Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Advancing the serial analysis of gene expression technique and its application to the study of the development.. Zuyderduyn, Scott Dorjan 2009-12-31

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
ubc_2009_fall_zuyderduyn_scott.pdf [ 26.9MB ]
Metadata
JSON: 1.0067241.json
JSON-LD: 1.0067241+ld.json
RDF/XML (Pretty): 1.0067241.xml
RDF/JSON: 1.0067241+rdf.json
Turtle: 1.0067241+rdf-turtle.txt
N-Triples: 1.0067241+rdf-ntriples.txt
Original Record: 1.0067241 +original-record.json
Full Text
1.0067241.txt
Citation
1.0067241.ris

Full Text

ADVANCING THE SERIAL ANALYSIS OF GENE EXPRESSION TECHNIQUE AND ITS APPLICATION TO THE STUDY OF THE DEVELOPMENT OF SQUAMOUS CELL LUNG CANCER by Scott Dorjan Zuyderduyn B.Sc.(Hon.), The University of British Columbia, 1999  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Biochemistry and Molecular Biology)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April 2009 © Scott Dorjan Zuyderduyn, 2009  ABSTRACT Lung cancer is one of the most common and deadliest forms of cancer. Squamous cell lung carcinomas (SCC), a common lung cancer subtype, feature a series of identifiable premalignant and early malignant forms that progress sequentially into full-blown tumours. This thesis describes a sophisticated and statistically rigorous analysis of global gene expression profiles taken from samples of several key stages of progression. This dataset was generated using serial analysis of gene expression (SAGE), a powerful transcriptome profiling technique that captures small sequence tags from each transcript in an mRNA population. These tags can then be counted and mapped back to a matching transcript sequence to quantitatively determine the expression of a given gene. The analysis identified several genes which show changes in expression that are highly correlated with the progressive steps of SCC. In addition, gene expression changes were identified in samples of bronchial epithelium that correspond to an acute response to tobacco smoke exposure, a major contributor to SCC development. The use of multiple sample types, the presence of extensive cellular heterogeneity, and the rarity of biological material for the purpose of validation introduced an additional layer of complexity that are not well-suited to conventional methods of SAGE analysis. To address these challenges, this thesis describes the development of two methodological improvements to SAGE data analysis.  The first describes a computational strategy to identify additional sequence  information that effectively increases the length of SAGE tag sequences, greatly enhancing the fidelity of tag to gene mapping. The second describes a new statistical method that shows improved performance in modelling SAGE data. The Poisson mixture model used in this work provides better estimates of statistical significance, is highly effective when using multiple sample types, and is a flexible framework for more complex meta-analyses.  ii  TABLE OF CONTENTS ABSTRACT ................................................................................................................... ii TABLE OF CONTENTS ................................................................................................ iii LIST OF TABLES ......................................................................................................... vi LIST OF FIGURES ...................................................................................................... vii LIST OF ABBREVIATIONS ........................................................................................... x ACKNOWLEDGEMENTS .......................................................................................... xiv CO-AUTHORSHIP STATEMENT ................................................................................ xv CHAPTER I  INTRODUCTION ................................................................................. 1  1.1 1.2 1.3  THE CHALLENGE OF CANCER ................................................................ 1 TYPES OF GENETIC CHANGE IN CANCER ........................................... 5 LUNG CANCER ......................................................................................... 9 1.3.1 Histological subtypes ................................................................... 10 1.3.1.1 Small cell lung cancer (SCLC) ...................................... 10 1.3.1.2 Adenocarcinoma (AC) .................................................. 13 1.3.1.3 Atypical subtypes .......................................................... 13 1.3.1.4 Squamous cell lung cancer (SCC) ................................ 14 1.3.1.4.1 Stages of progression .................................. 14 1.3.1.4.2 Common genetic alterations ........................ 17 1.4 GENOMICS AND GENE EXPRESSION PROFILING .............................. 19 1.5 SERIAL ANALYSIS OF GENE EXPRESSION (SAGE) ............................ 23 1.5.1 Constructing a library ................................................................... 23 1.6 DATA ANALYSIS AND STATISTICS ......................................................... 27 1.6.1 Statistical error and bias .............................................................. 28 1.6.2 Choice of test statistic .................................................................. 29 1.6.3 Multiple testing ............................................................................. 30 1.6.4 Let the data challenge assumptions ............................................ 31 1.6.5 Information visualization, data clustering, and global comparisons ..................................................................... 32 1.6.6 Cross-validation ........................................................................... 36 1.7 THESIS OBJECTIVES ............................................................................. 37 BIBLIOGRAPHY ............................................................................................... 39 CHAPTER II  2.1 2.2  DETERMINING ADDITIONAL SEQUENCE DATA TO IMPROVE SAGE TAG TO GENE MAPPING .................................................... 46  INTRODUCTION ...................................................................................... 46 MATERIALS AND METHODS .................................................................. 48 2.2.1 SAGE data ................................................................................... 48 iii  2.2.2 Software development ................................................................. 48 2.2.3 Tag to gene mapping ................................................................... 49 2.3 RESULTS ................................................................................................. 50 2.3.1 Additional sequence information is available for the majority of SAGE tags ................................................................................... 50 2.3.2 Theoretical improvement in mapping accuracy from extra nucleotides ................................................................................... 52 2.3.3 Influence of nucleotide composition on tag length is small, but significant ..................................................................................... 56 2.3.4 Predicted curvature of tag sequence is correlated with ditag length ........................................................................................... 57 2.3.5 XBP-SAGE: an algorithm to estimate additional nucleotides ....... 59 2.3.5.1 The likelihood function .................................................. 60 2.3.5.2 The problem of global optimization ............................... 63 2.3.5.3 Simulated annealing ..................................................... 63 2.3.5.4 Simple reductions to the solution space and the use of an “unknown” nucleotide ........................................... 64 2.3.5.5 Algorithm summary ....................................................... 65 2.3.5.6 Performance on a test dataset ...................................... 66 2.3.5.8 Improvements to tag gene mapping on real data .......... 67 2.4 DISCUSSION ........................................................................................... 75 BIBLIOGRAPHY ............................................................................................... 76 CHAPTER III  STATISTICAL INFERENCE FROM SAGE USING A POISSON MIXTURE MODEL ............................................................................ 78  3.1 3.2  INTRODUCTION ...................................................................................... 78 MATERIALS AND METHODS .................................................................. 82 3.2.1 Test datasets ................................................................................ 82 3.2.2 Model fitting ................................................................................. 82 3.2.2.1 Log-linear (Poisson) regression model ......................... 85 3.2.2.2 Overdispersed log-linear regression model .................. 85 3.2.2.3 Poisson mixture model ................................................. 86 3.3 RESULTS ................................................................................................. 87 3.3.1 Goodness of fit ............................................................................. 87 3.3.2 Tags with ambiguous mappings are represented by a greater number of components ................................................................ 89 3.3.3 Component assignment of libraries is non-random ...................... 91 3.3.4 Determining differentially expressed genes ................................. 91 3.4 DISCUSSION ........................................................................................... 99 BIBLIOGRAPHY ............................................................................................. 101 CHAPTER IV  4.1 4.2  TRANSCRIPTOME EVOLUTION IN THE DEVELOPMENTAL STAGES OF SQUAMOUS CELL LUNG CARCINOMA ................. 103  INTRODUCTION .................................................................................... MATERIALS AND METHODS ................................................................ 4.2.1 Sample collection and preparation ............................................. 4.2.2 Data processing ..........................................................................  103 106 106 106 iv  4.2.3 4.2.4 4.2.5 4.2.6 4.2.7  Multidimensional scaling ............................................................ 106 k-means clustering ..................................................................... 107 Hierarchical clustering ................................................................ 108 Gene Ontology and KEGG pathway enrichment ....................... 108 Statistical analysis ...................................................................... 109 4.2.7.1 Preprocessing ............................................................. 109 4.2.7.2 Feature selection ........................................................ 109 4.2.7.3 Estimating the selectivity of a candidate tag list and the generation of optimal signatures ........................... 112 4.2.8 Tag to gene mapping ................................................................. 112 4.2.9 Microarray validation .................................................................. 113 4.2.10 Tissue samples .......................................................................... 115 4.2.11 MMP11 antibody ........................................................................ 115 4.2.12 Immunohistochemistry ............................................................... 115 4.3 RESULTS ............................................................................................... 117 4.3.1 A global view of the transcriptome during the development of SCC .................................................................. 117 4.3.1.1 Multidimensional scaling analysis ............................... 117 4.3.1.2 Common gene expression patterns identified by k-means clustering ...................................................... 124 4.3.2 Transcriptional signatures of developmental stages .................. 128 4.3.2.1 Changes associated with tobacco smoke exposure ... 128 4.3.2.1.1 Acute response to tobacco smoke exposure .................................................... 130 4.3.2.1.2 Persistent response to tobacco smoke exposure .................................................... 135 4.3.2.2 Changes associated with SCC development .............. 138 4.3.2.2.1 Changes associated with pre-malignant transformation ............................................ 141 4.3.2.2.2 Changes associated with malignancy ........ 148 4.3.2.2.3 Changes associated with an invasive phenotype .................................................. 159 4.3.3 Comparison to existing squamous cell lung cancer profiles ....... 167 4.4 DISCUSSION ......................................................................................... 170 BIBLIOGRAPHY ............................................................................................. 173 CHAPTER V  CONCLUSION AND FUTURE PROSPECTS ................................. 179  6.1 OVERALL DISCUSSION AND CONCLUSION ....................................... 179 6.2 FUTURE PROSPECTS .......................................................................... 184 BIBLIOGRAPHY ............................................................................................. 187 APPENDIX I APPENDIX II APPENDIX III  LIKELIHOOD OF OBSERVING A DITAG GIVEN BOTH CONTRIBUTING TAG SEQUENCES ............................................. 186 MODEL FITTING R SOURCE CODE ............................................. 188 DERIVATION OF POISSON MIXTURE MODEL DIFFERENTIAL EXPRESSION CONFIDENCE SCORE ................ 192  v  LIST OF TABLES CHAPTER I Table 1.1 Table 1.2 Table 1.3  Estimated new cases and deaths for the top 5 cancer types ....... 1 The histological subtypes of lung cancer ................................... 11 Comparison of SAGE and DNA microarray technologies .......... 24  CHAPTER II Table 2.1 Table 2.2 Table 2.3 Table 2.4 Table 2.5  Estimated solutions for the distribution of tag lengths for 71 publicly available SAGE libraries ............................................... Improvement of tag to gene mapping with addition of extra nucleotides ................................................................................ Putative source of 14bp tags with ambiguous mappings that no longer map when lengthened to 16bp ............................ Putative source of 14bp tags with unambiguous mappings that no longer map when lengthened to 16bp ............................ Putative source of 14bp tags that fail to map .............................  54 70 71 72 74  CHAPTER III Table 3.1 Table 3.2 Table 3.3 Table 3.4  Datasets used to evaluate models ............................................. Comparison of model fits to a single group of biological replicates ................................................................................... Mean number of mixture model components ............................. Top component memberships ....................................................  83 88 90 92  CHAPTER IV Table 4.1 Table 4.2 Table 4.3 Table 4.4  Summary of SAGE libraries ..................................................... Source patient information for bronchial epithelium brushings samples ................................................................... Description of clusters identified by k-means analysis ............. Microarray validation of smoke-exposure signatures ...............  118 121 127 134  vi  LIST OF FIGURES CHAPTER I Figure 1.1 Figure 1.2 Figure 1.3 Figure 1.4 Figure 1.5 Figure 1.6  The hallmarks of cancer .............................................................. 4 Histology of common subtypes of lung cancer .......................... 12 Histological progression of squamous cell lung cancer ............. 15 The role of gene expression in the cell ...................................... 21 The serial analysis of gene expression library construction protocol ...................................................................................... 25 Examples of common data visualization methods ..................... 33  CHAPTER II Figure 2.1 Figure 2.2 Figure 2.3  Summary of genetic algorithm to find solutions for proportion of tag lengths (P(lt)) ................................................................... 51 The beta distribution .................................................................. 53 Comparison of the observed and expected count of simulated tags with two additional nucleotides .......................................... 68  CHAPTER III Figure 3.1 Figure 3.2 Figure 3.3  Figure 3.4 Figure 3.5  Probability density of several models applied to data generated from two Poisson components ................................................... Comparison to significance scores for a test of differential expression calculated using a negative binomial model ............ Counts for two tags assessed using a negative binomial model and the Poisson mixture model where one model shows significance and the other does not ........................................... Comparison to Bayes error rate for a test of differential expression calculated using a beta binomial model ................... Counts for two tags assessed using a Bayes error rate and the Poisson mixture model where one model shows significance and the other does not ...........................................  80 94  95 97  98  CHAPTER IV Figure 4.1 Figure 4.2  Figure 4.3  Figure 4.4  A confidence score calculation for a tag .................................. 111 Multidimensional scaling (MDS) analysis of 40 SAGE libraries from samples reflecting different stages of SCC development ............................................................................ 119 Multidimensional scaling (MDS) analysis of 24 SAGE libraries from brushings of bronchial epithelium with different levels of tobacco smoke exposure .......................................................... 122 Multidimensional scaling (MDS) analysis of 16 SAGE libraries from bulk samples reflecting the different stages of SCC development ............................................................................ 123 vii  Figure 4.5 Figure 4.6 Figure 4.7 Figure 4.8 Figure 4.9 Figure 4.10  Figure 4.11 Figure 4.12 Figure 4.13.1  Figure 4.13.2  Figure 4.14.1  Figure 4.14.2  Figure 4.14.3  Figure 4.14.4  Figure 4.15 Figure 4.16.1  Figure 4.16.2  Figure 4.17  Figure 4.18.1 Figure 4.18.2  Determination of cluster number for k-means analysis and the four major clusters ............................................................. 125 The five minor k-means clusters .............................................. 129 Candidate smoke exposure tag selection plots ....................... 131 Heatmap with sample-wise hierarchical clustering of 70 tags upregulated in bronchial epithelium from current smokers ...... 132 Heatmap with sample-wise hierarchical clustering of 9 tags downregulated in bronchial epithelium from current smokers .. 133 Heatmaps with sample-wise hierarchical clustering of 4 tags upregulated and 1 tag downregulated in former, but not current, smokers ................................................................ 137 Candidate squamous cell lung cancer progression tag selection plots .......................................................................... 139 Venn diagram of the number of candidate tags identified for different combinations of SCC progression sample types ....... 140 Heatmap with sample-wise hierarchical clustering of the first 69 of 138 tags upegulated in metaplasia and later stages of SCC development ................................................................... 142 Heatmap with sample-wise hierarchical clustering of the final 69 of 138 tags upegulated in metaplasia and later stages of SCC development ................................................................... 143 Heatmap with sample-wise hierarchical clustering of the first 79 of 316 tags downregulated in metaplasia and later stages of SCC development ................................................................... 144 Heatmap with sample-wise hierarchical clustering of the second 79 of 316 tags downregulated in metaplasia and later stages of SCC development ............................................ 145 Heatmap with sample-wise hierarchical clustering of the third 79 of 316 tags downregulated in metaplasia and later stages of SCC development ............................................ 146 Heatmap with sample-wise hierarchical clustering of the final 79 of 316 tags downregulated in metaplasia and later stages of SCC development ............................................ 147 Microarray validation of optimal metaplasia progressionassociated upregulated gene signature ................................... 149 Heatmap with sample-wise hierarchical clustering of the first 72 of 143 tags upregulated in malignant stages of SCC development ............................................................................ 150 Heatmap with sample-wise hierarchical clustering of the final 71 of 143 tags upregulated in malignant stages of SCC development ............................................................................ 151 Heatmap with sample-wise hierarchical clustering of the 26 tags downregulated in the malignant stages of SCC development ............................................................................ 152 Microarray validation of optimal malignant progressionassociated upregulated gene signature ................................... 154 Microarray validation of optimal malignant progressionassociated downregulated gene signature .............................. 155 viii  Figure 4.19  A model of MCM7 and CKS1B function based on current knowledge ............................................................................... 156 Figure 4.20 Expression of select members of the MCM family and related genes ....................................................................................... 158 Figure 4.21 Heatmap with sample-wise hierarchical clustering of 29 tags upregulated in the invasive stage of SCC development .......... 160 Figure 4.22 Microarray validation of optimal invasive progressionassociated upregulated gene signature ................................... 162 Figure 4.23 Western blot confirming correct activity of MMP11 antibody ... 163 Figure 4.24 MMP11 detection by immunohistochemistry ........................... 164 Figure 4.25 Role of serine hydroxymethyltransferase in thymidine biosynthesis ............................................................................. 166 Figure 4.26.1 Heatmap of SAGE tag expression of candidate SCC genes identified from the Nacht and Bhattacharjee datasets ............. 168 Figure 4.26.2 Heatmap of SAGE tag expression of candidate SCC genes identified from the Erez/Dehan and Wachi datasets ................ 169  ix  LIST OF ABBREVIATIONS Throughout this thesis, individual genes are referred to according to the symbols and guidelines specified by the HUGO Gene Nomenclature Committee (www.genenames.org) (White, 1997). Specifically: a) the standard HUGO gene symbol is used in all cases, although alias(es) will be mentioned with the first appearance of the symbol if they are still in common use (e.g. CDKNA2 (a.k.a. p16INK4a)); b) when referring to a gene locus or mRNA transcript arising from a locus, the gene symbol is italicized (e.g. GAPDH); c) when referring to a protein product, the gene symbol is not italicized (e.g. MDM1). The full name of the gene is not typically mentioned, but will always be present in this list of abbreviations. ABL AC ADH7 AIC AE AKR1B10 AKR1C2 AKR1C3 ALDH3A1 AML ATP1A1 ATP1A2 ATP1A3 ATP1B3 BCR BIC BIRC5 BRCA1 BRCA2 C16orf89 C3 C5orf32 CAV1 CBFB CBR1 CCND1 CCNE1 CDC45L CDC6 CDC7 CDH1 CDK2 CDK4 CDKN1A CDKN1B  c-abl oncogene 1, receptor tyrosine kinase adenocarcinoma alcohol dehydrogenase 7 (class IV), mu or sigma polypeptide Akaike information criterion anchoring enzyme aldo-keto reductase family 1, member B10 (aldose reductase) aldo-keto reductase family 1, member C2 (dihydrodiol dehydrogenase 2; bile acid binding protein; 3-alpha hydroxysteroid dehydrogenase, type III) aldo-keto reductase family 1, member C3 (3-alpha hydroxysteroid dehydrogenase, type II) aldehyde dehydrogenase 3 family, member A1 acute myeloid leukemia ATPase, Na+/K+ transporting, alpha 1 polypeptide ATPase, Na+/K+ transporting, alpha 2 (+) polypeptide ATPase, Na+/K+ transporting, alpha 3 polypeptide ATPase, Na+/K+ transporting, beta 3 polypeptide breakpoint cluster region Bayesian information criterion baculoviral IAP repeat-containing 5 breast cancer 1, early onset breast cancer 2, early onset chromosome 16 open reading frame 89 complement component 3 chromosome 5 open reading frame 32 caveolin 1, caveolae protein, 22kDa core-binding factor, beta subunit carbonyl reductase 1 cyclin D1 cyclin E1 CDC45 cell division cycle 45-like (S. cerevisiae) cell division cycle 6 homolog (S. cerevisiae) cell division cycle 7 homolog (S. cerevisiae) cadherin 1, type 1, E-cadherin (epithelial) cyclin-dependent kinase 2 cyclin-dependent kinase 4 cyclin-dependent kinase inhibitor 1A (p21, Cip1) cyclin-dependent kinase inhibitor 1B (p27, Kip1) x  CDKN2A cDNA CDT1 CGH CI CIS CKS1B CML CNV COL1A2 COL3A1 COL6A3 CRIP2 CSTA CV CYP1A1 CYP1B1 DAB DBF4 DHFR ECM EGFR EPHX1 ERBB2 ERG EST ETV1 ETV4 FDR GAPDH GEO GMNN GO GPX2 GPX4 GSR GSTA2 GSTP1 HIF1A IRLS JUP KLK3 KRAS KRT6A KRT6B KRT14 LAMA1  cyclin-dependent kinase inhibitor 2A (melanoma, p16, inhibits CDK4) complementary deoxyribonucleide acid chromatin licensing and DNA replication factor 1 comparative genomic hybridization confidence interval carcinoma in situ CDC28 protein kinase regulatory subunit 1B chronic myelogenous leukemia copy number variation collagen, type I, alpha 2 collagen, type III, alpha 1 collagen, type VI, alpha 3 cysteine rich protein 2 cystatin A (stefin A) cross-validation cytochrome P450, family 1, subfamily A, polypeptide 1 cytochrome P450, family 1, subfamily B, polypeptide 1 3,3’-diaminobenzidine DBF4 homolog (S. cerevisiae) dihydrofolate reductase extracellular matrix epidermal growth factor receptor (a.k.a ERBB1) epoxide hydrolase 1, microsomal (xenobiotic) v-erb-b2 erythroblastic leukemia viral oncogene homolog 2 (a.k.a. HER2/neu) v-ets erythroblastosis virus E26 oncogene homolog (avian) expressed sequence tag ets variant 1 ets variant 4 false discovery rate glyceraldehyde-3-phosphate dehydrogenase Gene Expression Omnibus geminin, DNA replication inhibitor Gene Ontology glutathione peroxidase 2 (gastrointestinal) glutathione peroxidise 4 (phospholipid hydroperoxidase) glutathione reductase glutathione S-transferase alpha 2 glutathione S-transferase pi 1 hypoxia-inducible factor 1, alpha subunit (basic helix-loop-helix transcription factor) iteratively reweighted least-squares junction plakoglobin (a.k.a. γ–catenin) kallikrein-related peptidase 3 (a.k.a. PSA) v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog keratin 6A keratin 6B keratin 14 (epidermolysis bullosa simplex, Dowling-Meara, Koebner) laminin, alpha 1 xi  LGALS3 LOH LOOCV MCM2 MCM3 MCM4 MCM5 MCM6 MCM7 MCM8 MCM9 MCM10 MDM1 MDM2 MDS MFAP2 ML MLE MMP11 MMP12 MPSS MYC MYCN MYH11 NCBI NHBE NQO1 NSCLC PCA PTEN PTTG1 PSA RB1 ROC S100A2 SA SAGE SCC SCLC SDC1 SFN SHMT1 SHMT2 SKP2 SLC6A8 SPP1 SPRR1A SPRR2A  lectin, galactosidase-binding, soluble, 3 loss of heterozygosity leave-one-out cross-validation minichromosome maintenance complex component 2 minichromosome maintenance complex component 3 minichromosome maintenance complex component 4 minichromosome maintenance complex component 5 minichromosome maintenance complex component 6 minichromosome maintenance complex component 7 minichromosome maintenance complex component 8 minichromosome maintenance complex component 9 minichromosome maintenance complex component 10 mouse double minute 1 nuclear protein homolog Mdm2 p53 binding protein homolog (mouse) multidimensional scaling microfibrillar-associated protein 2 maximum likelihood maximum likelihood estimation matrix metallopeptidase 11 (stromelysin 3) matrix metallopeptidase 12 (macrophage elastase) massively parallel signature sequencing v-myc myelocytomatosis viral oncogene homolog (avian) v-myc myelocytomatosis viral related oncogene, neuroblastoma derived (avian) myosin, heavy chain 11, smooth muscle National Center for Biotechnology Information normal human bronchial epithelium NAD(P)H dehydrogenase, quinine 1 non-small cell lung cancer principal component analysis phosphatase and tensin homolog pituitary tumor-transforming 1 prostate-specific antigen retinoblastoma 1 receiver operating characteristic S100 calcium binding protein A2 simulated annealing serial analysis of gene expression squamous cell carcinoma small cell lung cancer syndecan 1 stratifin serine hydroxymethyltransferase 1 (soluble) serine hydroxymethyltransferase 2 (mitochondrial) S-phase kinase-associated protein 2 (p45) solute carrier family 6 (neurotransmitter transporter, creatine), member 8 secreted phosphoprotein 1 small proline-rich protein 1A small proline-rich protein 2A xii  SRF SVM TALDO1 TE TFF3 TMPRSS2 TP63 TPM TSG TYMS  serum response factor (c-fos serum response element-binding transcription factor) (a.k.a. MCM1) support vector machine transaldolase 1 tagging enzyme trefoil factor 3 (intestinal) transmembrane protease, serine 2 tumor protein p63 tags per million tumour suppressor gene thymidylate synthetase  xiii  ACKNOWLEDGEMENTS I would like to thank my research supervisor Dr. Victor Ling for his unwavering support, enthusiasm, and commitment to good science. His ability to ask penetrating questions about subjects he should know nothing about was always a source of amazement. He is truly a scholar who any aspiring scientist could look to as a role model. I appreciate the support and companionship of past and present Ling Lab members, the most diverse mix of individuals I have ever had the pleasure to work with. I tip my hat to my student predecessors Dr. Dennis Leveson-Gower, Dr. Maria Ho, and Dr. Maisie Lo who each showed the way through graduate school is never a straight line but, eventually, it does end. Long-distance running hurts, but it feels so good when you stop. I thank Dr. Tania Kastelic for her wisdom, experience, and superhuman attention to detail. I thank Dr. Jonathan Sheps for his knowledge, candour, and for his enthusiasm in engaging in the prolonged discussions that computer geeks like me are compelled to start not only about science and technology, but wild tangents about the state of the world and the issues of the day. I thank Dr. Renxue Wang for sharing his knowledge and his willingness to help with any problem at a moment’s notice. I thank Dr. Jaclyn Hung for her insights and expertise on the problem of lung cancer, and her willingness to share her material and equipment. I would like to acknowledge the often thankless and unnoticed tasks performed by Barb Schmidt, Lin Liu, and Sue Smith that keep the lab together, running smoothly, and able to carry out good science. I am particularly indebted to the support, candour, ideas and continued friendship of Dr. Greg Vatcher. I appreciate the expertise and clinical material provided by lung oncologist Dr. Stephen Lam. I also acknowledge the technical support provided by fellow students Alvin Ng (Hung Lab) and Gerald Li (MacAulay Lab). I gratefully acknowledge the generous funding support provided by the Canadian Institutes for Health Research (CIHR) and the Michael Smith Foundation for Health Research (MSFHR). I am grateful for the sponsorship of Dr. Raymond Ng in obtaining access to the Westgrid computing cluster, as well the efforts of the Westgrid support staff. Finally, I acknowledge the guidance and assistance of my thesis committee: Dr. Wan Lam, Dr. Calum MacAulay, Dr. Ross McGillivray, and Dr. Raymond Ng.  “Science… never solves a problem without creating ten more.” - George Bernard Shaw  xiv  CO-AUTHORSHIP STATEMENT Several chapters describe work done in collaboration with others. Chapter II was done in collaboration with Dr. Greg Vatcher. I developed and performed all data analyses as presented, and wrote the manuscript. I am the sole author of the work in Chapter III. Chapter IV was done in collaboration with Drs. Greg Vatcher, Stephen Lam, Wan Lam, Calum MacAulay, Raymond Ng, and Victor Ling. I developed and performed all data analyses as presented, and wrote the manuscript.  xv  CHAPTER I INTRODUCTION Cancer is one of the most feared diseases and has afflicted humans for millennia. The first account of cancer was written on papyrus in 1500 BCE, and described a series of breast tumours that were treated with crude cauterization (Diamondopoulos, 1996). In the intervening three and a half thousand years, great progress has been made in understanding and treating the disease. However, the exceedingly complex nature of cancer and its ability to evolve and adapt, turning the intricate systems of the body against itself, continues to challenge medical science. This introduction will provide an overview of the characteristics that make cancer so complex, before focussing on lung cancer, a multifaceted disease in its own right. Particular attention will be given to squamous cell carcinoma, a common lung cancer subtype. This will be followed by a discussion of genomics, and gene expression profiling in particular, as a means to unravel cancer’s complexity. Serial analysis of gene expression (SAGE), a particularly powerful technique for obtaining a global portrait of gene expression and the methodology that informs the work in this thesis, will be described. Finally, the methods and challenges of analyzing largescale datasets, like those obtained from gene expression profiling, are explored.  1.1  THE CHALLENGE OF CANCER Cancer is currently the second leading cause of death in Canada (behind cardiovascular  disease), and is similarly placed throughout the Western world (Canadian Cancer Society, 2008; Jemal, 2007; Pisani, 2002).  In 2007, the disease affected about 160,000 Canadians and  contributed to over 72,000 deaths (Table 1.1) (Canadian Cancer Society, 2008). An additional 69,000 are affected by highly treatable forms of non-melanoma skin cancer (Canadian Cancer Society, 2008). Although a greater awareness of the health risks associated with cancer and 1  Table 1.1: Estimated new cases and deaths for the top 5 cancer types† Cancer Type  Cases  Males Deaths  Females Cases Deaths  lung  12,400  11,000  10,900  8,900  breast  170  50  22,300  5,300  prostate  22,300  4,300  -  -  colorectal  11,400  9,000  4,700  4,000  lymphoma  3,700  1,700  3,100  1,400  Statistics from the National Cancer Institute of Canada (NCIC), 2007. † 69,000 cases of non-melanoma skin cancers are excluded.  2  continuing advances in diagnosis and treatment have led to a decline in cancer over the last fifteen years, a concordant decline in deaths from heart disease has led to cancer becoming the number one killer except in those over 85, and is expected to include even this group by 2018 (Jemal, 2005). For this reason, substantial progress must be made in cancer research and related fields in order to continue improving the long-term health prospects of this and future generations. The pathology of cancer always involves the formation of malignant cells that destroy tissue or interfere with health-maintaining systems of the body. In most cases, this is associated with tumour formation, although haematological neoplasms (i.e. leukemia, lymphoma) are exceptions. Current scientific consensus postulates that cancer is, at its heart, a disease of the genes. The initiation and progression of carcinogenesis are driven by successive genetic changes that interfere with the integrity, function, or regulation of particular genes. Unfortunately, the identity of these loci and the mechanisms that disrupt them are only partially known, and can vary widely from tumour to tumour, even among cancers that arise from the same tissue and display identical histological characteristics. This is in stark contrast to major killers like heart attack, stroke or infection where the aetiology is largely uniform and new treatments can be widely applied. Despite the uncertainty surrounding the genetic basis of most cancers, a number of socalled “hallmarks” have been suggested that are considered requirements for cancer to thrive and kill successfully: a) malfunctions in the molecular signals that promote or inhibit cellular growth, b) independence from programmed cell death, c) unlimited proliferative capacity (or immortalization), d) sustained neovascularisation, and e) acquiring the ability to escape the barriers of organs and tissues that gave rise to the cancer and further spread to others (Figure 1.1) (Hanahan, 2000). As a general principle, a causal genetic change will affect one or more of these processes. However, the order in which these requirements are met is not consistent, and the 3  Self-sufficiency in growth signals Evading apoptosis  Insensitivity to anti-growth signals  Sustained angiogenesis  Tissue invasion & metastasis  Limitless replicative potential  Figure 1.1: The hallmarks of cancer. Depicted are the capabilities a cell must acquire in order to become malignant. The figure here is remastered from the original published in Hanahan and Weinberg (2000).  4  genetic changes that directly cause or attenuate them are highly variable. Of greatest concern from a clinical perspective is how these hallmarks conspire to make a tumour progress, to resist treatment outright, or to provide the foundations for further mutations that allow the disease to evolve and recur in a more resistant form even after successful treatment. Although it may take many years of research to decipher the full complexity of cancer, a partial understanding of some of the molecular players has positively impacted patient outcomes. Familial mutations that increase the risk of disease have helped to identify segments of the population who will benefit from aggressive surveillance or preventative medicine. For example, certain inherited variations in the BRCA1 and BRCA2 genes are an indicator of increased risk for breast cancer (reviewed in Palacios, 2008). Early detection screens have been developed based on the identification of molecular markers that manifest in premalignant or early stage neoplasms. For example, the serum level of KLK3 (a.k.a. PSA) is a commonly used, although controversial, test for the presence of prostate cancer (reviewed in Lilja, 2008). More recently, tailored treatments have been developed based on markers associated with drug response. For example, the monoclonal antibody trastuzumab (Herceptin) is an effective therapy for breast cancers that overexpress ERBB2 (reviewed in Nanda, 2007). The reality is that there is neither a “magic bullet” to attack cancer, nor a simple molecular mechanism to explain its occurrence or clinical course. The large number of cancer types, each associated with highly variable pathology and response to treatment, are a result of a complex genetic basis; it is this complexity that makes cancer such a difficult challenge. 1.2  TYPES OF GENETIC CHANGE IN CANCER Evolution is the means by which life adapts to a changing environment or enables  survival in a new environment; this process is carried out through heritable modifications to the genes and random changes to the overall genetic makeup of populations. However, not all 5  changes are beneficial and cancer development is a downside of life’s genetic plasticity. Much of the complexity of cancer is due to the fact that all known natural, and normally beneficial, mechanisms of genetic change can be exploited to promote tumourigenesis. DNA changes most often occur on the scale of a single or very small group of nucleotides. This can result from errors made during replication or by exposure to radiation or certain chemicals. Point mutations refer to the change of a single nucleotide from one base to another. When these occur in the coding region of a gene it is possible to introduce either: a) missense mutations that change the amino acid sequence of the encoded protein, altering or abolishing its function or b) nonsense mutations that introduce a premature stop codon, truncating the resulting protein. A classic example occurs in colon cancer, where point mutations in specific exons of the KRAS gene occur in about 50% of cases (Forrester, 1987; Bos, 1987). Mutations can also occur in non-coding regions such as regulatory elements, altering gene expression. A “C” to “G” transversion in the promoter region of BIRC5, a known antiapoptotic factor, is found in a variety of cancer cell lines and correlates with the increased expression of this gene (Xu, 2004). Small deletions or insertions (indels) in a coding region can change the structure of a protein by altering its amino acid sequence.  Despite the small number of  nucleotides affected, an indel can introduce a frameshift that will change the entire amino acid sequence downstream of the mutation site. As with point mutations, indels can occur in regulatory elements as well. These alterations are variable in size; most often a single base is involved, but they usually do not exceed 15-20bp (Stenson, 2008). Such events occur in germline mutations of CDH1 and are strongly associated with a type of hereditary gastric cancer (Humar, 2002; Brooks-Wilson, 2004). Other genetic changes occur on a larger scale. These copy number variations (CNVs) involve segments of a chromosome that can vary in size from a few hundred nucleotides to several megabases in size. These segments of DNA can be duplicated, amplifying the expression 6  of the genes contained within. For example, amplification of the MYCN oncogene is frequently observed in neuroblastomas, and is strongly correlated with the stage and aggressiveness of the disease (Seeger, 1985). Correspondingly, segments of DNA can be lost, reducing the function of the affected genes. Pediatric meningioma, a rare childhood cancer, is usually accompanied by a deletion of the tumour suppressor NF2 (Begnami, 2007).  Gains and losses of whole  chromosomes (and even aneuploidy and polyploidy, where the entire set of chromosomes is affected) is often observed in cancer cells, although it remains unclear to what extent this is a contributing cause of tumourigenesis or a symptom of genome instability (Storchova, 2004). Perhaps the most fascinating alterations are complex chromosomal rearrangements. When a section of one chromosome exchanges places with a section of a different chromosome, a translocation has occurred and can result in the disruption of a gene or its regulatory elements or put these elements into functional proximity of another gene. The most well-known example is the Philadelphia Chromosome, where the end of chromosome 9 is exchanged with that of chromosome 22 (t(9;22)(q34.1;q11.2)), resulting in a fusion gene BCR-ABL. This rearrangement occurs in 95% of cases of CML (Rowley, 2001). Inversions are similar, occurring within one chromosome when a segment of DNA is reversed. One subtype of AML is almost always associated with a specific inversion of chromosome 16 (inv(16)(p13;q22)), resulting in the fusion gene MYH11-CBFB (Liu, 1993). Interestingly, this inversion actually confers sensitivity to chemotherapy and a higher percentage of these patients achieve complete remission (Liu, 1995). A growing appreciation has emerged for the role of epigenetic modifications in cancer (Feinberg, 2006). These modifications result in heritable changes to DNA that effect gene expression without modifying the sequence itself. Epigenetic modification generally refers to changes arising from two mechanisms. The first involves the post-translation modification of histones which facilitate chromatin remodelling, packing DNA into a form inaccessible to transcriptional machinery (Jenuwein, 2001).  The second involves the methylation of the 7  cytosine in CpG dinucleotides.  The majority of CpG sites in mammalian genomes are  methylated, except for clusters of these dinucleotides called CpG islands that are present in the 5’ regulatory region of many genes (Bird, 1986). These CpG islands provide a mechanism to stably repress the expression of certain genes, and aberrant hyper- or hypomethylation, can result in unscheduled transcriptional repression or activation (Miranda, 2007).  For example,  hypermethylation has been shown to repress the transcription of GSTP1, an enzyme that plays an important role in cellular detoxification, in the majority of prostate cancers (Lin, 2001; Maruyama, 2002). Cancer-promoting changes can be the result of a combination of the above mechanisms, often through loss of heterozygosity (LOH). When one allele has become inactivated, the remaining functioning allele can usually compensate. This heterozygous state is then lost when the remaining gene is disrupted. This two-hit model was first proposed in a study of childhood retinoblastoma (Knudson, 1971). The age of diagnosis in patients with bilateral tumours tended to be younger than those with unilateral tumours. Moreover, the distribution of the age of diagnosis for bilateral tumours was consistent with a single hit, whereas the distribution for unilateral tumours was consistent with two hits. This suggested that both copies of a single gene were inactivated and in bilateral cases, a disrupted copy was inherited. This hypothesis was later confirmed with the discovery of the RB1 gene, where one copy carries an inherited mutation and the second is subsequently lost through deletion in most cases of inherited retinoblastoma (Friend, 1986). LOH has since been implicated in the loss of function of a large number of tumour suppressor genes, and is now one of the core concepts in molecular oncology (Mendelsohn, 2001). The recent discovery of microRNA (miRNA) has introduced an additional molecular player in cancer biology. These short, single-stranded RNAs act to regulate the expression of genes by targeting mRNA with close complementary sequence and bringing about their 8  destruction (reviewed in Ambros, 2004). The specific activities of miRNAs in cancer is an area of intense investigation, but one study has demonstrated that the expression levels of several hundred miRNAs are able to distinguish between normal and tumour samples across a wide panel of tissue types (Lu, 2005). It is clear that there are a large number of genes that show alterations in cancer, a problem compounded by the variety of mechanisms that cause these changes. Furthermore, different malignancies arise from diverse, tissue-specific environments which can drastically influence the effect a given alteration will have. For example, CAV1 appears to act as a tumour-suppressor gene in one subtype of lung cancer, while being required for survival in another (Sunaga, 2004). Finally, genome instability is a characteristic of the vast majority of cancers, and is capable of inducing genetic changes that have little or no causal or supporting role in tumourigenesis, but nonetheless complicate the search for key alterations. 1.3  LUNG CANCER Lung cancer is the third most commonly diagnosed cancer in both men and women  (behind skin and breast/prostate cancers), but is the leading cause of cancer death in both genders (Canadian Cancer Society, 2008; Jemal, 2007). These tumours typically progress to an advanced and inoperable stage before symptoms compel a patient to seek medical attention. Indeed, a mass of considerable size would be expected before a patient would manifest common lung cancer indicators such as dyspnoea, haemoptysis and chest pain. The result is dismal outcomes – in non-small cell lung cancers (NSCLC), the most common subtype (see Section 1.3.1), advanced disease (stage III and IV) has a 20-30% treatment response rate and an 8-10 month median survival time (Scagliotti, 2007). However, when these tumours are diagnosed early, the treatment success rate is high. Individuals with stage I or stage II NSCLC have average 5-year survival rates of 65% and 41%, respectively (Nesbitt, 1995). 9  Like all cancers, lung tumourigenesis has a complex genetic basis. As discussed in the previous section, this complexity arises from the diversity of mechanisms that can alter DNA and the large number of potentially affected genes. The abundance of mutagenic agents in tobacco smoke, which is a clear causative factor in lung cancer initiation, adds a particularly insidious dynamic. Polycyclic aromatic hydrocarbons (PAH) such as benzopyrene and nicotine-derived nitrosamines such as nitrosamine 4-methylnitrosamino-1-(3-pyrdiyl)-1-butanone (NNK), are a few of the more potent examples of the hundreds of carcinogens present in tobacco smoke that have non-specific, damaging effects on DNA (Phillips, 1983; Hecht, 1998; Besaratinia, 2002). 1.3.1 Histological subtypes Histology dictates the classification of lung cancer into several subtypes; each are associated with different clinical parameters such as prognosis, recurrence, and response to treatment. Malignant subtypes are separated into two categories: 1) small cell lung cancer forms its own group distinct from 2) non-small cell lung cancer (NSCLC), which comprises the adenocarcinoma, squamous cell carcinoma, large-cell carcinoma and other atypical subtypes. These two categories are largely historical and reflect the major differences in the primary choice of treatment at the time they were classified; the former being typically treated with chemotherapy, and the latter with surgery and radiotherapy (Holland, 2003). Each subtype and its approximate prevalence are summarized in Table 1.2 and photos providing a visual reference are found in Figure 1.2. A brief overview of the major subtypes follows, with a focus on the squamous cell subtype which is the focus of the work in Chapter IV of this thesis. 1.3.1.1 Small cell lung cancer (SCLC) SCLC almost always occurs in smokers and generally arises in the central airways. The cells are flat and small, containing exiguous cytoplasm (Figure 1.2A). SCLC is thought to arise from the Kulchitsky cell, which has important neuroendocrine functions (Hattori, 1972). This 10  Table 1.2: The histological subtypes of lung cancer malignancy category  subtype†  prevalence (%)‡  benign  papillomas adenomas  NA NA  small-cell carcinoma  14.6  squamous cell carcinoma adenocarcinoma large-cell carcinoma adenosquamous carcinoma carcinoid tumour bronchial gland carcinoma others  21.3 35.0 7.0  small cell lung cancer (SCLC) malignant  non-small cell lung cancer (NSCLC)  22.0  † Categories from World Health Organization, Histological typing of lung tumors, 2nd ed. Geneva, Switzerland: WHO; 1991. ‡ Incidence rates are U.S. statistics (Alberg, 2003).  11  12  large cell carcinoma  small cell lung cancer  D  B  squamous cell carcinoma  adenocarcinoma  Figure 1.2: Histology of common subtypes of lung cancer. All slides are stained with haematoxylin and eosin. Images courtesy of PathoPic (http://alf3.urz.unibas.ch/pathopic).  C  A  subtype is highly aggressive and is notorious for early and extensive metastases. Moreover, these tumours are rarely operable and are refractory to chemotherapy; in general, there is a very good initial response, but SCLC nearly always recurs in a highly resistant form (Rosti, 2006). Median survival is only about 7-20 months, depending on the extent of disease (El Maalouf, 2007). 1.3.1.2 Adenocarcinoma (AC) Lung AC has recently become the most common subtype and occurs mainly in the peripheral airways. Although predominantly associated with smoking history, it is the most likely of all the subtypes to occur when the patient is young or has never smoked (Kreuzer, 1999; Subramanian, 2007). ACs are thought to arise from both type II pneumocytes (which reside in the alveoli) and Clara cells (which reside in the bronchioles), which secrete substances such as structure-maintaining surfactant proteins (Wistuba, 2006) (Figure 1.2B). These tumours are histologically heterogeneous, and can be classified into even further subtypes. Treatment options are numerous (e.g. surgery, chemotherapy, radiotherapy) and depend heavily on stage and histological characteristics, among other factors (Collins, 2007). 1.3.1.3 Atypical subtypes Rarer subtypes comprise tumours that either: a) have unusual histology, making them difficult to definitively assign to other groups, or b) affect pulmonary structures that are uncommon sites of tumour formation (e.g. bronchial gland carcinoma). The most common of these atypical types are large cell carcinomas which get their name from the presence of large, undifferentiated cells (Figure 1.2C). Since they show no evidence of squamous or glandular differentiation, which would place them in the squamous cell carcinoma and adenocarcinoma categories, respectively, they are assigned a separate group by default. On the other end of the spectrum, adenosquamous carcinomas display characteristics of both the adenocarcinoma and 13  squamous cell carcinoma subtypes. 1.3.1.4 Squamous cell lung cancer (SCC) SCC occurs mainly in the central airways and is strongly associated with smoking history. At one time, this subtype occurred with a much greater prevalence (similar to or greater than adenocarcinoma), but it is likely that the introduction of filtered, low-tar cigarettes and a decrease in smoking rates have contributed to their decline (Kreuzer, 1999). SCCs are thought to arise from the phenotypically similar squamous epithelium, a common cell type that plays important functional roles throughout the body. In the lungs, the flat squamous cells line the alveoli and promote gas exchange; in the central airways, where SCCs occur, squamous epithelium results from a transformation of the normal layer of psuedostratified columnar epithelium as a result of genetic mutation (e.g. from carcinogens in tobacco smoke) (Puchelle, 2006) (Figure 1.2D). Unlike other lung cancer subtypes, SCCs have a clearly defined and observable progression from normal epithelium to malignancy. Squamous cell carcinomas at other sites of the body have similar pre-malignant lesions (e.g. skin, oesophagus, and cervix) (Neville, 2002; Greer, 2006; Shimizu, 2007). 1.3.1.4.1 Stages of progression Despite the relative difficulty in obtaining samples of pre-malignant lesions, lung neoplasms of the squamous subtype are histologically well-studied, and a detailed series of early progressive steps have been defined (Colby, 1998; Franklin, 2000; Kerr, 2001). The majority of cases are thought to progress sequentially from normal epithelium through, in order, hyperplasia, squamous metaplasia, dysplasia (mild, moderate, severe), carcinoma in situ (CIS), and finally an invasive carcinoma (Figure 1.3). However, this series can only be considered generally correct since each individual lesion may regress, fluctuate between steps, and even skip some steps entirely (Breuer, 2005). Furthermore, histologists may differ on their classification scheme 14  15  hyperplasia  squamous metaplasia  dysplasia  carcinoma in situ  invasive carcinoma  Figure 1.3: Histological progression of squamous cell lung cancer. Adapted from Wistuba and Gazdar (2006).  normal epithelium  progression  depending on a variety of subjective factors.  For example, squamous metaplasia may be  considered mild dysplasia by some health facilities; dysplasia itself, which is a continuum of changes with overlapping features, is also subject to inter-observer variability with respect to severity (Nicholson, 2001). Bearing this in mind, a brief description of the stages is as follows: hyperplasia is characterized by a thickening of the basal cell layer with the cells appearing otherwise normal; squamous metaplasia shows characteristic cytologic atypia - the cells flatten and may show keratinisation; mild dysplasia show a further thickening of the cell layer, a pleomorphic increase in the overall size of the cell and the relative size of the nucleus, and crowding of the cells at the basal layer; moderate dysplasia are more marked in these characteristics, along with evidence of rapid division apparent from cells in visible mitosis; severe dysplasia show very apparent abnormalities of the previous types involving most, but not all, cells; in carcinoma in situ all cells are completely atypical, showing extensive defects in cellular architecture and mitotic activity; finally, an invasive carcinoma is one in which the cells have broken through the basement membrane, infiltrating the surrounding tissue (Nicholson, 2001). The step in this progression that represents a shift from an at-risk lesion to a committed carcinoma is unclear. Indeed, the increased risk of progressing to a carcinoma associated with each subsequent step is strong evidence for their assigned order. However, greater than 90% of the lesions identified as CIS eventually become a full-blown tumour, suggesting this step represents a de facto committed carcinoma (Venmans, 2000; Bota, 2001). Detection of SCC at this early stage of development has a drastic effect on survival, as treatment of CIS is essentially curative (>90% 5-year survival) (Lam, 2000).  16  1.3.1.4.2 Common genetic alterations Some characterization of the genetic basis of SCC has been performed. However, no single causal alteration has been identified. Moreover, many of the identified markers consist of chromosomal aberrations where the specific gene(s) affected are as yet unknown. Nevertheless, there are a handful of genetic factors that currently stand out in SCC. Deletions of 3p Allelic losses of the short arm of chromosome 3 (3p) are an early and consistent genetic change associated with all lung cancers, including SCC (reviewed in Zabarovsky, 2002). In SCC, the frequency and extent of these losses is strongly correlated with the stage of progression; moreover, changes are often observed in histologically normal tissue in smokers, but never in non-smokers (Wistuba, 1999; Wistuba, 2000). Although initially thought to harbour a critical lung tumour suppressor gene (TSG), the inconsistency in the precise chromosomal locations of these changes and the failure to identify a single candidate gene has led to speculation that 3p harbours several such genes that have a complex relationship with cancer initiation and progression (Zabarovsky, 2002). Loss of CDKN2A (p16INK4a) Cyclin-dependent kinase inhibitor 2A, a classic TSG, plays a key role in cell cycle regulation (Kamb, 1994). The CDKN2A locus encodes several transcripts which inhibit CDK4, a kinase necessary for G1 phase progression; in addition, an alternate open reading frame (ORF) codes for a protein that sequesters MDM1, which is responsible for p53 degradation. Restricted CDKN2A function is observed in a wide variety of human cancers (Herman, 1995). In SCC, loss of CDKN2A expression is nearly universal and is often observed in preneoplastic lesions (Belinsky, 1998; Dessy, 2008). As with other cancers, this loss occurs through a variety of 17  mechanisms; in lung cancer this commonly involves homozygous deletion or promoter hypermethylation (Gazzeri, 1998). Increase in EGFR The epidermal growth factor receptor, as its name suggests, is a transmembrane tyrosine kinase which is activated by binding members of the epidermal growth factor (EGF) family, which initiates a signalling cascade that promotes cell proliferation (reviewed in Carpenter, 2000). Overexpression of EGFR, usually through an increase in gene copy number, is common in NSCLC and SCC in particular (Hirsch, 2003; Nakamura, 2006; Jeon, 2006). Recently, EGFR has garnered intense interest for lung cancer treatment when it was discovered that particular somatic mutations predict sensitivity to the newly developed EGFR inhibitor gefitinib (Paez, 2004). Unfortunately, the results implicated only the adenocarcinoma subtype. Loss of TP53 TP53 is the most frequently mutated gene known in human cancer, with modifications observed in more than half of cases (Benard, 2003). TP53 is often referred to as the “guardian of the genome” because of its role in initiating growth arrest in the event of DNA damage, and inducing apoptosis if the damage cannot be repaired (Lane, 1992).  In lung cancer, TP53  mutations, most often accompanied with LOH, are observed in about 50% of cases of NSCLC (both AC and SCC subtypes) and over 70% of cases of SCLC (Weston, 1989; Miller, 1992). Increased telomerase Telomeres consist of TTAGGG repeats that reside at the ends of each chromosome. After mitosis, the ends of chromosomes are shortened and the telomeres provide a buffer that delays the degradation of important genetic material. In normal somatic cells, this progressive telomere shortening confers an upper bound (called the Hayflick limit) to the number of possible cell divisions. Once the telomeres are lost, DNA damage responses are triggered causing the cell to 18  lose the ability to divide in a process called senescence. Telomerase is able to replace the DNA lost at the telomeres, and is vital in stem cells that are required to divide throughout the lifetime of the organism. In many cancers, telomerase activity is increased and is likely an essential mechanism for allowing the uncontrolled cell division of a malignant tumour. This appears to be a common feature of lung cancer as well, with about 80% of SCCs showing increased telomerase activity; other lung cancer subtypes show similar incidence (Lantuéjoul, 2007). 1.4  GENOMICS AND GENE EXPRESSION PROFILING Genomics, the study of the genes at the scale of the whole genome, is a field that has  existed for less than thirty years. Advances in gene sequencing and computer technology have made genomic studies practical and cost effective, and the field continues to develop rapidly. There are two overarching rationales for using genomics as an approach in cancer research: a) hitherto unknown genetic determinants of tumourigenesis and other important factors, such as susceptibility or clinical response, can be more easily identified by a wide-ranging examination of the genome; and b) as a complex genetic disease, resulting from an intricate interplay of genetic aberrations, a global view of molecular change can provide a fuller understanding not possible with more selective study. Success in sequencing whole genomes, notably the human genome, was the result of significant advances in high-throughput, automated techniques (Lander, 2001; Venter, 2001). As DNA sequencing became more inexpensive and technically undemanding, the application of these technologies to mRNA in order to characterize the transcriptome was a natural progression. Early gene expression profiles were generated by sequencing ESTs (random, single pass reads from the ends of a collection of cDNAs) (Adams, 1991). As the sequence of expressed genes became better defined and prediction models of gene loci were developed, hybridization-based, low-cost and highly parallelized spotted cDNA and oligonucleotide microarrays became possible 19  (Pease, 1994; Schena, 1995).  Finally, improvements to the original EST method were  developed, including SAGE and MPSS; these techniques decreased the amount of sequence required to identify a gene, and increased the number of identified transcripts per sequence read (Velculescu, 1995; Brenner, 2000; Saha, 2002; Matsumura, 2005). The central dogma of molecular biology states that information flow in biological systems travels in one direction – from DNA to RNA to protein (Crick, 1970). Thus, changes to DNA affect RNA, which in turn affect the proteins that carry out the functions of the cell. In this sequential process, RNA is the first intrinsically dynamic layer, resulting from the interplay between the genome and the regulatory factors it interacts with (Figure 1.4).  Thus, a  transcriptome profile represents one of the most revealing looks at the current status of a cell, capable of indicating both genome-level changes and environmental influences. DNA deletions, amplifications, changes in the activity of transcription factors and changes to the methylation status of individual loci are all expected to influence gene expression levels. However, there are inherent limitations. Small mutations in the coding region of a gene will be missed. Complex chromosomal rearrangements may result in situations that are not reflected in a gene expression profile.  Post-transcriptional splicing of mRNA can be detected in some, but not all,  circumstances.  The myriad number of post-translational modifications to proteins (e.g.  alkylation, glycosylation, proteolytic cleavage, and so on) can have drastic affects on function, and cannot be detected. However, despite the inability of transcriptome profiling to directly identify these events, one can adopt an optimistic point of view. Most, if not all, biochemical modifications will be expected to alter the transcriptome to some extent because the change a) has direct regulatory effects on other genes, or b) alters the steady-state of the cell requiring a response in order to maintain homeostasis. An example of the former is TP53, which is most often inactivated through missense point mutations not reflected at the transcriptional level (Soussi, 2001). 20  21  miRNA  *  protein  *  RNA processing  metabolic, structural, signalling, etc. functions  translation  mRNA  transcription  * gene regulation  Figure 1.4: The role of gene expression in the cell. The three asterisks signify processes that will be reflected in transcriptome profiles.  PROTEOME  TRANSCRIPTOME  GENOME  DNA  Central Dogma of Molecular Biology  However, TP53 is a potent transcriptional regulator, modulating the expression of dozens of genes such as CDKN1A, MDM2, PTEN, and SFN (Kanehisa, 2008). Thus, an alteration to TP53 may not result in a change to its own expression, but will be reflected in changes to the expression of downstream target genes. An example of the latter is the Warburg effect, which describes the increased glycolytic rates of tumour cells (Warburg, 1956). Rapid tumour growth results in a hypoxic environment that prevents oxidative phosphorylation, the normal source of energy, from functioning.  HIF1A, a transcription factor that is activated under anaerobic  conditions, is activated in the majority of cancers and induces the expression of several genes, including enzymes that drive the glycolysis pathway (Maxwell, 2001). A common objective of gene expression profiling is to generate “signatures” that differentiate between two or more phenotypes. In the best case scenario, these signatures will include genes that have some direct, causal relationship with a phenotypic change; failing that, these signatures can identify genes or processes that can be used as a starting point to hunt down causal events through alternate methods.  Moreover, improvements in clinical care can be  achieved without identifying the underlying genetic aetiology, as long as the resulting signature has sufficient sensitivity and specificity. An early demonstration of the power of gene expression profiling used microarrays to identify a signature that could distinguish between acute myeloid leukaemia (AML) and acute lymphocytic leukaemia (ALL) (Golub, 1999). Later, it was shown that gene expression profiles could identify specific signatures that could distinguish between many different tumour types (Ramaswamy, 2001). Such studies were successful in demonstrating that transcriptome profiles could yield sufficient information to make medically relevant phenotypic distinctions. Many of these distinctions are already possible with existing medical technology (for example, imaging techniques such as X-ray, CT, and MRI or with histological observation of tumour biopsies), but as genome profiling technologies becomes more advanced and economical they may become a 22  superior alternative. Moreover, gene expression profiling studies have proven successful in discovering novel tumour subtypes that improve existing classifications, or have identified signatures that correlate with important clinical variables such as prognosis and treatment response (Lin, 2008; van’t Veer, 2008). 1.5  SERIAL ANALYSIS OF GENE EXPRESSION (SAGE) Serial analysis of gene expression (SAGE) and DNA microarrays are the two methods  which have emerged to obtain high-throughput gene expression profiles (Schena, 1995; Velculescu, 1995).  It is instructive to compare SAGE to the more commonly used DNA  microarray, since the two have similar purposes (Table 1.3). SAGE is based on the capture of small sequence tags that are extracted from a defined position of each mRNA molecule in an input sample. The resulting SAGE data consists of quantitative tag counts; this is different from microarrays which use DNA hybridization to obtain an analog signal which is much more qualitative in nature. Moreover, microarray hybridization occurs on a pre-fabricated chip where the target sequences must be known a priori. In contrast, SAGE samples the entire mRNA population, generating a comprehensive profile that permits novel gene discovery. Although originally developed to study the cancer transcriptome, SAGE has found wide use in profiling diverse cell types from a variety of organisms. 1.5.1 Constructing a library There is a lengthy protocol required to construct a SAGE library (Figure 1.5). First, the mRNA from a sample of biological material is converted to cDNA using biotinylated oligonucleotide deoxythymidine primers. The incorporated biotin allows the cDNA to associate with streptavidin-coated magnetic beads so the molecules can be isolated by pulling down the 3’ end. The cDNA are then subjected to restriction digest by an anchoring enzyme (AE) (most frequently NlaIII, which cuts at the recognition site CATG). The 3’ ends of the cDNA, which are 23  Table 1.3: Comparison of SAGE and DNA microarray technologies consideration data output detection source of noise  DNA microarray hybridization-based (analog) limited to probes present on the array background hybridization  re-use  depends on type of array  cost  low (thousands of $)  SAGE count-based (digital) limited by sequencing depth of library sequencing and PCR amplification errors libraries from different sources can be compared high (tens of thousands of $)  24  25  GTAC  GTAC  A A  AAAA  AAAA  AAAA TTTTA T AAAA TTTTA T  AAAA TTTTA T AAAA TTTTA T AAAA TTTTA T AAAA A TTTT T  TE cut sites  ATG GGGAC B CCCTGTAC ATG GGGAC AC B CCCTGT  AAAA TTTTA T AAAA A TTTT T  Restriction with anchoring enzyme (AE)  Ligate to linkers  GTAC  GTAC  cDNA synthesis, immobolizes to streptavidin beads  AAAA TTTTA T AAAA A TTTT T AAAA TTTTA T AAAA A TTTT T  Restriction with tagging enzyme (TE) and blunt end  AE Tag  TE  CATG GTAC  AE  GTAC  Ditag  CATG GTAC  AE  GTAC  Ditag  CATG GTAC  AE  Concatenation (linking sticky ends) and clone  CATG  PCR amplification; restriction with anchoring enzyme again  B  CATGTCCC GTACAGGG  AE  A GGGACATG CCCTGTAC  CATG  Ditag  Ligation  CATGATCC B GTACAGGG  AE  AE Tag  B GGGACATG CCCTGTAC B GGGACATG CCCTGTAC  TE  A GGGACATG CCCTGTAC  TE  A GGGACATG CCCTGTAC A GGGACATG CCCTGTAC  TE  Figure 1.5: The serial analysis of gene expression library construction protocol. Image by Jiang Long for the UBC Science Creative Quarterly (sqc.ubc.ca). Minor modifications were made to correct the size and sequence of the TE recognition site and two cases where the AE site was incorrect. Reproduced under the Creative Commons license.  A CCCTGT  ATG GGGAC AC  A GGGACATG CCCTGTAC  A  AAAA  Polyadenylated RNA extracted from cell  A  AAAA  now shortened to the 3’-most AE recognition site, are again isolated using the magnetic beads. The sample is divided into two pools and a ligation step is performed to attach one of two different linkers to the 5’ end of the cDNA pool. Both linkers introduce a recognition site for a type IIS restriction enzyme, referred to as the tagging enzyme (TE) (for the original SAGE protocol this is BsmFI, which cuts 14bp downstream from the recognition site GGGAC). When the sample is subjected to restriction digest by the TE, short tags are released while the remainder of the cDNA remains attached to the magnetic beads. The tags are then blunt-ended and ligated to form ditags. The population of ditags undergoes PCR amplification using primers for each of the two linker sequences. A further digestion with the AE removes the linkers, and the final ditags can be isolated. Finally, the ditags are ligated into long concatemers which are cloned into an appropriate vector for sequencing. Several variants of SAGE have been introduced, each differing in the length of the captured tag (14, 21, and 26bp) (Velculescu, 1999; Saha, 2002; Matsumura, 2003). Computational methods are then required to analyze the resulting sequence. Initially, this involves an in silico digest with the AE to reveal the ditags. These undergo further processing to remove erroneously captured linker sequences and duplicate ditags. The latter is part of an internal control for preferential PCR amplification that may introduce bias into the final tag counts. Indeed, the use of “ditags” is not strictly necessary to capture sequence tags. Since the number of different transcripts in a typical sample is large, it is unlikely that any two tags will associate more than once and any instances of this event can be ascribed to preferential amplification. Once this processing has occurred, the sequence and number of each individual tag can then be determined. Finally, tags must be assigned to a source sequence in order to identify the expressed gene. Invariably, SAGE libraries are constructed from samples with some phenotypic difference, and the purpose is to identify tags (and ultimately genes) that show a significant 26  change in number. The application of statistical methods can be relatively straightforward for comparisons of two libraries, but as sequencing costs and the technical challenges of SAGE library construction have diminished, studies featuring many biological replicates and deeply sequenced libraries have become more common. More sophisticated methods are required to analyze datasets with multiple samples, but also to fully leverage the information potential gained from deeply sequenced libraries. 1.6  DATA ANALYSIS AND STATISTICS† “Data mining” – the extraction of relevant information from large amounts of data – has  wide applicability in many scientific fields such as epidemiology and in non-scientific fields such as business. Naturally, the large-scale nature of genomics has required the field to drawn upon accepted data mining strategies. The use of these methods is required to isolate informative segments of large datasets and help determine which observations are most relevant to the subject of study. In the case of SAGE, this involves the identification of tag sequences that have some change in count consistent with a given hypothesis. However, data mining has known issues that must be carefully considered when analyzing and interpreting large datasets, and SAGE data is no exception. These issues can be more clearly demonstrated by example. A recent study featured an analysis of SAGE profiles obtained from the lung epithelium of individuals who have never smoked, had quit smoking, or are current smokers (Chari, 2007). The study highlighted the differences in gene expression between these three groups, with a particular focus on a substantial number of changes that appear to persist after an individual has stopped smoking. These results were intriguing because they suggest that some aetiology is permanently †  The examples used in this section of the introduction are taken from Zuyderduyn, S.D. (2009) Correspondence regarding 'Effect of active smoking on the human bronchial epithelium transcriptome'. BMC Genomics, 10:82. A fuller discussion of certain statistical pitfalls and other more esoteric flaws in the analysis of SAGE data can be found in the complete manuscript.  27  maintained after smoking cessation, that this manifests itself at the level of gene expression, and that such changes may contribute to an increased risk of developing lung cancer. Under scrutiny, however, the statistical methods employed by the authors seriously affected the validity of these conclusions. However, the Chari et al. study is a useful tool with which to discuss some of the common pitfalls of analyzing large datasets and their practical consequences. For simplicity, the language of frequentist statistics is used in the following overview, although the issues are equally applicable to Bayesian approaches. 1.6.1 Statistical error and bias Statistical errors are commonly divided into two categories and operate under the assumption of a null hypothesis - the default premise accepted as true until statistical evidence indicates otherwise. Type I error, or a false positive, refers to the rejection of a null hypothesis that is actually true. Type II error, or a false negative, refers to the failure to reject a null hypothesis that is actually false. Other types of error have been proposed, many of which are somewhat lighthearted descriptions of statistical pitfalls. For example, Type III error has been described as correctly rejecting the null hypothesis for the wrong reason and Type IV error as the incorrect interpretation of a correctly rejected null hypothesis (Mosteller, 1948; Marascuilo, 1970). Statistical bias is easy to define, but is often difficult to identify in specific cases without some vigilance. When some set of data is collected or manipulated in a manner that makes it more likely that an undesired or unaccounted for property is represented, there is a risk of bias. If the overrepresented property is associated in some way with a variable being tested for a significant change or effect, then the experiment is biased. An example of this occurs in Chari et al. during the filtering of their dataset. When comparing a set of SAGE libraries, it is not unusual to remove tags with uniformly low 28  expression since there is not enough signal to provide any statistical significance. However, Chari et al. restrict their testing to tags with a “mean tag count of ≥20 tags per million (TPM) in at least one of never, former or current smoker SAGE libraries”. This introduces a bias because the reduction criterion includes smoking status, the variable being tested.  This becomes  problematic when estimating the amount of Type I error (discussed in the Section 1.6.3), since the dataset has now been enriched for data that will show changes between the three groups. This bias could be reasonably addressed by filtering using a criterion of a mean expression of 20TPM across all libraries. 1.6.2 Choice of test statistic A test statistic is a value calculated from the observed data, which can be used to determine significance by calculating the chance of observing an equally extreme value if the null hypothesis is true (i.e. the p-value). The choice of test statistic relies on certain assumptions about the data. Test statistics can be classified as parametric, where the frequency distribution of the data is known or can be reasonably assumed, and non-parametric, when the frequency distribution is not known. For example, Student’s t-test is a parametric test used to compare the means of two groups of observations where the scale of the variance is unknown, but is assumed to follow a normal distribution (also referred to as a Gaussian distribution). The Mann-Whitney U test is equivalent, but discards the assumption of normality by comparing the relative ranks of the observed values.  Although non-parametric tests are free of difficulties arising from  incorrectly assuming an underlying distribution, they always have less power than a parametric test. In Chari et al., the authors choose the Mann-Whitney U test to determine if a tag is differentially expressed between groups.  Although the test itself is perfectly valid, some  difficulties arise from its use with SAGE data. First, since the relative rank of the tag count is 29  used to calculate the test statistic, the following two comparisons are equivalent:  Comparison 1 Comparison 2  Group 1 1, 2, 3, 5 1, 3, 5, 10  Group 2 4, 6, 7, 9 9, 30, 50, 60  Common sense would suggest that the values in the second comparison are more statistically significant that the first, but the Mann-Whitney U test will determine that both are equally significant. Second, Chari et al.’s dataset consists of groups containing 4, 12, and 8 libraries. When comparing such a small number of samples, the power of the Mann-Whitney U test is limited. For example, assume for some tag that expression is lower in all libraries of the group of 4 than the group of 8. The lowest possible two-tailed p-value is 0.004. Although this value is more than acceptable for a single hypothesis test it can, as discussed in the next section, become problematic when testing many tags. 1.6.3 Multiple testing Statistical hypothesis testing aims to determine if there is a true difference between two (or more) individual of groups of observations. This was introduced in the previous section, where the use of a test statistic to determine a p-value is described. It naturally follows that when one performs such a test on enough observations, cases will appear where the null hypothesis is true but the observed value suggests otherwise. For example, the commonly accepted threshold for significance (α) is 0.05. If a hypothesis test is applied to 10,000 items (e.g. genes) where the null hypothesis is true, then one expects 500 will be identified as extreme, or significant, by chance alone. If 1,500 items are subsequently identified from real data, one can estimate that a third of these will be false positives. Despite the relative simplicity of the multiple testing problem, it continues to be inadequately addressed in many studies using high-throughput data (Dupuy and Simon, 2007). 30  There are numerous methods to control the false discovery rate (FDR); a simple Bonferroni correction or the Benjamini-Hochberg method are widely accepted and commonly used (E, 1936; Benjamini and Hochberg, 1995). Moreover, modern computers have made powerful resampling approaches such as bootstrapping and Monte Carlo simulation easy to perform. In Chari et al., over 8,000 tags are tested for a change in expression between smoking status groups, and comparisons resulting in a p-value <0.05 were considered significant. However, since there is no correction for multiple testing, the actual significance of any given result will be much lower. To show this directly, the same procedure can be applied to a randomization of the data. The class labels are randomly re-shuffled to produce a “null” dataset where one knows that any finding of significance will be spurious. When the p<0.05 criterion is used, 418 tags are identified; on the actual dataset, 885 tags are identified. Thus, the false discovery rate can be estimated at 418/885=47.2%; thus, about half of the reported findings are false positives. Moreover, the bias introduced by pre-filtering based on group membership (see Section 1.6.2.1) means this value is an underestimate. 1.6.4 Let the data challenge assumptions When a corollary assumption or supposition is made at the outset of an experiment, it is never a mistake to let the data challenge it. This serves several functions: a) provides quality control, as a failure for strong assumptions to be supported may suggest a mistake in implementing or analyzing an experiment; b) allows rejection or refinement of the assumption, and appropriate measures can be taken to account for the uncertainty or pursue an alternative hypothesis. Chari et al. provides a good example of how a core assumption underlying the experiment was accepted without letting the data challenge it. The authors report only the findings of a test of one null hypothesis: the expression of a tag in the never smoker group is the 31  same as the current smoker group. The selection of tags showing reversible or irreversible expression is later selected from this subset based on the expression in the former smoker group. This approach makes the tacit assumption that the limits of expression values are defined by the never smoker and current smoker groups, and that the expression of the former smoker group must be somewhere in between. A test of the other potentially informative null hypothesis, specifically never smokers versus former smokers and former smokers versus current smokers are not reported by the authors.  When these are tested on the real and null datasets, an  interesting result emerges; specifically, the estimated FDR is almost identical in all three of the possible comparisons. This disagrees with the assumption that former smokers will have gene expression profiles lying within a continuum of changes between never and current smokers, which would certainly result in a higher FDR for one or both of the comparisons involving former smokers.  Although this doesn’t disprove the assumption, it does suggest that the  statistical methods used should be re-examined. One likely explanation is that the criterion for statistical significance is simply not stringent enough to remove random variation, and may be obscuring the expected dynamics of the data. 1.6.5 Information visualization, data clustering, and global comparisons When analyzing large-scale datasets, it is often desirable to identify broad similarities and differences. Sometimes this serves to efficiently summarize a dataset, but can also be used in exploratory data analysis where the formulation of a hypothesis is desired. Common approaches are principal component analysis (PCA), multidimensional scaling (MDS), k-means clustering, and hierarchical (or agglomerative) clustering (Figure 1.6). PCA serves to reduce a multidimensional dataset to a smaller number of dimensions by identifying a principal component that accounts for as much of the variance as possible (Pearson, 1901).  Further principal components can be identified by repeating the procedure on the 32  PC2  PC1  ●  1.5  1.5  PC3  ●  ● ●  ● ● ● ● ●  ●  ● ● ●  ●  ●  ● ● ● ●  ● ● ● ● ● ● ● ● ● ●  ● ● ●  ● ● ●  ● ●  ●  ●  0.0  0.0  ● ● ● ●  ● ● ●  ●  ● ● ●  ● ● ● ● ● ● ● ● ●  ● ● ● ● ● ●  ● ● ● ● ●  ● ● ● ● ● ● ● ● ●  ●  ● ● ● ●  ● ● ● ● ● ● ● ● ●  ● ●  ●  ● ● ● ● ●  ● ● ● ● ● ● ●  ●  ●  ● ● ● ● ● ● ●  ● ●  ● ● ●  ● ● ● ● ● ● ● ●  0.5  ●  0.5  ●  1.0  1.0  ● ● ● ●  ●  1.5  1.5  ●  ●  ●  ●  ● ● ● ● ●  ● ●  ● ● ●  0.0  ●  ● ● ●  ● ● ● ● ● ● ●  ● ● ●  ● ● ●  ●  ● ● ● ●  ● ● ● ●  ●  ● ● ● ● ● ● ● ● ● ● ● ●  ● ● ● ●  ● ● ● ● ● ● ● ●  0.5  ● ● ● ●  ● ● ● ● ● ● ● ●  ● ●  ● ●  1.0  ● ● ● ●  ● ● ● ● ● ● ● ● ● ● ●  ● ● ● ● ● ● ● ● ● ● ● ● ● ●  ● ● ● ● ● ● ● ● ● ● ● ● ●  ● ● ● ● ● ● ● ● ● ● ● ● ●  ● ● ●  ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●  ●  ●  ● ● ● ● ● ● ● ● ● ● ● ●  ● ● ● ● ● ● ● ● ● ● ● ● ● ●  ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●  ● ● ● ● ● ● ● ● ● ● ● ● ● ●  ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●  0.0  0.5  1.0  ●  ●  Figure 1.6: Examples of common data visualization methods. Clock-wise from topleft are principal component analysis (PCA), multidimensional scaling (MDS), k-means clustering, and two-way hierarchical clustering. The PCA is from Yeung (2001), and the MDS and hierarchical clustering from Koinuma (2006).  33  remaining variance. In gene expression analysis, PCA is typically applied to a large list of genes in order to identify large trends that define the structure of the data (Yeung, 2001). MDS reduces a matrix of pair-wise similarities into a lower-dimensional space (Gower, 1966). For example, a set of ten objects would require nine dimensions to fully visualize the similarity of each object to all others. Several types of MDS algorithm are available, but in general the procedure re-positions the objects in the lower dimensional space to minimize a “loss function”, which describes the amount of similarity information in the original matrix not accounted for. In gene expression analysis, MDS can be applied to a set of samples to identify the main dynamics that account for the observed similarity. k-means clustering partitions a list of objects into k clusters based on their similarity (Steinhaus, 1956). The k-means algorithm begins by placing the objects into k clusters randomly or using some simple heuristic, and then calculating the average vector (or centroid) for each cluster. Each object is then moved to the cluster with the nearest centroid, which are recalculated after each round of re-arrangement. The procedure is repeated until the clusters are stable. In gene expression analysis, k-means clustering is useful for identifying families of genes with similar expression patterns across a set of samples (Tavazoie, 1999). Hierarchical clustering partitions data into subsets (or clusters) based on some distance metric, which can be represented as a tree structure, or hierarchy, called a dendrogram (Eisen, 1998). There are a large number of variations in the hierarchical clustering algorithm, primarily concerned with the calculations of the distance between a) individual observations, and b) the clusters themselves.  Hierarchical clustering is useful for determining similarities between  samples or genes, and can be presented simultaneously (e.g. two-way clustering where both the rows and columns of a data matrix are clustered). A challenge of all of these approaches is determining how many components, dimensions or, in the case of hierarchical clustering, what level of a dendrogram adequately captures the 34  significant features of the data. In the case of PCA, MDS, and k-means clustering a common strategy is to plot the amount of variance accounted for as a function of the number of components or dimensions, and selecting a point where the amount of additional variance removed is noticeably smaller. This is sometimes referred to as an “elbow” plot, because the datapoints appear as a bent arm with the elbow corresponding to the optimal number of components or dimensions. In the case of hierarchical clustering, one can choose a cutoff similarity where partitions within a cluster are no longer meaningful. A potential pitfall of these methods arises from their use during class discovery. Identifying subsets of observations that show similar properties can identify meaningful relationships. For example, an analysis of expression profiles from different cancer samples may reveal similarities that correspond to the stage or aggressiveness of the tumour, or to previously unknown molecular subtypes. However, it is important to ensure that claims of correlations identified by clustering are not made on data previously selected based on the same correlation. Like the multiple testing problem, many studies using high-throughput data commit this error (Dupuy and Simon, 2007). In Chari et al., the authors refer to the use of “supervised clustering”, which is a neologism, but captures the essential nature of this error. In the study, several hundred SAGE tags are selected based on their differential expression between never and current smokers, and it is implied that the observed clustering of these two groups is meaningful. Furthermore, for these specific tags, the expression values from the former smokers are included and then it is claimed that the separation of all three groups into distinct clusters is meaningful.  However, the  additional samples form a distinct cluster not because of any biological significance, but because they have not undergone the same a priori selection procedure applied to the other two groups. Ironically, if the former smokers had actually merged into the clusters of either the never or current smokers one could make a strong argument that gene expression changes arising from 35  smoke exposure revert or remain stable upon smoking cessation. However, the argument as presented is only valid if the study used all of the tags (or a subset selected on an unrelated variable such as average expression) and clusters corresponding to smoke exposure emerged. 1.6.6 Cross-validation Cross-validation (CV) is a powerful technique to confirm the validity of a statistical analysis, to estimate how well a set of putative descriptive features will perform on future data (e.g. predicting the disease status of a sample), and to guard against overfitting. When performed correctly, CV can be an effective means of confirming the findings of a statistical analysis when additional samples are not available or are difficult to obtain. Even when the availability of additional samples is not an issue, CV can help in the development of an effective model of the existing data before expending further resources on validation efforts. The overall strategy involves partitioning the data into a training set, where a statistical model is applied, and a testing set, which is used to determine the performance of the model. The partitioning procedure is performed a number of times so that a given sample will sometimes be part of the training set and sometimes the testing set. K-fold cross-validation involves partitioning the data into K subsets, selecting one of the subsets as the testing set, and combining the remaining subsets to form the training set. A single estimate of performance is then obtained by averaging the results of the K runs of CV. Leave-one-out cross-validation (LOOCV) is a special case of K-fold CV where K is equal to the number of samples. In other words, each sample acts as an individual testing set and the procedure involves the maximum number of possible rounds of validation. It is vital that the construction of the predictive model, or classifier, from the training set not involve the testing set in any way. For example, one may wish to identify a set of genes that can predict whether a biopsy is a malignant tumour. A set of gene expression profiles from a set 36  of normal or benign tumour and malignant tumours is collected, and a statistical test is used to rank the genes according to how well they differentiate between the sample types (using a metric such as the p-value or correlation coefficient). CV can then be used to determine how many genes should be used to obtain an optimal classifier, and to estimate the performance of this classifier. A common error is to utilize the ranking metric calculated using all of the samples during the construction of the classifier for the training set in each round of CV. The correct procedure is to re-calculate the ranking metric for each of the training sets. Failure to do so will result in an overestimate of the classifier performance, and may incorrectly calculate the number of genes that should be included in the optimal classifier. Along with the failure to correct for multiple testing and incorrectly assigning significance to clusters identified during class discovery, this is one of the most common errors in studies of cancer using gene expression profiles (Dupuy and Simon, 2007). 1.7  THESIS OBJECTIVES The overall objective of this thesis was to identify changes in gene expression associated  with the development of squamous cell lung cancer. This was accomplished using a large number of SAGE profiles generated from tissue samples of several pre-malignant and malignant stages. However, the rarity of this material made validation studies of any appreciable size extremely difficult. Furthermore, the nature of the samples was such that a straightforward analysis would have had limited effectiveness. As a result, much of the work in this thesis describes novel techniques that were developed to maximize the value of the SAGE profiles, and mitigate the requirement for large follow up studies. The three chapters describe: 1. the development of a technique to extract additional sequence information from SAGE data in order to provide increased confidence in the assignment of a gene to a given tag  37  2. the development of a statistical model to analyze multiple groups of SAGE libraries that outperforms current methods 3. the characterization of the transcriptome of the developmental stages of squamous cell lung cancer and the identification of specific biomolecules that are highly correlated with each transition using bioinformatic techniques, including those described in Chapters II and III  38  BIBLIOGRAPHY Adams, M. D., J. M. Kelley, et al. (1991). "Complementary DNA sequencing: expressed sequence tags and human genome project." Science 252(5013): 1651-6. Ambros, V. (2004). "The functions of animal microRNAs." Nature 431(7006): 350-5. Begnami, M. D., E. J. Rushing, et al. (2007). "Evaluation of NF2 gene deletion in pediatric meningiomas using chromogenic in situ hybridization." Int J Surg Pathol 15(2): 110-5. Belinsky, S. A., K. J. Nikula, et al. (1998). "Aberrant methylation of p16(INK4a) is an early event in lung cancer and a potential biomarker for early diagnosis." Proc Natl Acad Sci U S A 95(20): 11891-6. Benard, J., S. Douc-Rasy, et al. (2003). "TP53 family members and human cancers." Hum Mutat 21(3): 182-91. Benjamini, Y. H., Y. (1995). "Controlling the false discovery rate: a practical and powerful approach to multiple testing." J Roy Statist Soc Ser B 57: 289-300. Besaratinia, A., J. C. Kleinjans, et al. (2002). "Biomonitoring of tobacco smoke carcinogenicity by dosimetry of DNA adducts and genotyping and phenotyping of biotransformational enzymes: a review on polycyclic aromatic hydrocarbons." Biomarkers 7(3): 209-29. Bird, A. P. (1986). "CpG-rich islands and the function of DNA methylation." Nature 321(6067): 209-13. Bos, J. L., E. R. Fearon, et al. (1987). "Prevalence of ras gene mutations in human colorectal cancers." Nature 327(6120): 293-7. Bota, S., J. B. Auliac, et al. (2001). "Follow-up of bronchial precancerous lesions and carcinoma in situ using fluorescence endoscopy." Am J Respir Crit Care Med 164(9): 1688-93. Brenner, S., M. Johnson, et al. (2000). "Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays." Nat Biotechnol 18(6): 630-4. Breuer, R. H., A. Pasic, et al. (2005). "The natural course of preneoplastic lesions in bronchial epithelium." Clin Cancer Res 11(2 Pt 1): 537-43. Brooks-Wilson, A. R., P. Kaurah, et al. (2004). "Germline E-cadherin mutations in hereditary diffuse gastric cancer: assessment of 42 new families and review of genetic screening criteria." J Med Genet 41(7): 508-17. Canadian Cancer Society/National Cancer Institute of Canada (2008). Canadian Cancer Statistics 2008. Toronto, Canada. Carpenter, G. (2000). "The EGF receptor: a nexus for trafficking and signaling." Bioessays 22(8): 697-707. 39  Chari, R., K. M. Lonergan, et al. (2007). "Effect of active smoking on the human bronchial epithelium transcriptome." BMC Genomics 8: 297. Colby, T. V., Wistuba, II, et al. (1998). "Precursors to pulmonary neoplasia." Adv Anat Pathol 5(4): 205-15. Collins, L. G., C. Haines, et al. (2007). "Lung cancer: diagnosis and management." Am Fam Physician 75(1): 56-63. Crick, F. (1970). "Central dogma of molecular biology." Nature 227(5258): 561-3. Dessy, E., E. Rossi, et al. (2008). "Chromosome 9 instability and alterations of p16 gene in squamous cell carcinoma of the lung and in adjacent normal bronchi: FISH and immunohistochemical study." Histopathology 52(4): 475-82. Diamandopoulos, G. T. (1996). "Cancer: an historical perspective." Anticancer Res 16(4A): 1595-602. Dupuy, A. and R. M. Simon (2007). "Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting." J Natl Cancer Inst 99(2): 147-57. E, B. C. (1936). "Teoria statistica delle classi e calcolo delle probabilità." Pubblicazioni del R Istituto Superiore di Scienze Echonomiche e Commerciali di Firenze 8: 3-62. Eisen, M. B., P. T. Spellman, et al. (1998). "Cluster analysis and display of genome-wide expression patterns." Proc Natl Acad Sci U S A 95(25): 14863-8. El Maalouf, G., J. M. Rodier, et al. (2007). "Could we expect to improve survival in small cell lung cancer?" Lung Cancer 57 Suppl 2: S30-4. Feinberg, A. P., R. Ohlsson, et al. (2006). "The epigenetic progenitor origin of human cancer." Nat Rev Genet 7(1): 21-33. Forrester, K., C. Almoguera, et al. (1987). "Detection of high incidence of K-ras oncogenes during human colon tumorigenesis." Nature 327(6120): 298-303. Franklin, W. A. (2000). "Pathology of lung cancer." J Thorac Imaging 15(1): 3-12. Friend, S. H., R. Bernards, et al. (1986). "A human DNA segment with properties of the gene that predisposes to retinoblastoma and osteosarcoma." Nature 323(6089): 643-6. Gazzeri, S., V. Gouyer, et al. (1998). "Mechanisms of p16INK4A inactivation in non small-cell lung cancers." Oncogene 16(4): 497-504. Golub, T. R., D. K. Slonim, et al. (1999). "Molecular classification of cancer: class discovery and class prediction by gene expression monitoring." Science 286(5439): 531-7. Gower, J. C. (1966). "Some distance properties of latent root and vector methods used multivariate analysis." Biometrika 53: 325-328. 40  Greer, R. O. (2006). "Pathology of malignant and premalignant oral epithelial lesions." Otolaryngol Clin North Am 39(2): 249-75, v. Hanahan, D. and R. A. Weinberg (2000). "The hallmarks of cancer." Cell 100(1): 57-70. Hattori, S., M. Matsuda, et al. (1972). "Oat-cell carcinoma of the lung. Clinical and morphological studies in relation to its histogenesis." Cancer 30(4): 1014-24. Hecht, S. S. (1998). "Biochemistry, biology, and carcinogenicity of tobacco-specific Nnitrosamines." Chem Res Toxicol 11(6): 559-603. Herman, J. G., A. Merlo, et al. (1995). "Inactivation of the CDKN2/p16/MTS1 gene is frequently associated with aberrant DNA methylation in all common human cancers." Cancer Res 55(20): 4525-30. Hirsch, F. R., M. Varella-Garcia, et al. (2003). "Epidermal growth factor receptor in non-smallcell lung carcinomas: correlation between gene copy number and protein expression and impact on prognosis." J Clin Oncol 21(20): 3798-807. Holland, J. F. and E. Frei, Eds. (2003). Cancer Medicine. Hamilton, Ontario, BC Decker Inc. Humar, B., T. Toro, et al. (2002). "Novel germline CDH1 mutations in hereditary diffuse gastric cancer families." Hum Mutat 19(5): 518-25. Jemal, A., T. Murray, et al. (2005). "Cancer statistics, 2005." CA Cancer J Clin 55(1): 10-30. Jemal, A., R. Siegel, et al. (2007). "Cancer statistics, 2007." CA Cancer J Clin 57(1): 43-66. Jenuwein, T. and C. D. Allis (2001). "Translating the histone code." Science 293(5532): 107480. Jeon, Y. K., S. W. Sung, et al. (2006). "Clinicopathologic features and prognostic implications of epidermal growth factor receptor (EGFR) gene copy number and protein expression in non-small cell lung cancer." Lung Cancer 54(3): 387-98. Kamb, A., N. A. Gruis, et al. (1994). "A cell cycle regulator potentially involved in genesis of many tumor types." Science 264(5157): 436-40. Kanehisa, M., M. Araki, et al. (2008). "KEGG for linking genomes to life and the environment." Nucleic Acids Res 36(Database issue): D480-4. Kerr, K. M. (2001). "Pulmonary preinvasive neoplasia." J Clin Pathol 54(4): 257-71. Knudson, A. G., Jr. (1971). "Mutation and cancer: statistical study of retinoblastoma." Proc Natl Acad Sci U S A 68(4): 820-3. Kreuzer, M., L. Kreienbrock, et al. (1999). "Histologic types of lung carcinoma and age at onset." Cancer 85(9): 1958-65.  41  Lam, S., C. MacAulay, et al. (2000). "Detection and localization of early lung cancer by fluorescence bronchoscopy." Cancer 89(11 Suppl): 2468-73. Lander, E. S., L. M. Linton, et al. (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921. Lane, D. P. (1992). "Cancer. p53, guardian of the genome." Nature 358(6381): 15-6. Lantuejoul, S., C. Salon, et al. (2007). "Telomerase expression in lung preneoplasia and neoplasia." Int J Cancer 120(9): 1835-41. Lilja, H., D. Ulmert, et al. (2008). "Prostate-specific antigen and prostate cancer: prediction, detection and monitoring." Nat Rev Cancer 8(4): 268-78. Lin, J. and M. Li (2008). "Molecular profiling in the age of cancer genomics." Expert Rev Mol Diagn 8(3): 263-76. Lin, X., M. Tascilar, et al. (2001). "GSTP1 CpG island hypermethylation is responsible for the absence of GSTP1 expression in human prostate cancer cells." Am J Pathol 159(5): 1815-26. Liu, P., S. A. Tarle, et al. (1993). "Fusion between transcription factor CBF beta/PEBP2 beta and a myosin heavy chain in acute myeloid leukemia." Science 261(5124): 1041-4. Liu, P. P., A. Hajra, et al. (1995). "Molecular pathogenesis of the chromosome 16 inversion in the M4Eo subtype of acute myeloid leukemia." Blood 85(9): 2289-302. Lu, J., G. Getz, et al. (2005). "MicroRNA expression profiles classify human cancers." Nature 435(7043): 834-8. Marascuilo, L. A. and J. R. Levin (1970). "Appropriate post hoc comparisons for interaction and nested hypotheses in analysis of variance designs: the elimination of Type-IV errors." AERJ 7: 397-421. Maruyama, R., S. Toyooka, et al. (2002). "Aberrant promoter methylation profile of prostate cancers and its relationship to clinicopathological features." Clin Cancer Res 8(2): 514-9. Matsumura, H., S. Reich, et al. (2003). "Gene expression analysis of plant host-pathogen interactions by SuperSAGE." Proc Natl Acad Sci U S A 100(26): 15718-23. Maxwell, P. H., C. W. Pugh, et al. (2001). "Activation of the HIF pathway in cancer." Curr Opin Genet Dev 11(3): 293-9. Mendelsohn, J., P. M. Howley, et al., Eds. (2001). The molecular basis of cancer. Philadelphia, WB Saunders Company. Miller, C. W., K. Simon, et al. (1992). "p53 mutations in human lung tumors." Cancer Res 52(7): 1695-8.  42  Miranda, T. B. and P. A. Jones (2007). "DNA methylation: the nuts and bolts of repression." J Cell Physiol 213(2): 384-90. Mosteller, F. (1948). "A k-sample slippage test for an extreme population." Ann Math Stat 19: 58-65. Nakamura, H., N. Kawasaki, et al. (2006). "Survival impact of epidermal growth factor receptor overexpression in patients with non-small cell lung cancer: a meta-analysis." Thorax 61(2): 1405. Nanda, R. (2007). "Targeting the human epidermal growth factor receptor 2 (HER2) in the treatment of breast cancer: recent advances and future directions." Rev Recent Clin Trials 2(2): 111-6. Nesbitt, J. C., J. B. Putnam, Jr., et al. (1995). "Survival in early-stage non-small cell lung cancer." Ann Thorac Surg 60(2): 466-72. Neville, B. W. and T. A. Day (2002). "Oral cancer and precancerous lesions." CA Cancer J Clin 52(4): 195-215. Nicholson, A. G., L. J. Perry, et al. (2001). "Reproducibility of the WHO/IASLC grading system for pre-invasive squamous lesions of the bronchus: a study of inter-observer and intra-observer variation." Histopathology 38(3): 202-8. Paez, J. G., P. A. Janne, et al. (2004). "EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy." Science 304(5676): 1497-500. Palacios, J., M. J. Robles-Frias, et al. (2008). "The molecular pathology of hereditary breast cancer." Pathobiology 75(2): 85-94. Pearson, K. (1901). "On lines and planes of closest fit to systems of points in space." Philosophical Magazine 2: 559-572. Pease, A. C., D. Solas, et al. (1994). "Light-generated oligonucleotide arrays for rapid DNA sequence analysis." Proc Natl Acad Sci U S A 91(11): 5022-6. Phillips, D. H. (1983). "Fifty years of benzo(a)pyrene." Nature 303(5917): 468-72. Pisani, P., F. Bray, et al. (2002). "Estimates of the world-wide prevalence of cancer for 25 sites in the adult population." Int J Cancer 97(1): 72-81. Puchelle, E., J. M. Zahm, et al. (2006). "Airway epithelial repair, regeneration, and remodeling after injury in chronic obstructive pulmonary disease." Proc Am Thorac Soc 3(8): 726-33. Ramaswamy, S., P. Tamayo, et al. (2001). "Multiclass cancer diagnosis using tumor gene expression signatures." Proc Natl Acad Sci U S A 98(26): 15149-54. Rosti, G., G. Bevilacqua, et al. (2006). "Small cell lung cancer." Ann Oncol 17 Suppl 2: ii5-10. 43  Rowley, J. D. (2001). "Chromosome translocations: dangerous liaisons revisited." Nat Rev Cancer 1(3): 245-50. Saha, S., A. B. Sparks, et al. (2002). "Using the transcriptome to annotate the genome." Nat Biotechnol 20(5): 508-12. Scagliotti, G. (2007). "Optimizing chemotherapy for patients with advanced non-small cell lung cancer." J Thorac Oncol 2 Suppl 2: S86-91. Schena, M., D. Shalon, et al. (1995). "Quantitative monitoring of gene expression patterns with a complementary DNA microarray." Science 270(5235): 467-70. Seeger, R. C., G. M. Brodeur, et al. (1985). "Association of multiple copies of the N-myc oncogene with rapid progression of neuroblastomas." N Engl J Med 313(18): 1111-6. Shimizu, M., S. Ban, et al. (2007). "Squamous dysplasia and other precursor lesions related to esophageal squamous cell carcinoma." Gastroenterol Clin North Am 36(4): 797-811, v-vi. Soussi, T. (1996). "The p53 tumour suppressor gene: a model for molecular epidemiology of human cancer." Mol Med Today 2(1): 32-7. Steinhaus, H. (1956). "Sur la division des corp materiels en parties." Bull Acad Polon Sci, C1. III IV: 801-804. Stenson, P. D., E. Ball, et al. (2008). "Human Gene Mutation Database: towards a comprehensive central mutation database." J Med Genet 45(2): 124-6. Storchova, Z. and D. Pellman (2004). "From polyploidy to aneuploidy, genome instability and cancer." Nat Rev Mol Cell Biol 5(1): 45-54. Subramanian, J. and R. Govindan (2007). "Lung cancer in never smokers: a review." J Clin Oncol 25(5): 561-70. Sunaga, N., K. Miyajima, et al. (2004). "Different roles for caveolin-1 in the development of non-small cell lung cancer versus small cell lung cancer." Cancer Res 64(12): 4277-85. Tavazoie, S., J. D. Hughes, et al. (1999). "Systematic determination of genetic network architecture." Nat Genet 22(3): 281-5. van't Veer, L. J. and R. Bernards (2008). "Enabling personalized cancer medicine through analysis of gene-expression patterns." Nature 452(7187): 564-70. Velculescu, V. E., L. Zhang, et al. (1995). "Serial analysis of gene expression." Science 270(5235): 484-7. Venmans, B. J., T. J. van Boxem, et al. (2000). "Outcome of bronchial carcinoma in situ." Chest 117(6): 1572-6. Venter, J. C., M. D. Adams, et al. (2001). "The sequence of the human genome." Science 291(5507): 1304-51. 44  Warburg, O. (1956). "On respiratory impairment in cancer cells." Science 124(3215): 269-70. Weston, A., J. C. Willey, et al. (1989). "Differential DNA sequence deletions from chromosomes 3, 11, 13, and 17 in squamous-cell carcinoma, large-cell carcinoma, and adenocarcinoma of the human lung." Proc Natl Acad Sci U S A 86(13): 5099-103. White, J. A., P. J. McAlpine, et al. (1997). "Guidelines for human gene nomenclature (1997). HUGO Nomenclature Committee." Genomics 45(2): 468-71. Wistuba, II, C. Behrens, et al. (1999). "Sequential molecular abnormalities are involved in the multistage development of squamous cell lung carcinoma." Oncogene 18(3): 643-50. Wistuba, II, C. Behrens, et al. (2000). "High resolution chromosome 3p allelotyping of human lung cancer and preneoplastic/preinvasive bronchial epithelium reveals multiple, discontinuous sites of 3p allele loss and three regions of frequent breakpoints." Cancer Res 60(7): 1949-60. Wistuba, II and A. F. Gazdar (2006). "Lung cancer preneoplasia." Annu Rev Pathol 1: 331-48. Xu, Y., F. Fang, et al. (2004). "A mutation found in the promoter region of the human survivin gene is correlated to overexpression of survivin in cancer cells." DNA Cell Biol 23(9): 527-37. Yeung, K. Y. and W. L. Ruzzo (2001). "Principal component analysis for clustering gene expression data." Bioinformatics 17(9): 763-74. Zabarovsky, E. R., M. I. Lerman, et al. (2002). "Tumor suppressor genes on chromosome 3p involved in the pathogenesis of lung and other cancers." Oncogene 21(45): 6915-35. Zuyderduyn, S. D. (2009) “Corresponding regarding ‘Effect of active smoking on the human bronchial epithelium transcriptome’.” BMC Genomics 10:82.  45  CHAPTER II DETERMINING ADDITIONAL SEQUENCE DATA TO IMPROVE SAGE TAG TO GENE MAPPING†  2.1  INTRODUCTION One of the key components of SAGE analysis is matching a tag to an mRNA sequence in  order to identify the parent transcript or gene. Several modifications of the SAGE protocol have been developed to produce longer tag sequences to make this identification easier and more accurate (Saha, 2002; Matsumura, 2003; Gowda, 2004).  However, generating longer tags  directly increases the cost per unit of information. Moreover, although recent studies using SAGE typically use these improved protocols, there are over 700 profiles in the public domain that were generated using the original method (which produces 10bp of identifying sequence) (Lash, 2000; Barrett, 2006). The ability to use previously generated profiles in new studies is one of the distinct advantages of SAGE, so efforts to increase the information content of this existing knowledge base can be of great benefit. Difficulties with tag to gene mapping fall into two categories: 1) inability to assign a parent gene because two or more transcripts share the same tag (ambiguous tag), and 2) identifying the source of a tag when no transcript is matched (novel tag). The original SAGE protocol uses the restriction enzyme BsmFI to cleave 14bp from the start of the recognition site (GGGAC) to release the tag (Figure 1.5). The 3' cytosine of the sequence is also the 5' cytosine of the anchoring enzyme recognition site (CATG).  Thus, the final tag should be 15bp long.  However, it is apparent from the downstream ditags that the length of the cleaved tag is variable. This effect is likely influenced by sequence, temperature, and the conditions of the reaction mixture (salt content, pH, etc.). The standard data processing for SAGE ignores these additional †  A version of this chapter will be submitted for publication. Zuyderduyn, S.D. and Vatcher, G. Additional sequence information from SAGE can reduce mapping ambiguity and aid novel gene discovery.  46  nucleotides, ostensibly as a trade-off between extra information and a uniform, high-confidence processed set of tags. In fact, the original description of the SAGE technique considered only the first 13 nucleotides (Velculescu, 1995). Although SAGE tag mapping is conceptually simple, there are several hazards that are seldom appreciated.  First, several PCR amplification steps are required during library  construction. Even with the use of a high-fidelity DNA polymerase, errors in replication do occur. When introduced during an early PCR cycle, these errors will be carried through to later cycles and can appear in significant numbers in the final dataset. Second, the anchoring enzyme NlaIII (or the less used alternative, Sau3A) may not always cut at the 3’-most recognition site of an mRNA molecule. This results in the release of a product where the 2nd or even 3rd tag is captured. Third, tags corresponding to transcript sequences read in the 3’→5’ direction are often observed. Although there is speculation that some “antisense” tags may arise from regulatory RNA molecules, most are conspicuously positioned, sharing the same AE site as the expected tag and expressed at a lower level, suggesting they are a procedural artefact. In any event, the mechanism by which they arise is unclear. Finally, significant numbers of SAGE tags arising from regions of the mitochondrial genome that do not correspond to polyadenylated transcripts and are ostensibly the result of mitochondrial DNA contamination. The addition of even a small number of nucleotides can potentially reduce the number of ambiguous tag to gene mappings, provide sufficient sequence information to allow the use of a whole-genome sequence search to identify the source of novel tags, and better identify artefacts arising from the SAGE protocol. The following chapter describes a method to determine these additional nucleotides.  47  2.2  MATERIALS AND METHODS  2.2.1 SAGE data Sequence data from several hundred SAGE libraries was obtained from the SAGEmap project (ftp://ftp.ncbi.nlm.nih.gov/pub/sage/fasta) at the National Center for Biotechnology Information (NCBI) (Lash, 2000). Corresponding processed data and library information was obtained from the Gene Expression Omnibus (GEO), also at the NCBI (Barrett, 2007). The standard procedure for extracting SAGE tags from sequence data was performed using the Bio::SAGE::DataProcessing Perl module (Zuyderduyn, 2004). 2.2.2 Software development The tag length distribution estimator (Section 2.3.1) was programmed using the Python language (version 2.4.4) (van Rossum, 2007). The XBP-SAGE algorithm (Section 2.3.5) was programmed using the C++ language (Stroustrop, 2000) and compiled using the GNU C Compiler (GCC) (version 4.1.2). (Nethercote, 2007). performed  using  Debugging was assisted using valgrind (version 3.2.1)  CPU profiling to assist in optimizing algorithm implementation was the  Google  Performance  (http://code.google.com/p/google-perftools).  Tools  (google-perftools)  (version  0.8)  Post-processing of the XBP-SAGE output was  facilitated through the use of scripts written in Perl (Wall, 2000). Statistical analysis was performed using the R software package (version 2.6.1) (R Core Development Team, 2007). Sequence bending/curvature predictions were performed using the FORTRAN program BEND (Goodsell, 1994).  Minor source code modifications were required for a successful build.  Wrapper scripts written in Perl were created to allow BEND to be run on many sequences, as the original program was written to provide estimates for a single input sequence.  48  2.2.3 Tag to gene mapping Unless otherwise stated, tags were assigned a gene using SAGE Genie (March 03, 2008 release) (Boon, 2002). When a more exhaustive tag to gene mapping was performed the following resources were used: 1) SAGE Genie (March 03, 2008 release) 2) Exact-match BLASTN (Altschul, 1997) to the human genome (NCBI 36 assembly) using the web-based EnsEMBL BLAST resource (Flicek, 2008) 3) Exact-match BLASTN to the NCBI human EST collection (8,137,888 sequences) (Benson, 2008) using the web-based NCBI BLAST resource (Johnson, 2008) 4) Custom Perl scripts that match a specified tag to other tags in the library that differ by a single nucleotide or a single insertion or deletion. The tag was considered an artefact only if the closely related tag was expressed at least 10-fold greater. 5) Match to a complete database of antisense tags and tags occurring at anchoring enzyme sites other than the 3’-most position from a meta-catalogue of sequences from Refseq (Pruitt, 2007) and Unigene (Wheeler, 2008) (developed by G. Vatcher)  49  2.3  RESULTS  2.3.1 Additional sequence information is available for the majority of SAGE tags During library construction, SAGE tags are ligated tail-to-tail to form ditags. These ditags then undergo an additional ligation step to form long concatemers, which are cloned and sequenced. The anchoring enzyme (i.e. NlaIII) recognition site (i.e. CATG) then serves to identify the boundaries of each ditag in the resulting sequence data (Figure 1.5). Using the lengths of these ditags, one can determine the distribution of the tag sizes used to construct the library. Simple inspection reveals that the vast majority (>95%) of ditags range in size from 2832bp. Since the library construction protocol includes a ditag size selection step, sequences that are considerably larger or smaller likely represent cases where sequencing errors have introduced an anchoring enzyme recognition site where none exists or have destroyed a site that does exist. Based on this range, the associated tags must range in size from 14-16bp. Therefore, the following relationships must hold (where P(ld) is the proportion of ditags of size ld, and P(lt) is the proportion of tags of size lt): P(ld=28) = P(lt=14)2 P(ld=29) = 2P(lt=14)P(lt=15) P(ld=30) = 2P(lt=14)P(lt=16) + P(lt=15)2 P(ld=31) = 2P(lt=15)P(lt=16) P(ld=32) = P(lt=16)2 Throughout the text, P(ld) denotes a vector containing the values from ld=28 to ld=32, and P(lt) denotes a vector containing the values from lt=14 to lt=16. Using the relationships above, the proportion of each size of tag P(lt) must satisfy the observed values for the proportion of ditag lengths P(ld). A simple genetic algorithm was developed to estimate the values of P(lt) (Figure 2.1). The starting population consisted of 100 random solutions for P(lt). Each of these 50  Initialization  100 random solutions for P(lt)  Selection  Child with highest fitness – the sum of squared errors for the P(ld) calculated from candidate P(lt) vs. actual P(ld)  Asexual Reproduction  Termination  Create 100 child solutions: - 90% undergo change to random P(lt) by amount drawn from Beta distribution (α=1, β=20) - 10% undergo “translocation” where two random P(lt) are swapped  Stop algorithm when best solution is stable for 10 generations  Figure 2.1: Summary of genetic algorithm to find solutions for proportion of tag lengths P(l t ).  51  solutions produces 100 children, where 90% will undergo a change to the value of a random element of P(lt) (i.e. mutation) and the remaining 10% will swap the solutions of two random elements of P(lt) (i.e. translocation). The amount a value of P(lt) changes due to “mutation” is governed by a Beta distribution, with shape parameters α=1 and β=20 (Figure 2.2). This produces very small changes in the vast majority of offspring, but larger changes still have the opportunity to occur from time to time. The fitness of an individual solution is the sum of square errors between the P(ld) observed from the data, and the expected P(ld) calculated from the proposed solution; the smaller this value, the better the solution. The single fittest offspring was carried over to the successive round of reproduction, and the process repeated until the best solution remained stable for 10 generations. The problem is relatively straightforward, and little optimization of the algorithm was required.  A stable solution is usually found in <30  generations. On a Pentium 4 2.4GHz computer, execution took less than 30 seconds. The algorithm was applied to 71 publicly available SAGE libraries (see Section 2.2.1). It is observed that the vast majority of SAGE tags have additional sequence beyond the canonical 14bp (>95%) (Table 2.1).  Average values for P(lt) were P(lt=14)=0.036±0.039,  P(lt=15)=0.69±0.074, and P(lt=16)=0.27±0.069 (95% confidence intervals shown). Thus, it is clear that additional nucleotides beyond the commonly used 14bp are available in the majority of cases. 2.3.2 Theoretical improvement in mapping accuracy from extra nucleotides In order to quantify the potential benefit of obtaining additional nucleotides, the SAGE Genie tag to gene mapping resource (March 03, 2008 release) (Boon, 2002) was examined. This resource contains several components; relevant here are a) a “master” list of tags and associated counts for 112 LongSAGE libraries, and b) a list of “best gene” mappings, where tags that have been previously observed in (a) are matched to a Unigene entry (Wheeler, 2008). 52  20 15 10 0  5  f(x;α=1,β=20)  0.0  0.2  0.4  0.6  0.8  1.0  x  Figure 2.2: The beta distribution. Plotted is the value of the beta function with parameters α=1 and β=20 (y-axis) as a function of x (x-axis).  53  Table 2.1: Estimated solutions for the distribution of tag lengths for 71 publicly available SAGE libraries Library name SAGE_293-CTRL SAGE_293-IND SAGE_95-259 SAGE_95-260 SAGE_95-347 SAGE_95-348 SAGE_A+ SAGE_A2780-9 SAGE_BB542_whitematter SAGE_Br_N SAGE_CAPAN1 SAGE_CPDR_LNCaP-C SAGE_CPDR_LNCaP-T SAGE_Caco_2 SAGE_Chen_LNCaP SAGE_Chen_LNCaP_no-DHT SAGE_Chen_Normal_Pr SAGE_Chen_Tumor_Pr SAGE_DCIS-3 SAGE_DCIS-4 SAGE_DCIS-5 SAGE_DCIS_2 SAGE_Duke-H988 SAGE_Duke_1273 SAGE_Duke_40N SAGE_Duke_48N SAGE_Duke_757 SAGE_Duke_96-349 SAGE_Duke_BB542_normal_cerebellum SAGE_Duke_GBM_H1110 SAGE_Duke_H1020 SAGE_Duke_H1043 SAGE_Duke_H1126 SAGE_Duke_H247_Hypoxia SAGE_Duke_H247_normal SAGE_Duke_H341 SAGE_Duke_H392 SAGE_Duke_H54_EGFRvIII SAGE_Duke_H54_lacZ SAGE_Duke_HMVEC+VEGF SAGE_Duke_HMVEC SAGE_Duke_Kidney SAGE_Duke_leukocyte SAGE_Duke_mhh-1 SAGE_Duke_post_crisis_fibroblasts SAGE_Duke_precrisis_fibroblasts SAGE_Duke_thalamus SAGE_ES2-1 SAGE_H1126 SAGE_H126 SAGE_H127 SAGE_H408 SAGE_HCT116 SAGE_HMEC-B41 SAGE_HOSE_4 SAGE_HX SAGE_Hemangioma_146 SAGE_IDC-3 SAGE_IDC-4 SAGE_IDC-5 SAGE_IOSE29-11 SAGE_LN-1 SAGE_LNCaP SAGE_MDA453 SAGE_ML10-10 SAGE_Medullo_3871 SAGE_Meso-12 SAGE_MouseP8_PGCP SAGE_Mouse_GCP_control SAGE_NC1 SAGE_NC2 Average  P(lt=14) 0.03 0.03 0.02 0.03 0.02 0.03 0.03 0.05 0.03 0.03 0.04 0.05 0.03 0.04 0.06 0.06 0.06 0.08 0.03 0.03 0.04 0.03 0.02 0.03 0.02 0.03 0.03 0.03 0.12 0.03 0.02 0.01 0.02 0.03 0.03 0.05 0.02 0.02 0.02 0.03 0.04 0.05 0.04 0.04 0.05 0.06 0.02 0.02 0.04 0.02 0.04 0.03 0.03 0.13 0.02 0.03 0.04 0.02 0.02 0.04 0.02 0.04 0.03 0.04 0.03 0.04 0.04 0.03 0.05 0.05 0.02 0.04  P(lt=15) 0.70 0.70 0.70 0.69 0.70 0.71 0.63 0.71 0.72 0.68 0.72 0.68 0.69 0.74 0.70 0.68 0.70 0.67 0.64 0.71 0.68 0.69 0.73 0.72 0.67 0.70 0.71 0.59 0.62 0.73 0.71 0.68 0.69 0.74 0.75 0.71 0.70 0.72 0.71 0.72 0.69 0.70 0.73 0.71 0.59 0.66 0.66 0.79 0.70 0.70 0.69 0.72 0.70 0.59 0.68 0.69 0.70 0.61 0.61 0.70 0.73 0.67 0.65 0.65 0.72 0.65 0.72 0.69 0.68 0.67 0.67 0.69  P(lt=16) 0.27 0.27 0.28 0.28 0.28 0.26 0.33 0.25 0.24 0.29 0.24 0.27 0.28 0.22 0.24 0.26 0.24 0.25 0.33 0.25 0.28 0.28 0.25 0.26 0.30 0.27 0.26 0.38 0.26 0.24 0.27 0.31 0.29 0.23 0.22 0.25 0.28 0.25 0.27 0.25 0.27 0.25 0.23 0.25 0.36 0.29 0.32 0.19 0.27 0.29 0.28 0.25 0.27 0.27 0.29 0.28 0.26 0.37 0.37 0.26 0.26 0.29 0.32 0.30 0.25 0.31 0.24 0.28 0.26 0.28 0.31 0.27  Sum square error 5.61E-05 1.26E-04 2.42E-05 8.35E-05 1.89E-04 2.94E-05 3.32E-05 6.79E-06 3.04E-04 4.80E-06 6.28E-05 2.26E-06 9.92E-06 4.80E-04 1.88E-05 1.20E-04 9.64E-06 1.48E-04 1.94E-06 4.09E-06 1.29E-06 9.56E-05 4.07E-05 6.04E-05 7.49E-06 2.20E-05 1.00E-05 7.49E-04 5.43E-05 1.30E-04 2.83E-05 4.06E-05 2.40E-04 5.17E-05 9.17E-06 2.83E-05 1.12E-03 1.79E-04 1.82E-04 2.81E-05 2.61E-06 1.57E-04 2.83E-06 2.64E-05 2.70E-05 7.20E-05 2.43E-04 4.36E-05 5.43E-06 1.09E-04 1.02E-06 6.32E-06 5.04E-04 4.56E-04 6.91E-05 1.16E-04 3.56E-05 3.86E-06 1.92E-05 1.52E-06 1.47E-04 1.71E-05 1.05E-05 3.59E-05 4.89E-05 1.98E-05 4.13E-05 5.68E-04 9.03E-06 3.34E-04 4.43E-04 1.18E-04  54  The “master” list contained 4,953,569 distinct tag types for a total of 158,291,068 tag counts and can be assumed to contain almost all possible SAGE tags present in human tissues. However, many of these tags will represent artefacts such as erroneously sequenced tags. During the processing of a single library, it is not uncommon to remove tags that only appear once in order to ameliorate the contribution of such artefacts. In this case, a slightly expanded criteria was utilized; a tag was considered an artefact if it appeared at a frequency less than or equal to 10 tags per million (TPM) and was observed in only one library. This removed the vast majority of tag types, but most of the total tag count was retained. The resulting list contained 188,525 distinct tag types for a total of 130,393,800 tag counts. For each LongSAGE tag type, both the first 14bp and 16bp (an additional two nucleotides longer than a canonical short SAGE tag) were used to search the “best gene” mappings for a match. Of the 188,525 tag types: 140,524 were unmatched, 44,462 mapped unambiguously to a single gene, 2,564 were ambiguous for 14bp but were completely resolved for 16bp, 313 were ambiguous for 14bp but were partially resolved for 16bp, and 746 were equally ambiguous for both 14bp and 16bp. Two conclusions can be made: 1) the majority of SAGE tags have no clear match to a gene (74.5%) and 2) of the 14bp tags that do match a gene, 7.4% are ambiguous, but the majority of these (77.0%) can be wholly or partially resolved with the addition of two extra nucleotides. For SAGE tags that do not match a known mRNA sequence, an attempt to match the genome sequence can identify the source of the tag. However, for canonical 14bp tags this search will likely result in many spurious matches. The estimated size of the human genome is 3.2Gb (3.2x109bp). The average number of locations in the genome that will match a given 14bp tag sequence is 3.2x109 / 414 = 11.2. However, for longer sequences, this number drops to 3.2x109 / 415 = 2.8 and 3.2x109 / 416 = 0.70 matches for 15bp and 16bp tags, respectively. Thus, a small amount of additional sequence information can drastically increase the sensitivity of a 55  whole-genome search; in the case of 16bp, a single hit is expected to occur in the majority of attempts. This demonstrates that the addition of even a small number of extra nucleotides has the potential to substantially improve the tag to gene mapping accuracy for a short SAGE library by a) reducing ambiguity and b) providing additional sequence information for a whole-genome search when a tag fails to map a known transcript. 2.3.3 Influence of tag nucleotide composition on length is small, but significant To investigate the influence of sequence on the length of a tag, a regression analysis was used to identify any relationship between ditag length (the response variable) and nucleotide content (the explanatory variable). This procedure was applied to each of the 71 publicly available SAGE libraries (see Section 2.2.1). Ditag length was modelled as a Gaussian (normal) distributed variable. It should be noted that although ditag length is a discrete value, the binomial distribution was a poor fit to the observed data. As a result, the estimated magnitude of the effect of nucleotide content should be considered a suboptimal, albeit close, approximation. Let Li be the length of some ditag i, μ be the mean of all ditag lengths, σ2 be the variance of all ditag lengths, β be the vector of linear coefficients to be estimated, and x be the vector of explanatory variables (i.e. xnuc is the percentage contribution of some arbitrary nucleotide(s)). The linear regression model can then be expressed as: Li ~ Normal(μ,σ2) Li = β0 + βnucxnuc The model fit was performed using the standard regression functions of the R statistical software (R Core Development Team, 2007). Significance was determined by testing the null hypothesis that βnuc=0 (nucleotide composition has no effect on ditag length) using the t-test. Let s.e.(βnuc) be the standard error of the estimated coefficient βnuc and let d.f. be the degrees of 56  freedom. The t statistic is: t = βnuc / s.e.(βnuc), d.f. = nobs - 2 An effect was sought for each individual nucleotide (A, C, G, T) and all groups of two nucleotides (AC, AG, AT, CG, CT, GT). In most of the 71 libraries tested, there was a clear association between ditag length and the number of AT or CG nucleotides. In 43/71 libraries, the effect was highly significant (p<0.001) and in 6/71 libraries, the effect was moderately significant (p<0.01). Of the remaining libraries, 8/22 showed AT or CG content as the strongest association although not to the chosen levels of significance (p<0.01). In no case did another possible nucleotide composition show a significant effect. However, the magnitude of this effect was not large nor consistently positive or negative.  βAT was positive in 23/71 libraries  (0.359±0.162; 95% CI) and negative for 48/71 libraries (-0.191±0.038; 95% CI). This suggests that a more descriptive explanatory variable exists that is somehow associated with AT and/or CG content. One hypothesis is that certain motifs substantially alter the structure of the target DNA, and those that shorten the tag product tend to be AT rich. Another possibility, which is not mutually exclusive, is that random effects (e.g. local fluctuations in temperature, reaction conditions) simply outweigh the effect of sequence. 2.3.4 Predicted curvature of tag sequence is correlated with ditag length Based on the results of Section 2.3.3, it was hypothesized that tag sequence influences the shape of the DNA molecule in a manner sufficient to affect the point at which the tagging enzyme (BsmFI) cuts its target sequence. Research into the influence of sequence on DNA shape has been an ongoing concern in investigations into areas such as gene regulation (e.g. transcription factor binding) and DNA packaging (e.g. nucleosome binding). The rod model of DNA has been broadly adopted, and is appealing in the context of the problem here. This model describes short DNA sequences as resembling a “rod” which will bend and twist, depending on 57  the constituent sequence (reviewed in Munteanu, 1998). If the tagging enzyme cleaves DNA at a fixed distance from the recognition site, then significant bending or other distortion could effectively “shorten” the target DNA, resulting in a larger cleavage product. Indeed, a well known example is that of interspersed adenine tracts that are in-phase with the turns of the helix. Such a configuration will introduce rigidity on one face of the helix that will cause the DNA to bend (Reich, 1992). This example is consistent with the results in Section 2.3.3 where, in many libraries, the presence of AT nucleotides tends to result in shorter tags. The BEND program predicts local bending and global curvature from arbitrary sequence using a choice of several models (Goodsell, 1994).  The BEND developers found that a  trinucleotide model based on nucleosome positioning data performs best and so it is used here. For each ditag, the two constituent 14bp tags were analyzed by the program and bend angles for each trinucleotide were calculated. If each trinucleotide X-Y-Z represents a distance of 2 units, then one would have to “walk” 13 units to navigate the helix along the backbone of a SAGE tag. Upon the introduction of a bend, the distance between nucleotide X and nucleotide Z will be shortened. Let θi be the bend angle of trinucleotide i from the set of all length-1 trinucleotides. The modified length can be calculated using the expression: Ltag = Σi cos(θi) For each ditag, the Ltag value for the two constituent tags is combined to create a predicted length Lditag. As in Section 2.3.3, a linear regression analysis was performed to determine if there is a relationship between Lditag and the observed ditag length. Let μ and σ2 be the mean and variance of the ditag length, respectively; and let β0 and βL be the coefficients of the linear fit. Li ~ Normal(μ,σ2) Li = β0 + βLLditag The model fit was performed using the standard regression functions of the R statistical software (R Core Development Team, 2007). Significance was determined by testing the null hypothesis 58  that βL=0 (the value for Lditag has no relationship with ditag length) using the t-test, where: t = βL / s.e.(βL), d.f. = nobs - 2 In this case, the relationship was highly significant in 67/71 libraries (p<0.001) and moderately significant in the remaining libraries 4/71 (p<0.05). This analysis provides compelling support for the notion that the local shape of the DNA at the endonuclease target region plays a significant role in the size of the cleavage product obtained. However, the relative magnitude of the relationship is not strong. Although a significant proportion of the total variance in ditag length is accounted for by predicted shape effects, a majority of the variance remains. Therefore, calculating the bending of an arbritrary sequence can provide a prediction of ditag length that will outperform a random guess but not to a degree of accuracy that would be of substantial practical use. Thus, it is not possible to determine which tag the additional nucleotides of a ditag arose from based on a particular sequence. 2.3.5 XBP-SAGE: an algorithm to estimate additional nucleotides One can imagine a library of ditags as being the realization of a population of tags, each having a length of 16bp prior to being subjected to restriction digest by the tagging enzyme (in reality, a tag can be arbitrarily long prior to digestion, but since it is already established that the tagging enzyme does not release a product longer than 16bp, any additional sequence can be ignored and we can suppose a length of 16bp as the effective maximum). The tagging enzyme will release a product that is either this length, or one or two nucleotide(s) shorter. This population is then randomly ligated to form ditags. Thus, we are interested in determining what tag sequences in the original population would have resulted in the observed ditag sequences. This is a classic problem where one can find a solution using the method of maximum likelihood (ML). Whereas a probability (more specifically, a posterior probability) represents the chance than an observed event (A) would have occurred given a known set of parameter values (B) (i.e. 59  P(A|B)), likelihood represents the chance that some set of parameter values would have resulted in the observed event (i.e. P(B|A)). It is worth noting this distinction since, although the underlying mathematics and resulting values are similar, the terms refer to two separate concepts. 2.3.5.1 The likelihood function Let x be an observed variable and θ be some parameter of interest. A likelihood function is defined as: θ ⟼ P(x|θ) L(θ|x) = P(x|θ) In this case, let x be a possible ditag sequence and θ be two candidate tag sequences. Obviously, if the first 14bp of the tag sequences do not match the 14bp ends of the ditag sequence, then L(θ|x)=0. Otherwise, L(θ|x) will depend on the sequence of the ditag, the probability that a ditag of that length will be observed, and in what manner the extra nucleotides match those of the 16bp of the proposed tag sequences.  The likelihood function can be  expressed in more specific terms. Let λ be a ditag ligation site, which results in some putative assignment of the extra nucleotides to the 3' or 5' tag. Then each L(θ,λ|x) can be used to determine L(θ|x): L(θ|x) = P(x|θ) = P(ld)Σλ[P(x|θ,λ)P(λ)] Consider the following example:  5'-CATGTTTTTTTTTTTCCCCCCCCCCATG-3' x = 3'-GTACAAAAAAAAAAAGGGGGGGGGGTAC-5'  (29 bp)  θ = ( 5'-CATGTTTTTTTTTTGG-3', 5'-CATGGGGGGGGGGGAA-3' ) We know that the extra nucleotide [T] has an equal probability of arising from either the 5' or 3' tag, thus P(λ=CATGT10T║C10CATG)=P(λ=CATGT10║TC10CATG)=0.5 (where the character ║ denotes the ligation site). If it arose from the 5' tag, then observing θ is impossible 60  ([T]≠[G]), thus P(x|θ,λ=CATGT10T║C10CATG)=0; if it arose from the 3' tag, then θ is plausible ([A]=[A]), thus P(x|θ,λ=CATGT10║C10CATG)=1. In addition, we must also consider P(ld), the probability that a ditag of the observed length results from the ligation of two tags (see Section 2.3.1). Therefore: L(θ|x) = P(ld=29)×[ P(x|θ,λ=CATGT10T║C10CATG)P(λ=CATGT10T║C10CATG) + P(x|θ,λ=CATGT10║TC10CATG)P(λ=CATGT10║TC10CATG) ] = P(ld=29)×[ 0×0.5 + 1×0.5 ] = 0.5×P(ld=29) In other words, if a 29bp ditag arose from a ligation of the two tags θ, then the observed ditag x would be expected to occur at a frequency equal to one half the probability of observing a ditag of length 29bp. Consider a more complicated example: 5'-CATGTTTTTTTTTTTGCCCCCCCCCCCATG-3' x = 3'-GTACAAAAAAAAAAACGGGGGGGGGGGTAC-5' (30 bp) θ = ( 5'-CATGTTTTTTTTTTTA-3', 5'-CATGGGGGGGGGGGCA-3' ) We know that the two additional nucleotides [TG] could be a) both from the 5' tag, b) both from the 3' tag, or c) one from each tag.  The probability of each possibility (e.g.  P(lt=16|ld=30)) varies from library to library and can be estimated as previously described (see Section 2.3.1). The unconditional probability of each of these three possibilities is: a) P(λ=CATGT10TG║C10CATG)=P(lt=16)P(lt=14)/P(ld=30) b) P(λ=CATGT10║TGC10CATG)=P(lt=14)P(lt=16)/P(ld=30) c) P(λ=CATGT10T║GC10CATG)=P(lt=15)P(lt=15)/P(ld=30) Now, consider each possibility in turn: a) if both nucleotides came from the 5' tag, then θ is impossible ([TG]≠[TA]), thus P(x|θ,λ=CATGT10TG║G10CATG)=0; b) if both nucleotides came from the 3' tag, then θ is plausible ([CA]=[CA]), thus P(x|θ,λ=CATGT10║TGC10CATG)=1; and c) if one nucleotide came from each tag, then θ is plausible ([T]=[T] and [C]=[C]), thus P(x|θ,λ=CATGT10T║GC10CATG)=1. Therefore: 61  L(θ|x) = P(ld=30)×[ P(x|θ,λ=CATGT10TG║C10CATG)P(λ=CATGA10TG║C10CATG) + P(x|θ,λ=CATGT10║TGC10CATG)P(λ=CATGA10║TGC10CATG) + P(x|θ,λ=CATGT10T║GC10CATG)P(λ=CATGA10T║GC10CATG) ] =P(ld=30)×[ 0×P(lt=16)P(lt=14)/P(ld=30) + 1×P(lt=14)P(lt=16)/P(ld=30) + 1×P(lt=15)P(lt=15)/P(ld=30) ] =P(lt=14)P(lt=16)+P(lt=15)2 Tables of likelihood values covering all possible solutions for ditags of length 28-32bp are provided for reference in Appendix I. However, these equations assume it is known which particular tag arose from any one ditag. In practice, one can only reduce the possible source to the set of ditags that match the first 14bp of sequence. Thus, one must determine the likelihood of observing some ditag given two sets of possible child tags (e.g. the counts of the pair of 14bp tags). Let us define δ as a ditag having a sequence 5’-aXb-3’, where a and b is the first 14bp of the 5’ and 3’ tag, respectively, and X are the additional nucleotide(s). Furthermore, let A and B be the set of tags where the first 14bp is a or b, respectively. The size of these sets is denoted by nA and nB. Since one cannot impose an assumption of which specific tag arose from a ditag, the likelihood must be expressed in terms of observing the set of tags A and B, which is the average of all possible joinings of a tag in A to a tag in B: L(A,B|δ) = P(δ|A,B) =  ΣaΣbΣλP(δ|λ)P(λ|a,b) nAnB  Finally, the global likelihood of all of the observed ditags in a library given some set of tags is simply the joint likelihood over all ditags. Let Δ be the entire library of ditags and let Τ be the entire set of 16bp tags (where A and B denote subsets of Τ having the same first 14bp of sequence defined by the two flanking portions of some ditag δ). L(Τ|Δ)=∏δL(A,B|δ) 62  logL(Τ|Δ)=ΣδlogL(A,B|δ) And so, the objective is to determine a set of 16bp tags Τ that maximizes the likelihood of observing the ditag sequences Δ. 2.3.5.2 The problem of global optimization Although it is theoretically possible to find the most likely solution by testing all possible Τ, this approach is not practical. Consider a very small Δ of 50 ditags. For each of the 100 tags, the two additional nucleotides can have one of 16 (42) possible sequences. Therefore, the solution space would consist of 10016 (1×1032) possible solutions. Raising the number of ditags one order of magnitude (500 ditags) results in a 16 order of magnitude increase in the size of the solution space (1×1048). Problems that grow rapidly in this fashion are said to undergo a “combinatorial explosion”, making a complete search of the solution space for any reasonably sized realization of the problem intractable. In the parlance of computational complexity theory, a brute force algorithm would run in Ο(cn)1 time.  In this example, if one assumes that  calculating a single likelihood is exceedingly fast (1 calculation/picosecond), the search would take about 3x1012 years2. 2.3.5.3 Simulated annealing One approach to solving T without requiring an exhaustive search would be to modify each tag one at a time and accept the solution if the result was an increase in L(T|Δ). This would effectively make the running time linear, or Ο(n). However, this strategy runs the danger of a getting trapped in local maxima due to the interdependency of different tags in calculating the likelihood. For example, a change in the additional nucleotides of a tag in the set of tags A having the same starting 14bp will affect L(A,B|δab), L(A,C|δac), L(A,D|δad), and so on. A  1  2  Big O notation is used to describe an algorithm’s resource usage (time in this case) in relation to the size of the input data. For example: O(1) is constant time, O(n) is linear time, and O(cn) is exponential time. The current age of the universe is estimated to be about 1.4x1010 years (Hinshaw, 2009).  63  subsequent change to the additional nucleotides of a tag in, say, the set of tags B may alter the likelihood L(A,B|δab) to such an extent that the previous solution for A no longer provides optimal L(Τ|Δ). This problem can be addressed by a simulated annealing (SA) strategy.  This is a  computational approach for finding a solution that is a good approximation of the global optimum given a large search space (Kirkpatrick, 1983). The rationale of SA is inspired by the metallurgical annealing process, where a material is subjected to high heat and then cooled slowly to remove defects. SA relies on an acceptance function, which utilizes a temperate parameter t. SA starts with a random solution to the problem. At the beginning of the algorithm, when t is highest, the selection of a nearby solution is nearly random, while at lowest t, when the algorithm terminates, the chosen nearby solution will have the best fitness. By slowly decreasing t, the risk of getting trapped in local extrema is reduced since there is ample opportunity for escape at higher temperatures. SA is not a deterministic algorithm, and cannot guarantee an optimal solution; however, it can provide an excellent approximation in a reasonable amount of time. 2.3.5.4 Simple reductions to the solution space and the use of an “unknown” nucleotide Two simple assumptions can be imposed to reduce the size of the solution space that must be evaluated. First, longer ditags can reduce the number of possible solutions since the extra nucleotides can be assigned to two contributing tags with certainty. 32bp ditags must arise from two 16bp tags, so both extra nucleotides are known; and 31bp ditags must arise from two tags that are at least 15bp, so one extra nucleotide is known. Second, we can ignore any solution for a given tag where there is no supporting evidence from other tags with the same 14bp starting sequence. For example, if there is a set of three tags where we have proposed that two of these have extra nucleotides AT and AG based on the ditag evidence, then the extra nucleotides of the 64  third tag can only be one of these two possibilities. This may or may not be true in every case, but it is a reasonable assumption made for the sake of efficiency. Finally, we can define an additional nucleotide which represents an “unknown” (denoted by an X in the program’s output). An “unknown” nucleotide is assigned when supporting evidence is unavailable and any of the four possible nucleotides are equally likely. Specifically, an “unknown” 1st extra nucleotide is assigned if, and only if, no evidence of any kind is available. Either all of the tags of a sequence must be unknown at the first position, or none at all. For example, a solution of {XX, XX, XX} is acceptable, but {AA, XX, XX} is not. An unknown 2nd extra nucleotide is similar, except this restriction only applies with respect to the subset of tags with the same 1st extra nucleotide. For example, a solution of {AA, AA, CX, CX} is acceptable, but {AA, AA, CT, CX} is not. 2.3.5.5 Algorithm summary The XBP-SAGE algorithm proceeds as follows: Step 1 An initial solution is created by randomly choosing the ligation point for each ditag. Any missing nucleotides are filled in, if possible, according to the rules described in Section 2.3.5.4. Two global likelihood values are generated: 1) the preliminary global likelihood, Lp, based only on the sequence arising from cutting the ditag at a proposed ligation point; and the full global likelihood, Lf, based on the additional filled-in sequence. The initial annealing temperature t is set to 100. Step 2 Proceed through the set of ditags in a random order, and for each ditag calculate Lp for each possible ligation point (this is the set of nearby solutions). A choice of which ligation point to proceed with is made using t. Let Λ be the set of Lp for each possible ligation point λ1, λ2, …, 65  λn. Calculate a modifier value: μ = [abs(t-100)/10]+1 Alter the likelihoods such that: λi’ = exp{ λi – min(Λ) }μ This procedure ensures that the ligation point that results in the highest likelihood will always be chosen when t=0. At the start of the algorithm, when t=100, the choice is a stochastic process influenced by the values of Λ. For example, if λ1 is twice the size of λ2, then the ligation point corresponding to λ1 will be chosen two-thirds of the time. Step 3 In a randomly assigned order, generate additional nucleotides for tags where required (i.e. to avoid violating the assumptions described in Section 2.3.5.4) or change existing additional nucleotides if possible, and recalculate Lf. If Lf does not improve and the previous extra nucleotide assignments are still valid, then discard the changes. Step 4 Decrease the annealing temperature by 10. If t < 0 then accept the current solution and terminate the algorithm, otherwise return to Step 2. 2.3.5.6 Performance on a test dataset To assess the performance of the XBP-SAGE algorithm, a publicly available normal human cervical epithelium LongSAGE library was used (GEO accession: GSM144012) (Barrett, 2007). LongSAGE tags are 21bp in length (including the anchoring enzyme site) and 100,000 were randomly selected and shortened to 16bp to create a SAGE library with an extra two nucleotides for each tag. This creates a test set with a realistic distribution of counts and sequence differences and known extra nucleotides. Each tag was then trimmed based on the following probabilities: P(lt=14)=0.05, P(lt=15)=0.65, and P(lt=16)=0.30. Finally, the tags were 66  randomly joined to create a simulated set of 50,000 ditags. The genetic algorithm (Section 2.3.1) to estimate values for P(lt) from the ditag lengths yielded excellent agreement with values of: P(lt=14)=0.0499, P(lt=15)=0.650, P(lt=16)=0.300. The data was subjected to the XBP-SAGE algorithm which completed in 2.5 hours on a single Pentium 4 2.4GHz CPU. 10,911 tag types (12,713 tag counts) were assigned no additional nucleotides, 12,921 tag types (26,584 tag counts) were assigned one additional nucleotide, and 5,777 tag types (60,703 tag counts) were assigned two additional nucleotides. Thus, it can be generally stated for this dataset that a tag with a count >2 will reveal one additional nucleotide, and a tag with a count >9 will reveal two additional nucleotides. The extra nucleotides were compared to the actual data and 98,472/100,000 (98.5%) were assigned correctly (Figure 2.3). When an incorrect assignment did occur, the tag was expressed at low levels (<10 counts) and was usually a low-expressed member of a highly-expressed group of tags with the same starting 14bp. For example: sequence CCAAGGTGGCCC CCAAGGTGGCCT  actual count 44 0  estimated count 38 6  However, these situations were rare and, in general, the algorithm identified the breakdown of counts closely. For example: sequence ACTTTTTCAAAA ACTTTTTCAAAG ACTTTTTCAACA ACTTTTTCAAG  actual count 414 3 1 3  estimated count 412 6 1 2  2.3.5.7 Improvements to tag to gene mapping on real data The XBP-SAGE algorithm was run an unmodified SAGE library generated from colonic epithelium (GEO accession: GSM728) (Velculescu, 1995; Barrett, 2007). This library consisted of 25,570 ditags. The genetic algorithm (Section 2.3.1) to estimate values for P(lt) from the ditag 67  68  1  0  ●  ●  expected tag count  expected tag count  ●  ●  4  4  ●  ●  ●  ●  ●  ●  ●  ●  ●  0  0  ●  ●  2  2  ●  ●  ●  ●  ●  ●  5  ● ●  ● ●  10  10 expected tag count  5  expected tag count  ●  ●  ●  ● ●  ●  ●  ●●  20  20  ● ●● ● ●●● ●● ●● ● ● ●● ●● ●● ● ●● ● ●● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●  one additional nucleotide  ● ●●  ●  ●  0  0  ●  ●  ● ● ● ● ● ● ● ● ● ●  ●  ● ● ● ● ●  ● ●  ● ● ● ● ● ● ● ●  5  5  10  10  ● ● ● ●●●●●  ● ●● ●  ● ● ● ● ● ● ●  expected tag count  50 100  expected tag count  50 100  ●  ●  500 1000  ●  500 1000  ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ●● ●● ● ●● ●● ●● ●● ●● ●● ● ●●● ●● ● ● ●● ●● ●● ●●● ●● ● ●●● ● ●●●●●● ●●●● ● ● ●●● ●● ● ●●●●● ● ● ●● ● ●●● ● ● ●  ● ●● ●  two additional nucleotides ●  Figure 2.3: Comparison of the observed and expected count of simulated tags with two additional nucleotides. Tags assigned zero, one, and two additional nucleotides are separated into three plots. The top three plots are scatterplots with the expected (known) count on the x-axis and the observed (estimated) count on the y-axis. The bottom three plots are density plots corresponding to the three scatterplots. These are provided to convey the number of datapoints at each x-y coordinates, as more than one tag can have the same expected and observed count. A dot in the density plot represents a case where a single observation is present at a given coordinate.  1  ●  ●  0  ●  ●  ●  ●  ●  no additional nucleotides  20 10 5  500 1000 50 100  4  1  0  observed tag count  observed tag count  4  1  0  observed tag count observed tag count  2 1 20 10 5 2 1  observed tag count observed tag count  10 5 0 500 1000 50 100 10 5 0  lengths yielded values of: P(lt=14)=0.05, P(lt=15)=0.67, P(lt=16)=0.28. The consensus tag list contained 788 14bp tag types (815 tag counts), 9,646 15bp tag types (14,275 tag counts), and 7865 16bp tag types (36,048 tag counts). The success of tag to gene mapping was estimated using SAGE Genie (Boon, 2002) and was compared between the tags for which additional nucleotides were obtained and their canonical 14bp version (Table 2.2). To ameliorate the effect of spurious artefacts (e.g. PCR amplification or sequencing errors) tags which were observed only once were discarded. Of those tags that mapped ambiguously to a gene, 50% and 64% were no longer ambiguous with the addition of one or two extra nucleotides, respectively. Interestingly, some tags that were ambiguous or unambiguous at 14bp became unmapped with the addition of two nucleotides (40/459 ambiguous tags and 47/409 unambiguous tags). Ostensibly, these tags represent rare cases where XBP-SAGE assigned an additional nucleotide incorrectly. Tags with counts ≥5 were investigated in detail. Using a meta-list of all public LongSAGE libraries where each tag was shortened to 16bp, there was strong evidence (i.e. the 16bp tag was the highest expressed member of the group of tags with same starting 14bp) for correctness in 5/12 previously ambiguous tags and 13/21 previously unambiguous tags. However, when a comprehensive mapping was performed (see Section 2.2.3) an XBP-SAGE miscall was supported in only 3/12 previously ambiguous tags (Table 2.3) and 5/21 previously unambiguous tags (Table 2.4). The discrepancy is predominantly the result of sequencing artefacts.  Of the previously ambiguous tags, assignments were made to the mitochondrial  genome (4), PCR amplification artefacts (3), an antisense artefact (1), and an uncharacterized splice variant supported by EST evidence (1) (Table 2.3). Of the previously unambiguous tags, assignments were made to the mitochondrial genome (5), PCR amplification artefacts (3), antisense artefacts (2), probable antisense transcripts (2), an artefact of resulting from tag capture near an internal polyadenylation sequence (1), and uncharacterized splice variants supported by EST evidence (3) (Table 2.4). 69  Table 2.2: Improvement of tag to gene mapping with addition of extra nucleotides Tag Length = 16bp  14bp  16bp  % change  unmapped  80  167  +108.7%  ambiguous  459  165  -64.0%  unambiguous  409  616  +50.6%  Tag Length = 15bp  14bp  15bp  % change  unmapped  299  433  +44.8%  ambiguous  1069  534  -50.0%  unambiguous  1042  1443  +38.5%  Data was compiled from the results of the XBP-SAGE algorithm run on a public SAGE library from colonic epithelium (GEO accession: GSM728). The table includes tags where an additional one or two nucleotides were assigned and where the tag count exceeds one.  70  Table 2.3: Putative source of 14bp tags with ambiguous mappings that no longer map when lengthened to 16bp Tag ATTTGAGAAGCC  Count 333  LongSAGE yes  GGGGCAGGGCCC  46  yes  GGGAAGCAGATT GGGGTCAGGGGT  32 22  yes yes  CGCTGGTTCCAC  21  no  GCCAACCTCCTA  20  yes  AGCCACCGCGTA GCCATCCTCCAG  11 9  no no  CTACTGCACTCG GTGAAACTCTGC ATGGAGACTTCG  9 8 6  no no no  CCTGTAATACCG  6  no  Mapping Before CT: RAD23B TG: Hs.656343 AG: LOC729591 AT: GFAP CA: CRELD1 GG: F11R AC: Hs.605638 TG: PYGB AG: RPL11 GG: MYO1G TG: LTBP3 AC: LOC286058 AG: EP400NL 31 different genes TG: SLC39A13 TT: RNF26 7 different genes 9 different genes CA: CS GG: Hs.626951 TG: ATP2B1 16 different genes  Mapping After mitochondrial genome EIF5A antisense artefact  mitochondrial genome mitochondrial genome unknown (probable miscall)  mitochondrial genome unknown (probable miscall) maps to ESTs from Hs.591502, possible splice variant artefact of CCACTGCACT (353 tags) artefact of GTGAAACCCT (190 tags) unknown (probable miscall)  artefact of CCTGTAAT^CC (472 tags)  Columns correspond to the elongated tag sequence, the observed count, whether the elongated sequence is supported by LongSAGE data, and the putative mapping(s) before and after the addition of extra nucleotides.  71  Table 2.4: Putative source of 14bp tags with unambiguous mappings that no longer map when lengthened to 16bp Tag AAAACATTCTCC CCTCAGGATACT AGGTGGCAAGAA CTCCACCCGAAA CATTTGTAATAA ACAGGGTGACCT GCCATCCCCTTA AGAACCTTCCAG ACAAAAACTAGG AATCACAAATAA ACAAACCCCCAC GGCCCTGCAGGG ATGATGGCACCT AATGAGAAGGTA TTCCTATTAAGC GTAGCGCACGCA TCTGGTTTGTCT TGGCTACTTAGT CCCTGACTGCTG ATTTGAGAACCT GCACAGGTCACC  Count 170 114 56 48 48 26 25 21 20 17 15 14 12 11 8 7 7 5 5 5 5  LongSAGE yes yes yes yes yes no yes yes no yes yes no yes yes no no yes yes no no no  Mapping Before AC: Hs.707482 GA: TGFA GG: LOC644075 GG: TFF3 AC: Hs.689535 CC: EDF1 CC: Hs.622702 AA: HLA-A GC: Hs.703814 AC: CTAGE5 TC: Hs.704518 GA: SIRT6 TA: LOC387647 AA: PAF1 TT: NOM1 CC: GLTP GG: WDR82 CG: KPNA2 TC: RDH5 AT: REV1 CA: EFCAB1  Mapping After mitochondrial genome mitochondrial genome mitochondrial genome possible variant of TFF3 mitochondrial genome unknown (probable miscall) mitochondrial genome possible variant of HLA-A unknown (probable miscall) mispredicted 3’-UTR of CEACAM7 ATP1B1 antisense transcript unknown (probable miscall) TSPAN8 antisense transcript B4GALT1 internal poly-A tract artefact linker TCCCTATTAA unknown (probable miscall) TMSB10 antisense transcript MALL antisense artefact unknown (probable miscall) artefact of ATTTGAGAAG (333 tags) artefact of GCCCAGGTCA (547 tags)  Columns correspond to the elongated tag sequence, the observed count, whether the elongated sequence is supported by LongSAGE data, and the putative mapping(s) before and after the addition of extra nucleotides.  72  There were 80 tags that were initially unmapped.  A comprehensive mapping was  attempted for those with counts ≥5. The additional nucleotides were supported by existing LongSAGE data in 17/25 cases. In the majority of cases (20/25), a source of the tag could be determined. Assignments were made to the mitochondrial genome (9), PCR amplification artefacts (5), an antisense artefact (1), an antisense transcript (1), a second position tag likely due to an incomplete AE restriction digest (1), and several uncharacterized splice variants supported by EST evidence (3) (Table 2.5).  73  Table 2.5: Putative source of 14bp tags that fail to map Tag CTCATAAGGAAA TTTAACGGCCGC AGACCCACAACA GTAAGTGTACTG TCCCGTACATCA ACTTTCCAAAAA GCTAGGTTTATA CTTACAAGCAAG CCTGTCTGCCAG TCCCTATAAGCC TCACCCACACCA ACCCCTAACAGG TTCTTGTGGCGC AAACATCCTATC GCCGGAGGGCCC TCACCGTACATC GCCCCATTTTCC GGGGTCCCATTC CGGAACACCGTG GGAGGCGCTCAC CTAACTAGTTAC AGCTGTCCCCAC CTGTAAAAAAAA TCCCTATTGAGC CCCCCTGCATCA  Count 107 86 52 47 38 32 28 21 14 13 12 8 8 7 6 6 6 5 5 5 5 5 5 5 5  LongSAGE yes yes yes yes no yes yes yes no no yes yes yes yes no no no no yes yes yes yes yes no yes  Mapping mitochondrial genome mitochondrial genome mitochondrial genome mitochondrial genome artefact linker TCCCCGTACA mitochondrial genome mitochondrial genome mitochondrial genome EP400NL (probable expressed pseudogene) artefact linker TCCCTATTAA RPL23 antisense transcript mitochondrial genome RPS11 antisense artefact mitochondrial genome unknown artefact linker TCCCCGTACA unknown unknown EZR splice variant unknown single genome match, predicted ncRNA (EnsEMBL:AL163011.3) COX2 second position tag unknown artefact linker TCCCTATTAA artefact of CCCCCTGGAT (140 tags)  Columns correspond to the elongated tag sequence, the observed count, whether the elongated sequence is supported by LongSAGE data, and the putative mapping after the addition of extra nucleotides.  74  2.4  DISCUSSION Identifying the correct source of a tag is the most crucial element of SAGE analysis. The  power of the technique is diminished if a tag is assigned incorrectly or cannot be confidently determined. This chapter demonstrates that the tagging enzyme usually releases a tag one or two nucleotides longer than the 14bp assumed during standard data processing. Although sequence has an effect on the size of the tag released, this accounts for a very small amount of the variation observed. The determination of extra nucleotides is complicated by the ambiguity introduced by the ligation of two tags head-to-head to form ditags, a procedure critical to control for PCR amplification bias. The XBP-SAGE algorithm is designed to model these effects and determine a high-confidence list of extended tags. As a result, cases where a tag matches to more than one transcript or gene are reduced, artefacts are easier to identify, and the source of novel tags can be made with increased sensitivity (e.g. using a whole-genome search). Although the algorithm is an estimate of the maximum likelihood solution for the additional nucleotides, the performance of XBP-SAGE on a test dataset suggests the typical solution is of excellent quality (error rate <3%) and the small number of errors that do occur usually correspond to tags with very low counts. These would typically be of little interest in a SAGE study. Moreover, the enhanced mapping on a real dataset demonstrates that these errors are far outweighed by the ambiguities that are resolved and the correction of erroneous mappings. The use of the XBP-SAGE algorithm will increase the overall accuracy of a list of genes identified by SAGE, reducing the error introduced in downstream, large-scale bioinformatic analyses (e.g. GO enrichment, pathway analysis, etc.).  The algorithm is also helpful in  confirming the identity of high-confidence targets identified by statistical analyses before expending resources on further experiments. 75  BIBLIOGRAPHY Altschul, S. F., T. L. Madden, et al. (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res 25(17): 3389-402. Barrett, T., D. B. Troup, et al. (2007). "NCBI GEO: mining tens of millions of expression profiles--database and tools update." Nucleic Acids Res 35(Database issue): D760-5. Boon, K., E. C. Osorio, et al. (2002). "An anatomy of normal and malignant gene expression." Proc Natl Acad Sci U S A 99(17): 11287-92. Flicek, P., B. L. Aken, et al. (2008). "Ensembl 2008." Nucleic Acids Res 36(Database issue): D707-14. Goodsell, D. S. and R. E. Dickerson (1994). "Bending and curvature calculations in B-DNA." Nucleic Acids Res 22(24): 5497-503. Gowda, M., C. Jantasuriyarat, et al. (2004). "Robust-LongSAGE (RL-SAGE): a substantially improved LongSAGE method for gene discovery and transcriptome analysis." Plant Physiol 134(3): 890-7. Johnson, M., I. Zaretskaya, et al. (2008). "NCBI BLAST: a better web interface." Nucleic Acids Res 36(Web Server issue): W5-9. Kirkpatrick, S., C. D. Gelatt, Jr., et al. (1983). "Optimization by Simulated Annealing." Science 220(4598): 671-680. Lash, A. E., C. M. Tolstoshev, et al. (2000). "SAGEmap: a public gene expression resource." Genome Res 10(7): 1051-60. Matsumura, H., S. Reich, et al. (2003). "Gene expression analysis of plant host-pathogen interactions by SuperSAGE." Proc Natl Acad Sci U S A 100(26): 15718-23. Munteanu, M. G., K. Vlahovicek, et al. (1998). "Rod models of DNA: sequence-dependent anisotropic elastic modelling of local bending phenomena." Trends Biochem Sci 23(9): 341-7. Nethercote, N. and J. Seward (2007). Valgrind: a framework for heavyweight dynamic binary instrumentation. Proceedings of ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California. Pruitt, K. D., T. Tatusova, et al. (2007). "NCBI reference sequences (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins." Nucleic Acids Res 35(Database issue): D61-5. R Development Core Team (2007). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing. Reich, Z., R. Ghirlando, et al. (1992). "Nucleic acids packaging processes: effects of adenine 76  tracts and sequence-dependent curvature." J Biomol Struct Dyn 9(6): 1097-109. Saha, S., A. B. Sparks, et al. (2002). "Using the transcriptome to annotate the genome." Nat Biotechnol 20(5): 508-12. van Rossum, G. (2007). "Python language website." from http://www.python.org. Velculescu, V. E., L. Zhang, et al. (1995). "Serial analysis of gene expression." Science 270(5235): 484-7. Wall, L., T. Christiansen, et al. (2000). Programming Perl. Sebastopol, California, O'Reilly. Wheeler, D. L., T. Barrett, et al. (2008). "Database resources of the National Center for Biotechnology Information." Nucleic Acids Res 36(Database issue): D13-21. Zuyderduyn, S. D. (2004). "Bio::SAGE::DataProcessing perl module." from http://search.cpan.org/dist/Bio-SAGE-DataProcessing.  77  CHAPTER III STATISTICAL INFERENCE FROM SAGE USING A POISSON MIXTURE MODEL†  3.1  INTRODUCTION As a counting technology, SAGE produces profiles consisting of a digital output that is  quantitative in nature. For example, a statement can be made with reasonable certainty that a SAGE tag observed 30 times in a library of 100,000 tags corresponds to a transcript that comprises 0.03% of the total transcriptome; the same statement cannot be made reliably with analog values, like those obtained from a microarray. Accordingly, a reliable statistical model should account for the discrete, count-based nature of SAGE observations. Statistical tests that incorporate a continuous probability distribution (e.g. the normal distribution assumed by Student’s t-test) are not appropriate. These tests require tag counts be normalized by division with the total library size to convert the data to the same continuous scale, discarding a statistically informative facet of the data. The sampling of SAGE tags can be modeled by the binomial distribution that describes the probability of observing a number of successes in a series of Bernoulli trials. Here, the library size corresponds to the number of trials and the count of a particular tag is the number of successful trial outcomes.  As the number of trials increases, the binomial distribution  approaches the Poisson distribution. This is the case for SAGE (since the tag counts are small relative to a large library size), so the form of the Poisson and binomial distribution is essentially the same. A fortunate characteristic of both of these distributions is they are a function of a single parameter only, since the variance in the observed data is directly calculable from the mean.  †  A version of this chapter has been published. Zuyderduyn, S.D. (2007) Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model. BMC Bioinformatics, 8:282.  78  However, in practice, the variance of SAGE data is often larger than can be explained by sampling alone. Several authors have attributed this effect, termed “overdispersion”, to a latent biological variability (Baggerly, 2003; Baggerly, 2004; Lu, 2005). Baggerly et al. referred to this as “between”-library variability, as opposed to “within”-library variability caused by sampling (Baggerly, 2003). Factors that could contribute to this variability are numerous. For example: sample preparation or quality, artefacts intrinsic to the library construction protocol, differences in gene transcription due to environment, and the intrinsic stability or regulatory complexity of transcription at a particular locus.  This will adversely affect statistical analysis because  additional variance results in an overstated significance.  Procedures for using hierarchical  models which incorporate a continuous prior distribution to model the excess variance have been presented for both the binomial (viz. beta-binomial using logistic regression (Baggerly, 2004) or tw-test (Baggerly, 2003)) and Poisson (viz. negative binomial a.k.a. hierarchical gamma-Poisson using log-linear regression (Lu, 2005)) distributions.  Attempts to use the log-normal and  inverse-Gaussian as prior distributions (both of these have longer tails) did not show an appreciable improvement and are computationally difficult to fit (data not shown). Here, it is argued that the excess variation is due to a mixing of two or more distinct Poisson (or binomial) components, and this mixing is the predominant source of total variation. This assumption corresponds to a finite mixture model, which has found wide applicability in several fields (for a general introduction, see McLachlan and Peel, 2000). To illustrate, consider a tag from ten SAGE libraries of equal size (e.g. 100,000 tags) that has observed counts where half are realizations of an expression of 0.0003 and the other half of 0.0004. As a result, the probability distribution of observing a particular tag count will be a combination of these two components (Figure 3.1). Note the similarity between the shapes of the probability distributions estimated from a fitted negative binomial (which assumes sampling variability drawn from a latent biological variability) and a Poisson mixture model (which assumes a set of independent 79  ●●●  0.06  ● ●  0.04  ●  ●  Poisson Negative binomial Poisson mixture actual mixture components  ● ● ● ●  ●  ●  ●  0.03  probability  0.05  ●  ●  ● ●  0.02  ● ●  ●  0.00  0.01  ●  ● ●  ●  ●  ● ●  ●  20  ●  ●  30  40  ●  ●●  50  ●●●  ● ●● ●●●●●●●●●●●●  60  70  tag count Figure 3.1: Probability density of several models applied to data generated from two Poisson components. 10 observations were randomly drawn from each of two Poisson distributions, one with a mean of 30, the other 40. The values drawn from the first component were [ 40, 34, 37, 28, 31, 21, 41, 27, 34, 27 ] and the values drawn from the second component were [ 36, 42, 26, 57, 43, 37, 38, 39, 35, 35 ]. The probability densities are shown for a single Poisson distribution, the negative binomial distribution, and a two-component Poisson mixture distribution using maximum likelihood estimates (see Methods). The probability densities of the individual Poisson components from which the data were actually drawn are also shown. The individual observations are represented by triangles at the bottom of the plot. 80  components, each having sampling variability only). If the Poisson mixture model is an accurate foundation to explain SAGE observations, it is attractive for several reasons. First, this approach does not rely on a vague and continuous prior distribution to explain additional variance.  Rather, the model asserts that a gene’s  expression level will take on one of a number of distinct states. Second, overdispersed models applied to SAGE data tend to show a wide range of excess variation; in many cases, the excess is far greater than can be attributed to counting. This is a troubling prospect for studies that utilize a limited number of libraries (e.g. pair-wise comparisons), since the observed count may differ wildly from the underlying expression. If a mixture model provides an improved fit to SAGE data, this concern would be assuaged. Finally, mixture models, by nature, allow for the concept of subsets (or latent classes) in the expression values of each tag. Dysregulation of genes in disease processes such as cancer are often observed in only a proportion of profiled samples, and these will be naturally identified during model fitting. This property can also be utilized to identify sets of co-expressed genes.  81  3.2  MATERIALS AND METHODS  3.2.1 Test datasets Test datasets were obtained from the Gene Expression Omnibus (GEO) (Barrett, 2007) and reflect a range of cancer studies, including malignancies of the skin (Cornelissen, 2003; van Ruissen, 2002; Weeraratna, 2004), breast (Porter, 2003; Porter 2001), blood (Lee, 2006) and brain (Boon, 2004) (Table 3.1). In the case of breast and skin data, libraries from a combination of studies were used. Datasets were filtered to remove tags expressed at a mean less than 100 tags per million. 3.2.2 Model fitting The open-source statistical software package R was used to perform all calculations (R Core Development Team, 2007). For each of the models, let Yi be the observed tag count in library i, ni be the total number of tags in library i, and N be the total number of libraries. Also, let xi be the vector of explanatory variables (e.g. normal=0 and cancer=1) associated with library i, and β be the vector of coefficients. Example R code used to fit each model is included in Appendix II.  82  Table 3.1: Datasets used to evaluate models Dataset acute myeloid leukemia (AML) purified patient samples  developmental stages of breast cancer bulk tissue samples  melanoma bulk tissue samples  brain tumour bulk tissue samples  Accession GSM73364 GSM73365 GSM73366 GSM73367 GSM73368 GSM73369 GSM73370 GSM73371 GSM73372 GSM73373 GSM73374 GSM73375 GSM73376 GSM73377 GSM73378 GSM73379 GSM73380 GSM73381 GSM73382 GSM73383 GSM73384 GSM73385 GSM691 GSM692 GSM14756 GSM14801 GSM1731 GSM2389 GSM14800 GSM1733 GSM670 GSM672 GSM2382 GSM2383 GSM14797 GSM1123 GSM1124 GSM1125 GSM3242 GSM14751 GSM14775 GSM14778 GSM676 GSM695 GSM761 GSM763 GSM786 GSM7498 GSM14799 GSM31931 GSM31935 GSM697 GSM698 GSM699 GSM715 GSM1732 GSM2443  Library Size 100003 46329 83577 77604 50604 80131 48397 50067 68267 48278 54669 67803 55209 51254 83102 20712 84229 72483 48733 48394 66647 57563 7165 12142 58181 59327 43902 58801 50875 70099 40223 67386 65045 61480 21951 15167 11563 10080 37292 26032 41338 11399 94876 58826 51280 63208 77968 31538 308589 41773 305546 52479 77004 28159 17576 81495 80265  Description inv(16) inv(16) inv(16) inv(16) inv(16) t(8;21) t(8;21) t(8;21) t(8;21) t(8;21) t(9;11)1 t(9;11)1 t(9;11)1 t(9;11)1 t(9;11)2 t(9;11)2 t(9;11)2 t(15;17) t(15;17) t(15;17) t(15;17) t(15;17) normal normal normal normal DCIS DCIS DCIS invasive3 invasive invasive invasive invasive invasive normal normal normal normal cancer cancer cancer normal normal normal normal normal normal normal4 normal5 normal4 astrocytoma astrocytoma astrocytoma astrocytoma astrocytoma astrocytoma  83  Dataset brain tumour bulk tissue samples (continued)  Accession GSM2451 GSM2578 GSM14737 GSM14739 GSM14763 GSM14765 GSM14766 GSM14773 GSM1497 GSM1735 GSM2384 GSM2408 GSM14740 GSM14741 GSM14762 GSM14776 GSM14786 GSM793 GSM14767 GSM14768 GSM14769 GSM696 GSM745 GSM765 GSM1498 GSM690 GSM693 GSM14731 GSM14732 GSM14733 GSM14734 GSM14761 GSM14772 GSM14774 GSM14779 GSM14781 GSM14782 GSM14783 GSM14787 GSM14788 GSM14791 GSM14794 GSM14795 GSM689 GSM14742  Library Size 38634 69513 105764 88568 106982 102439 107344 118733 46928 74499 52934 52659 122690 120431 68614 75379 84073 56871 100600 102322 99099 70087 60069 61886 62675 38933 19572 52645 48451 43068 69971 85376 60454 85984 72318 83671 68392 61853 57469 74295 32570 74612 67404 28133 32442  Description astrocytoma astrocytoma astrocytoma astrocytoma astrocytoma astrocytoma astrocytoma astrocytoma ependymoma ependymoma ependymoma ependymoma ependymoma ependymoma ependymoma ependymoma ependymoma ependymoma glioblastoma glioblastoma glioblastoma glioblastoma glioblastoma glioblastoma glioblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma medulloblastoma oligodendroglioma oligodendroglioma  Accessions, sizes, and descriptions of the libraries included in the study of the Poisson mixture model. Superscripts denote: 1. de novo translocation, 2. treatment induced translocation, 3. small DCIS component, 4. GSM31935 is a shortened version of LongSAGE library GSM14799 (not included in analysis), 5. shortened LongSAGE library.  84  3.2.2.1 Log-linear (Poisson) regression model The log-linear model assumes that the observed tag counts are distributed as Yi ~ Poisson(µi) µi = nipi where pi is the actual expression in terms of the proportion of all expressed tags. Here, the unconditional mean and variance are E(Yi) = Var(Yi) = µi. Using the log link function, which linearizes the relationship between the dependent variables and the predictors(s), we obtain the linear equation log(Yi) = log(ni) + xi β pi = exp(xi β) Using iteratively reweighted least-squares (IRLS), the value(s) for the coefficient(s) β are estimated. The stats library included with R is used to fit the log-linear model. 3.2.2.2 Overdispersed log-linear regression model In contrast to a canonical log-linear model, we assume the actual expression is distributed as θi ~ Gamma(µi, 1/φ) µi = nipi where, as in Section 3.2.2.1, pi is the actual expression in terms of the proportion of all expressed tags. Here, the unconditional mean and variance are E(θi) = µi and Var(θi) = µi2φ. Since we are now sampling from this latent Gamma distribution, the observed tag counts are conditional on this underlying expression and are distributed as Yi | pi,φ ~ Poisson(θi) Now, the unconditional mean and variance are E(Yi) = µi and Var(Yi) = µi(1+µiφ). As above, using the log link function we obtain the linear equation  85  log(Yi) = log(ni) + xi β pi = exp(xi β) Here, a maximum likelihood estimate of the values for the coefficient(s) β and the overdispersion parameter (φ) can be performed. The MASS library for R is used to fit the overdispersed loglinear model (Venables, 2002). A more complete discusssion of this model and its application to SAGE, including significance testing, is described by Lu et al. (Lu, 2005). 3.2.2.3 Poisson mixture model Like the canonical log-linear regression model, we assume the observed tag counts are Poisson distributed.  However, the counts are conditional on the choice of a Poisson-distributed  component, such that Yi | k ~ Poisson(µik) µik = nipik where the component k = 1, 2, …, K and pik is the actual expression for component k in terms of the proportion of all expressed tags. The posterior probability that an observed tag count belongs to a component k is given by P(k|Yi,ψ) =  πkf(Yi|µik) Σjf(Yi|µij)  K  where ψ is the parameter vector containing the component means (θ1,…,θK) and mixing coefficients (π1,…, πK-1). f(.) is the Poisson probability density function. To fit the model, one must estimate the values in ψ. This can be done by maximum likelihood estimation (MLE) using the EM algorithm (Dempster, 1977). The flexmix library for R uses the EM algorithm to fit a variety of finite mixture models (Leisch, 2004).  86  3.3  RESULTS  3.3.1 Goodness of fit In order to evaluate the efficacy of a mixture model approach, a comparison of the goodness of fit of this and previously described models on 15 sets of biological replicates from publicly available SAGE data was performed (see Section 3.2.1). Goodness of fit was assessed for: 1) the canonical log-linear (Poisson) model, 2) negative binomial (i.e. hierarchical gammaPoisson or overdispersed log-linear) model, and 3) k-component Poisson mixture model (see Section 3.2.2). Since maximum likelihood estimation (MLE) is used to fit each of these models, the log-likelihood was the basis for assessing relative goodness of fit. A comparison of the Akaike information criterion (AIC) (Akaike, 1974) and Bayesian information criterion (BIC) (Schwarz, 1978) (both of which use the log-likelihood and a term to penalize a model for estimating a larger number of parameters) was performed on each of the datasets (Table 3.2). As expected, the canonical Poisson model, which does not account for excess variance, performs poorly in all cases. The Poisson mixture model consistently outperforms the negative binomial model regardless of the metric used. The competitiveness of the negative binomial model is perhaps not surprising since a comparison of the fit of these two models to simulated data indicates that the negative binomial can often fit better to data generated from a twocomponent Poisson mixture. This becomes more problematic as the component means draw closer (data not shown, Figure 3.1 is a good example). However, several hypotheses can be tested to further strengthen the case for the mixture model approach. These are considered in turn.  87  Table 3.2: Comparison of model fits to a single group of biological replicates  BRAIN astrocytoma ependymoma glioblastoma medulloblastoma normal AML inv(16) t(8;21) t(15;17) t(9;11) de novo t(9;11) treatment BREAST normal DCIS invasive SKIN normal melanoma  N  tags  k  mean AIC Poisson Negbin  Mixture  mean BIC Poisson Negbin  Mixture  14 10 7 18 8  1141 1205 1197 1045 1099  2.6 2.3 2.3 2.7 2.4  238.2 152.4 139.5 280.6 156.8  105.2 80.9 57.6 128.7 68.0  103.6 75.0 53.0 128.7 59.8  238.9 152.7 139.4 281.5 156.8  106.5 81.5 57.5 130.5 68.2  106.3 76.1 52.8 132.6 60.1  5 5 5 4 3  900 1037 709 954 1061  1.7 1.3 1.8 1.8 1.5  68.9 52.3 127.7 58.5 42.9  39.3 34.1 46.0 34.6 32.0  37.6 33.5 38.9 30.1 20.9  68.5 51.9 127.3 57.9 42.0  38.5 33.3 45.2 33.4 30.2  36.7 32.9 37.9 28.5 19.1  6 4 3  1259 598 1069  1.8 1.3 2.0  71.6 25.5 60.4  43.9 24.0 27.8  41.6 21.0 22.8  71.4 24.9 59.5  43.5 22.8 26.0  41.0 20.1 20.1  4 3  1015 992  1.6 1.8  33.8 38.2  24.6 24.0  22.2 19.6  33.2 37.3  23.4 22.2  20.9 17.4  SAGE tag counts from fifteen sets of biological replicates were fit to log-linear (Poisson), negative binomial (overdispersed log-linear), and Poisson mixture models. The table contains the number of replicates (N), tags tested, and mean number of mixture components (k). For each model, mean goodness of fit scores calculated using Akaike's Information Criterion (AIC) and Bayesian Information Criterion (BIC) are shown. For both scores, a lower value indicates a better fit.  88  3.3.2 Tags with ambiguous mappings are represented by a greater number of components Consider an idealized situation where the expression of a gene can take on one of two states (and can therefore be modelled by a two-component Poisson mixture). A significant proportion of SAGE tags are ambiguous (correspond to more than one gene) and, under the idealized model, would result in tag counts that are modelled by 2g components (where g is the number of expressed genes the tag corresponds to). Therefore, the number of components in the mixture should be higher for ambiguous tags. Simply partitioning the data into ambiguous and unambiguous tags and comparing the number of components is unlikely to be informative since, for any given ambiguous tag, it is not known how many of the possible genes are actually expressed. However, two normal brain libraries used in this study were generated using LongSAGE (GSM31931 and GSM31935), which provides 17 base pairs of information rather than 10. The tag sequences in these libraries were shortened before inclusion in the normal brain dataset used in the previous section. However, by comparing the shortened tag list to the original library, tags that actually correspond to two or more LongSAGE tag sequences (and presumably represent different transcripts) were identified.  Tags counts of one or two were considered artefacts of PCR amplification or  sequencing and were not used in this determination. The number of ambiguous and unambiguous tags was tallied for each estimated number of components (Table 3.3). Ambiguous tags are represented more highly in the set of model fits that consist of a larger number of components. This effect, which is statistically significant, is consistent with the mixture model hypothesis.  89  Table 3.3: Mean number of mixture model components Library  GSM31931  GSM31935  k  Unambiguous  Ambiguous  1 2 3 4 5 1 2 3 4 5  93 405 210 27 5 74 317 149 17 3  0 15 32 12 12 34 246 171 30 14  Significance  p<2.2E-16 (χ2=134.1; df=4)  p=1.8E-6 (χ2=32.1; df=4)  Expressed ambiguous and unambiguous 10 base pair tags for two LongSAGE libraries were distinguished based on the number of 17 base pair sequences that give rise to the same tag. The tags in each of these two groups were binned according to the number of estimated mixture components. The χ2 statistic was used to test the null hypothesis that these two groups are equivalent.  90  3.3.3 Component assignment of libraries is non-random If the mixture model approach holds, then the Poisson components should cluster the libraries into recurring groups. Such an enrichment of certain component assignments would be expected for a number of reasons.  Two possibilities are: a) if one or more libraries are  mislabelled, the tag expression in those libraries should show a preferential assignment to a separate component; and b) if the genes corresponding to a set of tags are co-expressed, the component assignment should be similar amongst these genes. Conversely, if the negative binomial model is more appropriate then component assignments should essentially be random, since the distribution assumed to give rise to biological variability is continuous and unconditional. For each of the datasets, the component assignments for tags where the estimated number of components is two were tallied. The individual assignment was based on the component with the highest posterior probability, given a tag count and library size. In all cases, there were a significant number of tags where the parent libraries were partitioned into the same two components (Table 3.4).  For example, in the AML libraries containing the cytogenetic  abnormality t(8;21), of the 225 tags that had expression that could be fit to two Poisson components, 110 were partitioned in the form -++-- (p=4.5E-67; binomial test). In other words, almost half of the tags that fit to two components were assigned to a single component configuration (for 5 libraries, (25/2)-1=15 such configurations are possible). 3.3.4 Determining differentially expressed genes In previously described overdispersed models, the identity of a library is included a priori as a model covariate. Significance is then determined by testing the null hypothesis that the fitted β coefficient for this covariate is zero (Baggerly, 2004; Lu, 2005). A Bayesian significance score has also been described, although this was developed using a beta-binomial model (Vencio, 91  Table 3.4: Top component memberships Dataset Component assignment BRAIN astrocytoma (N=14 nk=2=454) -+--------------+++++++++++ ependymoma (N=10 nk=2=544) -+---------+------glioblastoma (N=7 nk=2=607) ---+-------+medulloblastoma (N=18 nk=2=438) -----------------+ -----------+-----+ normal (N=9 nk=2=588) -----+------+-+ AML inv(16) (N=5 nk=2=387) -++++ -+-++ t(8;21) (N=5 nk=2=387) -++--++-+ t(15;17) (N=5 nk=2=225) -++--++-+ t(9;11) de novo (N=4 nk=2=502) -+-+ -+-t(9;11) treated (N=3 nk=2=405) -++ BREAST normal (N=6 nk=2=571) ---+------+ DCIS (N=4 nk=2=154) --++ -+++ invasive (N=3 nk=2=765) -+SKIN normal (N=4 nk=2=500) ---+ melanoma (N=3 nk=2=650) -+-  Freq.  p-value  14 12  4.3E-28 2.7E-23  16 15  6.0E-14 9.3E-13  48 42  3.2E-19 5.2E-15  4 4  1.5E-9 1.5E-9  41 21  7.4E-37 7.6E-14  110 36  2.5E-39 0.028  110 36  2.5E-39 0.028  110 36  4.5E-67 1.0E-6  143 132  1.6E-16 1.4E-12  216  1.1E-16  82 65  3.6E-29 4.0E-18  97 41  1.4E-43 4.5E-5  337  4.6E-10  215  1.8E-54  405  2.1E-51  For each set of biological replicates, the top one or two component states were selected from tags where the estimated number of components is 2. One component was represented with -, the other with + (i.e. -+- is equivalent to +-+). The significance was calculated using a zerotruncated binomial test. The number of possible ways for the libraries to be assigned to the two components is (2N/2)-1, where N is the number of libraries.  92  2004). In contrast, the Poisson mixture model does not require the identity of the libraries be included (although the addition of such covariates is possible). Rather, once a mixture model has been fit, the posterior probabilities of membership in a particular component given the observed tag counts can be used to determine how well the components can differentiate between two or more sample types (e.g. normal versus cancer).  Here, a score is presented based on the  confidence that a sample is of type ω given that it arises from component(s) k. Using Bayes Theorem, one can derive the following expression (the complete proof is included in Appendix III):  |  ∑  where ω is the set of libraries corresponding to some label of interest (e.g. normal or cancer), τjk is the posterior probability of the tag count from library j arising from component(s) k, and πk is the mixing coefficient for component(s) k. Using this expression, one can determine which tags have a set of mixture components that are closely linked with the sample type(s) of interest. To illustrate, SAGE libraries from normal brain (n=8) and ependymoma (n=10) (a type of brain tumour) were analyzed using both the overdispersed log-linear and Poisson mixture models. In the former case, significance was calculated using the method described by Baggerly et al. (Baggerly, 2004) (see also example R code in Appendix II). In the latter case, the method described above was used. A plot of the two sets of scores shows a moderate correlation and tags that are found highly significant in one test tend to be so in the other (Figure 3.2). However, a number of observations are found to be significant using the overdispersed log-linear model and not the Poisson mixture model, and vice versa. A closer look at the most extreme examples illustrates the superior performance of the mixture approach (Figure 3.3). In the first example, tag ACAACAAAGA seems clearly expressed in normal libraries, but is completely abolished in the ependymoma libraries. However, according to the overdispersed 93  90 80 70 60  CAGTTGTGGT  50  mixture model confidence score (%)  100  ACAACAAAGA  1e−11  1e−08  1e−05  1e−02  p−value (negative binomial)  Figure 3.2: Comparison to significance scores for a test of differential expression calculated using a negative binomial model. Using the tag counts from 8 normal brain libraries versus 10 ependymoma libraries, differential expression between these two sample types was assessed using two methods. Plotted are the significance scores calculated for a negative binomial model versus a Poisson mixture model. The negative binomial (x-axis) is a p-value, so smaller values are more significant. The Poisson mixture (y-axis) is a confidence score, so larger values are more significant. Circled are two examples of SAGE tags where one model shows significance while the other does not.  94  95  20 40 60  0  20 40 60  normalized tag count (tags per 100k)  • •  •  • •  normal  • • •  •  •  •  •  • •  •  •  • • •  ependymoma  • • • • • • •  CAGTTGTGGT log-linear p-val: 8.767E-7 mixture model confidence score: 59.75%  • • • • • • • • • •  ACAACAAAGA log-linear p-val: 0.9998 mixture model confidence score: 99.42%  Figure 3.3: Counts for two tags assessed using a negative binomial model and the Poisson mixture model where one models shows significance and the other does not. The figure is divided to show separate plots of the expression level of two tags observed in 8 normal brain libraries and 10 ependymoma libraries. The x-axis is the normalized expression (count/library size*100,000) and the y-axis is divided into the two sample types. In the top plot, the negative binomial model is not significant and the Poisson mixture is significant; in the bottom plot, the situation is reversed. Light gray guide lines denote the expected expression level of the Poisson components.  0  model, the observation is not at all significant (p=0.9998).  The mixture model, however,  produces a confidence score of 99.42%, which suggests this tag is highly informative with respect to sample type. This example demonstrates the difficulty that the log-linear model has with fitting groups where tag counts are zero, a problem that is even more pronounced when using a logistic regression model (for a more thorough discussion of this problem see Lu et al. (Lu, 2005)). In the second example, tag CAGTTGTGGT clearly has increased expression in some libraries from both the normal and ependymoma groups.  However, according to the  overdispersed model, the observation is highly significant (p=8.8E-7). The mixture model, however, produces a confidence score of 59.8% which is only nominally better than chance. This example demonstrates how the log-linear model seems to downweight the occasional extreme observation in one group, even if it is in agreement with observations in the other group. This can result in candidate lists based on the log-linear significance containing tags that have extreme observations that occur at a higher rate in one group over another, which are typically of little interest. Similar results were obtained when comparing to the Bayes error rate described by Vencio et al. (Vencio, 2004). Again, a moderate correlation is seen and tags found highly significant in one test tend to be so in the other (Figure 3.4). Overall, the Bayes error rate is in better agreement with the mixture model confidence score and appears to be more robust in assessing tags with zero counts in one group. However, the assumption of a hierarchical model (in this case, a beta-binomial) used to calculate the Bayes error rate versus a Poisson mixture model results in differences between the two methods. Two examples, analogous to those described above, are highlighted (Figure 3.5). In both cases, the Poisson mixture model appears to give confidence values that are in better agreement with the observations.  96  100  • •• •  •  •  90  •  •  • ••  80  •  ••  • •  •  • •  • • • • • •  •  • •  •  • •  • •  •  •  • •  • •  •  •  • •  •  • • • • • • • • • • ••• • • •  •  •  • • • •  • • •  • • •  •  • •  • • • • • • • • •• •  •  • • • • • • • •••  •  • • • • • • • • • • • • • • • • • • • • • •• •• •• • • •• • • • • • • • • •  •  •  •  • • •• •  • •  • • •  •  •  •  • • • • • • • • • • • • •• • • •• •  •• •  • • •  •  • • •  •• •  •  • •  •  • • • • • • • • • • • • • • •• •• •• • • • • • • • • • •  •  • •  • • •  • •  • • •  •  •  •  •  ••  •  • • •  •  •  •  •  •  •  • • • • • • • •• • • • •• • •• • • • • • •• • • • •• • • •  • • • •  •  • •  •  •  •  70  •  •  •  •  • • • •• • • • • • • • • • •• • • • •• •• • • • • •• • • • • • • • • •• • • • • • • •• • • • • •  • •  • •  •  •  •  • •  •  • •  •  • • • • •  •  •  •  •  • • • • • • • • • ••• •• • • • • • • •• • • • • • • • •• •• • • • •• • • • • • • • •  •  •  • •  •  • • • •• • •• •• • • • • • • •  •  •  • • • • • • • •• • • • •• • • • • • • • • • • • • •• •• • • • • • • • • • • • • • • • • •• • • • • • • • • • • •• • • • • • • • •  • • •  •  •  • •  • •  •  •  •  • •  •  CCAACCGTGC  •  •  •  •  •  •  •  •  0.4  •  •  •• • • • • •  •  •  ••  •  • • •  • •  • •  •  •  • •  •  • •  •• •  •  •  • •  • • •  ••  •• •  •  • • •  •  •  •  0.8  • •  •  •  • •  •  •  •• •  •  • •  •  •  •  •  • • • • •  • • • • • • • • • • •  •  •  •  •  • •  •  • • •  •  •  •  • • • • •• • •  •  • •  •  •  • • • •  •  •  • • • •  • ••  •  • • •  •  •  • • •  • • •  • •  •  •  • •  •  • • •  •  •  •  •  • •  • • •  •  •  •  •  •  •  •  • •  •  • • •  • •  • • • •  • • • • •  •  •  0.6  •  • • •  •  •  • • •  •  • •• •  •  • • • • •  • • • •  ••  • • •  •  •  •  • • • • •  •  •  • • • • • •• • • • • • •• • • • • • • • • •• • • • • • • • • • • • • • • • • • •• • •• •• • • • • • • • • • • • • • • • • •• • • • • • •  • • •  • •  • •  •  •  50  0.2  •  •  •  0.0  •  •  •  ••  •  •  • •  •  •  •  • •  •  • • •  •  • •  •  •  •  • •  •• •  •  •  • •  •  •  • • •  •  •  •  •  • • • •• • • • • •• • • •• • • • • • • • • • • • • • • •• • • • • • • • • • •• • • • • • •• • • • • • • • • • • •• • • • • • • •• •• • • • •• • • • • • • • • • • • • • • • • •  •  •  •  • •  •  •  •  •  •  •  • •  •  • •  •  •  •  • • • •  •  60  mixture model confidence score (%)  ••• ••  •  AGAGGTGTAG  •  • •  • • • •  • • • ••  •  • • • • •  •  • •  1.0  Bayes error rate  Figure 3.4: Comparison to Bayes error rate for a test of differential expression calculated using a beta binomial model. Using the tag counts from 8 normal brain libraries versus 10 ependymoma libraries, differential expression between these two sample types was assessed using two methods. Plotted are the Bayes error rate described in Vencio et al. (2004) versus a Poisson mixture model confidence score. For the Bayes error rate (x-axis) smaller values are more significant. The Poisson mixture (y-axis) is a confidence score, so larger values are more significant. Circled are two examples of SAGE tags where one model shows significance while the other does not.  97  98  100 200 300  5 10 15 20 0  normalized tag count (tags per 100k)  normal  AGAGGTGTAG Bayes error rate: 0.22 mixture model confidence score: 99.41%  ependymoma  CCAACCGTGC Bayes error rate: 0.02 mixture model confidence score: 61.97%  Figure 3.5: Counts for two tags assessed using a Bayes error rate and the Poisson mixture model where one models shows significance and the other does not. The figure is divided to show separate plots of the expression level of two tags observed in 8 normal brain libraries and 10 ependymoma libraries. The x-axis is the normalized expression (count/library size*100,000) and the y-axis is divided into the two sample types. In the top plot, the Bayes error rate is not significant and the Poisson mixture is significant; in the bottom plot, the situation is reversed. Light gray guide lines denote the expected expression level of the Poisson components.  0  3.4  DISCUSSION The exploration of statistical approaches to SAGE analysis is important since the number  of studies using the technology has resulted in a continuing rise in the amount of available data. The notion of sampling variability being the predominant source of “within”-library variability and distinct components being the predominant source of “between”-library variability is reassuring for investigators who choose the SAGE technique to obtain a comprehensive profile of gene expression in a limited number of samples.  Nevertheless, there is certainly a  contribution by a latent biological variability as evidenced by the increased performance of the negative binomial as the number of libraries increases. However, this work demonstrates that a simple overdispersed model may overstate this effect, and that certainly there is a clustering of expression into distinct components, which are then sampled. This is consistent with the view of gene transcription for any one locus consisting of (possibly several) inactivated or activated state(s). The same idea holds for some known mechanisms of genetic disease, such as loss of heterozygosity (LOH) or amplification of a particular locus (e.g. cancer). For this reason, it is recommended that investigators try the mixture model approach in comparisons of groups of biological replicates. Failing this, some of the difficulties that can be encountered with the negative binomial model can be lessened by: a) setting a tolerance for how much overdispersion (φ) is acceptable in a final list of candidate tags, although such a cutoff would be somewhat arbitrary; and b) add a small value to the tag count to avoid the problems the model has with groups consisting of many zero counts. One strategy is to assume equal odds that the next tag drawn is the one of interest by adding 1 to the count, and 2 to the library size (i.e. (count+1)/(size+2)) (K. Baggerly, personal communication). In the future, it may be worthwhile to combine both approaches by defining a negative binomial mixture model.  However, at this point, such an approach is unlikely to provide 99  significant improvement given the small number of libraries in a typical set of available biological replicates. In addition, applying the concept of “information sharing” between tags may provide estimates of statistically informative variables that apply library-wide, and could be utilized to improve the power of the method described in this chapter (Kuznetsov, 2002; Thygesen, 2006). The Poisson mixture model appears to be a rational means to represent SAGE data that are biological replicates and as a basis to assign significance when comparing multiple groups of such replicates. The use of a mixture model can improve the process of selecting differentially expressed genes, and provide a foundation for ab initio identification of co-expressed genes and/or biologically-relevant sample subsets.  100  BIBLIOGRAPHY Akaike, H. (1974). "A new look at the statistical model identification." IEEE Transactions on Automatic Control 19(6): 716-723. Baggerly, K. A., L. Deng, et al. (2003). "Differential expression in SAGE: accounting for normal between-library variation." Bioinformatics 19(12): 1477-83. Baggerly, K. A., L. Deng, et al. (2004). "Overdispersed logistic regression for SAGE: modelling multiple groups and covariates." BMC Bioinformatics 5: 144. Barrett, T., D. B. Troup, et al. (2007). "NCBI GEO: mining tens of millions of expression profiles--database and tools update." Nucleic Acids Res 35(Database issue): D760-5. Boon, K., J. B. Edwards, et al. (2004). "Identification of astrocytoma associated genes including cell surface markers." BMC Cancer 4: 39. Cornelissen, M., A. C. van der Kuyl, et al. (2003). "Gene expression profile of AIDS-related Kaposi's sarcoma." BMC Cancer 3: 7. Dempster, A., N. Laird, et al. (1977). "Maximum Likelihood from Incomplete Data via the EMAlgorithm." Journal of the Royal Statistical Society. Series B (Methodological) 39(1): 1-38. Kuznetsov, V. A., G. D. Knott, et al. (2002). "General statistics of stochastic process of gene expression in eukaryotic cells." Genetics 161(3): 1321-32. Lee, S., J. Chen, et al. (2006). "Gene expression profiles in acute myeloid leukemia with common translocations using SAGE." Proc Natl Acad Sci U S A 103(4): 1030-5. Leisch, F. (2004) "FlexMix: A general framework for finite mixture models and latent class regression in R." Journal of Statistical Software Volume, DOI: Lu, J., J. K. Tomfohr, et al. (2005). "Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach." BMC Bioinformatics 6: 165. McLachlan, G. J. and D. Peel (2000). Finite mixture models. New York, Wiley. Porter, D., J. Lahti-Domenici, et al. (2003). "Molecular markers in ductal carcinoma in situ of the breast." Mol Cancer Res 1(5): 362-75. Porter, D. A., I. E. Krop, et al. (2001). "A SAGE (serial analysis of gene expression) view of breast tumor progression." Cancer Res 61(15): 5697-702. R Development Core Team (2007). R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing. Schwarz, G. (1978). "Estimating the dimension of a model." Annals of Statistics 6(2): 461-464. Thygesen, H. H. and A. H. Zwinderman (2006). "Modeling Sage data with a truncated gamma101  Poisson model." BMC Bioinformatics 7: 157. van Ruissen, F., B. J. Jansen, et al. (2002). "A partial transcriptome of human epidermis." Genomics 79(5): 671-8. Venables, W. N. and B. D. Ripley (2002). Modern Applied Statistics with S. New York, Springer. Vencio, R. Z., H. Brentani, et al. (2004). "Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression (SAGE)." BMC Bioinformatics 5: 119. Weeraratna, A. T., D. Becker, et al. (2004). "Generation and analysis of melanoma SAGE libraries: SAGE advice on the melanoma transcriptome." Oncogene 23(12): 2264-74.  102  CHAPTER IV TRANSCRIPTOME EVOLUTION IN THE DEVELOPMENTAL STAGES OF SQUAMOUS CELL LUNG CARCINOMA†3  4.1  INTRODUCTION Over 80% of cancers are carcinomas, which are malignancies that arise from epithelial  cells. Carcinoma development is driven by a series of genetic alterations that are accompanied by histological changes that occur in a specific chronology and can often be observed in premalignant stages of tumourigenesis. A well-known example is the Fearon-Vogelstein model of colorectal cancer formation which correlates certain genetic alterations with the histopathological stages of progression (Fearon, 1990). The identification and characterization of the genetic changes that drive development is obviously critical to an overall understanding of tumourigenesis, and those that occur in pre-malignant and early malignant stages have particular importance as early detection markers. These genetic changes can potentially predict the risk for disease advancement and allow for more informed treatment decisions. However, studying these lesions is difficult since biological material is limited and clinical presentation is relatively rare. This chapter describes the analysis of transcriptome profiles of the early stages of squamous cell carcinoma (SCC) of the lung. This dataset represents, to date, the most complete catalogue of gene expression during the development of a solid tumour. For several reasons, this cancer type is attractive as a target for this effort. Other than highly treatable forms of skin cancer, lung cancers are the single most commonly diagnosed and are the leading cause of cancer mortality, killing more people than breast, prostate and colon cancer combined (Jemal, 2007). The SCC subtype of lung cancer comprises 25-30% of these cancers, representing a large fraction of total cancer cases. Furthermore, this tumour subtype has a more defined and readily †  A version of this chapter will be submitted for publication. Zuyderduyn, S., Vatcher, G., Lam, S., Lam, W., MacAulay, C., Ng, R., and Ling, V. Transcriptome evolution in the developmental stages of squamous cell lung cancer.  103  identifiable progression of pre-invasive lesions, a feature common to squamous cell carcinomas in other tissues. In lung, SCC progression starts with normal epithelium that (usually through exposure to tobacco smoke) advances to hyperplasia, squamous metaplasia (referring to the hallmark change in morphology of the normal columnar epithelium to a squamous cell type), varying degrees of dysplasia, carcinoma in situ (CIS), and finally a full blown invasive carcinoma (Hirsch, 2001). This nomenclature is similar in other tissues (e.g. skin, oesophagus, cervix) (Neville, 2002; Greer, 2006; Shimizu, 2007). Some plasticity is possible in this sequence, although it is thought to occur in the vast majority of cases. For example, lesions may spend very little time at a given stage of progression, or perhaps skip some stages altogether. Lesions in the precursor stages of malignant development often fail to progress and can even regress, particularly following smoking cessation (Colby, 1999).  Observations of these events strongly suggest that an  irreversible commitment to carcinogenesis does not occur until later stages of dysplasia, although it has almost certainly occurred at the CIS stage (Breuer, 2005). By associating these histological stages of progression with the occurrence of certain changes in gene expression, a refined picture of the molecular basis of malignant potential and commitment in lung SCC can be developed. Previous gene expression profiles of SCC have compared normal samples (either bulk lung tissue or cell lines) to primary, invasive tumours (Bhattacharjee, 2001; Nacht, 2001). Since the intervening stages are omitted, it is impossible to assign a chronology to the observed expression changes and distinguish between benign, premalignant, and malignant events. By identifying the contribution to each stage of development, a refinement of the phenotype (e.g. pre-malignant vs. malignant) associated with certain gene expression changes can be determined. This study profiles gene expression in several stages of SCC progression using serial analysis of gene expression (SAGE), a powerful technique for obtaining a comprehensive 104  snapshot of the transcriptome (Velculescu, 1995). This technique generates small sequence tags extracted from a defined position of each mRNA sequence. These tags are serially ligated and cloned and can then be sequenced and counted, generating a nearly unbiased profile of the mRNA population. SAGE has the advantage of being: 1) based on counts and therefore provides observations that are quantitative (compared to the more qualitative analog hybridization signals arising from microarray data); 2) capable of increased statistical power simply by increasing the sampling depth and/or including additional libraries generated at a later time; and 3) unbiased with respect to genes detected since no a priori knowledge of mRNA sequence is required. The analysis presented here focuses on two objectives: 1) investigate the relationship between the global gene expression profile and the histological stages of SCC development, and 2) identify a set of gene expression signatures that can differentiate the different neoplastic stages of SCC.  105  4.2  MATERIALS AND METHODS  4.2.1 Sample collection and preparation Details on the sampling and preparation of normal bronchial epithelium have been previously described (Lonergan, 2006).  Pre-malignant lesions and tumour samples were  obtained from biopsied material. Samples I11 and I12 are both pools of material from four patients and were obtained from Dr. Ming-Sound Tsao (Ontario Cancer Institute). 4.2.2 Data processing Automated sequencer traces were base-called and assigned quality scores using the Phred software package (Ewing, 1998a; Ewing, 1998b). SAGE tags were extracted from sequence data using Bio::SAGE::DataProcessing (ver. 1.20), a module written in Perl, and default parameters (Zuyderduyn, 2004). 4.2.3 Multidimensional scaling Multidimensional scaling (MDS), or principal coordinates analysis, was used to characterize the overall relationship between sets of libraries.  This method projects a  multidimensional set of values into a lower-dimensional space to aid interpretation. The distance between all possible library pairs was calculated using the Pearson correlation coefficient (Gower, 1966). Let xik be the normalized expression (tag count/library size) of the k-th tag in library i. Let N be the total number of tags compared. The distance between two libraries i and j is then:  106  A dissimilarity matrix was created where each cell contained the value 1 − rij, for each possible pair of libraries i and j. The cmdscale function of the R statistical software was used to determine the optimal coordinates to represent the libraries in 3 dimensions (Gentleman, 2004). 4.2.4 k-means clustering The objective of k-means clustering is to separate a set of n objects into k partitions by maximizing the similarity of objects within each partition (Steinhaus, 1956). A Poisson-based clustering algorithm specifically designed for use with SAGE data was utilized (Cai, 2004). The algorithm is unable to handle tag counts of zero, so a value of 0.5 was used for these tags. In addition, the algorithm determines cluster centres in terms of raw tag count only and this can complicate interpretation. Therefore, the equations for the algorithm were modified slightly to produce cluster centres corrected for library size. Finally, there is no guarantee that a k-means analysis will identify the optimal set of clusters. However, this shortcoming is greatly assuaged when running the analysis multiple times with different starting (or seed) clusters and choosing the outcome with the least residual variance. In this case, the procedure is performed 20 times. In addition, a strategy called k-means++ can greatly improve the selection of the starting clusters and enhance the quality of the final clusters (Arthur, 2007).  The k-means algorithm,  supplemented with these improvements, was implemented using the C++ programming language (Stroustrop, 2000). Since the number of partitions is not known a priori, the “elbow criterion” was used to determine a reasonable value for k (Jain, 1999). This involves performing the clustering over a range of k and plotting the amount of residual variance. In this case, the total within-cluster dispersion (S) was used as a metric for variance (Cai, 2004). As k increases, the amount of variance not explained by the clusters will decrease (by definition, the variance will be zero when k is equal to n). Under most circumstances, the plot will reveal an “elbow” where the 107  addition of more clusters shows a marked decline in the amount of variance removed. This point was chosen as the optimal k. 4.2.5 Hierarchical clustering Sample-wise hierarchical clustering was performed using the same Poisson-based distance metric described in Section 4.2.4 averaged across all tags included in the clustering. Cluster-to-cluster distances were calculated using the average linkage method, where the mean distance between objects in the two clusters is used. 4.2.6 Gene Ontology and KEGG pathway enrichment The Gene Ontology (GO) project is a collaborative effort to assign biological processes, cellular components, and molecular functions to descriptions of genes using a controlled vocabulary (Ashburner, 2000). In addition, GO terms are assigned to genes along with an evidence code, which describes the criteria used to make the assignment. In cases where a large list of genes makes a detailed examination prohibitive, a GO enrichment analysis can quickly uncover highly represented biological themes. The Kyoto Encyclopedia of Genes and Genomes (KEGG) features a collection of about 120 manually curated pathways grouped in broad categories of metabolism, genetic information processing, environmental information processing, cellular processes, and human diseases (Kanehisa, 2008). As with GO, a KEGG enrichment analysis can indicate units of the cell “circuitry” that show enhanced or depressed activity. In this study, tags were mapped to Unigene clusters and provided to the online DAVID resource. DAVID performs statistical significance testing to identify enrichments of GO terms and KEGG pathways (Huang da, 2007).  Re-sampling based false discovery rate (FDR)  estimates were chosen to correct for multiple testing.  In addition, only GO term-gene  associations with acceptable evidence codes were chosen, based upon the recommendation of the 108  developers of the GoMiner software package; specifically, the assignment must be made based on a traceable author statement (TAS) or inferred by: direct assay (IDA), mutant phenotype (IMP), genetic or physical interaction (IGI, IPI), sequence or structural similarity (ISS), or expression pattern (IEP) (Zeeberg, 2003). 4.2.7 Statistical analysis 4.2.7.1 Preprocessing In order to reduce the dimensionality of the data to a level that allowed analysis to be undertaken in a reasonable time, tags were conservatively pre-filtered to include only those that were expressed at least 4 times in at least one library (19,147 tags). Tags that do not meet this criterion cannot be distinguished from the variability due to counting, and will not be statistically informative. 4.2.7.2 Feature selection The counts for each tag were fit to a k-component Poisson mixture model (Zuyderduyn, 2007; see Chapter III). When it is determined that number of components in the mixture model is one (k=1), then the variance in counts cannot be distinguished from sampling variability and are excluded from consideration. When k>1, a test statistic is calculated to determine how well the components separate the different stages of progression into some meaningful groups of interest (e.g. pre-malignant versus malignant, non-invasive versus invasive, etc.). Let xi be the class label for library i (i.e. 0 for epithelial brushing, 1 for bulk normal, 2 for metaplasia/dysplasia, 3 for CIS, 4 for invasive). Let ωi be the binary group label of interest for library i (i.e. 0 for pre-malignant, 1 for malignant). Let β be an ordered vector containing the estimated coefficients for the k mixture components (e.g. β1< β2<…< βK) and let τ be an index value between 1 and K-1 that partitions the elements of β into two groups. Let P(k|Yi,ni) be the 109  posterior probability that the count Yi observed for library i of size ni arose from component k. Since each component is Poisson distributed, this will be Poisson(Yi,niexp(βk)). For each class of libraries x, the average probability px = avg[P(k≤τ|Yi)] is calculated. The score reflecting whether a tag count corresponds to one or the other group is: Θ0 = ∏px, for each class x where ω=0 Θ1 = ∏(1-px), for each class x where ω=1 The final test statistic is simply: Θ = Θ0 × Θ1 This value can range from 0-1, exclusive; the value approaches 1 as the tag counts correspond to the components separating the two groups. A confidence calculation is made for each possible separation τ and the highest score is taken. An example calculation is shown in Figure 4.1. The value of the test statistic is used to rank the relative suitability of each tag in a list of candidates. In order to determine a threshold score that can be used to isolate a list of highquality candidates, a Monte Carlo simulation was performed. The class labels were randomly reassigned, and the procedure described above was re-run. This was repeated 1,000 times. The number of candidates identified at a given threshold score was averaged over the 1,000 rounds of simulation and then compared to the actual number identified at that threshold for the true class labels. For example, at a confidence of 0.70 the Monte Carlo simulation may identify an average of 10 tags, while the true class labels identify 100 tags. Thus, for this threshold, one can estimate the false discovery rate (FDR) as 10/100=10% (i.e. candidates that are not truly associated with the sample types, but are a result of random variation). Threshold scores were chosen to achieve a 5% FDR.  110  5 10 15 20 25  ● ●  ● ●●●●●●●●●●●●  ● ●  ● ●●  ●● ●●●● ●●● ●● ●  3  -7.66  2  -8.52  1  -9.92  ● ●  ●  ● ●  ●●  ●  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  0  counts per 50k (Yi /ni *50000)  mixture model fit k β (θi =ni eβ )  class labels x = 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 3 3 3 4 4 4 4 4 4 group labels ω = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1  average P(k|Yi ) for each x k 0 1 x 2 3 4  1 0.958 0.996 0.999 0.003 0.165  2 0.004 0.004 0.001 0.797 0.665  3 0.000 0.000 0.000 0.200 0.170  0 0 0 ω 1 1  τ =1  τ =2  2,3 0.004 0.004 0.001 0.997 0.835  1,2 1.000 1.000 1.000 0.800 0.830  k 0 1 x 2 3 4  1 0.958 0.996 0.999 0.003 0.165  k 0 0 0 ω 1 1  x  0 1 2 3 4  3 0.000 0.000 0.000 0.200 0.170  Θω=0 = 0.953  Θω=0 = 1.000  Θω=1 = 0.832  Θω=1 = 0.034  Θ = 0.793  Θ = 0.034  0 0 0 ω 1 1  accept  Figure 4.1: A confidence score calculation for a tag. This example uses the tag CTGCACTTAC and group labels corresponding to a test for differential expression in malignant samples. A detailed explanation of the procedure is found in Section 4.2.7.2. The tag maps to MCM7 and is part of the optimal malignant signature discussed in Section 4.3.3.3. 111  4.2.7.3 Estimating the selectivity of a candidate tag list and the generation of optimal signatures The sensitivity and specificity of a given candidate tag list was estimated using leaveone-out cross-validation (LOOCV). Candidate tags were generated as described above, except one library was left out. This was repeated for all libraries. A prediction of whether the library was from group zero (ω=0) or group one (ω=1) (e.g. non-malignant versus malignant) was made with the Poisson distribution using the parameters determined by the mixture model fit: P(ω|Yi,ni,β,τ) = τΣj Poisson(Yi,niexp(βj)) These values were scaled to ensure that the probability that the sample came from any group is 100%. Finally, these values were averaged over all of the tags in the candidate list to arrive at a final membership probability. Let G=i denote the set of libraries where ω=i. The sensitivity and specificity were estimated as follows: sensitivity =  ΣG=0P(ω=0|Yi,ni,β,τ) ΣG=0P(ω=0|Yi,ni,β,τ)+ ΣG=1P(ω=0|Yi,ni,β,τ)  specificity =  ΣG=1P(ω=1|Yi,ni,β,τ) ΣG=0P(ω=1|Yi,ni,β,τ)+ ΣG=1P(ω=1|Yi,ni,β,τ)  In addition, a test statistic value was selected to generate an optimal signature using a forward selection strategy. A cutoff score was established and then successively lowered, and candidate genes that met this threshold were added to the signature in each round of crossvalidation until a maximum selectivity was reached as determined by a receiver operating characteristic (ROC) curve. This score was then used to select an optimal signature from the full dataset. 4.2.8 Tag to gene mapping To increase the accuracy and confidence of tag to gene mapping, tags were assigned additional nucleotides as described in Chapter II. A combination of resources was used to 112  facilitate tag to gene mapping. For rapid mapping of large tag lists, the LongSAGE (21bp) version of the SAGE Genie database (Boon, 2002) was used to allow the matching of 15-16bp tags obtained by the additional nucleotide procedure. When performing a more comprehensive mapping (i.e. to determine the source of a small number of tags of high interest), the SAGE Genie mapping was supplemented with: 1) Exact-match BLASTN (Altschul, 1997) to the human genome (NCBI 36 assembly) using the web-based EnsEMBL BLAST resource (Flicek, 2008) 2) Exact-match BLASTN to the NCBI human EST collection (8,137,888 sequences) (Benson, 2008) using the web-based NCBI BLAST resource (Johnson, 2008) 3) Custom Perl scripts that match a specified tag to other tags in the library that differ by a single nucleotide or a single insertion or deletion. The tag was considered an artefact only if the closely related tag was expressed at a 10-fold greater level. 4) Match to a complete database of antisense tags and tags occurring at anchoring enzyme sites other than the 3’-most position from a meta-catalogue of sequences from Refseq (Pruitt, 2007) and Unigene (Wheeler, 2008) (developed by G. Vatcher) 4.2.9 Microarray validation Validation of gene expression signatures was carried out using publicly available microarray profiles.  Unless otherwise stated, the data values determined by the  processing protocol of the original authors were used. Genes that were identified as having an effect on the bronchial epithelium with exposure to tobacco smoke were validated with the following datasets: 113  1) The Spira dataset, containing 75 samples (23 never smokers, 18 former smokers, 34 current smokers) (Spira, 2004). 2) The Carolan dataset, containing 44 samples (18 never smokers, 26 current smokers) (Carolan, 2006). 3) The Beane dataset, containing 102 samples (21 never smokers, 31 former smokers, 52 current smokers) (Beane, 2007). Genes that were identified in the developmental stages of SCC were validated with the following datasets: 1) The Bhattacharjee dataset, containing 38 samples (17 normal, 21 SCC) (Bhattacharjee, 2001).  The profiles were generated using the Affymetrix  U133A oligonucleotide array which contains 22,283 different probe sets. The published microarray data was processed using now obsolete methods, so the original Affymetrix CEL files were obtained from the authors’ website and processed  with  updated  methods  using  the  Bioconductor  (www.bioconductor.org) R analysis package (Gautier, 2004). Specifically, the Robust Multichip Average method, which applies background correction and a quantile normalization step, is used (Bolstad, 2003; Irizarry, 2003a; Irizarry, 2003b). This technique is consistent with the data processing used in the remaining microarray validation datasets. 2) The Erez/Dehan dataset (GEO accession: GSE1987), containing 25 samples (2 normal, 7 tumour-associated normal, 16 SCC) (Erez, 2004; Dehan, 2007). The profiles were generated using the Affymetrix U95A oligonucleotide array which contains 12,651 different probe sets. 3) The Wachi dataset (GEO accession: GSE3268), containing 10 matched samepatient samples (5 normal, 5 SCC) (Wachi, 2005). As with the Bhattacharjee 114  dataset, the profiles were generated using the Affymetrix U133A oligonucleotide array which contains 22,283 different probe sets. 4.2.10 Tissue samples Immunohistochemistry was performed on sections of formalin-fixed, paraffin-embedded tissue blocks (gift from Jaclyn Hung). These included 10 samples of Stage I and Stage II squamous cell lung cancers obtained from surgical resection. A positive control was included using 5 samples of invasive breast carcinoma, where MMP11 is known to be expressed. In addition, two samples from normal lung were included (gift from Calum MacAulay). 4.2.11 MMP11 antibody Detection of MMP11 was done using a mouse monoclonal antibody purchased from LabVision (Cat.#MS-1035-P1ABX), which has high sensitivity and specificity in both Western blots and formalin-fixed, paraffin-embedded tissues. 4.2.12 Immunohistochemistry Tissue sections were de-waxed by heating at 60oC for 20 minutes, followed by two 5 minute washes with xylene, two 2 minute washes with 100% ethanol, two 2 minute washes with 95% ethanol, two 2 minute washes with 70% ethanol, and two 5 minutes washes with distilled water. Antigen retrieval was performed by placing slides in a container filled with citrate acid buffer (0.094g citric acid and 0.60g sodium citrate per 250mL H2O, pH adjusted to 6.0) and microwaving for 2.5 minutes at high power, 4 minutes at 10% power, and 4 minutes at 20% power. The container was cooled for 30 minutes in running water after which the slides were removed and rinsed three times with distilled water. Quenching to block endogenous peroxidise activity was performed by washing slides in PBS (0.8g PBS, 0.8g potassium dihydrogen ortholphosphate, 5.4g potassium chloride, 32g di-sodium hydrogen ortholphosphate per 4L H2O) 115  for 3 minutes, 3% H2O2 in a light protected container for 10 minutes, and twice with PBS for 5 minutes. Several drops of DAKO Universal Blocker were applied to each slide and incubated for 15 minutes. The slides were each tapped on paper towel to remove excess blocker. The MMP11 antibody was diluted 500-fold with 1% BSA-PBS (5.8mL 0.1% Triton-PBS and 200μL 30% BSA) and several drops applied to each slide. The slides were incubated for 60 minutes. The slides were then washed three times with 0.1% Triton-PBS for 1 minute. Several drops of DAKO Envision secondary antibody (polymer linker) were then applied to each slide and incubated for 30 minutes. The slides were each tapped on paper to towel to remove excess antibody and then washed three times with PBS for 1 minute. Several drops of DAB were applied to each slide and incubated for 7 minutes in a light protected container. The reaction was then stopped by rinsing the slides in running water. The slides were then counterstained with hematoxylin.  116  4.3  RESULTS Thirty-nine SAGE libraries were constructed using: brushings of normal bronchial  epithelium from four never smokers, twelve former and eight current smokers; and bulk samples from two pools of normal lung parenchyma, one squamous metaplasia, one dysplasia, five carcinoma in situ (CIS), and six invasive cancers (Table 4.1). In addition, a publicly available bulk normal lung library generated as part of a larger human transcriptome profiling effort (GEO Accession: GSM762) was included, bringing the size of the total dataset to 40 libraries. For brevity, the libraries are prefixed with the labels NS, FS, CS (i.e. never, former, current smoker); N, M, D, C and I, respectively (the public library is referred to by its existing accession). Sample and library preparation and data pre-processing are described in Section 4.2.1-4.2.2. The majority of the libraries were sequenced to exceptional depth (>100,000 tags). The combined dataset contains a total of 4,614,031 tags, representing 256,361 unique sequences. 123,635 (48.2%) of these are observed more than once, and are therefore less likely to represent artefacts arising from PCR amplification or sequencing errors.  The brushings and normal lung  parenchyma libraries have been previously published with an accompanying analysis (Lonergan, 2006). 4.3.1 A global view of the transcriptome during the development of SCC 4.3.1.1 Multidimensional scaling analysis An unsupervised clustering of profiles was accomplished using multi-dimensional scaling (MDS) (as described in Section 4.2.3). In order to reduce the effect of random sampling variation on the Pearson distance, only the 10,718 tags where a normalized count of 5/100,000 was observed in at least one library were considered. The similarity between all 40 libraries was projected into three dimensions (Figure 4.2). The clearest distinction appears between libraries 117  Table 4.1: Summary of SAGE libraries ID  Description  NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 GSM762 N13 N16 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  normal bronchial epithelium (never smoker) normal bronchial epithelium (never smoker) normal bronchial epithelium (never smoker) normal bronchial epithelium (never smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (former smoker) normal bronchial epithelium (current smoker) normal bronchial epithelium (current smoker) normal bronchial epithelium (current smoker) normal bronchial epithelium (current smoker) normal bronchial epithelium (current smoker) normal bronchial epithelium (current smoker) normal bronchial epithelium (current smoker) normal bronchial epithelium (current smoker) normal lung normal lung parenchyma (4 pooled samples) normal lung parenchyma (4 pooled samples) squamous metaplasia dysplasia carcinoma in situ carcinoma in situ carcinoma in situ carcinoma in situ carcinoma in situ invasive carcinoma invasive carcinoma invasive carcinoma (4 pooled samples) invasive carcinoma (4 pooled samples) invasive carcinoma invasive carcinoma Pool of all 40 libraries  Tag Count 138419 127477 132292 139536 118395 59831 135539 140290 71637 63766 50331 147304 121901 113315 144650 159687 74155 66327 60645 69997 91467 156777 130300 135783 89607 45127 42784 156100 136257 121363 134327 152599 129372 154387 138269 133099 117768 103119 150444 159588 4614031  Tag Types 31768 29118 30920 33155 27798 18401 31944 31507 21341 18487 16045 32528 30072 28663 34744 45248 21279 18410 16831 20026 22911 33869 30739 32171 24694 11776 13102 35063 31194 28482 27341 34719 28787 33249 31164 25736 25641 21373 31817 45713 256361  % Count Singletons 14.1 13.9 14.6 14.9 14.9 20.0 14.5 13.8 19.9 19.2 22.3 13.6 15.5 16.2 15.3 13.9 18.5 18.0 19.9 18.3 15.5 13.2 14.5 14.7 18.1 16.6 20.8 15.2 14.0 15.9 13.1 15.0 13.9 13.4 14.5 12.0 13.8 13.0 13.2 19.9 2.7  % Type Singletons 61.6 60.8 62.5 62.6 63.5 64.9 61.6 61.3 66.8 66.4 70.1 61.6 62.6 64.1 63.6 71.1 64.6 64.8 64.4 64.0 61.9 61.1 61.4 62.2 65.5 63.8 68.0 67.5 61.4 67.7 64.6 66.0 62.7 62.4 64.3 62.1 63.3 62.7 62.2 69.3 48.2  Summary of forty (40) SAGE libraries constructed with samples of normal bronchial epithelium from never (4), former (12) and current (8) smokers; normal lung tissue (3); squamous metaplasia (1); dysplasia (1); carcinoma in situ (5); and invasive carcinoma (6). Twenty-eight (28) libraries consist of >100,000 SAGE tags.  118  119  0.8 0.6 0.4 0.2  −0.8  I04  −0.6  C05  C02  I12 I11  C27  C39  I18  M51  −0.4  C01  I08  ● ●  N16  x  0.0  −0.2  0.4  x  0.0  0.8  ●  ●  ●  FS50  0.2  FS38  ●  0.4  FS49  0.6  FS34  FS73 CS36 ● FS44 FS20 CS42 ● CS28 ● ● ● ● ● NS93 ● ● ● ● ● FS03 ●  CS80  ●  NS90 ● NS92 ●  NS97  ●  CS48  FS24 FS32  FS21  0.8 0.6 0.4 0.2 0.0 y −0.2 −0.4 −0.6  ● ● ●● ● ● ●● ● ●● ●  CS26  FS06+07  ●  CS94  ●  0.6  ●  ●  ● ● ●  ●  CS75 ●  D101  I22  0.2  ●  ●● ●  GSM762  N13 ●  ● ● ●  −0.8 −0.6 −0.4 −0.2  −0.8 −0.6 −0.4 −0.2 0.0  0.6  C02  I08  −0.4  −0.4  M51  C39  I18  ●  CS36  ●  I12  ● FS50  −0.2  C01  D101  I08  ●  FS38  0.0  ●  ●  ●  y  0.2  0.4  0.4  ● FS34  FS24 I04 FS03 FS32 NS92 NS93  ● ●  C05  0.2  I11 FS73 CS42 ● FS20 CS28 ● ● ● NS97 ● FS44 ● FS49 ● FS21 ● NS90  I22  ● CS94  0.0  N13  ● ● CS26  ●  FS34  0.6  C02  C27  0.6  FS24 ● FS32 ● NS93 CS28 FS21 ●● FS20 ● NS92 ●●● FS03 ● FS49 FS44 ● CS36 ●● ● NS90 FS73 FS50 ● CS42 ● FS38 ● ● ● NS97 CS80 ● CS48 FS06+07 CS75●  x  ●  D101  −0.2  C39 I18  I22  CS80 CS48 ● ● N16 ● ● FS06+07 GSM762 ●  ● CS26  −0.6  N16  ● ● GSM762  N13  ●  M51  C01  I04  I12  C05  CS94 ●  CS75 ●  −0.8  I11  C27  Figure 4.2: Multidimensional scaling (MDS) analysis of 40 SAGE libraries from samples reflecting different stages of SCC development. Datapoints are shaped and colour-coded according to the sample type: bulk normal is denoted with a white circle, brushings from bronchial epithelium are green circles (dark green are never smokers, green are former smokers, and cyan are current smokers), metaplasia is denoted with a blue upwards triangle, dysplasia with a purple downwards triangle, carcinoma in situ with a yellow diamond, and invasive carcinoma with a red square. The full 3D plot is displayed in the top left section. Cross-sections, clockwise from the top right, show the X-Y, Y-Z, and X-Z planes. The dashed circles on the X-Y plane highlight the separation of brushing and bulk tissue sample types.  z  z 0.4 0.2 0.0 −0.2 −0.4 −0.6  y z  0.4 0.2 0.0 −0.2 −0.4 0.4 0.2 0.0 −0.2 −0.4 −0.6  taken from normal or smoke-exposed epithelial brushings and bulk tissue samples (including the bulk normal), showing that the largest source of variation arises from the method used to obtain the samples. Having identified this aspect of the data's structure, an additional MDS was performed separately on these two groups to identify substructure(s) that have a more relevant biological basis. The MDS analysis of the 24 brushing samples reveals evidence of transcriptome changes in the epithelium of the lung due to smoke exposure, although this distinction is not particularly strong (Figure 4.3). Almost half of the libraries are organized into a single, tight cluster containing 4/4 never smokers, 6/12 former smokers, and 1/8 current smokers. The remaining libraries are positioned at varying distances from the main cluster, with the former smokers tending to lay closest and current smokers tending to be most distant. Although these data strongly suggest significant changes in gene expression as a result of smoke exposure (and particularly recent exposure), there appear to be other sources of variation that can have a substantial impact on the transcriptome.  These factors remain to be elucidated, as an  investigation of pack-years, lung function, or time since smoking cessation (Table 4.2) and a correlation with the relationship between sample transcriptomes was not found. The MDS analysis of the 14 tissue samples representing different stages of SCC development reveals clearer differences (Figure 4.4). The three normal samples form a distinct cluster apart from both pre-malignant and malignant samples. A further partition of the latter can be identified that separates 3/5 CIS samples and 3/6 invasive carcinoma samples from the premalignant samples and the remainder of the malignant samples. This suggests that squamous differentiation, the hallmark of metaplasia, is the predominant source of variation in gene expression. Although there is a degree of delineation between pre-malignancy and malignancy, this phenotypic distinction is not a particularly strong feature of the data. Moreover, a distinct positioning of the invasive samples is not evident. 120  Table 4.2: Source patient information for bronchial epithelium brushings samples ID  Gender  Age  Pack-years  NS90 NS92 NS93 NS97 FS03 FS06+07 FS21 FS24 FS32 FS20 FS34 FS44 FS49 FS50 FS38 FS73 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94  M F M F M M M M M M F F M F M M F M F M M M M F  58 56 53 81 68 69 68 70 67 65 56 63 72 71 72 69 63 56 63 68 64 66 52 55  n/a n/a n/a n/a 33 100 30 75 55 82 64 45 40 56 63 55 40 62 44 81 45 53 48 34  Cessation time (years) n/a n/a n/a n/a 19 1 1 17 5 10 1.5 4.5 32 16 6 21 n/a n/a n/a n/a n/a n/a n/a n/a  Lung function (%) 115 104 n/a n/a 50 21 30 76 n/a 59 71 83 87 58 n/a 57 69 89 96 76 73 85 63 81  121  122  0.4 0.3 0.2 0.1 0.0 −0.1  ●  ● NS92  NS90  ●  FS73  FS44  ● ●  −0.2  FS20 ●  ●  FS49  FS03  ●  ● FS50  0.2  ●  0.0 x  ● NS97  ● CS42  CS36  ●  x  0.0  ●  ●  0.2  CS80  ●  ●  ●  0.8  ●  ●  0.4  0.0  0.6  CS94  ●  CS26  ●  ●  y  0.2  CS75  −0.2 −0.4 −0.6  FS06+07  CS48  0.6  ● FS38  0.4  ●  ●  ●  0.4  ●  ●  ●  ●  −0.2  y  ●  ●  0.0  ●  ●  0.2  CS26  CS80  CS48  ●  ● FS20  NS90  FS44 ●  ● NS92  ● ●  CS42  FS73  CS75 ●  0.6  CS28 CS36 ● ●  FS24  ●  FS34 ●  ●  ●  CS75  CS94  CS26 ●  0.4  NS93  ●  ● FS49  FS21●  ●  FS32  ●  ●  CS48  FS06+07  0.2  FS03  CS94  x  ●  FS38  ●  ● FS06+07  ●  FS50  ●  NS97  0.0  ●  NS97  FS50 ●  −0.2  FS49 FS73 FS03  ●  −0.4  ● FS38  −0.4  FS21  ●  CS28 ● FS24 FS20 FS34 ● ● ● FS44 NS92 FS32 NS93 ● ● ● ● NS90 CS42 ●  CS36  ●  CS80  Figure 4.3: Multidimensional scaling (MDS) analysis of 24 SAGE libraries from brushings of bronchial epithelium with different levels of tobacco smoke exposure. Datapoints are colour-coded according to the sample type: dark green are never smokers, green are former smokers, and cyan are current smokers. The full 3D plot is displayed in the top left section. Cross-sections, clockwise from the top right, show the X-Y, Y-Z, and X-Z planes.  −0.4  ●  CS28  ●  FS21  NS93  FS24  ●  FS32  FS34  ●  ● ●●  ● ● ● ● ● ● ● ●● ●  ● ●  ●  −0.8 −0.6 −0.4 −0.2  −0.2  z z  0.2 0.1 0.0 −0.1  y z  0.2 0.0 −0.2 −0.4 0.2 0.1 0.0 −0.1  123  0.4 0.3 0.2 0.1  −0.6  C02  C27  −0.6  C05  x  I11  I12  −0.2  −0.4  −0.4  ●  0.2  x  I04  −0.2  0.0  ●  0.4  0.0  N13  ●  ●  ●  0.2  GSM762  I22  N16  ●  I08  I18  y  0.4  0.4  M51  C01  D101  C39  0.0 −0.2 −0.4 −0.6  0.2  0.6  0.4 0.2 ●  −0.4  GSM762  N16  N13  ● ●  −0.6  C27  C02  −0.4  C05  −0.2  I12  I11  I12  −0.2  y  C02  0.2  M51  D101  I04  0.2  N16  ●  I22  I08  C39  C05  I08  C27  I18  0.0  N13  ●  I22  0.0  I11  x  I04  I18  D101  M51  ●  0.4  C01  0.4  GSM762  C39  C01  Figure 4.4: Multidimensional scaling (MDS) analysis of 16 SAGE libraries from bulk samples reflecting the different stages of SCC development. Datapoints are colour-coded according to the sample type: bulk normal is denoted with a white circle, metaplasia is denoted with a blue upwards triangle, dysplasia with a purple downwards triangle, carcinoma in situ with a yellow diamond, and invasive carcinoma with a red square The full 3D plot is displayed in the top left section. Cross-sections, clockwise from the top right, show the X-Y, Y-Z, and X-Z planes. The dashed circles on the X-Y plot highlight the separation between normal, pre-malignant, and malignant samples.  −0.8  −0.4 −0.3 −0.2 −0.1 0.0  0.2  0.3  z z  0.1 0.0 −0.1 −0.2 −0.3  y z  0.0 −0.2 −0.4 0.3 0.2 0.1 0.0 −0.1 −0.2 −0.3  These data strongly suggest that the majority of transcriptome change occurs at the earliest stages of SCC development. While the changes associated with squamous differentiation appear most consistent, changes associated with malignant transformation appear to be substantial but highly variable.  Two factors may explain this observation: 1) cell type  heterogeneity (e.g. cells of stromal origin, residual normal epithelium, or members of the immune system) may introduce considerable variability, and 2) the particular genes that show altered expression and the extent of this alteration may differ substantially from tumour to tumour. Finally, gene expression changes associated with the invasive phenotype appear to be relatively limited. From this analysis, it is impossible to speculate on the variability of such changes. Although within-stage similarity coupled with a separation of samples according to the stage of SCC development (i.e. invasive samples most distant from normal, CIS between normal and invasive, etc.) would be an ideal result showing a strong relationship between histological classification and the transcriptome, the observed result is not entirely inconsistent with existing knowledge. Notably, genetic heterogeneity is an accepted reality in lung tumours (i.e. candidate oncogenes always appear disrupted in a certain percentage of tumours, typically 20-50%), and the high percentage of CIS that eventually progress to a full-blown invasive carcinoma (90100%) suggests the number of additional molecular changes required to facilitate this process is rather small. 4.3.1.2 Common gene expression patterns identified by k-means clustering In order to explore the transcriptome further, a k-means clustering analysis was performed (as detailed in Section 4.2.4). A total of 2,222 tags with a mean expression of 5/100,000 tags were included. The elbow criterion suggests k=9 clusters are sufficient to capture the most prominent features of the data (Figure 4.5A). Of these tags, 2,087 (93.9%) cluster into one of four patterns of expression (Figure 4.5B).  These clusters appear to represent distinct 124  7  within-cluster dispersion (x10 )  A  7  8  9  ●  6  ●  5  ● ● ●  ●  ●  3  4  ●  ●  5  ●  ●  ●  ●  ●  10  ●  ●  ●  ●  ●  15  ●  20  number of clusters  B  Cluster A (948 tags) housekeeping type I  Cluster C (458 tags) housekeeping type III Cluster D (110 tags) ciliated epithelial cells  0.04  ● ●  ●  ●  ●  ● ●  0.02  ● ●  ●  ● ●  ●  ● ● ● ●  ● ● ●  ● ●  ● ●  ●  ●  ● ● ● ● ● ●  ● ●  ●  ● ●  ●  ● ●  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  0.00  cluster centre  0.06  ● Cluster B (571 tags) housekeeping type II  Figure 4.5: Determination of cluster number for k-means analysis and the four major clusters. The top graph (A) is an elbow plot used to estimate a reasonable number of clusters. The x-axis is the number of clusters, and the y-axis is the total within-cluster dispersion (S) as defined in Cai et al. (2004). The chosen number (k=9) is indicated. The bottom graph (B) depicts the values of the four most populated clusters. The x-axis shows the sample and the y-axis shows the value of the cluster centre. The cluster centre values include a correction for the library size (see Section 4.2.4) and correspond to the ratio of expression (e.g. 0.10 means that 10% of the total observed expression appears in a given sample).  125  transcriptome types or states, as the four appear in equilibrium with one another. Of particular note is the opposing relationship of Cluster A with Cluster B. This is a significant concern for the analysis of the bulk samples, where the contribution of these two clusters is highly variable. Given the observed pattern and the identity of these samples, a reasonable conclusion is that an increase in Cluster A tags indicates a greater number of stromal cells (e.g. fibroblasts, endothelial cells), while Cluster B indicates a greater number of the actual pre-malignant or malignant cells (e.g. squamous epithelium). An increase in expression of tags in Clusters C and D are strongly associated with epithelial brushing samples. The two clusters are similar except for an increase in the expression of Cluster D tags in samples from current smokers. Given the purity of these samples, these two clusters likely correlate with the baseline expression of normal epithelium and an additional “stressed” state (i.e. as a result of current exposure to tobacco smoke), respectively. Furthermore, the values for the invasive sample I22 and, to a lesser extent, the dysplasia sample D101 seem to suggest some contamination with these cells. This manifests in the MDS analysis as a shift of the position of these samples towards the space occupied by the brushing samples (Figure 4.2). This may have resulted from the inadvertent capture of tissue surrounding the margin of these sites of interest. A GO enrichment analysis was performed on each cluster to determine if certain biological themes are overrepresented (as detailed in Section 4.2.6). The tags in each cluster were mapped to Unigene entries and submitted to the online DAVID analysis resource (Table 4.3) (Huang da, 2007). When enriched GO terms were identified with an FDR ≤5%, several clusters could be associated with specific themes. Cluster A is enriched for genes encoding intracellular proteins and functions that could be associated with general cellular housekeeping, including primary metabolic processes, gene expression, and translation. Cluster B and Cluster C did not show any significant enrichments. Cluster D was associated with the axoneme and 126  Table 4.3: Description of clusters identified by k-means analysis Cluster Number Unique Tags Unigene IDs A 948 624 B 571 281 C 458 307 D 110 76 E 54 36 F 44 30 G 15 7 H 12 2 I 10 7  Unique DAVID IDs 534 220 243 60 33 26 4 1 4  Putative biological significance  housekeeping type I housekeeping type II housekeeping type III ciliated epithelial cells keratinocyte differentiation extracellular matrix immune system markers secretoglobin pulmonary surfactants  127  contained genes previously found in normal lung, consistent with their increased presence in the ciliated epithelial cells collected by brushing. The minor clusters contain small numbers of genes with obvious and specific roles (Figure 4.6). Cluster E contains genes involved in keratinocyte differentiation. Cluster F contains genes that are secreted and carry out structural and regulatory roles in the extracellular matrix. Cluster G contains immunoglobulins, indicating these samples have an increased infiltration by immune cells. Cluster H contains secretoglobin, a gene known to be expressed in the lung but whose function is not yet clear. Finally, Cluster I contains several pulmonary surfactants which are important for maintaining the structural integrity and function of the alveoli. 4.3.2 Transcriptional signatures of developmental stages A supervised classification strategy was undertaken to identify sets of genes that are highly correlated with the stages of SCC development (detailed in Section 4.2.7). The Poisson mixture model (Chapter II) was particularly amenable for this dataset; indeed, it was partially inspired by the challenges this SAGE data presented.  Specifically, the notion of mixture  “components”, which essentially act as a discrete Bayesian prior distribution, can account for: a) a wide range of expression, possibly the result of different mechanisms of dysregulation or the extent to which they occur (e.g. the number of copies involved in copy number variation); and b) some of the cellular heterogeneity revealed by the k-means analysis (Section 4.3.1.2).  A  continuous, unimodal prior (such as the Beta or Gamma distribution), which is typically used in studies similar to this, fits poorly in such cases and regularly provides questionable estimates of significance (Zuyderduyn, 2007). 4.3.2.1 Changes associated with tobacco smoke exposure The incidence of lung SCC is strongly associated with long-term exposure to tobacco smoke. Some gene expression changes that occur as a result of smoking may have casual or 128  ● Cluster E (54 tags) keratinocyte differentiation  0.15  0.20  ●  0.10  Cluster F (44 tags) extracellular matrix  ●  ●  0.05  ●  ● ●  ●  ● ●  ● ●  0.00  cluster centre  A  ●  ●  0.30  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●  B ● ●  0.20  Cluster H (12 tags) secretoglobin  ● Cluster I (10 tags) pulmonary surfactants  0.10  cluster centre  Cluster G (15 tags) immune system markers ●  ● ● ● ● ● ● ● ●  ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●  ● ●  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  0.00  ● ● ● ● ● ●  ●  ●  ●  Figure 4.6: The five minor k-means clusters. The clustering algorithm was set to determine k=9 clusters. The top graph (A) depicts the values of the fifth and sixth most populated clusters. The bottom graph (B) depicts the values of the seventh to ninth most populated clusters. Both plots show the samples on the x-axis and the values of the cluster centres on the y-axis. The cluster centre values include a correction for the library size (see Section 4.2.4) and correspond to the ratio of expression (e.g. 0.10 means that 10% of the total observed expression appears in a given sample).  129  mitigating roles in the ultimate development of a tumour, while others may be a transient cellular response to this exposure that ceases upon smoking cessation. To explore the latter, an analysis of the subset of SAGE libraries from bronchial brushings was performed by searching for gene expression changes that were specific to one of the groups of never (NS), former (FS), and current (CS) smokers (Figure 4.7). Although substantial numbers of tags were identified that appear differentially expressed in the CS group and likely represent differences resulting from an acute response to tobacco exposure, none could be identified in the NS group. This supports the notion that the bronchial epithelium reverts to a near-normal state upon smoking cessation. However, a small number of changes were identified in the FS group that suggests the presence of a distinct and persistent gene expression signature in cells that have had substantial past exposure to tobacco smoke. 4.3.2.1.1 Acute response to tobacco smoke exposure In the CS group, a set of 70 upregulated and 9 downregulated tags were identified (Figures 4.8-4.9). Based on LOOCV, the sensitivity and specificity of these signatures are 66.2% and 86.4%, respectively, for the upregulated tags and 64.1% and 80.6%, respectively, for the downregulated tags. Optimal signatures were constructed to maximize the selectivity. This resulted in a 7 gene upregulated signature (ADH7, NQO1, CYP1B1, ALDH3A1, GPX2, AKR1B10, TFF3) with a sensitivity and specificity of 79.7% and 95.8%, respectively. Correspondingly, a single downregulated gene (C3) was identified with a sensitivity and specificity of 96.4% and 99.4%, respectively. The upregulated signature showed excellent agreement with all 3 of the validation microarray datasets, with a combined sensitivity and specificity of 92.0% and 92.8%, respectively (Table 4.4). However, the downregulated signature showed inconsistent performance. Although the C3 gene performed well in classifying the Carolan dataset, specificity was poor on the Spira and Beane datasets. The combined sensitivity 130  131  downregulated  ●  0.2  ●  ●  ●  0.0  0.2  ● ● ● ● ● ● ●  0.0  ● ● ●  ●  ● ● ● ●  0.6  ●  ●  ● ●  0.6  ●  statistic (Θ)  0.4  ●  statistic (Θ)  0.4  ●  ● ●  never smokers  1.0  0.8  1.0  0 tags Θ = N/A  0.8  0 tags Θ = N/A  ● ●  0.0  ● ● ● ●  0.0  ● ●  0.2  ●  0.2  ●  ●  ●  ●  ●  ●  ●  ● ●  0.6  ●  ● ● ●  ●  0.6  1.0  0.8  1.0  4 tags Θ = 0.58  0.8  ● ● ● ● ●  statistic (Θ)  0.4  ●  statistic (Θ)  0.4  ● ●  1 tag Θ = 0.62  former smokers  0.0  ● ● ●  0.0  ● ● ●  ●  ●  0.2  ●  0.2  ●  ●  ●  ●  ●  ●  ●  ● ●  ●  0.6  ● ●  0.6  0.8  ● ● ● ● ● ● ●  statistic (Θ)  0.4  ●  1.0  1.0  70 tags Θ = 0.48  0.8  ● ● ● ● ● ● ● ●  statistic (Θ)  0.4  ●  9 tags Θ = 0.58  current smokers 100%  10000  100%  10000  100%  Figure 4.7: Candidate smoke exposure tag selection plots. Each plot depicts the number of tags (bar plot, left y-axis) and estimated false discovery rate (FDR) (line plot, right y-axis) selected at a given threshold test statistic value (x-axis) (the thick vertical line marks the 5% FDR level). The number of tags selected at the 5% FDR level and the value of the test statistic (Θ) are shown at the top-right of each plot.  upregulated  tags  tags  10000  1000  100  10  1  10000  1000  100  10  1  FDR FDR  80 60 40 20 0 100% 80 60 40 20 0  tags tags  1000 100 10 1 10000 1000 100 10 1  FDR FDR  80 60 40 20 0 100% 80 60 40 20 0  tags tags  1000 100 10 1 10000 1000 100 10 1  FDR FDR  80 60 40 20 0 100% 80 60 40 20 0  Color Key  0 2 4 6 8 >10%  percent expression  TT TG GT TC GA GG GG AA TA TG CT AA GT AG AA GA AT TG CT AC GG AC GG CA CC AT GA GC TG CT TT CC G AT AA AC/AT TG AA CT AC CC AC AA AA CC AA AA TT TA TC AT TA TG CT TA CC AA TC CT CC AT CA GA GA/AG AA CC CC/AC AA TG TA  ADH7 NQO1 CYP1B1 ALDH3A1 GPX2 AKR1B10 TFF3 CLU PIR CBR1 ambiguous MSMB ALDH3A1 (antisense artefact) CYP1A1 CABYR AKR1C2 GSTA2 UCHL1 FTH1 SPDEF ALDH3A1 (antisense artefact) ALDH3A1 (transcript variant) TMEM45B MSMB (antisense artefact) PGD ADH7 (antisense artefact) AKR1C2 MUC5AC SLC7A11 ALDH3A1 SBEM ALDH3A1 (antisense artefact) ambiguous SRXN1 ambiguous TXN RNF128 MSMB (sequencing artefact) novel S100A1L AKR1C3 NQO1 ambiguous ambiguous GSTP1 AGR2 ABCB10 UPK1B SPP1 AKR1C1 NQO1 (antisense artefact) S100P FTL GPX4 ALDH1A1 (antisense artefact) C20orf114 MSMB (sequencing artefact) TALDO1 Hs.610279 CYB5R1 NT5DC1 GSR ambiguous Hs.636243/HLA-DQA1 ambiguous ambiguous SPAG7/RPL29 HTATIP2 SEPX1 CYP1B1  CS42 FS06+07 FS24 FS03 FS73 FS44 NS92 FS38 FS50 NS97 FS32 FS21 FS49 FS34 FS20 NS93 NS90 CS26 CS94 CS80 CS28 CS48 CS75 CS36  *TTAAAAATTC TTATCAAATC *AATGCTTTTA *GGCCCAGGCC *GGTGGTGTCT GCTTGAATAA CTCCACCCGA CAACTAATTC CAAATAAACC GGCCCCATTT AGAACAAAAC CCTATCAGTA *TGGGAGTGGG CTTGCATAAG CAAGCATAAA AGGTCTGCCA CAAGACCAGT CAGTCTAAAA TTGGGGTTTC *GTGCAGGGAG *TAGAGGGCCA *CCACCTGCTA GATGAATCCG *GCTACACAAT CGGCTGAATT *GAATGAACTG AGGTCTACCA GTGATCAGCT TGCTTTTGTA GCAAGAAGAG CTTCCTGTGA *CCGCTGTTCC TCATTTAATG GTGATGTAAG *TATTTTTGAA *TTTTCTGAAA *TATTTTTGTT CCTATCGGTA *GTTTCCTTTT *CATTTGTCAA GAGAGCTTTG *ACCTTGGGGT CCTACCAGTA TTTTGTATTC AGGTCCTAGC *ATTTTCTAAA *TATTTTGAAA ACTCCTACTT AATAGAAATT *TTAATATTCA *TTAGAAGGAA TACCTCTGAT CCCTGGGTTC *GCCTGCTGGG *TCAAGTTTTC CCAAGGTGGC *GCTATCAGTA *GGCGCCTCCT *GCCAGAGGAG TACGCTTGGT *TATGCTTTAA *CTGCTGCACT *CTTTGTATTT *TGGAAATGTG GACAAAAAAA TCCTATTAAG *GGGCTGGGGT GGGAGGATTA CTCGGAGGCC *AATTTTAAAG  Figure 4.8: Heatmap with sample-wise hierarchical clustering of 70 tags upregulated in bronchial epithelium from current smokers. Samples are bronchial brushings from never, former, and current smokers. The dendrogram is generated using complete-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: never smokers are dark green, former smokers are light green, and current smokers are cyan. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 132  Color Key  0 2 4 6 8 >10%  percent expression  GG GC CC CC CT GC TG AA GT  C3 LYPD2 ambiguous ambiguous SAA1/SAA2 ambiguous LRRC16 ABCA13 ambiguous  FS73 FS32 FS44 NS97 FS03 NS93 FS21 FS24 FS34 NS90 FS20 FS49 NS92 CS36 CS48 CS26 CS28 CS42 CS75 CS94 CS80 FS38 FS06+07 FS50  *GTTGTCTTTG GCCCTATGCG *TGCCTGTAAT *CCATTGCACT GTGCGGAGGA *AACCCGGGAG AGCTTAATGA AATGTGTTTA *TTGGTTTTTG  Figure 4.9: Heatmap with sample-wise hierarchical clustering of 9 tags downregulated in bronchial epithelium from current smokers. Samples are bronchial brushings from never, former, and current smokers. The dendrogram is generated using complete-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: never smokers are dark green, former smokers are light green, and current smokers are cyan. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping.  133  Table 4.4: Microarray validation of smoke-exposure signatures Regulation Dataset  Up  Down  Spira Carolan Beane Spira Carolan Beane  Genes 7 6 7 1 1 1  Never smokers 23/23 18/18 21/21 20/23 18/18 20/21  Former smokers 14/18 n/a 27/31 14/18 n/a 20/31  Current smokers 33/34 26/26 44/52 21/34 24/26 27/52  Sensitivity  Specificity  97.0% 100.0% 84.6% 92.1% 100.0% 97.9%  90.2% 100.0% 92.3% 54.0% 90.0% 35.7%  Classification performance of genes mapped to SAGE tags identified as upregulated in current smokers compared to never or former smokers. k-means clustering (k=2) was used to classify the samples. The optimal upregulated signature of 7 tags and optimal signature of 1 downregulated tag was used. Note: AKR1B10 was not present on the Carolan dataset microarrays.  134  and specificity was 96.4% and 51.3%, respectively. The 70 upregulated tags were mapped to 37 unique Unigene identifiers and a GO enrichment analysis revealed the terms “electron transport” (GO:0055114; FDR<0.1%) and “oxidoreductase activity, acting on CH-OH group of donors” (GO:0016614; FDR=0.1%) as significant themes. Furthermore, a KEGG enrichment analysis revealed an overrepresentation of members of the pathways responsible for the “metabolism of xenobiotics by cytochrome P450” (KEGG:hsa00980; FDR<0.1%) and “glutathione metabolism” (KEGG:hsa00480; FDR<0.1%). Both of these pathways work in concert to drive the metabolism of a diverse number of chemicals, including those found in tobacco smoke. Indeed, these two pathways form the canonical Phase I and Phase II enzyme system that is the first line of defence against foreign compounds. The Phase I reactions, involving cytochrome P450 in particular, drive various modification reactions that enable Phase II reactions, which includes glutathione conjugation, to detoxify the substrate. Specifically, ADH7, AKR1C2, AKR1C3, ALDH3A1, CBR1, CYP1A1, and CYP1B1 are known Phase I enzymes, while GPX2, GPX4, GSR, GSTA2, GSTP1, NQO1, and TALDO1 are known Phase II enzymes. All but one of the genes present in the 7 gene optimal signature are known players in this system. The only exception, TFF3, is a component of the epithelial mucosa of several tissues, including the lung. It is thought to be responsive to epithelial damage and involved in affecting the repair of the mucosa (Wiede, 1999; Oertel, 2001; Hoffmann 2007). 4.3.2.1.2 Persistent response to tobacco smoke exposure No significant differences were found that were specific to the NS group, which would correspond to changes that persist even after an individual has stopped smoking. This suggests that the majority of the cellular response to tobacco smoke is acute and returns to a predominantly normal state once the exposure is removed. However, a small number of genes 135  were identified in the FS group (Figure 4.10A-B). Such changes do not fit well with the notion of a dichotomous state that is influenced by past or present smoke exposure. Rather, these changes would correspond to a past exposure signature distinct from the acute response. A set of 4 upregulated tags had a sensitivity and specificity of 74.5% and 88.6%, respectively. However, this signature showed no consistent agreement with the Spira and Beane microarray datasets, which both contain former smokers. The single downregulated tag maps to EPHX1, and the sensitivity and specificity of this gene in identifying a former smoker is 65.9% and 84.8%, respectively. Although this gene did not show significant change in the Spira dataset, the Beane dataset also displayed a noticeable downregulation of EPHX1 in former smokers, particularly in those who had quit for longer than 24 months (Figure 4.10C). The former smokers in the Spira dataset include a large number of individuals who smoked for a relatively short period of time (≤10 pack-years) and/or quit for short period of time (<12 months) before the samples were taken. It is possible that EPHX1 is downregulated following cessation after long-term exposure to tobacco smoke. Such an event would be interesting because of its role in the Phase I enzyme system, which is so clearly affected by acute exposure. Moreover, the gene is prominent in a known mechanism of tobacco smoke-induced mutagenesis. Normally EPHX1 plays a protective role, but benzo[a]pyrene, the major mutagen in tobacco smoke, is a procarcinogen that must undergo modification by both CYP1A1 or CYP1B1 and EPHX1 to form benzo[a]pyrene-7,8dihydrodiol-9,10-oxide, a potent carcinogen (Denissenko, 1996). A loss of EPHX1 activity following cessation may then render the lung tissue more susceptible to mutation. Although the totality of the data suggests there is a true effect, the duration and amount of exposure and the time since smoking cessation appear to be important variables that cannot be adequately modelled with the available data.  136  Color Key  A  0 2 4 6 8 >10%  percent expression  CC AA CC AT  LCN2 LCN2 SLPI UBD  CS94 CS75 NS92 CS26 CS42 NS97 CS48 CS36 FS73 CS28 CS80 NS90 NS93 FS38 FS24 FS44 FS34 FS49 FS32 FS06+07 FS03 FS20 FS21 FS50  TGCCCTCAGG *TGCCCTCAAA TGTGGGAAAT ATTTTTACTA  Color Key  B  0  2  4  6%  percent expression  EPHX1  CS42 FS24 NS93 CS94 NS90 CS20 FS20 CS80 NS97 FS21 FS06+07 CS48 CS26 CS75 NS92 CS36 FS73 FS32 FS44 FS03 FS34 FS49 FS50 FS38  GCTTTGATGA TA  EPHX1 8 9 10 11 7  log hybridization  C  never smokers  former smokers  current smokers  Figure 4.10: Heatmaps with sample-wise hierarchical clustering of 4 tags upregulated and 1 tag downregulated in former, but not current, smokers. A boxplot of the expression of the candidate downregulated gene EPHX1 in the Beane microarray dataset is also shown. Samples are bronchial brushings from never, former, and current smokers. The dendrogram is generated using complete-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: never smokers are dark green, former smokers are light green, and current smokers are cyan. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 137  4.3.2.2 Changes associated with SCC development Gene expression differences associated with SCC development were explored by grouping the bulk samples into 4 groups: normal, pre-malignant (metaplasia/dysplasia), CIS, and invasive carcinoma. The brushing libraries were included as an additional group, and were always used to support the bulk normal lung samples during classification. Since the k-means analysis suggests that bulk normal samples are primarily composed of stromal cells, while the brushing samples are primarily composed of epithelial cells, a putative gene expression marker must show a significant dysregulation compared to both of these groups. This approach has some cost in terms of power, as some differences present in a pure population of pre-malignant or malignant cells may have been apparent when comparing to a pure population of epithelial cells. In this case, a change in expression must be apparent in a mixed population when compared to two relatively pure populations. Nevertheless, the candidates that are identified can be regarded with high confidence. Attempts to model the amount of cellular heterogeneity proved unsuccessful and, interestingly, resulted in the identification of similar candidates (data not shown). SAGE tags were identified at a cutoff FDR of 5% for each stage of progression (plots of the FDR over a range of test scores are shown in Figure 4.11).  In addition, candidates were  identified for all other possible combinations of the four groups not consistent with the known progression of SCC (the number of candidates for each combination are summarized in Figure 4.12). 608/789 (80%) of differentially expressed tags correspond to the accepted order of SCC progression.  146/159 (92%) of the remaining tags are either upregulated in metaplasia,  dysplasia, and CIS; or CIS alone. This would be expected in a situation where stromal cells are a considerable presence in the invasive samples, as suggested by the k-means analysis. These two categories likely correspond to genes that are, in reality, associated with squamous differentiation or malignancy, respectively, and have simply had their increase obscured.  A GO/KEGG 138  139  downregulated  0.0  ●  0.0  ●  ●  ●  ●  ●  ●  ●  0.2  ●  0.2  ●  ●  0.6  0.6  statistic (Θ)  0.4  1.0  0.8  1.0  138 tags Θ = 0.235  0.8  ● ● ● ● ● ● ● ● ● ● ● ● ● ●  statistic (Θ)  0.4  ● ● ● ● ● ● ● ● ● ● ● ● ● ●  316 tags Θ = 0.195  pre-malignant  0.0  ●  0.0  ●  ● ●  ●  ●  ●  ●  0.2  ●  0.2  ●  0.6  statistic (Θ)  0.4  ● ● ● ● ● ● ●  0.6 statistic (Θ)  0.4  1.0  0.8  1.0  143 tags Θ = 0.240  0.8  26 tags Θ = 0.320  ● ● ● ● ● ● ● ● ● ● ●  ●  malignant  0.0  ● ● ● ● ● ●  0.0  ● ●  0.2  ●  ●  0.2  ● ● ● ● ● ●  0.6  0.6  0.8  1.0  29 tags Θ = 0.575  1.0  7 tags Θ = 0.605  0.8  0 tags Θ = N/A  ● ● ● ● ● ● ● ● ● ● ●  statistic (Θ)  0.4  ● ● ● ● ●  ●  -I22  statistic (Θ)  0.4  ● ● ● ● ● ● ● ● ● ●  invasive 100% 80  100%  100%  Figure 4.11: Candidate squamous cell lung cancer progression tag selection plots. Each plot depicts the number of tags (bar plot, left y-axis) and estimated false discovery rate (FDR) (line plot, right y-axis) selected at a given threshold test statistic value (x-axis) (the thick vertical line marks the 5% FDR level). The number of tags selected at the 5% FDR level and the value of the test statistic (Θ) are shown at the top-right of each plot. The plot for upregulated invasive tags (bottom, far-right) is overlayed with the results of the procedure with sample I22 excluded (bar plot in light green, and line plot in green).  upregulated  10000  1000  100  10  1  10000  tags  tags  1000  100  10  1  FDR FDR  80 60 40 20 0 100% 80 60 40 20 0  tags tags  10000 1000 100 10 1 10000 1000 100 10 1  FDR FDR  80 60 40 20 0 100% 80 60 40 20 0  tags tags  10000 1000 100 10 1 10000 1000 100 10 1  FDR FDR  60 40 20 0 100% 80 60 40 20 0  140  316  1407  26  0  0  0 11  108  2  138  38  0  143  CIS  7  downregulated squamous differentiation  enriched in stroma  upregulated invasive  upregulated malignant  upregulated squamous differentiation absent in stroma  downregulated invasive downregulated malignant  invasive  Figure 4.12: Venn diagram of the number of candidate tags identified for different combinations of SCC progression sample types. The left Venn diagram contains the number of candidates, and the right Venn diagram is a legend indicating the biological significance of each of the diagram’s regions. The dashed outline indicates the bronchial brushing sample type, which is combined with the bulk normal sample type in all comparisons, except for a test of brushings only (1407).  bronchial brushings  normal  0  metaplasia/dysplasia  enrichment analysis on each of these combinations of sample types supports this view. While tags upregulated in metaplasia and later stages compared to normal map to genes that are strongly associated with keratinocyte differentiation and cornified envelope formation, which is consistent with a squamous cell phenotype, these terms are also enriched in the two anomalous comparisons. 4.3.2.2.1 Changes associated with pre-malignant transformation 138 upregulated and 316 downregulated tags were associated with the transition from normal lung to pre-malignant metaplasia (Figures 4.13.1-4.13.2 and 4.14.1-4.14.4, respectively). The placement of the unusual sample I22 in hierarchical clustering is consistent with the notion of contamination by normal epithelium suggested by the k-means analysis. The upregulated tags mapped to 85 unique Unigene entries and a GO enrichment analysis of these genes revealed the term “keratinocyte differentiation” (GO:0030216; FDR<0. 1%), consistent with a squamous cell phenotype. This is implied from the upregulation of the genes CSTA, SDC1, SPRR1A, SPRR2A, SFN, and TP63 (Gibbs, 1993; Lee, 2000; Nakajima, 2003; Ojeh, 2008; Truong, 2006). Moreover, KRT6A, KRT6B, and KRT14 are upregulated, consistent with keratinisation, a histological hallmark of metaplasia and SCC.  Keratins may also play an active role in  progression. For example, a recent study transfected mice with the human KRT14 gene, and constitutively activated it in lung Clara cells (Dakir, 2008). This was sufficient to activate squamous differentiation, as indicated by an increase in the expression of particular markers, including SPRR1A. However, it was insufficient to immediately initiate squamous maturation, although these mice developed hyperplastic and metaplastic lesions with time. Based on LOOCV, the sensitivity and specificity of these signatures are 72.4% and 81.6%, respectively, for the upregulated tags and 83.1% and 76.7%, respectively, for the downregulated tags. Optimal signatures were constructed to maximize the selectivity. This 141  Color Key  0 2 4 6 8 >10%  percent expression  GC TG AA TG AT AA TG TA CC CT GC AC GC CA/CT GG/GT CT AG AC AT AG AG AA AA AT AC TG GT AT CC GA TG/TT GC TT CT AT GG TG AT TG AA AA AT AT TT AT GG TA TT TT AA/AG AA AG GC CA AA GT GT CC TT GG GG CA GC AC AC GC CC TT TT  S100A2 KRT6A ambiguous KRT17 SFN PKP1 Hs.579297 TP63 HSPB1 DSG3 HMGA1 KRT6A ambiguous multiple expressors KRT6A (antisense artefact) JUP DSC2 SPRR2A CXCL14 ambiguous RPL22L1 PKP3 S100A2 (antisense artefact) CSTA ambiguous TYMS S100A2 (antisense artefact) S100A8 S100A9 SPRR1A SPRR3 ambiguous KRT6B FGFBP1 TMEM132A SDC1 COL7A1 KRT14 PHLDB2 YWHAZ TPI1 GINS1 TXNL5 DST LY6D CEP55 RHCG C16orf75 KRT6A PBEF1 SLC38A2 CCNA2 HN1 COL1A1 (antisense artefact) ambiguous ambiguous ENO1 LYPD3 EIF5A RP11-135M8__A.1-001 SDC1 EIF3S9 COL1A1 (antisense artefact) CCNB1 ITGA6 ECE2 HSPB1 PTDSS1 BACH1  CS28 CS36 CS48 FS06+07 FS32 FS24 FS34 NS97 FS49 CS42 NS90 CS26 FS44 CS94 CS80 FS20 NS93 FS73 FS03 CS75 N13 N16 FS38 GSM762 FS21 NS92 FS50 I22 I11 I12 I08 I04 C01 C39 I18 M51 D101 C27 C05 C02  GATCTCTTGG * AAAGCACAAG * GCCAGGAGCT CTTCCTTGCC TTTCCTCTCA TTTGTAGAGG * GGGAAGGGAC * CAATAAAATT CCCAAGCTAG * TAAAATGTAT ATTTGTCCCA GCCCCTGCTG * AACCCGGGAA * TTGAATCCCC * GATACTGCCT GTGTGGGGGG * AGAGTCATAC * ATGATCCCTG CAGGTTTCAT GCTGCCCTTG AGAAATACCA AACAGTCAAA * GCAGGAAGTC ATCCTTGCTG * AGCTTCTACC ATGTAGAGTG * ACAGTGATGA TACCTGCAGA GTGGCCACGG * CTGCTAAAAG * TTTCCTGCTC CCCTTGAGGA CGAATGTCCT * GCCCACACAG CCCCTTATTT * CTGAGGCCTG * GTGCTGATTC * GATGTGCACG ATTTCACATT TAAGTGGAAT TGAGGGAATA * TGTTCTATGA TTAGCAATAA GTAAATATGG * GAGATAAATG CTTACCACTG GCTTCCTCGG GTTGTAGACT TTTTATCCTT GCCTTAACAA CTTAATCCTG * TGGCAAGATG CTTGTGAACT * TTCGGTTGGT GTGAAAAAAA TAAACCTGCT TGAGCCTCGT * CCAGCGCCAA GGCTTTACCC GGGAGGGGTG * CTCATCTGCT GAAATAAAAG * CTTTATTCCA TGCCATCTGT ATTAGAAATT AACCAATACA GCTACATCTC CTCTAGAACC * TAATAAACCT  Figure 4.13.1: Heatmap with sample-wise hierarchical clustering of the first 69 of 138 tags upregulated in metaplasia and later stages of SCC development. The dendrogram is generated using average-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 142  Color Key  0 2 4 6 8 >10%  percent expression  GA AA CA TC TC TG CT GC AA TT CA CC GC TT CA AA AA TT AG TA CC AA TG CT GG AA AT AC CA TA TA GG TC AA AA CA TG TC TT AC CA TG CT GA GG CT TA CC AA CC TC TA CA/GG GA/GT CT TG GG CC C GC AA GC TT TG AT GG AT TT GA  SPRR1A CYC1 KLK10 SERPINB13 DSP TMSB10 (antisense artefact) ambiguous EPHA1 GAPDH RAP2B COL1A2 SLC3A2 GAPDH ATP5G3 SPRR3 (antisense artefact) ambiguous C10orf99 MAFB KRT16 SMC4 CDKN3 FKBP4 ambiguous GGH MAPK6 EIF4G1 YWHAG SPRR2E CEECAM1 COL15A1 ambiguous SREBF2 KLK10 ambiguous CD109 FRMD6 KRT6B PDK1 COX5A ATP11B RPS6KA4 KRT17 (sequencing artefact) ambiguous RBP1 NCCRP1 ITGAV SGK1 (antisense artefact) COL1A1 PSMA7 ambiguous NPC-A-5 SFRS3 ambiguous ambiguous S100A7 CRCT1 TRIM29 ambiguous S100A9 (antisense artefact) S100A16 GPI S100A2 (sequencing artefact) FZD10 ARF6 WSB2 PRR5 KPNA2 NCL GPX2  CS28 CS36 CS48 FS06+07 FS32 FS24 FS34 NS97 FS49 CS42 NS90 CS26 FS44 CS94 CS80 FS20 NS93 FS73 FS03 CS75 N13 N16 FS38 GSM762 FS21 NS92 FS50 I22 I11 I12 I08 I04 C01 C39 I18 M51 D101 C27 C05 C02  CTGTCACCCT * GGAATAAATT * TAAGGCTTAA TAAATAAAGA ACAGCGGCAA CCACGAGGTG * CTGTTGATTG * AGGGCTGCAG TACCATCAAT * CGTTTAATCA TTTGGTTTTC GTGAAGTCTT * ACAAGGTGCG * GGAATGTACG * CATTTCTTTT AGAAAAAAAA * AGTCTGCTGG * GAAATGGTGT CAGCTGTCCC GACTTTTAAA TGCAGATATT CTTGAGCAAT * CAGGACCTGG GAAAAGCCTT * AATAAAATTA AATTCAATTA * AATGTGAGTC GGCTTCTAAC TTGCCCAGCA * TAAGTGAACA * CCTGGATAAA AAATAATGTT TGTATGTAAA * GCAAAAAAAA ATTTATTAAT * GTAAGATTAG * GAAGCACAAG * GATCAACCCT TAATGGTAAC ACATCGTTGT GTCACTGCCT * CTTCCCTGCC AGCAGGGCTC CCAATAAAGT * AGGGTGGTGA TAACTTGTGA TTAATTACAG TGGAAATGAC AGGCGAGATC GTGGCGTGTG GCTGTAGCCA * CTGTCATTTG CTGGGTGCCC * GTTTTTCATT GAGCAGCGCC CGTGGGACAC TTGCATATCA * CACCTGTAAT ATGAACTCCT AGCAGGAGCA * TAGAAAAATA GGTCTCTTGG * CTCCACTATT CAGACTATGT TCTTTTCAAA GGGATGGAAG TAGCTGAGAC TACAAAACCA * GGTGGTGTCT  Figure 4.13.2: Heatmap with sample-wise hierarchical clustering of the final 69 of 138 tags upregulated in metaplasia and later stages of SCC development. The dendrogram is generated using average-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 143  Color Key  0 2 4 6 8 >10%  percent expression C16orf89 C5orf32 SELENBP1 APBB1 CYB5A PKIG OR7E47P/OR1I1 AK1 TSPAN13 SLC9A3R2 FAM104B Hs.680422 KAL1 CITED2 FLJ36032 OR7E47P C2orf40 TMEM66 C9orf135 SCGB1A1 (sequencing artefact) Hs.46701 WDR19 SYNE1 CAT FOLR1 RNF130 CYP4B1 SCGB1A1 ambiguous ASAH1 ambiguous ISCU FAM79A ambiguous MED25 CRY2 ambiguous SLC34A2 TEGT CYP2B7P1 DALRD3 TOPORS C11orf74 LZTFL1 LRRC36 DDAH1 MLPH C7orf41 PLCH1 LOC286189 BCAS3 DNAH5 LOC400891 ERLIN1 IRX3 PEBP1 CTSH RRAD C9orf61 ADSSL1 MZF1 MAP6 GPX4 MTND5 LOC728196 (antisense artefact) novel C1orf189 RGC32 WDR69 Hs.662541 MT-ND3 SNRPN TTC25 C14orf140 C1orf87  FS32 CS28 FS34 CS36 FS49 CS42 CS48 FS73 FS44 FS03 FS24 FS21 FS06+07 NS93 CS26 CS94 CS80 NS90 FS20 NS97 CS75 NS92 N13 N16 GSM762 FS38 I22 FS50 C39 C27 I18 M51 C01 I08 D101 I04 C02 C05 I12 I11  CCAGCTGCCT GG * GAATGATTTC TC AGCTCTTGGA GG GGCACCTCTG CT ATCTTTTAAA AA * GAGCTCCACA GA CTCAACCCCT TT * GACAGCTGAG CA AAAGTTCTCA GT TTTCCACTTA AT TTAATTGTCT TT TCAGTATGTG AA TTAGAGTTTG TA * AAAAGATACT AC AGAGCCCTAC TG TATTTCCCTA CT/AA GGAATGCCTC TC CAGTTCTCTG TG ATGAGAGTGG GA * CTCTGAGTCC AC * AATATTATGT AA TAACCTTGCT GT GAAGGCTTTA TG GCTTAATGTT TA GTCGGGCCTC TG CCACACAAGC AT GTTATGGCTG GG CTTTGAGTCC AC GTATTGTAAT GA TAATAAACAG GT GGCCATCGCA CC CAAGCAAAAT AC * GGTTGAGTGT GG CACCCCTGAT GT GATCTCATCT GA GGGAAGAGAT GG GCAGCTCCAT CT CCTGCCCCGC CC CCACCCCGAA TG * ACCTCCCCAC GT * GATTTCAGCT CC CTTATGTAGA TA GATGAAGAGA TT TAAACACATT TT * GAACTCACAC CA TATACCAATC AC AATGGAATGG AA TAATTACCAT TC * AAAATATAAT TA TCAGTAGTTG GT CCTCTTCTTC CT TAGGTTGTAT AA ACCGCATTTA TG TGTGAGGAGT GT TCCGTGTATA AA GGGGTAAGAA AA GACCACGAAT AT * CAGATTTTTG TA TTTACTTTGG GC GTTGCGTGTC CA * GTCAGAACAC CT TACAGTAGTC TT GCCTGCTGGG CT * AGACCCACAA CA * TTCATATAAT AA GCAGCCTTGC TC CCTGCTAACC TA CTAATATTGT AT ACAAAGTTAT GC * GAGAGGATGG CT AGCCCTACAA AC CCGCCTCCGG GA * GGATTTTATT AA AGCAGGCTCC AT * TTTCTCCCCA GC  Figure 4.14.1: Heatmap with sample-wise hierarchical clustering of the first 79 of 316 tags downregulated in metaplasia and later stages of SCC development. The dendrogram is generated using average-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 144  Color Key  0 2 4 6 8 >10%  percent expression GC TG AA CT T GC TA TA CC TG AC AC/AG GC AG GT TA AG TA GA TG GT GC C/AC G TG AG AC TC CC TC AA A/T TG GT TG AG G TA CT TC CA AA AA CA GT TT AC TT TA CT TG TA TT GA AG AT/GG GC CC GC AG TT TC TA AT AA CG GG TT AG AA TC CC TG AC/TG C  THRAP4 SNHG10 EFEMP1 AGR3 novel mitochondrial PRKAR1A CD59 DNAI1 FIS1 LOC643684 ambiguous C6orf206 LOC643037 CALM1 TUBA1A VWA3A UBADC1 CGI-38 LOC401052 SSBP3 HBB ambiguous SAMHD1 RICH2 GOLSYN C16orf48 ALDH3B1 CCNDBP1 CNKSR1 C3orf60 ambiguous DNAH10 MORN2 CYP2J2 TUBA4B Hs.558435 EPAS1 TGM2 LMO2 GALC PIGR (antisense artefact) LRRC51 PPP2CB NLRP1 LOC643684 SLIT1 CAP2 EPHX1 Hs.147562 PER1 ARHGAP18 ANXA4 MT-ATP6 KCNE1 KIAA0329/BAALC novel ambiguous METRN CST3 C12orf38 BBS1 BBS5 PRDX5 TMEM107 FUZ IFT52 novel ambiguous mitochondrial CBY1 NUPR1 LOC440335 ambiguous MAPK15  FS32 CS28 FS34 CS36 FS49 CS42 CS48 FS73 FS44 FS03 FS24 FS21 FS06+07 NS93 CS26 CS94 CS80 NS90 FS20 NS97 CS75 NS92 N13 N16 GSM762 FS38 I22 FS50 C39 C27 I18 M51 C01 I08 D101 I04 C02 C05 I12 I11  TATAAATAAA * AACAGAAATA CAAGGGTAAG AAGAAAACCT TCTCATTTAG * ACAAAAACTA TGTGCTAATA * TGACTGGCAG TGACCCCACT TGCTTTGGGA AATTGTTACT TGTGAAGATT ATTTTCTTAA * ACAGGCGAGG GGTGGAGTGT AATGCTTTGT GTGAACAGTG CTAATTTAAC GCTAACCCCT * ATTTGCCAAT * ACACTTTTTT GCAAGAAAGT GAGATCCAGG CTGAAGCTAA TACAAATCGT ATGAAAATAA GCTGCAAAGG CTGCTCATCC ATTTCTCATT AGGGCTACTT AAGGCCGAGT ACAAAGAAAA * GGAGTCGGCT TAGATGTGAT GCACGTGTTC GATAGTGTGG CGCGCTCTCA AACGTTATTA TATATTTTCT * GAGACGCATT * TTGTGTGATT TGAGTGACAG GTTTGACAAT ACAGTGCTTG * AATGTTAAAT GTTGAAAACA ACATTGAGTC TACCAAGGTT * GCTTTGATGA GTGAGAGCTG GAGTCCCTGG ATACTTTTAG GGAGACTTCC CACCTAATTG TACTGTTCTA TAAATGTATT TGGAACCGGA * GAAATTCAAA GGCTGGAGCC TGCCTGCACC * CCTGGCTGTA AGGCTTTAGC TTCGGTTTAA * GTGGTACAGG * ACGGGAACCT GGGCTGTTAG AGCTCTGGAA CGCCTGCCCT CTGAACTGTG * ACTTTCCAAA GATGAGCGGC GACACTACAC * GGCACCGTGC TGCAATAAAT * TGTGAGCCGC  Figure 4.14.2: Heatmap with sample-wise hierarchical clustering of the second 79 of 316 tags downregulated in metaplasia and later stages of SCC development. The dendrogram is generated using average-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 145  Color Key  0 2 4 6 8 >10%  percent expression CT AT CA TG GA AA AA TG AA CT CT AA AC TA GT AG TG TA GC CA GT CC CC/TT GA TA GA G AT AA AC CC CT AT CC GA TT TT AG AA GC GT GT TC TG CT AA AG TG TA CA GT CC TT CC/GA AT CC CA AA GT CA GG CC AA/GC C AA TT CA/AA GA TG CC/GG AC AT TT TT  PIK3IP1 PIGR C22orf15 TERF2IP TGOLN2 ambiguous SSBP2 C9orf24 CCDC69 PTPRZ1 (novel transcript) RTDR1 SPATS1 Z98881.1 (novel transcript) SATB1 C22orf23 FOXP1 IFT57 Hs.624732 C7orf42 ANXA5 SMARCA4 (novel transcript) TXNIP ambiguous PECI FBXL5 CCDC78 ARL3 ambiguous TSPYL4 Hs.363473 LOC344967 Hs.441122 ambiguous ambiguous HIGD2A BAIAP3 Hs.515423 hCG_1815491 DNALI1 DTX4 CBX7 EFCAB2 C20orf28 C20orf102 NCRNA00094 UBB (novel transcript) IQCA PH-4 HSPC157 MT-ND4 UFM1 FAM92B ambiguous ambiguous ambiguous FTO PDE6B TMEM125 ambiguous DSTN (novel transcript) Hs.556022 mitochondrial ambiguous FHAD1 C8orf40 KLC4 ambiguous MTND5 PRDX5 ambiguous MSLN C1orf102 SDC4 MT-ND2  FS32 CS28 FS34 CS36 FS49 CS42 CS48 FS73 FS44 FS03 FS24 FS21 FS06+07 NS93 CS26 CS94 CS80 NS90 FS20 NS97 CS75 NS92 N13 N16 GSM762 FS38 I22 FS50 C39 C27 I18 M51 C01 I08 D101 I04 C02 C05 I12 I11  * TTTTATATAC TATGATGAGC AGCAAAGCCC GTAGGTGAGG * TTTGATAAAT * TAGACTAGCA GATAGGAATA GTTACGAAAG CGGCACCTTA * CAATTAAAGC GAGGTGGAGA * TGACTGTTCT * CAAAGAGGGT CAAATAAGCT GCTGGCTGCT CTTAGTCTAA GTGATTATGA * TAAAAACAAT ATGGTCAGTA * ATACTTTAAT * GCTCCCTGTA TGGAGAAGAG * TTTCATACAC * TCCAAGGAAG * TCTGAAGACT AGTCAGGATA AACTGGGTCT AAGTGAGGAG TGCTTGAAGG * TATAGTTGGA * CCACTGCTCT * GTAATGTTTT TTATGCTTTC CTGGAGGCTG TGTCGCTGGG * TCTCTAGATT * CCTGCTCTTC * GAAAAATCAA * ATTAATTTCC AGGTGTCTTT CCTTTGCTGA TCCAATAGCA GGGGAATCTG ATCCAGTCTG AAGGATTCAC * GTAGCATAAA TACACTGTAT * TTATTTATTG AATGAAAGGT * TCGAAGCCCC TTGTTATATT * GAGGAGGCCC ACCCTCTGTG * GCCGTGAGCA CTGAATCTAA * TCAGTGCTCT CTCAGGAATT GCTGATTGGC * AAGGAAGATG * ATGACACTCA * GCTTTGCTCT * AAAACATTCT * TAATACTCCA AATGAACTGC AACAGCTTTA TATTTCACTT * AGATATTCAA * GTCCATCATA * GAGAACCTCT * TTTATTTCTA * CCCCCTGCAG * GCCCAGAATG AGGAGCGGGG * CTAGCTTTTA  Figure 4.14.3: Heatmap with sample-wise hierarchical clustering of the third 79 of 316 tags downregulated in metaplasia and later stages of SCC development. The dendrogram is generated using average-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 146  Color Key  0 2 4 6 8 >10%  percent expression GG CA GT CC GG AA CA GG CC GG TC AA TC GA AC G GG AC AT GG GG CA CA CA CA C GT GA AG TG TT GG GC TA TT AG GA AT/G AT TA TC CG TA AG AT/TA GA TT CC G TG CA CC TC AA AC G T AA GT AA C/GG AG GG AT/GG AA TT TT AG GA TA CT/TC GT GC GG  MS4A8B TRAK1 FHAD1 ambiguous CXCL17 PIGR FABP4 DRAP1 SCGB3A1 TGFBR2 CTGF C5orf15 HLA-DPA1 SNX3 SKP1A LOC441150 STARD10 HRASLS3 KIF9 MXD4 EPB49 Hs.660080 VEZF1 C9orf103 ADRA2A KIAA0367 KIAA0319L DDX19B ZBTB4 LRWD1 MT-ND2 HSPBP1 DDOST ambiguous CABIN1 SEPT9 FILIP1 ambiguous TXNIP ambiguous CYP4X1 MARCO ambiguous ambiguous MORN3/TCF7 PRKCD LUC7L2 C20orf96 CALM1 FAM120B GTF2H5 LOC692247 CTGF (antisense artefact) EFCAB2 C20orf114 LOC285141 TSPAN3 (antisense artefact) BASP1 (antisense artefact) SPAG16 PHYH ambiguous Hs.663256 AZI2 NSMCE4A/C3 WDR52 FYCO1 HIPK1 CCDC89 KCTD12 LY75 ambiguous KIF3B RBM24 WWC1  FS32 CS28 FS34 CS36 FS49 CS42 CS48 FS73 FS44 FS03 FS24 FS21 FS06+07 NS93 CS26 CS94 CS80 NS90 FS20 NS97 CS75 NS92 N13 N16 GSM762 FS38 I22 FS50 C39 C27 I18 M51 C01 I08 D101 I04 C02 C05 I12 I11  * TGAGCTTGTG TAAAAATTGG TGACTCTTCT ACCCAGCGGG * ACAGTGTCTG GACCCAAGAT ATTTAGCAAG * TGAAATCTTC * AAGCTCGCCG * TATTAAAATA TTTGCACCTT GTTGACTTAC TGAAAACTAC GAAGATTAAT AATGCAAGAT TTACAGTTAA CGGACTCACT TTCTCAAGAA AAATGATCAG TTGGGCACTA CTGGGGGAGG * TGACATTAAA * CAAAATACTG CCTGTATTTG TTGCTAATGA CATTGGATTG GTATTTAACT AATACACAGA * TTCCCTGGGA * GTGAACACAG * ACTAACACCC GGGGACGGGA * AATCCAAGAG GTTCTCTTTG TGAAGGTGGA TGGCCTGCCC TAATAAAATG * TAAGACTTTG * TTTTTGTATT * TGAAGAGAAT CCCTGACCAA * ACTGCAGCCA * GAAAAATACA TGATGTGATC * AGAATAAAGA * GGGGACTGGT GAGAATATCC * GGCCGCCCTC ACAAACTTAG TGTTTCATTC * ATCATTTGTT ACAAGTATTC * CTTTGAACGA TAAAGATCTT GTCCTGTTGG * CAAATATAAA * AATGGGCTCA TGGACAAGCT * TGTTATTTGA GAGTCCAAAT * GGCAAAATTA TGGAGCTATG * TGGCTGCATA * GTTGTCTTTG * TGTGCCTTTC TGGAAATACT TAGCAGTACA * AATGCTTTGC CCCAAACTTT * AAGGAAAGTG * GGCCGTGCTG GCTCCTTGAA ACTCTCCTGT AGTGAGGGGA  Figure 4.14.4: Heatmap with sample-wise hierarchical clustering of the final 79 of 316 tags downregulated in metaplasia and later stages of SCC development. The dendrogram is generated using average-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 147  resulted in a 2 gene upregulated signature (S100A2 and KRT6A) with a sensitivity and specificity of 93.4% and 96.3%, respectively.  Correspondingly, a 2 gene downregulated signature  (C16orf89 and C5orf32) was identified with a sensitivity and specificity of 93.1% and 93.8%, respectively. The upregulation of both S100A2 and KRT6A is strongly supported by the three SCC microarray datasets (Figure 4.15). However, neither C16orf89 nor C5orf32 are present on these arrays and their downregulation in early stage lesions is yet to be confirmed. Both upregulated genes have well established roles in squamous cell differentiation and lung cancer. S100A2 has been previously shown to be upregulated in almost all cases of SCC, and the level of expression is concordant with the amount of abnormality observed in premalignant and early stage lesions (Smith, 2004). Moreover, S100A2 transcription has been shown to be positively regulated by direct binding of TP63, a TP53 paralogue, to the promoter (Hibi, 2003). TP63 overexpression is also strongly implicated by the SAGE data, and is ranked eighth in the list of pre-malignant tags (Figure 4.13.1). The functions of C16orf89 and C5orf32 are currently unknown, and the characterization of these genes and their potential involvement in SCC remains to be explored. 4.3.2.2.2 Changes associated with malignancy A set of 143 upregulated and 26 downregulated tags was associated with the malignant phenotype (Figures 4.16.1-4.16.2 and 4.17, respectively). Again, the unusual invasive sample I22 is misclassified. A GO enrichment analysis did not reveal any statistically significant terms. Based on LOOCV, the sensitivity and specificity of these signatures are 61.4% and 85.3%, respectively, for the upregulated tags and 75.2% and 92.1%, respectively, for the downregulated tags. Optimal signatures were constructed to maximize the selectivity. This resulted in a 4 gene upregulated signature (MCM7, SLC6A8, CKS1B, ATP1B3) with a sensitivity and specificity of 81.4% and 92.1%, respectively. Correspondingly, a 2 gene downregulated 148  149  S100A2  normal  Wachi dataset  SCC  normal  normal  tumour-associated normal  tumour-associated normal  Erez/Dehan dataset  SCC  SCC  normal  normal  Bhattacharjee dataset  SCC  SCC  Figure 4.15: Microarray validation of optimal metaplasia progression-associated upregulated gene signature. Each box plot shows the sample type (x-axis) and log-hybridization (y-axis). Each box-and-whisker pattern depicts, from top to bottom, the maximum observed value, upper quartile, median value, lower quartile, and minimum observed value. Additional circles indicate outliers. Note: KRT6A is not represented on the Wachi dataset array.  KRT6A  11  9  7  12 10 8 8 10 6  12 10 8 6 11 9 7  Color Key  0 2 4 6 8 >10%  percent expression  TC AA TG GT GC AA GG AC CC CC GT/TT CC AA GG GA GG AT AA CC AG C GA CA TC TA TC TA AC TC AG CA TA CT AT AA AT/CA AA GC GA AC AA AC GC GG AC TG GG GG CG AA TT CC AT CA AA AA/GT CT TT AT AC CA GT TC AA T CA GC CC TT CA AA AA  MCM7 SLC6A8 CKS1B ATP1B3 C19orf48 SLC6A8 Hs.18166 IGHG1 (sequencing artefact) ARTN PSMD2 ATP1B3 SLC2A1 NDUFB9 SDC1 UNC119 LOC344887 MIF RPS27A IGHG1 (antisense artefact) GSTM3 UBE2S AKR1C2 GART (novel transcript) GPNMB EPHB3 SDC1 COX6B1 IGHG1 (sequencing artefact) DDX39 IGHG1 IGL@ ambiguous PLAT CXCL14 ambiguous LOC641364/SRP68 PKP1 ALG3 IGL@ KRT6A ambiguous IGHG1 (sequencing artefact) ECE2 FBL IGHG1 (sequencing artefact) RPS16 (antisense artefact) AKR1B10 ATP5G1 RPL18A (antisense) SSR1 RAP2B GPNMB (antisense) ALDOC ADAM12 FZD7 RAB18/RPM2 RPS15A NRCAM FST IGHG1 (sequencing artefact) COL1A1 (antisense) GPC3 novel GAPDH RPL37 COL1A2 GAPDH NDUFA4L2 ATP5G3 ambiguous SLC38A2 VDAC2  NS93 FS38 FS32 FS34 FS24 NS97 NS90 FS49 CS36 CS28 CS42 FS03 FS44 CS80 FS73 FS06+07 FS50 CS26 FS20 NS92 CS94 FS21 CS75 CS48 N13 N16 GSM762 D101 M51 I22 I18 C01 C39 I04 I11 I12 C02 C05 C27 I08  CTGCACTTAC AGTGCTCACT * TTAAAAGCCT TAGGAGTTAA GGGCCCCAAA * TCATTTTCCA CTGGGTGCCT * GAAATAGAGC * GGAGCTGGCC CATCCTGCTG TAGGATGGGG * GAGACTCCTG * CACTTGCCCT CTCATCTGCT GTAGGAGCTG CCAGGGCCAG AACGCGGCCA AACTAACAAA * AGAAGACGTT TGCCGTTTTG CTGGCGAGCG AGGTCTGCCA * TTGAAACTGT ACATTCTTTT GGTGAGCGTG * AGCGACAAAC * ACTTACCTGC * GAAGTAAAGC * CAGCTTCACC * GAGTTTATTC * CGTGACCTGG GTCCCTGCCT TTAGTTTTTA CAGGTTTCAT TTGCTCAAAA * ATATTAAATC TTTGTAGAGG CCGTCATCCT * AGTGCAGGGA GCCCCTGCTG GTGAAAAAAA * GAAACAAAGC AACCAATACA CCGTGGTCGT * GGAATAAAGC * GCTCCGAGCG GCTTGAATAA GGGGGTCACC * GCGTGCTCTC GATCTCGCAA CGTTTAATCA * CCCCCCCAAG CCTTGAGTAC * ACAACAGACA TACAGATCAC * TAATTTTTAA GACTCTGGTG AAAGGGTCAC * TAAATGTGCA * GAAAAATAGT TTCGGTTGGT * GCTGGAGGAG TGAATGTCAC TACCATCAAT * CAATAAATGT TTTGGTTTTC ACAAGGTGCG CAAGCCACAG GGAATGTACG TTACCTCCTT CTTAATCCTG ACAAATTATG  Figure 4.16.1: Heatmap with sample-wise hierarchical clustering of the first 72 of 143 tags upregulated in malignant stages of SCC development. The dendrogram is generated using average-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 150  Color Key  0 2 4 6 8 >10%  percent expression  CC TG TG AA AA AA AA TG CT AC AA AA AC GC AA AA AT AC TC CA AA TT AC AT TA TA AA AA AA AA TG AC TA CA TA CC TC AA GA CC CA AG AG GA TA CA TC C GG GT AC TG TA CT AA AG GA CA CA G CA GA CT TG GA/GT AA CC C AC TT GA  IGHG1 FOSL2 Hs.626397 ambiguous CYC1 C10orf99 RPS27A ambiguous AIM1 (sequencing artefact) IGHG1 IGL@ ARHGAP29 IGHG1 (sequencing artefact) NTRK2 NGFRAP1 EIF4G1 ambiguous IGHG1 (sequencing artefact) novel CEECAM1 HSP90B1 (antisense) NCK1 ambiguous ODC1 ILVBL DSC2 ambiguous ambiguous CLDN1 IGHA1 IGHA1 (sequencing artefact) IGHG1 (sequencing artefact) MTHFD2 ambiguous SNRPG IGHA1 (antisense artefact) PDK1 PSMF1 Hs.596679 COL1A1 RAB10 EIF3S12 RPLP1 GPX2 NOB1 SPCS2 ENSESTG00000012185 novel PFDN6 DDT MAPK6 KRT17 THOC3 S100A7 ambiguous ambiguous PSMA7 C11orf83 KHSRP ambiguous VARS LOC284889 JUP CRCT1 ambiguous GPI SLC3A2 CBX1 ADM PSMB4 ambiguous  NS93 FS38 FS32 FS34 FS24 NS97 NS90 FS49 CS36 CS28 CS42 FS03 FS44 CS80 FS73 FS06+07 FS50 CS26 FS20 NS92 CS94 FS21 CS75 CS48 N13 N16 GSM762 D101 M51 I22 I18 C01 C39 I04 I11 I12 C02 C05 C27 I08  * AGAAGATGTT ACATTTCATC GGAGGCGTGG AGAAAAAAAA GGAATAAATT AGTCTGCTGG AACTAAAAAA CAGGACCTGG * ACACTGTATC * GAAATAAAGC AAACCCCAAT GGTTGAAAAA * GAGATAAAGC TTAAATTAAT * GAAAAATTTA AATTCAATTA ACCCTGCCAA * GAAATAAGGC * GAGCAAAGGA TTGCCCAGCA * AGGAATGTTA TTGTACAATT TGCTGTGTCC ATGCAGCCAT TAGGCCCAAG GGGTATTGGT GCAAAAAAAA GACAAAAAAA TTTTTCTATT CTCCCCCAAA * CAGGAGAAGG * GAAATGAAGC TGTGTTGTCA CAGGAGACCC * TACATTTTCA * GGTGGGTTTA GATCAACCCT * TTGCCGGTTA AACATAATCT TGGAAATGAC TCTGTGACTT GCCTCCTCCC TTCAATAAAA GGTGGTGTCT CATTTAGATT * ACAGTCTTGC GCTGTAGCCA TTGTGGGTGC GTGGGGGGAG CAACATTCCT GCCAGCAAAT CTTCCCTGCC * GAAGTTTTTT GAGCAGCGCC GCCAGGAGCT GCTGCCCTTG GATGAGTCTC * TTCCTCCACC TTGAGCCAGC CGTGGTGGTG GGGATGGCAG ATGGCAGAAG GTGTGGGGGG CGTGGGACAC * ACTGGTACGT TAGAAAAATA GTGAAGTCTT TGTTAGATTT AAAGAGAAAG ATCAGTGGCT * CCCCCTCCGG  Figure 4.16.2: Heatmap with sample-wise hierarchical clustering of the final 71 of 143 tags upregulated in malignant stages of SCC development. The dendrogram is generated using average-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 151  Color Key  0 2 4 6 8 >10%  percent expression  AG AT AC CC AG AC TT GT GA GC TG TT AA GC GT TA CC GA AT CC TT TC GG TT TA TA  CRIP2 TMPRSS2 SCGB1A1 SLPI SCGB1A1 (antisense artefact) VAMP8 AQP3 LGALS3 RAB20 HNRPK ELF3 TSPAN15 MLPH CEACAM5 C10orf33 DUOX1 SLC1A4 CD59 STEAP4 SCGB1A1 VIL2 C5orf32 SELENBP1 CDC42 KIAA0251 PRKAR1A  CS94 N13 N16 FS49 FS50 CS28 FS34 FS73 FS44 FS06+07 D101 GSM762 FS32 FS03 CS36 NS93 CS42 FS24 CS75 CS48 FS21 NS90 NS97 CS80 FS20 CS26 FS38 NS92 M51 I22 C05 I12 C02 I11 C39 C01 I04 I08 C27 I18  TCCCTGGCAG CAAATAAATT CTTTGAGTCC TGTGGGAAAT * GCAGCGGCAG TGGCTGGGAA * TTTGCTTTTG TTCACTGTGA TATTATTAAA * CTTATAATAA * GCTGGCCTTG * TTTTGTTTTG AATGGAATGG AAGGATAAAA GACTGGTTCT GCAAGCCATT TGTGTTGTGT AAAACTTAGA TTAAGGGATG GAAAAAATAG * AAATAAAAGC * GAATGATTTC AGCTCTTGGA * TCTCAATTCT * TTTTTATATA TGTGCTAATA  Figure 4.17: Heatmap with sample-wise hierarchical clustering of the 26 tags downregulated in the malignant stages of SCC development. The dendrogram is generated using average-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping. 152  signature (CRIP2, TMPRSS2) was identified with a sensitivity and specificity of 75.2% and 98.3%, respectively. The dysregulation of all 6 genes is strongly supported by the three SCC microarray datasets (Figures 4.18.1-4.18.2). Both MCM7 and CKS1B are key, highly conserved players in eukaryotic cell division. MCM7 is part of a heterohexamer helicase complex (consisting of MCM2-7) and is critical to the initiation of DNA replication (Maiorano, 2006). MCM7 has become increasingly prominent as an accurate tumour cell proliferation marker in a variety of cancers, outperforming the current gold standard marker Ki67 (Xue, 2003; Li, 2005; Facoetti, 2006a; Facoetti, 2006b; Feng, 2008; Nishihara, 2008). A direct role in SCC formation has been suggested from a mouse model where MCM7 was transfected with a KRT14 promoter; these mice developed SCC when exposed to chemical carcinogenesis, whereas wild-type mice did not (Honeycutt, 2006). CKS1B plays a less direct, but still critical, role as a cell cycle regulator. The protein promotes SKP2-mediated degradation of the tumour-suppressors CDKN1A and CDKN1B, which normally bind to CCNE1-CDK2 and CCND1-CDK4 complexes to maintain a quiescent state (Ganoth, 2001; Spruck, 2001). The proposed role of CKS1B and SKP2 in degrading CDKN1B has been demonstrated in oral SCC (Kitajima, 2004). As MCM7 and CKS1B activity requires close cooperation with a number of other molecules (Figure 4.19), the expression of tags mapped to the genes for the origin of replication complex (ORC1L-ORC6L), other members of the MCM family (SRF, MCM2-10), known coparticipants in the pre-replication complex (CDC6, CDC7, CDT1, DBF4, CDC45L, GMNN), and cyclin-CDK complex members (CCND1, CCNE1, CDK2, CDK4, CDKN1A, CDKN1B) were examined in detail. Many of these genes are not detected, but it is impossible to determine whether this is due to a complete lack of expression or because expression levels are too low to be captured by SAGE. Although individually the genes that are detected are not expressed at levels necessary to provide statistical significance, taken as a whole a clear association with 153  154  MCM7  SLC6A8  CKS1B  SCC  normal  SCC  SCC  normal  normal  SCC  normal  Wachi dataset  tumour-associated normal  tumour-associated normal  normal  normal  tumour-associated normal  normal  Erez/Dehan dataset  SCC  SCC  SCC  normal  tumour-associated normal  SCC  normal  SCC  SCC  SCC  normal  normal  SCC  normal  Bhattacharjee dataset  Figure 4.18.1: Microarray validation of optimal malignant progression-associated upregulated gene signature. Each box plot shows the sample type (x-axis) and log-hybridization (y-axis). Each box-and-whisker pattern depicts, from top to bottom, the maximum observed value, upper quartile, median value, lower quartile, and minimum observed value. Additional circles indicate outliers.  ATP1B3  9.5  8.5  7.5  11  7 8 9  9.5  8.5  7.5  11.0  10.0  11.0 9.5 10.0 8.5 7.0 9.5 10  12  8.0  11.0  9.0 8.0 7.0 8.0 7.0 9.0 8.0 7.0 10.5 9.0 7.5  155  CRIP2  normal  normal  Wachi dataset  SCC  SCC  normal  normal  tumour-associated normal  tumour-associated normal  Erez/Dehan dataset  SCC  SCC  normal  normal  Bhattacharjee dataset  SCC  SCC  Figure 4.18.2: Microarray validation of optimal malignant progression-associated downregulated gene signature. Each box plot shows the sample type (x-axis) and log-hybridization (y-axis). Each box-and-whisker pattern depicts, from top to bottom, the maximum observed value, upper quartile, median value, lower quartile, and minimum observed value. Additional circles indicate outliers.  TMPRSS2  8.5  7.5  8.2  7.6  7.0  12.5 11.0 8 6  10  10.5 9.5 7.6 7.0 6.4  A  ubiquination CKS1B  U U CDKN1A/ CDKN1B  CUL1  CDKN1A/ CDKN1B  CDKN1A/ CDKN1B  CCNE1  CCNE1 CDK2  CDK2  SKP1 SKP2  U  CCNE1  CCNE1 degradation by 26S proteome  CDK2  CKS1B  SKP1  inactive  CUL1  CDK2 active  SKP2  SCF(Skp2) ubiquitin ligase  B ?  SRF  CDC6  CDT1 MCM2-7  ORC1L-6L chromatin  chromatin  ADP  MCM10  MCM10  MCM10 CDC6  MCM2-7  MCM2-7  MCM2-7  CDC7  CDC7  CDC7  DBF4  DBF4  MCM10  CDT1  ORC1L-6L  MCM2-7  MCM2-7  MCM2-7  CDC7  CDC7  CDC7  DBF4  DBF4  DBF4  DBF4  pre-initiation complex (CDC45L, ???) CCNE1 MCM10  MCM10  MCM10  CDK2 CDC6  MCM2-7  MCM2-7  MCM2-7  CDC7  CDC7  CDC7  DBF4  DBF4  CDT1  ORC1L-6L  MCM2-7 CDC7  DBF4  DBF4  GMNN  MCM2-7  MCM2-7  CDC7  CDC7  DBF4  DBF4  MCM2-7 CCNE1 CDK2  DNA polymerase MCM10 pre-initiation complex (CDC45L, ???)  CDC6 ORC1L-6L  MCM2-7  DBF4  CDC6 ORC1L-6L  {  CDC7  RPAs (RPA1-4)  pre-initiation complex (CDC45L, ???)  CDT1  CDT1 GMNN  MCM2-7 CDC7 MCM10 DBF4 DNA polymerase adaptors (human equivalents unknown)  Figure 4.19: A model of MCM7 and CKS1B function based on current knowledge. A) Role of CSK1B in activating the Cyclin-CDK complex by targetting CDKN1A/CDKN1B for ubiquitin-mediated degradation. B) Role of the MCM family and other key proteins in initiating DNA replication. The content and style of bottom figure (B) is adapted from Maiorano et al. (2006). The version shown here labels the biomolecules using official gene names, includes DBF4 in a complex with CDC7, includes CDC45L as a member of the pre-initiation complex, includes CCNE1 in a complex with CDK2, and makes reference to the specific members of the ORC and RPA gene families. ??? denotes unknown members. 156  malignant progression is present.  In particular, several other members of the MCM  heterohexamer complex are upregulated (MCM2, MCM3, MCM5), as are CDK2 and CDT1, which bridge the MCM complex to the ORC (Figure 4.20) (Tsuyama, 2005).  Moreover,  CDKN1B, a key negative regulator of cell division, shows some evidence of downregulation as early as pre-malignancy. It is enticing to speculate that the increase in CKS1B acts to further abolish the function of this TSG through the SCF(SKP2)-mediated ubiquination mechanism. Possible roles for the remaining upregulated malignancy-associated genes are unclear. SLC6A8 is responsible for transporting creatine and creatine analogues both in and out of cells (Sora, 1994). However, none of the tags corresponding to known creatine kinase genes, which catalyze the conversion of creatine from or to creatine phosphate and provide a source of or reservoir for ATP, show evidence of dysregulation. ATP1B3 is one of a family of regulatory subunits that heterodimerize with a catalytic subunit (one of ATP1A1, ATP1A2, or ATP1A3) to form an active Na+/K+-ATPase (Lingrel, 1990). Only the tag for ATP1A1 is detected and is expressed in all samples, but there is no evidence of dysregulation. The significance of increased ATP1B3 in malignant progression remains to be investigated. However, this gene has been shown to play a role in regulating T and B lymphocyte proliferation and may simply be part of an immune response at the site of malignancy (Chiampanichayakul, 2002). CRIP2 is a member of the diverse LIM domain-containing proteins that are characterized by a two tandem zinc fingers that appear to mediate protein-protein interactions, rather than facilitating DNA binding (Karim, 1996; Zheng, 2007).  CRIP2 has not been extensively  characterized. However, it has been shown to act as a bridge between SRF (a.k.a. MCM1) and several GATA proteins, initiating a transcriptional program that causes the differentiation of fibroblasts into smooth muscle cells (Chang, 2003). One hypothesis is, given the increase in the MCM family discussed above, the loss of CRIP2 results in the abolition of a transcriptional program that may act against the progression of pre-malignant lesions to SCC. 157  ●  ● ●● ●  ● ●●● ● ●  ● ● ● ● ● ●  ●●●  ●  ●  ●● ●  ●●  ● ●  ●  ●  ●  ●  ●  ●  ●●●●●●●●  ● ●●●●●  ● ●  ● ●  ●  ●  ● ●●●● ●  ●  ● ●  ● ● ●  ●  2  ●  ●  ● ●  ● ● ●  ●  ●  ●● ●  ●  4  ●  ● ●  ● ● ●  ●  ● ●  ●  1  2  3  tags/50k  ●  0  ●  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  ● ●●●●  4  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  0  ●  2  4  ● ●  tags/50k  3  ●  0  ● ●  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  ● ● ●●●  6  2  ●●  4  1 ●●● ● ●  2  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  0  ●●  tags/50k  4  tags/50k  ●  0  2  tags/50k 6  4  ●  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  0  tags/50k  CGGATTATCC  MCM2 TGCACCTTGG  CDK2  ● ● ●  ●  ●●●●  ●●●  ●●  ● ●●  ●  ●  ●  ● ●  ● ●  ●●  ●●●  ●  ● ●  ●●  ●  ●●  ● ●●  ● ●  ●  ● ● ●  ●  ●●●●  ●  ●  ●  ●  ●  ●  ●● ●●  ●  ●  ●●  ● ●  ● ●  ●  MCM3 CDT1  CAGGTCAAGA GGGCTCACCT  MCM5 CDKN1B  GACTCGCCCA TTTTGTGCAT  ●●●●  ●● ●●●  ● ●●  ●  ●  ● ● ●  ● ●●  ●●  ● ● ●  ● ● ●  ●  ●  ●  ●  ●  ●  ●  ●  ● ●  ● ● ●  ● ● ●  ● ●  ● ●  ● ● ● ●  Figure 4.20: Expression of select members of the MCM family and related genes. Each plot shows the normalized expression (tags/50,000) (y-axis) for all 40 SAGE libraries (x-axis). MCM2, MCM3, MCM5, CDK2, and CDT1 show evidence for upregulation in malignant samples. CDKN1B shows evidence for downregulation in both pre-malignant and malignant samples.  158  TMPRSS2 is a serine protease, a family of genes that performs diverse physiological roles by cleaving a variety of target proteins. The biological function of TMPRSS2 is yet to be determined. However, TMPRSS2 has gained recent attention in prostate cancer pathogenesis as it forms a fusion protein with a number of ETS-family transcription factors (e.g. ERG, ETV1, and ETV4) in the majority of cases (Tomlins, 2005; Tomlins, 2006). TMPRSS2 overexpression is androgen-regulated, and the fusion of the 5’-UTR to an oncogenic ETS partner may be an early causal event in prostate tumourigenesis. Although there is no evidence that TMPRSS2 fusion genes are present in lung cancer, its identification here suggests it may have a more general role as a tumour suppressor. 4.3.2.2.3 Changes associated with an invasive phenotype Using the complete dataset, a set of 7 upregulated tags was associated with an invasive phenotype (no set of downregulated was statistically significant). However, due to the small number of invasive samples (6), the unusual sample I22 introduces a substantial degree of interference in the selection of candidates. Given the results of both the k-means analysis and candidate selection procedure for pre-malignant and malignant tags, the contamination of this sample by normal epithelium has been reasonably established.  Therefore, an additional  candidate selection run was performed with I22 omitted, resulting in a large improvement (Figure 4.21). The modified sample set produced a set of 29 upregulated tags associated with the invasive phenotype (again, no set of downregulated tags was statistically significant) (Figure 4.21). The tags mapped to 26 unique Unigene entries and a GO enrichment analysis of these genes revealed the terms “extracellular matrix” (GO:0031012; FDR<0.1%) and “collagen” (GO:0005581; FDR=1.9%).  This is implied from the upregulation of the genes COL1A2,  COL3A1, COL6A3, LAMA1, MFAP2, MMP11, MMP12, and SPP1. These terms are consistent 159  Color Key  0 2 4 6 8 >10%  percent expression  CA AA CA AA AA GA TG CT TA TG TG TT CT TT TG TA TT AA TA GA TC CT AC CA CC GT AT CC AA  MMP11 GAPDH SHMT2 COL6A3 (antisense artefact) C5orf13 C1R MDH2 COL1A2 C19orf56 MDH2 SFRP2 RELB PHGDH EIF5A MMP12 TMED9 ambiguous RPS19 SPP1 LOC440995 RPL37 COL3A1 LAMA1 RBM8A CHCHD2 COL3A1 MFAP2 NDUFA4L2 AUP1  CS94 CS28 NS97 CS80 FS06+07 FS50 CS42 CS26 CS75 NS90 NS93 FS21 FS49 FS24 FS03 CS36 CS48 FS44 FS20 FS73 FS32 NS92 C01 FS38 FS34 D101 I22 N13 N16 GSM762 C05 C02 M51 C39 C27 I11 I12 I18 I04 I08  CAGGAGACCC TACCATCAAT ATCACAGTGT * CTTAACTAAA ACAAGTACCC * TTCTGTGCTG CCTCCACCTA * GATGAGGAGA TTTCTAGGGG CCTTCCAAAT * AATATTTTTA TGGGGGCACC * TGACTGAAGC GGCTTTACCC CTCTGTAAGT * AGTTTCCCAA * TGATTCTGTT CTGGGTTAAT AATAGAAATT CCGAGGCTTG * CAATAAATGT * CCACGGGATT TGCAACAAAT CAGCCTTGGA GCCCCTCCGG GATCAGGCCA GACCACCTTT CAAGCCACAG GTTTTCATTC  Figure 4.21: Heatmap with sample-wise hierarchical clustering of 29 tags upregulated in the invasive stage of SCC development. The dendrogram is generated using complete-linkage with a Poisson-based distance metric based on Cai et al. (2004). Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key placed between the dendrogram and the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. Additional nucleotides identified by the XBP-SAGE method (Chapter II) are shown. Tags prefixed with an asterisk (*) indicate that the additional nucleotides were needed for a successful mapping.  160  with the invasive phenotype, as the degradation of and interaction with the extracellular matrix and surrounding tissue is an essential feature of this process. Based on LOOCV, the sensitivity and specificity of this signature is 52.5% and 91.4%, respectively. An optimal signature was constructed to maximize the selectivity. This resulted in a 3 gene upregulated signature (MMP11, GAPDH, SHMT2) with a sensitivity and specificity of 64.5% and 91.5%, respectively. Again, there is excellent concordance with the microarray validation datasets (Figure 4.22). MMP11, a member of the matrix metalloprotease family, has been strongly implicated in invasion by its presence in the stromal cells surrounding the invasive front of most, and likely all, epithelial cancers (Basset, 1993; Rio, 2005). A recent study identified COL6A3 as a specific MMP11 substrate (Motrescu, 2008). Interestingly, an antisense artefact tag from COL6A3 was the fourth ranked tag in the list of invasive candidate tags. The expression of the MMP11 protein was further explored by immunochemistry. A commercial antibody was obtained and its activity confirmed by Western blot of a protein extract from an MCF cell line transfected with the MMP11 gene. Non-transfected MCF protein extract was used as a negative control (both extracts were a gift from Ulrich auf dem Keller, Overall Lab, UBC). No band was observed in the MCFWT sample, and the presence of a band at the expected 55kDa was seen in the MCFMMP11 sample (the MMP11 precursor is 54,618Da) (Figure 4.23). The antibody was then used on archival tissue sections from normal lung, breast cancer, and SCC. As expected, staining was absent in the lung cancer negative control and very strong in cases of breast carcinoma, where MMP11 is known to be expressed and where it was first implicated in invasion (Figure 4.24A-B) (Bassett, 1993). Dark staining was observed both in breast tumour cells, and surrounding fibroblasts and collagenous structure. No staining was observed in normal lung tissue (Figure 4.24C). In lung cancer, MMP11 was present in stromal cells surrounding invasive tumour cells (Figure 4.24D-F). Staining was noticeably lighter, which may indicate that MMP11 is present in 161  162  MMP11  GAPDH  normal  normal  normal  Wachi dataset  SCC  SCC  SCC  tumour-associated normal  tumour-associated normal  normal  normal  tumour-associated normal  normal  Erez/Dehan dataset  SCC  SCC  SCC  normal  normal  normal  Bhattacharjee dataset  SCC  SCC  SCC  Figure 4.22: Microarray validation of optimal invasive progression-associated upregulated gene signature. Each box plot shows the sample type (x-axis) and log-hybridization (y-axis). Each box-and-whisker pattern depicts, from top to bottom, the maximum observed value, upper quartile, median value, lower quartile, and minimum observed value. Additional circles indicate outliers.  SHMT2  8.0  7.0  6.0  12.0 13.0 14.0  9.0  8.0  7.0  11 13 9 7 11.5 10.0 7.5 8.5 9.5  7.0 8.0 9.0 10.4 9.8 9.2 7.5 6.5  Figure 4.23: Western blot confirming correct activity of MMP11 antibody. The expected size of the MMP11 precursor is 54.6kDa. Lane 1: 5μL Benchmark Pre-stained Marker; Lane 2: 10μL wild-type MCF7 cell lysate; Lane 3: 10μL MMP11 transfected MCF7 cell lysate.  163  A  B S  S  * *  1mm  1mm  C  D S *  S *  1mm  E  1mm  F  *  *  S S 1mm  1mm  Figure 4.24: MMP11 detection by immunohistochemistry. (A) squamous cell lung cancer negative control, (B) breast carcinoma positive control, (C) normal lung, (D,E,F) squamous cell lung cancer from 3 different patients. The presence of MMP11 is indicated by brown staining. All photographs were taken at 20X magnification, except the normal lung section which was taken at 40X magnification. All sections are counterstained with hematoxylin. Representative areas of malignant cells (*) and stromal tissue (S) are marked. 164  lower levels in SCC, although this is difficult to state definitively given the qualitative nature of immunohistochemistry. Changes in cellular metabolic processes are of well-established importance in tumour growth and survival. Both GAPDH and SHMT2 are important players in metabolic systems; the former is a key part of the glycolysis pathway, which drives anaerobic energy metabolism, and the latter is important for nucleotide metabolism, which is necessary for transcription, translation, and DNA repair.  An increase in glycolytic energy production as a means of  overcoming the hypoxic conditions present in tumours, known as the Warburg effect, has been known for decades (Warburg, 1956).  Mitochondrial SHMT2, and its cytosolic isoenzyme  SHMT1, catalyzes the simultaneous, reversible conversion of L-glycine to serine and 5,6,7,8tetrahydrofolate (THF) to N5,N10-methylene-THF (Schirch, 1982). N5,N10-methylene-THF is an essential substrate for thymidine biosynthesis (Figure 4.25). Increased SHMT2 expression and activity has been observed in a variety of tumours, including the lung, presumably to support the rapid increase in the rate of mitosis (Tendler, 1987). Interestingly, the other two enzymes in the cyclic pathway that regenerates N5,N10-methylene-THF and drives thymidine production are major, long-standing chemotherapeutic targets. TYMS, which utilizes N5,N10-methylene-THF to produce thymidine and the byproduct DHF, is the target of uracil analog 5-fluorouracil (5-FU), a principal agent in treating colon and pancreatic cancers (Danenberg, 1977). DHFR, which catalyzes the reduction of DHF back to THF, is the target of the folate analog methotrexate, the first chemotherapeutic agent (Schweitzer, 1990). It is still widely used to treat a variety of cancers, including those found in the lung. For this reason, serine hydroxymethyltranferases, the remaining component of this cyclic reaction, have been proposed as a potential treatment target (Agrawal, 2003; Rimpi, 2007). Moreover, there is evidence that SHMT2, rather than SHMT1, is the primary driver of thymidine biosynthesis (Fu, 2001).  165  DNA replication and repair  5-FU, raltitrexed  dUMP  dTMP  TYMS  DHF N ,N -methylene-THF 5  10  +  glycine  NADPH + H  SHMT1 SHMT2  DHFR THF  methotrexate, pemetrexed +  serine  NADP  Figure 4.25: Role of serine hydroxymethyltransferase in thymidine biosynthesis. SHMT1 and SHMT2 catalyze the conversion of serine to glycine, transferring the single carbon to tetrahydrofolate (THF) to form5N 10 ,N -methylene-THF. Thymidylate synthase (TYMS) catalyzes the conversion of uracil to thymidine, producing dihydrofolate (DHF). Dihydrofolate reductase (DHFR) catalyzes the return of DHF to THF, and the cycle begins again. Several chemotherapeutic inhibitors that target TYMS or DHFR are noted.  166  4.3.3 Comparison to existing squamous cell lung cancer profiles In order to explore the increased information potential of fully delineated developmental stages, the tags corresponding to lung SCC genes identified by previous transcriptome profiling studies were examined in this dataset.  A prior SAGE study identified 10 tags that are  upregulated in SCC compared to a normal bronchial epithelium (NHBE) cell line (Nacht, 2001) (Figure 4.26.1A). Of the three microarray studies, only the Bhattacharjee dataset identified specific differentially expressed candidates (Bhattacharjee, 2001) (Figure 4.26.1B). For the Erez/Dehan and Wachi datasets, the log-transformed hybridization values were subjected to Student’s t-test to identify the top 10 upregulated and downregulated genes according to p-value (Figure 4.26.2). While there is general agreement between the SAGE dataset presented here and these four prior studies, three major issues are evident. First, genes that appear to be associated with smoke-exposure are evident in the Nacht dataset. This is likely due to the use of an NHBE cell line for comparison rather than smoke-exposed epithelium, which is present in the majority of lung cancer patients. Second, genes that are strongly associated with the pre-malignant transformation of bronchial epithelium to metaplasia are highly overrepresented.  This is  particularly evident in the Bhattacharjee candidates, although it occurs to a large extent in all four datasets. Third, genes expressed by non-epithelial cells present in the surrounding stroma are present in the Nacht dataset, again likely due to the use of the NHBE cell line, and especially in the downregulated candidates in the Erez/Dehan and Wachi datasets.  This comparison  highlights the potential difficulties in identifying candidate lung cancer genes by global transcriptome profiling, and are relevant issues in the study of other solid tumours.  167  Color Key  0 2 4 6 8 >10%  percent expression  A TUBB RPL37 GPX2 TFRC GSTM3 CES1 AKR1B10 PRDX1 IFI27 HLA-B  GCTTGTTCTC GTGCTGATTC TAAAATGTAT TTTGTAGAGG CTTCCTTGCC GCCCCTGCTG CAATAAAATT GAAGCACAAG TTGCATATCA CATTGTAAAT AATAAAGTTG GAAAAGGAAT GATCTCTTGG TAAACCTGCT  GPC1 COL7A1 DSG3 PKP1 KRT17 KRT5 TP63 KRT6C TRIM29 SERPINB5 DST BICD2 S100A2 LGALS7  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  AACGACCTCG CAATAAATGT GGTGGTGTCT ACCTTTACTG TGCCGTTTTG AAGGAGCAAG GCTTGAATAA AGAACAAAAC CCAGGGGAGA CTGACCTGTG  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  B  Figure 4.26.1: Heatmap of SAGE tag expression of candidate SCC genes identified from the Nacht and Bhattacharjee datasets. The Nacht genes are the top heatmap (A) and the Bhattacharjee genes are the bottom heatmap (B). Both sets of genes are those identified by the original study authors. Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key at the top of the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red.  168  Color Key  0 2 4 6 8 >10%  percent expression CDH5 FHL1 PKIG C9orf61 MAOB SGCE FGFR1 ACVRL1 LMO7 CLIC5  CTCTGTAAGT TTGTCTGAAC GATCTCTTGG AAAGCACAAG CCTGTCAATG GGAACAAACA GCAAAAGCTT GGAATCCAAT AATAGAAATT CATTGTAAAT  MMP11 CCNE1 S100A2 KRT6A TFAP2A CD24 COL11A1 PTTG1 SPP1 SERPINB5  GAGGTGTTTG TAATGTTAAT CGAGTGCTGA AACGTTATTA GGAGTGCACA TTTACTTTGG ACCGGCGCCC GTTCACTGCA GAATGGCAGG CTAATATTGT  LIMCH1 DAPK1 TCF21 EPAS1 TMEM100 C9orf61 CLEC3B ICAM1 FIGF C13orf15  AAAGCACAAG CTGCTGTGAT CGAATGTCCT CAGTCCCCCT CATTGTAAAT AGGGCCGACT AATTCCCGTC AACGCGGCCA GAACATAGCC TTGGTTTCCC  KRT6A SNRPC KRT6B TTLL12 SERPINB5 MKI67 MRPL15 MIF RACGAP1 CDC2  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  A  GCAGCTCCTG TAATGACAAT GAGCTCCACA TTTACTTTGG CACAAGGAAT TTGGCAGTAT TCTTTTTAAA GTTAAATCCT GTTTGTATAC AATCTGAACC  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  B  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  C  N13 N16 GSM762 NS90 NS92 NS93 NS97 FS03 FS20 FS21 FS24 FS32 FS34 FS38 FS44 FS49 FS50 FS73 FS06+07 CS26 CS28 CS36 CS42 CS48 CS75 CS80 CS94 M51 D101 C01 C02 C05 C27 C39 I04 I08 I11 I12 I18 I22  D  Figure 4.26.2: Heatmap of SAGE tag expression of candidate SCC genes identified from the Erez/Dehan and Wachi datasets. The Erez/Dehan candidates are the top heatmaps corresponding to downregulated (A) and upregulated (B) genes. The Wachi candidates are the bottom heatmaps corresponding to downregulated (C) and upregulated (D) genes. Genes were identified by selecting those with the lowest p-value as determined by Student’s t-test. Each cell of the heatmap is based on the percentage expression across all samples to provide consistent colouration. A colour key shows the range of colours from 0-10% expression, with values exceeding this shown in bright red. A colour key at the top of the heatmap denotes the sample type: white is bulk normal; epithelial brushings from never, former, and current smokers shown in dark green, light green, and cyan, respectively; metaplasia in blue; dysplasia in purple; carcinoma in situ in yellow; and invasive carcinoma in red. 169  4.4  DISCUSSION This study presents an initial view of the global transcriptional changes that occur over  the course of squamous cell lung cancer development. These data suggest that a great deal, and perhaps the majority, of the changes to gene expression during the development of SCC occur with the transformation of normal epithelium to the squamous cell type. Although this process is a pre-condition for the development of SCC – hence the name – and may result in a tumour permissive state, it is not itself a malignant phenotype. Unfortunately, this results in a conflation of squamous differentiation-associated changes and malignancy-associated changes when comparing normal epithelium and de facto carcinomas, as is typically done. Gene expression changes associated with the invasive phenotype appear to be relatively limited. This is consistent with the fact that a high number (>90%) of in situ carcinomas eventually become invasive. The variability in the cellular composition of the bulk invasive samples used in this analysis may disguise changes to some extent. However, based on what is known about the candidate genes, the notion that a substantial number of invasion-promoting genes may arise from the cellular milieu surrounding a tumour is strongly supported. In other words, invasion may depend largely on gene expression changes in non-tumour cells. Of course, the possibility of a number of distinct, mutually exclusive determinants of this phenotype, which would complicate the identification of a consistent invasive signature, cannot be ruled out. A number of gene expression signatures were identified using computational strategies designed to maximize the information content of SAGE profiles. Both the improvement to tag to gene mapping presented in Chapter II and the statistical model presented in Chapter III were vital elements of the methodology to analyze this complex dataset and identify high quality signatures that correspond to the metaplastic, malignant, and invasive phenotypes that characterize SCC development. When compared to three separate gene expression studies using 170  normal lung tissue and SCC samples, excellent agreement was found for all of the genes that comprise these signatures. The results of this effort are sufficient to develop a convincing portrait of key gene expression changes associated with SCC development, including a number of novel genes that are promising targets of future study. The metaplastic phenotype is associated with the upregulation of S100A2 and KRT6A, and the downregulation of C16orf89 and C5orf32. While both upregulated genes have been previously implicated in keratinocyte or squamous differentiation, the two downregulated genes are completely uncharacterized and are attractive targets for exploratory studies. The malignant phenotype is associated with the upregulation of MCM7, SLC6A8, CKS1B, and ATP1B3, and the downregulation of CRIP2 and TMPRSS2.  MCM7 and CSK1B are key  participants in regulating the initiation of DNA replication and may represent a common modality that initiates SCC malignancy – likely in establishing and maintaining uncontrolled cell division.  The role of the remaining genes is unclear, but both CRIP2 and TMPRSS2 are  particularly attractive as targets for further study given the former’s possible role in bridging transcription machinery to particular loci, and the recent discovery of the latter’s fundamental importance in prostate cancer initiation. The invasive phenotype is associated with an upregulation of MMP11, GAPDH, and SHMT2. MMP11 is of particular interest given the role the matrix metalloproteinase family in extracellular matrix degradation and remodelling that is considered a requirement for successful tumour invasion. An examination of MMP11 in invasive carcinomas by immunohistochemistry supports this hypothesis. Given the role of GAPDH in anaerobic energy production, a long established feature of cancer etiology and progression, this gene is more an affirmation of this study’s methodology than a target of future work. Finally, the close relationship of SHMT2 to thymidine biosynthesis and a pathway that has long been a target of conventional chemotherapy is alluring and may represent a drug target with the potential to improve an existing strategy of 171  cancer treatment. In addition, an analysis of SAGE libraries obtained from bronchial epithelium from never, former, and current smokers revealed a distinct signature associated with recent exposure to tobacco smoke. These genes were overwhelmingly associated with the Phase I and Phase II enzyme system responsible for xenobiotic metabolism. Although no gene expression changes that persist once an individual has quit were found, downregulation of the Phase I enzyme EPHX1 to levels below that of a never smoker may represent a distinct effect of cessation following long-term exposure. Although EPHX1 activity is required to generate a mutagenic form of the known pro-carcinogen benzo[a]pyrene, the primary role of the enzyme is protective and a decrease in its activity may render the epithelium more susceptible to the harmful effects of environmental toxins. Finally, the data presented represents one of the most comprehensive catalogues of gene expression change in the step-wise development of an invasive carcinoma from normal tissue. The data and statistical analysis are of significant value to lung cancer researchers in terms of defining the temporal contribution of particular genes in SCC development. The dataset is amenable to meta-analyses that combine these data with other high-throughput experiments to identify important progression-related events (e.g. copy number variations or aberrant methylation).  172  BIBLIOGRAPHY Agrawal, S., A. Kumar, et al. (2003). "Cloning, expression, activity and folding studies of serine hydroxymethyltransferase: a target enzyme for cancer chemotherapy." J Mol Microbiol Biotechnol 6(2): 67-75. Altschul, S. F., T. L. Madden, et al. (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res 25(17): 3389-402. Arthur, D. and S. Vassilvitskii (2007). k-means++: The advantages of careful seeding. Symposium on Discrete Algorithms (SODA). Ashburner, M., C. A. Ball, et al. (2000). "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium." Nat Genet 25(1): 25-9. Basset, P., C. Wolf, et al. (1993). "Expression of the stromelysin-3 gene in fibroblastic cells of invasive carcinomas of the breast and other human tissues: a review." Breast Cancer Res Treat 24(3): 185-93. Beane, J., P. Sebastiani, et al. (2007). "Reversible and permanent effects of tobacco smoke exposure on airway epithelial gene expression." Genome Biol 8(9): R201. Benson, D. A., I. Karsch-Mizrachi, et al. (2008). "GenBank." Nucleic Acids Res 36(Database issue): D25-30. Bhattacharjee, A., W. G. Richards, et al. (2001). "Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses." Proc Natl Acad Sci U S A 98(24): 13790-5. Bolstad, B. M., R. A. Irizarry, et al. (2003). "A comparison of normalization methods for high density oligonucleotide array data based on variance and bias." Bioinformatics 19(2): 185-93. Boon, K., E. C. Osorio, et al. (2002). "An anatomy of normal and malignant gene expression." Proc Natl Acad Sci U S A 99(17): 11287-92. Breuer, R. H., A. Pasic, et al. (2005). "The natural course of preneoplastic lesions in bronchial epithelium." Clin Cancer Res 11(2 Pt 1): 537-43. Cai, L., H. Huang, et al. (2004). "Clustering analysis of SAGE data using a Poisson approach." Genome Biol 5(7): R51. Carolan, B. J., A. Heguy, et al. (2006). "Up-regulation of expression of the ubiquitin carboxylterminal hydrolase L1 gene in human airway epithelium of cigarette smokers." Cancer Res 66(22): 10729-40. Chang, D. F., N. S. Belaguli, et al. (2003). "Cysteine-rich LIM-only proteins CRP1 and CRP2 are potent smooth muscle differentiation cofactors." Dev Cell 4(1): 107-18. 173  Chiampanichayakul, S., A. Szekeres, et al. (2002). "Engagement of Na,K-ATPase beta3 subunit by a specific mAb suppresses T and B lymphocyte activation." Int Immunol 14(12): 1407-14. Colby, T. V., Wistuba, II, et al. (1998). "Precursors to pulmonary neoplasia." Adv Anat Pathol 5(4): 205-15. Dakir, E. H., L. Feigenbaum, et al. (2008). "Constitutive Expression of Human Keratin 14 Gene in Mouse Lung Induces Premalignant Lesions and Squamous Differentiation." Carcinogenesis. Danenberg, P. V. (1977). "Thymidylate synthetase - a target enzyme in cancer chemotherapy." Biochim Biophys Acta 473(2): 73-92. Dehan, E., A. Ben-Dor, et al. (2007). "Chromosomal aberrations and gene expression profiles in non-small cell lung cancer." Lung Cancer 56(2): 175-84. Denissenko, M. F., A. Pao, et al. (1996). "Preferential formation of benzo[a]pyrene adducts at lung cancer mutational hotspots in P53." Science 274(5286): 430-2. Erez, A., M. Perelman, et al. (2004). "Sil overexpression in lung cancer characterizes tumors with increased mitotic activity." Oncogene 23(31): 5371-7. Ewing, B. and P. Green (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities." Genome Res 8(3): 186-94. Ewing, B., L. Hillier, et al. (1998). "Base-calling of automated sequencer traces using phred. I. Accuracy assessment." Genome Res 8(3): 175-85. Facoetti, A., E. Ranza, et al. (2006). "Minichromosome maintenance protein 7: a reliable tool for glioblastoma proliferation index." Anticancer Res 26(2A): 1071-5. Facoetti, A., E. Ranza, et al. (2006). "Immunohistochemical evaluation of minichromosome maintenance protein 7 in astrocytoma grading." Anticancer Res 26(5A): 3513-6. Fearon, E. R. and B. Vogelstein (1990). "A genetic model for colorectal tumorigenesis." Cell 61(5): 759-67. Feng, H. C., S. W. Tsao, et al. (2008). "Overexpression of prostate stem cell antigen is associated with gestational trophoblastic neoplasia." Histopathology 52(2): 167-74. Flicek, P., B. L. Aken, et al. (2008). "Ensembl 2008." Nucleic Acids Res 36(Database issue): D707-14. Fu, T. F., J. P. Rife, et al. (2001). "The role of serine hydroxymethyltransferase isozymes in onecarbon metabolism in MCF-7 cells as determined by (13)C NMR." Arch Biochem Biophys 393(1): 42-50. Ganoth, D., G. Bornstein, et al. (2001). "The cell-cycle regulatory protein Cks1 is required for SCF(Skp2)-mediated ubiquitinylation of p27." Nat Cell Biol 3(3): 321-4. 174  Gautier, L., L. Cope, et al. (2004). "affy--analysis of Affymetrix GeneChip data at the probe level." Bioinformatics 20(3): 307-15. Gentleman, R. C., V. J. Carey, et al. (2004). "Bioconductor: open software development for computational biology and bioinformatics." Genome Biol 5(10): R80. Gibbs, S., R. Fijneman, et al. (1993). "Molecular characterization and evolution of the SPRR family of keratinocyte differentiation markers encoding small proline-rich proteins." Genomics 16(3): 630-7. Gower, J. C. (1966). "Some distance properties of latent root and vector methods used multivariate analysis." Biometrika 53: 325-328. Greer, R. O. (2006). "Pathology of malignant and premalignant oral epithelial lesions." Otolaryngol Clin North Am 39(2): 249-75, v. Hibi, K., S. Fujitake, et al. (2003). "Identification of S100A2 as a target of the DeltaNp63 oncogenic pathway." Clin Cancer Res 9(11): 4282-5. Hirsch, F. R., W. A. Franklin, et al. (2001). "Early detection of lung cancer: clinical perspectives of recent advances in biology and radiology." Clin Cancer Res 7(1): 5-22. Hoffmann, W. (2007). "TFF (trefoil factor family) peptides and their potential roles for differentiation processes during airway remodeling." Curr Med Chem 14(25): 2716-9. Honeycutt, K. A., Z. Chen, et al. (2006). "Deregulated minichromosomal maintenance protein MCM7 contributes to oncogene driven tumorigenesis." Oncogene 25(29): 4027-32. Huang da, W., B. T. Sherman, et al. (2007). "DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists." Nucleic Acids Res 35(Web Server issue): W169-75. Irizarry, R. A., B. M. Bolstad, et al. (2003). "Summaries of Affymetrix GeneChip probe level data." Nucleic Acids Res 31(4): e15. Irizarry, R. A., B. Hobbs, et al. (2003). "Exploration, normalization, and summaries of high density oligonucleotide array probe level data." Biostatistics 4(2): 249-64. Jain, A. K., M. N. Murty, et al. (1999). "Data clustering: a review." ACM Computing Surveys 31: 265-323. Jemal, A., R. Siegel, et al. (2007). "Cancer statistics, 2007." CA Cancer J Clin 57(1): 43-66. Johnson, M., I. Zaretskaya, et al. (2008). "NCBI BLAST: a better web interface." Nucleic Acids Res 36(Web Server issue): W5-9. Kanehisa, M., M. Araki, et al. (2008). "KEGG for linking genomes to life and the environment." Nucleic Acids Res 36(Database issue): D480-4. 175  Karim, M. A., K. Ohta, et al. (1996). "Human ESP1/CRP2, a member of the LIM domain protein family: characterization of the cDNA and assignment of the gene locus to chromosome 14q32.3." Genomics 31(2): 167-76. Kitajima, S., Y. Kudo, et al. (2004). "Role of Cks1 overexpression in oral squamous cell carcinomas: cooperation with Skp2 in promoting p27 degradation." Am J Pathol 165(6): 214755. Lee, C. H., L. N. Marekov, et al. (2000). "Small proline-rich protein 1 is the major component of the cell envelope of normal human oral keratinocytes." FEBS Lett 477(3): 268-72. Li, S. S., W. C. Xue, et al. (2005). "Replicative MCM7 protein as a proliferation marker in endometrial carcinoma: a tissue microarray and clinicopathological analysis." Histopathology 46(3): 307-13. Lingrel, J. B., J. Orlowski, et al. (1990). "Molecular genetics of Na,K-ATPase." Prog Nucleic Acid Res Mol Biol 38: 37-89. Lonergan, K. M., R. Chari, et al. (2006). "Identification of novel lung genes in bronchial epithelium by serial analysis of gene expression." Am J Respir Cell Mol Biol 35(6): 651-61. Maiorano, D., M. Lutzmann, et al. (2006). "MCM proteins and DNA replication." Curr Opin Cell Biol 18(2): 130-6. Motrescu, E. R., S. Blaise, et al. (2008). "Matrix metalloproteinase-11/stromelysin-3 exhibits collagenolytic function against collagen VI under normal and malignant conditions." Oncogene 27(49): 6347-55. Nacht, M., T. Dracheva, et al. (2001). "Molecular characteristics of non-small cell lung cancer." Proc Natl Acad Sci U S A 98(26): 15203-8. Nakajima, T., H. Shimooka, et al. (2003). "Immunohistochemical demonstration of 14-3-3 sigma protein in normal human tissues and lung cancers, and the preponderance of its strong expression in epithelial cells of squamous cell lineage." Pathol Int 53(6): 353-60. Neville, B. W. and T. A. Day (2002). "Oral cancer and precancerous lesions." CA Cancer J Clin 52(4): 195-215. Nishihara, K., K. Shomori, et al. (2008). "Minichromosome maintenance protein 7 in colorectal cancer: implication of prognostic significance." Int J Oncol 33(2): 245-51. Oertel, M., A. Graness, et al. (2001). "Trefoil factor family-peptides promote migration of human bronchial epithelial cells: synergistic effect with epidermal growth factor." Am J Respir Cell Mol Biol 25(4): 418-24. Ojeh, N., K. Hiilesvuo, et al. (2008). "Ectopic expression of syndecan-1 in basal epidermis affects keratinocyte proliferation and wound re-epithelialization." J Invest Dermatol 128(1): 2634. 176  Pruitt, K. D., T. Tatusova, et al. (2007). "NCBI reference sequences (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins." Nucleic Acids Res 35(Database issue): D61-5. Rimpi, S. and J. A. Nilsson (2007). "Metabolic enzymes regulated by the Myc oncogene are possible targets for chemotherapy or chemoprevention." Biochem Soc Trans 35(Pt 2): 305-10. Rio, M. C. (2005). "From a unique cell to metastasis is a long way to go: clues to stromelysin-3 participation." Biochimie 87(3-4): 299-306. Schirch, L. (1982). "Serine hydroxymethyltransferase." Adv Enzymol Relat Areas Mol Biol 53: 83-112. Schweitzer, B. I., A. P. Dicker, et al. (1990). "Dihydrofolate reductase as a therapeutic target." Faseb J 4(8): 2441-52. Shimizu, M., S. Ban, et al. (2007). "Squamous dysplasia and other precursor lesions related to esophageal squamous cell carcinoma." Gastroenterol Clin North Am 36(4): 797-811, v-vi. Smith, S. L., M. Gugger, et al. (2004). "S100A2 is strongly expressed in airway basal cells, preneoplastic bronchial lesions and primary non-small cell lung carcinomas." Br J Cancer 91(8): 1515-24. Sora, I., J. Richman, et al. (1994). "The cloning and expression of a human creatine transporter." Biochem Biophys Res Commun 204(1): 419-27. Spira, A., J. Beane, et al. (2004). "Effects of cigarette smoke on the human airway epithelial cell transcriptome." Proc Natl Acad Sci U S A 101(27): 10143-8. Spruck, C., H. Strohmaier, et al. (2001). "A CDK-independent function of mammalian Cks1: targeting of SCF(Skp2) to the CDK inhibitor p27Kip1." Mol Cell 7(3): 639-50. Steinhaus, H. (1956). "Sur la division des corp materiels en parties." Bull Acad Polon Sci, C1. III IV: 801-804. Stroustrop, B. (2000). The C++ Programming Language. Reading, Massachusetts, AddisonWesley. Tendler, S. J., M. D. Threadgill, et al. (1987). "Activities of serine hydroxymethyltransferase in murine tissues and tumours." Cancer Lett 36(1): 65-9. Tomlins, S. A., R. Mehra, et al. (2006). "TMPRSS2:ETV4 gene fusions define a third molecular subtype of prostate cancer." Cancer Res 66(7): 3396-400. Tomlins, S. A., D. R. Rhodes, et al. (2005). "Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer." Science 310(5748): 644-8. Truong, A. B., M. Kretz, et al. (2006). "p63 regulates proliferation and differentiation of developmentally mature keratinocytes." Genes Dev 20(22): 3185-97. 177  Tsuyama, T., S. Tada, et al. (2005). "Licensing for DNA replication requires a strict sequential assembly of Cdc6 and Cdt1 onto chromatin in Xenopus egg extracts." Nucleic Acids Res 33(2): 765-75. Velculescu, V. E., L. Zhang, et al. (1995). "Serial analysis of gene expression." Science 270(5235): 484-7. Wachi, S., K. Yoneda, et al. (2005). "Interactome-transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues." Bioinformatics 21(23): 42058. Warburg, O. (1956). "On respiratory impairment in cancer cells." Science 124(3215): 269-70. Wheeler, D. L., T. Barrett, et al. (2008). "Database resources of the National Center for Biotechnology Information." Nucleic Acids Res 36(Database issue): D13-21. Wiede, A., W. Jagla, et al. (1999). "Localization of TFF3, a new mucus-associated peptide of the human respiratory tract." Am J Respir Crit Care Med 159(4 Pt 1): 1330-5. Wistuba, II and A. F. Gazdar (2006). "Lung cancer preneoplasia." Annu Rev Pathol 1: 331-48. Xue, W. C., U. S. Khoo, et al. (2003). "Minichromosome maintenance protein 7 expression in gestational trophoblastic disease: correlation with Ki67, PCNA and clinicopathological parameters." Histopathology 43(5): 485-90. Zeeberg, B. R., W. Feng, et al. (2003). "GoMiner: a resource for biological interpretation of genomic and proteomic data." Genome Biol 4(4): R28. Zheng, Q. and Y. Zhao (2007). "The diverse biofunctions of LIM domain proteins: determined by subcellular localization and protein-protein interaction." Biol Cell 99(9): 489-502. Zuyderduyn, S. D. (2004). "Bio::SAGE::DataProcessing perl module." from http://search.cpan.org/dist/Bio-SAGE-DataProcessing. Zuyderduyn, S. D. (2007). "Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model." BMC Bioinformatics 8: 282.  178  CHAPTER V CONCLUSION AND FUTURE PROSPECTS  5.1  OVERALL DISCUSSION AND CONCLUSION The initial goal of this thesis was to determine the relationship of the transcriptome to the  formation and progression of squamous cell lung cancer (SCC), a common tumour subtype that contributes to a large proportion of total cancer deaths (Canadian Cancer Society, 2008; Jemal, 2007).  The use of serial analysis of gene expression (SAGE) to generate the necessary  transcriptome profiles was appealing due to the unbiased and comprehensive nature of the technique (Velculescu, 1995). While a relatively straightforward strategy involving existing, and widely adopted, approaches to SAGE analysis was envisioned, these techniques quickly proved to be problematic. These challenges were exacerbated by the lack of additional biological material to fully validate findings. Indeed, a common approach in large-scale genomic studies is to sacrifice optimization at the data analysis phase of an investigation for the sake of rapid transition to more targeted follow-up study where, presumably, false findings can be quickly identified. One of the first issues that became clear was that, without the ability to validate, SAGE tag to gene mapping would be an area of continuous frustration. The veracity of SAGE analysis hinges on the ability to determine the original transcript from which an observed tag arose. Often there are several transcripts that share the same theoretical SAGE tag, resulting in a situation where a differentially expressed gene cannot be unambiguously identified. This is typically addressed by either: a) simply accepting that some candidate tags are lost to this issue, or b) performing additional experiments (e.g. qPCR) to determine which of the possible candidates is differentially expressed. The former is acceptable when the number of candidates is relatively large and/or the study objective is more concerned with uncovering larger biological 179  themes, rather than specific genes. This proved not be the case in the dataset central to this thesis. The latter, of course, is only feasible in cases where biological material is not limiting. Chapter II describes the strategy developed to increase the effective length of tag sequences (usually by 20%) to reduce the problem of transcript ambiguity. However, it was also discovered that many of the tag to gene mappings that were previously considered unambiguous actually arose from other sources – most often as artefacts from highly expressed tags with close sequence similarity, or from a specific type of SAGE artefact that results in the capture of antisense sequences; but also novel transcript variants of known genes, or entirely novel loci. Applying this strategy proved invaluable in the analysis of the SCC dataset, as 35% of the candidate tags identified benefited from either reduced ambiguity or re-assignment to an improved source transcript. The multiple developmental stages of SCC represented by the SAGE dataset presented challenges for statistical analysis. Standard hypothesis tests were inappropriate for comparisons involving multiple groups, and the heterogeneity of the sampled tissues represented an additional level of complexity not adequately addressed by simple approaches. Existing work by several researchers had begun to tackle the challenge of determining statistical significance when comparing multiple groups of SAGE libraries (Baggerly, 2003; Baggerly, 2004; Vencio, 2004; Lu, 2005). Although these methods provided an excellent framework with which to begin the analysis of the SCC dataset, it became clear that some of the existing assumptions used in these statistical models, although fundamentally sound, were inadequate. At 40 libraries, the sheer size of the SCC dataset was the largest factor in enabling these inaccuracies to be uncovered. The major issue was the assumption that additional variance could be adequately represented by a hierarchical model. Such models consist of a unimodal prior distribution to supplement the binomial or Poisson distribution that accounts for the variance that must arise from sampling a small subset of tags from the large initial pool of tags generated from biological material. 180  Initially, effort was spent on determining a better choice for such a distribution, which was met with some limited success. Chapter III describes the eventual development of a statistical approach that utilizes a Poisson mixture model, rather than the previously proposed hierarchical model, to assign significance to SAGE library comparisons. This model was inspired by the structure of the SCC dataset, and while initially appearing to draw its strength by accounting for the extensive sample heterogeneity, its strong performance on previously published datasets, several of which were generated from relatively pure biological material, revealed a general applicability to all SAGE data. Chapter IV describes the final analysis of the SCC dataset. To date, it represents the most comprehensive profile of the transcriptome during the early development of a solid tumour. While the sample heterogeneity discussed above was a confounding factor throughout this research, a strict adherence to sound statistical methodology and the use of the techniques described in Chapters II and III resulted in the identification of several hundred genes associated with several steps in the development and progression of SCC. A targeted analysis of the subset of samples drawn from brushings of bronchial epithelium from never, former, and current smokers was performed to gain insight into the role of tobacco exposure on the normal lung transcriptome. This analysis revealed a large response to such exposure, primarily from the Phase I and Phase II enzyme system. However, there was no evidence that any of these changes persist once an individual ceases exposure. One intriguing observation was a former smoker-specific decrease in the expression of EPHX1, a known player in generating a highly mutagenic product from the pro-carcinogen benzo[a]pyrene (Denissenko, 1996). A decrease in the activity of this enzyme in former smokers may leave the epithelium more susceptible to future mutagenesis by, for example, air-borne carcinogens.  A larger,  longitudinal study with careful control of variables such as the timeline of past exposure and cessation are required to determine if this is a true effect. Of interest was the lack of any gene 181  expression changes associated with the acute response persisting in later stages of SCC development. It appears that upon the formation of squamous metaplasia, the normal molecular response to tobacco smoke ceases altogether. This supports the notion that these early lesions are comprised of cells that are highly susceptible to further mutagenesis. A more comprehensive analysis was then performed that focussed on the pre-malignant and malignant stages of SCC. A number of broad insights were gained. First, the largest and most consistent set of gene expression changes is associated with the development of premalignant squamous metaplasia. This has two major implications for future studies of SCC and, quite likely, other tumour types: 1) the use of normal bronchial epithelium as the “baseline” from which to assess gene expression changes in SCC is unwise, at least if the objective is to identify those associated with malignancy; and 2) the consistency of these pre-malignant changes means they are more likely to be identified as significant than malignant changes, which display a far more varied incidence and level of expression. Second, the number of changes associated with acquiring an invasive capability is remarkably small, even when considering the increased heterogeneity of the samples representing this phenotype. Although not presented in this thesis, the development of the computational strategies involved many modifications and continual improvements. Access to the Westgrid resource, a high-performance computing (HPC) cluster containing over 1500 CPUs, allowed complex procedures to be applied to the large SCC dataset without major concerns about execution time. For example, the cross-validation and resampling strategies described in Chapter IV would take several months to execute on a typical desktop computer. The use of an HPC cluster allowed these procedures to be performed in a few days or less, and this allowed exceptional freedom to improve or re-run entire analysis pipelines when additional data became available or new ideas were formed.  The difference in the quality of the first gene signatures developed at the  beginning of this project and those presented in this thesis is enormous. 182  The final set of 13 genes most strongly associated with the complete transformation of bronchial epithelium to pre-malignant metaplastic lesions, to carcinoma in situ, and finally, the acquisition of the invasive phenotype represent a compelling set of targets for additional study. While the upregulated S100A2 and KRT6A genes are familiar participants in SCC and squamous differentiation, the downregulation of C16orf89 and C5orf32 in metaplasia is equally strong and presents an opportunity to investigate two entirely uncharacterized genes that are likely to play important roles in differentiation (Smith, 2004). MCM7 and CKS1B represent genes that play important and central roles in promoting DNA replication and their consistent increase upon progression to malignancy may represent a common outcome of the disruption of a range of TSGs and oncogenes (Ganoth, 2001; Spruck, 2001; Maiorano, 2006). However, the significance of the increased expression of SLC6A8, a creatine transporter, and ATP1B3, the regulatory subunit of Na+/K+-ATPase, is unclear and remains to be investigated. The decreased expression of CRIP2, a possible bridge between the general transcription machinery and specific transcription factors, may represent a central point at which the expression of many malignancyassociated genes is altered (Chang, 2003). A similar decrease in TMPRSS2, which is now established as a playing a critical role in prostate tumour initiation by driving the expression of certain oncogenes through gene fusion events, suggests this gene may have tumour-suppressing properties in its own right (Tomlins, 2005; Tomlins, 2006). The top candidate identified in association with invasive samples was MMP11. This gene has been established as an important player in tumour invasion in a wide variety of tumours, although the exact mechanism remains unclear (Basset, 1993; Rio, 2005). A preliminary examination of its protein expression by immunohistochemistry supports its importance in SCC. Ironically, a likely source of MMP11 is the stromal cells surrounding invasive cancer cells, and the impurity of the samples analyzed in this thesis may have facilitated its identification here. The finding of the metabolic gene GAPDH in association with invasion is unsurprising, as the increased reliance on glycolysis is a 183  long established property of cancer that is used to overcome the hypoxic environment faced by a growing tumour (Warburg, 1956). Finally, the increase in SHMT2 is interesting due to its role in driving thymidine biosynthesis in close cooperation with TMYS and DHFR, two of the longest standing targets of conventional chemotherapy (Danenberg, 1977; Schweitzer, 1990). 5.2  FUTURE PROSPECTS The research described in this work presents the opportunity for further exploration in  two major areas: 1) the SAGE method as a general approach for gene expression profiling and how to best utilize data captured by this technique; and 2) the molecular basis of squamous cell lung cancer progression and how this information can improve the prospects for individuals affected by this disease. The XBP-SAGE approach presented in Chapter II demonstrates that additional sequence information can be extracted from SAGE data, resulting in a large increase in the fidelity of tag to gene mapping. This approach should be adopted as a standard method of data processing. Furthermore, a revisiting of the cost-benefit of adopting newer protocols, which produce longer tags, may be worthwhile for researchers that are considering the SAGE technique. For example, the cost of sequencing is almost 50% greater when utilizing the LongSAGE technique. If a proposed study plans to produce a number of profiles, the cost savings from using the shorter variant would allow the production and sequencing of libraries from additional biological replicates. Not only could additional libraries assist in the determination of extra nucleotides, but the statistical power gained may outweigh the slight loss in mapping fidelity. The Poisson mixture model presented in Chapter III is demonstrably superior in determining the significance of gene expression changes when comparing SAGE libraries. However, a statistical model can only be shown to be an improvement over an existing one and there is always the possibility that an entirely novel approach exists to better describe a particular 184  type of data. Nevertheless, several avenues exist for incremental improvements to existing approaches. Developing strategies to incorporate multiple SAGE tags to determine significance, rather than assessing them individually, are likely to bear fruit. For example, as discussed in Chapter II, there are several mechanisms that result in the capture of multiple tags from a single transcript. Better estimates of significance could be determined by utilizing this property. A similar approach could be developed for tags arising from genes under common regulatory control. The analysis of the changes in the transcriptome during the early development of SCC presented in Chapter IV provides a plethora of future work. The observation that most, and perhaps the majority, of changes occur during the development of pre-malignant squamous metaplasia invites additional gene expression profiling that compares tissue from a much larger set of these lesions to more advanced stages of tumour progression. Such a study would likely produce a much clearer picture of malignant transformation and reveal more dynamic aspects of transcriptome change not possible with the dataset used here. Nevertheless, the genes identified in Chapter IV represent excellent candidates for future study. An obvious next step is to confirm their specific association with a given stage of progression by performing qPCR validation on a larger panel of new samples representing the complete spectrum of SCC development. Western blot experiments or immunohistochemical staining can be used to confirm the presence of the corresponding protein product in cases where an antibody is available. Assuming these experiments confirm their involvement in a given stage of progression, a number of general strategies can be envisioned for several candidates. In the case of EPHX1, which may undergo downregulation in the bronchial epithelium following smoking cessation, a longitudinal study examining the incidence of pre-malignant lesion formation or progression of existing lesions in individuals relative to this gene’s activity is a possibility. In the case of C16orf89 and C5orf32, which show a very consistent loss of expression with the formation of 185  metaplasia, almost nothing is currently known. Both are evolutionarily conserved in vertebrates, but there are no within-species similarities to other genes or recognizable protein domains, so their function is likely quite specialized (Wheeler, 2008). These genes could potentially be cloned to facilitate the isolation of their protein product, whereupon techniques like a pull-down assay using epithelial cell lysates may help identify potential interactions with proteins having known functions. A strategy like an siRNA knockdown assay may also identify a specific phenotype. A similar approach may also help elucidate the role of CRIP2 and TMPRSS2, which are downregulated during the progression to malignancy. In this case, a knockdown assay in a squamous cell line could be followed by looking for an increase in malignant phenotypes (e.g. increased mitotic rate). Experiments with more direct clinical relevance are appealing for genes found to be upregulated during the progression to malignancy or the acquisition of invasive capabilities.  In the case of the former, MCM7 is particularly attractive, given its strong  performance as a prognostic marker in a range of tumours, including other lung cancer subtypes (Xue, 2003; Li, 2005; Facoetti, 2006a; Facoetti, 2006b; Feng, 2008; Nishihara, 2008). In the case of the latter, MMP11 shares similar promise as a prognostic indicator and may have additional utility as a molecular indicator of invasiveness in very early tumours. For example, although carcinoma in situ appears locally confined and is highly treatable by surgery alone, the presence of MMP11 may identify a subset of patients who are at increased risk of recurrence and would benefit from adjuvant chemotherapy or aggressive surveillance. In conclusion, the use of genome-wide experimental approaches such as global gene expression profiling, along with the use of sophisticated and carefully applied bioinformatic techniques, have the potential to increase our understanding of cancer and uncover new avenues of attack against this deadly disease. The continued application and improvement of these approaches are a cause for great optimism, and will undoubtedly benefit human health.  186  BIBLIOGRAPHY Baggerly, K. A., L. Deng, et al. (2003). "Differential expression in SAGE: accounting for normal between-library variation." Bioinformatics 19(12): 1477-83. Baggerly, K. A., L. Deng, et al. (2004). "Overdispersed logistic regression for SAGE: modelling multiple groups and covariates." BMC Bioinformatics 5: 144. Basset, P., C. Wolf, et al. (1993). "Expression of the stromelysin-3 gene in fibroblastic cells of invasive carcinomas of the breast and other human tissues: a review." Breast Cancer Res Treat 24(3): 185-93. Canadian Cancer Society/National Cancer Institute of Canada (2008). Canadian Cancer Statistics 2008. Toronto, Canada. Chang, D. F., N. S. Belaguli, et al. (2003). "Cysteine-rich LIM-only proteins CRP1 and CRP2 are potent smooth muscle differentiation cofactors." Dev Cell 4(1): 107-18. Danenberg, P. V. (1977). "Thymidylate synthetase - a target enzyme in cancer chemotherapy." Biochim Biophys Acta 473(2): 73-92. Denissenko, M. F., A. Pao, et al. (1996). "Preferential formation of benzo[a]pyrene adducts at lung cancer mutational hotspots in P53." Science 274(5286): 430-2. Facoetti, A., E. Ranza, et al. (2006). "Minichromosome maintenance protein 7: a reliable tool for glioblastoma proliferation index." Anticancer Res 26(2A): 1071-5. Facoetti, A., E. Ranza, et al. (2006). "Immunohistochemical evaluation of minichromosome maintenance protein 7 in astrocytoma grading." Anticancer Res 26(5A): 3513-6. Feng, H. C., S. W. Tsao, et al. (2008). "Overexpression of prostate stem cell antigen is associated with gestational trophoblastic neoplasia." Histopathology 52(2): 167-74. Ganoth, D., G. Bornstein, et al. (2001). "The cell-cycle regulatory protein Cks1 is required for SCF(Skp2)-mediated ubiquitinylation of p27." Nat Cell Biol 3(3): 321-4. Jemal, A., R. Siegel, et al. (2007). "Cancer statistics, 2007." CA Cancer J Clin 57(1): 43-66. Lu, J., J. K. Tomfohr, et al. (2005). "Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach." BMC Bioinformatics 6: 165. Li, S. S., W. C. Xue, et al. (2005). "Replicative MCM7 protein as a proliferation marker in endometrial carcinoma: a tissue microarray and clinicopathological analysis." Histopathology 46(3): 307-13. Maiorano, D., M. Lutzmann, et al. (2006). "MCM proteins and DNA replication." Curr Opin Cell Biol 18(2): 130-6. Nishihara, K., K. Shomori, et al. (2008). "Minichromosome maintenance protein 7 in colorectal cancer: implication of prognostic significance." Int J Oncol 33(2): 245-51. 187  Rio, M. C. (2005). "From a unique cell to metastasis is a long way to go: clues to stromelysin-3 participation." Biochimie 87(3-4): 299-306. Schweitzer, B. I., A. P. Dicker, et al. (1990). "Dihydrofolate reductase as a therapeutic target." Faseb J 4(8): 2441-52. Smith, S. L., M. Gugger, et al. (2004). "S100A2 is strongly expressed in airway basal cells, preneoplastic bronchial lesions and primary non-small cell lung carcinomas." Br J Cancer 91(8): 1515-24. Spruck, C., H. Strohmaier, et al. (2001). "A CDK-independent function of mammalian Cks1: targeting of SCF(Skp2) to the CDK inhibitor p27Kip1." Mol Cell 7(3): 639-50. Tomlins, S. A., R. Mehra, et al. (2006). "TMPRSS2:ETV4 gene fusions define a third molecular subtype of prostate cancer." Cancer Res 66(7): 3396-400. Tomlins, S. A., D. R. Rhodes, et al. (2005). "Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer." Science 310(5748): 644-8. Velculescu, V. E., L. Zhang, et al. (1995). "Serial analysis of gene expression." Science 270(5235): 484-7. Vencio, R. Z., H. Brentani, et al. (2004). "Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression (SAGE)." BMC Bioinformatics 5: 119. Warburg, O. (1956). "On respiratory impairment in cancer cells." Science 124(3215): 269-70. Wheeler, D. L., T. Barrett, et al. (2008). "Database resources of the National Center for Biotechnology Information." Nucleic Acids Res 36(Database issue): D13-21. Xue, W. C., U. S. Khoo, et al. (2003). "Minichromosome maintenance protein 7 expression in gestational trophoblastic disease: correlation with Ki67, PCNA and clinicopathological parameters." Histopathology 43(5): 485-90.  188  189  Θ10XX  Θ10BN, Θ10BX  Θ10AX  Θ10AB  CATGΘ10ABΦ10CATG Θ10AA  P(lt=12)P(lt= 10)×0.0625  0  P(lt=12)P(lt= 10)×0.25  P(lt=12)P(lt= 10)  Φ10AN, Φ10AX 0  CATGΘ10AΦ10CATG Θ10AN Θ10AX Θ10BN Θ10BX Θ10XX  Φ10BN 0.5 0.5 0 0 0.125  Φ10AX 1 1 0.5 0.5 0.625  P(lt=10)P(lt=1 2)+ P(lt=11)2+ P(lt=12)P(lt=1 1)×0.25 P(lt=10)P(lt=1 2) P(lt=10)P(lt=1 2)+ P(lt=11)2×0.25 + P(lt=12)P(lt=1 1)×0.0625  P(lt=11)2×0.25 + P(lt=12)P(lt=1 1)×0.0625  0  Φ10XX 0.625 0.625 0.125 0.125 0.25  Φ10XX P(lt=10)P(lt=12)×0.0 625+ P(lt=11)2×0.25 P(lt=10)P(lt=12)×0.0 625+ P(lt=11)2×0.25+ P(lt=12)P(lt=11) P(lt=10)P(lt=12)×0.0 625+ P(lt=11)2×0.25+ P(lt=12)P(lt=11)×0.2 5 P(lt=10)P(lt=12)×0.0 625 P(ld=30)×0.0625  × P(ld=29)  × P(ld=28)  Φ10BX P(lt=10)P(lt=12)×0.2 5+ P(lt=11)2 P(lt=10)P(lt=12)×0.2 5+ P(lt=11)2+ P(lt=12)P(lt=11) P(lt=10)P(lt=12)×0.2 5+ P(lt=11)2+ P(lt=12)P(lt=11)×0.2 5 P(lt=10)P(lt=12)×0.2 5 P(lt=10)P(lt=12)×0.2 5+ P(lt=11)2×0.25+ P(lt=12)P(lt=11)×0.0 625  Φ10BX 0.5 0.5 0 0 0.125  Φ10XX 1 1 1  P(lt=11)2+ P(lt=12)P(lt=1 1) P(lt=11)2+ P(lt=12)P(lt=1 1)×0.25  Φ10BB P(lt=11)2  Φ10NX 1 1 1  Φ10NN 1 1 1  Φ10BA P(lt=10)P(lt=1 2)+ P(lt=11)2 P(ld=30)  Φ10AN 1 1 0.5 0.5 0.625  CATGΘ10Φ10CATG Θ10NN Θ10NX Θ10XX  LIKELIHOOD OF OBSERVING A DITAG GIVEN BOTH CONTRIBUTING TAG SEQUENCES  APPENDIX I  190  ld: Θ10: A, B: N: X: P(lt=x): P(ld=x): ditag length (20-24, where e.g. CATGX20CATG is ld=20) some arbitrary 10bp sequence some arbitrary, known nucleotide a known nucleotide, where the actual base is not relevant an unknown nucleotide probability that a tag is length x probability that a ditag is length x  × P(ld=31)  × P(ld=32)  Φ10XX 0.15625 0.03125 0.0625 0 0.015625  Φ10XX 0.0625 0 0.015625 0.00390625  Φ10BN, Φ10BX 0 0 0 0 0  Φ10DX 0.25 0 0.0625 0.015625  Φ10CX 0.625 0.125 0.25 0 0.0625  Φ10DB 0 0 0 0  Φ10CA 0.5 0 0.125 0 0.03125  Φ10DC 1 0 0.25 0.0625  Φ10CB 1 0.5 0.625 0 0.15625  CATGΘ10ABCDΦ10CATG Θ10AB Θ10AC Θ10AX Θ10XX  CATGΘ10ABCΦ10CATG Θ10AB Θ10AC Θ10AX Θ10BN, Θ10BX Θ10XX  APPENDIX II MODEL FITTING R SOURCE CODE Sample Data # generate some sample data, replace with actual data counts <- c( 9, 13, 11, 8, 9, 20, 16, 19, 18, 15 ) library.sizes <- rep( 100000, 10 ) # generate some class labels, replace with actual labels # in this example, first 5 are class 0, last 5 are class 1 classes <- c( rep( 0, 5 ), rep( 1, 5 ) )  Log-linear (Poisson) regression model # perform the model fit fit <- glm( counts ~ offset(log(library.sizes)) + classes, family=poisson(link=”log”) ) # get the beta coefficients beta0 <- fit$coefficients[[1]] beta1 <- fit$coefficients[[2]] # get the expression (as a proportion) for each group prop0 <- exp(beta0) prop1 <- exp(beta0+beta1) # calculate the significance score for differential expression # i.e. null hypothesis is that beta1 = 0 t.value <- summary(fit)$coefficients[,”z value”][2] p.value <- 2 * pt( -abs(t.value), fit$df.residual )  Overdispersed log-linear regression model # requires the MASS library library( MASS ) fit <- glm.nb( counts ~ offset(log(library.sizes)) + classes ) # get the beta coefficients and dispersion beta0 <- fit$coefficients[[1]] beta1 <- fit$coefficients[[2]] dispersion <- 1 / fit$theta # get the expression (as a proportion) for each group prop0 <- exp(beta0) prop1 <- exp(beta0+beta1) # calculate the significance score for differential expression # i.e. null hypothesis is that beta1 = 0 t.value <- summary(fit)$coefficients[,”z value”][2] p.value <- 2 * pt( -abs(t.value), fit$df.residual )  191  Poisson mixture model # requires the flexmix library library( flexmix ) # set fitting control parameters to settings that work well # for SAGE custom.FLXcontrol <- list( iter.max=200, minprior=0, tolerance=1E-6, verbose=0, classify=”hard”, nrep=1 ) custom.FLXcontrol <- as( custom.FLXcontrol, “FLXcontrol” ) # specify the maximum number of model components maxk <- 5 # specify the number of fit attempts per component fit.attempts <- 5 # create objects to store fit for each k fits <- list() aic.fits <- rep( NA, maxk ) # increase the number of components until AIC decreases for( k in 1:maxk ) { # don’t bother fitting if there are fewer distinct values # than k if( length(unique((counts+1)/(sizes+2))) < k ) break # make an initial “good” guess of class membership # using k-means – helps avoid falling into a local # likelihood maximum cm <- rep( 1, length(counts) ) if( k > 1 ) cm <- kmeans( (counts+1)/(sizes+2), centers=k )$cluster for( i in 1:nattempts ) { fit <- try( flexmix( counts ~ 1, k=k, model=FLXglm( family=”poisson”, offset=log(sizes) ), control=custom.FLXcontrol, cluster=cm ), silent=TRUE ) # if fitting failed (did not converge), try again if( “try-error” %in% class(fit) ) next if( is.na( aic.fits[[k]] ) ) { fits[[k]] <- fit aic.fits[k] <- AIC( fits[[k]] ) } else { if( AIC(fit) < aic.fits[k] ) {  192  # this attempt was an improvement, so use it fits[[k]] <- fit aic.fits[k] <- AIC( fit ) } } } # if the fit failed all attempts, do not continue trying # to fit with an increasing number of components if( is.na(fits[[k]]) ) break # if the fit found less components than specified, do not # continue to fit with an increasing number of components if( max(clusters(fits[[k]])) > k ) break } # what is the optimal k? (minimum AIC) k.optimal <- which( aic.fits == min( aic.fits, na.rm=TRUE ) )[1] fit.optimal <- fits[[k.optimal]] # get the theta parameters (component coefficients) thetas <- array( dim=k.optimal ) for( i in 1:k.optimal ) thetas[i] <- parameters(fit.optimal, component=i)$coef # get the pi parameters (mixing coefficients) pis <- attributes(fit.optimal)$prior # what is the test statistic score that the fitted components # differentiate between groups? confidence.up <- pmm.testStatistic( fit.optimal, k.optimal, classes, downreg=F ) confidence.down <- pmm. testStatistic( fit.optimal, k.optimal, classes, downreg=T )  Mixture model test statistic “pmm.testStatistic” <- function( fit, k, classes, downreg=T, groups=NULL ) { if( is.null( groups ) ) groups <- classes if( k == 1 ) return 0 # get mixture component coefficients coefs <- array( dim=k ) for( i in 1:k ) coefs[i] <- parameters(fit, component=i)$coef # get posterior probabilities post.probs <- matrix( ncol=k, data=as.numeric(fit@posterior[[“unscaled”]]) )  193  # get mixing coefficients pis <- attributes(fit)$prior # scale the posterior probabilities post.probs <- post.probs / rowSums(post.probs) # reorder the coefs and posterior probabilities post.probs <- post.probs[,order(coefs,decreasing=downreg)] coefs <- coefs[order(coefs,decreasing=downreg)] scores <- rep( NA, k-1 ) for( tau in 1:(k-1) ) { class.probs <- rep( 1, length( unique( classes[which(!is.na(classes))])) ) post.probs2 <- post.probs[,c(1:tau)] if( tau > 1 ) post.probs2 <- rowSums(post.probs2) for( cls in unique(classes[which(!is.na(classes))]) ) { class.probs[cls+1] <- sum(post.probs2[which(classes==cls)]) } p0 <- 1 for( idx p0 <} p1 <- 1 for( idx p1 <}  in groups[which(groups[,2]==0),1] ) { p0*class.probs[idx+1]  in groups[which(groups[,2]==1),1] ) { p1*(1-class.probs[idx+1])  scores[tau] <- p0*p1 } return( max( scores[tau] ) ) }  194  APPENDIX III DERIVATION OF POISSON MIXTURE MODEL DIFFERENTIAL EXPRESSION CONFIDENCE SCORE Given Bayes Theorem: |  |     First, define A as the sample class (e.g. normal, cancer) and B as a mixture component(s). Therefore, re-write Bayes Theorem as: |  |     The probability of sample type ω arising from mixture component k is given as: |     ∑  where τjk is the posterior probability that observation j (where j ω is the subset of observations of type ω) j arose from component k and Nω is the number of libraries of type ω.  The probability that a sample is of type ω is simply:   where N is the total number of libraries.  The unconditional probability P(k) that a sample arose from mixture component k is the mixing coefficient πk.  195  Substituting terms, we arrive at: ∑ |     Finally, by cancelling like terms, we have the expression: |  ∑  196  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Country Views Downloads
United States 22 2
China 21 19
France 14 0
India 9 0
Canada 6 0
United Kingdom 5 0
Germany 4 17
Russia 4 0
Spain 3 0
Malaysia 2 0
Turkey 2 0
Sweden 2 0
Italy 1 0
City Views Downloads
Unknown 22 17
Shenzhen 8 17
Ashburn 8 0
Putian 6 0
Kolkata 5 0
New York 5 0
London 4 0
Berlin 4 0
Hangzhou 3 0
Ottawa 3 0
Madrid 3 0
Beijing 2 2
Lewes 2 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0067241/manifest

Comment

Related Items