Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The application of the permutation test on genome wide expression analysis Chan, Timothy 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


831-ubc_2006-0165.pdf [ 5.34MB ]
JSON: 831-1.0051243.json
JSON-LD: 831-1.0051243-ld.json
RDF/XML (Pretty): 831-1.0051243-rdf.xml
RDF/JSON: 831-1.0051243-rdf.json
Turtle: 831-1.0051243-turtle.txt
N-Triples: 831-1.0051243-rdf-ntriples.txt
Original Record: 831-1.0051243-source.json
Full Text

Full Text

The Application of the Permutation Test in Genome Wide Expression Analysis by Timothy Chan B . S c , U B C , B C , Canada, 2002 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Computer Science) The University of British Columbia March 2006 © Timothy Chan, 2006 Abstract We are now in a new era. The recent completion of the entire sequence of the human genome and high-throughput gene expression technologies has transformed the era of molecular biology to the era of genomics. Already, such technologies are showing great promise in disease classification and gene targets. However, like any new exciting technology, great promise and anticipation can lead to wasted resources and false hope. It is crit ical that we recognize the experimental limitations of these new technologies and most importantly, hidden problems must be addressed. The primary goal of a high-throughput gene expression experiment is to iden-tify genes of interest that are differentially expressed between two sample groups. This thesis addresses two key issues that have hindered high-throughput gene ex-pression technologies. The first is the sample size issue. Small sample sizes affect statistical confidence and are much more sensitive to outliers. Thus, we show that by using a nonparametric statistical test known as the permutation test, we can achieve higher accuracy than conventional parametric statistical tests such as the t-test. The second issue we address is the use of housekeeping genes for normalization of m R N A levels. It is well known that many biological experiments require a set of reference genes that are highly expressed and constant from sample to sample. i i The choice of reference genes is critical as the wrong choice can have dire effects on subsequent analyses. To address this issue, we developed a methodology based on S A G E , which is a genome wide expression technology that does not require nor-malization. Our results suggest that reference genes chosen by our methodology are more appropriate for m R N A normalization than the standard set of housekeeping genes. Furthermore, our results suggest that reference genes are more effective if chosen in a tissue-specific manner. i i i Contents Abstract ii Contents iv List of Tables vii List of Figures viii Acknowledgments x Dedication xi 1 Introduction 1 1.1 The Genomic Era 1 1.2 Challenges in Geneome Wide Expression Analysis 2 1.2.1 Sample Size 2 1.2.2 Use of Housekeeping Genes for Normalization 4 1.3 Contributions 5 1.3.1 Differential Gene Expression Finder 5 1.3.2 Constant Gene Expression Finder 6 1.4 Outline ^ 7 iv 2 Background 8 2.1 The Central Dogma of Biology 8 2.2 Genome Wide Expression Technologies 9 2.2.1 D N A Microarrays 10 2.2.2 S A G E Technology 19 2.2.3 S A G E versus D N A Microarrays 21 2.3 Statistical Tests 23 2.3.1 Parametric Tests 23 2.3.2 Nonparametric Tests 24 3 Differential Gene Expression Finder 26 3.1 Related Work 28 3.2 Preprocessing S A G E Da ta 30 3.3 The Permutation Framework 33 3.4 Breast Versus Bra in Cancer S A G E Libraries 35 3.4.1 Summary of Validat ion of Breast Cancer and B r a i n Cancer Genes 38 3.4.2 Discussion 39 3.5 Biomarkers in Ear ly Stages of Lung Cancer 41 3.5.1 Background and Related Work 41 3.5.2 Materials 42 3.5.3 Candidate CIS Biomarkers 43 3.5.4 Desmosomes as Candidate CIS Biomarkers 46 3.5.5 Invasive Candidate Biomarkers 54 3.5.6 Other Observed Gene Expression Patterns 59 3.5.7 Summary 60 v 4 Constant Gene Expression Finder 62 4.1 Related Work 63 4.2 Novel Methodology 64 4.3 Validation 68 4.4 Importance of Tissue Specificity 70 5 Conclusion and Future Directions 74 5.1 Conclusion 74 5.2 The Future of Gene Expression Analysis 76 Bibliography 79 Appendix A Central Dogma of Biology 84 Appendix B Genome-wide Expression Technologies 86 B . l DNA Microarrays 86 B. 2 SAGE Technology 89 Appendix C Validation of Differentially Expressed Genes 91 C. l Validation of Differentially Expressed Genes of Breast Data 91 C. 2 Validation of Differentially Expressed Genes of Breast Data 96 Appendix D Normalization of Expression Using Permutation Test and S A G E ( N E P S ) Reference Genes 100 D. l Breast-SAGE Reference Genes 100 D.2 Brain-SAGE Reference Genes 100 D.3 Lung-SAGE Reference Genes 100 D.4 All-SAGE Reference Genes 100 vi List of Tables 3.1 Statistical Comparison of the Top 30 Genes for Breast Data 38 3.2 Statistical Comparison of the Top 30 Genes for Brain Data 38 4.1 Comparison of the Top 40 Most Differentially Expressed Gene of Mi-croarray Data Normalized by two Approaches 71 D. l Breast-SAGE Reference Genes 101 D.2 Brain-SAGE Reference Genes 102 D.3 Lung-SAGE Reference Genes 103 D.4 All-SAGE Reference Genes 104 vii List of Figures 2.1 Experimental steps in Serial Analysis of Gene Expression (SAGE) . 22 2.2 Illustrates that every additional sample increases the power of the test exponentially 25 3.1 Illustrates that the higher the number of iterations, the more stable the list becomes 36 3.2 Diagram of a typical box and whisker plot 44 3.3 Graphical Representation of Steps of Analysis for CIS Specific "Genes 45 3.4 Boxplots showing gene expression patterns of genes with high CIS expression 47 3.5 CIS genes that are down-regulated 48 3.6 A Model of the Basic Structure of Desmosomes 50 3.7 Expression pattern DSG3 in normal(N), metaplasia (M), CIS (C), and invasive (I) stages 52 3.8 Expression pattern of DSC2 and DSC3 in normal(N), metaplasia(M), CIS(C) and invasive(I) stages 53 3.9 Expression patterns of armadillo proteins in normal(N), metapla-s ia^) , CIS(C), and invasive(I) 53 viii 3.10 Gene expression patterns of a desmoplakin (plakin family) in nor-mal(N), metaplasia(M), CIS(C) , and invasive(I) 55 3.11 Gene Expression Patterns of Candidate Invasive Biomarkers 56 3.12 Gene Expression Patterns of Candidate Invasive Biomarkers 58 3.13 Boxplots showing genes that are down regulated in lung cancer re-gardless of stage 59 3.14 Boxplots showing gene expression pattern for CIS and Invasive spe-cific stages 61 4.1 Raw Count versus Variabi l i ty on Lung S A G E data 67 4.2 Table showing relative comparison of validation results 69 4.3 Boxplots that depict the importance of tissue specificity 73 ix Acknowledgments There are a number of people I would like to acknowledge. Firs t and foremost, my supervisor who I am most indebted to for his guidance and wisdom. Without h im, I would never have finished this thesis. Second, I am forever grateful to Wan L a m , Stephen L a m and Ca lum MacAulay at the B C Cancer Research Centre for their invaluable contributions including providing me the data set to work wi th . T h i r d , I would like thank my mom and family for their financial and emotional support. T I M O T H Y C H A N The University of British Columbia March 2006 x To my father. x i Chapter 1 Introduction 1.1 The Genomic Era The Human Genome Project was launched in 1988 in a world effort to decode the entire human D N A . The main goals of the project included identifying all 20 to 30 thousand human genes and to accurately sequence the entire human D N A . B y Febru-ary 15, 2001, the International Human Genome Mapping Consort ium Published its first physical map in Nature that covered approximately 96% of the human genome [1]. Shortly after, on A p r i l 14, 2003, the project was officially complete two years ahead of schedule. The project was accomplished by shot gun sequencing wi th the aid of a physical map of the genome. This great scientific accomplishment was only the bir th of the genomic era. Today researchers are actively using genome research to fight common diseases such as diabetes, cancer, and heart disease. Decade-old tech-nologies such as D N A microarrays and Serial Analysis Of Gene Expression ( S A G E ) are finally coming of age as they are showing great promise as tools to detect diseases in their early stages and to aid discovery of candidate drug targets. However, due 1 to complexities of the biology and the experimental procedures of these techniques, we currently cannot use these technologies as standard tests in hospitals and clinics. In this thesis, we address some of these challenges. 1.2 Challenges in Geneome Wide Expression Analysis In the past decade, there has been an exponential growth in biological experimental data. Two primary technologies responsible for this rise are Serial Analysis of Gene Expression ( S A G E ) and the more popular D N A Microarrays. These two technolo-gies were developed around the same time and have revolutionized genome-wide expression analysis. One of the key goals of these technologies is to allow the iden-tification of genes that are either up or down regulated in specific conditions (ie. diseases or environmental conditions). A second goal is to map these expression levels to biological pathways so that biologists may begin to formulate hypotheses on the condition of interest. Whi le collecting data poses some challenges, it is clear that the biggest chal-lenge lies in the analysis of the collected data. In this thesis, we address two main issues: 1.2.1 Sample Size For every genome wide expression experiment, samples, which include tissues or cell cultures, are processed to extract the expressed m R N A molecules. These molecules are then collected and counted v ia the microarray or S A G E method. Often, samples are l imited and thus, are subject to poor statistical significance. This is due to two primary reasons: 2 1. C O S T : Gene expression technologies are quite expensive and thus, the number of samples are often hindered by cost. For instance, typical Affymetrix D N A microarray chips cost around 550 U S D each. Moreover, there is st i l l the cost of obtaining the sample, the cost of microarray analysis software, and the cost of the scanner. A S A G E experiment is even more costly as sequencing is an expensive process. In fact, to sequence the entire genome today, it would theoretically cost around 20 mil l ion dollars. Furthermore, a single S A G E library can cost upwards of $50,000 depending on how deep the sample is sequenced. 2. R A R E S A M P L E S : Sometimes, even if funding is not an issue, the samples are simply rare. For example, wi th our collaboration wi th B C Cancer Research Centre ( B C C R C ) , we studied the early detection of lung cancer. Lung cancer is notorious for being difficult to to detect in its early stages. One main reason for this is that lung cancer is usually symptomless until its late invasive (INV) stages. Thus, early lesions known as Carcinoma In Si tu (CIS) are rare encounters. Even after a few years of data collection, the number of CIS samples in B C C R C ' s cohort has only reached about 10. Failure to detect lung cancer at its early stages leads to poor prognoses. Thus, survival rates of lung cancer patients is among the lowest of cancer types. Small samples sizes are a major problem because of the high rate of false positives in genome wide expression experiments. In fact, both the S A G E and the microarray method require some sort of additional biological validation (ie. R T -P C R ) before they are accepted as evidence. Thus, it is crucial that we minimize the amount of false positives as each of these validation procedures take time and 3 money. 1.2.2 Use of Housekeeping Genes for Normalization Normalization is a preprocessing step that is required in many biological experiments to balance the non-biological differences between different samples so that they can be compared. Usually, housekeeping genes are used as references because they are believed to be expressed at a constant and abundant level (hereafter referred as the constancy requirement) regardless of whether the cell is active, dividing, or simply idle. Experiments such as Q R T - P C R and D N A microarrays rely on such genes for normalization. Microarrays in particular are extremely sensitive to normalization because of the many sources of error and the nature of the data. The data is intensity-based and represents the number of transcripts that naturally hybridize to a D N A microarray chip. Interpreting intensity data as m R N A count numbers a controversial issue. Furthermore, D N A microarrays are also sensitive to many other factors such as the hybridization rate in varying environments, the spot intensity issues, and many others described in the next chapter. These variations make inter-chip comparisons difficult. Thus, microarray data must be observed wi th much scrutiny as the intensity value of say 2 on chip A may not be the same as the intensity value of 2 on chip B . Normalization is a crit ical step so that non-biological and biological differences can be distinguished. In the past few years, it has become increasingly apparent that the com-monly used housekeeping genes , such as glyceraldehyde-3-phosphate dehydrogenase ( G A P D H ) and beta actin ( A C T B ) , violate the constancy requirement [35] [36] [37] [38]. The problem is that normalizing to such genes can skew data and affect any subsequent analyses. 4 1.3 Contributions To address the above problems, our method uses two major components. Firs t , we use a nonparametric statistical test known as the permutation test to deal wi th the small sample size problem. The permutation framework has the following benefits: 1. It is designed for small sample sizes. 2. It does not make any assumption about the underlying population (a non-parametric test). 3. It is robust to outliers. Second, we use S A G E (Serial Analysis of Gene Expression) for the following benefits: 1. It is an open method so there is absolutely no bias. That is, we do not need to know what genes we are looking for beforehand. 2. It does not possess the same normalization issues as D N A Microarrays as S A G E is based on sequencing fragments of transcripts which are counted in absolute terms. As wi l l be shown in Section 1.3.1 and Section 1.3.2, the contributions of this thesis are based on different ways of applying the permutation framework to S A G E in order to deal wi th the problems stated in Section 1.2. 1.3.1 Differential Gene Expression Finder A common experiment in genome wide expression analysis is to find differentially expressed genes between diverse samples. For example, we may want to find the 5 genes that are turned on in tumor tissues but turned off in normal tissues. A popular way to do this is to compare the mean cancer gene expression levels to the mean normal gene expression levels. In this thesis, we wi l l demonstrate the ability of the permutation framework to successfully identify differentially expressed genes between two diverse samples. To do this, we conducted two experiments. The first experiment shows that the permutation framework appears to be much more effective in identifying target candidate genes than the typical two sample t-test of unequal variance. The data used here are the breast and brain S A G E data from N C B I ' s S A G E m a p site (h t tp : / /www.ncb i .n lm.n ih .gov /SAGE/) . The second experiment further shows that the permutation framework is effective in identifying biologically signifcant genes which warrant further biological investigation. That is, we demonstrate the hypothesis generating power of the permutation test applied to S A G E data. Here, we use a cohort of Lung cancer S A G E libraries at various stages (ie. Normal,Metaplasia, CIS, and Invasive) provided by our research partners at B C Cancer Agency. More specifically, using these set of libraries, we identified candidate biomarkers of lung cancer. 1.3.2 Constant Gene Expression Finder Interestingly, the permutation framework is also useful for doing the reverse. As wi l l be seen in Chapter 4, it is also useful in identifying constantly expressed genes. Experimentally, this has never been done on a global scale wi th the exception wi th microarrays. However, extracting constantly expressed genes from a method that requires normalization itself is arguably inappropriate. Thus, this thesis proposes to use the permutation framework on S A G E to find constantly expressed genes and then to use these genes as references (ie. housekeeping genes). Our results suggest 6 that our set of S A G E selected reference genes are much more appropriate than the commonly used housekeeping genes. As wi l l be shown in Chapter 4, it may also be important that these reference genes must be tissue specific. That is, the reference genes for the lung may not be the same reference genes as for breast tissue. 1.4 Outline The outline of this thesis is as follows. In the next section, we first summarize some basic biological concepts. We then go into some details of the two most pop-ular gene expression technologies; Serial Analysis of Gene Expression ( S A G E ) and D N A Microarrays. Lastly, we describe the permutation test in detail including the steps/algorithm. In Chapter 3, we describe two experiments that use the permuta-tion test as a differential gene expression finder. The first experiment analyzes brain and breast S A G E data and compares the permutation test's performance wi th the standard two-sample of unequal variance t-test. The second experiment analyzes Lung S A G E data provided by B C Cancer Agency and compares the different stages of lung carcinoma in attempt to discover potential biomarkers. In Chapter 4, we describe a novel normalization procedure that identifies constant expressed genes with the permutation test. Final ly, we conclude wi th a brief summary and suggest further directions. 7 Chapter 2 Background 2.1 The Central Dogma of Biology It is well known that our bodies are composed of units called genes that are re-sponsible for the various physical and mental characteristics that define us. Genes are found in our D N A and contain templates for specialized messenger molecules known as R N A . R N A in turn contains information used to make the biological worker molecules known as proteins. Proteins have a variety of roles including signaling, immunity, physical structure, and inhibi t ion/promotion of growth. The central dogma of biology refers to the transition between D N A to R N A (process known as transcription) to protein (a process known as translation). Whi le the human cell is estimated to contain about 30,000 genes, the average cell only has about 20 percent of their genes turned on. From a molecular persective, the difference between muscle cells, skin cells, nerve cells are simply the different genes that are turned on or off in the particular cell type. This differential expression profile defines the type of cell. A mature cell's genetic make-up can also change 8 when there are changes in the environment to adapt and survive. For instance, heat shock proteins are proteins that are elevated in stressed conditions such as an increase in temperature or other environmental changes. One of the most important roles of these proteins is to assist protein-protein conformations and maintain those conformations. In environments where protein degradation is likely, the R N A of heat shock proteins increase. A t the same time, the R N A of metabolic proteins is likely to decrease as temperatures are not optimal for such activity. A s it turns out, diseased cells also have different genetic signatures from healthy cells. Thus, one of the key purposes of gene expression analysis is to identify key genes that are responsible for the onset and phenotype of the disease. Such knowledge is extremely helpful when designing treatments. 2.2 Genome Wide Expression Technologies Although we truly would like to analyze the protein levels in cells, it is currently extremely difficult due to various reasons. First , all proteins have their own behavior and many are sensitive to degradation. Thus, even if we can extract the protein, often they would degrade so fast that we would not be able to measure their ex-pression accurately. Second, proteins are often interacting wi th other genes often involving chemical and physical interactions between other proteins forming large multi-protein complexes. This field of study is known as proteomics and is much more complicated than genomics. Protein microarrays are in the works but are years away from being as wide-scale, accurate and reproducible as gene expression tech-nologies. Thus, for the past decade, researchers have been mostly analyzing gene expression indirectly. Fortunately, these technologies have proved to be quite useful in extracting interesting information and most importantly in developing hypotheses 9 for further validation and study. Gene expression analysis began about a decade ago when two genome wide expression technologies were introduced that revolutionized functional genomics by allowing cross-tissue comparisons of expression profiles; serial analysis of gene ex-pression ( S A G E ) and D N A microarrays. B o t h methods have been widely used to elucidate complete gene expression profiles. Over the years, these methods have been enhanced v ia statistical/mathematical and experimental techniques to account for their various weaknesses. For instance, l o n g - S A G E was developed to allow cloning of 20 nucleotide S A G E tags leading to improvements in tag to gene mappings [9], while M i c r o S A G E was developed to allow S A G E to be done when tissue material is scarce [10]. In D N A microarrays, various techniques have been developed to account for its sensitivity to normalization such as the use of control genes [30] [29]. These techniques wi l l be discussed later in the next section. 2.2.1 D N A Microarrays D N A microarrays were developed to facilitate the understanding of how a cell coor-dinates expression of thousands of genes in different conditions (ie. diseased state). When a microarray experiment is performed, a "snap shot" of a cell's global gene expression levels is taken giving researchers a large list of genes that are currently activated or suppressed. This technology has allowed scientists to conduct many experiments inconceivable just over a decade ago. For example, researchers can now compare the genetic profile of diseased and normal cells and allow extraction of possible drug targets and key genes that may have caused the onset of the disease. Thus, contributing to both early detection and prevention. In the following section, we describe two of the more popular types of D N A 10 microarrays and describe their similarities/differences and strengths/weaknesses. We then describe the experimental steps required in a D N A microarray experiment. Spotted Microarrays and Affymetrix Oligonucleotide Microarrays Spotted microarrays require molecules known as c D N A . c D N A s are reverse tran-scribed m R N A molecules and thus are free of premature m R N A sequences (ie. ex-ons). Spotted microarrays are made up of several thousand c D N A probes that are fixed on a slide. In general, spotted microarrays are more popular than commercially produced arrays (ie. Affymetrix slides) for two main reasons. Firs t , these types of chips are relatively inexpensive compared to commercially produced chips. Second, spotted microarrays are very flexible since they allow scientists to design their own arrays and spot any types of D N A of interest on to their customized slide. Unfortunately, spotted microarrays have a few pitfalls. One major drawback is the high error rate during the spotting process. Many of these errors are due to variations in experimental procedures and environments that the experiment is conducted. In fact, it is well known that results from one lab wi l l often vary from another lab and even vary from experiment to experiment. Also , most spotted arrays use c D N A as probes and consequently, require known D N A for the P C R process to produce the probes. In spite of the whole genome being sequenced, every human transcript is st i l l unknown. Lastly, this method results in long D N A fragments on the spotted slide and thus, increases the chances of probes hybridizing to similar c D N A (gene) fragments. Thus, it is vir tually impossible to detect small differences between D N A fragments (SNPs cannot be detected). The second common type of microarray is primarily produced by the com-pany, Affymetrix. These D N A chips differ from spotted microarrays in two distinct 11 manners. First , this technology is based on hybridzation of m R N A samples to small, high-density, arrays containing tens of thousands of synthetic oligonucleotides (syn-thetic D N A ) . This allows for detection of SNPs (single nucleotide polymorphism) and other small features of D N A (ie. specialized regions such as enhancers, pro-moters, alternative spliced genes etc). The second major difference is that only one sample is hybridized to the chip at a time while spotted microarrays require a con-trol and experimental sample to be competitively hybridized onto the chip. Thus, Affymetrix chips produce data of absolute intensities while spotted microarrays pro-duce data of relative intensities. Like spotted microarrays, these chips do have their share of disadvantages. First , Affymetrix chips and the equipment to read them are costly. Currently, on the Affymetrix website ( , a human gene chip goes for around 550 U S D and the basic Fluidics station and scanner is over $100,000. A s well, a custom chip would be even more costly. Second, often the signals generated from different probes for the same gene vary in magnitude (sometimes as much as two-fold). Thus, combining these signal intensities into one intensity to represent and estimate of the abundance of the gene is an issue. Furthermore, since Affymetrix currently only produces D N A chips wi th only one D N A sample, it has become a major challenge to normalize between the experimental and the control. These challenges wi l l be discussed in the following section. Experimental Steps of Affymetrix D N A Microarrays A simple microarray experiment requires 6 basic steps: 1. Sample preparation: The process of obtaining the cleanest sample possible. This process is the most sensitive to errors in microarray experiments as it is 12 very difficult to obtain clean samples that are subject to various environmental factors. 2. Array fabrication: The primary concern i n this process is to develop arrays wi th high precision wi th both lower cost and time to manufacture. 3. Hybridizat ion of the sample to the array: In this step, researchers work on developing ways to increase the efficacy of hybridization. 4. Scanning and image analysis: M u c h of this part of the analysis deal wi th issues such as spot location, background correction and intensity assignments. 5. Normal iza t ion/Data preprocessing: The goal of normalization is to distinguish non-biological differences from biological differences. 6. Da ta Analysis: The objective here is to find interesting gene expression pat-terns and groups using various statistical and data mining methods. Out of all these steps, normalization is arguably the most useful strategy to correct for non-biological systemic errors so that comparisons between arrays can be performed appropriately. However, in spite of years of research, the normalization of microarrays remain non-standardized. Below we describe this process in detail while further details of the remaining steps are discussed in detail in Appendix B . Normalization of D N A microarray chips Normalization in gene expression experiments is a broad term for a process of trans-forming m R N A expression values so that the non-biological differences between ex-periments is balanced so that the real biological differences between experiments can be observed. This procedure is required in experiments such as R T - P C R and 13 D N A microarrays. It should be stressed that any form of data manipulation does in itself introduce unwanted noise to the expression measures. Normalizat ion is not a miracle step. The technique can only correct for minor variables but if the data is poor, no amount of normalization can save it. Al though many advances in normalization of D N A microarrays have been made in the past decade, there is st i l l no general consensus on which approach is the most appropriate. Unlike the standardized normalization process in R T - P C R , D N A microarray normalization is much more complicated. One primary reason is the numerous non-biological variables involved i n a D N A microarray experiment. For example, a common variable that must be dealt wi th is the variation in the number of cells in different samples since this would mean that we may have unequal quantities of starting D N A from array to array. Another common problem in microarrays are spatial effects. These problems range from the unequal distribution of solvent across the surface of the array to the quality of washing off non-hybridized m R N A . Furthmore, hybridization efficiency is another common issue. The strength of a hybridized D N A can be affected by the G C versus A T content. G C pairings tend to be stronger as they contain 3 hydrogen bonds while A T pairings are weaker wi th 2 hydrogen bonds. Also , since different genes often have homologous regions, cross hybridization is a possibility. Moreover, statisticians also must take experimental bias into account. For example, one experimenter may decide to add fluorescent labels for al l their chips in one step or separately for each chip. The above is only a subset of issues that need to be considered when analyzing D N A microarrays. In the next section, we wi l l describe some ways to deal wi th some of these common variables. Before describing some of these normalization options, it is important to 14 note that the normalization process is applied after image analysis and before other subsequent data mining analyses including supervised and unsupervised clustering analyses (e.g., hierarchical clustering, K-means clustering, or self-organizing maps). Some Normalization Options Over the past decade, many normalization methods have been proposed, many of which were designed for specific conditions of an experiment being run. That is, depending on what the experiment entails, there are choices to be made to yield the most optimal results. For brevity, we wi l l only describe a few of the more popular and newer methods. The first step in normalization is to come up wi th an appropriate set of genes. Yang et al . suggested three basic approaches: (1) Use all genes on the array. (2) Use constantly expressed genes as references (ie. housekeeping genes). (3) Use control genes [29] [30]. A l l Genes on Array This type of normalization is a simple approach and is based on the following as-sumption. Since only a small fraction of genes wi l l be up or down regulated in the samples, on average, the number of up and down regulated genes ought to counter each other's effect on the total R N A count. In other words, there should be equal weights of R N A for all samples and thus, the number of R N A molecules should also be roughly the same from experiment to experiment. This assumption is valid for genome wide expression experiments covering the entire genome because we are dealing wi th thousands of genes where only a handful of them are differentially ex-pressed. The problem wi th this method is that most microarray chips do not cover the entire genome (especially for homo sapiens). In addition, this method also makes 15 the assumption that the total m R N A remains constant under different experimental conditions. Consequently, this method would only really be useful if the chips that one compares have similar expression patterns and invalid wi th chips wi th vastly different genetic signatures [29]. For instance, the genetic profile of tumor cells are known to be vastly different than corresponding normal cells since tumor cells are essentially in hyperdrive. A variant of using all genes is normalizing to the total or ribosomal R N A . This method is based on the assumption that ribosomal R N A is constitutively expressed in al l cells. Also , since ribosomal R N A consists of 90 % of the total R N A in a typical cell, it was thought that it could be used as a normalization standard for small fluctuations in m R N A levels. However, studies have shown that this underlying assumption is not correct [36]. The total R N A produced i n a cell can vary a lot from cell to cell and it is dependant on many factors including the conditions the cell is in and the cell type. Using Constantly Expressed Genes for Normalization Normalizing to a set of constantly expressed genes is also a popular method and used in Affymetrix's 133 series chips. Instead of using the whole set of genes for normalization, a small subset of genes are used, where chips are re-scaled so that the average values of each housekeeping gene are equal across chips. These genes are chosen based on the biological belief that they are required for a cell to function and live. They are typically picked by selecting for abundant expression in al l developmental stages of the cell and are assumed to be non-regulated over various conditions. However, as explained i n detail later, these biological assumptions may be too strong. In fact, in recent years, there have been many studies evaluating the 16 variability of these housekeeping genes [35] [36], [40], [38], [39]. Not suprisingly, we observed similar trends in our own results. In the past few years, many have proposed ways to improve on this method. For instance, Kepler et al. proposed the choice of housekeeping genes to be made by looking for a stable background pattern of activity on the microarray. Using this assumption, they can derive a transcriptional core from this stable background pattern (identified statistically for each experiment). This method is referred to as normalization by self-consistency and uses the following steps to acquire the core. First , al l genes are designated as core genes. Next, the whole chip is normalized to al l the genes on the chip. After this step, al l the genes that have not changed from the normalization process are kept. This process is repeated unti l the previ-ous transcriptional core and the next transcriptional core are equal [31]. Another method of finding constantly expressed genes proposed by Tseng et al. is using a rank invariance gene approach. This approach attempts to find genes that are non-differentially expressed by first separately computing the intensities of fluores-cent labels (of different colors). These labels are then compared and if the ranks of the two differ by some arbitrarily set threshold and the rank of the averaged intensity is not among the highest 1 ranks or lowest 1 ranks, the gene is classified as a non-differentially expressed gene [32]. The problem wi th these two methods is that they are based on using on-chip data to find non-differentially expressed genes. This , in itself, introduces a circular argument since these so called constantly expressed genes are also subject to the same systematic technical variations as any other spot on the chip. 17 Control Genes Controls are chosen genes that have no relation to the experimental samples under study. Some Affymetrix chips employ such genes. One such type of control is known as spiked controls. They are synthetically generated D N A sequences or D N A sequences from a known organism different from the one under study. These genes are spotted on the array of the controlled chip and the variable chip. These samples have an equal amount of m R N A and thus should have equal intensities and could be used for normalization. Another controls approach is known as the t i tration series approach which involves the spotting of several different concentrations of the same genes or E S T that span the range of intensities on the array. Unfortunately, this technique is technically challenging [29]. Scaling and Transformation The next logical step after selecting what genes to use as references, is to scale the rest of the data to match the relative distribution of these reference genes. There are several ways to approach this. For example, one simple approach is to scale to some arbitrary constant. This constant is often a calculated value such the mean, tr immed mean or median of al l the intensity values across al l the experiments. Sometimes replicates would be generated to evaluate the relationships between chips and often they reveal a non-linear relationship among probes. Thus, often logarithmic trans-formations are performed to stabilize such variances. Recently, Lu ' s studies have shown that eliminating outliers did not reduce variation among multiple arrays but rather increased variation. Thus, he highly recommended using the mean of the logarithm transformed signals to calculate the normalization constant factor [34]. 18 Data Analysis Recall that the last step in microarray analysis is simply the statistical tests applied to the cleaned up data. Often researchers apply the common statistical procedures such as ratios and absolute differences between intensities to find differentially ex-pressed genes. Sometimes they wi l l use more sophisticated tests such as the popular t-test. In addition, it is quite popular to cluster the data (unsupervised or super-vised) in order to find biologically relevant groups of genes. These groups of genes could be used to identify subgroups of diseases at the genetic level. 2.2.2 SAGE Technology Like D N A microarray technology, Serial Analysis of Gene Expression was developed to provide a comprehensive and quantitative gene expression profile of target cells. The S A G E technique was developed at John Hopkins University in the U S A around 1995 by Velculescu et al. [5]. Unlike the microarray technique, which are based on relative expression levels ( or absolute fluorescent intensity levels (Affymetrix)), it measures m R N A expression by counting representative short sequenced tags. The expression values are thus given in absolute terms. This avoids the need for the kind of normalization required for comparing multiple microarray chips. Another important advantage of the S A G E technique is its ability to detect small transcripts, which may be crucial genes as these could be the switches or pathway regulators leading to the onset of a disease like cancer. A thi rd advantage of S A G E over microarray is that it does not require the genes to be previously known. Thus, analysis can reveal previously unknown genes as the potential markers. Lastly, S A G E libraries may be pooled. That is, even after an experiment is done, one can go back and sequence the same tissue library even deeper and simply add these 19 results to the old data. A s great as S A G E seems to be, it has one major pitfall . Collecting S A G E data involves sequencing, which is a laborious and expensive process. A t the current market rate, generating a S A G E library cost at least 100 times more than performing a microarray experiment. Thus, there are only a small number of S A G E libraries (a library corresponds to the measured m R N A expression levels of a sample or patient.) For instance, as wi l l be described in Chapter 3, when we conducted the brain and breast S A G E analysis 2 years ago, there were only 17 brain cancer libraries, 7 brain normal libraries, 10 breast cancer libraries and 5 breast normal libraries publicly available through the N C B I website. Thus, while S A G E libraries give high quality data, the small sample size problem is crit ical for the analysis of S A G E libraries. In section 2.3, we wi l l describe a statistical tests designed for such a problem. However, before we get into the analysis, we describe the S A G E technique i n detail below. Methdology Recall that the basic units that make up D N A and R N A are called nucleotides. To conduct S A G E , a method called sequencing is required to read these nucleotide sequences. This procedure is complex and demands cost and time. In fact, if one were to sequence every m R N A molecule in a cell to its entirety, it would probably take a decade or so. Fortunately, as it turns out, fourteen nucleotides of an m R N A molecule is sufficient to capture the majority of the m R N A in a cell precisely. This was a major discovery as an m R N A molecule could have thousands of nucleotides. The S A G E method is based on this idea of taking a small fragment (tag) of nu-cleotides from an m R N A molecule to represent the entire transcript of a particular gene. These tags are obtained by special cleaving enzymes (restriction enzymes) that 20 recognize a specific sequence of nucleotides in the m R N A strand and cut around 14 nucleotides downstream from this recognized region. This process wi l l almost always obtain the same sequence of tags if the transcript is the same. Of course, such a process raises the following question: Is it possible that different genes can have the same tags? Yes, but because there are 14 base pairs, the likelihood is statistically acceptable. However, it is likely that more than one tag represents the same gene due to biological phenomenons such as polymorphism and alternative splicing. In Chapter 3, we wi l l describe these issues in more detail and how they are circum-vented. A second question that may arise is: Is it possible that some genes may not have this restriction enzyme site ? Yes, but it is a relatively small percentage of genes as restriction enzymes typically cleave sites of no more than 4 nucleotides. Figure 2.1 briefly outlines the steps of a typical S A G E experiment (based on the figure at h t tp : / /www.ncbi .n lm.nih .gov/Class /NAWBIS/Modules /Express ion/exp82.h tml) . Fur-ther details can be found in Appendix B . 2.2.3 SAGE versus D N A Microarrays The choice of the S A G E technique over the D N A microarray hybridization technique depends on several factors including the amount of starting material, the number of samples, and the availability of resources [6]. For example, one major consideration for S A G E is the fact that around 1 .5X10 6 bases are required to be sequenced i n order to do a simple 2 sample library comparison. In spite of technological advancements, sequencing is st i l l an expensive procedure as it requires an expensive automated seqeuncer. Thus, when the amount of sample material is not a consideration, the D N A microarray hybridization technique is a much more cost effective technique and would yield much more results. However, if we only have a few rare precious 21 o CeU - o synthesize o * Tag m R N A s Llgate to vector Cloning Sequence Vector sequence output Computer Ligate poo] 1 and 2 And PCR amplify Gone Expression Figure 2.1: Experimental steps in Serial Analysis of Gene Expression (SAGE) 22 samples, the amplification step ( P C R ) in the S A G E method allows experiments to be done wi th sample sizes as low as 9 oocytes to 100,000 cells [6]. In fact, S A G E on a single cell has even be reported [7]. If there are a large number of samples, microarrays would be more appropriate as the costs are much lower. 2.3 Statistical Tests Since few results in science are absolute, we must have a way to "weigh" our results. Statistics was developed to help "weigh" our results so we can determine whether an experiment is credible or not. In general, a hypothesis is generated and either accepted or rejected based on a mathematical measurement of confidence. A typical statistical experiment evaluates a population of specific values between two samples where each member has a value. For example, gene expression values between a set of cancer cell lines and a set of normal cell lines may be compared to determine how likely that this gene in question is different in cancer over normal. 2.3.1 Parametric Tests Conventional statistics, also known as parametric tests, work well if the following assumptions are met: 1. The underlying population can be assumed. For example, a commonly used assumption is that the experimental values of the sample are normally dis-tributed. This assumption is often referred as the parametric assumption. 2. Overall , the samples have equal variances. 3. Samples are drawn from the population independently. 23 If a parametric assumption cannot be met, statisticians often apply various tactics to manipulate the data. As long as this is done wi th care and never used to adjust for a better p-value, such a procedure is permitted. Such tactics range from deletion of outliers, winsorization, and/or t r imming the data. These procedures are tricky though, as we can never be sure if these outlying values are from naturally occuring biological phenomena, or a by chance value, or a mistake in the data col-lection (ie. sequencing errors). Furthermore, when sample sizes are small, deleting or manipulating any values can have drastic consequences to the final results. For instance, suppose we have two samples. Sample A consists of the following values: 4, 100 and 6. Sample B consists of 5, 5 and 5. The mean of sample A is 36.7 while the mean of sample B is 5. Thus, these two samples would be deemed different by a typical statistical test. However, if the outlier 100 from sample A is removed we suddenly, sample A and B suddenly have the same mean. 2.3.2 Nonparametric Tests Nonparametric test are classes of statistical tests that require both the equal variance and the independence assumption as described above. The advantage of these types of tests is that they do not require a parametric assumption. The disadvantage is that distribution free methods require great care because they often produce results wi th less efficiency in that they produce results wi th poor detection of false positives. However, for situations where the distribution cannot be assumed (ie. a small sample size), nonparametric tests are ideal. 24 Permutation Test Often , we see inappropriate use of parametric tests conducted on only a handful of microarray or SAGE experiments. Here, we introduce the permutation test. This non-parametric test is designed for small sample sizes that makes no assumption about the underlying population. Furthermore, from figure 2.2, we can see that for every additional sample, the power of the test increases exponentially [47]. In the following Chapter, we will describe how we used the permutation test to conduct our analyses. # ol &amp4»« vs » of Combinations (Log Scat*) ^10000000000 •% | 10000OGO0 5 I IOOWOO g I 10000 = I 100 O 1 Number ot Sampi«& Figure 2.2: Illustrates that every additional sample increases the power of the test exponentially 25 Chapter 3 Differential Gene Expression Finder Cancer currently ranks second among diseases that k i l l humans. W i t h the aging population and the increasing introduction of environmental complexities (ie. pol-lution) it is likely that it wi l l soon overtake cardiovascular disease as the number one killer of humans. The recent development of high-throughput technologies and the recent sequencing of the entire human genome have brought great promise to unravel the mysteries to diseases such as cancer. However, progress in cancer re-search has been hindered by two main factors. Firs t , cancer is a highly variable disease that is triggered by multiple genetic changes making it difficult to pinpoint genetic targets of interests. Second, current high-throughput technologies such as DNA-microarrays are not 100% reliable and often gives rise to false positives (ie. [44], [45]). Thus, the majority of findings from high-throughput technology cannot be considered ready and safe for clinical trials. For this to happen, validation of candidate genes is required which is often laborious and unsucessful. Consequently, 26 it is crit ical that the differentially gene expression selection process is as accurate as possible. In this Chapter, we demonstrate the importance of using an appropriate statistical test in identifying genes that are differentially expressed in tumors. A s discussed in Chapter 1, one of the major problems that plagues experiments is the sample size problem. M u c h of the past research relied on parametric tests such as the t-test which require a large sample size. Here, we introduce a non-parametric test known as the permutation test. This test requires a much smaller sample size as wi th each additional sample, the power of the test increases exponentially. To show the appropriateness of this test we conducted two different experi-ments both of which apply a permutation test algorithm to identify genes that are the most differentially expressed. The first experiment has two main objectives. First , to compare the genetic signatures of two vastly different cancers (brain and breast cancer). Second, to evaluate the performance of the permutation framework over the standard t-test. The second experiment we conducted focussed on com-paring the genetic signatures of different stages of the same type of cancer; in this case lung cancer. The goal here was to find genes that were specific to each stage and genes that were common to both stages. Before getting into the details of each of these experiments, we w i l l first describe how we preprocessed the S A G E data in each case. Following this description, we w i l l also describe the finer details of the permutation framework we used to conduct our experiments. B u t before going into the details of our experiments, we wi l l briefly discuss the recent findings in Cancer research using high-throughput technologies. 27 3.1 Related Work In the past decade, the development of high-throughput gene expression technology has created a stir of enthusiasm in the cancer research community. However, as wi th any new technologies, often such excitement and eagerness can lead to false discoveries as many problems could be overlooked. In the following sections, we wi l l describe some of these experiments some of which appear promising and some that appear questionable. Applications of High-throughput Gene Expression Technologies in Bra in and Breast Cancer Research One of the main goals of cancer research is to find reliable diagnostic or prognostic biomarkers for different grades of tissue-specific tumors. Today, we rely on specialists to assess the grade and predict the behavior of the tumor based on an examination of the histology and morphology of the tumor. These methods are highly subjective and thus, is not reliable. In brain cancer for example, it has been shown that even among experienced neuropathologists, the final diagnosis is observer-dependant [46]. Thus the need for such biomarkers is desperately needed. A second goal of genomics in cancer research is to find the switches or regulators that may have caused the onset of the disease. Following such a discovery, researchers have the potential to design drugs to control these checkpoints of the pathway of the disease and thus, treat the disease. The advent of genomic technology has provided valuable tools for cancer re-searchers to aid both biomarker and switch/regulator discovery. Typically, normal and cancer tissues are compared in order to identify genes that are differentially expressed. Several research groups are already demonstrating the power and poten-28 t ial of such technologies. However, as wi l l be seen below, accuracy is hindered by sample sizes and noise. Ljubimova et al. used D N A microarray technology to identify 11 genes that were deemed differentially expressed in human gliomas using a commerically pro-duced D N A microarray called Gene Discovery Array ( G D A ) from Incyte Genomics, Inc. These 11 genes were chosen based on the highest mean ratios of differential expression between normal brain tissues and brain tumor tissues. The problem wi th such an approach is that means are sensitive to outlying values. For instance, if one microarray experiment shows a very high gene expression level due to some noise, it w i l l be deemed significant. Not surprisingly, only 2 of these 11 genes were veri-fied v ia semi-quantitative R T - P C R and Northern blot analysis [45]. These results demonstrate that the chances of false-positives are great. In another study, Watson et al. identified 196 genes that demonstrated common differential patterns among different tumor grades of oligodendroglimas. This study successfully showed that these expression patterns could be correlated wi th oligodendroglima tumor grades. The main problem wi th this study, is that the oligonucleotide micraorray consisted of only 1100 genes and they only had 7 different samples/microarrays [46]. The accuracy of high-throughput gene expression studies for breast cancer is no better. Recently Shen et al. demonstrated the loss of annexin A l expression in human breast cancer detected by using both S A G E and microarrays. For their microarray experiments, the used 2 normal and 7 breast cancer samples. They found 129 genes wi th a greater than five-fold change. To narrow the number of genes for further study, they used S A G E and E S T analysis and only identified 4 qualified genes. Their expression pattern was further validated via R T - P C R . Following this analysis, A N X A 1 was further confirmed at the m R N A level by Human Breast Cancer 29 Tissue Profiling Array and at the protein level. Only at this point, were they satisfied that this gene is of significance [42]. This demonstrates that genomic technology is far from clinical use for screening patients. Currently, there are few studies that compare brain and breast cancer at the molecular level. N g et al though did show that by using a clustering algorithm called O P T I C S on a subset of genes selected using the Wilcoxon test, they were able to create 8 distinct clusters including brain cancer, breast cancer, brain normal, and breast normal clusters. These results suggest that brain and breast cancers are also different at the molecular level. However, no subtypes of cancers were noticeably clustered but this may have been due to the few libraries they had to work wi th [50]. To date, the success rate of these genome wide expression studies have not been as high as anticipated only a decade ago. Such experiments generate large lists of differentially expressed genes but only a small portion of these genes are validated upon further analysis. Because genomics data is subject to many non-biological errors, statistical significance needs to be very strong. Another common problem is small sample sizes. Undertaking a genomics experiment is a risky and expensive endeavor. In Section 3.4 , we wi l l demonstrate how much an analyses can differ just by changing how we identify significant genes. Bu t first, we wi l l describe how we preprocessed the S A G E data and describe the permutation framework we used to identify differentially expressed genes. 3.2 Preprocessing S A G E Data Recall that S A G E has various deficiencies. Firs t , S A G E tags often map to more than one gene due to the short sequence that represents the m R N A . Second, genes 30 may also map to more than one tag due to two major biological phenomena: al-ternative splicing and polymorphism. Bo th wi l l be discussed later in greater detail. Furthermore, due to the time and money spent, the publicly available S A G E l i -braries created between different research institutes and universities vary in size, ranging from 10,000 to 60,000 tag types per library. In order to conduct appropri-ate analyses, the data must be preprocessed. Below we describe what preprocessing steps we took to minimize these deficiencies. 1. Gene to Tag Assignment A n average mature m R N A sequence is around 1800 to 2000 nucleotides in length. S A G E relies on a 10 base pair sequence to uniquely identify a gene. However, since there are only 4 different bases in total, the chances of a tag mapping to more than one gene is likely. Not surprisingly, in practice, we found that there are numerous tags that map to more than one gene. Since it is impossible to determine which gene is actually being expressed, we dealt wi th this problem by assigning the expression level of the tag to each gene that it mapped to. For example, if tag A maps to genes 1, 2, and 3, al l the genes wi l l be assigned the tag count of tag A . 2. Tag to Gene Collapsing This preprocessing step deals wi th the inherent fact that genes often map to more than one tag. A s mentioned above, this is primari ly due to two major biological phenomena. The first, known as alternative splicing, describes the post processing steps of an m R N A molecule before it enters the cytoplasm. Be-fore m R N A leaves the nucleus to be transcribed, certain regions called introns are spliced out and the remaining regions (exons) are concatenated forming a 31 much shorter m R N A sequence. These regions are mixed, creating alternatively spliced forms of the same gene and increases protein diversity. However, these alternate forms may or may not possess the same functionality [2] [4] . This phenomenon has recently been estimated to occur in about 55% of human genome [3]. The other major reason for multiple tags mapping to the same gene is that polymorphism among the population may exist. Polymorphism and may further contribute to several tags mapping to the same gene. For this this situation, we simply summed up the tag expression levels that mapped to the same gene. For instance, if tag A , B , and C all mapped to gene X , the expression levels of tags A , B , and C would be summed to one expression level. B y performing this procedure we also get the added benefit of a reduction in the dimensionality of the data which in turn, significantly decreases computation time. It should be noted though that this process was not done in al l our experiments. If the type of experiment involves searching for only highly expressed tags, then we found it reasonable to collapse since high counts is less prone to a sequencing error. In addition, the larger counts tend to have better confidence for tag-to-gene mapping. We wi l l discuss this issue further in Chapter 4 where collapsing is performed. . Scaling the S A G E libraries To deal wi th the varying sizes of S A G E libraries, we scaled the tag counts to an arbitrary large total tag count (e.g., 1 million) to reduce comparison errors. 32 3.3 The Permutation Framework The permutation test itself, is not a new concept or method. In fact it was introduced in the 1930s. The advent of computers have made this resampling method relatively easy to implement. Serveral researchers have adopted the permutation concept. For example, Dudoit et al . used the permutation test's resampling techniques to eliminate the t-statistic's strong normality assumption [49]. Contrary to Dudoit et al.'s work, the framework of the permutation test that we used is based strictly on a test of means between two different distributions. Here, we assume one distribution is for the gene expression level of a particular gene in normal tissues (ie. subscript n) and the other for cancerous tissue (ie. subscript c). To select significant genes, we use the following hypotheses: NullHypothesis : H: Mc - = 0 (3.1) AlternativeHypothesis : Ha:nc-fin^0 (3.2) The null hypothesis states that the distribution of average gene expression levels in cancerous libraries is the same as that in non-cancerous libraries. If the 33 null hypothesis is rejected, it indicates that the gene expression levels of normal and cancer samples are sufficiently different (the alternative hypothesis). A variation of this test was previously been used for identifying differentially expressed genes in time-course microarray experiments [48]. In the following section, we show our specific implementation of the test. Let N and C be the number of normal tissue samples and the number of cancerous tissue samples respectively. Algorithm 1. For each gene, take all the gene counts of the normal libraries and all the cancer libraries. M i x them up in an urn. 2. Randomly select 7Y counts from the urn to create a simulated normal set, and calculate the simulated normal mean /isn. 3. Similarly the remaining C counts form the simulated cancerous set. Calculate the simulated cancer mean /J,SC. 4. Consider the random variable v = /J,SC — n s n , called the simulated difference. 5. Repeat the steps A to D above M times. Let \i and a denote the mean and the standard deviation of v. 6. Now separate the libraries back into their true identity: normal or cancerous. Calculate the true observed difference 0 — / x r c — firn, where firc denote the true mean count of the cancerous libraries, and \iTn denote the true mean of the normal libraries. 7. Calculate the Permutation Score PS where PS = . (7 34 These permutation scores statistically rate how different the two groups are from each other. The higher the permutation score is, the greater the statistical confidence that the two populations are different. Whi le the lower the permutation score, the more statistically similar the two populations are. In the algorithm shown above, we used M permutations or iterations for each gene hit ratio. The following figure plots the hit ratio w i th varying values of M. For the results reported here, we used M = 5,000 on the breast data where N = 5 samples and C = 10 samples. A natural question to ask is whether this is sufficient. To answer this question, we re-ran using M = 10,000 on the same data and select all the genes whose permutation score exceeds a certain threshold. For example, the threshold 0.96 corresponds to the tai l ends beyond 95% of the area under the standard normal curve. This set of genes is then intersected wi th the corresponding set of genes found using M — 5,000. The ratio of the size of this intersection and the size of the set of genes using M = 10,000 is called the hit ratio. Figure 3.1, shows that the set of genes stabilizes rather quickly, and M — 5,000 is a reasonable value. 3.4 Breast Versus Brain Cancer S A G E Libraries To perform this experiment we used a set of brain and breast S A G E libraries found on N C B I ' s public database. A t the time of analysis , there were 17 brain cancer libraries, 7 brain normal libraries, 10 breast cancer libraries and 5 breast libraries. To satisfy the first objective, we first applied the permutation test to select for the most significant differentially expressed genes in brain cancer (ie., brain cancer versus normal brain libraries). We then sorted the permutation scores in descending order and examined the top 30 ranked genes. The higher the score or rank, the more statistically confident we are that the two groups are different. 35 Hit Ratio vs Iterations Figure 3.1: Illustrates that the higher the number of iterations, the more stable list becomes 36 Similarly, we identified the top 30 ranked genes for breast cancer. To compare the two lists, we performed an intersection of the top 500 ranked genes. Thirty genes was chosen for two main reasons. First, each list generated over 500 genes. Looking up 500 different genes in the literature would be a very laborious task. Second, we believed that 30 genes would be sufficient to show which of the two tests gave more accurate results. Next, to evaluate the effectiveness of the permutation test, we performed a similar experiment using the 2-sample t-test of unequal variance. To obtain the t-value,.we first calculated the mean (n) and the pooled variance (Sp2) separately for both the cancerous libraries and the normal libraries. Then we computed the t-value using: The t-values were then sorted in descending order and the top 30 ranked genes were identified. This list is then compared with the permutation lists. To evaluate the validity of our results using the two different tests, we per-formed a literature search on PubMed on the top ranked genes. With these top ranked genes, we looked at whether the genes were related to the neoplastic process. In order to be consistent, we used the following rules for a gene to be considered related to the neoplastic process: 1. The gene is up-regulated or down-regulated in breast or brain cancer. 2. The gene is up/down-regulated in another type of cancer. 3. The gene is a known cancer-related gene (ie. oncogene, mutator,tumour sup-t = (3.3) pressor). 37 4. The gene is a major component of the cell cycle. 3.4.1 Summary of Validation of Breast Cancer and Brain Cancer Genes Category Breast Cancer Related Related To Other Cancer T-test 3 4 Permutation Test 11 7 Table 3.1: Statistical Comparison of the Top 30 Genes for Breast Data From the table 3.4.1 we can see that among the 30 top ranked genes of the breast libraries identified by the permutation test, 11 are related to breast cancer. Another 7 genes are related to other types of cancer. In contrast, among the top 30 ranked genes identified by the t-test, only 3 are known to be breast cancer related. Interestingly, all three are included in the top 30 selected by the permutation test. These results suggest that the permutation test may be more appropriate for differ-ential gene selection when the sample size is small. For details of the exact genes found please refer to Appendix C. Category Brain Cancer Related Related To Other Cancer T-test 7 8 Permutation Test 5 7 Table 3.2: Statistical Comparison of the Top 30 Genes for Brain Data Among the top 30 genes of the brain libraries identified by the permutation test, 5 are known to be related to brain cancer, and another 7 known to be related to other cancer. In contrast, among the top 30 identified by the t-test, 7 is known to be related to brain cancer, and another 8 is known to be related to other cancers. 38 These results show that the permutation test performed comparably with the t-test. The top ranked gene in the permutation test is an EST (expressed sequence tag), which is an unknown transcript that has been identified but not studied yet. The next highest ranked gene is the protein kinase C and casein kinase substrate in neurons 1. This gene is known to be highly expressed in the normal brain [43], which agrees with our results. To summarize our verification data, we have created another summary table similar to all the verified permutation score genes for our brain data experiment. For further information on the top ranked genes, we have provided complete tables of the genes found using the permutation test in Appendix C. 3.4.2 Discussion An interesting trend to note for the literature verification data is that there seems to be a higher probability of a gene to be verified at higher ranks. For example, for the breast data, in the top 10 ranked genes we have 7 genes verified. While in the ranks 11 to 20, we have 6 genes verified. Further down, between ranks 21 and 30 we have 5 genes being verified. A similar trend was also observed in the brain data (Ranks 1-10: 6 verified, Ranks 11-20: 5 verified, Ranks 21-30: 2 verified). Compared to the breast SAGE data results, the brain SAGE data results do not look as impressive. However, it should be noted that there are many dif-ferent types of brain tissues, and thus different types of brain tumors, including gliomas, meningioma, pituitary tumours, haemangioblastoma, acoustic neuroma, pineal gland tumors, spinal cord tumors, and lymphoma. At the time of our re-search, we combined all the brain cancer libraries and overlooked the issue of differ-ing brain tumor types. This is because there were not enough libraries if we were to 39 conduct the analysis on these tissues separately. These different types of tumors are all unlikely to have similar gene expression profiles, and thus may have contributed to the weaker results compared to the breast data. There were also a couple other factors that may have contributed to the quality of both experiments. First, the usage of Cell Line SAGE libraries may have contributed to some inconsistencies in the data. Cell lines are tissue cells grown artificially in a petri dish and may not be representative of a true genome wide profile of a tissue taken directly from a diseased tissue (known as bulk tissues). Second, many of the top ranked genes were hypothetical proteins or unknown genes (ie. 7 genes were hypothetical proteins for breast data) and so literature for these genes is non-existent. In addition to evaluating the breast and brain data separately, we also compared breast cancer and brain cancer at the sub-cellular level by intersecting the top 500 ranked genes to determine its similarity. A high intersection would suggest a high similarity while a low intersection would suggest a low similarity. For the top 500 permutation test results, 26/500 of the genes intersected. These results suggest a low similarity between the two cancers even at the sub-cellular level. In spite of the few sources of error explained above, our overall results (espe-cially for the breast data) demonstrate that the application of the permutation test on SAGE libraries has significant merit and thus, should also be used on different tissues with SAGE libraries keeping the above issues in mind. This performance could be attributed to the fact that the permutation framework is robust to out-liers. The t-test on the other hand, would pick out spikes (outliers) in the data. For instance, suppose there were 5 libraries in each sample. For one gene, the genetic signature could look like the following: 11115 and 11111. The permutation test would assign this group a relatively low score and thus a low rank where as a t-test 40 would give this example a relatively higher score. Genes with no confirmed litera-ture verification should also be further investigated via biological experimentation (ie. RT-PCR). 3.5 Biomarkers in Early Stages of Lung Cancer 3.5.1 Background and Related Work Lung cancer is the leading cause of cancer deaths world wide with 160,000 annual deaths in the United States alone [23]. One major reason for this is that lung cancer is very difficult to detect. In recent years, many advancements in screening technologies such as chest radiographs, computed tomography (CT) scanning, and sputum cytology have given doctor's tools for detection of cancers. However, these advancements have proven to be feeble when it comes to detecting lung cancer early enough. In fact, studies that screen high risk individuals have not been effective in decreasing mortality rates where the 5-yr patient survival rate at the clinical stage II-IV is poor ranging from 40% down to 5% [52], [51]. Today, there is a wealth of information on the histological and molecular characteristics of the premalignant changes in bronchial mucosa [52]. The earliest change that can be observed is known as reserve cell hyperplasia and squamous metaplasia. These changes spontaneously reverse upon the cessation of smoking and thus, are not considered to be true premalignant lesions. The true early prema-lignant changes are believed to be the low and high grade dysplasia and carcinoma in situ (CIS). Currently, it is unclear whether low grade dysplasia will lead to the advanced stages of lung cancer. However, both high grade dysplasia and CIS have been shown in various studies to lead to the invasive stages of lung cancer [53], [54], 41 [55]. At either of these two stages, it is unlikely that these lesions will regress with the cessation of smoking. The recent advent of global gene expression technologies have given re-searchers the tools to quickly generate candidate biomarkers. As far as we know, there have been no studies using such technology for evaluating stage differences at the genetic level. One reason for this is that it is very difficult to obtain a clean sample of early lesions and these lesions are rare. In the following section we demonstrate an example of how such technologies can be used for this purpose. 3.5.2 Materials To perform our analysis, we used Lung SAGE libraries provided by BC Cancer Agency which include the following SAGE libraries: 1. 15 smoked-damaged libraries but otherwise non-cancerous (hereafter we re-fer to these libraries as normal). These libraries were obtained via bronchial brushings. Brushings is a medical procedure often performed during a bron-choscopy where cells from the tissues lining the respiratory tract are obtained by a small brush-like device. 2. A metaplasia library that is obtained via a bronchial biopsy. This library contains the genetic profile of the bronchial epithelium that has undergone changes due to activities such as smoking but are not cancerous. Specifically, the transformation of the normal ciliated columnar epithelium by a squamous epithelium. 3. 5 carcinoma insitu (hereafter referred as CIS) libraries obtained from a lung biopsy. CIS is the stage following severe dysplasia pre-invasive lung carcinoma 42 that has no metastatic potential. 4. 6 invasive libraries that were obtained via frozen resected samples. The inva-sive stage describes when a CIS lesion breaks through the basement membrane and has metastatic potential. 3.5.3 Candidate CIS Biomarkers The primary objective of our analysis is to isolate genes specific to the CIS stage. In order to analyze their expression patterns we plotted a series of bloxplots using the statistical and graphical software package R ( The figure below shows the basic; parts of the boxplot. The box itself, corresponds to the interquartile range (IQR) meaning that it contains 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile (median of 75% of the data) of the data set while the lower hinge indicates the 25th percentile (median of the 25% of the data). The outer whiskers represent the max and min of the data points unless an outlier exists. If there is an outlier, the whiskers are no more than the 75th percentile plus 1.5 times the interquartile range (the length of the box) or no more than 25th percentile minus 1.5 times the interquartile range (the bottom whisker). An outlier is depicted by a o. To find these CIS specific genes we apply the permutation framework to the following: 1. CIS and Normal SAGE libraries 2. CIS and Invasive SAGE libraries From the permutation test, we obtain a score based on the permutation score for each gene/tag that measures the statistical significance of the difference between 43 Outliers O <75% *1.5*IQR 75% 50% 25% <25%-1.5*IQR IQR Figure 3.2: Diagram of a typical box and whisker plot 44 the two groups. Recall that the higher the score, the more confident we are that the two distributions are different. However, it should be noted that this score does not distinguish whether group A is greater than group B or visa versa. After obtaining the scores for each gene/tag, we sort them in descending order. We then took the top 2000 most differentially regulated genes of the 2 sets and performed an intersection. From this intersection, we obtained a total of 79 tags. Figure 3.3 illustrates the steps of our analysis in a graphical manner. 7 9 S A G E Tags Figure 3.3: Graphical Representation of Steps of Analysis for CIS Specific "Genes To search for the up-regulated genes in CIS, but down regulated in Smoked-damaged normal tissues and invasive tissues, we first selected genes that have a permutation score greater than 2.35 for both sets (corresponding to a p-value of around 0.01). All these genes then had to meet the following requirements to be deemed a CIS biomarker candidate: 1- Normal, ^  l u 9 ~> 9 a n d Metaplasia ^ z d n u 45 For the CIS biomarkers set, after applying the above filters we obtained 18 tags. These tags were then mapped to their corresponding unigene ids and gene names. According to the tag/gene database found at CGAP(, our results mapped to 14 unigene ids and ^ of these mapped to genes with names. These genes included those involved in cellular adhesion, EGFR trafficking and ty-ronsine or phosphatase kinase receptors. Figure 3.4 illustrates the gene expression patterns specific for high expression in the CIS stage. To search for the down-regulated genes in CIS, but up regulated in both smoked-damaged normal tissues and invasive tissues, we had to relax our require-ments to get a decent output. The only requirement was that the average of CIS must be less than the average of both normal and invasive. This output generated 13 candidates. Figure 3.5 depicts the relative expression specific for low expression in the CIS stage. Out of these 13 candidates, all mapped to unigene ids. However, 4 out of 13 of them are hypothetical and unknown genes. The remaining genes mapped to known genes. These genes included protein kinase regulators and inhibitors, and genes belonging to the annexin family. Literature shows that these genes may have potential roles in cellular signal transduction, inflammation, growth and differenti-ation. 3.5.4 Desmosomes as Candidate CIS Biomarkers Our previous results analyzed the gene expression patterns specific for up regulation of genes in the CIS stage. Two of the genes, DSC2 and DSG3 are components of a larger adhesion structure known as desmosomes, we decided to analyze the gene 46 PPP2R2B HCG9 S0RCS2 KRT6A ID •a 8 ° ~i—r—i—r N M C I Stage i Stage ~ l 1 r N M C I Stage S co Stage SNX13 NTRK2 LOC339483 LASS3 S co S co Stage Stage Stage PTPRD DSC2 DSG3 8. S Stage Stage Stage Figure 3.4: Boxplots showing gene expression patterns of genes with high CIS ex-pression 47 CN0T4 LOC340061 TUB A3 MGC27165 DKFZP5640123 § - P r a Stage Stage Stage ~i—I—I—r N M C l Stage Stage ORF1-FL49 PKIG BASP1 ANXA5 Ellsl B-Stage Stage Stage 1—I—I—r N M C I Stage i n — i — i — r N M C I Stage PRKAR1A EPAS1 APLP2 N M C I B t t 0 1 — i — i — r N M C I Stage Stage Stage Figure 3.5: CIS genes that are down-regulated 48 expression patterns of the other components (which may have been missed because of our restrictive filters) to help strengthen our observations. Background Desmosomes are specialized structures responsible for cellular adhesion. These highly organized intercellular junctions provide mechanical integrity to tissues by anchoring intermediate filaments to sites of strong adhesion [20]. They are made up of at least 3 major family of proteins; desmosomal cadherins, armadillo pro-teins and the plakins [21] [22]. The cadherins (as seen in figure 3.6 ) are single-pass transmembrane glycoproteins that are responsible for the actual physical adhesion between two cells and as well facilitate Calcium-dependent cell to cell adhesion. These cadherins can be subdivided into the subfamilies desmogleins and the desmo-collins. The desmogleins consists of 4 different types termed DSGl,DSG2,DSG3,and DSG4. Similarly, the desmcollins consists of 3 types termed DSC1,DSC2 and DSC3. Armadillo proteins include the plakoglobin, the plakophilins (PKP1, PKP2,PKP3), and p0071 [20]. Lastly, the plakins include plectin, desmoplakins I and II, and the cell envelope proteins envoplakin and periplakin [20]. As seen from figure 3.6, these groups of molecules interact in an organized fashion in order to facilitate both cell adhesion and cell rigidity/structure. The single-pass transmembrane proteins DSC and DSG are responsible for the physical adhesion between two cells. Its extracellular domains span into the dense midline (DM) where they interact with other cadherins from other cells. As seen from figure 3.6 , these proteins associate closely with the tails of the desmosomal cadherins. At the same time, they also interact with plakins (desmoplakins seen in diagram). Plakoglobins and PKPs interact with the desmoplakin N-terminus to aid the cluster-49 Figure 3.6: A Mode l of the Basic Structure of Desmosomes 50 ing of the cadherins and work together to form the desmsomal plaque. And finally, desmoplakin C-terminus anchors to the intermediate filaments in the inner density plaque(IDP). Proper anchoring to the IF is crucial for strengthening adhesion and tissue integrity [20]. Desmosomes and Cancer As discussed above, desmosomes play an important role in tissue integrity and in cellular adhesion. In many types of cancers, the desmosome gene expression levels are disrupted. For example, in head, neck, and oral cancer, desmosome expression is down regulated suggesting a mechanism for a cancer to become invasive and metastasize. Such results suggest that desmosomes may have a tumor suppression functions [19]. However, this is not always the case, as in colon cancer, there is no observed downregulation prior to invasion. But this may mean that this type of cancer uses a different mechanism to invade (ie. temporary loss of adhesion in the desmosomes by other means at the protein level) [19] [16]. These results have compelled us to take a closer look at the expression patterns of all desmosome components. Expression Patterns of Desmosome Components To analyze the expression patterns we also used boxplots as described in Section 3.5.3. Desmosomal Cadherins Recall that desmoglein 3 is one of the genes that survived our strict filters. As shown in figure 3.7, plot clearly shows a distinct pattern where the CIS libraries 51 have in general, a much higher expression than the normal libraries and still higher than the invasive libraries. NORMAL METAPLASIA CIS INVASIVE Figure 3.7: Expression pattern DSG3 in normal(N), metaplasia (M), CIS (C), and invasive (I) stages Similarly, as seen in figure 3.8 the other class of cahderins, desmocollins, also show a similar gene expression pattern although the patterns are not as clear cut especially with DSC3. Armadil lo Proteins The armadillo proteins are responsible for regulating desmosomal assembly and adhesion. Studies have shown that knocking out these genes led to the loss of adhesion and tissue fragility [17]. From figure 3.9 , we see that DSC2 is distinctly overexpressed in the CIS stage relative to both normal and invasive stages. Note 52 OSC il D S C 3 Figure 3.8: Expression pattern of DSC2 and DSC3 in normal(N), metaplasia(M), CIS(C) and invasive(I) stages JUP (Plahoglooin) NORMAL METAPLASIA CIS INVASIVE NORMAL METAPLASIA CIS INVASIVE Figure 3.9: Expression patterns of armadillo proteins in normal(N), metaplasia(M), CIS(C), and invasive(I) 53 that DSC2 is one of the two desmosomal cadherins extracted from our strict filters. It is also interesting to see that DSC3 also shows relative higher expression in CIS than both invasive and normal even though the difference is not as clear cut as it is The plakins are also an important part of the desmosome structure as they are critical components for proper anchorage of the intermediate filaments (IF) to de-somosomes, which in turn are important for strengthening adhesion and tissue in-tegrity [20] [18]. Recall from figure 3.6 that the tail of the desmplakins are located in the inner density plaque and directly interact with both the armadillo proteins and the intermediate filaments. As seen in figure 3.10, we observe the expected gene expression patterns as other desmosome components we analyzed. 3.5.5 Invasive Candidate Biomarkers The next part of our analysis was to analyze the gene expression patterns specific to the invasive stages. This study is similar to finding genes specific to the CIS stage. Like the CIS study, we also performed an intersection of the top 2000 most differentially expressed genes based on the permutation score with the groups CIS vs Invasive and Normal vs Invasive. Out of this intersection, we obtained a total of 78 tags. From these 78 tags, we applied the following filters: for DSC2. Plakins 1. invasive normal > 10 2. invasive > 5, CIS \ 3. metaplasia > 1 54 DESMOPLANKIN 1 1 1 1 NORMAL METAPLASIA CIS INVASIVE STAGE Figure 3.10: Gene expression patterns of a desmoplakin (plakin family) in nor-mal(N), metaplasia(M), CIS(C), and invasive(I) 55 After applying all these similar filters, we retrieved only 6 tags. All tags have unigene ids and all mapped to genes with names. These genes included SPARC, MMP11, C0L4A1, MCAM, C0L1A2 and C1R. All of have unigene ids and gene symbols. Four out of the six are components of the adhesive extracellular matrix. As seen in figure 3.11, all exhibit the distinct pattern we filtered for. That is, in normal tissues, there is low or no expression while CIS has higher expression and finally invasive stages have even higher expression. S P A R C M M P 1 1 C O L 4 A 1 o o _ o lO o o -CM ~r i o o -o T 1 1 3000 expression 0 150 expression 300 40 o 8 -- B i Scaled I 50 10 Scaled E 00 200 , o --r i o -~T ^ — Q o - - J L . -l_ N M C I N M C I N M C I Stage Stage Stage M C A M C O L 1 A 2 C 1 R o _ <o ^ o -T o _ 8 -CO c lion >00 250 | o _ CO i Express 200 i Express 150 : o _ CM o _ 0 T Scate< 100 Scalec 50 100 1 1 o - _ • o - o - _1_ n 1 1—~r~^  —i 1 1 1— '—i 1 1 r N M C I N M C I N M C I Stage Stage Stage Figure 3.11: Gene Expression Patterns of Candidate Invasive Biomarkers 56 It should be noted that some of these genes found are consistent with the literature. For example, SPARC is a secreted protein (acidic, cysteine-rich) that is a matrix-associated protein that can change shape [27]. Such a function is something one would think is important in the invasive stages. Another interesting gene is melanoma adhesion molecule (MCAM) which is a cell surface glycoprotein. Accord-ing to northern blot and immunohistochemical experiments performed by Sers et al. [8], M C A M antigens are found primarily in advanced primary and metastatic melanomas. In addition to being associated with tumour progression, M C A M ex-pression has also been found to be associated with shorter patient survival in ade-nocarcinoma patients. As well, it is expressed in a high portion of NSLC (non-small cell lung cancer) cells [8]. Like the CIS gene expression study, we also examined the mirror image where genes are turned off pr down regulated in the invasive stages. To do this, we took the same list and applied the following filters: ^ normal i n invasive 2 CIS ^ q invasive 3. Fold change between invasive and metaplasia is ignored. These filters generated 6 tags and 5 with gene names, while one tag is cur-rently unmapped. Only one of these genes proved to be interesting. PTPRN2 (protein tyrosine phosphatase, receptor type, N polypeptide 2) is a known gene that is involved in signaling molecules that regulate a variety of cellular processes including cell growth, differentiation , mitotic cycle, and oncogenic; transformation. Again, to illustrate the gene patterns we plotted out boxplots at each stage as seen in figure 3.12. 57 C10orf32 PTPRN2 DKFZPS64I0422 Stage Stage CACTTTTAAA LOC388389 Ellsl Stage Figure 3.12: Gene Expression Patterns of Candidate Invasive Biomarkers 58 3.5.6 Other Observed Gene Expression Patterns To further our analysis, we also analyzed whether there are genes that are not present in CIS and Invasive stages. To find these genes, we simply performed the permutation test on the normal lung SAGE libraries and the CIS and invasive libraries pooled together. We then sorted the list in descending order based on the permutation scores and took the top 200 (all those with permutation scores > 6). Finally we select for genes with a fold change between normal and the cis/invasive pool to be greater than 40. Out of this filter, we get 15 tags with 10 genes with symbols. Figure 3.13 illustrates the selected patterns. CLSPN CXorf44 FLJ32855 KCNE1 KIAA1S33 Stage 1—I—I—T N M C I Stage LOC283152 RAB33B SPAG6 TEKT2 TTC18 9 i—i—i—r N M C I a T—i—i—r N M C I :B n—i—i—r N M C I Stage Stage Stage Stage Stage Figure 3.13: Boxplots showing genes that are down regulated in lung cancer regard-less of "stage 59 In addition to finding genes present in our normal lung samples but not in lung cancer samples, we also looked at the reverse. Here we selected for a fold change between the cancer average and normal average to be greater than 10. Performing that filter generated 6 candidates. Again, we plotted a set of boxplots to illustrate this pattern as seen in figure 3.14. At first glance, it does not appear to be very clear that the boxplots show that these genes are CIS/Invasive specific. This is because in order to retrieve these genes, we compared the mean ratios. Means are sensitive to outliers and most of the plots in 3.14 contain outliers which would skew the selected mean ratios. Thus, care must be taken when selecting these genes. 3.5.7 Summary In summary, we have analyzed all possible expression patterns between normal, CIS and invasive stages. Specifically, we have looked at genes that are CIS specific, Invasive specific and those that are specific to both stages (CIS and Invasive). For CIS, we observed clear cut CIS specific patterns for both down and up regulation. Similarly, we observed clear cut Invasive specific patterns for both down and up regulation. However, the CIS and Invasive specific patterns were not clear cut suggesting that at the molecular level, these stages are quite different. 60 W D R 1 C E N T A 2 C O L 5 A 3 K R T 1 4 D T 1 P 1 A 1 0 S L C 1 6 A 4 s •_ ) = = ) T N M C Stage Figure 3.14: Boxplots showing gene expression pattern for CIS and Invasive specific stages . ' 61 Chapter 4 Constant Gene Expression Finder Reference genes are commonly used in many gene expression experiments to attempt to normalize mRNA levels between different samples. For example, QRT-PCR and certain normalization methods in DNA microarrays both require the use of such ref-erence genes. In the case of DNA microarrays, the normalization step attempts to isolate variations that exists due to biological differences and not due to experimen-tal conditions. Such non-biological errors include mismatch hybridization, unequal starting material, scanner problems and spot noises. Reference genes are often used to help calibrate the intensity data to account for these errors as the expression level of these reference genes should be the same across multiple samples. How-ever, if these reference genes are incorrect choices, then this type of normalization could severely skew the the normalized data and may even inadvertently remove important biological differences between samples especially for the more sensitive small expression values. Thus, selection of stable and abundant reference genes is a 62 critical step as failure to do so, could severely jeopardize subsequent analyses. In this Chapter, we argue that reference genes must be chosen in a tissue specific manner. Furthermore, we propose a novel methodology for identification and evaluation of good reference genes. This methodology uses the advantages of the SAGE method and the permutation framework to find a stable and highly expressed set of genes (hereafter referred as the constancy requirement). In the following section/we will first describe some related works. We will then describe exactly why we use the SAGE method followed by the details of our novel methodology. Following this section, we describe how we evaluated these sets of genes and how our results suggests the importance tissue specific reference genes. 4.1 Related Work Reference genes are typically chosen because they are biologically required and thus, usually are referred as housekeeping genes. The expression level of these housekeep-ing genes are assumed to be stable and highly expressed (the constancy requirement). The problem is that in the past few years, it has been increasingly apparent that many standard housekeeping genes may not be appropriate for many gene expression experiments [35],[36],[37],[40]. Recent studies by Barber et al. have shown that one of the most common housekeeping genes, Glyceraldehyde-3-phosphate (GAPDH), show variable expres-sion. Their experiment consisted of 72 different pathologically normal human tis-sue types. They performed 371,088 multiplexed Q-RTPCR experiments and found significant differences in expression levels of GAPDH mRNA between tissue types and between donors of the same tissue type. Most notablly, a 15-fold difference in GAPDH mRNA copy numbers was observed between the highest (skeletal mus-63 cle) and lowest expressing tissue types (breast) [39]. As will be seen later, their conclusion is consistent with our results. Zhang et al. recently reported that 3 of the 10 standard housekeeping genes were inappropriate for normalizing mRNA in neutrophils because of very low ex-pression levels [38] and recommended only using 5 of them that showed relative stable expression. Like Barber et al.'s study, Zhang et al.'s study suggest that many standard housekeeping genes are inappropriate for mRNA normalization. The main problem with both these studies though is that they only analyze the expression level of the standard housekeeping genes which only make up less than one percent of the human genome. Thus, it is quite likely that a better reference genes exist. Gabrielsson et al (2005) attempted to solve the reference gene problem by using 52 microarray expression profiles of human adipose tissue where they selected 50 genes with the lowest coefficient of variation [26]. The problem with such an approach is that microarrays require normalization themselves. In fact, this decade-old technology is even more sensitive to normalization than RT-PCR because of the large number of sources of systematic variation (as described in Chapter 3). 4.2 Novel Methodology In spite of the numerous normalization techniques introduced for DNA microarrays in the past decade, to date, there is still no standard procedure. On the other hand, the SAGE (Serial Analysis of Gene Expression) technique is relatively standardized. Being a sequenced based method, SAGE technology does not require such normal-ization. The only preprocessing technique that is required is somewhat similar to global normalization where all samples are scaled up to some given constant. This is done to correct for unequal number of tags produced by each library as it is dif-64 ficult to obtain libraries with exactly the same number of tags. In the case of DNA microarrays, the data depends on a natural hybridization reaction inherent in DNA. Furthermore, the obtained data is intensity-based and requires that we assume that the relative difference in intensity between sample A and B is approximately equal to the relative transcription between sample A and B. Arguably, this makes microar-rays more susceptible to non-biological errors than SAGE. Thus, here we introduce a novel methodology to identify and evaluate reference genes based on using SAGE data. While the SAGE methodology makes up a large part of our methodology, we also require one more component. In order to identify and evaluate potential reference genes, we use the parametric framework introduced in Chapter 3. Recall that in the permutation framework was used to identify differentially expressed genes. As it turns out, this framework can also be used to find statistically similar gene expression profiles simply by reversing the list where low permutation scores correspond to similar samples. Thus, to select for reference genes that satisfy the constancy requirement we came up with threshold that would generate a decent number of genes ( 20) but is constant and highly expressed. All our selected genes must meet the following criteria: 1. The permutation score must be less than 0.15 2. The average raw gene count must be at least 25 and the raw count for each library must be at least 3. 3. The standard deviation a must be reasonably small (ie. less than 10 for the lung data). The second condition is set to guarantee that the reference gene selected 65 is abundant and present to some degree in all libraries while the last condition ensures that the simulated cancer distribution and simulated normal distribution are sufficiently close. The above conditions were all applied to a set of lung SAGE libraries pro-vided by the BC Cancer Agency. Their cohort consists of 18 non-cancerous smoke-damaged bronchial epithelial libraries, 5 pre-invasive lung carcinoma libraries (stage known as Carcinoma Insitu [CIS]), and 6 invasive lung carcinoma libraries. The non-cancerous libraries are all extracted via lung brushings. Pre-invasive libraries were extracted surgically via lung biopsies and the invasive libraries were obtained from frozen resected samples. After application of the above methodology to the lung SAGE data we ob-tained a list of 16 candidate reference genes. To visualize how these sets of reference genes performed, we plot the points on a scatter plot as seen in figure 4.1. The x-axis gives the permutation score indicating variability, wherease the y-axis shows the average raw SAGE counts. These sets of lung-SAGE reference genes all satisfy the constancy requirement. That is, they appear to be highly expressed and show almost constant expression across all 29 libraries of cancerous and non-cancerous libraries. Conversely, all the common housekeeping genes we examined do not meet our reference gene requirement. In fact, they are far from meeting our constancy re-quirement. Standard housekeeping genes such as GAPDH and TFRC in particular have a permutation greater than 3. A permutation score of 3 is equivalent to saying that the probability of the average expression in cancerous libraries differing from the average expression in non-cancerous libraries is only 1% (ie. a p-value of 0.01). In the next section, we describe two ways that these candidate reference genes were validated (hereafter referred as lung-SAGE reference genes). 66 Average Rain Count 300 250 200 150 100 50 • ACTB fx * **.* I * STAT1 GAPDH » TFRC 2 3 4 Permutation Score • Lung-SAGE Ref Genes • Houskeeping Ref Genes Breas t -SAGE Ref Genes Bra in -SAGE Ref Genes K A L L - S A G E Ref Genes Figure 4.1: Raw Count versus Variability on Lung SAGE data 67 4.3 Validation To validate these lung-SAGE reference genes, a handful of them were analyzed by the Ontario Cancer Institute with their own cohort of lung tissues. They analyzed 7 of our lung-SAGE reference genes and 4 commonly used housekeeping genes and performed QRT-PCR experiments. QRT-PCR values were used to evaluate whether a gene satisfies the constancy requirement. If the values between tumor and normal tissues are close, then we say that they satisfy the constancy requirement. Thus, to analyze these values, we combined the two samples to compute a single standard deviation of expression. A coefficient of variation close to zero suggest that the expression level does not change much regardless of a cell's neoplastic state. A summary of the results can been seen from figure 4.2. This figure clearly shows that in general, the standard housekeeping genes (top 4 bar lines) exhibit more variation than the lung-SAGE reference genes. In addition to the above biological validation, we also evaluated the effec-tiveness of our lung-SAGE reference genes by analyzing a dataset generated by Bhattacharjee et al [33] in 2001. Essentially, we compared the same dataset normal-ized in two distinct manners. The first way is based on data normalized by standard housekeeping genes, while the second way is data normalized by the lung-SAGE ref-erence genes. For both cases, we applied the same permutation framework as our previous experiment on brain and breast data to identify the top 40 most differ-entially expressed genes. Below we outline the exact steps we took to renormalize Bhattacharjee et al's data: 1. Re-normalized Bhattacharjee et al's published microarray data by fitting a line (LMS) through the SAGE selected Housekeeping genes. (Array vs Average of 68 G A P D H A C T B TF RC STAT-1 M R C L 3 P R D X - 1 P K M 2 R P S 1 3 K C N N 3 R P L 1 5 R P L 1 1 SAGE-Ref Genes Vs Standard-Ref Genes e cv 10 12 Figure 4.2: Table showing relative comparison of validation results. 69 All Arrays). 2. Performed a permutation test on the renormalized 17 normal lung arrays vs 23 Lung Cancer Arrays 3. Performed Permutaion test on Originally Normalized Lung Arrays with same arrays as above. 4. Sorted each in descending order of permutation scores. 5. Intersected the top 40 genes from SAGE Normalized and Originally Normal-ized Harvard Data 6. Looked up SAGE Normalized Specific Genes, Originally Normalized Specific genes and Intersected Genes and categorized them as: Up/down-regulated in Lung Cancer, Up/down-regulated in other cancers, Not previously Associated with Cancer. Of the top 40 genes, 22 of them were in both lists. As for the other remaining genes, we categorized the them into one of two cancer-related categories. The first category, as seen in table 4.1 are genes that are up/down-regulated in lung cancer while the second category are genes that are up/down-regulated in other cancer types. Our results, as seen in the table 4.1, suggest that the data normalized by lung-SAGE reference genes give results that are more consistent with the literature. 4.4 Importance of Tissue Specificity Our second major objective is to analyze whether the choice of reference genes ought to be tissue specific. That is, a set of reference genes that satisfy the constancy 70 Criteria Intersection By lung-SAGE By Houskeeping Genes Lung Cancer Related 5 5 1 Related to Other Cancer 7 4 1 Total 12 9 2 Table 4.1: Comparison of the Top 40 Most Differentially Expressed Gene of Mi-croarray Data Normalized by two Approaches requirement for one tissue type may not be the same as the reference genes for another tissue type. To show this, we identify the sets of reference genes that satisfy the reference gene requirement for two additional tissue types, breast and brain. We use the same libraries as the ones used in our differential experiment in Chapter 3 with the exception that we only include bulk tissues (resected from patients). We took this step to keep consistent with the lung libraries produced by BC Cancer Agency which were all bulk type. Below we outline the exact steps of our analysis: 1. Performed Permutation test on Breast SAGE Libraries from NCBI which in-cludes 12 cancer libraries and 3 normal libraries. 2. Performed Permutation test on 6 brain cancer SAGE libraries from NCBI which includes 6 normal libraries and 20 cancer libraries. 3. Performed permutation on all the above libraries including our lung SAGE libraries 4. Sorted each list in descending order to get lists of the most statistically differ-entially expressed genes. After obtaining the breast and brain SAGE based reference genes, we com-pared them with the lung-SAGE genes. Figure 4.1 shows how breast and brain 71 SAGE reference genes compare with lung-SAGE genes on lung data. Notice that for both the breast and brain SAGE data, most of the genes have low average counts and relatively higher permutation scores than the lung-SAGE genes. This figure supports our theory that tissue specificity is an important factor when se-lecting reference genes. These findings compelled us to analyze the issue further so we looked at both the brain and breast SAGE data cases. Figure 4.3 shows a summary of our results in 3 boxplots. On y-axis we have the permutation score and on the x-axis we have the type of reference gene. In the first boxplot, we show how the permutation scores of various reference genes on lung data. Lung-SAGE reference genes show almost constant expression while standard housekeeping genes vary greatly. Interestingly, the other reference genes do not perform as badly as the housekeeping genes. The other 2 plots also show similar behavior but the variability of the housekeeping genes is comparable to the other reference genes. Our results collectively suggest the following: 1. That standard housekeeping genes may not be appropriate for normalization of microarrays and perhaps even in other biological experiments requiring normalization (ie. QRT-PCR) 2. That there are far better reference genes than the existing ones today and tissue specific reference genes may be more appropriate. 72 (A) Lung Data (B) Breast Data (C) Brain Data HK Lg Bt Br All HK Lg Bt Br All HK Lg Bt Br All Type of Reference Genes Type of Reference Genes Type of Reference Genes Figure 4.3: Boxplots that depict the importance of tissue specificity 73 Chapter 5 Conclusion and Future Directions 5.1 Conclusion This thesis deals with two key problems in genome wide expression analysis. The first is the small sample size problem. The second is, the normalization issue with biological experiments. Specifically, the issues surrounding poorly chosen reference genes. We have proposed using SAGE and the permutation framework to alleviate these issues. Our results in 3 independent experiments suggest that there is merit in our proposed solution. Our first experiment compared the commonly used t-test of unequal variance and the permutation test. To evaluate their performance, each test was applied to two sets of SAGE data (brain and breast) to identify the top 30 most differen-tially expressed genes between normal and cancer groups.. Overall, our results show that the permutation test produce results more consistent with the literature. For 74 instance, for the breast SAGE data, out of the top 30 ranked genes for the per-mutation test, 60% of the genes were verified via literature while only around 23% of the highly ranked t-test genes were verified. As for the brain SAGE data, the permutation test performed comparably with the t-test with 40% and 50% being verified respectively. However, it should be stressed that the data for the brain data was less uniform with various tumor types and may have had effect on the overall results. The second experiment in our study further strengthened the value of the permutation test as an identifier of interesting genes. Here we analyzed lung SAGE data of various stages and performed various intersections to produce genes specific to certain stages. By performing various filters, we were able to isolate various genes specific to its stage including those specific to lung normal, CIS and invasive stages. The key findings were a couple genes known as DSG3 and DSC2. As described in Chapter 3, these two genes belong to a family of adhesion molecules known as desmosomal cadherins. The expression pattern we observe suggest that adhesion increases in pre-invasive lung carcinoma followed by a down regulation in invasive stages of the disease. Biologically, it makes sense as the definition of invasion is the spread of the cancer. For that to happen, cancer cells must lose some adhesion so that it can travel to other parts of the body. Upon this knowledge we also looked up the gene expression patterns of other associated structural genes that did not make the strict filters of our analysis. Interestingly, several of these associated genes exhibited similar gene expression patterns (but not as strong as DSG3 and DSC2). These results are now under investigation as candidate biomarkers of pre-invasive lung squamous carcinoma. In our third experiment, we used the SAGE and permutation framework to 75 evaluate and identify reference genes. SAGE was used over DNA microarray data as it requires normalization itself while SAGE is immune to the normalization issues that plague DNA chip technology. In addition, recall that SAGE is a genome-wide expression technique and thus, is capable of analyzing the entire set of expressed genes in any given sample. Thus, this unique combination would be ideal for iden-tifying novel reference genes. 5.2 The Future of Gene Expression Analysis Gene expression analysis technologies are powerful and are showing many signs of advancing medical research. For instance, in 2002, van't Veer et al. demonstrated a strategy involving analyzing the genetic profile of individuals to help decide what breast cancer patients would benefit from certain treatments [41]. Such knowledge would be critical to a doctor's decision on what treatment to administer to the patient. The excitement and pace of these technologies though have generated vari-ous oversights. Fortunately, in recent years, a few scientists have realized some of these oversights. For example, as discussed in the previous chapter researchers have now realized the deficiencies and limitations of using standard housekeeping genes to normalize mRNA levels [35], [36], [37],[40], [38], [39]. However, most of these studies lack the search for novel reference genes and the ones that do [26], require normalization to reference genes. Our novel methodology globally identifies genes that are the most stable and highly expressed (what we called the constancy re-quirement). We showed that the genes it selected can be verified via QRT-PCR and showed that normalizing DNA microarray data with these newly acquired reference genes produce results more consistent with the literature. Thus, future experiments 76 must consider methods such as ours, to conduct normalization on mRNA levels. There is still much to be done beyond our study. In fact, so far we only grazed the surface of this SAGE data that we analyzed. This data is very large and there are just numerous combinations of ways to look at the data. Recall from Chapter 3 that we generated many different figures showing many different gene expression patterns. All such patterns still need to be analyzed carefully to pull out any interesting genes that map to known pathways or stuctures. Following this analysis, we would analyze the gene expression patterns of the genes in the pathway which can involve several genes. In addition to these studies, we also could look at low expressors. Low expressors may be interesting to study because they may be the switches or regulators that cause the onset of the disease. The good news is that this study could also be done using the permutation framework that we have outlined because during our analysis, we found that the permutation test also pulls out results with low counts. After extracting genes that belong to some existing pathways and examining its gene expression pattern within our own SAGE data, we still need to conduct some sort of validation. Recall from Chapter 3 on the second experiment with the lung SAGE data, our objective was to find candidate biomarkers for early stages of lung cancer. Following the analysis, we selected two of these candidates for analysis: namely DSG3 and DSC2. In spite of our efforts to show that there is strong evidence from the SAGE data that desmosome components are up-regulated just before the invasive stages of lung cancer, we cannot take what we observe and publish it. Thus, often, researchers perform an extra step of analysis; namely biological validation. This is due to three reasons. First, biological validation can show that our obser-vations are also true at the protein level (ie. via antibody staining). Second, this 77 extra step conducted on a different cohort can make our findings more statistically convincing. Third, genome wide expression technologies are subject to more errors than other small scale technologies. For instance, in SAGE, one thing that is con-stantly changing is the tag-to-gene mapping database. Throughout our study, we had to update our results various times because of this issue, which sometimes led to a small but significant change in the results. Thus, validation can help distinguish artifact from true observation. Unfortunately, biological validation can take several months to a year. Furthermore, since our analysis attempts to extract CIS specific genes, simply obtaining additional CIS samples for biological validation would take even more time. We were very fortunate to have collaborated with experts in the biological domain. As you recall, some of our experimental results were validated by the Ontario Cancer Institute. Specifically, the housekeeping project, as described in Chapter 4. Their biological experiment was performed on a different cohort and helped validate and strengthen our findings. Without the wet-lab and dry-lab re-lationship, our research could only go so far. Our results are strong in both the wet-lab and dry-lab side. Thus, strongly suggesting that housekeeping genes do not make the best reference genes and that normalizing to reference genes is appropriate if we select them the way we have. 7 8 Bibliography [1] McPherson JD et al. A physical map of the human genome. Nature, 2001; Feb 15;409(6822):934-41. [2] Kan Z, Rouchka EC, Gish WR, and States, DJ. Gene Structure Prediction and Alternative Splicing Analysis Using Genomically Aligned ESTs. CeZZ,2001;ll, 889. [3] Black DL. Protein diversity from alternative splicing: a challenge for bioinfor-matics and post-genome biology. Cell, 2000; 103, 367. [4] Zhu J, Shendure J, Mitra RD, Church GM. Single molecule profiling of alter-native pre-mRNA splicing. Science 2003; 301 (5634):836-8. [5] Velculescu VE, Zhang L, Vogelstein,B. and Kinzler,K.W. Serial Analysis of gene expression. 5czence,1995; 276, 1268-1272. [6] Serial analysis of gene expression: technical considerations and applications to cardiovascular biology. Patino WD, Mian OY, Hwang PM. Circ Res., 2002; Oct 4;91(7):565-9. [7] Schober MS, Min Y N , Chen YQ. Serial analysis of gene expression in a single cell. Biotechniques, 2001;31:1240-1242. [8] Kristiansen G, Yu Y, Schluns K, Sers C, Dietel M, Petersen I. Expression of the cell adhesion molecule CD146/MCAM in non-small cell lung cancer. Anal Cell Pathol, 2003; 25(2):77-81. [9] Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW Using the transcriptome to annotate the genome. Nat Biotechnol, 2002; 20: 508-512. [10] Datson NA, van der Perk-de Jong J, van den Berg MP, de Kloet ER, Vreugden-hil E. MicroSAGE: a modified procedure for serial analysis of gene expression in limited amounts of tissue. Nucelic Acids Res., 1999;27:1300-1307. 79 [11] Patino WD, Mian OY, Hwang PM. Serial analysis of gene expression: techni-cal considerations and applications to cardiovascular biology. Res., 2002 Oct 4;91(7):565-9. [12] Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science ,1995; 270: 467-470. [13] Okamoto T, Suzuki T, Yamamoto N. Microarray fabrication with covalent attachment of DNA using bubble jet technology. Nat. Biotechnol., 2000 Apr;18(4):438-41. [14] Macgregor PF and Squire JA. Application of Microarrays to the Analysis of Gene Expression in Cancer. Clinical Chemistry, 2002; 48:8: 1170-1177. [15] Emmert-Buck MR, Bonner RF, Smith PD, Chuaqui RF, Zhuang Z, Goldstein SR, Weiss RA, Liotta LA. Laser Capture Microdissection. Science, 1996; Vol. 274. no. 5289, pp. 998-1001. [16] Garrod DR. Desmosomes and cancer. Cancer Sum,1995; 24:97-111. [17] Ruiz P, Brinkmann V, Ledermann B, Behrend M, Grund C, Thalhammer C, et al. Targeted mutation of plakoglobin in mice reveals essential functions of desmosomes in the embryonic heart. J Cell Biol, 1996; 135(1 ):215-25 [18] Huen AC, Park JK, Godsel LM, Chen X, Bannon LJ, Amargo EV, et al. In-termediate filament-membrane attachments function synergistically with actin-dependent contacts to regulat intercellular adhesive strength. J Cell Biol, 2002; 159(6): 1005-17. [19] Depondt J, Shabana AH, Florescu-Zorila S, Gehanno P, Forest N. Down-regulation of desmosomal molecules in oral and pharyngeal squamous cell car-cinomas as a marker for tumour growth and distant metastasis. Eur J Oral Sci., 1999;107(3):183-93. [20] Yin, T, Green, KJ . Regulation of desmosome assembly and adhesion. Semin Cell Dev Biolo., 2004; 15(6):666-77. [21] Huber O. Structure and function of desmosomal proteins andtheir role in de-velopment and disease. CeU Mol Life Sci , 2003;60(9):187290. [22] Getsios S, Huen AC, Green KJ . Working out the strength and flexibility of desmosomes. Nat Rev Mol Cell Biol, 2004; 5:27181. 80 [23] Jemal A, Clegg LX, Ward E, Ries LA, Wu X, Jamison PM, Wingo PA, Howe HL, Anderson RN, Edwards BK. Annual report to the nation on the status of cancer, 1975-2001, with a special feature regarding survival. Cancer, 2004; Jul l;101(l):3-27. [24] Wang Y, Lu J, Lee R, Gu Z, Clarke R. Iterative Normalization of cDNA Mi-croarray Data. IEE Transactions On Information Technology In Biomedicine, 2002; Vol. 6, No. 1. [25] Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH. Issues in cDNA microar-ray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects Nucleic Acids Res., 2001; 29(12):2549-57. [26] Gabrielsson BG, Olofsson LE, Sjogren A, Jernas M, Elander A, Lonn M, Rudemo M, and Carlsson LMS. Evaluation of reference genes for studies of gene expression in human adipose tissue. Obes Res., 2005; 13:649-652. [27] Goldblum SE, Ding X, Funk SE, Sage EH. SPARC (secreted protein acidic and rich in cysteine) regulates endothelial cell shape and barrier function. Proc Natl Acad Sci USA 1995 Apr 12;91(8):3448-52. [28] Hoffmann, R., Seidl, T. and Dugas, M. Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis. Genome Biology, 2002; 3(7): research 0033.1-0033.11. [29] Yang, Y H and Speed, TP. Design issues for cDNA microarray experiments. Nature Reviews Genetics, 2002; 579-588. [30] Baldi P, Hatfield GW. DNA microarrays and gene expression, from experiments to data analysis and modeling. Cambridge University Press, 2002. [31] Normalization, and analysis of DNA microarray data by self-consistency and local regression. Kepler TB, Crosby L, Morgan KT. Genome Biology, 2002; 3:research 0037.1-0037.12. [32] Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH. Issues in cDNA microar-ray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res, 2001; 29: 2549-2557. [33] Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, Loda M, Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M. Classifica-tion of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci U S A , 2001; 98(24).T3790-5. 81 [34] Lu,C. Improving the scaling normalization for high-density oligonucleotide Gene Chip expression microarrays. BM Bioinformatics, 2004, 5:103. [35] Bustin SA. Absolute quantification of mRNA using real-time reverse transcrip-tion polymerase chain reaction assays. J Mol Endocrinol, 2000; 25:169-193. [36] Suzuki T, Higgins PJ, Crawford DR. Control selection for RNA quantitation. Biotechniques, 2000; 29:332-337. [37] Thellin O, Zorzi W, Lakaye B, De Borman B, Coumans B, Hennen G, grisar T, Igout A, Heinen E. Housekeeping genes as internal standards: use and limits. J. Biotechnol, 1999; 75:291-295. [38] Zhang X, Ding L, Sanford A J. Selection of reference genes for gene expression studies in human neutrophils by real-time PCR. BMC Mol Biol, 2005; 6: 4. [39] Barber, R. et al. GAPDH as a housekeeping gene: analysis of GAPDH mRNA expression in a panel of 72 human tissues. Physiological Genomics, 2005; 21: 389-395. [40] Warrington JA, Nair A, Mahadevappa M, Tsyganskaya M. Comparisons of human adult and fetal expression and identification of 535 housekeep-ing/maintenance genes. Physiol Genomics, 2000; 2:143-147. [41] vant Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peters HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GH, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friends SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 2002; 415(6871):5306. [42] Shen D, Chang HR, Chen Z, He J, Lonsberry V, Elshimali Y, Chia D, Seligson D, Goodglick L, Nelson SF, Gornbein JA. Loss of annexin A l expression in human breast cancer detected by multiple high-throughput analyses. BBRC, 2005; 326, 218-227. [43] Sumoy L, Pluvinet R, Andreu N, Estivill X, Escarceller M. PACSIN 3 is a novel SH3 domain cytoplasmic adapter protein of the pacsin-syndapin-FAP52 gene family. Gene,2001; 262(1-2), 199-205. [44] Liau L M , Yang I. Microarrays and the Genetic Analysis of Brain Tumors. Current Genomics, 2002; 3, 33-41. - [45] Ljubimova JY, Khazenzon NM, Chen Z, Neyman YI, Turner L, Riedinger MS, and Black KL. Gene Expression abnormalities in human glial tumors identified by gene array. Int. J. Oncol, 2001; 18, 287-295 . 82 [46] Watson MA, Perry A, Budhjara V, Hicks C, Shannon WD, and Rich K M . Gene expression profiling with oligonucleotide microarrays distinguishes world health organization grade of oligodendrogliomas. Cancer Research, 2001; 81, 717-723. [47] Good, P.I. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses 2nd ed. Springer-Verlag New York, Inc., 2000. [48] Park T, Y i SG, Lee S, Lee SY, Yoo DH, Ahn JI, Lee YS. Statistical tests for identifying differentially expressed genes in time-course microarray experi-ments. Bioinformatics, 2003;19(6), 694-70. [49] Dudoit S, Yang YH, Speed TP, and Callow MJ. Statistical methods for identi-fying differentially exprssed genes in replicated cDNA microarray experiments. Statistica Sinica, 2002; 12, 111-139. [50] Ng RT, Sander J, Sleumer MC, Yuen MS, Jones SJ. A Methodology for ana-lyzing SAGE libraries for Cancer Profiling ACM TOIS,2005; V 23, 35-60, [51] Differential expression of biomarkers in lung adenocarcinoma: a compartive study between smokers and never-smokers. Dutu T, Michiels S, Fouret P, Penault-Llorca F, Validire P, Benhamou S, Taranchon E, Morat L, Grunenwald D, Le Chevalier T, Sabatier L, Soria JC. Ann Oncol, 2005; 16(12):1906-14. [52] Hirsch FR, Franklin WA, Gazdar AF, Bunn PA Jr. Early detection of lung cancer: clinical perspectives of recent advances and radiology. Clin Cancer Res, 2001; 7: 5-22. [53] Auerbach O, Gere B, Forman JB, et al. Changes in bronchial epithelim in relation to smoking and cancer of lung. N Engl J Med; 256: 97-104. [54] Saccomano G, Archer VE, Auerbach O, Saunders RP, Brennan LM. Develop-ment of carcinoma of the lung as reflected in exfoliated cells. Cancer (Phila), 1974; 32: 256-270. [55] Venman BJW, van Boxem TJM, Smit EF, Postmus PE, Sutedja TG. Outcome of bronchial carcinoma in situ. Chest, 2000; 117: 1572-1576. 83 Appendix A Central Dogma of Biology The smallest biological unit of life is known as a cell. Some organisms are unicellular while others are multicellular. Unicellular organisms contain all the machinery to sustain life and reproduce. Multicellular organisms are much more complex as they require interactions between other cells to obtain necessary nutrients to sustain life and also often require interactions between other multicellular organisms to repro-duce. Difference cells produce different gene products. The activation of specific gene products or proteins can have a great effect on the characteristics of a cell as they are the actual worker molecules of the cell. For instance, a muscle cell and a nerve cell express different gene products and thus, differ in both function and physical appearance. The central dogma of biology explains just how a gene product is produced in cells. Cells contain genetic templates known as DNA located in a region called the nucleus. These gene templates are copied into corresponding molecules known as messenger RNA (mRNA) through a process known as transcription. mRNA molecules (or often called transcripts) are then exported into the cytoplasm where 84 they are recruited by structures called ribosomes (rRNA). These structures in turn recruit free amino acids (the building blocks of proteins) and perform a process known as translation where the ribosome interpret the instructions contained in the transcript until finally the entire protein is created. DNA/RNA consists of subunits known as nucleotides (A Adenine, T Thymine (DNA)/ U Uracil (RNA), C Cytosine, G- Guanine). DNA in its natural state consists of 2 strands of DNA where each nucleotide have an affinity for another cor-responding nucleotide and forms a robust double stranded structure known as the double helix. Specifically, adenine has an affinity for thymine and cytosine has an affinity for guanine and vice versa. This natural phenomena is known as hybridiza-tion and as you will see later, is a key reaction for DNA microarrays to work. RNA on the other hand, are single stranded and do not have thymine but rather a similar molecule known as uracil. 85 Appendix B Genome-wide Expression Technologies B . l D N A Microarrays A simple microarray experiment requires 6 basic steps: 1. Sample preparation 2. Array fabrication 3. Hybridization of the sample to the array 4. Scanning and image analysis 5. Normalization/Data Preprocessing 6. Data Analysis Sample preparation Sample.preparation is considered to be the most sensitive step in a microarray experiment step. One reason for this is that samples must be extracted from crude • .86 sources such as blood, tumor biopsies, and complex multi-cellular tissues. Thus, purity of samples becomes a major issue. For instance, a scientist maybe only interested in the tumour tissue but in order to extract a tumor, healthy tissues are also extracted along with it. In addition, such samples must be treated with chemicals such as formalin to preserve the tissue which may alter or degrade the RNA and thus, make it unusable. Fortunately, this is an active area of research where new technologies have been providing scientists with tools to help extract purer samples. For instance, in recent years, the development of new technologies such as laser capture microdissection has allowed scientist to extract purer samples. LCM (laser capture dissection) can help eliminate unwanted cells and create a purer sample [15]. Array fabrication The primary concerns when manufacturing a DNA microarray is precision, cost and time. Today, DNA chips are either manufactured in the lab (especially in academia) or ordered from various specialized companies that market and manufacture them including Affymetrix, Amersham Pharmacia Biotech, Biorobtoics, and Genomic So-lutions. The manufacturing process is also a hot area of research as scientists and businesses aim to reduce cost and time without loss of precision or to increase preci-sion with little cost. For instance, one of the first technologies developed, photolitho-graphic DNA synthesis, allowed the placement of high-density oligonucelotides on microarray slides. However, the procedure is costly and time consuming, and most importantly, the precision is poor as the synthesized oligonucleotides are subject to a wide variation and uncertainty. Thus, alternatives were developed namely mechani-cal microspotting, ink jet ejection and bubble jet technology which helped alleviate 87 various pitfalls of older technologies [13]. Hybridization of the sample to the array After the cell samples are captured (either grown or extracted from bulk tissues directly) they are placed in a test-tube. These test-tubes are centrifuged in order to move the cells to the bottom of the test-tube and then treated with an RNA extrac-tion compound. This compound is then extracted and placed in another test tube and reversed transcribed to produce cDNA molecules as DNA molecules are much more stable than RNA molecules. As the cDNA is made, it is tagged with a fluores-cent dye. These fluorescently labeled probes are then placed on a DNA microarray chip. DNA microarray chips contain many spots of contemplementary bases and are either made up of oligonucleotides (synthetically produced short sequences) or entire cDNA sequences. For spotted microarrays, two samples (the reference and the experimental) are labeled with different dyes (ie. Cy3 and Cy5) and competi-tively hybridze onto the chip. As mentioned above, for Affymetrix type of arrays, controls and experimental samples are hybridized on separate microarray slides and thus produce arrays of absolute expression intensities. Scanning and image analysis (ie. Spot location, background correction, intensity assignments) After hybridization, fluorescent dyes cannot be seen until a specialized laser passes over the slide. While this is done, light is emitted and captured by a computer and saved as an image file. Over the years, various groups have developed microarray image analysis software to extract the varying intensities for each spot in an array. Up to now, there is no standard way of doing this specifically for microarrays as there 88 are many different image analysis methods ( . However, all microarray image analysis software has the same goals. First, on top of calculating the spots associated signal intensities, image analysis also aims to assess local background noises. In addition, most of these software tools also allow for basic filtering to deal with ghost spots (background higher than intensity spot), damaged spots, and very low intensity spots. Much of these poor spots are caused by experimental factors such as wobbling of the robot arm that makes the deposits of the samples onto the slide and contamination (ie. dust, thumbprints) [14]. Normalizat ion/Data Preprocessing Refer to Chapter 3. Data Analysis Refer to Chapter 3. B .2 S A G E Technology The SAGE methodology basically works by capturing the mRNAs in cells. Since most mRNA molecules end with a long string of As (polyadenylated tails), scientist have discovered that they can easily capture these molecules by using a long string of the contemplemtary base pairs, thymine (T). To do such a procedure though, mRNA must be transcribed into cDNA (DNA version of an mRNA transcript) since DNA is a much more stable structure than RNA. This process is performed by an enzyme found in RNA viruses called Reverse Transcriptase. The next important step is to bind a molecule with 20 or more Ts to special microscopic magnetic structures called streptavidin beads. This allows the capturing of mRNA molecules in the 89 cytoplasm of the cells. These beads can then be withdrawn with a magnet along with the hybridized DNA molecule (a copy of the original RNA molecule). Next, as shown in the above figure, this cDNA molecule is cleaved by a restriction enzyme that creates a sticky end hanging out (CTAC). This sticky end allows linkers to be attached to these ends. Then a tagging enzyme, which detects the linker site, reaches down past the CTAC sticky end and cuts off a short segment of the cDNA molecule. This short segment is the tag and ideally contains enough nucleotides to identify, the original mRNA molecule. After the enzymatic cutting steps, the ends of these short molecules have special characteristics that allow them to bind together. Together they form longer structures called concatemers. These structures are then PCR amplified so they can be detected and then fed into a sequencer that counts and adds up all unique tags. 90 Appendix C Validation of Differentially Expressed Genes C . l Validation of Differentially Expressed Genes of Breast Data 1. Annexin Al (UID-.782255) : (a) Rank 1 (b) Score: 3.75 (c) Description : They are thought to have a role in tumour suppression and are upregulated in many cancers including breast cancer. (d) References: PMIDS: 9062391,9514092,8387039 2. Serum Amyloid A2 (UID: 336462) (a) Rank: 4 (b) Score: 3.32 91 (c) Description: Elevated in many types of tumours including breast cancer. (d) References: OMIM: 104751 ,PMID: 6200925,3734116,11596022 3. Ribosomal protein S15;RPS15 (UID: 133230): (a) Rank 5 (b) Score: 3.32 (c) Description: Found to be activated in various human tumors such as in-sulinomas, esophageal cancers, and colon cancers. (d) References: OMIM: 104751, PMID: 6200925,3734116,11596022 4. nuclear receptor subfamily 2, group F, member 6; NR2F6 (UID: 239752): (a) Rank: 6 (b) Score: 3.30 (c) Description: Members of this family of genes is involved in control and differentiation in neoplasia. Absent in human colon carcinoma (d) References: OMIM: 132880, PMID: 2553781 5. interleukin 8: (UID: 624) (a) Rank: 7 (b) Score: 3.18 (c) Description: Studies suggests a role for IL-8 in promoting the metastatic potential of breast tumor cells (d) References: OMIM: 146930 , PMID: 11330965 ,11159200,11029796 6. protein tyrosine phosphatase, non-receptor type 13(UID:211595): (a) Rank: 8 (b) Score: 3.17 92 (c) Description: Is expressed in normal breast epithelial cells and is frequently up-regulated in breast cancer (d) References: OMIM: 600267,PMID: 11696979, 9544992, 10640988 7. nuclear factor of kappa light polypeptide gene enhancer in B-cells inhibitor, alpha: (UID: 81328) (a) Rank: 9 (b) Score: 3.10 (c) Description: Activation of nuclear factor-kappaB in breast cancer (d) References: OMIM: 600664, PMID: 11325857 ,11752211 8. LIM domain protein(UID: 79691): (a) Rank: 11 (b) Score: 3.09 (c) Description: Overexpressed in breast cancer. (d) References: OMIM: 603422 , PMID: 11734645 9. transforming growth factor beta-stimulated protein TSC-22 (UID: 114360): (a) Rank: 12 (b) Score: 3.07 (c) Description: Transcription factor and belongs to the large family of early response genes and thought to have a tumour suppressor role in many cancers. (d) References: OMIM: 607715 PMID: 11944908, 11836610, 11095965, 10879745, 10854535, 9459148, 9458104, 9195978 10. serum amyloid A1:(UID: 332053) 93 (a) Rank: 13 (b) Score: 3.06 (c) Description: Expression in hepatoma cells (d) References: OMIM: 104751, PMID: 1656519 11. inositol polyphosphate-l-phosphatase:(UID: 32309) (a) Rank: 15 (b) Score: 3.05 (c) Description: Upregulated in human colorectal cancer (d) References: OMIM: 147263 ,PMID: 10747296, 8392378 12. superoxide dismutase 2, mitochondrial (UID: 318885): (a) Rank: 17 (b) Score: 3.01 (c) Description: Down regulated in lung carcinoma. (d) References: OMIM: 147460 , PMID: 11313974,11491651 13. protein phosphatase 1, regulatory (inhibitor) subunit 15A (UID: 76556): (a) Rank: 18 (b) Score: 3.01 (c) Description: Member of a group of genes whose transcript levels are in-creased following stressful growth arrest conditions and treatment with DNA-damaging agents. Its protein response is correlated with apoptosis following ionizing radiation. (d) References: PMID: 11593419,11836553 14. intercellular adhesion molecule 1 (CD54), human rhinovirus receptor (UID: 168383): 94 (a) Rank: 22 (b) Score: 2.95 (c) Description: Upregulated in breast cancer. (d) References: OMIM: 147840, PMID: 11761443, 11471895, 11783310 15. spermidine/spermine Nl-acetyltransferase ; SAT (UID: 28491): (a) Rank: 23 (b) Score: 2.93 (c) Description: Activity in both breast cancer cells and other types of cancers (ie. Lung) (d) References: OMIM: 313020 , PMID: 8216356, 2731159, 12697027 16. baculoviral IAP repeat-containing 3 (UID: 127799): (a) Rank: 24 (b) Score: 2.91 (c) Description: Inhibit apoptosis (programmed-cell death) (d) References: OMIM: 601721, PMID: 8643514, 8552191 17. prostate epithelium-specific Ets transcription factor (UID..79414): (a) Rank: 26 (b) Score: 2.90 (c) Description: The mRNA is overexpressed in human breast tumors and is a candidate breast tumor marker and a breast tumor antigen. (d) References: PMID: 11555586 18. keratin 14 (epidermolysis bullosa simplex, Dowling-Meara, Koebner); KRT14 (UID:117729): 95 (a) Rank: 30 (b) Score: 2.86 (c) Description: Expressed in breast cancer. (d) References: OMIM: 148066 , PMID: 10786689, 11487275 C .2 Validation of Differentially Expressed Genes of Breast Data 1. protein kinase C and casein kinase substrate in neurons 1 (UID: 6462): (a) Rank:2 (b) Score: 4.26 (c) Description: Strong expression in normal brain. Co-localize and bind with dynamin.Phosphorylates AGT which controls the susceptibility to methylate carcinogens in tumor cells. (d) References: OMIM: 606512, PMID: 11023825, 10667577,9746365, 11179684, 11082044,10667577 2. synuclein, beta, (UID: 90297): (a) Rank: 3 (b) Score: 4.25 (c) Description: Upregulation in ovarian and breast tumors. (d) References: OMIM: 602569, PMID: 8194594, 10048491, 10813729 3. ATPase, Ca++ transporting, plasma membrane, 2 (UID: 89512): (a) Rank: 4 (b) Score: 4.21 96 (c) Description: Upregulated in brain. (d) References: OMIM: 108733, PMID 10533058 4. Protein phosphatase-2A, regulatory subunit B' (PR 53), (UID: 236963): (a) Rank: 6 (b) Score: 4.18 (c) Description: Reduced expression of the Aalpha subunit of protein phos-phatase 2A in human gliomas in the absence of mutations in the Aalpha and Abeta subunit genes.Thought to function as tumour supressors. (d) References: OMIM: 600756 , PMID: 11519040 5. Enolase-2, gamma, Neuronal,(UID: 146580): (a) Rank: 8 (b) Score: 4.17 (c) Description: Elevated in glioma cells (d) References: OMIM:131360, PMID: 7520111, 6268172 6. Dynamin-1, (UID: 166161): (a) Rank: 10 (b) Score: 4.12 (c) Description: Associated with various brain tumors (d) References: OMIM: 602377, PMID: 11072786, 1832879, 10749171 7. fVisinin-like 1, (UID: 2288): (a) Rank: 12 (b) Score: 4.08 (c) Description: Plays an important role in regulating tumor cell invasiveness and that its loss could aid in enhancing the advanced malignant phenotype 97 (d) References: OMIM: 600817, PMID: 9364517, 12941826 8. UID: 75149 SH3 domain, GRB2-like, 2 (a) Rank: 14 (b) Score: 4.06 (c) Description: Preferentially expressed in the brain. (d) References: OMIM: 604465, PMID: 9169142 9. prostatic binding protein(UID: 80423): (a) Rank: 16 (b) Score: 3.99 (c) Description: Used to regulate the onset of mammary and prostate cancer in transgenic mice. (d) References: OMIM: 604311, PMID: 10713685,7972041 10. RAS-associated protein RAB3A(UID: 27744) (a) Rank: 19 (b) Score: 3.97 (c) Description: Located in the chromosome region 19pl3.1-pl2. 19pl3.2 site is involved in malignant processes such as acute leukemias. (d) References: OMIM: 179490 ,PMID: 8432525 11. Kinesin 2, 60-70kD (UID: 117977 ): (a) Rank: 25 (b) Score: 3.93 (c) Description: Breast cancer antigen (upregulation) (d) References: OMIM: 600025 ,PMID: 9177777,12747765 98 12. myelin transcription factor 1-like (UID: 172619): (a) Rank: 28 (b) Score: 3.85 (c) Description: Upregulated in high-grade human brain (d) References: OMIM: 600379, PMID: 9210873 99 Appendix D Normalization of Expression Using Permutation Test and SAGE (NEPS) Reference Genes D . l Breas t -SAGE Reference Genes D .2 B r a i n - S A G E Reference Genes D .3 L u n g - S A G E Reference Genes D .4 A l l - S A G E Reference Genes # Symbol Gene Score Stdev. Average 1 ITM2B integral membrane protein 2B 0.14 9.48 11.73 2 TSPYL1 TSPY-like 0.13 18.12 10.40 3 CDH1 cadherin 1, type 1, E-cadherin (epithelial) 0.060 20.75 12.13 4 TAP T-cell activation protein 0.05 16.42 10.13 5 HLA-DQB1 major histocompatibility complex, class II, DQ beta 1 0.046 18.87 10.20 6 CGI-135 CGI-135 protein 0.042 13.37 10.20 7 DKFZp564F053 Homo sapiens mRNA; cDNA DKFZp564F053 (from clone DKFZp564F053 ) 0.033 16.54 10.60 8 PPIB peptidylproly isomerase B (cyclophilin B ) 0.024 20.48 22.80 9 ATP IB ATPase, Na+/K+ transport-ing beta 1 polypeptide 0.022 16.86 10.73 10 GK001 GK001 protein 0.020 13.20 11.27 Table D. l : Breast-SAGE Reference Genes 101 # Symbol Gene Score Stdev. Average 1 TCEB2 transcription elongation fac-tor B (SIII), polypeptide 2 (18kDa, elongin B) 0.15 9.59 10.81 2 FLJ22678 hypotehtical protein FLJ22678 0.14 11.16 13.27 3 ENOl enolase 1, (alpha) 0.14 10.074 12.73 4 P23 unactive progesterone recep-tor, 23 kD 0.14 14.020 16.81 5 KIAA0193 KIAA0193 gene product 0.13 11.75 11.88 6 TTYH1 tweety homolog 1 (Drosophila) 0.12 14.49 10.81 7 ZYX zyxin 0.089 12.39 10.88 8 DDAH1 dimethylarginine dimeth-laminohydrolase 1 0.088. 13.43 10.54 9 ATP5G2 ATP synthase, H+ transport-ing, mitochondrial F0 com-plex, subunit c (subunit 9), isoform 2 0.077 13.99 12.46 10 NCKAP1 NCK-associated protein 1 0.049 13.25 11.12 11 PPP1CB protein phosphatase 1, cat-alytic subunit, beta isoform 0.047 12.32 10.96 12 MACF1 microtubule-actin crosslink-ing factor 1 0.019 10.89 10.5 13 CCT7 chaperonin containing TCP1, subunit 7 (eta) 0.016 10.44 10.69 14 SAP 18 sin3-associated polypeptide, 18kDa " 0.0070 14.70 11.81 15 PFKL phosphofructokinase, liver 0.0030 12.34 10.50 Table D.2: Brain-SAGE Reference Genes 102 # Symbol Gene Score Stdev. Average 1 RPL11 ribosomal protein L l l 0.0043 8.57 71.45 2 RPL15 ribosomal protein L15 0.023 4.94 44.03 3 RPL17 ribosomal protein L17 0.027 7.99 63.72 4 PKM2 pyruvate kinase muscle 0.054 7.51 66.65 5 MRLC2 myosin regulatory light chain 0.054 4.48 40.52 6 RSP13 ribosomal protein S13 0.087 4.44 30.52 7 ATP5A1 ATP synthase, H+ transport-ing, mitochondrial F l com-plex, alpha subunit isoform 1, cardiac muscle 0.093 7.58 81.34 8 ATP5J ATP synthase, H+ transport-ing, mitochondrial F0 com-plex, subunit F6 0.097 2.65 29.48 9 H3F3B H3 histone, family 3B (H3.3B) 0.097 2.91 28.34 10 CST3 cystatin C (amyloid angiopa-thy and cerebral hemorrhage) 0.099 4.21 31.48 11 PRDX1 peroxiredoxin 1 0.11 7.32 53.66 12 SRP14 signal recognition particle 14kDa (homologous Alu RNA binding protein) 0.12 4.00 42.31 13 NINJ1 ninjurin 1 0.12 3.06 28.14 14 RPL4 ribosomal protein L4 0.14 4.65 41.03 15 TEBP unactive progesterone recep-tor, 23 kD 0.14 4.65 41.03 16 NDUFV1 NADH dehydrogenase (ubiquinone) flavoprotein 1, 51kDa 0.14 4.00 29.00 Table D.3: Lung-SAGE Reference Genes 103 # Symbol Gene Score Stdev. Average 1 UQCRB ubiquinol-cytochrome c re-ductase binding protein 0.15 1.98 23.30 2 RPL6 ribosomal protein L6 3 G22P1 thyroid autoantigen 70kDa (Ku antigen) 0.14 2.14 21.13 4 FLJ20003 hypothetical protein FLJ20003 0.13 6.38 37.07 5 SRP9 signal recognition particle 9kDa 0.12 4.23 42.56 6 GTF2I general transcription factor II, i 0.11 2.08 24.30 7 ARHA ras homolog gene family, member A 0.11 2.62 31.10 . 8 VPS28 vacuolar protein sorting 28 (yeast) 0.11 2.070 24.50 9 SERP1 stress-associated endoplasmic reticulum protein 1 0.084 10 43.44 CAPNS1 calpain small subunit 1 0.083 0.083 | 4.28 11 RPS6 ribsomal protein S6 0.063 3.81 35.27 12 RPS2 ribosomal protein S2 0.062 7.33 69.40 13 LAPTM4A lysosomal-associated protein transmembrane 4 alpha 0.050 3.11 31.99 14 RPL3 ribosomal protein L3 0.048 8.69 101.14 Table D.4: All-SAGE Reference Genes . 104 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items