Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The application of the permutation test on genome wide expression analysis Chan, Timothy 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2006-0165.pdf [ 5.34MB ]
JSON: 831-1.0051243.json
JSON-LD: 831-1.0051243-ld.json
RDF/XML (Pretty): 831-1.0051243-rdf.xml
RDF/JSON: 831-1.0051243-rdf.json
Turtle: 831-1.0051243-turtle.txt
N-Triples: 831-1.0051243-rdf-ntriples.txt
Original Record: 831-1.0051243-source.json
Full Text

Full Text

T h e Application of the Permutation Test in Genome W i d e Expression Analysis by Timothy Chan B . S c , U B C , B C , Canada, 2002  A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F THE REQUIREMENTS  FOR T H E DEGREE OF  M a s t e r of Science in T H E F A C U L T Y OF G R A D U A T E STUDIES (Computer  Science)  T h e University of British Columbia M a r c h 2006 © T i m o t h y C h a n , 2006  Abstract  We are now i n a new era. T h e recent completion of the entire sequence of the human genome and high-throughput gene expression technologies has transformed the era of molecular biology to the era of genomics. Already, such technologies are showing great promise i n disease classification and gene targets. However, like any new exciting technology, great promise and anticipation can lead to wasted resources and false hope. It is critical that we recognize the experimental limitations of these new technologies and most importantly, hidden problems must be addressed. T h e primary goal of a high-throughput gene expression experiment is to identify genes of interest that are differentially expressed between two sample groups. T h i s thesis addresses two key issues that have hindered high-throughput gene expression technologies. T h e first is the sample size issue. Small sample sizes affect statistical confidence and are much more sensitive to outliers. Thus, we show that by using a nonparametric statistical test known as the permutation test, we can achieve higher accuracy than conventional parametric statistical tests such as the ttest. T h e second issue we address is the use of housekeeping genes for normalization of m R N A levels. It is well known that many biological experiments require a set of reference genes that are highly expressed and constant from sample to sample.  ii  T h e choice of reference genes is critical as the wrong choice can have dire effects on subsequent analyses. To address this issue, we developed a methodology based on S A G E , which is a genome wide expression technology that does not require normalization. O u r results suggest that reference genes chosen by our methodology are more appropriate for m R N A normalization than the standard set of housekeeping genes. Furthermore, our results suggest that reference genes are more effective if chosen in a tissue-specific manner.  iii  Contents Abstract  ii  Contents  iv  List of Tables  vii  List of Figures  viii  Acknowledgments  x  Dedication 1  xi  Introduction  1  1.1 The Genomic Era  1  1.2 Challenges in Geneome Wide Expression Analysis  2  1.2.1  Sample Size  2  1.2.2  Use of Housekeeping Genes for Normalization  4  1.3 Contributions  5  1.3.1  Differential Gene Expression Finder  5  1.3.2  Constant Gene Expression Finder  6  1.4 Outline  ^ 7 iv  2  Background 2.1  The Central D o g m a of Biology  8  2.2  Genome W i d e Expression Technologies  9  2.3  3  8  2.2.1  D N A Microarrays  10  2.2.2  S A G E Technology  19  2.2.3  S A G E versus D N A Microarrays  21  Statistical Tests  23  2.3.1  Parametric Tests  23  2.3.2  Nonparametric Tests  24  Differential Gene Expression F i n d e r  26  3.1  Related W o r k  28  3.2  Preprocessing S A G E D a t a  30  3.3  The Permutation Framework  33  3.4  Breast Versus B r a i n Cancer S A G E Libraries  35  3.4.1  3.4.2 3.5  Summary of V a l i d a t i o n of Breast Cancer and B r a i n Cancer Genes  38  Discussion  39  Biomarkers i n E a r l y Stages of L u n g Cancer  41  3.5.1  Background and Related W o r k  41  3.5.2  Materials  42  3.5.3  Candidate CIS Biomarkers  43  3.5.4  Desmosomes as Candidate C I S Biomarkers  46  3.5.5  Invasive Candidate Biomarkers  54  3.5.6  Other Observed Gene Expression Patterns  59  3.5.7  Summary  60 v  4  5  Constant Gene Expression F i n d e r  62  4.1 Related Work  63  4.2  64  Novel Methodology  4.3 Validation  68  4.4 Importance of Tissue Specificity  70  Conclusion and Future Directions  74  5.1 Conclusion  74  5.2 The Future of Gene Expression Analysis  76  Bibliography  79  A p p e n d i x A C e n t r a l D o g m a of Biology  84  Appendix B  86  Genome-wide Expression Technologies  B . l DNA Microarrays  86  B. 2 SAGE Technology  89  A p p e n d i x C Validation of Differentially Expressed Genes  91  C. l Validation of Differentially Expressed Genes of Breast Data  91  C. 2 Validation of Differentially Expressed Genes of Breast Data  96  A p p e n d i x D Normalization of Expression U s i n g Permutation  Test  and S A G E ( N E P S ) Reference Genes  100  D. l  100  Breast-SAGE Reference Genes  D.2 Brain-SAGE Reference Genes  100  D.3 Lung-SAGE Reference Genes  100  D.4 All-SAGE Reference Genes  100  vi  List of Tables 3.1  Statistical Comparison of the Top 30 Genes for Breast Data  38  3.2  Statistical Comparison of the Top 30 Genes for Brain Data  38  4.1  Comparison of the Top 40 Most Differentially Expressed Gene of Microarray Data Normalized by two Approaches  71  D . l Breast-SAGE Reference Genes  101  D.2 Brain-SAGE Reference Genes  102  D.3 Lung-SAGE Reference Genes  103  D.4 All-SAGE Reference Genes  104  vii  List of Figures 2.1 Experimental steps in Serial Analysis of Gene Expression (SAGE)  .  22  2.2 Illustrates that every additional sample increases the power of the test exponentially 3.1  25  Illustrates that the higher the number of iterations, the more stable the list becomes  36  3.2 Diagram of a typical box and whisker plot  44  3.3  Graphical Representation of Steps of Analysis for CIS Specific "Genes 45  3.4  Boxplots showing gene expression patterns of genes with high CIS expression  47  3.5  CIS genes that are down-regulated  48  3.6  A Model of the Basic Structure of Desmosomes  50  3.7 Expression pattern DSG3 in normal(N), metaplasia (M), CIS (C), and invasive (I) stages 3.8  52  Expression pattern of DSC2 and DSC3 in normal(N), metaplasia(M), CIS(C) and invasive(I) stages  3.9  53  Expression patterns of armadillo proteins in normal(N), metaplas i a ^ ) , CIS(C), and invasive(I)  viii  53  3.10 Gene expression patterns of a desmoplakin (plakin family) i n normal(N), metaplasia(M), C I S ( C ) , and invasive(I)  55  3.11 Gene Expression Patterns of Candidate Invasive Biomarkers  56  3.12 Gene Expression Patterns of Candidate Invasive Biomarkers  58  3.13 Boxplots showing genes that are down regulated i n lung cancer regardless of stage  59  3.14 Boxplots showing gene expression pattern for CIS and Invasive specific stages  61  4.1  R a w Count versus Variability on L u n g S A G E data  67  4.2  Table showing relative comparison of validation results  69  4.3  Boxplots that depict the importance of tissue specificity  73  ix  Acknowledgments There are a number of people I would like to acknowledge. F i r s t and foremost, my supervisor who I am most indebted to for his guidance and wisdom. W i t h o u t h i m , I would never have finished this thesis. Second, I am forever grateful to W a n L a m , Stephen L a m and C a l u m M a c A u l a y at the B C Cancer Research Centre for their invaluable contributions including providing me the data set to work w i t h . T h i r d , I would like thank my m o m and family for their financial and emotional support.  TIMOTHY  The University of British Columbia March 2006  x  CHAN  To my father.  xi  Chapter 1  Introduction 1.1  T h e Genomic E r a  T h e Human  Genome Project was launched in 1988 i n a world effort to decode the  entire human D N A . T h e m a i n goals of the project included identifying all 20 to 30 thousand human genes and to accurately sequence the entire human D N A . B y February 15, 2001, the International H u m a n Genome M a p p i n g C o n s o r t i u m Published its first physical map i n Nature that covered approximately 96% of the human genome [1]. Shortly after, on A p r i l 14, 2003, the project was officially complete two years ahead of schedule. T h e project was accomplished by shot gun sequencing  w i t h the  aid of a physical map of the genome. T h i s great scientific accomplishment was only the b i r t h of the genomic era. Today researchers are actively using genome research to fight common diseases such as diabetes, cancer, and heart disease. Decade-old technologies such as D N A microarrays and Serial Analysis O f Gene Expression ( S A G E ) are finally coming of age as they are showing great promise as tools to detect diseases i n their early stages and to aid discovery of candidate drug targets. However, due  1  to complexities of the biology and the experimental procedures of these techniques, we currently cannot use these technologies as standard tests i n hospitals and clinics. In this thesis, we address some of these challenges.  1.2  Challenges in Geneome W i d e Expression Analysis  In the past decade, there has been an exponential growth i n biological experimental data. T w o primary technologies responsible for this rise are Serial Analysis of Gene Expression ( S A G E ) and the more popular D N A Microarrays. These two technologies were developed around the same time and have revolutionized genome-wide expression analysis. One of the key goals of these technologies is to allow the identification of genes that are either up or down regulated i n specific conditions (ie. diseases or environmental conditions). A second goal is to map these expression levels to biological pathways so that biologists may begin to formulate hypotheses on the condition of interest. W h i l e collecting data poses some challenges, it is clear that the biggest challenge lies i n the analysis of the collected data. In this thesis, we address two m a i n issues:  1.2.1  Sample Size  For every genome wide expression experiment, samples, which include tissues or cell cultures, are processed to extract the expressed m R N A molecules. These molecules are then collected and counted v i a the microarray or S A G E method. Often, samples are limited and thus, are subject to poor statistical significance. T h i s is due to two primary reasons:  2  1. C O S T : Gene expression technologies are quite expensive and thus, the number of samples are often hindered by cost. For instance, typical Affymetrix D N A microarray chips cost around 550 U S D each. Moreover, there is still the cost of obtaining the sample, the cost of microarray analysis software, and the cost of the scanner. A S A G E experiment is even more costly as sequencing is an expensive process.  In fact, to sequence the entire genome today, it would  theoretically cost around 20 million dollars.  Furthermore, a single S A G E  library can cost upwards of $50,000 depending on how deep the sample is sequenced. 2. R A R E S A M P L E S : Sometimes, even if funding is not an issue, the samples are simply rare. For example, w i t h our collaboration w i t h B C Cancer Research Centre ( B C C R C ) , we studied the early detection of lung cancer. L u n g cancer is notorious for being difficult to to detect i n its early stages.  One m a i n  reason for this is that lung cancer is usually symptomless until its late invasive (INV) stages. Thus, early lesions known as C a r c i n o m a I n S i t u (CIS) are rare encounters.  E v e n after a few years of data collection, the number of C I S  samples i n B C C R C ' s cohort has only reached about 10. Failure to detect lung cancer at its early stages leads to poor prognoses. Thus, survival rates of lung cancer patients is among the lowest of cancer types.  Small samples sizes are a major problem because of the high rate of false positives i n genome wide expression experiments. I n fact, b o t h the S A G E and the microarray method require some sort of additional biological validation (ie. R T P C R ) before they are accepted as evidence. Thus, it is crucial that we minimize the amount of false positives as each of these validation procedures take time and  3  money.  1.2.2  Use of Housekeeping Genes for Normalization  Normalization is a preprocessing step that is required i n many biological experiments to balance the non-biological differences between different samples so that they can be compared. Usually, housekeeping genes are used as references because they are believed to be expressed at a constant and abundant level (hereafter referred as the constancy requirement) regardless of whether the cell is active, dividing, or simply idle. Experiments such as Q R T - P C R and D N A microarrays rely on such genes for normalization. Microarrays i n particular are extremely sensitive to normalization because of the many sources of error and the nature of the data.  T h e data is  intensity-based and represents the number of transcripts that naturally hybridize to a D N A microarray chip. Interpreting intensity data as m R N A count numbers a controversial issue. Furthermore, D N A microarrays are also sensitive to many other factors such as the hybridization rate i n varying environments, the spot intensity issues, and many others described i n the next chapter. These variations make interchip comparisons difficult.  Thus, microarray data must be observed w i t h much  scrutiny as the intensity value of say 2 on chip A may not be the same as the intensity value of 2 on chip B . Normalization is a critical step so that non-biological and biological differences can be distinguished. In the past few years, it has become increasingly apparent that the commonly used housekeeping genes , such as glyceraldehyde-3-phosphate dehydrogenase ( G A P D H ) and beta actin ( A C T B ) , violate the constancy requirement [35] [36] [37] [38]. T h e problem is that normalizing to such genes can skew data and affect any subsequent analyses.  4  1.3  Contributions  To address the above problems, our method uses two major components. First, we use a nonparametric statistical test known as the permutation test to deal w i t h the small sample size problem. T h e permutation framework has the following benefits:  1. It is designed for small sample sizes. 2. It does not make any assumption about the underlying population (a nonparametric test). 3. It is robust to outliers.  Second, we use S A G E (Serial Analysis of Gene Expression) for the following benefits:  1. It is an open method so there is absolutely no bias. T h a t is, we do not need to know what genes we are looking for beforehand. 2. It does not possess the same normalization issues as D N A Microarrays as S A G E is based on sequencing fragments of transcripts which are counted i n absolute terms.  A s w i l l be shown i n Section 1.3.1 and Section 1.3.2, the contributions of this thesis are based on different ways of applying the permutation framework to S A G E in order to deal w i t h the problems stated i n Section 1.2.  1.3.1  Differential Gene Expression Finder  A common experiment i n genome wide expression analysis is to find differentially expressed genes between diverse samples. For example, we may want to find the 5  genes that are turned on i n tumor tissues but turned off i n normal tissues.  A  popular way to do this is to compare the mean cancer gene expression levels to the mean normal gene expression levels.  In this thesis, we w i l l demonstrate the  ability of the permutation framework to successfully identify differentially expressed genes between two diverse samples.  To do this, we conducted two experiments.  The first experiment shows that the permutation framework appears to be much more effective i n identifying target candidate genes than the typical two sample ttest of unequal variance. T h e data used here are the breast and brain S A G E data from N C B I ' s S A G E m a p site ( h t t p : / / w w w . n c b i . n l m . n i h . g o v / S A G E / ) . T h e second experiment further shows that the permutation framework is effective in identifying biologically signifcant genes which warrant further biological investigation. T h a t is, we demonstrate the hypothesis generating power of the permutation test applied to S A G E data. Here, we use a cohort of L u n g cancer S A G E libraries at various stages (ie. Normal,Metaplasia, C I S , and Invasive) provided by our research partners at B C Cancer Agency. More specifically, using these set of libraries, we identified candidate biomarkers of lung cancer.  1.3.2  Constant Gene Expression Finder  Interestingly, the permutation framework is also useful for doing the reverse. A s w i l l be seen i n Chapter 4, it is also useful i n identifying constantly expressed genes. Experimentally, this has never been done on a global scale w i t h the exception w i t h microarrays. However, extracting constantly expressed genes from a method that requires normalization itself is arguably inappropriate. Thus, this thesis proposes to use the permutation framework on S A G E to find constantly expressed genes and then to use these genes as references (ie. housekeeping genes). O u r results suggest  6  that our set of S A G E selected reference genes are much more appropriate than the commonly used housekeeping genes. A s w i l l be shown i n Chapter 4 , it may also be important that these reference genes must be tissue specific. T h a t is, the reference genes for the lung may not be the same reference genes as for breast tissue.  1.4  Outline  The outline of this thesis is as follows.  In the next section, we first summarize  some basic biological concepts. We then go into some details of the two most popular gene expression technologies; Serial Analysis of Gene Expression ( S A G E ) and D N A Microarrays. Lastly, we describe the permutation test i n detail including the steps/algorithm. In Chapter 3, we describe two experiments that use the permutation test as a differential gene expression finder. T h e first experiment analyzes brain and breast S A G E data and compares the permutation test's performance w i t h the standard two-sample of unequal variance t-test. T h e second experiment analyzes L u n g S A G E data provided by B C Cancer Agency and compares the different stages of lung carcinoma i n attempt to discover potential biomarkers. In Chapter 4, we describe a novel normalization procedure that identifies constant expressed genes w i t h the permutation test. Finally, we conclude w i t h a brief summary and suggest further directions.  7  Chapter 2  Background 2.1  T h e Central D o g m a of Biology  It is well known that our bodies are composed of units called genes that are responsible for the various physical and mental characteristics that define us. Genes are found i n our D N A and contain templates for specialized messenger molecules known as R N A . R N A i n t u r n contains information used to make the biological worker molecules known as proteins.  Proteins have a variety of roles including  signaling, immunity, physical structure, and i n h i b i t i o n / p r o m o t i o n of growth. T h e central dogma of biology refers to the transition between D N A to R N A (process known as transcription) to protein (a process known as translation). W h i l e the human cell is estimated to contain about 30,000 genes, the average cell only has about 20 percent of their genes turned on. F r o m a molecular persective, the difference between muscle cells, skin cells, nerve cells are simply the different genes that are turned on or off i n the particular cell type. T h i s differential expression profile defines the type of cell.  A mature cell's genetic make-up can also change  8  when there are changes i n the environment to adapt and survive. For instance, heat shock proteins are proteins that are elevated i n stressed conditions such as an increase i n temperature or other environmental changes. One of the most important roles of these proteins is to assist protein-protein conformations and maintain those conformations.  In environments where protein degradation is likely, the R N A of  heat shock proteins increase. A t the same time, the R N A of metabolic proteins is likely to decrease as temperatures are not optimal for such activity. A s it turns out, diseased cells also have different genetic signatures from healthy cells. Thus, one of the key purposes of gene expression analysis is to identify key genes that are responsible for the onset and phenotype of the disease. Such knowledge is extremely helpful when designing treatments.  2.2  Genome W i d e Expression Technologies  A l t h o u g h we truly would like to analyze the protein levels i n cells, it is currently extremely difficult due to various reasons. First, all proteins have their own behavior and many are sensitive to degradation.  Thus, even if we can extract the protein,  often they would degrade so fast that we would not be able to measure their expression accurately. Second, proteins are often interacting w i t h other genes often involving chemical and physical interactions between other proteins forming large multi-protein complexes. T h i s field of study is known as proteomics and is much more complicated than genomics. P r o t e i n microarrays are i n the works but are years away from being as wide-scale, accurate and reproducible as gene expression technologies. Thus, for the past decade, researchers have been mostly analyzing gene expression indirectly. Fortunately, these technologies have proved to be quite useful in extracting interesting information and most importantly i n developing hypotheses  9  for further validation and study. Gene expression analysis began about a decade ago when two genome wide expression technologies were introduced that revolutionized functional genomics by allowing cross-tissue comparisons of expression profiles; serial analysis of gene expression ( S A G E ) and D N A microarrays. B o t h methods have been widely used to elucidate complete gene expression profiles. Over the years, these methods have been enhanced v i a statistical/mathematical and experimental techniques to account for their various weaknesses. For instance, l o n g - S A G E was developed to allow cloning of 20 nucleotide S A G E tags leading to improvements i n tag to gene mappings [9], while M i c r o S A G E was developed to allow S A G E to be done when tissue material is scarce [10]. In D N A microarrays, various techniques have been developed to account for its sensitivity to normalization such as the use of control genes [30] [29]. These techniques will be discussed later i n the next section.  2.2.1  D N A Microarrays  D N A microarrays were developed to facilitate the understanding of how a cell coordinates expression of thousands of genes i n different conditions (ie. diseased state). W h e n a microarray experiment is performed, a "snap shot" of a cell's global gene expression levels is taken giving researchers a large list of genes that are currently activated or suppressed.  T h i s technology has allowed scientists to conduct many  experiments inconceivable just over a decade ago. For example, researchers can now compare the genetic profile of diseased and normal cells and allow extraction of possible drug targets and key genes that may have caused the onset of the disease. Thus, contributing to b o t h early detection and prevention. In the following section, we describe two of the more popular types of D N A  10  microarrays and describe their similarities/differences and  strengths/weaknesses.  We then describe the experimental steps required i n a D N A microarray experiment.  Spotted Microarrays and Affymetrix Oligonucleotide Microarrays Spotted microarrays require molecules known as c D N A . c D N A s are reverse transcribed m R N A molecules and thus are free of premature m R N A sequences (ie. exons). Spotted microarrays are made up of several thousand c D N A probes that are fixed on a slide. In general, spotted microarrays are more popular than commercially produced arrays (ie. Affymetrix slides) for two m a i n reasons. First, these types of chips are relatively inexpensive compared to commercially produced chips. Second, spotted microarrays are very flexible since they allow scientists to design their own arrays and spot any types of D N A of interest on to their customized slide. Unfortunately, spotted microarrays have a few pitfalls. One major drawback is the high error rate during the spotting process.  M a n y of these errors are due  to variations i n experimental procedures and environments that the experiment is conducted. In fact, it is well known that results from one lab will often vary from another lab and even vary from experiment to experiment. Also, most spotted arrays use c D N A as probes and consequently, require known D N A for the P C R process to produce the probes. In spite of the whole genome being sequenced, every human transcript is still unknown. Lastly, this method results i n long D N A fragments on the spotted slide and thus, increases the chances of probes hybridizing to similar c D N A (gene) fragments. Thus, it is virtually impossible to detect small differences between D N A fragments ( S N P s cannot be detected). The second common type of microarray is primarily produced by the company, Affymetrix. These D N A chips differ from spotted microarrays i n two distinct  11  manners. First, this technology is based on hybridzation of m R N A samples to small, high-density, arrays containing tens of thousands of synthetic oligonucleotides (synthetic D N A ) . T h i s allows for detection of S N P s (single nucleotide polymorphism) and other small features of D N A (ie. specialized regions such as enhancers, promoters, alternative spliced genes etc). T h e second major difference is that only one sample is hybridized to the chip at a time while spotted microarrays require a control and experimental sample to be competitively hybridized onto the chip. Thus, Affymetrix chips produce data of absolute intensities while spotted microarrays produce data of relative intensities. Like spotted microarrays, these chips do have their share of disadvantages. First, Affymetrix chips and the equipment to read them are costly. Currently, on the Affymetrix website ( , a human gene chip goes for around 550 U S D and the basic Fluidics station and scanner is over $100,000. A s well, a custom chip would be even more costly. Second, often the signals generated from different probes for the same gene vary i n magnitude (sometimes as much as two-fold). Thus, combining these signal intensities into one intensity to represent and estimate of the abundance of the gene is an issue. Furthermore, since Affymetrix currently only produces D N A chips w i t h only one D N A sample, it has become a major challenge to normalize between the experimental and the control. These challenges w i l l be discussed i n the following section.  E x p e r i m e n t a l Steps of Affymetrix D N A Microarrays A simple microarray experiment requires 6 basic steps:  1. Sample preparation: T h e process of obtaining the cleanest sample possible. T h i s process is the most sensitive to errors i n microarray experiments as it is 12  very difficult to obtain clean samples that are subject to various environmental factors. 2. A r r a y fabrication: T h e primary concern i n this process is to develop arrays w i t h high precision w i t h both lower cost and time to manufacture. 3. H y b r i d i z a t i o n of the sample to the array: In this step, researchers work on developing ways to increase the efficacy of hybridization. 4. Scanning and image analysis: M u c h of this part of the analysis deal w i t h issues such as spot location, background correction and intensity assignments. 5. N o r m a l i z a t i o n / D a t a preprocessing: T h e goal of normalization is to distinguish non-biological differences from biological differences. 6. D a t a Analysis: T h e objective here is to find interesting gene expression patterns and groups using various statistical and data mining methods.  O u t of all these steps, normalization is arguably the most useful strategy to correct for non-biological systemic errors so that comparisons between arrays can be performed appropriately. However, i n spite of years of research, the normalization of microarrays remain non-standardized.  Below we describe this process i n detail  while further details of the remaining steps are discussed i n detail i n A p p e n d i x B .  N o r m a l i z a t i o n of D N A microarray chips Normalization i n gene expression experiments is a broad term for a process of transforming m R N A expression values so that the non-biological differences between experiments is balanced so that the real biological differences between experiments can be observed. T h i s procedure is required i n experiments such as R T - P C R and 13  D N A microarrays. It should be stressed that any form of data manipulation does i n itself introduce unwanted noise to the expression measures. Normalization is not a miracle step. T h e technique can only correct for minor variables but if the data is poor, no amount of normalization can save it. A l t h o u g h many advances i n normalization of D N A microarrays have been made i n the past decade, there is still no general consensus on which approach is the most appropriate. Unlike the standardized normalization process i n R T - P C R , D N A microarray normalization is much more complicated. One p r i m a r y reason is the numerous non-biological variables involved i n a D N A microarray experiment. For example, a common variable that must be dealt w i t h is the variation i n the number of cells i n different samples since this would mean that we may have unequal quantities of starting D N A from array to array.  Another common problem i n microarrays  are spatial effects. These problems range from the unequal distribution of solvent across the surface of the array to the quality of washing off non-hybridized m R N A . Furthmore, hybridization efficiency is another common issue.  T h e strength of a  hybridized D N A can be affected by the G C versus A T content. G C pairings tend to be stronger as they contain 3 hydrogen bonds while A T pairings are weaker w i t h 2 hydrogen bonds. Also, since different genes often have homologous regions, cross hybridization is a possibility. Moreover, statisticians also must take experimental bias into account.  For example, one experimenter may decide to add fluorescent  labels for a l l their chips i n one step or separately for each chip. T h e above is only a subset of issues that need to be considered when analyzing D N A microarrays. In the next section, we will describe some ways to deal w i t h some of these common variables. Before describing some of these normalization options, it is important to  14  note that the normalization process is applied after image analysis and before other subsequent data mining analyses including supervised and unsupervised clustering analyses (e.g., hierarchical clustering, K-means clustering, or self-organizing maps).  Some Normalization  Options  Over the past decade, many normalization methods have been proposed, many of which were designed for specific conditions of an experiment being run. T h a t is, depending on what the experiment entails, there are choices to be made to yield the most o p t i m a l results. For brevity, we w i l l only describe a few of the more popular and newer methods. T h e first step i n normalization is to come up w i t h an appropriate set of genes. Y a n g et a l . suggested three basic approaches: (1) Use a l l genes on the array. (2) Use constantly expressed genes as references (ie. housekeeping genes). (3) Use control genes [29] [30].  A l l Genes on A r r a y T h i s type of normalization is a simple approach and is based on the following assumption. Since only a small fraction of genes will be up or down regulated i n the samples, on average, the number of up and down regulated genes ought to counter each other's effect on the total R N A count. In other words, there should be equal weights of R N A for all samples and thus, the number of R N A molecules should also be roughly the same from experiment to experiment. T h i s assumption is valid for genome wide expression experiments covering the entire genome because we are dealing w i t h thousands of genes where only a handful of them are differentially expressed. T h e problem w i t h this method is that most microarray chips do not cover the entire genome (especially for homo sapiens). In addition, this method also makes  15  the assumption that the total m R N A remains constant under different experimental conditions. Consequently, this method would only really be useful if the chips that one compares have similar expression patterns and invalid w i t h chips w i t h vastly different genetic signatures [29]. For instance, the genetic profile of tumor cells are known to be vastly different than corresponding normal cells since tumor cells are essentially i n hyperdrive. A variant of using all genes is normalizing to the total or ribosomal R N A . T h i s method is based on the assumption that ribosomal R N A is constitutively expressed i n all cells. A l s o , since ribosomal R N A consists of 90 % of the total R N A i n a typical cell, it was thought that it could be used as a normalization standard for small fluctuations i n m R N A levels. However, studies have shown that this underlying assumption is not correct [36]. T h e total R N A produced i n a cell can vary a lot from cell to cell and it is dependant on many factors including the conditions the cell is i n and the cell type.  U s i n g Constantly Expressed Genes for N o r m a l i z a t i o n Normalizing to a set of constantly expressed genes is also a popular method and used i n Affymetrix's 133 series chips. Instead of using the whole set of genes for normalization, a small subset of genes are used, where chips are re-scaled so that the average values of each housekeeping gene are equal across chips. These genes are chosen based on the biological belief that they are required for a cell to function and live.  T h e y are typically picked by selecting for abundant expression i n all  developmental stages of the cell and are assumed to be non-regulated over various conditions. However, as explained i n detail later, these biological assumptions may be too strong. In fact, i n recent years, there have been many studies evaluating the  16  variability of these housekeeping genes [35] [36], [40], [38], [39]. N o t suprisingly, we observed similar trends i n our own results. In the past few years, many have proposed ways to improve on this method. For instance, Kepler et al. proposed the choice of housekeeping  genes to be made  by looking for a stable background pattern of activity on the microarray. Using this assumption, they can derive a transcriptional  core from this stable background  pattern (identified statistically for each experiment). T h i s method is referred to as normalization  by self-consistency  and uses the following steps to acquire the core.  First, a l l genes are designated as core genes.  Next, the whole chip is normalized  to a l l the genes on the chip. After this step, a l l the genes that have not changed from the normalization process are kept. T h i s process is repeated until the previous transcriptional  core and the next transcriptional  core are equal [31]. Another  method of finding constantly expressed genes proposed by Tseng et al. is using a rank invariance  gene approach. T h i s approach attempts to find genes that are  non-differentially expressed by first separately computing the intensities of fluorescent labels (of different colors). These labels are then compared and if the ranks of the two differ by some arbitrarily set threshold and the rank of the averaged intensity is not among the highest 1 ranks or lowest 1 ranks, the gene is classified as a non-differentially expressed gene [32]. T h e problem w i t h these two methods is that they are based on using on-chip data to find non-differentially expressed genes. T h i s , i n itself, introduces a circular argument since these so called constantly expressed genes are also subject to the same systematic technical variations as any other spot on the chip.  17  C o n t r o l Genes Controls are chosen genes that have no relation to the experimental samples under study.  Some Affymetrix chips employ such genes.  One such type of control is  known as spiked controls. T h e y are synthetically generated D N A sequences or D N A sequences from a known organism different from the one under study. These genes are spotted on the array of the controlled chip and the variable chip. These samples have an equal amount of m R N A and thus should have equal intensities and could be used for normalization. Another controls approach is known as the titration series approach which involves the spotting of several different concentrations of the same genes or E S T that span the range of intensities on the array. Unfortunately, this technique is technically challenging [29].  Scaling and  Transformation  T h e next logical step after selecting what genes to use as references, is to scale the rest of the data to match the relative distribution of these reference genes. There are several ways to approach this. For example, one simple approach is to scale to some arbitrary constant. T h i s constant is often a calculated value such the mean, trimmed mean or median of all the intensity values across all the experiments.  Sometimes  replicates would be generated to evaluate the relationships between chips and often they reveal a non-linear relationship among probes. Thus, often logarithmic transformations are performed to stabilize such variances. Recently, L u ' s studies have shown that eliminating outliers d i d not reduce variation among multiple arrays but rather increased variation. Thus, he highly recommended using the mean of the logarithm transformed signals to calculate the normalization constant factor [34].  18  D a t a Analysis Recall that the last step i n microarray analysis is simply the statistical tests applied to the cleaned up data. Often researchers apply the common statistical procedures such as ratios and absolute differences between intensities to find differentially expressed genes. Sometimes they w i l l use more sophisticated tests such as the popular t-test. In addition, it is quite popular to cluster the data (unsupervised or supervised) i n order to find biologically relevant groups of genes. These groups of genes could be used to identify subgroups of diseases at the genetic level.  2.2.2  SAGE Technology  Like D N A microarray technology, Serial Analysis of Gene Expression was developed to provide a comprehensive and quantitative gene expression profile of target cells. The S A G E technique was developed at J o h n Hopkins University i n the U S A around 1995 by Velculescu et al. [5]. Unlike the microarray technique, which are based on relative expression levels ( or absolute fluorescent intensity levels (Affymetrix)), it measures m R N A expression by counting representative short sequenced tags. T h e expression values are thus given i n absolute terms.  T h i s avoids the need for the  k i n d of normalization required for comparing multiple microarray chips. A n o t h e r important advantage of the S A G E technique is its ability to detect small transcripts, which may be crucial genes as these could be the switches or pathway regulators leading to the onset of a disease like cancer.  A t h i r d advantage of S A G E over  microarray is that it does not require the genes to be previously known. Thus, analysis can reveal previously unknown genes as the potential markers.  Lastly,  S A G E libraries may be pooled. T h a t is, even after an experiment is done, one can go back and sequence the same tissue library even deeper and simply add these  19  results to the old data. A s great as S A G E seems to be, it has one major pitfall. Collecting S A G E data involves sequencing, which is a laborious and expensive process. A t the current market rate, generating a S A G E library cost at least 100 times more than performing a microarray experiment. Thus, there are only a small number of S A G E libraries (a library corresponds to the measured m R N A expression levels of a sample or patient.) For instance, as w i l l be described i n Chapter 3, when we conducted the brain and breast S A G E analysis 2 years ago, there were only 17 brain cancer libraries, 7 brain normal libraries, 10 breast cancer libraries and 5 breast normal libraries publicly available through the N C B I website. Thus, while S A G E libraries give high quality data, the small sample size problem is critical for the analysis of S A G E libraries. In section 2.3, we w i l l describe a statistical tests designed for such a problem. However, before we get into the analysis, we describe the S A G E technique i n detail below.  Methdology Recall that the basic units that make up D N A and R N A are called nucleotides. To conduct S A G E , a method called sequencing is required to read these nucleotide sequences. T h i s procedure is complex and demands cost and time. In fact, if one were to sequence every m R N A molecule i n a cell to its entirety, it would probably take a decade or so. Fortunately, as it turns out, fourteen nucleotides of an m R N A molecule is sufficient to capture the majority of the m R N A in a cell precisely. T h i s was a major discovery as an m R N A molecule could have thousands of nucleotides. The S A G E method is based on this idea of taking a small fragment (tag) of nucleotides from an m R N A molecule to represent the entire transcript of a particular gene. These tags are obtained by special cleaving enzymes (restriction enzymes) that  20  recognize a specific sequence of nucleotides i n the m R N A strand and cut around 14 nucleotides downstream from this recognized region. T h i s process will almost always obtain the same sequence of tags if the transcript is the same. O f course, such a process raises the following question: Is it possible that different genes can have the same tags? Yes, but because there are 14 base pairs, the likelihood is statistically acceptable. However, it is likely that more than one tag represents the same gene due to biological phenomenons such as polymorphism and alternative splicing. In Chapter 3, we will describe these issues i n more detail and how they are circumvented. A second question that may arise is: Is it possible that some genes may not have this restriction enzyme site ? Yes, but it is a relatively small percentage of genes as restriction enzymes typically cleave sites of no more than 4 nucleotides. Figure 2.1 briefly outlines the steps of a typical S A G E experiment (based on the figure at h t t p : / / w w w . n c b i . n l m . n i h . g o v / C l a s s / N A W B I S / M o d u l e s / E x p r e s s i o n / e x p 8 2 . h t m l ) . Further details can be found i n A p p e n d i x B .  2.2.3  S A G E versus D N A Microarrays  T h e choice of the S A G E technique over the D N A microarray hybridization technique depends on several factors including the amount of starting material, the number of samples, and the availability of resources [6]. For example, one major consideration for S A G E is the fact that around 1 . 5 X 1 0 bases are required to be sequenced i n order 6  to do a simple 2 sample library comparison. In spite of technological advancements, sequencing is still an expensive procedure as it requires an expensive automated seqeuncer. Thus, when the amount of sample material is not a consideration, the D N A microarray hybridization technique is a much more cost effective technique and would yield much more results. However, if we only have a few rare precious  21  -o o  o CeU  synthesize *  Tag mRNAs  Llgate to vector  Ligate poo] 1 and 2 And PCR amplify  Cloning Sequence Vector  Gone Expression  sequence  output  Computer  Figure 2.1: Experimental steps in Serial Analysis of Gene Expression (SAGE)  22  samples, the amplification step ( P C R ) i n the S A G E method allows experiments to be done w i t h sample sizes as low as 9 oocytes to 100,000 cells [6]. In fact, S A G E on a single cell has even be reported [7]. If there are a large number of samples, microarrays would be more appropriate as the costs are much lower.  2.3  Statistical Tests  Since few results i n science are absolute, we must have a way to "weigh" our results. Statistics was developed to help "weigh" our results so we can determine whether an experiment is credible or not. In general, a hypothesis is generated and either accepted or rejected based on a mathematical measurement of confidence. A typical statistical experiment evaluates a population of specific values between two samples where each member has a value. For example, gene expression values between a set of cancer cell lines and a set of normal cell lines may be compared to determine how likely that this gene i n question is different i n cancer over normal.  2.3.1  Parametric Tests  Conventional statistics, also known as parametric tests, work well if the following assumptions are met:  1. T h e underlying population can be assumed. For example, a commonly used assumption is that the experimental values of the sample are normally distributed. This assumption is often referred as the parametric assumption. 2. Overall, the samples have equal variances. 3. Samples are drawn from the population independently.  23  If a parametric assumption cannot be met, statisticians often apply various tactics to manipulate the data. A s long as this is done w i t h care and never used to adjust for a better p-value, such a procedure is permitted. Such tactics range from deletion of outliers, winsorization, and/or t r i m m i n g the data. These procedures are tricky though, as we can never be sure if these outlying values are from naturally occuring biological phenomena, or a by chance value, or a mistake i n the data collection (ie. sequencing errors). Furthermore, when sample sizes are small, deleting or manipulating any values can have drastic consequences to the final results. For instance, suppose we have two samples. Sample A consists of the following values: 4, 100 and 6. Sample B consists of 5, 5 and 5. T h e mean of sample A is 36.7 while the mean of sample B is 5. Thus, these two samples would be deemed different by a typical statistical test. However, if the outlier 100 from sample A is removed we suddenly, sample A and B suddenly have the same mean.  2.3.2  Nonparametric Tests  Nonparametric test are classes of statistical tests that require b o t h the equal variance and the independence assumption as described above. T h e advantage of these types of tests is that they do not require a parametric assumption. T h e disadvantage is that distribution free methods require great care because they often produce results w i t h less efficiency i n that they produce results w i t h poor detection of false positives. However, for situations where the distribution cannot be assumed (ie. a small sample size), nonparametric tests are ideal.  24  P e r m u t a t i o n Test  Often , we see inappropriate use of parametric tests conducted on only a handful of microarray or SAGE experiments. Here, we introduce the permutation test. This non-parametric test is designed for small sample sizes that makes no assumption about the underlying population. Furthermore, from figure 2.2, we can see that for every additional sample, the power of the test increases exponentially [47]. In the following Chapter, we will describe how we used the permutation test to conduct our analyses. # ol &amp4»« vs » of Combinations (Log Scat*) ^10000000000 •% |  10000OGO0  5 I  IOOWOO  g I  10000 100 1  =I O  Number ot Sampi«&  Figure 2.2: Illustrates that every additional sample increases the power of the test exponentially  25  Chapter 3  Differential Gene Expression Finder Cancer currently ranks second among diseases that k i l l humans.  W i t h the aging  population and the increasing introduction of environmental complexities (ie. pollution) i t is likely that it will soon overtake cardiovascular disease as the number one killer of humans. T h e recent development of high-throughput technologies and the recent sequencing of the entire human genome have brought great promise to unravel the mysteries to diseases such as cancer. However, progress i n cancer research has been hindered by two m a i n factors.  First, cancer is a highly variable  disease that is triggered by multiple genetic changes making it difficult to pinpoint genetic targets of interests.  Second, current high-throughput technologies such as  D N A - m i c r o a r r a y s are not 100% reliable and often gives rise to false positives (ie.  [44], [45]). Thus, the majority of findings from high-throughput technology cannot be considered ready and safe for clinical trials. For this to happen, validation of candidate genes is required which is often laborious and unsucessful. Consequently,  26  it is critical that the differentially gene expression selection process is as accurate as possible. In this Chapter, we demonstrate the importance of using an appropriate statistical test i n identifying genes that are differentially expressed i n tumors. A s discussed i n Chapter 1, one of the major problems that plagues experiments is the sample size problem. M u c h of the past research relied on parametric tests such as the t-test which require a large sample size. Here, we introduce a non-parametric test known as the permutation test. This test requires a much smaller sample size as w i t h each additional sample, the power of the test increases exponentially. To show the appropriateness of this test we conducted two different experiments b o t h of which apply a permutation test algorithm to identify genes that are the most differentially expressed.  T h e first experiment has two m a i n objectives.  First, to compare the genetic signatures of two vastly different cancers (brain and breast cancer). Second, to evaluate the performance of the permutation framework over the standard t-test. T h e second experiment we conducted focussed on comparing the genetic signatures of different stages of the same type of cancer; i n this case lung cancer. T h e goal here was to find genes that were specific to each stage and genes that were common to both stages. Before getting into the details of each of these experiments, we w i l l first describe how we preprocessed the S A G E data i n each case. Following this description, we w i l l also describe the finer details of the permutation framework we used to conduct our experiments. B u t before going into the details of our experiments, we w i l l briefly discuss the recent findings i n Cancer research using high-throughput technologies.  27  3.1  Related Work  In the past decade, the development of high-throughput gene expression technology has created a stir of enthusiasm i n the cancer research community. However, as w i t h any new technologies, often such excitement and eagerness can lead to false discoveries as many problems could be overlooked. In the following sections, we w i l l describe some of these experiments some of which appear promising and some that appear questionable.  Applications of High-throughput Gene Expression Technologies in B r a i n and Breast Cancer Research One of the m a i n goals of cancer research is to find reliable diagnostic or prognostic biomarkers for different grades of tissue-specific tumors. Today, we rely on specialists to assess the grade and predict the behavior of the tumor based on an examination of the histology and morphology of the tumor. These methods are highly subjective and thus, is not reliable. In b r a i n cancer for example, it has been shown that even among experienced neuropathologists, the final diagnosis is observer-dependant [46]. Thus the need for such biomarkers is desperately needed. A second goal of genomics in cancer research is to find the switches or regulators that may have caused the onset of the disease. Following such a discovery, researchers have the potential to design drugs to control these checkpoints of the pathway of the disease and thus, treat the disease. T h e advent of genomic technology has provided valuable tools for cancer researchers to aid b o t h biomarker and switch/regulator discovery. Typically, normal and cancer tissues are compared i n order to identify genes that are differentially expressed. Several research groups are already demonstrating the power and poten28  tial of such technologies. However, as will be seen below, accuracy is hindered by sample sizes and noise. Ljubimova et al. used D N A microarray technology to identify 11 genes that were deemed differentially expressed i n human gliomas using a commerically produced D N A microarray called Gene Discovery A r r a y ( G D A ) from Incyte Genomics, Inc.  These 11 genes were chosen based on the highest mean ratios of differential  expression between normal brain tissues and brain tumor tissues. T h e problem w i t h such an approach is that means are sensitive to outlying values. For instance, if one microarray experiment shows a very high gene expression level due to some noise, it w i l l be deemed significant. Not surprisingly, only 2 of these 11 genes were verified v i a semi-quantitative R T - P C R and Northern blot analysis [45]. These results demonstrate that the chances of false-positives are great. In another study, Watson et al. identified 196 genes that demonstrated common differential patterns among different tumor grades of oligodendroglimas. T h i s study successfully showed that these expression patterns could be correlated w i t h oligodendroglima tumor grades. The m a i n problem w i t h this study, is that the oligonucleotide micraorray consisted of only 1100 genes and they only had 7 different samples/microarrays [46]. The accuracy of high-throughput gene expression studies for breast cancer is no better. Recently Shen et al. demonstrated the loss of annexin A l expression in human breast cancer detected by using both S A G E and microarrays. For their microarray experiments, the used 2 normal and 7 breast cancer samples. T h e y found 129 genes w i t h a greater than five-fold change. To narrow the number of genes for further study, they used S A G E and E S T analysis and only identified 4 qualified genes. Their expression pattern was further validated v i a R T - P C R . Following this analysis, A N X A 1 was further confirmed at the m R N A level by H u m a n Breast Cancer  29  Tissue Profiling A r r a y and at the protein level. O n l y at this point, were they satisfied that this gene is of significance [42]. T h i s demonstrates that genomic technology is far from clinical use for screening patients. Currently, there are few studies that compare brain and breast cancer at the molecular level. N g et al though d i d show that by using a clustering algorithm called O P T I C S on a subset of genes selected using the W i l c o x o n test, they were able to create 8 distinct clusters including brain cancer, breast cancer, b r a i n normal, and breast normal clusters. These results suggest that brain and breast cancers are also different at the molecular level. However, no subtypes of cancers were noticeably clustered but this may have been due to the few libraries they had to work w i t h [50]. To date, the success rate of these genome wide expression studies have not been as high as anticipated only a decade ago.  Such experiments generate large  lists of differentially expressed genes but only a small portion of these genes are validated upon further analysis. Because genomics data is subject to many nonbiological errors, statistical significance needs to be very strong. Another common problem is small sample sizes. Undertaking a genomics experiment is a risky and expensive endeavor. In Section 3.4 , we will demonstrate how much an analyses can differ just by changing how we identify significant genes. B u t first, we w i l l describe how we preprocessed the S A G E data and describe the permutation framework we used to identify differentially expressed genes.  3.2  Preprocessing S A G E D a t a  Recall that S A G E has various deficiencies. First, S A G E tags often map to more than one gene due to the short sequence that represents the m R N A . Second, genes 30  may also map to more than one tag due to two major biological phenomena: alternative splicing and polymorphism. B o t h will be discussed later i n greater detail. Furthermore, due to the time and money spent, the publicly available S A G E l i braries created between different research institutes and universities vary i n size, ranging from 10,000 to 60,000 tag types per library. In order to conduct appropriate analyses, the data must be preprocessed. Below we describe what preprocessing steps we took to minimize these deficiencies.  1. Gene to Tag Assignment A n average mature m R N A sequence is around 1800 to 2000 nucleotides i n length. S A G E relies on a 10 base pair sequence to uniquely identify a gene. However, since there are only 4 different bases i n total, the chances of a tag mapping to more t h a n one gene is likely.  N o t surprisingly, i n practice, we  found that there are numerous tags that map to more t h a n one gene. Since it is impossible to determine which gene is actually being expressed, we dealt w i t h this problem by assigning the expression level of the tag to each gene that it mapped to. For example, if tag A maps to genes 1, 2, and 3, a l l the genes will be assigned the tag count of tag A . 2. Tag to Gene Collapsing T h i s preprocessing step deals w i t h the inherent fact that genes often map to more than one tag. A s mentioned above, this is primarily due to two major biological phenomena. T h e first, known as alternative splicing, describes the post processing steps of an m R N A molecule before it enters the cytoplasm. B e fore m R N A leaves the nucleus to be transcribed, certain regions called introns are spliced out and the remaining regions (exons) are concatenated forming a  31  much shorter m R N A sequence. These regions are mixed, creating alternatively spliced forms of the same gene and increases protein diversity. However, these alternate forms may or may not possess the same functionality [2] [4] . T h i s phenomenon has recently been estimated to occur i n about 55% of human genome [3]. T h e other major reason for multiple tags mapping to the same gene is that polymorphism among the population may exist. P o l y m o r p h i s m and may further contribute to several tags mapping to the same gene. For this this situation, we simply summed up the tag expression levels that mapped to the same gene.  For instance, if tag A , B , and C all mapped to  gene X , the expression levels of tags A , B , and C would be summed to one expression level. B y performing this procedure we also get the added benefit of a reduction i n the dimensionality of the data which i n turn, significantly decreases computation time. It should be noted though that this process was not done i n a l l our experiments. If the type of experiment involves searching for only highly expressed tags, then we found it reasonable to collapse since high counts is less prone to a sequencing error. In addition, the larger counts tend to have better confidence for tag-to-gene mapping. We w i l l discuss this issue further i n Chapter 4 where collapsing is performed. . Scaling the S A G E libraries To deal w i t h the varying sizes of S A G E libraries, we scaled the tag counts to an arbitrary large total tag count (e.g., 1 million) to reduce comparison errors.  32  3.3  T h e Permutation Framework  T h e permutation test itself, is not a new concept or method. In fact it was introduced i n the 1930s. T h e advent of computers have made this resampling method relatively easy to implement.  Serveral researchers have adopted the permutation concept.  For example, Dudoit et a l . used the permutation test's resampling techniques to eliminate the  t-statistic's strong normality assumption [49].  Contrary to Dudoit et al.'s work, the framework of the permutation test that we used is based strictly on a test of means between two different distributions. Here, we assume one distribution is for the gene expression level of a particular gene in normal tissues (ie. subscript n) and the other for cancerous tissue (ie. subscript c). To select significant genes, we use the following hypotheses:  NullHypothesis  H: Mc -  :  = 0 (3.1)  AlternativeHypothesis H :n -fi ^0 a  c  :  (3.2)  n  T h e null hypothesis states that the distribution of average gene expression levels i n cancerous libraries is the same as that i n non-cancerous libraries. If the  33  null hypothesis is rejected, it indicates that the gene expression levels of normal and cancer samples are sufficiently different (the alternative hypothesis). A variation of this test was previously been used for identifying differentially expressed genes in time-course microarray experiments [48]. I n the following section, we show our specific implementation of the test. L e t N and C be the number of normal tissue samples and the number of cancerous tissue samples respectively.  Algorithm 1. For each gene, take a l l the gene counts of the normal libraries a n d a l l the cancer libraries. M i x them up i n a n u r n . 2. R a n d o m l y select 7Y counts from the u r n to create a simulated normal set, and calculate the simulated normal mean /i . sn  3. Similarly the remaining C counts form the simulated cancerous set. Calculate the simulated cancer mean  /J, . SC  4. Consider the random variable v = /J, — n SC  s n  , called the simulated difference.  5. Repeat the steps A to D above M times. L e t \i and a denote the mean and the standard deviation of v. 6. N o w separate the libraries back into their true identity: normal or cancerous. Calculate the true observed difference  0 —  /x — rc  true mean count of the cancerous libraries, and \i  where  fi ,  Tn  rn  rc  denote the  denote the true mean of  the normal libraries. 7. Calculate the Permutation Score PS where PS =  . (7  34  fi  These permutation scores statistically rate how different the two groups are from each other.  T h e higher the permutation score is, the greater the statistical  confidence that the two populations are different. W h i l e the lower the permutation score, the more statistically similar the two populations are. In the algorithm shown above, we used M permutations or iterations for each gene hit ratio. T h e following figure plots the hit ratio w i t h varying values of M. For the results reported here, we used M = 5,000 on the breast data where N = 5 samples and C = 10 samples. A natural question to ask is whether this is sufficient. To answer this question, we re-ran using M = 10,000 on the same data and select all the genes whose permutation score exceeds a certain threshold. For example, the threshold 0.96 corresponds to the tail ends beyond 95% of the area under the standard normal curve. T h i s set of genes is then intersected w i t h the corresponding set of genes found using M — 5,000. T h e ratio of the size of this intersection and the size of the set of genes using M = 10,000 is called the hit ratio. Figure 3.1, shows that the set of genes stabilizes rather quickly, and M — 5,000 is a reasonable value.  3.4  Breast Versus B r a i n Cancer S A G E Libraries  To perform this experiment we used a set of brain and breast S A G E libraries found on N C B I ' s public database.  A t the time of analysis , there were 17 brain cancer  libraries, 7 brain normal libraries, 10 breast cancer libraries and 5 breast libraries. To satisfy the first objective, we first applied the permutation test to select for the most significant differentially expressed genes i n brain cancer (ie., brain cancer versus normal brain libraries). W e then sorted the permutation scores i n descending order and examined the top 30 ranked genes.  T h e higher the score  or rank, the more statistically confident we are that the two groups are different. 35  Hit Ratio vs Iterations  Figure 3.1: Illustrates that the higher the number of iterations, the more stable list becomes  36  Similarly, we identified the top 30 ranked genes for breast cancer. To compare the two lists, we performed an intersection of the top 500 ranked genes. Thirty genes was chosen for two main reasons. First, each list generated over 500 genes. Looking up 500 different genes in the literature would be a very laborious task. Second, we believed that 30 genes would be sufficient to show which of the two tests gave more accurate results. Next, to evaluate the effectiveness of the permutation test, we performed a similar experiment using the 2-sample t-test of unequal variance. To obtain the t-value,.we  first calculated the mean (n) and the pooled variance (Sp ) separately 2  for both the cancerous libraries and the normal libraries. Then we computed the t-value using:  (3.3)  t =  The t-values were then sorted in descending order and the top 30 ranked genes were identified. This list is then compared with the permutation lists. To evaluate the validity of our results using the two different tests, we performed a literature search on PubMed on the top ranked genes. With these top ranked genes, we looked at whether the genes were related to the neoplastic process. In order to be consistent, we used the following rules for a gene to be considered related to the neoplastic process: 1. The gene is up-regulated or down-regulated in breast or brain cancer. 2. The gene is up/down-regulated in another type of cancer. 3. The gene is a known cancer-related gene (ie. oncogene, mutator,tumour suppressor). 37  4. The gene is a major component of the cell cycle. 3.4.1  Summary of Validation of Breast Cancer and Brain Cancer Genes Category T-test  Permutation Test  Breast Cancer Related Related To Other Cancer 3 4 11 7  Table 3.1: Statistical Comparison of the Top 30 Genes for Breast Data From the table 3.4.1 we can see that among the 30 top ranked genes of the breast libraries identified by the permutation test, 11 are related to breast cancer. Another 7 genes are related to other types of cancer. In contrast, among the top 30 ranked genes identified by the t-test, only 3 are known to be breast cancer related. Interestingly, all three are included in the top 30 selected by the permutation test. These results suggest that the permutation test may be more appropriate for differential gene selection when the sample size is small. For details of the exact genes found please refer to Appendix C. Category T-test  Permutation Test  Brain Cancer Related Related To Other Cancer 7 8 5 7  Table 3.2: Statistical Comparison of the Top 30 Genes for Brain Data Among the top 30 genes of the brain libraries identified by the permutation test, 5 are known to be related to brain cancer, and another 7 known to be related to other cancer. In contrast, among the top 30 identified by the t-test, 7 is known to be related to brain cancer, and another 8 is known to be related to other cancers. 38  These results show that the permutation test performed comparably with the t-test. The top ranked gene in the permutation test is an EST (expressed sequence tag), which is an unknown transcript that has been identified but not studied yet. The next highest ranked gene is the protein kinase C and casein kinase substrate in neurons 1. This gene is known to be highly expressed in the normal brain [43], which agrees with our results. To summarize our verification data, we have created another summary table similar to all the verified permutation score genes for our brain data experiment. For further information on the top ranked genes, we have provided complete tables of the genes found using the permutation test in Appendix C. 3.4.2  Discussion  An interesting trend to note for the literature verification data is that there seems to be a higher probability of a gene to be verified at higher ranks. For example, for the breast data, in the top 10 ranked genes we have 7 genes verified. While in the ranks 11 to 20, we have 6 genes verified. Further down, between ranks 21 and 30 we have 5 genes being verified. A similar trend was also observed in the brain data (Ranks 1-10: 6 verified, Ranks 11-20: 5 verified, Ranks 21-30: 2 verified). Compared to the breast SAGE data results, the brain SAGE data results do not look as impressive. However, it should be noted that there are many different types of brain tissues, and thus different types of brain tumors, including gliomas, meningioma, pituitary tumours, haemangioblastoma, acoustic neuroma, pineal gland tumors, spinal cord tumors, and lymphoma. At the time of our research, we combined all the brain cancer libraries and overlooked the issue of differing brain tumor types. This is because there were not enough libraries if we were to  39  conduct the analysis on these tissues separately. These different types of tumors are all unlikely to have similar gene expression profiles, and thus may have contributed to the weaker results compared to the breast data. There were also a couple other factors that may have contributed to the quality of both experiments. First, the usage of Cell Line SAGE libraries may have contributed to some inconsistencies in the data. Cell lines are tissue cells grown artificially in a petri dish and may not be representative of a true genome wide profile of a tissue taken directly from a diseased tissue (known as bulk tissues). Second, many of the top ranked genes were hypothetical proteins or unknown genes (ie. 7 genes were hypothetical proteins for breast data) and so literature for these genes is non-existent. In addition to evaluating the breast and brain data separately, we also compared breast cancer and brain cancer at the sub-cellular level by intersecting the top 500 ranked genes to determine its similarity. A high intersection would suggest a high similarity while a low intersection would suggest a low similarity. For the top 500 permutation test results, 26/500 of the genes intersected. These results suggest a low similarity between the two cancers even at the sub-cellular level. In spite of the few sources of error explained above, our overall results (especially for the breast data) demonstrate that the application of the permutation test on SAGE libraries has significant merit and thus, should also be used on different tissues with SAGE libraries keeping the above issues in mind. This performance could be attributed to the fact that the permutation framework is robust to outliers. The t-test on the other hand, would pick out spikes (outliers) in the data. For instance, suppose there were 5 libraries in each sample. For one gene, the genetic signature could look like the following: 11115  and 11111.  The permutation test  would assign this group a relatively low score and thus a low rank where as a t-test  40  would give this example a relatively higher score. Genes with no confirmed literature verification should also be further investigated via biological experimentation (ie. RT-PCR).  3.5 3.5.1  Biomarkers in Early Stages of L u n g Cancer Background and Related Work  Lung cancer is the leading cause of cancer deaths world wide with 160,000 annual deaths in the United States alone [23]. One major reason for this is that lung cancer is very difficult to detect. In recent years, many advancements in screening technologies such as chest radiographs, computed tomography (CT) scanning, and sputum cytology have given doctor's tools for detection of cancers. However, these advancements have proven to be feeble when it comes to detecting lung cancer early enough. In fact, studies that screen high risk individuals have not been effective in decreasing mortality rates where the 5-yr patient survival rate at the clinical stage II-IV is poor ranging from 40% down to 5% [52], [51]. Today, there is a wealth of information on the histological and molecular characteristics of the premalignant changes in bronchial mucosa [52]. The earliest change that can be observed is known as reserve cell hyperplasia and squamous metaplasia. These changes spontaneously reverse upon the cessation of smoking and thus, are not considered to be true premalignant lesions. The true early premalignant changes are believed to be the low and high grade dysplasia and carcinoma in situ (CIS). Currently, it is unclear whether low grade dysplasia will lead to the advanced stages of lung cancer. However, both high grade dysplasia and CIS have been shown in various studies to lead to the invasive stages of lung cancer [53], [54],  41  [55]. At either of these two stages, it is unlikely that these lesions will regress with the cessation of smoking. The recent advent of global gene expression technologies have given researchers the tools to quickly generate candidate biomarkers. As far as we know, there have been no studies using such technology for evaluating stage differences at the genetic level. One reason for this is that it is very difficult to obtain a clean sample of early lesions and these lesions are rare. In the following section we demonstrate an example of how such technologies can be used for this purpose. 3.5.2  Materials  To perform our analysis, we used Lung SAGE libraries provided by BC Cancer Agency which include the following SAGE libraries: 1. 15 smoked-damaged libraries but otherwise non-cancerous (hereafter we refer to these libraries as normal). These libraries were obtained via bronchial brushings. Brushings is a medical procedure often performed during a bronchoscopy where cells from the tissues lining the respiratory tract are obtained by a small brush-like device. 2. A metaplasia library that is obtained via a bronchial biopsy. This library contains the genetic profile of the bronchial epithelium that has undergone changes due to activities such as smoking but are not cancerous. Specifically, the transformation of the normal ciliated columnar epithelium by a squamous epithelium. 3. 5 carcinoma insitu (hereafter referred as CIS) libraries obtained from a lung biopsy. CIS is the stage following severe dysplasia pre-invasive lung carcinoma 42  that has no metastatic potential. 4. 6 invasive libraries that were obtained via frozen resected samples. The invasive stage describes when a CIS lesion breaks through the basement membrane and has metastatic potential. 3.5.3  Candidate CIS Biomarkers  The primary objective of our analysis is to isolate genes specific to the CIS stage. In order to analyze their expression patterns we plotted a series of bloxplots using the statistical and graphical software package R ( The figure below shows the basic; parts of the boxplot. The box itself, corresponds to the interquartile range (IQR) meaning that it contains 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile (median of 75% of the data) of the data set while the lower hinge indicates the 25th percentile (median of the 25% of the data). The outer whiskers represent the max and min of the data points unless an outlier exists. If there is an outlier, the whiskers are no more than the 75th percentile plus 1.5 times the interquartile range (the length of the box) or no more than 25th percentile minus 1.5 times the interquartile range (the bottom whisker). An outlier is depicted by a o. To find these CIS specific genes we apply the permutation framework to the following: 1. CIS and Normal SAGE libraries 2. CIS and Invasive SAGE libraries From the permutation test, we obtain a score based on the permutation score for each gene/tag that measures the statistical significance of the difference between 43  Outliers  O  <75% *1.5*IQR  75%  IQR  50%  25%  <25%-1.5*IQR  Figure 3.2: Diagram of a typical box and whisker plot  44  the two groups. Recall that the higher the score, the more confident we are that the two distributions are different. However, it should be noted that this score does not distinguish whether group A is greater than group B or visa versa. After obtaining the scores for each gene/tag, we sort them in descending order. We then took the top 2000 most differentially regulated genes of the 2 sets and performed an intersection. From this intersection, we obtained a total of 79 tags. Figure 3.3 illustrates the steps of our analysis in a graphical manner.  79 S A G E  Tags  Figure 3.3: Graphical Representation of Steps of Analysis for CIS Specific "Genes  To search for the up-regulated genes in CIS, but down regulated in Smokeddamaged normal tissues and invasive tissues, we first selected genes that have a permutation score greater than 2.35 for both sets (corresponding to a p-value of around 0.01). All these genes then had to meet the following requirements to be deemed a CIS biomarker candidate: -  1  9  Normal, ^  l  u  ~>  Metaplasia ^  z  9 n au n d  d  45  For the CIS biomarkers set, after applying the above filters we obtained 18 tags. These tags were then mapped to their corresponding unigene ids and gene names. According to the tag/gene database found at CGAP(, our results mapped to 14 unigene ids and ^ of these mapped to genes with names. These genes included those involved in cellular adhesion, EGFR trafficking and tyronsine or phosphatase kinase receptors. Figure 3.4 illustrates the gene expression patterns specific for high expression in the CIS stage. To search for the down-regulated genes in CIS, but up regulated in both smoked-damaged normal tissues and invasive tissues, we had to relax our requirements to get a decent output. The only requirement was that the average of CIS must be less than the average of both normal and invasive. This output generated 13 candidates. Figure 3.5 depicts the relative expression specific for low expression in the CIS stage. Out of these 13 candidates, all mapped to unigene ids. However, 4 out of 13 of them are hypothetical and unknown genes. The remaining genes mapped to known genes. These genes included protein kinase regulators and inhibitors, and genes belonging to the annexin family. Literature shows that these genes may have potential roles in cellular signal transduction, inflammation, growth and differentiation. 3.5.4  Desmosomes as Candidate CIS Biomarkers  Our previous results analyzed the gene expression patterns specific for up regulation of genes in the CIS stage. Two of the genes, DSC2 and DSG3 are components of a larger adhesion structure known as desmosomes, we decided to analyze the gene 46  PPP2R2B  HCG9  S0RCS2  KRT6A  ID  •a  8  °  S co  ~i—r—i—r N M C I  ~l  N  1  r  M C  i I  Stage  Stage  Stage  Stage  SNX13  NTRK2  LOC339483  LASS3  S co  S co  Stage  Stage  Stage  PTPRD  DSC2  DSG3  Stage  Stage  Stage  8. S  Figure 3.4: Boxplots showing gene expression patterns of genes with high CIS expression  47  CN0T4  LOC340061  TUB A3 § -Pr  a  MGC27165  DKFZP5640123  ~i—I—I—r N  M C l  Stage  Stage  Stage  Stage  Stage  ORF1-FL49  PKIG  BASP1  ANXA5  Ellsl i  B1—I—I—r  N Stage  Stage  Stage  PRKAR1A  EPAS1  APLP2  B  Stage  tt  0 1—i—i—r  N  M C I Stage  N Stage  M C I Stage  Figure 3.5: C I S genes that are down-regulated  48  M C I  n—i—i—r  N  M C I Stage  expression patterns of the other components (which may have been missed because of our restrictive filters) to help strengthen our observations. Background  Desmosomes are specialized structures responsible for cellular adhesion. These highly organized intercellular junctions provide mechanical integrity to tissues by anchoring intermediate filaments to sites of strong adhesion [20]. They are made up of at least 3 major family of proteins; desmosomal cadherins, armadillo proteins and the plakins [21] [22]. The cadherins (as seen in figure 3.6 ) are single-pass transmembrane glycoproteins that are responsible for the actual physical adhesion between two cells and as well facilitate Calcium-dependent cell to cell adhesion. These cadherins can be subdivided into the subfamilies desmogleins and the desmocollins. The desmogleins consists of 4 different types termed DSGl,DSG2,DSG3,and DSG4. Similarly, the desmcollins consists of 3 types termed DSC1,DSC2 and DSC3. Armadillo proteins include the plakoglobin, the plakophilins (PKP1, PKP2,PKP3), and p0071 [20]. Lastly, the plakins include plectin, desmoplakins I and II, and the cell envelope proteins envoplakin and periplakin [20]. As seen from figure 3.6, these groups of molecules interact in an organized fashion in order to facilitate both cell adhesion and cell rigidity/structure. The single-pass transmembrane proteins DSC and DSG are responsible for the physical adhesion between two cells. Its extracellular domains span into the dense midline (DM) where they interact with other cadherins from other cells. As seen from figure 3.6 , these proteins associate closely with the tails of the desmosomal cadherins. At the same time, they also interact with plakins (desmoplakins seen in diagram). Plakoglobins and PKPs interact with the desmoplakin N-terminus to aid the cluster-  49  Figure 3.6: A M o d e l of the Basic Structure of Desmosomes  50  ing of the cadherins and work together to form the desmsomal plaque. And finally, desmoplakin C-terminus anchors to the intermediate filaments in the inner density plaque(IDP). Proper anchoring to the IF is crucial for strengthening adhesion and tissue integrity [20]. Desmosomes and Cancer  As discussed above, desmosomes play an important role in tissue integrity and in cellular adhesion. In many types of cancers, the desmosome gene expression levels are disrupted. For example, in head, neck, and oral cancer, desmosome expression is down regulated suggesting a mechanism for a cancer to become invasive and metastasize. Such results suggest that desmosomes may have a tumor suppression functions [19]. However, this is not always the case, as in colon cancer, there is no observed downregulation prior to invasion. But this may mean that this type of cancer uses a different mechanism to invade (ie. temporary loss of adhesion in the desmosomes by other means at the protein level) [19] [16]. These results have compelled us to take a closer look at the expression patterns of all desmosome components.  Expression Patterns of Desmosome Components  To analyze the expression patterns we also used boxplots as described in Section 3.5.3. Desmosomal  Cadherins  Recall that desmoglein 3 is one of the genes that survived our strict filters. As shown in figure 3.7, plot clearly shows a distinct pattern where the CIS libraries  51  have in general, a much higher expression than the normal libraries and still higher than the invasive libraries.  NORMAL  METAPLASIA  CIS  INVASIVE  Figure 3.7: Expression pattern DSG3 in normal(N), metaplasia (M), CIS (C), and invasive (I) stages  Similarly, as seen in figure 3.8 the other class of cahderins, desmocollins, also show a similar gene expression pattern although the patterns are not as clear cut especially with DSC3. A r m a d i l l o Proteins  The armadillo proteins are responsible for regulating desmosomal assembly and adhesion. Studies have shown that knocking out these genes led to the loss of adhesion and tissue fragility [17]. From figure 3.9 , we see that DSC2 is distinctly overexpressed in the CIS stage relative to both normal and invasive stages. Note 52  DSC 3  OSC il  Figure 3.8: Expression pattern of DSC2 and DSC3 in normal(N), metaplasia(M), CIS(C) and invasive(I) stages  JUP (Plahoglooin)  NORMAL METAPLASIA CIS  INVASIVE  NORMAL METAPLASIA CIS  INVASIVE  Figure 3.9: Expression patterns of armadillo proteins in normal(N), metaplasia(M), CIS(C), and invasive(I)  53  that DSC2 is one of the two desmosomal cadherins extracted from our strict filters. It is also interesting to see that DSC3 also shows relative higher expression in CIS than both invasive and normal even though the difference is not as clear cut as it is for DSC2. Plakins  The plakins are also an important part of the desmosome structure as they are critical components for proper anchorage of the intermediate filaments (IF) to desomosomes, which in turn are important for strengthening adhesion and tissue integrity [20] [18]. Recall from figure 3.6 that the tail of the desmplakins are located in the inner density plaque and directly interact with both the armadillo proteins and the intermediate filaments. As seen infigure3.10, we observe the expected gene expression patterns as other desmosome components we analyzed. 3.5.5  Invasive Candidate Biomarkers  The next part of our analysis was to analyze the gene expression patterns specific to the invasive stages. This study is similar to finding genes specific to the CIS stage. Like the CIS study, we also performed an intersection of the top 2000 most differentially expressed genes based on the permutation score with the groups CIS vs Invasive and Normal vs Invasive. Out of this intersection, we obtained a total of 78 tags. From these 78 tags, we applied the following filters: 1. invasive > 10 normal  2.  invasive > 5, \ CIS  3.  metaplasia>  1 54  DESMOPLANKIN  1  1  1  NORMAL METAPLASIA  CIS  1  INVASIVE  STAGE  Figure 3.10: Gene expression patterns of a desmoplakin (plakin family) in normal(N), metaplasia(M), CIS(C), and invasive(I)  55  After applying all these similarfilters,we retrieved only 6 tags. All tags have unigene ids and all mapped to genes with names. These genes included SPARC, MMP11, C0L4A1, M C A M , C0L1A2 and C1R. All of have unigene ids and gene symbols. Four out of the six are components of the adhesive extracellular matrix. As seen in figure 3.11, all exhibit the distinct pattern we filtered for. That is, in normal tissues, there is low or no expression while CIS has higher expression and finally invasive stages have even higher expression.  C  -  I  ^ — Q N  M  C  40 o  - -JL. N  M  C  Stage  Stage  MCAM  COL1A2  C1R o 250  T  0  n  N  1 1—~r~^  M  C  Stage  I  :>00  1  50  _ •  200  Scate<i Express  o _ CM  100  CO  |  150  Scalec i Expresslion  CO  o  -  I  -T  8 -  o _  -  ,  I  c  o  o  Stage  _  o _  1  -l_  o _ <o ^ o  1  300  150 o  M  i  -  ~T  N  o  100  o  i i  -r  T  o  00  8 -  ~r  -  10 0  - B  o  Scaled Iexpression  CM  Scaled Eexpression  o o  _  3000  o lO  COL4A1  50  o o  MMP11  200  SPARC  1 _1_  —i N  1 1 1— M  C  Stage  I  o  -  '—i  1 N  1 r  M  C  I  Stage  Figure 3.11: Gene Expression Patterns of Candidate Invasive Biomarkers  56  It should be noted that some of these genes found are consistent with the literature. For example, SPARC is a secreted protein (acidic, cysteine-rich) that is a matrix-associated protein that can change shape [27]. Such a function is something one would think is important in the invasive stages. Another interesting gene is melanoma adhesion molecule (MCAM) which is a cell surface glycoprotein. According to northern blot and immunohistochemical experiments performed by Sers et al. [8], M C A M antigens are found primarily in advanced primary and metastatic melanomas. In addition to being associated with tumour progression, M C A M expression has also been found to be associated with shorter patient survival in adenocarcinoma patients. As well, it is expressed in a high portion of NSLC (non-small cell lung cancer) cells [8]. Like the CIS gene expression study, we also examined the mirror image where genes are turned off pr down regulated in the invasive stages. To do this, we took the same list and applied the following filters: ^  2  normal invasive CIS  in  ^q  invasive  3. Fold change between invasive and metaplasia is ignored. These filters generated 6 tags and 5 with gene names, while one tag is currently unmapped. Only one of these genes proved to be interesting. PTPRN2 (protein tyrosine phosphatase, receptor type, N polypeptide 2) is a known gene that is involved in signaling molecules that regulate a variety of cellular processes including cell growth, differentiation , mitotic cycle, and oncogenic; transformation. Again, to illustrate the gene patterns we plotted out boxplots at each stage as seen in figure 3.12. 57  C10orf32  PTPRN2  Stage  CACTTTTAAA  DKFZPS64I0422  Stage  LOC388389  Ellsl  Stage  Figure 3.12: Gene Expression Patterns of Candidate Invasive Biomarkers  58  3.5.6  Other Observed Gene Expression Patterns  To further our analysis, we also analyzed whether there are genes that are not present in CIS and Invasive stages. To find these genes, we simply performed the permutation test on the normal lung SAGE libraries and the CIS and invasive libraries pooled together. We then sorted the list in descending order based on the permutation scores and took the top 200 (all those with permutation scores > 6). Finally we select for genes with a fold change between normal and the cis/invasive pool to be greater than 40. Out of this filter, we get 15 tags with 10 genes with symbols. Figure 3.13 illustrates the selected patterns.  CLSPN  CXorf44  KCNE1  FLJ32855  KIAA1S33  1—I—I—T N M C I Stage  LOC283152  RAB33B  Stage  SPAG6  TEKT2  9  :B  a  Stage  Stage  TTC18  i—i—i—r N M C I  T—i—i—r N M C I  n—i—i—r N M C I  Stage  Stage  Stage  Figure 3.13: Boxplots showing genes that are down regulated in lung cancer regardless of "stage  59  In addition to finding genes present in our normal lung samples but not in lung cancer samples, we also looked at the reverse. Here we selected for a fold change between the cancer average and normal average to be greater than 10. Performing that filter generated 6 candidates. Again, we plotted a set of boxplots to illustrate this pattern as seen infigure3.14. At first glance, it does not appear to be very clear that the boxplots show that these genes are CIS/Invasive specific. This is because in order to retrieve these genes, we compared the mean ratios. Means are sensitive to outliers and most of the plots in 3.14 contain outliers which would skew the selected mean ratios. Thus, care must be taken when selecting these genes. 3.5.7  Summary  In summary, we have analyzed all possible expression patterns between normal, CIS and invasive stages. Specifically, we have looked at genes that are CIS specific, Invasive specific and those that are specific to both stages (CIS and Invasive). For CIS, we observed clear cut CIS specific patterns for both down and up regulation. Similarly, we observed clear cut Invasive specific patterns for both down and up regulation. However, the CIS and Invasive specific patterns were not clear cut suggesting that at the molecular level, these stages are quite different.  60  WDR1  CENTA2  COL5A3  KRT14  DT1P1A10  SLC16A4  s )==)  •_ N  M  T C  Stage  Figure 3.14: Boxplots showing gene expression pattern for C I S and Invasive specific stages . '  61  Chapter 4  Constant Gene Expression Finder Reference genes are commonly used in many gene expression experiments to attempt to normalize mRNA levels between different samples. For example, QRT-PCR and certain normalization methods in DNA microarrays both require the use of such reference genes. In the case of DNA microarrays, the normalization step attempts to isolate variations that exists due to biological differences and not due to experimental conditions. Such non-biological errors include mismatch hybridization, unequal starting material, scanner problems and spot noises. Reference genes are often used to help calibrate the intensity data to account for these errors as the expression level of these reference genes should be the same across multiple samples. However, if these reference genes are incorrect choices, then this type of normalization could severely skew the the normalized data and may even inadvertently remove important biological differences between samples especially for the more sensitive small expression values. Thus, selection of stable and abundant reference genes is a  62  critical step as failure to do so, could severely jeopardize subsequent analyses. In this Chapter, we argue that reference genes must be chosen in a tissue specific manner. Furthermore, we propose a novel methodology for identification and evaluation of good reference genes. This methodology uses the advantages of the SAGE method and the permutation framework to find a stable and highly expressed set of genes (hereafter referred as the constancy requirement).  In the following  section/we will first describe some related works. We will then describe exactly why we use the SAGE method followed by the details of our novel methodology. Following this section, we describe how we evaluated these sets of genes and how our results suggests the importance tissue specific reference genes.  4.1  Related Work  Reference genes are typically chosen because they are biologically required and thus, usually are referred as housekeeping genes. The expression level of these housekeeping genes are assumed to be stable and highly expressed (the constancy  requirement).  The problem is that in the past few years, it has been increasingly apparent that many standard housekeeping genes may not be appropriate for many gene expression experiments [35],[36],[37],[40]. Recent studies by Barber et al. have shown that one of the most common housekeeping genes, Glyceraldehyde-3-phosphate (GAPDH), show variable expression. Their experiment consisted of 72 different pathologically normal human tissue types. They performed 371,088 multiplexed Q-RTPCR experiments and found significant differences in expression levels of GAPDH mRNA between tissue types and between donors of the same tissue type. Most notablly, a 15-fold difference in GAPDH mRNA copy numbers was observed between the highest (skeletal mus63  cle) and lowest expressing tissue types (breast) [39]. As will be seen later, their conclusion is consistent with our results. Zhang et al. recently reported that 3 of the 10 standard housekeeping genes were inappropriate for normalizing mRNA in neutrophils because of very low expression levels [38] and recommended only using 5 of them that showed relative stable expression. Like Barber et al.'s study, Zhang et al.'s study suggest that many standard housekeeping genes are inappropriate for mRNA normalization. The main problem with both these studies though is that they only analyze the expression level of the standard housekeeping genes which only make up less than one percent of the human genome. Thus, it is quite likely that a better reference genes exist. Gabrielsson et al (2005) attempted to solve the reference gene problem by using 52 microarray expression profiles of human adipose tissue where they selected 50 genes with the lowest coefficient of variation [26]. The problem with such an approach is that microarrays require normalization themselves. In fact, this decadeold technology is even more sensitive to normalization than RT-PCR because of the large number of sources of systematic variation (as described in Chapter 3).  4.2  Novel Methodology  In spite of the numerous normalization techniques introduced for DNA microarrays in the past decade, to date, there is still no standard procedure. On the other hand, the SAGE (Serial Analysis of Gene Expression) technique is relatively standardized. Being a sequenced based method, SAGE technology does not require such normalization. The only preprocessing technique that is required is somewhat similar to global normalization where all samples are scaled up to some given constant. This is done to correct for unequal number of tags produced by each library as it is dif64  ficult to obtain libraries with exactly the same number of tags. In the case of DNA microarrays, the data depends on a natural hybridization reaction inherent in DNA. Furthermore, the obtained data is intensity-based and requires that we assume that the relative difference in intensity between sample A and B is approximately equal to the relative transcription between sample A and B. Arguably, this makes microarrays more susceptible to non-biological errors than SAGE. Thus, here we introduce a novel methodology to identify and evaluate reference genes based on using SAGE data. While the SAGE methodology makes up a large part of our methodology, we also require one more component. In order to identify and evaluate potential reference genes, we use the parametric framework introduced in Chapter 3. Recall that in the permutation framework was used to identify differentially expressed genes. As it turns out, this framework can also be used to find statistically similar gene expression profiles simply by reversing the list where low permutation scores correspond to similar samples. Thus, to select for reference genes that satisfy the constancy requirement we came up with threshold that would generate a decent number of genes ( 20) but is constant and highly expressed. All our selected genes must meet the following criteria: 1. The permutation score must be less than 0.15 2. The average raw gene count must be at least 25 and the raw count for each library must be at least 3. 3. The standard deviation a must be reasonably small (ie. less than 10 for the lung data). The second condition is set to guarantee that the reference gene selected 65  is abundant and present to some degree in all libraries while the last condition ensures that the simulated cancer distribution and simulated normal distribution are sufficiently close. The above conditions were all applied to a set of lung SAGE libraries provided by the BC Cancer Agency. Their cohort consists of 18 non-cancerous smokedamaged bronchial epithelial libraries, 5 pre-invasive lung carcinoma libraries (stage known as Carcinoma Insitu [CIS]), and 6 invasive lung carcinoma libraries. The noncancerous libraries are all extracted via lung brushings. Pre-invasive libraries were extracted surgically via lung biopsies and the invasive libraries were obtained from frozen resected samples. After application of the above methodology to the lung SAGE data we obtained a list of 16 candidate reference genes. To visualize how these sets of reference genes performed, we plot the points on a scatter plot as seen in figure 4.1. The x-axis gives the permutation score indicating variability, wherease the y-axis shows the average raw SAGE counts. These sets of lung-SAGE reference genes all satisfy the constancy requirement. That is, they appear to be highly expressed and show almost constant expression across all 29 libraries of cancerous and non-cancerous libraries. Conversely, all the common housekeeping genes we examined do not meet our reference gene requirement. In fact, they are far from meeting our constancy requirement. Standard housekeeping genes such as GAPDH and TFRC in particular have a permutation greater than 3. A permutation score of 3 is equivalent to saying that the probability of the average expression in cancerous libraries differing from the average expression in non-cancerous libraries is only 1% (ie. a p-value of 0.01). In the next section, we describe two ways that these candidate reference genes were validated (hereafter referred as lung-SAGE reference genes).  66  Average Rain Count 300  GAPDH  250  200  • L u n g - S A G E Ref Genes • Houskeeping Ref G e n e s B r e a s t - S A G E Ref G e n e s  150  B r a i n - S A G E Ref G e n e s K A L L - S A G E Ref Genes  100  • ACTB 50  fx I  * **.* » TFRC  * STAT1 2  3  4  Permutation Score  Figure 4.1: Raw Count versus Variability on Lung SAGE data  67  4.3  Validation  To validate these lung-SAGE reference genes, a handful of them were analyzed by the Ontario Cancer Institute with their own cohort of lung tissues. They analyzed 7 of our lung-SAGE reference genes and 4 commonly used housekeeping genes and performed QRT-PCR experiments. QRT-PCR values were used to evaluate whether a gene satisfies the constancy requirement. If the values between tumor and normal tissues are close, then we say that they satisfy the constancy requirement. Thus, to analyze these values, we combined the two samples to compute a single standard deviation of expression. A coefficient of variation close to zero suggest that the expression level does not change much regardless of a cell's neoplastic state. A summary of the results can been seen from figure 4.2. This figure clearly shows that in general, the standard housekeeping genes (top 4 bar lines) exhibit more variation than the lung-SAGE reference genes. In addition to the above biological validation, we also evaluated the effectiveness of our lung-SAGE reference genes by analyzing a dataset generated by Bhattacharjee et al [33] in 2001. Essentially, we compared the same dataset normalized in two distinct manners. The first way is based on data normalized by standard housekeeping genes, while the second way is data normalized by the lung-SAGE reference genes. For both cases, we applied the same permutation framework as our previous experiment on brain and breast data to identify the top 40 most differentially expressed genes. Below we outline the exact steps we took to renormalize Bhattacharjee et al's data: 1. Re-normalized Bhattacharjee et al's published microarray data by fitting a line (LMS) through the SAGE selected Housekeeping genes. (Array vs Average of  68  SAGE-Ref Genes Vs Standard-Ref Genes GAPDH ACTB TF RC STAT-1 MRCL3 PRDX-1 PKM2 RPS13 KCNN3 RPL15 RPL11  e  cv  Figure 4.2: Table showing relative comparison of validation results.  69  10  12  All Arrays). 2. Performed a permutation test on the renormalized 17 normal lung arrays vs 23 Lung Cancer Arrays 3. Performed Permutaion test on Originally Normalized Lung Arrays with same arrays as above. 4. Sorted each in descending order of permutation scores. 5. Intersected the top 40 genes from SAGE Normalized and Originally Normalized Harvard Data 6. Looked up SAGE Normalized Specific Genes, Originally Normalized Specific genes and Intersected Genes and categorized them as: Up/down-regulated in Lung Cancer, Up/down-regulated in other cancers, Not previously Associated with Cancer. Of the top 40 genes, 22 of them were in both lists. As for the other remaining genes, we categorized the them into one of two cancer-related categories. The first category, as seen in table 4.1 are genes that are up/down-regulated in lung cancer while the second category are genes that are up/down-regulated in other cancer types. Our results, as seen in the table 4.1, suggest that the data normalized by lung-SAGE reference genes give results that are more consistent with the literature.  4.4  Importance of Tissue Specificity  Our second major objective is to analyze whether the choice of reference genes ought to be tissue specific. That is, a set of reference genes that satisfy the constancy  70  Criteria Lung Cancer Related Related to Other Cancer Total  Intersection 5 7 12  By lung-SAGE 5 4 9  By Houskeeping Genes 1 1 2  Table 4.1: Comparison of the Top 40 Most Differentially Expressed Gene of Microarray Data Normalized by two Approaches requirement for one tissue type may not be the same as the reference genes for another tissue type. To show this, we identify the sets of reference genes that satisfy the reference gene requirement for two additional tissue types, breast and brain. We use the same libraries as the ones used in our differential experiment in Chapter 3 with the exception that we only include bulk tissues (resected from patients). We took this step to keep consistent with the lung libraries produced by BC Cancer Agency which were all bulk type. Below we outline the exact steps of our analysis: 1. Performed Permutation test on Breast SAGE Libraries from NCBI which includes 12 cancer libraries and 3 normal libraries. 2. Performed Permutation test on 6 brain cancer SAGE libraries from NCBI which includes 6 normal libraries and 20 cancer libraries. 3. Performed permutation on all the above libraries including our lung SAGE libraries 4. Sorted each list in descending order to get lists of the most statistically differentially expressed genes. After obtaining the breast and brain SAGE based reference genes, we compared them with the lung-SAGE genes. Figure 4.1 shows how breast and brain  71  SAGE reference genes compare with lung-SAGE genes on lung data. Notice that for both the breast and brain SAGE data, most of the genes have low average counts and relatively higher permutation scores than the lung-SAGE genes. This figure supports our theory that tissue specificity is an important factor when selecting reference genes. These findings compelled us to analyze the issue further so we looked at both the brain and breast SAGE data cases. Figure 4.3 shows a summary of our results in 3 boxplots. On y-axis we have the permutation score and on the x-axis we have the type of reference gene. In the first boxplot, we show how the permutation scores of various reference genes on lung data. Lung-SAGE reference genes show almost constant expression while standard housekeeping genes vary greatly. Interestingly, the other reference genes do not perform as badly as the housekeeping genes. The other 2 plots also show similar behavior but the variability of the housekeeping genes is comparable to the other reference genes. Our results collectively suggest the following: 1. That standard housekeeping genes may not be appropriate for normalization of microarrays and perhaps even in other biological experiments requiring normalization (ie. QRT-PCR) 2. That there are far better reference genes than the existing ones today and tissue specific reference genes may be more appropriate.  72  (A) Lung Data  HK  Lg  Bt  Br  (B) Breast Data  All  Type of Reference Genes  HK  Lg  Bt  Br  (C) Brain Data  All  Type of Reference Genes  HK  Lg  Bt  Br  Figure 4.3: Boxplots that depict the importance of tissue specificity  73  All  Type of Reference Genes  Chapter 5  Conclusion and Future Directions 5.1  Conclusion  This thesis deals with two key problems in genome wide expression analysis. The first is the small sample size problem. The second is, the normalization issue with biological experiments. Specifically, the issues surrounding poorly chosen reference genes. We have proposed using SAGE and the permutation framework to alleviate these issues. Our results in 3 independent experiments suggest that there is merit in our proposed solution. Our first experiment compared the commonly used t-test of unequal variance and the permutation test. To evaluate their performance, each test was applied to two sets of SAGE data (brain and breast) to identify the top 30 most differentially expressed genes between normal and cancer groups.. Overall, our results show that the permutation test produce results more consistent with the literature. For  74  instance, for the breast SAGE data, out of the top 30 ranked genes for the permutation test, 60% of the genes were verified via literature while only around 23% of the highly ranked t-test genes were verified. As for the brain SAGE data, the permutation test performed comparably with the t-test with 40% and 50% being verified respectively. However, it should be stressed that the data for the brain data was less uniform with various tumor types and may have had effect on the overall results. The second experiment in our study further strengthened the value of the permutation test as an identifier of interesting genes. Here we analyzed lung SAGE data of various stages and performed various intersections to produce genes specific to certain stages. By performing various filters, we were able to isolate various genes specific to its stage including those specific to lung normal, CIS and invasive stages. The key findings were a couple genes known as DSG3 and DSC2. As described in Chapter 3, these two genes belong to a family of adhesion molecules known as desmosomal cadherins. The expression pattern we observe suggest that adhesion increases in pre-invasive lung carcinoma followed by a down regulation in invasive stages of the disease. Biologically, it makes sense as the definition of invasion is the spread of the cancer. For that to happen, cancer cells must lose some adhesion so that it can travel to other parts of the body. Upon this knowledge we also looked up the gene expression patterns of other associated structural genes that did not make the strict filters of our analysis. Interestingly, several of these associated genes exhibited similar gene expression patterns (but not as strong as DSG3 and DSC2). These results are now under investigation as candidate biomarkers of pre-invasive lung squamous carcinoma. In our third experiment, we used the SAGE and permutation framework to  75  evaluate and identify reference genes. SAGE was used over DNA microarray data as it requires normalization itself while SAGE is immune to the normalization issues that plague DNA chip technology. In addition, recall that SAGE is a genome-wide expression technique and thus, is capable of analyzing the entire set of expressed genes in any given sample. Thus, this unique combination would be ideal for identifying novel reference genes.  5.2  T h e Future of Gene Expression Analysis  Gene expression analysis technologies are powerful and are showing many signs of advancing medical research. For instance, in 2002, van't Veer et al. demonstrated a strategy involving analyzing the genetic profile of individuals to help decide what breast cancer patients would benefit from certain treatments [41]. Such knowledge would be critical to a doctor's decision on what treatment to administer to the patient. The excitement and pace of these technologies though have generated various oversights. Fortunately, in recent years, a few scientists have realized some of these oversights. For example, as discussed in the previous chapter researchers have now realized the deficiencies and limitations of using standard housekeeping genes to normalize mRNA levels [35], [36], [37],[40], [38], [39]. However, most of these studies lack the search for novel reference genes and the ones that do [26], require normalization to reference genes. Our novel methodology globally identifies genes that are the most stable and highly expressed (what we called the constancy requirement).  We showed that the genes it selected can be verified via QRT-PCR and  showed that normalizing DNA microarray data with these newly acquired reference genes produce results more consistent with the literature. Thus, future experiments 76  must consider methods such as ours, to conduct normalization on mRNA levels. There is still much to be done beyond our study. In fact, so far we only grazed the surface of this SAGE data that we analyzed. This data is very large and there are just numerous combinations of ways to look at the data. Recall from Chapter 3 that we generated many different figures showing many different gene expression patterns. All such patterns still need to be analyzed carefully to pull out any interesting genes that map to known pathways or stuctures. Following this analysis, we would analyze the gene expression patterns of the genes in the pathway which can involve several genes. In addition to these studies, we also could look at low expressors. Low expressors may be interesting to study because they may be the switches or regulators that cause the onset of the disease. The good news is that this study could also be done using the permutation framework that we have outlined because during our analysis, we found that the permutation test also pulls out results with low counts. After extracting genes that belong to some existing pathways and examining its gene expression pattern within our own SAGE data, we still need to conduct some sort of validation. Recall from Chapter 3 on the second experiment with the lung SAGE data, our objective was to find candidate biomarkers for early stages of lung cancer. Following the analysis, we selected two of these candidates for analysis: namely DSG3 and DSC2. In spite of our efforts to show that there is strong evidence from the SAGE data that desmosome components are up-regulated just before the invasive stages of lung cancer, we cannot take what we observe and publish it. Thus, often, researchers perform an extra step of analysis; namely biological validation. This is due to three reasons. First, biological validation can show that our observations are also true at the protein level (ie. via antibody staining). Second, this  77  extra step conducted on a different cohort can make our findings more statistically convincing. Third, genome wide expression technologies are subject to more errors than other small scale technologies. For instance, in SAGE, one thing that is constantly changing is the tag-to-gene mapping database. Throughout our study, we had to update our results various times because of this issue, which sometimes led to a small but significant change in the results. Thus, validation can help distinguish artifact from true observation. Unfortunately, biological validation can take several months to a year. Furthermore, since our analysis attempts to extract CIS specific genes, simply obtaining additional CIS samples for biological validation would take even more time. We were very fortunate to have collaborated with experts in the biological domain. As you recall, some of our experimental results were validated by the Ontario Cancer Institute. Specifically, the housekeeping project, as described in Chapter 4. Their biological experiment was performed on a different cohort and helped validate and strengthen our findings. Without the wet-lab and dry-lab relationship, our research could only go so far. Our results are strong in both the wet-lab and dry-lab side. Thus, strongly suggesting that housekeeping genes do not make the best reference genes and that normalizing to reference genes is appropriate if we select them the way we have.  78  Bibliography [1] McPherson JD et al. A physical map of the human genome. Nature, 2001; Feb 15;409(6822):934-41. [2] Kan Z, Rouchka EC, Gish WR, and States, DJ. Gene Structure Prediction and Alternative Splicing Analysis Using Genomically Aligned ESTs. CeZZ,2001;ll, 889. [3] Black DL. Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell, 2000; 103, 367. [4] Zhu J, Shendure J, Mitra RD, Church GM. Single molecule profiling of alternative pre-mRNA splicing. Science 2003; 301 (5634):836-8. [5] Velculescu VE, Zhang L, Vogelstein,B. and Kinzler,K.W. Serial Analysis of gene expression. 5czence,1995; 276, 1268-1272. [6] Serial analysis of gene expression: technical considerations and applications to cardiovascular biology. Patino WD, Mian OY, Hwang PM. Circ Res., 2002; Oct 4;91(7):565-9. [7] Schober MS, Min Y N , Chen YQ. Serial analysis of gene expression in a single cell. Biotechniques, 2001;31:1240-1242. [8] Kristiansen G, Yu Y, Schluns K, Sers C, Dietel M, Petersen I. Expression of the cell adhesion molecule CD146/MCAM in non-small cell lung cancer. Anal Cell Pathol, 2003; 25(2):77-81. [9] Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler K W Using the transcriptome to annotate the genome. Nat Biotechnol, 2002; 20: 508-512. [10] Datson NA, van der Perk-de Jong J, van den Berg MP, de Kloet ER, Vreugdenhil E. MicroSAGE: a modified procedure for serial analysis of gene expression in limited amounts of tissue. Nucelic Acids Res., 1999;27:1300-1307. 79  [11] Patino WD, Mian OY, Hwang PM. Serial analysis of gene expression: technical considerations and applications to cardiovascular biology. Res., 2002 Oct 4;91(7):565-9. [12] Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science ,1995; 270: 467-470. [13] Okamoto T, Suzuki T, Yamamoto N. Microarray fabrication with covalent attachment of DNA using bubble jet technology. Nat. Biotechnol., 2000 Apr;18(4):438-41. [14] Macgregor PF and Squire JA. Application of Microarrays to the Analysis of Gene Expression in Cancer. Clinical Chemistry, 2002; 48:8: 1170-1177. [15] Emmert-Buck MR, Bonner RF, Smith PD, Chuaqui RF, Zhuang Z, Goldstein SR, Weiss RA, Liotta LA. Laser Capture Microdissection. Science, 1996; Vol. 274. no. 5289, pp. 998-1001. [16] Garrod DR. Desmosomes and cancer. Cancer Sum,1995; 24:97-111. [17] Ruiz P, Brinkmann V, Ledermann B, Behrend M, Grund C, Thalhammer C, et al. Targeted mutation of plakoglobin in mice reveals essential functions of desmosomes in the embryonic heart. J Cell Biol, 1996; 135(1 ):215-25 [18] Huen AC, Park JK, Godsel LM, Chen X, Bannon LJ, Amargo EV, et al. Intermediatefilament-membraneattachments function synergistically with actindependent contacts to regulat intercellular adhesive strength. J Cell Biol, 2002; 159(6): 1005-17. [19] Depondt J, Shabana AH, Florescu-Zorila S, Gehanno P, Forest N. Downregulation of desmosomal molecules in oral and pharyngeal squamous cell carcinomas as a marker for tumour growth and distant metastasis. Eur J Oral Sci., 1999;107(3):183-93. [20] Yin, T, Green, K J . Regulation of desmosome assembly and adhesion. Semin Cell Dev Biolo., 2004; 15(6):666-77. [21] Huber O. Structure and function of desmosomal proteins andtheir role in development and disease. CeU Mol Life Sci , 2003;60(9):187290. [22] Getsios S, Huen AC, Green KJ. Working out the strength and flexibility of desmosomes. Nat Rev Mol Cell Biol, 2004; 5:27181. 80  [23] Jemal A, Clegg LX, Ward E, Ries LA, Wu X, Jamison P M , Wingo PA, Howe HL, Anderson RN, Edwards BK. Annual report to the nation on the status of cancer, 1975-2001, with a special feature regarding survival. Cancer, 2004; Jul l;101(l):3-27. [24] Wang Y, Lu J, Lee R, Gu Z, Clarke R. Iterative Normalization of cDNA Microarray Data. IEE Transactions On Information Technology In Biomedicine, 2002; Vol. 6, No. 1. [25] Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects Nucleic Acids Res., 2001; 29(12):2549-57. [26] Gabrielsson BG, Olofsson LE, Sjogren A, Jernas M , Elander A, Lonn M , Rudemo M, and Carlsson LMS. Evaluation of reference genes for studies of gene expression in human adipose tissue. Obes Res., 2005; 13:649-652. [27] Goldblum SE, Ding X , Funk SE, Sage EH. SPARC (secreted protein acidic and rich in cysteine) regulates endothelial cell shape and barrier function. Proc Natl Acad Sci USA 1995 Apr 12;91(8):3448-52.  [28] Hoffmann, R., Seidl, T. and Dugas, M . Profound effect of normalization on detection of differentially expressed genes in oligonucleotide microarray data analysis. Genome Biology, 2002; 3(7): research 0033.1-0033.11. [29] Yang, Y H and Speed, TP. Design issues for cDNA microarray experiments. Nature Reviews Genetics, 2002; 579-588. [30] Baldi P, Hatfield GW. DNA microarrays and gene expression, from experiments to data analysis and modeling. Cambridge University Press, 2002. [31] Normalization, and analysis of DNA microarray data by self-consistency and local regression. Kepler TB, Crosby L, Morgan KT. Genome Biology, 2002; 3:research 0037.1-0037.12. [32] Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Res, 2001; 29: 2549-2557. [33] Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M , Loda M , Weber G, Mark EJ, Lander ES, Wong W, Johnson BE, Golub TR, Sugarbaker DJ, Meyerson M. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci U S A , 2001; 98(24).T3790-5.  81  [34] Lu,C. Improving the scaling normalization for high-density oligonucleotide Gene Chip expression microarrays. BM Bioinformatics, 2004, 5:103. [35] Bustin SA. Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays. J Mol Endocrinol, 2000; 25:169-193. [36] Suzuki T, Higgins PJ, Crawford DR. Control selection for RNA quantitation. Biotechniques, 2000; 29:332-337. [37] Thellin O, Zorzi W, Lakaye B, De Borman B, Coumans B, Hennen G, grisar T, Igout A, Heinen E. Housekeeping genes as internal standards: use and limits. J. Biotechnol, 1999; 75:291-295. [38] Zhang X, Ding L, Sanford A J. Selection of reference genes for gene expression studies in human neutrophils by real-time PCR. BMC Mol Biol, 2005; 6: 4. [39] Barber, R. et al. GAPDH as a housekeeping gene: analysis of GAPDH mRNA expression in a panel of 72 human tissues. Physiological Genomics, 2005; 21: 389-395. [40] Warrington JA, Nair A, Mahadevappa M , Tsyganskaya M . Comparisons of human adult and fetal expression and identification of 535 housekeeping/maintenance genes. Physiol Genomics, 2000; 2:143-147. [41] vant Veer L J , Dai H, van de Vijver MJ, He YD, Hart AA, Mao M , Peters HL, van der Kooy K, Marton MJ, Witteveen AT, Schreiber GH, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friends SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 2002; 415(6871):5306. [42] Shen D, Chang HR, Chen Z, He J, Lonsberry V, Elshimali Y, Chia D, Seligson D, Goodglick L, Nelson SF, Gornbein JA. Loss of annexin A l expression in human breast cancer detected by multiple high-throughput analyses. BBRC, 2005; 326, 218-227. [43] Sumoy L, Pluvinet R, Andreu N, Estivill X, Escarceller M. PACSIN 3 is a novel SH3 domain cytoplasmic adapter protein of the pacsin-syndapin-FAP52 gene family. Gene,2001; 262(1-2), 199-205. [44] Liau L M , Yang I. Microarrays and the Genetic Analysis of Brain Tumors. Current Genomics, 2002; 3, 33-41. - [45] Ljubimova JY, Khazenzon NM, Chen Z, Neyman YI, Turner L, Riedinger MS, and Black KL. Gene Expression abnormalities in human glial tumors identified by gene array. Int. J. Oncol, 2001; 18, 287-295 . 82  [46] Watson MA, Perry A, Budhjara V, Hicks C, Shannon WD, and Rich K M . Gene expression profiling with oligonucleotide microarrays distinguishes world health organization grade of oligodendrogliomas. Cancer Research, 2001; 81, 717-723. [47] Good, P.I. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses 2nd ed. Springer-Verlag New York, Inc., 2000. [48] Park T, Y i SG, Lee S, Lee SY, Yoo DH, Ahn JI, Lee YS. Statistical tests for identifying differentially expressed genes in time-course microarray experiments. Bioinformatics, 2003;19(6), 694-70. [49] Dudoit S, Yang YH, Speed TP, and Callow MJ. Statistical methods for identifying differentially exprssed genes in replicated cDNA microarray experiments. Statistica Sinica, 2002; 12, 111-139. [50] Ng RT, Sander J, Sleumer MC, Yuen MS, Jones SJ. A Methodology for analyzing SAGE libraries for Cancer Profiling ACM TOIS,2005; V 23, 35-60, [51] Differential expression of biomarkers in lung adenocarcinoma: a compartive study between smokers and never-smokers. Dutu T, Michiels S, Fouret P, Penault-Llorca F, Validire P, Benhamou S, Taranchon E, Morat L, Grunenwald D, Le Chevalier T, Sabatier L, Soria JC. Ann Oncol, 2005; 16(12):1906-14. [52] Hirsch FR, Franklin WA, Gazdar AF, Bunn PA Jr. Early detection of lung cancer: clinical perspectives of recent advances and radiology. Clin Cancer Res, 2001; 7: 5-22. [53] Auerbach O, Gere B, Forman JB, et al. Changes in bronchial epithelim in relation to smoking and cancer of lung. N Engl J Med; 256: 97-104. [54] Saccomano G, Archer VE, Auerbach O, Saunders RP, Brennan LM. Development of carcinoma of the lung as reflected in exfoliated cells. Cancer (Phila), 1974; 32: 256-270. [55] Venman BJW, van Boxem TJM, Smit EF, Postmus PE, Sutedja TG. Outcome of bronchial carcinoma in situ. Chest, 2000; 117: 1572-1576.  83  Appendix A  Central Dogma of Biology The smallest biological unit of life is known as a cell. Some organisms are unicellular while others are multicellular. Unicellular organisms contain all the machinery to sustain life and reproduce. Multicellular organisms are much more complex as they require interactions between other cells to obtain necessary nutrients to sustain life and also often require interactions between other multicellular organisms to reproduce. Difference cells produce different gene products. The activation of specific gene products or proteins can have a great effect on the characteristics of a cell as they are the actual worker molecules of the cell. For instance, a muscle cell and a nerve cell express different gene products and thus, differ in both function and physical appearance. The central dogma of biology explains just how a gene product is produced in cells. Cells contain genetic templates known as DNA located in a region called the nucleus. These gene templates are copied into corresponding molecules known as messenger RNA (mRNA) through a process known as transcription. mRNA molecules (or often called transcripts) are then exported into the cytoplasm where  84  they are recruited by structures called ribosomes (rRNA). These structures in turn recruit free amino acids (the building blocks of proteins) and perform a process known as translation where the ribosome interpret the instructions contained in the transcript until finally the entire protein is created. DNA/RNA consists of subunits known as nucleotides (A Adenine, T Thymine (DNA)/ U Uracil (RNA), C Cytosine, G- Guanine). DNA in its natural state consists of 2 strands of DNA where each nucleotide have an affinity for another corresponding nucleotide and forms a robust double stranded structure known as the double helix. Specifically, adenine has an affinity for thymine and cytosine has an affinity for guanine and vice versa. This natural phenomena is known as hybridization and as you will see later, is a key reaction for DNA microarrays to work. RNA on the other hand, are single stranded and do not have thymine but rather a similar molecule known as uracil.  85  Appendix B  Genome-wide Expression Technologies B.l  D N A Microarrays  A simple microarray experiment requires 6 basic steps: 1. Sample preparation 2. Array fabrication 3. Hybridization of the sample to the array 4. Scanning and image analysis 5. Normalization/Data Preprocessing 6. Data Analysis Sample preparation  Sample.preparation is considered to be the most sensitive step in a microarray experiment step. One reason for this is that samples must be extracted from crude •  .86  sources such as blood, tumor biopsies, and complex multi-cellular tissues. Thus, purity of samples becomes a major issue. For instance, a scientist maybe only interested in the tumour tissue but in order to extract a tumor, healthy tissues are also extracted along with it. In addition, such samples must be treated with chemicals such as formalin to preserve the tissue which may alter or degrade the RNA and thus, make it unusable. Fortunately, this is an active area of research where new technologies have been providing scientists with tools to help extract purer samples. For instance, in recent years, the development of new technologies such as laser capture microdissection has allowed scientist to extract purer samples. LCM (laser capture dissection) can help eliminate unwanted cells and create a purer sample [15]. A r r a y fabrication  The primary concerns when manufacturing a DNA microarray is precision, cost and time. Today, DNA chips are either manufactured in the lab (especially in academia) or ordered from various specialized companies that market and manufacture them including Affymetrix, Amersham Pharmacia Biotech, Biorobtoics, and Genomic Solutions. The manufacturing process is also a hot area of research as scientists and businesses aim to reduce cost and time without loss of precision or to increase precision with little cost. For instance, one of the first technologies developed, photolithographic DNA synthesis, allowed the placement of high-density oligonucelotides on microarray slides. However, the procedure is costly and time consuming, and most importantly, the precision is poor as the synthesized oligonucleotides are subject to a wide variation and uncertainty. Thus, alternatives were developed namely mechanical microspotting, ink jet ejection and bubble jet technology which helped alleviate  87  various pitfalls of older technologies [13]. H y b r i d i z a t i o n of the sample to the array  After the cell samples are captured (either grown or extracted from bulk tissues directly) they are placed in a test-tube. These test-tubes are centrifuged in order to move the cells to the bottom of the test-tube and then treated with an RNA extraction compound. This compound is then extracted and placed in another test tube and reversed transcribed to produce cDNA molecules as DNA molecules are much more stable than RNA molecules. As the cDNA is made, it is tagged with a fluorescent dye. Thesefluorescentlylabeled probes are then placed on a DNA microarray chip. DNA microarray chips contain many spots of contemplementary bases and are either made up of oligonucleotides (synthetically produced short sequences) or entire cDNA sequences. For spotted microarrays, two samples (the reference and the experimental) are labeled with different dyes (ie. Cy3 and Cy5) and competitively hybridze onto the chip. As mentioned above, for Affymetrix type of arrays, controls and experimental samples are hybridized on separate microarray slides and thus produce arrays of absolute expression intensities.  Scanning and image analysis (ie.  Spot location, background  correction,  intensity assignments)  After hybridization, fluorescent dyes cannot be seen until a specialized laser passes over the slide. While this is done, light is emitted and captured by a computer and saved as an image file. Over the years, various groups have developed microarray image analysis software to extract the varying intensities for each spot in an array. Up to now, there is no standard way of doing this specifically for microarrays as there  88  are many different image analysis methods ( . However, all microarray image analysis software has the same goals. First, on top of calculating the spots associated signal intensities, image analysis also aims to assess local background noises. In addition, most of these software tools also allow for basic filtering to deal with ghost spots (background higher than intensity spot), damaged spots, and very low intensity spots. Much of these poor spots are caused by experimental factors such as wobbling of the robot arm that makes the deposits of the samples onto the slide and contamination (ie. dust, thumbprints) [14]. N o r m a l i z a t i o n / D a t a Preprocessing  Refer to Chapter 3. D a t a Analysis  Refer to Chapter 3.  B.2  S A G E Technology  The SAGE methodology basically works by capturing the mRNAs in cells. Since most mRNA molecules end with a long string of As (polyadenylated tails), scientist have discovered that they can easily capture these molecules by using a long string of the contemplemtary base pairs, thymine (T). To do such a procedure though, mRNA must be transcribed into cDNA (DNA version of an mRNA transcript) since DNA is a much more stable structure than RNA. This process is performed by an enzyme found in RNA viruses called Reverse Transcriptase. The next important step is to bind a molecule with 20 or more Ts to special microscopic magnetic structures called streptavidin beads. This allows the capturing of mRNA molecules in the 89  cytoplasm of the cells. These beads can then be withdrawn with a magnet along with the hybridized DNA molecule (a copy of the original RNA molecule). Next, as shown in the above figure, this cDNA molecule is cleaved by a restriction enzyme that creates a sticky end hanging out (CTAC). This sticky end allows linkers to be attached to these ends. Then a tagging enzyme, which detects the linker site, reaches down past the CTAC sticky end and cuts off a short segment of the cDNA molecule. This short segment is the tag and ideally contains enough nucleotides to identify, the original mRNA molecule. After the enzymatic cutting steps, the ends of these short molecules have special characteristics that allow them to bind together. Together they form longer structures called concatemers. These structures are then PCR amplified so they can be detected and then fed into a sequencer that counts and adds up all unique tags.  90  Appendix C  Validation of Differentially Expressed Genes C.l  Validation of Differentially Expressed Genes of Breast Data  1. Annexin  Al (UID-.782255)  :  (a) Rank 1 (b) Score: 3.75 (c) Description : They are thought to have a role in tumour suppression and are upregulated in many cancers including breast cancer. (d) References: PMIDS: 9062391,9514092,8387039 2. Serum Amyloid  A2 (UID: 336462)  (a) Rank: 4 (b) Score: 3.32  91  (c) Description:  Elevated in many types of tumours including breast cancer.  (d) References: OMIM: 104751 ,PMID: 6200925,3734116,11596022 3. Ribosomal protein S15;RPS15  (UID:  133230):  (a) Rank 5 (b) Score:  3.32  (c) Description:  Found to be activated in various human tumors such as in-  sulinomas, esophageal cancers, and colon cancers. (d) References: OMIM: 104751, PMID: 6200925,3734116,11596022 4. nuclear receptor subfamily 2, group F, member 6; NR2F6  (UID:  239752):  (a) Rank: 6 (b) Score:  3.30  (c) Description:  Members of this family of genes is involved in control and  differentiation in neoplasia. Absent in human colon carcinoma (d) References: OMIM: 132880, PMID: 2553781 5. interleukin  8: (UID: 624)  (a) Rank: 7 (b) Score:  3.18  (c) Description:  Studies suggests a role for IL-8 in promoting the metastatic  potential of breast tumor cells (d) References: OMIM: 146930 , PMID: 11330965 ,11159200,11029796 6. protein tyrosine phosphatase,  non-receptor  (a) Rank: 8 (b) Score:  3.17  92  type  13(UID:211595):  (c) Description:  Is expressed in normal breast epithelial cells and is frequently  up-regulated in breast cancer (d) References: OMIM: 600267,PMID: 11696979, 9544992, 10640988 7. nuclear factor  of kappa light polypeptide  gene enhancer  in B-cells  inhibitor,  alpha: (UID: 81328) (a) Rank: 9 (b) Score:  3.10  (c) Description:  Activation of nuclear factor-kappaB in breast cancer  (d) References: OMIM: 600664, PMID: 11325857 ,11752211 8. LIM domain protein(UID:  79691):  (a) Rank: 11 (b) Score:  3.09  (c) Description:  Overexpressed in breast cancer.  (d) References: OMIM: 603422 , PMID: 11734645 9. transforming  growth factor beta-stimulated  protein  TSC-22 (UID:  114360):  (a) Rank: 12 (b) Score:  3.07  (c) Description:  Transcription factor and belongs to the large family of early  response genes and thought to have a tumour suppressor role in many cancers. (d) References: OMIM: 607715 PMID: 11944908, 11836610, 11095965, 10879745, 10854535, 9459148, 9458104, 9195978 10. serum amyloid A1:(UID:  332053)  93  (a) Rank: 13 (b) Score:  3.06  (c) Description:  Expression in hepatoma cells  (d) References: OMIM: 104751, PMID: 1656519 11. inositol polyphosphate-l-phosphatase:(UID:  32309)  (a) Rank: 15 (b) Score:  3.05  (c) Description:  Upregulated in human colorectal cancer  (d) References: OMIM: 147263 ,PMID: 10747296, 8392378 12. superoxide dismutase  2, mitochondrial  (UID:  318885):  (a) Rank: 17 (b) Score:  3.01  (c) Description:  Down regulated in lung carcinoma.  (d) References: OMIM: 147460 , PMID: 11313974,11491651 13. protein phosphatase  1, regulatory (inhibitor)  subunit 15A (UID:  76556):  (a) Rank: 18 (b) Score:  3.01  (c) Description:  Member of a group of genes whose transcript levels are in-  creased following stressful growth arrest conditions and treatment with DNA-damaging agents. Its protein response is correlated with apoptosis following ionizing radiation. (d) References: PMID: 11593419,11836553 14. intercellular  adhesion  molecule 1 (CD54),  168383):  94  human rhinovirus  receptor  (UID:  (a)  Rank: 22  (b) Score:  2.95  (c) Description:  Upregulated in breast cancer.  (d) References: OMIM: 147840, PMID: 11761443, 11471895, 11783310 15. spermidine/spermine (a)  Rank: 23  (b)  Score:  Nl-acetyltransferase  ; SAT  (UID: 28491):  2.93  (c) Description:  Activity in both breast cancer cells and other types of cancers  (ie. Lung) (d) References: OMIM: 313020 , PMID: 8216356, 2731159, 12697027 16. baculoviral IAP repeat-containing (a)  Rank: 24  (b)  Score:  3 (UID:  127799):  2.91  (c) Description:  Inhibit apoptosis (programmed-cell death)  (d) References: OMIM: 601721, PMID: 8643514, 8552191 17. prostate epithelium-specific  Ets transcription  factor  (UID..79414)  :  (a) Rank: 26 (b)  Score:  2.90  (c) Description:  The mRNA is overexpressed in human breast tumors and is  a candidate breast tumor marker and a breast tumor antigen. (d) References: PMID: 11555586 18. keratin  14 (epidermolysis  bullosa simplex, Dowling-Meara,  (UID:117729):  95  Koebner);  KRT14  (a) Rank: 30 (b) Score:  2.86  (c) Description:  Expressed in breast cancer.  (d) References: OMIM: 148066 , PMID: 10786689, 11487275  C.2  Validation of Differentially Expressed Genes of Breast Data  1. protein kinase C and casein kinase substrate in neurons 1 (UID: 6462): (a)  Rank:2  (b) Score:  4.26  (c) Description:  Strong expression in normal brain. Co-localize and bind  with dynamin.Phosphorylates AGT which controls the susceptibility to methylate carcinogens in tumor cells. (d) References: OMIM: 606512, PMID: 11023825, 10667577,9746365, 11179684, 11082044,10667577 2. synuclein,  beta, (UID:  90297):  (a) Rank: 3 (b) Score:  4.25  (c) Description:  Upregulation in ovarian and breast tumors.  (d) References: OMIM: 602569, PMID: 8194594, 10048491, 10813729 3. ATPase,  Ca++ transporting,  plasma membrane, 2 (UID:  (a) Rank: 4 (b) Score:  4.21  96  89512):  (c) Description:  Upregulated in brain.  (d) References: OMIM: 108733, PMID 10533058 4. Protein phosphatase-2A,  regulatory subunit B' (PR 53), (UID:  236963):  (a) Rank: 6 (b) Score:  4.18  (c) Description:  Reduced expression of the Aalpha subunit of protein phos-  phatase 2A in human gliomas in the absence of mutations in the Aalpha and Abeta subunit genes.Thought to function as tumour supressors. (d) References: OMIM: 600756 , PMID: 11519040 5. Enolase-2,  gamma, Neuronal,(UID:  146580):  (a) Rank: 8 (b) Score:  4.17  (c) Description:  Elevated in glioma cells  (d) References: OMIM:131360, PMID: 7520111, 6268172 6. Dynamin-1,  (UID:  166161):  (a) Rank: 10 (b) Score:  4.12  (c) Description:  Associated with various brain tumors  (d) References: OMIM: 602377, PMID: 11072786, 1832879, 10749171 7. fVisinin-like 1, (UID: 2288): (a) Rank: 12 (b) Score:  4.08  (c) Description:  Plays an important role in regulating tumor cell invasiveness  and that its loss could aid in enhancing the advanced malignant phenotype 97  (d) References: OMIM: 600817, PMID: 9364517, 12941826 8. UID: 75149 SH3 domain,  GRB2-like,  2  (a) Rank: 14 (b) Score:  4.06  (c) Description:  Preferentially expressed in the brain.  (d) References: OMIM: 604465, PMID: 9169142 9. prostatic  binding protein(UID:  80423):  (a) Rank: 16 (b) Score:  3.99  (c) Description:  Used to regulate the onset of mammary and prostate cancer  in transgenic mice. (d) References: OMIM: 604311, PMID: 10713685,7972041 10. RAS-associated  protein RAB3A(UID:  27744)  (a) Rank: 19 (b) Score:  3.97  (c) Description:  Located in the chromosome region 19pl3.1-pl2. 19pl3.2 site  is involved in malignant processes such as acute leukemias. (d) References: OMIM: 179490 ,PMID: 8432525 11. Kinesin  2, 60-70kD (UID: 117977 ):  (a) Rank: 25 (b) Score:  3.93  (c) Description:  Breast cancer antigen (upregulation)  (d) References: OMIM: 600025 ,PMID: 9177777,12747765 98  12. myelin transcription  factor 1-like (UID:  172619):  (a) Rank: 28 (b) Score:  3.85  (c) Description:  Upregulated in high-grade human brain  (d) References: OMIM: 600379, PMID: 9210873  99  Appendix D  Normalization of Expression Using Permutation Test and SAGE (NEPS) Reference Genes D.l  B r e a s t - S A G E Reference Genes  D.2  B r a i n - S A G E Reference Genes  D.3  L u n g - S A G E Reference Genes  D.4  A l l - S A G E Reference Genes  #  Symbol  Gene  Score  Stdev.  Average  1 2 3  ITM2B TSPYL1 CDH1  0.14 0.13 0.060  9.48 18.12 20.75  11.73 10.40 12.13  4 5  TAP HLA-DQB1  0.05 0.046  16.42 18.87  10.13 10.20  6 7  CGI-135 DKFZp564F053  0.042 0.033  13.37 16.54  10.20 10.60  8  PPIB  0.024  20.48  22.80  9  ATP IB  0.022  16.86  10.73  10  GK001  integral membrane protein 2B TSPY-like cadherin 1, type 1, E-cadherin (epithelial) T-cell activation protein major histocompatibility complex, class II, DQ beta 1 CGI-135 protein Homo sapiens mRNA; cDNA DKFZp564F053 (from clone DKFZp564F053 ) peptidylproly isomerase B (cyclophilin B ) ATPase, Na+/K+ transporting beta 1 polypeptide GK001 protein  0.020  13.20  11.27  Table D . l : Breast-SAGE Reference Genes  101  #  Symbol  Gene  1  TCEB2  2  FLJ22678  3 4  ENOl P23  5 6  KIAA0193 TTYH1  7 8  ZYX DDAH1  9  ATP5G2  10 11  NCKAP1 PPP1CB  12  MACF1  13  CCT7  14  SAP 18  15  PFKL  transcription elongation factor B (SIII), polypeptide 2 (18kDa, elongin B) hypotehtical protein FLJ22678 enolase 1, (alpha) unactive progesterone receptor, 23 kD KIAA0193 gene product tweety homolog 1 (Drosophila) zyxin dimethylarginine dimethlaminohydrolase 1 ATP synthase, H+ transporting, mitochondrial F0 complex, subunit c (subunit 9), isoform 2 NCK-associated protein 1 protein phosphatase 1, catalytic subunit, beta isoform microtubule-actin crosslinking factor 1 chaperonin containing TCP1, subunit 7 (eta) sin3-associated polypeptide, 18kDa " phosphofructokinase, liver  Score  Stdev.  Average  0.15  9.59  10.81  0.14  11.16  13.27  0.14 0.14  10.074 14.020  12.73 16.81  0.13 0.12  11.75 14.49  11.88 10.81  0.089 0.088.  12.39 13.43  10.88 10.54  0.077  13.99  12.46  0.049 0.047  13.25 12.32  11.12 10.96  0.019  10.89  10.5  0.016  10.44  10.69  0.0070  14.70  11.81  0.0030  12.34  10.50  Table D.2: Brain-SAGE Reference Genes  102  #  Symbol  Gene  Score  Stdev.  Average  1 2 3 4 5 6 7  RPL11 RPL15 RPL17 PKM2 MRLC2 RSP13 ATP5A1  0.0043 0.023 0.027 0.054 0.054 0.087 0.093  8.57 4.94 7.99 7.51 4.48 4.44 7.58  71.45 44.03 63.72 66.65 40.52 30.52 81.34  8  ATP5J  0.097  2.65  29.48  9  H3F3B  0.097  2.91  28.34  10  CST3  0.099  4.21  31.48  11 12  PRDX1 SRP14  0.11 0.12  7.32 4.00  53.66 42.31  13 14 15  NINJ1 RPL4 TEBP  ribosomal protein L l l ribosomal protein L15 ribosomal protein L17 pyruvate kinase muscle myosin regulatory light chain ribosomal protein S13 ATP synthase, H+ transporting, mitochondrial F l complex, alpha subunit isoform 1, cardiac muscle ATP synthase, H+ transporting, mitochondrial F0 complex, subunit F6 H3 histone, family 3B (H3.3B) cystatin C (amyloid angiopathy and cerebral hemorrhage) peroxiredoxin 1 signal recognition particle 14kDa (homologous Alu RNA binding protein) ninjurin 1 ribosomal protein L4 unactive progesterone receptor, 23 kD NADH dehydrogenase (ubiquinone) flavoprotein 1, 51kDa  0.12 0.14 0.14  3.06 4.65 4.65  28.14 41.03 41.03  0.14  4.00  29.00  16 NDUFV1  Table D.3: Lung-SAGE Reference Genes  103  .  #  Symbol  Gene  1  UQCRB  2 3  RPL6 G22P1  4  FLJ20003  5  SRP9  6  GTF2I  7  ARHA  8  VPS28  9  SERP1  10 43.44 11 12 13  CAPNS1  ubiquinol-cytochrome c reductase binding protein ribosomal protein L6 thyroid autoantigen 70kDa (Ku antigen) hypothetical protein FLJ20003 signal recognition particle 9kDa general transcription factor II, i ras homolog gene family, member A vacuolar protein sorting 28 (yeast) stress-associated endoplasmic reticulum protein 1 calpain small subunit 1  RPS6 RPS2 LAPTM4A  14  RPL3  ribsomal protein S6 ribosomal protein S2 lysosomal-associated protein transmembrane 4 alpha ribosomal protein L3  Score  Stdev.  Average  0.15  1.98  23.30  0.14  2.14  21.13  0.13  6.38  37.07  0.12  4.23  42.56  0.11  2.08  24.30  0.11  2.62  31.10  0.11  2.070  24.50  0.083  0.083 |  4.28  0.063 0.062 0.050  3.81 7.33 3.11  35.27 69.40 31.99  0.048  8.69  101.14  0.084  Table D.4: All-SAGE Reference Genes .  104  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items