Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Outcome prediction and genome-transcriptome correlation analysis in classical Hodgkin's lymphoma Lee, Tang 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2009_spring_lee_tang.pdf [ 1.38MB ]
Metadata
JSON: 24-1.0067085.json
JSON-LD: 24-1.0067085-ld.json
RDF/XML (Pretty): 24-1.0067085-rdf.xml
RDF/JSON: 24-1.0067085-rdf.json
Turtle: 24-1.0067085-turtle.txt
N-Triples: 24-1.0067085-rdf-ntriples.txt
Original Record: 24-1.0067085-source.json
Full Text
24-1.0067085-fulltext.txt
Citation
24-1.0067085.ris

Full Text

OUTCOME PREDICTION AND GENOME-TRANSCRIPTIOME CORRELATION ANALYSIS IN CLASSICAL HODGKIN’S LYMPHOMA by TANG LEE B.Sc., Cell Biology and Genetics, The University of British Columbia, 2006  A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Bioinformatics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  April 2009  © Tang Lee, 2009  ABSTRACT Treatment outcome prediction in classical Hodgkin’s Lymphoma is currently standardized with the International Prognostic Score (IPS), a scoring system based on 7 clinical parameters: age, stage, sex, serum albumin, absolute lymphocyte count or percentage, hemoglobin, and white blood cell count. Known limitations of the system are that it is tailored for advanced-stage patients, and it is unable to identify patients with very poor prognosis. In our dataset of 100 cases, the IPS predicted only 28% of the treatment Failures correctly, and 78% treatment successes correctly. We examined the outcome predictive power of whole-tumour gene expression profiling (GEP) in comparison to the clinical parameters, to see whether additional predictive power can be gained by combining the two data sources. Random Forests and Sparse Multinomial Logistic Regression were used for classification and feature importance ranking. Receiver-Operator Characteristic (ROC) curves and Area Under the Curve (AUC) values did not suggest a significant improvement with GEP, but potentially important GEP predictors were revealed (CSDA, DPEP2, PDE4D, HBP1, etc) and only one of the seven clinical parameters (Ann Arbor Stage) was found to have predictive value. The use of whole-tumour GEP warranted that very limited amount of data reflected the biology of the malignant Hodgkin Reed Sternberg (HRS) cells, since these cells take up only 1-2% of the whole tumour. Treatment response/outcome likely involves a significant contribution by the HRS cells, therefore examining only an enriched pool of microdissected HRS cells would be very beneficial. Twelve cases of micro-dissected HRS cells were available, and this limited sample size prevented the development of a reliable classification model. Instead, we gained insights into the biology of HRS cells by examining the relationship between DNA copy number (CN), as profiled by array CGH,  ii  and GEP. The second part of the thesis involved a single-sample strategy for the examination of the twelve cases and a joint analysis to compare between cases of different CN status. Sparse patterns of correlations with CN gains were found on chromosomes 2p, 6p, 9p, and 12p, including locus JAK2 which has known correlation with gain of 9p.  iii  TABLE OF CONTENTS ABSTRACT ...................................................................................................................... ii TABLE OF CONTENTS .................................................................................................. iv LIST OF TABLES ............................................................................................................ vi LIST OF FIGURES.......................................................................................................... vii ACKNOWLEDGEMENTS............................................................................................. viii DEDICATION.................................................................................................................. ix CHAPTER 1: INTRODUCTION ....................................................................................... 1 1.1 1.2 1.3 1.4  PROJECT RATIONALE ................................................................................................ 1 FOCUSING ON CLASSICAL HODGKIN’S LYMPHOMA .................................................... 2 TREATMENT STRATEGIES AND PROGNOSTIC FACTORS................................................ 2 BIOLOGY OF THE DISEASE ........................................................................................ 3 1.4.1 HRS cells and their origin ............................................................................... 4 1.4.2 HRS cells and the microenvironment .............................................................. 6 1.4.3 Genomic aberrations in HRS cells................................................................... 7 1.5 GENOMIC APPROACHES AND CORRELATIONS WITH CLINICAL OUTCOME ..................... 7 1.6 PROJECT GOALS AND THESIS OUTLINE ...................................................................... 9 CHAPTER 2: OUTCOME CLASSIFICATION ANALYSIS - GENE EXPRESSION PROFILING AND CLINICAL VARIABLES AS PREDICTORS ................................... 12 2.1 INTRODUCTION ...................................................................................................... 12 2.1.1 Project goal................................................................................................... 12 2.1.2 Feature selection ........................................................................................... 12 2.1.3 Classification / class prediction ..................................................................... 14 2.1.4 Model evaluation strategies........................................................................... 15 2.1.5 Introduction of algorithms used..................................................................... 16 2.1.5.1 Random Forests................................................................................. 16 2.1.5.2 Sparse Multinomial Logistic Regression ............................................ 17 2.2 MATERIALS AND METHODS .................................................................................... 18 2.2.1 Hodgkin’s lymphoma whole tissue samples .................................................. 18 2.2.2 Data type comparison and evaluation metrics................................................ 18 2.2.3 Random Forests ............................................................................................ 19 2.2.3.1 Global pre-filtering, out-of-bag error estimation (GP-OOB)............... 19 2.2.3.2 Feature selection, 3-fold cross validation (FS-CV)............................. 20 2.2.3.3 Settings.............................................................................................. 22 2.2.4 Sparse Multinomial Logistic Regression ....................................................... 22 2.3 RESULTS ON RANDOM FORESTS ............................................................................. 22 2.3.1 GP-OOB ....................................................................................................... 22 2.3.2 FS-CV .......................................................................................................... 24 2.3.3 Stability of features....................................................................................... 26 2.4 RESULTS ON SMLR ............................................................................................... 27 2.5 DISCUSSION ........................................................................................................... 30 2.5.1 Comparison of GEP and clinical variables .................................................... 30  iv  2.5.2 Predictive genes ............................................................................................ 31 2.5.2.1 Based on variable importance ............................................................ 31 2.5.2.2 Stability of features............................................................................ 32 2.5.2.3 Comparison with previous studies...................................................... 32 2.5.3 Significance of the clinical model results ...................................................... 33 2.6 CONCLUDING REMARKS ......................................................................................... 33 2.7 FURTHER RESEARCH .............................................................................................. 34 CHAPTER 3: CORRELATION ANALYSIS OF DNA COPY NUMBER AND GENE EXPRESSION IN HODGKIN REED-STERNBERG CELLS.......................................... 36 3.1 INTRODUCTION ...................................................................................................... 36 3.1.1 Project goal................................................................................................... 36 3.1.2 Correlation between copy number and transcription level.............................. 36 3.1.3 Current methods and limitations.................................................................... 37 3.2 MATERIALS AND METHODS .................................................................................... 39 3.2.1 Micro-dissected HRS cells ............................................................................ 39 3.2.2 Copy number and gene expression calls ........................................................ 39 3.2.3 Inter-platform mapping ................................................................................. 39 3.2.4 Single-sample analysis .................................................................................. 40 3.2.5 Joint analysis ................................................................................................ 41 3.3 RESULTS ............................................................................................................... 42 3.3.1 Summary of data and comparison with manual calls ..................................... 42 3.3.2 Single-sample analysis .................................................................................. 45 3.3.3 Joint analysis ................................................................................................ 50 3.4 DISCUSSION ........................................................................................................... 54 3.4.1 Inter-platform mapping ................................................................................. 54 3.4.2 Single-sample analysis .................................................................................. 55 3.4.3 Joint analysis ................................................................................................ 56 3.5 CONCLUSION AND FUTURE WORK ........................................................................... 57 BIBLIOGRAPHY............................................................................................................ 59  v  LIST OF TABLES Table 2.1 Definition of classes and error rates. ............................................................... 19 Table 2.2 Top 12 ranked genes preceding the first clinical variable, Stage, from GP-OOB Random Forests. ............................................................................................. 23 Table 2.3 FS-CV results for each model: mean and standard deviations of AUC and classspecific error rates........................................................................................... 24 Table 2.4 Top 3 genes preceding the first clinical variable, Stage, found by FS-CV Random Forests. ............................................................................................. 25 Table 2.5 The 8 most stable genes (frequency above half of all 18 folds) from Random Forests 3-fold CV. .......................................................................................... 26 Table 2.6 The 81 genes selected by SMLR as informative predictors with Importance ranked by Random Forests. ............................................................................. 28 Table 3.1 Pearson’s Correlations between GE log2 intensities and CN log2 ratios.......... 45 Table 3.2 Number of regions identified by the single-sample correlation analysis listed by the two outcome classes. ................................................................................. 45 Table 3.3 Selected genes in the minimal overlapping regions of CN loss in 4 chromosomes. ................................................................................................. 49 Table 3.4 Genes identified from joint analysis. ............................................................... 52  vi  LIST OF FIGURES Figure 2.1 Illustration of k-fold cross-validation, using k=5............................................ 15 Figure 2.2 Random Forests set-ups. ................................................................................ 20 Figure 2.3 OOB-based ROC curves of Random Forests models: GEP, clinical, and their combination................................................................................................... 22 Figure 2.4 FS-CV notched boxplots for true and false positive rates for each model. ...... 25 Figure 2.5 ROC curves of the SMLR models.................................................................. 27 Figure 3.1 Single-sample analysis, chromosome-based t-test. ......................................... 40 Figure 3.2 Joint analysis for testing CN-GE correlation .................................................. 41 Figure 3.3 Composite frequency plot of twelve CN profiles............................................ 42 Figure 3.4 Comparison of size of CN gain and loss regions identified by HMM and manual calls. ................................................................................................. 43 Figure 3.5 Recurrence of HMM and manual CN calls across the 12 samples. ................. 44 Figure 3.6 Selected results from single-sample analysis: CN gained regions with elevated expression levels compared to all other loci on chromosome. ........................ 46 Figure 3.7 Selected results from single-sample analysis: CN loss regions with lowered expression levels compared to all other loci chromosome. ............................. 47 Figure 3.8 Joint analysis results plotted using Bioconductor geneplotter......................... 50  vii  ACKNOWLEDGEMENTS I thank my rotation supervisors, Dr. Steven Hallam and Dr. Wyeth Wasserman, for providing me with interesting research questions and discussions. Many thanks to my first-year committee, Dr. Irmtraud Meyer, Dr. Mark Wilkinson, and Dr. Artem Cherkasov, for their support and input in the early stage of my degree. I am grateful to my thesis committee, Dr. Steven Jones and Dr. Raymond Ng, for their valuable advice and challenging questions throughout my thesis work. Thank you to Dr. Christian Steidl and Tarun Nayar for their continuous support on a daily basis. I am most grateful to my supervisor, Dr. Randy Gascoyne, for providing me with an enjoyable work environment and many stimulating discussions leading to the development of this thesis project.  viii  DEDICATION  To my mother, my brother, and, Jimmy  ix  CHAPTER 1: INTRODUCTION  1.1  PROJECT RATIONALE  Hodgkin’s Lymphoma (HL) is a lymphoid tumour characterized by the presence of rare, multinucleated and large malignant cells within a sea of immune cells. The disease accounts for approximately one third of all malignant lymphomas in the Western world, and is the most common lymphoma in the young adult population (National cancer institute hodgkin lymphoma.2008). For a newly diagnosed patient, one of four HL clinical stages will be assigned before the appropriate treatment program is determined and the appropriate prognostic score is applied. The stages indicate the spread of the malignancy, covering from a single lymph node at stage I to distant spread and multiple lymph nodes at stage IV (Carbone, Kaplan, Musshoff, Smithers, & Tubiana, 1971; Lister et al., 1989). The stage is further characterized by the presence (type B) or absence (type A) of specific systemic symptoms including weight loss, night sweats, or fever (Carbone et al., 1971; Lister et al., 1989). Selection of treatment strategies is then based on this staging along with additional clinical prognostic factors, affecting the number and duration of chemotherapy and/or radiotherapy cycles in order to optimize the balance between increased efficacy and lowered toxicity (Diehl, Stein, Hummel, Zollinger, & Connors, 2003; Diehl, Engert, & Re, 2007). The current treatment strategies yield an overall 5-year survival rate of about 80% but this number declines steadily for advanced-stage cases and as the number of risk factors increases (van Spronsen & Veldhuis, 2003). While 20-25% of patients fail the primary/standard treatment, a similar proportion of patients suffer from over-treatment, leading to increased long-term risk of secondary malignancy. These two groups of patients could benefit from treatment alternatives that are 1  better tailored to their needs, if a reliable prognostic test is available at the time of diagnosis. The current prognostic test for advanced-stage diseases (Stages IB-IIB bulky and III-IV) is the International Prognostic Score (IPS) involving 7 clinical risk factors (Diehl et al., 2003; van Spronsen & Veldhuis, 2003), which is noted for various limitations and inconsistent performance. Developing a better prognostic system for HL patients using the power of Gene Expression Profiling was the motivation for this work, where we search for a small number of highly predictive molecular markers that could ultimately be testable on a routine clinical setting.  1.2  FOCUSING ON CLASSICAL HODGKIN’S LYMPHOMA  Hodgkin’s Lymphoma is histologically classified by World Health Organization into two distinct subtypes: classical HL (cHL) which accounts for approximately 95% of all HL cases, and nodular lymphocyte predominant HL (NLPHL) (Hansmann & Willenbrock, 2002; Jaffe, Harris, Diebold, & Muller-Hermelink, 1998). Features that differ between NLPHL and cHL include malignant cell identity, antigen profile and somatic mutation patterns, while the four cHL subtypes (nodular sclerosis, mixed cellularity, lymphocyte-rich, and lymphocytedepleted) differ primarily in their tumour cellular composition and malignant cell morphology (Foss, Marafioti, & Stein, 2000; Jaffe et al., 1998). This project focused on cHL collectively as a unified phenotype, for which we had data for 100 cases.  1.3  TREATMENT STRATEGIES AND PROGNOSTIC FACTORS  Most patients are diagnosed with advanced-stage cHL and the current treatment program involves at least 4 cycles of polychemotherapy and radiotherapy if indicated. Current therapies fail to cure about one third of patients with advanced disease, while a 2  similar proportion of patients might be over-treated (van Leeuwen et al., 2000). Patients who fail their primary treatment may receive secondary treatment involving autologous stem cell transplantation with a 50% cure rate, while patients who responded but were over-treated may develop increased long-term risk of secondary malignancy (Salloum et al., 1996; van Leeuwen et al., 2000). These concerns demonstrate the importance of better, more accurate prediction of primary treatment outcome, so that improved treatment approaches tailored to these two response groups can be developed. The gold standard for risk stratification of advanced-stage HL patients is currently the IPS, and this score is calculated from 7 clinical parameters recorded at the time of diagnosis: age, stage, sex, serum albumin, absolute lymphocyte count or percentage, hemoglobin, and white blood cell count (Hasenclever & Diehl, 1998). Every parameter is dichotomized into ‘favourable’ and ‘unfavourable’ by a measurement threshold, and the overall number of unfavourable factors (yielding the IPS) determines the risk of the case. The five-year event free survival starts at 84% for cases with zero unfavourable factors, and declines to 42% for cases with 5 or more unfavourable factors (Hasenclever & Diehl, 1998). ‘High risk’ cases are indicated by the presence of 4 or more unfavourable factors (IPS>4). Although this score is easy to evaluate and interpret, the system is imperfect and its level of accuracy is inconsistent across data sets. In addition to its limited application for early-stage cases, it is also unable to identify cases with very poor prognosis defined by a 5-year survival rate of less than 50% (Axdorph et al., 2000).  1.4  BIOLOGY OF THE DISEASE  Histopathologically, cHL (and NLPHL) is distinct from virtually all other cancers with the scarcity of its malignant cells, known as Hodgkin Reed-Sternberg (HRS) cells.  3  These cells comprise approximately 1-2% of the tumour tissue, significantly outnumbered by non-neoplastic cells in the surrounding microenvironment. This non-neoplastic microenvironment includes numerous lymphocytes, macrophages, eosinophils, plasma cells, stromal cells, and fibroblasts, and proportions of all components vary between cHL subtypes (Bjorkholm et al., 1995). 1.4.1  HRS cells and their origin  The origin of the malignant HRS cells was established to be the germinal centre B cell when microdissected single HRS cells were shown to harbour rearrangements in the immunoglobulin (Ig) heavy chain variable region gene and revealed the presence of frequent somatic mutations, both phenomena unique to germinal centre B cells (Kuppers & Rajewsky, 1998; Marafioti et al., 2000). Germinal centres are structures that develop after mature B cells are stimulated by antigen, and are the sites for rapid B cell proliferation and somatic hypermutations (Kuppers, 2009). B cell variants are generated and those that acquired advantageous mutations are positively selected for, while those that acquired unfavourable mutations undergo apoptosis (Kuppers, 2009). HRS cells are unfavourable post-germinal centre B cells that have escaped apoptosis (Kanzler, Kuppers, Hansmann, & Rajewsky, 1996; Kuppers & Rajewsky, 1998). Rarely (2% of cases) a T cell origin of HRS cells has also been supported by presence of T cell receptor β gene rearrangements (Muschen et al., 2000). Though of B cell origin, HRS cells have remarkably lost their B cell expression program and in almost all cases, the only B cell features retained are functions associated with antigen-presenting and T helper cell interactions (Kuppers, 2009). The B cell receptor signaling pathway that is essential in normal B cells is down-regulated (Kuppers et al., 2003; Schwering et al., 2003), along with other lost B cell features such as functional Ig production and the markers CD20 and CD79a expression (Kanzler et al., 1996; Marafioti et al., 2000).  4  The genes expressed in HRS cells comprise a mixture of markers of other cell lineages, including TARC (chemoattractant associated with Th2 recruitment), CD15 (ligand associated with neutrophils), and CD30 (receptor associated with T, B, and natural killer cells) (Pileri et al., 2002). The result is a range of complex immunophenotypes that do not resemble any known cell type, and recent work has revealed some of the mechanisms driving these changes. As reviewed in (Kuppers, 2009), these mechanisms include lost expression of multiple B cell transcription factors (OCT2, BOB1, PU.1), epigenetic silencing of many B cell genes, low expression of the early B-cell factor 1 (EBF1) that commits a cell to a Blineage, high expression of the T cell transcription factor and B-cell negative regulator Notch 1, down-regulation of various transcription factors that suppress non-B-lineage genes. The proliferative capacity and anti-apoptotic nature of HRS cells has been attributed to various factors, but largely the well-characterized constitutive activation of NF-κB and JaK-Stat pathways. The transcription factor protein complex NF-κB is a key regulator of T/B cell development and survival, and its constitutive activation in HRS cells is regulated by ligand-dependent tumour necrosis factor receptor (TNFR) signaling, and/or mutations of the NF-κB inhibitor IκB (Jost & Ruland, 2007). A link to Epstein-Barr virus (EBV) infection (occurs in about 40% of cHL) has also been established, in which expression of viral latent membrane proteins mimicking TNFR activity leads to NF-κB activation (Deacon et al., 1993; Eliopoulos et al., 2003). The Jak-Stat pathway involves complex cytokine signaling events leading to the activation and nuclear translocation of STAT transcription factors (Kuppers, 2009). The mechanisms leading to its constitutive activation mainly involve genomic gains of JAK2 and inactivation of the negative regulator SOCS1 (Kuppers, 2009). High constitutive expression of activating transcription factor 3 (ATF3) has also been implicated as an important promoter of HRS cell viability (Janz et al., 2006), and further  5  complex interactions with the microenvironment are believed to facilitate how HRS cells escape apoptosis and maintain survival. 1.4.2  HRS cells and the microenvironment  Cytokines are signaling proteins that regulate many processes by modulating the activity of surrounding cells or their own precursor cells, and chemokines regulate leukocyte trafficking by selective binding to receptors (Skinnider & Mak, 2002). HRS cells express a variety of cytokines and chemokines that result in the interaction with other cells and a microenvironment favorable for their survival. The overexpression of Th2 (one of two subsets of T helper cells in the cell-mediated immune response) cytokines and chemokines are well known, including IL-13 (B-cell proliferation), IL-5 (eosinophil recruitment), IL-6 (plasma cell differentiation and TNFα production), TARC and MDC (Th2 recruitment) and Eotaxin (eosinophil and Th2 recruitment) (Maggio et al., 2002; Marshall et al., 2004; Skinnider & Mak, 2002). HRS cells also attract mast cells and neutrophils with secretion of CCL5, and IL8 (Kuppers, 2009). The resulting microenvironment provides the HRS cells with survival signals, and leads to suppression of immunosurveillance, further activation of NF-κB, and sustained support of the reactive infiltrate. Cytokines that inhibit Th1 responses are also highly expressed in HRS cells, such as IL-10 and transforming growth factor (TGF) β, an immunosuppressor of B/T cells (Poppema, 2005). The overall HRS cell expression program leads to an environment that suppresses cell-mediated immunity and allows the tumour cells to proliferate. The whole tumour, on the other hand, has high expression levels of several Th1 factors, including IL-12 (Th1 cell differentiation), IP-10 and MIG (Th1 recruitment), thus also demonstrates a role of the Th1-driven immune response in cHL (Skinnider & Mak, 2002).  6  1.4.3  Genomic aberrations in HRS cells Multiple genomic alterations have been recurrently observed in cHL cases, but a  single, unifying marker aberration that could define HRS cells has yet to be elucidated. Among the known aberrations, copy number (CN) gains of chromosomal arms 2p and 9q are best characterized. Recurring in more than 30% of cases, these two events are associated with amplifications of REL (member of NF-κB family transcription factor) and JAK2 (a tyrosine kinase involved in Jak-Stat signaling), respectively (Barth et al., 2003; Joos et al., 2000; Joos et al., 2003; Martin-Subero et al., 2002). Other recurrent genomic alterations observed include gains of 12p, 16q, 17p, 17q, and losses of 4q, 6q, 11q, and 13q (Chui et al., 2003; Joos et al., 2003). Amplification of MDM2 (negative regulator of the tumour suppressor p53) was found in one study to be associated with 12q gain (Kupper et al., 2001), and other genes including transcription factor JunB (an oncogene) on 19p, and signal transducer and activator of transcription 6 (STAT6) on 12q, have been suggested to have altered expression associated with CN alterations (Hartmann et al., 2008; Joos et al., 2002; Kluiver et al., 2007). Recurrent breakpoints target chromosomes 1p, 6q, 7q, 11q, 12q, and 14q (Falzetti et al., 1999), and IG translocations are present in about 20% of cHLs (MartinSubero et al., 2006).  1.5 GENOMIC APPROACHES AND CORRELATIONS WITH CLINICAL OUTCOME In an effort to add predictive power to the current clinical prognostic system, gene expression profiling (GEP) is being examined widely to find biomarkers correlating with clinical outcome, such as overall survival and response to therapy. For cHL, three studies have published results on this approach using whole tissue lymph node biopsy material  7  (microenvironment), based on different methodologies. Based on hierarchical clustering of 63 cHL samples (41 Favourable Outcome, 21 Unfavourable Outcome) and 6,229 probe sets selected by variance, Chetaille et al. (2009) observed a split between two groups of samples that is significantly correlated with clinical outcome based on Fisher exact test (p=0.03). The gene clusters overexpressed in the Favourable Outcome group are associated with apoptosis, cell-cyle, B cells, and antiviral response, while the cluster extracellular matrix remodeling was overexpressed in the Unfavourable Outcome group. To search for an outcomeassociated gene expression signature, Cox analysis was then applied to 52 adult patients (31 favourable outcome, 21 unfavorable outcome). 501 probe sets were determined to be associated with outcome, where B cell, apoptosis, and cell metabolism genes were again found to be associated with favorable outcome (Chetaille et al., 2009). Devilard et al. (2002) performed hierarchical clustering on 21 cHL samples (16 Good Outcome, 5 Bad Outcome) based on 1,045 selected cDNA clones (associated with cancer and/or immune reactions), and two main branches were created in which one contained all 5 Bad Outcome samples. Three gene clusters were visually identified to be associated with clinical outcome, and these were used to recluster the 21 cHL cases. A new division into three groups of samples was produced, and the Chi-squared test showed a significant correlation with response to therapy and survival. Over-expression of genes involved in fibroblast activation, angiogenesis, and extracellular matrix remodeling were found in the Bad Outcome group, while genes involved in tumour suppression, apoptosis, and cell signaling were found to be linked with Good Outcome (Devilard et al., 2002). Sanchez-Aguilera et al. (2006) performed a supervised analysis on 29 samples (14 Favorable Outcome, 15 Unfavorable Outcome) based on Student t test to identify differentially expressed genes between the two outcome groups. 11,675 cDNA clones representing 9,348 cancer-related genes were used, and 145 genes met the FDR  8  filtering criteria applied. Hierarchical clustering of these genes allowed the identification of 4 gene clusters, where 3 are overexpressed by the Unfavorable Outcome group. The genes in these 3 clusters are involved in host immune response, apoptosis regulation, signal transduction, and mitotic checkpoint regulation. The gene cluster that is overexpressed by the Favourable Outcome group consists of genes involved in extraceullar matrix remodeling, fibroblast function, and specific B cell populations (Sanchez-Aguilera et al., 2006). While these studies presented some overlap in their outcome-associated gene signatures, the directionality (over- or under-expressed by which outcome group) of some are not consistent. These might be a result of different methodologies (microarray platforms and number of genes, patient population, definition of clinical outcome groups) and small sample sizes (52, 21 and 29). Our larger dataset of 100 cases and our use of the Affymetrix platform (approximately 50,000 probe sets) might help confirm or clarify these findings. For isolated/microdissected HRS cells, a group has previously examined the same question using GEP and identified IL-11 receptor alpha as a previously unrecognized overexpressed gene in comparison to cell line data. Correlation with outcome, however, was not possible due to the small sample size of 14 (Karube et al., 2006). Studies of CN changes in HRS cells have established a collection of recurrent aberrations in the disease, and a few genes with expression associated with the CN changes, but these studies have not focused on correlation with clinical outcome. Only one group has suggested a correlation of 13q losses with poor outcome (Chui et al., 2003).  1.6  PROJECT GOALS AND THESIS OUTLINE  We used our frozen tissue archive of 100 whole cHL tumour biopsies obtained at diagnosis and used GEP to examine its potential predictive power in comparison to the 8  9  (lymphocyte count and lymphocyte percentage were both used) clinical variables currently used in the IPS. We examined the two data types using two tools, Random Forests (Breiman, 2001) and Sparse Multinomial Logistic Regression (Krishnapuram, Carin, Figueiredo, & Hartemink, 2005), which returned lists of selected predictive features (genes or clinical variables) that were then ranked. By examining the top predictors and the relative rankings of the two data types, this work would provide answers to the question whether GEP yields additional outcome prediction power, and identify the best predictive/discriminative features in the two types of data. Our sample size of 100 was significantly larger than those presented in previous studies (21 in Devilard et al., 2002 and 29 in Sanchez-Aguilera et al., 2006), in which we had18 samples from patients who progressed following their primary therapy (treatment Failure) and 83 patients who maintained a sustained complete remission (treatment success). This classification analysis constitutes the first part of this thesis. As the non-neoplastic microenvironment comprises 98% of the whole tumour, GEP of the whole tumour would represent a story predominantly derived from the non-neoplastic cells with limited sensitivity to reflect the rare HRS cells. Since the HRS cells are the defining characteristics of the disease, and a complex interplay exists between the microenvironment and the HRS cells, the outcome predictor built from whole tumour GEP alone would certainly be informative but imperfect. The hypothesis that HRS cell biology contributes significantly to clinical outcome is appealing, but molecular data collection on HRS cells is hampered by the technical challenges to enrich for these cells. We used Laser Capture Microdissection (LCM) and obtained 12 HRS-cell enriched samples to-date. While this small sample size prevented a classification analysis, other aspects of HRS biology can be explored. A great deal of work has been dedicated to understanding HRS cell origin, expression profile, and recurrent genomic aberrations (Kuppers, Schwering, Brauninger,  10  Rajewsky, & Hansmann, 2002; Re, Kuppers, & Diehl, 2005; Thomas, Re, Wolf, & Diehl, 2004), therefore we wanted to examine the correlation between DNA copy number changes and gene expression levels in HRS cells, an area which still requires a great deal of work in cHL. The increasing amount of GEP and CN studies have motivated the integrated analysis of the two genomic data types, since the two may closely interact by gene-dosage mechanisms. This topic has been explored in many other diseases including breast cancer (Pollack 2002, Hyman 2002), neuroblastoma (Wang Q 2006), and gastric cancer (Yang S 2007), all demonstrating an important role of CN on gene expression levels and at the same time identifying important target genes. The second part of this thesis investigated CN changes and their correlation to gene expression levels in 12 HRS-cell-enriched samples. This work with its small sample size served as a preliminary step to devising methods for single-sample analysis (correlations in each individual case) and joint-sample analysis (patterns observed simultaneously across multiple samples), that could be used to make novel discoveries in the future when a sufficiently large sample size is reached. The joint-sample analysis might arrive at recurrent regions that could further extend to correlation analysis with treatment outcome, whereas the single-sample findings could provide insights into concepts such as disease-related polymorphisms. Results from this study might also provide some characterization of the overall influence of CN alterations on gene expression in HRS cells.  11  CHAPTER 2: OUTCOME CLASSIFICATION ANALYSIS - GENE EXPRESSION PROFILING AND CLINICAL VARIABLES AS PREDICTORS 2.1  INTRODUCTION  2.1.1  Project goal  The International Prognostic Score (IPS) is considered inaccurate for predicting outcome for cases of early stage cHL and those with very bad prognosis (Axdorph et al., 2000; Franklin et al., 2000). In our set of 100 cases, the IPS predicted only 28% treatment Failures correctly, and 78% treatment successes correctly. We examined 100 cHL cases and compared the contribution of gene expression profiling (GEP) and clinical data using two classification algorithms, Random Forests and Sparse Multinomial Logistic Regression. The main focus was to identify the most important features from both data types, in the search for better prognostic factors and finally improve the current prognostic system. 2.1.2  Feature selection  The initial approach to examining variables in a GEP analysis is to consider the total number of probe sets represented on the array. We used the Affymetrix HG U133 Plus 2.0 array, which contains over 54,000 probe sets. To identify genes of interest that contribute to outcome class separation, we hypothesized that these would most likely comprise less than 1% of the initial pool of genes, as most would likely be irrelevant. While microarray technology gained popularity for its power to capture a large amount of data in one experiment and thus offer a better chance for discovery, the research question is usually much more focused and involves few genes. The true signals become difficult to extract from this setting, as more noise and errors are also present. When the number of features is so large, a huge number of samples are needed in order to generalize findings to other data  12  sets. This is known as the “curse of dimensionality”, and GEP suffers greatly from this as large sample sizes are difficult to obtain. When the number of samples is small relative to the number of variables, GEP models may adapt to specific patterns and become specialized on the training data, resulting in the problem of “overfitting” (Simon, Radmacher, Dobbin, & McShane, 2003). In this context, dimensionality reduction becomes an essential step, to strategically reduce the number of input variables to a manageable number. In classification analysis, this pre-processing step also helps to improve models by reducing noise (such as minimally expressed or minimally varying genes) and increasing computation speed. Feature selection is a class of dimensionality reduction techniques in which a subset of features are merely selected and not altered (Saeys, Inza, & Larranaga, 2007), therefore returns information that remains interpretable. This is in contrast to other dimensionality reduction methods, such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), where original variables are combined or compressed to create new variables that become uninterpretable (Guyon & Elisseeff, 2003; Ma & Huang, 2008; Saeys et al., 2007). At the end of feature selection, important contributors are identified, less space and time are required, and better predictive power can be obtained whilst eliminating noise. Depending on the questions asked, feature selection algorithms may select a subset of relevant features (for example, in predictor construction), or rank a list of features (for example, in drug lead identification). In the context of classification analysis as in this study, feature selection may be incorporated at three different stages: prior to model construction (filter methods), prior to model construction but optimized by model evaluation (wrapper methods), or embedded within model construction (embedded methods). Filter methods, such as t-test, Wilcoxon rank sum, Fisher score, ANOVA, Pearson correlation and Bayesian networks, are the most common in gene expression studies due to their simple  13  implementation (Saeys et al., 2007). The use of machine learning algorithms in this project involved a mixture of feature selection schemes. 2.1.3  Classification / class prediction  After the set of informative features is identified, a classifier can be constructed based on pre-defined class labels (in supervised learning) using these features. The goal of classification is to group individuals with similar features, and to construct a classifier essentially involves taking all input features and selecting the ones that show distinguishable and consistent differences between the pre-defined groups. During classifier construction (model training), a cutoff value may be computed for each feature to optimize its class separation, and a weight may also be assigned to each feature to indicate the importance of the feature. When a model is established, new input samples of unknown classes can be used as input and the classifier outputs the predicted classes for them. Various review articles have focused on discriminatory/classification methods for GEP data (Boulesteix, Porzelius, & Daumer, 2008; Dudiot, Fridlyand, & Speed, 2002). The more traditional methods include Fisher’s linear discriminant analysis which searches for linear combinations of the genes that separate between classes, and Nearest neighbor, which makes use of neighboring observations to obtain a class vote. Classification trees are more modern methods, in which the variables and measurements are structured to yield a class label at the terminal. Most recent work focuses on machine learning methods, such as aggregating classifiers in which multiple predictors are built using bootstrap samples then aggregated to obtain a class vote (Breiman, 1996; Freund & Schapire, 1997). The problem of overfitting is a critical limitation of supervised methods, and this limitation leads to an emphasis on the evaluation process of any constructed classifier.  14  Model evaluation and error estimation help to quantify the model bias, which serves as an indicator of its generalization capability. 2.1.4  Model evaluation strategies  The ideal method to evaluate a classifier and to obtain unbiased error estimation is testing on an independent sample set. Since sample size is limited, cross-validation (CV) is the best option because it efficiently uses the data as both training and evaluation set (Figure 2.1).  Figure 2.1 Illustration of k-fold cross-validation, using k=5.  Total number of samples Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5  Test Test Test Test Test  In k-fold CV, the total number of samples (n) is divided into k folds and one fold is kept hidden while all remaining samples are used as training data to construct the model. The hidden fold (consisting n/k samples) is then used as validation data to obtain error estimation and other performance metrics of the model. This is iterated k times such that a different fold is assigned the validation set each time, resulting in k models. The overall performance is calculated as the average of all k models (Boulesteix et al., 2008). Two adjustments improve error estimation in CV: stratification and varying partitions (Boulesteix et al., 2008; Braga-Neto & Dougherty, 2004). In stratified CV, class proportions within each fold are kept the same as that in the whole data set. In varying partitions, CV is 15  performed multiple times to create different partitions of samples in each run, in order to reduce variability generated by the initial partitions of folds. The choice of k depends on the initial sample size and the class proportions, but the most common set up is k=5 or 10. Leave-one-out CV is the special case when k equals n, and this setup has been shown to perform best, providing a nearly unbiased estimate of the true error rate (Lachenbruch, 1967). 3-fold CV and leave-one-out CV were both used in this project, but there are other methods for estimating error rates, such as Monte-Carlo CV, bootstrap, holdout and the 0.632 and 0.632+ estimators (Boulesteix et al., 2008). 2.1.5 2.1.5.1  Introduction of algorithms used Random Forests  Random Forests (Breiman, 2001) is a collection of classification trees, each built using a bootstrap sample of the original data set and each node chosen from a random subset of the original variables. The idea is that if important patterns exist in these data, different trees will uncover different portions of them, and they will be repeatedly detected in the forest (Breiman, 2001). As a whole, the forest can effectively make use of the discriminative power of the variables. In this analysis, Random Forests (RF) was used mainly for importance ranking of the lists of input variables. The importance of a variable is measured by the decline in prediction accuracy that results after the model uses randomized values of the variable: the larger drop in model performance, the more important the original variable was. RF has been shown to be very successful for this purpose, in applications such as biomarker selection (Remlinger, 2007) and SNP interaction detection (Bureau et al., 2005; Lunetta, Hayward, Segal, & Van Eerdewegh, 2004). It has also been shown to perform  16  equally well or better than other feature selection/classification tools for gene expression data (Diaz-Uriarte & Alvarez de Andres, 2006; Man, Dyson, Johnson, & Liao, 2004). The algorithm has demonstrated that its generalization error converges to a limiting value as the number of trees increases, thus overcoming the problem of overfitting (Breiman, 2001). The large amount of trees helps to wash out accidental patterns in the data, creating robustness to outliers and noise, while constructing a powerful model even when individual input variables are weak discriminators, as long as their correlations are low (Breiman, 2001). Another important feature of the algorithm is the automatic creation of “Out-Of-Bag” (OOB) data that can be used for internal error estimation. OOB data are the samples that were left out when trees were built from randomly selected sample subsets, thus they serve to be independent test sets for their particular trees. Lastly, the algorithm is fast as each tree node examines only a small subset of the variable space. 2.1.5.2  Sparse Multinomial Logistic Regression  Sparse Multinomial Logistic Regression (SMLR) (Krishnapuram et al., 2005) is a sparse, Bayesian-based probabilistic multi-class classifier that utilizes features of the modern machine-learning domain. Sparsity refers to the selection of a minimal set of informative features, so most input features have no influence on the classifier. This is achieved in SMLR by the incorporation of a sparsity promoting Laplacian prior on the model parameters. The Laplacian distribution is sharply peaked at zero, thus in the learning process, feature weights of exactly zero are favoured over small values close to zero, resulting in only a small set of features with non-zero weights (Krishnapuram, Hartemink, Carin, & Figueiredo, 2004). This small set of selected features is determined by the algorithm to be relevant in making class predictions, and this sparsity prevents over-fitting to achieve good generalization ability. The SMLR algorithm training process involves examining each class  17  and estimates a sparse set of feature weights that best reflect the relationship between the features and the class. At the classification stage of an input, SMLR predicts the posterior probabilities of its class memberships based on the feature weights, and assigns the class with the highest posterior probability.  2.2  MATERIALS AND METHODS  2.2.1  Hodgkin’s lymphoma whole tissue samples  A total of 121 Affymetrix HG U133 Plus 2.0 arrays were normalized using robust multiarray analysis (Irizarry et al., 2003) and 113 arrays passed the quality-control criterion: Normalized Unscaled Standard Errors (NUSE) (Bolstad, 2008) median ≤ 1.05. The outcome class proportion in this dataset was 31 failures and 82 successes, but the failures contained a mixture of diagnostic and relapse biopsy samples. Focusing on pretreatment outcome prediction in this study, we used only the 18 failure biopsies that were taken at diagnosis, leaving a total of 100 samples for the classification analysis. The GEP consisted of 54,675 probe sets, and the 8 clinical variables were those defined in the IPS: age, stage, sex, albumin, absolute lymphocyte count, lymphocyte percentage, hemoglobin, white blood cell count. 2.2.2  Data type comparison and evaluation metrics  Three data sets were compared to see if there were measurable differences in the accuracy of the classifiers they produced: GEP alone (GEP model), clinical variables alone (clinical model), and their combination (combination model). Model comparisons were based on the area under the Receiver-Operating Characteristic (ROC) curve (AUC). The positive class was assigned to treatment Failure (class F), and definitions for various error rates are listed in Table 2.1, accordingly.  18  Table 2.1 Definition of classes and error rates.  True Status  Class F Class S Total number of predictions  Error term True positive (TP) False positive (FP) True negative (TN) False negative (FN) True positive rate (TPR) or Sensitivity False positive rate (FPR) True negative rate (TNR) or Specificity False negative rate (FNR)  Class F A C A+C  Predicted Class S Total number of cases B A+B D C+D B+D A+B+C+D  Definition A C D B A/(A+B) C/(C+D) D/(C+D) B/(A+B)  Features derived from the combination model were ranked, with a cutoff set at the point when the first clinical variable appeared. 2.2.3 2.2.3.1  Random Forests Global pre-filtering, out-of-bag error estimation (GP-OOB)  Non-specific global filtering was used to reduce the number of GEP features before training the classifiers, based on an expression intensity filter: log2 intensity greater or equal to 4 in at least one third of samples. This resulted in 24,595 qualified probe sets from the original pool of 54,675 probe sets. The purpose of this step was to reduce the size of input to speed up computation, and was also necessary for using RF because the algorithm accepts a maximum of 32,768 features. The assumption here is that most of the lowly expressed features will not be relevant to the classifier. Three RFs were then created using three different inputs: GEP (filtered), 8 clinical variables, and their combination. Since this feature pre-filtering was applied to all 100 samples, no independent set could be used as validation data so OOB error estimation was used.  19  2.2.3.2  Feature selection, 3-fold cross validation (FS-CV)  The second approach involved feature selection on GEP prior to inputting into RF. Three-fold CV was performed and for each training set (2/3 folds) probe sets were selected based on the t-test and a threshold at false discovery rate of 0.05 (Benjamini & Hochberg, 1995). The aim of this step was to obtain informative genes to be used as input, in order to improve classification models. RFs were constructed using these features and evaluated by the left-out fold. Use of an independent validation set which were not subject to feature selection was important in this setup to avoid selection bias, thus OOB error estimation was not used. Overview of this approach and its contrast to the OOB use is illustrated in Figure 2.2. Figure 2.2 Random Forests set-ups. (A) Random Forests built-in Out-Of-Bag error estimation. Each tree in the forest is constructed using a different bootstrap sample from the original sample set. Random sampling with replacement leaves approximately 37% of the samples out of the construction step of this tree. These left out samples are the Out-Of-Bag (OOB) data and using these for error estimation of the constructed tree has proven to be unbiased.  20  (B) Random Forests 3-fold cross-validation. Prior to Random Forests, feature selection was applied to the GEP in a training set containing 2/3 of the samples. Random Forest was then constructed using this training set based on the selected features. The test set that is completely independent from these steps is used for error estimation of the constructed Random Forest.  The small k=3 was chosen due to the highly unbalanced classes in the 100 samples (82 successes versus 18 treatment failures). Similar setups for RF (prior feature selection) were employed by other groups to avoid selection bias (Boulesteix et al., 2008; Markowetz, Ruschhaupt, & Spang, 2007). In order to build a clinical model, no feature selection was performed and all 8 clinical features were used. To construct the combination model, the 8 clinical variables were added to the list of selected features prior to running RF. Three-fold CV was repeated 6 times for each model type (GEP, clinical, combination), to create different partitions for reducing sampling bias. Variable importance of the combination model was examined only for features that occurred in 6 or more iterations (1/3 of all iterations). The 8 clinical variables were included since they were added into each iteration by default (occurred in all 18 iterations). In  21  addition to importance, stability of the features was examined, which was measured by the frequency of a variable’s presence across 18 iterations. 2.2.3.3  Settings  Settings for all runs of RF (Salford Systems) were kept constant as follows: Number of trees = 10,000, Class weights = Balanced, number of features to test at each node = sqrt(N), where N is the total number of input features. 2.2.4  Sparse Multinomial Logistic Regression  All 54,675 probe sets were used as input without pre-filtering, and the built-in leaveone-out CV was used for all models. All other parameters were set to the default values, including unit normalization of all variables (‘Recenters and rescales each feature to have zero mean and unit variance’). To construct the combination model, the 8 clinical variables were added to the complete GEP as input to SMLR. RF was used to rank importance of the SMLR selected features.  2.3  RESULTS ON RANDOM FORESTS  2.3.1  GP-OOB  The combination model performed similarly to the GEP model without much improvement, and AUC values indicated the clinical model was overall superior although ROC plots showed that this occurred only below the 75% sensitivity level (Figure 2.3). Figure 2.3 OOB-based ROC curves of Random Forests models: GEP, clinical, and their combination. GEP filtering on all samples was performed to reduce the number of inputs from 54,675 to one that is acceptable to Random Forests (below 32,768). Intensity filter was set at log2 intensity≥4 in 100/3≤samples. The clinical model produced the best AUC value while the combination model produced the worst. The superiority of the clinical model was not uniformly present as the ROC curves intersect, and disappeared with sensitivity levels higher than 0.75.  22  RF (OOB)  True positive rate  AUC=0.697 Clinical AUC=0.667 Gene Expression AUC=0.659 Combination  False positive rate  Variable importance ranked (Ann Arbor) stage as the best clinical predictor, while the 7 remaining clinical variables were uninformative with weights near zero. This was expected as stage is a critical parameter in guiding treatment decisions by classifying patients based on the extent of their disease. Twelve genes (17 probe sets) ranked higher than stage (Table 2.2), including 4 of the most differentially expression genes between the two outcome classes (data not shown): dipeptidase 2 (DPEP2), hematopoietic SH2 (HSH2D), cold shock domain protein A (CSDA), and phosphodiesterase 4D (PDE4D). Table 2.2 Top 12 ranked genes preceding the first clinical variable, Stage, from GP-OOB Random Forests. Gene Symbol  Gene Title  EPOR  Erythropoietin receptor  DPEP2  Dipeptidase 2  PIAS1  Protein inhibitor of activated STAT, 1  HSH2D SYNJ2 TGIF2  Hematopoietic SH2 domain containing Synaptojanin 2 TGFB-induced factor  GO Biological Process Signal transduction / signal transduction / elevation of cytosolic calcium ion concentration / small GTPase mediated signal transduction / brain development / heart development Proteolysis Transcription / regulation of transcription, DNAdependent / ubiquitin cycle / signal transduction / JAKSTAT cascade / androgen receptor signaling pathway / positive regulation of transcription, DNA-dependent T cell activation --Transcription / regulation of transcription, DNA-  23  Gene Symbol  Gene Title  GO Biological Process  homeobox 2  dependent / regulation of transcription, DNA-dependent / regulation of transcription Ubiquitin-dependent protein catabolic process / ubiquitin cycle Protein import into nucleus / NLS-bearing substrate import into nucleus / transport / intracellular protein transport / protein transport Translation / translation / response to extracellular stimulus / gas transport / erythrocyte differentiation / anatomical structure development / positive regulation of cell motility Negative regulation of transcription from RNA polymerase II promoter / transcription / regulation of transcription, DNA-dependent / response to cold  USP32  Ubiquitin specific peptidase 32  KPNA4  Karyopherin alpha 4 (importin alpha 3)  RPS19  Ribosomal protein S19  CSDA  Cold shock domain protein A  PDE4D  ALDH9A1  2.3.2  Phosphodiesterase 4D, cAMP-specific (phosphodiesterase E3 dunce homolog, Drosophila) Aldehyde dehydrogenase 9 family, member A1  Signal transduction Aldehyde metabolic process / metabolic process / carnitine metabolic process / neurotransmitter biosynthetic process / hormone metabolic process / oxidation reduction  FS-CV  The averaged results across 18 iterations for each method are presented in Table 2.3. Table 2.3 FS-CV results for each model: mean and standard deviations of AUC and classspecific error rates. Variability across folds was not accounted for in these measures and might have contributed to the overall small difference in average AUC values: about 5% difference between the clinical model and the other two models. Differences in class error rates between the clinical model and the other two models, however, are greater at about 20% and 35% for class Failure and class Success, respectively. Model  AUC ± SD  GEP Clinical Combination  0.684 ± 0.106 0.647 ± 0.127 0.681 ± 0.109  Class F error ± SD 0.509 ± 0.25 0.407 ± 0.22 0.519 ± 0.25  Class S error ± SD 0.254 ± 0.12 0.398 ± 0.09 0.252 ± 0.11  24  Mean AUC (± standard deviation) for the GEP model was highest (0.684±0.106), followed by the combination (0.681±0.109) then the clinical (0.647±0.127) model. The clinical model had the lowest class F error (false-negative rate) and highest class S error (false-positive rate), but only the latter was significant as indicated by the non-overlapping notches of the boxplots (Figure 2.4). Figure 2.4 FS-CV notched boxplots for true and false positive rates for each model.  45 features (22 annotated genes) were found repeatedly in 6 or more iterations, and their variable importance was examined. Stage was again the only important clinical predictor, and 3 genes (6 probe sets) ranked higher (Table 2.4). Only 3 of these were annotated genes, and CSDA and DPEP2, were again found to be among the best predictors. Table 2.4 Top 3 genes preceding the first clinical variable, Stage, found by FS-CV Random Forests. Gene Symbol  Gene Title  ZNF395  Zinc finger protein 395  DPEP2  Dipeptidase 2 Cold shock domain protein A  CSDA  GO biological process term Transcription /// regulation of transcription, DNAdependent /// regulation of transcription, DNA-dependent Proteolysis Negative regulation of transcription from RNA polymerase II promoter /// response to cold  25  The average number of probe sets selected in an iteration was 110. The number of unique probe sets selected across the 18 iterations was in total 1,205, where 70% of these were selected in only one iteration. The most stable probe set was selected in 15 iterations, followed by 11 other probe sets selected in more than 10 iterations. 2.3.3  Stability of features  To evaluate stability of the features, a threshold of 9 out of 18 iterations was chosen. 8 genes (15 probe sets) met this stability criterion and are listed in Table 2.5. Table 2.5 The 8 most stable genes (frequency above half of all 18 folds) from Random Forests 3-fold CV. Only the annotated genes are shown here. The top most stable probe sets were 239829_at and 1556824_at, which were identified in 15 and 13 folds respectively. Frequency  Gene Symbol  12  PDE4D  12  CBY3  Gene Title Phosphodiesterase 4D, cAMPspecific Chibby homolog 3 (Drosophila)  11  CSDA  cold shock domain protein A  11  RHEB  Ras homolog enriched in brain  10  ZNF7  Zinc finger protein 7  10  SERP2  9  HSH2D  9  RALGPS2  Stress-associated endoplasmic reticulum protein family member 2 hematopoietic SH2 domain containing Ral GEF with PH domain and SH3 binding motif 2  GO biological process signal transduction --negative regulation of transcription from RNA polymerase II promoter/transcription/response to cold signal transduction/small GTPase mediated signal transduction transcription/regulation of transcription, DNAdependent/multicellular organismal development transport/protein transport/intracellular protein transport across a membrane T cell activation small GTPase mediated signal transduction  The best feature was selected in 15 out of 18 iterations, but this probe set is not annotated (ID 239829_at). Annotated genes on this list include PDE4D, CSDA, Ras homolog in brain  26  (RHEB), stress-associated endoplasmic reticulum protein (SERP2), and HSH2D, showing consistency with findings by Variable Importance.  2.4  RESULTS ON SMLR  AUC was highest for the combination model, and its ROC curve lies between the other two models below a 70% sensitivity level but overtakes them thereafter (Figure 2.5). The clinical model performed worst at high sensitivity range, which was found also by RF. In the combination model, SMLR selected 81 genes (103 probe sets) and Stage as predictive features. RF importance ranking placed stage as the 8th best predictive feature. The top gene was HMG-box transcription factor 1 (HBP1), and CSDA, PDE4D again were found at the top of the list (Table 2.6). Figure 2.5 ROC curves of the SMLR models. False positive rate corresponding to the IPS score is drawn to indicate performance of the current prognostic system. The combination model produced the best AUC value, followed by the GEP model. In agreement with Random Forests, the clinical model showed superiority at lower true positive rate levels but quickly weakens at higher levels.  SMLR Models  27  Table 2.6 The 81 genes selected by SMLR as informative predictors with Importance ranked by Random Forests. Importance 100 68.01 43.17 38.09  Gene_Symbol HBP1 CSDA PDE4D SORT1  37.27  ATP5A1  34.13 33.41 31.27 28.91  NKAIN3 STAGE SPata3 PRLR  27.91  GPT2  24.38 22.55  FBP2 GPR135  21.84  POLR2a  21.27 20.83 19.66 18.45 18.27 17.62 17.52 17.48  SNX13 EMID2 ARMC5 PADI1 CCBP2 CCDC85a GPC1 TFB2M  17.29  LOC643711  15.17 15.07  PPM1L ESR1  14.62  NR5A2  14.36 14.13 13.59  RASAL1 MGC16824 ADAM33  13.39  EARS2  11.77 10.98  IGHV3-74 STXBP5  10.72  AMMECR1  10.18  SLC1A6  9.75 9.71  SPATA6 MAPK4  Gene Title HMG-BOX TRANSCRIPTION FACTOR 1 COLD SHOCK DOMAIN PROTEIN A PHOSPHODIESTERASE 4D, CAMP-SPECIFIC SORTILIN 1 ATP SYNTHASE, H+ TRANSPORTING, MITOCHONDRIAL F1 COMPLEX, ALPHA SUBUNIT 1, CARDIAC MUSCLE NA+/K+ TRANSPORTING ATPASE INTERACTING 3 STAGE SPERMATOGENESIS ASSOCIATED 3 PROLACTIN RECEPTOR GLUTAMIC PYRUVATE TRANSAMINASE (ALANINE AMINOTRANSFERASE) 2 FRUCTOSE-1,6-BISPHOSPHATASE 2 G PROTEIN-COUPLED RECEPTOR 135 POLYMERASE (RNA) II (DNA DIRECTED) POLYPEPTIDE A, 220KDA SORTING NEXIN 13 EMI DOMAIN CONTAINING 2 ARMADILLO REPEAT CONTAINING 5 PEPTIDYL ARGININE DEIMINASE, TYPE I CHEMOKINE BINDING PROTEIN 2 COILED-COIL DOMAIN CONTAINING 85A GLYPICAN 1 TRANSCRIPTION FACTOR B2, MITOCHONDRIAL PLATELET-ACTIVATING FACTOR ACETYLHYDROLASE, ISOFORM IB, BETA SUBUNIT 30KDA PSEUDOGENE PROTEIN PHOSPHATASE 1 (FORMERLY 2C)-LIKE ESTROGEN RECEPTOR 1 NUCLEAR RECEPTOR SUBFAMILY 5, GROUP A, MEMBER 2 RAS PROTEIN ACTIVATOR LIKE 1 (GAP1 LIKE) ESOPHAGEAL CANCER ASSOCIATED PROTEIN ADAM METALLOPEPTIDASE DOMAIN 33 GLUTAMYL-TRNA SYNTHETASE 2, MITOCHONDRIAL (PUTATIVE) IMMUNOGLOBULIN HEAVY VARIABLE 3-74 SYNTAXIN BINDING PROTEIN 5 (TOMOSYN) ALPORT SYNDROME, MENTAL RETARDATION, MIDFACE HYPOPLASIA AND ELLIPTOCYTOSIS CHROMOSOMAL REGION, GENE 1 SOLUTE CARRIER FAMILY 1 (HIGH AFFINITY ASPARTATE/GLUTAMATE TRANSPORTER), MEMBER 6 SPERMATOGENESIS ASSOCIATED 6 MITOGEN-ACTIVATED PROTEIN KINASE 4  28  Importance 9.23 9.01  Gene_Symbol RB1 ADH6  8.91  SLC7A1  8.83 8.64  MXD1 CD5L  7.65  RDH11  7.4  PSCD1  7.27  ATP8A1  6.96 6.89  SIGLEC11 HELB  6.83  LILRA5  6.42  BRCA2  6.16  POLD1  5.93 5.83 5.62 5.61 5.39 5.15  SAA1 /// SAA2 NTAN1 FAM135B CIDEC FLJ13305 OLFM3 IL1RAP  4.91  PTX3  4.69  PKHD1L1  4.22 4.19 4.19 4.08 3.87  CCIN HP SH3PXD2A DISP2 HIBADH  3.42  PDE4A  3.18  PPARD  3.17  DCD  2.76  FOLH1  2.72  FRS2  2.63 2.48 2.38  --TM2D3 FRK  6.14  Gene Title RETINOBLASTOMA 1 (INCLUDING OSTEOSARCOMA) ALCOHOL DEHYDROGENASE 6 (CLASS V) SOLUTE CARRIER FAMILY 7 (CATIONIC AMINO ACID TRANSPORTER, Y+ SYSTEM), MEMBER 1 MAX DIMERIZATION PROTEIN 1 CD5 MOLECULE-LIKE RETINOL DEHYDROGENASE 11 (ALL-TRANS/9-CIS/11CIS) PLECKSTRIN HOMOLOGY, SEC7 AND COILED-COIL DOMAINS 1(CYTOHESIN 1) ATPASE, AMINOPHOSPHOLIPID TRANSPORTER (APLT), CLASS I, TYPE 8A, MEMBER 1 SIALIC ACID BINDING IG-LIKE LECTIN 11 HELICASE (DNA) B LEUKOCYTE IMMUNOGLOBULIN-LIKE RECEPTOR, SUBFAMILY A (WITH TM DOMAIN), MEMBER 5 BREAST CANCER 2, EARLY ONSET POLYMERASE (DNA DIRECTED), DELTA 1, CATALYTIC SUBUNIT 125KDA SERUM AMYLOID A1 /// SERUM AMYLOID A2 N-TERMINAL ASPARAGINES AMIDASE FAMILY WITH SEQUENCE SIMILARITY 135, MEMBER B CELL DEATH-INDUCING DFFA-LIKE EFFECTOR C HHYPOTHETICAL PROTEIN FLJ13305 OLFACTOMEDIN 3 INTERLEUKIN 1 RECEPTOR ACCESSORY PROTEIN PENTRAXIN-RELATED GENE, RAPIDLY INDUCED BY IL-1 BETA POLYCYSTIC KIDNEY AND HEPATIC DISEASE 1 (AUTOSOMAL RECESSIVE)-LIKE 1 CALICIN HAPTOGLOBIN SH3 AND PX DOMAINS 2A DISPATCHED HOMOLOG 2 (DROSOPHILA) 3-HYDROXYISOBUTYRATE DEHYDROGENASE PHOSPHODIESTERASE 4A, CAMP-SPECIFIC (PHOSPHODIESTERASE E2 DUNCE HOMOLOG, DROSOPHILA) PEROXISOME PROLIFERATOR-ACTIVATED RECEPTOR DELTA DERMCIDIN FOLATE HYDROLASE (PROSTATE-SPECIFIC MEMBRANE ANTIGEN) 1 FIBROBLAST GROWTH FACTOR RECEPTOR SUBSTRATE 2 TRANSCRIBED LOCUS TM2 DOMAIN CONTAINING 3 FYN-RELATED KINASE  29  Importance 2.16 2.12 1.88 1.8 0.88 0.61 0.34 0 0  Gene_Symbol SAA1 CG012 DYNC1H1 JUND FAM110C ABHD1 FIP1L1 ZNF713 ITGA9  0  ADH1C  0  HSPA1L  0  ARTS-1  0  SCO2  Gene Title SERUM AMYLOID A1 HYPOTHETICAL GENE CG012 DYNEIN, CYTOPLASMIC 1, HEAVY CHAIN 1 JUN D PROTO-ONCOGENE FAMILY WITH SEQUENCE SIMILARITY 110, MEMBER C ABHYDROLASE DOMAIN CONTAINING 1 FIP1 LIKE 1 (S. CEREVISIAE) ZINC FINGER PROTEIN 713 INTEGRIN,ALPHA9 ALCOHOL DEHYDROGENASE 1C (CLASS I), GAMMA POLYPEPTIDE HEAT SHOCK 70KDA PROTEIN 1-LIKE TYPE 1 TUMOUR NECROSIS FACTOR RECEPTOR SHEDDING AMINOPEPTIDASE REGULATOR SCO CYTOCHROME OXIDASE DEFICIENT HOMOLOG 2 (YEAST)  2.5  DISCUSSION  2.5.1  Comparison of GEP and clinical variables  GP-OOB found the clinical variables generated the best classifier, and combining both types of data produced the worst. Two possible interpretations of these results are that the current prognostic system is sufficient and GEP provides no improvement, or, the two data types have contradictory information and that combining them worsens prediction. FSCV, based on the average AUC measures alone, suggested that GEP performed the best, and the combination model averaged out the individual influences of the two data types. The small difference in average AUC, however, might also reflect results from the large variability across CV folds. SMLR agreed that GEP and further combining them with the clinical variables improved outcome prediction. The ROC plots illustrated that the clinical variables performed best with respect to predicting class Failure, but at higher FPR levels the GEP model was superior with respect to predicting class Success. The combination model merged these strengths of the two individual models and averaged out their weaknesses. The  30  clinical interpretation is that the current prognostic system could be improved if GEP markers were incorporated. While this cannot be concluded from these preliminary results, it has been shown by two groups to be true in breast cancer prognosis (Gevaert, De Smet, Timmerman, Moreau, & De Moor, 2006; Sun, Goodison, Li, Liu, & Farmerie, 2007). 2.5.2 2.5.2.1  Predictive genes Based on variable importance  GP-OOB found 12 genes to be superior predictors than the best clinical predictor Stage, FS-CV found 3 such genes, and the overlap included the two genes CSDA and DPEP2. CSDA is a negative regulator of transcription (Coles, Diamond, Occhiodoro, Vadas, & Shannon, 1996) that is over-expressed in class Failure; while DPEP2 is a dipeptidase involved in proteolysis (Habib, Shi, Cuevas, & Lieberman, 2003) that is over-expressed in class Success. Using stability across RF iterations as an indicator for predictor importance, the top genes found are associated with functions in signal transduction, transcription regulation, and T cell activation. PDE4D (Bolger et al., 1997), CSDA, and HSH2D (Oda et al., 2001) were among the most stable features, providing confidence to the results derived from importance ranking. SMLR selected HBP1 as the most important predictor, which is a transcription factor for the highly-conserved nucleoprotein domain HMG-box (Smith, Bowles, Wilson, & Koopman, 2004). This was the most differentially expressed gene between outcome classes found in our previous expression analysis, and is highly overexpressed in the class Failure. Among the top 5 genes were CSDA and PDE4D, which have been consistently identified as the best predictors by various methods in this project. Prompted by these results, an ongoing study aims to validate this correlation to treatment failure by immunohistochemistry using an anti-CSDA antibody and an independent patient cohort.  31  2.5.2.2  Stability of features  The small number of stable probe sets found from FS-CV demonstrated that our collection of samples is pathologically heterogeneous thus feature selection was highly determined by the sample subset used. Although the true patterns that separate the two classes were subtle, our methods were able to pick up the few relevant features. A few genes were identified repeatedly by different methods, showing the robustness of the methods and the importance of these features. The immunohistochemistry work using CSDA as a marker in the next step would help to explore the generalization capability of our results on new datasets. 2.5.2.3  Comparison with previous studies  Although the work by Chetaille et al. (2009), Devilard et al. (2002), and SanchezAguilera et al. (2006) did not involve the use of classification algorithms nor integration with clinical parameters, our results can be compared to theirs by examining the overlap between the outcome-associated genes reported. In summary, the proportion of genes selected in our three analyses that were found on the combined gene lists of the three previous studies were: 12/81 from SMLR, 7/20 (top 20 features) from RF (GP-OOB), and 4/22 from RF (FS-CV). The genes from SMLR include MAP2K4 (mitogen-activated), ITGA9 (integrin), RASIP1 (Ras-interacting), CCBP2 (chemokine binding), and BRCA2 (breast cancer). The genes from RF (GP-OOB) include LDHA (lactate dehydrogenase), HSPA5 (heat shock), CXCR4 (chemokine receptor), TGIF2 (TGFB-induced) and TNFSF10 (TNF member). The genes from RF (FS-CV) include ATP5C (ATP synthase), and ZNF395 (zinc finger protein). Interestingly, one of the top features from our analyses, CSDA, was found by Chetaille et al. (2009) to be associated with EBV+ status. In fact, all three analyses we performed had overlapping genes with Chetaille et al.’s EBV-associated gene lists: 10 from SMLR, 2 from  32  RF (GP-OOB) and 2 from RF (FS-CV). These include G protein-coupled receptor, calciumactivated potassium channel, leukocyte immunoglobin-like receptor, and various small peptides. Chetaille et al. also found variants of our top feature, PDE4D, (PDE4B, PDE6D) to be associated with clinical outcome. 2.5.3  Significance of the clinical model results  Regardless of methods used, the clinical model showed weakness in predicting class Success. Clinically, this is an important issue because wrongfully predicting treatment Success patients as treatment Failure could result in over-treatment of these individuals. This has shown to lead to higher risk of secondary malignancies, therefore represents a more serious problem than the reverse scenario of falsely predicting class Failure, which occurs in the GEP model. SMLR additionally showed that using raw clinical variables yields higher predictive power than the IPS system. At the FPR threshold corresponding to the IPS threshold, the raw clinical variables performed significantly better (vertical line in Figure 2.5).  2.6  CONCLUDING REMARKS  Random Forests and SMLR both showed that the clinical variables predicted better for treatment Failures, while GEP predicted better for treatment Successes. Combining the two types of data led to slightly increased performance, by encompassing the strengths of each approach. At this stage, the system has not improved to a level that clinical decisionmaking would be altered, due to the unsatisfactory error rates. We hypothesize that many other factors are actively involved in the treatment response process, such as HRS cell specific features including differentiation stage, tumour stem cell properties, drug resistance, or host detoxification metabolism. Many features are unlikely to be detected by GEP of the  33  microenvironment alone, and may require data collection from other means such as array Comparative Genomic Hybridization (arrayCGH) for copy number data. Examination of the predictive features across all methods identified Stage as the only important clinical variable, which consistently ranked within the top 20 features. This indicates that the remaining clinical variables are comparatively weak predictors when combined with GEP, although they have the advantage of being easy to retrieve and interpret in the clinical setting. A few genes were also consistently found to be among the important predictors, which include CSDA, PDE4D, and DPEP2, and these may serve as candidates for further experimental validation. An antibody for CSDA will be used for this purpose as a result of this analysis.  2.7  FURTHER RESEARCH  In our work we integrated two different types of data, GEP and clinical variables. For both Random Forests and SMLR, the integration occurred at the input stage and the combined data was treated as one uniform dataset without distinguishing between GEP and clinical variables. This simplistic approach has the advantage that any correlations between the two data types would not be neglected, but a few issues may also arise. First, the pooled dataset contained mostly GEP variables and only a few clinical variables, which might be unfair for even the most powerful clinical predictors. In Random Forests, for example, the GEP variables would have a greater chance of being selected at a tree node. The second problem concerns the different scales of the two data types, where GEP log2 intensities ranged from 0-15 while the clinical variables were in many different numeric scales and a few were categorical. Biases may occur towards certain types of variables, for example, as previous work suggested that classification trees favor variables with more categories (Strobl,  34  Boulesteix, Zeileis, & Hothorn, 2007). In summary, the integration time point and method are non-trivial and more experiments are required to find the optimal approach. Three different integration approaches using Bayesian networks were examined in one study (Gevaert et al., 2006). In addition to full integration analogous to our method, they examined decision integration and partial integration. Decision integration is analogous to Random Forests creating separate models for the two data types, then merging their best predictors to build a combined model. This could provide an interesting comparison to the current approach. Partial integration is unique to the Bayesian networks structure therefore cannot be used to analyze our dataset. The next step might also require examining the top predictors identified through this analysis. We focused only on the best clinical variable and the GEP features that ranked higher. Examining the entire lists of variables would be important for understanding the mechanism of class separation, and for comparing the different model-building strategies used.  35  CHAPTER 3: CORRELATION ANALYSIS OF DNA COPY NUMBER AND GENE EXPRESSION IN HODGKIN REED-STERNBERG CELLS 3.1  INTRODUCTION  3.1.1  Project goal  This work aimed to examine methods for characterizing the influence of copy number (CN) alterations on gene expression (GE) from micro-dissected HRS cells. Two approaches were involved: examining patterns in individual samples and examining recurrent patterns across samples. Results from these strategies might further extend to correlation analysis with treatment outcome, where different patterns may recurrently occur in samples of each of the two outcome classes. Insights into the disease might also be gained regarding genetic heterogeneity and different genetic mechanisms that produce a common phenotype. 3.1.2  Correlation between copy number and transcription level  The increasing numbers of GEP and CN studies have motivated the integrated analysis of these two genomic data types, and early studies focused on quantifying the global effect of copy number alterations (CNA) on transcription levels. The first studies examined breast cancer and reported that 62% of the genes in CN amplified regions were overexpressed, and in the opposite direction, 10.5% of the over-expressed genes were in CN amplified regions (Hyman et al., 2002; Pollack et al., 2002). A comparative study examined the global CN-GE correlation in five published cancer data sets, and concluded that correlation is weak at around 12% but consistent across studies (Gu, Choi, & Ghosh, 2008). These results revealed a small percentage of gene expression variation that can be explained by CN, in addition to regulatory mechanisms such as histone modification, sequence mutations, microRNA, and protein-DNA interactions. Recent studies have moved from examining global correlations to focusing on small, interesting regions, and to identifying 36  specific targeted genes. One group examined CN status at four prognostically relevant loci in neuroblastoma and showed that GE patterns also cluster according to the CN groups (Wang et al., 2006). Their work produced a prioritized list of candidate neuroblastoma suppressor genes. Another group examined CN patterns of 64 cancer-relevant genes in 60 cancer cell lines and identified a high correlation between 3p gain and the carcinogenesisinducing ERBB2 (Bussey et al., 2006; Isola et al., 1999). They further studied enzyme drug sensitivity of 118 compounds in relation to CN and found a striking correlation between sensitivity of the drug L-asparaginase and CN loss of genes near the asparagine synthetase gene in ovarian cancer cell lines, suggesting a functionally relevant drug resistance mechanism. 3.1.3  Current methods and limitations  Most of the publications involving integration of the two data types are clinical articles focusing on results and biological interpretations, but very few are methodological that aimed to improve the analytical strategies. The common methods can be summarized as involving three steps: compilation of an integrated data set, processing of data points, and computation of a test for correlation. In the first step, the two data types are either mapped to a set of common genes or mapped by coordinates (Linn et al., 2003; Yang et al., 2007), or no mapping was required when both experiments were performed using cDNA microarrays (Hyman et al., 2002; Pollack et al., 2002). In the second step, the CN and GE values are often segmented into discrete groups by setting thresholds on the raw ratios, resulting in groups amplified/normal/deleted for CN and over/under-expressed for GE (Hyman et al., 2002; Yoshimoto et al., 2007). In the case if a single correlation statistic (for continuous variables) was desired then the raw values can be used directly and no pre-processing was performed (Yang et al., 2007). The last step involves a test for correlation and the choice of  37  test varies largely across studies. For a single statistic, the Pearson’s correlation is most often used (Linn et al., 2003; Yang et al., 2007). When segmented data are used, GE would be compared between CN groups, and examples of the tests employed include the t-statistic (Hyman et al., 2002), Wilcoxon rank test (van Wieringen, Belien, Vosse, Achame, & Ylstra, 2006), and sum of squares based formula (Sweet-Cordero et al., 2006). One group took a different approach by first identifying differentially expressed genes between the CN groups then examining the chromosomal distribution of the over- and under-expressed genes (Yoshimoto et al., 2007). They inferred sites of correlations based on prior knowledge of the chromosomes involving genomic aberrations in the disease. Two issues that are of interest to us are lacking in all these studies: single-sample patterns and clinical-class patterns. These studies focused on methods to examine patterns in a particular disease as a whole, often comparing between cancer and normal profiles only. Individual samples and their unique patterns were not discussed in these joint-sample analyses, and clinically relevant subgroups such as treatment outcomes were not being compared. In this project, both single-sample and joint/multi-sample analyses were performed, and outcome classes could be visually compared in the preliminary examination of singlesample results. Since our methods were developed based on a small sample set of 12 cases, many steps might need further optimization when applied to larger data sets, therefore the parameter settings applied should be considered preliminary and freely adjustable in the future. Some of these adjustable parameters include the number of CN groups (3 or 5) generated by a Hidden Markov Model (HMM), the threshold on log2 intensities to generate GE groups, the overlap criterion in the inter-platform mapping, and the resultant filtering methods.  38  3.2  MATERIALS AND METHODS  3.2.1  Micro-dissected HRS cells  Submegabase Resolution Tiling array (SMRT) (Ishkanian et al., 2004) and GEP of 12 HRS cell samples were available at the start of this analysis. The array CGH data in log2 ratios were normalized by the method outline in (Khojasteh, Lam, Ward, & MacAulay, 2005) and provided in SeeGH-readable (Chi, DeLeeuw, Coe, MacAulay, & Lam, 2004) text files. GEP required RMA normalization to produce the log2 intensities. The SMRT array contained 26,310 BAC clones, and the Affymetrix array contained 54,675 probe sets. 3.2.2  Copy number and gene expression calls  Computational calls on the CN log2 ratios were generated using the Matlab program CNA-HMMer (Shah et al., 2006), which employs the HMM and assigns segments of BACs into three CN groups: gain (1), neutral (0), and loss ( -1). Manual calls on the same data set were made previously by Dr. Christian Steidl, and these were used in this project solely for comparing data distributions with the HMM calls. Calls for GEP were made relative to the average intensity of every probe set across the 12 samples. A probe set with value greater than 1 standard deviation (SD) from the average was called Over-expressed (+1), value less than 1 SD from the average was called Under-expressed (-1), and value between these 2 endpoints was called Normal (0). 3.2.3  Inter-platform mapping  Chromosomal locations of the BAC clones and Affymetrix probe sets were obtained from the SMRT array input file and Affymetrix HG U133 Plus2.0 annotation file respectively, and all overlapping records were paired. The overlap percentage OP was then introduced to control the required overlap portion of the probe set. In this analysis OP = 50, so 50% or more of the probe set must be overlapping with a BAC clone in order to map the  39  pair. This resulted in 77,750 spots (pairs) across the 24 chromosomes, comprising 47,408 (87%) unique probe sets and 16,595 (63%) unique BAC clones. 3.2.4  Single-sample analysis  For each of the 12 samples, two computations were performed, Pearson’s correlation between log2 intensities and log2 ratios across the whole genome, and a chromosome-based t-test comparing GE between CN groups. The chromosome-based analysis required 2 steps (Figure 3.1). Figure 3.1 Single-sample analysis, chromosome-based t-test. Step 1: Identify contiguous copy number regions on a chromosome Chromosome CN groups  1  0  -1  Step 2: Perform t-test (group X versus group Y) on gene expression between regions Chromosome GE groups  X  Y  Chromosome Y  X  Contiguous CN regions were first identified on each chromosome, then the average unlogged expression in each CN altered (CNA) region was compared to that of the remaining loci on the chromosome. The latter may contain a mixture of CN regions. In the case of a whole chromosome event where the CNA region was the whole chromosome, the group to compare to was all CN neutral BACs (~50,000) in the genome. P-values were adjusted by the Benjamini-Hochberg method (Benjamini & Hochberg, 1995) and a FDR threshold 0.01 was used to filter for significant regions. These were plotted by chromosome using the UCSC genome browser and outcome classes were labeled differently.  40  3.2.5  Joint analysis  The 12 samples were examined simultaneously in this analysis and the following steps were performed for each of the 77,750 spots (Figure 3.2). Figure 3.2 Joint analysis for testing CN-GE correlation Step 1: Categorization into CN groups (CN matrix) Spot 1 2 3 : 77,750  Sample 1  2  3  4 1  5  6  7  8  -1  9  10 0  11  12  0 -1  1  0  Step 2: Compute fold-change (group X versus group Y) on gene expression between groups (GE matrix) Test 1 2 3.a 3.b : 37,052  Sample 1  2  3  4 X  5  6  7  8  X  9  10 Y  11  12  Y X Y  Y X  Y  In the first step, each sample was categorized by CN status. The unlogged expressions were then compared between the CN groups using fold-change. The four possible CN group comparisons were: 1 versus 0, 1 versus others (0 and -1), -1 versus 0, -1 versus others (0 and 1). The small sample size prevented the use of t-test and comparisons were performed only when the two comparison groups each had a minimum of 3 samples. In total, 37,052 spots were examined, and 74% belonged to the CN=1 versus CN=0 scenario. The unlogged expression of one CN group was compared to that of the other CN group using one standard error above and below the average value. The spots with non-  41  overlapping ranges were then filtered using a fold change threshold of 1.0. The Bioconductor package geneplotter was used to plot these spots across the genome (Gentleman, 2008).  3.3  RESULTS  3.3.1  Summary of data and comparison with manual calls  Composite frequency plot of the 12 CN profiles generated by CNA-HMMer showed patterns consistent with the known aberrations in Hodgkin’s lymphoma (Figure 3.3). Gains of 2p, 9p, 12q, 16p, 17p, 17q, 20q were all clearly present, and losses of 6q, 13q, and more subtlely of 4q and 11q were present. Figure 3.3 Composite frequency plot of twelve CN profiles. Known CN aberrations associated with the disease are detected in the 12 samples, the most prominent are gains of chromosomes 2p, 9p, 12p, 16p, 17, and losses of chromosomes 6q, 13q.  42  On the BAC-level, the distributions of calls into each CN group are similar for HMM and manual methods. Nearly 80% of BAC clones were called CN neutral from both methods, more CN loss were called by the manual method (12% versus 8%), and more CN gains were called by the HMM (13% versus 10%). When the distribution was examined for contiguous regions, HMM calls were much more fragmented, identifying a larger number of small regions whereas the manual calls identified fewer but large regions (Figure 3.4). Figure 3.4 Comparison of size of CN gain and loss regions identified by HMM and manual calls. Contiguous BAC clones with identical CN calls were defined as regions, and size distributions of these regions were plotted for the two call methods. HMM calls resulted in 526 gained regions and 729 loss regions, with maximum region size near 1000bp. Manual calls resulted in 130 gain regions and 131 loss regions, with maximum region size near 1800bp.  43  Recurrence of the CN calls across the 12 samples had similar distributions in both computational and manual calls. For approximately 10% of the BACs, they were called CN neutral recurrently in all 12 samples. For CN gained and lost BACs, they were most frequently called in a single sample only (Figure 3.5). Figure 3.5 Recurrence of HMM and manual CN calls across the 12 samples. Similar patterns of recurrence are observed for the two call methods. CN neutral BAC clones are highly recurrent across the samples but not the CN gain or loss ones.  44  3.3.2  Single-sample analysis  Pearson correlation coefficients for the 12 samples had a median of 0.0668, maximum of 0.1061 (Hd40) and minimum of -0.0458 (Hd23) (Table 3.1). Chromosomebased analysis resulted in a total of 259 significant regions across the 12 samples, and over 90% were associated with CN loss (Table 3.2). Table 3.1 Pearson’s Correlations between GE log2 intensities and CN log2 ratios. Sample  Pearson’s correlation  Hd23 Hd31 Hd30 Hd20 Hd1 Hd21 Hd28 Hd5 Hd15 Hd25 Hd38 Hd40  -0.04582 0.00091 0.02445 0.04391 0.06028 0.06141 0.07225 0.07325 0.08408 0.08427 0.10249 0.10613  P-value 2.2e-16 0.799 9.244e-12 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16 2.2e-16  Table 3.2 Number of regions identified by the single-sample correlation analysis listed by the two outcome classes. Above 90% of the findings involve regions of lowered expression associated with CN losses. Class  Sample  S S S S S F F F F F F F  Hd1 Hd5 Hd30 Hd31 Hd40 Hd15 Hd20 Hd21 Hd23 Hd25 Hd28 Hd38  Gain associated 1 1 2 0 2 2 1 2 1 3 1 0  Loss associated 12 15 26 12 21 49 15 9 6 20 18 40  45  On the per-sample level, the number of regions ranged from a minimum of 7 per case to a maximum of 51 per case. Results plotted by chromosome showed non-recurrent, large regions associated with CN gains, and more recurrent, fragmented regions associated with CN losses. Figure 3.6 shows three chromosomes illustrating CN gain associated regions. Figure 3.6 Selected results from single-sample analysis: CN gained regions with elevated expression levels compared to all other loci on chromosome. Results for each sample is presented as tracks using the UCSC Genome browser. Blue indicates samples in class Failure and red indicates samples in class Success. Results associated with CN gains involve large and sparse, lowly-recurrent regions. With respect to the known CNcorrelated loci, two samples covered the REL locus on chromosome 2p.  46  REL on 2p was found in 2 cases and JAK2 on 9p (not shown) was found in only 1 case involving a whole-arm CNA. Figure 3.7 shows five chromosomes illustrating recurrent CN loss associated regions on 1p, 6q, 7q, 10q, and 16. Figure 3.7 Selected results from single-sample analysis: CN loss regions with lowered expression levels compared to all other loci chromosome. Results for each sample is presented as tracks using the UCSC Genome browser. Blue indicates samples in class Failure and red indicates samples in class Success. Results associated with CN losses involved small, dense, and recurrent regions.  47  The minimal overlapping region on chromosome 16 did not contain any RefSeq genes, but selected genes on the other 4 chromosomes are listed in Table 3.3.  48  Table 3.3 Selected genes in the minimal overlapping regions of CN loss in 4 chromosomes. Chromosome 1 ACOT11 DHCR24 DIO1 GLIS1 MAGOH PARS2 PPAP2B PRKAA2 SCP2 Chromosome 6 GPR126 HINT3 HIVEP2 HSF2 NCOA7 PKIB TPD52L1 Chromosome 7 CASP2 GSTK1 KEL Chromosome 10 BNIP3 GPR123 INPP5A KNDC1 MTG1 PPP2R2D STK32C TCERG1L TUBGCP2 UTF1  ACYL-COA THIOESTERASE 11 24-DEHYDROCHOLESTEROL REDUCTASE DEIODINASE, IODOTHYRONINE, TYPE I GLIS FAMILY ZINC FINGER 1 MAGO-NASHI HOMOLOG, PROLIFERATION-ASSOCIATED (DROSOPHILA) PROLYL-TRNA SYNTHETASE (MITOCHONDRIAL)(PUTATIVE) PHOSPHATIDIC ACID PHOSPHATASE TYPE 2B PROTEIN KINASE, AMP-ACTIVATED, ALPHA 2 CATALYTIC SUBUNIT STEROL CARRIER PROTEIN 2 G PROTEIN-COUPLED RECEPTOR 126 HISTIDINE TRIAD NUCLEOTIDE BINDING PROTEIN 3 HUMAN IMMUNODEFICIENCY VIRUS TYPE I ENHANCER BINDING PROTEIN 2 HEAT SHOCK TRANSCRIPTION FACTOR 2 NUCLEAR RECEPTOR COACTIVATOR 7 PROTEIN KINASE (CAMP-DEPENDENT, CATALYTIC) INHIBITOR BETA TUMOUR PROTEIN D52-LIKE 1 CASPASE 2, APOPTOSIS-RELATED CYSTEINE PEPTIDASE GLUTATHIONE S-TRANSFERASE KAPPA 1 KELL BLOOD GROUP, METALLO-ENDOPEPTIDASE BCL2/ADENOVIRUS E1B 19KDA INTERACTING PROTEIN 3 G PROTEIN-COUPLED RECEPTOR 123 INOSITOL POLYPHOSPHATE-5-PHOSPHATASE, 40KDA KINASE NON-CATALYTIC C-LOBE DOMAIN (KIND) CONTAINING 1 MITOCHONDRIAL GTPASE 1 HOMOLOG (S. CEREVISIAE) PROTEIN PHOSPHATASE 2, REGULATORY SUBUNIT B, DELTA ISOFORM SERINE/THREONINE KINASE 32C TRANSCRIPTION ELONGATION REGULATOR 1-LIKE TUBULIN, GAMMA COMPLEX ASSOCIATED PROTEIN 2 UNDIFFERENTIATED EMBRYONIC CELL TRANSCRIPTION FACTOR 1  Among these are protein kinases, G protein-coupled receptors, transcription associated factors, apoptosis-related caspase, and BCL2 associated protein.  49  3.3.3  Joint analysis  152 genes were found to have significant CN-GE correlation using a fold-change based comparison. 134 (88%) of these were associated with CN gains, mainly located on chromosomes 2p, 6p, 9p, 12p (Figure 3.8 (A)). Figure 3.8 Joint analysis results plotted using Bioconductor geneplotter. (A) Probe sets with elevated expression in CN-gained samples  50  (B) Probe sets with lowered expression in CN-loss samples  Known association of JAK2 on 9p (Kupper et al., 2001) was found, while few genes were found to be in CN loss regions. Table 3.4 shows a complete list of the genes identified, and those associated with CN gain include Interleukins 22 and 26 (Jak/Stat pathway), RAB of the Ras oncogene family, BCL2-interacting protein, Immunoglobulin subunits, EBV-induced gene and JAK2. GO analysis of the 134 gain-associated genes showed enrichment in the categories negative regulation of transcription (5 genes, p=6.47e-3), transcription factor activity (14 genes, p=8.57e-4), regulation of cellular metabolism (24 genes, p=5.34e-3), and negative regulation of cellular physiological process (11 genes, p=4.77e-3). The 18 CN loss associated genes include insulin growth factor, ribonuclease and transcription elongation factor. GO analysis showed 2 genes (1.67e-5) involved in transmembrane receptor protein tyrosine kinase activity.  51  Table 3.4 Genes identified from joint analysis. (A) Genes with elevated expression in CN-gained samples Chrom 1  2  Genes CTSS IAH1, KLF11, SF3B14, YPEL5, DYNC2LI1, PREPL, VRK2, COMMD1, GFPT1, ADD2, PLGLA1, NT5DC4  3  KPNA1, PARP14  5  SRD5A1, MARVELD2, FCHO2, ODZ2  6  ERPINB1, SERPINB6, TUBB2B, TFAP2A, AOF1, MBOAT1, MRS2L, ALDH5A1, TTRAP, CMAH, TRIM38, HMGN4, ZSCAN16, ZNF193, MRPS18B, GTF2H4, BAT1, MICA, TNF, HSPA1B, SRPK1, MAPK13, PEX6  7  ZNF655, RABL5  8  CDCA2, GNRH1, BNIP3L, DPYSL2, ELP3, FBXO32  Annotated Genes CATHEPSIN S ISOAMYL ACETATE-HYDROLYZING ESTERASE 1 HOMOLOG, SPLICING FACTOR 3B, DYNEIN CYTOPLASMIC 2, PROLYL ENDOPEPTIDASE-LIKE, COPPER METABOLISM (MURR1) DOMAIN CONTAINING 1, GLUTAMINE-FRUCTOSE-6PHOSPHATE TRANSAMINASE 1, ADDUCIN 2 (BETA) KARYOPHERIN ALPHA 1 (IMPORTIN ALPHA 5), POLY (ADP-RIBOSE) POLYMERASE FAMILY, MEMBER 14 STEROID-5-ALPHA-REDUCTASE, ALPHA POLYPEPTIDE 1, MARVEL DOMAIN CONTAINING 2, FCH DOMAIN 2 SERPIN PEPTIDASE INHIBITOR, TUBULIN, BETA 2B, TRANSCRIPTION FACTOR AP-2 ALPHA, AMINE OXIDASE (FLAVIN CONTAINING) DOMAIN 1, MEMBRANE BOUND O-ACYLTRANSFERASE DOMAIN CONTAINING 1, ALDEHYDE DEHYDROGENASE 5 FAMILY, MEMBER A1 (SUCCINATE-SEMIALDEHYDE DEHYDROGENASE), TRAF AND TNF RECEPTOR ASSOCIATED PROTEIN, CYTIDINE MONOPHOSPHATEN-ACETYLNEURAMINIC ACID HYDROXYLASE, TRIPARTITE MOTIF-CONTAINING 38, HIGH MOBILITY GROUP NUCLEOSOMAL BINDING DOMAIN 4, ZINC FINGER AND SCAN DOMAIN CONTAINING 16, ZINC FINGER PROTEIN 193, MITOCHONDRIAL RIBOSOMAL PROTEIN S18B, TRANSCRIPTION FACTOR IIH, POLYPEPTIDE 4, HLA-B ASSOCIATED TRANSCRIPT 1, MHC CLASS I POLYPEPTIDE-RELATED SEQUENCE A, TUMOUR NECROSIS FACTOR (TNF SUPERFAMILY, MEMBER 2), HEAT SHOCK 70KDA PROTEIN 1B, SFRS PROTEIN KINASE 1, MITOGEN-ACTIVATED PROTEIN KINASE 13, PEROXISOMAL BIOGENESIS FACTOR 6 ZINC FINGER PROTEIN 655, RAB, MEMBER RAS ONCOGENE FAMILY-LIKE 5 GONADOTROPIN-RELEASING HORMONE 1, BCL2/ADENOVIRUS E1B 19KDA INTERACTING PROTEIN 3-LIKE, F-BOX PROTEIN 32  52  Chrom  Genes  9  KIAA0020, RFX3, JAK2, ERMP1, RANBP6, NFIB, ZDHHC21, DENND4C, RPS6, KLHL9, PLAA, NFX1, NOL6, UBAP2, ZCCHC7, GRHPR, EXOSC3, SHB  10  PRKCQ  11  ARCN1, TRAPPC4  12  RAD52, FOXM1, GAPDH, ENO2, CSDA, GPR19, ST8SIA1, ETNK1, BCAT1, FGFR1OP2, VDR, GPR84, FAM112B, IL26, IL22, MDM1  14  15 16  17  19  ARHGAP5, PNN, SPG3A, PELI2, DHRS7, SIX1, PCNX IGKC, CA12, KIF23, ARIH1, SGK269, KIAA1199, NR2F2 KIAA1576 ZNF207, PCGF2, ARL17, KIAA1267, HOXB9, HOXB9, SPAG9, TOB1, PPM1E, YPEL2, MIRN21, PSMD12, KCNJ2, DNAI2 EBI3, ZNF260, CCDC114  Annotated Genes JANUS KINASE 2 (A PROTEIN TYROSINE KINASE), ENDOPLASMIC RETICULUM METALLOPEPTIDASE 1, RAN BINDING PROTEIN 6, NUCLEAR FACTOR I/B, ZINC FINGER DHHC-TYPE CONTAINING 21, DENN/MADD DOMAIN CONTAINING 4C, RIBOSOMAL PROTEIN S6, PHOSPHOLIPASE A2-ACTIVATING PROTEIN, NUCLEAR TRANSCRIPTION FACTOR, X-BOX BINDING 1, NUCLEOLAR PROTEIN FAMILY 6 (RNAASSOCIATED), UBIQUITIN ASSOCIATED PROTEIN 2, ZINC FINGER, CCHC DOMAIN, GLYOXYLATE REDUCTASE/HYDROXYPYRUVATE REDUCTASE, EXOSOME COMPONENT 3 PROTEIN KINASE C ARCHAIN 1, TRAFFICKING PROTEIN PARTICLE COMPLEX 4 GLYCERALDEHYDE-3-PHOSPHATE DEHYDROGENASE, ENOLASE 2 (GAMMA, NEURONAL), COLD SHOCK DOMAIN PROTEIN A, G PROTEIN-COUPLED RECEPTOR 19, ST8 ALPHA-NACETYL-NEURAMINIDE ALPHA-2,8SIALYLTRANSFERASE 1, ETHANOLAMINE KINASE 1, CYTOSOLIC, FGFR1 ONCOGENE PARTNER 2, VITAMIN D (1,25- DIHYDROXYVITAMIN D3) RECEPTOR, G PROTEIN-COUPLED RECEPTOR 84, INTERLEUKIN 26, INTERLEUKIN 22, MDM4 TRANSFORMED 3T3 CELL DOUBLE MINUTE 1 P53 BINDING PROTEIN (MOUSE) GTPASE ACTIVATING PROTEIN 5, SPASTIC PARAPLEGIA 3A (AUTOSOMAL DOMINANT), DEHYDROGENASE/REDUCTASE (SDR FAMILY) MEMBER 7, SIX HOMEOBOX 1 IMMUNOGLOBULIN KAPPA CONSTANT, CARBONIC ANHYDRASE, KINESIN FAMILY MEMBER 23, NKF3 KINASE FAMILY MEMBER, NUCLEAR RECEPTOR SUBFAMILY KIAA1576  ZINC FINGER PROTEIN, TRANSDUCER OF ERBB2, 1, PROTEIN PHOSPHATASE 1E (PP2C DOMAIN CONTAINING), PROTEASOME (PROSOME, MACROPAIN) 26S SUBUNIT, DYNEIN, AXONEMAL  EPSTEIN-BARR VIRUS INDUCED GENE 3, ZINC FINGER PROTEIN 260  53  Chrom  Genes  20  SNAP25, VSTM2L, SERINC3, PKIG, YWHAB, SPATA2, BCAS1 MORC3, ZNF295, WDR4 UFD1L, MIAT, SFI1 CYBB  21 22 23  Annotated Genes SYNAPTOSOMAL-ASSOCIATED PROTEIN, SERINE INCORPORATOR 3, PROTEIN KINASE (CAMPDEPENDENT, CATALYTIC) INHIBITOR GAMMA, TYROSINE 3-MONOOXYGENASE/TRYPTOPHAN 5MONOOXYGENASE ACTIVATION PROTEIN, BETA POLYPEPTIDE MORC FAMILY CW-TYPE ZINC FINGER 3, ZINC FINGER PROTEIN 295 MYOCARDIAL INFARCTION ASSOCIATED TRANSCRIPT SFI1 HOMOLOG CYTOCHROME B-245, BETA POLYPEPTIDE  b) Genes with lowered expression in CN-loss samples Chrom 1  Genes SMYD2  6  SNAP91, EPHA7, FNDC1, IGF2R, RNASET2, THBS2  7  ASZ1, CFTR, GIMAP8  11 13  JAM3 LGMN  23  JARID1D, EIF1AY, BEX1, BEX2, TCEAL4, MAGEA9  Annotated Genes SET AND MYND DOMAIN CONTAINING 2 SYNAPTOSOMAL-ASSOCIATED PROTEIN, EPH RECEPTOR A7, FIBRONECTIN TYPE III DOMAIN CONTAINING 1, INSULIN-LIKE GROWTH FACTOR 2 RECEPTOR, RIBONUCLEASE T2, THROMBOSPONDIN 2 CYSTIC FIBROSIS TRANSMEMBRANE CONDUCTANCE REGULATOR (ATP-BINDING CASSETTE SUB-FAMILY C, MEMBER 7), GTPASE IMAP FAMILY MEMBER 8 JUNCTIONAL ADHESION MOLECULE 3 LEGUMAIN EUKARYOTIC TRANSLATION INITIATION FACTOR 1A Y-LINKED, BRAIN EXPRESSED, X-LINKED 1, XLINKED 2, TRANSCRIPTION ELONGATION FACTOR A (SII)-LIKE 4, MELANOMA ANTIGEN FAMILY A  3.4  DISCUSSION  3.4.1  Inter-platform mapping  The current base-pair based mapping was direct and unbiased but had two issues that should be considered. First, the SMRT array consists of overlapping BAC clones thus multiple BACs could be mapped to a GE probe set. If these BACs span a region with differing CN then this inconsistency would destroy the correlation at this site. The Affymetrix platform has a similar problem of inconsistency. A gene could be represented by 54  multiple probe sets in this platform, and if these probe sets mapped to different CN regions then any correlation would also be destroyed. A more refined mapping model should first aggregate overlapping features within a platform to remove redundancy and inconsistency for the subsequent analyses. For example, by computing the average values measured by multiple probes at the same loci. In the current work, the parameter OP was introduced to remove some of the redundancy without manipulating GE or CN values, such as averaging. Using this approach, a probe set would not be mapped to a BAC if the overlapping portion was too small. 3.4.2  Single-sample analysis  As an initial analysis, the Pearson correlations indicated that little relationship existed between CN and GEP in our 12 cases. Comparing to the 5 datasets analyzed in Gu et al. (2008), the maximum from our dataset was lower than their median (0.1061 and 0.12, respectively). They concluded that CN explained 12%-40% of the variation in GE, but they had larger sample sizes, filter-reduced cDNA microarray data, and mostly cell line experiments. Although our values were small, these computations were a standard and quick method to obtain an overall impression of our dataset. The chromosome-based single sample analysis identified recurrent patterns of correlations associated with CN losses of chromosomes 1q, 6q, 7q, 10q, and 16. The bestknown correlations associated with CN gains of 2p and 9p were found in 2 and 1 samples, respectively. The interpretation of a significant region from this analysis was that this CN gain/loss region had higher/lower GEP average level compared to all other loci on the same chromosome. Since this was only based on observations in one sample, it remained unknown whether this GEP change in fact traveled with the CN change. On one hand, a single sample analysis required a ‘reference’ and an ‘event’ to be assigned for comparison.  55  While the event was clearly a CNA region, what the reference should be was non-trivial. The space flanking the CNA region was assigned the reference in this analysis, but a genome-derived CN neutral space is also possible. Overall, this analysis provided some insights into individual samples but it should not be used for identification of CN-GE correlations in a group of samples. This motivated the joint analysis where analyzing all samples together by CN groups should better address the question of CN-GE correlation. If a region had different GE levels between different CN groups, then a correlation would be supported. 3.4.3  Joint analysis  Findings were mostly associated with CN gains and enrichment was found on chromosomes 2p, 6p, 9p, and 12p. JAK2 on 9p was found but not REL on 2p. Gene Ontology analysis of both the 134 genes associated with CN gains and the 18 genes associated with CN loss showed that over half of the genes were associated with physiological responses and basal cellular processes. Among the genes found were an EBVinduced gene and JAK2, both known to play important roles in the disease. Most of the genes found have not been previously linked to the disease, but their identification in this analysis might suggest a role in the CN-GE interaction of the disease. New insights into the disease might be gained, for example, by identifying genes that have expression changes highly concordant with the different CN shifts within a disease. These expression shifts might not be detectable by GE analyses alone, since tumour cases often have heterogeneous CN profiles. This justifies the importance of identifying genes involved in CN-GE interactions, and analyzing 12 samples served as a first step to a comprehensive search for those genes in the Hodgkin’s lymphoma genome, before data generation from a larger patient group would allow for statistically more powerful analysis.  56  3.5  CONCLUSION AND FUTURE WORK  Steps to refine the presented methods will be examined in the upcoming projects in our group. In the first step, the inter-platform mapping will be refined to a gene-based model. Problems associated with the current approach should be eliminated to produce more accurate findings. All BAC clones and Affymetrix probe sets will be mapped to genes directly, and redundant mappings will be averaged. The genes will act as anchors to merge the two data types, so each gene will correspond to a single GE and a single CN value. The chromosome-based single sample analysis will be replaced by a gene based approach. For each gene, its CN and GE values will be combined into a single measure indicating the degree of correlation. Originally, multiplication of the two values was considered but a major problem arose concerning the incomparable scales of the two data types. The CN log ratios reflect a relative difference between two channels comprising of both negative and positive values, while the GE log intensities reflect absolute measured levels in the positive scale only. To properly perform multiplication of the two data types, they must first be normalized and transformed to comparable scales. The technique of quantile normalization will be examined along with other relevant normalization strategies. The joint analysis methods could also be further extended from the BAC level to a region level. Since CN patterns are examined in smoothed, contiguous regions, GE correlation should also be examined in a similar way to provide confidence and biologically sound interpretation. The challenge is to locate the minimal overlapping regions across samples as the target regions, since all samples have different CN combinations across a genomic region. The first step thus requires employing a separate algorithm to pre-define minimal overlapping regions across all samples. Array CGH analysis tools such as  57  Bioconductor cghMCR and the stand-alone JAVA application CGHPRO could be explored for this purpose (Chen, Erdogan, Ropers, Lenzner, & Ullmann, 2005; Kim et al., 2005).  58  BIBLIOGRAPHY  Axdorph, U., Sjoberg, J., Grimfors, G., Landgren, O., Porwit-MacDonald, A., & Bjorkholm, M. (2000). Biological markers may add to prediction of outcome achieved by the international prognostic score in hodgkin's disease. Annals of Oncology : Official Journal of the European Society for Medical Oncology / ESMO, 11(11), 1405-1411. Barth, T. F., Martin-Subero, J. I., Joos, S., Menz, C. K., Hasel, C., Mechtersheimer, G., et al. (2003). Gains of 2p involving the REL locus correlate with nuclear c-rel protein accumulation in neoplastic cells of classical hodgkin lymphoma. Blood, 101(9), 3681-3686. doi:10.1182/blood-2002-08-2577 Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57, 289-300. Bjorkholm, M., Axdorph, U., Grimfors, G., Merk, K., Johansson, B., Landgren, O., et al. (1995). Fixed versus response-adapted MOPP/ABVD chemotherapy in hodgkin's disease. A prospective randomized trial. Annals of Oncology : Official Journal of the European Society for Medical Oncology / ESMO, 6(9), 895-899. Bolger, G. B., Erdogan, S., Jones, R. E., Loughney, K., Scotland, G., Hoffmann, R., et al. (1997). Characterization of five different proteins produced by alternatively spliced mRNAs from the human cAMP-specific phosphodiesterase PDE4D gene. The Biochemical Journal, 328 ( Pt 2)(Pt 2), 539-548. Bolstad, B. (2008). affyPLM: Model based QC assessment of affymetrix GeneChips., 2008, from http://bioconductor.org/packages/1.9/bioc/vignettes/affyPLM/inst/doc/QualityAssess.pdf Boulesteix, A. L., Porzelius, C., & Daumer, M. (2008). Microarray-based classification and clinical predictors: On combined classifiers and additional predictive value. Bioinformatics (Oxford, England), 24(15), 1698-1706. doi:10.1093/bioinformatics/btn262 Braga-Neto, U. M., & Dougherty, E. R. (2004). Is cross-validation valid for small-sample microarray classification? Bioinformatics (Oxford, England), 20(3), 374-380. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. Bureau, A., Dupuis, J., Falls, K., Lunetta, K. L., Hayward, B., Keith, T. P., et al. (2005). Identifying SNPs predictive of phenotype using random forests. Genetic Epidemiology, 28(2), 171-182.  59  Bussey, K. J., Chin, K., Lababidi, S., Reimers, M., Reinhold, W. C., Kuo, W. L., et al. (2006). Integrating data on DNA copy number with gene expression levels and drug sensitivities in the NCI-60 cell line panel. Molecular Cancer Therapeutics, 5(4), 853-867. Carbone, P. P., Kaplan, H. S., Musshoff, K., Smithers, D. W., & Tubiana, M. (1971). Report of the committee on hodgkin's disease staging classification. Cancer Research, 31(11), 18601861. Chen, W., Erdogan, F., Ropers, H. H., Lenzner, S., & Ullmann, R. (2005). CGHPRO -- a comprehensive data analysis tool for array CGH. BMC Bioinformatics, 6, 85. doi:10.1186/1471-2105-6-85 Chetaille, B., Bertucci, F., Finetti, P., Esterni, B., Stamatoullas, A., Picquenot, J.M., et al. (2009). Molecular profiling of classical Hodgkin lymphoma tissues uncovers variations in the tumor microenvironment and correlations with EBV infection and outcome. Blood, 113(12), 2765-2775. Chi, B., DeLeeuw, R. J., Coe, B. P., MacAulay, C., & Lam, W. L. (2004). SeeGH--a software tool for visualization of whole genome array comparative genomic hybridization data. BMC Bioinformatics, 5, 13. doi:10.1186/1471-2105-5-13 Chui, D. T., Hammond, D., Baird, M., Shield, L., Jackson, R., & Jarrett, R. F. (2003). Classical hodgkin lymphoma is associated with frequent gains of 17q. Genes, Chromosomes & Cancer, 38(2), 126-136. Coles, L. S., Diamond, P., Occhiodoro, F., Vadas, M. A., & Shannon, M. F. (1996). Cold shock domain proteins repress transcription from the GM-CSF promoter. Nucleic Acids Research, 24(12), 2311-2317. Deacon, E. M., Pallesen, G., Niedobitek, G., Crocker, J., Brooks, L., Rickinson, A. B., et al. (1993). Epstein-barr virus and hodgkin's disease: Transcriptional analysis of virus latency in the malignant cells. The Journal of Experimental Medicine, 177(2), 339-349. Devilard, E., Bertucci, F., Trempat, P., Bouabdallah, R., Loriod, B., Giaconia, A., et al. (2002). Gene expression profiling defines molecular subtypes of classical hodgkin's disease. Oncogene, 21(19), 3095-3102. Diaz-Uriarte, R., & Alvarez de Andres, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7, 3. doi:10.1186/1471-2105-7-3 Diehl, V., Engert, A., & Re, D. (2007). New strategies for the treatment of advanced-stage hodgkin's lymphoma. Hematology/oncology Clinics of North America, 21(5), 897-914. Diehl, V., Stein, H., Hummel, M., Zollinger, R., & Connors, J. M. (2003). Hodgkin's lymphoma: Biology and treatment strategies for primary, refractory, and relapsed disease. Hematology / the Education Program of the American Society of Hematology.American Society of Hematology.Education Program, , 225-247.  60  Dudiot, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data.97, 77-87. Eliopoulos, A. G., Caamano, J. H., Flavell, J., Reynolds, G. M., Murray, P. G., Poyet, J. L., et al. (2003). Epstein-barr virus-encoded latent infection membrane protein 1 regulates the processing of p100 NF-kappaB2 to p52 via an IKKgamma/NEMO-independent signalling pathway. Oncogene, 22(48), 7557-7569. Falzetti, D., Crescenzi, B., Matteuci, C., Falini, B., Martelli, M. F., Van Den Berghe, H., et al. (1999). Genomic instability and recurrent breakpoints are main cytogenetic findings in hodgkin's disease. Haematologica, 84(4), 298-305. Foss, H. D., Marafioti, T., & Stein, H. (2000). Hodgkin lymphoma. classification and pathogenesis. [Hodgkin-Lymphome. Klassifikation und Pathogenese] Der Pathologe, 21(2), 113-123. Franklin, J., Paulus, U., Lieberz, D., Breuer, K., Tesch, H., & Diehl, V. (2000). Is the international prognostic score for advanced stage hodgkin's disease applicable to early stage patients? german hodgkin lymphoma study group. Annals of Oncology : Official Journal of the European Society for Medical Oncology / ESMO, 11(5), 617-623. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55, 119-139. Gentleman, R. (2008). Geneplotter., 2008, from http://www.bioconductor.org/packages/2.3/bioc/html/geneplotter.html Gevaert, O., De Smet, F., Timmerman, D., Moreau, Y., & De Moor, B. (2006). Predicting the prognosis of breast cancer by integrating clinical and microarray data with bayesian networks. Bioinformatics (Oxford, England), 22(14), e184-90. Gu, W., Choi, H., & Ghosh, D. (2008). Global associations between copy number and transcript mRNA microarray data: An empirical study. Cancer Informatics, 6, 17-23. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection.3, 1157. Habib, G. M., Shi, Z. Z., Cuevas, A. A., & Lieberman, M. W. (2003). Identification of two additional members of the membrane-bound dipeptidase family. The FASEB Journal : Official Publication of the Federation of American Societies for Experimental Biology, 17(10), 1313-1315. doi:10.1096/fj.02-0899fje Hansmann, M. L., & Willenbrock, K. (2002). WHO classification of hodgkin's lymphoma and its molecular pathological relevance. [Die WHO-Klassifikation des Hodgkin-Lymphoms und ihre molekularpathologische Relevanz] Der Pathologe, 23(3), 207-218. Hartmann, S., Martin-Subero, J. I., Gesk, S., Husken, J., Giefing, M., Nagel, I., et al. (2008). Detection of genomic imbalances in microdissected hodgkin and reed-sternberg cells of  61  classical hodgkin's lymphoma by array-based comparative genomic hybridization. Haematologica, 93(9), 1318-1326. doi:10.3324/haematol.12875 Hasenclever, D., & Diehl, V. (1998). A prognostic score for advanced hodgkin's disease. international prognostic factors project on advanced hodgkin's disease. The New England Journal of Medicine, 339(21), 1506-1514. Hyman, E., Kauraniemi, P., Hautaniemi, S., Wolf, M., Mousses, S., Rozenblum, E., et al. (2002). Impact of DNA amplification on gene expression patterns in breast cancer. Cancer Research, 62(21), 6240-6245. Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., et al. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England), 4(2), 249-264. Ishkanian, A. S., Malloff, C. A., Watson, S. K., DeLeeuw, R. J., Chi, B., Coe, B. P., et al. (2004). A tiling resolution DNA microarray with complete coverage of the human genome. Nature Genetics, 36(3), 299-303. doi:10.1038/ng1307 Isola, J., Chu, L., DeVries, S., Matsumura, K., Chew, K., Ljung, B. M., et al. (1999). Genetic alterations in ERBB2-amplified breast carcinomas. Clinical Cancer Research : An Official Journal of the American Association for Cancer Research, 5(12), 4140-4145. Jaffe, E. S., Harris, N. L., Diebold, J., & Muller-Hermelink, H. K. (1998). World health organization classification of lymphomas: A work in progress. Annals of Oncology : Official Journal of the European Society for Medical Oncology / ESMO, 9 Suppl 5, S25-30. Janz, M., Hummel, M., Truss, M., Wollert-Wulf, B., Mathas, S., Johrens, K., et al. (2006). Classical hodgkin lymphoma is characterized by high constitutive expression of activating transcription factor 3 (ATF3), which promotes viability of Hodgkin/Reed-sternberg cells. Blood, 107(6), 2536-2539. doi:10.1182/blood-2005-07-2694 Joos, S., Granzow, M., Holtgreve-Grez, H., Siebert, R., Harder, L., Martin-Subero, J. I., et al. (2003). Hodgkin's lymphoma cell lines are characterized by frequent aberrations on chromosomes 2p and 9p including REL and JAK2. International Journal of Cancer.Journal International Du Cancer, 103(4), 489-495. Joos, S., Kupper, M., Ohl, S., von Bonin, F., Mechtersheimer, G., Bentz, M., et al. (2000). Genomic imbalances including amplification of the tyrosine kinase gene JAK2 in CD30+ hodgkin cells. Cancer Research, 60(3), 549-552. Joos, S., Menz, C. K., Wrobel, G., Siebert, R., Gesk, S., Ohl, S., et al. (2002). Classical hodgkin lymphoma is characterized by recurrent copy number gains of the short arm of chromosome 2. Blood, 99(4), 1381-1387. Jost, P. J., & Ruland, J. (2007). Aberrant NF-kappaB signaling in lymphoma: Mechanisms, consequences, and therapeutic implications. Blood, 109(7), 2700-2707.  62  Kanzler, H., Kuppers, R., Hansmann, M. L., & Rajewsky, K. (1996). Hodgkin and reedsternberg cells in hodgkin's disease represent the outgrowth of a dominant tumor clone derived from (crippled) germinal center B cells. The Journal of Experimental Medicine, 184(4), 1495-1505. Karube, K., Ohshima, K., Suzumiya, J., Kawano, R., Kikuchi, M., & Harada, M. (2006). Gene expression profile of cytokines and chemokines in microdissected primary hodgkin and reed-sternberg (HRS) cells: High expression of interleukin-11 receptor alpha. Annals of Oncology : Official Journal of the European Society for Medical Oncology / ESMO, 17(1), 110-116. doi:10.1093/annonc/mdj064 Khojasteh, M., Lam, W. L., Ward, R. K., & MacAulay, C. (2005). A stepwise framework for the normalization of array CGH data. BMC Bioinformatics, 6, 274. doi:10.1186/1471-21056-274 Kim, S. Y., Nam, S. W., Lee, S. H., Park, W. S., Yoo, N. J., Lee, J. Y., et al. (2005). ArrayCyGHt: A web application for analysis and visualization of array-CGH data. Bioinformatics (Oxford, England), 21(10), 2554-2555. doi:10.1093/bioinformatics/bti357 Kluiver, J., Kok, K., Pfeil, I., de Jong, D., Blokzijl, T., Harms, G., et al. (2007). Global correlation of genome and transcriptome changes in classical hodgkin lymphoma. Hematological Oncology, 25(1), 21-29. doi:10.1002/hon.804 Krishnapuram, B., Carin, L., Figueiredo, M. A., & Hartemink, A. J. (2005). Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 957-968. Krishnapuram, B., Hartemink, A. J., Carin, L., & Figueiredo, M. A. (2004). A bayesian approach to joint feature selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9), 1105-1111. doi:10.1109/TPAMI.2004.55 Kupper, M., Joos, S., von Bonin, F., Daus, H., Pfreundschuh, M., Lichter, P., et al. (2001). MDM2 gene amplification and lack of p53 point mutations in hodgkin and reed-sternberg cells: Results from single-cell polymerase chain reaction and molecular cytogenetic studies. British Journal of Haematology, 112(3), 768-775. Kuppers, R. (2009). The biology of hodgkin's lymphoma. Nature Reviews.Cancer, 9(1), 1527. doi:10.1038/nrc2542 Kuppers, R., Klein, U., Schwering, I., Distler, V., Brauninger, A., Cattoretti, G., et al. (2003). Identification of hodgkin and reed-sternberg cell-specific genes by gene expression profiling. The Journal of Clinical Investigation, 111(4), 529-537. Kuppers, R., & Rajewsky, K. (1998). The origin of hodgkin and Reed/Sternberg cells in hodgkin's disease. Annual Review of Immunology, 16, 471-493.  63  Kuppers, R., Schwering, I., Brauninger, A., Rajewsky, K., & Hansmann, M. L. (2002). Biology of hodgkin's lymphoma. Annals of Oncology : Official Journal of the European Society for Medical Oncology / ESMO, 13 Suppl 1, 11-18. Lachenbruch, P. A. (1967). An almost unbiased method of obtaining confidence intervals for the probability of misclassification in discriminant analysis. Biometrics, 23(4), 639-645. Linn, S. C., West, R. B., Pollack, J. R., Zhu, S., Hernandez-Boussard, T., Nielsen, T. O., et al. (2003). Gene expression patterns and gene copy number changes in dermatofibrosarcoma protuberans. The American Journal of Pathology, 163(6), 2383-2395. Lister, T. A., Crowther, D., Sutcliffe, S. B., Glatstein, E., Canellos, G. P., Young, R. C., et al. (1989). Report of a committee convened to discuss the evaluation and staging of patients with hodgkin's disease: Cotswolds meeting. Journal of Clinical Oncology : Official Journal of the American Society of Clinical Oncology, 7(11), 1630-1636. Lunetta, K. L., Hayward, L. B., Segal, J., & Van Eerdewegh, P. (2004). Screening large-scale association study data: Exploiting interactions using random forests. BMC Genetics, 5(1), 32. doi:10.1186/1471-2156-5-32 Ma, S., & Huang, J. (2008). Penalized feature selection and classification in bioinformatics. Briefings in Bioinformatics, 9(5), 392-403. doi:10.1093/bib/bbn027 Maggio, E., van den Berg, A., Diepstra, A., Kluiver, J., Visser, L., & Poppema, S. (2002). Chemokines, cytokines and their receptors in hodgkin's lymphoma cell lines and tissues. Annals of Oncology : Official Journal of the European Society for Medical Oncology / ESMO, 13 Suppl 1, 52-56. Man, M. Z., Dyson, G., Johnson, K., & Liao, B. (2004). Evaluating methods for classifying expression data. Journal of Biopharmaceutical Statistics, 14(4), 1065-1084. Marafioti, T., Hummel, M., Foss, H. D., Laumen, H., Korbjuhn, P., Anagnostopoulos, I., et al. (2000). Hodgkin and reed-sternberg cells represent an expansion of a single clone originating from a germinal center B-cell with functional immunoglobulin gene rearrangements but defective immunoglobulin transcription. Blood, 95(4), 1443-1450. Markowetz, F., Ruschhaupt, M. & Spang, R. (2007). Molecular diagnosis - tumor classification by PAM and random forest., 2008, from http://compdiag.molgen.mpg.de/ngfn/pma2007may.php Marshall, N. A., Christie, L. E., Munro, L. R., Culligan, D. J., Johnston, P. W., Barker, R. N., et al. (2004). Immunosuppressive regulatory T cells are abundant in the reactive lymphocytes of hodgkin lymphoma. Blood, 103(5), 1755-1762. doi:10.1182/blood-2003-07-2594 Martin-Subero, J. I., Gesk, S., Harder, L., Sonoki, T., Tucker, P. W., Schlegelberger, B., et al. (2002). Recurrent involvement of the REL and BCL11A loci in classical hodgkin lymphoma. Blood, 99(4), 1474-1477.  64  Martin-Subero, J. I., Klapper, W., Sotnikova, A., Callet-Bauchu, E., Harder, L., Bastard, C., et al. (2006). Chromosomal breakpoints affecting immunoglobulin loci are recurrent in hodgkin and reed-sternberg cells of classical hodgkin lymphoma. Cancer Research, 66(21), 10332-10338. Muschen, M., Rajewsky, K., Brauninger, A., Baur, A. S., Oudejans, J. J., Roers, A., et al. (2000). Rare occurrence of classical hodgkin's disease as a T cell lymphoma. The Journal of Experimental Medicine, 191(2), 387-394. National cancer institute hodgkin lymphoma. (2008). Retrieved 08/01, 2008, from http://www.cancer.gov/cancertopics/types/hodgkin Oda, T., Muramatsu, M. A., Isogai, T., Masuho, Y., Asano, S., & Yamashita, T. (2001). HSH2: A novel SH2 domain-containing adapter protein involved in tyrosine kinase signaling in hematopoietic cells. Biochemical and Biophysical Research Communications, 288(5), 1078-1086. Pileri, S. A., Ascani, S., Leoncini, L., Sabattini, E., Zinzani, P. L., Piccaluga, P. P., et al. (2002). Hodgkin's lymphoma: The pathologist's viewpoint. Journal of Clinical Pathology, 55(3), 162-176. Pollack, J. R., Sorlie, T., Perou, C. M., Rees, C. A., Jeffrey, S. S., Lonning, P. E., et al. (2002). Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proceedings of the National Academy of Sciences of the United States of America, 99(20), 12963-12968. doi:10.1073/pnas.162471999 Poppema, S. (2005). Immunobiology and pathophysiology of hodgkin lymphomas. Hematology / the Education Program of the American Society of Hematology.American Society of Hematology.Education Program, , 231-238. Re, D., Kuppers, R., & Diehl, V. (2005). Molecular pathogenesis of hodgkin's lymphoma. Journal of Clinical Oncology : Official Journal of the American Society of Clinical Oncology, 23(26), 6379-6386. Remlinger, K. (2007). RandomForest as a variable selection tool for biomarker data., 2007, from http://statgen.ncsu.edu/icsa2007/talks/Session%208B%20Remlinger.ppt. Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics (Oxford, England), 23(19), 2507-2517. doi:10.1093/bioinformatics/btm344 Salloum, E., Doria, R., Schubert, W., Zelterman, D., Holford, T., Roberts, K. B., et al. (1996). Second solid tumors in patients with hodgkin's disease cured after radiation or chemotherapy plus adjuvant low-dose radiation. Journal of Clinical Oncology : Official Journal of the American Society of Clinical Oncology, 14(9), 2435-2443.  65  Sanchez-Aguilera, A., Montalban, C., de la Cueva, P., Sanchez-Verde, L., Morente, M. M., Garcia-Cosio, M., et al. (2006). Tumor microenvironment and mitotic checkpoint are key factors in the outcome of classic hodgkin lymphoma. Blood, 108(2), 662-668. doi:10.1182/blood-2005-12-5125 Schwering, I., Brauninger, A., Klein, U., Jungnickel, B., Tinguely, M., Diehl, V., et al. (2003). Loss of the B-lineage-specific gene expression program in hodgkin and reedsternberg cells of hodgkin lymphoma. Blood, 101(4), 1505-1512. doi:10.1182/blood-200203-0839 Shah, S. P., Xuan, X., DeLeeuw, R. J., Khojasteh, M., Lam, W. L., Ng, R., et al. (2006). Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics (Oxford, England), 22(14), e431-9. Simon, R., Radmacher, M. D., Dobbin, K., & McShane, L. M. (2003). Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute, 95(1), 14-18. Skinnider, B. F., & Mak, T. W. (2002). The role of cytokines in classical hodgkin lymphoma. Blood, 99(12), 4283-4297. Smith, J. M., Bowles, J., Wilson, M., & Koopman, P. (2004). HMG box transcription factor gene Hbp1 is expressed in germ cells of the developing mouse testis. Developmental Dynamics : An Official Publication of the American Association of Anatomists, 230(2), 366370. Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8, 25. doi:10.1186/1471-2105-8-25 Sun, Y., Goodison, S., Li, J., Liu, L., & Farmerie, W. (2007). Improved breast cancer prognosis through the combination of clinical and genetic markers. Bioinformatics (Oxford, England), 23(1), 30-37. doi:10.1093/bioinformatics/btl543 Sweet-Cordero, A., Tseng, G. C., You, H., Douglass, M., Huey, B., Albertson, D., et al. (2006). Comparison of gene expression and DNA copy number changes in a murine model of lung cancer. Genes, Chromosomes & Cancer, 45(4), 338-348. Thomas, R. K., Re, D., Wolf, J., & Diehl, V. (2004). Part I: Hodgkin's lymphoma--molecular biology of hodgkin and reed-sternberg cells. The Lancet Oncology, 5(1), 11-18. van Leeuwen, F. E., Klokman, W. J., Veer, M. B., Hagenbeek, A., Krol, A. D., Vetter, U. A., et al. (2000). Long-term risk of second malignancy in survivors of hodgkin's disease treated during adolescence or young adulthood. Journal of Clinical Oncology : Official Journal of the American Society of Clinical Oncology, 18(3), 487-497.  66  van Spronsen, D. J., & Veldhuis, G. J. (2003). New developments in staging and follow-up of patients with hodgkin's lymphoma. The Netherlands Journal of Medicine, 61(9), 278-284. van Wieringen, W. N., Belien, J. A., Vosse, S. J., Achame, E. M., & Ylstra, B. (2006). ACEit: A tool for genome-wide integration of gene dosage and RNA expression data. Bioinformatics (Oxford, England), 22(15), 1919-1920. doi:10.1093/bioinformatics/btl269 Wang, Q., Diskin, S., Rappaport, E., Attiyeh, E., Mosse, Y., Shue, D., et al. (2006). Integrative genomics identifies distinct molecular classes of neuroblastoma and shows that multiple genes are targeted by regional alterations in DNA copy number. Cancer Research, 66(12), 6050-6062. Yang, S., Jeung, H. C., Jeong, H. J., Choi, Y. H., Kim, J. E., Jung, J. J., et al. (2007). Identification of genes with correlated patterns of variations in DNA copy number and gene expression level in gastric cancer. Genomics, 89(4), 451-459. doi:10.1016/j.ygeno.2006.12.001 Yoshimoto, T., Matsuura, K., Karnan, S., Tagawa, H., Nakada, C., Tanigawa, M., et al. (2007). High-resolution analysis of DNA copy number alterations and gene expression in renal clear cell carcinoma. The Journal of Pathology, 213(4), 392-401.  67  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0067085/manifest

Comment

Related Items