Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Computational tools for CNV detection using probe-level analysis of Affymetrix SNP arrays : application… Farnoud, Noushin 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2012_fall_farnoud_noushin.pdf [ 26.54MB ]
JSON: 24-1.0073048.json
JSON-LD: 24-1.0073048-ld.json
RDF/XML (Pretty): 24-1.0073048-rdf.xml
RDF/JSON: 24-1.0073048-rdf.json
Turtle: 24-1.0073048-turtle.txt
N-Triples: 24-1.0073048-rdf-ntriples.txt
Original Record: 24-1.0073048-source.json
Full Text

Full Text

Computational Tools for CNV Detection Using Probe-level Analysis of Affymetrix SNP Arrays Application to the Study of CNVs in Follicular Lymphoma by Noushin R. Farnoud M.A.Sc. Electrical Engineering, Ryerson University, 2004 B.Sc. Computer Software Engineering, Shahid Beheshty University, 2000  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Genetics)  The University Of British Columbia (Vancouver) August 2012 c Noushin R. Farnoud, 2012 �  Abstract Copy number variants (CNVs) account for both variations among normal individuals and pathogenic variations. The introduction of DNA microarrays had a significant impact on the resolution of detectable CNVs and yielded a new perspective on the submicroscopic CNVs. Oligonucleotide microarrays, such as Affymetrix SNP arrays, have been commonly used for genome-wide CNV analysis. Despite the improvements in the technology, a major concern of using microarrays is how a putative CNV is defined. A disadvantage of oligonucleotide arrays is the poor signal-to-noise ratio of the data that leads to considerable variation in reported intensity readouts. Such variation will lead to false positive and false negative results, regardless of how the data are analysed. The most common approach to circumvent this problem is looking for abrupt ratio intensity shifts in several consecutive markers (e.g., SNP probes). However this approach reduces the overall resolution and mitigates the sensitivity of detecting CNVs with fewer probes. This limitation emphasizes the importance of designing methods that can identify noisy readouts at the probe-level. The main goals of this work were to study the scale of the variability in Affymetrix SNP arrays and to develop computational tools that can improve the resolution of CNV detection. By using simulated data, it was shown that the proposed method improved the accuracy and precision of detecting CNVs with fewer probes compared to standard methods. This approach was also applied to tumor/normal pairs from 25 follicular lymphoma patients and 286 candidate CNVs were found, from which 261 (91.2%) were also seen by other array-based method(s). Importantly, from 32 ii  novel deletions, undetected by other array-based methods, at least 15 (47%) were real based on sequence-based validation. An example of a novel discovery was a partial deletion of the extracellular domain of the KIT proto-oncogene that may lead to constitutive activation of this gene. Gain of function mutations of KIT has been previously reported in several other hematologic cancers through other mechanisms such as point mutations. In conclusion, CNV discovery contributes to our understanding of complex diseases and the methods presented here should provide means for better detection of CNVs and their interpretation.  iii  Preface Portions of the statistical methods used to analyse the copy number data in Chapters 3 and 4 have been described: T. J. Pugh, A. D. Delaney, N. Farnoud, S. Flibotte, M. Griffith, H. I. Li, H. Qian, P. Farinha, R. D. Gascoyne and M. A. Marra, ”Impact of whole genome amplification on analysis of copy number variants”. Nucleic Acids Res. 2008; 36, 13:e80. I was involved in the computational aspects of this study and wrote the relevant sections of the paper. Specifically, I performed all computational analysis described in ’Sequence analysis of recurrent whole genome amplification-induced artifacts’ section of this paper, including post analysis of copy number data and investigations of how various factors, such as the GC content and repetitive sequences influence this analysis. I also generated Figure 4 of the paper. The tools described in Chapters 3 and 4 were also used in the following manuscripts and conference proceedings: J. M. Friedman, A. Baross, A. D. Delaney, A. Ally, L. Arbour, J. Asano, D. K. Bailey, S. Barber, P. Birch, M. Brown-John, M. Cao, S. Chan, D. L. Charest, N. Farnoud, N. Fernandes, S. Flibotte, A. Go, W. T. Gibson, R. A. Holt, S. J. M. Jones, G. C. Kennedy, M. Krzywinski, S. Langlois, H. I. Li, B. C. McGillivray, T. Nayar, T. J. Pugh, E. Rajcan-Separovic, J. E. Schein, A. Schnerch, A. Siddiqui, M. I. Van Allen, G. Wilson, S.-L. Yong, F. Zahir, P. Eydoux, and M. A. Marra. ”Oligonucleotide Microarray Analysis of Genomic Imbalance in Children with Mental Retardation”. American Journal of Human Genetics. September 2006; 79(3): 500513. M. Rahmani, M. Earp, P. Pannu, N. Farnoud, J. Wu, L. Akhabir, J. Halaschek-Wiener, B. Munt, C. iv  Thompson, S. Mitropanopoulos, D. Craig, P. Par, B. McManus, and A. Brooks-Wilson. ”Identification of novel risk loci for calcific aortic valve stenosis on chromosome 1 by a genome-wide scan of 1,000,000 single nucleotide polymorphisms”. Proceedings of National Research Forum for Young Investigators in Circulatory and Respiratory Health, May 2009. N. Farnoud, S. Chan, S. Flibotte, A. Delaney, J.M. Friedman and M. A. Marra. ”DLOH: A novel bioinformatics tool for detection of copy-number deletions using LOH data”. Proceedings of Advances in Genome Biology and Technology conference, February 2008. A manuscript based on Chapter 3 is in preparation: Noushin Farnoud, Stephane Flibotte, Inanc Birol, P. Eydoux, Robert A. Holt, J. M. Friedman and Marco A. Marra. ”Detecting DNA copynumber variations based on probe-level analysis of Affymetrix SNP array data”. This manuscript reports the details of OPAS copy number analysis approach explained in Chapter 3. I developed all the methods described in the manuscript and made relevant figures and tables with input from my supervisor M.A.M. and members of my supervisory committee J.M.F and R.A.H. J.M.F. provided samples and P.E. performed FISH experiments. Figures 3.1, 3.11, 3.15, 3.9 of Chapter 3 and Table I.1 are directly taken from this manuscript draft. Figures 3.6, 3.7, 3.12 of Chapter 3 are from the supplementary material. S.F. and I.B. provided guidance for statistical analysis. M.A.M, R.A.H., P. E. and J.M.F. provided guidance for biological interpretation of the CNV results. M.A.M. helped in data interpretation and provided supervisory support, including manuscript revision. A version of Chapter 4 is in preparation: Noushin Farnoud, Andy J. Mungall, Susana BenNeriah, Andy Chu, Martin Krzywinski, Inanc Birol, Jacqueline Schein, Randy Gascoyne and Marco A. Marra. ”Integrated genome-wide DNA copy number and expression analysis of follicular lymphoma genomes”. The samples and SNP array data were provided by R.G. at the Centre for Lymphoid Cancers. The fluorescent in situ hybridization (FISH) validation results in this manuscript were performed by S.B.-N. at Clinical Cancer Genetics Laboratory at the BC Cancer Agency. The DNA fingerprint profiling of the samples was conducted by J.S. and sequencing  v  validation experiments were performed by A.J.M. at the Genome Sciences Center. A.C. implemented the Tumordb database and also helped to obtain the data from multiple platforms used in the integrated analysis. For this study, I performed all the array copy number analysis of follicular lymphoma samples (detailed in Sections 4.3.3-4.3.6). I also implemented several additional scripts to summarize the CNV findings in large-scale studies (automated using Matlab and PERL), such as scripts to obtain recurrently affected regions across multiple patients (used to generate Figures 4.9 and 4.17; both taken from the manuscript draft), and a script to automatically generate UCSC tracks for representation of the CNV results with comprehensive graphical details (such as shown in Figure 4.25c). I also made the figures, tables, interpreted the results and wrote the sections of the manuscript (a version of Sections 4.2-4.4 of the thesis), with the exception of the FISH images that were provided by S.B.-N. (Figures 4.3b, 4.28b, 4.32). I also performed the pathway analysis using Ingenuity software (Figure 4.21 from the manuscript). M.A.M. supervised all aspects of this study. I also conducted an integrative analysis of copy number and expression data. The integrated analysis was performed using an available software package (DR-Integrator) and a script that I designed to integrate copy-number/expression data results a single sample. The latter method is not included in the thesis since this work was done following thesis preparation. Under the supervision of M.A.M., I combined the data from different platforms and interpreted the results. M.A.M, A.J.M. and J.S. provided guidance for biological interpretation of the CNV results and R.G. helped in conception of integrating gene expression and copy number results. A.J.M. implemented the experiments for Illumina sequence validation of several CNVs (Section, adapted from the manuscript), and was also involved in designing a computational model to investigate a hypothetical fusion between HVCN1 and PPTC7 genes (Section and Figure 4.29, also from the manuscript). Figures 4.7, 4.10, 4.17, 4.21, 4.29, 4.34 and Table 4.2, 4.4, 4.7 are taken from the manuscript draft and Figures 4.9, 4.15, 4.31, 4.32, 4.33, J.1 are from the supplementary material.  vi  Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiii 1  Methods and Strategies For Analysing Copy Number Variation Using DNA Microarrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  1.1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  1.2  Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3  1.2.1  First Generation Techniques . . . . . . . . . . . . . . . . . . . . . . . . .  3  Chromosome Banding . . . . . . . . . . . . . . . . . . . . . . .  3  Spectral Karyotyping (SKY) and Multiple Fluorescent in Situ  1.2.2  Hybridization (M-FISH) Analysis . . . . . . . . . . . . . . . . .  4  Second Generation High-Resolution Techniques . . . . . . . . . . . . . .  5  5  Comparative Genome Hybridization (CGH) . . . . . . . . . . . vii  DNA Oligonucleotide Arrays . . . . . . . . . . . . . . . . . . .  6  Genotyping Arrays . . . . . . . . . . . . . . . . . . . . . . . .  7  Affymetrix SNP Arrays . . . . . . . . . . . . . . . . .  8  Illumina SNP arrays . . . . . . . . . . . . . . . . . . .  9  Application of SNP arrays for CNV Detection . . . . .  9  Third Generation High-Resolution Techniques . . . . . . . . . . . . . . .  10  Methods to Discover CNVs Using SNP-array Data . . . . . . . . . . . . . . . . .  12  1.3.1  Normalization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .  12  1.3.2  CNV Calling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .  13  1.4  Limitations of Current CNV Detection Algorithms for SNP Data Analysis . . . . .  17  1.5  Thesis Objectives and Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . .  18  1.6  Chapter Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  20  1.7  Figures and Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  26  Analysing Variability in Microarray Data . . . . . . . . . . . . . . . . . . . . . . . .  30  2.1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  30  2.1.1  Variability in Microarray Data . . . . . . . . . . . . . . . . . . . . . . . .  31  2.1.2  Biological Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  32  2.1.3  Technical Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  32  Methods for Measuring and Quantifying Microarray Variability . . . . . . . . . . .  34  2.2.1  Quantifying Technical Variability in Affymetrix SNP Arrays . . . . . . . .  35  Log-transformation . . . . . . . . . . . . . . . . . . . . . . . .  35  Coefficient of Variation (CV) . . . . . . . . . . . . . . . . . . .  35  1.2.3 1.3  2  2.2  2.2.2  2.3  A Link Between the CV and the Probability of Observing k-fold Disparities Between Replicate Measurements . . . . . . . . . . . . . . . . . . . . . .  37  Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  38  viii  2.3.1  Affymetrix 10K Replicate Experiment . . . . . . . . . . . . . . . . . . . .  38  2.3.2  Quantifying Technical Variability in Affymetrix SNP Arrays . . . . . . . .  39  2.3.3  Assessing Chip Variability in the Replicate Dataset (10K) . . . . . . . . .  41  2.3.4  Assessing Labeling Variability in the Replicate Dataset . . . . . . . . . . .  41  2.3.5  Analysing the Relationship Between Oligo-level and SNP-level Variabilities 43  The Acceptable Range of Variability Between Replicate Oligos .  Finding a Link Between Oligo-level CV to the Changes in SNPlevel LR Values . . . . . . . . . . . . . . . . . . . . . . . . . .  3  43  45  Evaluating the Impact of Oligo-level Variability on the Extent and Frequency of Noisy SNPs in Replicate Experiments . . . . .  48  2.4  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  49  2.5  Figures and Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  52  Algorithm for Oligonucleotide Probe-level Analysis of Signal Intensities (OPAS) . .  66  3.1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  66  3.2  Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  69  3.2.1  Algorithm Design  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  69  3.2.2  SNP Pre-processing Phase . . . . . . . . . . . . . . . . . . . . . . . . . .  70  Quantile Normalization . . . . . . . . . . . . . . . . . . . . . .  70  Clustering Individual Oligonucleotide Probes in a SNP Probe Set  70  Cluster Prediction: Fuzzy Subtractive Clustering . . . .  71  Cluster Estimation: k-means Clustering . . . . . . . .  72  SNP Classification . . . . . . . . . . . . . . . . . . . . . . . . .  72  Likelihood Estimation: Kolmogorov-Smirnov (KS) test . . . . .  73  Machine Learning Classifier Based on Discriminant Analysis . .  74  ix  3.2.3  sification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  74  Post-processing and CNV Calling . . . . . . . . . . . . . . . . . . . . . .  75  PCR Fragment Length Normalization . . . . . . . . . . . . . . .  75  Circular Binary Segmentation (CBS) Algorithm . . . . . . . . .  75  3.2.5  OPAS Visualization and Other Features . . . . . . . . . . . . . . . . . . .  76  3.2.6  Simulated Data for Comparative Analysis of CNV Calling Algorithms . . .  76  3.2.7  Analysing the Effect of Noise on CNV Calling Performance . . . . . . . .  77  3.2.8  Comparative Analysis of OPAS and Circular Binary Segmentation . . . . .  78  Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  78  3.3.1  Patterns of LR Intensity Fluctuations in SNP Array Data . . . . . . . . . .  78  3.3.2  Analysing the Impact of the Size of the Reference Set on the Estimated  3.2.4  3.3  Alternative Approach for SNP Pre-processing Based on Naive Bayes Clas-  SNP Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3.3.3  79  Impact of the Size of the Reference Set on the Estimated LR Values 79  Impact of the Size of the Reference Set on the Number of CNVaffirmative Oligos . . . . . . . . . . . . . . . . . . . . . . . . .  80  Results of OPAS Pre-processing Phase . . . . . . . . . . . . . . . . . . .  81  Clustering PM Oligos in SNP Probe-sets . . . . . . . . . . . . .  81  Performing Oligonucleotide Probe-level Analysis of the SNP Array Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  81  SNPs Classification and LR Estimation  82  Comparison of OPAS Pre-processing and Naive Bayes Classifi-  . . . . . . . . . . . . .  cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  82  An Example of the Impact of SNP Pre-processing on Improving CNV Data Quality . . . . . . . . . . . . . . . . . . . . . . . . .  x  84  4  3.3.4  Results of OPAS Post-processing Phase . . . . . . . . . . . . . . . . . . .  85  3.3.5  Assessing the Effect of Noise on CNV Calling Performance . . . . . . . .  85  3.3.6  Comparative Analysis of CNV-calling Algorithms . . . . . . . . . . . . .  87  3.3.7  Comparing OPAS and CBS Accuracy . . . . . . . . . . . . . . . . . . . .  88  3.4  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  89  3.5  Figures and Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  93  Analysing CNVs in Follicular Lymphoma Genomes . . . . . . . . . . . . . . . . . . 116 4.1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116  4.2  Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117  4.3  4.2.1  Samples and Cytogenetic Analysis . . . . . . . . . . . . . . . . . . . . . . 117  4.2.2  BAC Arrays and SNP Arrays . . . . . . . . . . . . . . . . . . . . . . . . . 118  4.2.3  Fingerprint Profiling (FPP) . . . . . . . . . . . . . . . . . . . . . . . . . . 118  4.2.4  Ingenuity Pathway Analysis Software . . . . . . . . . . . . . . . . . . . . 119  Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.3.1  Magnitude of Copy Number Changes in FL Genomes . . . . . . . . . . . 120  4.3.2  Spectrum of Somatic CNVs in FL Genomes . . . . . . . . . . . . . . . . . 122  4.3.3  Category 1: CNVs Affecting Whole-chromosomes or Chromosome Arms (WCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123  4.3.4  Recurrent WCA CNVs . . . . . . . . . . . . . . . . . . . . . . 124  Category 2: CNVs Affecting the Distal Ends of Chromosomes . . . . . . . 124  Functional Analysis of the Genes Affected by Distal CNVs . . . 126  4.3.5  Category 3: Other CNVs  . . . . . . . . . . . . . . . . . . . . . . . . . . 127  4.3.6  Candidate Focal CNVs in FL Genomes and Functional Analysis of the Affected Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127  4.3.7  Comparison of OPAS Generated CNVs with Other Methods . . . . . . . . 129 xi  4.3.8  5  Examples of Sequence Validated Novel (OPAS-exclusive) CNV Findings . 132  HVCN1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132  CDKN2A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134  KIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135  4.4  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136  4.5  Figures and Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138  Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.1  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187  5.2  Significance and Contribution to Field of Study . . . . . . . . . . . . . . . . . . . 194  5.3  Potential Applications and Future Directions . . . . . . . . . . . . . . . . . . . . . 196  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 A Appendix A: Probability Distribution Function and Cumulative Distribution Function245 B Appendix B: Normal and Standard Normal Distributions . . . . . . . . . . . . . . . 248 C Appendix C: Proof of the Relationship Between p(k) and CV . . . . . . . . . . . . . 252 D Appendix D: Estimating SNP-array Reproducibility Using Analysis of Variance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 E Appendix E: Description of Boxplots (in the thesis) . . . . . . . . . . . . . . . . . . . 263 F Appendix F: Comparative Analysis of the SNP Array Normalization Techniques . . 265 F.1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265  F.2  Method and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265  G Appendix G: Description of QDA  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 xii  H Appendix H: Measures of Predicting the Accuracy . . . . . . . . . . . . . . . . . . . 276 I  Appendix I: List of Validated CNVs in 146 MR Patients . . . . . . . . . . . . . . . . 278  J  Appendix J: Relationship Between Hybridization Intensity Noise and the Number of Predicted CNVs in FL Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281  K Appendix K: Description of FPP Events . . . . . . . . . . . . . . . . . . . . . . . . . 283 L Appendix L: Supplementary Material for Chapter 4 . . . . . . . . . . . . . . . . . . 285  xiii  List of Tables Table 1.1  Aneuploidies and large-scale CNVs associated with human disease . . . . . . .  28  Table 1.2  Comparison of array specifications in 4 generations of Affymetrix SNP array . .  28  Table 1.3  A partial list of Affymetrix SNP array data (raw .CEL files) that are publicly available . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Table 2.1  The relationship between CV and predicted replicate oligos that are ≥ k-fold different . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Table 2.2  29  64  The impact of oligo-level variability across replicate measurements on the expected SNP-level variability . . . . . . . . . . . . . . . . . . . . . . . . . . . .  65  Table 3.1  Three sets of hypothesis tests used in null-likelihood phase . . . . . . . . . . . 112  Table 3.2  The impact of the size of reference sets on the estimated SNP signals . . . . . . 113  Table 3.3  Comparison of the performance of Naive Bayes and OPAS QDA-based SNP pre-processing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114  Table 3.4  Comparing CBS and OPAS performance in detecting known CNVs with respect to the number of SNP probes in the CNV regions . . . . . . . . . . . . . . . . . 115  Table 4.1  Spectrum of LR deviation of all estimated DNA regions from 25 FL genomes . 178  Table 4.2  Summary of candidate somatic copy number changes in the FL dataset . . . . . 179  Table 4.3  Frequency of candidate distal CNVs (category 2) in each FL chromosome . . . 180  xiv  Table 4.4  Summary of all candidate somatic focal CNVs (� 150 kb) that affect at least 1 gene in an FL patient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182  Table 4.5  List of 19 new candidate OPAS amplifications that were not previously detected by SNP data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183  Table 4.6  List of OPAS-exclusive deletions that overlap with FPP or Illumina sequence validated events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185  Table 4.7  Partial list of some important genes that have been linked to cancer and their frequency across the FL dataset.  Table D.1  . . . . . . . . . . . . . . . . . . . . . . . . . 186  Results of assessing chip and labeling variability in 8 samples based on two-way ANOVA analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259  Table F.1  List of CNVs and large-scale copy number alteration in 8 studied follicular lymphoma patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272  Table F.2  Comparing the results of normalization methods in 8 cancer samples (FL) . . . 273  Table I.1  Validation of OPAS sensitivity in detecting previously known CNVs in 146 mental retardation patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280  Table L.1  Ingenuity Pathway Analysis list of known genes that were affected by candidate focal CNVs (� 150 kb) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295  xv  List of Figures Figure 1.1  Karyotype and M-FISH chromosome analysis of a patient with follicular lymphoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  26  Figure 1.2  R Structure of SNP probe sets in Affymetrix GeneChip� SNP arrays . . . . . . .  27  Figure 2.1  Sources of variability in SNP microarray experiments for identification of copy number variations (CNVs) . . . . . . . . . . . . . . . . . . . . . . . . . . . .  52  Figure 2.2  Common sources of microarray technical variability . . . . . . . . . . . . . .  53  Figure 2.3  Impact of log-transformation on the distribution of raw signal intensities from Affymetrix SNP arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  54  Figure 2.4  Schematic representation of Affymetrix 10K replicate study . . . . . . . . . .  55  Figure 2.5  Schematic representation of assessing technical variability in 10K SNP array replicate study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Figure 2.6  56  Global normalization of log-transformed intensity readouts from 3 replicate arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  57  Figure 2.7  MA-plots of chip variability . . . . . . . . . . . . . . . . . . . . . . . . . . .  58  Figure 2.8  Histograms of deviation (error) in replicate arrays . . . . . . . . . . . . . . . .  60  Figure 2.9  Estimated chip and labeling variability of the Affymetrix 10K replicate experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xvi  61  Figure 2.10 Relationship between array variability and the probability of estimating oligos with k-fold difference in their intensity readout . . . . . . . . . . . . . . . . .  63  Figure 3.1  Flowchart of the OPAS algorithm . . . . . . . . . . . . . . . . . . . . . . . .  93  Figure 3.2  Flowchart of the alternative approach for SNP pre-processing based on Naive Bayes classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  94  Figure 3.3  Schematic representation of OPAS input/output data . . . . . . . . . . . . . .  95  Figure 3.4  The Nsp signal of chromosome 14 of a follicular lymphoma patient that harbors a deletion on 14q32.33 (∼644 kb; 9 SNPs) . . . . . . . . . . . . . . . . .  96  Figure 3.5  Example of oligo-level and SNP-level variability in SNP arrays (100K data) . .  98  Figure 3.6  Comparing the impact of the size of the reference set on estimated log2-ratio intensity readouts of SNP probe sets within a deleted region . . . . . . . . . .  Figure 3.7  99  Comparison of CDFs of all oligos in SUFU deleted region with 6 reference sets with varying sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100  Figure 3.8  Boxplots of the average number of CNV-affirmative PM oligos per SNP probe set (θ ), with respect to 3 reference sets with varying sizes . . . . . . . . . . . . 101  Figure 3.9  Fuzzy-Kmeans clustering of PM oligos in 5 SNPs within the SUFU deleted region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102  Figure 3.10 Schematic representation of oligo-clustering and likelihood estimation modules of OPAS default pre-processing . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 3.11 Distribution of PM log2-ratio intensities before and after pre-processing (500K data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Figure 3.12 The relationship between SNP log2-ratio intensities and PCR fragment length, before and after LOWESS normalization . . . . . . . . . . . . . . . . . . . . 106 Figure 3.13 Comparing estimated LR values of 567 SNPs in 19 validated regions of copy number loss based QDA and Naive Bayes classifications . . . . . . . . . . . . 107 xvii  Figure 3.14 Examples of OPAS detected CNVs in simulated signals with different magnitudes of LR deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Figure 3.15 Comparing the accuracy and precision of CNV calling algorithms . . . . . . . 109 Figure 3.16 Number of false negative deletion calls of the IGH locus, plotted against increasing noise of the simulated data . . . . . . . . . . . . . . . . . . . . . . . 110 Figure 3.17 Results of comparing CBS and OPAS performance in detecting a known deletion111 Figure 4.1  Example of an FPP event on chromosome 4q12 in FL patient 20 that is proved to be a deletion by Illumina sequencing . . . . . . . . . . . . . . . . . . . . . 138  Figure 4.2  Distribution of LR intensity measurements of all regions across FL dataset . . 139  Figure 4.3  Deletion on 1p36 chromosomal region of ht-9 with slight signal deviation (LR = -0.13), validated by FISH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140  Figure 4.4  Probability Density Function (PDF) of z-scores from all regions with gain (LR > 0) or loss (LR < 0) of log2-ratio signal intensity (in FL dataset) . . . . . 141  Figure 4.5  Boxplot of z-scores of all regions with loss or gain of log2-ratio signal intensities (in FL dataset) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142  Figure 4.6  Candidate deletions on chromosome 4 of patient 29 with slight signal deviations but significant z-scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 143  Figure 4.7  Pie charts of the frequency of candidate amplifications and deletions in 25 FL patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144  Figure 4.8  Examples of two real whole chromosome gains in an FL patient (ht-11) with slight log-ratio deviations from base-line (LR = 0.12 and 0.13) . . . . . . . . . 145  Figure 4.9  Frequency of WCA events per chromosome across all FL patients . . . . . . . 146  Figure 4.10 Chromosome ideogram view of 48 WCA CNVs in the FL dataset . . . . . . . 147 Figure 4.11 The only two WCA events that were not directly validated by cytogenetic analysis (slight gains of chromosomes 7 and X in ht-29) . . . . . . . . . . . . . . . 149 xviii  Figure 4.12 Example of a distal CNV on chromosome 22 of an FL patient (ht-22) . . . . . 150 Figure 4.13 Two candidate amplifications on chromosome 5 of an FL patient (ht-12), including a distal copy number gain on 5 q-end . . . . . . . . . . . . . . . . . . 151 Figure 4.14 Candidate distal deletion on chromosome 1p36 (ht-7) with slight LR deviation but a significant z-score (LR = -0.12; z-score = -1) . . . . . . . . . . . . . . . 152 Figure 4.15 Recurrent distal deletion of chromosome 1p36 in the FL dataset . . . . . . . . 153 Figure 4.16 Most significantly associated gene networks with candidate distal CNVs of the FL dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Figure 4.17 Frequency of candidate CNVs in each chromosome among 25 FL patients . . . 155 Figure 4.18 Validated focal deletion (∼38.7 kb) on 13q21.33, detected by OPAS and SMD but not aCGH results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Figure 4.19 Candidate focal deletions on chromosome 2 that potentially affect the first intron of ERBB4 gene in two FL patients (OPAS-exclusive) . . . . . . . . . . . . 157 Figure 4.20 OPAS-exclusive candidate deletion on chromosome 3q13.33 of patient 20 (∼97 kb) that is adjacent to an FPP inversion event . . . . . . . . . . . . . . . . . . 159 Figure 4.21 The most significant gene network associated with OPAS candidate focal CNVs (� 150 kb) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Figure 4.22 Venn diagram comparing predicted copy number amplifications in FL samples, generated by 3 methods (OPAS, SMD and aCGH) . . . . . . . . . . . . . . . 161 Figure 4.23 Candidate OPAS-exclusive focal amplification on 9p13.2 (∼76 kb; 14 SNPs) that encompasses 2 exons of PAX5 (OPAS-exclusive) . . . . . . . . . . . . . . 162 Figure 4.24 Venn diagram comparing predicted copy number deletions in FL samples, generated by 3 methods (OPAS, SMD and aCGH) . . . . . . . . . . . . . . . . . 163 Figure 4.25 Multiple OPAS candidate deleted regions on chromosome 10 of an FL patient (ht-19), that align with FPP complex events . . . . . . . . . . . . . . . . . . . 164  xix  Figure 4.26 Candidate OPAS-exclusive deletion on chromosome 14q32.33 of patient 14, mapping to an FPP ’coverage-gap’ (∼230 Kb; 4 SNPs) . . . . . . . . . . . . . 166 Figure 4.27 Candidate OPAS-exclusive deletion on 15q11.2 (ht-21), mapping to a region with several ’multi fpp’ events . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Figure 4.28 Validated OPAS-exclusive focal deletion on chromosome 12 of patient 6 (∼143 kb; 4 SNP probes), affecting 4 genes including HVCN1 and PPTC7 . . . . . . 169 Figure 4.29 Analysis of a putative fusion between HVCN1 and PPTC7 genes in FL patient 6 as the result of a focal deletion (145,148 bp) on 12q24.11 . . . . . . . . . . . 170 Figure 4.30 Candidate ∼1.4 Mb deletion on 12q24 in patient 9, encompassing the HVCN1 gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Figure 4.31 Sequence validated OPAS-exclusive focal deletion on 9p21.3 in FL patient 16, encompassing CDKN2A gene . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Figure 4.32 Recurrent deletion of 9p21.3 chromosomal region in 4/25 (16%) FL patients, suggesting a potentially important role of CDKN2A tumor suppressor in FL . . 173 Figure 4.33 Sequence validated OPAS-exclusive focal deletion on 4q12 affecting the KIT gene (ht-20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Figure 4.34 Analysis of the impact of 4q12 deletion (136,811 bp) in FL patient 20 on the KIT gene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Figure A.1  Probability density function (PDF) . . . . . . . . . . . . . . . . . . . . . . . . 246  Figure A.2  A graphical representation of the relationship between the PDF and CDF . . . 247  Figure B.1  PDF and CDF of normal distribution . . . . . . . . . . . . . . . . . . . . . . . 249  Figure D.1  Schematic representation of two-way ANOVA test performed on each 10K SNP  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260  xx  Figure D.2  Measuring the effect of chip and labeling variability in 10K replicate experiment using two-way ANOVA analysis . . . . . . . . . . . . . . . . . . . . . . 261  Figure D.3  Frequency of inconsistent SNPs across 8 studied DNA samples . . . . . . . . . 262  Figure E.1  Boxplot and a probability density function (PDF) of a dataset with standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263  Figure F.1  Comparing the results of 5 normalization techniques on the estimated probelevel log2-ratio intensity readouts . . . . . . . . . . . . . . . . . . . . . . . . 271  Figure J.1  Relationship between variation of hybridization signal intensities and CNV counts in 25 FL genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282  Figure L.1  Candidate distal deletion on chromosome 8p23.3 of patient 25 (∼21 kb) containing 3 SNP probe markers (OPAS-exclusive) . . . . . . . . . . . . . . . . . 286  Figure L.2  Candidate OPAS distal deletion on chromosome 1p36 (ht-18), that includes an array CGH predicted deletion . . . . . . . . . . . . . . . . . . . . . . . . . . 287  Figure L.3  Candidate distal amplification on 14p36 (ht-7) with slight gain of signal intensity (LR = +0.18) but a significant z-score (+1.1) . . . . . . . . . . . . . . . . 288  Figure L.4  Candidate OPAS distal deletion on chromosome 6 of patient 20 that includes 11 SNP probes with significant loss of signal intensity (LR = -0.42; z-score = -1.50) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289  Figure L.5  Examples of FISH validated 1p36 deletions in 4 FL patients . . . . . . . . . . 290  Figure L.6  Candidate OPAS-exclusive focal deletions on chromosome 19 of patient 4 . . . 291  Figure L.7  Slight gain of signal intensity of a region within a deleted chromosome arm, predicted to represent an amplification (LR = 0.03; z-score = +0.6) . . . . . . . 293  xxi  Figure L.8  Candidate OPAS-exclusive focal deletion on chromosome 4 of FL patient 19 (∼10 kb; 4 SNPs), adjacent to an FPP translocation site between 4q28.3 and 2p25.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294  xxii  Acknowledgments I owe many thanks to my PhD thesis supervisor Dr. Marco Marra for the opportunity to pursue studies in his lab, as well as for his guidance and support. He was an inspiration, and always enthusiastic about teaching biology and fostering interdisciplinary science. Besides being a great supervisor, he was an excellent mentor, and someone I would like to emulate in my career. My sincere thanks also go to the members of my supervisory committee: Dr. Jan Friedman and Dr. Robert Holt, who were instrumental in my training throughout the course of my degree. I have also enjoyed the support of many fellow graduate students and post-doctoral fellows including Trevor Pugh, Suganthi Chittaranjan, Ryan Morin, Malachi Grifith, Sorana Morrissy, Claire Hou, Jaswinder Khattra, Monica Sleumer, Maria Mendez-Lago, Rodrigo Goya, and Obi Griffith. I especially extend my sincere thanks to Olena Morozova who has not only been a true friend and an invaluable emotional support throughout my PhD studies, but also immensely helpful in reviewing my thesis in more ways than I could ever list here. I also wish to express my gratitude to Dr. Inanc Birol and Dr. Stephane Flibotte who were abundantly helpful and offered invaluable assistance in the statistical aspects of my thesis. I am grateful for the funding received from the National Cancer Institute of Canada, the University of British Columbia (Department of Genetics), Genome Canada, Genome British Columbia and the British Columbia Cancer Foundation. Most of the laboratory validation work described in this thesis would not have been possible without the help of Dr. Andy Mungall, whose exceptional dedication to science has motivated my work and who has always answered my questions about biology with patience. I must also extend my thanks to many other colleagues at the GSC, specially Martin Krzywinski, Irene Li, Andy Chu, Matthew Field, Dr. Allen Delaney and Jacquie Schein as well as collaborators outside the Genome Sciences Centre , including Dr. Horsman, Susana Ben-Neriah, Dr. Patrice Eydoux and Dr. K-John Cheung, who helped me in many aspects of data analysis and biological interpretation of the results. I also extend appreciation and thanks to all the members of the Genome Sciences Centre who I do not mention by name but who helped me by xxiii  creating an open and exciting atmosphere of scientific collaboration. The path I have chosen in science and my immense passion in the application of mathematics in solving biological problems would have not been ignited if it were not for the enthusiasm and inspiration of certain people that I have been blessed to meet and collaborate with in my life. I am forever indebted to Dr. Michael Kolios, my Master’s thesis mentor, who introduced me to the fascinating field of biophysics and provided me with the first opportunity to experience the exciting field of biomedical data analysis. It is also difficult to overstate my gratitude to the memory of two people who also had major impacts on my passion for bioinformatics: Dr. Sam Roweis, whose immense passion and enthusiasm for new ideas and new perspective on interdisciplinary science sparked my passion for bioinformatics; and my very good friend and fellow graduate student Adrian Quayle, who I will always be grateful to have known. On a personal level, I cannot forget to express thanks to all my friends and my brothers for believing in me and especially Ali who has been there for me every step of the way. Finally, thanks to my father who dedicated so much of his life to my happiness and who I miss dearly. To my mother, I dedicate this work. She continues to be the only constant parameter of inspiration and love among all other variabilities of my life.  xxiv  Chapter 1  Methods and Strategies For Analysing Copy Number Variation Using DNA Microarrays 1.1 Introduction Genetic variants resulting in gains or losses of DNA segments are collectively termed copy number variants or CNVs and are found both in human and other mammals such as chimpanzee [1–3]. From the earliest days of cytogenetics-based chromosomal analysis, scientists were able to identify chromosomal variants and in many cases were able to associate them with certain human diseases [4–6] (Table 1.1). Association of copy number variation with a phenotype goes back as early as 1936 when Bridges identified duplication of the BAR gene in Drosophila melanogaster as the cause of the ’Bar eye phenotype’ [7]. In 1959, J´erˆome Lejeune discovered the first chromosomal disorder in humans, an extra copy of chromosome 21 (trisomy 21) that was associated with Down syndrome [8], more than 90 years after Down syndrome was first described by John Langdon Down in 1866 [9]. This discovery of the first human condition definitely attributable to chromosome copy number variation was regarded as a turning point in cytogenetics. Since then many other syndromes were associated with large deletions or duplications of chromosomal regions, which were visible using chromosome microscopy [4–6]. In addition to discovery of copy number variants in human disease, seemingly benign chromosomal variants were also identified among normal individuals (often referred to as copy number polymorphisms). These events were 1  frequently detected in regions of heterochromatin on chromosomes 1, 9 and 16 and in the short arm of the acrocentric chromosome 6 [10, 11]. Later, the hybridization of molecular probes to human chromosomes, particularly with fluorescence in situ hybridization (FISH) [12], provided an effective tool for detection of subtle DNA gains and losses, as well as other chromosomal rearrangements such as inversions and translocations. The advances in molecular analysis technique paved the way for discovery of numerous new genetic variants including short tandem repeats [13] and single nucleotide polymorphisms (SNPs) [14–17]. As the result of these discoveries, it became clear that the scale of variation in the human genome ranged from single base pairs (SNPs1 ) to regions as large as several megabases in size [18–25]. Since then, our understanding of chromosomal variants both in human disease and normal populations has been profoundly expanded as the result of genome-wide chromosome analysis techniques that have allowed us to interrogate the DNA sequence and discover submicroscopic CNVs (< 1Mb), much smaller than the earlier cytogenetics analysis (> 5 − 10 Mb).  We now know that CNVs are common characteristics of human diseases such as mental re-  tardation [26–30], autism [31, 32] and cancer [33–36]. The copy number variants can directly cause disease through altering the abundance of dosage-sensitive genes [37], as in micro-deletion or micro-duplication disorders [26, 38, 39], or affect gene expression, either directly by affecting the genes that are harboured within CNVs, or indirectly through altering upstream or downstream sequences that are involved in gene regulation [40–43]. Furthermore, multiple recent studies have indicated the importance of copy number alteration in susceptibility to human complex diseases such as Alzheimer disease [44–46], Crohn’s disease [47, 48], autism [31, 32, 49, 50], psoriasis [51], Parkinson’s disease [52–54], schizophrenia [55–59] and glomerulonephritis [60]. Importantly, over the past several years emerging evidence has shown the significance of smaller copy number variants associated with human disease [61–65] as well as normal genome diversity [20, 66–68]. However, the resolution of detecting CNVs is not only dependent on the resolution of the technology, but also on the sensitivity and precision of the computational methods that are used to for CNV analysis. The capacity to reliably detect small copy number variations (below 100 kb in size) using early clinical microarray platforms appeared to be limited [69–71], suggesting that there are yet undetected and potentially disease causing CNVs that required higher resolution methods of genome analysis. As the importance of copy number variants is well established, it is important to note that the power to detect these variants depends on two main factors: 1) the resolution of the technology 1 SNPs are individual base positions in the genome that show natural variation in a population with more than 1% frequency (according to the Single Nucleotide Polymorphism database; dbSNP).  2  being used to study the sample, and 2) the sensitivity and specificity of the computational methods that are applied to analyse the data. In the rest of this Chapter, I will first review the technologies that have been used for CNV detection and then will discuss some of the computational approaches that are employed for CNV analysis.  1.2 Technologies The techniques to detect chromosomal abnormalities can be broadly categorized into 3 groups based on their detection resolution and genome coverage: techniques with low to moderate resolution with limited genome coverage (first generation), high-resolution chromosomal microarray analysis (second generation), and massively parallel sequencing technologies (third generation).  1.2.1  First Generation Techniques Chromosome Banding For more than 50 years, the standard clinical method for detection of chromosomal abnormalities was chromosomal cytogenetic analysis using karyotyping, a microscopic method that requires highly skilled interpretation. The conventional process for karyotyping, known as ”chromosome banding”, involves adding a dye to metaphase chromosomes that provides a visual image of different regions of each chromosome by its unique pattern (usually as a black-and-white staining pattern). The most commonly used chromosome banding, G-banding, uses Giesma stains to visualize transverse bands on a chromosome1 . Each chromosome has a characteristic banding pattern that helps to identify it and both members of a chromosome pair have the same banding pattern (see Figure 1.1). This method typically produces between 400-800 unique bands which can be distinguished by the order and size of each band. Karyotypes are arranged with the paired chromosomes ordered by size, the short arm of the chromosome on top and the long arm on the bottom [72]. Comparing each chromosome’s banding pattern to its normal pattern enables cytogeneticists to recognize chromosomal abnormalities such as numerical changes as well as deletions, duplications and translocations if they are of sufficient size (e.g., larger than 4-5 Mb; Figure 1.1). A major challenge of using G-banding to stain chromosomes is that subtle deletions and translocations near the telomeres are extremely difficult to identify by this method (since Giesma staining produces 1 In  general, heterochromatic regions, which tend to be AT-rich stain more darkly in G-banding, in contrast to less condensed GC-rich chromatin which incorporates less stain and thus appear as light bands in G-banding.  3  light bands for most of chromosomes tips), and thus other staining methods such as R-banding1 have been used to detect telomere-specific aberrations [73–75]. Despite their limitations, G-banded karyotypes are still routinely used in clinical applications to diagnose a wide range of large-scale chromosomal abnormalities, including trisomy 21 in Down syndrome (written as 47, XX, +21). Spectral Karyotyping (SKY) and Multiple Fluorescent in Situ Hybridization (M-FISH) Analysis To improve resolution and efficiency of conventional chromosome banding, Spectral Karyotyping (SKY) and Multiplex Fluorescent in Situ Hybridization (M-FISH) chromosome analysis methods were developed. These methods use labeled chromosome-specific paints to provide simultaneous visualization of chromosomes in different colors [76, 77]. SKY uses multiple fluorochromes to measure the spectrum of each image pixel, simultaneously, by means of an interferometer. MFISH, on the other hand, generates separate images for each of 5 employed fluorochromes through application of special filters and later superimposes these images automatically to obtain a single image in full color (see Figure 1.1b). The main advantage of SKY and M-FISH is characterizing changes with respect to their origin, such as translocation. SKY and M-FISH have been used to detect and characterize chromosomal abnormalities in different cancers including those of breast [78, 79], colon [80, 81], bladder [82–84], lung [85, 86] and cervix [79, 83, 84, 87]. In contrast to conventional karyotyping, both SKY and M-FISH generate a digital image in full color (instead of simple black-and-white pattern), which enhances the observation of structural aberrations in the entire genome, and provides insight into the chromosomal composition of several ambiguous marker chromosomes2 [88]. Furthermore, working with digital images allowed scientists to use computers to analyse the ”painted” chromosomes and automate identification of structural abnormalities. This significantly reduced the cost and time of conventional labour-intensive karyotyping, and improved the accuracy of finding true abnormalities by minimizing the human errors of interpreting black-and-white karyotypes. As a result, SKY and M-FISH have been used to detect chromosomal rearrangements [89, 90]. However, the resolution of these techniques is estimated to be approximately ∼1-2 Mb for both SKY [91, 92] and M-FISH [93].  1 R-banding is a staining method in which chromosomes are heated in a phosphate buffer, then treated with Giesma stain to produce a banding pattern that is the reverse of that produced in G-banding. Thus, the dark regions are gene-rich euchromatic and light bands are heterochromatic (tightly packed form of DNA). 2 A marker chromosome is an abnormal chromosome that is distinctive in appearance but not fully identified.  4  1.2.2  Second Generation High-Resolution Techniques Comparative Genome Hybridization (CGH) In cytogenetic CGH, which was first developed by Kallioniemi et al. [94], the patient and normal samples (also referred to as ”test” and ”reference” samples, respectively) are labeled with different fluorescent tags and co-hybridized to normal metaphase chromosomes. The hybridization fluorescence intensities from test and normal samples are converted into quantitative ratio measurements that represent the gains and losses in the test (patient) genome relative to the reference genome [94]. Cytogenetic CGH has been a popular tool to characterize chromosome imbalances in many different clinical applications [94–97] including mental retardation [97–99] and cancer [94, 100–106]. For example application of CGH in ovarian cancer resulted in the discovery of several copy number variants that affected some of the key genes in ovarian tumorigenesis including loss of 17pter-q21 that harbors p53, gain of 17q that results in amplification of HER2/neu (ERBB2) and amplification of 8q24 including the MYC oncogene [107–109]. CGH has also been successfully applied to analyse hematological cancers such as leukemia and lymphoma [103, 106, 110]. Although cytogenetic CGH technology had a huge impact on cytogenetics analysis of human disease, these methods were very labor intensive and required the use of metaphase chromosomes, which led to limited resolution, typically about 5-10 Mb [103, 111, 112]. Completion of the Human Genome Project [113], where large-insert clone libraries were developed and assembled into overlapping contigs for sequencing, initiated a major improvement to CGH techniques. In an attempt to overcome the aforementioned limitations associated with cytogenetic CGH, investigators developed a method that combined the principles of CGH with the use of microarrays [114]. Instead of using metaphase chromosomes, this method, which is known as array CGH (aCGH), used DNA clones that accurately mapped to known regions of the genome and were robotically spotted onto array glass slides or glass capillaries [115]. In array CGH labelled samples are applied to a slide containing thousands or millions of DNA probes [116]. The resolution of such arrays depends on the size of the genomic fragments that are used as DNA probes (e.g., ∼150 kb for BAC aCGH), the  density of the array (e.g., Agilent ultra-dense 1M array CGH platform includes 1 million probes) and the structure of the genome sequence being analysed. Similarly to cytogenetic CGH, in array CGH technology, test and reference DNAs are labeled with different fluorescent dyes and then are co-hybridized to the array. Relative gains and losses of signal intensities are subsequently measured and reported as copy number deleted or amplified regions of the genome. This technique has revolutionized the study of CNVs in many different applications such as cancer studies [116–121]. 5  Among the genomic representations of DNA that are used as the probes in array CGH platforms, Bacterial Artificial Chromosomes or BACs were heavily used initially, especially for studying CNVs in cancers [36, 122] and mental retardation syndromes [26, 123–125]. Using BACs for CGH (known as BAC aCGH) provided genome resolution and coverage that was unprecedented before that time, as shown in a study by Krzywinski et al. [126] which indicated that more than 99% of the entire human genome can be represented by a set of 32,000 BAC clones with an average intermarker distance of 76 kb between BAC clones. Despite the unprecedented resolution and genome coverage of BAC-aCGH, the empirical resolution of this technology to detect chromosomal abnormalities was still limited by the average size of BAC clones [126, 127] and by technical difficulties of producing high density, highly reproducible BAC arrays in large numbers required for clinical applications. Following the success of aCGH technology to detect structural aberrations up to one-tenth the size of those detectable by conventional cytogenetics [36, 122, 126, 128–132], using DNA arrays with shorter probe sequences became increasing popular in recent years. Thus in the next section I will focus on describing oligonucleotide arrays as an important tool for high-resolution wholegenome analysis developed during the past decade. DNA Oligonucleotide Arrays Oligonucleotide arrays consist of an arrayed series of thousands or millions of microscopic spots of DNA oligonucleotides, called features, each containing a specific DNA sequence (or probe) [133– 138]. The length of oligonucleotide probes, or oligos, varies between different array types and vendors but typically is in the range of 25 (used by Affymetrix ) to 60 (used by Agilent and Illumina) nucleotides [137–139]. The intensity of target-probe hybridization is then translated into measurements representing the abundance of DNA in the test sample relative to the reference. Oligonucleotide arrays provide the highest potential resolution for microarrays. However, in practice the effective resolution of oligonucleotide microarrays is dependent on several factors, such as the length of the oligonucleotide probes (or ”oligos”), the density of the probes on the array and the coverage of the genome. Another source of data variability in microarrays is the array manufacturing technology. Based on the manufacturing technology, microarrays can be broadly categorized into ”spotted” and ”in situ synthesized” arrays. In earlier arrays the probes were synthesized prior to deposition on the array and then spotted on the array surface by means of a spotting robot (known as ”spotted arrays”). Spotting technology allowed a maximum of about ∼ 60, 000 oligos to be printed on any given array [139] and conse6  quently the density of the spotted arrays was limited to approximately a single probe per 50 Kb sequence [139]. The next major manufacturing technology, in situ hybridization, was fundamentally different from robotic spotting as the oligos were synthesized, base-by-base, directly on the array surface. Over the past decade, many developments have been made in array technology, and in particular there has been a significant trend toward increased numbers of features (probes) and toward shorter DNA sequences as hybridization targets [139], both of which have impacts on the resolution at which CNVs can be detected. Despite their improved density and detection resolution, a major disadvantage of oligo arrays is the relatively poor signal-to-noise hybridization intensities that leads to considerable variability in the reported number and size of CNVs [139–142]. To improve the signal-to-noise ratio (SNR) limitation, Lucito et al. [115] developed a method that is based on reducing the complexity of the genomic DNA that being is hybridized on oligo arrays, known as Representational Oligonucleotide Microarray, or ROMA using 70-mer oligos as genome representations. Briefly, in ROMA the genomic DNA is digested using a restriction enzyme (often BglII) and the resultant fragments are then ligated to adapters and PCR amplified using universal primers. Because of preferential PCR amplification of smaller segments the final amplification product would be depleted in larger fragments, leading to a reduction in the complexity of the sample [115]. These products were then hybridized to an oligonucleotide array consisting of probes that were selected to match the reduced set of restriction fragments. But ROMA also suffers from poor signal-to-noise ratio measurements. Furthermore, it presents other potential problems for CNV detection studies. First, as the result of the complexity reduction step (explained above) different regions in the DNA could have different representations due to their sequence content (not their sequence abundance), and this different representation could be mistakenly interpreted as copy number variation. Second, different individuals will have different restriction digestion patterns, and it is possible that some individual probe ratios may be related to restriction fragment size differences rather than to true copy number changes. However, the main limitation of ROMA arrays is their low signal-to-noise ratio compared to BAC arrays, and thus typically 3 probes are averaged to improve the associated variance of the array data leading to lower CNV detection resolution compared to BAC aCGH [139]. Genotyping Arrays During the past two decades, Single Nucleotide Polymorphisms (SNPs) have been recognized as a major source of human genetic variation [14–17, 143–145]1 . These findings have been made 1  7  possible largely by the development of high-throughput array technologies for SNP genotyping from commercial vendors such as Affymetrix and Illumina. Although these arrays were originally developed for genotyping SNPs, the intensity information from these arrays can be used to detect copy number variants providing both SNP genotypes and copy number estimates from a single experiment simultaneously. Since the introduction of this technology, extensive research has focused on studying CNVs in human disease, such as mental retardation [29, 30, 146], schizophrenia [147], autism [148], cancer [149–160], and normal polymorphisms [69, 153, 158, 161–164]. In the next two sections I present a brief description of SNP arrays from Affymetrix and illumina.  Affymetrix SNP Arrays:  The first generation of commercial SNP arrays known as  ”HuSNP” was produced by Affymetrix and became available more than a decade ago [17]. The early HuSNP arrays were capable of genotyping 1,494 SNPs in a single experiment, and since then Affymetrix has continued to release newer arrays with increased numbers of features including 10,000, 100,000, 500,000 and now with ∼1 million SNPs (; see Table 1.2).  Briefly, in this technology, total genomic DNA (250 ng) is digested with a restriction enzyme  (such as Nsp I or Sty I in 500K arrays) and ligated to adaptors for PCR amplification1 . The PCR conditions have been optimized to preferentially amplify fragments in the 200 to 1,100 base pairs (bp) size range2 . This preferential amplification reduces the complexity of the hybridization by incorporating the smaller fragments. Finally, the amplified DNA is fragmented, labeled, and hybridized to a chip. In Affymetrix technology used in 10K, 100K and 500K platforms, each SNP sequence is interrogated by a set of 25-mer oligonucleotide probes that target the SNP site and its surrounding base pairs, as shown in Figure 1.2. Each SNP on the array is represented by a collection of probe quartets, also known as the SNP probe set. A probe quartet consists of a set of 25-mer oligonucleotide ”probe pairs” for two most common alleles (known as ’A’ and ’B’) and for both forward and reverse strands (antisense and sense) for the SNPs. Each probe pair consists of a perfect match (PM) probe and a mismatch (MM) probe (see Table 1.2). The number of designated oligonucleotides in a probe set varies in each generation of the arrays. In 10K and 100K arrays [135, 137], a probe set consisted of 40 different oligonucleotide probes, while in 500K analysis this number was generally reduced to 20 oligonucleotides, although a subset of SNPs in 500K Affymetrix arrays still retain 40 features [165]. For CNV detection, the signal intensities of the probes (see Figure 1.2) are compared with values from another individual (or group of individuals) and the relative copy number 1 datasheet.pdf 2 human snp5/faq 4.jsp  8  per locus is determined [166, 167]. The technology used in the latest Affymetrix genotyping array, SNP 6.0, has major differences with the three previous generations of Affymetrix SNP arrays. The SNP 6.0 array consists of probes for both SNPs and copy number variation. The copy number variation probes were selected based on Toronto Database of Genomic Variants (DGV) [168]. In SNP 6.0 platform each ’A’ and ’B’ allele of a SNP probe are presented by 3-4 replicate PM oligonucleotide probes resulting in 6-8 oligos per SNP. R The Affymetrix Genome-Wide Genotyping� arrays have been widely used for high-throughput  SNP genotyping (in population genetics [169], linkage disequilibrium analysis [170], and wholegenome association studies [171]), and copy number analysis [149–157].  Illumina SNP arrays: Similar to Affymetrix technology, Illumina SNP arrays have  also been increased in capacity from 100,000 SNP probes in Human-1 array to ∼1.2 million probes  in Infinium HD BeadChip that consists of both SNP and CNV probes [172, 173]. Both Affymetrix and Illumina technologies share the same underlying principle for identifying CNVs through using the array intensity data and both enable simultaneous analysis of genotypes and copy number data [174]. Despite their similarities, the two products have substantial differences [172, 173, 175]. For example, Illumina arrays use 50-mer oligos compared to Affymetrix’s 25-mers. Also, Illumina has 1 or 2 replicate probes per SNP allele whereas Affymetrix has about 4-6 probes per allele (Affymetrix 10K-500K arrays have 10-40 oligos to interrogate a SNP locus, however, these oligos do not have the exact same sequence1 ). In the context of SNP genotyping, the Illumina Infinium assay, which runs on its 1M-Duo chip, uses single-base extension with a labeled base to call a SNP genotype [172, 175], but Affymetrix genotype calls are based exclusively on differential hybridization [173, 175]. Nonetheless despite their SNP genotyping approaches, in the context of chromosome copy number discovery, the signal-intensity output from both platforms present similar analysis and interpretation problems [173, 175–177].  Application of SNP arrays for CNV Detection:  Initial genotyping arrays provided  unprecedented resolution for identifying chromosome copy number aberrations both in normal and disease states and the results have improved with the subsequent developments of the technology. Even so, the high level of associated noise has been the main computational challenge of interpreting the array signal intensity for CNV detection [176]. Since the introduction of this technology, various methods have been developed to reduce the noise and improve the sensitivity and specificity of CNV calling [69, 149, 150, 161, 178–183]. Also, various copy number detection 1 The new design strategy in Affymetrix 6.0 arrays uses replicate oligonucleotide probes to interrogate each SNP (see Table 1.2).  9  algorithms have been developed to aid CNV detection [161, 168, 184–193]. However, for robust CNV detection, most of these methods require significant concordant ratio shifts in several probes sequentially located along the genome, which consequently lowers ”effective resolution” of these arrays [139, 141]. More importantly, the distribution of the probes is not uniform across the entire genome (e.g., in Affymetrix GeneChip 100K, SNPs have a median spacing of 8.5 kb but a mean intermarker distance of 23.6 kb [167]) and, thus, CNV calling approaches based on ratio shifts in several consecutive SNPs will inevitably limit sensitivity for small CNVs and CNVs in genomic regions sparsely populated by probes (often methods require at least 8 − 10 probes to identify a CNV).  1.2.3  Third Generation High-Resolution Techniques  More recently with the advent of next-generation sequencing technologies, a few groups have applied massively parallel sequencing platforms with the aim of improving the sensitivity of CNV detection to base-pair resolution [194–197]. The existing algorithms for sequencing-based CNV analysis can broadly be categorized into two groups. The first category is primarily based on paired-end read mapping (PEM), as was previously reported by Tuzun et al. [24] and Korbel et al. [198] (both obtained by 454 technology). In the PEM approach, 3 kb paired end reads are computationally mapped to the human reference genome. The mapped pattern of the 3 kb reads is then analysed to detect regions of structural variations [198]. Therefore, deletions are identified by paired ends spanning a genomic region in the reference genome longer than the known fragment length. Similarly, insertions are predicted through paired ends that span a region shorter than the reference genome would predict, or pair ends that cross chromosomes. It is reported that PEMbased detection methods have several limitations and demonstrate particularly poor performance in complex genomic regions that are rich in segmental duplications and have limited ability to detect insertions larger than the average insert size of the library [24]. The second category uses the depth of the sequencing coverage to predict CNVs [194]. Evan Eichler’s group used this approach to develop the mrFAST algorithm. This algorithm measures the depth of the coverage of whole-genome shotgun sequencing (WGS) reads that are aligned to the human reference genome by checking every locus in the genome and matching them to reads with at least 94% identity. The algorithm consequently uses the average depth of the aligned reads to detect regions of copy number aberration [194]. The performance of this approach was tested using 3 sequenced human genomes, including Yoruban [199], Han Chinese [187] and the Watson genome [200]. The result of this study indicated that the number of reads sampled from a given  10  region, referred to as read depth, was proportional to the number of times the region appeared in the corresponding genome. To further test this hypothesis, Alkan et al. [194] studied read depth in 961 autosomal duplications and concluded that at 20-fold sequence coverage, > 90% of all segmental duplications larger than 20 kb could be accurately identified by analysing the sequencing read depth. By selecting regions with increased sequence coverage, Alkan et al. [194] identified 725 non-overlapping large segmental duplications. Nearly all of these 725 detected segmental duplications were present in the sequenced genomes of all three subjects. A similar approach for detecting structural variations in next generation sequencing data has been posed by Yoon et al. [195]. This approach, known as event-wise testing or EWT, also uses read depth (RD) of the coverage and relies on statistical testing of 100-bp RD intervals to identify potential region of increased or decreased RD coverage, and then uses this information to infer regions of copy number variation. The plots of averaged RD data across the chromosome lengths resemble copy number scatterplots from oligonucleotide arrays, and there seems to be a significant variation in the estimated RD measurements [195]. Furthermore, the windowing approach towards smoothing the RD data has certain limitations, notably, the criteria that are used to select the optimal window size is rather experimentspecific and may not be applicable for other experiments (a detailed discussion of the pros and cons of windowing-based smoothing of the data is presented in Section 1.3). The basis of the method proposed by Yoon et al. [195] is that the scatterplot of RD coverage along the genome will follow the normal distribution after averaging the RD measurements in 100-bp intervals; however, this hypothesis may not hold true for different experiments or even different parts of the genome with uneven coverage [197]. Therefore, in practice, a 100-bp windowing of RD data may not generate a normal distribution and thus none of the downstream statistical analysis in EWT [195] would be applicable, since they are strictly based on the assumption of a normal distribution. Substituting non-parametric methods to analyse the sequence data could reduce variation in RD data and may be a potent tool to improve CNV identification at high resolution. It is clear that new computational approaches are needed to systematically detect copy number variants from sequence data [201]. Oligonucleotide arrays are also being used in parallel with sequencing data. For instance, Affymetrix 500K SNP arrays were used to assess the accuracy of the known sequence-derived SNPs from Watson genome [200] and to provide a map of CNVs of the Craig Venter genome [163]. Additionally, as the costs of array production, labelling, and hybridization continue to fall, these arrays are becoming more accessible and, therefore, the range of their applications is growing [202, 203]. These factors emphasize the significance of developing highly accurate computational methods that can improve the sensitivity/specificity of current CNV detection algorithms. Additionally, although next generation DNA sequencers may become the dominant technologies for CNV detec11  tion [202], computational biologists have already begun borrowing methods initially developed for oligonucleotide arrays to analyse sequence data. The next section details the history of computational advances associated with SNP arrays, which have been the focus of my thesis.  1.3 Methods to Discover CNVs Using SNP-array Data The underlying principle of all SNP array copy number analysis algorithms is to compare the fluorescence intensity ratios along the length of each chromosome to identify regions of candidate copy number loss or gain in the test sample, relative to the reference sample. A major concern for the detection of CNVs using oligonucleotide array technology is how a putative CNV is defined computationally. There is a plethora of different methods being used to call significant changes in relative intensity ratio from arrays. Relative to Illumina SNP arrays, more methods have been developed and evolved to analyse Affymetrix SNP array data, since these arrays have both more data redundancy per SNP locus and have been commercially available longer than Illumina SNP arrays. Typically, each copy number analysis algorithm involves 2 main steps: (1) normalization, and (2) CNV calling. With each new version of Affymetrix SNP array technology, these modules have evolved and been modified to improve the sensitivity and specificity of CNV detection [69, 149, 150, 161, 178–183].  1.3.1  Normalization Methods  As previously mentioned, SNP array data analysis is based on processing the relative signal ratio changes in a test array against one or more samples known as the ”reference set”. By doing so, the analysis inevitably incorporates the variability that exists between different arrays (chip-to-chip variability) [204]. Therefore, to perform an accurate data analysis, it is crucial to first normalize raw intensities to correct for the variation that exists between different chips and different hybridization experiments [205] (also known as between-array normalization). Several commonly used methods for normalizing SNP microarray data have been adopted from expression arrays, including global normalization [206], invariant-set normalization [207], and LOWESS [208]. The current consensus, however, is based on Quantile Normalization [205], a non-parametric approach developed by Terry Speed’s group for SNP oligonucleotide array data normalization. Quantile normalization guarantees all samples in an analysis have similar intensity distributions (instead of just focusing on the mean/median of intensities across the chips, as in global normalization).  12  1.3.2  CNV Calling Methods  Statistical methods for analysing copy number data are necessary for identification of CNVs. The development of methods that can accurately identify CNV regions has been a major challenge for microarray-based copy number analysis during the past several years [175, 176]. Therefore a variety of statistical analysis and visualization tools have been developed for Affymetrix SNP array platforms [69, 149, 150, 161, 175, 176, 178–183]. Despite their algorithmic differences, the logical structure underlying these approaches typically belongs to one of the following statistical models: (1) hidden Markov models (HMMs), such as QuantiSNP [185], PenCNV [186], HMMSeg [184] and dChipSNP [188, 209], (2) segmentation algorithms such as DNAcopy [189], GLAD [210], Circular Binary Segmentation (CBS) [189] and FACADE [211] (3) t-tests and standard deviationbased thresholding of the log2-ratio intensity measurements, as in [210, 212, 213]. Many of these methods, such as CBS and HMMs, were initially designed for aCGH and later adopted for SNP arrays [139, 214]. Below is a brief description of the fundamental basics of each of these algorithm categories. One of the simplest approaches to identifying shifts in the array intensity outputs is based on analysing the standard deviation (also known as SD) of log2-ratio intensities using thresholds for identifying putative regions with significant log2-ratio deviation from the baseline, as in [212, 215]. These methods were originally developed and widely used for copy number data analysis in aCGH platforms (such as BAC aCGH [213] and cDNA-based aCGH [34]). Despite their simplicity and speed, in the presence of non-specific variation in the signal intensities (or noise) the thresholdingbased methods perform poorly in detecting true regions of copy number aberration [175, 176]. In an attempt to overcome limitations of thresholding-based CNV detection algorithms, Pollack et al. [216] used a modified thresholding approach in an aCGH platform to detect CNVs in 44 primary breast tumors and 10 breast cancer cell lines. In this approach, the data were first smoothed by averaging over a window of optimal size, and a statistic was calculated for each probe in the window. Next, based on these statistics, CNV false discovery rates (FDRs) were estimated using the Benjamini and Hochberg method [217] and applied to determine the thresholds of DNA copy number gains and losses in the corresponding breast cancer data set [216]. The windowing based t-test methods have since been widely used in numerous copy number detection studies [29, 30, 162, 215], but these methods suffer from major drawbacks. The main challenge of windowing approaches is the ambiguous nature of determining an optimal window length (the same limitation that was described earlier for EWT method, proposed by Yoon et al. [195]; p. 11). In fact, genomics was not the first field of study that applied windowing to improve signal variation and recognized 13  its limitations. The drawbacks of windowing-based smoothing had been previously brought to attention in neurology when scientists tried to identify normal and abnormal patterns of brain EEG (electroencephalography)1 and knee VAG (vibroarthrographic)2 signals among individuals. The common problem of windowing based methods is the optimal length of the defined window (or window size). Choosing a large window size results in a greater degree of smoothing but would inevitably hide small copy number changes. On the other hand, if the selected window size is too small, the presence of a few sporadic noisy probes would be sufficient to generate a false positive CNV readout. Another drawback of this approach is that it suppresses the magnitude of signal intensities for both noisy and informative probes. Although such data suppressing reduces the variation of signal intensities and lowers the overall standard deviation of the signal, it also reduces the magnitude of true signal aberrations. This limitation can reduce the sensitivity of CNV detection, particularly for small CNVs or CNVs with fewer SNP probe markers [218–221]. A fundamental drawback of the above approach to data smoothing and thresholding is handling tumor heterogeneity in cancer, which refers to the presence of different cell subpopulations in a sample [218]. In such cases the combination of normal and copy number aberrated cells results in log2-ratio measurements that are well-below the predetermined thresholds of calling a CNV. Thus methods that are solely based on thresholding cannot detect such changes unless the CNV is present in the majority of the cells to generate a significant shift in signal intensities away from the baseline. To overcome the limitations of thresholding-based CNV calling algorithms to handle noisy data (particularly in oligonucleotide arrays), model-based algorithms were developed that focused on statistical analysis of candidate CNV regions to improve the ability to recognize the difference between non-specific (noise) and informative hybridization signal intensities. One of the earliest model-based approaches for copy number data was proposed by Hodgson et al. (2001) [33] in an aCGH study in mice, where a three-component mixture model was fit to islet tumor data in which each component represented one state of copy number data corresponding to copy number gain, loss or neutral states. They subsequently used the information from these Gaussian models to determine the thresholds above or below which aCGH ratios should be considered as increased or decreased. In 2003, Snijders et al. (2003) [35] developed a heuristic method to fit a Gaussian hidden Markov model (HMM) to array CGH copy number data, and since then HMMs have been routinely used to detect CNV regions and to predict the actual ploidy of the regions [184]. A common assumption by HMMs is that observed intensities are related to an unobserved copy number state 1 electrical 2 vibration  activity along the scalp produced by the firing of neurones within the brain signals emitted during movement of the knee  14  at each locus that can be defined by an emission distribution (often assumed to be Gaussian). Like windowing methods, HMMs have a long history in other applications, mainly in speech recognition [222], and were later adopted by bioinformaticians initially for DNA sequence alignment [223]. Another important assumption of an HMM model is that copy number states follow a pattern, so neighbouring SNPs have similar copy number states, and thus the transitions between copy number states can be predicted through a transition matrix that describes the probability of moving from one state to another. This transition matrix is learned directly from the data (also known as the training set) by applying another statistical method, such as Expectation Maximization (EM) [175, 224]. After the training phase of the model, the HMM can be applied for CNV detection in a new experiment, where each log2-ratio possibility is assigned a state and the Viterbi algorithm is used to predict the state for each observed events [175]. Since the introduction of HMM models to array genomic hybridization, many different methods have been developed for analysing array copy number data from different platforms, such as HMMSeg [184] (in aCGH), QuantiSNP [185], dChipSNP 1 [188, 209], and PennCNV [186]. But despite their popularity and various publications that used this technique to identify novel CNVs, HMM models have their own limitations. First, training an HMM model requires proper initialization of both transition2 and emission3 matrices, with transition matrix being particularly sensitive to initialization values. This implies that the transition probabilities, and thus the estimated HMM results, are sensitive to the assumptions about the patterns of fluctuations of log2-ratio intensity data between neighbouring probes. Therefore, training an HMM in one particular disease yields a model that is sensitive to the CNV patterns in that particular disease. The HMM approach for CNV detection has a number of other limitations. Estimating putative regions of CNVs depends of the initial hypothesis about present number states in the array data. Often two states are assumed in the data (one for amplification and another for deletion), however it can be difficult to assign a single state to a genomic region, particularly in cancer studies. For example, if a fraction of the tumor cells have lost a particular DNA segment while others have not (CNV heterogeneity), or if the size of the lost region varies between tumor cells, we would observe slight gains or losses of signal intensities, that do not necessarily fall into a certain HMM predefined state. There is an increasing body of evidence to indicate that tumor heterogeneity is an important characteristic of most cancers [225]. Another factor that can result to an ambiguous state is sample impurity, for example, when the pathologist has not been successful in removing 1 2 Transition 3 Emission  probability of an HMM state describes the probability of moving from the current state to a new state. probability describes the likelihood of a certain output given the current HMM state  15  surrounding normal tissue, or if the tumor sample itself is an admixture of cancer and normal cells. The above properties of the biological samples which makes it difficult, if not impossible, to accurately define certain copy number states and the dependency of the HMM model on the predefined number of copy number states suggest that HMM-based models may not be suitable for CNV analysis in cancer studies. Also, as previously explained, the HMM predictions are dependant on the transition probabilities that are estimated based on the training data. However, results from cytogenetics and other chromosomal analysis techniques have shown that the patterns of copy number changes are substantially different between disease (e.g., lymphoma versus mental retardation). Therefore, it is reasonable to assume that an HMM that is trained on validated copy number gains and losses from a particular training set would be more sensitive towards identifying CNVs in samples with similar properties (for example, with similar sample heterogeneity). Furthermore, if the transition probabilities are not properly initialized, there is a high risk of the algorithm getting stuck in a local minimum resulting in an improper training of the HMM, which further complicates HMM initialization. The aforementioned limitations of HMM (training dependancy and initializations of states), emphasize that prior to using a computational method to analyse biological data, we need to have a thorough knowledge of the biological properties of the data as well as the requirements of computational methods to select a model that is likely to generate more accurate results. Another popular approach for CNV calling is segmentation of log2-ratio copy number data [226, 227]. As is the case with HMMs, there are different kinds of segmentation models, many of which were originally developed for CNV analysis of cDNA and BAC CGH arrays. The common assumption underlying all copy number segmentation methods is that CNVs occur in contiguous regions of the chromosomes, often spanning multiple probes. Based on this hypothesis, segmentation methods attempt to split the chromosomes into regions of equal copy number. The average (or median) log2-ratio of each segment is then used to identify candidate CNVs. Various segmentation methods have been proposed, such as SW-Array (Price et al. [228]), and CGHseg (Picard et al. [229]). A popular segmentation method is Circular Binary Segmentation (CBS; Olshen et al. [189]). This non-parametric method is a modification of Binary segmentation (developed by Sen and Srivastava; [230]), with improved sensitivity towards small variants that may otherwise be obscured within larger segments. CBS assumes that copy number data may be noisy, and as a result, some probes do not reflect the true copy number in the test sample. Since this algorithm does not make any assumption regarding the distribution of the data (non-parametric), it provides a natural way to segment a chromosome into contiguous regions by recursively applying a statistical test to detect significant breakpoints in the data, and continues to divide a region into segments 16  until it no longer finds a segment that is different from the neighboring regions (or until it reaches a maximum number of permutations). The CBS change-point detection method is designed to identify all the places which partition the chromosome into segments with the same (log2-ratio) copy number. Due to the complete non-parametric treatment of the array data, CBS is potentially one of the most robust CNV calling algorithms [231, 232]. A comparison of the performance of segmentation algorithms by Lai et al. [231] using 11 different methods for analysis of aCGH data found that CBS was among the top 2 algorithms with the best performance under various conditions. Another independent study by Willenbrock and Fridlyand [232] compared 3 Bioconductor packages: DNAcopy1 [189] (based on CBS segmentation), aCGH software2 (based on HMM) and GLAD3 [210] (based on adaptive weights smoothing) segmentation methods and found that CBSbased DNAcopy [189] had the best performance in terms of its sensitivity and FDR for breakpoint detection [232].  1.4 Limitations of Current CNV Detection Algorithms for SNP Data Analysis Regardless of the algorithm used, the variation in hybridization intensity measurements from the array can impact the reliability of CNV detection. The high rate of variability in signal intensity outputs increases the number of regions that are mistakenly identified as CNVs, resulting in increased false positive rates. An excellent example of this undesirable effect is the overpopulation of apparent CNVs that have been observed in the Database of Genomic Variants4 (DGV) in several independent publications [3, 141, 168]. In addition to increasing false positive rates, such variability can hamper our sensitivity to detect true CNVs and subsequently increases the false negative rate of CNV identification. Parametric approaches mitigate high rates of false positives by applying more stringent criteria, often requiring that a significant shift in signal intensity outputs must be detected in multiple neighbouring SNP probes before CNV is identified. This approach inevitably under-identifies small CNVs or CNVs that occur in regions of the genome with low probe density, such as segmental duplications [141]. The main caveat of such parametric approaches is that the discovery procedure emphasizes specificity over sensitivity and as a result, despite a general reduction in false positive calls, the detection power is dependent on probe counts and the computational methods are not sensitive enough to detect small CNVs. 1 mirror/packages/2.3/bioc/html/DNAcopy.html 2 3 4  17  A study of CNV hotspots in a general population published by Itsara et al. [141] provides a good example of describing the impact of the aforementioned computational limitations on CNV measurements by suggesting that their reported CNV findings significantly underestimated the number and size of small copy number aberrations. They further elaborated that this shortcoming was due to the dependency of their CNV detection algorithm on the number of probes [141]. Generally, while large CNVs (> 4 Mb) are routinely identified by most of the available algorithms, when it comes to identifying small CNVs or variants located in regions with reduced probe density (< 8 − 10 SNP probes), these algorithms are no longer consistent. The choice of reference set sam-  ple size (number of required reference samples for a particular algorithm) is another bioinformatics challenge, especially in cancer analysis where often it is desirable to perform a pairwise analysis of tumor and matching normal DNA to identify somatic CNVs. Meanwhile, some methods (such as CNAT [121] and GLAD [210]) depend on smoothing the variation of the intensity data by averaging the reference signal over multiple normal individuals and therefore do not allow pair-wise analysis. It is clear that developing a method that can reduce both false positive and false negative CNV calls will have a major impact on the accuracy and reliability of CNV findings using SNP array data. As mentioned earlier, controlling false positive/negative rates also depends on improving the associated noise (non-specific variation) in the SNP intensity outputs. These factors imply that optimal CNV detection requires an algorithm with the following two components: (1) a preprocessing phase that filters unwanted noise from the array raw signal intensity readouts, and (2) a non-parametric CNV calling approach that can translate the intensity information into relative copy number change and apply statistical methods to identifying locations of gains or losses of copy number. Nonetheless, very little work has been done to address the associated noise as an independent module [218, 233, 234]. Instead, most often algorithms tend to adjust the impact of the unwanted variation by modifying the downstream CNV calling algorithm, which consequently results in high false positive/negative rates as discussed earlier [140]. In conclusion, the high level of noise associated with SNP oligonucleotide data are still a major limitation of identifying true CNVs and their boundaries [175, 235].  1.5 Thesis Objectives and Hypothesis Copy number gains and losses are shown to be associated with complex human diseases such as developmental abnormalities [26–32] and cancer [33–36] as well as increased susceptibility to several diseases (such as Parkinson’s Disease and HIV) [52, 236]. In addition to their role in human 18  disease, CNVs are also a major source of genome diversity in unaffected human populations [18– 20, 22, 22, 24, 25, 66, 69, 161, 194, 237]. It follows that identifying and characterizing these variants are critical to our understanding of genome structure and function and the application of genomics to human disease, including the development of personalized genomic tools. Based on literature reviews of copy number detection methods and algorithms, it is well accepted that oligonucleotide microarray noise is a major source of false positive and negatives in CNV results, and thus a key factor for underestimation of small CNVs [139, 141, 175, 176]. Considering the current limitations of available CNV calling algorithms (discussed in Section 1.4), it is hypothesized that the current available methods have significant statistical biases that results in lower CNV detection accuracy. This leads to lower sensitivity to detect small aberrations, an issue which has been addressed by several independent groups [141, 238, 239]. A study of global variation in 270 normal human genomes by Redon et al. [69] using Affymetrix 500K SNP array found that on average 206 kb of genome is affected by copy number variations in each individual. Comparing the latter finding with emerging evidence that highlights the prevalence of small CNVs, between 10-100 kb [20, 22, 24, 240] emphasizes that many smaller variants are missed at the current effective resolution of the arrays. An underlying assumption of this thesis is that a portion of these events can be successfully identified if proper statistical methods are used to analyse the array data. The ability to detect such events carries the potential to discover small CNVs that are associated with human disease or predisposition. For instance, it has been shown that small deletions between 70 bp to 7 kb of MTUS1 tumor suppressor gene are associated with a decreased risk of familial and high-risk breast cancer [238, 241]. The general aim of this thesis was to develop new computational methods to facilitate analysis of Affymetrix SNP microarray data and design a method with improved accuracy for identifying copy number variant regions. In particular, I focused on developing tools that facilitate identification of CNVs based on non-parametric approaches towards improving the quality of SNP array data at the individual oligonucleotide probe-level. To address these challenges, I took advantage of developments in CGH microarray technology and non-parametric statistical methods to develop a novel approach to identify CNVs, and applied this method to study CNVs in follicular lymphoma patients. By developing these non-parametric probe-specific methods, my goal was to improve the accuracy of CNV detection, particularly for smaller events [141]. The main hypotheses of this thesis were as follows: (1) detection of candidate CNVs using current SNP microarray methods is greatly dependent not only on the density of the probes but also the number of probes within the candidate CNVs, and therefore (2) current analysis methods largely underestimate the extent of small CNVs. Furthermore (3) the number of candidate CNVs 19  reported using current methods is largely dependent on the level of array-wide variation of signal intensities (SD), which yields an increased chance of false positive calls. Finally (4) non-parametric probe-level analysis of SNP arrays allows identification of true CNVs that may be important in disease or disease progression.  1.6 Chapter Summaries A brief summary of the analysis that I developed and implemented to identify CNVs (Chapters 2 and 3) and two examples of applying this method to identify CNVs in human cancer (follicular lymphoma, Chapter 4) is provided as below. In Chapter 2, I focused on evaluating the variability of hybridization intensity outputs from R Affymetrix GeneChip� SNP arrays. To perform this analysis, I first assessed the technical vari-  ability of SNP arrays by analysing the intensity readouts from 11,564 SNP probes (10K array) in a replicate study consisting of 72 experiments from 8 individuals. The result of this analysis indicated that Affymetrix SNP array technology is highly reproducible (CV1 = 5.16% for chip variability, and CV = 6.3% for labeling variability). Next, I combined statistical theories with the reproducibility measured from empirical data to predict the likelihood of observing 2 or more random probes on the array with k-fold differences in their hybridization intensity (k � 2). The aim of this chapter was to detect possible sources of variation in SNP array data, and based on the results presented in this chapter, I concluded that the non-specific variation between the performance of individual oligonucleotide probes is a major contributor to the overall microarray noise. The replicate experiment that was used in this Chapter was designed and performed by collaborators at the Affymetrix Company. The aim of the work presented in Chapter 3 was to develop and implement an enhanced algorithm to reduce the probe-level variability in SNP array data, in order to assess whether such an approach would provide improved accuracy for detecting CNVs. The specific aims of this chapter were (1) to develop an optimized approach for estimating SNP signal intensity, and (2) to implement an accurate CNV calling model to improve the sensitivity and precision of predicted CNVs. To develop this model, I took advantage of several non-parametric statistical methods that had previously been used in different applications, such as speech signal processing, geology and array CGH technology. The resulting algorithm, called Oligonucleotide Probe-level Analysis of Signal or OPAS, involves two major components: probe-level analysis and SNP-level analysis. I then designed and implemented OPAS visualization software and an associated pipeline that 1 Coefficient  of Variability  20  facilitates automatic sample import and high-throughput sample analysis. An advantage of this approach is that most of the individual modules can be easily extracted from the source code and applied to analyse data from other platforms, for example, Illumina SNP arrays and sequence-based read depth data. In the pre-processing phase of the algorithm, probe-level analysis is conducted. In this phase, the intensity of each PM oligonucleotide probe (or oligo) is analysed to identify noisy probes and subsequently eliminate them from data analysis. To achieve this goal, within each SNP probe set the PM oligos are categorized into groups or clusters with similar intensity patterns. To facilitate probe set clustering of PM oligos, I proposed a new clustering approach, referred to as Fuzzy-Kmeans Clustering, based on combining two well-known clustering methods: k-means optimization-based and subtractive fuzzy-logic based clustering algorithms. Next, non-parametric KS-test is applied on each determined cluster of oligos (referred to as ”oligo cluster”) to evaluate the likelihood that the intensity pattern of the PM oligos, that are involved in the oligo cluster, represent a significant shift in the signal intensity. These KS-generated probabilities and the oligo cluster information are then passed to a machine-learning algorithm to identify the most significant oligo cluster(s) within each SNP probe set. Subsequently the mean log2-ratio intensity of the oligos in the most significant oligo cluster(s) is estimated and used to represent the SNP log-ratiometric value. In the post-processing phase of the algorithm, SNP-level analysis is performed. In this step, I first apply GC-fragment length normalization to minimize the effect of fragment length biases on the estimated SNP readouts through a non-linear LOWESS normalization method. Next, in order to identify regions of the genome with copy number alterations, the algorithm applies Circular Binary Segmentation (CBS) [189, 242] non-parametric CNV calling method on the pre-processed SNP data. It is important to note that, while many of the components of this algorithm had been previously developed in other applications (e.g., CBS was originally designed in aCGH), the adaptation of these methods to SNP microarray data required a largely novel implementation to accommodate a different data type. To facilitate high-throughput data analysis, I designed and implemented OPAS software that automatically generated a relevant record of the sample analysis. Upon sample import, OPAS software generates separate image and data folders for each sample that are named according to the sample file name (and date of analysis, if it already exists). During sample analysis, OPAS automatically creates a comprehensive catalogue of the graphs and data for each step of the sample processing, from normalization and mean/intensity (MA) plots to visualization of CNVs in each chromosome along the chromosome ideogram, and subsequently saves these images and data in 21  the pre-designated sample folders (see page 95). The purpose of this enhanced sample recording is to provide a useful sample tracking method for future follow ups. Furthermore, despite the fact that all the components of the OPAS algorithm were based on nonparametric analytical techniques, there were still a few assignable parameters that could affect the algorithm performance. Thus, I extensively tested each module to better assign these parameters and to improve the quality and speed of OPAS. The samples used in the work presented in this Chapter were provided by Dr. Jan Friedman and the ”wet-lab” (sample preparation and Affymetrix experiments) was performed at the Genome Sciences Centre. Throughout this Chapter, I also use validated CNV results of mental retardation project by Dr. Friedman that has already been published [29, 30, 215] to test the performance of my proposed method in detecting known CNVs. In Chapter 4, I presented the results of applying OPAS to detect somatic CNVs (present in the tumor but not the matching normal DNA) in 25 follicular lymphoma (FL) patients. Follicular lymphoma (FL) is the most prevalent type of non-Hodgkin lymphoma (NHL)1 (cancers of the lymph nodes). In Canada, non-Hodgkin lymphomas accounted for about 7,500 new cases of cancer in 2010 (making them the fifth most common cancer) and 3,200 estimated deaths2 . The statistics also indicate an increasing rate of incidence of NHL among young women aged 20-39 [243]. More importantly, follicular lymphomas frequently transform to a more rapidly progressive invasive and lethal cancer, diffuse large B-cell lymphoma or DLBCL (10 year survival < 20%). Cytogenetic abnormalities are common characteristics of FL genomes [244–248]. A genetic hallmark of follicular lymphoma is the recurrent chromosomal translocation t(14;18)(q32;q21), which is present in approximately 85-90% of the patients [245, 249, 250]. As a result of this translocation, a part of chromosome 14 involving the enhancer of the immunoglobulin heavy chain (IGH) locus moves to chromosome 18 and into the proximity of the BCL2 anti-apoptotic gene, resulting in BCL2 over-expression [245, 249–251]. However, transgenic mice with BCL2 overexpression do not develop lymphoma [252, 253], and t(14;18) bearing lymphocytes have also been reported in healthy individuals [254, 255]. These findings suggest that t(14;18) alone is not sufficient to produce clinical FL [256–258]. I hypothesized that the OPAS algorithm could improve CNV discovery in FL patients and could enhance our understanding of FL genetics. To test this hypothesis, I looked for candidate somatic aberrations in 25 FL genomes by performing a pairwise analysis of Affymetrix 500K data from tumor matched normal DNA. In total, I identified 286 somatic CNVs (11.4 per patient) of which 18.5% (53/286) were smaller than 150 kb and 14.3% (41/286) encompassed fewer than 10 SNP probe markers. To assess the ac1 2  22  curacy of the putative CNV calls, I compared OPAS findings in 23/25 patients with the results from several alternative technologies and methods that were applied to study the same patients, including Significance Mean Distance method (SMD [259]) based on 500K SNP arrays, BAC aCGH, BAC-end fingerprint profiling (FPP), and Illumina sequencing data. This comparison indicated that 87.9% of all OPAS predicted CNVs are seen by at least one additional dataset, while most of the remaining candidate events had no corresponding FPP or sequence information that could reject them. One of the sequence validated CNVs that was below the detection power of standard SNP calling methods and thus was previously undetected with SNP arrays was a ∼104 kb somatic  deletion on 9p21.3 which included only 8 SNP probes. This deletion harbored the CDKN2A gene which is frequently deleted in many cancers [260, 261], for instance, in an aggressive subset of cutaneous T-cell lymphomas [262]. Another interesting finding was detecting a deletion, approximately 143 kb in length, on chromosome 12q24.11 that encompassed only 4 Nsp SNP probes and was validated using BAC end sequence data. This deletion affected the HVCN1 and PPTC7 genes. A recent publication on the voltage-gated proton channel HVCN1 gene suggested that it modulates the B cell antigen receptor [263]. The deletion of 12q24.11 removes all but the first exon of HVCN1 and the first 3 coding exons of PPTC7, juxtaposing the HVCN1 promoter to the remaining PPTC7 exons. Therefore, this deletion can potentially create a novel fusion gene between the 5’ end of HVCN1 and 3’ end of PPTC7. Both of the deletions found (CDKN2A and HVCN1-PPTC7) were validated, although neither of these deletions was previously identified when the same 500K data were analysed using the SNP analysis method described in [29, 30]. These findings confirmed that the accuracy of calling CNVs can be improved by performing a probe-level analysis of the array data and applying non-parametric data mining methods to analyse the data. Other Advantages: During the past several years, there has been a major breakthrough in massively parallel sequencing technologies, which offer the promise of detecting whole-genome structural aberrations at base-pair resolution. The two main aspects of my research that I believe are relevant to emerging sequencing technologies are (1) methods that are currently developed for detecting CNVs from sequencing data are in their infancy, and many non-parametric modules that I developed and/or implemented in this work can be easily extracted and modified for detecting CNVs using sequencing data (for example for analysing RD coverage data, Section 1.2.3, p. 10), and (2) during the past several years, thousands of projects have been completed using SNP arrays to study CNVs in thousands of individuals affected with cancers, developmental abnormalities, rare diseases as well as unaffected ”normal” individuals. The raw array data files (.CEL) of many of these projects are accessible online (for example HuRef [163] replicate 500K arrays can be 23  downloaded from; see Table 1.3 for more examples). Re-analysis of these data sets with an enhanced algorithm, such as OPAS, can result in discovering small recurrent and potentially pathogenic aberrations, well below the sensitivity thresholds of other algorithms. Considering the relatively low cost of computational array analysis with respect to sample preparation and processing, OPAS re-analysis of the extant data sets will create new discovery opportunities at minimal cost. Throughout my research, I have generated many tools to facilitate data acquisition, sequence interpretation and statistical inference and visualization of the genomic data. In order to keep my developed tools reusable and traceable, I generated comprehensible Graphical User Interfaces (GUIs) for most of these tools. For instance, DLOH software and GUI were designed to interrogate SNP genotype data in order to identify regions of loss-of-heterozygosity (LOH). It also generates a statistical score for each LOH region to predict potential regions of copy number deletion. I presented DLOH at the Advances in Genome Biology and Technology (AGBT) conference in 2008. In addition to the work described in this thesis, I have been involved in several other collaborative projects at the Genome Sciences Centre (GSC) which have resulted in publications or presentations. I used non-parametric statistical tests to assess the impact of whole genome amplification on analysis of copy number variants (Pugh et al. [264]). In this work, I compared the distribution of data from paired pre- and post-whole genome amplified (WGA) samples, for 3 separate DNA samples and identified apparent WGA-induced over- and under-amplifications in each of the three comparisons of amplified versus unamplified material. I also collaborated with Dr. Maziar Rahmani in analysing data and preparing genome wide visualization of markers in several genome wide association studies (GWAS). Based on my results several novel risk loci were chosen in patients with calcific aortic valve stenosis, which were presented at the Canadian Human Genetics Conference [265] and National Research Forum for Young Investigators in Circulatory and Respiratory Health [265]. More recently, a large-scale study has been initiated by Dr. Rahmani at the Genome Sciences Centre and I will apply the same approach to identify potentially significant markers which will be then genotyped for validation and presented in a future manuscript. In conclusion, whole-genome analysis techniques provided us with the means to understand the extent of the variation in normal and disease genomes. Nonetheless, the past several years have yielded rapid developments in sequencing technology, creating a field of investigation that is transforming both our concept of the human genome and the application to clinical practice. I believe that merging the vast amount of data available from microarrays and recent sequence 24  based studies carries the promise of understanding the mechanisms by which disease and normal genomes diverge. However, gaining this knowledge is largely dependent on the use of improved computational methods that can provide accurate analysis of these data.  25  1.7 Figures and Tables    (a) Karyotype of an FL patient (48,XY,+X)    (b) M-FISH of the same patient  Figure 1.1: Karyotype and M-FISH chromosome analysis of a patient with follicular lymphoma (FL). The karyotype and M-FISH analysis of an FL patient is shown in panels (a) and (b), respectively. Both techniques indicate several microscopically detectable chromosomal alterations in this patient, including extra copies of chromosomes 7 and X (48,XY,+X, +7). Karyotype and M-FISH results can also detect balanced translocation, such as t(14;18)(q32;q21) in the above FL patient (figures provided by Dr. Horsman’s lab at the BCCRC).  26          R Figure 1.2: Structure of SNP probe sets in Affymetrix GeneChip� SNP arrays. In Affymetrix technology, each SNP on the array is represented by a collection of probe quartets, also known as a SNP probe set. A probe quartet consists of a set of 25-mer oligonucleotide ”probe pairs” for two most common alleles (known as ’A’ and ’B’) and for both forward and reverse strands (antisense and sense) for the SNPs. Each probe pair consists a perfect match (PM) probe and a mismatch (MM) probe. The Affymetrix chip design strategy is to use a set of these PM/MM probe pairs to interrogate the surrounding bases of SNPs for the forward and or reverse target for both alleles. At the top of the above figure, the ’A’ and ’B’ alleles of a given SNP are shown by pink and yellow beads in the middle of the green sequence. The 4 rectangles below the depicted sequence, represent the probe quartets for forward and reverse strands of allele ’A’ (red rectangles) and allele ’B’ (yellow rectangles). As shown in this figure by blue beads (’G’ and ’A’), the sequence of MM probes is the same as the PM probes, except for a single nucleotide difference. Furthermore, within a probe set there are additional probes to interrogate the neighboring bases of the SNP locus, known as ”offset probes”. The position of the probes within a probe set is typically denoted by (-4, -2, -1, 0, +1, +2, +4), where the integer n refers to the base position being interrogated by the offset probes relative to the SNP position (n = 0). The bottom panel illustrates two offset probes (n = −1 and n = +4) of a given SNP sequence. The number of designated oligonucleotide probes in a probe set varies in each generation of the Affymetrix GeneChip SNP arrays. In 10K and 100K arrays, a probe set consists of 40 PM/MM oligonucleotide probes. Thus, for example, the 100K SNP array has in total more than 4.5 million features (oligonucleotide probes) on the array. The Nsp (∼262,000 SNPs) and Sty (∼238,000 SNPs) arrays from Affymetrix 500K SNP dual array-set have a density of more than 500,000 SNP sites with 20 or 40 PM/MM probe pairs per SNP locus, resulting to > 3 million features on each array and a total of ∼6,071,040 features.  27  Table 1.1: Aneuploidies and large-scale CNVs associated with human disease Name of Abnormality  Type  Chromosomal Variation  Cytogenetics Representation  Turner syndrome Klinefelter syndrome∗ Edwards syndrome Down syndrome Patau syndrome Cri du chat 1p36 Deletion syndrome Angelman syndrome  aneuploidy aneuploidy aneuploidy aneuploidy aneuploidy large-scale CNV large-scale CNV large-scale CNV  loss of entire X chromosome gain of entire X chromosome trisomy of chromosome 18 trisomy of chromosome 21 trisomy of chromosome 13 loss of the short arm of chromosome 5 loss of a region on the short arm of chromosome 1 loss of ∼ 4Mb of the long arm of chromosome 15†  (45, X) or (45, X0) (47, XXY) (47, +18) (47, +21) (47, +13) 46,XX,del(5)(p15.2) deletions on 1p36 46,XX,del(15)(q11-q13)  ∗ the  most common male chromosomal disease in 50% of patients with Angelman syndrome  † observed  Table 1.2: Comparison of array specifications in 4 generations of Affymetrix SNP array Enzymes  SNPs  500K  XbaI XbaI, HindIII NspI, StyI  SNP 6.0  NspI/StyI  SNP 6.0 has major differences with the three previous generations of Affymetrix SNP arrays: • combines the NspI and StyI fractions that were previously assayed on two separate arrays • contains 906,600 SNP probes in addition to 945,826 non-polymorphic copy number (CN) variation probes: (∼744,000 were selected for their spacing and ∼202,000 were based on known copy number changes‡ ) • 3-4 replicate perfect match (PM) probes per SNP probe (CN probes have no replicate) • median inter marker distance = 2180 bp (CN), 1270 bp (SNP), 680 bp (SNP + CN) • average inter marker distance = 3160 bp (CN), 3230 bp (SNP), 1600 bp (SNP + CN)  10 K 100K  Probe Pairs  Quartets Features  Median IMD∗ (kb)  Average Coverage IMD (kb)  10,204 40 116,204∗ 40  14 10  647,080 4,648,160  113 8.5  258 23.6  500,568† 24, 40  6-10  12,013,632  2.5  5.8  ∗ IMD:  Inter Marker Distance are 58,960 SNP probes on XbaI and 57,244 on HindIII arrays. † There are ∼262,000 SNP probes on NspI and ∼238,000 on StyI arrays. ‡ based on Toronto Database of Genomic Variants (DGV) [168] ∗∗ There  28  at least 1 SNP per 100 kb. 92% of genome within 100 kb of a SNP; 40% within 10 kb of a SNP. 85% of genome within 10 kb of a SNP.  Table 1.3: A partial list of Affymetrix SNP array data (raw .CEL files) that are publicly available 270 samples from HapMap project raw data/affy500k/  9 Tumor/Normal pairs derived from several human cancer cell lines, including adenocarcinoma, non-small cell lung carcinoma and primary ductal carcinoma sample data/copy number data.affx)  GlaxoSmithKline (GSK) has made available the 500K SNP data for over 300 cancer cell lines from 30 different tissue types in a wide range of cancers including small cell lung carcinoma, neuroblastoma, lymphoma and glioblastoma GSKdata/  Marshall et al. (2008; [50]) provided the 500K .CEL files of 1318 individuals with Autism Spectrum Disorder (ASD) cgi?acc=GSE9222  Chiang et al. (2009; [266]) provided the Mapping250K-Sty array data for 77 replicates of HCC1143 (breast ductal carcinoma), 69 replicates of HCC1143BL (matched normal), 42 replicates of HCC1954 (breast ductal carcinoma), 36 replicates of HCC1954BL (matched normal) and 1 replicate of NCI-H2347 (lung adenocarcinoma) cancer/publications/pub paper.cgi?mode= view&paper id=182  Solomon et al. (2008; [267]) provided Affymetrix Mapping250K-Nsp array data for 58 glioblastoma multiforme tumor samples cgi?acc=GSE13021  161 primary breast cancer samples (500K array) reported by Kadota et al. (2009; [268]) cgi?acc=GSE16619  141 gliomas and 33 normal tissue samples (100K array), used by Beroukhim et al. (2007; [269]) cancer/publications/pub paper.cgi?mode= view&paper id=162&p=t  768 Affymetrix Mapping250K-Sty data for 384 tumor/normal lung adenocarcinoma pairs for Weir et al. (2007; [270]) tsp/  29  Chapter 2  Analysing Variability in Microarray Data 2.1 Introduction Oligonucleotide microarrays are powerful tools that enable high-throughput measurement of DNA copy number alterations across the entire genome. One application of oligonucleotide microarrays is copy number profiling, which has led to significant advances in the understanding of complex diseases and discovering submicroscopic aberrations in genes that are associated with or causative to diseases, such as mental retardation [271, 272] and cancer [273]. A major concern for the identification of CNVs using oligonucleotide microarray technology is how a putative copy number change is defined. There is a plethora of different methods that have been used to call changes in DNA copy number values from relative ratio intensity outputs of these arrays, ranging from simple preset thresholds [213] to complex statistical modelling [212]. Despite their differences, the fundamental principle underlying these approaches often include one (or a combination) of hidden Markov models (HMMs) [185, 186], segmentation algorithms [189, 210, 242], or t-tests and standard deviations (SDs) of the log2-ratio intensities (referred to as LR in this thesis) [212]. Regardless of how the data are analysed, nonspecific variations that are due to assay variability will mitigate the accuracy of the downstream CNV results. One drawback of such non-specific variation is that the incorporated noise forces the measured signal intensities to pass over a predefined threshold of CNV calling, leading to false positive and false negative CNV results. To circumvent the false positive problem, algorithms often apply  30  more stringent CNV calling criteria (for example, by raising the LR thresholds of calling significant deviations from the base-line and/or increasing the minimum number of consecutively shifted SNPs for calling CNVs). This, on the other hand, mitigates the ability to identify other real aberrations that do not necessarily meet more stringent constraints, and consequently leads to increased false negative rate. Therefore, despite the importance of technological advances in microarray platform design and array processing during the past decade, it has become increasingly clear that the capability to assess variability associated with array outputs is of paramount significance and is an essential factor for developing analytical tools that can maximize the utility of these platforms [139, 274]. Nonetheless, investigating variability and its possible sources have been largely unexplored in SNP arrays [140, 233, 275], unlike expression arrays where a great effort has been invested to understand the sources of variability among probes for each transcript [207, 234, 276–281]. As a result, the impact of oligonucleotide probe-level variability on SNP data reproducibility that directly influences CNV analysis and interpretation has also been largely unexplored. To address this issue, in this Chapter I explore different sources of variability in Affymetrix SNP arrays. The specific aims of the work presented in this chapter are: (1) to estimate relative magnitudes of different sources of variation in Affymetrix SNP array data, and (2) to assess the variability associated with each probe set by modelling variance as a function of intensity. To perform these analyses, I first examine common causes of variation among array results by assessing R the reproducibility of Affymetrix GeneChip� 10K SNP array platform using a replicate data set  consisting of 69 arrays from 8 individuals. Next, I present a mathematical model to study the relationship between the empirical variability (%CV) and the theoretical fraction of individual oligos that are expected to differ in their log2-ratio intensity readouts by at least some given factor. The aim of this model is to incorporate the theoretical probabilities and the empirical variabilities (%CV) to determine the acceptable range of variability across replicate oligos on 10K SNP arrays. In the rest of this Chapter, I first explain some of the major sources of variability in microarray data and then discuss the methods I use to analyse variability in Affymetrix SNP arrays and the corresponding results.  2.1.1  Variability in Microarray Data  A typical microarray experiment, regardless of the array platform used, has many different sources of variation [204, 282, 283]. Figure 2.1 illustrates some of the common sources of variability in Affymetrix SNP genotyping arrays [204, 282]. As seen in Figure 2.1, the sources of variation  31  detected by microarrays can be broadly attributed to biological and technical causes. These sources of variability are briefly explained in the following sections (2.1.2-2.1.3).  2.1.2  Biological Variability  At the highest level of variation hierarchy in Figure 2.1 is the population variability or biological variability, a well-known source of variation that exists among normal individuals. Such variability is independent of the microarray experimental process [204]. This type of intrinsic copy number variation among individuals, known as copy number polymorphisms or CNPs, is responsible for a substantial amount of human phenotypic variation and genome diversity [19, 66, 69, 163, 284– 287]. Other studies of CNPs have also discovered several associations between CNPs affecting genes and biological functions related with immunity [46, 288, 289], sensory reception [290, 291] and other phenotypes, such as predisposition to HIV infection [236], and susceptibility to Crohn’s disease [47]. Early studies suggested that about 12-18% of the human genome is involved in copy number polymorphisms [19, 69]. However, more recently, several independent large-scale studies of CNPs have consistently concluded that the frequency and size of these variations were largely overestimated in the initial studies [66, 168]. Nonetheless, these biological variations are a major source of variability among normal individuals and a fascinating field of population genetics [292].  2.1.3  Technical Variability  During microarray experiments, many factors can lead to unwanted variation or noise in the generated data that are commonly referred to as experimental or technical variation. Such variation in array data affects our ability to identify real copy number changes in the downstream analysis. Technical variability is, therefore, fundamentally different from biological variability, which is an indicator of genetic diversity (Sec. 2.1.2). Multiple sources of variation in a microarray experiment can impact the overall experimental variability, including array manufacturing process, DNA or RNA isolation method, sample preparation, target labeling, hybridization and scanning. These sources of technical variability are not platform-specific and are relevant to all commercially available microarrays, albeit to a different degree [204, 293–297]. While the choice of the array platform used can affect the quality of obtained results [205, 298, 299], the ability to compare findings within the same platform requires systematic assessment of technical variability. An accurate estimation of different sources of technical variability is essential not only for understanding how well a microarray platform performs, but also for calibrating the array output signal and improving the sensitivity and specificity of the findings. An extensive 32  amount of research has been carried out in the past decade to discover, understand and quantify unwanted sources of technical variation. These studies have motivated the development of many advanced computational techniques for microarray data normalization and filtering [180, 205, 296, 300–302]. Figure 2.2 depicts some of the major components of technical variation in microarrays (applicable to both expression and SNP genotyping arrays). The first component depicted in this figure is referred to as ”the variation between Cy5 and Cy3 colors” which is specific to two-color arrays such as CGH-based platforms. In such arrays two samples (e.g., ”test” and ”reference”) are labeled with different fluorophores (usually cyanine-3 (Cy3) and cyanine-5 (Cy5)) and hybridized together on a single microarray, and thus the difference between the two dies can affect the integrity of the resultant readouts (refer to Chapter 1 for more details). Such variation does not exist in onecolor arrays, such as Affymetrix SNP genotyping chips, where in each experiment only a single sample is hybridized to an array after it has been labeled with a single fluorophore (such as Cy3 or Cy5). This implies that one-color arrays, such as Affymetrix SNP chips, are impacted by fewer sources of variability compared to two-color arrays. The next source of variability denoted in Figure 2.2 is labelling variability which is due to the difference in sample labelling reactions. Such variability has also been addressed in other studies as the difference in the labelling efficiency across multiple arrays. However, in this Chapter labelling variability refers to the technical variation that is due to both sample preparation and labelling process. Another aspect of technical variation, commonly observed in large-scale studies, occurs when samples are divided into different groups and each group is processed independently. Such variation, commonly referred to as batch effects [303–308], has been observed from the earliest microarray experiments [309] and can be caused by many factors including the batch of amplification reagent used or the hybridization reaction. In contrast to random non-specific variation (noise) in microarrays, batch effects exclusively describe systematic technical differences when samples are processed and measured in different batches [303, 307]. In large scale studies, practical considerations limit the number of samples that can be amplified and hybridized at one time, so samples may be generated several days or months apart and, therefore, batch effects will inevitably confound the results of such large-scale studies. The batch effects can potentially mask the real biological events, particularly when the average variation between different batches is significantly larger than the biological variation within each batch [307, 308]. Thus, proper monitoring and managing of batch effects is essential for extracting relevant biological information [303, 307, 308]. The variability between hybridization to different arrays, referred to as chip variability, is another major source of technical variability that is illustrated in Figure 2.2. This variability represents the underlying chip-to-chip variation of the intensity readouts between multiple arrays (of the 33  same platform). As shown in Figure 2.2, chip variability is intrinsically dependent on two other components of technical variability: (1) variability between arrays in relation to the manufacturing process (also referred to as manufacturing variability), and (2) variability between different hybridization reactions. As the result of intercorrelation between different sources of microarray variability, as illustrated in Figure 2.2, measuring the independent contribution of each sources of technical variability is a very difficult, if not impossible, task. The rationale of the work presented in Chapters 2 and 3 is that an experimental design that takes into account the impact of experimental variability at the level of individual oligonucleotide probes can identify and exclude the noisy oligos and minimize the impact of noise in the downstream CNV detection process. In the next section (Section 2.2), I will discuss some of the methods that were used to estimate two major components of variability, chip and labelling variability, among replicate Affymetrix SNP arrays (described in Sections 2.3.3-2.3.4).  2.2 Methods for Measuring and Quantifying Microarray Variability A common approach for measuring microarray data variability, regardless of the platform, is to perform replicate studies. Microarray experiments can be replicated at biological and technical levels to assess the two main categories of variation which were discussed in the previous sections (Sections 2.1.2-2.1.3). Biological replicates: are replicates taken at the level of the population being studied (e.g., copy number polymorphism among normal individuals). Such replicate data sets include samples that are independently obtained from replicate sources (such as multiple cell lines, multiple biopsies or multiple patients). The purpose of studying biological replicates in copy number studies is to evaluate the extent of DNA copy number diversity among normal individuals [19, 66, 69, 163, 284– 286, 310], and to measure the possible functional or phenotypical implications of such variations on normal individuals [46, 47, 236, 288–291]. Technical replicates: are replicates generated at the level of the experimental process [204]. The purpose of studying technical replicates in Affymetrix SNP arrays is to evaluate technical variability in the generated log2-ratio intensity readouts from a SNP array, which can directly affect the consistency and reliability of both genotyping and CNV results in the downstream data analysis. As discussed in detail in Section 2.1.3, technical variability consists of several further components, such as labeling and chip variability. Estimating these aspects of technical variation and their impact on the data integrity is an important task for evaluating the performance of any microarray experiment. 34  The focus of the remainder of this Chapter is to present the methods that were used and/or developed to measure technical variability of Affymetrix SNP arrays and the results of applying these methods to real data from a replicate SNP array experiment.  2.2.1  Quantifying Technical Variability in Affymetrix SNP Arrays  Quantifying the levels of experimental (or technical) variability is a common practice before undertaking any large-scale microarray project [204]. Such variability is often evaluated by performing replicate experiments that aim to assess the systematic reproducibility of a particular technology (as explained in Section 2.2) [204]. Below is the description of some of the main computational aspects of analysing and quantifying variability in microarray experiments. It is important to note that these general methods are not platform-specific and can be applied to a variety of different microarrays, such as Affymetrix and Illumina SNP chips or expression arrays [301, 306]. Log-transformation Often the first step in estimating technical variability in any microarray platform, including Affymetrix SNP arrays, is to transform raw intensity data of the replicate study into log intensities [204, 311, 312]. It has been shown that generally in all microarray experiments larger intensities tend to have larger variations [301, 313, 314]. This bias leads to inconsistent variance across a measured range of intensity. Such inconsistency in variance, which is also described in statistics as ”heteroskedasticity”, imposes a serious challenge for analysing variability in any application, including microarray experiments [315, 316]. One of the primary reasons for performing log-transformation is that it circumvents the former issue by converting asymmetric distribution of (raw) intensities to a symmetric and Gaussian-like distribution, as illustrated in the example of Figure 2.3. Furthermore, the latter transformation would enable us to apply powerful statistical methods to stabilize variance or measure the underlying variability in a microarray experiment (see the following Section) [301, 311, 312]. Coefficient of Variation (CV) In statistics, it is common to represent the variability in a data set by evaluating the standard deviation (SD) of the data. In microarray applications, however, it is well-known that the standard deviation of intensities is positively correlated with the mean signal intensity of the array [204, 317]. If variance is proportional to raw signal intensity of the data, the application of log transformation produces a constant variance across the range of signal intensities on the logarithm scale. The most 35  important advantage of log-normal assumption is that it allows different levels of variability to be expressed as a percentage known as the ”coefficient of variability” (CV) [204], defined by: CV =  σ × 100 µ  (2.2.1)  where σ and µ represent the standard deviation and mean of the input signal, respectively. Based on Equation (2.2.1), the CV is estimated by finding the ratio of standard deviation of the signal to its mean and serves as an indicator of data variability. A key advantage of using coefficient of variation (CV) in SNP arrays, instead of simple standard deviation (SD), is that the standard deviation (or variance) of microarrays are generally proportional to the mean of signal intensities. Therefore, dividing the standard deviation (σ ) by mean (µ), as shown in Equation (2.2.1), removes the intensity-specific dependencies of the estimated variability. This feature makes CV particularly useful for quantifying the variability in replicate microarray studies. Based on Equation (2.2.1), a lower value of CV represents a higher microarray reproducibility. However, the acceptable range of CV varies between different studies and groups [318–321]. For example, the Institute of Food and Research1 (IRF) microarray facility in the UK uses median CV of 5% and 10% as the critical values for technical and biological replicates [321]; and MicroArray Quality Control (MAQC) project has reported 5-20% CV for six2 commercially available microarray platforms for gene expression analysis [318]. Based on theoretical statistics, it is reasonable to assume that if the ratio of mean to standard deviation in a normal distribution is ≥ 3 the experiment is not reproducible. By using this hypothesis Johnson and Welch et al. had previously stated that 33% CV constitutes a permissible upper limit of CV, implying that any experiment with CV > 33% is not reproducible [322]. The main reason behind the debate about the critical value of the CV is that there is no fundamental analysis that can relate the assay variability with the reliability (or precision) of the biological interpretation of array findings. Understanding the link between the measured variability and its impact on the expected frequency of inconsistent readouts is crucial to determine the acceptable extent of technical variation in microarray experiments. Nonetheless, no previous study has explored the relationship between the technical variation and the expected frequency of erroneous readouts in CNP arrays. To address this limitation, the focus of the rest of this Chapter is to adapt the previous studies of CV and variability in other applications [279, 323] to develop a comprehensive model that can relate oligo1 2 Applied  Biosystems; Affymetrix; Agilent Technologies; GE Healthcare; Illumina, and Eppendorf.  36  specific variability in Affymetrix SNP arrays (as measured by the CV) to the frequency of potential SNP-level errors. Such detailed analysis of variability and its impact on SNP data quality allows us to determine the critical value of CV that defines the acceptable range of technical variation in SNP microarrays.  2.2.2  A Link Between the CV and the Probability of Observing k-fold Disparities Between Replicate Measurements  Two studies by Wood et al. [323] and Reed et al. [279] had previously focused on developing a link between the estimated variation in replicate measurements and the expected fraction of pairs of those measurements that differ by a given factor, in the context of serological assays. Wood et al. [323] showed a mathematical relationship between the error frequency and the magnitude of the SD, under the assumption that the logarithm of measurements is normally distributed and based on using a maximum acceptable variability of 2 fold change (k = 2). Alternatively, Reed et al. [279] extended Wood’s approach and derived a mathematical relationship between the CV and the expected frequency of any k-fold disparate results in serological assays. However, there is no current publication that links the estimated variation in SNP arrays to their performance and error rates, similar to Wood et al. [323] treatment of SD or Reed et al. [279] treatment of CV in serological assays. To address this limitation, I used Reed’s model [279] to correlate the extent of oligo-specific CV measurements that have k-fold difference in their signal intensity readouts in a replicate Affymetrix SNP array experiment. I then further expanded this model and used empirical results of CV to estimate the effect of PM oligo-specific variability on the measured SNP log2-ratio (LR) values (described in Section 2.3.5). According to Reed’s model [279], in a given assay with log-normal data distribution and with a known value of CV, the likelihood that two replicate measurements differ by at least a factor of k (p(k)) is estimated by Equation (C.9), restated here: � � − log(k) p(k) = 2Φ � 2 loge (CV2 + 1)  (2.2.2)  where Φ denotes the cumulative density function (CDF) of ”standard normal distribution” (see Appendix A and B for more detail). The complete mathematical proof of this equation is presented in Appendix C. In the context of SNP arrays, the knowledge of this probability (p(k)) helps to determine what magnitudes of difference between oligos can be expected by chance alone when a particular coefficient of variation (%CV) is in effect. This information plays a pivotal role in 37  understanding whether an estimated CV represents a reproducible or non-reproducible microarray experiment. It must be added that the CV is a common standard for assessing microarray data reproducibility and has been used for microarray platform comparisons in MicroArray Quality Control (MAQC) project and several other studies [318, 324, 325]. Nonetheless, Analysis of Variance (ANOVA) is another statistical approach for assessing microarray reproducibility. The description of this method and the results of its application on Affymetrix SNP array data are detailed in Appendix D.  2.3 Results To estimate the technical variability of Affymetrix SNP arrays, a replicate study was performed R using 69 samples studied on Affymetrix GeneChip� 10K SNP arrays. It is important to note that  the aforementioned experiment was carried out at the Affymetrix company (part of a collaboration between Affymetrix and the Genome Sciences Centre). My role in this study was to apply computational techniques to quantify variability in the data generated from this experiment and to determine if Affymetrix 10K SNP arrays were reproducible. The results of my analyses are explained in the following sections.  2.3.1  Affymetrix 10K Replicate Experiment  A schematic representation of the replicate experiment is shown in Figures 2.4-2.5. For this experiment, 8 individuals with normal karyotypes were selected by our collaborators at Affymetrix. The DNA from each subject was divided into 3 batches of approximately 750 ng. Each batch was labeled and hybridized to 3 Affymetrix SNP chips (10K), providing a total of 9 replicate arrays for each sample (Figure 2.4). The only exception in this design was sample #1 for which only 6 replicate arrays were available (2 chips per labelled batch). For each subject in this study, sample R processing and hybridization were performed according to the Affymetrix GeneChip� 10K SNP  assay [326], resulting in 69 arrays for the entire replicate data set. The .CEL files for these 69 arrays were obtained from Affymetrix for further analysis of technical variation. These .CEL files contain the raw intensity readouts of all (∼462,400) oligonucleotide probes on the 10K chips. This data set is referred to as ”10K rep-test” in the remainder of this chapter.  38  2.3.2  Quantifying Technical Variability in Affymetrix SNP Arrays  In Section 2.2 several computational aspects of evaluating technical variability in replicate microarray data were discussed. Below is the summary of the steps that were performed to quantify technical variability in ”10K rep-test” analysis (adapted and modified from Stekel et al. [204]). 1. Log-transform raw probe intensities (obtained from .CEL files) of the 69 arrays in ”10K rep-test”. Here, I have used natural logarithms for data transformation (see Figure 2.3). 2. Apply global normalization on the arrays by bringing the overall mean intensity of the SNP arrays to the same level (see Figure 2.6). 3. Estimate the mean log-intensity value of each oligonucleotide probe feature among the replicate arrays1 (10K SNP array has ∼462,400 oligos). This value is commonly known as A in the context of MA-plots, which will be discussed in Step 5.  4. For each oligo in the replicate arrays, estimate the deviation from the mean. This is also known as the error between replicate features [204, 205], and is typically denoted by M in MA-plots (see Step 5). 5. Generate MA-plots [204, 205] to visualize the relationship between the estimated error (or M, obtained in Step 4) versus mean log-intensities (or A, measured in Step 3). The MA-plots help to examine whether the magnitude of the error is independent from the signal intensity (see Figure 2.7). 6. If the data in MA-plot are not symmetrical around a horizontal line, it implies that the error is reliant on intensity and based on the shape of this plot linear normalization (e.g., linear regression) or non-linear normalization (e.g., LOWESS) techniques can be applied to correct for the error biases [205]. Otherwise, proceed to the next step (Step 7). 7. Calculate the standard deviation of error distribution (σe ). If the MA-plot suggests that the variation is dependent on the signal intensity (asymmetrical MA shape), the data can be partitioned into subsets with different intensity ranges so that the oligo readouts within each 1 In this algorithm, the term replicate arrays refers to the chips that are compared together to determine chip or labelling variability for a particular sample. For example, as seen in Figure 2.5 the first DNA batch (b = 1) of individual #5 (s = 5) is hybridized to 3 separate chips (c1, c2 and c3). Therefore, to measure chip variability of this particular batch (s = 5, b = 1), the intensity hybridization between c1,c2 and c3 must be compared. These 3 arrays are, therefore, referred to as replicate arrays for analysing chip variability for the above DNA batch (s = 5, b = 1).  39  subset are intensity-independent. In such circumstances, a separate σe must be evaluated for each partition, independently. 8. A common model for errors in microarray experiments is the log-normal assumption which assumes that the deviations in replicate experiments follow a normal distribution in the logscale. This can be verified by plotting a histogram of the estimated deviation values (see Figure 2.8). 9. One advantage of the log-normal model of errors is that the coefficient of variation (CV) relates to the standard deviation of the errors in the log-scale (σe ) by the following formula (solved in Appendix C):  � 2 CV = eσ − 1  (2.3.1)  If log-normal assumption is validated, evaluate the coefficient of variation (CV) by substituting σe ’s (obtained in Step 7) in Equation (2.3.1). For dataset with multiple partitions, as described in Step 7, estimate one CV for each partition, independently. The above algorithm is a general method and can be used to estimate variability in the data from other biological platforms, when applicable. As shown in Figure 2.3, the log-transformed intensity readouts of PM oligos from ”10K reptest” arrays follow a normal distribution. This finding is consistent across all 69 arrays in this dataset (each image has been inspected separately, but to avoid redundancy only a subset of the images are shown in Figure 2.3). Next, according to Step 2, I applied global normalization to bring the overall (mean) intensity in these arrays to the same level, as shown in Figure 2.6. In the subsequent step (Step 5), sample-specific MA-plots of deviation of oligo-level hybridization intensities were generated. As seen in the MA-plots depicted in Figure 2.7, the symmetrical shape of the data indicate that the oligo-level errors (deviations) in these arrays are mainly independent from their mean intensity values. This finding is consistent across all other samples in ”10K reptest” dataset, suggesting that there is no need to perform any further sophisticated normalization technique that would have been otherwise required to remove intensity-dependent deviations (as described in Step 6). Next, according to Step 7, I estimated the standard deviation of the errors (σe ) and then plotted the histograms of these errors to investigate whether the log-normal assumption was true in the ”10K rep-test” dataset. As shown in Figure 2.8, the histogram of chip-specific deviations for all 8 samples in ”10K rep-test” dataset followed a normal distribution, and this observation was consistent across all samples for both chip and labelling variabilities. The latter finding confirmed the log-normal assumption of the samples in ”10K rep-test” dataset. 40  2.3.3  Assessing Chip Variability in the Replicate Dataset (10K)  As shown in Figure 2.5, each DNA sample in the ”10K rep-test” study was hybridized to a total of 9 separate 10K chips (except for sample #1 which had 6 replicate arrays). To measure chip variability in ”10K rep-test” dataset, the hybridization intensities between these chips were analysed for each DNA sample, independently. To estimate chip variability for each individual oligonucleotide probe on the 10K SNP array, the mean (A) and deviation from the mean (M) were calculated based on the following formulas: A p (s, B) =  1 3 ∑ log Ip (s, B, Ci ) 3 i=1  M p (s, B,C) = log Ip (s, B,C) − A p (s, B)  (2.3.2a) (2.3.2b)  where s ∈ {1, . . . , 8} corresponds to the DNA sample number, B ∈ {1, 2, 3} represents the in-  dex of the same DNA batch, and Ci , denotes the i-th replicate chip of a particular DNA batch (i ∈ {1, 2, 3}1 ). Also, Ip (s, B,Ci ) represents the intensity of oligonucleotide probe p in the i-th  replicate array (Ci ) of a particular DNA batch (B) of sample s. For example, I100 (5, 1,C3 ) indicates the intensity of the 100-th oligo of the 3rd replicate chip (c = 3) of the first batch of sample #5 (b = 1, s = 5; see left panel of Figure 2.5). Based on the above definitions, Equation (2.3.2b) evaluates the mean (A) of oligonucleotide probe-level signal intensity (Ip ) of the same DNA batch across replicate arrays, and Equation (2.3.2b) uses the estimated mean value (A) to assess the error (deviation; M) of chip variability across replicate arrays. As described in page 39, a separate MA-plot was generated for each DNA batch by using the evaluated oligo-specific M and A values. The aim of this plot was to assess whether oligo-level variations were dependent on the average intensity across replicate arrays. These values were then used in Equation (2.3.1) to assess the %CV of chip variability for all 8 samples of ”10K rep-test” dataset, generating 3 estimates of CV per sample (one for each DNA batch) and a total of 24 CV’s for the entire dataset. As depicted in Figure 2.9a the estimates of CVl varied between 4 to 7% among all samples of the ”10K rep-test” dataset, with the mean chip variability (CVc ) of 5.16%.  2.3.4  Assessing Labeling Variability in the Replicate Dataset  To assess labelling variability, it is necessary to measure the difference in hybridization intensity of different labelling reactions of the same DNA sample. Similar to assessing chip variability, log1 with  the exception of sample #1 (s = 1), where i ∈ {1, 2}.  41  transformation, global normalization and other steps described in Section 2.3.2 were applied on the replicate arrays to evaluate the variability that was associated with labeling reactions. According to the experimental design of the ”10K rep-test” dataset and as seen in Figure 2.5, each DNA sample was hybridized on 9 SNP arrays (with the exception of sample #1 which was applied to 6 arrays). Considering that the aim of this analysis was to find the signal intensity deviation between different labelling reactions, I first estimated the average intensity for each batch and then compared the average batch intensities across the 3 labelling reactions, as illustrated in Figure 2.5. The formula for M and A for analysing labeling variability is given by: A p (s) =  1 9  3  3  ∑ ∑ log Ip (s, Bj , Ci )  M p (s, B) = I¯p (s, B) − A p (s) I¯p (s, B) =  (2.3.3a)  j=1 i=1  (2.3.3b)  1 3 ∑ log Ip (s, B, Ci ) 3 i=1  where s ∈ {1, . . . , 8} represents the sample number, B j , j ∈ {1, 2, 3}, denotes the batch number of a specified sample (s), and Ci ∈ {1, 2, 3} indicates the replicate chip for each batch. Also, as  described before, Ip (s, B,C) corresponds to the intensity of a given oligonucleotide probe p in chip (C) which is associated with batch B of sample s (For a detailed description of these parameters see Section 2.3.3, p. 41). In Equation (2.3.3b), Mp (s, B) denotes the deviation of the average signal intensity of probe p (I¯p (s, B)) in the specified DNA batch (s, B) from the mean intensity of the same probe (A p (s)) across the entire set of replicate arrays of the same sample (evaluated by Eq. (2.3.3a)). Next, the standard deviation of the estimated errors associated with labelling variability (σe ) were measured and, subsequently, the coefficients of variability for different labelling reactions (CVl ) were generated based on Equation (2.3.1) (Section 2.3.2). Figure 2.9b illustrates the CVl results in the ”10K rep-test” dataset. This Figure indicates that CVl varies between 6-7% across the 8 DNA samples in this study, with an average CVl of 6.36% per sample. Here, we observe that the overall estimate of labelling variability is slightly larger than the chip variability (CVc = 5.16%; ∆CV = |CVl −CVc | ≈ 1.20%). The fact that a component of labelling variability is the chip-to-chip variation between arrays with the same labelled DNA explains this marginal increase of labelling variability compared to chip variability.  42  2.3.5  Analysing the Relationship Between Oligo-level and SNP-level Variabilities  As discussed in Section, the coefficient of variability (CV) is preferred over the standard deviation (SD) as a means of quantifying the reproducibility of SNP arrays in ”10K rep-test” dataset. However, understanding the extent to which technical variation influences the measured LR values requires a reliable formulation that links the CV to assay performance. To characterize this relationship in Affymetrix SNP arrays and to investigate how it influences the log2-ratio values of the oligos, I performed the following analysis. Initially, I estimated the acceptable k-fold variation between intensity measurements of the same oligonucleotide probe across replicate arrays, and then I used Reed’s model (Equation (2.2.2)) to relate the CV to the probability of two disparate oligos (i.e., oligonucleotide probes that differ at least by k-fold between replicate arrays) in ”10K rep-test” dataset. In the final analysis, I derived a mathematical formulation to link the oligo-specific variability to the extent of SNP-level LR variations, under the assumption that the average log2-ratio signal intensity from n PM oligonucleotide probes (oligos) in the same probe set is used to measure the SNP log2-ratio intensity values (throughout this Chapter, I will refer to the SNP log2-ratio intensity estimate as the SNP LR values). In conclusion, I applied the above method to the real data obtained from the ”10K rep-test” experiment to evaluate the associated oligo-level and SNP-level variability in this dataset. The detailed description of these analyses is presented in the following sections. The Acceptable Range of Variability Between Replicate Oligos According to Reed’s model (discussed in Section 2.2.2) the probability that two independent measurements from the same sample will differ by a factor of k or more is given by Eq. (2.2.2), under the assumption that the data from these samples are normally distributed after logarithmic transformation [279]. As discussed in the previous section, histograms of chip and labeling errors confirmed the log normal hypothesis of the ”10K rep-test” dataset (p. 40 and Figure 2.3), thus satisfying the prerequisite of Equation (2.2.2) [209, 275, 327]. By applying this model, I generated Figure 2.10, which depicts the probability of observing > k fold difference in the raw intensities of two replicate oligos. The curves in this graph reflects the aforementioned relationship based on a variable range of k and CV. To interpret the results from this nomogram (Figure 2.10), we need to understand the critical value of k, which defines the maximum acceptable range of variation between replicate oligos in the context of copy number data analysis. For copy number analysis, the permissible variability corresponds to the range of variation between individual PM oligos that does not affect 43  the SNP’s overall LR values. To find the proper value of k in 10K SNP arrays, I performed the following analysis. Let A and R be the readouts of the same normal oligonucleotide probe (LR = 0) in ’test’ and ’reference’ arrays, respectively. Therefore: A =0 R ⇒ log2 A − log2 R = 0 log2  (2.3.4)  Now assume that in a replicate array the raw signal intensity of this oligo is increased by k-fold (A� → kA, k > 1). Thus the log2-ratio readout of the same oligo in the replicate array is: log2  kA = log2 kA − log2 R R = log2 k + log2 A − log2 R  (2.3.5)  Substituting the result of Eq. (2.3.4) in (2.3.5) would result in: log2  kA = log2 k + 0 = log2 k R  (2.3.6)  The result of Equation (2.3.6) suggests that k-fold increase in the raw intensity readout of a normal oligonucleotide probe (LR = 0) would result in log2 (k) increase in the corresponding log2-ratio (LR) estimate (assuming that the same reference set is used to analyse the data from replicate arrays). In practice, regardless of the CNV calling method used, not every deviation of LR from the theoretical baseline (LR = 0) is reported as a significant copy number change. Instead, often default deletion and amplification thresholds are used to select significant deviations of intensity as potential regions of copy number aberration [328–330]. For instance, let assume log2-ratio of ±0.5 is used to call amplifications and deletions. It must be noted that the theoretical log2  ratios of one copy gain and loss are +0.58 and -1, respectively. However, to compensate the effect of noise often a smaller magnitude of aberration is used to determine significant DNA gains and losses. Here, the selected values (LR = ±0.5) are arbitrary thresholds that are used to show  the probability of observing random noisy oligos that could shift a normal SNP above a deletion or an amplification threshold. As seen later in this Chapter, the presented results are generated based on several conditions that can be used to study the aforementioned probability with different thresholds. With the above assumption (LR = ±0.5 as CNV thresholds), the acceptable technical variability is −0.5 < LR < +0.5. Therefore, the critical value of k estimated by Equation (2.3.4)  44  is: log2 k = 0.5 ⇒ k = 1.4142 ≈ 1.4  (2.3.7)  This result indicates that in the context of copy number analysis, an approximate 1.4-fold difference in intensity measurements of the same oligonucleotide probe across replicate arrays can be regarded as the upper limit of acceptable variability. By substituting the measured CV from the empirical data (”10K rep-test” dataset), the frequency of replicate oligos that differ by � 1.4 fold in Affymetrix 10K SNP arrays is estimated as the following: � � − loge (1.4) p(1.4) = 2Φ � = 6.66e − 04 2 loge [(7/100)2 ) + 1  where an upper estimate of CV (≈ 7%) was used to assess p(k = 1.4). Alternatively, this value may also be approximated by inspection of the nomogram presented in Figure 2.10 which plots the probabilities by using appropriate k (1.4) and CV (7%) values. This nomogram helps to understand the link between coefficient of variation (CV) and the probability of observing k fold disparate results in a replicate experiment. This analysis reveals that based on estimated variability in 10K Affymetrix SNP arrays in the ”10K rep-test” experiment, the p-value (P) of observing � 1.4 fold1 difference in the raw oligo-level intensities between replicate oligos is P = 6.66e-04. This means that based on estimated CV values in ”10K rep-test” dataset, by random chance, about 308 oligos from 462,200 oligos on 10K SNP arrays are expected to be significantly different (� 1.4-fold) across replicate experiments. In general according to Figure 2.10, any variability below 10% would essentially have a p(k) probability of zero for observing ≥ 1.4 fold difference in SNP-level LR outputs. However, the visible difference between k = 1.1 and k = 1.4 in this figure emphasizes on the fact that, just by random chance, there  is a much larger probability of observing disparate results (in replicate oligos) between 1.1-1.4 fold apart (corresponding to ∼ 0.25 − 0.48 difference in log2-intensity values). A careful inspection  of Figure 2.10 shows that there is ∼50% probability that log2-ratio readouts of the same oligo in  replicate arrays differ by 0.26 (in log2-scale), even in a highly reproducible experiment with CV as low as 10%. Finding a Link Between Oligo-level CV to the Changes in SNP-level LR Values Understanding the extent to which these highly variable oligos affect the SNP-level RL readouts requires an analytical approach to identify how many noisy oligos in each 10K SNP probe set can 1 equivalent  to a difference of at least 0.5 between log2-intensity measurements of a replicate oligo  45  make a significant change in the LR value of a corresponding SNP probe (each 10K SNP readout is based on the information from 20 PM oligos). Here, I assumed that the SNP log2-ratio copy number (LR) values are calculated by averaging the log2-ratio intensity measurements from the PM oligos that belong to the SNP probe set. So the LR value of SNP ”A” consisting of n perfectmatch oligos, A = {A1 , A2 , . . . , An }, is estimated by:  � � 1 A1 A2 An S(A, R) = log2 + log2 + . . . + log2 n R1 R2 Rn  (2.3.8)  where n = 20 for Affymetrix 10K SNP array; and Ri denotes the intensity of the same PM oligonucleotide probe (i-th probe) in the reference set. The measurement of the same SNP probe, that is obtained from a replicate array, is represented by: A� = {A�1 , A�2 , . . . , A�n } = {k1 A1 , k2 A2 , . . . , kn An }, k > 1 where ki indicates the magnitude of intensity fold-change (of the i-th PM oligo). Similar to Equation (2.3.8), the average log2-ratio signal intensity of the replicate SNP (A� ) is assessed by: � � 1 k1 A1 k2 A2 kn A n S(A , R) = log2 + log2 + . . . + log2 n R1 R2 Rn �  (2.3.9)  To understand how ki ’s can make a significant change in the estimated LR values, the following analysis was performed. It was assumed that LR = −0.5 and LR = +0.5 are theoretical thresholds of loss and gain of DNA copy number, respectively. Thus, for a given normal SNP with LR = 0,  the maximum acceptable random variability between replicate arrays is |LR| < 0.5. This means that the SNP readouts in replicate experiments are inconsistent (or disparate) if:  1 A� A� A� (log2 1 + log2 2 + . . . + log2 n ) � 0.5 n R1 R2 Rn k1 A1 k2 A2 kn An = log2 + log2 + . . . + log2 � 0.5n R1 R2 Rn  ¯ � , R) = S(A  46  (2.3.10)  By expanding the logarithm in Equation (2.3.10) and grouping the resultant terms, we have: ¯ � , R) =(log2 k1 + log2 k2 + . . . + log2 kn ) + (log2 A1 + log2 A2 + . . . + log2 An )− S(A (log2 R1 + log2 R2 + . . . + logn Rn ) � 0.5n A1 A2 A2 =(log2 k1 + log2 k2 + . . . + log2 kn ) + (log2 + log2 + . . . + log2 ) � 0.5n R1 R2 R2 n  ¯ R) � 0.5n = log2 ∏ ki + S(A, i=1  ¯ R) = 0), and by Assuming that SNP A represents a normal region of the genome with LR = 0 (S(A, substituting this in the previous inequality, we have: n  S(A� , R) = log2 ∏ ki + 0 � 0.5n ⇒ i=1  n  ∏ ki � 2(0.5n) i=1  (ki = 1 when A�i = Ai )  (2.3.11)  where parameter n denotes the total number of PM probes in the probe set. If the above inequality (2.3.11) holds true, it implies that the variation in raw oligonucleotide signal intensities (denoted by ki ) are sufficient to increase the average SNP LR measurement by at least 0.5 units (in log2-scale). Assuming that m represents the number of variable oligos (i.e., oligos that have at least k fold-change across replicate arrays) in a SNP probe set (m < n), the left hand side of inequality (2.3.11) can be substituted by (k¯ m )m . Therefore: m  ∏ k¯ m � 2(0.5n) ⇒ (k¯ m )m � 2(0.5n)  (2.3.12)  i=1  where m denotes the number of variable oligos in the SNP probe set that differ, on average ≥ k¯ m  fold across replicate arrays; and n is the total number of oligos per SNP probe set. The extent to which a random k-fold change in m oligos (of the same SNP probe set) would affect the overall SNP LR value, can be measured by subtracting Equation (2.3.9) from Equation (2.3.8): n 1 S(A� , R) − S(A, R) = log2 ∏ ki = ∆S n i=1  (2.3.13)  where ki is the variability of each PM oligonucleotide probe that belong to the same SNP probe set. Assuming that a given SNP probe set has only m variable (or disparate) oligos (m < n), with  47  an average k¯ m fold-change, the above equation becomes: m 1 1 m ∆S = (log2 ∏ k¯ m ) = log2 (k¯ m )m = log2 k¯ m n n n i=1  (2.3.14)  Equations (2.3.13) and (2.3.14) represent the magnitude of the difference between LR measurements of the same SNP in two replicate experiments (∆S), depending on the oligo-level variability of the individual PM oligos in the SNP probe set. Estimating ∆S for hypothetically varied values of m and k¯ m can help to understand variations of SNP LR readouts based on the variability of their PM oligos. Evaluating the Impact of Oligo-level Variability on the Extent and Frequency of Noisy SNPs in Replicate Experiments The work presented in this section aimed to (1) assess what proportion of PM oligonucleotide probes on an Affymetrix SNP array are expected to vary by a minimum of k fold between replicate experiments, and (2) what proportion of Affymetrix SNP probe sets are affected by such oligolevel variations. Table 2.1 presents the results of evaluating oligos that vary by ≥ k fold (aim #1). The first column, %CV, indicates an arbitrary range of CV values. Columns 2-6 denoted by p(k), k ∈ {1.2, 1.41, 1.5, 2, 3} represent the probability (p) that an oligonucleotide probe would  differ by at least k fold across replicate experiments. The k and CV values used in this table are selected from the curves in Figure 2.10. As described in Section, k = 1.4 corresponds to the critical value of k in SNP data analysis estimated by Equation (2.3.7). The last 5 columns in this table (Table 2.1) denote the predicted frequency of oligos in 10K SNP arrays that differ ≥ k  across replicate experiments. It is observed that as the CV increases, the probability of observing variable oligos also increases. Furthermore, in spite of relatively small p(k) values, we observe a considerable number of oligos that are highly variable among replicate experiments. For example, for an experimental platform with a CV = 20%, the probability of an oligonucleotide exhibiting ≥ 1.4 fold difference across replicate experiments is only 0.148 (p(1.5)). However, under these circumstances the data from a 10K SNP replicate array is expected to contain 68,266 variable oligos. The impact of oligo-level variability on SNP-level variability is also shown in Table 2.2. It is observed that for CV between 10-20%, there are between 445 to 3,065 SNPs that have at least two oligos that vary ≥ 1.2 fold across replicate experiments. As evident from this table, the predicted frequency of observing ∆S ≈ 0.5 in the log2-ratio intensity measurement from the same  SNP probe set across replicate arrays is dependent on the number of variable PM oligos across  48  replicate measurements (m) and the extent of this difference (k¯ m ). In addition to these parameters, the frequency of such potentially unreliable SNPs is directly affected by the coefficient of variation (CV) of the experimental platform. In summary, in this section I applied a mathematical model on the empirical data from a SNP array replicate experiment to predict the frequency of SNP readouts with significant non-specific variability between replicate arrays (more than 1.42-fold difference in raw signal intensities, which corresponds to more than 0.5 intensity difference in log2-scale). The results of Tables 2.1 and 2.2 indicate that based on the estimated CV in ”10K rep-test” experiment (CV � 7%), it is very unlikely  that the experimental variation in this dataset would result in any significant change (∆S) in the SNP LR values. However, it is important to note that the assessed CV is particularly low and the same data quality may not be reproduced in a typical laboratory conditions. Also, it must be pointed out that even based on the same CV value, there are still many oligos that may have LR changes less than 0.5 (see the intersection of CV = 7% with k = 1.1 (shown in black) and k = 1.2 (shown in blue) curves in Figure 2.10).  2.4 Conclusions In this Chapter, I used a replicate data set obtained from 8 normal individuals to analyse probeR specific contributions to technical variability in Affymetrix GeneChip� 10K SNP arrays. To quan-  tify technical variability, a general step-by-step procedure was presented based on using the coefficient of variation or CV (Section 2.3.2). By applying this procedure on a 10K replicate dataset consisting of 69 samples from 8 individuals (Section 2.3.1), the average chip and labeling CV values were estimated to be 5.15% and 6.36%, respectively (Figure 2.9). These estimates indicated the measure of technical variability between intensity readouts of the same oligonucleotide probes across replicate SNP arrays, but they did not provide any information regarding the extent of variation of the SNP log2-ratios (∆S) across replicate arrays (p. 43). This link is critical to understanding the impact of estimated CVs to the array performance in the context of copy number analysis. To address this issue, I used the mathematical model proposed by Reed et al. [279] to find the relationship between the CV and the probability of observing k-fold random difference (p(k)) in intensity measurements of replicate oligos (Section Next, I further expanded this model and developed a method to quantify the contribution of probe set-specific technical variability on the estimated SNP log2-ratio values (∆S; Section Then, I used the estimated CV values (shown in Figure 2.9) from ”10K rep-test” dataset in the aforementioned model and found that under these conditions there was only a slight chance (p = 6.66e-04) that two replicate oligos in 49  Affymetrix 10K SNP arrays would differ significantly (k ≥ 1.4; p. 45).  The development of the relationship between the CV and p(k) enhances the usefulness of CV in  biological interpretation of SNP array findings, which has never been previously investigated in the context of microarrays. This model allows generation of a dimensionless index of variability which is universally applicable to assess the performance of SNP array platforms in different laboratory conditions. Furthermore, the model can be easily adapted for other applications, for example, to find the probability of any specified k-fold differences in expression data that occur by random chance between replicate arrays. By expanding Reed’s model and applying it to SNP array data (Section 2.3.5), I was able to accomplish two important applications of the mathematical formulation that linked CV and p(k): (1) to assess whether or not the difference in raw intensity readouts of the same oligonucleotide probes between replicate arrays is due to random variation (by analysing the oligo-level variability; Table 2.1 and Figure 2.10); and (2) to assess whether the variation in a set of individual PM oligos can make a significant bias (∆S) in the resultant SNP log2-ratio measurement (Table 2.2). In clinical applications, this model can also provide a quality control tool through which it can be determined whether the current estimated variability exceeds what has been established from past experiments. It is important to note that although in this analysis I used the critical value of k as minimum 1.4-fold difference in raw intensities (corresponding to ∼0.5 difference in log2-  ratio intensity readouts), the choice of k can also be easily adjusted to optimize the sensitivity and specificity of a specific analysis. For example, one could use this model to predict the frequency of observing a difference of 0.25 units across replicate experiments by setting k = 1.19 in Equation (2.2.2) and Figure 2.10. One important observation of the nomogram in Figure 2.10 is that any variability below 10% would essentially have p(k) probability of zero for observing k ≥ 1.4 fold difference in SNP-level LR outputs. However, the visible difference between k = 1.1 and 1.2 curves with k = 1.4 curve in Figure 2.10 implies that, just by random chance, there is a much larger probability of observing replicate oligos that are less than 1.4 fold different (k = 1.1 and k = 1.2 correspond to log2-ratio intensity changes of ∼0.25 and ∼0.48, respectively). A careful inspection of Figure 2.10 also indicates that there is ∼50% probability that two oligos would have at  least 0.26 difference in their log2-ratio intensity readouts between replicate arrays, even in a highly reproducible experiment with a CV as low as 10%. In this Chapter, I have outlined an approach to generate a reliable estimate of variability for Affymetrix SNP arrays and then developed a model to use the measured CVs to determine the quality of SNP data. The provided Equations, the nomogram visualization in Figure 2.10 and the critical-value tables provided in this Chapter (Tables 2.1-2.2) are simple tools that extend the 50  understanding of the CV and increase its usefulness in interpretation of technical variability of SNP arrays. The main conclusion of the work presented in this Chapter is that although Affymetrix 10K SNP arrays are shown to be highly reproducible, there is still a significant likelihood of the occurrence of noisy oligos just by random chance (Section Such noisy oligos may affect SNP-level LR results and ultimately affect the quality of the downstream CNV findings. Based on the results of this Chapter, I concluded that in order to improve the quality of CNV results, it is important to develop a CNV detection method that is based on analysing probe-level variability of SNP array data.  51  2.5 Figures and Tables Variability between individuals (copy number polymorphism)  Variability between sample preparations (labeling variability)  Variability between arrays and hybridization (chip variability)  Variability between features on the array (SNP variability and oligovariability)  Figure 2.1: Sources of variability in SNP microarray experiments for identification of copy number variations (CNVs). At the highest level of variability, there is a biological variation in the DNA copy number values that exists among different individuals in a population, also known as copy number polymorphism (CNP). At the experimental level, there is a variability in preparation and labeling of the same sample (labeling variability), as well as a variability in the hybridization of a sample to different arrays (chip variability). The last source of variability is a variability between different features on the same array, which includes both the variability between probes that target different SNPs loci, as well as the variability between individual oligonucleotide probes within the same SNP probe set.  52  Variability between Cy5 and Cy3 channels  Technical Variability  Does not exist in one-color arrays (e.g., Affymetrix SNP arrays)  Variability between differently labelled samples , referred to as labeling variability  Batch Effects  Refers exclusively to systematic technical differences when samples are processed and measured in different batches  Variability between hybridization to different arrays, referred to as chip variability  Variability in manufacturing the arrays  Variability among different hybridization reactions  Figure 2.2: Common sources of microarray technical variability: There are several components of technical variability in microarray experiments. The first component, ”variability between Cy5 and Cy3 colors”, is specific to two-color arrays, such as CGH-based platforms, where two samples (e.g., test and reference) are labeled with different fluorophores (usually Cy3 and Cy5 dyes) and co-hybridized on a single microarray. The other denoted factors are common among all arrays including Affymetrix SNP arrays, Illumina SNP arrays and aCGH technology. Labeling variability refers to the variation between intensity readouts of the sample that has been prepared and labelled by separate labeling reactions. The batch effects are often observed in large-scale studies where due to practical considerations the number of samples that can be prepared and hybridized at one time is limited. Under such circumstances, samples are often divided into groups or batches, and the samples in each group are analysed together in separate experiments that may be conducted several days or months apart. This procedure introduces systematic batch effects between these separate groups that makes it difficult, if not impossible, to compare between batches. The next source of technical variability is the variation between hybridization to different arrays, referred to as chip variability. As indicated in the above diagram, chip variability is dependent on two other components (1) variability between arrays in relation to the manufacturing process, also known as manufacturing variability; and (2) variability between different hybridization reactions.  53  (a) Histogram of Raw PM Intensities, Sample 5  (b) Histogram of Log-transformed PM Intensities, Sample 5  Figure 2.3: Impact of log-transformation on the distribution of raw signal intensities from Affymetrix SNP arrays. Panel (a) denotes the histogram of raw intensity readouts from all perfect match (PM) oligos on a 10K SNP array (total of 231,200 PM oligos), for a given 10K sample (S5 in replicate experiment, previously discussed in Figure 2.4). It is clear from this graph that the raw intensity data is not normally distributed. Panel (b) demonstrates the histogram of PM signal intensities of the same sample (S5) after logtransformation (natural logarithm). The red curve illustrates the estimated normal fit to the log-data. The bell-shape of the histogram presented in (b) indicates that the raw (PM) intensity readouts tend to follow a normal distribution after log-transformation.  54                                            Figure 2.4: Schematic representation of Affymetrix 10K replicate study. The replicate experiment, which is referred to as ”10K rep-test”, was designed and performed by our collaborators at R Affymetrix� according to the process outlined below. The study included 8 normal individuals, referred by index ’S’ as depicted in Step A. None of these individuals had any previously known genetic abnormality. Total genomic DNA obtained from each subject (such as S5, shown above) was divided into 3 batches∗ . Each batch was then prepared and labelled independently, as illustrated in Step B. Following labeling proR cess, each batch was hybridized onto 3 separate Affymetrix GeneChip� 10K SNP arrays, as demonstrated in Step C. Thus, the experiment resulted in 9 replicate arrays for each individual ’S’ (except for S1 that had 6 replicate arrays* ). The ”10K rep-test” experiment, therefore, generates a dataset consisting of the data from 69 replicate 10K arrays from 8 individual normal samples. * The only exception is S1, where only 2 (instead of 3) batches were available for the analysis.  55  S=5  S=5 (  b=2  b=1  b=3  Chip Variability : variability between hybridization of the same labelled sample to different arrays  10 K  10 K  10 K  c=1  c=2  c=3  b=2  b=1  b=3  Labeling Variability : variability between different batches of the same sample  10 K  10 K  c=1 c=2  (a): Evaluating Chip Variability (for the 1st batch of the 5th subject)  10 K  c=3  10 K  10 K  c=4 c=5  10 K  c=6  10 K  10 K  c=7 c=8  10 K  c=9  (b): Evaluating Labeling Variability (for S=5 )  Figure 2.5: Schematic representation of assessing technical variability in 10K SNP array replicate study. Panel (a) demonstrates how chip-to-chip variability is estimated in ”10K rep-test” experiments. The ”chip variability” is evaluated based on the difference of oligo-level intensity measurements of the same labelled sample hybridized to 3 replicate arrays. In the depicted example, chip variability of batch 1 of sample 5 (S5, b = 1), is estimated by comparing the microarray signal intensity outputs from chips c = 1, c = 2 and c = 3. Panel (b) depicts how ”labeling variability” is measured in ”10K rep-test” datasets. This variability, defined as the variation between separate labeling reactions of the same sample, is evaluated by comparing the oligo-level signal intensity outputs from arrays that have been hybridized by different batches of the same sample. In the depicted example (in panel (b)) the labeling variability of sample 5 (S5) is evaluated by comparing the variation in hybridization intensity output from 9 arrays, denoted by c = 1 to c = 9.  56  Boxplot of Probe−Level Intensities (Before Centering)  Log10 Intensity (462,400 oligos)  4  3.5  3  2.5  2  1.5 1  2  3  Replicate Arrays (1:3)  (a) Probe-level Distributions Before Normalization Boxplot of Probe−Level Intensities After Global Normalization  Log10 Intensity (462,400 oligos)  4  3.5  3  2.5  2  1.5 1  2  Replicate Arrays (1:3)  3  (b) Probe-level Distributions After Normalization  Figure 2.6: Global Normalization of log-transformed intensity readouts from 3 replicate arrays. The boxplots are generated to compare the mean and spread of oligo-specific intensity readouts from 3 replicate chips of a sample from ”10K rep-test” experiment (S8, b = 3). On each box, the central red line is the 50th percentile (median), the edges of the box are the 25th and the 75th percentiles and outliers are depicted by red ’+’ markers (see Appendix E for more information about boxplot visualization). Panel (a) shows the boxplots of log-transformed data before global normalization; and panel (b) denotes the boxplots of the same data after global normalization. It is evident from these plots that the overall mean intensity in these arrays have been brought to the same level as the result of normalization.  57              58  Figure 2.7: MA-plots of chip variability. Each dot represents the relationship between deviation in the log-intensity readouts of a particular oligo across 3 replicate arrays (same labelled sample hybridized to 3 separate 10K arrays, evaluated for 462,400 oligos on 10K array). The x-axis denotes the average intensity between replicate oligos (’A’), and the y-axis represents the deviation of intensity readouts of the same oligo between replicate arrays (’M’). Panels (A)-(C) show the MA-plots for 3 different batches of sample S2, and panels (D)-(F) indicate the MA-plots of 3 different batches of sample S5, described previously in Figure 2.5.a. These plots indicate that all depicted MA-plots are symmetrical around the x-axis, suggesting that the overall array deviation is not intensity-specific and, thus, there is no need to apply a more sophisticated normalization technique prior to quantifying labeling and chip variabilities.  59            Figure 2.8: Histograms of deviation (error) in replicate arrays. The above plots show the histogram of oligo-specific deviation of signal intensities of 4 samples in ”10K rep-test” (S1, S3, S4 and S5). The associated histogram of oligo-level deviation is shown in blue and the predicted normal fit to each distribution is superimposed on the histogram plot by red curves. These bell-shaped distributions indicate that logtransformed intensity deviation (error) follows a normal distribution in these samples. Similar results have been obtained for the rest of samples in ”10K rep-test” dataset (4 other samples; not shown here). This finding proves that the errors from ”10K rep-test” experiment are normally distributed, an assumption which is the prerequisite for several statistical methods that were applied on this dataset throughout this chapter (such as Reed’s mathematical model of relating the CV to the probability of random different readouts in replicate experiments; as described in Section 2.2.2).  60                                         (a) Chip Variability in SNP 10K Replicate Dataset                                   (b) Labeling Variability in SNP 10K Replicate Dataset  61      Figure 2.9: Estimated chip and labeling variability of the Affymetrix 10K replicate experiment. Panel (a) denotes the results of 10K chip variability in 8 DNA samples of the ”10K rep-test” experiment. The x-axis (’S’) represents the DNA sample index, and the y-axis (%CV) denotes the measured coefficient of variation for each specified batch of a given DNA sample. The estimated chip variability for 3 DNA batches of each sample is represented red, blue and green bars, respectively. The horizontal dashed line depicts the mean chip variability index across all samples and all batches in ”10K rep-test” dataset (CVc = 5.16%). Panel (b) denotes labeling variability, evaluated by assessing the variation in hybridization intensity of the same sample that was prepared through 3 separate labeling reactions (9 replicate arrays per individual; see Figure 2.5.b). The red dashed line represents the mean labeling variability across all samples (CVl = 6.36%). 1* : denotes the only exception, S1, where only 2 (instead of 3) batches were available for this analysis.  62                                                          Figure 2.10: Relationship between array variability and the probability of estimating oligos with kfold difference in their intensity readout. This nomogram depicts the probability that two measurements from the same oligo will differ by a factor of k or more across replicate Affymetrix SNP arrays. The probability curves in this figure were generated by using a hypothetically varied range of CV and k values in Eq. (2.2.2). This graph allows monitoring the expected quality and consistency of the data generated from an array experiment with known variability (%CV). The red solid curve (labelled as k = 1.4∗ ) presents the probability of observing 0.5 units difference between log2-intensity readouts of the same oligo across replicate arrays. This magnitude of difference between log2-signal intensities (0.5), which is equivalent to 40% difference in raw signal intensities, is assumed as the maximum acceptable oligo-level variability between replicate measurements in this chapter. By finding the intersection of k = 1.4 curve with the estimated %CV of ”10K rep-test” data set (max CV = 6.36%), that was presented in the previous figure (Fig. 2.9), the above nomogram indicates that there is a very low chance of observing oligos that differ � 1.4 fold between the 10K replicate arrays. Thus, we can accurately conclude that the data from ”10K rep-test” experiments were highly reproducible.  63  5 6.36 7 10 12 15 20 25 30 40 50 60 70 80 90  Predicted No. k-fold different oligos in 10K†  p(k) : Probability that oligos differ � k-fold  %CV p(1.2)  p(1.41)∗  p(1.5)  p(2)  p(3)  F ‡ [1.2]  0.010 0.042 0.065 0.196 0.281 0.387 0.515 0.601 0.661 0.738 0.785 0.816 0.838 0.855 0.867  9.38E-07 1.15E-04 4.56E-04 1.40E-02 0.040 0.100 0.216 0.320 0.404 0.525 0.604 0.659 0.698 0.728 0.750  9.60E-09 6.41E-06 4.12E-05 4.05E-03 0.016 0.055 0.148 0.244 0.329 0.457 0.544 0.605 0.650 0.684 0.710  1.03E-22 1.22E-14 2.38E-12 8.95E-07 4.15E-05 0.001 0.013 0.047 0.095 0.203 0.299 0.377 0.438 0.486 0.525  1.68E-54 2.24E-34 1.11E-28 6.82E-15 8.20E-11 1.91E-07 8.76E-05 0.002 0.008 0.044 0.100 0.161 0.219 0.269 0.313  4566 0.43 19619 53 30129 211 90689 6481 129854 18680 179073 46411 238062 99807 277578 147720 305303 186656 341056 242523 362789 279132 377227 304376 387431 322601 394980 336263 400764 346824  F[1.41] F[1.5] 0.004 3 19 1872 7624 25235 68266 112892 151944 211111 251386 279689 300345 315934 328039  F[2]  F[3]  4.77E-17 5.63E-09 1.10E-06 0.41 19 470 6160 21503 43908 93962 138415 174136 202287 224581 242461  7.78E-49 1.03E-28 5.11E-23 3.15E-09 3.79E-05 0.09 40 742 3762 20224 46252 74522 101053 124509 144764  † Affymetrix GeneChip� R 10K SNP array ∗ 1.41 is the critical value of k in 10K SNP arrays, as estimated by Equation (2.3.7). ‡ ’F’ denotes the frequency of observing m variable oligos with k¯ fold-differences in the same SNP probe set, across replicate 10K m  .  SNP arrays  Table 2.1: The relationship between CV and predicted replicate oligos that are ≥ k-fold different. This table  summarizes the probability of obtaining a specific value of fold change, p(k), for various assumed values of CV. This uses the oligo-level variation information to predict the frequency of SNP probe sets in Affymetrix 10K SNP arrays that are affected by such variable oligos. The m and k¯ m denote the number of variable oligos in a SNP probe set and the average fold-difference of variable oligos across replicate experiments, respectively. Columns 2-6 denoted by p(k), k ∈ {1.2, 1.41, 1.5, 2, 3} represent the probability (p) that an oligonucleotide probe would differ by at least k fold across replicate experiments (see Figure 2.10). The last 5 columns of this table denote the predicted frequency of oligos in 10K SNP arrays that differ � k fold across replicate experiments. In this table we observe that as the CV increases, the probability of observing variable oligos increases. Furthermore, in spite of relatively small p(k) values, there can be a considerate number of oligos on the 10K SNP array that differ � k fold among replicate experiments. For example, for an experimental platform with CV = 20%, the probability of an oligonucleotide to have ≥ 1.4 fold difference across replicate experiments is only 0.148 (p(1.5)). However, under these circumstances the data from a 10K SNP array are expected to contain 68,266 oligonucleotide intensity readouts that are likely to differ by ≥ 1.5 fold across replicate experiments.  64  Predicted No. of SNPs with ∆S difference in R Affymetrix GeneChip� 10K SNP array  PM Oligos in the SNP probe set that differ ≥ k-fold m∗  ∗∗ k¯ m  ∆S†  CV = 7%  CV = 10%  CV = 15%  CV = 20%  CV = 40%  2 5 10 15 18  1.2 1.2 1.2 1.2 1.2  0.03 0.07 0.13 0.20 0.24  49 0.01 1.60E-08 1.88E-14 5.22E-18  445 3 9.77E-04 2.84E-07 2.15E-09  1734 101 0.88 0.01 4.47E-04  3065 419 15 0.55 0.08  6292 2528 553 121 49  2 5 10 15 18  1.4 1.4 1.4 1.4 1.4  0.05 0.12 0.24 0.36 0.44  0.01 1.52E-12 2.00E-28 2.62E-44 7.76E-54  3 1.68E-05 2.43E-14 3.52E-23 1.75E-28  142 0.19 3.20E-06 5.32E-11 7.21E-14  609 7 4.71E-03 3.00E-06 3.64E-08  3330 515 23 1 0.16  2 5 10 15 18  1.5 1.5 1.5 1.5 1.5  0.06 0.15 0.29 0.44 0.53  1.96E-05 1.37E-18 1.62E-40 1.91E-62 1.33E-75  0.19 1.26E-08 1.37E-20 1.50E-32 9.94E-40  34 0.01 2.72E-09 1.32E-15 2.15E-19  252 0.81 5.71E-05 4.01E-09 1.29E-11  2411 230 5 0.09 0.01  2 5 10 15 18  2 2 2 2 2  0.10 0.25 0.50 0.75 0.90  6.52E-20 8.75E-55 6.63E-113 5.02E-171 6.74E-206  9.25E-09 6.62E-27 3.79E-57 2.17E-87 1.56E-105  0.01 1.26E-11 1.37E-26 1.49E-41 1.56E-50  2 4.86E-06 2.04E-15 8.60E-25 2.04E-30  478 4 1.39E-03 4.84E-07 4.06E-09  ∗ m: number of variable (PM) oligos that belong to the same SNP probe set (m ≤ n; n = 20). ∗∗ k ¯ : average fold-change of log2-ratio intensity measurements of m variable oligos within the same SNP probe set (p. 47). m  † ∆S: magnitude of the difference in the LR value of a SNP due to oligo-level variability (estimated according to Eq. (2.3.14)).  Table 2.2: The impact of oligo-level variability across replicate measurements on the expected SNP-level vari-  ability. This table summarizes the effect of variable PM oligos on the estimated variability of SNP log2-ratio readouts (∆S) and the frequency of such SNPs. The m (1st column) denotes the number of PM oligos in the SNP probe set that are different between replicate arrays (m ≤ 20, in Affymetrix 10K SNP arrays). The k¯ m (2nd column) denotes the average fold-change, k, of the corresponding m variable oligos (explained in page 47). The magnitude of the resultant difference in the SNP LR values between replicate arrays is evaluated by Equation (2.3.14) and is shown in the 3rd column of the above table (∆S). The last 5 columns indicate the predicted frequency of SNPs on Affymetrix 10K SNP array that are expected to differ � k fold between replicate arrays. For example, for k = 1.4, the expected frequency of SNP probe sets that have 2 PM oligos that are at least 1.2-fold different across replicate experiments is 49, when the SNP platform CV is 7% (the approximate value of CV that was estimated for the ”10K rep-test” dataset). Under such circumstances, the SNP probe set mean signal could vary between replicate experiments by ∆S = 0.03 (in log2-scale). We also observe that for CV between 10% and 20%, 445 to 3065 SNP probe sets are expected to contain 2 oligos that vary ≥ 1.2 fold across replicate experiments, and therefore the SNP probe set readouts could differ by ∆S = 0.03. However, this number increases to 6,292 if the CV of the experimental platform is 40%.  65  Chapter 3  Algorithm for Oligonucleotide Probe-level Analysis of Signal Intensities (OPAS) 3.1 Introduction In recent years, high-density SNP genotyping arrays have become increasingly popular for copy number detection application, since these arrays can serve a dual role for both SNP genotyping as well as copy number analysis [135, 136, 149, 162, 209]. The most prominent SNP arrays are from two commercial vendors; Affymetrix and Illumina (explained previously in Section As described in Chapter 1, the Affymetrix SNP array construction involves synthesizing 25-mer oligonucleotide probes (which I also refer to as ”oligos” in this thesis) corresponding to a perfect match and mismatch of the two SNP alleles. This probe quartet, commonly known as a SNP ”probe set”, is the basic unit for downstream computational analysis (see Figure 1.2 for more details). The hybridization reaction of target DNA to the above oligos generates signal intensity measurements, which can then be converted by computational tools to infer SNP genotypes as well as regions of copy number variation [161, 172]. Successful application of this technology for copy number analysis has discovered a number of interesting CNVs with relationships to complex disease. For example, rare CNVs have been linked to schizophrenia [57] in a study where micro-deletions and duplications were shown to be responsible for disrupting genes involved in neurodevelopment. The UGT2B17 gene on 4q13.2 66  was also linked to osteoporosis in a case-control study of 350 affected individuals in a Chinese population [331]. Despite these advances, a major disadvantage of oligonucleotide arrays is their poor signal-to-noise that can affect the quality of the predicted CNVs by increasing the rates of false positive and false negative CNV calls [141, 168, 332]. Several algorithms have been developed to improve the signal-to-noise ratios in these arrays by taking into account the length and GC content of the probes, and various algorithms for copy number detection have been developed to aid CNV detection using approaches often based on either Hidden Markov Models (such as QuantiSNP [185], PenCNV [186]), Segmentation (such as CBS Segmentation [189]) or t-tests [210, 212, 213]. Nonetheless, often algorithms depend on multiple sequential SNPs with significant intensity change to call a putative CNV. As a consequence, several studies (such as Itsara et al. [141]) have shown that CNV calling algorithms often emphasize specificity over sensitivity and as a result the CNV detection power is dependent on the number of SNP probes markers in a region of interest. Therefore, while large CNVs (> 4 Mb) are routinely identified by most of the available algorithms, when it comes to small CNVs (< 100 − 150 kb) or CNVs in regions with  less probe density (< 8 − 10 SNP probes), the results from different algorithms are often no longer consistent [141, 215].  As discussed earlier, each Affymetrix SNP is represented by a collection of oligos that interrogate a SNP locus and its surrounding sequence in the genome. Such design strategy is inherited from Affymetrix expression arrays, where the relative expression of each gene was estimated by a set of probes in a probe set. In expression arrays, it was shown that although ideally all probes within the same probe set should represent the expression of the same gene, there was often a remarkable variation between their intensity readouts [207, 276]. In this Chapter, I first provide evidence that (1) there is a significant variation both between SNP probe sets in regions with known copy number values and among the oligonucleotide probes in the same SNP probe set; and (2) such variation exists regardless of the array density (e.g., 100K or 500K) or the size of the normal reference set that is used to evaluate intensity ratios, and thus I hypothesize that (3) the accuracy of predicted CNVs can be improved by distinguishing between true signal and noisy oligos and incorporating the true information to the downstream CNV calling method. Therefore, the underlying hypothesis of the work presented in this Chapter is that by improving the quality of SNP readouts, which are the input data to the downstream CNV calling algorithm, the impact of noise in Affymetrix SNP arrays can also be improved and, consequently, the CNV detection process would have higher sensitivity and specificity. Based on this hypothesis, I developed an algorithm which utilizes nonparametric statistical model-based methods for preprocessing SNP array data by analysing the presence of nonspecific binding and oligonucleotide 67  probe-specific effects on the estimated noise in the array readout. I will show in this Chapter that the pre-processing was able to dramatically improve the oligonucleotide probe level variation (noise) within SNP probe sets by improving the SD values from 1.22 (before pre-processing) to 0.41 (after pre-processing). In addition, I will provide further evidence indicating that despite improvements in the SD values, the magnitudes of copy number aberrations are significantly improved after the pre-processing phase (Section This provides evidence that the proposed algorithm has been able to address one of the major issues in analysing SNP array data. As described in Chapter 1, other methods often improve the SD by smoothing the data or incorporating large windows of neighboring SNPs, which reduces the magnitude of both noisy oligos and true copy number aberrations. Thus, apparent noise reduction in such algorithms comes at the cost of reduced detection sensitivity, particularly CNVs in DNA regions with low density of SNP markers [220, 221]. In contrast to such methods, the proposed non-parametric approach I implemented distinguishes between noisy and informative oligonucleotide probes prior to excluding the noisy oligos from the array readouts.  Contribution The samples used in the work presented in this Chapter were provided by Dr. Jan Friedman and the ”wet-lab” work (sample preparation and Affymetrix experiments) was performed at the Genome Sciences Centre. Throughout this Chapter, I also use validated CNV results of mental retardation (MR) project that have already been published [29, 30, 215] to test the performance of OPAS in improving the noise and detecting the real CNVs . The reported MR CNVs in the Friedman et al. study [29] were based on 100K SNP array data [29]. The CNVs were first selected by using CNAG [180] and dChipSNP [209] software packages with additional t-test statistics, as explained in [29] (this analysis was performed by a separate group at the GSC). The candidate CNVs were then examined by FISH analysis performed by Dr. Patrice Eydoux’s lab at the Children’s and Women’s Hospital in Vancouver. The reported CNVs were a subset of all de novo events (i.e., events that were present in the affected child and not in either of the normal parents) that were successfully validated by FISH analysis. The second MR-related CNV study was based on Affymetrix 500K SNP arrays [30]. In the latter study, chip-to-chip normalization, standardization to a reference set, genotype detection and copy number estimation were all performed using Affymetrix Power Tools (version 1.6.0) software suite ( This analysis was performed by a separate group at the GSC. Estimation of CNV boundary positions was done using Significance of Mean Difference (SMD) 68  method, developed by Delaney et al. [259]. In SMD, the mean of SNP copy number estimates (or log2-ratios) within a candidate CNV region was compared to the mean of those SNPs on the rest of the chromosome. The probability of accepting the null hypothesis of Student’s t-test (i.e., that the means were from the same distribution) was then calculated and used to identify the most significant putative CNV regions [30]. The CNVs reported in the Friedman et al. 500K array publication [30] were selected based on SMD hits with at least 10 contiguous SNPs and with a p-value less than 10e-8. Similar to the 100K study [29], all reported MR-related CNVs in 500K publication [30] were validated de novo events. Throughout this Chapter all references to known CNVs in MR study refer to the results from the aforementioned studies [29, 30]. In this Chapter, I present an algorithm for CNV detection based on probe-level data analysis using Affymetrix GeneChip Mapping 10K, 100K and 500K arrays. This algorithm, called Oligonucleotide Probe level Analysis of Signal intensities (OPAS), makes no assumptions about the performance of individual oligos within a SNP probe set; instead, OPAS uses fuzzy-logic theory to analyse the relationship between individual perfect match (PM) oligos in a SNP probe set to determine how many possible groups of such oligos exist in each probe set. The decision about which group(s) of oligos are informative is made through non-parametric statistical tests and a machine learning classifier. The validated CNVs in the aforementioned publications in MR [29, 30] are used to analyse the oligo-level variabilities in known CNV regions and also to examine the performance of OPAS in detecting real copy number aberrations. This approach for identifying individual noisy oligos in each SNP probe set is novel and has not yet been reported in other copy number detection algorithms [168, 180, 184, 185, 188, 333, 334].  3.2 Methods 3.2.1  Algorithm Design  The OPAS design is divided into two main levels, as illustrated in Figure 3.1. The first module, ”SNP pre-processing”, aims to find the most informative subset of PM oligos in each SNP probe set and to generate improved SNP log-ratio readouts; the improvements are detailed in Sections The next phase, ”SNP post-processing”, applies a normalization method to correct for PCRinduced biases and to partition each chromosome into regions where copy number changes between neighbouring segments. The main underlying hypothesis of OPAS algorithm is that the noise in Affymetrix SNP arrays, in addition to being SNP dependent, also depends on the oligonucleotide probes within the SNP 69  probe sets. Therefore, the data reliability of individual oligos can directly impact the quality of the estimated SNP log2-ratio (LR) values and consequently the accuracy of the downstream CNV calls. Detailed explanation of different parts of this algorithm are provided in the following sections.  3.2.2  SNP Pre-processing Phase  As previously described in Chapter 1, Affymetrix genotyping arrays interrogate each SNP genotype with a set of 25-mer oligonucleotide probes (or oligos), known as a SNP probe set. The oligos in a probe set are designed to query both DNA strands at multiple offsets with respect to the SNP position (Figure 1.2). The generated intensity data from these arrays can also be compared to a reference set to determine the relative abundance of DNA at the specified SNP sites for CNV analysis. Quantile Normalization In the first step of pre-processing, a quantile normalization [205, 335] technique is applied on the log-transformed fluorescent intensity data from test and reference SNP arrays (.CEL files) to enable differentiation between real variations in DNA copy number and variations due to experimental differences between multiple arrays. The quantile normalization method was originally designed to normalize intensities in Affymetrix high-density oligonucleotide expression arrays [205, 335, 336]. However, the same general method can be adapted to normalize the data from SNP arrays. This normalization method is based on the assumption that the data from all samples (being compared) follow a common underlying distribution [336]. A possible theoretical problem with this approach is the risk of removing some of the signal in the tails of the distribution; however, several studies have shown that empirical evidence does not indicate that quantile normalization leads to such errors in practise [205, 302]1 . The OPAS default normalization approach is a modified version that adjusts the data in extreme tail values to allow for greater differentiation [336]. Clustering Individual Oligonucleotide Probes in a SNP Probe Set The goal of OPAS pre-processing phase is to improve the quality of the estimated SNP signal by finding a subset of oligonucleotide probes with the most informative log2-ratio intensity in each SNP probe set. The OPAS strategy to find such informative subset of oligos is to first cluster 1 Appendix  F provides a comparative study of several normalization techniques using 500K array data from cancer samples. The results of this analysis also indicates that background adjusted quantile normalization does not suppress the magnitude of real CNVs, empirically.  70  PM oligos in each SNP probe set into groups with similar LR values. One of the most popular methods for data clustering is the k-means algorithm [337, 338], a successful method that has been widely used in several applications, such as gene expression profiling [339–342] and expressed sequence tag (EST) analysis [343–345]. Despite the popularity of k-means algorithm for data clustering, one of the limitations of this method is that it requires the number of clusters (k) and the approximate centroid values (µk ) as input parameters. If no prior information is available for cluster centers, random initialization is often employed. However, k-means is particularly sensitive to these initialization values. Another drawback of random initialization is that clustering the same SNP probe set multiple times would not necessarily generates the same clusters. These limitations imply that in order to apply k-means for clustering oligos, first the appropriate initialization values must be determined in a non-random manner. To accomplish this task, I designed a two-stage clustering algorithm that first predicts the optimal clustering parameters (k and µk ) using a non-parametric approach based on fuzzy logic theory (subtractive cluster analysis) [346] and then uses these values to initialize an optimization-based k-means clustering. The details of this clustering approach is described in the following two sections.  Cluster Prediction: Fuzzy Subtractive Clustering  J In step one of clustering, a fuzzy-logic subtractive algorithm [346–348] was applied on PM oligonucleotide probe level data in each Affymetrix SNP probe set. Subtractive method is a fast, one-pass algorithm for estimating the optimal number of clusters (k) and cluster centers (µk ) in a set of data [346–349]. Assuming that P = {pi | i = 1, . . . , n} denotes a SNP probe set with ’n’  number of PM oligos, the likelihood that the i-th PM oligo (pi ) from P is a cluster center is defined by [350]:  �  � �Pi − p j �2 Di = ∑ exp − , i = 1, 2, . . . , n , i �= j (ra /2)2 j=1 n  (3.2.1)  Where radius ra > 0 is the cluster radius (the best ra default values are usually between 0.2 and 0.5). In the following step, the PM oligo that is associated with the largest D likelihood value is chosen as the first cluster center. Subsequently, the density measure of the rest of the PM oligos in this probe set (P) are revised by: �  �Pi − pc1 �2 Di = Di − Dc1 exp − (rb /2)2  �  (3.2.2)  Where, c1 is the first cluster center and rb is usually set at 1.5 × ra to avoid obtaining closely spaced cluster centers. The subtractive clustering algorithm iterates between Equations (3.2.1)-(3.2.2) until 71  all of the PM oligos in a SNP probe set are within the radii of a cluster center [350]. Notably, fuzzy clustering alone did not perform well at clustering border line oligos, and therefore, we combined it with k-means optimization based clustering as described below.  Cluster Estimation: k-means Clustering  J In the second phase of clustering, for each SNP probe set the number of clusters (k) and cluster centers (µk ) that were obtained by fuzzy subtractive analysis are used to initialize a k-means clustering algorithm (Sec. The k-means partitions the PM oligos in each SNP probe set into k mutually exclusive clusters, such that the oligos within each cluster are as close to each other as possible and as far from oligos in the other clusters as possible. In the rest of this thesis, I refer to a cluster of similar PM oligos as an ”oligo cluster”. The k-means method iteratively updates cluster centers to minimize a cost function, until there is no significant change in the cluster centers or when it exceeds the maximum number of iterations. The k-means cost function f is defined by: k  f = arg min ∑ C  ∑  i=1 p j ∈ci  �(p j − µi )�2 ,  j = 1, 2, . . . , n  (3.2.3)  Where ci represents the i-th oligo cluster in a SNP probe set with centroid µi ; and k is the number of predicted oligo clusters in the SNP probe set (1 � k � n). When clustering finds more than one oligo cluster in a SNP probe set, further analysis is required to distinguish between noisy and informative oligo clusters. This process, referred to as SNP classification, is described in the following section (Section SNP Classification The goal of SNP classification is to identify which subset of oligos in a SNP probe set represents the true SNP signal intensity. The OPAS default SNP classification approach consists of two modules. First, the likelihood estimation phase tests the null-hypothesis that the log2-ratio intensity values of each oligo cluster are significantly different from the normal baseline or not. In the next step, prediction phase, the likelihood of these null-hypothesis tests are used as input data to a machine learning classifier to find the most informative oligo clusters in each SNP probe. For each SNP, the center of the informative oligo cluster is then used as the OPAS-estimated LR value.  72 Likelihood Estimation: Kolmogorov-Smirnov (KS) test In this phase, three sets of non-parametric Kolmogorov-Smirnov tests are separately applied to each oligo cluster of the SNP probe sets; as listed in Table 3.1. Each KS-test in this table makes a statement about how the population of the oligos in an oligo cluster (X) is related to a specified baseline distribution (e.g., y0 ). The values of the alternative tests, shown in Table 3.1 (”two.sided”, ”less” and ”greater”), define the null hypothesis that the cumulative distribution function (CDF) of oligo cluster data X is equal to, not less than or not greater than the cumulative distribution function of the background, respectively. For instance, KS-test set #1 tests the alternative hypothesis that the CDFs of the oligo cluster data and the background population are not equal (H1 : X �= y0 ).  For an oligo cluster X with k number of PM oligos and a standard deviation of σx , the back-  ground distribution for test set #1, y0 , is defined as a distribution of k randomly generated numbers with mean zero and the same standard deviation of the oligo cluster data being compared to (σx ). Nonetheless, when the same background distribution (y0 ) was used in KS-tests 2-3, often none of these mutually exclusive null hypothesis tests were rejected. The implication of this observation was that often we could not determine whether the oligo cluster data is likely smaller or larger than the zero-mean baseline1 . To circumvent this issue, different baseline distributions were used for the aforementioned test sets (in Tab. 3.1). The baseline distributions for tests 2 and 3 is generated with the same strategy used to create y0 , but with different mean population values (µ(y1 ) = −0.5,  µ(y2 ) = +0.6). These values were set following the analysis of > 1300 oligo clusters in more than  600 SNP probe sets from known deleted and amplified regions (from mental retardation project). These background distributions (y1 and y2 ) provided the greatest distinguishability between the results of the aforementioned 3 mutually exclusive tests. Although, it must be added that still ∼20%  of SNP probe sets did not provide any relevant copy number information, mainly not because the tests failed to generate accurate results, but the fact that the SNP probe set readouts were inconsistent with the known DNA copy number of the underlying region. Some of the examples of inconsistent SNPs were previously described in Section and also seen in Figure 3.6 (e.g., SNP80 and SNP81 ). The schematic representation of clustering and likelihood estimation analysis 1 This statement requires further clarification.  In biology often a hypothesis test can have one of two outcomes: either the researcher can ”accept” the hypothesis or ”reject” it. For instance, if the scientist discovers that a particular gene does not appear in the genome sequence, then it is deduced that it has been deleted. However, in statistics there is a major issue with the notion of accepting the null hypothesis. Instead, the failure to reject the initial hypothesis implies that the data are not sufficiently persuasive to prefer the alternative hypothesis over the null hypothesis. Therefore, even if we reject the null hypotheses that oligo cluster data is equal to the baseline and also reject that is it smaller than the baseline but fail to reject that is it larger that the latter distribution, we can not directly conclude that the oligo data is in fact larger than the baseline.  73  for a given SNP probe set is presented in Figure 3.10. Machine Learning Classifier Based on Discriminant Analysis In the next phase of SNP classification, a machine learning-based classifier was implemented to discover the most informative oligo cluster for each SNP by utilizing the information that was obtained in the likelihood estimation phase. A quadratic discriminant analysis (QDA) was implemented to perform this classification task. The main inputs to the QDA classifier are the estimated p-values of the hypothesis tests described in the previous section, the number of PM oligos in each oligo cluster and the cluster centers. Further description of QDA classifier is presented in Appendix G. It is hypothesized that the SNP data manipulations described in the SNP pre-processing phase, can decrease the overall standard deviation of the array (noise) while increasing the magnitude of true signal aberrations (results shown in Section Such improvements can directly influence the quality of the downstream CNV calling process; as detailed in Sections 3.3.6-3.3.7.  3.2.3  Alternative Approach for SNP Pre-processing Based on Naive Bayes Classification  It has been suggested that instead of clustering oligos into groups with similar LR values and using likelihood estimation and QDA to independently classify each SNP, a classifier should be directly applied on the SNPs probe sets. To implement this alternative approach I trained Naive Bayes classifiers using SNPs from known deleted, amplified and normal regions (the same training data used for QDA1 ). The underlying assumption is that a Naive Bayes classifier can directly determine the SNP copy-number status (i.e., normal, deleted or amplified) by comparing the log2-ratio intensity pattern of the oligos in the SNP probe set with those from DNA regions with known copy-number values (Figure 3.2). Two Naive Bayes classifiers were implemented for this analysis, one with normal (gaussian) distribution and another with kernel smoothing density estimation. The main difference between the Naive Bayes approach and OPAS default SNP classification is that all the PM oligos in a SNP probe set are used as the input data to the Naive Bayes classifiers. However, as explained in the previous section, the OPAS SNP classification approach is based on QDA analysis of the information that is obtained from oligo clusters (compare the flowcharts in Figures 3.1 and 3.2). The comparison between the classification performance using the QDA and Naive Bayes approaches is detailed in Section 1 The  same 324 SNPs from known classes (deleted, normal and normal regions) that were used for QDA training.  74  3.2.4  Post-processing and CNV Calling  The goal of the OPAS post-processing phase is to partition the whole genome into regions where the copy number changes between contiguous segments. The post-processing phase includes two main sub-modules (Figure 3.1). In the first step, a second phase of normalization is applied to SNP log2-ratio data (generated by pre-processing phase) to correct for biases that are due to the PCR process (Section Next, a non-parametric CNV calling algorithm (Circular Binary Segmentation) is applied to identify putative regions of copy number change in Affymetrix SNP data. I hypothesized that incorporating high quality SNP readouts with a non-parametric CNV calling approach can improve the quality of ultimate CNV calls. In this section, I describe the OPAS post-processing modules. PCR Fragment Length Normalization As mentioned in Chapter 1, Affymetrix genotyping assay involves a Polymerase Chain Reaction (PCR) process to amplify the target DNA sample. It has been shown that locus-specific array intensity data may be correlated with PCR fragment length [180]. This is due to the fact that longer fragments usually generate fewer amplified products, which reduces the material available for labeling and hybridization and results in weaker signals [180, 351]. The magnitudes of such PCR-induced biases may vary between arrays, so they do not necessarily cancel each other out in the estimated test versus reference LR ratios [180, 351–353]. To correct for such biases, a nonlinear LOWESS regression method was used [208, 354, 355]. This method has been used in a wide range of microarray applications, such as adjusting the waves in microarray signal intensities [140, 356] and normalizing Illumina Infinium SNP data [357]. Circular Binary Segmentation (CBS) Algorithm The Circular Binary Segmentation (CBS) algorithm [189] is a modification of binary segmentation [230], a well-known method for the change-point problem1 in statistics. The basic idea of this entirely non-parametric approach is to recursively split chromosomes into segments based on a maximum p-value that is estimated by permutation analysis. A study by Lai et al. [231] comparing 11 different CNV calling algorithms such as mixture model, HMM, maximum likelihood, regression, wavelets and genetic algorithms, concluded that CBS appears to perform consistently well. Another comparison study by Willenbrock and Fridlyand et al. [232] found that a CBS-based CNV calling method (DNAcopy software) [189] had the highest sensitivity rate and the lowest false de1  75  tection discovery rate (FDR) of CNV breakpoints compared to Gaussian-based GLAD [210] and an HMM-based method [358]. A key advantage of CBS method is that it does not have a strict limitation on the minimum number of probes in a candidate CNV region. The main disadvantage of this algorithm is its low speed. However, the Venkatraman et al. [242] modification of the original algorithm has alleviated this problem to some extent. Due to its non-parametric nature and proven accuracy, CBS is one of the most powerful statistical-based algorithms for CNV detection1 [164, 232, 359–362].  3.2.5  OPAS Visualization and Other Features  Several visualization and computational tools have been developed to facilitate OPAS data analysis and CNV interpretation. Figure 3.3 presents a schematic representation of the data that is generated during OPAS analysis. As noted in this graph, the OPAS software uses the raw Affymetrix .CEL files and generates 3 main output files (.jpg, .BED and .txt files). The .jpg files show the OPAS results of segmented data in each chromosome along the chromosome ideograms based on banding patterns from USCS genome browser (hg17, hg18 or hg19)2 . The .txt file is a list of all estimated segments in all chromosomes following the CBS analysis. An additional script has been implemented that allows OPAS to generate UCSC custom tracks for any list of candidate CNVs, based on .BED formatting3 . This tool can also generate tracks with color spectra for better representation of the type of predicted events (e.g., deletion versus amplification) or the magnitudes of the estimated LR values. Other computational tools have also been implemented to allow finding overlapping CNVs across multiple samples; and comparing a list of putative CNVs with known copy number polymorphisms in Toronto Database of Genomic Variants (DGV)4 . These tools can help to obtain further information about the generated CNV results.  3.2.6  Simulated Data for Comparative Analysis of CNV Calling Algorithms  A simulated data set was generated based on the Affymetrix 250K Nsp array to evaluate the performance of OPAS algorithm. To create the simulated signals, first a distribution of random Gaussian data was generated. The SNP level and oligonucleotide level standard deviations of the generated 1 2 The  OPAS default genome build is hg18; however, it can be modified through the user interface.  3 4  76  distribution were set to equal those of a Follicular Lymphoma sample1 to demonstrate performance on real data (SD= 0.08 and 0.04). Next, 200 non-overlapping simulated regions of copy number change were randomly scattered throughout the simulated signal. These simulated CNVs had 8 different alteration sizes with 2, 4, 8, 10, 15, 25, 100, and 200 data points. For each alteration size (referred to as w), 25 non-overlapping random CNV regions were generated, resulting in total of 200 (8 × 25 = 200) distinct simulated CNVs throughout the entire signal. This data vector is referred to as a template signal. To implement the range of alterations expected in a typical SNP array experiment, a constant log2-ratio magnitude of 0.11, 0.2, 0.4, 0.6, 0.8 and 1.0 was then added to the simulated CNV regions of the template signal. This model is used in comparing the sensitivity and precision of CNV calling algorithm; detailed in Section 3.3.6.  3.2.7  Analysing the Effect of Noise on CNV Calling Performance  The second simulation analysis used an artificial but biologically inspired model to generate synthetic data to evaluate OPAS CNV detection sensitivity in the presence of added noise. This simulated model is based on Affymetrix Nsp SNP array data (250K) from a follicular lymphoma patient. The genome of this patient, referred to as ht-17, has been thoroughly analysed for both sequence level mutations and structural aberrations using several different platforms, including array CGH, 500K SNP arrays, fingerprint profiling (FPP) [363] and whole transcriptome shotgun sequencing (WTSS) [364] (all these data have been provided by other groups at the GSC). The FPP results found a deletion on chromosome 14 immunoglobulin heavy locus (IGH@), spanning approximately 870 Kb (14q32; Figure 3.4). This deletion was validated by BAC end sequencing (BES) of an FPP clone that harbored this deletion (clone HTa17-0164B08, denoted in Figure 3.4b). No other CNV was detected in this chromosome by either of the above methods2 . Therefore, it can be hypothesized that the aforementioned deletion on 14q32 (105,163,197106,035,402) is likely the only CNV in chromosome 14 of the above patient. To generate simulated chromosome 14 signals based on real data, I added random Gaussian noise to the original Nsp data of the this patient. Based on this model, the simulated noisy signal Y is defined by: Yi = Xi + εi  1�i�n  (3.2.4)  εi ∼ N (0, σ 2 ) where n represents the sample size and X is the original Nsp data of chromosome 14. The term ε 1 This  sample is obtained from follicular lymphoma project, described in Chapter 4.  2  77  denotes the added gaussian noise which follows a normal distribution with standard deviation of σ and mean of zero. Based on this model, signal Y includes the IGH deletion (on 14q32) but is corrupted with a known magnitude of noise (ε).  3.2.8  Comparative Analysis of OPAS and Circular Binary Segmentation  As explained in Section 3.2.4, in OPAS post-processing phase Circular Binary Segmentation (CBS) algorithm is applied to detect regions where log2-ratio intensity changes between neighbouring segments. The aim of the following experiment is to investigate whether analysing a sample using OPAS has any advantage for CNV detection over application of CBS segmentation alone. The 100K array data from a mental retardation (MR) patient with a known (validated) deletion is used to perform this experiment. The aforementioned deletion spans approximately 294 Kb (chromosome 2p16.3) and includes 17 Xba SNPs. Next, a series of simulated signals are generated by randomly selecting N SNPs from the deleted region and eliminating them from the original signal. This process is repeated multiple times for each value of N (see Section 3.3.7 for more details). The performance of OPAS and CBS is compared by assessing the sensitivity of each method to detect the known deletion with fewer SNP probe markers. The results of this analysis are described in Section 3.3.7.  3.3 Results 3.3.1  Patterns of LR Intensity Fluctuations in SNP Array Data  To study the extent of variabilities in SNP array data, I examined more than 950 SNP probe sets in known regions of copy number variation using FISH validated CNVs from Friedman et al. [29] and Taylor et al. [365] publications. The results of this analysis found different fluorescent intensity responses to DNA copy number abundance, both across SNPs in a region with the same copy number (SNP-level variability), and among perfect-match oligos that are in the same probe set (oligo-level variability). Figure 3.5 demonstrates examples of such SNP-level and oligo-level variabilities in log2-ratio fluorescent intensities. The top panel (Fig. 3.5a) presents a scatterplot of a region on chromosome 10 that includes a known deletion, approximately 353 kb in size (10q24.31q25.1; denoted by red arrows). The sample is from a patient with global developmental delay and a desmoplastic medulloblastoma that was previously reported by Taylor et al. [365]. The phenotypes in this patient were associated with the disruption of the SUFU gene that is located on the deleted region. In the rest of this Chapter, I will refer to the aforementioned deletion in this patient as the 78  ”SUFU deleted region”. The highlighted SNPs in Figure 3.5a (SNPs 79, 81, 83, 93 and 100) are all within the SUFU deleted region. As seen in this figure, these deleted SNPs have noticeably different LR values and standard deviations (Figures 3.5b-3.5e; −1.34 � LR � −0.05, 0.3 � SD � 1.3). Figure 3.5e shows  an example of the impact of oligo-level variability on the quality of the estimated SNP signal. The depicted SNP (SNP100 ) has mean LR value of −0.5; however, 5/20 perfect-match (PM) oligos  of this SNP indicate increased signal intensities (with mean log2-ratio value of +0.2; marked by blue dots) and can be considered as noisy oligos. In the same SNP probe set, 11 other PM oligos indicate a relative loss of signal intensity (with mean log2-ratio intensity of −0.52; marked by  green dots). The remaining 4 PM oligos in this probe set denote a significant loss of signal intensity with mean log2-ratio of −1.3 (marked by red dots). The latter subset of oligos represents 2.6-fold increase in the magnitude of copy number loss compared to the original mean signal of SNP100  (LR = −0.5); and an LR estimate that is approximately equal to the theoretical log2-ratio value of one copy number loss (log2 (1/2) = −1). Such findings, that are commonly observed in SNP  arrays, prove that using a subset of PM oligos to evaluate log2-ratio values can improve the quality of the SNP signal. It is also reasonable to speculate whether the SNP signal can be improved simply by increasing the number of reference samples. The analysis of the impact of reference set size on the estimated log2-ratio intensities is presented in the following section (Section 3.3.2).  3.3.2  Analysing the Impact of the Size of the Reference Set on the Estimated SNP Signals  To assess whether a larger reference set improves the quality of SNP signals, it is important to measure its impact on both the magnitude of estimated LR values and the number of informative probes in regions with known copy number values. These two analyses are described in Section and Section, respectively. Impact of the Size of the Reference Set on the Estimated LR Values For this analysis, the relative copy number ratios of all PM oligos in the ”SUFU deleted region” were compared using 6 different reference sets with varying sizes. Three of these reference sets were the normal father (R f ather ), normal mother (Rmother ) and the mean of both parents (R parents ) of the affected MR patient. The other three reference sets consisted of 24 (R24 ), 99 (R99 ) and 150 (R150 ) normal individuals, all with normal karyotypes1 . The results of this analysis are 1 These  samples were part of a larger data set of normal parents in a trio study by Friedman et al. [29].  79  presented in Figures 3.6-3.7 and Table 3.2. Figure 3.7 presents a comparison between the distribution of log2-ratio intensity readouts of PM oligos in the SUFU deleted region (500) with all other PM oligos on the array (> 1.1 mil), using the cumulative density plots of these two populations. As seen in this figure, in all cases the distribution of SUFU deleted oligos falls on the left side of the baseline CDF (CDF of the entire array; shown in black). The latter observation implies that the distribution of log2-ratio intensities of the PM oligos in SUFU deleted region is consistently smaller than the rest of the array, regardless of the size of the reference set that was used to calculate LR values. The distance between the CDFs of the deleted and the baseline oligos, referred to as δ , represents the distinguishability of the deleted oligos from the rest of the array. Figure 3.7 also denotes that at F(x) = 0.51 , the δ value with respect to the largest reference set (CDF150 ; yellow curve) is smaller than the δ with respect to a single reference sample of the normal father (CDFf ather ; blue curve). The latter finding suggests that a larger reference set does not necessarily improve the magnitude of LR deviation of real CNVs. Impact of the Size of the Reference Set on the Number of CNV-affirmative Oligos The Xba SNP array data (50K) of 995 PM oligos in 22 regions of known copy number deletion (previously reported in [29]) was analysed to investigate whether using a larger reference set improves the proportion of CNV-affirmative oligos in these CNVs. In this analysis, the proportion of CNV-affirmative oligos (referred to as θ ) is defined as the number of PM oligonucleotide probes in a deleted SNP probe set that indicate loss of signal intensity (LR< 0) divided by the total number of SNPs in the corresponding deleted region. The boxplots in Figure 3.8 illustrate the distributions of the average θ values (average number of affirmative oligos per SNP probe set) across the 22 known deletions with respect to 3 separate reference sets (R24 , R99 and R150 )2 . As seen in this figure, the reference set with 99 normal samples (R99 ) has a better θ rate compared to the reference set with 24 samples (θR24 = 13.3/20 � 66%, θR99 = 15.3/20 = 77%). However, the proportion of affirmative oligos using the largest reference set with 150 samples is lower than that of 99 samples (θR150 = 14.3/20 � 71% < θR99 = 77% ).  Similar analysis was applied on 5 SNPs in SUFU deleted region (SNPs 79, 81, 83, 93 and 100;  shown in Fig. 3.5) to study the impact of the size of reference set on the estimated SNP log2-ratio values and the number of CNV-affirmative oligos in these 5 SNPs. Comparing the data generated by R150 and R f ather in Table 3.2 reveals that using R f ather leads to larger magnitudes of copy number 1 F(x)  is defined as the proportion of X values that are less than or equal to x. reference sets were previously described in page 79.  2 These  80  loss and approximately the same overall number of informative oligos for these 5 deleted SNPs. In summary, the results of the analyses presented in Sections suggest that there is no substantial evidence to support the hypothesis that increasing the size of a reference set improves the magnitude of real copy number aberrations1 .  3.3.3  Results of OPAS Pre-processing Phase Clustering PM Oligos in SNP Probe-sets The raw array data (from .CEL files) was normalized using quantile normalization method, as explained in Section Fuzzy-Kmeans clustering was applied to each SNP probe set according to the model described in Section Figure 3.9 illustrates examples of fuzzy-kmeans clustering of 5 SNPs in SUFU deleted region that were previously shown to have noticeably different LR values (see Figure 3.5). Each predicted oligo cluster is shown by a different color in these plots (red, blue and green). As observed, 3/5 SNPs (Figs. 3.9a-3.9c) are predicted to have 2 separate oligo clusters (k = 2), while the remaining 2 SNPs (Figs. 3.9d-3.9e) each have 3 separate oligo clusters (k = 3). As observed excluding noisy oligos (such as the blue oligo cluster in Figure 3.9c) can lead to significant improvements in the estimated SNP signal intensities. However, filtering oligos does not always improve the magnitude of LR deviation. For instance, excluding the blue oligo cluster from the probe set of SNP81 (Figure 3.9b) does not significantly improve the estimated SNP LR value (compare LR and LR’ values in Fig. 3.9b). Figure 3.6 also confirms that SNP81 does not provide any deletion information to begin with (regardless of the reference set). Therefore, the oligo-level processing cannot have a significant impact on the quality of the estimated signal from this SNP. Performing Oligonucleotide Probe-level Analysis of the SNP Array Data As explained in Section, for each oligo cluster in a SNP probe set three null-hypothesis tests were applied to assess whether the oligo cluster indicates a significant LR deviation from the baseline or not. Figure 3.10 provides an example of clustering and likelihood estimation for a given SNP on the Xba array (SNP A-1740765, panel (A)). The fuzzy-kmeans clustering found 2 separate PM oligo clusters in the SNP probe set, which are denoted by red and blue colors in 1 In fact, based on empirical data, I observed that using a larger reference set can have a negative impact on the ability  to detect somatic changes in cancer samples.  81  Panel (B). Panels (D) and (E) show the results of null hypothesis tests for each of the above 2 oligo clusters (see Sec. for the description of these tests). The generated data from these tests are passed as input data to the downstream QDA classifier; detailed in the next section. SNPs Classification and LR Estimation The clustering of SNPs in multiple arrays revealed that less than 5% of SNPs have only 1 oligo cluster (k = 1). For these SNPs no further pre-processing is applied and the center of the only predicted oligo cluster is used as the SNP-level LR readouts. The remaining SNPs that have more than 1 oligo cluster are passed through QDA classifiers (Equation (G.2)). The QDA classifiers were trained using 108 SNPs from validated regions of copy number loss [29] and the same number of SNPs from validated regions of copy number gain, as well as 108 SNPs from putative normal regions1 . As more regions of deletion and amplification are validated, the user can add the validated SNP data to the training set. This feature enables the algorithm to potentially improves its own decision boundaries. The input data to the QDA classifiers include the p-values of the KS-tests from likelihood estimation phase, the number of PM oligos in each oligo cluster and their centroid values. Comparison of OPAS Pre-processing and Naive Bayes Classification As explained earlier in Section 3.2.3, an alternative approach was suggested for preprocessing Affymetrix SNP probes based on Bayes classification and without oligo clustering. The same dataset that was used to train QDA classifier (324 SNPs from known deleted, amplified and normal regions2 ) is also used for training the Naive Bayes classifiers. Table 3.3 presents the comparisons between the performance of QDA and 2 Naive Bayes classifiers (described in Section 3.2.8), using 567 SNPs in 19 regions of copy number deletion in follicular lymphoma samples that were validated by Illumina sequencing3 (500K data). The aims of this analysis were: 1) to compare the the proportion of the 567 SNPs that were accurately detected as deleted by each classifier; and 2) to compare the magnitudes of estimated LR values by each approach. The summary of the results, presented in the last row of Table 3.3, indicates that the OPAS number of false negative deletions was 151/567 (27%); however, both of the Naive Bayes classifiers showed lower deletion sensitivity with 217/567 (38%; kernel fit) and 208/567 (37%; normal fit) false negative deletion calls (these 1 The  exact copy number values for these regions were not experimentally determined, however all these SNPs had LR ≈ 0 and manually inspected prior to adding them to the training set. 2 108 × 3 = 324 3 These data are from analysis of Follicular Lymphoma CNVs, described in Chapter 4.  82  are deletion calls at SNP level and not CNV level. It is also observed that for some of the deleted regions QDA has a noticeably better sensitivity compared to Naive Bayes classifiers. For instance, for region #4 in Table 3.3, QDA predicted 101/152 (66%) SNPs as deleted; however, Naive Bayes classifiers predicted 71-72 (∼47%) deleted SNPs.  The suggested OPAS modification excludes oligo clustering phase (see Fig. 3.2); therefore, the  mean log2-ratio intensity of all PM oligos with LR < 0 were used as LR estimate of the SNPs that were classified as deleted by Naive Bayes classifiers1 . The mean Naive Bayes estimated LR values of all SNPs within the deleted region is shown in columns 10-11 (LRNBN and LRNBK ) of Table 3.3. This table also shows the original LR values based on average log2-ratio intensity of all PM oligos in the SNPs probe sets (column 9; LR), as well as the OPAS-estimated LR values (column 12; LROPAS ). Comparing the estimated LR values (columns 9-12) in Table 3.3 reveals that for all 19 sequence-validated deletions, the OPAS pre-processed LR values (LROPAS = −1.01) have noticeably larger magnitude of copy number loss compared to the mean probe set LR measurements before pre-processing (LR = −0.43). The magnitude of OPAS-generated LR values are also larger than the LR estimates based on Naive Bayes classifiers (LRNBN ≈ LRNBK = −0.59). The boxplots  in Figure 3.13 also demonstrate noticeable differences between LR values of the same deleted regions based on different approaches. These boxplots indicate that OPAS-estimated LR values (boxplot 4) have larger magnitudes of copy number loss compared to LR measurements based on Naive Bayes classifiers (boxplots 2-3). It is also evident from this figure that the LR values of the deleted regions are significantly improved after OPAS SNP pre-processing analysis (compare boxplots 1 and 4 in Fig. 3.13). The latter observation confirms the initial hypothesis that oligo-level analysis of SNPs can lead to improve LR estimates. It is also important to investigate whether the higher QDA sensitivity to call deleted SNPs compared to Naive Bayes classifiers is because QDA approach has a poor false positive rate for deletion SNP calls. This can be assessed by randomly selecting SNPs from regions that are putatively normal and comparing the number of such SNPs that are predicted as deleted by each classifier. To implement the above analysis, 5672 SNPs were randomly selected from the autosomes of two FL samples (any known CNV regions were excluded from these autosomes prior to the random SNP selection). In the rest of this analysis, these 567 randomly selected SNPs are referred to as ’putative normal SNPs’. The QDA classification of OPAS and the Naive Bayes classifiers were then applied on these putative normal SNPs the number of deletion calls were estimated. The results of 1 For those SNPs that were classified as normal by Naive Bayes methods, the mean of all PM oligos in the probe set with −0.1 � LR � +0.1 was used as the SNP’s LR measurement. 2 the same number of SNPs in 19 deleted regions of Table 3.3  83  OPAS analysis suggest that 35.4% (201/567) of the aforementioned putative normal SNPs appear to indicate loss of signal intensity. The Naive Bayes classifiers with normal distribution and kernel fit, respectively, predicted 31.2% (177/567) and 39.5% (224/567) of these putative normal SNPs as deleted. These findings indicate that all 3 classifiers generated almost the same proportion deleted SNP calls among 567 putative normal SNPs (31.2%, 35.4% and 39.5%). In conclusion, Table 3.3, Figure 3.13, and the findings of the latter analysis collectively suggest that using Naive Bayes for SNP pre-processing phase provides no substantial advantage for SNP pre-processing. Nonetheless, the two Naive Bayes classifiers have also been added as optional features of OPAS that can be selected by the user instead of the default pre-processing method. An Example of the Impact of SNP Pre-processing on Improving CNV Data Quality As described in Chapter 1, a major limitation of microarray data smoothing is that such approaches often tend to smooth the entire signal aggressively in order to eliminate noisy artifacts; and therefore, may also suppress the magnitude of true CNVs [220, 221]. To investigate if such problems exist in OPAS outputs, the distribution of log2-ratio fluorescent intensity readouts of 540 PM oligos from 45 Nsp SNPs in a known deleted region were analysed before and after pre-processing (this deletion has been detected in an MR patient and was validated by FISH [30]). Figure 3.11 illustrates the CDF plot of these 540 deleted oligos before and after pre-processing, shown by black and red curves, respectively. This figure also depicts the CDF of all oligos on the array (blue curve) that is considered as a baseline distribution for the copy number analysis1 . As observed, the CDF of deleted oligos after pre-processing (red curve) falls further left of the distribution of the same oligos before pre-processing (black curve). In summary, Figure 3.11 indicates that SNP pre-processing improved both the proportion of informative oligos and the magnitude of LR deviation from the baseline in the aforementioned known CNV region. As the result, there is a wider separation between the LR distributions of deleted and baseline oligos. Such improvements can consequently lead to better CNV detection accuracy in the downstream copy number analysis. Section 3.3.6 details the impact of such oligolevel improvements on the downstream CNV calling accuracy based on a comparative analysis using simulated data.  1 No  other large-scale CNV or aneuploidy was discovered in this MR patient; therefore, it is reasonable to consider that the distribution of whole-genome microarray data (> 3 mil) represents a copy number normal baseline.  84  3.3.4  Results of OPAS Post-processing Phase  Analysing estimated SNP signals from different experiments found that sometimes a considerable correlation exists between the LR values and PCR fragment length data, such as shown in Figure 3.12a (r = 0.9, with p-value P = 2e-5). To correct for such biases LOWESS non-parametric normalization is applied on SNP pre-processed data (described in Section The effect of LOWESS normalization in removing the fragment length bias is shown in Figure 3.12 [180]. Next, to find regions with multiple SNPs that exhibit a statistically different LR measurements, Circular Binary Segmentation algorithm is applied on the modified SNP data (Fig. 3.1). As described in Section, this entirely non-parametric method splits the chromosomes into contiguous regions of equal copy number by modelling discrete copy number gains and losses through permutation analysis [189, 242]. In OPAS design, a default of 10,000 permutations are used to estimate the p-values that define the significance of a segment split. Although, in theory any detected segment with LR deviation from the baseline (|LR| > 0) presents a putative CNV, in practise other factors can contribute to fluctuations in LR intensities that are statistically significant but do not represent a real biological event (e.g., DNA quality). Therefore, in a classification based CNV calling method a second filter is used to choose segments which likely indicate a real copy number change; and this filter often depends on the LR value. For example, in a study of ovarian tumor samples, regions with log2-ratio > +0.3 or < −0.3 were used as candidate CNVs [366] (direct use of LR cut offs for CNV calling); and in another study, z-scores of the segmented read depth coverage data were used to call putative CNVs from genome sequence data [195] (z-scores also depend on the distribution of LR values). From the biological perspective, we also have some prior knowledge about the frequency and extent of genomic CNVs in different samples. It is well-known that cancer DNA is frequently affected by genome rearrangements compared to copy number polymorphisms among normal individuals [18, 19, 24, 25, 37, 69]. Therefore, instead of using predefined cut-offs to label segments as candidate CNVs, OPAS provides the list of all segments in the genome (Fig. 3.3) and basic analysis of the distribution of these segments (meta-data). The proposed approach is to use both the OPAS-generated data and the biological information to assign proper statistical cut-offs for each separate study, depending on the type of samples that are being analysed (e.g., cancer or mental retardation).  3.3.5  Assessing the Effect of Noise on CNV Calling Performance  To evaluate the robustness of the CNV calling in the presence of noise, the biologically inspired model described in Section 3.2.7 was used to generate a dataset of simulated noisy signals. These 85  simulated noisy signals represent data from chromosome 14 of a sample with known IGH deletion1 (ht-17). Figure 3.16 presents the effect of increasing the σ of the underlying noise on the performance of the CNV calling algorithm based on 9 different levels of noise (σnoise = .01, .03, .05, .09, .13, .15, .17, .19, .21). For each specified σnoise , 10 random simulated signals were generated and analysed using OPAS method. The number of times that the known IGH deletion was not detected in a simulated signal with σnoise divided by the total number of generated signals with the same σnoise (n = 10) is referred to as the detection error rate (η). The signal-to-noise ratio (SNR) at each corresponding noise level is also generated based on the estimated amplitude of the original signal and the known added noise. The result of this analysis is shown in Figure 3.16. As expected, it is observed that the detection error rate (η) increases with the standard deviation of the noise. This figure also denotes that when σnoise reaches 0.17 (marker ’B’ in Fig. 3.16) the IGH deletion is no longer detected in any of the simulated signals (η = 1). Further study of this particular data point revealed that at σnoise = 0.17 the SNR is approximately equal to 1 (SNR=1.03); and the noise and the original signal have approximately the same standard deviation (σsignal = 0.1694 and σnoise = 0.17). This finding implies that when the magnitude of the simulated noise is equal to or greater than the original signal, the algorithm cannot distinguish between the distribution of the deleted SNPs and the rest of the data. It must be noted that the above analysis found no false positive CNV calls in any of the simulated signals (n = 90). Therefore, I continued this experiment by increasing the magnitude of the added noise in simulated chromosome 14 signals and analysing the estimated CNV calls. The first false positive CNV call was detected when the amplitude of the added noise was more than two times larger than that of the original signal (σnoise = 0.35 and σsignal � 0.17). In conclusion,  the findings of these analyses suggest that despite the increased level of noise, the CNV calling algorithm yields reasonable results.  1 The  validated IgH deletion is located on chromosome 14q32.33, as shown in Figure 3.4.  86  3.3.6  Comparative Analysis of CNV-calling Algorithms  A simulated model was implemented according to the method described in Section 3.2.6 (see Figure 3.14). OPAS, SMD and GLAD were run with default parameters, as most users will do. For OPAS and GLAD, all regions with LR < 0 and LR > 0 were selected as candidate deletions and amplifications, respectively. Based on SMD recommendations, two different p-value cut-offs (pvalues� 1e-6 and � 1e-8) were used to generate two separate lists of SMD results1 . The results from each dataset were compared to the known simulated CNV regions for each alteration size (w = 2, 4, 8, 10, 15, 25, 100, 200) and LR ratio response (δ = 0.11, 0.2, 0.4, 0.6, 0.8, 1.0). The results from each dataset were compared to the known simulated CNV regions for each alteration size (w) and LR ratio response (δ ). True positives, false positives and false negatives were aggregated for each algorithm and simulation to evaluate the sensitivity (true positive rate; TPR) and precision (true positive predictive value; PPV) estimates. As shown in Figure 3.15, all algorithms offered similar sensitivity and precision for alterations with log2-ratio shifts larger than 0.6 (δ � 0.6) that had more than 10 to 15 SNP probe markers (panels D-E and I-J; Figure 3.15). However, detecting CNVs with fewer number of SNPs (w < 10) was largely dependent on the magnitude of the LR deviation (δ ) and the method used to analyse the data. For example, Figure 3.15.A shows that all methods had substantially low sensitivity for detecting simulated CNVs with slight LR deviations which also contained fewer than 10 SNP probe markers (δ = 0.11 and w < 10). This plot also denotes that the OPAS sensitivity for simulated CNVs with LR deviation of 0.11 improved when these CNV regions contained at least 10-15 SNP probe markers (Panel (A)). Panel (F) shows that the estimated precision (PPV = T P/T P + FP) of detecting CNVs with only slight LR deviation (δ = 0.11) was also low, regardless of the algorithm used to analyse the simulated data. As expected, in each plot the detection sensitivity and precision increased with the number of SNP probe markers within the simulated regions (w). The main conclusion of the analysis presented in Figure 3.15 is that in mid-range LR deviations (δ = 0.2 − 0.6) OPAS provides a noticeably better sensitivity and precision for detecting simulated CNVs with fewer than 10 SNP probes (see panels (B-C) and (G-H)).  1 The SMD documentation indicates that generated putative CNVs with p-values �1e-8 have the lowest false positive rate but a higher rate of false negatives; on the contrary, results with p �1e-6 have better CNV detection sensitivity but ∼40% false positive rate.  87  3.3.7  Comparing OPAS and CBS Accuracy  To investigate whether the observed improvements in CNV detection accuracy with fewer SNP probe markers are due to the OPAS approach in dealing with noisy oligos and not CBS segmentation alone, the performance of OPAS and CBS were compared using the model previously described in Section 3.2.8. A series of simulated signals was generated by randomly selecting N oligos from a known deleted region (on chromosome 2p16.3 of an MR patient) and eliminating them from the Xba SNP array data (the deletion includes 17 Xba SNP probe markers). Fourteen different N values (N = {2, 3, . . . , 15}) were used to generate these simulated signals. This process  was repeated 100 times for each value of N, resulting in a total of 1,400 simulated signals with 2 to 15 SNP probe markers within the known deleted locus (ι). The simulated signals were analysed with OPAS and CBS, independently, and segments with log2-ratio < 0 were considered for further analysis. The boundaries of these segments (with LR< 0) were then compared with the known 2p16.3 deleted region and those with > 60% coverage were considered as putative correct calls. Table 3.4 and Figure 3.17 show the results of this analysis. As observed, OPAS detected almost all deletions that had 8 or more SNP probe markers (except for 1 false negative call in 800 signals with ι � 8; see row 7 of Tab. 3.4), and CBS detected almost all deletions with at least 9 SNP probe markers (except for 2 false negative calls; see rows 5-6 of Tab. 3.4). The results of this analysis also indicate that as the number of remaining SNPs in the deleted region (ι) drops below 9, the methods start to show increasingly different CNV detection accuracies (see the pink dashed line in Fig. 3.17a). As previously described in page 85, the accuracy of detecting CNVs using a segmentation based approach not only depends on the ability of the algorithm to find a segment that maps to the real CNV event, but also its estimated magnitude of LR change. Therefore, to perform an accurate comparison between OPAS and CBS, the estimated LR values of the known deleted region (within the simulated signals) are also presented in Figure 3.17b and Table 3.4. Comparing OPAS and CBS estimated LR values shows significant improvements in the magnitude of copy number loss as the result of OPAS-analysis. For instance, the analysis results in Figure 3.17a and Table 3.4 suggest that OPAS detected the known deletion in 92 of the total 100 (92%) simulated signals that had only 3 deleted SNP probes (ι = 3). However, CBS detected this deletion in 58 (58%) of these simulated signals. The LR data of the same simulated signals (ι = 3), presented in Figure 3.17b and Table 3.4, reveal that CBS and OPAS estimated average LR of the deleted regions is approximately −0.1 and −0.82,  88  respectively1 (see columns 6 and 4 of Table 3.4). These findings suggest that eventhough CBS found an overlapping segment in 58% of the aforementioned ι = 3 signals, unlike OPAS results, these segments do not appear as significant and reliable deletion calls. Collectively, these data indicate that OPAS approach resulted in significant loss of signal intensities for the known deleted region in more than 90% of the signals that had only 3 SNP probe markers within the deleted boundaries (2p16.3). In summary, the results of this experiment confirm the initial hypothesis that OPAS oligo-level data processing has a major impact on the accuracy of finding CNVs with fewer SNP probe markers and that these improvement are not the result of the CBS segmentation alone (Table 3.4 and Figure 3.17). This conclusion is also supported by the results of CNV analysis in follicular lymphoma patients, presented in the next Chapter, where OPAS detected several real CNVs (validated by FPP or Illumina sequencing) that were not identified with several alternative methods.  3.4 Conclusions To investigate the sources of variability in SNP arrays, in this Chapter I studied several factors that influence signal intensity readouts in Affymetrix GeneChip SNP arrays (Sections 3.3.1-3.3.2), such as analysing the impact of reference set size on the magnitude of copy number deviation. This analysis revealed that in contrast to what may be expected, a larger reference set does not necessarily yield a better CNV detection sensitivity (Tab. 3.2). Based on the observed results and the results of analysing SNP probe sets (Fig. 3.5), I hypothesized that processing the SNP array data at the oligo-level could improve the accuracy of the downstream CNV detection. To implement this idea, I developed the algorithm for Oligonucleoytide Probe-level Analysis of Signal intensities or OPAS (Fig. 3.1). In the first step of OPAS, the raw signal intensities between test and reference samples are log-transformed and normalized using quantile normalization (described in Section and Appendix F). As mentioned in Section, a possible theoretical problem with this approach is the risk of removing some of the signal in the tails of the distribution; however, several studies have shown that empirical data does not support this hypothesis [205, 302]. To investigate this problem, a comparative analysis of the impact of normalization on the estimated log2-ratio values was performed; as described in Appendix F. This analysis applies 5 normalization methods on SNP array data (500K) from 8 follicular lymphoma samples and compares the average LR values of 50 sequence-validated CNVs in these samples. The results of this comparison did not provide 1 in  both cases LRbaseline � 0  89  any substantial evidence to support the hypothesis that quantile-based normalization method suppresses the magnitude of real CNVs in the aforementioned follicular lymphoma samples. As seen in Table F.2, contrary to what was expected, the quantile-based OPAS normalization often showed an increase in the magnitude of real CNVs in these patients. Nonetheless, the 5 normalization approaches in Appendix F have been added to the OPAS software and the user can change the default pre-processing normalization method or replace it by a new user-defined function. Following normalization, for each SNP on the Affymetrix array, the PM oligos within the SNP probe set were separated into groups with similar LR values based on a two-level clustering approach (Fig. 3.10; panels A-B). The proposed clustering method, referred to as fuzzy-kmeans clustering (Section, first uses a fuzzy inference system (subtractive algorithm) to model the probe set data behavior through a minimum number of rules and then uses this information to initialize a k-means optimization-based clustering algorithm (Fig. 3.9). Each estimated cluster of PM oligos (i.e., oligo cluster) is subsequently passed through a hypothesis testing model (Section that performs KS-tests to test the null-hypothesis that the distribution of the oligo cluster is around the normal copy number baseline (panels C-E in Fig. 3.10). These estimates are then passed to a QDA machine learning classifier (Section to identify the SNP’s oligo cluster with the most informative signal and subsequently the center of this oligo cluster is used as the SNP-level log2-ratio (LR) measurement. In the post-processing phase, first a LOWESS non-parametric method is applied on the estimated SNP LR values to remove the potential PCRfragment length biases (Sec., Fig 3.12). The generated OPAS meta-data can also help to identify some potential issues for specific samples. Analysing the meta-data can help building quality assurance controls. It is important to note that in more than 600 analysed samples (from mental retardation, follicular lymphoma, as well as normal individuals) only about 5% of the cases showed a strong PCR-fragment length bias (r � 0.9; such as shown in Figure 3.12a). Possible causes for such correlation is that experimental issues with PCR process or poor quality of the sample genomic data (problems at the experiment-level) have resulted in such bias. Although this correlation is computationally improved (Fig. 3.12b), there is no guarantee that the quality of the generated data in such experiments is at the same level of samples that showed no such experimental biases. The generated OPAS meta-data, such as intensity scatter plots before and after PCR-fragment length normalization that are automatically generated and stored during sample analysis, can potentially help to track down problematic samples. Although building proper quality assurance controls is beyond the scope of this thesis, analysing the generated meta-data can provide further information about the samples and may potentially facilitate quality control process. The final module of 90  OPAS, shown in Fig. 3.1 flowchart, is non-parametric CBS segmentation (Section The aim of applying CBS segmentation is to identify neighboring regions of DNA that exhibit a statistically significant difference in their average signal intensities. To assess the CNV calling accuracy, OPAS, GLAD and SMD results were compared using a simulated Nsp data set (described in Section 3.3.6). The results of this analysis indicated that all methods had similar sensitivity and precision for LR shifts larger than 0.6 (δ > 0.6), and regions with more than 10 − 15 SNP probe markers (Sec. 3.3.6). The analysis also found that for midrange shifts (0.2 � δ � 0.6), the performance of OPAS for detecting CNVs with fewer than 10  SNP probes was noticeably higher than the other methods (Fig. 3.15). The latter finding implies that OPAS has a better accuracy in detecting CNVs with fewer probes compared to GLAD and SMD. To investigate if this improved performance was due to OPAS oligo-level pre-processing analysis and not the CBS segmentation alone, another experiment was implemented to compare the accuracy of both methods in detecting a known CNV with variable number of SNP probe markers (Sec. 3.2.8 and Sec. 3.3.7). For this analysis 1,400 signal were generated with 2-15 SNP probe markers within a known deleted region on chromosome 2p16.3 of a patient with mental retardation. The results of this analysis (Figure 3.17, Table 3.4) also supported the hypothesis that OPAS oligo-level processing of SNP data has a major impact on the accuracy of finding CNVs with fewer SNP probe markers, which is not achieved by using CBS segmentation alone. In addition to the simulated data analysis, described in Sec. 3.3.6, OPAS was applied on data from 146 patients with mental retardation and the predcited CNVs were compared to a list of 30 validated CNVs in the same patients that were previously found by integrating the results of 7 alternative copy number algorithms [215]. The results of this analysis, presented in Table I.1, revealed that OPAS detected all of these 30 validated CNVs in addition to 52 extra putative CNVs in these MR patients. While there is no biological verification for the new putative events, Ingenuity Pathway Analysis found genetic disorder, neurological disease and behavior are the most significant functions associated to these candidate MR CNVs. The pathway analysis results provide additional confidence to OPAS findings in the MR dataset in the absence of experimental validation. Furthermore, I also used a biologically inspired model to evaluate the performance of CNV calling in the presence of controlled added noise (on chromosome 14 IGH locus; Fig. 3.4). The results of this analysis showed that the average error rate (η) increased with the magnitude of simulated noise and reached 100% when the signal-to-noise-ratio was greater than 1 (Fig. 3.16). The first false positive CNV call among these simulated signals was observed when SNR � 3. Based on these findings, it can be speculated that the CNV calling algorithm yields reasonable sensitivity and specificity to detect real CNVs. 91  In the next Chapter (Chapters 4), I will apply OPAS to study CNVs in 25 patients with follicular lymphoma and will use several alternative data sets to compare the predicted CNV results. This analysis will confirm high sensitivity of the OPAS non-parametric approach in identifying small CNVs with only a few SNP probes that were otherwise cryptic to alternative SNP CNV analysis methods (for instance, OPAS detected several deletions with 4-8 SNPs that were validated by sequencing; p. 132). Such findings support the underlying hypothesis of OPAS design that a probe-level copy number analysis approach can improve CNV detection accuracy in Affymetrix SNP arrays, particularly for those events with fewer SNP probe markers.  92  3.5 Figures and Tables Start .CEL Files  Log2 Estimation  i=1 to N (e.g., N= 262,000 for Nsp )  SNP Pre-processing  Cluster PM probes in SNP(i) probe-set (Fuzzy-Kmeans) Find Likelihood Estimates for each PM cluster (KS-test) Classify SNP (QDA)  Oligonucleotide Probe-level Analysis  Between-Array Normalization (Quantile Normalization)  Estimate SNP Log2-Ratio No  i= N Yes  All Regions  CNV Detection  SNP-level Analysis  PCR Fragment Length Normalization (LOWESS)  Region Identification (CBS Segmentation)  Figure 3.1: Flowchart of the OPAS algorithm: The OPAS algorithm has 2 main modules, pre-processing (shown by red arrow) and post-processing (shown by yellow arrow). The first phase, pre-processing, aims to detect the most informative PM oligos within each SNP probe set in order to improved SNP log2-ratio intensity (LR) readouts. The main modules in SNP pre-processing are quantile normalization (Section, fuzzy K-means clustering (Section, likelihood estimation (Section, and QDA-based machine learning classification (Section The goal of the next phase, post-processing, is to partition the whole genome into regions where the copy number changes between contiguous segments. In this phase, the SNP-level LR intensities are first subjected to a LOWESS normalization to remove the systematic biases induced by impact of PCR fragment length on the estimated SNP signal intensities (Section Subsequently, the LR values are passed to CBS segmentation algorithm (Section Analysing the mean log2-ratio values of these segments or their z-scores can help to identify candidate CNV regions.  93  Start .CEL Files  Log2 Estimation Between-Array Normalization (Quantile Normalization) i=1 to N (e.g., N= 262,000 for Nsp ) Cluster PM probes in SNP(i) probe-set (Fuzzy-Kmeans) Naive Bayes Classification  Find Likelihood Estimates for each PM cluster (KS-test) Classify SNP (QDA)  Estimate SNP Log2Ratio  Estimate SNP Log2-Ratio i= N  No  Yes PCR Fragment Length Normalization Region Identification (CBS Segmentation)  Figure 3.2: Flowchart of the alternative approach for SNP pre-processing based on Naive Bayes classification. The modifications suggested to OPAS pre-processing phase are superimposed on the default algorithm flowchart (previously shown in Fig. 3.1). The blue boxes on the left of the main flowchart denote the modifications. The red boxes (with greyed-out text) are the steps of default OPAS pre-processing phase that are being replaced by these modifications. As seen in this plot, clustering is not applied on SNP probe sets; and instead, all PM oligos in each probe set are passed to a Naive Bayes classifier. The aim of this classifier is to determine whether the log2-ratio intensity of each SNP probe set is similar to those SNPs from known deleted, normal or amplified regions. Since clustering information in not available in this approach (blue modules), log2-ratio values of SNPs are estimated based on the method explained in Section  Meta Data  MA Plots  Affymetrix .CEL Files  (see Ch. 1)  Density Plots  (before and after quantile normalization)  OPAS Algorithm  PCR Fragment Length Plot  (both before and after LOWESS noralization)  Generated Results  Karyotype Visualization of CNVs (.jpg)  .BED Files for import to UCSC genome browser  Server List of All Regions (.txt) including candidate CNVs and normal  Figure 3.3: Schematic representation of OPAS input/output data. As shown in this figure, the Affymetrix GeneChip raw .CEL intensity files are the only user-input data to the OPAS algorithm (OPAS allows both single and batch imports). During sample analysis, several different meta-data are generated which are automatically stored on the server or the designated storage space. The output .txt file includes a list of all segments in each 23 chromosomes of the analysed sample. The generated CNV plots (.jpg files) provide a means to visualize CNVs along chromosome ideograms that are generated based on the UCSC genome browser (the visualization also accommodates switching between different versions of UCSC genome builds). An additional script enables generating UCSC custom tracks for any list of OPAS putative CNVs, according to BED formatting (.BED files). During sample analysis, a series of other figures and data (referred to as meta-data) are also generated and automatically saved at a pre-allocated space on the server (such as PCR fragmentation length normalization plots; and probability density plots of the estimated LR values). These generated meta-data and graphs provide a means to facilitate CNV interpretation and analysis, as discussed in Section 3.4.  95                                                                                                            (a) OPAS visualization output  deletion in patient 17 predicted by an alternative CNV analysis algorithm (SMD)  OPAS predicted del in patient 17  FPP clone that detected the deletion (patient 17)  HTa17_0164B08  (b) UCSC screenshot of the deleted region with FPP results  Figure 3.4: The Nsp signal of chromosome 14 of a follicular lymphoma patient that harbors a deletion on 14q32.33 (∼644 kb; 9 SNPs) and is used as the template to generate simulated noisy signals. Panel (a) shows OPAS visualization output of chromosome 14 of a follicular lymphoma patient (patient 17; obtained from Chapter 4 CNV analysis). The black arrow highlights a predicted deletion, approximately 644 kb with 9 Nsp SNP probe markers, located on chromosome 14 IGH locus. Panel (b) demonstrates the screenshot of UCSC genome browser illustrating the fingerprint profiling (FPP) alignment of BAC clones in chromosome 14q32.33 of this sample (ht-17) to the reference human genome (hg18). BACs with linear alignments to the reference genome are coloured blue and the ones with split alignments are coloured green (see Appendix K for more information). The two blue arrows indicate that the ends of BAC clone HTa17-0164B08 align to different loci on chromosome 14, suggesting that the region between these two arrows may have been deleted. The OPAS predicted region of copy number deletion in the same sample is shown in pink. The black vertical dashed lines denote the concordance between OPAS and FPP predicted boundaries of IGH deletion. The black arrow in (b) indicates that the above deletion in patient 17 was also detected by an alternative CNV calling algorithm (SMD [259]). These observations suggest that the predicted deletion of IGH locus in patient 17 is a real CNV event.  96           (a) SNP Scatterplot  LR  SNP 79  SNP 83  1  1  0  0  −1  −1  −2  −2  −3  −3 2  4  6  8  10 12 PM Oligos  14  16  18  20  2  (b) SNP79 : LR = −0.89, SD = 1.3  4  6  8  10  12  14  16  18  20  (c) SNP83 : LR = −1.34, SD = 1.1 SNP 100  SNP 93 1  1  0  0  −1  −1  −2  −2  −3  −3 2  4  6  8  10  12  14  16  18  20  2  (d) SNP93 : LR = −1.33, SD = 1.3  4  6  SNP 81  0 −1 −2 −3 4  6  10  12  14  16  18  (e) SNP100 : LR = −0.50, SD = 0.6  1  2  8  8  10  12  14  16  18  (f) SNP81 : LR = −0.05, SD = 0.3  97  20  20  Figure 3.5 (previous page): Example of oligo-level and SNP-level variability in SNP arrays (100K data). Panel (a) denotes the SNP scatterplot of a region on chromosome 10 in a child with developmental abnormalities that harbors a known deletion, highlighted by the red dashed lines. This deletion is ∼353 kb in length and includes 25 Xba SNP probes. The deletion disrupts several genes, including the SUFU gene which has been associated with the observed phenotypes in this patient [365]. Panels (b)-(f) illustrate the probe sets of 5 SNPs in SUFU deleted region (SNPs 79, 81, 83, 93 and 100). The x-axis denotes the index of PM oligonucleotide probes in the SNP probe set (1, 2, . . . , 20); and the y-axis demonstrates their fluorescent log2-ratio intensity readouts. The probe set plots (b-f) reveal that although all of the aforementioned 5 SNPs are within the SUFU deleted region, there is a wide range of variability between their estimated LR values and standard deviations (−1.34 � LR � −0.05, 0.3 � SD � 1.3). Some of these SNPs have similar probe sets and LR values, such as SNP83 (3.5c) and SNP93 (3.5d) which both indicate significant loss of signal intensities (LR83 = −1.34, LR93 = −1.33). Compared to these 2 SNPs, SNP100 (3.5e) has a relatively smaller magnitude of copy number loss (LR100 = −0.5; Panel 3.5e). The colored dots in panel 3.5e denote the differences between the log2-ratio readouts of PM oligos in SNP100 probe set; as detailed in page 79 (red and green dots show oligos with LR< −0.5 and oligos with 0 < LR � −0.5, respectively; while the blue dots show noisy oligos with LR> 0). The last plot in this figure (3.5f) illustrates SNP81 probe set. As observed, the log2-ratio intensity readouts of the PM oligos in this probe set provide no substantial evidence to conclude that SNP81 is deleted (LR81 � 0).  98  Rfather RefSet Father SNP 79 SNP79 #$%&'#()#  ! " !!  Rmother  Rparents  R24  R99  R150  RefSet Mother  RefSet Parents  R24  R99  R150  SNP #$%&'#+" SNP 80 *  80  " !*  SNP #$%&'#+*# SNP 81 *  81  " !*  SNP #$%&'#+! SNP 82 82  ! " !!  SNP #$%&'#+, SNP 83 ! " !!  ! " !!  83  #$%&'#()# SNP 94 SNP  94  SNP SNP 95 ! " !!  #$%&'#(* 95  SNP #$%&'#(+ SNP 96 !  96  " !! ! " !!  " !!!  SNP #$%&'#(,# SNP 97 97  SNP #$%&'#(.# SNP 98 98  Figure 3.6: Comparing the impact of the size of the reference set on estimated log2-ratio intensity readouts of SNP probe sets within SUFU deleted region. These plots demonstrate the probe sets of 10 SNPs in SUFU deleted region. Three of these 10 SNPs were also shown in the previous figure, including the non-informative SNP81 illustrated in Fig. 3.5f. The plots in each row illustrate the PM oligonucleotide probe sets of the same SNP. Each column represents estimated log2-ratio measurements with respect to a separate reference set. These 6 reference sets include the mother of the affected patient (column 1), his father (column 2), the mean of both parents (column 3), and three other reference sets with 24, 99 and 150 normal samples (R24 , R99 , R150 ; columns 4-6). This figure indicates that, regardless of the reference set used to estimate LR values, there is a remarkable variation in the overall pattern of probe set LR intensities. It is also observed that using a larger reference set does not improve the probe set response of uninformative SNPs, such as SNP81 (Fig. 3.5f) or SNP96 .  99  1  Proportion of PM oligos (x 100)  0.9 0.8  Rfather  0.7 0.6  CDF of the entire array (PM)  0.5  R99  0.4  R150  0.3  Wholeïarray PM probes RefSet: Father RefSet: Mother RefSet: Meanïparents R24 R99 R150  0.2 0.1 0 ï2  ï1.5  ï1  ï0.5  0 Log2 Ratio (LR)  0.5  1  1.5  2  Figure 3.7: Comparison of cumulative density functions (CDFs) of all oligos in SUFU deleted region with 6 reference sets with varying sizes. This figure illustrates the CDF of log2-ratio fluorescent intensities from 500 Xba PM oligos (representing 25 Xba SNPs) in SUFU deleted region (previously shown in Fig. 3.5). Six reference sets with varying sizes (Fig. 3.6) were used to estimate log2-ratio intensities of the PM oligos in this region. Each CDF is depicted by a different color, as detailed in the figure legend (see Section for the description of these reference sets). In addition to the CDF of deleted oligos, the CDF of all PM oligos on the Xba array (> 1.1 mil PM oligos) is also shown by the black curve (referred to as base-line CDF). The fact that all CDFs that represent SUFU deleted region fall on the left side of the base-line CDF emphasizes that no matter what reference set was used to estimate test versus normal ratios, the log2-ratio intensity distribution of the SUFU deleted region is smaller than, and distinct from, the LR distribution of the entire array. However, the extent of this distinguishability, which is determined by the deviation between SUFU and base-line CDFs, varies depending on the reference set that was used to estimate the LR values. It is observed that using the largest reference set (R150 ; denoted by the orange arrow) does not improve the distinguishability of SUFU deleted oligos from the rest of the array, compared to single-array reference set of R f ather (denoted by the blue arrow).  100  Comparison with anLR< LR<0 in each probeïset Comparisonofofthe theAverage averageNumber numberofofPM PMoligos oligos with 0 per SNPSNP probe set (Θ) inin22 Validated Deletion 22Regions deletedof regions (validated)  set Average Num PM oligos per SNP probe probeset with LR < 0  20 18 16 14 12 10 8 6 4 2 0 Reference Set # 2 (R ) 24  Reference Set # 2 (R ) 99  Reference Set # 3 (R  )  150  Figure 3.8: Boxplots of the average number of CNV-affirmative PM oligos per SNP probe set (θ ), with respect to 3 reference sets with varying sizes. Three reference sets with 24, 99 and 150 samples were used to estimated log2-ratio values of the SNPs from 22 regions of known copy number loss in mental retardation patients [29, 30]. Each boxplot presents the average number PM oligos in each SNP probe set with LR< 0 across 22 aforementioned deletions with respect to a separate reference set (the reference sets are denotes in the x-axis). These boxplots indicate that when R24 is used, on average, 13.3 out of 20 PM oligos in each Xba SNP probe set indicate loss of signal intensity (θ24 = 13.3/20 � 67%). Using a larger reference set with 99 normal samples (R99 ) improves the number of CNV-affirmative oligos per SNP to 15.4/20 (θ = 77%). However, when the largest reference set (R150 ) is used the average number of CNV-affirmative PM oligos per SNP probe set drops to 14.3/20 (θ150 = 71%). This observation implies the number of PM oligos in the probe set of SNP within a deleted region that indicate loss of signal intensity, does not necessarily improve by increasing the number of reference samples.  101  SNP 79 (  Log2-Ratio  % &  !% !! !' (  !  "  #  $  %&  %!  %"  %#  %$  !&  PM oligos in SNP probe set �  (a) LR = −0.89, LR = −1.97  SNP 81  %  (  &  &  !%  !%  !!  !!  !' (  SNP 83  %  !' !  "  #  $  %&  %!  %"  %#  %$  (  !&  !  �  #  $  %&  %!  %"  %#  %$  !&  (c) LR = −1.34, LR = −2.37  SNP 93  %  "  �  (b) LR = −0.05, LR = −0.068 (  &  !%  !%  !!  !!  !'  SNP 100  %  &  (  (  (  !' !  "  #  $  %&  %!  �  %"  %#  (d) LR = −1.33, LR = −1.60  %$  !&  (  !  "  #  $  %&  %!  %"  %#  %$  !&  �  (e) LR = −0.50, LR = −0.79  Figure 3.9: Fuzzy-Kmeans clustering of PM oligos in 5 SNPs within the SUFU deleted region. Each plot depicts the probe set of an Xba SNP in SUFU deleted region that was previously shown in Figure 3.5 (each consisting of 20 PM oligos). The fuzzy-kmeans predicted oligo clusters in each SNP probe set are shown by different colors (red, blue and green). The mean of each oligo cluster is shown by a horizontal dashed line, with the same color as the corresponding oligo cluster. The red oligo cluster in each plot indicates a subset of PM oligos with the largest magnitude of copy number loss; while the blue oligo cluster denotes PM oligos with either mediocre loss of signal (as in 3.9c) or oligos that show positive mean log2-ratio values (noisy oligos). The reported LR value for each SNP, is the average log2-ratio intensity measurement of 20 � PM oligos in the SNP probe set; and LR presents the average SNP LR value after excluding the blue oligos 102 from the probe set.  2  SNPïAï1740765 (Mapping50KïXba240ï076) SNP_A-1740765 (Mapping50K-Xba240-076)  Ave. (LR) = -0.19  1.5 1 0.5  (A)  0 ï0.5 ï1 ï1.5 ï2  2  4  6  8  10  12  14  16  18  20  fuzzy-kmeans clustering SNPïAï1740765 (Mapping50KïXba240ï076) 2  Cluster #1 Cluster #2  1.5 1 0.5  (B)  0 ï0.5 ï1 ï1.5 ï2  2  4  6  8  10  12  14  16  18  20  (C) 2  2  Cluster #1  1.5  1.5  1  1  0.5  0.5  0  0  ï0.5  ï0.5  ï1 ï1.5  Cluster #2  c2 = -0.9  ï1  c1= +0.39  ï2  2  4  6  ï1.5 8  10  12  14  16  18  ï2  20  2  4  6  8  10  12  14  16  18  20  likelihood Estimation  (D) H0 : μ=μ0  H1 : x=y0 H1 : x>ya H1 : x<yd center clus. 1 # oligos  (E) H 1  P 9.30E-05  H0 : μ=μ0 H1 : x=y0  H 1  P 4.3E-04  1  4.65E-03  1.64E-05  5.72E-06 -  H1 : x>ya H1 : x<yd  1  1 +0.39 11  0 -0.91 9  0.98361 -  center clus. 2 # oligos  Figure 3.10: Schematic representation of oligo-clustering and likelihood estimation modules of OPAS default pre-processing. Panel (A) shows the prob