Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

CIS-features mediating CAG/CTG repeat instability the Satellog database, and candidate repeat prioritization… Missirlis, Perseus Ioannis 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2004-0561.pdf [ 9.65MB ]
Metadata
JSON: 831-1.0091531.json
JSON-LD: 831-1.0091531-ld.json
RDF/XML (Pretty): 831-1.0091531-rdf.xml
RDF/JSON: 831-1.0091531-rdf.json
Turtle: 831-1.0091531-turtle.txt
N-Triples: 831-1.0091531-rdf-ntriples.txt
Original Record: 831-1.0091531-source.json
Full Text
831-1.0091531-fulltext.txt
Citation
831-1.0091531.ris

Full Text

C/S-FEATURES MEDIATING CAG/CTG REPEAT INSTABILITY, THE SATELLOG DATABASE, AND CANDIDATE REPEAT PRIORITIZATION IN SCHIZOPHRENIA by PERSEUS IOANNIS MISSIRLIS B.Sc.H., Queen's University, 2002 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES GENETICS GRADUATE PROGRAM We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA August 2004 © Perseus loannis Missirlis, 2004 UBC W THE UNIVERSITY OF BRITISH COLUMBIA FACULTY OF G R A D U A T E STUDIES Library Authorization 3 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Perseus loannis Missirlis 18/08/2004 Name of Author (please print) Date (dd/mm/yyyy) Title of Thesis: C / S - F E A T U R E S MEDIATING C A G / C T G R E P E A T INSTABILITY, T H E S A T E L L O G D A T A B A S E , A N D CANDIDATE R E P E A T PRIORITIZATION IN S C H I Z O P H R E N I A Degree: Master of Science Year: 2004 Department of Genetics Graduate Program The University of British Columbia Vancouver, B C Canada grad.ubc.ca/forms/?formlD=THS page 1 of 1 last updated: 18-Aug-04 ABSTRACT Polyglutamine repeat expansions in the coding regions of unrelated genes have been implicated in the neurodegenerative phenotype of nine separate diseases. However, little is known about the role of flanking c/'s-sequences in mediating this repeat instability. Brock et al. identified an association between flanking G C content and C A G / C T G repeat instability at many of these disease loci by using a relative measure of repeat instability called 'expandability'. Using this measure, we have extended the analysis of Brock and colleagues and utilized the expandability metric to associate other features theorized to contribute to C A G / C T G repeat instability such as repeat length and purity, proximity to CCCTC-binding factor (CTCF) binding sites, and the nucleosome formation potential of the surrounding DNA. Our results confirmed earlier relationships regarding flanking G C content and C A G / C T G repeat instability and also suggest a novel one involving flanking C T C F binding sites. Conversely, no relationships between expandability and repeat length, purity, and nucleosome formation were detected. Anticipation refers to the progressive worsening of a disease phenotype and earlier age of onset in successive generations. Anticipation has been reported in a number of diseases in which repeat expansion may have a role in etiology. We developed Satellog, a database that catalogs all pure 1-16 repeat unit repeats in the human genome along with supplementary data of use for the ii prioritization of repeats in disease association studies. For each pure repeat we calculate the percentile rank of its length relative to other repeats of the same class in the genome, its polymorphism within UniGene clusters, its location either within or adjacent to EnsEMBL-defined genes, and its expression profile in normal tissues according to the GeneNote database. By examining the global repeat polymorphism profile, we found that highly polymorphic coding repeats were mostly restricted to trinucleotide repeats, whereas a wider range of repeat unit lengths were tolerated in untranslated sequence. We also found that 3'-UTR sequence tolerates more repeat polymorphisms than 5'-UTR or exonic sequence. Lastly, we use Satellog to prioritize repeats for disease-association studies in schizophrenia. Satellog is available as a freely downloadable MySQL and web-based database. iii TABLE OF CONTENTS ABSTRACT ii TABLE OF CONTENTS iv LIST OF TABLES viii LIST OF FIGURES x LIST OF ABBREVIATIONS xii ACKNOWLEDGEMENTS xv DEDICATION xvi PREFACE xvii CHAPTER 1 INTRODUCTION 2 1.1 c/'s-Features of unstable C A G / C T G repeats 2 1.1.1 Unstable repeats and disease 2 1.1.2 The argument for cis mediators of instability 4 1.1.2.1 Flanking % G C and CpG islands 4 1.1.2.2 Repeat length and purity 5 1.1.2.3 The role of the C T C F insulator protein 7 1.1.2.4 The role of nucleosomes 8 1.1.3 Objectives 9 1.1.4 Specific aims and rationale 10 1.2 The unstable repeat perspective of schizophrenia 10 1.2.1 Biology of schizophrenia 10 1.2.2 Genetics of schizophrenia 12 1.2.3 Anticipation in neuropsychiatric diseases 14 1.2.4 C A G / C T G repeats in schizophrenia 15 1.2.5 Published satellite repeat analyses and databases 18 1.2.6 Objectives. 20 1.2.7 Specific aims and rationale 21 CHAPTER 2 MATERIALS AND METHODS 23 2.1 c/s-Features of unstable C A G / C T G repeats 23 2.1.1 Collection of candidate C A G / C T G repeats for cis sequence analysis 23 2.1.2 Software Dependencies 1 23 2.1.3 Implementing the g e m s _ c i s database 25 iv 2.1.4 Overview of the f l a n k e r . p l script 27 2.1.4.1 Collection of flanking % G C , CpG islands, length and purity and other repetitive elements 29 2.1.4.2 Detection of flanking C T C F insulator protein binding sites 29 2.1.5 Detection of nucleosome formation potential with NucleoMeter 30 2.1.6 Statistics and plots with R 32 2.2 The s a t e l l o g database 32 2.2.1 Software Dependencies II 33 2.2.2 Implementing the satellog database 33 2.2.3 Preliminary set-up 37 2.2.3.1 Detecting pure repeats with Tandem Repeats Finder (TRF) 37 2.2.3.2 Identifying unique repeat classes 38 2.2.3.3 Preparing expression data from the GeneNote database 39 2.2.3.4 Detecting repeat polymorphisms within UniGene clusters 40 2.2.4 Overview of the repeatalyzer.pl script 41 2.2.5 Generating a measure of repeat length significance 42 2.2.6 Detection and input of disease-associated repeats 42 2.3 Prioritizing candidate repeats for disease-association studies in schizophrenia 43 2.3.1 Input of neuropsychiatric linkage regions into Satellog 43 2.3.2 Prioritizing candidate repeats with Satellog 43 CHAPTER 3 RESULTS 47 3.1 c/s-Features of unstable C A G / C T G repeats 47 3.1.1 Correlation of flanking C A G / C T G repeat features to Brock etal. expandability data 47 3.1.1.1 Correlation of CpG islands with expandability 47 3.1.1.2 Correlation of flanking % G C with expandability 48 3.1.1.3 Correlation of repeat length and purity with expandability 52 3.1.1.4 Correlation of C T C F binding sites with expandability 52 3.1.1.5 Correlation of nucleosome formation potential with expandability 58 3.2 Genomic repeat analysis with the Satellog database 61 3.2.1 Summary statistics 61 3.2.2 Characteristics of disease-associated repeats 62 3.2.3 Characteristics of repeats polymorphic within UniGene clusters 66 3.2.4 Disease-associated repeats detected in UniGene clusters 68 3.3 Candidate repeats for typing in schizophrenia and bipolar disorder 75 3.3.1 Top 20 polymorphic schizophrenia candidate repeats 76 3.3.2 Top 20 globally prioritized schizophrenia candidate repeats 77 3.3.3 Top 20 polymorphic bipolar disorder candidate repeats 78 3.3.4 Top 20 globally prioritized bipolar candidate repeats 79 3.3.5 Top 20 polymorphic schizophrenia candidate repeats from disease-associated classes 80 3.3.6 Top 20 globally prioritized schizophrenia candidate repeats from disease-associated classes 81 3.3.7 Top 20 polymorphic bipolar disorder candidate repeats from disease-associated classes 82 3.3.8 Top 20 globally prioritized bipolar candidate repeats from disease-associated classes 83 CHAPTER 4 DISCUSSION 86 4.1 c/'s-Features of unstable C A G / C T G repeats 86 4.1.1 Identifying c/s-mediators of instability 86 4.1.1.1 Association between flanking % G C and instability 87 4.1.1.2 Association between flanking repeat length, purity and instability 88 4.1.1.3 Association between flanking C T C F binding sites and instability. 88 4.1.1.4 Association between flanking nucleosome formation and instability 90 4.1.2 Prioritizing candidate C A G / C T G repeats 91 4.2 Genomic repeat analysis with the Satellog database 93 4.3 Repeat prioritization in schizophrenia with Satellog 94 4.3.1 Top 20 polymorphic schizophrenia candidate repeats 95 4.3.2 Top 20 globally prioritized schizophrenia candidate repeats 96 4.3.3 Top 20 polymorphic schizophrenia candidate repeats from disease-associated classes 96 4.3.4 Top 20 globally prioritized schizophrenia candidate repeats from disease-associated classes 97 4.4 Conclusions 97 4.5 Problems encountered and limitations 99 4.5.1 Brock et al.'s expandability metric 99 4.5.2 Limitations of the GeneNote dataset 100 4.5.3 Mapping repeats to UniGene clusters 100 4.5.4 Prioritizing with p-values 101 4.5.5 Multiple repeats detected for known diseases 101 4.6 Future studies 102 4.6.1 Identifying c/'s-mediators of instability 102 4.6.2 Improvements to Satellog 103 4.6.3 Disease association studies in schizophrenia 104 4.6.3.1 Specimens for analysis 104 4.7 Significance 106 BIBLIOGRAPHY 108 APPENDIX A 120 APPENDIX B 124 APPENDIX C 126 APPENDIX D 153 APPENDIX E 1 5 7 APPENDIX F 1 5 9 APPENDIX G 1 6 0 APPENDIX H 1 7 0 APPENDIX 1 1 7 5 APPENDIX J 1 7 7 APPENDIX K 1 8 0 APPENDIX L 1 8 2 APPENDIX M 1 8 4 APPENDIX N 1 8 8 APPENDIX O 1 9 0 APPENDIX P - 1 9 2 APPENDIX Q 1 9 5 APPENDIX R 2 0 5 vii LIST OF TABLES Table 1: Genetic anticipation in schizophrenia; summary of linkage studies from 1996-1999 (adapted from Vincent et al., 2000) 15 Table 2: All unstable and candidate C A G / C T G repeat-containing genes located within a CpG island. 'Start' and 'End' columns refer to start and end co-ordinates of the CpG island relative to the 50 Mb slice of genomic sequence flanking the C A G / C T G repeat (i.e. the C A G / C T G repeat starts at 50,000). 49 Table 3: All C A G / C T G repeat-containing genes with 100 bp of flanking sequence having % G C at least equal to that of HD. The '100_bp' column summarizes the G+C fraction of 100 bp flanking the C A G / C T G repeat 53 Table 4: All C T C F binding sites with a HMMer score greater than 1 that are within 1,000 bp of a C A G / C T G repeat 59 Table 5: All C T C F binding sites with an HMMer score between 0 and 1 within 1,000 bp of C A G / C T G repeat. These may represent true C T C F sites because of the binding degeneracy C T C F 60 Table 6: The ten most unstable coding repeats organized by descending standard deviation. Repeats highlighted in bold are known disease-associated repeats 73 Table 7: The ten most unstable untranslated repeats organized by the descending standard deviation. No disease-associated repeats are present in this sample 74 Table 8: Top 20 polymorphic schizophrenia candidate repeats 76 Table 9: Top 20 globally prioritized schizophrenia candidate repeats 77 Table 10: Top 20 polymorphic bipolar disorder candidate repeats 78 Table 11: Top 20 globally prioritized bipolar candidate repeats 79 Table 12: Top 20 polymorphic schizophrenia candidate repeats from disease-associated classes 80 Table 13: Top 20 globally prioritized schizophrenia candidate repeats from disease-associated classes 81 Table 14: Top 20 polymorphic bipolar disorder candidate repeats from disease-associated classes 82 Vll l Table 15: Top 20 globally prioritized bipolar candidate repeats from disease-associated classes 83 Table 16: Summary of expandable C A G loci and candidate C A G / C T G repeat containing genes with at least one feature associated with repeat expandability. % G C (100 bp) refers to the % G C of 100 bp flanking the C A G / C T G repeat. C T C F scores and absolute distance (i.e. either upstream or downstream) relative to the repeat are listed. Multiple C T C F hits are separated by commas 92 Table 17: Summary of disease-associated repeats from Cleary and Pearson, 2003 as detected in Satellog. Each disease is associated with one or more repeat co-ordinates 190 Table 18: Summary of schizophrenia and bipolar disorder linkage regions from (Sklar, 2002). This table summarizes the linkage studies in the paper and includes the cytogenetic band, genetic marker (with co-ordinates) of each study cited in the review. The ref column refers to the PubMed ID of each linkage study. This represents a portion of the linkage table in Satellog.. 192 ix LIST OF FIGURES Figure 1: Flowchart outlining how the candidate G e M S list was populated 24 Figure 2: Schema for gems_cis database 26 Figure 3: Flowchart for flanker.pl, a perl script designed to analyze the cis-elements flanking disease associated and candidate C A G / C T G repeats... 28 Figure 4: Alignment of experimentally identified C T C F binding sites used to build the HMM. Sequences are from DM in human, the chicken B-globin FN, mouse H19 DMD4 and DMD7, chicken myc FV, and human MYCA. Bold nucleotides are essential contact guanosines; grey bars highlight inter-site conservation (adapted from Filippova et al., 2001) 31 Figure 5: Satellog database schema 35 Figure 6: a-c) Correlation between ranked median expandability and ranked % G C in 100 bp, 500 bp, and 1000 bp flanking unstable C A G / C T G repeats (Brock et al., 1999). d) Spearman's rank correlation (rho) of median expandability and % G C of 50 bp, 100 bp, 500 bp, 1,000 bp, 1,500 bp, 2,000 bp, 2,500 bp, 3,000 bp, 3,500 bp, 4,000 bp, 4,500 bp, and 5,000 bp of sequence flanking the C A G / C T G repeat. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance 50 Figure 7: Histogram of % G C of 100 bp flanking the repeat in candidate C A G / C T G repeat sequences. Red bar indicates sole gene (SCA7) with % G C content achieving statistical significance based on z-score within this distribution 51 Figure 8: No correlation was observed between ranked expandability and ranked repeat length. Repeat length is defined as the absolute length of the repeat, irrespective of purity, defined by Tandem Repeats Finder co-ordinates. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance 54 Figure 9: Correlation between ranked expandability and ranked C A G / C T G repeat purity. No correlation was observed between ranked expandability and ranked C A G / C T G repeat purity. C A G / C T G repeat purity defined as longest contiguous stretch of the repeat unit specified in Tandem Repeats Finder. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance 55 Figure 10: a) Plot of ranked median expandability against ranked score of computationally detected C T C F binding sites (from HMMer) in 5,000 bp flanking the C A G / C T G repeats known to be unstable (Brock era/., 1999). b) x Plot of ranked median expandability against ranked distance of C T C F binding site in 5,000 bp flanking the C A G / C T G repeat of genes known to be unstable (Brock etal., 1999) c) Correlation between ranked distance from the C A G / C T G repeat and ranked score of hits. Each point represents a C T C F binding site. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance 57 Figure 11: Genomic distribution of repeat lengths of all repeat classes-associated with disease. A repeat class represents all repeat variations of a given repeat unit (i.e. C A G , A G C , G C A , G T C , T G C , CTG) 65 Figure 12: Polymorphic repeats make up a tiny portion of all pure repeats detected in Satellog. Approximately half of all the 111,950 transcribed repeats were mapped to UniGene clusters, but only 5,546 or 0.07 % of all repeats were detected as polymorphic within UniGene clusters 67 Figure 13: Median standard deviations (line through box) of polymorphic repeats detected in exonic, 5'-UTR, and 3'-UTR sequence. Median exonic and 5'-UTR standard deviations of did not significantly differ from reach other, but did significantly differ from 3'-UTR repeats implying that the 3'-UTR tolerates larger more expanded repeats (One-way ANOVA, P < 0.05). 69 Figure 14: Repeat period distribution of polymorphic non-coding repeats at increasing standard deviation (sd) cut-offs 70 Figure 15: Repeat period distribution of polymorphic coding repeats at increasing standard deviation (sd) cut-offs 71 xi LIST OF ABBREVIATIONS 5-HT2A serotonin 2A ABI Applid Biosystems ANOVA Analysis of Variance API Application Programming Interface AR Androgen Receptor BCCA BC Cancer Agency BLAST Basic Local Alignment Search Tool BLAT BLAST-like Alignment Tool CCD cleidocranial dysplasia CpG cytosine and guanine separated by a phosphate CTCF CCCTC-binding Factor DM Myotonic Dystrophy DNA Deoxyribonucleic Acid DRD3 dopamine D3 receptor gene DRPLA Dentatorubral-Pallidoluysian Atrophy EMBOSS The European Molecular Biology Open Software Suite EPM1 progressive myoclonic epilepsy type 1 FISH fluorescent in situ hybridization FRAXA Fragile X Syndrome (A subtype) FRAXE Fragile X Syndrome (E subtype) FRDA Friedreich's Ataxia GABA gamma-aminobutyric acid GeMS Genomic Mutational Signatures GeneNote Gene Normal Tissue Expression GEO Gene Expression Omnibus HAT histone acetyl-transferases HD Huntington's Disease HFGS hand-foot-genital syndrome HMM Hidden Markov Model hmmfs Hidden Markov Model fragment search HUGO Human Genome Organization indel insertion and deletion kb kilobases Mb Megabases MRD Microsatellite Repeats Database MZ Monozygotic NA Not Applicable ND Not Determined NS Not Significant OPMD oculopharyngeal muscular dystrophy Perl Practical extraction and report language R The R project for Statistical Computing RAPID Repeat Analysis, Pooled Isolation and Detection Xll l RED Repeat Expansion Detection SBMA X-linked Spinal and Bulbar Muscular Atrophy SCA Spinocerebellar Ataxia SQL Structured Query Language STR Short Tandem Repeats TBP TATA-box binding protein TNR(s) Trinucleotide Repeat(s) TRF Tandem Repeats Finder UBiC UBC Bioinformatics Centre UCSC University of California Santa Cruz UTR Untranslated Region VNTR Variable Number Tandem Repeats ACKNOWLEDGEMENTS Dr. Rob Holt for supporting and directing my research for the past year. Thanks for tolerating, and being receptive to, my wide-ranging interests in genomics and its application to medicine. Stefanie Butland and Francis Ouellette for a chance to contribute and collaborate with the GeMS project. Also, for the opportunity to apply my training to a tangible clinical problem in a true bioinformatics environment. The entire BCCA Genome Sciences Centre team. My work at the Genome Sciences Centre is equally the result of the incredibly resourceful and helpful work environment. This study simply would not have been possible without your help. Dr. Marco Marra, Dr. Steven Jones and Dr. Phil Hieter for their support and input as mentors (Drs. Marra and Jones) and senior supervisor (Dr. Hieter). Canadian Institutes for Health Research and the Michael Smith Foundation for Health Research for funding this project and the bioinformatics program in general. Also for the tremendous opportunity in health-oriented bioinformatics research. DEDICATION This work is dedicated to my parents, Elly and Stellios Missirlis, for their support and encouragement throughout my undergraduate and graduate studies. This work is also dedicated to Madeleine de Trenqualye for her support, humour, appreciation, and generosity during the twilight months of my graduate career. xvi PREFACE The CIHR training program in Bioinformatics, in the Genetics Graduate Program at UBC, is structured in a rotation-based format in order to expose students to different scientific problems and laboratory cultures. I had the opportunity to rotate through three different projects before extending my final project into my Master's thesis with Dr. Rob Holt. It should be noted that the work presented here is the result of one 4 month rotation with Francis Ouellette at the University of British Columbia Bioinformatics Centre (UBiC) and the final 8 months of my graduate career with Dr. Rob Holt, not my entire two years in graduate school. These two portions of my work have been selected because they represent my primary interests towards the end of my Master's degree and provide the best framework for a thesis. That is not to say that my other rotations are insignificant, but rather, they did not fit into an organic whole that could be written as a Master's thesis. This study is broadly divided into two sections, one dealing with my work investigating c/s-mediators of C A G / C T G repeat instability which was the result of my 4 month rotation with UBiC. The second section builds upon the ideas and software tools from this initial rotation to create a comprehensive database for the prioritization of candidate repeats in association studies. Each chapter of this thesis is therefore further divided into sub-chapters dealing with these two major themes. xvii CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 c/s-Features of unstable CAG/CTG repeats 1.1.1 Unstable repeats and disease Repeat instability as an etiological mechanism for neurodegenerative diseases is a relatively new observation. The expansion of trinucleotide repeats (TNRs) was identified in 1991 as the disease causing mutation for X-linked spinal and bulbar muscular atrophy (SBMA) (La Spada et al., 1991), fragile X syndrome (FRAXA) (Kremer et al., 1991; Verkerk et al., 1991), and myotonic dystrophy type 1 (DM) (Brook et al., 1992). These diseases exhibit 'genetic anticipation', a progressive worsening of the phenotype and earlier age of onset due to the transmission of an expanding TNR. The process of inter-generational TNR expansion or contraction has been termed 'dynamic mutation' because of the repeat's unstable length. Since research contained within this report concerns itself only with repeat expansions, the term 'repeat instability' will henceforth refer solely to repeat expansions. Today, 35 human diseases, some of which also exhibit anticipation, have been associated with unstable repeats (Cleary and Pearson, 2003). Diseases for which unstable microsatellites are the causative disease mechanism can be divided into those caused by coding or non-coding repeat expansions. 2 The majority of disease-associated coding repeats identified to date are C A G / C T G repeats encoding an expanded polyglutamine tract in affected individuals. C A G / C T G expansion disorders include spinal and bulbar muscular atrophy (SBMA) (La Spada et al., 1991), dentatorubral-pallidoluysian atrophy (DRPLA) (Koide et al., 1994), Huntington disease (HD) (Huntington's Disease Collaborative Research Group, 1993) and a range of spinocerebellar ataxias (SCAs) including SCA1 (Banfi et al., 1994), S C A 2 (Imbert et al., 1996), S C A 3 (Ikeda et al., 1996), S C A 6 (Zhuchenko et al., 1997), and SCA7 (David et al., 1997). In these diseases, an expanded polyglutamine tract results in a toxic gain of function causing either neuronal degeneration (Ross et al., 1998), or in mouse models of spinocerebellar ataxia (SCA), neuronal dysfunction due to Purkinje cell abnormalities (Cummings and Zoghbi, 2000). The precise pathogenic disease mechanism is unknown but requires expression of the expanded polyglutamine tract. Neuronal inclusion bodies are observable on autopsy (Cummings and Zoghbi, 2000). Untranslated repeats are diverse and include non-trinucleotide repeats. For example, progressive myoclonic epilepsy type 1 (EPM1) pathology results from an expansion of the dodecamer C C C C G C C C C G C G (Lalioti et al., 1997) and an ATTCT repeat expansion is the pathogenic agent in SCA10 (Matsuura et al., 2000). In contrast to the coding repeat disorders, non-coding repeats such as myotonic dystrophy can expand dramatically into the range of thousands of 3 repeats (Brook et al., 1992). Non-coding repeat expansions are not associated with neuronal inclusion bodies on autopsy (Cummings and Zoghbi, 2000). 1.1.2 The argument for cis mediators of instability Murine models provide some of the most compelling evidence of the role of c/s-elements in C A G / C T G repeat instability. Mice with genetically identical inbred backgrounds had a transgene (constructed of a large C T G repeat with little flanking human sequence) integrated randomly into their genomes. The mice demonstrated a wide range of instabilities at the repeat locus, reflecting the influence of mouse c/s-sequences at the site of transgene integration (Zhang et al., 2002). In another experiment, mice had a human transgene integrated into their genomes consisting of C T G repeats plus 45 kb of flanking sequence from an affected DM patient. These mice had uniform instability regardless of the genomic insertion site (Gourdon et al., 1997) which suggested that the identical flanking c/s-sequences of the human transgene dictated the level of instability. 1.1.2.1 Flanking %GC and CpG islands An association exists between relative expandability of repeats and flanking G C content (Brock et al., 1999). To quantify relative levels of expandability, the expandability metric uses the following formula: length change / (progenitor allele length - 35 repeats). This measure quantifies the "tendency 4 of an above threshold repeat block to undergo further expansion". The authors' believed that repeat length changes needed to be relative to the progenitor length of repeats. Progenitor allele length was "standardized" by subtracting 35 repeats, the hypothetical threshold of coding C A G / C T G repeat instability in many coding C A G / C T G disorders (Cleary and Pearson, 2003). This expandability metric represents a summary of pedigree analyses and literature detailing expansion events at various C A G / C T G loci. Furthermore, it represents a standardized way of comparing loci irrespective of the effect of progenitor allele length. It established that variable expandability existed at C A G / C T G repeats at different loci, which the authors attributed to the particular milieu of c/'s-features flanking the repeats. The presence of G C rich sequence, including CpG islands, was observed for the more expandable loci, while low G C content and no CpG islands were observed for the less expandable loci. CpG islands were significantly associated with more expandable loci (P< 0.01, Fisher's exact test). Dramatic positive ranked correlations were observed between expandability and G C content (in the 100 bp flanking the repeat) for male transmissions determined indirectly by pedigree analysis (rho=0.817, P < 0.01) and directly by single sperm analysis (rho= 0.9, P < 0.05). The correlation in female transmissions was similar but weaker (rho= 0.717, P < 0.025). These relationships persisted at 500 bp flanking the repeat, but the correlation co-efficients were lower. No conserved flanking motifs were detected, only variations in % G C . 1.1.2.2 Repeat length and purity 5 TNR repeat length and purity are important determinants of stability. Repeat length dictates the severity of symptoms and approximate age of onset of TNR diseases. The majority of coding C A G / C T G TNRs greater than -35 repeats become genetically unstable (Cleary and Pearson, 2003). Most TNRs show a polymorphic distribution in the human genome, for example, the (CAG) n repeats of S C A 7 range from 10-13 repeats, with the majority consisting of 10 repeats (Gouw et al., 1998). Some large normal pre-mutation lengths approach the 20-35 range and can undergo de novo expansion to full mutation lengths. Large normal repeats may be indicative of a propensity to undergo de novo expansion. Large pure repeat tracts are more likely to expand following transmission than impure repeat tracts. For example, normal C A G / C T G repeats in S C A 1 , which can be as large as 39 repeats, are stable when interrupted with a single CAT. In comparison, SCA1 repeats consisting of 40 pure repeats are genetically unstable (Chong et al., 1995). F R A X A (Eichler era/. , 1994), S C A 2 (Choudhry et al., 2001) and Friedreich's Ataxia (FRDA) (Cossee era/. , 1997; Montermini era/., 1997) also have stabilizing repeat interruptions. Part of the stabilizing effect of interrupting mutations may come from disruptions in repeat secondary structure. In SCA1 and FRAXA, repeat purity and formation of slipped stand structures correlates with repeat instability and disease (Pearson era/., 1998). Many TNRs form stable secondary structures that may interfere with cellular replication machinery. Stable secondary structures have been theorized to facilitate 6 polymerase slippage and contribute to repeat expansion (Cleary and Pearson, 2003). The ability of a TNR tract to form secondary structure is both length and purity dependent and most structures cannot be formed by sub-threshold repeats. However, other interruptions within disease-associated C A G / C T G repeats are selected for following expansion in certain genes such as SCA3. In SCA3, a 5' A G G interruption codes for an essential lysine residue which provides evidence that the protein context of some interruptions may be favoured in expanded alleles (Kawaguchi et al., 1994). 1.1.2.3 The role of the CTCF insulator protein Myotonic dystrophy (DM) is a TNR expansion disease characterized by C T G expansions in the 3'-UTR of the DMPK gene. The polyadenylation site of DMPK is less than 300 bp from the transcriptional start site of SIX5, a gene that when dis-regulated is theorized to cause the cataract phenotype in DM patients. Interestingly, DMPK and SIX5 have different expression profiles despite their proximity. This phenomenon can be explained by c/s-acting insulator elements that act as a barrier to the local effects of other flanking c/'s-sequences such as enhancers. The C T C F (for 'CCCTC-binding factor') protein is an eleven zinc-finger DNA binding protein that acts as an insulator when bound to C T C F sites on DNA (Ohlsson et al., 2001). Gel mobility shift assays were used with portions of the 3'-UTR of DMPK to identify if C T C F bound to those regions (Filippova et al., 2001). It was observed that C T C F bound to the DNA fragments immediately 7 flanking the C T G repeat and acted as an insulator for DMPK. Interestingly, the approximate -176 bp that separates the two C T C F binding sites also binds a nucleosome as determined by standard micrococcal nuclease assays. The DMPK insulator sites are also methylation-sensitive; when methylated, insulator function is impaired. Furthermore, the importance of C T C F binding sites flanking CTG20 for insulator activity was tested by constructing various vectors containing one, both or neither C T C F binding sites. Maximum insulator activity was observed with both C T C F binding sites plus the CTG20 repeat. The upstream C T C F site was more important than the downstream site. Most importantly, the composition of sequence between the C T C F sites was found to be important for maximum insulator functionality. When the CTG20 between the two C T C F binding sites was replaced with A phage DNA of equal length, a small but significant decrease in insulator function was observed. Flanking C T C F binding sites were also observed at the DM, DRPLA, and S C A 7 C A G / C T G repeats by gel retardation assays. For HD and SCA2 , only upstream C T C F binding sites were detected. This suggests that C A G / C T G repeats flanked by C T C F sites may have important insulator activity in the human genome and rationalizes their selection over evolutionary history. Nonetheless, the presence of C T C F binding sites flanking the C A G / C T G repeat within these genes suggests that some link may exist between C T C F and instability. 1.1.2.4 The role of nucleosomes 8 Transcriptionally inactive DNA is stored by an octamer of histone protein pairs. These pairs consist of H2A, H2B, H3, and H4 and are termed the 'nucleosome' (Kornberg and Lorch, 1999). The nucleosome is the basic unit of DNA packaging and modification of its acetylation state is an important process of gene regulation. Nucleosomal acetylation (and the subsequent displacement of nucleosomes from DNA) by histone acetyl-transferases (HATs) has been proven to be an important regulatory mechanism in yeast (Sterner et al., 1999) and in humans (Imhof et al., 1997). Interestingly, the repeat threshold for instability (30-40 repeats or 90-120 bp) is almost the same as the amount of DNA in a nucleosome (-146 bp). Some repetitive sequences readily form nucleosomes (i.e. (CTG) n and (GGA) n) while others do not (i.e. (CGG) n and poly(A) n (Cleary and Pearson, 2003). Extensive research contends that the most efficient sequence for nucleosome positioning is (CTG) n whereas CAT interruptions disrupt formation. Although these observations are suggestive of some link between TNRs and nucleosome formation, their role in mediating repeat instability is currently unknown. 1.1.3 Objectives Trinucleotide repeat instability is associated with flanking % G C content and C p G islands and may be related to repeat length and purity, C T C F binding sites and nucleosome formation potential. In this analysis, the expandability metric developed from male transmissions in Brock et al. is correlated with 9 flanking % G C , CpG islands, repeat length and purity, C T C F binding sites, and nucleosome formation potential to confirm published relationships and identify new ones between unstable coding C A G / C T G repeats and flanking sequence. 1.1.4 Specific aims and rationale Aim 1: Expand the initial analysis by Brock et al. by correlating flanking % G C , CpG islands, repeat length and purity, other repetitive elements, C T C F binding sites, and nucleosome formation potential with expandability data. Brock et al. found a correlation between unstable C A G / C T G repeats and flanking % G C and C p G islands with expandability using an incomplete version of the human genome. We extend this analysis to the complete human genome (UCSC v. 34) and analyze other cis features for their ability to identify unstable C A G / C T G repeats in the human genome. 1.2 The unstable repeat perspective of schizophrenia 1.2.1 Biology of schizophrenia Schizophrenia is a neuropsychiatric disorder characterized by positive symptoms such as delusions, hallucinations, and thought disorder; and by negative symptoms including catatonic behaviour, affective flattening, alogia, and 10 avolition (American Psychiatric Association. Diagnostic and statistical manual of mental disorders (4* ed.) (DSM-IV)). Prevalence of the disease ranges from 0.75% to 1.5% in the general population independent of the culture studied. Age of onset is generally in early adulthood, from the late teens to the late twenties (Bray and Owen, 2001). The lack of overt pathology in the schizophrenic brain along with the anti-psychotic properties of certain drugs gave rise to the idea that disturbances in neurotransmitter systems were responsible for the disease. In the 1950's and 1960's the "dopamine hypothesis" was borne from the clinical efficacy of anti-psychotic drugs. Abundant circumstantial evidence suggests that excessive dopamine plays a role in schizophrenia such as: 1) anti-psychotic drugs block central nervous system postsynaptic D2 receptors 2) drugs that increase dopaminergic activity either aggravate schizophrenic symptoms or give rise to symptoms in patients de novo 3) schizophrenic brains have higher dopamine receptor density post-mortem 4) schizophrenic brains have higher dopamine receptor density detected by positron emissions tomography (PET) 5) increased concentrations of homovanillic acid (a metabolite of dopamine) in the cerebrospinal fluid, plasma and urine of patients successfully treated for schizophrenia (summarized from Potter and Hollister, 2001) These conclusions are complicated by the fact the newer anti-psychotic drugs have been shown to interact with other receptor systems such as serotonin (Wooley and Shaw, 1954) glutamate (Kim et al., 1980), and G A B A (Roberts, 1972). Therefore, direct evidence of a dopaminergic basis for schizophrenia has yet to be established. Recently, a neurodevelopmental theory of schizophrenia has been proposed due to supporting epidemiological, neuroanatomical, and 11 histological evidence. Epidemiologic evidence has revealed a pattern of obstetric complications, childhood neuropsychological deficits, and non-specific neuropathological anomalies in schizophrenics, but not at a statistically reproducible level (Weinberger, 1995). Anectodal observations from neuroanatomical, volumetric and histological studies offer other perspectives on the disease. Neuroimaging has revealed increased lateral ventricle size at onset and in adolescents with high risk of developing the disorder (Degreef et al., 1992). Furthermore, volumetric studies have suggested that schizophrenics have a small reduction in brain size relative to controls (McCarley et al., 1999). Lastly, histological analysis has revealed reduced axonal and dendritic markers within the frontal lobes and tempero-limbic structures in schizophrenic brains (Harrison, 1999). These observations form the basis for the newer "disconnectivity theory" of schizophrenia. According to this theory schizophrenia is a result of subtle defects in the overall synaptic network affecting the organization of neurons in the brain (Bullmore era/., 1997), but more studies are needed to determine the likelihood of this hypothesis. In conclusion, the biological research to date on schizophrenia has yet to yield any clear and undisputed pathology for the disease. 1.2.2 Genetics of schizophrenia There is a strong, genetic, non-Mendellian component to schizophrenia that is apparent from family, twin and adoption studies (McGuffin et al., 1995). Risk of developing schizophrenia increases exponentially based on familial 12 relatedness to affected individuals. Identical or monozygotic (MZ) twins have a 50% lifetime risk of developing the disease. The fact that the remaining 50% do not develop the disease highlights the importance of environmental or epigenetic factors in schizophrenia pre-disposition. Adoption studies have firmly established that individuals adopted into families with schizophrenic members retain their biologic, and not their adopted family's, risk of developing schizophrenia. Genetic epidemiology makes apparent that what is inherited is a predisposition to develop the disease, rather than the certainty to do so (Bray and Owen, 2001). Biometric models of schizophrenia have estimated that 80% of the risk of developing schizophrenia is genetic, while the remaining 20% is conferred by non-genetic factors (Cardno et al., 1999) perhaps including obstetric complications, maternal viral infections, and social stressors (Tsuang, 2000). Identification of susceptibility genes for schizophrenia is a long sought after goal of linkage and association studies. Identification of alleles that segregate with schizophrenia could either support the current neurochemical models of the disease or introduce new ones. It is unlikely a single locus is responsible for genetic predisposition given the recurrence risk of the disease in relatives of affected individuals. It has been estimated that, assuming extreme epistasis is not occurring, between 2 and 3 genes contribute to the schizophrenia phenotype (Risch, 2000). 13 The control of schizophrenic symptoms by anti-psychotic drugs led to the "dopamine hypothesis of schizophrenia". Most association studies focus on genes with some relationship to dopaminergic or other neurotransmitter systems, but to date, significant polymorphisms have been detected in dopamine D3 receptor gene (DRD3) (Crocq et al., 1992) and serotonin 2A (5-HT2A) receptor (Inayama et al., 1996). Unfortunately, the Ser9Gly mutation in DRD3 has a low odds ratio and the T102C mutation in 5-HT 2 A does not change the amino acid sequence of the serotonin receptor. Association studies in schizophrenia are unlikely to find a single locus of significant effect; instead what is more probable is that numerous loci of small effect contribute a fraction of relative risk that additively increases one's risk of developing schizophrenia. 1.2.3 Anticipation in neuropsychiatric diseases The phenomenon of anticipation in schizophrenia was first documented in the 19 t h century by Morel who noted it amongst many other forms of disease that worsened in succeeding generations (Morel, 1857) The "law of anticipation" was formalized by Sir Frederick Mott by examining 420 parent-offspring pairs in the asylums of London (Mott, 1910). The leap to rationalizing anticipation in genetic terms was made unscientifically, most probably because it fit neatly into the social Darwinist paradigm of the era. This view prevailed until the mid-20 t h century, until the scientific rationale of anticipation in myotonic dystrophy was 14 forcefully criticized as an artifact of ascertainment bias by Penrose (Penrose, 1948). Ascertainment bias has been reviewed elsewhere, but briefly it includes: 1) Preferential ascertainment of parents with later onset (due to decreased reproductive fitness of earlier onset individuals) 2) Preferential ascertainment of offspring with early onset, as later onset individuals may not be affected at time of study. 3) Decreased memory of earlier episodes in older generation 4) Increased awareness of illness by family members and physician (excerpted from Vincent era/., 2000) To overcome these biases, suggested "ideal" study designs include population-based random samples of families irrespective of family history followed prospectively until they are through the age of risk. Unfortunately, no published epidemiological studies exist with these criteria. Eliminating ascertainment and other biases in the evaluation of anticipation is likely to be impossible leading some investigators to speculate that anticipation is a matter of opinion in certain diseases (Ashizawa et al., 1992). Controversy surrounding conclusions of genetic anticipation dates back to prior to conclusive findings in F R A X A and HD, suggesting that these issues are in perpetual consideration and difficult to control (Ashizawa et al., 1992). Despite these issues, numerous studies of genetic anticipation in schizophrenia employing a variety of designs have found surprisingly consistent positive results (Table 1). 1.2.4 CAG/CTG repeats in schizophrenia Interest in the idea of anticipation in psychoses was revived with the identification in 1991 of trinucleotide repeat expansion as the genetic mechanism 15 Table 1: Genetic anticipation in schizophrenia; summary of linkage studies from 1996-1999 (adapted from Vincent era/., 2000). 1 Reference Ascertainment No. families Anticipation I Genomic Imprinting (Bassett and Honer, 1994) Linkage study 8 Families + NS (Asherson et al., 1994) Linkage study 28-40 Relative Pairs 11.1 years NS (Thibaut et al., 1995) Linkage study 26 Families + NS (Yaw et al., 1996) Linkage study 15 Families + NS (Gorwood et al., 1996) Population based 24 Families + NS (Ohara et al., 1997) Clinic-based 24 Familjes + NS (Bassett and Husted, 1997) Retrospective registry 137 Pairs + ND (Johnson et al., 1997) Linkage study 33 Families + ND (Imamura er al., 1998) Clinic-based 37 Pairs + NS (Valero era/., 1998) Linkage study 25 Families + NS (Heiden era/., 1999) Clinic-based 15 Families + ND Pairs = Parent-child pairs; NS = Not significant; ND = Not determined. 16 responsible for anticipation in F R A X A (Kremer et al., 1991) and SBMA (La Spada et al., 1991). Trinucleotide repeat screens in psychoses have mostly focused on C A G / C T G type mutations based on their prevalence in neurodegenerative disorders and repeat expansion detection (RED) evidence. RED is a technique that can detect large expanded repeats in DNA without prior knowledge of their genomic co- ordinates (Schalling et al., 1993). RED uses long genomic repeats as a template to which repeat oligonucleotides anneal. Two or more oligonucleotides are ligated when they anneal adjacently on an expanded repeat. After a number of denaturation, annealing and ligation cycles large single-stranded multimers are produced which are then detected by polyacrylamide gel electrophoresis, blotting and hybridization with complementary 3 2 P - A T P labelled molecules as a probe. R E D studies in schizophrenia have shown that the median length of C A G / C T G repeats in affected probands were longer relative to unaffected controls (Vincent et al., 2000). With evidence of genomic C A G / C T G expansions, subsequent molecular studies tried to identify the location of these loci. These studies identified three expanded loci: one at ERDA1, which was cloned using a genomic library from a patient identified by RED as having a large C A G / C T G repeat (Nakamoto et al., 1997), another within SEF2-1B (this repeat locus is also referred to as CTG18.1) which was detected by fluorescent in situ hybridization (FISH) using a large C A G / C T G repeat probe (Haaf et al., 1996), and SCA8 which was detected by RED and then cloned from total genomic DNA by Repeat Analysis, Pooled Isolation and Detection (RAPID) (Koob era/. , 1998). Although expanded in schizophrenia, these loci had individually either failed to 17 segregate in family studies, or associate with disease in affected individuals vs. unaffected controls, at least not at levels of statistical significance (Vincent et al., 2000). A recent study specifically examined whether C A G / C T G repeat expansions as detected by RED in 100 unrelated schizophrenia patients were due to expansions in these loci. The study showed that 28% of studied probands had an expanded C A G / C T G repeat relative to controls. This study took the unprecedented step of typing the C A G / C T G repeats known to be polymorphic in the general population. They concluded that most repeat expansions could be rationalized by expansions in non-pathogenic polymorphic C A G / C T G repeats at the ERDA1, CTG 18.1, and SCA8 loci but could not exclude a small C A G expansion independent of these regions having a limited phenotypic effect (Tsutsumi et al., 2004). Importantly, no RED or association studies have been conducted on non-CAG/CTG repeats in the genomes of schizophrenics. 1.2.5 Published satellite repeat analyses and databases Historically if one suspected a polymorphic microsatellite repeat were associated with a disease, few bioinformatics resources were available to identify relevant repeats in the human genome. Repeat prioritization is the process by which candidate repeats in the genome are hierarchically ranked based on investigator-defined variables. One approach now available is to browse the Tandem Repeats Finder (TRF) (Benson, 1999) track on the University of California Santa Cruz (UCSC) genome browser (http://qenome.ucsc.edu/) (Kent 18 et al., 2002) within a genomic region of interest. T R F at U C S C was executed with liberal insertion and deletion (indel) and substitution penalties that allow the detection of larger, frequently impure repeats. Since pure repeat tracts are more likely to expand than impure repeat tracts following transmission (Chung et al., 1993; Kunst and Warren, 1994; Chong et al., 1995) a large fraction of repeats presented at U C S C are probably not relevant for disease association studies. Furthermore, certain known disease-associated repeats, such as the G A A repeat in Friedreich's Ataxia (Campuzano et al., 1996), are not detected at all (chr9:67,109,320-67,109,339) at U C S C because it is too short to be detected by their TRF parameters. Other groups have created databases of all 2-16 repeat unit satellite repeats within human gene regions (Subramanian et al., 2002; Collins et al., 2003) and of all 1-6 repeat unit microsatellites across prokaryotic and eukaryotic taxa (Subramanian et al., 2002). Collins detected microsatellites with a novel algorithm and deposited this data in a relational database called GRID Short Tandem Repeats (STR) database (Collins et al., 2003). This database included in silico polymorphism detection of coding trinucleotide repeats by using the BLAST algorithm to detect each repeat's length polymorphisms within GenBank, but only for a subset of exonic repeats (Collins et al., 2003). These resources enrich the microsatellite repeat bioinformatics landscape but do not integrate these data with other published resources in a way relevant for repeat prioritization in disease-association studies. Also, these resources do not provide flexible interfaces for combining data in user-defined ways to allow dynamic generation of candidate repeat lists. For example, both 19 the Microsatellite Repeats Database (MRD) (Subramanian et al., 2002) and the STR databases (Collins et al., 2003) provide static co-ordinates of candidate repeats for disease-association studies defined by the author's criteria, but lack the functionality to easily re-prioritize repeats based on user preferences. 1.2.6 Objectives We sought to create a resource to allow prioritization of candidate repeats that utilized bioinformatics features relevant to unstable repeats. To identify which repeats are the most likely substrates for expansion, we prioritized candidate repeats using a comprehensive database named Satellog. Satellog integrates the co-ordinates of all micro- and mini-satellite repeats in the human genome with gene proximity, gene expression, and repeat polymorphism data within UniGene clusters. Given the lack of overt neurological pathology in schizophrenics and the evidence of anticipation, we propose that a non-CAG/CTG repeat expansion may confer genetic risk that increases the probability of developing schizophrenia. Schizophrenia linkage data was incorporated into Satellog in order to prioritize candidate repeats in schizophrenia. Prioritized repeats will be analyzed with GeneScan software on an ABI 3700 sequencer by other individuals/groups at the B C C A Genome Sciences Centre. 20 1.2.7 Specific aims and rationale Aim 1: Develop a comprehensive bioinformatics resource for the prioritization of candidate repeats for disease-association studies in the human genome. No resource exists that allows the generation of candidate gene lists based on bioinformatics integrating satellite repeat co-ordinates, gene proximity, gene expression and repeat polymorphism within UniGene clusters. We have created a database to accomplish this that we have named Satellog. This resource will enrich the bioinformatics possible on micro- and minisatellite repeats and provide a powerful tool for researchers investigating repeats believed to be associated with a disease of interest. Aim 2: Generate candidate repeat lists using Satellog based on schizophrenia linkage regions. We intend to generate a prioritized candidate repeat list of all repeats in schizophrenia and bipolar linkage regions that have interesting features within Satellog. A portion of these repeats will be typed at the B C C A Genome Sciences Centre in disease-association studies with DNA from individuals with schizophrenia, bipolar disorder and unaffected control individuals (n=35 each). 21 CHAPTER 2 MATERIALS AND METHODS CHAPTER 2 MATERIALS AND METHODS 2.1 c/s-Features of unstable CAG/CTG repeats 2.1.1 Collection of candidate CAG/CTG repeats for cis sequence analysis Previous work at UBiC generated a list of candidate C A G / C T G repeats (also called Genomic Mutational Signatures or GeMS) by scanning the human genome for genes with relatively large, coding C A G / C T G repeats (Butland S., personal communication). A perl script (Appendix A) was written that relied on the Tandem Repeat finder (TRF) (Benson, 1999) results generated from the human genome assembly version 33, available from the U C S C genome browser (http://qenome.ucsc.edu) (Kent et al., 2002). Using the co-ordinates from TRF for C A G / C T G repeats, each repeat was extracted and tested with the EnsEMBL API to see if it existed within an EnsEMBL (www.ensembl.org) annotated gene (Hubbard et al., 2002). If yes and the gene was associated with at least one transcript that coded for five or more glutamines, then the gene was collected as a candidate C A G / C T G repeat or GeMS candidate repeat (Figure 1). A total of 66 repeats fulfilled the above criteria for candidate C A G / C T G repeats (Appendix B). The next problem involved prioritizing genes based on the composition of their flanking cis sequences. 2.1.2 Software Dependencies I 23 (A <D .a c a> 3 O (D X a> a 5" CO What a r e t h e GeMS? ensembl API Use TRF co-ords 1) Use Tandem Repeat Finder (TRF) - Data dump table from UCSC 2 ) Repeat qualifies as a candidate if: a) Located in an ensembl sene b) 1 or more transcript codes for 5 Q + 3 ) Total of 6 6 genes Prioritize! I Genotyping Figure 1: Flowchart outlining how the candidate GeMS list was populated. All bioinformatics analysis was conducted with a single perl script entitled 24 flanker.pl (Appendix C). This script is essentially a wrapper for a number of programs that need to be installed locally which are: BioPerl version 0.7.2 (Stajich era/., 2002), the EnsEMBL API (Hubbard era/., 2002), E M B O S S v.2.7.1 (Rice et al., 2000), HMMer v.1.8.4 (Eddy, 1998), and MySQL v.3.23.52. An older version of HMMer was selected because v. 1.8.4 is optimized for nucleic acid analyses while newer versions focus on protein sequence. All of these programs are installed on stent.cmmt.ubc.ca and flanker.pl was run on this machine at the UBiC. 2.1.3 Implementing the g e m s c i s database All data was deposited into a MySQL database entitled 'gems_cis' (Figure 2) (Appendix D). Henceforth, references to slices describe the length (in bp) of the TRF defined C A G / C T G repeat plus the up- and down-stream flanking sequence. The database is composed of 10 tables entitled buildj'nfo, gems, gc_plots, ctcf, cpg, gc, repeats, exons, flanking, gems_feat. All tables, with the exception of build_info, can be related to each other based on the name variable which is simply the Human Genome Organization (HUGO) name of the gene the C A G / C T G repeat is within. The build_info table stores one-time operational information about the database such as: the version of the EnsEMBL human genome assembly used, the name of this database (gems_cis), the date and 25 build Info ens db db name date run gems ens ID gems_feat name chr strand start flanking end name unit chr seq strand length start purity for_seq expandability rev_seq for_rep_seq rev_rep_seq gc_plots —» name start obsex ctcf start end strand distance extras name start end start end repeats rep_name rep_class start end strand distance 50_bp 100_bp 150_bp 200_bp 250_bp 300_bp 350_bp 400_bp 450_bp 500_bp 1000_bp 1500_bp 2000_bp 2500_bp 3000_bp 3500_bp 4000_bp 4500_bp 5000_bp Figure 2: Schema for gems_cis database. time that f l a n k e r . p i was run in order to populate the database. The gems tables stores the EnsEMBL gene ID's. The gc_plots table stores the position within the slice of observed over expected % G C within a sliding window of 200 bp and % G C value calculated by E M B O S S . The ctcf table contains the HMMer score, start, end, strand, and absolute distance from the C A G / C T G repeat as calculated for each hit within the slice. The cpg table collects the start, end and score data for each C p G island, as defined by EnsEMBL, within the slice. The gc table collects the % G C of 50 bp upwards of sequence flanking the C A G / C T G repeat. The repeats table collects the name, repeat class, start and end co-ordinates relative to the slice, score, strand, and distance from the C A G / C T G repeat of each repetitive element. The exons table collects the start and end co-ordinates of each exon. The flanking table stores the chromosome, and chromosomal start and end co-ordinates of the slice, including the sequence of the slice the forward and reverse orientations both without and with repeat masking. The gems_feat table collects the chromosome, strand, chromosomal start and end co-ordinates, repeat unit, sequence, length, purity and expandability (Brock era/., 1999) of each T R F defined C A G / C T G repeat. 2.1.4 Overview of the f l a n k e r . p l script The co-ordinates of candidate C A G / C T G repeats are input in flat-file format to f l a n k e r . p i (Appendix B) (Figure 3) which were sequentially looped 27 through until the end of the file was reached. The first part of the program collected the sequence of the repeat plus a user defined amount of flanking sequence, in our analysis, 50 kb. This value was selected because mouse experiments in which a transgene consisting of -45 kb of human sequence flanking the DM repeat was integrated into the mouse genome yielded uniform repeat instability (Gourdon et al., 1997). This experiment established that at least 45 kb of cis sequence is required to observe the effects of human c/s-sequences. To be conservative, the flanking sequence considered in this analysis was extended to 50 kb on each side of each CAG-repeat. 2.1.4.1 Collection of flanking %GC, CpG islands, length and purity and other repetitive elements The next phase of the loop involves collecting data or calculating values either with Perl or with external programs. For each feature extracted or calculated, the script automatically inserted data into a MySQL database entitled 'gems_cis'. Using the EnsEMBL API, the script extracted EnsEMBL-defined C p G islands and all the repeats in the sequence surrounding the C A G / C T G repeat. C A G length and purity, and flanking % G C were calculated by the script. All data was input into g e m s _ c i s . 2.1.4.2 Detection of flanking CTCF insulator protein binding sites 29 The C T C F binding sites were detected by using HMMer (Eddy, 1998). HMMer can build Hidden Markov Models (HMM) in a number of different ways. C T C F binding sites are expected to occur more than once and not overlap. HMMer has an option hmmfs (HMM fragment search) that reports multiple non-overlapping Smith/Waterman matches, hmmfs corrects for the length of the model when calculating statistics by employing a cyclic model that permits multiple matches, hmmfs was used to create an HMM (ctcf-md.hmm) that defined the C T C F binding sites. The HMM for the C T C F binding sites was generated from the published multiple sequence alignment (Figure 4). This alignment highlighted essential contact guanosine nucleotides that mutational analyses revealed as the most important determinants of C T C F binding (Filippova et al., 2001). 2.1.5 Detection of nucleosome formation potential with NucleoMeter Nucleosome formation potential was calculated with the NucleoMeter program (http://wwwmqs.bionet.nsc.ru/mqs/proqrams/recon/). NucleoMeter applies discriminant analysis techniques to detect the nucleosome formation of eukaryotic DNA (Levitsky et al., 2001). NucleoMeter partitions the input sequence into regions with a more homogeneous dinucleotide frequencies in order to maximize detection of nucleosome binding sites. The algorithm then searches for the partition that maximizes the Mahalanobis distance R2 used to 30 <; O O E-i H o CD o o < <; H CD CD H C J H H CD CD <c < u <c < u H u CD u _ <c CD CD < < U < U EH H u u CD ' o CD u CD . . SSL CJ CD CJ <; < < u i U U H U _ C J _ H CD U U U CD <C <C CD CD U < CD H CD H <C EH EH CD C J L O Z O Z O J U o u u < CJ U U EH a i O EH  _ CD u 3 3 5 u CD oioju-cra~:cro-O. O C2> O CD O O CJ_.CDZCL_C)_ca_.O_0L CD EH O EH EH O O O O CD CD^ O^ CS ^ Ol <c_<:^f i E H T ^ Z - S J EH EH EH I CD CD CD 01f<JO_0^0""0_0] • CD O O CD OyUL_DJ L U ^ U _ U _ U _ J L ) J C ) <c u 1 1 1 u a o 103 < 1 1 COHOSH] CD f=C 1 1 EH CJ CD CJ CJ CJ CD <c CD <t; LQZIOZPZEO CD 15JJ CD CJ <; CJ CJ CJ CD <c CJ <! CJ << < CD CJ EH CD EH EH CD CD CD CJ CJ CJ CJ < <C CD u CD CJ CD CJ CD < CJ H CJ <T, CD CD EH H EH CD CD CD <C a H J O ^ C D ~ C D ~ C D _ C D I " 0 0 . , < : < : CD " • CD C J < < U U EH CD CD EH I CD CJ EH U _ . , <; CJ CD CD EH LOZOZOZOrOIIOSOil CJ EH CD u U CD u <; o H CD <! CD CD CJ u <; EH CJ <; CJ <; <c CD CJ CD U U EH CD U EH CD CD C J EH <; U U U EH U CD CJ H J U < U H O ( J CD CJ CD CJ H CD M t—I I I H (N ^ 1 1 0 a> CD .-H p p - U - U CD S S •H -H I P P CO (fl (0 I I I I 4 J cn (Tt p p , Q m w I I u u >1 >H e 2 C CO — CD Q ~ E o o_CD o CD o •a o co CO -CD ^§ cn c CO CO CD as Q_ E -1- o Z5 U_ CO to CD m I I w K o C ^ CD "D S • -c Q f •Q T3 CD o * •-I- s £ o 9 o) C T -CD X ^ CD CO 13 o E 15 O S CO 1 CO _Q >^  CD c CD X -Q CD o CD CD co" CD c CO o c CO CD c CD £ c CD o eg c o o O — CD ~ 0 c CO J . J = c CD CO CO CD CD L CO discriminate between potential nucleosomal sites. The nucleosome formation potential (cp(X)) is constructed so that input sequences similar to the 141 nucleosome binding training sequences discriminate between potential nucleosomal sites. Sequences similar to the training set return scores close to +1, while non-nucleosomal binding sequences have scores of -1. 2.1.6 Statistics and plots with R R is an open-source computer language and environment for statistical analysis (R Development Core Team, 2004). The final component of the program generates an R script for visualizing all the sequence features. A file entitled genename.R is generated within each gene's directory by f l a n k e r . p i . This file automatically generates a plot with the R software package to visualize the sequence features flanking each C A G / C T G repeat. Data was collected from the gems_cis database for statistical analysis (see Appendix F for sample queries). Statistical analysis was conducted with the R package. Ranked Spearman's correlations and Fisher's exact test were used to determine the significance of the associations (see Appendix G for R scripts). 2.2 The s a t e l l o g database 32 Working with C A G / C T G repeats revealed that there is no easy way for researchers to extract genomic repeat sequence data and information from the current genome browsers. Expertise gained from dealing with collecting genomic C A G / C T G repeat co-ordinates was incorporated with other bioinformatics features to generate the Satellog database to address this deficiency. The build procedure for the Satellog database is outlined below and in the appendices. 2.2.1 Software Dependencies II A perl script "repeatalyzer.pl" functions as a wrapper for a number of different programs to achieve the endpoints of Satellog. repeatalyzer.pl is run with perl v5.6.1 and used BioPerl v1.2 (Stajich et al., 2002), the EnsEMBL Perl API (May 24 t h , 1999 release), MySQL v10.8 Distribution 3.23.21-beta (for pc-linux-gnu), BLAT v. 28 (Kent, 2002) and v. 34 of the human genome sequence (Lander et al., 2001). repeatalyzer.pl was run against the homo_sapiens_core_19_34b EnsEMBL database and v. 34 of the human genome sequence. The script was processed in parallel on our in-house 40 processor Opteron cluster. 2.2.2 Implementing the satellog database Prior to proceeding, a MySQL database called Satellog must be implemented to generate all the required tables (Appendix H). The database is 33 composed of 17 tables: repeats, linkage, unigene, gc, class_stats, ugstats, ugcount, rep_stats, rep_class, transcripts, ens_db, disease, go, pdb, mim, affy, and GeneNote (Figure 5). All tables are organized around the repeats table in a star schema. This table stores output from Tandem Repeats Finder (Benson, 1999) including chromosome start and end co-ordinates, repeat unit length (referred to as period), the sequence of the repeat unit, the distinct repeat class of which the repeat is a member of, the sequence of the repeat and pure repeat length. The p-value is calculated independently and represents the fraction of repeats of the same class having the same or greater length. The linkage table contains information about genomic linkage regions implicated in diseases of interest. For each disease linkage study, the linkage table stores the cytogenetic band of the genetic marker used, marker genomic co-ordinates, the original reference's PubMed ID, the linkage score if provided, the type of linkage, any reported p-values and notes of interesting or confounding principles. The pstart and qend values are co-ordinates encompassing 50 Mb flanking the genetic marker co-ordinates (recombination boundaries of the marker). The gc table contains the % G C of the 100 bp, 500 bp, and 1,000 bp of sequence flanking the repeat. The unigene table contains the genomic co-ordinates of each UniGene cluster successfully mapped to the human genome including its score from the BLAST-like Alignment Tool (BLAT) (Kent, 2002) and the percent identity of the alignment. The rep_class table stores a unique repeat class identifier that is created by concatenating all repeat class members in the class field. The class_stats table stores a pvalue for each repeat class length that represents the 34 fraction of repeats of the same class having the same or greater length. The ugcount table links each unique repeat by its repeat ID (rep_id) to the UniGene cluster sequences it has been detected in by BLAT (Kent, 2002) and stores the repeat's length in each hit cluster. The ugstats table collects summary statistics of all UniGene repeat length hits for each rep_id including the count (total number of hits), minimum value, maximum value, mean, and the standard deviation of all detected repeat lengths. Supplementary information about adjacent transcripts is collected in the transcripts table if a repeat is within 60 kb of an EnsEMBL defined gene. For each such repeat this includes the EnsEMBL transcript identifier, distance from or location within the EnsEMBL transcript, coding peptide sequence (if the repeat is exonic), and the EnsEMBL gene identifier of the hit. The ens_db table stores supplementary information of all the EnsEMBL genes that contain repeats. This table stores each EnsEMBL gene's unique identifier, Human Genome Organization (HUGO) name, text description (if known), chromosomal co-ordinates and strand location. The go, pdb, mim, affy tables respectively store any Gene Ontology (Ashburner et al., 2000), Protein Data Bank (Berman era/., 2002), Mendelian Inheritance in Man (Wheeler et al., 2004), and AffyMetrix probe sets associated with each gene. The genenote table contains AffyMetrix expression values from the Gene Normal Tissue Expression (GeneNote) database (Shmueli era/. , 2003). Specifically this table includes each probe's identifier, expression value and expression call (either Absent (A), Marginal (M), or Present (P)) calculated from Microarray Analysis Suite (MAS) 5.0 with default parameters, AffyMetrix array and number. 36 2.2.3 Preliminary set-up Prior to running repeatalyzer.pl a number of preliminary programs need to be run plus "staging" databases are created to collect temporary data required for subsequent analyses. 2.2.3.1 Detecting pure repeats with Tandem Repeats Finder (TRF) We were interested in exclusively pure repeat tracts which are more likely to expand following transmission (Chung era/. , 1993; Kunst and Warren, 1994; Chong et al., 1995). Command-line T R F has seven parameters that can be manually assigned at run-time which include matching weight, mismatch and indel penalties, match probability, indel probability, minimum alignment score to report, and maximum period size to report (Benson, 1999). We found that matching weight, mismatch and indel penalties, minimum alignment score and maximum period size directly affected the length and purity of hits detected by TRF whereas changing the match and indel probability features was not useful. The match and indel probability features refer respectively to the percent identity and fraction of indels tolerated in each serial tandem unit detected as a hit. These features allow users to specify alternative expected matching and indel statistical distributions. 37 Next we evaluated the ability of the matching weight and maximum period size parameters to detect short repeats. Period size refers to the length of the tandemly repeated DNA unit, for instance C A G / C T G repeats have a period of 3. Since T R F hits must be at least 10 bp, the smallest hit for each repeat class reported in Satellog is 10 divided by the repeat unit length. For example, for C A G / C T G repeats, the smallest hit detectable that satisfies the minimum hit length is a 3 1/3 repeat unit hit (i.e. C A G C A G C A G C). Due to this constraint, only repeats 10 bp and up are stored in Satellog. Lastly we investigated the utility of adjusting the mismatch and indel penalties. We found that setting the penalty for these parameters to 4090 produced no impure repeats as hits. T R F was run on whole chromosome FASTA files from v. 34 of the human genome downloaded from the U C S C genome browser. Hit purity was confirmed by visually inspecting the top high period hits (these hits have the highest probability of introducing indels due to the scoring scheme used by TRF (Benson, 1999) (Appendix I). 2.2.3.2 Identifying unique repeat classes A repeat can be represented in a number of ways in double-stranded DNA. T R F detects repeats by the first tandemly repeated unit, therefore, C A G C A G C A G , A G C A G C A G C , and G C A G C A G C A are detected as repeats of C A G , A G C , and G C A respectively. Furthermore, the reference human genome 38 sequence is only presented as the positive strand. Repeats of G T C , T C G , and C G T on the positive strand represent 5'->3 C A G , A G C and G C A repeats respectively on the negative strand. Therefore, to identify all C A G / C T G repeats in the human genome it's necessary to detect all C A G , A G C , G C A , G T C , T C G , and C G T repeats on the positive strand. We developed an algorithm to generate all possible sequence varieties of a repeat unit on the positive and negative strands. Our repeat classification algorithm operates by taking an input repeat unit, i.e. C A G , removing the first letter (C in this case) and appending it to the end of the remainder (AG) to create the second repeat unit (AGC). This is then reverse complemented to generate the equivalent sequence on the negative strand (TCG). This procedure is repeated repeat unit length - 1 times to generate a unique identifier henceforth referred to as the repeat class. Each repeat in Satellog is associated with a single unique repeat class (Appendix J). 2.2.3.3 Preparing expression data from the GeneNote database The GeneNote (Gene Normal Tissue Expression) database provides baseline normal expression data of human genes for use in disease studies (Shmueli et al., 2003). GeneNote data was downloaded from the Gene Expression Omnibus (GEO) (Appendix K). A total of twelve human tissue profiles are presented in GeneNote including bone marrow, brain, heart, kidney, liver, lung, pancreas, prostate, skeletal muscle, spinal cord, spleen, and thymus. These products were generated with the AffyMetrix HG-U95 A-E probe-set, 39 covering 62,839 probe-sets. EnsEMBL genes have been mapped to AffyMetrix HG-U95 probes by the EnsEMBL project (Hubbard et al., 2002). Once a repeat is detected either inside or within 60 kb of an EnsEMBL gene, that gene's normal expression profile is evaluated by cross-referencing its AffyMetrix tags to the GeneNote database within Satellog (Appendix K). 2.2.3.4 Detecting repeat polymorphisms within UniGene clusters UniGene contains the largest public repository of transcribed human sequence and represents an attempt to organize this wealth of expression data into discrete transcriptional loci (Wheeler et al., 2004). All human UniGene sequences were processed for use with repeatalyzer.pl (Appendix L). For each repeat detected in UTR or exonic sequence, the repeat plus 10 bp of flanking sequence was extracted from EnsEMBL and queried using the BLAT algorithm (Kent, 2002) against a BLAT-formatted database created from sequences representing the longest, highest quality stretch of DNA from each individual UniGene cluster (this sequence is provided by UniGene as the file Hs.seq.uniq). Polymorphism is evaluated only if BLAT analysis against all UniGene clusters resulted in 1) hits that achieved BLAT scores at least 85% of the theoretical maximum for a perfect hit 2) 90% of the query sequence matched identically within the cluster 3) the repeat mapped within 10 kb of the genomic co-ordinates of the UniGene cluster (Appendix M discusses the mapping of UniGene clusters to the human genome). If a hit to a UniGene cluster satisfied these criteria, the 40 length of the repeat in the cluster is stored in Satellog. This feature allows investigators to query all repeats with polymorphisms in UniGene clusters from genomic regions of interest. 2.2.4 Overview of the repeatalyzer.pl script Once the above software and data dependencies are configured, the perl script repeatalyzer.pl automatically populates Satellog. The script processes the flat files output by TRF. These files contain the repeat co-ordinates plus the repeat period (the size of the repeated unit), the sequence of the individual repeat unit, the entire repetitive sequence and the repeat length. Repeat co-ordinates are passed to the EnsEMBL API to confirm the authenticity of the co-ordinates generated by TRF. If the repeat is not detected within a gene with the EnsEMBL API, then progressively larger slices incrementing by 15 kb are taken in search of flanking genes. As soon as a gene is located in flanking sequence then no further flanking sequence is collected. However, if no genes are detected within 60 kb of the repeat co-ordinates then repeatalyzer.pl stops searching for genes. If a repeat is detected inside or within 60 kb adjacent to an EnsEMBL-defined gene then that gene's primary information (co-ordinates, H U G O name, EnsEMBL ID and description) are collected along with metadata stored in EnsEMBL such as Protein Data Bank (PDB) (Berman et al.), Online Mendelian Inheritance in Man (Wheeler et al., 2004), Gene Ontology (GO) (Ashburner et al., 2000), and mappings to AffyMetrix probe sets. If the repeat is 41 located in the 5'-UTR, 3'-UTR, or exon of a gene then its polymorphism profile within UniGene clusters is evaluated. 2.2.5 Generating a measure of repeat length significance After running the script to populate Satellog, each repeat's length is compared to its class' genomic repeat length profile. The majority of repeats associated with disease undergo expansions from already large reference genome lengths relative to other repeats of the same class (Cleary and Pearson, 2003). The percentile rank of each repeat length (referred to as p-value in Satellog) is calculated from the distribution of repeat lengths within each repeat's class (Appendix N). It reflects the proportion of repeats with the same or greater length from the repeat's genomic length distribution. 2.2.6 Detection and input of disease-associated repeats Disease-associated repeats and their common properties were recently reviewed (Cleary and Pearson, 2003). Repeats that were not analyzed either had a repeat period greater than 16 (thus not detected by our TRF parameters) or were polymorphic but not associated with any disease. For these disease-associated repeats, there is no record of their precise genomic co-ordinates. To address this, we used Satellog to probe for the probable repeat that corresponded to each disease by selecting all repeats of the expected class 42 within each disease gene. Except for the repeat responsible for blepharophimosis (Crisponi et al., 2001), all repeats were detected. A total of 51 repeats were mapped for 31 diseases (Appendix O). 2.3 Prioritizing candidate repeats for disease-association studies in schizophrenia 2.3.1 Input of neuropsychiatric linkage regions into Satellog A recent article exhaustively reviewed all schizophrenia and bipolar linkage regions identified to date (Sklar, 2002). We manually collected each linkage region and input it in a standard format into Satellog 2. For each linkage region, we input the genetic marker and its co-ordinates, the cytogenetic band, and PubMed ID of the source paper, the score, p-value, and type of linkage if provided (i.e. logs odd score of 3.4) (Appendix P). If there were any points of interest mentioned by Sklar these were also included as supplementary notes. 2.3.2 Prioritizing candidate repeats with Satellog 2 Bipolar disorder has overlapping symptoms (DSM-IV) and linkage regions Sklar, P. (2002). "Linkage analysis in psychiatric disorders: the emerging picture." Annu Rev Genomics Hum Genet 3: 371-413. with schizophrenia. We decided to prioritize bipolar disoder repeats as well because the linkage regions were readily available in the same review as the schizophrenia linkage regions Sklar, P. (2002). "Linkage analysis in psychiatric disorders: the emerging picture." Annu Rev Genomics Hum Genet 3: 371-413. plus we had DNA from bipolar disorder probands (see 4.6.3.1). However, the rationale for this study is based strictly on the biology of schizophrenia. 43 Repeats were prioritized for disease association studies in schizophrenia and bipolar disorder (Appendix Q). All repeats within 50 Mb of genetic markers associated with each disease were selected from Satellog. Linkage depth was calculated for each repeat by counting the number of linkage co-ordinates that overlapped with the repeat's genomic co-ordinates. We were interested in globally prioritizing repeats in both transcribed and untranscribed regions. Of the remaining transcribed repeats, those that had any evidence of repeat polymorphism (defined as any length polymorphism within UniGene clusters) were deposited into tables for each disease (schz_cand and bp_cand). This restricted the analysis to repeats within either UTR or exonic sequence. Each repeat's co-ordinates, repeat unit, period, class, length, pvalue, linkage depth, and UniGene length polymorphism statistics, location within or adjacent to EnsEMBL genes, peptide sequence (if exonic), HUGO name, text description of associated genes, tissue expression and call within GeneNote was collected. Lastly, only repeats with some evidence of brain expression in the GeneNote database were retained. All repeats were also globally prioritized without considering evidence of repeat polymorphism within UniGene clusters and were deposited into tables for each disease (schz_cand_global and bp_cand_global). This prioritization paradigm therefore also considers intronic repeats. Each repeat's co-ordinates, repeat unit, period, class, length, pvalue, linkage depth, location within or 44 adjacent to EnsEMBL genes, peptide sequence (if exonic), HUGO name, text description of associated genes, tissue expression and call within GeneNote was collected. Lastly, only repeats with a p-value less than 0.05 or repeat length greater than 10 and some evidence of brain expression in the GeneNote database were retained. We also checked that any repeats greater than 10 repeat units going through this filter did not have a p-value of 1. Singleton repeats have no other repeats in their repeat class and thus will have a p-value of 1 regardless of their repeat length. The repeat associated with progressive myoclonic epilepsy (EPM1) was from such a distribution and we were concerned about missing other such repeats. One such repeat (rep_id: 2829206, chr:4: 48577074-48577404, G G A G A A G A G G G A G A A , repeat length = 22) was detected in the bipolar linkage regions. Lastly, the prioritization approach summarized above was repeated, this time selecting repeats of the same repeat class as disease-associated repeats. 45 CHAPTER 3 RESULTS ? 46 CHAPTER 3 RESULTS 3.1 c/s-Features of unstable CAG/CTG repeats 3.1.1 Correlation of flanking CAG/CTG repeat features to Brock et al. expandability data Both male and female expandability metrics based on pedigree transmission of unstable repeats were developed by Brock et al. to correlate cis-sequence features to expandability (Brock et al., 1999). All of the subsequent analyses in this report used the male pedigree transmission because the male correlation of expandability to flanking % G C was strongest and featured broader gradation of expandability values (in contrast to two 0 expandability values for SCA1 and SBMA in the female dataset) (Brock et al., 1999). For each cis feature analyzed, any of the candidate C A G / C T G repeats (those within genes that code for 5 or more glutamines) satisfying the profile of expandable repeats within gems_cis are also displayed. All S Q L code required to extract the data from the gems_cis database is available in Appendix R. 3.1.1.1 Correlation of CpG islands with expandability Brock er al. noted that the most expandable C A G / C T G repeats were within CpG islands. Since our analysis was limited to C A G / C T G repeats within coding regions we sought to establish whether the most expandable loci were 47 within CpG islands. The three most expandable loci were within CpG islands while the four least expandable were not (P < 0.01, Fisher's exact test) (Table 2). 3.1.1.2 Correlation of flanking %GC with expandability G C content at 100 bp and 500 bp flanking disease-associated C A G / C T G repeats loci is positively associated with expandability (Brock et al., 1999). We extended this analysis to 50 bp, 100 bp, 500 bp, and 1,000 bp flanking all the candidate C A G / C T G repeats and known unstable repeats. The correlations observed in Brock et al. existed in our set. Spearman rank correlations of % G C versus expandability were calculated for 50 bp, 100 bp, 500 bp, and 1,000 bp of sequence flanking the C A G / C T G repeat. Significant positive correlations were detected for 50 bp, 100 bp, 500 bp, and 1,000 bp of flanking sequence respectively {rho = 0.82, 0.89, 0.93, 0.89, and Pva lue = 0.04, 0.02, 0.01, 0.02) (Figure 6). At all values of flanking sequence the associations were strong, but the association was strongest at 100 bp (Figure 7). Given this, we investigated the distribution of flanking % G C at 100 flanking bp for all candidate C A G / C T G repeats (Figure 7). The % G C values assumed a normal distribution but the repeat within one gene, S C A 7 , had high enough % G C to achieve statistical significance (Z-score > 1.96, P < 0.05) (Figure 7). However, if the % G C threshold is set to that of the third most expandable 48 Name Start End Expandability Islands SCA7 49034 50415 1.3 TRUE SCA2 48861 50806 0.97 TRUE HD 48883 50320 0.29 TRUE SI7E_HUMAN 48837 50259 NULL TRUE IRS1 48020 50849 NULL TRUE RUNX2 49004 51229 NULL TRUE POU3F2 46199 50723 NULL TRUE NM_175863 . 48790 51627 NULL TRUE PHLDA1 49109 51401 NULL TRUE ASCL1 47940 50545 NULL TRUE C14orf4 48179 51516 NULL TRUE POLG 49573 51779 NULL TRUE 094795 48954 50320 NULL TRUE S0C6JHUMAN 48962 50278 NULL TRUE MN1 47193 52991 NULL TRUE Table 2: All unstable and candidate C A G / C T G repeat-containing genes located within a CpG island. 'Start' and 'End' columns refer to start and end co-ordinates of the CpG island relative to the 50 Mb slice of genomic sequence flanking the C A G / C T G repeat (i.e. the C A G / C T G repeat starts at 50,000). 49 100 bp 500 bp Figure 6: a-c) Correlation between ranked median expandability and ranked % G C in 100 bp, 500 bp, and 1000 bp flanking unstable C A G / C T G repeats (Brock et al., 1999). d) Spearman's rank correlation (rho) of median expandability and % G C of 50 bp, 100 bp, 500 bp, 1,000 bp, 1,500 bp, 2,000 bp, 2,500 bp, 3,000 bp, 3,500 bp, 4,000 bp, 4,500 bp, and 5,000 bp of sequence flanking the C A G / C T G repeat. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance. 50 Histogram of %GC 100 bp flanking CAG repeat of Candidate GeMS Genes to CO <D c CD CO o I CD .O E to o - 1 0.0 0.2 0.4 0.6 0.8 1.0 % G C of 100 bp flanking repeat Figure 7: Histogram of % G C of 100 bp flanking the repeat in candidate C A G / C T G repeat sequences. Red bar indicates sole gene (SCA7) with % G C content achieving statistical significance based on z-score within this distribution. 51 locus (HD) then a total of six genes have significant flanking % G C . Three other genes without expandability data, POUF32, C14ord4, and C A C N A 1 A had flanking % G C at least equal to HD (Table 3). 3.1.1.3 Correlation of repeat length and purity with expandability Having confirmed the correlations observed in Brock et al., we sought to explore new relationships between expandability and C A G / C T G repeat length and purity. Repeat length was defined as the length of nucleotides bounded by the chromosomal co-ordinates detected by Tandem Repeats Finder (TRF). Spearman ranked correlations revealed no relationship between expandability and C A G / C T G repeat length {rho = -0.32, P = 0.50) (Figure 8). Repeat purity was defined internally by flanker.pl by counting the number of contiguous occurrences of the repeat unit as specified by TRF. Spearman ranked correlations revealed no relationship between expandability and C A G / C T G repeat purity (rho = -0.11, P = 0.84) (Figure 9). 3.1.1.4 Correlation of CTCF binding sites with expandability C T C F binding sites are known to flank unstable repeats. For this section of the analysis we are including DMPK, a gene with an expandable C A G / C T G repeat in its 3'-UTR, because it is the only disease-associated locus for which it is known with experimental certainty that C T C F binds to sequences flanking the 52 Name Expandability 100_bp SCA7 1.3 0.83 SCA2 0.97 0.77 HD 0.29 0.76 POU3F2 NULL 0.76 C14orf4 NULL 0.765 CACNA1A NULL 0.8 Table 3: All C A G / C T G repeat-containing genes with 100 bp of flanking sequence having % G C at least equal to that of HD. The '100_bp' column summarizes the G+C fraction of 100 bp flanking the C A G / C T G repeat. 53 Expandability vs. Repeat Length CAG repeats known to be unstable Repeat length derived from TRF co-ordinates c CO .a CO "D a CO Q_ X tu c CO •a CD CD CO —\ CM H o o rho = - 0.32 i> = 0.50 o Repeat Length (rank) Figure 8: No correlation was observed between ranked expandability and ranked repeat length. Repeat length is defined as the absolute length of the repeat, irrespective of purity, defined by Tandem Repeats Finder co-ordinates. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance. 54 Expandability vs. Repeat Purity CAG repeats known to be unstable Purity defined as longest contiguous repeat unit o o o o rho = -0.11 P = 0M 3 6 C A G - r e p e a t purity (rank) Figure 9: Correlation between ranked expandability and ranked C A G / C T G repeat purity. No correlation was observed between ranked expandability and ranked C A G / C T G repeat purity. C A G / C T G repeat purity defined as longest contiguous stretch of the repeat unit specified in Tandem Repeats Finder. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance. 55 C A G / C T G repeat (Filippova et al., 2001). We felt that including the C T C F binding sites flanking the C A G / C T G repeat of DMPK in this analysis provides insight into the significance of C T C F binding sites flanking C A G / C T G repeats in all genomic contexts. The expandable genes were analyzed for C T C F binding sites within 5,000 bp flanking their C A G / C T G repeats. This amount of flanking sequence was chosen because a high number of hits are within this region across most of the expandable genes. C T C F binding site proximity to C A G / C T G repeat was detected with HMMer. Spearman ranked correlations revealed no relationship between expandability and C T C F scores within 5,000 bp flanking the C A G / C T G repeat {rho = 0.08, P = 0.84) (Figure 10). Furthermore, no relationship was obvious between expandability and distance from the repeat (rho = -0.09, P = 0.81) (Figure 10). However, a significant negative association was detected between C T C F score and distance from the C A G / C T G repeat (rho = - 0.75, P = 0.03) (Figure 10). These relationships were similar for 10,000 bp flanking the repeat. Next we sought to compare the distribution of C T C F binding sites in candidate C A G / C T G repeats versus expandable genes. Interestingly, when C T C F scores of 1.00 or higher of 1,000 flanking bp were evaluated the three most expandable loci had one or more C T C F binding sites (P < 0.01, Fisher's exact test) (Table 4). The highest scoring hits were those for the DMPK C T C F binding sites because they were used as part of the training set for the HMM. The quality of 56 c CO CO •o c co CL •a T3 CD m CO CN Median Expandability vs. CTCF Score (5,000 bp) o o °rho = 0.08 P = 0.84 T 6 8 c CO CO T J c co CL c CO LO CO CM Median Expandability vs. Distance from Repeat (5,000 bp) o 2 o rho = - 0.09 P = 0.81 N I 6 8 CTCF Score (rank) Distance from CAG-repeat (rank) Distance of CTCF hit vs. CTCF Score (5,000 bp) 2 4 6 8 CTCF Score (rank) Figure 10: a) Plot of ranked median expandability against ranked score of computationally detected C T C F binding sites (from HMMer) in 5,000 bp flanking the C A G / C T G repeats known to be unstable (Brock et al., 1999). b) Plot of ranked median expandability against ranked distance of C T C F binding site in 5,000 bp flanking the C A G / C T G repeat of genes known to be unstable (Brock et al., 1999) c) Correlation between ranked distance from the C A G / C T G repeat and ranked score of hits. Each point represents a C T C F binding site. The P-value (P) is the probability of observing Spearman's coefficient (rho) by chance. 57 the remaining hits cannot be assessed because they have not been confirmed experimentally (Table 4). From a qualitative perspective, C T C F binding site hits can be ranked using those of DMPK as the ideal (which has a profile of high score, small distance from the C A G / C T G repeat, and presence of a flanking C T C F binding site). The C T C F binding sites detected adjacent to the C A G / C T G repeats within the KCNN3, RUNX2 and POUF3 genes are also high scoring hits. The C T C F binding sites flanking the C A G / C T G repeat within TNS has a lower scoring hit but is closer to the C A G / C T G repeat. The C A G / C T G repeats within SCA7, NM_175863, and SOC6_HUMAN have C T C F sites flanking both sides of C A G / C T G repeat. The C T C F binding sites are degenerate because of the domain architecture of the protein (Ohlsson et al., 2001). This degeneracy may be reflected as weaker scoring hits (score of 0 to 0.99), such as those flanking the C A G / C T G repeat within IRS1 (Table 5). 3.1.1.5 Correlation of nucleosome formation potential with expandability Nucleosome formation potential was evaluated with the web-based version of NucleoMeter (http://wwwmqs.bionet.nsc.ru/mqs/proqrams/recon/) (Levitsky er al., 2001). Nucleometer calculates the nucleosome formation potential of input sequence within a 160 bp window. Values of +1 and higher 58 Gene Name Score Distance Expandability Start End DMPK 15.5 101 4.81 49839 49899 DMPK 16.04 34 4.81 50096 50156 SCA7 6.18 15 1.3 49924 49985 SCA7 1.31 838 1.3 49100 49162 SCA2 2.61 481 0.97 50552 50615 KCNN3 3.73 170 NULL 49769 49830 TNS 1.22 22 NULL 50052 50113 RUNX2 6.17 641 NULL 49301 49359 POU3F2 3.18 918 NULL 49019 49082 NM_175863 2.91 538 NULL 49401 49462 NM_175863 5.84 415 NULL 49524 49585 NM_175863 3.22 225 NULL 50279 50339 NM_175863 2.01 402 NULL 50456 50516 NM_175863 1.33 429 NULL 49509 49571 CIZ1_HUMAN 1.19 883 NULL 50977 51039 CIZ1_HUMAN 2.98 460 NULL 50554 50615 094795 2.26 219 NULL 50308 50369 S0C6_HUMAN 2.96 0 NULL 50025 50086 S0C6_HUMAN 2.37 566 NULL 49374 49434 Table 4: All C T C F binding sites with a HMMer score greater than 1 that are within 1,000 bp of a C A G / C T G repeat. 59 Name Score Distance Expandability Start End DRPLA 0.14 701 0.19 50761 50822 RORC 0.16 0 NULL 49991 50053 IRS1 0.08 46 NULL 50091 50152 IRS1 0.99 425 NULL 49515 49575 RUNX2 0.06 545 NULL 49394 49455 CIZ1_HUMAN 0.53 618 NULL 50712 50774 CIZ1JHUMAN 0.19 654 NULL 50748 50807 CACNA1A 0.04 602 NULL 50643 50706 PRKCBP1 0.06 395 NULL 50436 50499 Table 5: All C T C F binding sites with an HMMer score between 0 and 1 within 1,000 bp of C A G / C T G repeat. These may represent true C T C F sites because of the binding degeneracy C T C F . 60 indicate high nucleosome formation potential whereas values -1 and lower have poor nucleosome formation potential. Portions of flanking sequence that had poor di-nucleotide content were inappropriate for consideration of nucleosome binding sites and were given a null score by NucleoMeter. NucleoMeter returns a running score over each 160 bp window of input sequence. We summarized flanking nucleosome formation potential by averaging the running score for the flanking sequences and then averaging the upstream and downstream scores. We felt that this gives a rough indication of the nucleosome formation potential of the flanking sequences as a whole. NucleoMeter lacks an API so nucleosome formation potential had to be assessed manually with nucleo.pl, a script designed to extract 200 bp both upstream and downstream flanking the C A G / C T G repeat defined by TRF, but not including the repeat itself (Appendix E). S C A 3 flanking sequence were composed solely of null scores and was given a score of NA for statistical calculations in R. Spearman ranked correlations revealed no relationship between expandability and nucleosome formation potential (rho = -0.37, P=0.50) . See Appendix G for example R scripts used for the statistical analysis discussed in this section. 3.2 Genomic repeat analysis with the Satellog database 3.2.1 Summary statistics 61 A total of 8,357,425 pure repeats were detected by T R F in the human genome and were stored in Satellog. Of these, 5,398,328 or 64.6% were detected within an EnsEMBL-defined gene or within 60 kb flanking either side of an EnsEMBL gene. These repeats mapped to 7,260,625 genetic locations in or near EnsEMBL genes, reflecting the fact that some repeats were located within more than one gene. Of the genes in EnsEMBL, 92% (21,654 / 23,531) had at least one pure repeat within 60 kb of their gene boundaries. All repeats in Satellog clustered into 70,318 unique repeat classes. 3.2.2 Characteristics of disease-associated repeats Disease-associated repeats and their common properties were recently reviewed (Cleary and Pearson, 2003). We queried Satellog with these sequences to observe any characteristic features of these repeats relative to all other repeats. We asked how many of these repeats could be identified as potentially unstable using only the bioinformatics resources within Satellog. A total of 31 of the 35 disease-associated repeats were manually collected from the review and input into Satellog. Repeats that were not analyzed either had a repeat period greater than 16 (thus not detected by our T R F parameters) or were polymorphic but not associated with any disease. For these disease-associated repeats, there is no record of their precise genomic co-ordinates. To address this, we used Satellog to probe for the probable repeat that corresponded to each disease by selecting all repeats of the expected class within each disease gene. 62 All repeats were detected, except for the repeat responsible for blepharophimosis (Crisponi et al., 2001). In 12 cases, more than one candidate was detected as the disease-associated repeat for a disease. These cases usually involve flanking repeats of the same class that are detected as two distinct repeats because of an interrupting unit, an established characteristic of some disease-associated repeats such as those responsible for SCA1 (Chung et al., 1993) and Fragile X (Kunst and Warren, 1994). In these cases, we simply retained both repeats and associated them with the disease. A total of 51 repeats were mapped for 31 diseases. These disease-associated repeats were located in all gene locations, from exons, introns, 5'-UTR, 3'-UTR and up to 45 kb away from a gene. Interestingly, these repeats were from only 6 repeat classes. Trinucleotide repeats are the most common repeat class implicated in disease (Cleary and Pearson, 2003), especially for disorders caused by coding repeat expansion. Of the disease-associated repeats we analyzed, 28 of the 31 were trinucleotide repeats with 16 being from the C A G / C T G repeat class, 11 from the G C G repeat class, and one each from the C C C C G C C C C G C G , C C T G , GAA, and ATTCT repeat classes respectively. These disease-associated repeat classes had dramatically different genomic distribution (Figure 11). For example, the C C C C G C C C C G C G dodecamer implicated in progressive myoclonic epilepsy type 1 (EPM1) (Lalioti era/ . , 1997) is the only pure repeat of its class detected in the human genome and therefore has a singleton as its distribution. The remaining repeat classes have broader 63 distributions, particularly the G A A repeat class. G A A repeats have been reported to have a unique distribution relative to other trinucleotide repeats due to their evolutionary origin within Alu repeats (Clark et al., 2004). Satellog recapitulated a distinct, expanded profile for GAA repeats relative to all other trinucleotide repeats (Figure 11). We define significant repeat length in the reference genome as any repeat with length within the top 5% of its class (corresponds to a p-value < 0.05 in Satellog). Using this cut-off, we determined whether the reference genome repeat length is significant for any of the disease-associated repeats within their respective.disease classes. Interestingly, 80% (24/30) of the disease-associated repeats in Figure 11 were significantly long in the reference genome given their repeat class' length distribution (P-value < 0.05). In fact, 20 of 30 of all disease-associated repeats had a p-value of 0.01 or less indicating that these repeats were the extreme outliers within their class. 64 Of the coding repeats, 12 of 17 had significant repeat lengths, including all the CAG-type repeats. Exceptions were the cleidocranial dysplasia (CCD), hand-foot-genital syndrome (HFGS), synpolydactyly, oculopharyngeal muscular dystrophy (OPMD), and holoprosencephaly coding G C G repeats. The C C C C G C C C C G C G dodecamer implicated in progressive myoclonic epilepsy type 1 (EPM1) is not included in this comparison because there were no other pure repeats of its class in the genome. 3.2.3 Characteristics of repeats polymorphic within UniGene clusters We used a bioinformatics approach to see if we could detect repeat polymorphisms within UniGene sequences. Of the 8,357,425 pure repeats detected by Satellog, 1.3% or 111,950 repeats were detected as transcribed by the EnsEMBL API (either in the UTR or exon sequence of the gene). Of these repeats, approximately half (57.4% or 64,116 repeats) were detected within UniGene cluster sequences. Finally, of these repeats, only 5,546 repeats were detected as polymorphic (defined as any repeat that had at least one sequence within a cluster with a different repeat length) (Figure 12). A measure of repeat polymorphism was provided by calculating the standard deviation (sd) of all repeat lengths detected within a UniGene cluster. A total of 3,165, 2,044, and 55 polymorphic repeats were detected in 3'-UTR, exonic, and 5'-UTR sequence respectively (Note, repeats may exist in more than one gene which is why the location break-down of the repeats is greater than the total number of distinct polymorphic repeats of 5,546). The degree of polymorphism is greater in the 3'-66 Unstable Repeats as a Fraction of Total Repeats Detected in UniGene Clusters • Untranscribed repeats • Transcribed repeats not detected in UniGene clusters • Transcribed repeats detected as stable in UniGene Clusters • Transcribed repeats detected as unstable in UniGene Clusters Figure 12: Polymorphic repeats make up a tiny portion of all pure repeats detected in Satellog. Approximately half of all the 111,950 transcribed repeats were mapped to UniGene clusters, but only 5,546 or 0.07 % of all repeats were detected as polymorphic within UniGene clusters. 67 UTR sequence than in exonic or 5'-UTR sequence (One-way ANOVA, P < 0.05) (Figure 13). Next we evaluated the tolerance of repeat polymorphisms by various repeat periods in exonic and UTR sequence. To observe if highly polymorphic repeats were restricted to certain repeat periods (defined as repeat unit length), the repeat period distribution was observed at progressively increasing sd values (Figure 14 & 15). Untranslated repeats were well distributed across all repeat periods except for 16mers at an sd cut-off of 1 (which roughly corresponded to repeat polymorphims of 1 repeat unit). At increasing sd cut-offs, untranslated polymorphic repeats were detected as penta-, tri-, and mainly di-nucleotide repeats (Figure 14). In contrast, while coding repeat polymorphisms were widely distributed at an sd of 1, they were mainly restricted to trinucleotide repeats at higher sd cut-offs (Figure 15). Although the untranslated repeats had higher sd values, their most polymorphic sd values were restricted to mono- and di-nucleotide repeats. 3.2.4 Disease-associated repeats detected in UniGene clusters We were interested if known disease-associated repeats were polymorphic within UniGene clusters. We extracted the top ten most polymorphic coding and non-coding repeats, based on their sd value, and determined if any of the disease-associated repeats were also the most polymorphic. The repeats associated with SBMA (AR is the gene mutated in 68 Boxplot Comparison of Polymorphic Repeats from Exonic, 5'-UTR and 3'-UTR Sequence <0 H ro > CD Q CO c CO -I—I o H i 1 r Exon 5'-UTR 3'-UTR Figure 13: Median standard deviations (line through box) of polymorphic repeats detected in exonic, 5'-UTR, and 3'-UTR sequence. Median exonic and 5'-UTR standard deviations of did not significantly differ from reach other, but did significantly differ from 3'-UTR repeats implying that the 3'-UTR tolerates larger more expanded repeats (One-way ANOVA, P < 0.05). 69 individuals affected with SBMA), DRPLA, and SCA17 (TBP is the gene mutated in individuals affected with SCA17) were detected as the first-, third- and fourth-most polymorphic coding repeats (Table 6). The coding AIB-I repeat that confers increased risk of prostate cancer is also detected as polymorphic but not in the top ten. Of the non-coding repeats, the repeat responsible for F R A X E is detected as polymorphic, but not as one of the top ten most polymorphic untranslated repeats (Table 7). Of the 31 disease-associated repeats discussed previously, only 5 repeats were detected as polymorphic within UniGene clusters. We sought to understand why this occurred. Of the 31 disease-associated repeats, 4 failed to map within the genomic co-ordinates of any mapped UniGene cluster. The remaining 27 repeats mapped within a UniGene cluster's genomic co-ordinates. However, 16 of these failed to be detected within UniGene sequences even though they mapped within a UniGene cluster. This could be because of the 3' bias of the UniGene sequences, the incomplete nature of the clusters (Wheeler et a/., 2004), sequence errors in the representative UniGene cluster sequence we searched against for hits (Hs.seq.uniq - see METHODS for details), or the limitations of our mapping algorithm. Our approach enforces that the repeat must exist with at least 10 bp of flanking sequence, which leaves out repeats at the edge of UniGene clusters. The remaining 11 disease-associated repeats were detected within UniGene clusters, but only 5 of these repeats were polymorphic. On average, the repeats detected as polymorphic had more hits 72 C O r-. _ -*—' "O co O CD O CL © 2 - Q T J £ (D "co "Jo 1 3 o C CD c CO CD CO CD CD CO CD o oo CO LO co CM LO CO CM co co CD CO LO CM CM co CO co co cn 05 CO CD CO o to CO CD CO T3 r-c o CO > CD (0 o CO Uo CO CO T3 c B w O) c X3 c CD O CO CD ~o CD .c >^  _Q "D CD N C CO a> to -*—< CO CD CL CD i_ "D CD s g c to 3 2 CO *— a CO CD C CO CD CO O-O CD E co c w ® CO CD CD Q . - C CD c (0 0) E X re E c E o> E re c 05 CM oo o co CD CM CM CM O O | |c\i KM CO CM CD 05 CD 05 CD CO N CM CM CM CM < t-< Q_ CO N >-co < co CO 0. < < CO c o I re u o I 0) c 0) 05 ID CM CO c 3 CD < I— co CM O < CM co CM CM a 3 I ID 3 ID CO < a T3 . . CD IS- CO a> o £ cS within UniGene clusters than those detected as stable (there were an average of 17.4 observations per repeat for the polymorphic repeats to 4.54 for stable repeats). This suggests that there is a greater chance of observing repeat polymorphism with deeper sampling. All of the polymorphic repeats were limited to one UniGene cluster and none of the lengths surpassed the disease pre-mutation threshold of 29, 25, 36, 42, and 39 pure repeats for the repeats responsible for increased prostate cancer risk (AIB-I), DRPLA, SBMA, SCA17, and F R A X E respectively (Cleary and Pearson, 2003). 3.3 Candidate repeats for typing in schizophrenia and bipolar disorder Candidate repeats for the polymorphic repeats were scored and selected based on descending score values. The score was calculated by multiplying linkage depth by the standard deviation of repeat polymorphisms within UniGene clusters and then dividing by the repeat's p-value. Candidate repeats were also selected by a global prioritization approach that did not take into account repeat polymorphism. This approach selected only those repeats with a p-value less than 0.05 and repeat length greater than 10. Lastly, we prioritized repeats with the same approach, but restricted the repeats to those from repeat classes associated with disease. One repeat in bipolar linkage regions was extracted separately (see 2.3.2) (rep_id: 2829206, chr:4: 48577074-48577404, G G A G A A G A G G G A G A A , repeat length = 22). 75 r-W *-• CO <D a 0) -*—< CO CD CL CD OS c CO u CO "E CD a o N IE o W u o E >» o a. o CM a o l-co CO •g T3 c CO o _C0 'c CD .c Q . O N !c o w g CL _^ o E >-o CL o CM CL O CO o n to t-c a> o c v o> re 0)0 E XJ CO 3 re > a •o c CO CD CD co o Si LO O) CD 0> CM CD > < CD CD CM CO CD CD co CM co co CO LO CD CD CO in CD CO o d CM CD LL CD CM in co N co CD cn col CO co CD in CM CD t cS o CM CM co in LO 05 co 3 I CO CM in co o co o d a. O O < < CD 3 co in in co co o d CD O < < < < o co o co a> oo CM co in LO CO oo CM CM CM in CO s CM CD w CO 0) a <D k_ cu •a c co o .5 "E a> >_ .c a. o N o w N +- o to as CD CL CD 1_ d) rd •g T3 c (0 o 03 ' c © 1 CL o N o to T5 CD N v. a CO o o CM Q. O CM CO CO CL aj _o D) o CM CL o C) .£3 CO CD CO in CD CM CD ID CO CM LD a co oo LU LO CD N CO co CM CM LO CM LO ID CD CO O LO CD co co r-co a CD CO TO H5 c CO o I— CD T J o If) "5 JS co 4—1 CO CD CL CD i_ CD CO •g T3 c CO o CD T3 i — o CO T3 _co o Q. !5 o Z Q. i_ O E > o a o CM a o \-co CO CO CO o Q-\a o CL o E j>> o CL o CM CL O h-Q) CO I-co CD CO o CO co a> CO _ £ «» ,E -a > a. D) C a> LO x: u CM CM CM CO CD C_ m 0-i-CM CD 00 oo LU I LU LO LO O O a. LU CD LO CD I CD 05 CM O a a a a a a o CM CO CM t o 05 CO LO CO CL CD |C0 I I co CO co CM CO LO CO LO CM CO co C D | £ CO CM o 00 O N t -(0 CO 0 a 0 CO T3 '•5 c CO o m o a !5 T J 0 N a. > 15 JQ o U) o CM Q. O I-n-co co" CO CO 0 CL 0 i» 0 CO •g c CO o JO o Q . ! Q T J 0 N CL "co . Q o CO o CM CL o 0 SI CO I-a a> a. c a> o c ~ <D CO 0)0 E -6 a> 3 re > a D) c CD CM CD LO o 00 CM CD o CO 00 LO O o LO a> 1— •3- CD CM CD co 00 o CO o CO CO CO CM •sf _ l _ l O _ l _ l O 13 CD o Z z < CO c c c o o o o o i _ o "E lO CD LO Is- CO LO CO CO CO O •f— CM CM O o O o O o O o O o o o O o o o d d d d o LO Is-CM CO o < t < a o f < AT AG < 00 Is-CO CM 00 o o 00 O) CM 00 CD o o o Is-CO CM LO 00 co O) CO CO CO O) CO o CO I--Is- CD CO o a> i-- 00 CO T— 00 co o o o Is-CO CM LO 00 CD a> CO CO •3- CO O) CO CM co CM CM T ~ CM co CM 3 | 3 Z 0 0 co 00 co co co co co co co co co oo 0. < a CD CD I- , < CD o o o o co 00 LO O ) LO Is-Is-o CM CO CM CO I CO LO LO Is-o o o o Is-co Is-oo co LO oo CM CM CM < < < < 00 LO co O J oo 00 I IS-Is-LO <D 00 00 I Is-Is-s CM LO co co 05 co cn co CM 00 CD 00 CD I CM CM |CM CD 00 CO LO LU X LU o IS T— O o o I co CM CM CO LO CO CM 00 00 LO o Is-CO < z CL o o o o CO IS-co | o o o O O O o Is-CM 1— co oo LO I CM | CM IS-co oo LO co LO co CM 00 oo LO o Is-LO oo o o o o LO CM CO CM I CO a> CT> o o o LO LO 00 o o o o CD < < < < o CM LO 00 oo co 00 00 00 CO CM W CD CO (A JS o T3 0) +J CO "o o (0 (0 CO cb co CO CL) (0 E o i_ H— (ft CO CD Q. CD i_ CD CO T3 T5 c CO u .5 E CD l_ .c a o N JE o u !E Q. O E > o Q. O CM a o H LO CO CO co CD CO CO CO o a> co 'o o CO CO CO I CD CO CO CD CO -a E o I CO CO CD CL CD i CD -*—' CO •g T5 c CO o cO 'c CD .c C L o N o CO o !c Q. o E >, o CL o CM CL o I-CM CO I-c o £ CO o CO (0 CO aj C CO •-= U CO 3 CO > a c co TJ c 0) cn CD If) 00 CD CM cn CM CD cn CD o CM CM cn CO co in CM CM co o LX c_ LU CM CM CM d d cn 00 < CO u_ a. 3 < LU at LU LU LU 00 CO oo cn o CO o CO rx CL o oo LX LX LX cn d cn CO o CO cn o co 3 Ln d LO -a-O a o cn CM LO cn CM LO CO CM m CM CO •3-CO oo co CO co CO CO co co co co <0 0) V) <0 J2 o J3 •a 0) CO *2 o o CO (0 CO i (A CO d) (0 E o <D +J CO T3 T3 C CO u CO "E a o N IE u (0 a) N o a « A O O) O CM a o t-(O CO CO CO CD CO CO CO o TD CD 4—* CO "o o CO CO CO I CD CO CO CD CO T3 £ o (0 ~S CO CD a CD CO CO CD CD-CD i— CD "CO •g c CO o CO 'c CD _c C L o N o CO T3 CD N C L "CO -Q _o CO o CM C L O I— CO o Si (0 y-Q. a> a. c 0) o c Q) (0 a> 3 (0 > Q. D> C CM Is- CM CM LO CM oo CO CM CM Is-CM LO CM LO CM Is- CO 1 u_ 1 < CM CD u_ _l CL Z N I_l _l _l _l 3 3 Z z O c o o o o o 1 o o LO CO CM CM O) CD co CO O CM LO O o CM O o o O o o O o o d d d LO CM CO o o a CD < 1- a CD o < 05 00 CO 1— CD LO CD CO 1— 00 T— 00 OS LO CO Is-CO O) CD Is- CM co CM CM 05 CM O) LO CO T— 00 T— CO a> LO 00 Is-CO CO Is- CM LO CO LO LO CD CD 00 CO LO O Is-< O co X O a a a a a a a a a a a L U O) CO o CO CD CM CO LO 3 z i c £; O LO C LO DQ X _J X CD CO CO CO CO CO CO CO CO CO CD co CM < < CD O) O) CM O) CM CM co CM CD q CM CO CM CO DL < o o a o a o a a o a a o a o o o o LO | L O | o CO o o LO I LO CO CM CO CJ) co LO , LO to < o I CD LO CO LO 00 LO CM CM to CO CO o LO CM oo CD CO LO •3-CM CM CM to to q o LO CM 00 CD LU DC O o o o LO o o o o CO LO cn o o d CM < o CD Is-co o Is-co CO Is-oo LO co o o CM CM I < < CD co CM CM CM oo Is-o CM CD Is-LO Is-Is-O CM CO to I C N 0 0 it) CD w W o — to CD CO CO COo CD -I—• CO o o to to CO I CD CO CO CD CO T J E 2 " ~ CO T3 0 (0 "o o w (A (0 CD CO CO CD W T5 E o CO CD CL CD 1_ CO TJ H5 c CO o CD T3 o to "5 jo JO o Cl !5 u o E >. o a o CM a o t-CO CO CO CD o_ CD * CD -*—* CO •g T3 c CO o i CD 73 O CO T3 CO O C L X3 g CL 1_ o E o C L o C M C L O h-0) XJ co I-c co o c «= a) co D)0 CO E TJ CO 3 CO > a c 0) a a o o a a a a a a> O J o o co o o LU o D3 CO co to O J O CO < o X CD CO LO CO o o o d < O CD GO LO T— O CO CO LO o CO CM oo cn d CM co t o oo O a o o a a o X CD co io| o o d CM < O CD LO co LO 00 OS CO CO LO oo LO oo o o CL LU a a o o o a a c o X CD LO CO co o o I CM 1 O o o Is-CM O CO co CM O CO < O CD O 1= O o o o o o o X CD LO co co CM o LO CM o d co CM oo CD I 1— CM o O ) oo CO co CO LO o 1— CO LO CO oo LO < o CD CD cn CO CM CO LO col I CO 00 CM cn LO CO CO cn CM CO CO CM o o cri LO CM CL CO 3 o co CL CO 3 O CO oo CO d co o CO o CO o LO 3 CO CM co LO CO o CO cn o d co CD CM CM a o o CO LO CM O CD CM LO LO CM O CO CM CD o co co LO cn co LO 05 CM LO CM cn LO CM CM LO LO o co < O CD CD CO o I co CM O CO o CD CM o CM CM < LX c o x CD LO LU LU LU LU CO O 00 CO LO OO CL CO , z , < CO CO o X CO O CD CD a. CD _i DC < LX IX LX LX LX a o X CD CO co CO CO oo CM O LO LO CD CO "3-oo CM o CD LO CM M CD LO CM d LO co CM CM CM CM oo CO CM LO oo < CO U -CL. I 3 CM LX CD CD O h-co cn o < < < < < c o X CD < < CD LO d co co LO to o o CM LO I o CD O LO CO oo co o cn evil cn CM LO cn CM cn CO CO o cn CM CM CO O CO CD cn co co co co cn CM co cn co co co CM <A o CO co ro o "O CD CO X u o <0 w cp cb in co Q) CO E o CO CD a CD a >» "co O O) o CM a o oq CO CO CO CD CO CO as o T3 CD CO O O CO CO CO I CD CO as CD co T3 E o i H — CO W Hi CO CD CL CD CD CO •g T3 c CO o I— _CC o Q. !a T3 CD N 0) ~ +-» - ' CO T J '•5 c CO o «J o Q. !5 T J CD s N .2 CL "CO o CO o CM CL O h-in a H CD CO CO O co c 0) o c •«= o> n 0)0 a a a a a a a a o o a o o a a a a a a a £ xi x» c u 00 co CM o o o d a < O LO CM CO IS-LO o o CL 111 CO CM CO IS-IS-O O O d a < o CM co 00 LO CM IS-a> CD CD CO CM < o CO a a a a a o a a a a a a a a CO CO CD IS-IS-00 o CM CM CD O CM CO CM CO Q_ < O o CL a O a a o a a o a a a a IS-LO O o d CM CD < o 05 CD LO as co CM LO CD LO oo LO CM O co < a o a 8: a o o o < LO CD LO 00 LO CM CD CM < o CD LO IS-co LO 00 CM IS-LO T— o o d CM O CD CM t •= h- O < CM CO LO CO o < o CD cn co CM O IS-1^ co oo CM i— O LU _ l o o LO | CM oo CD co| 05 co CD IS-CM o o d O CM O J O co co ai CM CM 3 SI CM CM CM CD LO 05 CO LO CO CO CD 05 O O) CO 00 oo CD _ l _1 _ l Q _ J 3 X z Z o o o a a a o o a a o o a a a a a a a a c o c c o I— ox g • c CL) "3 LO LO CM oo 00 CO CO LO LO CM Q co o Q o o Q o o d d d CM cn cn CM CM < CD CD CD < < < O < o CO o CM CM o 00 CO 00 1— o cn O) T— CM LO CO O 1 CO CM CO 1— CD T— T— O o CO 00 oo i — o cn CT> T— CM LO 00 o 1— CO CM 00 CHAPTER 4 DISCUSSION CHAPTER 4 DISCUSSION 4.1 c/'s-Features of unstable CAG/CTG repeats 4.1.1 Identifying c/s-mediators of instability Research into c/s-sequences mediating the instability of C A G / C T G repeats has suffered due to a lack of theoretical knowledge about repeat expansion in general. However, a useful measure of genomic C A G / C T G repeat instability has been developed, termed 'expandability' (Brock era/ . , 1999). This expandability metric allows the correlation of flanking sequence features to repeat instability at disparate C A G / C T G loci. We have repeated this analysis and extended it to identify novel flanking sequence features associated with instability with one modification to the approach of Brock era/., 1999: only coding C A G / C T G repeats were analyzed. We limited our analysis in this respect because the selective constraints on expansion differ between coding and non-coding regions (Cleary and Pearson, 2003). Our results agree with those of Brock et al, 1999 to the extent that we found flanking % G C and presence within CpG islands to be correlated with repeat expandability. Furthermore, we have extended the original analysis to evaluate associations between expandability, repeat length and purity, C T C F binding sites, and nucleosome formation potential. We have also summarized a number of candidate C A G / C T G repeats that have c/s-features similar to genes associated with instability. 86 4.1.1.1 Association between flanking %GC and instability We observed a positive association between flanking G C content and expandability but the exact percentages and trends were different from those published (Brock et al., 1999). Of immediate concern were the differences in flanking G C content between our study and Brock et al., 1999. In Brock et al., 1999 and in our study the S C A 7 locus had the highest flanking G C content with 83.5% and 83.0% respectively. We sought to understand why these differences existed. Flanking sequences were manually extracted and counted and the reason for this difference became apparent. We relied on TRF sequence co-ordinates to locate the repeat, but T R F identifies repeats by their first tandemly repeated unit. Therefore, if a C A G / C T G repeat was preceded by a non-repeat G, the first tandemly repeating unit, G C A would be counted. For example, in this stretch of sequence AAT |GCAGCAGCAG !GGAG the C A G repeat is in bold but the repeat unit detected by T R F is highlighted in grey. In contrast, Brock era/ . , 1999 manually extracted the sequence based on the absolute co-ordinates of the each C A G repeat. When we repeated the analysis in this way for the S C A 7 locus, we obtained the same value as that published (83.5%). The G C % difference between our study and Brock et al., 1999 at the S C A 3 locus was a result of using the complete human genome sequence versus the incomplete version available at the time of their study. Another difference between our study and Brock era/. , 1999 was the correlation co-efficient trend. The correlation co-87 efficient in Brock et al. decreased when 100 bp to 500 bp flanking the repeat was considered. Conversely, in our study, the strongest correlation co-efficient was observed at 500 bp (0.93), although the difference was miniscule (0.4). Interestingly, our weakest correlation was at 50 bp flanking the repeat. This difference is because we did not consider the ERDA1 locus, which is in a G C poor portion of the genome and skews correlations higher. CpG islands, on the other hand, were detected as expected from Brock et al., with the more expandable loci harbouring them. 4.1.1.2 Association between flanking repeat length, purity and instability No correlation between expandability and repeat length or purity was observed in our analysis. Brock et al, 1999 failed to notice any association between repeat length and expandability as well; they did not explore relationships with repeat purity. In one respect this is encouraging as it indicates that the expandability metric is independent of the repeat length. Repeat length and purity remain useful tools in haplotype analysis. For example in HD, on average longer, purer repeats are diagnostic of earlier and more severe disease onset (Cleary and Pearson, 2003). 4.1.1.3 Association between flanking CTCF binding sites and instability 88 C T C F binding sites have been documented to be associated with expandable loci (Filippova et al., 2001). We have used an HMM to detect C T C F binding sites in 1,000 bp flanking the C A G / C T G repeat. Important points from experimentation with C T C F binding sites at the DM locus are a) flanking proximal C T C F binding sites in conjunction with C A G / C T G repeats form a insulator module, b) the C A G / C T G repeats are an important component of this biological system, c) the upstream binding site is more important than the downstream site for insulator activity (Filippova et al., 2001). Of initial interest was whether our HMM detected experimentally determined C T C F binding sites flanking C A G / C T G repeats. As revealed by gel-shift assays, the C A G / C T G loci within the DM, DRPLA and SCA7 had upstream and downstream C T C F binding sites, while HD and SCA2 only had upstream sites. We detected C T C F binding sites flanking DMPK (as expected as DMPK C T C F binding sites were used in training the HMM), two upstream of SCA7, one downstream of DRPLA and SCA2, and none at HD. This is not necessarily a failure of the HMM because the precise sequence that C T C F bound to was not determined in these experiments and it is not known if these sequences are of the same family as those used to train our HMM. The C T C F protein has 11 zinc-fingers which may mediate interactions with a range of sequence profiles (Ohlsson et al., 2001). Our HMM can only detect binding sites that it has been trained on, therefore hits must share sequence features with the C T C F sites previously determined. The proximity (15 bp) and high score (6.18) of the SCA7 locus makes that hit appear more 'real' than others but such statements are strictly qualitative without further 89 experimental data. Interestingly, La Spada et al. presented evidence at a recent microsatellites conference that the particular sequence motif of the C T C F binding site flanking the C A G / C T G at the SCA7 locus mediates repeat instability (La Spada et al., 2004). In his work, direct mutation of known C T C F contact nucleotides in a binding site adjacent to the SCA7 locus, the most unstable C A G / C T G repeat, resulted in a significant decrease in repeat instability. 4.1.1.4 Association between flanking nucleosome formation and instability Spearman ranked correlations revealed no relationship between expandability and nucleosome formation potential as defined by NucleoMeter (Levitsky et al., 2001) (rho = -0.37, P = 0.50). NucleoMeter assigns high nucleosome formation potential scores to DNA regions most likely to form nucleosomes. In their training set, DNA with tissue-specific expression profiles had the highest scores reflecting the fact in most tissues these genes would be heterochromatinized (Levitsky et al., 2001). We cross-referenced the unstable genes summarized by Brock et al. with the GeneNote database to see if they exhibited tissue specific expression. Over half of the genes including SCA7, SCA2, SCA1, and SCA3 had evidence of expression in all tissues. HD and SBMA were expressed in seven tissues, which is over half of the tissues profiled by GeneNote. DRPLA had the most limited expressed profile as it was expressed in only three tissues. Therefore, none of the genes correlated with nucleosome formation potential, and especially the most unstable ones (SCA7 90 and SCA2) had tissue-specific expression. It is not surprising that the nucleosome formation potential at these sites does not correlate with expandability as it appears that the majority of the genes evaluated are ubiquitously expressed would likely not share c/s-features with genes having tissue-specific expression. It should be noted that the field of computational nucleosome formation potential prediction is new and Levitsky's group is the only one solely focused on this problem. It is thus challenging to compare the accuracy of NucleoMeter's predictions as there are no competing tools investigating the problem. 4.1.2 Prioritizing candidate CAG/CTG repeats We sought to apply the conclusions from both our re-analysis of the Brock et al. data and our own analyses to the set of candidate C A G / C T G repeats. Our assumption is that candidate C A G / C T G repeats sharing features with unstable loci are more likely themselves to be unstable and should be set as higher priority regions to genotype. We identified statistically significant associations between CpG islands, G C content, and C T C F binding sites. Candidate C A G / C T G repeats having at least one of these features were compared to identify co-occurrences (Table 16). Since no expandability data exists for the candidate C A G / C T G loci, statistical associations cannot be directly calculated for these loci and their flanking sequence features. Instead we sought to identify candidate loci that fit the 'expandability profile'. Eight genes fit the 'expandability 91 CD LL CD -*—< i — co w CD CJ -ti C N O N E (0 CO 0) u . o •S o E | O .2 5 Q (A C I-|Q O 0) 11 O co P co 0 0 O w £ 5 CO LU CO 3 CO IO 1 . D_ CO IO w (0 •o c CO a x UJ 0) c CO O U) CO co" o 00 <° 00 • * 00" 5 o>58 1— CD co co 10 T 0 o CD -ti LL c c0 r; o g O W CD O co o ^ £ O ^ 5 < . O -a CD B CO CD _ _C SZ CO 0) — = CO CO CD J3 CD C L C L CD -Q g 'co c o o CO o CD T~ C L H— CD O CD I— O 0 < o o CD CD X CO -g T3 c CO o T3 c CO % ' C L -Q O O O o o C D ^ O . _CD "SI "g-g CO c C L 0 0 CD I? E £ 51 CD SZ CD _C0 CD CO CD 1— -*—< CO c o "O O E co CD -*—' CO C L ZZ i_ CD CD O c CO CO X5 CO CD g ±=J2 O CO >> CO 0) cb £ T- CO CD O Si CO CO _ _ h- co co co co CO £ co O C L O CDprofile': CACNA1A, C14orf4, POU3F2, A S C L 1 , RUNX2, SOC6_HUMAN, IRS1, NM_175863. CACNA1A, C14orf4, POU3F2, A S C L 1 , RUNX2 were interesting based on their high flanking G C content, presence within CpG islands and C T C F binding sites. SOC6_HUMAN, IRS1, NM_175863 were interesting primarily because of the presence of pairs of flanking C T C F binding sites and in the case of S O C 6 _ H U M A N , the close proximity of the C T C F site to the C A G / C T G repeat. 4.2 Genomic repeat analysis with the Satellog database Satellog presents human microsatellite repeat data in a manner relevant to disease association studies. The selection of each bioinformatics feature or supplementary data source in Satellog is rationalized by its biological relevance to polymorphic satellite repeats. Satellog recapitulates many known biological facts about micro- and minisatellite repeats and reveals new patterns of disease-associated repeats. There is no documentation in the literature of repeat polymorphism differences of repeats residing in various genetic regions. Although one might expect greater polymorphism in UTR sequence relative to exons due to reduced evolutionary constraints, both 5'-UTR and exonic repeats had similar rates of polymorphism, whereas 3'-UTR repeats had significantly greater polymorphism compared to these two groups (Figure 13). This may be due to the documented 3'-UTR sequence over-representation in UniGene (Wheeler et al., 2004). 93 However, depending on whether the repeat is within exonic or UTR sequence, there appears to be constraints regarding what repeat unit sizes can tolerate large polymorphisms. Of the more polymorphic UTR repeats (those with sd values greater than 3), there was a single trinucleotide repeat amongst mainly dinucleotide and mononucleotide repeats (Figure 14, Table 7). On the other hand, the majority of exonic polymorphism, although less pronounced, is almost entirely in factors of three (Figure 15, Table 6). Our results support the observation that coding microsatellite polymorphisms are usually in-frame in order to avoid a deleterious phenotype resulting from frame-shift (Metzgar and Wills, 2000) or to provide a rapid evolutionary response to a changing environment (Kashi era/., 1997). It is interesting that polymorphism data present in the UniGene dataset recapitulates this biological principle. 4.3 Repeat prioritization in schizophrenia with Satellog We selected repeats that were within the recombination limits (50 Mb or the end of the chromosome) of genetic markers with evidence of linkage to schizophrenia and bipolar disorder in multiple studies. We felt that these broad regions that had some evidence of association with schizophrenia were of more interest than random genomic sequence. Our prioritization strategy looked at polymorphic repeats in linkage regions, any repeats in linkage regions, and then lastly disease-associated repeat classes that were polymorphic or had significant lengths. Since we felt that an unstable repeat may confer genetic risk for 94 developing schizophrenia, repeats that had shown some evidence of repeat polymorphism in UniGene clusters were prioritized in a parallel strategy that did not take into account polymorphism profiles and instead emphasized the repeat's p-value and length. The candidate repeat lists present the first objective prioritization of candidate repeats in schizophrenia and bipolar disorder linkage regions. Previous disease association studies looking at candidate repeats in schizophrenia investigated long C A G / C T G repeats close to schizophrenia linkage regions due to their prevalence in polyglutamine encoding expansion disorders. We hope our approach will help identify the repeats implicated in schizophrenia if the disorder is in fact mediated by an unstable repeat tract. Here we summarize the interesting repeats from each prioritization paradigm and highlighting those with a role putative role in neurobiology. The function of these repeat-containing genes is summarized from their GeneCards entries (Rebhan et al., 1997). 4.3.1 Top 20 polymorphic schizophrenia candidate repeats Interesting polymorphic repeats were detected within genes with evidence of neuronal function such as Neurochondrin (chr 1: 35459893-35459943), MOG (Myelin-oligodendrocyte glycoprotein precursor) (chr 6: 29732916-29732936), VANGL2 (van gogh, (Drosophila)-like 2) (chr 1: 157614137- 157614167) (Table 8). Neurochondrin cDNA's have high expression in the brain and kidney tissue. Little is known about the function of the gene, but homozygous null mutants in 95 mice are lethal (Mochizuki et al., 2003). MOG is a minor component of the myelin sheath and is linked to G O terms for synaptic transmission and central nervous system development. VANGL2 is involved in morphogenesis and patterning of the neural plate. 4.3.2 Top 20 globally prioritized schizophrenia candidate repeats The TATAT repeat within the intron of GRIK2 (glutamate receptor, ionotropic, kainate 2) (chr 6: 102371516-102371577) (Table 9) was the sole repeat within a gene with evidence of neuronal function. Glutamate mediates excitatory neurotransmission in the brain and anti-psychotics have shown effects on this system (Kim et al., 1980). This receptor has marked expression in the brain and to a lesser extent in the spinal cord. 4.3.3 Top 20 polymorphic schizophrenia candidate repeats from disease-associated classes A coding C T G repeat was detected in PCSK9 (proprotein convertase subtilisin/kexin type 9) (chr 1: 54875471-54875493) (Table 12). PCSK9 has been implicated in the differentiation of cortical neurons and its expression has been noted in embryonic brain telencephalon neurons (Seidah et al., 2003). 96 4.3.4 Top 20 globally prioritized schizophrenia candidate repeats from disease-associated classes No repeats from disease-associated classes were detected within genes with obvious neurobiological function by global prioritization (Table 13). However, the disease-associated repeats for SCA1 were detected (chr6: 16435844-6435887 and chr 6: 16435895-16435934). 4.4 Conclusions Brock et al. identified an association between flanking G C content and C A G / C T G repeat instability at disease loci by using a relative measure of repeat instability called 'expandability'. Using this expandability measure, we have extended the analysis of Brock and colleagues to include sequence regions omitted due to the incomplete state of the human genome sequence at the time of their study. Furthermore, we have utilized the expandability metric to associate with instability other features theorized to contribute to it such as C A G / C T G repeat length and purity, proximity to CCCTC-binding factor (CTCF) binding sites, and the nucleosome formation potential of the surrounding DNA. Our results recapitulated Brock's observations regarding flanking G C , C p G islands and C A G / C T G repeat instability. Specifically, we observed a positive association between flanking G C content and expandability but the exact percentages and trends were different from those published (Brock et al., 1999). 97 We also observed that in general unstable repeats were located within CpG islands. Our work also suggested a novel relationship between flanking C T C F binding sites and unstable repeats. Conversely, no relationships between expandability and repeat length, purity, and nucleosome formation were detected. Our results provide further insight regarding what c/s-sequences may contribute to C A G / C T G repeat instability. We developed Satellog, a database that catalogs all pure 1-16 repeat unit repeats in the human genome along with supplementary data we believe to be of use for the prioritization of satellite repeats in disease association studies. For each pure repeat we also calculate a p-value for its length relative to other repeats of the same class in the genome, its frequency of polymorphism within UniGene clusters, its location either within or adjacent to EnsEMBL-defined genes, and its expression profile in normal tissues according to the GeneNote database. Satellog is the first database capable of dynamic candidate repeat prioritization in the human genome based on these features. By examining the global repeat polymorphism profile, we found that highly polymorphic coding repeats were mostly restricted to trinucleotide repeats, whereas a wider range of repeat unit lengths were tolerated in untranslated sequence. We also found that 3'-UTR sequence has more repeat polymorphisms than 5'-UTR or exonic sequence. To showcase Satellog's potential utility, we use Satellog to prioritize repeats for disease-association 98 studies in schizophrenia. We hope that Satellog proves useful for candidate repeat prioritization in schizophrenia or any other disease in which unstable repeats are thought to have a role in disease etiology. 4.5 Problems encountered and limitations 4.5.1 Brock etal.'s expandability metric The expandability metric employed by Brock etal., is the authors' collation of pedigree analyses and published results. To quantify relative levels of instability, the metric uses the following formula: length change / (progenitor allele length - 35 repeats). This measure quantifies the "tendency of an above threshold repeat block to undergo further expansion". The authors' believed that repeat length changes needed to be relative to the progenitor length of repeats. Progenitor allele length was "standardized" by subtracting 35 repeats, the hypothetical threshold of coding C A G / C T G repeat instability in many coding C A G / C T G disorders (Cleary and Pearson, 2003). This assumption has a number of limitations, mainly because the myotonic dystrophy repeat was included in their study. Since myotonic dystrophy is a non-coding C A G / C T G disorder, it has different molecular genetic and etiological disease properties. Most importantly, the published minimum threshold for disease in myotonic dystrophy is 50 repeats, greater than the 35 used as a control by Brock et al., 1999. Furthermore, this metric was never tested nor was the raw data used to 99 generate the metric provided. This weakens our ability to rely on Brock et al.'s expandability as a useful measure of genomic C A G / C T G repeat instability. 4.5.2 Limitations of the GeneNote dataset The GeneNote AffyMetrix microarray experiments (Shmueli et al., 2003) were based on whole tissue RNA samples which is an important consideration if one is interested in gene over-expression in a particular anatomical region. For instance, high gene expression local to the frontal lobe would be diluted in the larger tissue section used for the GeneNote experiments. Repeat- prioritization approaches enforcing particular tissue expression should bear this in mind. 4.5.3 Mapping repeats to UniGene clusters A major problem with mapping repeats to UniGene clusters is that repetitive sequence usually hits many clusters, the majority of which are false positives. To ensure that each hit was real, we pre-mapped each unique UniGene cluster to the human genome and stored the chromosomal co-ordinates in a table named unigene (Appendix M). At run-time, every time a repeat was detected in a UniGene cluster, the hit's co-ordinates were compared to the mapped co-ordinates of the cluster. If the repeat co-ordinates were within 10 kb of the UniGene genomic co-ordinates, then the repeat length hits was retained and merged into a single sd value. It is also important to consider that larger 100 repeat polymorphisms could cause a UniGene cluster to "split" into two distinct clusters. This could downplay a repeat's polymorphism because such repeats would not be evaluated as a single group, therefore decreasing the repeat's sd value. Pre-mapping the clusters controlled for this as well, because if a cluster was split by a large repeat polymorphism, then both clusters should be mapped to the same genomic co-ordinates. In practical terms this was not an issue, since only one of our most polymorphic repeats (sd > 2) mapped to two clusters. 4.5.4 Prioritizing with p-values The genomic length distribution of a repeat class determines each length's p-value in Satellog (2.2.3.1), but it should be emphasized that this is meaningless for repeats that have few repeats in their distribution. For example, the C C C C G C C C C G C G dodecamer implicated in progressive myoclonic epilepsy type 1 (EPM1) (Lalioti et al., 1997) is the only pure repeat of its class detected in the human genome and therefore has a singleton as its distribution. The p-value for this disease-associated repeat is 1. P-values should be used carefully when considering larger period repeats as their distributions contain fewer, usually single, observations. 4.5.5 Multiple repeats detected for known diseases 101 Using the repeat type and gene information in Cleary and Pearson, 2003, we attempted to uniquely identify all disease-associated repeats. In some cases, more than one repeat was a candidate as the disease-associated repeat for a disease. These cases usually involved adjacent repeats of the same class that were detected as two distinct repeats because of an interrupting unit, a known feature of some disease-associated repeats (i.e. SCA1 (Chung et al., 1993; Kunst and Warren, 1994; Chong et al., 1995)). In these cases, we simply retained both repeats and associated them with the disease. 4.6 Future studies 4.6.1 identifying c/s-mediators of instability It is encouraging that recent research on C T C F binding sites flanking C A G / C T G repeats implicated the sites in instability (La Spada et al., 2004). Future experiments should aim to explain why the published C T C F binding sites flanking DRPLA, SCA7 , HD and SCA2 (Filippova et al., 2001) failed to be detected by our HMM. Implicit in the published analysis was that the multiple sequence alignment of known sites revealed a consensus sequence of conserved guanine nucleotides essential for C T C F recognition (Figure 4). Perhaps the sites not detected by the HMM contained a consensus sequence that interacts with a different combination of zinc fingers of C T C F than the binding sites published in the multiple alignment. Gel-shift assays established 102 that C T C F binding sites exist flanking DM, DRPLA and SCA7 , while HD and S C A 2 only had upstream sites (Filippova et al., 2001). These sites should be characterized by DNase I footprinting to observe the consensus site that C T C F interacts with and to enrich our HMM to optimize detection of C T C F binding sites either flanking C A G / C T G repeats and or at any other genomic locus. New C T C F binding sites should be incorporated into our HMM with the eventual goal of creating a C T C F binding site predictor. 4.6.2 Improvements to Satellog Satellog can be improved in many ways in order to increase its utility to the micro- and mini- satellite research community. More sophisticated algorithms can be deployed to detect repeat polymorphisms in UniGene clusters. Once a repeat has been mapped to a UniGene cluster with BLAT, our approach relied on an exact match of at least 10 bp of flanking sequence to register a hit to a specific cluster sequence. Future approaches should further incorporate the BLAT algorithm for this procedure as it's optimized to detect short, nearly exact matches rapidly (Kent, 2002) while tolerating indels. Rigorous parameter testing will be needed to determine how to optimize detection of repeats with BLAT. This should greatly enrich the polymorphism profile of repeats within Satellog. Furthermore, new disease-associated repeats should be added to the database as they're published. 103 4.6.3 Disease association studies in schizophrenia Repeats from the candidate repeats lists (3.3.1-8) will be analyzed by GeneScan software on an ABI 3700 sequencer to determine whether they are expanded in schizophrenics versus controls. GeneScan is a fragment analysis package that automatically identifies, quantitates, and sizes each DNA fragment that passes through an ABI instrument. 4.6.3.1 Specimens for analysis High quality, high molecular weight DNA from individuals with schizophrenia, bipolar disorder and unaffected control individuals (n=35 each) has been obtained from the Stanley Medical Research Institute (Bethesda, Maryland) free of charge, specifically for the work outlined in this proposal. The Stanley Array Collection is a brain collection developed specifically for molecular studies using high throughput methodologies. DNA has been extracted from post-mortem brain specimens collected, with informed consent from next-of-kin, by participating medical examiners between January 1995 and June 2002. The specimens were all collected, processed, and stored in a standardized way. Exclusion criteria for all specimens included: • Significant structural brain pathology on post-mortem examination by a qualified neuropathologist, or by pre-mortem imaging, • History of significant focal neurological signs pre-mortem, • History of central nervous system disease that could be expected to alter gene expression in a persistent way, • Documented IQ < 70. 104 Additional exclusion criteria for unaffected controls included: Age less than 30 (thus, still in the period of maximum risk), Substance abuse within one year of death or evidence of significant alcohol-related changes in the liver. Diagnoses were made by two senior psychiatrists, using DSM-IV criteria, based on medical records, and when necessary, telephone interviews with family members. Diagnoses of unaffected controls were based on structured interviews by a senior psychiatrist with family member(s) to rule out Axis I diagnoses. There are several distinct advantages in using the Stanley array samples for this study. First, there is a control group that is matched to the degree possible for age, ethnicity, gender and cause of death. A matched control group will allow polymorphic DNA repeat length changes that are present in both patients and controls to be recognized as unlikely candidate susceptibility loci. While this study could be done using DNA from either brain or peripheral lymphocytes or any other tissue, data obtained from peripheral lymphocytes would only be revealing in the case of inherited or constitutive changes. While brain tissue is essentially post-mitotic, an advantage of using DNA from brain is the opportunity to detect sporadic repeat expansions that might occur during brain development. Finally, brain tissue and high quality RNA samples are also available through the Stanley Medical Research Institute and these tissue and RNA samples are derived from the same individuals, described above, that were the source for DNA. This additional biological material will be an important resource for relating any observed unstable repeats to neuronal pathology in these diseases in subsequent studies. 105 4.7 Significance Satellog enriches the current bioinformatics landscape in which repeats are viewed. For example, the G A A repeat in Friedreich's Ataxia (Campuzano et al., 1996) is not detected at all (chr9:67,109,320-67,109,339) in the U C S C genome browser (Kent, 2002) by the T R F (Benson, 1999) and Variable Number Tandem Repeats (VNTR) tracks. The VNTR feature in U C S C detects all perfect 2 to 10 repeat units with 10 or more copies. Repeats detected by this method may over-represent insignificant low period repeats and under-represent potentially interesting high period repeats. In Satellog, not only is the Friedreich's Ataxia G A A repeat detected, but its p-value also suggests that this size of G A A repeat is a relatively rare observation in the human genome (P = 0.045). Satellog integrates disparate data sources to give researchers an idea of how interesting certain repeats are based on their genetic location, tissue expression profile and polymorphism. It should be noted that Satellog does not intend to be a de novo detection method for disease-associated repeats. Instead, it provides comprehensive, integrated bioinformatics platform to prioritize repeats in a convenient and efficient manner. Satellog also presents the first comprehensive identification and integration of disease-associated repeats with other genomic resources for use as bioinformatics reagents in other studies. Satellog should prove useful to investigators interested in prioritizing repeats for typing in diseases showing anticipation or in which repeat polymorphism is thought to play 106 a role in etiology and as a general bioinformatics resource for microsatellite repeat studies. Secondly, we have produced the first objective lists of candidate repeats for association studies in schizophrenia and bipolar disorder. Previous studies have arbitrarily selected repeats based on their prevalence in other etiologically distinct neurological diseases (for example C A G / C T G repeats in inclusion disorders). Our study is the first ever to objectively prioritize repeats in the human genome by integrating disparate bioinformatics resources. We also provide the infrastructure to dynamically redefine candidate sets based on new biological knowledge or research interests. We hope that Satellog will facilitate the identification of disease-associated repeats in schizophrenia and other ailments. Satellog is available as a freely downloadable MySQL and web-based database. 107 BIBLIOGRAPHY American Psychiatric Association. (1994). Diagnostic and statistical manual of mental disorders (4th ed.). Washington, DC. Ashburner, M., C. A. Ball, J . A. Blake, D. Botstein, H. Butler, J . M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J . T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J . C. Matese, J . E. Richardson, M. Ringwald, G. M. Rubin and G. Sherlock (2000). "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium." Nat Genet 25(1): 25-9. Asherson, P., C. Walsh, J . Williams, M. Sargeant, C. Taylor, A. Clements, M. Gill, M. Owen and P. McGuffin (1994). "Imprinting and anticipation. Are they relevant to genetic studies of schizophrenia?" Br J Psychiatry 164(5): 619-24. Ashizawa, T., C. J . Dunne, J . R. Dubel, M. B. Perryman, H. F. Epstein, E. Boerwinkle and J . F. Hejtmancik (1992). "Anticipation in myotonic dystrophy. I. Statistical verification based on clinical and haplotype findings." Neurology 42(10): 1871-7. Banfi, S., A. Servadio, M. Y. Chung, T. J . Kwiatkowski, Jr., A. E. McCall, L. A. Duvick, Y. Shen, E. J . Roth, H. T. Orr and H. Y. Zoghbi (1994). "Identification and characterization of the gene causing type 1 spinocerebellar ataxia." Nat Genet 7(4): 513-20. Bassett, A. S. and W. G. Honer (1994). "Evidence for anticipation in schizophrenia." Am J Hum Genet 54(5): 864-70. Bassett, A. S. and J . Husted (1997). "Anticipation or ascertainment bias in schizophrenia? Penrose's familial mental illness sample." Am J Hum Genet 60(3): 630-7. Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA sequences." Nucleic Acids Res 27(2): 573-80. 108 Berman, H. M., T. Battistuz, T. N. Bhat, W. F. Bluhm, P. E. Bourne, K. Burkhardt, Z. Feng, G. L. Gilliland, L. lype, S. Jain, P. Fagan, J . Marvin, D. Padilla, V. Ravichandran, B. Schneider, N. Thanki, H. Weissig, J . D. Westbrook and C. Zardecki (2002). "The Protein Data Bank." Acta Crvstalloqr D Biol Crvstalloqr 58(Pt 6 No 1): 899-907. Bray, N. J . and M. J . Owen (2001). "Searching for schizophrenia genes." Trends Mol Med 7(4): 169-74. Brock, G. J . , N. H. Anderson and D. G. Monckton (1999). "Cis-acting modifiers of expanded C A G / C T G triplet repeat expandability: associations with flanking G C content and proximity to CpG islands." Hum Mol Genet 8(6): 1061-7. Brook, J . D., M. E. McCurrach, H. G. Harley, A. J . Buckler, D. Church, H. Aburatani, K. Hunter, V. P. Stanton, J . P. Thirion, T. Hudson and et al. (1992). "Molecular basis of myotonic dystrophy: expansion of a trinucleotide (CTG) repeat at the 3' end of a transcript encoding a protein kinase family member." Cell 68(4): 799-808. Bullmore, E. T., S. Frangou and R. M. Murray (1997). "The dysplastic net hypothesis: an integration of developmental and dysconnectivity theories of schizophrenia." Schizophr Res 28(2-3): 143-56. Campuzano, V., L. Montermini, M. D. Molto, L. Pianese, M. Cossee, F. Cavalcanti, E. Monros, F. Rodius, F. Duclos, A. Monticelli and et al. (1996). "Friedreich's ataxia: autosomal recessive disease caused by an intronic GAA triplet repeat expansion." Science 271(5254): 1423-7. Cardno, A. G., E. J . Marshall, B. Coid, A. M. Macdonald, T. R. Ribchester, N. J . Davies, P. Venturi, L. A. Jones, S. W. Lewis, P. C. Sham, Gottesman, II, A. E. Farmer, P. McGuffin, A. M. Reveley and R. M. Murray (1999). "Heritability estimates for psychotic disorders: the Maudsley twin psychosis series." Arch Gen Psychiatry 56(2): 162-8. Chong, S. S., A. E. McCall, J . Cota, S. H. Subramony, H. T. Orr, M. R. Hughes and H. Y. Zoghbi (1995). "Gametic and somatic tissue-specific heterogeneity of the expanded SCA1 C A G repeat in spinocerebellar ataxia type 1." Nat Genet 10(3): 344-50. 109 Choudhry, S., M. Mukerji, A. K. Srivastava, S. Jain and S. K. Brahmachari (2001). "CAG repeat instability at SCA2 locus: anchoring C A A interruptions and linked single nucleotide polymorphisms." Hum Mol Genet 10(21): 2437-46. Chung, M. Y., L. P. Ranum, L. A. Duvick, A. Servadio, H. Y. Zoghbi and H. T. Orr (1993). "Evidence for a mechanism predisposing to intergenerational C A G repeat instability in spinocerebellar ataxia type I." Nat Genet 5(3): 254-8. Clark, R. M., G. L. Dalgliesh, D. Endres, M. Gomez, J . Taylor and S. I. Bidichandani (2004). "Expansion of GAA triplet repeats in the human genome: unique origin of the FRDA mutation at the center of an Alu." Genomics 83(3): 373-83. Cleary, J . D. and C. E. Pearson (2003). "The contribution of cis-elements to disease-associated repeat instability: clinical and experimental evidence." Cvtoaenet Genome Res 100(1-4): 25-55. Collins, J . R., R. M. Stephens, B. Gold, B. Long, M. Dean and S. K. Burt (2003). "An exhaustive DNA micro-satellite map of the human genome using high performance computing." Genomics 82(1): 10-9. Cossee, M., M. Schmitt, V. Campuzano, L. Reutenauer, C. Moutou, J . L. Mandel and M. Koenig (1997). "Evolution of the Friedreich's ataxia trinucleotide repeat expansion: founder effect and premutations." Proc Natl Acad Sci U S A 94(14): 7452-7. Crisponi, L., M. Deiana, A. Loi, F. Chiappe, M. Uda, P. Amati, L. Bisceglia, L. Zelante, R. Nagaraja, S. Porcu, M. S. Ristaldi, R. Marzella, M. Rocchi, M. Nicolino, A. Lienhardt-Roussie, A. Nivelon, A. Verloes, D. Schlessinger, P. Gasparini, D. Bonneau, A. Cao and G. Pilia (2001). "The putative forkhead transcription factor FOXL2 is mutated in blepharophimosis/ptosis/epicanthus inversus syndrome." Nat Genet 27(2): 159-66. Crocq, M. A., R. Mant, P. Asherson, J . Williams, Y. Hode, A. Mayerova, D. Collier, L. Lannfelt, P. Sokoloff, J . C. Schwartz and et al. (1992). "Association between schizophrenia and homozygosity at the dopamine D3 receptor gene." J Med Genet 29(12): 858-60. 110 Cummings, C. J . and H. Y. Zoghbi (2000). "Trinucleotide repeats: mechanisms and pathophysiology." Annu Rev Genomics Hum Genet 1: 281-328. David, G. , N. Abbas, G. Stevanin, A. Durr, G. Yvert, G. Cancel, C. Weber, G. Imbert, F. Saudou, E. Antoniou, H. Drabkin, R. Gemmill, P. Giunti, A. Benomar, N. Wood, M. Ruberg, Y. Agid, J . L. Mandel and A. Brice (1997). "Cloning of the S C A 7 gene reveals a highly unstable C A G repeat expansion." Nat Genet 17(1): 65-70. Degreef, G. , M. Ashtari, B. Bogerts, R. M. Bilder, D. N. Jody, J . M. Alvir and J . A. Lieberman (1992). "Volumes of ventricular system subdivisions measured from magnetic resonance images in first-episode schizophrenic patients." Arch Gen Psychiatry 49(7): 531-7. Eddy, S. R. (1998). "Profile hidden Markov models." Bioinformatics 14(9): 755-63. Eichler, E. E., J . J . Holden, B. W. Popovich, A. L. Reiss, K. Snow, S. N. Thibodeau, C. S. Richards, P. A. Ward and D. L. Nelson (1994). "Length of uninterrupted C G G repeats determines instability in the FMR1 gene." Nat Genet 8(1): 88-94. Filippova, G. N., C . P. Thienes, B. H. Penn, D. H. Cho, Y. J . Hu, J . M. Moore, T. R. Klesert, V. V. Lobanenkov and S. J . Tapscott (2001). "CTCF-binding sites flank C T G / C A G repeats and form a methylation-sensitive insulator at the DM1 locus." Nat Genet 28(4): 335-43. Gorwood, P., M. Leboyer, B. Falissard, M. Jay, F. Rouillon and J . Feingold (1996). "Anticipation in schizophrenia: new light on a controversial problem." Am J Psychiatry 153(9): 1173-7. Gourdon, G., F. Radvanyi, A. S. Lia, C. Duros, M. Blanche, M. Abitbol, C. Junien and H. Hofmann-Radvanyi (1997). "Moderate intergenerational and somatic instability of a 55-CTG repeat in transgenic mice." Nat Genet 15(2): 190-2. Gouw, L. G. , M. A. Castaneda, C. K. McKenna, K. B. Digre, S. M. Pulst, S. Perlman, M. S. Lee, C. Gomez, K. Fischbeck, D. Gagnon, E. Storey, T. Bird, F. R. Jeri and L. J . Ptacek (1998). "Analysis of the dynamic mutation in the S C A 7 gene shows marked parental effects on C A G repeat transmission." Hum Mol Genet 7(3): 525-32. I l l Haaf, T., G. Sirugo, K. K. Kidd and D. C. Ward (1996). "Chromosomal localization of long trinucleotide repeats in the human genome by fluorescence in situ hybridization." Nat Genet 12(2): 183-5. Harrison, P. J . (1999). "The neuropathology of schizophrenia. A critical review of the data and their interpretation." Brain 122 ( Pt 4): 593-624. Heiden, A., U. Willinger, J . Scharfetter, K. Meszaros, S. Kasper and H. N. Aschauer (1999). "Anticipation in schizophrenia." Schizophr Res 35(1): 25-32. Hubbard, T., D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J . Cuff, V. Curwen, T. Down, R. Durbin, E. Eyras, J . Gilbert, M. Hammond, L. Huminiecki, A. Kasprzyk, H. Lehvaslaiho, P. Lijnzaad, C. Melsopp, E. Mongin, R. Pettett, M. Pocock, S. Potter, A. Rust, E. Schmidt, S. Searle, G . Slater, J . Smith, W. Spooner, A. Stabenau, J . Stalker, E. Stupka, A. Ureta-Vidal, I. Vastrik and M. Clamp (2002). "The Ensembl genome database project." Nucleic Acids Res 30(1): 38-41. Huntington's Disease Collaborative Research Group, T. (1993). "A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes. The Huntington's Disease Collaborative Research Group." Cell 72(6): 971-83. Ikeda, H., M. Yamaguchi, S. Sugai, Y. Aze, S. Narumiya and A. Kakizuka (1996). "Expanded polyglutamine in the Machado-Joseph disease protein induces cell death in vitro and in vivo." Nat Genet 13(2): 196-202. Imamura, A., S. Honda, Y. Nakane and Y. Okazaki (1998). "Anticipation in Japanese families with schizophrenia." J Hum Genet 43(4): 217-23. Imbert, G., F. Saudou, G. Yvert, D. Devys, Y. Trottier, J . M. Gamier, C. Weber, J . L. Mandel, G. Cancel, N. Abbas, A. Durr, O. Didierjean, G. Stevanin, Y. Agid and A. Brice (1996). "Cloning of the gene for spinocerebellar ataxia 2 reveals a locus with high sensitivity to expanded CAG/glutamine repeats." Nat Genet 14(3): 285-91. Imhof, A., X. J . Yang, V. V. Ogryzko, Y. Nakatani, A. P. Wolffe and H. Ge (1997). "Acetylation of general transcription factors by histone acetyltransferases." Curr Biol 7(9): 689-92. 112 Inayama, Y., H. Yoneda, T. Sakai, T. Ishida, Y. Nonomura, Y. Kono, R. Takahata, J . Koh, J . Sakai, A. Takai, Y. Inada and H. Asaba (1996). "Positive association between a DNA sequence variant in the serotonin 2A receptor gene and schizophrenia." Am J Med Genet 67(1): 103-5. Johnson, J . E., J . Cleary, H. Ahsan, J . Harkavy Friedman, D. Malaspina, C. R. Cloninger, S. V. Faraone, M. T. Tsuang and C. A. Kaufmann (1997). "Anticipation in schizophrenia: biology or bias?" Am J Med Genet 74(3): 275-80. Kashi, Y., D. King and M. Soller (1997). "Simple sequence repeats as a source of quantitative genetic variation." Trends Genet 13(2): 74-8. Kawaguchi, Y., T. Okamoto, M. Taniwaki, M. Aizawa, M. Inoue, S. Katayama, H. Kawakami, S. Nakamura, M. Nishimura, I. Akiguchi and et al. (1994). " C A G expansions in a novel gene for Machado-Joseph disease at chromosome 14a32.1." Nat Genet 8(3): 221-8. Kent, W. J . (2002). "BLAT~the BLAST-like alignment tool." Genome Res 12(4): 656-64. Kent, W. J . , C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler and D. Haussler (2002). "The human genome browser at U C S C . " Genome Res 12(6): 996-1006. Kim, J . S., H. H. Kornhuber, W. Schmid-Burgk and B. Holzmuller (1980). "Low cerebrospinal fluid glutamate in schizophrenic patients and a new hypothesis on schizophrenia." Neurosci Lett 20(3): 379-82. Koide, R., T. Ikeuchi, O. Onodera, H. Tanaka, S. Igarashi, K. Endo, H. Takahashi, R. Kondo, A. Ishikawa, T. Hayashi and et al. (1994). "Unstable expansion of C A G repeat in hereditary dentatorubral-pallidoluysian atrophy (DRPLA)." Nat Genet 6(1): 9-13. Koob, M. D., K. A. Benzow, T. D. Bird, J . W. Day, M. L. Moseley and L. P. Ranum (1998). "Rapid cloning of expanded trinucleotide repeat sequences from genomic DNA." Nat Genet 18(1): 72-5. Kornberg, R. D. and Y. Lorch (1999). "Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosome." Cell 98(3): 285-94. 113 Kremer, E. J . , M. Pritchard, M. Lynch, S. Yu, K. Holman, E. Baker, S. T. Warren, D. Schlessinger, G. R. Sutherland and R. I. Richards (1991). "Mapping of DNA instability at the fragile X to a trinucleotide repeat sequence p(CCG)n." Science 252(5013): 1711-4. Kunst, C. B. and S. T. Warren (1994). "Cryptic and polar variation of the fragile X repeat could result in predisposing normal alleles." Cell 77(6): 853-61. La Spada, A. R., R. I. Richards and B. Wieringa (2004). "Dynamic mutations on the move in Banff." Nat Genet 36(7): 667-70. La Spada, A. R., E. M. Wilson, D. B. Lubahn, A. E. Harding and K. H. Fischbeck (1991). "Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy." Nature 352(6330): 77-9. Lalioti, M. D., H. S. Scott, C. Buresi, C. Rossier, A. Bottani, M. A. Morris, A. Malafosse and S. E. Antonarakis (1997). "Dodecamer repeat expansion in cystatin B gene in progressive myoclonus epilepsy." Nature 386(6627): 847-51. Lander, E. S., L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J . Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, R. Funke, D. Gage, K. Harris, A. Heaford, J . Howland, L. Kann, J . Lehoczky, R. LeVine, P. McEwan, K. McKernan, J . Meldrim, J . P. Mesirov, C. Miranda, W. Morris, J . Naylor, C. Raymond, M. Rosetti, R. Santos, A. Sheridan, C. Sougnez, N. Stange-Thomann, N. Stojanovic, A. Subramanian, D. Wyman, J . Rogers, J . Sulston, R. Ainscough, S. Beck, D. Bentley, J . Burton, C. Clee, N. Carter, A. Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R. Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J . C. Mullikin, A. Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. H. Waterston, R. K. Wilson, L. W. Hillier, J . D. McPherson, M. A. Marra, E. R. Mardis, L. A. Fulton, A. T. Chinwalla, K. H. Pepin, W. R. Gish, S. L. Chissoe, M. C. Wendl, K. D. Delehaunty, T. L. Miner, A. Delehaunty, J . B. Kramer, L. L. Cook, R. S. Fulton, D. L. Johnson, P. J . Minx, S. W. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, J . F. Cheng, A. Olsen, S. Lucas, C. Elkin, E. Uberbacher, M. Frazier, R. A. Gibbs, D. M. Muzny, S. E. Scherer, J . B. Bouck, E. J . Sodergren, K. C. Worley, C. M. Rives, J . H. Gorrell, M. L. Metzker, S. L. Naylor, R. S. Kucherlapati, D. L. Nelson, G. M. Weinstock, Y. Sakaki, A. Fujiyama, M. Hattori, T. Yada, A. Toyoda, T. Itoh, C. Kawagoe, H. Watanabe, Y. Totoki, T. Taylor, J . Weissenbach, R. Heilig, W. Saurin, F. Artiguenave, P. Brottier, T. Bruls, E. Pelletier, C. Robert, P. 114 Wincker, D. R. Smith, L. Doucette-Stamm, M. Rubenfield, K. Weinstock, H. M. Lee, J . Dubois, A. Rosenthal, M. Platzer, G. Nyakatura, S. Taudien, A. Rump, H. Yang, J . Yu, J . Wang, G. Huang, J . Gu, L. Hood, L. Rowen, A. Madan, S. Qin, R. W. Davis, N. A. Federspiel, A. P. Abola, M. J . Proctor, R. M. Myers, J . Schmutz, M. Dickson, J . Grimwood, D. R. Cox, M. V. Olson, R. Kaul, N. Shimizu, K. Kawasaki, S. Minoshima, G. A. Evans, M. Athanasiou, R. Schultz, B. A. Roe, F. Chen, H. Pan, J . Ramser, H. Lehrach, R. Reinhardt, W. R. McCombie, M. de la Bastide, N. Dedhia, H. Blocker, K. Hornischer, G. Nordsiek, R. Agarwala, L. Aravind, J . A. Bailey, A. Bateman, S. Batzoglou, E. Birney, P. Bork, D. G. Brown, C. B. Burge, L. Cerutti, H. C. Chen, D. Church, M. Clamp, R. R. Copley, T. Doerks, S. R. Eddy, E. E. Eichler, T. S. Furey, J . Galagan, J . G. Gilbert, C. Harmon, Y. Hayashizaki, D. Haussler, H. Hermjakob, K. Hokamp, W. Jang, L. S. Johnson, T. A. Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W. J . Kent, P. Kitts, E. V. Koonin, I. Korf, D. Kulp, D. Lancet, T. M. Lowe, A. McLysaght, T. Mikkelsen, J . V. Moran, N. Mulder, V. J . Pollara, C. P. Ponting, G. Schuler, J . Schultz, G. Slater, A. F. Smit, E. Stupka, J . Szustakowski, D. Thierry-Mieg, J . Thierry-Mieg, L. Wagner, J . Wallis, R. Wheeler, A. Williams, Y. I. Wolf, K. H. Wolfe, S. P. Yang, R. F. Yeh, F. Collins, M. S. Guyer, J . Peterson, A. Felsenfeld, K. A. Wetterstrand, A. Patrinos, M. J . Morgan, J . Szustakowki, P. de Jong, J . J . Catanese, K. Osoegawa, H. Shizuya, S. Choi and Y. J . Chen (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921. Levitsky, V. G., O. A. Podkolodnaya, N. A. Kolchanov and N. L. Podkolodny (2001). "Nucleosome formation potential of eukaryotic DNA: calculation and promoters analysis." Bioinformatics 17(11): 998-1010. Matsuura, T., T. Yamagata, D. L. Burgess, A. Rasmussen, R. P. Grewal, K. Watase, M. Khajavi, A. E. McCall, C. F. Davis, L. Zu, M. Achari, S. M. Pulst, E. Alonso, J . L. Noebels, D. L. Nelson, H. Y. Zoghbi and T. Ashizawa (2000). "Large expansion of the ATTCT pentanucleotide repeat in spinocerebellar ataxia type 10." Nat Genet 26(2): 191-4. McCarley, R. W., C. G. Wible, M. Frumin, Y. Hirayasu, J . J . Levitt, I. A. Fischer and M. E. Shenton (1999). "MRI anatomy of schizophrenia." Biol Psychiatry 45(9): 1099-119. McGuffin, P., M. J . Owen and A. E. Farmer (1995). "Genetic basis of schizophrenia." Lancet 346(8976): 678-82. Metzgar, D. and C. Wills (2000). "Evidence for the adaptive evolution of mutation rates." Cell 101(6): 581-4. 115 Mochizuki, R., M. Dateki, K. Yanai, Y. Ishizuka, N. Amizuka, H. Kawashima, Y. Koga, H. Ozawa and A. Fukamizu (2003). "Targeted disruption of the neurochondrin/norbin gene results in embryonic lethality." Biochem Biophvs Res Commun 310(4): 1219-26. Montermini, L , E. Andermann, M. Labuda, A. Richter, M. Pandolfo, F. Cavalcanti, L. Pianese, L. lodice, G. Farina, A. Monticelli, M. Turano, A. Filla, G . De Michele and S. Cocozza (1997). "The Friedreich ataxia G A A triplet repeat: premutation and normal alleles." Hum Mol Genet 6(8): 1261-6. Morel, B. (1857). "Traite des degenerescences." J.B. Bailliere. 1. Mott, F. (1910). "Hereditary aspects of nervous and mental diseases." Br Med J 2: 1013-1020. Nakamoto, M., H. Takebayashi, Y. Kawaguchi, S. Narumiya, M. Taniwaki, Y. Nakamura, Y. Ishikawa, I. Akiguchi, J . Kimura and A. Kakizuka (1997). "A C A G / C T G expansion in the normal population." Nat Genet 17(4): 385-6. Ohara, K., H. D. Xu, N. Mori, Y. Suzuki, D. S. Xu and Z. C. Wang (1997). "Anticipation and imprinting in schizophrenia." Biol Psychiatry 42(9): 760-6. Ohlsson, R., R. Renkawitz and V. Lobanenkov (2001). "CTCF is a uniquely versatile transcription regulator linked to epigenetics and disease." Trends Genet 17(9): 520-7. Pearson, C. E., E. E. Eichler, D. Lorenzetti, S. F. Kramer, H. Y. Zoghbi, D. L. Nelson and R. R. Sinden (1998). "Interruptions in the triplet repeats of SCA1 and FRAXA reduce the propensity and complexity of slipped strand DNA (S-DNA) formation." Biochemistry 37(8): 2701-8. Penrose, L. (1948). "The problem of anticipation in pedigrees of dystrophia myotonica." Ann Eugenics 14: 125-132. Potter and Hollister (2001). Antipsychotic Agents & Lithium. Basic & Clinical Pharmacology, McGraw-Hill Publishers. 116 R Development Core Team (2004). R: A language and environment for statistical computing. Rebhan, M., V. Chalifa-Caspi, J . Prilusky and D. Lancet (1997). "GeneCards: integrating information about genes, proteins and diseases." Trends Genet 13(4): 163. Rice, P., I. Longden and A. Bleasby (2000). " E M B O S S : the European Molecular Biology Open Software Suite." Trends Genet 16(6): 276-7. Risch, N. J . (2000). "Searching for genetic determinants in the new millennium." Nature 405(6788): 847-56. Roberts, E. (1972). "Prospects for research on schizophrenia. An hypotheses suggesting that there is a defect in the G A B A system in schizophrenia." Neurosci Res Program Bull 10(4): 468-82. Ross, C. A., R. L. Margolis, M. W. Becher, J . D. Wood, S. Engelender, J . K. Cooper and A. H. Sharp (1998). "Pathogenesis of neurodegenerative diseases associated with expanded glutamine repeats: new answers, new questions." Prog Brain Res 117: 397-419. Schalling, M., T. J . Hudson, K. H. Buetow and D. E. Housman (1993). "Direct detection of novel expanded trinucleotide repeats in the human genome." Nat Genet 4(2): 135-9. Seidah, N. G. , S. Benjannet, L. Wickham, J . Marcinkiewicz, S. B. Jasmin, S. Stifani, A. Basak, A. Prat and M. Chretien (2003). "The secretory proprotein convertase neural apoptosis-regulated convertase 1 (NARC-1): liver regeneration and neuronal differentiation." Proc Natl Acad Sci U S A 100(3): 928-33. Shmueli, O., S. Horn-Saban, V. Chalifa-Caspi, M. Shmoish, R. Ophir, H. Benjamin-Rodrig, M. Safran, E. Domany and D. Lancet (2003). "GeneNote: whole genome expression profiles in normal human tissues." C R B i o l 326(10-11): 1067-72. Sklar, P. (2002). "Linkage analysis in psychiatric disorders: the emerging picture." Annu Rev Genomics Hum Genet 3: 371-413. 117 Stajich, J . E., D. Block, K. Boulez, S. E. Brenner, S. A. Chervitz, C. Dagdigian, G. Fuellen, J . G. Gilbert, I. Korf, H. Lapp, H. Lehvaslaiho, C. Matsalla, C. J . Mungall, B. I. Osborne, M. R. Pocock, P. Schattner, M. Senger, L. D. Stein, E. Stupka, M. D. Wilkinson and E. Birney (2002). "The Bioperl toolkit: Perl modules for the life sciences." Genome Res 12(10): 1611-8. Sterner, D. E., P. A. Grant, S. M. Roberts, L. J . Duggan, R. Belotserkovskaya, L. A. Pacella, F. Winston, J . L. Workman and S. L. Berger (1999). "Functional organization of the yeast S A G A complex: distinct components involved in structural integrity, nucleosome acetylation, and TATA-binding protein interaction." Mol Cell Biol 19(1): 86-98. Subramanian, S., V. M. Madgula, R. George, R. K. Mishra, M. W. Pandit, C. S. Kumar and L. Singh (2002). "MRD: a microsatellite repeats database for prokaryotic and eukaryotic genomes." Genome Biol 3(12): PREPRINT0011. Thibaut, F., M. Martinez, M. Petit, M. Jay and D. Campion (1995). "Further evidence for anticipation in schizophrenia." Psychiatry Res 59(1-2): 25-33. Tsuang, M. (2000). "Schizophrenia: genes and environment." Biol Psychiatry 47(3): 210-20. Tsutsumi, T., S. E. Holmes, M. G. Mclnnis, A. Sawa, C. Callahan, J . R. DePaulo, C. A. Ross, L. E. DeLisi and R. L. Margolis (2004). "Novel C A G / C T G repeat expansion mutations do not contribute to the genetic risk for most cases of bipolar disorder or schizophrenia." Am J Med Genet 124B(1): 15-9. Valero, J . , L. Martorell, J . Marine, E. Vilella and A. Labad (1998). "Anticipation and imprinting in Spanish families with schizophrenia." Acta Psvchiatr Scand 97(5): 343-50. Verkerk, A. J . , M. Pieretti, J . S. Sutcliffe, Y. H. Fu, D. P. Kuhl, A. Pizzuti, O. Reiner, S. Richards, M. F. Victoria, F. P. Zhang and et al. (1991). "Identification of a gene (FMR-1) containing a C G G repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome." CeJI 65(5): 905-14. 118 Vincent, J . B., A. D. Paterson, E. Strong, A. Petronis and J . L. Kennedy (2000). "The unstable trinucleotide repeat story of major psychosis." Am J Med Genet 97(1): 77-97. Weinberger, D. R. (1995). "From neuropathology to neurodevelopment." Lancet 346(8974): 552-7. Wheeler, D. L , D. M. Church, R. Edgar, S. Federhen, W. Helmberg, T. L. Madden, J . U. Pontius, G. D. Schuler, L. M. Schriml, E. Sequeira, T. O. Suzek, T. A. Tatusova and L. Wagner (2004). "Database resources of the National Center for Biotechnology Information: update." Nucleic Acids Res 32 Database issue: D35-40. Wooley, D. W. and E. Shaw (1954). "A biochemical and pharmacological suggestion about certain mental disorders." Proc. Natl. Acad. Sci . U. S. A. 40: 228-231. Yaw, J . , M. Myles-Worsley, M. Hoff, J . Holik, R. Freedman, W. Byerley and H. Coon (1996). "Anticipation in multiplex schizophrenia pedigrees." Psvchiatr Genet 6(1): 7-11. Zhang, Y., D. G . Monckton, M. J . Siciliano, T. H. Connor and M. L. Meistrich (2002). "Age and insertion site dependence of repeat number instability of a human DM1 transgene in individual mouse sperm." Hum Mol Genet 11(7): 791-8. Zhuchenko, O., J . Bailey, P. Bonnen, T. Ashizawa, D. W. Stockton, C. Amos, W. B. Dobyns, S. H. Subramony, H. Y. Zoghbi and C. C. Lee (1997). "Autosomal dominant cerebellar ataxia (SCA6) associated with small polyglutamine expansions in the alpha 1A-voltage-dependent calcium channel." Nat Genet 15(1): 62-9. 119 APPENDIX A polyQ.pl # ! / u s r / l o c a l / b i n / p e r l -w # polyq_ens_july28_2003.pl # usage: polyq_ens.pl t r f _ c a g _ b u i l d 3 3 . o u t # Stef Butland, updated by Perseus M i s s i r l i s J u l y 25, 2003 use s t r i c t ; use Bio::EnsEMBL::DBSQL::DBAdaptor; #################### # Globa l V a r i a b l e s # #################### # EnsEMBL API my $host = 'kaka.sanger.ac.uk'; my $user = 'anonymous'; my $db_name = 'homo_sapiens_core_15_33'; ############################### # Connect to EnsEMBL w i t h API # ############################### my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host, -user => $user, -dbname => $db_name); my $ s l i c e _ a d a p t o r = $db->get_SliceAdaptor; ####################################################################### # FILEHANDLES # # s i x o u t f i l e s f o r coordinates, one f o r each category # # w i l l need sequence o u t f i l e s f o r the three polyQ gene c a t e g o r i e s # # the polyQ gene categories represent the d i f f p r i o r i t i e s w i t h # # a polyQ known gene with m u l t i p l e aa repeats as the top p r i o r i t y f o r # # disease candidate # ####################################################################### open GENE, ">> gene.out" or d i e "cannot open/write to gene.out: $!\n"; open GENEQ, ">> geneq.out" or d i e "cannot open/write to geneq.out: $!\n"; open GENEQSEQ, ">> geneqseq.out" or d i e "cannot open/write to geneqseq.out: $!\n"; ############################################################################### # Message to p r i n t at s t a r t of sequence output...hmmm....print to which f i l e ? # ############################################################################### p r i n t GENEQSEQ "EnsEMBL i n f o \ : host\: $host\, db\: $db_name Note that the sequences i n t h i s f i l e are a l l from the p l u s s t r a n d r e g a r d l e s s of the strand on which a given gene i s encoded. These sequences are intended f o r PCR primer p r e d i c t i o n . Each sequence i s 500bp on e i t h e r side of the repeat i n question.\n"; ############################################ # Message t o p r i n t at s t a r t of data output # ############################################ p r i n t GENE "EnsEMBL i n f o \ : host\: $host\, db\: $db_name\nThe genes l i s t e d i n t h i s f i l e c o n t a i n >=8 CAG/CTG repeats (not n e c e s s a r i l y 8 pure rpts) but d i d not have >=5 Q's i n p r o t e i n sequence. I f the coding region of one of these genes i s not p r o p e r l y represented (example: true s t a r t ATG i s f a r t h e r upstream), the true p r o t e i n may cont a i n >=5 Q's. This was the case f o r HD p r o t e i n i n EnsEMBL 110 but the f u l l l e n gth p r o t e i n i s represented i n RefSeq. Therefore, t h i s f i l e can be combed f o r more polyQ candidates by checking coding sequence from d i f f e r e n t db's.\n"; p r i n t GENEQ "EnsEMBL i n f o \ : host\: $host\, db\: $db_name\nThis f i l e contains a l i s t of genes that c o n t a i n >=8 CAGs (not n e c e s s a r i l y 8 pure rpts) i n n u c l e o t i d e sequence (as pr e d i c t e d by Tandem Repeat Finder) and >=5 Qs i n p r o t e i n sequence. Each gene i s l i s t e d as 120 'known' or 'novel' according t o EnsEMBL d e f i n i t i o n s . Known genes are EnsEMBL p r e d i c t e d genes w i t h SwissProt or RefSeq records. For known genes, t h i s f i l e w i l l c o n t a i n accession numbers f o r e x t e r n a l databases. O c c a s i o n a l l y , a known gene w i l l l a c k a d e s c r i p t i o n l i n e . Novel genes w i l l l a c k d e s c r i p t i o n s and accessions to e x t e r n a l databases.\n\nTranslation numbers f o r each candidate gene i n d i c a t e , where the same candidate gene has been l i s t e d , t h at there are m u l t i p l e t r a n s c r i p t s i n EnsEMBL.\n"; ############################################################################## # read i n f i l e s p e c i f i e d on the cmd l i n e # # read m u l t i p l e coords of CAG/CTG repeat l o c a t i o n s from STDIN one l i n e at a time # ############################################################################## while (<>) { chomp; # put elements i n t o an ar r a y my @rpt_coords = s p l i t / \ t / ; p r i n t "array values: $rpt_coords[0] $rpt_coords[1] $rpt_coords[2] $rpt_coords[3] $rpt_coords[5] $rpt_coords[16]\n\n"; # p u l l out coords of i n t e r e s t my $chrom = $rpt_coords[1]; my $chromStart = $rpt_coords[2]; my $chromEnd = $rpt_coords[3]; my $ r p t s i z e = $rpt_coords[5]; my $rptUnit = $rpt_coords[16]; # f o l l o w i n g l i n e s t r i p s t e x t "chr" from " c h r l , chrX e t c " i n UCSC simpleRepeat t a b l e f o r use i n EnsEMBL coord system $chrom =~ s/chr(.+)/$1/; ############# # Get S l i c e # ############# my $ s l i c e = $slice_adaptor->fetch_by_chr_start_end($chrom,$chromStart, $chromEnd); my $rptRegion = $slice->seq; ############################ # Get a l l genes from s l i c e # ############################ my $genes = $slice->get_all_Genes; foreach my $gene(@$genes) { ################################# # Get a l l t r a n s c i p t s from s l i c e # ################################# my $ t r a n s c r i p t s = $gene->get_all_Transcripts; my $i=0; foreach my $transcript(®$transcripts) { $i++ ; my $peptide = $ t r a n s c r i p t - > t r a n s l a t e ; i f ($peptide->seq =- /(Q{5,})/i) { my $qlength = length $1; p r i n t GENEQ " \n" ; p r i n t GENEQ j o i n "| ", @rpt_coords,"\n"; p r i n t GENEQ " T r a n s l a t i o n " . $ i . " of " . $gene->stable_id . " has " . $qlength . " Qs i n i t s f i r s t Q run\n"; p r i n t GENEQ "http://www.ensembl.org/perl/geneview?gene=" . $gene->stable_id . "\n"; p r i n t GENEQ " D e s c r i p t i o n : " . $gene->description . "\n"; p r i n t GENEQ " L i s t of amino a c i d runs of f i v e or more:\n"; ############################################################################# # screen f o r m u l t i p l e amino a c i d runs here # # code from http://lists.evolt.org/archive/Week-of-Mon-20010326/153470.html # 121 # works i n independent code t e s t w i t h peptide s t r i n g below # # t e s t w i t h hd gene coords # #######»##################################################################### my OmyChar,-my $x; my $compare; my $myCompare; my $repeated = 1; $myChar[0] = substr($peptide->seq, 0, 1); fo r ($x = 1; $x < length($peptide->seq); $x++) { $myChar[$x] = substr($peptide->seq, $x, 1) ; $ compare = $x - 1 ,-$myCompare = $myChar[$compare]; i f ($myCompare eq $myChar[$x]) { $repeated++; } e l s e { p r i n t GENEQ "$repeated $myCompare\n" i f $repeated >= 5; $repeated = 1; } } p r i n t GENEQ "$repeated $myCompare\n" i f $repeated >= 5; ########################### # Ca l c u l a t e repeat p u r i t y # ########################### my $max = 0; while ($rptRegion =- /(($rptUnit)+)/g) { my $len = $1; my $ l _ h i t = (l e n g t h ( $ l e n ) ) / 3 ; i f ( $ l _ h i t > $max) { $max = $ l _ h i t ; } } p r i n t GENEQ " S l i c e has a maximum of $max pure $rptUnit repeats\n\n"; ################################################################################### # cat DNA seq of s l i c e t o f i l e i n f a s t a format using chromosome coords i n d e f l i n e # ################################################################################### p r i n t GENEQSEQ 11 \>chr$chrom\: " . ($chromStart-500) . "\-" . ($chromEnd+500) . " " . $rptUnit . " (CAG-type) r p t i n " . $gene->stable_id . " " . $ g e n e - d e s c r i p t i o n . "\n my $ p r i m e r _ s l i c e = $slice_adaptor->fetch_by_chr_start_end($chrom,$chromStart-500,$chromEnd+500); p r i n t GENEQSEQ $primer_slice->seq . "\n"; ########################################################################## # C o l l e c t supplementary i n f o r m a t i o n i f s l i c e contains known EnsEMBL gene # ########################################################################## i f ($gene->is_known) { p r i n t GENEQ "Known gene\n"; p r i n t GENEQ "DBLinks and synonyms:\n"; foreach my $ l i n k (@{$gene->get_all_DBLinks}) { p r i n t GENEQ $ l i n k - > d i s p l a y _ i d . " 11 . $link->database . "\n"; my Ssyns = @{$link->get_all_synonyms}; p r i n t GENEQ "@syns\n"; } } el s e { p r i n t GENEQ "Not a known gene\n"; } } e l s e { p r i n t GENE " p r i n t GENE j o i n "| ", @rpt_coords,"\n"; p r i n t GENE "No polyQ i n t h i s t r a n s l a t i o n \ n " ; } } } } ##################### # c l o s e f i l e h a n d l e s # ##################### cl o s e GENE or warn " e r r o r s while c l o s i n g gene.out: $!\n"; c l o s e GENEQ or warn "er r o r s while c l o s i n g geneq.out: $!\n"; cl o s e GENEQSEQ or warn "er r o r s while c l o s i n g geneqseq.out: $!\n" e x i t ; APPENDIX B Flat file of candidate CAG/CTG repeats (GeMS) (used as input to flanker.pl) Gene EnsEMBL Gene ID Chr Start End Repeat SCA3_MJD ENSG00000066427 14 90527395 90527437 CTG DMPK ENSG00000104936 19 50949512 50949573 CAG SCA7 ENSG00000163635 3 63753267 63753299 GCA SCA2 ENSG00000089232 12 111819606 111819676 GCT HD ENSG00000125387 4 3113331 3113395 CAG DRPLA ENSG00000111676 12 6925140 6925199 CAG SCA1 ENSG00000124788 6 16390403 16390494 TGC SBMA ENSG00000169083 X 64998383 64998486 GCA SI7E_HUMAN ENSG00000117069 1 76750189 76750226 GCA TNRC4 ENSG00000159409 1 148453802 148453843 TGC RORC ENSG00000143365 1 148552932 148552983 GCT KIAA0476 ENSG00000130568 1 150682577 150682628 CTG KCNN3 ENSG00000143603 1 151617621 151617660 GCT TNS ENSG00000079308 2 218676908 218676937 GCT IRS1 ENSG00000169047 2 227628083 227628127 GCT TNRC15 ENSG00000066216 2 233676219 233676265 CAG SATB1 ENSG00000182568 3 18265417 18265463 CTG BAIAP1 ENSG00000151276 3 65280467 65280529 CTG Q8IVF3 ENSG00000176542 3 114657333 114657376 TGC HYP_95.5 ENSG00000138756 4 80184862 80184945 CAG TNRC3 - ENSG00000179637 4 141277244 141277331 TGC TFEB ENSG00000112561 6 41660233 41660265 GCT RUNX2 ENSG00000124813 6 45391833 45391899 CAG POU3F2 ENSG00000184486 6 99283278 99283355 GCA NM_175863 ENSG00000049618 6 157054532 157054585 CAG TBP ENSG00000112592 6 170546468 170546579 GCA RD_POU ENSG00000106536 7 39086067 39086098 CAG F0XP2 ENSG00000128573 7 113810616 113810739 GCA PAXIP1L ENSG00000157212 7 154075553 154075591 TGC SMARCA2 ENSG00000080503 9 2029754 2029837 GCA NM_152786 ENSG00000157653 9 109641295 109641327 CAG CIZ1_HUMAN ENSG00000148337 9 124406704 124406797 CTG MAML2 ENSG00000184384 11 96009730 96009957 TGC PRDMO ENSG00000170325 11 129813900 129813925 CTG FLJ31638 ENSG00000151065 12 1941584 1941613 TGC ZNF384 ENSG00000126746 12 6656325 6656374 GCT EDR1 ENSG00000111752 12 8985591 8985638 GCA MLL2 ENSG00000167548 12 49143997 49144033 TGC PHC1 ENSG00000179899 12 55523949 55523996 GCT PHLDA1 ENSG00000139289 12 76141663 76141715 TGC ASCL1 ENSG00000139352 12 103285098 103285155 GCA NC0R2 ENSG00000139720 12 124611001 124611053 GCT EP400 ENSG00000183495 12 132427572 132427656 GCA C14orf4 ENSG00000119669 14 75483802 75483867 TGC BRIGHT ENSG00000179361 15 72412214 72412254 CAG POLG ENSG00000140521 15 87464052 87464093 GCT MEF2A ENSG00000068305 15 97845438 97845472 CAG CREBBP ENSG00000005339 16 3778685 3778739 TGC TNRC6 ENSG00000090905 16 24715802 24715902 CAG 094795 ENSG00000168286 16 67612229 67612317 GCA NFAT5 ENSG00000102908 16 69462945 69462986 CAG MINK-1 ENSG00000141503 17 4738742 4738792 GCA RAI1 ENSG00000108557 17 17640149 17640190 CAG S0C6_HUMAN ENSG00000174111 17 36419000 36419025 AGC ZNF161 ENSG00000136451 17 56398683 56398723 TGC CACNA1A ENSG00000141837 19 13163881 13163921 CTG BRD4 ENSG00000141867 19 15194877 15194932 GCT CHERP ENSG00000085872 19 16485772 16485809 GCT NUMBL ENSG00000105245 19 45849908 45849973 GCT NC0A6 ENSG00000088297 20 34014378 34014452 TGC PRKCBP1 ENSG00000101040 20 46505971 46506011 GCT NC0A3 ENSG00000124151 20 46918236 46918323 GCA PCQAP ENSG00000099917 22 19245310 19245403 CAG 124 MN1 KIAA1093 MKL1 TNRC11 KIAA1817 CXorf6 ENSG00000169184 ENSG00000100354 ENSG00000100361 ENSG00000184634 ENSG00000147234 ENSG00000013619 22 26520156 22 38940215 22 39225928 X 68594302 X 104879384 X 147410589 26520207 TGC 38940240 GCA 39225985 TGC 68594402 AGC 104879465 CAG 147410627 GCA APPENDIX C flanker.pl # ! / u s r / l o c a l / b i n / p e r l -w # f l a n k e r . p l # M i g h t i l y committed to code by Perseus M i s s i r l i s (perseusObioinformatics.ubc.ca / perseus@canada.com) # B i o i n f o r m a t i c s Graduate Student # UBC B i o i n f o r m a t i c s Centre (UBiC) at the Centre f o r Molecular Medicine and Therapeutics (CMMT) # ht t p : / / b i o i n f o r m a t i c s . u b c . c a / # Last update: Aug 27, 2003 use s t r i c t ; use Bio::EnsEMBL::DBSQL::DBAdaptor; use DBI; use Data::Dumper; my ($dsn) = "DBI: mysql :gems_cis : s t e n t . cmmt. ubc . ca" ; my ($user_name) = "gems_rw"; my ($password) = "g7e6m5"; my ($dbh, $s t h ) ; my (Oary); ####################### # Connect to Database # ####################### $dbh = DBI->connect ($dsn, $user_name, $password, j Rai s e E r r o r => 1 }) ; ############################## # Insert t o b u i l d _ i n f o t a b l e # ############################## p r i n t "\n\n###################################\n",• p r i n t "# I n s e r t i n g i n t o b u i l d _ i n f o t a b l e #\n"; p r i n t "###################################"; $sth = $dbh->prepare ("INSERT INTO b u i l d _ i n f o VALUES( 1$host 1, 1$db_name 1,NOW())"); $sth->execute ( ) ; ###################################### # Check f o r proper command-line f i l e # ###################################### my($USAGE) =•" j Automatic Sequence A n a l y s i s Tool f o r Disease-Associated Repeat I n s t a b i l i t y Studies -v$prog_version | #################### # Global V a r i a b l e s # #################### # EnsEMBL API my $host my $user my $db_name my $prog_version 1ensembldb.ensembl.org 1; 1 anonymous 1; 1homo_sapiens_core_16_33'; '0.4'; # DBI 126 UBC B i o i n f o r m a t i c s By: Perseus M i s s i r l i s (perseus\@canada.com) - - -+\n USAGE: $0 c o o r d f i l e OPTIONS: None so far!\n\n"; unless(@ARGV) { p r i n t $USAGE; e x i t ; } ############################################ # Figure out how many sequences to compare # ############################################ p r i n t "\n | Automatic Sequence A n a l y s i s Tool f o r Disease-Associated Repeat I n s t a b i l i t y Studies -v$prog_version | | UBC B i o i n f o r m a t i c s | By: Perseus M i s s i r l i s (perseus\@canada.com) +\n\n"; ################################ # Open f i l e s f o r sequence data # ################################ my $ o u t p u t f i l e l = "seq.fa"; unless ( open(SEQ, " > $ o u t p u t f i l e l " ) ) { die "Cannot open f i l e \ " $ o u t p u t f i l e l \ " t o w r i t e to!\n\n"; } my $ o u t p u t f i l e 2 = "seq.fa.masked"; unless ( open(REP_SEQ, ">$outputfile2") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 2 \ " to w r i t e to!\n\n"; } my $ o u t p u t f i l e 3 = " g c . t x t " ; unless ( open(GC, ">$outputfile3") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 3 \ " to w r i t e to!\n\n"; } p r i n t GC "Gene\tGC\%_50bp\tGC\%_100bp\tGC\%_150bp\tGC\%_200bp\tGC\%_250bp\tGC\%_300bp\tGC\%_350bp\ tGC\%_4 00bp\tGC\%_450bp\tGC\%_500bp\tGC\%_1000bp\n"; my $ o u t p u t f i l e 4 = " c t c f _ s c o r e s _ d i s t . t x t " ; unless ( open(CTCF_SCORE, ">$outputfile4") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 4 \ " to w r i t e to!\n\n",-} p r i n t CTCF_SCORE "Gene\tScore\tDistance\n"; 127 my $ o u t p u t f i l e 5 = " c t c f _ s c o r e s _ d i s t 2 . t x t " ; unless ( open(CTCF_SC0RE2, ">$outputfile5") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 5 \ " to w r i t e to!\n\n"; } p r i n t CTCF_SCORE "Gene\tScore\tDistance\n"; ################################# # S p e c i f y input f i l e from STDIN # ################################# my $aln_filename = "$ARGV[0]"; unless ( -e $aln_filename) { die " F i l e \"$aln_filename\" doesn\ 1t seem to e x i s t ! ! \ n " ; } unless ( open(ALN_FILE, $aln_filename) ) { die "Cannot open f i l e \"$aln_filename\"\n\n"; } #################################### # C o l l e c t User Input P r i o r t o Loop # #################################### # Sequence f l a n k i n g the repeat p r i n t "How much sequence f l a n k i n g the repeat do you which to c o l l e c t ? (0-100,000) -> my $ f l a n k e r = <STDIN>; $fl a n k e r =~ s/,//g,-$ f l a n k e r =~ s/\s//g; while ( ( $ f l a n k e r < 0) || ($flanker > 100000)) { p r i n t "\nIncorrect amount of f l a n k i n g sequence, t r y again (0-100,000) : »; $flan k e r = <STDIN>; $fl a n k e r =~ s/,//g; $ f l a n k e r =~ s/\s//g; } ############################### # Connect to EnsEMBL w i t h API # ############################### my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host, -user => $user, -dbname => $db_name); my $ s l i c e _ a d a p t o r = $db->get_SliceAdaptor; ############################################################ # Get i n f o from f l a t f i l e t o i n t e r a c t w i t h the EnsEMBL API # ############################################################ my $n = 1; my $input_filename; my $ o u t p u t f i l e ; w hile (<ALN_FILE>) { chomp; ######################################### # EnsEMBL API sequence e x t r a c t i o n phase # ######################################### # i n t h i s regex # $1 c o l l e c t s the common gene name # $2 c o l l e c t s the EnsEMBL gene ID # $3 c o l l e c t s the l a s t d i g i t from the EnsEMBL gene ID, i t ' s an e f f e c t the i n t e r n a l brackets # $4 c o l l e c t s the chromosome number # $5 c o l l e c t s the repeat s t a r t p o s i t i o n i n chromosome $4 # $6 c o l l e c t s the repeat end p o s i t i o n i n chromosome $4' i f </* (\w+)\s+(ENSG(\d) {ll})\s+(\S+)\s+(\d+)\s+(\d+)\s+(\w+)/) { p r i n t "1 i s $1, 2 i s $2, 3 i s $3, 4 i s $4, 5 i s $5, 6 i s $6, 7 i s $7\n"; my $gene_name = $1; $gene_name =- s/\s//g; my $gene_id = $2; $gene_id =- s/\s//g; $gene_id =~ tr/a-z/A-Z/; my $chrom = $4; my $r e p _ s t a r t = $5; my $rep_end = $6 ,-my $repeat_unit = $7; ######################## # I n s e r t t o gems t a b l e # ######################## $sth = $dbh->prepare ("INSERT INTO gems VALUES('$gene_name 1, 1$gene_id')"); $sth->execute (),-################################### # C o l l e c t f l a n k i n g GC percentages # ################################### # 50 my $ s l i c e _ l e f t _ 5 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-50,$rep_start-l); my $ s l i c e _ r i g h t _ 5 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+50); my $ l e f t _ 5 0 = $ s l i c e _ l e f t _ 5 0 - > s e q ; my $right_50 = $slic e _ r i g h t _ 5 0 - > s e q ; ^ my $gc_50_seq = $ l e f t _ 5 0 . $right_50; my $g=0; my $c=0; my $gc_length_50 = length($gc_50_seq) ,• while($gc_50_seq =~ /g/ig){$g++} while($gc_50_seq =- /c/ig){$c++} my $gc_50 = ($g+$c)/$gc_length_50; # 100 my $ s l i c e _ l e f t _ 1 0 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-100,$rep_start-l); my $ s l i c e _ r i g h t _ 1 0 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+100); my $left_100 = $s l i c e _ l e f t _ 1 0 0 - > s e q ; my $right_100 = $slice_right_100->seq; my $gc_100_seq = $left_100 . $right_100; $g=0; $c=0; my $gc_length_100 = length($gc_100_seq); while($gc_100_seq =~ /g/ig){$g++} while($gc_100_seq =~ /c/ig){$c++} my $gc_100 = ($g+$c)/$gc_length_100; # 150 my $ s l i c e _ l e f t _ 1 5 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-150,$rep_start-l); my $ s l i c e _ r i g h t _ 1 5 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+150); my $left_150 = $s l i c e _ l e f t _ 1 5 0 - > s e q ; my $right_150 = $slice_right_150->seq; my $gc_150_seq = $left_150 . $right_150; 129 $g=0; $c=0; my $gc_length_150 = length($gc_150_seq); while($go_150_seq =- /g/ig){$g++} while($gc_150_seq =~ /c/ig){$c++} my $gc_150 = ($g+$c)/$gc_length_150; # 200 my $ s l i c e _ l e f t _ 2 0 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-200,$rep_start-l); my $ a l i c e _ r i g h t _ 2 0 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+200); my $left_200 = $ s l i c e _ l e f t _ 2 0 0 - > s e q ; my $right_200 = $slice_right_200->seq; my $gc_200_seq = $ l e f t _ 2 0 0 . $right_200; $g=0; $c=0; my $gc_length_200 = length($gc_200_seq); while($gc_200_seq =~ /g/ig){$g++} while($gc_200_seq =- /c/ig){$c++) my $gc_200 = ($g+$c)/$gc_length_200; # 250 my $ s l i c e _ l e f t _ 2 5 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-2 5 0 , $ r e p _ s t a r t - l ) ; my $ s l i c e _ r i g h t _ 2 5 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+250); my $left_250 = $ s l i c e _ l e f t _ 2 5 0 - > s e q ; my $right_250 = $slice_right_250->seq;' my $gc_250_seq = $left_250 . $right_250; $g=0; $c=0; my $gc_length_250 = length($gc_250_seq); while($gc_250_seq =~ l/g/ig){$g++} while($gc_250_seq =~ /c/ig){$c++} my $gc_250 = ($g+$c)/$gc_length_250; # 300 my $ s l i c e _ l e f t _ 3 0 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-300,$rep_start-l); my $ s l i c e _ r i g h t _ 3 0 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+300); my $left_300 = $ s l i c e _ l e f t _ 3 0 0 - > s e q ; my $right_300 = $slice_right_300->seq; my $gc_300_seq = $left_300 . $right_300; $g=0; $c=0; my $gc_length_3 00 = length($gc_300_seq); while($gc_300_seq =~ /g/ig){$g++} while($gc_300_seq =~ /c/ig){$c++} my $gc_300 = ($g+$c)/$gc_length_3 0 0; # 350 my $ s l i o e _ l e f t _ 3 5 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-350,$rep_start-l); my $ s l i c e _ r i g h t _ 3 5 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+350); my $left_350 = $ s l i c e _ l e f t _ 3 5 0 - > s e q ; my $right_350 = $slice_right_350->seq; my $gc_350_seq = $left_350 . $right_350; $g=0; $c=0; my $gc_length_350 = length($gc_350_seq); while($gc_350_seq =- /g/ig){$g++} while($gc_350_seq =~ /c/ig){$c++} my $gc_350 = ($g+$c)/$gc_length_350; # 400 my $ s l i c e _ l e f t _ 4 0 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-400,$rep_start-l); 130 my $ s l i c e _ r i g h t _ 4 0 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+400); my $left_400 = $ s l i c e _ l e f t _ 4 0 0 - > s e q ; my $right_400 = $slice_right_400->seq; my $gc_400_seq = $l e f t _ 4 0 0 . $right_400; $g=0; $c=0; my $gc_length_4 00 = length($gc_400_seq); while($gc_400_seq = ~ /g/ig){$g++} while($gc_400_seq =~ /c/ig){$c++} my $gc_400 = ($g+$c)/$gc_length_400; # 450 my $ s l i c e _ l e f t _ 4 5 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-450,$rep_start-l); my $ s l i c e _ r i g h t _ 4 5 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+450); my $left_450 = $ s l i c e _ l e f t _ 4 5 0 - > s e q ; my $right_450 = $slice_right_450->seq; my $gc_450_seq = $left_450 . $right_450; $g=0; $c=0; my $gc_length_4 50 = length($gc_450_seq); while($gc_450_seq =~ /g/ig){$g++} while($gc_450_seq =- /c/ig){$o++} my $gc_450 = ($g+$c)/$gc_length_4 50; # 500 my $ s l i c e _ l e f t _ 5 0 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-500,$rep_start-l); my $ s l i c e _ r i g h t _ 5 0 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+500); my $left_500 = $s l i c e _ l e f t _ 5 0 0 - > s e q ; my $right_500 = $slice_right_500->seq,-my $gc_500_seq = $left_500 . $right_500; $g=0; $c=0; my $gc_length_500 = length($gc_500_seq); while($gc_500_seq =~ /g/ig){$g++} while($gc_500_seq =~ /c/ig){$c++} my $gc_500 = ($g+$c)/$gc_length_500; # 1000 my $ s l i c e _ l e f t _ 1 0 0 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start-1000,$rep_start-l); my $ s l i c e _ r i g h t _ 1 0 0 0 = $s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+1000); my $left_1000 = $slic e _ l e f t _ 1 0 0 0 - > s e q ; my $right_1000 = $slice_right_1000->seq; my $gc_1000_seq = $left_1000 . $right_1000; $g=0; $c=0; my $gc_length_1000 = length($gc_1000_seq); while($gc_1000_seq =- /g/ig){$g++} while($gc_1000_seq =- /o/ig){$c++} my $gc_1000 = ($g+$c)/$gc_length_1000; p r i n t GC "$gene_name\t$gc_50\t$gc_100\t$gc_150\t$gc_200\t$gc_250\t$gc_300\t$gc_350\t$gc_400\t$gc_4 50\t$gc_500\t$gc_1000\n"; ###################### # Insert to gc t a b l e # ###################### p r i n t "\n\n###########################\n"; p r i n t "# I n s e r t i n g i n t o gc t a b l e #\n"; p r i n t "###########################\n\n"; $sth = $dbh->prepare ("INSERT INTO gc VALUES('$gene_name','$gc_50','$gc_100 1, 1$gc_150 1, 1$gc_200','$gc_250','$gc_300', 1$gc_350 1, 1 $gc_400 1 , 1 $gc_450 ' , ' $gc_500 • , ' $gc_1000 1 ) 11) ; 131 $sth->execute () ; #################################### # Create new d i r e c t o r i e s f o r f i l e s # #################################### mkdir "$gene_narae" or warn "Cannot make $gene_name d i r e c t o r y : $!"; ch d i r "./$gene_name" or d i e "cannot c h d i r t o ./$gene_name: $!"; p r i n t "current d i r e c t o r y i s : "; system("pwd"); p r i n t "\n\n"; chomp(my $pwd = "pwd"); $pwd = ~ s>/disk2/home2!//STENT!; ################################ # Open f i l e s f o r sequence data # ################################ my $ o u t p u t f i l e l = "$gene_name.fa"; unless ( open(SEQ2, " > $ o u t p u t f i l e l " ) ) { die "Cannot open f i l e \ " $ o u t p u t f i l e l \ " to w r i t e to!\n\n"; } my $ o u t p u t f i l e 2 = "$gene_name.fa.masked"; unless ( open(REP_SEQ2, 11 >$outputf i l e 2 " ) ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 2 \ " to w r i t e to!\n\n"; } my $ o u t p u t f i l e 3 = "$gene_name.repeat"; unless ( open(REPEAT, ">$outputfile3") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 3 \ " t o w r i t e to!\n\n"; } my $ o u t p u t f i l e 4 = "exons.gff"; unless ( open(EXONS, ">$outputfile4") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 4 \ " to w r i t e to!\n\n"; } my $ o u t p u t f i l e 5 = "cpg.gff"; unless ( open (CPG, 11 >$outputf i l e 5 " ) ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 5 \ " t o w r i t e to!\n\n"; } my $ o u t p u t f i l e 6 = "hmm_test.fa"; unless ( open(HMM, ">$outputfile6") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 6 \ " t o w r i t e to!\n\n"; } my $ o u t p u t f i l e 7 = "rep_coords.txt"; unless ( open(REP_C00RD, ">$outputfile7") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 7 \ " to w r i t e to!\n\n"; } my $ o u t p u t f i l e 8 = " r e p _ p u r i t y . t x t " ; unless ( open(REP_PURE, " >$outputf i l e 8 1 1 ) ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 8 \ " to w r i t e to!\n\n"; } my $ o u t p u t f i l e 9 = " a l u . t x t " ; unless ( open(ALU, ">$outputfile9") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e 9 \ " to w r i t e to!\n\n"; } 132 my $o u t p u t f i l e l O = "rep_elements.txt"; unless ( open(REP_EL, ">$outputfilelO") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e l O \ " to w r i t e to!\n\n"; } # repeat my $ s l i c e = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_start,$rep_end); "\n" ; 11 \n"; 11 \n" ; 11 \n" ; "\n"; my $chr_name my $ c h r _ s t a r t my $chr_end my $strand my $repeat_lengther my $strand2; = $slice->chr_name; = $ s l i c e - > c h r _ s t a r t ; = $slice->chr_end; = $ s l i c e - > s t r a n d ; = $ s l i c e - > l e n g t h ; i f ($strand == 1) { $strand2 = "+"; } e l s i f ($strand == -1 ) { $strand2 = "-"; } my $ r e p _ t o t a l _ s e q = $slice->seq; "\n" 11 \n" ; p r i n t "\n$gene_name Repeat S l i c e Genomic i n f o : \ n " ; p r i n t "********************************\n\n"• p r i n t "Chromosome : " . $chr_name p r i n t " S t a r t p r i n t "End p r i n t "Strand p r i n t "Length of repeat p r i n t "EnsEMBL server p r i n t "Database p r i n t "Repeat my $repeat_length = $repeat_lengther; p r i n t $ r e p _ t o t a l _ s e q . "\n"; ########################## # R p l o t f o r repeat u n i t # ########################## . $ c h r _ s t a r t . $chr_end . $strand2 . $repeat_lengther . $host . $db_name \n\n" ,-my $ i = $f l a n k e r + 1; my $end = $ i + $repeat_length; while ($i < $end) { p r i n t REP_COORD " $ i \ t l \ n " ; $i++; } Close(REP_COORD); ######################## # P r i n t repeat to d i s k # ######################## p r i n t REPEAT ">$gene_name | $gene_id | $db_name | repeat co-ord i n a t e s : Chr$chrom$strand2\:$rep_start\-$rep_end\ | length = $repeat_lengther bp p r i n t REPEAT "$rep_total_seq\n\n"; close(REPEAT); ########################### # C a l c u l a t e repeat p u r i t y # ########################### my $max = 0; while ( $ r e p _ t o t a l _ s e q =~ /(($repeat_unit)+)/g) { my $len = $1; my $ l _ h i t = (l e n g t h ( $ l e n ) ) / 3 ; i f ( $ l _ h i t > $max) { $max = $ l _ h i t ; } } p r i n t REP_PURE "$max $repeat_unit"; close(REP_PURE); ############################### # Ins e r t to rep e a t _ f e a t t a b l e # ############################### p r i n t "\n\n##################################\n" ,-p r i n t "# I n s e r t i n g i n t o gems_feat t a b l e #\n" ; p r i n t "##################################\n\n"; $sth = $dbh->prepare ("INSERT INTO gems_feat VALUES( 1$gene_name 1, 1$chr_name','$strand2', 1$chr_start 1,'$chr_end', 1$repeat_unit 1, 1$rep_t otal_seq','$repeat_lengther 1,'$max',NULL)"); $sth->execute () ; ##################################### # Get repeat p l u s f l a n k i n g sequence # ##################################### $ s l i c e $flanker,$rep_end+$flanker); $slice_adaptor->fetch_by_chr_start_end($chrom,$rep_start-my $ i n v e r t _ s t r a n d = $ s l i c e - > i n v e r t ; $chr_name = $slice->chr_name; = $ s l i c e - > c h r _ s t a r t ; = $slice->chr_end; = $ s l i c e - > s t r a n d ; $ c h r _ s t a r t $chr_end $strand $strand2 my $lengther 0; $s l i c e - > l e n g t h ; my $genes = $sli c e - > g e t _ a l l _ G e n e s ; foreach my $gene(@$genes) { ##################### # C o l l e c t a l l exons # ##################### my $temp_id = $gene->stable_id; p r i n t "$temp_id before loop \n"; i f ($temp_id eq $gene_id) { p r i n t "$temp_id i n loop\n"; p r i n t "\nStrand of $temp_id i s " . $gene->strand . "\n\n"; i f ($gene->strand == 1) { $strand2 = "+"; } e l s i f ($gene->strand == -1 ) { $strand2 = »-"; } 134 p r i n t "\n#########################\n"; p r i n t "# ensembl API #\n" ; p r i n t "# C o l l e c t i n g a l l exons. #\n"; p r i n t "#########################\n\n"; my $exons = $gene->get_all_Exons; foreach my $exon(®$exons) { p r i n t EXONS $exon->gffstring . "\n" } close(EXONS); p r i n t "Done.\n\n"; $chr_name $chr s t a r t $lengther "\n\n"; my $ t o t a l _ s e q = $slice->seq; my $repeat_seq = $slice->get_repeatmasked_seq; my $rep_seq = $repeat_seq->seq; my $i n v e r t _ s e q = $invert_strand->seq; my $ i n v e r t _ s e q _ c o l l e c t = $invert_strand->get_repeatmasked_seq; my $invert_seq_rep = $invert_seq_collect'->seq; p r i n t "\n$gene_name S l i c e Genomic i n f o : \ n " ; p r i n t "EnsEMBL Gene ID . "\n"; p r i n t "Chromosome . "\n"; p r i n t " S t a r t . "\n"; p r i n t "End . "\n"; p r i n t "Strand . "\n"; p r i n t "Length of repeat plus f l a n k i n g sequence . "\n" ; p r i n t "EnsEMBL server . "\n"; p r i n t "Database $gene_id $chr_end $strand2 $host $db_name # P r i n t f l a n k i n g r e p e t i t i v e sequence to d i s k p r i n t SEQ2 ">$gene_name | $gene_id | $db_name | Chr$chr_name\+\:$chr_start\-$chr_end\ | repeat co-ordinates: Chr$chrom\+\:$rep_start\-$rep_end\ | f l a n k i n g sequence up/down-stream of repeat: $ f l a n k e r | length: $lengther\n"; p r i n t SEQ2 "$total_seq\n\n"; p r i n t REP_SEQ2 ">$gene_name | $gene_id | $db_name | Chr$chr_name\+\:$chr_start\-$chr_end\ | repeat co-ordinates: Chr$chrom\+\:$rep_start\-$rep_end\ | f l a n k i n g sequence up/down-stream of repeat: $ f l a n k e r | length: $lengther\n"; p r i n t REP_SEQ2 "$rep_seq\n\n"; p r i n t HMM ">$gene_name | $gene_id | $db_name | Chr$chr_name\+\:$chr_start\-$chr_end\ | repeat co-ordinates: Chr$chrom\+\:$rep_start\-$rep_end\ | f l a n k i n g sequence up/down-stream of repeat: $ f l a n k e r |\n"; p r i n t HMM "$total_seq\n\n"; p r i n t HMM ">$gene_name | $gene_id | $db_name | Chr$chr_name\-\:$chr_start\-$chr_end\ | repeat co-ordinates: Chr$chrom\-\:$rep_start\-$rep_end\ | f l a n k i n g sequence up/down-stream of repeat: $ f l a n k e r |\n"; p r i n t HMM "$invert_seq\n\n"; close(SEQ2); close(REP_SEQ2); close(HMM); ############################ # Ins e r t to f l a n k i n g t a b l e # ############################ p r i n t "#################################\n"; 135 p r i n t "# I n s e r t i n g i n t o f l a n k i n g t a b l e #\n" ; p r i n t "#################################\n\n"; $sth = $dbh->prepare ("INSERT INTO f l a n k i n g VALUES( 1$gene_name','$chr_name', 1$strand2', 1$chr_start 1,'$chr_end 1,'$total_seq', 1$invert_ se q 1 , 1 $ r e p e a t _ s e q ' , ' $ i n v e r t _ s e q _ r e p ' ) " ) ; $sth->execute ( ) ; ################## # CpG p l o t phase # ################## ##################################### # Get a l l CpG i s l a n d s - EnsEMBL ver # ##################################### p r i n t "###############################\n" ; p r i n t "# Scanning f o r CpG Islands #\n"; p r i n t "# EnsEMBL plus EMBOSS running #\n"; p r i n t "###############################\n\n"; my $simple_feature_adaptor = $db->get_SimpleFeatureAdaptor; my $cpg_islands = $simple_feature_adaptor-> f e t c h _ a l l _ b y _ S l i c e _ a n d _ s c o r e ( $ s l i c e , 50, 'CpG'); foreach my $cpg (®$cpg_islands) { # p r i n t CPG "Label: ", $cpg->display_label, "\n"; # p r i n t CPG "Seqname: ", $cpg->seqname, "\n"; p r i n t CPG $cpg->start . " \ t " ; p r i n t CPG $cpg->end . " \ t " ; p r i n t CPG $cpg->score, "\n"; my $cpg_start = $cpg->start; my $cpg_end = $cpg->end; my $cpg_score = $cpg->score; ####################### # In s e r t t o cpg t a b l e # ####################### p r i n t "############################\n"; p r i n t "# I n s e r t i n g i n t o cpg t a b l e #\n"; p r i n t "############################\n\n"; $sth = $dbh->prepare ("INSERT INTO cpg VALUES('$gene_name', 1$cpg_start','$cpg_end','$cpg_score')"); $sth->execute ( ) ; } close(CPG); p r i n t "CpG data c o l l e c t e d ! \ n \ n " ; ########## # EMBOSS # ########## system("cpgplot $gene_name.fa -graph cps -window 500 - s h i f t 100 -minlen 200 -minoe 0.6 -minpc 50 - o u t f i l e $gene_name.plotfile > $gene_name\_gc.txt"); system("ps2pdf cpgplot.ps c p g p l o t . p d f " ) ; system("mv cpgplot.pdf $gene_name.pdf"); ############# # Run HMMer # ############# p r i n t "\n###################################\n"; p r i n t "# HMMer vl.84 #\n"; p r i n t "# Scanning f o r CTCF b i n d i n g s i t e s #\n"; p r i n t "###################################\n\n"; 136 system("hmmfs -c ../ctcf-md.hmm $gene_name\.fa > $gene_name\_ctcf.out"); p r i n t "Done!\n\n"; ##################### # C o l l e c t a l l Alu's # ##################### my $repeats = $slic e - > g e t _ a l l _ R e p e a t F e a t u r e s ; foreach my $repeat(®$repeats) { i f ($repeat->repeat_consensus->name eq 'AluY') { p r i n t ALU $r e p e a t - > g f f s t r i n g . " \ t " . $repeat->repeat_consensus->name . " \ t " . $repeat->repeat_consensus->repeat_class . "\n" } } close(ALU); ####################### # C o l l e c t a l l Repeats # ####################### $repeats = $slic e - > g e t _ a l l _ R e p e a t F e a t u r e s ; foreach my $repeat(@$repeats) { p r i n t REP_EL $ r e p e a t - > g f f s t r i n g . " \ t " . $repeat->repeat_consensus->name . " \ t " . $repeat->repeat_consensus->repeat_class . "\n"; } close(REP_EL); $input_filename = "rep_elements.txt"; unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open f i l e \"$input_filename\" to w r i t e to!\n\n"; } while (<IN_FILE>) { chomp; i f (/"\S+\.\S+\-\S+\s+\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\d+)\s(\S+)\s+\S+\s+\S+\s+(\S+)\s+(\S+)/) { my $ r _ s t a r t = $1; my $r_end = $2; my $r_score = $3; my $r_s t r a n d = $4; my $r_name = $5; my $ r _ c l a s s = $6; my $r _ d i s t a n c e ; i f ( $ r _ s t a r t > $flanker) { $r d i s t a n c e = $ r _ s t a r t - ($flanker + $repeat_lengther); e l s i f ( $ r _ s t a r t < $flanker) { $r_dista n c e = $f l a n k e r - $r_end; } p r i n t "\n"; p r i n t "################################\n"; p r i n t "# I n s e r t i n g i n t o repeats t a b l e #\n"; p r i n t "################################\n\n"; $sth = $dbh->prepare ("INSERT INTO repeats V A L U E S ( 1 $ g e n e _ n a m e ' , 1 $ r _ n a m e 1 , 1 $ r _ c l a s s ' , 1 $ r _ s t a r t 1 , 1 $ r _ e n d 1 , 1 $ r _ s c o r e 1 , 1 $ r _ s t r a n d 1 , ' $ r _ d i s t a n c e ' ) " ) ; $sth->execute 0; } } #################### 137 # Create R _ s c r i p t s # #################### ######### # Exons # ######### $input_filename = "exons.gff"; unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open f i l e \"$input_filename\" to w r i t e to!\n\n"; } $ o u t p u t f i l e = "exons_R.txt"; unless ( open(EXON_R, ">$outputfile") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e \ " to w r i t e to!\n\n"; } w h i l e (<IN_FILE>) { chomp; ################################# # generate R p l o t f i l e f o r exons # ################################# i f (/*\S+\s+\S+\s+\S+\s+(\S+)\s+(\S+)\s+\S+/) { my $ i = $1; my $end = $2; p r i n t "##############################\n" p r i n t "# I n s e r t i n g i n t o exons t a b l e #\n" p r i n t "##############################\n\i $sth = $dbh->prepare ("INSERT INTO exons VALUES(1$gene_name', '$1 1, 1$2') " ) ; $sth->execute ( ) ; while ( $i <= $end) { p r i n t EXON_R "$i\tO.5\n"; $i++ ; close(IN_FILE) close(EXON R); ####### # CpG # ####### $input_filename = "cpg.gff"; unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open f i l e \"$input_filename\" to w r i t e t o i \ n \ n " ; } $ o u t p u t f i l e = "cpg_R.txt"; unless ( open(CPG_R, ">$outputfile") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e \ " to w r i t e to!\n\n"; } while (<IN_FILE>) { chomp; ################################# # generate R p l o t f i l e f o r CpG # ################################# p r i n t "Creating R p l o t s . . . \ n " ; i f (/*(\S+)\s+(\S+)\s+(\S+)/) { my $ i = $1; my $end = $2; while ( $i <= $end) { p r i n t CPG_R "$ i \ t $ 3 \ n " ; $i++; } c l o s e ( I N _ F I L E ) ; close(CPG_R); ######## # CTCF # ######## $input_filename = "$gene_name\_ctcf.out"; unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open f i l e \"$input_filename\" to w r i t e to!\n\n"; } $ o u t p u t f i l e = "hmm_R.txt"; unless ( open(HMM_R, ">$outputfile") ) { di e "Cannot open f i l e \ " $ o u t p u t f i l e \ " to w r i t e to!\n\n"; while (<IN_FILE>) { chomp; ###################################### # generate R p l o t f i l e f o r c t c f s i t e s # ###################################### my $c_distance; i f (/A(\d+\.\d+)\s+(\d+)\s+(\d+)/) { p r i n t "1 i s $1 and 2 i s $2 and 3 i s $3\n"; i f ($2 < $3) { i f ($2 > $flanker) { $c_distance = $2 - ($flanker + $repeat_lengther) e l s i f ($2 < $flanker) { $c d i s t a n c e = $f l a n k e r - $3; p r i n t "\n###############################\n"; p r i n t "# I n s e r t i n g i n t o c t c f t a b l e 1 #\n"; p r i n t "###############################\n\n"; $sth = $dbh->prepare ("INSERT INTO c t c f VALUES(1$gene_name', '$1' , ' $2 ' , ' $3 ' , ' + 1 , 1 $ c _ d i s t a n c e 1 ) " ) ; $sth->execute 0; my $ i = $2 ; my $end = $3; while ( $i <= $end) { 139 p r i n t HMM_R " $ i \ t $ l \ n " ; $i++ ; } e l s i f ($2 > $3) { i f ($3 > $flanker) { $c_distance = $3 - ($flanker + $repeat_lengther) } e l s i f ($3 < $flanker) { $c_distance = $ f l a n k e r - $2; } p r i n t "\n###############################\n"; p r i n t "# I n s e r t i n g i n t o c t c f t a b l e 2 #\n",-p r i n t "###############################\n\n"; $sth = $dbh->prepare ("INSERT INTO c t c f VALUES(1$gene_name', '$1 1, 1$3 ' , ' $2 ' , ' - ' , '$c_distance')") ; $sth->execute () ,-my $ i = $3; my $end = $2; while ( $i < = $end) { p r i n t HMM_R 11 $ i \ t $ l \ n " ,-$ i + +; } c l o s e ( I N _ F I L E ) ; close(HMM_R); p r i n t "Done!\n\n"; ############ # GC St a t s # ############ $input_filename = "$gene_name\_gc.txt"; unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open f i l e \"$input_filename\" to w r i t e to!\n\n"; $ o u t p u t f i l e = "gc_R.txt"; unless ( open(GC_R, ">$outputfile") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e \ " t o w r i t e to!\n\n" my Sobsex; my @gc; $ i = l ; my $ j = l ; my $n = $lengther + 2; $end = ($lengther*2) + 1 ; while (<IN_FILE>) { chomp; ######################################## # Parse cpgplot i n f o f o r gc s t a t i s t i c s # ######################################## i f (/"(\d+\.\d+)/) { i f ( $ i <= $lengther) 140 $obsex[$i] = $1; $i++ ; e l s i f (($n <= $end && $ i >$lengther)) { $gc[$j] = $1; $n++ ; $j++; } C l o s e ( I N _ F I L E ) ; ##################### # P r i n t to gc_R.txt # ##################### $i = 1; while ($i <= $lengther) { i f ($gc[$i] > 0) { p r i n t GC_R " $ i \ t $ o b s e x [ $ i ] \ t " . ($gc[$i]/100) . "\n"; my $obs_in = $obsex[$i]; my $gc_in = ($gc[$i]/100) ; p r i n t "\n"; p r i n t "###########################\n"; p r i n t "# I n s e r t i n g i n t o g c _ p l o t s #\n" ; p r i n t "###########################\n\n"; $sth = $dbh->prepare ("INSERT INTO g c _ p l o t s VALUES( 1$gene_name 1,'$i','$obs_in', 1$gc_in')"); $sth->execute ( ) ; } $ i + + ; } close(GC R); ######## # AluY # ######## $input_filename = " a l u . t x t " ; unless ( open(IN_FILE, "$input_filename") ) { die "Cannot open f i l e \"$input_filename\" t o w r i t e to!\n\n"; } $ o u t p u t f i l e = " a l u _ R . t x t " ; unless ( open(ALU_R, ">$outputfile") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e \ " t o w r i t e to!\n\n"; } while (<IN_FILE>) { chomp; i f (/*\S+\.\S+\-\S+\s+\S+\s+\S+\s+(\d+)\s+(\d+)\s+(\d+)\s(\S+)/) { my $ i = $1; my $end = $2; while ($i <= $end) { p r i n t ALU_R "$ i \ t 0 . 5 \ n " ; $ i + +; } 141 } } close(ALU_R); #################### # P r i n t R p l o t f i l e # #################### $ou t p u t f i l e = "$gene_name\.R"; unless ( open(R_PLOT, ">$outputfile") ) { di e "Cannot open f i l e \ " $ o u t p u t f i l e \ " t o w r i t e to!\n\n"; } p r i n t R_PLOT p r i n t R_PLOT p r i n t R_PLOT p r i n t R_PLOT p r i n t R_PLOT p r i n t R_PLOT p r i n t R_PLOT p r i n t R_PLOT p r i n t R_PLOT p r i n t R_PLOT p r i n t R_PLOT p r i n t R_PLOT header=FALSE) p r i n t R_PLOT header=FALSE) p r i n t R_PLOT header=FALSE) "\# FILE\: $gene_name.R\n"; "# AUTH: Perseus M i s s i r l i s <perseus\@bioinformatics.ubc.ca>\n"; "\n" ; "# $gene_name\n"; "\n" ; "$gene_name\.rep = read.delim(\"$pwd/rep_coords.txt\", header\=FALSE)\;\n"; "$gene_name\.exons = read.delim(\"$pwd/exons_R.txt\", header=FALSE);\n"; 11 $gene_name\ . cpg = read, delim (\"$pwd/cpg_R. t x t \ " , header=FALSE) ; \n" ; "$gene_name\.ctcf = read.delim(\"$pwd/hmm_R.txt\", header=FALSE);\n"; "$gene_name\ .alu = read.delim(\"$pwd/alu_R.txt\" , header=FALSE) ;\n" ,-"$gene_name\.gc = read.delim(\"$pwd/gc_R.txt\", header=FALSE);\n"; "$gene_name\.flank\.0 = read.delim(\"$pwd/ctcf_dint.txt\", ;\n" ; "$gene_name\.flank\.100 = read.delim(\"$pwd/ctcf_dint_100.txt\", ; \n" ; "$gene_name\.flank\.500 = read.delim(\"$pwd/ctcf_dint_50 0.txt\", ; \n" ; p r i n t R_PL0T "\n" ; p r i n t R_PL0T "x <- dim($gene_name\.rep)[l];\n"; p r i n t R_PLOT "mylim = $flanker*2 + x;\n"; p r i n t R_PLOT 11 \n"; p r i n t R_PLOT "plot{$gene_name\.cpg\$Vl, (($gene_name\.cpg[,2]/max($gene_name\.cpg[,2]))/2), xlim=c(0, mylim), col=\"green\", xlab=\"$gene_name\", cex=0.5, type=\"h\", ylim=c(0,max($gene_name\.gc\$V2)), ylab=\"score\");\n"; p r i n t R_PLOT " lines($gene_name\.exons\$Vl, $gene_name\.exons\$V2 , col=\"black\", cex=0.5, type=\"h\")\n"; p r i n t R_PL0T "lines($gene_name\.ctcf\$V1, (($gene_name\.ctcf[,2]/max($gene_name\.ctcf[,2]))/3), col=\"red\", cex=0.5, type=\"h\")\n"; p r i n t R_PLOT "lines($gene_name\.alu\$Vl, $gene_name\.alu[,2], col=\"blue\", cex=0.5, type=\"h\") ,-\n" ; p r i n t R_PL0T "lines($gene_name\.rep\$Vl, $gene_name\.rep\$V2, col=\"purple\", cex=0.5, type=\"h\");\n"; p r i n t R_PLOT "lines($gene_name\.gc\$Vl, $gene_name\.gc\$V2, col=\"orange\", cex=0.5, type=\"l\", l t y = l ) ; \ n " ; p r i n t R_PLOT "lines($gene_name\.gc\$Vl, $gene_name\.gc\$V3, col=\"blue\", cex=0.5, type=\"l\", lty=2);\n"; p r i n t R_PL0T "\n"; p r i n t R_PLOT "plot($gene_name\.cpg\$Vl, (($gene_name\.cpg[,2]/max($gene_name\.cpg[,2]))/2), xlim=c(" . ($flanker-1000) . ", " . ($flanker+1000) . "+x), col=\"green\", xlab=\"$gene_name\", cex=0.5, type=\"h\", ylim=c(0,max($gene_name\.gc\$V2)), ylab=\"score\");\n"; p r i n t R_PLOT "lines($gene_name\.exons\$Vl, $gene_name\.exons\$V2, col=\"black\", cex=0.5, type=\"h\")\n"; p r i n t R_PLOT "lines($gene_name\.ctcf\$V1, (($gene_name\.ctcf[,2]/max($gene_name\.ctcf[,2]))/3), col=\"red\", cex=0.5, type=\"h\");\n"; p r i n t R_PL0T "lines($gene_name\.alu\$Vl, $gene_name\.alu[,2], col=\"blue\", cex=0.5, type=\"h\");\n"; p r i n t R_PLOT "lines($gene_name\.rep\$Vl, $gene_name\.rep\$V2, col=\"purple\", cex=0.5, type=\"h\");\n"; 142 p r i n t R_PLOT "lines($gene_name\.gc\$Vl, $gene_name\.gc\$V2, col=\"orange\", cex=0.5, type=\"l\", l t y = l ) ; \ n " ; p r i n t R_PLOT " l i n e s ($gene_name\ .gc\$Vl, $gene_name\ .gc\$V3 , col=\"blue\ 1 1, cex=0.5, type=\"l\", lty=2);\n"; p r i n t R_PLOT "abline(0.5,0);\n"; p r i n t R_PLOT "\n"; p r i n t R_PLOT "plot($gene_name\.cpg\$Vl, (($gene_name\.cpg[,2]/max($gene_name\.cpg[,2]))/2), xlim=c(" . ($flanker-10000) . ", " . ($flanker+10000) . "+x), col=\"green\", xlab=\"$gene_name\", cex=0.5, type=\"h\", ylim=c(0,max($gene_name\.gc\$V2)), ylab=\"score\");\n"; p r i n t R_PL0T "lines($gene_name\.exons\$Vl, $gene_name\.exons\$V2, col=\"black\", cex=0.5, type=\"h\")\n"; p r i n t R_PLOT "lines($gene_name\.ctcf\$V1, (($gene_name\.ctcf[,2]/max($gene_name\.ctcf[,2]))/3), col=\"red\", cex=0.5, type=\"h\");\n"; p r i n t R_PL0T "lines($gene_name\.alu\$Vl, $gene_name\.alu[,2], col=\"blue\", cex=0.5, type=\"h\");\n"; p r i n t R_PLOT "lines($gene_name\.rep\$Vl, $gene_name\.rep\$V2, col=\"purple\", cex=0.5, type=\"h\");\n"; p r i n t R_PL0T "lines($gene_name\.gc\$Vl, $gene_name\.gc\$V2, col=\"orange\", cex=0.5, type=\"l\", l t y = l ) ;\n" ,-p r i n t R_PLOT 11 l i n e s ($gene_name\. gc\$Vl, $gene_name\ .gc\$V3, col=\"blue\" , cex=0.5, type=\"l\", lty=2);\n"; p r i n t R_PL0T "abline(0.5,0);\n"; close(R_PL0T); ########################################## # Open HMMer f i l e f o r CTCF b i n d i n g s i t e s # ########################################## $aln_filename = "$gene_name\_ctcf.out"; unless ( -e $aln_filename) { die " F i l e \"$aln_filename\" doesn\'t seem to e x i s t ! ! \ n " ; } unless ( open(CTCF_SITES, $aln_filename) ) { die "Cannot open f i l e \ " $ a l n _ f ilename\"\n\n" ,• } ################################################ # Open f i l e f o r CTCF b i n d i n g s i t e s d i - n t s t a t s # ################################################ my $ o u t p u t f i l e = " c t c f _ d i n t . t x t " ; unless ( open(CTCF, ">$outputfile") ) { di e "Cannot open f i l e \ " $ o u t p u t f i l e \ 1 1 t o w r i t e to!\n\n"; } p r i n t CTCF "Score\t" . "aa\t" . " a t \ t " . "ag\t" . "ac\t" . " t a \ t " . " t t \ t " . " t g \ t " . " t c \ t " . "ga\t" . " g t \ t " . "gg\t" . "gc\t" . "ca\t" . " c t \ t " . "cg\t" . "c c \ t " . "er r o r s \ n " ; 143 $ o u t p u t f i l e = " c t c f _ d i n t _ 1 0 0 . t x t " ; unless ( open(CTCF_100, ">$outputfile") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e \ " to w r i t e to!\n\n" } p r i n t CTCF_100 "Score\t" "aa\t" " a t \ t " "ag\t" "ac\t" " t a \ t " " t t \ t " " t g \ t " " t c \ t " "ga\t" " g t \ t " "gg\t" "gc\t" "ca\t" " c t \ t " "cg\t" " c c \ t " " e r r o r s \ n " ; $ o u t p u t f i l e = " c t c f _ d i n t _ 5 0 0 . t x t " ; unless ( open(CTCF_500 , ">$outputfile") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e \ " to w r i t e to!\n\n"; } p r i n t CTCF_500 "Score\t" . "aa\t" " a t \ t " "ag\t" "ac\t" " t a \ t " " t t \ t " " t g \ t " " t c \ t " "ga\t" " g t \ t " "gg\t" "gc\t" "ca\t" " c t \ t " "cg\t" " c c \ t " " e r r o r s \ n " ; ########################################################### # I n i t i a l i z e v a r i a b l e s t o count d i - n u c l e o t i d e frequencies # ########################################################### my $aa; my $at; my $ag; my $ac my $ta; my $ t t ; my $tg; my $tc my $ga; my $gt; my $gg; my $gc my $ca; my $ct; my $cg; my $cc my $e; while (<CTCF_SITES>) { chomp; i f (/*(\d+\.\d+)\s+(\d+)\s+(\d+)/) { p r i n t "1 i s $1, 2 i s $2, 3 i s $3\n"; my $ s t a r t ; my $ s i z e ; i f ($2 < $3) { i f ($2 < $flanker) { my $ s t a r t = $flanker-$2; p r i n t CTCF_SCORE "$gene_name\t$l\t" . $ s t a r t . "\n" 144 } e l s i f ($2 > $flanker) { my $ s t a r t = $2-($flanker+$repeat_length); p r i n t CTCF_SCORE "$gene_name\t$l\t 1 1 . $ s t a r t . } i f ($2 < $flanker) { my $ s t a r t = $2-$flanker; p r i n t CTCF_SC0RE2 "$gene_name\t$l\t" . $ s t a r t } e l s i f ($2 > $flanker) { my $ s t a r t = $2-($flanker+$repeat_length); p r i n t CTCF_SC0RE2 "$gene_name\t$l\t" . $ s t a r t > $ s t a r t = $2-1; $s i z e = $3-$2-l; ##### # 0 # ##### my $for = s u b s t r ( $ t o t a l _ s e q , $ s t a r t , $ s i z e ) ; ############################################# # Count d i - n u c l e o t i d e s i n f l a n k i n g sequence # ############################################# $aa=0 ; $at=0; $ag=0 ; $ac=0; $ta=0; $tt=0; $tg=0; $tc=0; $ga=0 ; $gt=o ; $gg=0; $gc=0; $ca=0; $Ct=0; $cg=0; $cc=0; $e=0; w h i l e ( $ f o r =~ /aa/ig) $aa++} w h i l e ( $ f o r = ~ / a t / i g ) $at++} w h i l e ( $ f o r = ~ /ag/ig) $ag++} w h i l e ( $ f o r =~ /ac/ig) $ac++} w h i l e ( $ f o r =~ / t a / i g ) $ta++} w h i l e ( $ f o r = ~ / t t / i g ) $tt++} . w h i l e ( $ f o r = ~ / t g / i g ) $tg++} w h i l e ( $ f o r =~ / t c / i g ) $tc++} w h i l e ( $ f o r /ga/ig) $ga++} w h i l e ( $ f o r = ~ / g t / i g ) $gt++} w h i l e ( $ f o r /gg/ig) $gg++} w h i l e ( $ f o r = ~ /gc/ig) $gc++} w h i l e ( $ f o r /ca/ig) $ca++} w h i l e ( $ f o r / c t / i g ) $ct++} w h i l e ( $ f o r /cg/ig) $cg++} w h i l e ( $ f o r = - /cc/ig) $cc++} w h i l e ( $ f o r / [ A a t g c /ig){$e++ p r i n t CTCF 11 $gene_name\ . $1\ . 0\t 1 1 "$aa\t" . "$at \ t " . "$ag\t" . "$ac\t" . "StaXt" . " $ t t \ t " . "$tg\t" . " $ t c \ t " . "$ga\t" . "$gt\t" "$gg\t" "$gc\t" "$ca\t" " $ c t \ t " "$cg\t" "$cc\t" "$e\n"; p r i n t "\n####################################\n"; p r i n t "# I n s e r t i n g i n t o c t c f _ d i n t _ 0 t a b l e #\n"; p r i n t "####################################\n\n"; $sth = $dbh->prepare ("INSERT INTO c t c f _ d i n t _ 0 VALUES('$gene_name','$1", 1$aa 1,'$at',"$ag','$ac', 1$ta','$tt','$tg','$tc','$ga','$gt','$gg ','$gc 1, 1$ca','$ct','$cg', 1$cc','$e')"); $sth->execute ( ) ; ####### # 100 # ####### my $ f o r _ l e f t = s u b s t r ( $ t o t a l _ s e q , $start - 1 0 0 , 100); my $ f o r _ r i g h t = s u b s t r ( $ t o t a l _ s e q , $start+$size+100, 100); my $for_100 = $ f o r _ l e f t . $for . $ f o r _ r i g h t ; ############################################# # Count d i - n u c l e o t i d e s i n f l a n k i n g sequence # ############################################# $aa=0 $at = 0 $ag=0 $ac=0 $ta=0 $tt=0 $tg=0 $tc=0 $ga=0 $gt=0 $gg=o $gc = 0 $ca=0 $ct=0 $cg=0 $cc=0 $e=0; w h i l e ( $ f o r _ 100 =~ /aa/ig) $aa++} wh i l e ( $ f o r _ ~_100 =~ / a t / i g ) $at++} wh i l e ( $ f o r _ JL00 =~ /ag/ig) $ag++} wh i l e ( $ f o r _ ^ 100 =~ /ac/ig) $ac++} while ($f or_ 100 =- / t a / i g ) $ta++} whi l e ( $ f o r _ ~100 =~ / t t / i g ) $tt++j wh i l e ( $ f o r _ "100 =~ / t g / i g ) $tg++} while ($for_ "lOO =~ / t c / i g ) $tc++} while ($for_ ~_100 =~ /ga/ig) $ga++} while ($for_ ~_100 =~ / g t / i g ) $gt++} while ($for_ "100 =~ /gg/ig) $gg++} while ($for_ [100 =~ /gc/ig) $gc++} while ($for_ _100 =~ /ca/ig) $ca++} while ($for_ ~100 =~ / c t / i g ) $ct++} while ($for_ "lOO =~ /cg/ig) $cg++} while ($for_ [100 =~ /cc/ig) $cc++} while ($for_ ~100 =~ /[*atgc /ig){$e++ p r i n t CTCF_100 "$gene_name\.$1\.100\t" . "$aa\t" . "$at \ t " . "$ag\t" . 146 "$ac\t" " $ t a \ t " " $ t t \ t " " $ t g \ t " " $ t c \ t " "$ga\t" "$gt\t" "$gg\t" "$gc\t" "$ca\t" " $ c t \ t " "$cg\t" "$cc\t" "$e\n"; p r i n t "######################################\n"; p r i n t "# I n s e r t i n g i n t o c t c f _ d i n t _ 1 0 0 t a b l e #\n" ; p r i n t "######################################\n\n"; $sth = $dbh->prepare ("INSERT INTO c t c f _ d i n t _ 1 0 0 VALUES(1$gene_name', '$1','$aa', 1$at', 1$ag', '$ac 1, 1 $ t a 1 , ' $ t t ' , '$tg', '$tc','$ga 1, '$gt 1, 1$gg 1, 1$gc 1, 1 $ c a 1 , 1 $ c t ' , '$cg 1, 1$cc ' , '$e') " ) ; $sth*->execute ( ) ; ####### # 500 # ####### $ f o r _ l e f t = s u b s t r ( $ t o t a l _ s e q , $start-500, 500); $ f o r _ r i g h t = s u b s t r ( $ t o t a l _ s e q , $start+$size+500, 500); my $for_500 = $ f o r _ l e f t . $ f o r _ r i g h t ; ############################################# # Count d i - n u c l e o t i d e s i n f l a n k i n g sequence # ############################################# $aa=0 $at = 0 $ag=0 $ac=0 $ta=0 $tt=0 $tg=0 $tc=0 $ga=0 $gt=0 $gg=0 $gc=0 $ca=0 $ct=0 $cg=0 $cc=0 $e=0; wh i l e ( $ f o r _ 500 = ~ /aa/ig) $aa++ while ($for_ ^500 / a t / i g ) $at + + while ($for_ "500 /ag/ig) $ag++ while ($for_ ~500 = ~ /ac/ig) $ac++ while ($for_ [500 / t a / i g ) $ta++ while ($for_ 500 / t t / i g ) $tt++ w h i l e ( $ f o r ^500 /tg / i g ) $tg++ while ($for_ "500 = ~ / t c / i g ) ' $tc+ + while ($f or_ [500 = ~ /ga/ig) [ $ga++ while ($for_ _500 = ~ / g t / i g ) ;$gt++ while ($for_ ^500 = ~ /gg/ig) [$gq++ while ($for_ [500 /gc/ig) [$gc++ while ($for_ [500 = ~ /ca/ig) [$ca++ while ($for_ [500 / c t / i g ) [$ct++ while ($for_ "50 0 /cg/ig) [$cg++ 147 while($for_500 =~ /cc/ig)($cc++) while($for_500 =- /[ Aatgc]/ig){$e++} p r i n t CTCF_500 "$gene_name\ . $1\ . 500\t 1 1 "$aa\t" . "$at \ t " . "$ag\t" . "$ac\t" . "$ta \ t " . " $ t t \ t " . "$tg \ t " . " $ t c \ t " . "$ga\t" . "$gt\t" . "$gg\t" . "$gc\t" . "$ca\t" . " $ c t \ t " . "$cg\t" . "$cc\t" . "$e\n" ,-p r i n t "######################################\n"; p r i n t "# I n s e r t i n g i n t o c t c f _ d i n t _ 5 0 0 t a b l e #\n"; p r i n t "######################################\n\n"; $sth = $dbh->prepare ("INSERT INTO c t c f _ d i n t _ 5 0 0 VALUES('$gene_name',1$1','$aa', 1$at', '$ag 1, 1$ac', • $ t a 1 , ' $ t t ' , '$tg', ' $ t c ' , 1 $ g a 1 , '$gt', '$gg ', 1$gc 1, 1$ca', '$ct','$cg 1, 1$cc 1 , 1 $ e ' ) " ) ; $sth->execute ( ) ; } e l s i f ($2 > $3) { i f ($3 < $flanker) { my $ s t a r t = $flanker-$3; p r i n t CTCF_SCORE "$gene_name\t$l\t" . $ s t a r t } e l s i f ($3 > $flanker) { my $ s t a r t = $3-($flanker+$repeat_length); p r i n t CTCF_SCORE "$gene_name\t$l\t" . $ s t a r t } i f ($3 < $flanker) { my $ s t a r t = $3-$flanker; p r i n t CTCF_SCORE2 "$gene_name\t$l\t" . $ s t a r t } e l s i f ($3 > $flanker) { my $ s t a r t = $3-($flanker+$repeat_length); p r i n t CTCF_SCORE2 "$gene_name\t$l\t" . $ s t a r t } ##### # 0 # ##### $s t a r t = $3-1; $s i z e = $2-$3+l; p r i n t "reverse complement! s t a r t $ s t a r t and s i z e $ s i z e \n"; my $ h i t = s u b s t r ( $ t o t a l _ s e q , $ s t a r t , $ s i z e ) ; $ h i t = ~ tr/ATGCatgcn/TACGTACGN/; my $ r e v _ h i t = r e v e r s e ( $ h i t ) ; ############################################# # Count d i - n u c l e o t i d e s i n f l a n k i n g sequence # ############################################# $aa=0 ,-$at=0; $ag=0; $ac=0; $ta=0; $tt=0; $tg=0; . "\n"; . "\n"; . "\n"; . "\n"; 148 $tc=0 $ga=0 $gt=0 $gg=o $gc=0 $ca=0 $ct=0 $cg=0 $cc=0 $e=0; whi l e ( $ r e v h i t =- /aa/ig) $aa++} while($rev_ h i t = - / a t / i g ) $at++} while ($rev_ "hit = - /ag/ig) $ag++} while ($rev_ "hit =~ /ac/ig) $ac++} while($rev_ "hit =~ / t a / i g ) $ta++} while ($rev_ "hit = - / t t / i g ) $tt++} while ($rev_ "hit = - / t g / i g ) $tg++} while ($rev_ " h i t = ~ / t c / i g ) $tc++} while ($rev_ " h i t =~ /ga/ig) $ga++} while ($rev_ " h i t =~ / g t / i g ) $gt++} while ($rev_ h i t = - /gg/ig) $gg++} while ($rev_ " h i t = ~ /gc/ig) $gc++} while ($rev_ " h i t = ~ /ca/ig) $ca++} while ($rev_ " h i t =~ / c t / i g ) $ct++} while ($rev_ " h i t =- /cg/ig) $cg++} while ($rev_ _ h i t =~ /cc/ig) $cc++} while ($rev_ _ h i t = ~ /[*atgc /ig){$e++} p r i n t CTCF " $gene_ _name\.$1\.0\t" . "$aa\t" " $ a t \ t " "$ag\t" "$ac\t" " $ t a \ t " " $ t t \ t " " $tg\t" " $ t c \ t " "$ga\t" "$g t \ t " "$gg\t" "$gc\t" "$ca\t" " $ c t \ t " "$cg\t" "$cc\t" "$e\n" ; p r i n t »\n####################################\n" ,-p r i n t "# I n s e r t i n g i n t o c t c f _ d i n t _ 0 t a b l e #\n"; p r i n t "####################################\n\n"; $sth = $dbh->prepare ("INSERT INTO c t c f _ d i n t _ 0 VALUES( 1$gene_name','$1','$aa 1, 1$at 1,'$ag 1, 1$ac','$ta','$tt 1, 1$tg', 1$tc', 1$ga 1,'$gt','$gg 1 , '$gc', '$ca', '$ct', 1$cg','$cc 1 , 1$e') " ) ; $sth->execute ( ) ; ####### # 100 # ####### my $ f l a n k l = s u b s t r ( $ t o t a l _ s e q , $start-100, 100); $ f l a n k l =- tr/ATGCatgcn/TACGTACGN/; my $ r e v _ l e f t = r e v e r s e ( $ f l a n k l ) ; my $flank2 = s u b s t r ( $ t o t a l _ s e q , $start+$size+100, 100); $flank2 =- tr/ATGCatgcn/TACGTACGN/; my $ r e v _ r i g h t = r e v e r s e ( $ f l a n k 2 ) ; my $rev_100 = $ r e v _ l e f t . $ r e v _ h i t . $ r e v _ r i g h t ; 149 ############################################# # Count d i - n u c l e o t i d e s i n f l a n k i n g sequence # ############################################# $aa=0; $at=0; $ag=0; $ac=0 ,-$ta=0; $tt=0; $tg=0; $tc=0; $ga=0; $gt=0; $gg=0; $gc=0; $ca=0; $ct=0; $cg=0; $CC=0; $e=0; while w h i l e while while while while while while while while w h i l e while while while while while while ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ ($rev_ 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 / a a / i g ) | / a t / i g ) | /ag/ig)\ /ac/ig){ / t a / i g ) | / t t / i g ) | / t g / i g ) \ / t c / i g ) \ /ga/ig)\ / g t / i g ) ' /gg/ig) • /gc/ig) • / c a / i g ) • / c t / i g ) • / c g / i g ) • / c c / i g ) • / [*atgc: $aa++} $at++} $ag++} $ac++} $ta++} $tt++} $tg++} $tc++} $ga++} $gt++} $gg++} $gc++} $ca++} $ct++} $cg++} $cc++} /ig){$e+ p r i n t CTCF_100 "$gene_name\.$1\.100\t" "$aa\t" "$a t \ t " "$ag\t" "$ac\t» "$ta \ t " " $ t t \ t " " $tg\t" " $ t c \ t " "$ga\t" "$gt\t" "$gg\t" "$gc\t" "$ca\t" " $ c t \ t " "$cg\t" "$cc\t" "$e\n"; p r i n t "######################################\n"; p r i n t "# I n s e r t i n g i n t o c t c f _ d i n t _ 1 0 0 t a b l e #\n"; p r i n t "######################################\n\n"; $sth = $dbh->prepare ("INSERT INTO c t c f _ d i n t _ 1 0 0 VALUES( 1$gene_name 1, 1$1 1,'$aa 1, 1$at', 1$ag•, 1$ac', 1$ta','$tt 1, 1$tg 1, 1$tc', 1$ga 1,'$gt',•$gg ' , ' $gc' , 1 $ c a 1 , 1 $ c t 1 , 1 $ c g 1 , 1 $ c c 1 , ' $e') ") ; $sth->execute (),-####### 150 # 500 # ####### $ f l a n k l = s u b s t r ( $ t o t a l _ s e q , $start-500, 500); $ f l a n k l =~ tr/ATGCatgcn/TACGTACGN/; $ r e v _ l e f t = r e v e r s e ( $ f l a n k l ) ; $flank2 = s u b s t r ( $ t o t a l _ s e q , $start+$size+500, 50 $flank2 =- tr/ATGCatgcn/TACGTACGN/; $ r e v _ r i g h t = r e v e r s e ( $ f l a n k 2 ) ; my $rev_500= $ r e v _ l e f t . $ r e v _ h i t . $ r e v _ r i g h t ; ############################################# # Count d i - n u c l e o t i d e s i n f l a n k i n g sequence # ############################################# $aa=0; $at=0; $ag=0; $ac=0; $ta=0; $tt=0; $tg=0; $tc=0; $ga=0; $gt=0; $gg=0; $gc=0; $ca=0; $ct=0; $cg=0; $cc=0; $e=0; while $rev_ 500 =~ /aa/ig) $aa++} while $rev "500 =~ /a t / i g ) $at++} while $rev_ ^500 =~ /ag/ig) $ag++} while $rev_ 500 =~ /ac/ig) $ac++} while $rev "500 =~ /ta / i g ) $ta++} while $rev ~500 =~ / t t / i g ) $tt++} while $rev_ ^500 =~ /tg/ i g ) $tg++} while $rev_ ^500 =~ / t c / i g ) $tc++} while $rev_ [500 =~ /ga/ig) $ga++} while $rev_ "500 =~ /gt/ig) $gt++} while $rev_ [500 =~ /gg/ig) $gg++} while ($rev_ "500 =~ /gc/ig) $gc++} while ($rev_ [500 =~ /ca/ig) $ca++} while $rev_ ]500 =- / c t / i g ) $ct++} while $rev_ "500 =~ /cg/ig) $cg++} while ($rev_ ^500 =~ /cc/ig) $cc++} while ;$rev_ ]soo =- /["atgc /ig){$e++} p r i n t CTCF_500 "$gene_name\.$1\.500\t" . "$aa\t" . "$a t \ t " . "$ag\t" . "$ac\t" . "$t a \ t " . " $ t t \ t " . "$tg\t" . " $ t c \ t " . "$ga\t" . "$gt\t" . "$gg\t" . "$gc\t" . "$ca\t" . " $ c t \ t " . "$cg\t" . "$cc\t" . "$e\n"; p r i n t "######################################\n"; p r i n t "# I n s e r t i n g i n t o c t c f _ d i n t _ 5 0 0 t a b l e #\n"; p r i n t "######################################\n\n"; $sth = $dbh->prepare ("INSERT INTO c t c f _ d i n t _ 5 0 0 VALUES (' $gene_name ' , ' $1' , 1 $aa ' , ' $at' , ' $ag ' , ' $ac ' , 1 $ta' ,.' $ t t > , ' $tg' , ' $tc ' , ' $ga' , 1 $gt 1 , 1 $gg 1,'$gc 1,'$ca','$ct', 1$cg 1, 1$cc', '$e ' ) ") ; $sth->execute ( ) ; } } } close(CTCF); close(CTCF_100); close(CTCF_500); close(CTCF_SITES); ############################## # P r i n t master sequence f i l e # ############################## # need to go down one d i r e c t o r y to p r i n t sequence to seq.fa master f i l e c h d i r "../" or d i e "cannot c h d i r to . ./: $!"; p r i n t SEQ ">$gene_name | $gene_id | $db_name | Chr$chr_name$strand2\:$chr_start\-$chr_end\ | repeat co-ordinates: Chr$chrom$strand2\:$rep_start\-$rep_end\ | f l a n k i n g sequence up/down-stream of repeat: $ f l a n k e r |\n"; p r i n t SEQ "$total_seq\n\n"; p r i n t REP_SEQ ">$gene_name | $gene_id | $db_name | repeat co-or d i n a t e s : Chr$chrom$strand2\:$rep_start\-$rep_end\ | f l a n k i n g sequence up/down-stream of repeat: $ f l a n k e r | \n"; p r i n t REP_SEQ "$rep_seq\n\n"; } } } p r i n t "the value of \$n i s $n \n" ; $n++; } ############################################################ # Close f i l e h a n d l e r s f o r repeat t r a c t + f l a n k i n g sequences # ############################################################ Close(CTCF_SCORE); close(CTCF_SC0RE2); close(GC); close(SEQ); close(REP_SEQ); e x i t ; 152 APPENDIX D Interacting with and creating the 'gems_cis' database Re-populating the Database: MySQL Syntax for deleting the tables WARNING: This will delete all data collected so far by the database! If you want to re-run flanker.pl and re-populate the database you must delete all the tables currently in the database with these commands: DROP TABLE repeats; DROP TABLE b u i l d _ i n f o ; DROP TABLE cpg; DROP TABLE c t c f ; DROP TABLE c t c f d i n t _ 0; DROP TABLE c t c f _ d i n t _ 100; DROP TABLE c t c f _ d i n t _ 500; DROP TABLE f l a n k i n g ; DROP TABLE gc; DROP TABLE g c _ p l o t s ; DROP TABLE gems ; DROP TABLE gems_feat; DROP TABLE exons; Re-populating the Database: MySQL Syntax for creating the tables You must now re-create all the tables you just deleted so that flanker.pl can insert data into the expected fields. Execute the following commands to do this: CREATE TABLE b u i l d _ i n f o ( ens_db VARCHAR(50) NOT NULL, db_name VARCHAR(50) NOT NULL, date run DATETIME CREATE TABLE gems ( name VARCHAR(15) NOT NULL PRIMARY KEY, ens_ID CHAR(15) NOT NULL ) ; CREATE ( TABLE gc name VARCHAR(IS) NOT NULL PRIMARY KEY, 50_bp DECIMAL(7,6) NOT NULL, 100_bp DECIMAL(7,6) NOT NULL, 150_bp DECIMAL(7,6) 200_bp DECIMAL(7,6) 250_bp DECIMAL(7, 6) 3 00_bp DECIMAL(7, 6) 350_bp DECIMAL(7,6) 400_bp DECIMAL(7,6) 450_bp DECIMAL(7,6) 500_bp DECIMAL(7,6) 1000_bp DECIMAL(7,6) NOT NULL, NOT NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL NOT NOT NOT NOT NOT NOT NOT ) ; CREATE TABLE gems_feat ( name VARCHAR(15) NOT NULL PRIMARY KEY, 153 chr CHAR(2) NOT NULL, strand CHAR(l) NOT NULL, s t a r t INT UNSIGNED, end INT UNSIGNED, u n i t CHAR(3), seq VARCHAR(2 55) NOT NULL, length VARCHAR(4) NOT NULL, p u r i t y VARCHAR(3) NOT NULL, e x p a n d a b i l i t y DECIMAL(3,2) NULL ) ; CREATE TABLE f l a n k i n g ( name VARCHAR(15) NOT NULL PRIMARY KEY, chr CHAR(2) NOT NULL, strand CHAR(l) NOT NULL, s t a r t INT UNSIGNED, end INT UNSIGNED, fo r _ s e q TEXT, rev_seq TEXT, for_rep_seq TEXT, rev_rep_seq TEXT ) ; CREATE TABLE cpg ( name VARCHAR(15) NOT NULL, s t a r t INT UNSIGNED, end INT UNSIGNED, score VARCHAR(8) ) ; CREATE TABLE exons ( name VARCHAR(15) NOT NULL, s t a r t INT, end INT ) ; CREATE TABLE repeats ( name VARCHAR(15) NOT NULL, rep_name VARCHAR(15), r e p _ c l a s s VARCHAR(15), s t a r t INT UNSIGNED, end INT UNSIGNED, score INT UNSIGNED, strand CHAR(l), d i s t a n c e MEDIUMINT UNSIGNED ) ; CREATE TABLE gc_plotS ( name VARCHAR(15) NOT NULL, s t a r t INT UNSIGNED, obsex DECIMAL(7,6) NOT NULL, gc DECIMAL(7,6) NOT NULL ) ; CREATE TABLE c t c f ( name VARCHAR(15) NOT NULL, score DECIMAL(5,2) NOT NULL, s t a r t INT UNSIGNED, end INT UNSIGNED, strand CHAR(l) NOT NULL, distan c e INT UNSIGNED ) ; CREATE TABLE c t c f _ d i n t _ 0 ( name VARCHAR (15) NOT NULL, score DECIMAL(5,2) NOT NULL, aa VARCHAR(4), at VARCHAR(4) , ag VARCHAR(4) , ac VARCHAR (4) , t a VARCHAR(4) , t t VARCHAR(4) , t g VARCHAR(4), t c VARCHAR(4), ga VARCHAR(4), gt VARCHAR (4) , gg VARCHAR(4), gc VARCHAR(4), ca VARCHAR(4), Ct VARCHAR(4), eg VARCHAR(4), cc VARCHAR (4) , e r r o r s VARCHAR(4) ) ; CREATE TABLE ctcf_dint_100 ( name VARCHAR(15) NOT NULL, score DECIMAL(5,2) NOT NULL, aa VARCHAR(4), at VARCHAR(4), ag VARCHAR (4) , ac VARCHAR(4) , t a VARCHAR(4), t t VARCHAR(4), tg VARCHAR (4) , t c VARCHAR (4) , ga VARCHAR(4), gt VARCHAR(4), gg VARCHAR(4) , gc VARCHAR(4), ca VARCHAR(4), c t VARCHAR (4) , eg VARCHAR(4) , CC VARCHAR(4), e r r o r s VARCHAR(4) ) ; CREATE TABLE ctcf_dint_500 ( name VARCHAR(15) NOT NULL, score DECIMAL(5,2) NOT NULL, aa VARCHAR(4), at VARCHAR(4) , ag VARCHAR(4), ac VARCHAR (4) , t a VARCHAR(4), t t VARCHAR(4), t g VARCHAR(4) , tC VARCHAR(4), ga VARCHAR(4) , gt VARCHAR(4) , gg VARCHAR(4) , gc VARCHAR(4) , ca VARCHAR(4) , ct VARCHAR(4), eg VARCHAR(4), CC VARCHAR(4), e r r o r s VARCHAR(4) ) ; Inserting the expandability data Since the expandability date cannot be generated automatically, A F T E R flanker.pl has been run, the expandability data must be set manually with these commands: UPDATE gems_feat SET e x p a n d a b i l i t y = '4.81' WHERE name = •DMPK'; UPDATE gems_feat SET e x p a n d a b i l i t y = '1.30' WHERE name = 'SCA71; UPDATE gems_feat SET e x p a n d a b i l i t y = '0.97' WHERE name = 1SCA2'; UPDATE gems_feat SET e x p a n d a b i l i t y = '0.29' WHERE name = 'HD'; UPDATE gems_feat SET e x p a n d a b i l i t y = '0.19' WHERE name = 1DRPLA 1; UPDATE gems_feat SET e x p a n d a b i l i t y = '0.14' WHERE name = 1SCA1 1; UPDATE gems_feat SET e x p a n d a b i l i t y = '0.08' WHERE name = 'SBMA1; UPDATE gems_feat SET e x p a n d a b i l i t y = '0.07' WHERE name = 'SCA3_MJD1; 156 APPENDIX E nucleo.pl # ! / u s r / l o c a l / b i n / p e r l -w use s t r i c t ; use Bio::EnsEMBL::DBSQL::DBAdaptor; use DBI; use Data::Dumper; #################### # Gl o b a l V a r i a b l e s # #################### # ensembl API my $host my $user my $db_name my $prog_version = 'ensembldb.ensembl.org 1; = 'anonymous 1; = 1 homo_sapiens_core_16_33' = '0.4'; ############################### # Connect to ensembl w i t h API # ############################### my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(-host => $host, -user => $user, -dbname => $db_name) my $ s l i c e _ a d a p t o r = $db->get_SliceAdaptor; while (<>) { chomp; ######################################### # ensembl API sequence e x t r a c t i o n phase # ######################################### # i n t h i s regex # $1 c o l l e c t s the common gene name # $2 c o l l e c t s the ensembl gene ID # $3 c o l l e c t s the l a s t d i g i t from the ensembl gene ID, i t ' s an e f f e c t of the i n t e r n a l brackets # $4 c o l l e c t s the chromosome number # $5 c o l l e c t s the repeat s t a r t p o s i t i o n i n chromosome $4 # $6 c o l l e c t s the repeat end p o s i t i o n i n chromosome $4 i f (/*(\w+)\s+(ENSG(\d){ll})\s+(\S+)\s+(\d+)\s+(\d+)\s+(\w+)/) { p r i n t "1 i s $1, 2 i s $2, 3 i s $3, 4 i s $4, 5 i s $5, 6 i s $6, 7 i s $7\n"; my $genename = $1; my $chrom = $4; my $ r e p _ s t a r t = $5; my $rep_end = $6; my $repeat_unit; my $ s l i c e _ l e f t _ 1 0 0 = $slice_adaptor->fetch_by_chr_start_end($chrom,$rep_start-1 0 0 , $ r e p _ s t a r t - l ) ; my $ s l i c e _ r i g h t _ 1 0 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+100); my $left_100 = $ s l i c e _ l e f t _ 1 0 0 - > s e q ; my $right_100 = $slice_right_100->seq; my $nucleo_100_seq = $left_100 . $right_100; p r i n t "$genename w i t h 100 bp f l a n k i n g sequence\n\n$$nucleo_100_seq\n\n"; 157 my $ s l i c e _ l e f t _ 5 0 0 = $slice_adaptor->fetch_by_chr_start_end($chrom,$rep_start-5 0 0 , $ r e p _ s t a r t - l ) ; my $ s l i c e _ r i g h t _ 5 0 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+500); my $left_500 = $ s l i c e _ l e f t _ 5 0 0 - > s e q ; my $right_500 = $slice_right_500->seq; my $nucleo_500_seq = $l e f t _ 5 0 0 . $right_500; p r i n t "$genename w i t h 500 bp f l a n k i n g sequence\n\n$$nucleo_500_seq\n\n"; my $ s l i c e _ l e f t _ 1 0 0 0 = $slice_adaptor->fetch_by_chr_start_end($chrom,$rep_start-1 0 0 0 , $ r e p _ s t a r t - l ) ; my $ s l i c e _ r i g h t _ 1 0 0 0 = $ s l i c e _ a d a p t o r ->fetch_by_chr_start_end($chrom,$rep_end+l,$rep_end+1000); my $left_1000 = $ s l i c e _ l e f t _ 1 0 0 0 - > s e q ; my $right_1000 = $slice_right_1000->seq; my $nucleo_1000_seq = $left_1000 . $right_1000; p r i n t "Sgenename w i t h 1000 bp f l a n k i n g sequence\n\n$$nucleo_1000_seq\n\n"; } 158 APPENDIX F Sample queries SELECT g.name, g.expa n d a b i l i t y , gc.50_bp, gc.lOO_bp, gc.l50_bp, gc.200_bp, gc.250_bp, gc.300_bp, gc.350_bp, gc.400_bp, gc.450_bp, gc.500_bp, gc.lOOO_bp FROM gems_feat g, gc WHERE g. e x p a n d a b i l i t y > 0 AND g.name = gc.name; SELECT g.name, g.e x p a n d a b i l i t y , r.name, r . d i s t a n c e FROM gems_feat g, repeats r WHERE g.name = r.name AND r.rep_name LIKE 'AluY%' AND e x p a n d a b i l i t y > 0 AND r.d i s t a n c e < 20000 AND g.name NOT LIKE 'DMPK1 ORDER BY r . d i s t a n c e ; SELECT g.name, g.expa n d a b i l i t y , r.rep_name, r . d i s t a n c e FROM gems_feat g, repeats r WHERE g.name = r.name AND r.d i s t a n c e < 10000 AND r. d i s t a n c e > 0 AND r.rep_name NOT LIKE "dust" AND r.rep_name NOT LIKE "(CAG)n" AND g. e x p a n d a b i l i t y > 0.9 ORDER BY r.rep_name; SELECT g.name, g.expa n d a b i l i t y , r.rep_name, r . d i s t a n c e FROM gems_feat g, repeats r WHERE g.name = r.name AND r.d i s t a n c e < 10000 AND r.d i s t a n c e > 0 AND r.rep_name LIKE "Alu%"; SELECT g.name, g.expa n d a b i l i t y , r.rep_name, r . d i s t a n c e FROM gems_feat g, repeats r WHERE g.name = r.name AND e x p a n d a b i l i t y > 0 AND r . d i s t a n c e < 1000 AND r. d i s t a n c e > 10 ORDER BY r.rep_name DESC; SELECT g.name, g.expa n d a b i l i t y , c.score, c.distance FROM gems_feat g, c t c f c WHERE g.name = c.name AND c.distance < 1000 AND g. e x p a n d a b i l i t y > 0 ORDER BY c.score DESC; SELECT c.name, c.score, c.distance, g . e x p a n d a b i l i t y , c . s t a r t , c.end FROM c t c f c, gems_feat g WHERE c.distance < 1000 AND c.score > 2 AND g.name = c.name ORDER BY c.name; SELECT c.name, c.score, c.distance, g . e x p a n d a b i l i t y , c . s t a r t , c.end FROM c t c f c, gems_feat g WHERE g.name = c.name AND c.name = "DRPLA" ORDER BY c.name; 159 APPENDIX G Example R scripts and selected results ####################### . # Flan k i n g GC Content # ####################### # Br i n g data i n t o R gc.brock = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/gc_stats.txt", sep = "") g c . a l l = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/gc_all.txt", sep = "") # C a l c u l a t e ranks f o r f l a n k i n g %GC exp.rank = rank(gc.brock[,2]) gc.100.rank = rank(gc-brock[,4]) gc.500.rank = rank(gc.brock[,12]) gc.1000.rank = rank(gc.brock[,13]) gc.ranks = cbind(exp.rank,gc.100.rank,gc.500.rank,gc.1000.rank) # C a l c u l a t e Spearman's Ranked C o r r e l a t i o n c o r . t e s t ( g c . b r o c k [ , 2] , gc.brock[,3], method="spearman") f l a n k i n g . g c = c(0, 50 ,100 ,500 ,1000 ,1500 ,2000 ,2500 ,3000 ,3500 ,4000 ,4500 ,5000) rho = c(NA, 0.8214286 ,0.8928571 ,0.9285714 ,0.8928571 ,0.4285714 ,0.3214286 ,0.3214286 ,0.3214286 ,0.3214286 ,0.3214286 ,0.3214286 ,0.2142857) p.value = c(NA ,0.03945 ,0.01898 ,0.01181 ,0.01898 ,0.349 ,0.4948 ,0.4948 ,0.4948 ,0.4948 ,0.4948 ,0.4948 ,0.6615) rho = cbind(flanking.gc,rho,p.value) # Brock et a l . cor c o - e f f i c i e n t s "SCA7", "SCA2", "HD", "DRPLA", "SCA1", "SBMA", "SCA3", 1.3, 0.97, 0.29, 0.19, 0.14, -0.01) 77, 74.5, 63.5, 66, 38.5) 79, 71, 66, 67.2, 59, genes = c ( " DM" , "ERDA1" ) exp = c(4 • 81, 0.08, 0.07, gc.100 = c(69. 5, 83 .5, 65, 36.5, gc.500 = c(66, 71.5, 38.5, 43) brock = cbind(genes,exp,gc.100,gc.500) c o r . t e s t ( b r o c k [ , 2 ] , b r o c k [ , 3 ] , methods"spearman") # P l o t of rho vs. i n c r e a s i n g f l a n k i n g sequence plot(x=rho[,1],y=rho[,2],xlab="Flanking sequence (bp)", ylab="Spearman's rank c o r r e l a t i o n (rho)", main="Effect of i n c r e a s i n g f l a n k i n g sequence on Rho value", sub="Figure Id: Spearman rank c o r r e l a t i o n (rho) value of median e x p a n d a b i l i t y to %GC over 50 bp, 100 bp, 500 bp, and 1000 bp st r e t c h e s of f l a n k i n g sequence", xlim=c (0,1000) , ylim=c(0, 1) ) ,-l i n e s (x=rho [, 1] ,y=rho [, 2] ) ################## # Figure 5 P l o t s # ################## pdf("Figure_5_GC.pdf") par(mfrow=c(2,2)) 160 \ plot(x=gc.ranks[,2],y=gc.ranks[,1],xlab="%GC of 100 bases f l a n k i n g repeat (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="100 bp") a b l i n e ( 0 , l ) plot(x=gc.ranks[,3],y=gc.ranks[,1],xlab="%GC of 500 bases f l a n k i n g repeat (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="500 bp"); abline(0,1) plot(x=gc.ranks[,4],y=gc.ranks[,1],xlab="%GC of 1000 bases f l a n k i n g repeat (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="1000 bp"); abline(0,1) plot(x=rho[,1],y=rho[,2],xlab="Flanking sequence (bp)", ylab="Spearman's rank c o r r e l a t i o n (rho)", main="Effect of i n c r e a s i n g flanking\nsequence on Rho value", xlim=c(0,5000), ylim=c(0,1)); lines(x=rho[,1],y=rho[,2] ) dev. o f f () ############ # Figure 6 # ############ pd f ( " F i g u r e _ 6 _ h i st.pdf") par(mfrow=c(1,1)) c o l o u r = c(0,0,0,0,0,0,0,0,0,2) hist(gc.all[,4],xlim=c(0,1),main="Histogram of %GC\nl00 bp f l a n k i n g CAG/CTG repeat of Candidate CAG/CTG repeats\n", xlab="%GC of 100 bp f l a n k i n g repeat", ylab="number of genes", c o l = c o l o u r ) ; dev.of f () mean(gc.all[,4])+2 * ( s d ( g c . a l l [ , 4 ] ) ) sum(gc.all[,4] > 0.762376) # z score f o r second highest expandable lo c u s . (0 .767326- (mean (gc . a l l [,4] ) ) ) / ( s d ( g c . a l l [, 4] ) ) mean(gc.all[,12])+2*(sd(gc.all[,12])) ######################################### # E x p a n d a b i l i t y and Length C a l c u l a t i o n s # ######################################### len.pur = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/len_pur.txt", sep = "") len.rank = rank(len.pur[,3]) pur.rank = rank(len.pur[,4]) len.pur.ranks = cbind(exp.rank,len.rank,pur.rank) c o r . t e s t ( l e n . p u r . r a n k s [ , 1 ] , l e n . p u r . r a n k s [ , 2 ] , method="spearman") plot(x=len.pur.ranks[,1];y=len.pur.ranks[,2],xlab="CAG-repeat length (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Expandability vs. Repeat Length", sub="Figure 2: Ranked E x p a n d a b i l i t y vs. Ranked CAG/CTG repeat l e n g t h " ) ; abline(0,1) ############ # Figure 7 # ############ pdf("Figure_7_length.pdf") par(mfrow=c(1,1) ) plot(x=len.pur.ranks[,1],y=len.pur.ranks[,2],xlab="Repeat Length (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Expandability vs. Repeat Length\nCAG/CTG repeats known to be unstable\nRepeat length d e r i v e d from TRF c o - o r d i n a t e s " ) ; dev. o f f () # Output Spearman's rank c o r r e l a t i o n rho 161 data: len.pur.ranks[, 1] and len.pur.ranks[, 2] S = 74, p-value = 0.4948 a l t e r n a t i v e hypothesis: true rho i s not equal t o 0 sample estimates: rho -0.3214286 # p l o t of exp. vs p u r i t y c o r . t e s t ( l e n . p u r . r a n k s [ , 1 ] , l e n . p u r . r a n k s [ , 3 ] , method="spearman") ############ # Figure 8 # ############ pdf( " F i g u r e _ 8 _ p u r i t y . p d f " ) par(mf row=c(1,1)) plot(x=len.pur.ranks[,1],y=len.pur.ranks[,3],xlab="CAG-repeat p u r i t y (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Expandability vs. Repeat Purity\nCAG/CTG repeats known to be u n s t a b l e \ n P u r i t y d e f i n e d as longest contiguous repeat u n i t " ) ; dev.off() # Output Spearman's rank c o r r e l a t i o n rho data: len.pur.ranks[, 1] and len.pur.ranks[, 3] S = 62, p-value = 0.843 a l t e r n a t i v e hypothesis: t r u e rho i s not equal t o 0 sample estimates: rho -0.1071429 ############### # A l u repeats # ############### a l u . t o t a l = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu.txt", sep = "") c o r . t e s t ( a l u . t o t a l [ , 2 ] , a l u . t o t a l [ , 4 ] , method="spearman") genes = c("SCA3_MJD","SCA7","SCA2","HD","DRPLA","SCA1","SBMA") alu.count = c(14,6,21,12,6,6,1) a l u = cbind(exp.rank, rank(alu.count)) plot(x=alu[,2],y=alu[,1],xlab="Number of AluY Repeats (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Expandability vs. Number of Fl a n k i n g AluY Repeats", sub="Figure 3: Ranked e x p a n d a b i l i t y vs. ranked t o t a l number of AluY repeats i n 50,000 bp of f l a n k i n g sequence"); abline(0,1) c o r . t e s t ( a l u [ , 1 ] , a l u 1,2], method="spearman") cor . t e s t ( g c . b r o c k [ , 2 ] , a l u . c o u n t , method="spearman") # Output Spearman's rank c o r r e l a t i o n rho data: gc.brock[, 2] and alu.count S = 46, p-value = 0.7207 a l t e r n a t i v e hypothesis: t r u e rho i s not equal t o 0 sample estimates: rho 0.1853123 # 10, 000 alu.total.10000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_10000.txt", sep = "") 162 c o r . t e s t ( a l u . t o t a l . 1 0 0 0 0 [ , 2 ] , a l u . t o t a l . 1 0 0 0 0 [ , 4 ] , method="spearman") # Output Spearman's rank c o r r e l a t i o n rho data: a l u . t o t a l . 1 0 0 0 0 [ , 2] and a l u . t o t a l . 1 0 0 0 0 [ , 4] S = 156, p-value = 0.04489 a l t e r n a t i v e hypothesis: t r u e rho i s not equal to 0 sample estimates: rho 0.5718863 # 20,000 alu.total.20000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_20000.txt 1 sep = "") c o r . t e s t ( a l u . t o t a l . 2 0 0 0 0 [ , 2 ] , a l u . t o t a l . 2 0 0 0 0 [ , 4 ] , method="spearman") # Output Spearman's rank c o r r e l a t i o n rho data: alu.total.20000 [, 2] and alu.total.20000 [, 4] S = 5785, p-value = 0.8534 a l t e r n a t i v e hypothesis: true rho i s not equal to 0 sample estimates: rho 0.03312307 # 30,000 alu.total.30000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_30000.txt sep = "") c o r . t e s t ( a l u . t o t a l . 3 0 0 0 0 [ , 2 ] , a l u . t o t a l . 3 0 0 0 0 [ , 4 ] , method="spearman") # Output Spearman's rank c o r r e l a t i o n rho data: a l u . t o t a l . 3 0 0 0 0 [ , 2] and alu.total.30000 [, 4] S = 17424, p-value = 0.3314 a l t e r n a t i v e hypothesis: true rho i s not equal to 0 sample estimates: rho -0.1478174 # 40,000 alu.total.40000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_40000.txt sep = "") c o r . t e s t ( a l u . t o t a l . 4 0 0 0 0 [ , 2 ] , a l u . t o t a l . 4 0 0 0 0 [ , 4 ] , method="spearman") # Output Spearman's rank c o r r e l a t i o n rho data: a l u . t o t a l . 4 0 0 0 0 [ , 2] and a l u . t o t a l . 4 0 0 0 0 [ , 4] S = 28314, p-value = 0.5679 a l t e r n a t i v e hypothesis: true rho i s not equal to 0 sample estimates: rho -0.07925101 # 50,000 alu.total.50000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_50000.txt sep = "") c o r . t e s t ( a l u . t o t a l , 5 0 0 0 0 [ , 2 ] , a l u . t o t a l . 5 0 0 00[,4], method="spearman") # Output Spearman's rank c o r r e l a t i o n rho data: alu.total.50000 [, 2] and alu.total.50000 [, 4] S = 53582, p-value = 0.3424 a l t e r n a t i v e hypothesis: true rho i s not equal t o 0 sample estimates: rho -0.1185093 ########################### # Numbers of AluY Repeats # ########################### names = c("genes","expandability","5,000","10,000","20,000","30,000","40,000","40,000 alu.number = matrix ( d a t a = NA, nrow=7, n c o l = 8, byrow = FALSE) alu.number[,1] = c("SCA3_MJD","SCA7","SCA2","HD","DRPLA","SCA1","SBMA") alu.number[,2] = c(0.07,1.30,0.97,0.29,0.19,0.14,0.08) a l u . number [, 3] = c(l,0,1,1,0,0,0) a l u . number [, 4] =c(3,l,5,3,l,0,0) alu.number [, 5] = c (5, 3,11, 9, 2, 3, 0) alu.number [, 6] = c (9, 3 ,16 ,10, 3 , 4 , 0) alu.number [, 7] = c (9, 4 ,19,11, 5, 6, 0) alu.number [, 8] = c (14 , 6, 21,12 , 6, 6,1) cor.test(alu.number[,2],alu.number[,8], method="spearman") names = c ( "SCA3_MJD" , "SCA7 " , "SCA2 " , "HD" , "DRPLA" , "SCA1" , "SBMA" ) exp = c(0.07,1.30,0.97,0.29,0.19,0.14,0.08) a.5000 = c(l,0,1,1,0,0,0) a.10000 = c(3,1,5,3,1,0,0) a.20000 = c(5,3,11,9,2,3,0) a.30000 = c(9,3,16,10,3,4,0) a.40000 = c(9,4,19,11,5,6,0) a.50000 = c(14,6,21,12,6,6,l) alu.number = cbind(names,exp,a.5000,a.10000,a.20000,a.30000,a.40000,a.50000) ########################### # Generate p l o t v a r i a b l e s # ########################### alu.total.10000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_10000.txt sep = " " ) " alu.total.20000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_20000.txt sep = "") alu.total.30000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_30000.txt sep = "") alu.total.40000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_40000.txt sep = "") alu.total.50000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/alu_50000.txt sep = "") plot(x=al[,2],y=al[,1],xlab="(rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="1000 bp") ; e l = rankfalu.total.10000 [,2]) r l = rank(alu.total.10000 [,4]) a l = c b i n d ( e l , r l ) plot(x=al[,2],y=al[,1],xlab="Distance of AluY-type repeat from CAG/CTG repeat (rank)" ylab="Median E x p a n d a b i l i t y (rank)", main="Median E x p a n d a b i l i t y vs.\nDistance of AluY Repeat\n(10,000 bp)") abline(0,1) e2 = rank(alu.total.20000 [,2] ) r2 = rank(alu.total.20000 [,4] ) a2 = cbind(e2,r2) p l o t (x=a2 [, 2] ,y=a2 [, 1] ) e3 = rank(alu.total.30000 [,2]) r3 = rank(alu.total.30000 [,4]) a3 = cbind(e3,r3) plot(x=a3[,2],y=a3[,1],xlab="Distance of AluY-type repeat from CAG/CTG repeat (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Median E x p a n d a b i l i t y vs.\nDistance of AluY Repeat\n(30,000 bp)") abline(0,1) e4 = rank(alu.total.40000[,2] ) r4 = rank(alu.total.40000[,4] ) a4 = cbind(e4,r4) p l o t (x=a4 [, 2] ,y=a4 [, 1] ) e5 = rank(alu.total.50000 [,2] ) r5 = rank(alu.total.50000[,4] ) a5 = cbind(e5,r5) plot(x=a5[,2],y=a5[,1],xlab="Distance of AluY-type repeat from CAG/CTG repeat (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Median E x p a n d a b i l i t y vs.\nDistance of AluY Repeat\n(50,000 bp)") abline(0,1) # rho values alu.bp=c(10000,20000,30000,40000,50000) alu.rho = c(0.5718863,0.03312307,-0.1478174,-0.07925101,-0.1185093) ############ # Figure 9 # ############ pdf("Figure_9_AluY.pdf") par(mfrow=c(2,2)) plot(x=al[,2],y=al[,1],xlab="Distance of AluY-type repeat\nfrom CAG/CTG repeat (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Median E x p a n d a b i l i t y vs.\nDistance of AluY Repeats\n(10, 000 bp)", xlim=c(0,14), ylim=c(0) abline(0,1) plot(x=a3[,2],y=a3[,1],xlab="Distance of AluY-type repeat\nfrom CAG/CTG repeat (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Median E x p a n d a b i l i t y vs.\nDistance of AluY Repeats\n(30, 000 bp)") plot(x=a5[,2] , y=a5[, 1] , xlab="Distance of AluY-type repeat\nfrom CAG/CTG repeat (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Median E x p a n d a b i l i t y vs.\nDistance of AluY Repeats\n(50, 000 bp)") plot(x=alu.bp,y=alu.rho,xlab="Flanking sequence (bp)",ylab="Spearman 1s rank c o r r e l a t i o n (rho,) " ,main="Ef feet of i n c r e a s i n g \ n f l a n k i n g sequence on rho value") lines(x=alu.bp,y=alu.rho) dev.off() ######## # CTCF # ######## # 50,000 cor.test(ctcf.50000.rank[,2] , ctcf.50000.rank[, 3] , method^"spearman") # 1,000 ctcf.1000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_l_000.txt", sep = "") ctcf.1000.rank = matrix(data = NA, nrow = 5, ncol = 4, byrow = FALSE) c t c f .1000.rank[,l] = c t c f . 1000 [, 1] ctcf.1000.rank[,2] = rank(ctcf.1000[,2]) c t c f .1000.rankl,3] = rank ( c t c f . 1000 [, 3] ) c t c f .1000.rank[,4] = rank ( c t c f . 1000 [, 4] ) # Cor. between exp and score cor. t e s t ( c t c f .1000 .rank 1,2] , c t c f . 1000 . rank [, 3] , methods 1 1 spearman") Spearman's rank c o r r e l a t i o n rho data: ctcf.1000.rank[, 2] and ctcf.1000.rank[, 3] 165 S = 4, p-value = 0.1333 a l t e r n a t i v e hypothesis: true rho i s not equal to 0 sample estimates: rho 0 . 7905694 # Cor. between exp and dist a n c e c o r . t e s t ( c t c f . 1 0 0 0 . r a n k 1,2],ctcf.1000.rank[,4], method="spearman") Spearman's rank c o r r e l a t i o n rho data: ctcf.1000.rank[, 2] and ctcf.1000.rank[, 4] S = 27, p-value = 0.5167 a l t e r n a t i v e hypothesis: t r u e rho i s not equal to 0 sample estimates: rho -0.3689324 # Cor. between score and d i s t a n c e c o r . t e s t ( c t c f . 1 0 0 0 . r a n k [ , 3 ] , c t c f . 1 0 0 0 . r a n k [ , 4 ] , method="spearman") Spearman's rank c o r r e l a t i o n rho data: ctcf.1000.rank[, 3] and ctcf.1000.rank[, 4] S = 34, p-value = 0.2333 a l t e r n a t i v e hypothesis: true rho i s not equal to 0 sample estimates: rho -0.7 # 5,000 ctcf.5000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_5_000.txt", sep tr II ) ctcf.5000.rank = matrix(data = NA, nrow = 9, n c o l = 4, byrow = FALSE) c t c f .5000.rank[,l] = c t c f . 5000 [, 1] c t c f .5000.rank[,2] = rank ( c t c f . 5000 [, 2 ] ) c t c f .5000.rank[,3] = rank ( c t c f . 5000 [, 3] ) c t c f .5000.rank[,4] = rank ( c t c f . 5000 [, 4 ] ) # Cor. between exp and score c o r . t e s t ( c t c f . 5 0 0 0 . r a n k 1,2],ctcf.5000.rank[,3], method="spearman") Spearman's rank c o r r e l a t i o n rho data: ctcf.5000.rank[, 2] and ctcf.5000.rank[, 3] S = 110, p-value = 0.8438 a l t e r n a t i v e hypothesis: true rho i s not equal to 0 sample estimates: rho 0 . 08257228 # Cor. between exp and dist a n c e c o r . t e s t ( c t c f . 5 0 0 0 . r a n k 1,2],ctcf.5000.rank[,4], method="spearman") Spearman's rank c o r r e l a t i o n rho data: ctcf.5000.rank[, 2] and ctcf.5000.rank[, 4] S = 131, p-value = 0.8096 a l t e r n a t i v e hypothesis: t r u e rho i s not equal to 0 sample estimates: rho -0.09174698 # Cor. between score and d i s t a n c e c o r . t e s t ( c t c f . 5 0 0 0 . r a n k [ , 3 ] , c t c f . 5 0 0 0 . r a n k [ , 4 ] , method="spearman") 166 Spearman's rank c o r r e l a t i o n rho data: ctcf.5000.rank[, 3] and ctcf.5000.rank[, 4] S = 210, p-value = 0.02742 a l t e r n a t i v e hypothesis: t r u e rho i s not equal t o 0 sample estimates: rho -0.75 # 10,000 ctcf.10000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_10_000.txt", sep = "") ctcf.10000.rank = matrix(data = NA, nrow = 11, n c o l = 4, byrow = FALSE) c t c f . 10000 .rank [, 1] = c t c f . 10000 [, 1] c t c f .10000.rank[,2] = rank ( c t c f . 10000 [, 2] ) c t c f .10000.rank[,3] = rank ( c t c f . 10000 [, 3] ) ctcf.10000.rank[,4] = rank(ctcf.10000[,4]) # Cor. between exp and score c o r . t e s t ( c t c f . 1 0 000.rank[,2],ctcf.10000.rank[,3], method="spearman") Spearman's rank c o r r e l a t i o n rho data: ctcf.10000.rank[, 2] and ctcf.10000.rank[, 3] S = 204, p-value = 0.8388 a l t e r n a t i v e hypothesis: true rho i s not equal t o 0 sample estimates: rho 0.0697736 # Cor. between exp and d i s t a n c e c o r . t e s t ( c t c f . 1 0 0 0 0 . r a n k [ , 2] , ctcf.10000.rank[,4], method="spearman") Spearman's rank c o r r e l a t i o n rho data: ctcf.10000.rank[, 2] and ctcf.10000.rank[, 4] S = 266, p-value = 0.5391 a l t e r n a t i v e hypothesis: true rho i s not equal t o 0 sample estimates: rho -0.2093208 # Cor. between score and d i s t a n c e cor.test(ctcf.10000.rank[,3],ctcf.10000.rank[,4], method="spearman") Spearman's rank c o r r e l a t i o n rho data: ctcf.10000.rank[, 3] and ctcf.10000.rank[, 4] S = 376, p-value = 0.01873 a l t e r n a t i v e hypothesis: true rho i s not equal t o 0 sample estimates: rho -0.7090909 ################ # With No DMPK # ################ # 1,000 ctcf.1000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_l_000_noDM.txt", sep = "") ctcf.1000.rank = matrix(data = NA, nrow = 3, nc o l = 4, byrow = FALSE) c t c f .1000. rank[, 1] = c t c f . 1000 [, 1] c t c f .1000.rank[,2] = rank ( c t c f . 1000 [, 2] ) ctcf.1000.rank[,3] = rank(ctcf.1000[,3]) 167 ctcf.1000.rank [ ,4] = rank(ctcf.1000 [,4]) cor. t e s t ( c t c f . 1000 . rank [, 2] , c t c f . 1000 . rank [, 3] , method="spearman11) # 5,000 ctcf.5000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_5_000_noDM.txt", sep = »") ctcf.5000.rank = matri x ( d a t a = NA, nrow = 4, n c o l = 4, byrow = FALSE) c t c f .5000. rank [,1] = c t c f . 5000 [, 1] ctcf.5000.rank[,2] = rank(ctcf.5000 [,2]) c t c f .5000.rank[,3] = rank ( c t c f . 5000 [, 3] ) c t c f .5000.rank[,4] = rank ( c t c f . 5000 [, 4] ) cor . t e s t ( c t c f . 5 0 0 0 . r a n k [ , 2 ] , c t c f . 5 0 0 0 . r a n k [ , 3 ] , method="spearman") # 10,000 ctcf.10000 = read.delim("g:/My Documents/Bioinformatics/GeMS/stats/ctcf_10_000_noDM.txt' sep = "") ctcf.10000.rank = matri x ( d a t a = NA, nrow = 5, n c o l = 4, byrow = FALSE) c t c f .10000.rank[,l] = c t c f . 10000 [, 1] c t c f .10000. rank[, 2] = rank ( c t c f . 10000 [, 2 ] ) ctcf.10000.rank[,3] = rank(ctcf.10000[,3]) c t c f .10000.rank[,4] = rank ( c t c f . 10000 [, 4] ) cor.test(ctcf.10000.rank[,2],ctcf.10000.rank[,3], method="spearman") ############# # Figure 10 # ############# pdf (11 F i g u r e _ l 0_CTCF. pdf " ) par(mfrow=c(2,2)) p l o t (x=ctcf.5000.rank[,3],y=ctcf.5000.rank[,2],xlab="CTCF Score (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Median E x p a n d a b i l i t y vs.\nCTCF Score\n(5,000 bp)") plot(x=ctcf.5000.rank[,4],y=ctcf.5000.rank[,2],xlab="Distance from CAG-repeat (rank)", ylab="Median E x p a n d a b i l i t y (rank)", main="Median E x p a n d a b i l i t y vs.\nDistance from Repeat\n(5,000 bp)") plot(x=ctcf.5000.rank[,3],y=ctcf.5000.rank[,4],xlab="CTCF Score (rank)", ylab="Distance from Repeat (rank)", main="Distance of CTCF h i t vs.\nCTCF Score\n(5,000 bp)") abline(10,-1) dev.of f () ################################## # Nucleosome Formation P o t e n t i a l # ################################## names = c("genes","expandability","100") nucleo = matri x ( d a t a = NA, nrow=7, n c o l = 3, byrow = FALSE) nucleo[,1] = c("SCA3_MJD","SCA7","SCA2","HD","DRPLA","SCAl","SBMA") nucleo[,2] = c (0 . 07,1. 30 , 0 . 97, 0 . 29, 0 .19, 0 .14 , 0 . 08) nucleo[,3] = C(0.823,-0.758,NA,0.035,0.624,0.265,-0.039) cor.test(nucleo[,2],nucleo[,3],method="spearman") Spearman's rank c o r r e l a t i o n rho data: nucleo [, 2] and nucleo [, 3] S = 48, p-value = 0.4972 a l t e r n a t i v e hypothesis: true rho i s not equal t o 0 sample estimates: rho -0.3714286 ############# # Figure 10 # ############# pd f ( " F i g u r e _ l l _ n u c l e o . p d f " ) par(mfrow=c(1,1)) 168 plot(x=rank(nucleo[,3]),y=rank(nucleo[,2]),xlab="Nucleosome Formation P o t e n t i a l (rank)",ylab="Median E x p a n d a b i l i t y (rank)", main="Median E x p a n d a b i l i t y vs.\nNucleosome Formation P o t e n t i a l \ n o f 100 bp f l a n k i n g the repeat") dev.off() 169 APPENDIX H Satellog MySQL Database commands ######################################## # MySQL Commands f o r S a t t e l o g Database # ######################################## The f o l l o w i n g document has a l l of the commands needed to recreate the S a t e l l o g database from s c r a t c h . Broadly speaking, the t a b l e s can be d i v i d e d i n t o two c l a s s e s : those that must be populated manually p r i o r t o running r e p e a t a l y z e r . p l and those populated a u t o m a t i c a l l y when r e p e a t a l y z e r . p l i s run. The d i s t i n c t i o n between the two c l a s s e s i s made i n the comments. Some comments are provided a f t e r each create t a b l e command to b r i e f l y expain what data the t a b l e holds. See the S a t t e l o g manuscript f o r f u r t h e r d e t a i l s . ####################### # DROP TABLE COMMANDS # ####################### DROP TABLE a f f y ; DROP TABLE b u i l d _ i n f o ; DROP TABLE ugcount; DROP TABLE ugs t a t s ; DROP TABLE ens_db; DROP TABLE gc; DROP TABLE go; DROP TABLE mim; DROP TABLE pdb; DROP TABLE repeats; DROP TABLE t r a n s c r i p t s # Following t a b l e s are not a u t o m a t i c a l l y generated # Are you sure you need to drop them? DROP TABLE l i n k a g e ; DROP TABLE r e p _ s t a t s ; DROP TABLE r e p _ c l a s s ; DROP TABLE GeneNote; DROP TABLE disease; DROP TABLE unigene; ######################### # CREATE TABLE COMMANDS # ######################### CREATE TABLE b u i l d _ i n f o ( ens_db VARCHAR(50) NOT NULL PRIMARY KEY, db_name VARCHAR(50) NOT NULL, date_run DATETIME ) ; # c o l l e c t s the name and v e r s i o n of the EnsEMBL databases used # c o l l e c t s the date the database was populated CREATE TABLE repeats ( r e p _ i d INT auto_increment NOT NULL PRIMARY KEY, chr VARCHAR(4), s t a r t INT UNSIGNED, end INT UNSIGNED, per i o d TINYINT UNSIGNED, u n i t VARCHAR(16) , c l a s s _ i d INT, seq VARCHAR(255), length INT UNSIGNED, pvalue DECIMAL(8,6) NULL 170 ) CREATE CREATE CREATE CREATE CREATE CREATE CREATE CREATE CREATE INDEX INDEX INDEX INDEX INDEX INDEX INDEX INDEX INDEX r _ t o t a l ON repeats (rep_id,chr,start,end,period,unit,seq,length) r_common ON repeats ( r e p _ i d , p e r i o d , u n i t , length) ,-r_chr ON repeats ( c h r , s t a r t , e n d , p e r i o d , u n i t , l e n g t h ) ; r _ t o t a l _ p e r i o d ON repeats ( r e p _ i d , p e r i o d ) ; r _ t o t a l _ u n i t ON repeats ( r e p _ i d , u n i t ) ; r _ t o t a l _ l e n g t h ON repeats ( r e p _ i d , l e n g t h ) ; r_co_ords ON repeats ( r e p _ i d , chr, s t a r t , end) ,-r a p i d ON repeats (period, length) ,-r _ l e n g t h _ c l a s s ON repeats ( c l a s s , l e n g t h ) ; # t h i s i s the primary o r g a n i z i n g t a b l e of the database # c o l l e c t s raw in f o r m a t i o n from the Tandem Repeats Finder (TRF) raw output f i l e s ########### # ugcount # ########### CREATE TABLE ugcount ( count_id INT auto_increment NOT NULL PRIMARY KEY, r e p _ i d INT NOT NULL, c l u s t e r VARCHAR(20), sequence VARCHAR(20), length INT CREATE INDEX u g _ h i t s ON ugcount (rep_id,cluster,sequence,length) CREATE INDEX le n g t h ON count ( r e p _ i d , l e n g t h ) ; CREATE TABLE ugstats c o u n t _ i d INT auto_increment NOT NULL PRIMARY KEY, r e p _ i d INT NOT NULL, count INT NOT NULL, min INT NOT NULL, max INT NOT NULL, mean DECIMAL(8,2) NOT NULL, sd DECIMAL(8,2) NULL ) ; CREATE INDEX ug_stats ON ugstats (rep_id,count,min,max,mean,sd); CREATE INDEX sd ON ugstats ( r e p _ i d , s d ) ; CREATE TABLE t r a n s c r i p t s ( t s _ i d INT auto_increment NOT NULL PRIMARY KEY, r e p _ i d INT NOT NULL, ens_ts VARCHAR(15), gene_location VARCHAR(20), pep VARCHAR(150) NULL, ens_id INT NOT NULL ) ; CREATE INDEX t_common ON t r a n s c r i p t s (rep_id,gene_location,pep,ens_id); CREATE INDEX t_pep ON t r a n s c r i p t s ( r e p _ i d , g e n e _ l o c a t i o n ) ; CREATE INDEX t_gene ON t r a n s c r i p t s ( r e p _ i d , e n s _ i d ) ; CREATE TABLE gc ( r e p _ i d INT NOT NULL PRIMARY KEY, 100_bp DECIMAL(7,6) NOT NULL, 500_bp DECIMAL(7,6) NOT NULL, 1000_bp DECIMAL(7,6) NOT NULL ) ; CREATE INDEX gc_100 ON gc (rep_id,100_bp); CREATE INDEX gc_500 ON gc (rep_id,500_bp); CREATE INDEX gc_1000 ON gc (rep_id,100 0_bp) 171 CREATE TABLE ens_db ( ens _ i d INT auto_increment NOT NULL PRIMARY KEY, ens_name VARCHAR(15) NOT NULL, name VARCHAR(15) NULL, d e s c r i p t i o n TEXT NULL, chr CHAR(2), s t a r t INT UNSIGNED, end INT UNSIGNED, strand CHAR(l) ) ; CREATE INDEX ens_common ON ens_db (ens_id,ens_name,name); CREATE INDEX ens_lookup ON ens_db (ens_name,ens_id); CREATE TABLE go ( go_ i d INT auto_increment NOT NULL PRIMARY KEY, ens _ i d INT NOT NULL, go_value VARCHAR(50) NOT NULL ) ; CREATE INDEX go_go_value ON go (ens_id,go_value); CREATE TABLE pdb ( pdb_id INT auto_increment NOT NULL PRIMARY KEY, ens _ i d INT NOT NULL, domain VARCHAR(20) NOT NULL ) ; CREATE INDEX pdb_domain ON pdb (ens_id,domain); CREATE TABLE mim ( mim_id INT auto_increment NOT NULL PRIMARY KEY, ens _ i d INT NOT NULL, mim_value VARCHAR(20) NOT NULL ) ; CREATE INDEX mim_mim_value ON mim (ens_id,mim_value) ,-CREATE TABLE a f f y ( a f f y _ i d INT auto_increment NOT NULL PRIMARY KEY, ens _ i d INT NOT NULL, g _ i d INT NOT NULL ) ; CREATE INDEX a f f y _ i d _ r e f ON a f f y ( e n s _ i d , g _ i d ) ; # Following t a b l e s are not a u t o m a t i c a l l y generated CREATE TABLE lin k a g e ( l i n k _ i d INT auto_increment NOT NULL PRIMARY KEY, disease VARCHAR(10) NOT NULL, band VARCHAR(25), marker VARCHAR(50), chr CHAR(2), p s t a r t INT, s t a r t INT, end INT, qend INT, re f INT, score DECIMAL(3,2), type VARCHAR(10), p_value DECIMAL(11,8), notes TEXT ) ; 172 CREATE TABLE r e p _ s t a t s ( c l a s s _ i d INT NOT NULL, chr VARCHAR(4), length INT UNSIGNED ) ; CREATE INDEX s t a t s _ s e a r c h ON r e p _ s t a t s ( c l a s s _ i d , l e n g t h ) ; CREATE INDEX st a t s _ s e a r c h _ c h r ON r e p _ s t a t s ( c h r , c l a s s _ i d , l e n g t h ) ; CREATE TABLE GeneNote ( g _ i d INT auto_increment NOT NULL PRIMARY KEY, i d _ r e f VARCHAR(15) NOT NULL, value DECIMAL(10,1) NOT NULL, c a l l CHAR(l) NOT NULL, t i s s u e VARCHAR(15) NOT NULL, arr a y CHAR(l) NOT NULL, number CHAR(4) NOT NULL ) ; CREATE INDEX g_lookup ON GeneNote ( i d _ r e f , g _ i d ) ; CREATE INDEX g _ a l l ON GeneNote (g_ i d , i d _ r e f , value, c a l l , t i s s u e , array, number) ,-CREATE INDEX g_common ON GeneNote ( g _ i d , c a l l , t i s s u e ) ; CREATE TABLE r e p _ c l a s s ( c l a s s _ i d INT auto_increment NOT NULL PRIMARY KEY, c l a s s TEXT ) ; CREATE INDEX rc _ s e a r c h ON r e p _ c l a s s ( c l a s s (20) ASC , r e p _ c l a s s _ i d ) ; CREATE TABLE disease ( d i s e a s e _ i d INT auto_increment NOT NULL PRIMARY KEY, r e p _ i d INT, short_name VARCHAR(15), full_name VARCHAR(IOO) , name VARCHAR(15), ens_name VARCHAR(15) NOT NULL, norm_min INT, norm_max INT, dis_min INT, dis_max INT, locus CHAR, a n t i c i p a t i o n VARCHAR(2) ) ; CREATE TABLE unigene ( c l u s t e r _ i d INT auto_increment NOT NULL PRIMARY KEY, C l u s t e r VARCHAR(20), chr VARCHAR(4), s t a r t INT UNSIGNED, end INT UNSIGNED, b l a t s c o r e INT, i d e n t i t y DECIMAL(4,1) ) ; CREATE INDEX unigene_look_up ON unigene ( c l u s t e r _ i d , c l u s t e r , c h r , s t a r t , e n d ) ; CREATE TABLE c l a s s _ s t a t s ( c l a s s INT NOT NULL, length INT, pvalue DECIMAL(9,8) ) ; CREATE INDEX c l a s s _ s t a t s _ l o o k _ u p ON c l a s s _ s t a t s ( c l a s s , l e n g t h , p v a l u e ) ; CREATE TABLE r e p e a t s _ i n _ l i n k a g e ( r e p _ i d INT, disease VARCHAR(10) NOT NULL, l i n k _ i d INT ) ; CREATE INDEX r e p _ l i n k ON r e p e a t s _ i n _ l i n k a g e ( r e p _ i d , l i n k _ i d ) ; CREATE INDEX r e p _ l i n k _ d i s e a s e ON r e p e a t s _ i n _ l i n k a g e ( r e p _ i d , d i s e a s e , l i n k _ i d ) ; LOAD DATA INFILE '/home/perseusm/My_Documents/Publications/Satellog/Results/repeats_in_linkage.txt 1 INTO TABLE r e p e a t s _ i n _ l i n k a g e IGNORE 1 LINES; CREATE TABLE go_terms ( go_value VARCHAR(50) NOT NULL PRIMARY KEY, go_term VARCHAR(200), go_class VARCHAR(1) ) ; LOAD DATA INFILE 1/home/perseusm/GO/go_terms.txt 1 INTO TABLE go_terms; 174 APPENDIX I Running TRF on v.34 whole chromosome fasta files from UCSC @@@@@@@@@@@@®@@@@@@@@ @ Human Genome v.34 @ @@@@@@®@@@@@®®@@@@®@® The human genome goldenpath f o r a l l chromosomes (excluding random chromosome DNA data) was saved i n /home/perseusm/goldenpath f o r subsequent a n a l y s i s . These f i l e s were downloaded from UCSC. How to download f a s t a f i l e s by FTP: f t p - i hgdownload.cse.ucsc.edu # - i turns o f f i n t e r a c t i v e mode, t h e r e f o r e no prompting during mget l o g i n u:anonymous p:your®email.com get f i l e s cd goldenPath/hgl6/chromosomes/ mget * @@@@®®® @ TRF ® @@®®®®@ We are i n t e r e s t e d i n developing our own repeat co-ordinates d i s t i n c t from the pre-computed co-ords provided by UCSC f o r two reasons: 1) We want t o detect repeats much smaller than the smallest at UCSC 2) We are on l y i n t e r e s t e d i n pure repeats The f o l l o w i n g parameters were recommended to detect the purest repeats p o s s i b l e w i t h TRF without running out of memory: S h e l l S c r i p t f o r TRF /home/perseusm/trf/trf321.linux.exe /home/perseusm/goldenpath/chr7.fa 3 4090 4090 80 10 30 16 -d; f o r f i l e 2 i n /home/perseusm/chr7*.html; do rm - i $ f i l e 2 - f ; done; f o r f i l e 3 i n /home/perseusm/chr7*.tmp; do rm - i $ f i l e 3 - f ; done This i s an i n t e r e s t i n g s h e l l s c r i p t here that gets r i d of a l l the html f i l e s spawned by TRF. This, much to my annoyment, was not an opt i o n that could be d i s a b l e d . You need to do each chromosome s e q u e n t i a l l y because temporary f i l e s are created that are needed i n the c r e a t i o n of the f i n a l .dat f i l e . The next t h i n g I wanted to do was to t e s t and ensure that i n f a c t only pure repeats were being detected by the s c r i p t . The f o l l o w i n g i s a quick and d i r t y s c r i p t that e x t r a c t s the l a r g e s t h i t s from the TRF .dat f i l e s . Due to the way the s c o r i n g a lgorithm works, l a r g e r repeats have a higher chance of t o l e r a t i n g i n d e l s and s u b s t i t u t i o n s . I wanted t o make sure the TRF parameters I s e l e c t e d reported only pure h i t s . # Execute p u r i t y t e s t s c r i p t /home/perseusm/goldenpath/3.4090.4090.80.10.30.16/parse_dis.pi > p u r i t y _ t e s t . t x t # Code f o r p u r i t y t e s t s c r i p t ################ # p a r s e _ d i s . p l # ################ 175 #!/usr/bin/perl use s t r i c t ; my $chrom; while (<>) { chomp; i f ($_ = - /chr(\S+)/) { $chrom = $1; } e l s i f ($_ =-/*(\d+)\s+(\d+)\s+(\d+)\s+(\d+\.\d+)\s+(\d+)\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+ \d+\.\d+\s+(\S+)\s+(\S+)/) { # p u l l out coords of i n t e r e s t my $chromStart= $1; my $chromEnd = $2; my $ r p t P e r i o d = $3; my $ r p t S i z e = $4; my $rptConsensus = $5; my $rptUnit = $6; my $rpt = $7; my $rptLength = length $rpt; i f ( ( $ r p t P e r i o d == 16) && ($rptSize > 10)) { p r i n t "$chrom\t$chromStart\t$chromEnd\t$rptUnit\t$rpt\t$rpt\t$rptPeriod\t$rptSize\n"; } } } ################# # End of S c r i p t # ################# The contents of /home/perseusm/goldenpath/3.4090.4090.80.10.30.16/purity_test.txt i n d i c a t e that a l l the l a r g e s t h i t s are pure, t h i s means that a l l s m a l l e r s i z e d h i t s are pure as w e l l . You should go through t h i s f i l e manually and ensure each h i t i s pure, i . e . only composed of tandem repeat u n i t s . 176 APPENDIX J Generating the repeat classifier @@@@@®@@@@@@@@@@@®@@@@@@@@@@@@@@@@®@@®®@@@@ @ Determining the d i s t i n c t repeat c l a s s e s @ @@®®@®®®@®®@®®®®®®®@@@@@@@@@@®®@@@®®®®®®@@® Mathematically, i f one i s l o o k i n g at a l l lmers t o 16mers then there are huge number of p o t e n t i a l combinations. Not a l l mathematically p r e d i c t e d repeats e x i s t i n the genome (at l e a s t not pure r e p e a t s ) . For the pure repeats that do e x i s t i n the human genome we need a way of c l a s s i f y i n g them i n t o t h e i r repeat f a m i l i e s . For example: CAG, AGC, GCA, GTC, TCG, CGT are a l l the same repeat. They are detected d i s t i n c t l y because of the way TRF works (see t e x t ) . To do t h i s we developed a repeat c l a s s that detects a l l repeats i n a fa m i l y . We al s o had to design i t i n a way so that i t could be searched s p e c i f i c a l l y f o r each repeat. To do t h i s , a l l detected repeats had to be c o l l e c t e d from the d i r e c t o r y c o n t a i n i n g the TRF output f i l e s as f o l l o w s : f o r f i l e i n *.dat; do cut -d " " -f 14 $ f i l e ; done > rep_units_40 90 This cuts the TRF output f o r each chromosome separated by space at the 14th column (the column c o n t a i n i n g the repeat u n i t ) and outputs i t t o rep_units_4 0 90. One erroneous repeat u n i t was detected with a AAA i n s t e a d of an A. The cause of t h i s e r r o r i s unknown but t h i s c l a s s of repeat i s represented by the A. So t h i s value was changed i n chrl4.fa.3.4090.4090.80.10.30.16.dat, and r e f l e c t e d i n rep_units_4090.2 O r i g i n a l erroneous h i t i n chrl4.fa.3.4090.4090.80.10.30.16.dat: chr 14 102425491 102425500 3 3.3 3 100 0 30 100 0 0 0 0.00 AAA AAAAAAAAAA Next, we sought to determine a l l the d i s t i n c t repeats, t h i s was accomplished by making a temporary "staging" t a b l e : CREATE TABLE r e p _ c l a s s ( c l a s s CHAR(16) PRIMARY KEY ) ; and i n s e r t i n g a l l the repeats i n t o t h i s t a b l e : LOAD DATA INFILE 1/home/perseusm/goldenpath/3.4090.4090.80.10.30.16/rep_units_4090.2' IGNORE INTO TABLE r e p _ c l a s s ; from t h i s t a b l e , unique repeats were s e l e c t e d w i t h the f o l l o w i n g s h e l l s c r i p t : echo " SELECT DISTINCT c l a s s FROM r e p _ c l a s s " | mysql --quick -h athena -u schz_rw -prepeat schz_db > repeats_4 090.txt Next we needed to generate a l l the d i s t i n c t repeat c l a s s e s p o s s i b l e from the d i s t i n c t repeats i n our dataset. home/perseusm/progs/repeat_classer/repeat_classer2.pi does t h i s . # Execute /home/perseusm/progs/repeat_classer/repeat_classer2.pi repeats_4090.txt ############################ # Begin r e p e a t _ c l a s s e r 2 . p l # ############################ #!/usr/bin/perl # r e p e a t _ c l a s s e r 2 . p l 177 # usage: r e p _ c l a s s e r 2 . p l r e p _ u n i t s . t x t # run t h i s s c r i p t before r e p e a t a l y z e r . p l # detects what macroclass each repeat belongs to # r e p _ u n i t s . t x t i s a l i s t of a l l d i s t i n c t types of repeats i n the db # Perseus M i s s i r l i s - Mar 18, 2004 use s t r i c t ; my $rep = " r e p _ c l a s s . t x t " ; unless ( open(REP_CLASS, ">$rep") ) { die "Cannot open f i l e \"$rep\" to w r i t e to!\n\n"; } my $ i = 1; while (<>) { chomp; i f ($_ =- /(\S+)/) { p r i n t " l i n e $ i matched: $l\n\n"; my ©units; my $rep_class; my ©repeat = s p l i t ( 1 1 , $ 1 ) ; my $lengther = s c a l a r ©repeat; my $n; p r i n t " t h i s i s the repeat: ©repeat\n"; p r i n t " t h i s i s the length of the repeat: $lengther\n\n"; my $x; while ($n < $lengther) { my $for = s h i f t ©repeat; $repeat[($lengther - 1)] = $for; my $for_run = "©repeat"; $for_run =~ s/\s//g; my $rev_run = reverse $for_run; $rev_run =- tr/ATGCatgc/TACGtacg/; p r i n t " t h i s i s the repeat a f t e r popping i t : @repeat\nthis i s the v a r i a b l e f o r regex: $ f o r _ r u n \ n t h i s i s the v a r i a b l e i n reverse: $rev_run\n\n"; $units[$x] = $for_run; $x++ ,-$units[$x] = $rev_run; $x++ ; $n++; } p r i n t " t h i s i s the u n i t s array: @units\n"; ©units = sort(©units); p r i n t "sorted u n i t s : \ n " ; foreach my $unit(©units) { p r i n t $unit . "\n"; $rep_class .= $unit . "o"; } $rep_class = "o" . $re p _ c l a s s ; p r i n t " t h i s i s the r e p _ c l a s s : $rep_class\n\n"; p r i n t REP_CLASS $rep_class . "\n"; $i++; } } close(REP_CLASS); e x i t ; ############## # End s c r i p t # 178 ############## This s c r i p t generates a l l the d i s t i n c t c l a s s e s i n a f i l e repeats_classes_4090.txt l o c a t e d a t : /home/perseusm/progs/repeat_classer/repeats_classes_4090 . t x t This i s fed through the r e p _ c l a s s t a b l e l i k e above t o get a l l the d i s t i n c t c l a s s which i s saved i n a f i l e c a l l e d repeats_classes_4090.txt. l o c a t e d a t : /home/perseusm/progs/repeat_classer/repeats_classes_4090.txt Now create the f i n a l , usable r e p _ c l a s s t a b l e : CREATE TABLE r e p _ c l a s s ( r e p _ c l a s s _ i d INT auto_increment NOT NULL PRIMARY KEY, c l a s s TEXT ) ; CREATE INDEX rc_ s e a r c h ON re p _ c l a s s ( c l a s s (50) AS C , r e p _ c l a s s _ i d ) ; which can be fed i n t o the db using LOAD DATA or the f o l l o w i n g s c r i p t : # execute: /home/perseusm/progs/repeat_classer/reg_ex_test.pi repeats_classes_4 090.txt ################## # r e g _ e x _ t e s t . p l # ################## #!/usr/bin/perl # r e g _ e x _ t e s t . p i # run t h i s s c r i p t before r e p e a t a l y z e r . p l # detects what macroclass each repeat belongs t o # r e p _ u n i t s . t x t i s a l i s t of a l l d i s t i n c t types of repeats i n the db # Perseus M i s s i r l i s - Mar 18, 2 0 04 use s t r i c t ; use DBI; # DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $ s t h ) ; my (@ary); ####################### # Connect to Database # ####################### $dbh = DBI->connect ($dsn, $user_name, $password, { Ra i s e E r r o r => 1 }) ; my $ i = 1; while (<>) { chomp; i f ($_ =- /(\S+)/) { # p r i n t " l i n e : 1 - I n s e r t i n g : $l\n\n"; $sth = $dbh->prepare ("INSERT INTO r e p _ c l a s s VALUES(NULL, 1$1')"); $sth->execute ( ) ; $ i + + ; } } e x i t ; # End s c r i p t # 179 APPENDIX K Downloading and populating the GeneNote tables in Satellog @@@@@®@®@®@®@@@@@ ® GeneNote Data @ @@@@@@@@@@@@@@@@@ Get the GeneNote dataset from GEO: http://www.ncbi•nlm.nih.qov/qeo/ Dataset ID: GSE803 # Execute /home/perseusm/genenote/genenote_parser.pi GSE803.txt ###################### # genenote_parser.pl # ###################### #!/usr/bin/perl # genenote_parser.pl # usage: genenote_parser.pl GSE803.txt # Perseus M i s s i r l i s - Jan 15, 2004 use s t r i c t ; use DBI; #################### # Globa l V a r i a b l e s # #################### # DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $s t h ) ; my (@ary); ####################### # Connect to Database # ####################### $dbh = DBI->connect ($dsn, $user_name, $password, { Rais e E r r o r => 1 } ) ; ############################################################################## # read i n f i l e s p e c i f i e d on the cmd l i n e # # read m u l t i p l e coords of CAG/CTG repeat l o c a t i o n s from STDIN one l i n e at a time # ############################################################################## my $ t i s s u e ; my $array; my $number; my $ l a s t _ i n s e r t _ i d ; while (<>) { chomp; i f ($_ =~ /\!Sample_title\s+\=\s+Normal\s+((\S+\s+\S+)|\S+)\s+\S+\s+(\S+)\s+\S+\s+(\S+)/) { $ti s s u e = $1 ; $array = $3 ; $number = $4; } e l s i f ($_ =~ /(\S+)\s+(\d+\.\d+)\s+(\S)/) { my $ i d _ r e f = $1; 180 my $value = $2 ,-my $ c a l l = $3; # p r i n t " i d _ r e f $ i d _ r e f value $value c a l l $ c a l l t i s s u e $ t i s s u e a r r a y $array number $number\n"; $sth = $dbh->prepare ("INSERT INTO GeneNote VALUES('NULL','$id_ref','$value','$call', 1$tissue','$array', 1$number 1)"); $sth->execute (),-$ l a s t _ i n s e r t _ i d = $sth-> { m y s q l _ i n s e r t i d } ; p r i n t "The i d of the l a s t record i n s e r t e d i n t o the db i s $ l a s t _ i n s e r t _ i d \ n " ; } p r i n t "outside of loop t h i s i s the l a s t record $ l a s t _ i n s e r t _ i d \ n \ n " ; } ############## # End s c r i p t # ############## This w i l l populate the GeneNote database and have i t ready f o r f u t u r e q u e r i e s . 181 APPENDIX L Downloading and processing UniGene data We were curious i f there was any i n d i c a t i o n of repeat polymorphism i n the UniGene c l u s t e r s posted at NCBI. r e p e a t a l y z e r . p l a u t o m a t i c a l l y evaluates each repeat f o r polymorphisms w i t h i n UniGene c l u s t e r s . To do t h i s however, we need the c l u s t e r s and a l l sequences: How to download f a s t a f i l e s by FTP: # open FTP connection to NCBI $ f t p - I ftp.ncbi.nih.gov # l o g i n u:anonymous p:yburoemail.com # change to UniGene d i r e c t o r y cd /repository/UniGene # download a l l human UniGene f i l e s mget Hs* Convert FASTA formatting of Hs.seq.uniq f i l e The Hs.seq.uniq f i l e contains a l l sequences representing the longest, highest q u a l i t y s t r e t c h of DNA f o r each p a r t i c u l a r UniGene c l u s t e r . We w i l l be using the BLAT algorithm to see i f each repeat plus 10 bp of upstream and downstream genomic sequence can be detected w i t h i n these sequences. The FASTA f i l e s provided by NCBI have a long, somewhat cumbersome naming convention that i s too b i g f o r the BLAT output. For example the FASTA header f o r Hs.2 i s : >gnl|UG|Hs#S1728506 Homo sapiens N - a c e t y l t r a n s f e r a s e 2 (arylamine N - a c e t y l t r a n s f e r a s e ) (NAT2), mRNA /cds=(108,980) /gb=NM_000015 /gi=4557782 /ug=Hs.2 /len=1276 From t h i s , we only r e a l l y need the c l u s t e r i d e n t i f i e r (Hs.2) and the UniGene i d e n t i f i e r f o r t h i s sequence w i t h i n Hs.2 (Hs#S1728506) . Run the f o l l o w i n g command-line p e r l s c r i p t to format t h i s f i l e f o r subsequent BLAT a n a l y s i s : $ p e r l -i.bak -p -e 1s/^.*(Hs\#\S+).*\/ug\=(\S+).*$/>\2\|\l/g' Hs.seq.uniq The FASTA header f o r a l l sequences i n Hs.seq.uniq i s now: >Hs.2|Hs#S1728506 Now rename t h i s f i l e t o Hs . seq.uniq2 : $ mv Hs.seq.uniq Hs.seq.uniq2 And rename the back-up f i l e created by command-line f i l e t o the o r i g i n a l : $ mv Hs.seq.uniq.bak Hs.seq.uniq Make Hs.seq.uniq2 i n t o a BLATable database BLAT re q u i r e s m u l t i p l e FASTA f i l e s converted to a .2bit f i l e format i n order to process them. $ ~/blat/faToTwoBit Hs.seq.uniq2 Hs.seq.uniq2.2bit Remember where t h i s f i l e i s , i t i s r e q u i r e d by r e p e a t a l y z e r to work. S p l i t the UniGene c l u s t e r s i n t o c l u s t e r d e l i n e a t e d m u l t i p l e FASTA f i l e s The H s . s e q . a l l f i l e from UniGene i s e s s e n t i a l l y one huge f l a t f i l e . W i thin t h i s f i l e , UniGene c l u s t e r s are d e l i m i t e d by # foll o w e d by a c o l l e c t i o n of sequences that make up the UniGene c l u s t e r . For r e p e a t a l y z e r to work, the UniGene c l u s t e r s need to be parsed 182 to separate f i l e s r e p r e s e n t i n g each c l u s t e r w i t h a l l of i t s a s s o c i a t e d sequences H s . s e q . a l l f i l e was parsed by the f o l l o w i n g s c r i p t : # make a new d i r e c t o r y (105680 f i l e s w i l l be created!) # make a note of the absolute l o c a t i o n of these f i l e s # they w i l l be needed by r e p e a t a l y z e r $ mkdir ugc_fa s t a # run the s c r i p t $ ./p a r s e _ u n i g e n e _ a l l . p i H s . s e q . a l l # Code f o r p a r s i n g H s . s e q . a l l ########################### # parse_unigene_unique.pl # ########################### #!/usr/bin/perl -w # parse_unigene_unique.pi use s t r i c t ; my $ o u t p u t f i l e = " f r i g " ; ; my $ i ; my $count; while (<>) { i f <r\s+$/) { next ; } e l s i f (/#.*containing\s+(\d+)/) { $count = $1; $i = 1; p r i n t " c o n d i t i o n a l 2: $count\n"; } e l s i f (/*(>.*\/ug\=(\S+).*$)/) { i f ( $ o u t p u t f i l e eq " f r i g " ) { $ o u t p u t f i l e = $2; unless ( open(SEQ, ">$outputfile\.ugc") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e \ " t o w r i t e to!\n\n p r i n t SEQ " $ l \ n " ; $i++ ; } e l s i f ( ( $ o u t p u t f i l e ne " f r i g " ) && ($i ==1)) { close(SEQ); $ o u t p u t f i l e = $2; unless ( open(SEQ, ">$outputfile\.ugc") ) { die "Cannot open f i l e \ " $ o u t p u t f i l e \ " t o w r i t e to!\n\n } p r i n t SEQ " $ l \ n " ; $i++; } e l s i f ( ( $ o u t p u t f i l e ne " f r i g " ) && ($i <= $count)) { p r i n t SEQ " $ l \ n " ; $i++; } } e l s i f (/(\S+)/) { p r i n t SEQ " $ l \ n " ; } } e x i t ; APPENDIX M Mapping the unique UniGene clusters to the human genome S p l i t the unique UniGene c l u s t e r s i n t o i n d i v i d u a l FASTA f i l e s Each unique UniGene c l u s t e r was mapped to the human genome. Doing so e x p l o i t e d a l l of the sequence i n f o r m a t i o n w i t h i n the each c l u s t e r and allowed us to c o n t r o l f a l s e p o s i t i v e s . Whenever a repeat was detected i n a UniGene c l u s t e r that d i d not map w i t h i n 10 kb of the repeat's chromosomal co-ordinates, i t was not evaluated f u r t h e r . The Hs.seq.uniq f i l e contains a s i n g l e sequence representing the longest, highest quality-s t r e t c h of DNA f o r each UniGene c l u s t e r . These f i l e s were parsed i n t o i n d i v i d u a l FASTA f i l e s so that they could be BLATed against the human genome. # create a new d i r e c t o r y $ mkdir ugc_unique # parse Hs.seq.uniq # use o r i g i n a l Hs.seq.uniq, not Hs.seq.uniq2 from Appendix E # parse_unigene_unique.pl w i l l format output f i l e s c o r r e c t l y $ parse_unigene_unique.pl Hs.seq.uniq2 ########################### # parse_unigene_unique.pi # ########################### #!/usr/bin/perl -w # parse_unigene_unique.pl use s t r i c t ; my $ o u t p u t f i l e = " f r i g " ; while (<>) { i f (/*.*(Hs\#\S+).*\/ug\=(\S+).*$/) { i f ( $ o u t p u t f i l e eq " f r i g " ) { $ o u t p u t f i l e = "$2"; unless ( open(SEQ, ">$outputfile") ) { di e "Cannot open f i l e \ " $ o u t p u t f i l e \ " t o w r i t e to!\n\n"; } p r i n t SEQ ">$2\|$l\n"; } e l s i f ( $ o u t p u t f i l e ne " f r i g " ) { close(SEQ); $ o u t p u t f i l e = "$2"; unless ( open(SEQ, ">$outputfile") ) { di e "Cannot open f i l e \ " $ o u t p u t f i l e \ " t o w r i t e to!\n\n"; } p r i n t SEQ 11 >$2\| $ l \ n " ; } } e l s i f (/\S+/) { p r i n t SEQ "$&\n"; close(SEQ); e x i t ; 184 Set-up a BLAT server of a l l human chromosomes Create a BLAT serve r u s i n g a l l the human chromosomes from UCSC. Soft mask (-mask) the sequences so that repeats are not allowed to i n i t i a t e an alignment but can be used t o extend an alignment. /home/perseusm/blat/gfServer -mask -canStop s t a r t OofO 8050 /home/perseusm/goldenpath/*.nib Run mapugc.pl on each unique UniGene sequence # use the f o l l o w i n g s h e l l s c r i p t t o run p e r l s c r i p t on each unique UniGene sequence f o r f i l e i n /home/unigene/ugc_unique/Hs.*; do ./mapugc.pl $ f i l e ; done ; Warning: You need to manually set the g f C l i e n t command i n the f o l l o w i n g s c r i p t to match your host and port s e t t i n g s from the gfServer command run to set-up the BLAT server. The $cutoff v a r i a b l e i s set t o 85%, that i s , o n ly BLAT scores that are 85% of the maximum ( c a l c u l a t e d f o r a p e r f e c t h i t ) are input i n t o the database. ############# # mapugc.pi # ############# #!/usr/bin/perl # mapugc.pl UniGene_Cluster.fa # use a s h e l l f o r f i l e i n Hs.* loop to parse each i n d i v i d u a l UniGene f a f i l e # b l a t s each unique unigene sequence again human genome # i f score i s at l e a s t 85% of the t h e o r e t i c a l max ( i . e . a l l bases match) and 90% of input bases match somewhere, i t q u a l i f i e s as a h i t # use t h i s s c r i p t t o f i s h out chromosome co-ordinates f o r unigene c l u s t e r s # make sure /home/unigene/ugc_unique/ has no *.out f i l e s i n i t before running t h i s program # f i n d -name "*.out" -printO | xargs -0 Is # f i n d -name "*.out" -printO | xargs -0 rm - f # make sure BLAT g f C l i e n t host and port match your gfServer s e t t i n g s use DBI; use s t r i c t ; # DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $ s t h ) ; my (@ary); ####################### # Connect to Database # ####################### $dbh = DBI->connect ($dsn, $user_name, $password, { Rais e E r r o r => 1 }); $sth = $dbh->prepare ("INSERT INTO unigene VALUES(NULL,?,?,?,?,?,?)"); my $ b l a t _ f i l e = $ARGV[0]; p r i n t " f i l e t o b l a t i s $ b l a t _ f i l e \ n \ n " ; my $ t o t a l ; unless ( open(FILE, " $ b l a t _ f i l e " ) ) { die "Cannot open f i l e \ " $ b l a t _ f i l e \ " to w r i t e to!\n\n"; } while (<FILE>) { chomp ,-i f (/"(\w+)$/) { 185 my $seq = $1; $seq =- s/\s+//g; my $length = length($seq); $ t o t a l += $length; } } my $cutoff = 0.85*$total; p r i n t " $ t o t a l \ n " ; p r i n t " c u t o f f i s $cutoff\n\n"; c l o s e ( F I L E ) ; p r i n t "BLAT query\n\n"; p r i n t Vhome/perseusm/blat/gfClient oOOOl 8050 / $ b l a t _ f i l e $ b l a t _ f i l e . o u t ~ ; my $ b l a t _ h i t s = " $ b l a t _ f i l e . o u t " ; p r i n t "open t h i s f i l e : $ b l a t _ h i t s \ n \ n " ; close(HITS); unless ( open(HITS, " $ b l a t _ h i t s " ) ) { di e "Cannot open f i l e \ " $ b l a t _ h i t s \ " to w r i t e to!\n\n"; } while (<HITS>) { i f (/"(\d+)\s+(\d+)\s+\d+\s+\d+\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+\S+\s+(\S+)\s+(\d+)\s+(\d+ )\s+(\d+)\s+(\S+)\s+\S+\s+(\S+)\s+(\S+)/) { p r i n t 11 1 $1 2 $2 3 $3 4 $4 5 $5 6 $6 7 $7 8 $8 9 $9 10 $10 11 $11 12 $12 13 $13 \n\n"; my $match = $1; my $mismatch = $2; my $qGapCount = $3; my $qGapBases = $4; my $tGapCount = $5; my $tGapBases = $6; my $query = $7; my $qSize $8; my $qStart = $9; my $qEnd = $10 my $chr = $11; my $ s t a r t = $12; my $end = $13; my $ug_cluster = $7; my $ug_sequence = $ug_cluster; $ug_cluster =- s/^(Hs\.\S+)\|.*/\l/; $ug_sequence =~ s/*.*\|(Hs\#.*)/\1/; my $blatScore = $match - $mismatch - $qGapCount - $tGapCount; my $ i d e n t i t y = ($match)/($match + $mismatch + $qGapCount); 186 $chr =~ s/chr//g; $ i d e n t i t y = $ i d e n t i t y * 100; $ i d e n t i t y = s p r i n t f " % . I f " , $ i d e n t i t y ; p r i n t " t h i s i s the c l u s t e r : $ug_cluster\n"; p r i n t " t h i s i s the sequence: $ug_sequence\n" ,-p r i n t "maps t o $chr\:$start\-$end i n the human genome\n"; p r i n t " i d e n t i t y : $ i d e n t i t y score: $blatScore\n\n"; i f (($blatScore > $cutoff) && ( $ i d e n t i t y > 90)) { p r i n t " i t ' s a keeper\n"; p r i n t " t h i s i s the c l u s t e r : $ug_cluster\n"; p r i n t " t h i s i s the sequence: $ug_sequence\n"; p r i n t "maps to $chr\:$start\-$end i n the human genome\n"; p r i n t " i d e n t i t y : $ i d e n t i t y score: $blatScore\n\n"; # i n s e r t h i t i n t o database $sth->execute ( $ u g _ c l u s t e r , $ c h r , $ s t a r t , $ e n d , $ b l a t S c o r e , $ i d e n t i t y ) ; } # system("rm $ b l a t _ f i l e . o u t - f " ) ; e x i t ; 187 APPENDIX N Generating the percentile rank for each repeat (p-values) We were i n t e r e s t e d i n knowing how s i g n i f i c a n t a repeat l e n g t h was r e l a t i v e t o a l l other repeats of the same c l a s s i n the human genome. In the case of CAG c l a s s repeats, longer coding repeats i n the human genome are more unstable r e l a t i v e t o sh o r t e r repeats. For each repeat c l a s s , we c o l l e c t e d counts of repeats at each length and st o r e d i n i n t o a ta b l e c a l l e d c l a s s _ s t a t s w i t h the f o l l o w i n g s c r i p t . ####################### # generate_pvalues.pl # ####################### #!/usr/bin/perl # generate_pvalues.pl # Perseus M i s s i r l i s # This s c r i p t loops through the d i s t i n c t repeat c l a s s e s i n S a t e l l o g and c o l l e c t s counts of a l l the lengths f o r a l l repeat c l a s s e s # Resul t s are st o r e d i n the c l a s s _ s t a t s t a b l e i n S a t e l l o g use s t r i c t ; use DBI; # DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $ s t h ) ; my (@ary); ####################### # Connect t o Database # ####################### my $dbh = DBI->connect ($dsn, $user_name, $password, { Rai s e E r r o r => 1 }); my $ s t h l = $dbh->prepare ("SELECT COUNT(*) AS count FROM repeats WHERE c l a s s = ?"); my $sth2 = $dbh->prepare ("SELECT length, COUNT(length) AS count FROM repeats WHERE c l a s s = ? GROUP BY l e n g t h ; " ) ; my $sth3 = $dbh->prepare ("INSERT INTO c l a s s _ s t a t s VALUES(?,?,?)"); ##################### # Ex t r a c t f o r query # ##################### my $ i = 1; p r i n t " i i s $i\n\n" ,-# 90700 while ($i < 90700) { p r i n t " i i s $i\n\n"; $ s t h l - > e x e c u t e ( $ i ) ; my $count; while ( my $href = $sthl->fetchrow_hashref ) { $count = $href->{count}; p r i n t "$i\t$count\n"; } $sth2->execute($i); 188 my $ f r a c t i o n _ l a r g e r = $count; my $pvalue; while ( my $href = $sth2->fetchrow_hashref ) { my $length = $href->{length}; my $length_count = $href->{count}; $pvalue = $ f r a c t i o n _ l a r g e r / $count; p r i n t " $ i \ t $ l e n g t h \ t $ l e n g t h _ c o u n t \ t $ f r a c t i o n _ l a r g e r \ t $ p v a l u e \ n " ; $sth3->execute($i,$length,$pvalue); $ f r a c t i o n _ l a r g e r = ( $ f r a c t i o n _ l a r g e r - $length_count); } } p r i n t "done\n\n"; L a s t l y , each r e p _ i d i n the repeats t a b l e had i t s c l a s s and length queried against the c l a s s _ s t a t s t a b l e to e x t r a c t a p-value f o r each repeat. 189 APPENDIX O Disease-associated repeats Table 17: Summary of disease-associated repeats from Cleary and Pearson, 2003 as detected in Satellog. Each disease is associated with one or more repeat co-ordinates. Disease chr start end unit length prostate cancer risk 20 46953964 46953975 CAG 4 prostate cancer risk 20 46965237 46965257 GCA 7 prostate cancer risk 20 46965259 46965287 CAG 9 dentatorubral-pallidoluysian atrophy/ Haw River Syndrome 12 6916153 6916199 CAG 15 Huntington's Disease 4 3108016 3108074 CAG 19 Huntington's Disease-like 2 16 87419384 87419431 CTG 16 spinal and bulbar muscular atrophy X 65631950 65632018 GCA 23 spinal and bulbar muscular atrophy X 65632034 65632052 GCA 6 spinocerebellar ataxia 1 6 16435844 16435887 TGC 14 spinocerebellar ataxia 1 6 16435895 16435934 TGC 13 spinocerebellar ataxia 2 12 110448707 110448734 GCT 9 spinocerebellar ataxia 2 12 110448736 110448776 TGC 13 spinocerebellar ataxia 3 / Machado-Joseph Disease 14 90527396 90527419 CTG 8 spinocerebellar ataxia 6 19 13179673 13179712 CTG 13 spinocerebellar ataxia 7 3 63855699 63855730 GCA 10 spinocerebellar ataxia 17 6 170727556 170727614 CAG 19 infantile spasm syndrome X 24393203 24393234 GCC 10 cleidocranial dysplasia 6 45437342 45437356 GGC 5 cleidocranial dysplasia 6 45437358 45437374 GCG 5 hand-foot-genital syndrome 7 26981724 26981734 GCC 3 hand-foot-genital syndrome 7 26981742 26981753 GCC 4 hand-foot-genital syndrome 7 26981820 26981830 GCC 3 hand-foot-genital syndrome 7 26981847 26981858 GCC 4 synpolydactyly 2 177160255 177160270 GGC 5 synpolydactyly 2 177160330 177160344 GGC 5 synpolydactyly 2 177160355 177160371 GCG 5 oculopharyngeal muscular dystrophy 14 21780809 21780829 GGC 7 190 holoprosencephaly 13 98332395 98332411 GCG 5 holoprosencephaly 13 98332445 98332454 GGC 3 holoprosencephaly 13 98335704 98335714 GCG 3 holoprosencephaly 13 98335716 98335729 GCG 4 holoprosencephaly 13 98335731 98335744 GCG 4 Myotonic Dystrophy 19 50965303 50965364 CAG 20 unknown 14 75483803 75483834 TGC 10 unknown 14 75483836 75483855 TGC 6 possible bipolar disorder 18 51402372 51402447 AGC 25 spinocerebellar ataxia 12 5 146286801 146286832 GCT 10 Fragile X (A subtype) X 145661208 145661218 GGC 3 Fragile X (A subtype) X 145661287 145661316 GGC 10 Fragile X (E subtype) x 146287684 146287698 GCC 5 Fragile X (E subtype) x 146287711 146287757 GCC 15 Fragile X (E subtype) x 146287806 146287815 CCG 3 Fragile X (E subtype) X 146287817 146287826 GCC 3 Fragile X (E subtype) X 146288149 146288162 CCG 4 Fragile X (F subtype) X 147419081 147419105 CGC 8 Jacobsen Syndrome 11 118614652 118614685 CGG 11 Myotonic Dystrophy 2 3 130212329 130212359 CAGG 7 progressive myoclonic 21 44052526 44052562 GCGCGGGGCGGG 3 epilepsy type 1 spinocerebellar ataxia 10 22 44467801 44467870 ATTCT 14 Friedreich's ataxia 9 67109320 67109339 AAG 6 spinocerebellar ataxia 8 13 68511517 68511562 CTG 15 191 APPENDIX P Schizophrenia and bipolar disorder linkage regions (adapted from (Sklar, 2002)). Table 18: Summary of schizophrenia and bipolar disorder linkage regions from (Sklar, 2002). This table summarizes the linkage studies in the paper and includes the cytogenetic band, genetic marker (with co-ordinates) of each study cited in the review. The ref column refers to the PubMed ID of each linkage study. This represents a portion of the linkage table in Satellog. Disease band marker chr start end ref BP 13q32.3 D13S1271-D13S779 13 97566284 99202143 10374733 BP 13q22.1 D13S800 13 71672693 71672988 9184308 BP 13q32.1 D13S793 13 95549764 95750042 9184308 BP 13q32.1 D13S154 13 93960285 93960543 11149935 BP 13q32.3 D13S225-D13S796 13 99244463 105587128 10631152 BP 13q32.3 D13S225-D13S796 13 99244463 105587128 11673797 BP 13q32.3 D13S779 13 99201956 99202143 11673797 BP 21q22.3 D21S171 21 44848869 44848988 7647797 BP 21q22.3 D21S1260 21 41716438 41716647 9915960 BP 21q22.1 D21S1254 21 33995763 33996026 9184307 BP 21q21.2-21q21.3 D21S265 21 25841358 25841605 9184307 BP 21.q22.12-21q22.13 D21S1252 21 36747281 36747527 9184307 BP 21q22.13 D21S1440 21 38062023 38062184 9184307 BP 22q11.22 D22S303 22 21599366 21599581 9129709 BP 22q12.3 D22S278 22 34678466 34678703 11149935 BP 22q12.3 D22S278 22 34678466 34678703 11149935 BP 22q11.23-22q12.1 D22S419 22 24267850 24268118 9184305 BP 22q12.1-22q12.3 D22S689-D22S685 22 27181014 27181237 10318931 BP 18p11.22 D18S21 18 8552482 8552642 8016089 BP NULL D18S32 18 0 0 9006397 BP 18p11.22 D18S53 18 11482737 11482915 9529343 BP 18q21.31 D18S41 18 52500001 54600000 9529343 BP 18q23 D18S554 18 73170118 73170331 8630501 BP 18q23 D18S70 18 75963363 75963476 8630501 BP 18q21.33 D18S51 18 59097813 59098118 8731454 BP 18q22.2 D18S61 18 65585088 65585244 8731454 BP 18q22.3 D18S541 18 68323159 68323445 9399888 BP 18q22-23 NULL 18 59800001 76115139 10089014 BP 18q12.3 D18S1145 18 35400001 41700000 11673797 BP 12q24.11 ATP2A2 12 109182151 109251615 8199789 BP 12q24.31 D12S1639 12 124731268 124731488 9800214 192 BP 12q24.21 D12S2070 12 114494662 114494765 10318931 BP 12q24.21 D12S2070 12 114494662 114494765 10631152 BP 4p16.1 D4S394 4 7024570 7024766 8630499 BP NULL D4S4394 4 0 0 9774780 BP 4p15.1 D4S2408-D4S2632 4 31055117 35601445 10318931 SCZ 8p21.3 D8S258 8 20377523 20377672 7573181 SCZ 8p21.2 D8S1771 8 25463145 25463370 9731535 SCZ 8p21.2 D8S1752 8 22690067 22690218 9731535 SCZ 8p21.2 D8S1771 8 25463145 25463370 11126395 SCZ 8p21.3 D8S258 8 20377523 20377672 8942448 SCZ 8p22 D8S261 8 12700001 18700000 8950417 SCZ 8p21.3 D8S439 8 22271321 22271581 9754621 SCZ 8p12 D8S1791 8 38171429 38171667 9674972 SCZ 8p21.3 D8S136 8 22455333 22455402 10784452 SCZ 8p21.1 D8S1771 8 25463145 25463370 11179014 SCZ 1q21.3-1q23.3 D1S1653-D1S1677 1 155149566 155149673 10784452 SCZ 1q21-22 NULL 1 40800001 153700000 9754621 SCZ 1q21-22 NULL 1 40800001 153700000 11179014 SCZ 1q23.3 D1S2675 1 159397370 159397531 11126394 SCZ 13q32.2 D13S128 13 96558115 96558272 11126394 SCZ 13q32.2 D13S128 13 96558115 96558272 9050933 SCZ 13q33.1 D13S174 13 100652077 100652253 9731535 SCZ 13q31.1 D13S170 13 78907238 78907358 9754621 SCZ 13q32.1-13q32.3 D13S793-D13S779 13 99201956 99202143 10784452 SCZ 6p22.3 D6S260 6 15512626 15512804 7647789 SCZ 6p24.3 D6S296 6 8798698 8798995 12140777 SCZ 6p22.3 D6S274-D6S285 6 16854125 18679238 7581458 SCZ 6p22.3 D6S274 6 16854125 16854302 7581457 SCZ 6p22.3 D6S285 6 18679024 18679238 7581457 SCZ 6p21.32 HLA-DQB1 6 32673703 32680835 11920855 SCZ 6p22.3 D6S260 6 15512626 15512804 11920855 SCZ 6p24.3 D6S470 6 10133771 10133890 8950417 SCZ 6q21 D6S416 6 112496958 112497214 9226366 SCZ 6q16.1 D6S424 6 95514543 95514789 9226366 SCZ 6q16.1 D6S424 6 95514543 95514789 10402499 SCZ 6q16.1-6q16.3 D6S424-D6S301 6 95514543 95514789 10402499 SCZ 6q15 D6S1570 6 91189580 91189714 11096332 SCZ 6p22.3 D6S242 6 23689569 23689852 10924404 SCZ 10p12.31 D10S1423 10 19441909 19442134 9674973 SCZ 10p12.31 D10S1714 10 18844109 18844295 9674975 SCZ 10p12.1 D10S2443 10 26765475 26765569 0 SCZ 10p12.2 D10S245 10 23607341 23607508 11673797 SCZ 22q12.3 IL2RB 22 35764920 35789001 8178837 SCZ 22q11-13 D22S84 22 11800001 24300000 7909992 SCZ 22q13.33 D22S55 22 47700001 49396972 7909992 193 s c z 22q11.21 D22S446 22 20343712 20343913 9754621 s c z 22q12.3 D22S283 22 35022762 35022895 9754621 s c z 15q14 300kb of a-nicotinic receptor 15 29838757 30477178 8776738 s c z 15q14 D15S1012 15 36723599 36723772 11001582 194 APPENDIX Q Code for repeat prioritization in schizophrenia and bipolar disoder linkage regions ############################################################################### # S h e l l s c r i p t t o e x t r a c t a l l repeats w i t h i n 50 Mb of li n k a g e genetic markers # ############################################################################### #!/bin/sh # # Automated Queries t o schz_db # Perseus M i s s i r l i s # 040128 echo 11 SELECT r . r e p _ i d , 1.disease, l . l i n k _ i d FROM repeats r, lin k a g e 1 WHERE r.chr = 1. chr AND r . s t a r t . >= l . p s t a r t AND r.end <= l.qend; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > r e p e a t s _ i n _ l i n k a g e . t x t ################################################# # Table to st o r e a l l repeats i n lin k a g e regions # ################################################# CREATE TABLE r e p e a t s _ i n _ l i n k a g e ( r e p _ i d INT, disease VARCHAR(10) NOT NULL, l i n k _ i d INT ) ; CREATE INDEX r e p _ l i n k ON re p e a t s _ i n _ l i n k a g e ( r e p _ i d , l i n k _ i d ) ; CREATE INDEX r e p _ l i n k _ d i s e a s e ON r e p e a t s _ i n _ l i n k a g e ( r e p _ i d , disease, l i n k _ i d ) ; ############# # LOAD DATA # ############# LOAD DATA INFILE 1/home/perseusm/My_Documents/Publications/Satellog/Results/repeats_in_linkage.txt 1 INTO TABLE r e p e a t s _ i n _ l i n k a g e IGNORE 1 LINES; ################################################ # GET ALL DISTINCT rep_i d ' s of l i n k e d r e p _ i d ' s # ################################################ #!/bin/sh # # Automated Queries to schz_db # Perseus M i s s i r l i s # 040128 echo " SELECT DISTINCT r e p _ i d FROM r e p e a t s _ i n _ l i n k a g e " | mysql --quick -h athena -u schz_rw -prepeat schz_db > d i s t i n c t _ r e p e a t s _ i n _ l i n k a g e . t x t ########################### # C a l c u l a t e Linkage Depth # ########################### CREATE TABLE linkage_depth ( r e p _ i d INT NOT NULL, disease VARCHAR(10) NOT NULL, linkage_depth INT 195 ) #################### # linkage_depth.pl # #################### #!/usr/bin/perl # q u e r i e s . p l # usage: q u e r i e s . p l s o m e f i l e . t x t # Perseus M i s s i r l i s - Jan 29, 2004 # Last updated: use s t r i c t ; use DBI; # DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $ s t h ) ; my (@ary); ####################### # Connect to Database # ####################### my $dbh = DBI->connect ($dsn, $user_name, $password, { Rai s e E r r o r => 1 } ) ; my $ s t h l = $dbh->prepare ("SELECT r e p _ i d , COUNT(link_id) AS linkage_depth FROM re p e a t s _ i n _ l i n k a g e WHERE r e p _ i d = ? GROUP BY r e p _ i d ; " ) ; my $sth2= $dbh->prepare("INSERT INTO linkage_depth VALUES(?,?)"); ##################### # Ex t r a c t f o r query # ##################### while (<>) { i f (/*(\d+)/) { my $rep_id = $1; $sthl->execute ($rep_id) ,-while ( my $href = $sthl->fetchrow_hashref ) { my $linkage_depth = $href->{linkage_depth}; $sth2->execute($rep_id,$linkage_depth); } } } # Run ./linkage_depth.pl d i s t i n c t _ r e p e a t s _ i n _ l i n k a g e . t x t ####################################################################################### # SELECT ALL TRANSCRIBED, POLYMORPHIC CANDIDATE REPEATS FOR SCHIZOPHRENIA AND BIPOLAR # ####################################################################################### #!/bin/sh # # Automated Queries to schz_db # Perseus M i s s i r l i s # 040128 echo " 196 SELECT DISTINCT r . r e p _ i d , r.chr, r . s t a r t , r.end, r . u n i t , r . p e r i o d , r . c l a s s , r . l e n g t h , r.pvalue, Id.linkage_depth, u.count, u.min, u.max, u.mean, u.sd FROM repeats r, linkage_depth Id, ugstats u WHERE u.sd > 0 AND u.rep_id = l d . r e p _ i d AND u.rep_id = r . r e p _ i d AND Id.disease = \"SCZ\"; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > schz_cand.txt echo " SELECT DISTINCT r . r e p _ i d , r.chr, r . s t a r t , r.end, r . u n i t , r . p e r i o d , r . c l a s s , r . l e n g t h , r.pvalue, Id.linkage_depth, u.count, u.min, u.max, u.mean, u.sd FROM repeats r, linkage_depth Id, ugstats u WHERE u.sd > 0 AND u.rep_id = l d . r e p _ i d AND u.r e p _ i d = r . r e p _ i d AND Id.disease = \"BP\"; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > bp_cand.txt ##################### # DROP TABLE SYNTAX # ##################### # SCHZ DROP TABLE schz_cand; # BP DROP TABLE bp_cand; ####################### # CREATE TABLE SYNTAX # ####################### # SCHZ CREATE TABLE schz_cand ( r e p _ i d INT NOT NULL, chr VARCHAR(4), s t a r t INT UNSIGNED, end INT UNSIGNED, u n i t VARCHAR(16), pe r i o d TINYINT UNSIGNED, c l a s s _ i d INT, length INT UNSIGNED, pvalue DECIMAL(8,6) NULL, linkage_depth INT NOT NULL, count INT NOT NULL, min INT NOT NULL, max INT NOT NULL, mean DECIMAL(8,2) NOT NULL, sd DECIMAL(8,2) NULL, gene_location VARCHAR(20), pep VARCHAR(150) NULL, name VARCHAR(15) NULL, d e s c r i p t i o n TEXT NULL, t i s s u e VARCHAR(15) NOT NULL, c a l l CHAR(l) NOT NULL ) ; # BP CREATE TABLE bp_cand ( r e p _ i d INT NOT NULL, chr VARCHAR(4), s t a r t INT UNSIGNED, end INT UNSIGNED, u n i t VARCHAR(16), p e r i o d TINYINT UNSIGNED, c l a s s _ i d INT, length INT UNSIGNED, pvalue DECIMAL(8,6) NULL, linkage_depth INT NOT NULL, count INT NOT NULL, 197 min INT NOT NULL, max INT NOT NULL, mean DECIMAL(8,2) NOT NULL, sd DECIMAL(8,2) NULL, gene_location VARCHAR(20), pep VARCHAR(150) NULL, name VARCHAR(15) NULL, d e s c r i p t i o n TEXT NULL, t i s s u e VARCHAR(15) NOT NULL, c a l l CHAR(l) NOT NULL ) ; ################### # POPULATE TABLES # ################### ./expressed_in_brain_candidates.pi schz_cand.txt ./expressed_in_brain_candidates.pi bp_cand.txt # Note: Remember to change input t a b l e i n $sth3 and $sth4 below #################################### # expressed_in_brain_candidates.pl # #################################### #!/usr/bin/perl # exp r e s s e d _ i n _ b r a i n . p l # usage: # exp r e s s e d _ i n _ b r a i n . p l s o m e f i l e . t x t # Perseus M i s s i r l i s - Jan 29, 2004 # Last updated: use s t r i c t ; use DBI; # DBI my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $ s t h ) ; my (@ary); ####################### # Connect to Database # ####################### my $dbh = DBI->connect ($dsn, $user_name, $password, { Rais e E r r o r => 1 }); my $ s t h l = $dbh->prepare ("SELECT DISTINCT r . r e p _ i d AS a, t.ge n e _ l o c a t i o n AS b, t.pep AS c, e.name AS d, e . d e s c r i p t i o n AS e FROM repeats r, t r a n s c r i p t s t , ens_db e WHERE r . r e p _ i d = ? AND r . r e p _ i d = t . r e p _ i d AND t . e n s _ i d = e.ens_id;"); my $sth2 = $dbh->prepare ("SELECT DISTINCT r . r e p _ i d AS a, t.ge n e _ l o c a t i o n AS b, t.pep AS c, e.name AS d, e . d e s c r i p t i o n AS e, g . t i s s u e AS f, g . c a l l AS g FROM repeats r, t r a n s c r i p t s t , ens_db e, a f f y a, GeneNote g WHERE r . r e p _ i d = ? AND r . r e p _ i d = t . r e p _ i d AND t . e n s _ i d = e.ens_id AND e.ens_id = a.ens_id AND a.g_id = g.g_id AND g . t i s s u e = \"B r a i n \ " AND g . c a l l = \ " P \ " ; " ) ; my $sth3 = $dbh->prepare("INSERT INTO bp_cand V A L U E S ( ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ) " ) ; my $sth4 = $dbh->prepare("UPDATE bp_cand SET t i s s u e = ?, c a l l = ? WHERE r e p _ i d = ?"); ##################### # Ext r a c t f o r query # ##################### my $ n u l l = "NULL"; while (<>) { 198 i f (/"(\d+)\s+(\S+) \s+(\d+)\s+(\d+)\s+(\S+)\s+(\d+)\s+(\d+)\s+ (\d+)\s+(\d+\.\d+)\s+(\d+)\s+ ( \d+)\s+(\d+)\s+(\d+)\s+(\d+\.\d+)\s+(\d+\.\d+)/) { my $rep i d = $1; my $chr = $2; my $s t a r t = $3; my $end = $4; my $unit = $5; my $period = $6; my $class = $7; my $length = $8; my $pvalue = $9; my $link_depth = $10; my $count = $11 my $min = $12 my $max = $13 my $mean = $14 my $sd = $15 $sthl->execute($1) ; while ( my $href = $sthl->fetchrow_hashref ) { my $gene_location = $href->{b}; my $pep = $href->{c}; my $name = $href->{d}; my $ d e s c r i p t i o n = $href->{e}; $sth3->execute($rep_id,$chr,$start,$end,$unit,$period,$class,$length,$pvalue,$link_depth,$count ,$min,$max,$mean,$sd,$gene_location,$pep,$name,$description,$null,$null); } $sth2->execute($1); while { my $href = $sth2->fetchrow_hashref ) { my $ t i s s u e = $href->{f}; my $ c a l l = $href->{g}; $s t h 4 - > e x e c u t e ( $ t i s s u e , $ c a l l , $ r e p _ i d ) ; ############################################################## # SELECT ALL CANDIDATE REPEATS FOR SCHIZOPHRENIA AND BIPOLAR # ############################################################## #!/bin/sh # # Automated Queries t o schz_db # Perseus M i s s i r l i s # 040128 echo " SELECT DISTINCT r . r e p _ i d , r.chr, r . s t a r t , r.end, r . u n i t , r . p e r i o d , r . c l a s s , r . l e n g t h , r.pvalue, Id.linkage_depth FROM repeats r, linkage_depth I d WHERE r . r e p _ i d = l d . r e p _ i d AND Id.disease = \"SCZ\"; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > schz_cand_global.txt echo " SELECT DISTINCT r . r e p _ i d , r.chr, r . s t a r t , r.end, r . u n i t , r . p e r i o d , r . c l a s s , r . l e n g t h , r.pvalue, Id.linkage_depth FROM repeats r, linkage_depth I d WHERE r . r e p _ i d = l d . r e p _ i d AND Id.disease = \"BP\"; " | mysql --quick -h athena -u schz_rw.-prepeat schz_db > bp_cand_global.txt 199 ##################### # DROP TABLE SYNTAX # ##################### # SCHZ DROP TABLE schz_cand_global; # BP DROP TABLE bp_cand_global; ####################### # CREATE TABLE SYNTAX # ####################### # SCHZ CREATE TABLE schz_cand_global ( schz_cand_global_id INT auto_increment PRIMARY KEY, r e p _ i d INT NOT NULL, chr VARCHAR(4), s t a r t INT UNSIGNED, end INT UNSIGNED, u n i t VARCHAR(16), p e r i o d TINYINT UNSIGNED, c l a s s _ i d INT, length INT UNSIGNED, pvalue DECIMAL(8,6) NULL, linkage_depth INT NOT NULL, gene_location VARCHAR(20), pep VARCHAR(150) NULL, name VARCHAR(15) NULL, d e s c r i p t i o n TEXT NULL, t i s s u e VARCHAR(15) NOT NULL, c a l l CHAR(l) NOT NULL CREATE INDEX lookup ON schz_cand_global ( r e p _ i d , schz_cand_global_id); # BP CREATE TABLE bp_cand_global ( bp_cand_global_id INT auto_increment PRIMARY KEY, r e p _ i d INT NOT NULL, chr VARCHAR(4), S t a r t INT UNSIGNED, end INT UNSIGNED, u n i t VARCHAR(16) , pe r i o d TINYINT UNSIGNED, c l a s s _ i d INT, length INT UNSIGNED, pvalue DECIMAL(8,6) NULL, linkage_depth INT NOT NULL, gene_location VARCHAR(20), pep VARCHAR(150) NULL, name VARCHAR(15) NULL, d e s c r i p t i o n TEXT NULL, t i s s u e VARCHAR(15) NOT NULL, c a l l CHAR(l) NOT NULL ) ; CREATE INDEX lookup ON bp_cand_global ( r e p _ i d , bp_cand_global_id); ################### # POPULATE TABLES # ################### ./exp r e s s e d _ i n _ b r a i n _ g l o b a l . p i s c h z _ g l o b a l _ c a n d i d a t e s . t x t & ./expressed_in_brain_global2.pi b p _ g l o b a l _ c a n d i d a t e s . t x t & # Note: Remember to change input t a b l e i n $sth3 and $sth4 below ################################ # ex p r e s s e d _ i n _ b r a i n _ g l o b a l . p l # ################################ #!/usr/bin/perl # e x p r e s s e d _ i n _ b r a i n _ g l o b a l . p l # usage: e x p r e s s e d _ i n _ b r a i n _ g l o b a l . p l s o m e f i l e . t x t # Perseus M i s s i r l i s - Jan 29, 2004 # Last updated: use s t r i c t ; use DBI; # DBI ? my ($dsn) = "DBI:mysql:schz_db:athena.bcgsc.ca"; my ($user_name) = "schz_rw"; my ($password) = "repeat"; my ($dbh, $ s t h ) ; my (@ary); ####################### # Connect to Database # ####################### my $dbh = DBI->connect ($dsn, $user_name, $password, { Raise E r r o r => 1 }) ; my $ s t h l = $dbh->prepare ("SELECT DISTINCT r . r e p _ i d AS a, t.g e n e _ l o c a t i o n AS b, t.pep AS c, e.name AS d, e . d e s c r i p t i o n AS e FROM repeats r, t r a n s c r i p t s t , ens_db e WHERE r . r e p _ i d = ? AND r . r e p _ i d = t . r e p _ i d AND t . e n s _ i d = e.ens_id;"); my $sth2 = $dbh->prepare ("SELECT DISTINCT r . r e p _ i d AS a, t.g e n e _ l o c a t i o n AS b, t.pep AS c, e.name AS d, e . d e s c r i p t i o n AS e, g . t i s s u e AS f, g . c a l l AS g FROM repeats r, t r a n s c r i p t s t , ens_db e, a f f y a, GeneNote g WHERE r . r e p _ i d = ? AND r . r e p _ i d = t . r e p _ i d AND t . e n s _ i d = e.ens_id AND e.ens_id = a.ens_id AND a.g_id = g.g_id AND g . t i s s u e = \" B r a i n \ " AND g . c a l l = \ " P \ " ; " ) ; 201 my $sth3 = $dbh->prepare("INSERT INTO schz_cand_global VALUES('NULL',?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"); my $sth4 = $dbh->prepare("UPDATE schz_cand_global SET t i s s u e = ?, c a l l = ? WHERE r e p _ i d = ?") ; ##################### # E x t r a c t f o r query # ##################### my $ n u l l = "NULL"; while (<>) { (/A(\d+)\s+(\S+) \s+(\d+)\s+(\d+)\s+(\S+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+\.\d+)\s+(\d+)/) { my $rep i d = $1 my $chr = $2 my $ s t a r t = $3 my $end $4 my $unit = $5 my $period = $6 my $class = $7 my $length = $8 my $pvalue = $9 my $lin k _ d e p t h = $10; $sthl->execute ($1) ; while ( my $href = $sthl->fetchrow_hashref ) my $gene_location = $href->{b} my $pep = $href->{c} my $name = $href->{d} my $ d e s c r i p t i o n = $href->{e} $sth3->execute($rep_id,$chr,$start,$end,$unit,$period,$class,$length,$pvalue,$link_depth,$gene location,$pep,$name,$description,$null,$null); } $sth2->execute($1); while ( my $href = $sth2->fetchrow_hashref ) my $ t i s s u e = $href->{f}; my $ c a l l = $href->{g}; $s t h 4 - > e x e c u t e ( $ t i s s u e , $ c a l l , $ r e p _ i d ) ; } ################################################################# # COLLECT ALL REPEATS IN OVERLAPPING SCZ AND BP LINKAGE REGIONS # ################################################################# CREATE TABLE schz_bp_cand SELECT s. r e p _ i d , s.chr, s . s t a r t , s.end, s . u n i t , s.period, s . c l a s s _ i d , s.length, s.pvalue, s.linkage_depth, s.count, s.min, s.max, s.mean, s.sd, s.gene_location, s.pep, s.name, s . d e s c r i p t i o n , s . t i s s u e , s . c a l l FROM schz_cand s, bp_cand b WHERE s. r e p _ i d = b . r e p _ i d ######################################### # SHELL SCRIPT TO SELECT TOP 2 0 REPEATS # ######################################### #!/bin/sh # # Automated Queries t o schz_db # Perseus M i s s i r l i s # 040128 ############################################################# # SELECT TOP 20 POLYMORPHIC SCHIZOPHRENIA CANDIDATE REPEATS # ############################################################# echo " SELECT DISTINCT r e p _ i d , chr, s t a r t , end, u n i t , length, pvalue, linkage_depth, sd, gene_location, pep, name, (linkage_depth*sd/pvalue) AS score FROM schz_cand WHERE t i s s u e = \ " B r a i n \ " AND c a l l = \"P\" AND gene_location = \"exon\" OR gene_location = \"5utr\" OR gene_location = \"3 u t r \ " ORDER BY score DESC LIMIT 50 11 | mysql --quick -h athena -u schz_rw -prepeat schz_db > poly_schz_cand.txt ###################################################################### # SELECT TOP 20 GLOBALLY PRIORITIZED SCHIZOPHRENIA CANDIDATE REPEATS # ###################################################################### echo " SELECT DISTINCT r e p _ i d , chr, s t a r t , end, u n i t , length, pvalue, linkage_depth, gene_location, pep, name, (linkage_depth/pvalue) AS score FROM schz_cand_global WHERE t i s s u e = \" B r a i n \ " AND c a l l = \"P\" ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > global_schz_cand.txt ####################################################### # SELECT TOP 20 POLYMORPHIC BIPOLAR CANDIDATE REPEATS # ####################################################### echo " SELECT DISTINCT r e p _ i d , chr, s t a r t , end, u n i t , length, pvalue, linkage_depth, sd, gene_location, pep, name, (linkage_depth*sd/pvalue) AS score FROM bp_cand WHERE t i s s u e = \" B r a i n \ " AND c a l l = \"P\" AND gene_location = \"exon\" OR gene_location = \"5utr\" OR gene_location = \"3utr\" ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > poly_bp_cand.txt ################################################################ # SELECT TOP 20 GLOBALLY PRIORITIZED BIPOLAR CANDIDATE REPEATS # ################################################################ echo " SELECT DISTINCT r e p _ i d , chr, s t a r t , end, u n i t , length, pvalue, linkage_depth, gene_location, pep, name, (linkage_depth/pvalue) AS score FROM bp_cand_global WHERE t i s s u e = \" B r a i n \ " AND c a l l = \"P\" ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > global_bp_cand.txt 203 ######################################################################### # SHELL SCRIPT TO SELECT TOP 20 REPEATS FROM DISEASE-ASSOCIATED CLASSES # ######################################################################### #!/bin/sh # # Automated Queries t o schz_db # Perseus M i s s i r l i s # 040128 ############################################################# # SELECT TOP 20 POLYMORPHIC SCHIZOPHRENIA CANDIDATE REPEATS # ############################################################# echo " SELECT DISTINCT r e p _ i d , chr, s t a r t , end, u n i t , length, pvalue, linkage_depth, sd, gene_location, pep, name, (linkage_depth*sd/pvalue) AS score FROM schz_cand WHERE t i s s u e = \ " B r a i n \ " AND c a l l = \"P\" AND (gene_location = \"exon\" OR gene_location = \"5utr\" OR gene_location = \"3utr\") AND ( c l a s s _ i d = 182 OR c l a s s _ i d = 381 OR c l a s s _ i d = 51 OR c l a s s _ i d = 36 OR c l a s s _ i d = 285 OR c l a s s _ i d = 42495) ORDER BY score DESC LIMIT 50 " | mysql --quick -h athena -u schz_rw -prepeat schz_db > poly _ s c h z _ c a n d _ d i s . t x t ###################################################################### # SELECT TOP 20 GLOBALLY PRIORITIZED SCHIZOPHRENIA CANDIDATE REPEATS # ###################################################################### echo " SELECT DISTINCT r e p _ i d , chr, s t a r t , end, u n i t , length, pvalue, linkage_depth, gene_location, pep, name, (linkage_depth/pvalue) AS score FROM schz_cand_global WHERE t i s s u e = \ " B r a i n \ " AND c a l l = \"P\" AND ( c l a s s _ i d = 182 OR c l a s s _ i d = 381 OR c l a s s _ i d = 51 OR c l a s s _ i d = 36 OR c l a s s _ i d = 285 OR c l a s s _ i d = 42495) ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > g l o b a l _ s c h z _ c a n d _ d i s . t x t ####################################################### # SELECT TOP 20 POLYMORPHIC BIPOLAR CANDIDATE REPEATS # ####################################################### echo " SELECT DISTINCT r e p _ i d , chr, s t a r t , end, u n i t , length, pvalue, linkage_depth, sd, gene_location, pep, name, (linkage_depth*sd/pvalue) AS score FROM bp_cand WHERE t i s s u e = \ " B r a i n \ " AND c a l l = \"P\" AND (gene_location = \"exon\" OR gene_location = \"5utr\" OR gene_location = \"3utr\") AND ( c l a s s _ i d = 182 OR c l a s s _ i d = 381 OR c l a s s _ i d = 51 OR c l a s s _ i d = 36 OR c l a s s _ i d = 285 OR c l a s s _ i d = 42495) ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > poly_bp_cand_dis.txt ################################################################ # SELECT TOP 20 GLOBALLY PRIORITIZED BIPOLAR CANDIDATE REPEATS # ################################################################ echo " SELECT DISTINCT r e p _ i d , chr, s t a r t , end, u n i t , length, pvalue, linkage_depth, gene_location, pep, name, (linkage_depth/pvalue) AS score FROM bp_cand_global WHERE t i s s u e = \ " B r a i n \ " AND c a l l = \"P\" AND ( c l a s s _ i d = 182 OR c l a s s _ i d = 381 OR c l a s s _ i d = 51 OR c l a s s _ i d = 36 OR c l a s s _ i d = 285 OR c l a s s _ i d = 42495) ORDER BY score DESC LIMIT 50; " | mysql --quick -h athena -u schz_rw -prepeat schz_db > glo b a l _ b p _ c a n d _ d i s . t x t 204 APPENDIX R SQL code to generate results Extract names of genes whose C A G / C T G repeats were within CpG islands: SELECT c.name, c . s t a r t , c.end, g.expandability, I F ( ( ( c . s t a r t + (c.end. - c . s t a r t ) ) > 50000) AND ((c.end - (c.end - c . s t a r t ) ) < 50000),"TRUE","FALSE") AS i s l a n d s FROM cpg c, gems_feat g WHERE c.name = g.name ORDER BY i s l a n d s DESC; Extract genes, their expandability metric and flanking %GC: SELECT g.name, g.expandability, gc.50_bp, gc.l00_bp, gc.500_bp, gc.10 0 0_bp FROM gems_feat g, gc WHERE g.expandability > 0 AND g.name = gc.name; Extract genes with 100 bp flanking %GC at least equal to that of HD (0.76): SELECT g.name, g.expandability, gc.l00_bp FROM gems_feat g, gc WHERE g.name = gc.name AND gc.l00_bp >= 0.76; Extract genes with flanking CTCF binding site scores of 1.00 or higher: SELECT c.name, c.score, c.distance, g.expandability, c.s t a r t , c.end FROM c t c f c, gems_feat g WHERE c.distance < 1000 AND c.score > 1 AND g.name = c.name ORDER BY g.expandability DESC; Extract genes with weak flanking CTCF scores (0 < score < 1): SELECT c.name, c.score, c.distance, g.expandability, c . s t a r t , c.end FROM c t c f c, gems_feat g WHERE c.distance < 1000 AND c.score > 0 AND c.score < 1 AND g.name = c.name ORDER BY g.expandability DESC; 205 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0091531/manifest

Comment

Related Items