Open Collections

UBC Faculty Research and Publications

Satellog: A database for the identification and prioritization of satellite repeats in disease association… Missirlis, Perseus I; Mead, Carri-Lyn R; Butland, Stefanie L; Ouellette, BF F; Devon, Rebecca S; Leavitt, Blair R; Holt, Robert A Jun 10, 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
52383-12859_2005_Article_470.pdf [ 2.69MB ]
Metadata
JSON: 52383-1.0132548.json
JSON-LD: 52383-1.0132548-ld.json
RDF/XML (Pretty): 52383-1.0132548-rdf.xml
RDF/JSON: 52383-1.0132548-rdf.json
Turtle: 52383-1.0132548-turtle.txt
N-Triples: 52383-1.0132548-rdf-ntriples.txt
Original Record: 52383-1.0132548-source.json
Full Text
52383-1.0132548-fulltext.txt
Citation
52383-1.0132548.ris

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceDatabaseSatellog: A database for the identification and prioritization of satellite repeats in disease association studiesPerseus I Missirlis*1, Carri-Lyn R Mead1, Stefanie L Butland2, BF Francis Ouellette2, Rebecca S Devon3, Blair R Leavitt3 and Robert A Holt1,4Address: 1Genome Sciences Centre, BC Cancer Agency, Suite 100, 570 West 7th Ave, Vancouver, BC, V5Z 4S6, Canada, 2UBC Bioinformatics Centre, University of British Columbia, 950 West 28th Ave, Vancouver, BC V5Z 4H4, Canada, 3Centre for Molecular Medicine and Therapeutics, University of British Columbia, 950 West 28th Avenue, Vancouver, B.C., V5Z 4H4, Canada and 4Department of Psychiatry, University of British Columbia, 2255 Wesbrook Mall, Vancouver, BC, V6T 2A1, CanadaEmail: Perseus I Missirlis* - perseusm@bcgsc.ca; Carri-Lyn R Mead - cmead@bcgsc.ca; Stefanie L Butland - butland@bioinformatics.ubc.ca; BF Francis Ouellette - francis@bioinformatics.ubc.ca; Rebecca S Devon - Rebecca.Devon@ed.ac.uk; Blair R Leavitt - bleavitt@cmmt.ubc.ca; Robert A Holt - rholt@bcgsc.ca* Corresponding author    AbstractBackground: To date, 35 human diseases, some of which also exhibit anticipation, have beenassociated with unstable repeats. Anticipation has been reported in a number of diseases in whichrepeat expansion may have a role in etiology. Despite the growing importance of unstable repeatsin disease, currently no resource exists for the prioritization of repeats. Here we present Satellog,a database that catalogs all pure 1–16 repeat unit satellite repeats in the human genome along withsupplementary data. Satellog analyzes each pure repeat in UniGene clusters for evidence of repeatpolymorphism.Results: A total of 5,546 such repeats were identified, providing the first indication of many novelpolymorphic sites in the genome. Overall, polymorphic repeats were over-represented within 3'-UTR sequence relative to 5'-UTR and coding sequence. Interestingly, we observed that repeatpolymorphism within coding sequence is restricted to trinucleotide repeats whereas UTRsequence tolerated a wider range of repeat period polymorphisms. For each pure repeat we alsocalculate its repeat length percentile rank, its location either within or adjacent to EnsEMBL genes,and its expression profile in normal tissues according to the GeneNote database.Conclusion: Satellog provides the ability to dynamically prioritize repeats based on any of theircharacteristics (i.e. repeat unit, class, period, length, repeat length percentile rank, genomic co-ordinates), polymorphism profile within UniGene, proximity to or presence within gene regions(i.e. cds, UTR, 15 kb upstream etc.), metadata of the genes they are detected within and geneexpression profiles within normal human tissues. Unstable repeats associated with 31 diseaseswere analyzed in Satellog to evaluate their common repeat properties. The utility of Satellog washighlighted by prioritizing repeats for Huntington's disease and schizophrenia. Satellog is availableonline at http://satellog.bcgsc.ca.Published: 10 June 2005BMC Bioinformatics 2005, 6:145 doi:10.1186/1471-2105-6-145Received: 12 January 2005Accepted: 10 June 2005This article is available from: http://www.biomedcentral.com/1471-2105/6/145© 2005 Missirlis et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 14(page number not for citation purposes)BMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145BackgroundAnticipation is a medical observation that refers to theprogressive worsening of a disease's symptoms and/or anearlier age of onset over successive generations of affectedfamily members [1]. Although historically controversial,the concept gained widespread scientific acceptance withthe identification in 1991 of unstable trinucleotiderepeats associated with Fragile X syndrome [2,3] and spi-nal and bulbar muscular atrophy (SBMA) [4]. Today, 35human diseases, some of which also exhibit anticipation,have been associated with unstable repeats [5]. Diseasesfor which unstable microsatellites are the causative dis-ease mechanism can be divided into those caused by cod-ing or non-coding repeat expansions.The majority of disease-associated coding repeats identi-fied to date are CAG-type repeats encoding an expandedpoly-glutamine tract in affected individuals. CAG-typeexpansion disorders include spinal and bulbar muscularatrophy (SBMA) [4], dentatorubral-pallidoluysian atro-phy (DRPLA) [6], Huntington disease (HD) [7] and arange of spinocerebellar ataxias (SCAs) including SCA1[8], SCA2 [9], SCA3 [10], SCA6 [11], and SCA7 [12]. Inthese diseases, an expanded poly-glutamine tract results ina toxic gain of function causing either neuronal degenera-tion [13], or in mouse models of spinocerebellar ataxia(SCA), neuronal dysfunction due to Purkinje cell abnor-malities [14]. The precise pathogenic disease mechanismis unknown but requires expression of the expanded pol-yglutamine tract. Neuronal inclusion bodies are observa-ble on autopsy [14].Untranslated repeats are diverse and include non-trinucle-otide repeats. For example, progressive myoclonic epi-lepsy type 1 (EPM1) pathology results from an expansionof the dodecamer CCCCGCCCCGCG [15] and an ATTCTrepeat expansion is the pathogenic agent in SCA10 [16].In contrast to the coding repeat disorders, non-codingrepeats can expand dramatically into the range of thou-sands of repeats [17]. Most non-coding repeat expansionsare not associated with neuronal inclusion bodies onautopsy [14], with the exception of Fragile X-associatedtremor ataxia syndrome [18], and nuclear foci observed inneurons of myotonic dystrophy patients [19].Anticipation has been reported in a number of orphandiseases in which repeat expansion may have a role in eti-ology. These diseases include autosomal dominant limb-girdle muscular dystrophy [20], Crohn's disease [21],leukemia [22], nodal osteoarthritis [23], Parkinson's dis-ease [24], rheumatoid arthritis [25], truncal heart defects[26], mood disorders [27], schizophrenia [28,29], andanxiety disorders [30,31]. Although no repeat expansionsHistorically if one suspected a polymorphic microsatelliterepeat were associated with a disease, few bioinformaticsresources were available to identify relevant repeats in thehuman genome. One approach now available is tobrowse the Tandem Repeats Finder (TRF) [32] track on theUCSC genome browser [33] within a genomic region ofinterest. TRF at UCSC was executed with liberal insertionand deletion (indel) and substitution penalties that allowthe detection of larger, frequently impure repeats. Sincepure repeat tracts are more likely to expand than impurerepeat tracts following transmission [34-36] a large frac-tion of repeats presented at UCSC are probably not rele-vant for disease association studies. Furthermore, certainknown disease-associated repeats, such as the GAA repeatin Friedreich's Ataxia (chr9:67,109,320-67,109,339) [37],are not detected at all at UCSC because they are too shortto be detected by their TRF parameters. Other groups havecreated databases of all 2–16 repeat unit satellite repeatswithin human gene regions [38,39] and of all 1–6 repeatunit microsatellites across prokaryotic and eukaryotic taxa[38]. Collins detected microsatellites with a novel algo-rithm and deposited this data in a relational databasecalled GRID Short Tandem Repeats (STR) database [39].This database included in silico polymorphism detectionof coding trinucleotide repeats by using the BLAST algo-rithm to detect each repeat's length polymorphismswithin GenBank, but only for a subset of coding repeats[39]. These resources enrich the microsatellite repeat bio-informatics landscape but do not integrate these data withother published resources in a way relevant for repeat pri-oritization in disease-association studies. Also, theseresources do not provide flexible interfaces for combiningdata in user-defined ways to allow dynamic generation ofcandidate repeat lists. For example, both the Microsatel-lites Repeat Database (MRD) [38] and the STR databases[39] provide static co-ordinates of candidate repeats fordisease-association studies defined by the author's crite-ria, but lack the functionality to easily re-prioritize repeatsbased on user preferences.To address these deficiencies we created Satellog, a data-base that catalogs all pure 1–16 repeat unit satelliterepeats in the human genome along with supplementarydata we believe to be of use for the prioritization of satel-lite repeats in disease association studies. For each purerepeat Satellog can also calculate the percentile rank of itslength relative to other repeats of the same class in thegenome, its polymorphism within UniGene clusters [40],its location relative to known genes [41], and its expres-sion profile in normal tissues according to the GeneNotedatabase [42]. Repeats within Satellog can be prioritizedbased on any of their characteristics (i.e. repeat unit, class,period, length, length percentile rank, genomic co-ordi-Page 2 of 14(page number not for citation purposes)have been associated with any of these disorders, no com-prehensive surveys have been undertaken.nates), polymorphism profile within UniGene, proximityto or presence within gene regions (i.e. cds, UTR, 15 kbBMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145upstream etc), metadata of the genes they are detectedwithin, and gene expression profiles within normalhuman tissues. Disease-associated repeats from 31 dis-eases were used as a test set to see what fraction could bedetected independently within Satellog and what could belearned about polymorphic repeats in general. To show-case its utility, we used Satellog to prioritize repeats fordisease-association studies in Huntington's disease andschizophrenia. Satellog is available as a web-queriabledatabase along with all source code licensed under GNUGeneral Public License at http://satellog.bcgsc.ca.ResultsSummary statisticsA total of 8,357,425 pure repeats were detected by TRF inthe human genome and were stored in Satellog. Of these,5,398,328 or 64.6% were detected within an EnsEMBL-defined gene or within 60 kb flanking either side of anEnsEMBL gene. These repeats mapped to 7,260,625genetic locations in or near EnsEMBL genes, reflecting thefact that some repeats were located within more than onegene. Of the genes in EnsEMBL, 92% (21,654 / 23,531)had at least one pure repeat within 60 kb of their geneboundaries. All repeats in Satellog clustered into 70,318unique repeat classes. Overall, repeat counts correlatedwith decreasing chromosomal size, however chromosome19 had the highest density of repeats in accordance withpreviously published reports [43] (Figure S1, Table S1 –supplementary information available online at http://satellog.bcgsc.ca/source.php). Data summarizing repeatcounts and density by repeat unit size and chromosome(Table S2), by specific repeat unit (Table S3) and by generegion (Table S4) are also available online as supplemen-tary information.Characteristics of disease-associated repeatsDisease-associated repeats and their common propertieswere recently reviewed [5]. We queried the database withthese sequences to observe any characteristic features ofthese repeats relative to all other repeats. We asked howmany of these repeats could be identified as potentiallyunstable using only the bioinformatics resources withinSatellog. The co-ordinates for 31 of the 35 disease-associ-ated repeats were manually collected from the review andidentified in Satellog. Repeats that were not analyzedeither had a repeat period greater than 16 (thus notdetected by our TRF parameters) or were polymorphic butnot associated with any disease. For these disease-associ-ated repeats, there is no record of their precise genomicco-ordinates. To address this, we used Satellog to probefor the probable repeat that corresponded to each diseaseby selecting all repeats of the expected class within eachdisease gene. All repeats were detected, except for therepeat responsible for blepharophimosis [44]. In 12 cases,more than one candidate was detected as the disease-asso-ciated repeat for a disease. These cases usually involveflanking repeats of the same class that are detected as twodistinct repeats because of an interrupting unit, an estab-lished characteristic of some disease-associated repeatssuch as those responsible for SCA1 [35] and Fragile X syn-drome [36]. In these cases, we simply retained bothrepeats and associated them with the disease.A total of 51 repeats were mapped for 31 diseases. Inter-estingly, these repeats were from only 6 repeat classes. Tri-nucleotide repeats are the most common repeat classimplicated in disease [5], especially for disorders causedby coding repeat expansion. Of the disease-associatedrepeats we analyzed, 28 of the 31 were trinucleotideTable 1: Unstable coding repeats organized by descending standard deviation Sample output from Satellog.unit length gene location pep name mean sdGCA 23 cds LQQQQQQQQQQQQQQQQQQQQQQQ AR 20.36 4.11CAG 15 cds QQQQQQQQQQQQQQQH DRPLA 12.44 3.9GGC 17 cds GGGGGGGGGGGGGGGGGE AR 15.1 3.54CAG 19 cds QQQQQQQQQQQQQQQQQQQQ TBP 17.1 3.01ACC 13 cds LPPPPPPPPPPPPP NULL 11.5 2.12GGC 8 cds GGGGGGGGG GDF7 9 1.73CTG 6 cds GSSSSSR PCDH12 7.2 1.55CCG 9 cds PAAAAAAAAA NULL 6 1.41GGGGCC 4 cds APAPAPAPAP CDKN1C 3.33 1.15GGC 6 cds GGGGGG NULL 6.67 1.03The ten most unstable coding repeats organized by descending standard deviation. Repeats highlighted in bold are known disease-associated repeats. (Note: trailing non-consensus amino acids are not artefactual output. Repeat units continue to be detected at the DNA level even it they do not completely achieve the consensus. For example in the second row above, the corresponding DNA sequence (CAG)15 CA contains a trailing CA (of the subsequent CAT codon) that translates into histidine).Page 3 of 14(page number not for citation purposes)repeats with 16 being from the CAG repeat class, 11 fromthe GCG repeat class, and one each from theBMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145CCCCGCCCCGCG, CCTG, GAA, and ATTCT repeatclasses respectively. These disease-associated repeat classeshad dramatically different genomic distributions (Figure1). For example, the CCCCGCCCCGCG dodecamerimplicated in progressive myoclonic epilepsy type 1(EPM1) [15] is the only pure repeat of its class detected inthe human genome and therefore has a singleton as itsdistribution. The remaining repeat classes have broaderdistributions, particularly the GAA repeat class. GAArepeats have been reported to have a unique distributionrelative to other trinucleotide repeats due to their evolu-tionary origin within Alu repeats [45]. Satellog recapitu-lated a distinct, expanded profile for GAA repeats relativeto all other trinucleotide repeats (Figure 1).We defined significant repeat length in the referencegenome as any repeat with length within the top 5% of itsclass (corresponds to a percentile rank < 0.05 in Satellog).Using this cut-off, we determined whether the referencegenome repeat length is significant for any of the disease-associated repeats within their respective disease classes.genome given their repeat class' length distribution (per-centile rank < 0.05). In fact, 20 of 30 of all disease-associ-ated repeats had a percentile rank of 0.01 or lessindicating that these repeats were the extreme outlierswithin their class. Of the coding repeats, 12 of 17 had sig-nificant repeat lengths, including all the CAG-typerepeats. Exceptions were the cleidocranial dysplasia(CCD), hand-foot-genital syndrome (HFGS), synpolydac-tyly, oculopharyngeal muscular dystrophy (OPMD), andholoprosencephaly coding GCG repeats. TheCCCCGCCCCGCG dodecamer implicated in progressivemyoclonic epilepsy type 1 (EPM1) is not included in thiscomparison because there were no other pure repeats ofits class in the genome.Polymorphic repeats detected in UniGene clustersWe used a bioinformatics approach to see if we coulddetect repeat polymorphisms within UniGene sequences.Of the 8,357,425 pure repeats detected by Satellog, 1.3%or 111,950 repeats were detected as transcribed by theEnsEMBL API (either in the UTR or coding sequence (cds)Genome-wide repeat lengths of disease-associated repeat classesFigure 1Genome-wide repeat lengths of disease-associated repeat classes. Genomic distribution of repeat lengths of all repeat classes associated with disease.Page 4 of 14(page number not for citation purposes)Interestingly, 80% (24/30) of the disease-associatedrepeats in Figure 1 were significantly long in the referenceof the gene). Of these repeats, approximately half (57.4%or 64,116 repeats) were detected within UniGene clusterBMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145sequences. Finally, of these repeats, only 5,546 repeatsrepeat length). A measure of repeat polymorphism wasprovided by calculating the standard deviation (sd) of allrepeat lengths detected within a UniGene cluster. A totalof 2,763, 541, and 4,244 polymorphic repeats weredetected in coding, 5'-UTR, and 3'-UTR sequence respec-tively (Note, repeats may exist in more than one genewhich is why the location break-down of the repeats isgreater than the total number of distinct polymorphicrepeats of 5,546). Our ability to generalize repeatpolymorphism trends within genetic regions was con-founded by increased sampling of the 3' end of genes (Fig-ure 2). To control for this, we compared thepolymorphism profile of repeats in coding, 5'UTR, and3'UTR regions that had equal sampling depth. By one-wayANOVA, we found a significant difference between coding(0.322 ± 0.134), 5'-UTR (0.416 ± 0.207), and 3'UTR(0.510 ± 0.184) repeats. There was significant repeat pol-ymorphism in the 3'-UTR sequence relative to codingsequence but not to 5'-UTR sequence after controlling forsampling bias (Tukey-Kramer post-hoc multiple compari-sons test, P < 0.001). Next we evaluated the tolerance ofrepeat polymorphisms by various repeat periods in cod-ing and UTR sequence. To observe if highly polymorphicrepeats were restricted to certain repeat periods (definedas repeat unit length), the repeat period distribution wasobserved at progressively increasing sd values (Figure 3&4). Untranslated repeats were well distributed across allrepeat periods except for 16 mers at an sd cut-off of 1(which roughly corresponded to repeat polymorphismsof 1 repeat unit). At increasing sd cut-offs, untranslatedpolymorphic repeats were detected as penta-, tri-andmainly di-nucleotide repeats (Figure 3). In contrast, whilecoding repeat polymorphisms were widely distributed atan sd of 1, they were mainly restricted to trinucleotiderepeats at higher sd cut-offs (Figure 4). Although theuntranslated repeats had higher sd values, their most pol-ymorphic sd values were restricted to mono-and di-nucle-otide repeats.Disease-associated repeats detected in UniGene clustersTo address whether known disease-associated repeatswere polymorphic within UniGene clusters, we extractedthe top ten most polymorphic coding and non-codingrepeats, based on their sd value, and determined if any ofthe disease-associated repeats were also the most poly-morphic. The repeats associated with SBMA (AR is thegene mutated in individuals affected with SBMA), DRPLA,and SCA17 (TBP is the gene mutated in individualsaffected with SCA17) were detected as the first-, third-andfourth-most polymorphic coding repeats (Table 1). TheAIB-I repeat that confers increased risk of prostate cancerwas also detected as polymorphic but not in the top ten.The repeat responsible for FRAXE was detected as poly-Boxplot comparison of polymorphic repeats from coding, 5'-UTR and 3'-UTR sequenceFigure 2Boxplot comparison of polymorphic repeats from coding, 5'-UTR and 3'-UTR sequence. Median standard deviations (line through box) of all polymorphic repeats detected in coding, 5'-UTR, and 3'-UTR sequence. After con-trolling for sampling bias, coding and 5'-UTR standard devia-tions did not significantly differ from each other, but did significantly differ from 3'-UTR repeats implying that the 3'-UTR tolerates larger, more expanded repeats (P < 0.001).Page 5 of 14(page number not for citation purposes)were detected as polymorphic (defined as any repeat thathad at least one sequence within a cluster with a differentmorphic, but not as one of the top ten most polymorphicuntranslated repeats (Table 2).BMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145Of the 31 disease-associated repeats discussed previously,only 5 repeats were detected as polymorphic within Uni-Gene clusters. We sought to understand why thisoccurred. Of the 31 disease-associated repeats, 4 failed tomap within the genomic co-ordinates of any mapped Uni-Gene cluster. The remaining 27 repeats mapped within aUniGene cluster's genomic co-ordinates. However, 16 ofthese failed to be detected within UniGene sequenceseven though they mapped within a UniGene cluster. Thiscould be because of the 3' bias of the UniGene sequences,the incomplete nature of the clusters [40], sequence errorsin the representative UniGene cluster sequence wesearched against for hits (Hs.seq.uniq – see Methods fordetails), or the limitations of our mapping algorithm. Ourapproach enforces that the repeat must exist with at least10 bp of flanking sequence, which leaves out repeats atthe edge of UniGene clusters. The remaining 11 disease-associated repeats were detected within UniGene clusters,but only 5 of these repeats were polymorphic. On average,the repeats detected as polymorphic had more hits withinmorphic repeats to 4.54 for stable repeats). This suggeststhat there is a greater chance of observing repeat polymor-phism with deeper sampling. All of the polymorphicrepeats were limited to one UniGene cluster and none ofthe lengths surpassed the disease pre-mutation thresholdof 29, 25, 36, 42, and 39 pure repeats for the repeatsresponsible for increased prostate cancer risk (AIB-I),DRPLA, SBMA, SCA17, and FRAXE respectively [5].DiscussionAlthough one might expect greater polymorphism in UTRsequence relative to coding sequence due to reducedevolutionary constraints, both 5'-UTR and coding repeatshad similar rates of polymorphism, whereas 3'-UTRrepeats had significantly greater polymorphism comparedto these two groups. This may be due to the documented3'-UTR sequence over-representation in UniGene [40].However, depending on whether the repeat is within cod-ing or UTR sequence, there appears to be constraintsregarding what repeat unit sizes can tolerate large poly-Counts of unstable non-coding repeats at increasing instability cut-offsFig re 3Counts of unstable non-coding repeats at increasing instability cut-offs. Repeat period distribution of polymorphic non-coding repeats at increasing standard deviation (sd) cut-offs.Page 6 of 14(page number not for citation purposes)UniGene clusters than those detected as stable (there werean average of 17.4 observations per repeat for the poly-morphisms. Of the more polymorphic UTR repeats (thosewith sd values greater than 3), there was a single trinucle-BMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145otide repeat amongst mainly dinucleotide and mononu- less pronounced, are almost entirely in factors of threeCounts of unstable coding repeats at increasing instability cut-offsFig re 4Counts of unstable coding repeats at increasing instability cut-offs. Repeat period distribution of polymorphic coding repeats at increasing standard deviation (sd) cut-offs.Table 2: Unstable untranslated repeats organized by descending standard deviation Sample output from Satellog.unit length gene location name mean sdGT 9 3utr NULL 12.17 7.29AT 25 3utr SPATA2 19 7.07TA 10 3utr NULL 11.11 6.33T 11 3utr LYZ 13.08 4.72AC 23 3utr NAV1 17.71 4.39AC 28 5utr NULL 25.4 3.71GCA 16 5utr GLS 9.6 3.58GCC 14 5utr DAZAP1 15 2.28T 13 5utr NULL 15 2.24T 19 5utr NULL 17.5 2.12The ten most unstable untranslated repeats organized by descending standard deviation. No disease-associated repeats are present in this sample.Page 7 of 14(page number not for citation purposes)cleotide repeats (Figure 2, Table 2). On the other hand,the majority of coding repeat polymorphisms, although(Figure 1, Table 1). Our results support the observationthat coding microsatellite polymorphisms are usually in-BMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145frame in order to avoid a deleterious phenotype resultingfrom frame-shift or to provide a rapid evolutionaryresponse to a changing environment [46].It is important to consider that larger repeat polymor-phisms could cause a UniGene cluster to "split" into twodistinct clusters. This could downplay a repeat'spolymorphism because such repeats would not be evalu-ated as a single group, therefore decreasing the repeat's sdvalue. This issue was addressed by pre-mapping all Uni-Gene clusters to the human genome. If the repeat co-ordi-nates were within 10 kb of the UniGene genomic co-ordinates, then the repeat length hits was retained andmerged into a single sd value. In practical terms this wasnot an issue, since only one of our most polymorphicrepeats (sd > 2) mapped to two clusters.There are certain limitations in using the GeneNote data-base to establish expression of repeat-containing genes.Specifically, the GeneNote microarray experiments wereconducted with whole tissues, not tissues from particulartissue sub-types [42]. For example, users limiting theirsearch to repeats expressed in the brain must bear in mindthe possibility that a transcript highly expressed in oneanatomical region (i.e. hippocampus) may lack sufficientglobal expression to be detected in the whole brain tissueused by the GeneNote experiments. Users interested inexpression in particular anatomical regions might benefitfrom integrating gene expression data from their anatom-ical region of interest with repeat data from Satellog.As an example of the utility of Satellog, we wished to seehow it might have expedited research for groups in thepast hunting for candidate unstable repeats. In 1992, hap-lotype analysis of linkage disequilibrium data inHuntington's disease patients had indicated a portion of4p16.3 (chr4:1-4,600,000) as the likely location of themutation [47]. We assumed that the investigators at thetime were looking specifically for an unstable, brain-expressed, CAG repeat to explain the disease phenotype,similar to SBMA [4]. Using the Satellog database, we nar-rowed down our search for candidates repeats in this areafrom 13,804 to 13 (Figure 5). Three polyglutaminerepeats are returned by the database, but the repeat impli-cated in Huntington's disease (chr4:3108016-3108074)stands out as a strong candidate due to its size. If we re-runthis query and select only the top 5% of repeats relative totheir class, chr4:3108016-3108074 is the only poly-glutamine repeat. These repeat characteristics: CAG repeattype, brain expression and presence within the top 5% ofits repeat class, plus the privilege of hindsight, easily allowus to distinguish this repeat as the lead candidate in thisregion.Secondly, we sought to prioritize all repeats in disease inwhich unstable repeats might play a role but in whichnone have been successfully correlated with disease todate. Schizophrenia is one such disease with genetic link-age in region 22q [48-50] suggesting some role of chro-mosome 22 abberations in disease development.Microdeletions in this region in patients affected withVelocardial Facial Syndrome (VCFS) confers the mostconsistent genetic predisposition to developing schizo-phrenia [51]. First, we collected all repeats on chromo-some 22 resulting in a total of 113,789 repeats. Next, sincewe only observed trinucleotide repeats and higher periodrepeats in our disease-associated set, we restricted ourrepeats to those with a period greater than 2 resulting in91,918 repeats. Since the majority of the disease-associ-ated repeats had a significantly longer reference genomelength relative to other repeats of the same class, weselected the 2,934 repeats with a percentile rank less than0.05. The cellular pathology associated with schizophre-nia shows no evidence of nuclear inclusions mediated bypolyglutamine expansions, therefore, the disease pheno-type may be mediated by an expansion in the UTR region.We selected 27 repeats from our set that were located ineither the 5'-or 3'-UTR. Assuming that genes relevant toschizophrenia are expressed in the brain, we limited ouranalysis to the 18 repeats that were within genes expressedin the brain. Of our final set of 18 repeats, 2 repeats in the3'-UTRs of CRKL and NIPSNAP1 had evidence of repeatpolymorphism in UniGene clusters (Table 3). In this pri-oritization paradigm, we did not look at any intronicrepeats which may mediate the neurological phenotypeby a mechanism similar to that of Friedreich's ataxia [37].The point is that the prioritization paradigm can bedefined by the user to dynamically generate a list of can-didate repeats based on feature preference within Satellogor the fluctuating biological interpretation of repeatinstability.ConclusionSatellog enriches the current bioinformatics landscape inwhich repeats are viewed. For example, the GAA repeat inFriedreich's Ataxia [37] is not detected at all(chr9:67,109,320-67,109,339) in the UCSC genomebrowser [33] by the TRF [32] and Variable Number Tan-dem Repeats (VNTR) tracks. The VNTR feature in UCSCdetects all perfect 2 to 10 repeat units with 10 or morecopies. Repeats detected by this method may over-repre-sent insignificant low period repeats and under-representpotentially interesting high period repeats. In Satellog,not only is the Friedreich's Ataxia GAA repeat detected,but its percentile rank also suggests that this size of GAArepeat is a relatively rare observation in the humangenome (percentile rank = 0.045). Satellog integrates dis-Page 8 of 14(page number not for citation purposes)parate data sources to give researchers an idea of howinteresting certain repeats are based on their genetic loca-BMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145tion, tissue expression profile and polymorphism withinUniGene. It should be noted that Satellog does not intendto be a de novo detection method for disease-associatedrepeats. Instead, it provides comprehensive, integratedbioinformatics platform to prioritize repeats in a conven-ient and efficient manner. Satellog also presents the firstbioinformatics reagents in other studies. Satellog shouldprove useful to investigators interested in prioritizingrepeats for typing in diseases showing anticipation or inwhich repeat polymorphism is thought to play a role inetiology. In addition, given that all sequence information(i.e. the human genome sequence and UniGeneCandidate repeats within Huntington's disease linkage region 4p16.3Figure 5Candidate repeats within Huntington's disease linkage region 4p16.3. Sample output from Satellog summarizing can-didate repeats within the 4p16.3 Huntington's disease linkage region. Coding CAG-type repeats from chr4:1-4,600,000 were selected along with their peptide sequence, HUGO names and ensembl gene IDs. The repeat encoding 19 glutamines has been associated with Huntington disease progression.Table 3: Candidate repeats within the chromosome 22 schizophrenia linkage region.chr start end unit length p-value gene locationname tissue mean sd22 19632267 19632294 AAC 9 0.019894 3utr CRKL Brain 8.04 0.5122 28276064 28276078 GGCCT 3 0.017437 3utr NIPSNAP1 Brain 2.97 0.17Candidate repeats within the chromosome 22 linkage region implicated in schizophrenia along with the tissue expression call in the brain and UniGene cluster summary statistics indicating mean repeat length and polymorphism (standard deviation (sd) values > 0).Page 9 of 14(page number not for citation purposes)comprehensive identification and integration of disease-associated repeats with other genomic resources for use assequences) is from presumed "normal" individuals lack-ing disease phenotypes; Satellog may also prove useful inBMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145extending our understanding of the normal role of repeatsin genes and transcripts.MethodsSoftware dependenciesA perl script "repeatalyzer.pl" functions as a wrapper for anumber of different programs to achieve the endpoints ofSatellog. repeatalyzer.pl is run with perl v5.6.1 and usedBioPerl v1.2 [52], the EnsEMBL Perl API (May 24th, 1999release), MySQL v10.8 Distribution 3.23.21-beta (for pc-linux-gnu), BLAT v. 28 [53] and v. 34 of the humangenome sequence [54]. This script was run in parallel ona 192 node linux cluster at the BCCA Genome SciencesCentre. More detailed methods information is available athttp://satellog.bcgsc.ca.Detecting microsatellite repeats with Tandem Repeats Finder (TRF)We chose to detect sequences repeated at least twice andsecondly, we were interested in exclusively pure repeattracts which are more likely to expand following transmis-sion [34-36]. Command-line TRF has seven parametersthat can be manually assigned at run-time which includematching weight, mismatch and indel penalties, matchprobability, indel probability, minimum alignment scoreto report, and maximum period size to report [32]. Wefound that matching weight, mismatch and indel penal-ties, minimum alignment score and maximum period sizedirectly affected the length and purity of hits detected byTRF whereas changing the match and indel probabilityfeatures was not useful. The match and indel probabilityfeatures refer respectively to the percent identity and frac-tion of indels tolerated in each serial tandem unit detectedas a hit. These features allow users to specify alternativeexpected matching and indel statistical distributions.Next we evaluated the ability of the matching weight andmaximum period size parameters to detect short repeats.Period size refers to the length of the tandemly repeatedDNA unit, for instance CAG repeats have a period of 3.Since TRF hits must be at least 10 bp, the smallest hit foreach repeat class reported in Satellog is 10 divided by therepeat unit length. For example, for CAG repeats, thesmallest hit detectable that satisfies the minimum hitlength is a 3 1/3 repeat unit hit (i.e. CAG CAG CAG C). Inshort, only pentanucleotide and larger repeats have a min-imum of two repeat units in Satellog.Lastly we investigated the utility of adjusting the mis-match and indel penalties. We found that setting the pen-alty for these parameters to 4090 produced no impurerepeats as hits. TRF was run on whole chromosome FASTAfiles from v. 34 of the human genome downloaded fromhave the highest probability of introducing indels due tothe scoring scheme used by TRF [32].Identifying unique repeat classesA repeat can be represented in a number of ways in dou-ble-stranded DNA. TRF detects repeats by the first tan-demly repeated unit, therefore, CAGCAGCAG,AGCAGCAGC, and GCAGCAGCA are detected as repeatsof CAG, AGC, and GCA respectively. Furthermore, the ref-erence human genome sequence is only presented as thepositive strand. Repeats of GTC, TCG, and CGT on thepositive strand represent 5'->3' CAG, AGC and GCArepeats respectively on the negative strand. Therefore, toidentify all CAG repeats in the human genome it's neces-sary to detect all CAG, AGC, GCA, GTC, TCG, and CGTrepeats on the positive strand. We developed an algorithmto generate all possible sequence varieties of a repeat uniton the positive and negative strands. Our repeat classifica-tion algorithm operates by taking an input repeat unit, i.e.CAG, removing the first letter (C in this case) and append-ing it to the end of the remainder (AG) to create the sec-ond repeat unit (AGC). This is then reversecomplemented to generate the equivalent sequence onthe negative strand (TCG). This procedure is repeatedrepeat unit length – 1 times to generate a unique identifierhenceforth referred to as the repeat class. Each repeat inSatellog is associated with a single unique repeat class.Preparing AffyMetrix expression data from the GeneNote databaseThe GeneNote (Gene Normal Tissue Expression) databaseprovides baseline normal expression data of human genesfor use in disease studies [42]. GeneNote data is down-loaded from the Gene Expression Omnibus (GEO). Atotal of twelve human tissue profiles are presented inGeneNote including bone marrow, brain, heart, kidney,liver, lung, pancreas, prostate, skeletal muscle, spinalcord, spleen, and thymus. These products were generatedwith the AffyMetrix HG-U95 A-E probe-set, covering62,839 probe-sets. EnsEMBL genes have been mapped toAffyMetrix HG-U95 probes by the EnsEMBL project [41].Once a repeat is detected either inside or within 60 kb ofan EnsEMBL gene, that gene's normal expression profile isevaluated by cross-referencing its AffyMetrix tags to theGeneNote database within Satellog.Detecting repeat polymorphisms within UniGene clustersUniGene contains the largest public repository of tran-scribed human sequence and represents an attempt toorganize this wealth of expression data into discrete tran-scriptional loci [40]. All human UniGene sequences wereprocessed for use with repeatalyzer.pl. For each repeatdetected in UTR or coding sequence, the repeat plus 10 bpPage 10 of 14(page number not for citation purposes)the UCSC genome browser. Hit purity was confirmed byvisually inspecting the top high period hits (these hitsof flanking sequence was extracted from EnsEMBL andqueried using the BLAT algorithm [53] against a BLAT-for-BMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145matted database created from sequences representing thelongest, highest quality stretch of DNA from each individ-ual UniGene cluster (pre-selected by UniGene as the fileHs.seq.uniq). Polymorphism is evaluated only if BLATanalysis against all UniGene clusters resulted in 1) hitsthat achieved BLAT scores at least 85% of the theoreticalmaximum for a perfect hit 2) 90% of the query sequencematched identically within the cluster 3) the repeatmapped within 10 kb of the genomic co-ordinates of theUniGene cluster. If a hit to a UniGene cluster satisfiedthese criteria, the length of the repeat in the cluster isstored in Satellog. This feature allows investigators toquery all repeats with polymorphisms in UniGene clustersfrom genomic regions of interest.repeatalyzer.pl overviewOnce the above software and data dependencies are con-figured, repeatalyzer.pl automatically populates Satellog(Figure 6). The script processes the flat files output by TRF.These files contain the repeat co-ordinates plus the repeatperiod (the size of the repeated unit), the sequence of theindividual repeat unit, the entire repetitive sequence andthe repeat length. Repeat co-ordinates are passed to theEnsEMBL API to confirm the authenticity of the co-ordi-nates generated by TRF. If the repeat is not detected withina gene with the EnsEMBL API, then progressively largerslices incrementing by 15 kb are taken in search of flank-ing genes. As soon as a gene is located in flankingsequence then no further flanking sequence is collected.However, if no genes are detected within 60 kb of therepeatalyzer.pl flowchartFigure 6repeatalyzer.pl flowchart. Flowchart outlining how repeatalyzer.pl populates the Satellog database.Page 11 of 14(page number not for citation purposes)repeat co-ordinates then repeatalyzer.pl stops searchingfor genes. If a repeat is detected inside or within 60 kbBMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145adjacent to an EnsEMBL-defined gene then that gene's pri-mary information (co-ordinates, HUGO name, EnsEMBLID and description) are collected along with metadatastored in EnsEMBL such as Protein Data Bank (PDB) [55],Online Mendelian Inheritance in Man [40], Gene Ontol-ogy (GO) [56], and mappings to AffyMetrix probe sets. Ifthe repeat is located in the 5'-UTR, 3'-UTR, or codingsequence of a gene then its polymorphism profile withinUniGene clusters is evaluated.Generating a measure of repeat length significanceAfter running the script to populate Satellog, each repeat'slength is compared to the lengths of all repeats of thesame repeat class. The majority of repeats associated withdisease undergo expansions from already large referencegenome lengths relative to other repeats of the same class[5]. Each repeat's percentile rank is calculated from thedistribution of repeat lengths within each repeat's class. Itreflects the proportion of repeats with the same or greaterlength from the repeat class' genomic distribution.Authors' ContributionsPIM conceived of the study, wrote all analysis scripts, col-lected and input data into the database, analyzed the data,directed the Satellog website design, wrote all documenta-tion and the tutorial accompanying the database anddrafted the manuscript. CRM developed the online graph-ical user interface for the database, troubleshooted and re-indexed queries for the database and provided technicalexpertise for realizing the web version of Satellog. SLB par-ticipated in the design of the study and gave crucial intel-lectual direction to the final manuscript. BFFOparticipated in the design of the study and provided assist-ance with bioinformatics analysis. RSD provided key bio-logical background to guide the design of the study. BRLparticipated in the design and strengthened the clinicalperspective of the final manuscript. RAH participated inthe study design, coordination, performed data analysisand gave critical direction to the final manuscript. Allauthors read and approved the final manuscript.AppendixFigure S1 – Repeat density (bp of repeat sequence / Mb)per human chromosome.Available online at: http://satellog.bcgsc.ca/source.php.Table S1 – Total repeat count and density bychromosomeAvailable online at: http://satellog.bcgsc.ca/source.php.Table S2 – Repeat period count and density byAvailable online at: http://satellog.bcgsc.ca/source.php.Table S3 – Repeat unit count and density bychromosomeAvailable online at: http://satellog.bcgsc.ca/source.php.Table S4 – Repeat unit count and density by gene regionAvailable online at: http://satellog.bcgsc.ca/source.php.Acknowledgements1) UBC/SFU CIHR Training Program for Bioinformatics in Health Research, Rooms 308/308A, 2206 East Mall, University of British Columbia, Vancou-ver, BC, V6T 1Z3, Canada2) Mark Mayo and Bernard Li at the BCCA Genome Sciences Centre for technical support with cluster computing.3) Martin Krzywinski for creating the Satellog logo.References1. Harper PS, Harley HG, Reardon W, Shaw DJ: Anticipation in myo-tonic dystrophy: new light on an old problem.  Am J Hum Genet1992, 51:10-16.2. Verkerk AJ, Pieretti M, Sutcliffe JS, Fu YH, Kuhl DP, Pizzuti A, ReinerO, Richards S, Victoria MF, Zhang FP, et al.: Identification of a gene(FMR-1) containing a CGG repeat coincident with a break-point cluster region exhibiting length variation in fragile Xsyndrome.  Cell 1991, 65:905-914.3. Kremer EJ, Pritchard M, Lynch M, Yu S, Holman K, Baker E, WarrenST, Schlessinger D, Sutherland GR, Richards RI: Mapping of DNAinstability at the fragile X to a trinucleotide repeat sequencep(CCG)n.  Science 1991, 252:1711-1714.4. La Spada AR, Wilson EM, Lubahn DB, Harding AE, Fischbeck KH:Androgen receptor gene mutations in X-linked spinal andbulbar muscular atrophy.  Nature 1991, 352:77-79.5. Cleary JD, Pearson CE: The contribution of cis-elements to dis-ease-associated repeat instability: clinical and experimentalevidence.  Cytogenet Genome Res 2003, 100:25-55.6. Koide R, Ikeuchi T, Onodera O, Tanaka H, Igarashi S, Endo K, Taka-hashi H, Kondo R, Ishikawa A, Hayashi T, et al.: Unstable expansionof CAG repeat in hereditary dentatorubral-pallidoluysianatrophy (DRPLA).  Nat Genet 1994, 6:9-13.7. A novel gene containing a trinucleotide repeat that isexpanded and unstable on Huntington's disease chromo-somes. The Huntington's Disease Collaborative ResearchGroup.  Cell 1993, 72:971-983.8. Banfi S, Servadio A, Chung MY, Kwiatkowski TJJ, McCall AE, DuvickLA, Shen Y, Roth EJ, Orr HT, Zoghbi HY: Identification and char-acterization of the gene causing type 1 spinocerebellarataxia.  Nat Genet 1994, 7:513-520.9. Imbert G, Saudou F, Yvert G, Devys D, Trottier Y, Garnier JM,Weber C, Mandel JL, Cancel G, Abbas N, Durr A, Didierjean O, Ste-vanin G, Agid Y, Brice A: Cloning of the gene for spinocerebellarataxia 2 reveals a locus with high sensitivity to expandedCAG/glutamine repeats.  Nat Genet 1996, 14:285-291.10. Ikeda H, Yamaguchi M, Sugai S, Aze Y, Narumiya S, Kakizuka A:Expanded polyglutamine in the Machado-Joseph disease pro-tein induces cell death in vitro and in vivo.  Nat Genet 1996,13:196-202.11. Zhuchenko O, Bailey J, Bonnen P, Ashizawa T, Stockton DW, AmosC, Dobyns WB, Subramony SH, Zoghbi HY, Lee CC: Autosomaldominant cerebellar ataxia (SCA6) associated with smallpolyglutamine expansions in the alpha 1A-voltage-depend-ent calcium channel.  Nat Genet 1997, 15:62-69.12. David G, Abbas N, Stevanin G, Durr A, Yvert G, Cancel G, Weber C,Page 12 of 14(page number not for citation purposes)chromosome Imbert G, Saudou F, Antoniou E, Drabkin H, Gemmill R, Giunti P,Benomar A, Wood N, Ruberg M, Agid Y, Mandel JL, Brice A: CloningBMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145of the SCA7 gene reveals a highly unstable CAG repeatexpansion.  Nat Genet 1997, 17:65-70.13. Ross CA, Margolis RL, Becher MW, Wood JD, Engelender S, CooperJK, Sharp AH: Pathogenesis of neurodegenerative diseasesassociated with expanded glutamine repeats: new answers,new questions.  Prog Brain Res 1998, 117:397-419.14. Cummings CJ, Zoghbi HY: Trinucleotide repeats: mechanismsand pathophysiology.  Annu Rev Genomics Hum Genet 2000,1:281-328.15. Lalioti MD, Scott HS, Buresi C, Rossier C, Bottani A, Morris MA,Malafosse A, Antonarakis SE: Dodecamer repeat expansion incystatin B gene in progressive myoclonus epilepsy.  Nature1997, 386:847-851.16. Matsuura T, Yamagata T, Burgess DL, Rasmussen A, Grewal RP,Watase K, Khajavi M, McCall AE, Davis CF, Zu L, Achari M, Pulst SM,Alonso E, Noebels JL, Nelson DL, Zoghbi HY, Ashizawa T: Largeexpansion of the ATTCT pentanucleotide repeat in spinoc-erebellar ataxia type 10.  Nat Genet 2000, 26:191-194.17. Brook JD, McCurrach ME, Harley HG, Buckler AJ, Church D, Abura-tani H, Hunter K, Stanton VP, Thirion JP, Hudson T, et al.: Molecularbasis of myotonic dystrophy: expansion of a trinucleotide(CTG) repeat at the 3' end of a transcript encoding a proteinkinase family member.  Cell 1992, 68:799-808.18. Greco CM, Hagerman RJ, Tassone F, Chudley AE, Del Bigio MR, Jac-quemont S, Leehey M, Hagerman PJ: Neuronal intranuclear inclu-sions in a new cerebellar tremor/ataxia syndrome amongfragile X carriers.  Brain 2002, 125:1760-1771.19. Jiang H, Mankodi A, Swanson MS, Moxley RT, Thornton CA: Myot-onic dystrophy type 1 is associated with nuclear foci ofmutant RNA, sequestration of muscleblind proteins andderegulated alternative splicing in neurons.  Hum Mol Genet2004, 13:3079-3088.20. Speer MC, Gilchrist JM, Stajich JM, Gaskell PC, Westbrook CA, Hor-rigan SK, Bartoloni L, Yamaoka LH, Scott WK, Pericak-Vance MA:Evidence for anticipation in autosomal dominant limb-girdlemuscular dystrophy.  J Med Genet 1998, 35:305-308.21. Bayless TM, Picco MF, LaBuda MC: Genetic anticipation inCrohn's disease.  Am J Gastroenterol 1998, 93:2322-2325.22. Horwitz M, Goode EL, Jarvik GP: Anticipation in familialleukemia.  Am J Hum Genet 1996, 59:990-998.23. Wright GD, Regan M, Deighton CM, Wallis G, Doherty M: Evidencefor genetic anticipation in nodal osteoarthritis.  Ann Rheum Dis1998, 57:524-526.24. Bonifati V, Vanacore N, Meco G: Anticipation of onset age infamilial Parkinson's disease.  Neurology 1994, 44:1978-1979.25. McDermott E, Khan MA, Deighton C: Further evidence forgenetic anticipation in familial rheumatoid arthritis.  AnnRheum Dis 1996, 55:475-477.26. Bleyl S, Nelson L, Odelberg SJ, Ruttenberg HD, Otterud B, LeppertM, Ward K: A gene for familial total anomalous pulmonaryvenous return maps to chromosome 4p13-q12.  Am J HumGenet 1995, 56:408-415.27. Ohara K, Suzuki Y, Ushimi Y, Yoshida K: Anticipation andimprinting in Japanese familial mood disorders.  Psychiatry Res1998, 79:191-198.28. Bassett AS, Honer WG: Evidence for anticipation inschizophrenia.  Am J Hum Genet 1994, 54:864-870.29. Bassett AS, Husted J: Anticipation or ascertainment bias inschizophrenia? Penrose's familial mental illness sample.  Am JHum Genet 1997, 60:630-637.30. Battaglia M, Bertella S, Bajo S, Binaghi F, Bellodi L: Anticipation ofage at onset in panic disorder.  Am J Psychiatry 1998, 155:590-595.31. Ohara K, Suzuki Y, Ochiai M, Yoshida K: Age of onset anticipationin anxiety disorders.  Psychiatry Res 1999, 89:215-221.32. Benson G: Tandem repeats finder: a program to analyze DNAsequences.  Nucleic Acids Res 1999, 27:573-580.33. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM,Haussler D: The human genome browser at UCSC.  GenomeRes 2002, 12:996-1006.34. Chong SS, McCall AE, Cota J, Subramony SH, Orr HT, Hughes MR,Zoghbi HY: Gametic and somatic tissue-specific heterogene-ity of the expanded SCA1 CAG repeat in spinocerebellarataxia type 1.  Nat Genet 1995, 10:344-350.35. Chung MY, Ranum LP, Duvick LA, Servadio A, Zoghbi HY, Orr HT:CAG repeat instability in spinocerebellar ataxia type I.  NatGenet 1993, 5:254-258.36. Kunst CB, Warren ST: Cryptic and polar variation of the fragileX repeat could result in predisposing normal alleles.  Cell1994, 77:853-861.37. Campuzano V, Montermini L, Molto MD, Pianese L, Cossee M, Cav-alcanti F, Monros E, Rodius F, Duclos F, Monticelli A, Zara F, Caniza-res J, Koutuikova H, Bidichandani SI, Gellera C, Brice A, Trouillas P,De Michele G, Filla A, De Frutos R, Palau F, Patel PI, Di Donato S,Mandel JL, Cocozza S, Koenig M, Pandolfo M: Friedreich's ataxia:autosomal recessive disease caused by an intronic GAA tri-plet repeat expansion.  Science 1996, 271:1423-1427.38. Subramanian S, Madgula VM, George R, Mishra RK, Pandit MW,Kumar CS, Singh L: MRD: a microsatellite repeats database forprokaryotic and eukaryotic genomes.  Genome Biol 2002,3:PREPRINT0011.39. Collins JR, Stephens RM, Gold B, Long B, Dean M, Burt SK: Anexhaustive DNA micro-satellite map of the human genomeusing high performance computing.  Genomics 2003, 82:10-19.40. Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Mad-den TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Suzek TO,Tatusova TA, Wagner L: Database resources of the NationalCenter for Biotechnology Information: update.  Nucleic AcidsRes 2004, 32 Database issue:D35-40.41. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T,Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M,Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C,Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S,Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Clamp M: The Ensembl genome databaseproject.  Nucleic Acids Res 2002, 30:38-41.42. Shmueli O, Horn-Saban S, Chalifa-Caspi V, Shmoish M, Ophir R, Ben-jamin-Rodrig H, Safran M, Domany E, Lancet D: GeneNote: wholegenome expression profiles in normal human tissues.  C R Biol2003, 326:1067-1072.43. Subramanian S, Mishra RK, Singh L: Genome-wide analysis of mic-rosatellite repeats in humans: their abundance and density inspecific genomic regions.  Genome Biol 2003, 4:R13.44. Crisponi L, Deiana M, Loi A, Chiappe F, Uda M, Amati P, Bisceglia L,Zelante L, Nagaraja R, Porcu S, Ristaldi MS, Marzella R, Rocchi M,Nicolino M, Lienhardt-Roussie A, Nivelon A, Verloes A, SchlessingerD, Gasparini P, Bonneau D, Cao A, Pilia G: The putative forkheadtranscription factor FOXL2 is mutated in blepharophimosis/ptosis/epicanthus inversus syndrome.  Nat Genet 2001,27:159-166.45. Clark RM, Dalgliesh GL, Endres D, Gomez M, Taylor J, BidichandaniSI: Expansion of GAA triplet repeats in the human genome:unique origin of the FRDA mutation at the center of an Alu.Genomics 2004, 83:373-383.46. Kashi Y, King D, Soller M: Simple sequence repeats as a sourceof quantitative genetic variation.  Trends Genet 1997, 13:74-78.47. MacDonald ME, Novelletto A, Lin C, Tagle D, Barnes G, Bates G, Tay-lor S, Allitto B, Altherr M, Myers R, Lehrach H, Collins FS, WasmuthJJ, Frontali M, Gusella JF: The Huntington's disease candidateregion exhibits many different haplotypes.  Nat Genet 1992,1:99-103.48. Shaw SH, Kelly M, Smith AB, Shields G, Hopkins PJ, Loftus J, Laval SH,Vita A, De Hert M, Cardon LR, Crow TJ, Sherrington R, DeLisi LE: Agenome-wide search for schizophrenia susceptibility genes.Am J Med Genet 1998, 81:364-376.49. Pulver AE, Karayiorgou M, Wolyniec PS, Lasseter VK, Kasch L, Nes-tadt G, Antonarakis S, Housman D, Kazazian HH, Meyers D, Ott J,Lamacz M, Liang K-Y, Hanfelt J, Ullrich G, DeMarchi N, Ranu E,McHugh PR, Adler L, Thomas M: Sequential strategy to identifya susceptibility gene for schizophrenia: report of potentiallinkage on chromosome 22q12-q13.1: Part 1.  Am J Med Genet1994, 54:36-43.50. Coon H, Jensen S, Holik J, Hoff M, Myles-Worsley M, Reimherr F,Wender P, Waldo M, Freedman R, Leppert M, et al.: Genomic scanfor genes predisposing to schizophrenia.  Am J Med Genet 1994,54:59-71.51. Murphy KC: Schizophrenia and velo-cardio-facial syndrome.Lancet 2002, 359:426-430.52. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C,Page 13 of 14(page number not for citation purposes)Evidence for a mechanism predisposing to intergenerational Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mun-gall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD,Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2005, 6:145 http://www.biomedcentral.com/1471-2105/6/145Stupka E, Wilkinson MD, Birney E: The Bioperl toolkit: Perl mod-ules for the life sciences.  Genome Res 2002, 12:1611-1618.53. Kent WJ: BLAT--the BLAST-like alignment tool.  Genome Res2002, 12:656-664.54. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J,Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, HarrisK, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P,McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J,Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sul-ston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N,Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, DurbinR, French L, Grafham D, Gregory S, Hubbard T, Humphray S, HuntA, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S,Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S,Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA,Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL,Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB,Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T,Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, DoggettN, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M,Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, WorleyKC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS,Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T,Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T,Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T,Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L,Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, PlatzerM, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G,Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA,Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, GrimwoodJ, Cox DR, Olson MV, Kaul R, Shimizu N, Kawasaki K, Minoshima S,Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, RamserJ, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, DedhiaN, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bai-ley JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, BurgeCB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T,Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hay-ashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS,Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, KooninEV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T,Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J,Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, WolfeKH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A,Wetterstrand KA, Patrinos A, Morgan MJ, Szustakowki J, de Jong P,Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ: Initialsequencing and analysis of the human genome.  Nature 2001,409:860-921.55. Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, BurkhardtK, Feng Z, Gilliland GL, Iype L, Jain S, Fagan P, Marvin J, Padilla D, Rav-ichandran V, Schneider B, Thanki N, Weissig H, Westbrook JD, Zard-ecki C: The Protein Data Bank.  Acta Crystallogr D Biol Crystallogr2002, 58:899-907.56. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM,Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M,Rubin GM, Sherlock G: Gene ontology: tool for the unificationof biology. The Gene Ontology Consortium.  Nat Genet 2000,25:25-29.yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 14 of 14(page number not for citation purposes)

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0132548/manifest

Comment

Related Items