UBC Faculty Research and Publications

CAG-encoded polyglutamine length polymorphism in the human genome Butland, Stefanie L; Devon, Rebecca S; Huang, Yong; Mead, Carri-Lyn; Meynert, Alison M; Neal, Scott J; Lee, Soo S; Wilkinson, Anna; Yang, George S; Yuen, Macaire M; Hayden, Michael R; Holt, Robert A; Leavitt, Blair R; Ouellette, BF F May 22, 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12864_2006_Article_839.pdf [ 2.29MB ]
JSON: 52383-1.0223938.json
JSON-LD: 52383-1.0223938-ld.json
RDF/XML (Pretty): 52383-1.0223938-rdf.xml
RDF/JSON: 52383-1.0223938-rdf.json
Turtle: 52383-1.0223938-turtle.txt
N-Triples: 52383-1.0223938-rdf-ntriples.txt
Original Record: 52383-1.0223938-source.json
Full Text

Full Text

ralssBioMed CentBMC GenomicsOpen AcceResearch articleCAG-encoded polyglutamine length polymorphism in the human genomeStefanie L Butland1, Rebecca S Devon2, Yong Huang1, Carri-Lyn Mead1, Alison M Meynert1, Scott J Neal2, Soo Sen Lee1, Anna Wilkinson1, George S Yang3, Macaire MS Yuen1, Michael R Hayden2,4, Robert A Holt3,5, Blair R Leavitt†2,4 and BF Francis Ouellette*†1,4Address: 1UBC Bioinformatics Centre, Michael Smith Laboratories, University of British Columbia, Vancouver, Canada, 2Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, Canada, 3Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, Canada, 4Department of Medical Genetics, University of British Columbia, Vancouver, Canada and 5Department of Psychiatry, University of British Columbia, Vancouver, CanadaEmail: Stefanie L Butland - butland@bioinformatics.ubc.ca; Rebecca S Devon - Rebecca.Devon@ed.ac.uk; Yong Huang - dewriver@gmail.com; Carri-Lyn Mead - cmead@bcgsc.ca; Alison M Meynert - ameynert@ebi.ac.uk; Scott J Neal - sneal@cmmt.ubc.ca; Soo Sen Lee - soo_sen@yahoo.com; Anna Wilkinson - guywil@shaw.ca; George S Yang - gyang@bcgsc.ca; Macaire MS Yuen - mackyuen@gmail.com; Michael R Hayden - mrh@cmmt.ubc.ca; Robert A Holt - rholt@bcgsc.ca; Blair R Leavitt - bleavitt@cmmt.ubc.ca; BF Francis Ouellette* - francis@bioinformatics.ubc.ca* Corresponding author    †Equal contributorsAbstractBackground: Expansion of polyglutamine-encoding CAG trinucleotide repeats has been identifiedas the pathogenic mutation in nine different genes associated with neurodegenerative disorders.The majority of individuals clinically diagnosed with spinocerebellar ataxia do not have mutationswithin known disease genes, and it is likely that additional ataxias or Huntington disease-likedisorders will be found to be caused by this common mutational mechanism. We set out todetermine the length distributions of CAG-polyglutamine tracts for the entire human genome in aset of healthy individuals in order to characterize the nature of polyglutamine repeat lengthvariation across the human genome, to establish the background against which pathogenic repeatexpansions can be detected, and to prioritize candidate genes for repeat expansion disorders.Results: We found that repeats, including those in known disease genes, have unique distributionsof glutamine tract lengths, as measured by fragment analysis of PCR-amplified repeat regions. Thisemphasizes the need to characterize each distribution and avoid making generalizations betweenloci. The best predictors of known disease genes were occurrence of a long CAG-tractuninterrupted by CAA codons in their reference genome sequence, and high glutamine tract lengthvariance in the normal population. We used these parameters to identify eight priority candidategenes for polyglutamine expansion disorders. Twelve CAG-polyglutamine repeats were invariantand these can likely be excluded as candidates. We outline some confusion in the literature aboutthis type of data, difficulties in comparing such data between publications, and its application tostudies of disease prevalence in different populations. Analysis of Gene Ontology-based functionsPublished: 22 May 2007BMC Genomics 2007, 8:126 doi:10.1186/1471-2164-8-126Received: 23 October 2006Accepted: 22 May 2007This article is available from: http://www.biomedcentral.com/1471-2164/8/126© 2007 Butland et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 18(page number not for citation purposes)of CAG-polyglutamine-containing genes provided a visual framework for interpretation of theseBMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126genes' functions. All nine known disease genes were involved in DNA-dependent regulation oftranscription or in neurogenesis, as were all of the well-characterized priority candidate genes.Conclusion: This publication makes freely available the normal distributions of CAG-polyglutamine repeats in the human genome. Using these background distributions, against whichpathogenic expansions can be identified, we have begun screening for mutations in individualsclinically diagnosed with novel forms of spinocerebellar ataxia or Huntington disease-like disorderswho do not have identified mutations within the known disease-associated genes.BackgroundNine different neurodegenerative disorders are known tobe caused by expansions of polyglutamine-encoding CAGtrinucleotide (CAGpolyQ) repeats in the following genes:the HD gene in Huntington disease [1], ATN1 in denta-torubral pallidoluysian atrophy or Haw River syndrome[2,3], AR in spinal and bulbar muscular atrophy [4],CACNA1A in spinocerebellar ataxia SCA6 [5], TBP inSCA17 [6] and ATXN1, 2, 3, and 7 in SCA1 [7], SCA2 [8-10], SCA3 (Machado-Joseph disease) [11], and SCA7 [12].These disorders share similar clinical features whichinclude selective neuronal degradation associated with aprogressive neurological phenotype, but their respectivecausative genes appear to have little functional or struc-tural similarity, suggesting that functional genomicsapproaches to identifying new gene-disease associationswill not be useful. The repeat expansion mechanism ofpathogenesis is a shared molecular feature, and this formof mutation has only been exhaustively ruled out for a fewfamilial forms of SCA, and has not been examined at allfor the majority of patients who present with SCA or HD-like disorders.Despite recent advances in molecular diagnosis, themajority of individuals clinically diagnosed with SCA donot have identified mutations within the known disease-associated genes [13]. There are 28 genetically distinctSCAs identified by the Human Gene Nomenclature Com-mittee (HGNC) [14], but only 13 causative genes areknown. Six genes cause SCA by CAGpolyQ expansions,but the remaining 15 clinically-defined forms of SCA haveno known genetic mutation associated with them, and thesearch for causative genes continues. It is likely that someof these forms of SCA will be found to be caused by thiscommon mutational mechanism. Candidate genes forSCA and HD-like disorders can be identified using awhole-genome screening approach based on the compu-tational identification of a common sequence we havetermed a Genomic Mutational Signature (GeMS). GeMSare sequence patterns occurring in the normal genomethat, when mutated, cause disease – in this case CAG tri-nucleotide repeats that encode an extended tract ofglutamine residues in the protein. A significant advantagegle cases. This approach is not constrained by any require-ment for additional family members, additional affectedpatients, nor is a detailed family history required.Partial lists of CAGpolyQ-containing genes identifiedusing classical [15-20] or computational methods [21-24]have been published. Screening for CAG expansions inone such gene list, in patients with hereditary ataxias, leddirectly to the discovery of the causative gene for denta-torubral pallidoluysian atrophy [2,16]. To date, there hasbeen no complete genome-wide analysis of the distribu-tions of CAGpolyQ repeat lengths in a control populationin order to set the baseline from which to detect expan-sions. Studies on a limited number of genes have revealedthat different genes have very different polyglutaminetract (Q-tract) length distributions with some invariant(CREBBP) [25] some bimodal (ATXN3) [26], some verynarrow (ATXN2) [26] and some broad distributions (AR,ATN1, SMARCA2 and THAP11) [26-28].The molecular nature of polyglutamine repeatsThe amino acid glutamine (Q) is encoded by CAG andCAA trinucleotides. Q-tracts in proteins are typicallyencoded by mixtures of these two codons while expandedQ-tracts in disease-causing genes are typically composedof long uninterrupted repeats of the CAG trinucleotideonly. Long uninterrupted CAG repeats are known to be asubstrate for expansion mutation by a variety of mecha-nisms. The underlying process is currently thought toinvolve the generation of abnormal DNA structuresinduced by factors such as replication slippage, DNArepair and recombination, that can contribute to repeatinstability acting either separately or in combination [29-32] and these mutations underlie pathogenic expansions[33] and genetic anticipation [34,35]. Q-tracts encoded bymixtures of CAG and CAA codons, however, are less proneto suffer expansions [30,36,37]. The precise nucleotidesequence of a repeat tract determines a particular allele'ssusceptibility to large expansion mutations, while theamino acid sequence – the Q-tract – in the context of thewhole protein determines the effect of a length change onmolecular and clinical phenotypes.Page 2 of 18(page number not for citation purposes)of this approach is that novel candidate disease genes areidentified and can then be screened for mutations in sin-BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126Characteristics of known disease genesOne motivation for this research was to enable us to pri-oritize candidate genes for polyglutamine expansion dis-orders. Thus, we sought to identify hallmarks among theknown disease genes to which we could compare our dataon CAGpolyQ genes not yet associated with disease. Dis-ease-causing CAGpolyQ-containing genes tend to be con-sidered a homogeneous group in terms of their repeats,with an often-cited pathogenic threshold of about 35glutamines. In fact, a closer look at normal and patho-genic characteristics of each reveals their unique qualities.ATXN2 has a remarkably narrow distribution of Q-tractlengths with very few alleles longer or shorter than themodal length of 22Q [26,37]. In contrast, ATXN3 has abroad bi- (or tri-) modal distribution of Q-tract lengths[26]. Disease genes can also differ in the number of Q-res-idues that separate the longest normal from the shortestpathogenic allele. The longest normal ATN1 Q-tract is36Q and the shortest disease allele has 48Q [26,38], whilea single residue separates normal (19Q) from pathogenic(20Q) Q-tracts in CACNA1A [26,38]. Some disease genescarry non-glutamine interruptions in their Q-tracts,though their lengths are often reported as "repeat lengths"as if they were pure Q-tracts. For example, normal ATXN1has one to three CAT (coding for histidine, H) interrup-tions near the middle of the Q-tract, but in SCA1 diseasealleles the repeat tracts are pure CAGpolyQ [39]. Clearlyone must be cautious in making assumptions about com-mon features among polyglutamine expansion diseasegenes when seeking to identify new disease-associatedgenes.At the sequence level, polyglutamine expansion diseasegenes share several characteristics. They have long uninter-rupted CAG tracts [29] and tend to have polymorphic Q-tract lengths [26,36]. Analysis of both genomic DNA andexpressed sequence tags have shown that pure CAG-tractlength is correlated with Q-tract variance [36,40,41] andinterruptions provide stability to repeat tracts [36,37].Finally, comparisons of orthologous human and rodentgenes show that the lengths of disease-associated Q-tractshave a low level of conservation between species com-pared with those that are not associated with disease[29,42].The products of the genes causing polyglutamine expan-sion disorders do not all share a specific function, but thephenotypic overlap of these disorders does suggest somecommon functions in either their normal or mutatedstates, or both. As early as 1989, researchers noted theinvolvement of polyQ-containing genes in transcriptionalregulation [43]. This connection spans organisms fromyeast to humans [44-48] and known disease-causingATXN1 and ATXN2 are thought to be involved in RNAmetabolism [56,57] while CACNA1A is the only ion chan-nel gene known to cause a polyglutamine expansion dis-order [5]. The normal function of a gene product and therole of the Q-tract in that protein determine the distribu-tion of repeat lengths in the normal population and thethreshold for pathogenic expansion for each gene. There-fore, the functions of CAGpolyQ-containing genes mustbe assessed in conjunction with the normal levels ofrepeat polymorphism in order to prioritize candidategenes for polyglutamine expansion disorders.SummaryUsing the human genome reference sequence [58,59] andEnsembl annotated genes [60] we performed a genome-wide computational identification of all candidate genescontaining a specific GeMS sequence, CAGpolyQ repeats.We used fragment analysis to assess the CAG-tract lengthsof these candidate genes in a large control population. Wealso applied two methods of analyzing the potential func-tions of these genes based on the Gene Ontology (GO)system of functional classification [61] in order to identifyand visualize the network of functional relationshipsamong the CAGpolyQ-containing genes in the humangenome. Using related approaches, Lavoie and colleaguesidentified polyalanine-containing genes in the humangenome and assessed their normal levels of polymor-phism [62]. Functional analysis revealed that the majorityof polyalanine-containing genes have roles in transcrip-tional regulation [62].In characterizing the Q-tract length distributions for 64CAGpolyQ tracts in 62 genes in the human genome, wefind that each Q-tract has a unique distribution of Q-tractlengths. The best predictors of known disease genes wereoccurrence of a long uninterrupted CAG-tract in their ref-erence genome sequence and high Q-tract length variancein the normal population. Therefore, we used theseparameters to identify eight priority candidate genes forpolyglutamine expansion disorders. The majority of CAG-polyQ-containing genes are involved in transcriptionalregulation and neurogenesis. We provide a visual frame-work for interpretation of new information on CAGpolyQgene functions and their biomolecular interactions.ResultsIdentification of CAGpolyQ-containing GenesCAGpolyQ repeats were identified on the basis of havingtandemly repeated CAG trinucleotides in the sequencewithin the boundaries of a known gene that had five ormore tandem glutamine residues in its peptide sequence(see Methods for detailed description of approach anddata sources). Build 33 of the human genome sequencePage 3 of 18(page number not for citation purposes)genes like HD, TBP and ATXN7 are directly involved intranscription and transcriptional regulation [49-55].[58] contained 436 CAG trinucleotide repeats in total.Sixty-six of these CAG repeats lay in glutamine-codingBMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126sequences in genes including all nine genes in whichmutation by expansion of their CAGpolyQ repeat tract isknown to cause a neurodegenerative disorder (Table 1).Distributions of Q-tract LengthsUsing PCR amplification and ABI fragment analysis weestablished the range of CAGpolyQ tract lengths for 64targets (in 62 genes) in a set of healthy individuals ofmixed ethnic background (Table 1, Additional file 1). Wescreened at least 130 normal alleles for each target (mean162), including X-linked genes, giving us 99% confidencethat 95% of the whole population lie between the mini-mum and maximum values in our sample (95% toler-ance; see Methods), with the exception of four targets forwhich we screened slightly fewer alleles due to technicallimitations: ATXN2 and CACNA1A (94% tolerance),FOXP2 and RUNX2 (93% tolerance). Table 1 summarizesdata for 66 CAGpolyQ repeat targets in 64 genes.Known disease genes have long uninterrupted CAG-tracts and high Q-tract length variancesWe sought in our data some hallmark of the nine knowndisease genes that would allow us to prioritize candidatesamong the 54 genes not yet associated with CAGpolyQexpansion disorders. Sorting CAGpolyQ repeats byincreasing Q-tract length variance (Table 1) clustered dis-ease genes in the top one third of 64 targets. Known dis-ease gene Q-tract length variances ranged from 0.79(ATXN2) to 29.2 (ATXN3). The highest Q-tract length var-iances of all targets were observed in four known diseasegenes: ATXN3, ATN1, AR and HD. The least polymorphicdisease target, ATXN2, is distinguished from other diseasegenes by its previously documented tight distribution ofQ-tract lengths [26].Q-tracts are made up of lengths of CAG codons that canbe pure or interspersed with one or more CAA codons.Length polymorphism tends to occur within CAG-tracts.Sorting CAGpolyQ repeats by the length of their longestuninterrupted CAG-tract in the reference genome clus-tered disease genes in the top half of 64 targets. This wasincreased to the top one third if ATXN3 was excluded dueto its reference genome sequence reflecting the low modeof a bimodal distribution of repeat tract lengths (see graphin Additional file 2). Disease gene CAG-tract lengthsranged from 10 (ATXN7) to 22 (AR) and the longest unin-terrupted CAG-tracts of all targets occurred in four diseasegenes: AR (22 CAG), HD (19 CAG), TBP (19 CAG) andATN1 (15 CAG).The length of the longest uninterrupted CAG-tract in thereference genome for each target (e.g. CAG13CAA1CAG9has CAG-tract length of 13; see Table 1) was positively cor-between long CAG-tracts and high Q-tract length vari-ance, we divided all targets in two groups at the medianCAG length of eight and tested the null hypothesis thatvariances were equal in the two groups. Q-tract length var-iances were indeed higher with longer CAG-tracts (p =0.002, 1-tailed heteroscedastic t-test).Mean or maximum Q-tract length failed to yield any sig-nificant clustering of disease genes, and mean Q-tractlength was only very weakly correlated with Q-tract lengthvariance (correlation = 0.12). Underlying this relationshipis the fairly weak correlation of uninterrupted CAG-tractlength with mean Q-tract length (0.31, ATXN3 excluded).Mixtures of CAG and CAA codons making up the Q-tractaccount for this. One telling example is FOXP2 which hadthe longest mean and maximum Q-tract lengths but rela-tively little variance in Q-tract length. In fact, FOXP2 hadthe second-shortest uninterrupted CAG-tract of all 66 tar-gets. Based on our analysis, this low level of polymor-phism is predicted by the short pure CAG repeat length.Sorting targets according to other parameters also failed toyield any significant clustering of disease genes. Theseincluded sorting by the proportion of alleles with Q-tractlengths longer than mean + 1 SD, and by repeat purity,which was a combined measure of both the length of thelongest uninterrupted CAG-tract and the total Q-tractlength.Priority candidates for polyglutamine expansion disordersA plot of CAG length versus Q-tract length variance foreach target allowed us to identify eight genes as prioritycandidates for polyglutamine expansion disorders (Figure1). We selected genes that had uninterrupted CAG-tractsequal to or longer than 10 CAG (the shortest uninter-rupted CAG-tract in a known disease gene, ATXN7) andhad Q-tract length variance equal to or higher than 0.79(the lowest Q-tract variance in a known disease gene,ATXN2). All eight priority candidates: C14orf4, KCNN3,KIAA2018, MEF2A, NCOR2, RAI1, SMARCA2, andTHAP11 are expressed in normal brain [63-66]. This list isnot meant to be exhaustive, but rather a list of the topeight genes prioritized according to two hallmarks ofknown disease genes.Twelve invariant CAGpolyQ repeats have short CAG-tractsIn this set of 64 CAGpolyQ repeats, having at least fourtandem CAG codons coding for five tandem glutamineresidues, mean Q-tract length ranged from five to 39.8(Table 1). Twelve repeats in eleven genes, including CREB-binding protein (CREBBP) for example, had no changesin Q-tract length in as many as 212 alleles tested. An addi-tional six repeats were essentially invariant with only onePage 4 of 18(page number not for citation purposes)related with its level of polymorphism (correlation = 0.62,ATXN3 excluded; Figure 1). Given this associationout of as many as 184 alleles differing in length by one Q-residue (Table 1). The twelve invariant repeats had unin-BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126Table 1: Q-tract length variation in genes containing polyglutamine-encoding CAG-type trinucleotide repeats, sorted by Q-tractChromosomeBandGene Namea Repeat Sequence from Reference Genome (sense strand)bExpected Q-tract Length from Reference GenomecNd ObservedQ-tractLengthMin-MaxQ-tractMeanQ-tractVariance17p13.2 MINK1* G4N1G5 Q4LQ5 (SwP) 162 5 – 5 5.0 09q34.11 CIZ1 G6 Q6 154 6 – 6 6.0 07q36.2 PAXIP1L* G7 Q7 168 7 – 7 7.0 011q24.3 PRDM10 G8 Q8 172 8 – 8 8.0 04q31.1 MAML3a* G9 Q9 156 8 – 8 8.0 06p21.1 TFEB G6A1G3 Q10 162 10 – 10 10.0 019p13.11 CHERP G6A1G5 Q12 192 12 – 12 12.0 012q21.2 PHLDA1 G5A1G6A2G1 Q15 212 14 – 14 14.0 016p13.3 CREBBP G4A1G3A2G2A1G4A1 Q18 158 18 – 18 18.0 04q31.1 MAML3b* G3A1G3A1G1A1G8 Q18 166 18 – 18 18.0 020q11.22 NCOA6* G4A4G8A2G1A1G2A2G1 Q25 166 25 – 25 25.0 0Xq13.1 MED12* G5A1G2A1G1A1G5A1G1A1G7N4G6 Q26X4Q6 205 26 – 27 26.0 020q13.12 PRKCBP1 G7A1 Q8 152 8 – 9 8.0 0.0115q24.1 ARID3B G8A2G1 Q11 212 11 – 12 11.0 0.0122q11.21 PCQAPa G4A1G3N1G5N3G7A1G3N8G3N5G5N1G8 Q8FQ5X3Q11X16Q5LQ8152 11 – 12 11.0 0.013p24.3 SATB1 G1A1G3A1G1A1G7 Q15 174 15 – 16 15.0 0.016q16.2 POU3F2 G3A1G1A1G3A1G2A1G6A1G1 Q21 148 21 – 22 21.0 0.01Xq22.3 FRMPD3 G4A3G4A3G3A3G7 Q27 (SwP) 184 26 – 27 27.0 0.012q35 TNS G9 Q9 178 9 – 11 9.0 0.0219p13.12 BRD4 G5N1G1N1A1G4A1G1A1 Q5RQEQ8 140 8 – 9 8.0 0.0312p13.31 PHC1 G5A2G1A1G2A1G3 Q15 170 13 – 15 15.0 0.059q32 C9orf43 G6A1G1 Q8 168 8 – 9 8.1 0.071q21.3 TNRC4 A1G8A1G4A1 Q15 150 15 – 18 15.0 0.0817q12 SOCS7 G7A1 Q8 (SwP) 134 8 – 9 8.1 0.121p31.1 ST6GALNAC5 G7A1G4 Q12 150 12 – 14 12.1 0.1315q26.1 POLG G10A1G2 Q13 164 13 – 15 13.1 0.1622q13.1 TNRC6B G8 Q8 166 7 – 8 7.8 0.1712q13.12 MLL2* G5N1A1G1A1G1A1N1G7N1A1G1A1G1A1N1 G2A1G1N1A1G2A1G4N1A2G3A1G1N1A1G2 A1G2N1A1G1A1G1A3G3N1A1G3A1G3Q5LQ5LQ7LQ5LQ4LQ8LQ7 LQ6LQ10FQ8184 8 – 11 10.2 0.217p14.1 POU6F2 G10 Q10 168 6 – 11 10.0 0.22Xq28 CXorf6 G1A1G8A1N92G5A1G4 Q11X92Q10 168 11 – 12 11.6 0.2512p13.33 DCP1B G9A1 Q10 136 10 – 12 10.5 0.2617q23.2 VEZF1 G12A6 Q13 (through intron)176 8 – 15 13.1 0.2922q11.21 PCQAPb G3A1G2N9A2G1A1G12 Q6X9Q16 152 12 – 18 16.1 0.343p14.1 MAGI1 G5A1G3A1G10 Q20 168 16 – 21 20.3 0.364q21.21 BMP2K G8A1G1A1G4A1G1A1G9 Q27 148 23 – 28 26.9 0.3616q22.1 NFAT5* G5A1G3A1G3A3G1 Q17 168 11 – 19 17.0 0.3712p13.31 ZNF384 G14A1G1 Q16 214 11 – 20 15.2 0.4722q12.1 MN1* A1G9A1G6A1G1A1G1A1G6 Q28 180 26 – 30 28.7 0.5312q24.33 EP400 G6A2G14A1G4A1G1 Q29 158 28 – 31 28.8 0.5312q23.2 ASCL1 G12 Q12 148 9 – 15 12.3 0.656q25.3 ARID1B G7A1G7A1G1A1 Q18 152 16 – 23 17.7 0.6911q21 MAML2 G1A1G2A1G13A1G5A1G1A1G1A1G1A1G2N 5A2G1A1G3N5A1G5A2G5A3G1A2G6A2Q31X5Q7X5Q27 (through intron)168 27 – 31 28.3 0.7512q24.12 ATXN2 G13A1G9 Q23 124 17 – 27 22.2 0.799p24.3 SMARCA2 G1A2G3A1G13A1G2 Q23 130 18 – 24 22.7 0.7920q13.12 NCOA3 G6A1G9A1G1A1G1A1G1A1G2A1G2A1 Q29 150 26 – 31 28.4 0.8017p11.2 RAI1 G13A1 Q14 184 11 – 17 14.6 0.847q31.1 FOXP2* G4A1G4A2G2A2G3A5G2A2G5A1G5A1G1 Q40 100 34 – 40 39.8 0.853p14.1 ATXN7 G10 Q10 184 7 – 14 10.4 0.8919q13.2 NUMBL G6A1G1A1G7A1G2A1 Q20 156 18 – 20 18.7 0.93Page 5 of 18(page number not for citation purposes)terrupted CAG-tracts from four to nine repeat units longbut had mean Q-tract lengths evenly distributed from fiveto 26 residues (Table 1). Thus, a lack of polymorphismwas restricted to relatively short pure CAG-tracts but theirQ-tract lengths varied widely. This again emphasizes theutility of using pure CAG-tract length rather than Q-tractlength in assessments of length polymorphism.Each CAGpolyQ repeat has a unique distribution of Q-tract lengthsThe two allele frequency distributions of Q-tract lengthsin Figure 2 provide examples of the 64 CAGpolyQ repeatswe analyzed. ATXN3 had a unique bi- or tri-modal distri-bution that is virtually identical to published data [26].RAI1, a priority candidate disease gene with a long CAG-tract and relatively high Q-tract variance, had a simplerdistribution that is consistent with the published Q-tractlength range [62]. The 64 plots of allele frequency distri-butions of Q-tract lengths for each CAGpolyQ repeat illus-trate clearly that there is no single pattern that is typical ofQ-tract length distributions across the human genome(Additional file 2).Functional classification of CAGpolyQ-containing genesBrowsing descriptions associated with the 64 CAGpolyQgenes suggested an over-representation of genes involvedin transcriptional processes and genes involved in chro-matin architecture, and thyroid hormone receptor bind-based classification of these genes to determine whetherspecific functional categories are statistically overrepre-sented, to visualize the network of functional relation-ships among CAGpolyQ-containing genes, and todetermine whether priority candidates for polyglutamineexpansion are associated with one or more specific GOterms.GO over-representation analysisWe used GoMiner [67] to look for statistical over-repre-sentation of CAGpolyQ genes in GO terms in the top fourlevels of the three GO categories: biological process,molecular function, and cellular component. GO termdescriptions can be viewed at the Gene Ontology website[68]. GoMiner contained gene name-GO term annota-tions for 56 of our 64 genes against a background of13,598 HGNC genes. Genes without GO term assign-ments at the time of this analysis were: C14orf4, C9orf43,CXorf6, DENND4B, FRMPD3, KIAA2018, TNRC15 andTNS. Our null hypothesis was that the genes of interestwould be distributed among the chosen GO terms in thesame proportions as the background set. GO terms withp-values below the significance threshold (p = 0.05) wereconsidered to be over-represented among CAGpolyQgenes. In negative control experiments (see Methods) wefound no over-representation in GO terms under molecu-lar function in 100 replicates. Under biological process,BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/12612q24.31 NCOR2 G3A2G12 Q16 (through intron)172 13 – 20 16.9 0.9515q26.3 MEF2A G11 Q11 174 8 – 16 10.2 1.1314q24.3 C14orf4 A1G1A1G1A1G6A1G10A2G1 Q25 (through intron)150 20 – 31 23.4 1.173q13.2 KIAA2018 G11A1G1A4 Q14 (through intron)150 11 – 16 12.6 1.441q21.3 DENND4B A1G5A1G9 Q16 156 13 – 17 15.2 2.046p22.3 ATXN1 G12T1G1T1G14 Q12HQHQ14 130 11 – 21 14.6 2.236q27 TBP G3A3G8A1G1A1G19A1G1 Q38 158 30 – 41 36.9 2.2619p13.3 CACNA1A G13 Q13 112 7 – 16 12.1 2.4216p12.1 TNRC6A G4A1G3 Q8 166 4 – 8 7.2 2.506p21.1 RUNX2 A1G3A1G4A1G6A1G6 Q23 100 18 – 30 22.5 3.0416q22.1 THAP11 G3A1G5A1G2A1G5A1G10 Q29 170 18 – 30 28.5 3.121q22 KCNN3 G7A1G4N25G14 Q12X25Q14 170 15 – 25 20.3 3.984p16.3 HDe G19A1G1 Q21 252 9 – 33 17.2 7.18Xq12 AR G22A1N5G6 Q23X5Q6 180 14 – 33 23.7 9.3412p13.31 ATN1 G1A1G1A1G15 Q19 168 11 – 27 17.6 11.614q32.12 ATXN3 G2A1N1G1A1G8 Q3KQ10 168 10 – 27 17.8 29.22q37.1 TNRC15 G6 Q6 n.d. n.d. n.d. n.d.aBoldface   text marks a gene known to cause disease by expansion of a   polyglutamine-encoding CAG trinucleotide repeat. 'a' and 'b' after MAML3   and PCQAP denote two targets within these genes. Genes marked with an   asterisk (*) contain an additional repeat target that was not screened   in this study.bG denotes   "CAG", A denotes "CAA" and N denotes a non-glutamine codon, each   followed by the number of tandem repeats of that codon. Boldface text   marks the longest uninterrupted CAG-tract.cX indicates a non-glutamine amino acid; SwP indicates peptide sequence obtained from SwissProt recorddN denotes number of alleles screenedeData for N, Observed Q-tract Length Min-Max, Q-tract Mean, Q-tract Variance, taken from Andres et al. (26)Table 1: Q-tract length variation in genes containing polyglutamine-encoding CAG-type trinucleotide repeats, sorted by Q-tract Page 6 of 18(page number not for citation purposes)ing. We assessed these and other observations using GO- three out of 100 replicates each had one over-representedBMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126GO term. Under cellular component, one out of 100 rep-licates had one over-represented GO term and one out of100 replicates had two over-represented GO terms.Over-representation analysis confirmed these 56 CAG-polyQ genes' functional association with transcriptionand revealed some specific details. There were six signifi-cant GO terms under molecular function (Table 2). Theseincluded 13.4-fold over-representation of transcriptioncoactivator activity, which is a child term of the 8.8-foldover-represented transcription cofactor activity. CAG-polyQ transcriptional coactivators on our gene listinclude: ARID1B, CREBBP, MAML2, MAML3, MED12,MEF2A, NCOA3, NCOA6, and SMARCA2. Transcriptionfactor binding was 8.3-fold over-represented, includingthe transcription coactivator genes above, as well as HD,NCOR2 and TBP. Half of the 56 genes bind DNA. Therewere five significant GO terms under biological process(Table 2), with the most specific, positive regulation ofmetabolism, 6.5-fold over-represented (MAML2,CREBBP, RUNX2, ARID1B, NCOA6, NFAT5, andover-represented. Genes in over-represented GO catego-ries are listed in Additional file 3 (Biological Process),Additional file 4 (Molecular Function) and Additional file5 (Cellular Component).Shared GO-term analysisTo delve deeper into the possible functional relationshipsamong genes containing CAGpolyQ repeats, we devel-oped a method for quantitative comparison of GO termsannotated to each gene product, based on the structure ofthe GO graph (AMM, SLB, BFFO, manuscript in prepara-tion). Briefly, given a pair of genes, their GO term annota-tions, and a comparison scoring function for GO terms,we calculated similarity scores for every pair of GO termsfor that pair of genes. GO term pairs scoring above athreshold were used to construct a graph where each noderepresents a gene and weighted edges between nodes rep-resent pairs of GO term annotations and their scores.Genes were grouped by a simple visual clustering algo-rithm that assigns shorter lengths to edges with higherweights (i.e. more similar shared GO terms). Because agene may have multiple shared GO terms with otherExample distributions of normal Q-tract lengthsFigure 2Example distributions of normal Q-tract lengths. (A) ATXN3, ataxin 3 (B) RAI1, retinoic acid receptor 1.B. RAI10. 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29Q-tract LengthAllele FrequencyA. ATXN30. 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29Q-tract LengthAllele FrequencyRelationship between length of longest uninterrupted CAG-tract and Q-tract length varianceFigure 1Relationship between length of longest uninter-rupted CAG-tract and Q-tract length variance. (A) All targets. HD Q-tract length variance from Andres et al. [26]. Correlation = 0.62, not including ATXN3. (B) Higher resolu-tion view of targets with Q-tract length variance < 4.0. Dashed lines at 10 CAG and 0.79 variance represent the cut-off for identifying candidate genes for polyglutamine expan-sion disorders. See text for list of genes falling in this area. 5 10 15 20 25CAG Length (reference genome)Q-tract Length VarianceTBPATXN1CACNA1AATXN7ATXN2A0. 5 10 15 20 25Q-tract Length Variance ATXN3ATN1ARHDBPage 7 of 18(page number not for citation purposes)MAML3). There were seven significant GO terms undercellular component (Table 2), with nucleoplasm 4.1-foldgenes, this method allowed us to cluster the functions ofgenes that share terms on different branches and at differ-BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126ent levels of the gene ontology. Related functions gounnoticed without this clustering.Only seven gene pairs scored above the cutoff (estimated99th percentile; described in Methods) for the cellularcomponent category (Additional file 6) so we did not con-sider this category further. There were 544 gene pairs withscores above the cutoff in the biological process category,representing 45 genes. There were 503 pairs among 42genes in the molecular function category. The functionalrelationships among these CAGpolyQ genes are illus-trated in Figure 3. GO terms and the genes that share themare listed in Additional file 7 (Biological Process), Addi-tional file 8 (Molecular Function) and Additional file 6(Cellular Component).Based on our analysis of relationships among GO termsshared by two or more genes, CAGpolyQ genes in thehuman genome clustered primarily under two major bio-logical processes: DNA dependent regulation of transcrip-tion, and neurogenesis (Figure 3A). Other processesincluded establishment and/or maintenance of chroma-tin architecture and post-translational modifications.Since there were few functional clusters, it was not surpris-ing that all but one known disease gene and most prioritycandidate genes were involved in DNA dependent regula-tion of transcription and in neurogenesis (Figure 3A).was recently shown to be an integral component of theTFTC (TATA-binding protein-free TAF-containing) com-plex and the STAGA (SPT3/TAF9/GCN5 acetyltransferase)complex involved in transcriptional regulation [52-54].Consistent with their predominant classification in DNAdependent regulation of transcription, DNA binding wasthe primary shared molecular function among these 64genes (Figure 3B). Known disease genes were involved inDNA, calcium and zinc binding and HD was classified ashaving transcription corepressor activity (Figure 3B). Allbut one priority candidate gene had DNA binding activityaccording to current GO annotations. CAGpolyQ geneswith invariant Q-tract lengths were not limited to any onebiological process or molecular function.DiscussionOur findings build on previous work indicating that unin-terrupted CAG-tract length, not the Q-tract lengthencoded by CAG plus CAA codons, influences the degreeof polymorphism of a Q-tract. Uninterrupted CAG-tractlength and Q-tract length variance are the most usefulparameters in characterizing known disease genes andidentifying candidate genes for expansion disorders. Atone extreme, zero variance CAGpolyQ repeats – those thatdo not tolerate changes in Q-tract length – can likely beexcluded as candidates for polyglutamine expansion dis-orders. The shapes of Q-tract length distributions differedTable 2: Functional classification of CAGpolyQ genes: Gene Ontology over-representation analysis.Gene Ontology term (levels) GO ID Candidate genes in GO term Fold* EnrichmentBiological Processregulation of biological process (1) GO:0050789 37 2.3regulation of physiological process (2) GO:0050791 36 2.5regulation of metabolism (3) GO:0019222 29 3.0positive regulation of metabolism (4) GO:0009893 7 6.5nucleobase, nucleoside, nucleotide and nucleic acidmetabolism (4) GO:0006139 34 2.5Molecular Functiontranscription regulator activity (1) GO:0030528 24 4.0transcription cofactor activity (2,4) GO:0003712 11 8.8transcription coactivator activity (3,5) GO:0003713 9 13.4nucleic acid binding (2) GO:0003676 35 2.8DNA binding (3) GO:0003677 28 3.1transcription factor binding (3) GO:0008134 12 8.3Cellular Componentorganelle (1) GO:0043226 43 1.7membrane-bound organelle (2) GO:0043227 43 1.9intracellular (2) GO:0005622 47 1.5intracellular organelle (2,3) GO:0043229 43 1.7intracellular membrane-bound organelle (3,4) GO:0043231 43 1.9nucleus (3,4,5) GO:0005634 41 2.7nucleoplasm (4,5,6) GO:0005654 11 4.1All levels for each GO term are indicated, with boldface indicating one path through the GO*p < 0.00004 for all GO terms listed except nucleoplasm, p = 0.0001.Page 8 of 18(page number not for citation purposes)ATXN7, the one disease gene excluded from the clusterinvolved in DNA dependent regulation of transcription,widely between various loci across the genome. Thus, thedata presented here for allele length distributions for 64BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126Page 9 of 18(page number not for citation purposes)Functional classification of CAGpolyQ genes: shared Gene Ontology term analysisigure 3Functional classification of CAGpolyQ genes: shared Gene Ontology term analysis. Known disease genes are marked with a 'D', candidate disease genes are marked with a 'C' and genes with invariant Q-tracts (Table 1) are marked with an 'I'. Clusters of genes are labeled with the GO terms that best described each cluster. GO terms shared by gene pairs are listed in Addi-tional file 7 and Additional file 8. Genes not represented in a graph either had no annotation under that GO namespace or did not share a GO term with a score above the 99th percentile. (A) Biological process. Genes not represented: ARID1B, ATXN1, ATXN2, BRD4, C9ORF43, DCP1B, HD, DENND4B, FRMPD3, MAML2, PAXIP1L, PHC1, PHLDA1, SOCS7, THAP11, TNRC15, TNRC6A, TNRC6B and TNS. (B) Molecular function. Genes not represented: ATN1, ATXN1, ATXN3, BRD4, C14ORF4, C9ORF43, CHERP, DCP1B, KCNN3, DENND4B, FRMPD3, MAML2, NUMBL, PAXIP1L, PCQAP, PHLDA1, SOCS7, ST6GALNAC5, TNRC15, TNRC6A, TNRC6B and TNS.DDDDDDIIIIIIIICCCC CCGO:0005509calcium ion bindingGO:0005554molecular function unknownGO:0003714transcription corepressor activityGO:0005524ATP bindingGO:0008270zinc ion bindingGO:0016563transcriptional activator activityGO:0030374ligand-dependent nuclear receptortranscription coactivator activityGO:0046966thyroid hormone receptor bindingGO:0004674protein serine/threonine kinase activityGO:0003713transcription coactivator activityGO:0003677DNA bindingDDDDDDIIIIIIIIICCCCCCCGO:0006486protein amino acid phosphorylationGO:0006468protein amino acid glycosylationGO:0000004biological process unknownGO:0006350establishment and/or maintenance of chromatin architectureGO:0006461protein complex assemblyGO:0007399neurogenesisGO:0006355regulation of transcription, DNA-dependentGO:0000074regulation of cell cycleGO:0000398nuclear mRNA splicing, via spliceosomeGO:0006260DNA replicationGO:0006281DNA repairGO:0007268synaptic transmissionGO:0007417central nervous systemdevelopmentGO:0007601visual perceptionA. Biological ProcessB. Molecular FunctionMAGI1VEZF1MAGI1VEZF1BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126Q-tracts in 62 genes with detailed conditions for theirscreening, will be invaluable for identifying putativeexpansion mutations in candidate genes not yet associ-ated with CAGpolyQ-type neurodegenerative disorders.All nine known polyglutamine expansion disorder genesare involved in DNA-dependent regulation of transcrip-tion or in neurogenesis, as are all of the well-characterizedpriority candidate genes identified in this study.Many groups have published lists of CAGpolyQ-contain-ing genes identified using classical [15,17-20] or compu-tational methods [21-24]. The content of eachcomputationally-derived list differs slightly depending onthe repeat detection algorithms and gene data sets usedbut they are largely the same. Tandem Repeat Finder, usedin this study under default parameters, is not guaranteedto find all CAGpolyQ repeats, but it is likely that the vastmajority of long repeats were found. Our approach is val-idated by its detection of all nine genes known to causediseases by expansion of CAGpolyQ repeats. This study ofthe normal levels of polymorphism of human CAGpolyQrepeats is the most exhaustive conducted to date.Our allele frequency distributions match those publishedfor known disease genes AR [69], ATN1 [2,3,26,70],ATXN1 [26], ATXN2 [26], ATXN3 [26,70], ATXN7[70,71], CACNA1A [26,70], and TBP [70,72]. The same istrue for CAGpolyQ repeats in other genes whose Q-tractlengths have been found to be invariant like CREBBP [25]and MED12 [19], moderately polymorphic FOXP2 [73],NCOA3 [25,26,74], POLG [75], RAI1 [76], SMARCA2 [28]or highly polymorphic THAP11 [28] and KCNN3 [26,77].Differences in apparent repeat lengths between this studyand published data for ATXN1 [26,70] and ATXN3[26,70] exist because we report repeat lengths based onthe longest pure Q-tract while Andres et al. [26] and Juvo-nen et al. [70] report "repeat lengths" that contain non-glutamine amino acids. For ATN1, the shape of our distri-bution matches published data but our distribution isincreased by two to four glutamine residues.Among our eight priority candidate genes some featuresare already known. CAG length variation in RAI1 isresponsible for 4.1% of age of onset variability in SCA2[76]. Huang et al. [42] identified RAI1 (called RAI2 in thatpaper) and NCOA3 as candidate disease genes by virtue oftheir long CAG tracts and the fact that their mouse and ratorthologues had Q-tracts less than half the size of thehuman repeats. In our study, NCOA3 lay just below thethreshold for priority candidate disease genes, with nineCAG while priority candidates had ten CAG. KCNN3CAG-tract length differences have been associated withanorexia [78] and with schizophrenia and bipolar disor-didates by Pandey [28] based on their relatively longuninterrupted CAG-tracts. Four genes identified by Huanget al. [42] as candidate genes of interest fell far below ourthreshold of Q-tract variance so we do not consider themto be priority expansion disease candidates. These wereDCP1B, MAML3 (called TNRC3 in Huang et al.), POLG(called NFYC in Huang et al.) and POU6F2 (called RPF-1in Huang et al.).Q-tract lengths for many genes do not have a normal dis-tribution and differ widely between loci, as previouslyobserved [27,36]. Even different disease genes have verydifferent Q-tract length distribution shapes with differentminima and maxima in normal populations and differentminimum disease allele lengths so it is critical to charac-terize each distribution without making generalizationsbetween loci. A gene containing more than one CAG-polyQ repeat can have two invariant repeats (MAML3) ora combination of invariant and variant repeats (PCQAP).Orthologous repeats in human and mouse genomes canhave very different levels of polymorphism: humanVEZF1 has a polymorphic Q-tract (this study) while thecorresponding Q-tract in its mouse orthologue is invari-ant [80].Long pure repeats expandAlba and colleagues [29,30,81] have clearly shown that,with respect to evolutionary processes, there are twoclasses of Q-tracts in human proteins: those whose lengthsare conserved between human and mouse orthologues,and those whose lengths differ. Length-conserved polyQrepeats tend to be encoded by mixtures of CAG and CAAcodons and are likely to be restricted in length by purify-ing selection. PolyQ repeats whose lengths vary betweenhuman and mouse tend to be encoded by longer pureCAG-tracts that evolve nearly neutrally [29,30,81]. Ourdata on Q-tract polymorphism within a normal humanpopulation corroborates their between species data andbuilds on previous work, with longer pure CAG-tractshaving higher Q-tract length variance and invariant CAG-polyQ repeats having relatively short pure CAG-tracts[40,41]. Again, the extremes reinforce the rules; FOXP2with a short 5-CAG repeat has the longest mean Q-tractlength of all candidate genes but a low level of polymor-phism.Correlation of uninterrupted CAG length with Q-tractlength variance is consistent with work on dinucleotiderepeats [82] and on tetra- and penta-nucleotide repeats[83]. For all of these, the level of polymorphism increaseswith the number of pure repeats, and non-polymorphicrepeats have the shortest pure repeat tracts. Similarly, inthe HD gene, as CAG repeat number increases, there is aPage 10 of 18(page number not for citation purposes)der but these associations are controversial [79].SMARCA2 and THAP11 were previously identified as can-significant increase in the frequency of expansion muta-BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126tions and the mean number of repeats added per expan-sion [34].Pure CAG length is not the only factor determining repeatinstability. An in-frame interruption in a CAG-tract has astabilizing influence over and above that of reducing thepure CAG-tract length. In yeast, dinucleotide repeats witha single dinucleotide interruption in the middle of thetract are five times more stable than a pure repeat of thesame length [84]. SCA1 disease alleles of the ATXN1 geneall contain uninterrupted tracts of CAG repeats while vir-tually all normal alleles have one to three CAT (coding forhistidine) interruptions in the middle of the Q-tract [39].Other factors underlying repeat instability include differ-ent repair mechanisms [32], flanking sequence elements[85,86], CpG methylation, and nucleosome and replica-tion origin positioning [86-88].Rozanska et al. [89] recently published a large study thatcomplements our results, analyzing repeat lengths andinterruption patterns in a normal Polish population. Theydetermined that the length of uninterrupted repeat tract inthe most frequent allele for a locus is correlated with thedegree of length polymorphism for that tract, and providefurther evidence for a stabilizing effect of repeat interrup-tions. Trinucleotide repeat expansion disease genes werefound to have a higher proportion of long repeat allelesthan those not associated with disease [89].Inferences about repeat lengths and disease prevalenceLack of detailed reporting of repeat sequence lengths indisease genes, such as Q-tract lengths in ATXN1 andATXN3 are a potential source of confusion in the literatureand highlight the difficulties in comparing Q-tract lengthdistributions for the same genes from different publica-tions. The amino acid sequence of the most common nor-mal ATXN1 repeat tract is Q12H1Q1H1Q14 [37] but it isfrequently reported as 29 "repeats" and the ATXN3 repeattract, Q3K1Q10, is reported as 14 "repeats" [26]. Non-glutamine interruptions in a Q-tract are critical to pheno-type, so it is misleading to report these as "Q repeats" or"CAG repeats". For this reason, we reported all target Q-tract lengths based on the longest uninterrupted Q-tract(encoded by CAG/CAA) in the reference genome (Table 1,Additional file 2).Measuring Q-tract lengths in affected individuals enablesidentification of putative repeat expansions outside thenormal range, but more in depth characterization requiresprecise determination of the underlying amino acid andnucleotide sequences of individual alleles. Characteriza-tion of each allele at the nucleotide sequence level in addi-tion to the normal (wild-type) Q-tract length distributionalleles at a given locus are prone to expansion, and for dis-ease genes, characterizing allele repeat sequences withrespect to disease prevalence in a given population [33].As has been expertly laid out by Sobczak and Krzyzosiak[37] repeat interruption patterns in a given target can dif-fer between populations, even when Q-tract length distri-butions are similar. Repeat interruption characteristics arenot commonly studied, but reporting overall repeatlengths in the absence of repeat interruption patterns maybe quite misleading in studies of allele lengths as theyrelate to disease prevalence in a given population[37,70,90]. Juvonen and colleagues [70] recently reportedthat the frequencies of large normal alleles at SCA lociwere poor predictors of the prevalence of the respectivediseases in Finland but Q-tract lengths were assayed with-out reporting CAG-tract interruption patterns in differentalleles. A different picture might be revealed by character-ization of repeat interruption patterns at each SCA locusin that population.The genotype-phenotype connectionQ-tract length variance is influenced both by specificsequence characteristics and by the specific role of the Q-tract within a protein's structure and function. AR pro-vides an excellent example of this balance. The AR Q-tractin the reference genome has a very long pure CAG-tract of22 CAGs, consistent with its high length variance. TheCAGpolyQ tract in the AR protein lies in its N-terminaltransactivation domain which interacts with the C-termi-nal ligand binding domain (the N/C interaction). Bucha-nan et al. [69] found no changes in in vitro N/C interactionfor Q-tract lengths of 16 to 29 but shorter or longer tractsresulted in a significant decrease in N/C interaction. Over90% of normal alleles fall within the Q16-Q29 range bothin this study and in Buchanan's re-examination of pub-lished data [69]. Q-tracts in AR equal to or longer than 38glutamines cause the polyglutamine expansion disorderspinal and bulbar muscular atrophy while short Q-tractsare associated with increased risk of prostate cancer [69].In other genes, Q-tracts with no length variation suggestthe presence of strong purifying selection in which a pre-cise Q-tract length is required to maintain a protein'sstructure or its biomolecular interactions, and its func-tion. Therefore, a length change in a non-variant Q-tract ispresumed to be lethal.CAGpolyQ Gene FunctionsBased on GO overrepresentation and shared-term analysiswe find that CAGpolyQ genes are involved, in general, intwo major biological processes, DNA dependent regula-tion of transcription and neurogenesis, and are enrichedfor transcriptional coactivator and transcription factorbinding functions. Subgroups of genes such as knownPage 11 of 18(page number not for citation purposes)will be critical in better identifying candidate CAGpolyQgenes not yet associated with disease, determining whichpolyglutamine expansion disease genes, priority candi-dates, or genes containing invariant Q-tracts are not obvi-BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126ously distinguished by association with a particularprocess or molecular function. Polyglutamine-containingproteins in organisms from yeast to humans have beenpreviously noted to be involved in transcriptional regula-tion [44-48]. In fact, most eukaryotic repeat containingproteins are involved in transcription or translation orinteract directly with DNA, RNA or chromatin, irrespec-tive of the amino acid repeat type [48]. The majority ofrepeat-containing proteins perform roles in processes thatrequire the assembly of large multiprotein or protein/nucleic acid complexes [48]. Expanded Q-tracts in HDand ATN1 gene products interfere with CREBBP-activatedgene transcription via interaction of their Q-rich domains[91,92] and mutant HD targets specific components ofthe core transcriptional machinery, in a Q-tract length-sensitive manner, to disrupt gene expression in culturedHD cells [55]. We anticipate that continual incorporationinto the GO of newly published information about thenormal functions of polyglutamine expansion disordergenes will reveal more specific shared functions amongthem.ConclusionWe have characterized the levels of Q-tract length poly-morphism in 64 CAGpolyQ repeat tracts in a normalhuman population, and found a strong positive correla-tion between uninterrupted CAG-tract length and Q-tractlength variance. The best predictors of known diseasegenes were the occurrence of a long uninterrupted CAG-tract in the reference genome sequence and high Q-tractlength variance in the normal population. Using these cri-teria we identified eight priority candidate genes for poly-glutamine expansion disorders based on the presence ofpure CAG-tracts longer and Q-tract variances higher thanthe smallest values in known disease genes. Twelve invar-iant Q-tracts (in eleven genes) are unlikely to be candi-dates for polyglutamine expansion disorders. EachCAGpolyQ repeat, including those in known diseasegenes, has a unique distribution of Q-tract lengths,emphasizing the need to characterize each distributionwithout making generalizations between loci. This publi-cation makes freely available for the first time the lengthdistributions of virtually all of the CAGpolyQ repeats inthe human genome. Using these normal repeat distribu-tions against which pathogenic expansions can be identi-fied, we have begun screening for mutations inindividuals clinically diagnosed with SCA or Huntingtondisease-like disorders who do not have identified muta-tions within known disease genes.MethodsSelection of candidate genesCandidate genes were identified on the basis of having ain the peptide sequence of that gene. To accomplish this,the Simple Repeats table (simpleRepeat.txt.gz) was down-loaded from the UCSC genome annotation database [59]for build 33 (April 2003) of the human genome sequenceassembly [58] and uploaded into a local MySQL database.The Simple Repeats table contained chromosomal loca-tion coordinates of all repeats detected by Tandem RepeatFinder (TRF) software [93] using default parameters. Loca-tions of all the CAG-type repeats in this table wereexported to a file using an SQL query to extract all recordswith the sequences 'CAG', 'AGC', 'CGA', 'CTG', 'GCT' and'TCG' to accommodate all six potential reading frames ofthe repeat as they might appear in genomic sequence. Thisfile was used as input to a Perl script that used theEnsembl Perl API [60] version 15_33 to extract all knowngenes (Ensembl-predicted transcripts that map to species-specific SwissProt, RefSeq or TrEMBL database entries)whose chromosomal coordinates overlapped with therepeat coordinates. For each known gene with a CAG-typerepeat, if the Ensembl peptide sequence contained five ormore glutamine residues in tandem, that gene was consid-ered a candidate. A minimum glutamine repeat length offive was used since Karlin [94] determined that for a "typ-ical" protein of 400 residues and average composition, arun of an individual amino acid is statistically significantif it is five or more residues long [94].The candidate gene list was generated from Build 33 of thehuman genome sequence assembly (April 2003), and thenucleotide/amino acid sequences of each glutamine tractreported in Table 1 were generated from Build 35 (May2004). Two new candidate genes were identified in thelater build (Ensembl known genes data set version30_35c) that were not part of our study: MKL1 andC14orf43, and additional CAGpolyQ repeats weredetected in nine of our existing candidate genes: FOXP2,MAML3, MED12, MINK1, MLL2, MN1, NCOA6, NFAT5,PAXIP1L. These targets have been denoted by an asteriskin Table 1. Chromosome band was obtained from theUCSC Chromosome band track [95] and may differslightly from a gene's location listed by the HGNC Data-base, Genew [14]. Gene names listed are official HGNCgene symbols from the HGNC website [96] (accessedMarch 13, 2007).DNA samplesControl DNA samples (extracted from blood) were froma population of mixed ethnic background with individu-als of Western European descent most highly represented(Additional file 1). 48 of these were from the Coriell CellRepository [97].PCR primers and amplification of candidate repeatsPage 12 of 18(page number not for citation purposes)CAG-type simple repeat within the boundaries of aknown gene with five or more tandem glutamine residuesAdditional file 9 lists primer sequences, annealing tem-peratures, specific PCR conditions and expected fragmentBMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126size (from the reference genome) for each repeat target.PCR primers for candidate repeat amplification weredesigned using Primer3 [98]. Forward primers were 5'-labeled with 5-HEX, 6-FAM or TAMRA fluorescent dyes(Operon) and reverse primers all had a 5'-GTTT "PIG-tail"[99]. PCR amplification was performed with standard Taqpolymerase (Invitrogen) or AccuPrime Taq polymerase(Invitrogen) in 96-well plates according to the conditionsspecified for each target in Additional file 9. PCR productswere visualized and quantitated by comparing the signalintensity of a specific volume of PCR product against 4 μlof Low DNA Mass Ladder (Invitrogen) on an agarose gel.The accuracy of this quantitation method was validatedagainst the PicoGreen® dsDNA Quantitation assay (Molec-ular Probes) [100].ABI 3700 fragment analysis and GeneMapper band callingPCR products for fragment sizing were assembled in 96-well microtiter plates at 0.5 ng/μl in each well, with up tosix PCR products multiplexed per well according to theirpredicted allele sizes and fluorescent labels. One micro-liter of the multiplexed PCR products was added to 9 μl ofeither 2% 400 HD [ROX] sizing standard (Applied Biosys-tems) or 2% 500 [ROX] sizing standard (Applied Biosys-tems) depending on the estimated sizes of products beinganalyzed. DNA fragments were separated by capillary elec-trophoresis using the ABI Prism 3700 DNA Analyzer(Applied Biosystems) with POP-6 polymer (Applied Bio-systems). Sizing of the PCR fragments was accomplishedusing GeneMapper software (v.3.0, Applied Biosystems).Representative alleles from each locus were sequenced todetermine the exact correspondence between fragmentsize and Q-tract length. In all cases (except TNRC15, forwhich we do not present data), fragment length polymor-phism was entirely accounted for by changes in Q-tractlength. At least one such sequenced allele was included onevery run as a calibrator.Data management and analysisRepeat information, PCR conditions, sample informationand analysis results were stored in a MySQL databasecalled GeMSdb (Genomic Mutational Signaturesequences database). Data was input into GeMSdb usingPerl scripts and through a web interface built with PHPand Apache. Data analysis and graphics were done usingPHP.The Q-tract length of each allele was based on the differ-ence between observed PCR fragment size from a DNAsample and expected PCR fragment size from the refer-ence genome (plus 4 nucleotides from the primer tail).Expected fragment sizes and Q-tract lengths (referencegenome Build 35) for every target are listed in Additionaltract (Q12H1Q1H1Q14) lengthExpis 14 because the overallrepeat region of 29 residues is interrupted by two non-glutamine amino acids.Q-tract lengthObs =(Fragment sizeObs - Fragment sizeExp)/3 + Q-tract lengthExpRepeat purity was calculated as a normalized weightedmeasure, nWP, combining both the length of the longestuninterrupted CAG-tract (CAG-length) and the total Q-tract length (Q-length) of each repeat. Weighted purity(WP) for each repeat was normalized by dividing by thehighest WP among loci, which was 21.04 for AR.nWP = (CAG-length/Q-length)*CAG-length/21.04Statistical analysisBecause there was no a priori knowledge of the distribu-tion of Q-tract lengths in each gene for the typical controlpopulation, we applied the statistics of tolerance levels todetermine the number of control alleles that must bescreened to distinguish a Q-tract length that occurs in theaffected but not unaffected populations with a given levelof confidence. Screening 130 control alleles provides uswith 99% confidence that 95% of the population of inter-est lies between the minimum and maximum repeatlengths in our samples [101].Gene expressionCandidate genes' expression in brain was determinedaccording to either eVOC controlled vocabularies for geneexpression data [63,64] queried through BioMart [65] oraccording to expression data at the GeneCards website[66] (accessed September 19, 2005).Gene functional classificationGene Ontology over-representation analysisWe used GoMiner [67] for GO over-representation analy-sis down to the fourth level in the ontology. The target andbackground gene sets were generated as follows. Wedownloaded 23,913 HGNC gene IDs on June 28, 2005from the HGNC website [96]. All IDs ending in '~with-drawn' were removed to generate a list of 21,591 IDs usedas the 'query gene file' for GoMiner. GoMiner matched13,598 of these to GO terms. We conducted 100 negativecontrol replicates of this experiment for the three GO cat-egories, each replicate with 56 randomly selected genesout of the 13,598 background gene set. To correct for mul-tiple testing we used a Bonferroni correction to adjust thethreshold of significance appropriately. The raw thresholdof significance was p = 0.05. Adjusted significance thresh-olds were: molecular function p = 0.00004; biologicalprocess p = 0.00005; cellular component p = 0.00009.Page 13 of 18(page number not for citation purposes)file 9. Q-tract lengthExp below is that of the longest unin-terrupted Q-tract in the target. For example, the ATXN1 Q-BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126Graph-based shared Gene Ontology term analysisFor each pair of genes among our set of 64, the GO termsannotated to each gene were compared and we calculateda graph-based similarity measure (AMM, SLB, BFFO, man-uscript in preparation) for all gene pairs. In order to deter-mine significant scores and produce a meaningfulsubgraph, we bootstrapped an estimate of the scorerequired to be above the 99th percentile for a set of genesof that size (64) from the background set. We randomlydrew 1000 replicates from the set of 15,168 Entrez Genehuman protein-coding genes and took the mean of the99th percentile score for each GO namespace (biologicalprocess, molecular function and cellular component) asour cut-off value. Pairs of genes with shared GO termsscoring above the cut-off value were visualized usingCytoscape 2.1 [102] with the "organic" arrangement ofnodes, which produced a natural set of clusters. The"organic" node arrangement treats edges as springs: themore edges among a group of nodes, the tighter they clus-ter. The pairwise similarity measure links GO terms viatheir lowest common ancestor term in the graph. Theselowest common ancestor terms are output with each pairof GO terms that are scored, and can be considered asedge labels in the resulting graph. Clusters of genes joinedby the same GO term edge labels were manually anno-tated with those GO terms.AbbreviationsCAGpolyQ, polyglutamine-encoding CAG trinucleotiderepeat; Q-tract, polyglutamine tract; HD, Huntington dis-ease; SCA, spinocerebellar ataxia; ATN1, atrophin1; AR,androgen receptor; TBP, TATA-binding protein; ATXN,ataxin; GeMS, Genomic Mutational Signature; GO, GeneOntology; HGNC, Human Gene Nomenclature Commit-teeAuthors' contributionsSLB, RSD, BRL, BFFO and RAH conceived and designedthe experiments. SLB, RSD, CLM, AMM, SJN, SSL, AW,GSY and MMSY performed the experiments. SLB, YH, SJN,CLM and AMM analyzed the data. CLM and YH designedand managed the database. MRH and RAH contributedreagents/materials. SLB wrote the paper and all authorsread and approved the final manuscript.Additional materialAdditional file 1Ethnic composition of control population.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-126-S1.xls]Additional file 2Allele length distributions in a normal population for 64 polyglutamine-encoding CAG trinucleotide repeat targets (A) – (BL). This multi-page document provides plots of allele frequency distributions.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-126-S2.pdf]Additional file 3Genes in over-represented GO terms under Biological Process. For each over-represented GO term and its GO ID, this document lists the CAG-polyQ repeat-containing genes that were annotated with that GO term.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-126-S3.pdf]Additional file 4Genes in over-represented GO terms under Molecular Function. For each over-represented GO term and its GO ID, this document lists the CAG-polyQ repeat-containing genes that were annotated with that GO term.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-126-S4.pdf]Additional file 5Genes in over-represented GO terms under Cellular Component. For each over-represented GO term and its GO ID, this document lists the CAG-polyQ repeat-containing genes that were annotated with that GO term.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-126-S5.pdf]Additional file 6Genes and their shared GO terms under Cellular Component. This docu-ment provides GO IDs, their descriptions, and the lists of CAGpolyQ repeat-containing genes that shared these annotations above the 99th per-centile cutoff.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-126-S6.pdf]Additional file 7Genes and their shared GO terms under Biological Process. This docu-ment provides GO IDs, their descriptions, and the lists of CAGpolyQ repeat-containing genes that shared these annotations above the 99th per-centile cutoff.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-126-S7.pdf]Additional file 8Genes and their shared GO terms under Molecular Function. This docu-ment provides GO IDs, their descriptions, and the lists of CAGpolyQ repeat-containing genes that shared these annotations above the 99th per-centile cutoff.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-126-S8.pdf]Page 14 of 18(page number not for citation purposes)BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126AcknowledgementsThis study has been approved by the University of British Columbia Clinical Research Ethics Board. The authors wish to thank Christopher Pearson and Simon Warby for helpful discussions, Terry Pape for suggesting a critical experiment, Ian Bosdet and Jacquie Schein for early technology develop-ment, Elizabeth Simpson for Coriell controls, and Clinical Research Support at Children's and Women's Health Centre of British Columbia for statistical consulting services. Funding for this study was provided by the Canadian Genetic Diseases Network, the National Organization for Rare Disorders, and the University of British Columbia. RAH is a Michael Smith Foundation for Health Research Scholar and AMM was funded by the Natural Sciences and Engineering Research Council of Canada.References1. A novel gene containing a trinucleotide repeat that isexpanded and unstable on Huntington's disease chromo-somes. The Huntington's Disease Collaborative ResearchGroup.  Cell 1993, 72:971-983.2. Koide R, Ikeuchi T, Onodera O, Tanaka H, Igarashi S, Endo K, Taka-hashi H, Kondo R, Ishikawa A, Hayashi T, et al.: Unstable expansionof CAG repeat in hereditary dentatorubral-pallidoluysianatrophy (DRPLA).  Nat Genet 1994, 6:9-13.3. Nagafuchi S, Yanagisawa H, Sato K, Shirayama T, Ohsaki E, Bundo M,Takeda T, Tadokoro K, Kondo I, Murayama N, et al.: Dentatorubraland pallidoluysian atrophy expansion of an unstable CAG tri-nucleotide on chromosome 12p.  Nat Genet 1994, 6:14-18.4. La Spada AR, Wilson EM, Lubahn DB, Harding AE, Fischbeck KH:Androgen receptor gene mutations in X-linked spinal andbulbar muscular atrophy.  Nature 1991, 352:77-79.5. Zhuchenko O, Bailey J, Bonnen P, Ashizawa T, Stockton DW, AmosC, Dobyns WB, Subramony SH, Zoghbi HY, Lee CC: Autosomaldominant cerebellar ataxia (SCA6) associated with smallpolyglutamine expansions in the alpha 1A-voltage-depend-ent calcium channel.  Nat Genet 1997, 15:62-69.6. Nakamura K, Jeong SY, Uchihara T, Anno M, Nagashima K,Nagashima T, Ikeda S, Tsuji S, Kanazawa I: SCA17, a novel auto-somal dominant cerebellar ataxia caused by an expandedpolyglutamine in TATA-binding protein.  Hum Mol Genet 2001,10:1441-1448.7. Orr HT, Chung MY, Banfi S, Kwiatkowski TJ Jr., Servadio A, BeaudetAL, McCall AE, Duvick LA, Ranum LP, Zoghbi HY: Expansion of anunstable trinucleotide CAG repeat in spinocerebellar ataxiatype 1.  Nat Genet 1993, 4:221-226.8. Imbert G, Saudou F, Yvert G, Devys D, Trottier Y, Garnier JM,Weber C, Mandel JL, Cancel G, Abbas N, Durr A, Didierjean O, Ste-vanin G, Agid Y, Brice A: Cloning of the gene for spinocerebellarataxia 2 reveals a locus with high sensitivity to expandedCAG/glutamine repeats.  Nat Genet 1996, 14:285-291.9. Sanpei K, Takano H, Igarashi S, Sato T, Oyake M, Sasaki H, WakisakaA, Tashiro K, Ishida Y, Ikeuchi T, Koide R, Saito M, Sato A, Tanaka T,Hanyu S, Takiyama Y, Nishizawa M, Shimizu N, Nomura Y, Segawa M,Iwabuchi K, Eguchi I, Tanaka H, Takahashi H, Tsuji S: Identificationof the spinocerebellar ataxia type 2 gene using a direct iden-tification of repeat expansion and cloning technique,DIRECT.  Nat Genet 1996, 14:277-284.10. Pulst SM, Nechiporuk A, Nechiporuk T, Gispert S, Chen XN, Lopes-repeat in spinocerebellar ataxia type 2.  Nat Genet 1996,14:269-276.11. Kawaguchi Y, Okamoto T, Taniwaki M, Aizawa M, Inoue M, KatayamaS, Kawakami H, Nakamura S, Nishimura M, Akiguchi I, et al.: CAGexpansions in a novel gene for Machado-Joseph disease atchromosome 14q32.1.  Nat Genet 1994, 8:221-228.12. David G, Abbas N, Stevanin G, Durr A, Yvert G, Cancel G, Weber C,Imbert G, Saudou F, Antoniou E, Drabkin H, Gemmill R, Giunti P,Benomar A, Wood N, Ruberg M, Agid Y, Mandel JL, Brice A: Cloningof the SCA7 gene reveals a highly unstable CAG repeatexpansion.  Nat Genet 1997, 17:65-70.13. Rudnicki DD, Margolis RL: Repeat expansion and autosomaldominant neurodegenerative disorders: consensus and con-troversy.  Expert Rev Mol Med 2003, 2003:1-24.14. Wain HM, Lush MJ, Ducluzeau F, Khodiyar VK, Povey S: Genew: theHuman Gene Nomenclature Database, 2004 updates.  NucleicAcids Res 2004, 32:D255-7.15. Gastier JM, Brody T, Pulido JC, Businga T, Sunden S, Hu X, Maitra S,Buetow KH, Murray JC, Sheffield VC, Boguski M, Duyk GM, HudsonTJ: Development of a screening set for new (CAG/CTG)ndynamic mutations.  Genomics 1996, 32:75-85.16. Li SH, McInnis MG, Margolis RL, Antonarakis SE, Ross CA: Novel tri-plet repeat containing genes in human brain: cloning,expression, and length polymorphisms.  Genomics 1993,16:572-579.17. Riggins GJ, Lokey LK, Chastain JL, Leiner HA, Sherman SL, WilkinsonKD, Warren ST: Human genes containing polymorphic trinu-cleotide repeats.  Nat Genet 1992, 2:186-191.18. Reddy PH, Stockburger E, Gillevet P, Tagle DA: Mapping and char-acterization of novel (CAG)n repeat cDNAs from adulthuman brain derived by the oligo capture method.  Genomics1997, 46:174-182.19. Margolis RL, Abraham MR, Gatchell SB, Li SH, Kidwai AS, Breschel TS,Stine OC, Callahan C, McInnis MG, Ross CA: cDNAs with longCAG trinucleotide repeats from human brain.  Hum Genet1997, 100:114-122.20. Schalling M, Hudson TJ, Buetow KH, Housman DE: Direct detec-tion of novel expanded trinucleotide repeats in the humangenome.  Nat Genet 1993, 4:135-139.21. Karlin S, Brocchieri L, Bergman A, Mrazek J, Gentles AJ: Amino acidruns in eukaryotic proteomes and disease associations.  ProcNatl Acad Sci U S A 2002, 99:333-338.22. Collins JR, Stephens RM, Gold B, Long B, Dean M, Burt SK: Anexhaustive DNA micro-satellite map of the human genomeusing high performance computing.  Genomics 2003, 82:10-19.23. Subramanian S, Madgula VM, George R, Mishra RK, Pandit MW,Kumar CS, Singh L: Triplet repeats in human genome: distribu-tion and their association with genes and other genomicregions.  Bioinformatics 2003, 19:549-552.24. Jasinska A, Michlewski G, de Mezer M, Sobczak K, Kozlowski P, Napi-erala M, Krzyzosiak WJ: Structures of trinucleotide repeats inhuman transcripts and their functional implications.  NucleicAcids Res 2003, 31:5463-5468.25. Hayashi Y, Yamamoto M, Ohmori S, Kikumori T, Imai T, Funahashi H,Seo H: Polymorphism of homopolymeric glutamines in coac-tivators for nuclear hormone receptors.  Endocr J 1999,46:279-284.26. Andres AM, Lao O, Soldevila M, Calafell F, Bertranpetit J: Dynamicsof CAG repeat loci revealed by the analysis of their variabil-ity.  Hum Mutat 2003, 21:61-70.27. Edwards A, Hammond HA, Jin L, Caskey CT, Chakraborty R:Genetic variation at five trimeric and tetrameric tandemrepeat loci in four human population groups.  Genomics 1992,12:241-253.28. Pandey N, Mittal U, Srivastava AK, Mukerji M: SMARCA2 andTHAP11: potential candidates for polyglutamine disordersas evidenced from polymorphism and protein-folding simu-lation studies.  J Hum Genet 2004, 49:596-602.29. Alba MM, Santibanez-Koref MF, Hancock JM: Conservation of pol-yglutamine tract size between mice and humans depends oncodon interruption.  Mol Biol Evol 1999, 16:1641-1644.30. Alba MM, Santibanez-Koref MF, Hancock JM: The comparativegenomics of polyglutamine repeats: extreme differences inthe codon organization of repeat-encoding regions betweenAdditional file 9Conditions for PCR amplification of CAGpolyQ repeats in 64 CAGpolyQ repeats. This table provides primer sequences, annealing temperatures and expected fragment sizes for screening these repeats.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2164-8-126-S9.xls]Page 15 of 18(page number not for citation purposes)Cendes I, Pearlman S, Starkman S, Orozco-Diaz G, Lunkes A, DeJongP, Rouleau GA, Auburger G, Korenberg JR, Figueroa C, Sahba S:Moderate expansion of a normally biallelic trinucleotidemammals and Drosophila.  J Mol Evol 2001, 52:249-259.BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/12631. Levinson G, Gutman GA: Slipped-strand mispairing: a majormechanism for DNA sequence evolution.  Mol Biol Evol 1987,4:203-221.32. Pearson CE, Edamura KN, Cleary JD: Repeat instability: mecha-nisms of dynamic mutations.  Nat Rev Genet 2005, 6:729-742.33. Squitieri F, Andrew SE, Goldberg YP, Kremer B, Spence N, Zeisler J,Nichol K, Theilmann J, Greenberg J, Goto J, et al.: DNA haplotypeanalysis of Huntington disease reveals clues to the originsand mechanisms of CAG expansion and reasons for geo-graphic variations of prevalence.  Hum Mol Genet 1994,3:2103-2114.34. Leeflang EP, Zhang L, Tavare S, Hubert R, Srinidhi J, MacDonald ME,Myers RH, de Young M, Wexler NS, Gusella JF, et al.: Single spermanalysis of the trinucleotide repeats in the Huntington's dis-ease gene: quantification of the mutation frequency spec-trum.  Hum Mol Genet 1995, 4:1519-1526.35. Telenius H, Kremer HP, Theilmann J, Andrew SE, Almqvist E, AnvretM, Greenberg C, Greenberg J, Lucotte G, Squitieri F, et al.: Molecu-lar analysis of juvenile Huntington disease: the major influ-ence on (CAG)n repeat length is the sex of the affectedparent.  Hum Mol Genet 1993, 2:1535-1540.36. Jodice C, Giovannone B, Calabresi V, Bellocchi M, Terrenato L,Novelletto A: Population variation analysis at nine loci con-taining expressed trinucleotide repeats.  Ann Hum Genet 1997,61:425-438.37. Sobczak K, Krzyzosiak WJ: Patterns of CAG repeat interrup-tions in SCA1 and SCA2 genes in relation to repeat instabil-ity.  Hum Mutat 2004, 24:236-247.38. GeneReviews at GeneTests: Medical Genetics InformationResource   [http://www.genetests.org]39. Chung MY, Ranum LP, Duvick LA, Servadio A, Zoghbi HY, Orr HT:Evidence for a mechanism predisposing to intergenerationalCAG repeat instability in spinocerebellar ataxia type I.  NatGenet 1993, 5:254-258.40. Wren JD, Forgacs E, Fondon JW 3rd, Pertsemlidis A, Cheng SY, Gal-lardo T, Williams RS, Shohet RV, Minna JD, Garner HR: Repeat pol-ymorphisms within gene regions: phenotypic andevolutionary implications.  Am J Hum Genet 2000, 67:345-356.41. Mularoni L, Guigo R, Alba MM: Mutation patterns of amino acidtandem repeats in the human proteome.  Genome Biol 2006,7:R33.42. Huang H, Winter EE, Wang H, Weinstock KG, Xing H, Goodstadt L,Stenson PD, Cooper DN, Smith D, Alba MM, Ponting CP, Fechtel K:Evolutionary conservation and selection of human diseasegene orthologs in the rat and mouse genomes.  Genome Biol2004, 5:R47.43. Mitchell PJ, Tjian R: Transcriptional regulation in mammaliancells by sequence-specific DNA binding proteins.  Science 1989,245:371-378.44. Bhandari R, Brahmachari SK: Analysis of CAG/CTG tripletrepeats in the human genome: Implication in transcriptionfactor gene regulation.  Journal of biosciences 1995, 20:613-627.45. Karlin S, Burge C: Trinucleotide repeats and long homopep-tides in genes and proteins associated with nervous systemdisease and development.  Proc Natl Acad Sci U S A 1996,93:1560-1565.46. Alba MM, Santibanez-Koref MF, Hancock JM: Amino acid reitera-tions in yeast are overrepresented in particular classes ofproteins and show evidence of a slippage-like mutationalprocess.  J Mol Evol 1999, 49:789-797.47. Alba MM, Guigo R: Comparative analysis of amino acid repeatsin rodents and humans.  Genome Res 2004, 14:549-554.48. Faux NG, Bottomley SP, Lesk AM, Irving JA, Morrison JR, de la BandaMG, Whisstock JC: Functional insights from the distributionand role of homopeptide repeat-containing proteins.  GenomeRes 2005, 15:537-551.49. Dunah AW, Jeong H, Griffin A, Kim YM, Standaert DG, Hersch SM,Mouradian MM, Young AB, Tanese N, Krainc D: Sp1 and TAFII130transcriptional activity disrupted in early Huntington's dis-ease.  Science 2002, 296:2238-2243.50. Freiman RN, Tjian R: Neurodegeneration. A glutamine-richtrail leads to transcription factors.  Science 2002,296:2149-2150.51. van Roon-Mom WM, Reid SJ, Faull RL, Snell RG: TATA-binding52. Helmlinger D, Hardy S, Sasorith S, Klein F, Robert F, Weber C,Miguet L, Potier N, Van-Dorsselaer A, Wurtz JM, Mandel JL, Tora L,Devys D: Ataxin-7 is a subunit of GCN5 histone acetyltrans-ferase-containing complexes.  Hum Mol Genet 2004,13:1257-1265.53. Palhan VB, Chen S, Peng GH, Tjernberg A, Gamper AM, Fan Y, ChaitBT, La Spada AR, Roeder RG: Polyglutamine-expanded ataxin-7inhibits STAGA histone acetyltransferase activity to pro-duce retinal degeneration.  Proc Natl Acad Sci U S A 2005,102:8472-8477.54. McMahon SJ, Pray-Grant MG, Schieltz D, Yates JR 3rd, Grant PA: Pol-yglutamine-expanded spinocerebellar ataxia-7 protein dis-rupts normal SAGA and SLIK histone acetyltransferaseactivity.  Proc Natl Acad Sci U S A 2005, 102:8478-8482.55. Zhai W, Jeong H, Cui L, Krainc D, Tjian R: In vitro analysis of hunt-ingtin-mediated transcriptional repression reveals multipletranscription factor targets.  Cell 2005, 123:1241-1253.56. Ralser M, Albrecht M, Nonhoff U, Lengauer T, Lehrach H, KrobitschS: An integrative approach to gain insights into the cellularfunction of human ataxin-2.  J Mol Biol 2005, 346:203-214.57. Irwin S, Vandelft M, Pinchev D, Howell JL, Graczyk J, Orr HT, TruantR: RNA association and nucleocytoplasmic shuttling byataxin-1.  J Cell Sci 2005, 118:233-242.58. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J,Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, HarrisK, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P,McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J,Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sul-ston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N,Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, DurbinR, French L, Grafham D, Gregory S, Hubbard T, Humphray S, HuntA, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S,Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S,Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA,Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL,Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB,Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T,Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, DoggettN, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M,Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, WorleyKC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS,Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T,Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T,Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T,Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L,Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, PlatzerM, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G,Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA,Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, GrimwoodJ, Cox DR, Olson MV, Kaul R, Shimizu N, Kawasaki K, Minoshima S,Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, RamserJ, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, DedhiaN, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bai-ley JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, BurgeCB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T,Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hay-ashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS,Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, KooninEV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T,Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J,Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, WolfeKH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A,Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ,Osoegawa K, Shizuya H, Choi S, Chen YJ: Initial sequencing andanalysis of the human genome.  Nature 2001, 409:860-921.59. Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT,Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, HausslerD, Kent WJ: The UCSC Genome Browser Database.  Nucl AcidsRes 2003, 31:51-54.60. Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M,Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, DownT, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, HerreroPage 16 of 18(page number not for citation purposes)protein in neurodegenerative disease.  Neuroscience 2005,133:863-872.J, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D,Keenan S, Kokocinsci F, London D, Longden I, McVicker G, MelsoppBMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/126C, Meidl P, Potter S, Proctor G, Rae M, Rios D, Schuster M, Searle S,Severin J, Slater G, Smedley D, Smith J, Spooner W, Stabenau A,Stalker J, Storey R, Trevanion S, Ureta-Vidal A, Vogel J, White S,Woodwark C, Birney E: Ensembl 2005.  Nucleic Acids Res 2005,33:D447-53.61. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM,Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M,Rubin GM, Sherlock G: Gene ontology: tool for the unificationof biology. The Gene Ontology Consortium.  Nat Genet 2000,25:25-29.62. Lavoie H, Debeane F, Trinh QD, Turcotte JF, Corbeil-Girard LP, Dic-aire MJ, Saint-Denis A, Page M, Rouleau GA, Brais B: Polymor-phism, shared functions and convergent evolution of geneswith sequences coding for polyalanine domains.  Hum MolGenet 2003, 12:2967-2979.63. Kelso J, Visagie J, Theiler G, Christoffels A, Bardien S, Smedley D,Otgaar D, Greyling G, Jongeneel CV, McCarthy MI, Hide T, Hide W:eVOC: a controlled vocabulary for unifying gene expressiondata.  Genome Res 2003, 13:1222-1230.64. Hide W, Smedley D, McCarthy M, Kelso J: Application of eVOC:controlled vocabularies for unifying gene expression data.  CR Biol 2003, 326:1089-1096.65. Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C,Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a genericsystem for fast and flexible access to biological data.  GenomeRes 2004, 14:160-169.66. Rebhan M, Chalifa-Caspi V, Prilusky J: GeneCards: encyclopediafor genes, proteins and diseases.   [http://www.genecards.org].67. Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Nar-asimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Bar-rett JC, Weinstein JN: GoMiner: a resource for biologicalinterpretation of genomic and proteomic data.  Genome Biol2003, 4:R28.68. Gene Ontology   [http://www.geneontology.org]69. Buchanan G, Yang M, Cheong A, Harris JM, Irvine RA, Lambert PF,Moore NL, Raynor M, Neufing PJ, Coetzee GA, Tilley WD: Struc-tural and functional consequences of glutamine tract varia-tion in the androgen receptor.  Hum Mol Genet 2004,13:1677-1692.70. Juvonen V, Hietala M, Kairisto V, Savontaus ML: The occurrence ofdominant spinocerebellar ataxias among 251 Finnish ataxiapatients and the role of predisposing large normal alleles ina genetically isolated population.  Acta Neurol Scand 2005,111:154-162.71. Gouw LG, Castaneda MA, McKenna CK, Digre KB, Pulst SM, PerlmanS, Lee MS, Gomez C, Fischbeck K, Gagnon D, Storey E, Bird T, JeriFR, Ptacek LJ: Analysis of the dynamic mutation in the SCA7gene shows marked parental effects on CAG repeat trans-mission.  Hum Mol Genet 1998, 7:525-532.72. Zuhlke C, Hellenbroich Y, Dalski A, Kononowa N, Hagenah J,Vieregge P, Riess O, Klein C, Schwinger E: Different types ofrepeat expansion in the TATA-binding protein gene areassociated with a new form of inherited ataxia.  Eur J HumGenet 2001, 9:160-164.73. Bruce HA, Margolis RL: FOXP2: novel exons, splice variants,and CAG repeat length stability.  Hum Genet 2002, 111:136-144.74. Dai P, Wong LJ: Somatic instability of the DNA sequencesencoding the polymorphic polyglutamine tract of the AIB1gene.  J Med Genet 2003, 40:885-890.75. Rovio AT, Abel J, Ahola AL, Andres AM, Bertranpetit J, Blancher A,Bontrop RE, Chemnick LG, Cooke HJ, Cummins JM, Davis HA, ElliottDJ, Fritsche E, Hargreave TB, Hoffman SM, Jequier AM, Kao SH, KimHS, Marchington DR, Mehmet D, Otting N, Poulton J, Ryder OA,Schuppe HC, Takenaka O, Wei YH, Wichmann L, Jacobs HT: Aprevalent POLG CAG microsatellite length allele in humansand African great apes.  Mamm Genome 2004, 15:492-502.76. Hayes S, Turecki G, Brisebois K, Lopes-Cendes I, Gaspar C, Riess O,Ranum LP, Pulst SM, Rouleau GA: CAG repeat length in RAI1 isassociated with age at onset variability in spinocerebellarataxia type 2 (SCA2).  Hum Mol Genet 2000, 9:1753-1758.77. Figueroa KP, Chan P, Schols L, Tanner C, Riess O, Perlman SL,Geschwind DH, Pulst SM: Association of moderate poly-glutamine tract expansions in the slow calcium-activated78. Koronyo-Hamaoui M, Gak E, Stein D, Frisch A, Danziger Y, Leor S,Michaelovsky E, Laufer N, Carel C, Fennig S, Mimouni M, Apter A,Goldman B, Barkai G, Weizman A: CAG repeat polymorphismwithin the KCNN3 gene is a significant contributor to sus-ceptibility to anorexia nervosa: a case-control study offemale patients and several ethnic groups in the Israeli Jew-ish population.  Am J Med Genet B Neuropsychiatr Genet 2004,131:76-80.79. Tsutsumi T, Holmes SE, McInnis MG, Sawa A, Callahan C, DePaulo JR,Ross CA, DeLisi LE, Margolis RL: Novel CAG/CTG repeat expan-sion mutations do not contribute to the genetic risk for mostcases of bipolar disorder or schizophrenia.  Am J Med Genet BNeuropsychiatr Genet 2004, 124:15-19.80. Ogasawara M, Imanishi T, Moriwaki K, Gaudieri S, Tsuda H, Hashim-oto H, Shiroishi T, Gojobori T, Koide T: Length variation of CAG/CAA triplet repeats in 50 genes among 16 inbred mousestrains.  Gene 2005, 349:107-119.81. Hancock JM, Worthey EA, Santibanez-Koref MF: A role for selec-tion in regulating the evolutionary emergence of disease-causing and other coding CAG repeats in humans and mice.Mol Biol Evol 2001, 18:1014-1023.82. Weber JL: Informativeness of human (dC-dA)n.(dG-dT)n pol-ymorphisms.  Genomics 1990, 7:524-530.83. Brinkmann B, Klintschar M, Neuhuber F, Huhne J, Rolf B: Mutationrate in human microsatellites: influence of the structure andlength of the tandem repeat.  Am J Hum Genet 1998,62:1408-1415.84. Petes TD, Greenwell PW, Dominska M: Stabilization of microsat-ellite sequences by variant repeats in the yeast Saccharomy-ces cerevisiae.  Genetics 1997, 146:491-498.85. Michlewski G, Krzyzosiak WJ: Molecular architecture of CAGrepeats in human disease related transcripts.  J Mol Biol 2004,340:665-679.86. Cleary JD, Pearson CE: The contribution of cis-elements to dis-ease-associated repeat instability: clinical and experimentalevidence.  Cytogenet Genome Res 2003, 100:25-55.87. Cleary JD, Pearson CE: Replication fork dynamics and dynamicmutations: the fork-shift model of repeat instability.  TrendsGenet 2005, 21:272-280.88. Mulvihill DJ, Edamura KN, Hagerman KA, Pearson CE, Wang YH:Effect of CAT or AGG interruptions and CpG methylationon nucleosome assembly upon trinucleotide repeats onspinocerebellar ataxia, type 1 and fragile X syndrome.  J BiolChem 2005, 280:4498-4503.89. Rozanska M, Sobczak K, Jasinska A, Napierala M, Kaczynska D,Czerny A, Koziel M, Kozlowski P, Olejniczak M, Krzyzosiak WJ: CAGand CTG repeat polymorphism in exons of human genesshows distinct features at the expandable loci.  Hum Mutat2007, 28:451-458.90. Takano H, Cancel G, Ikeuchi T, Lorenzetti D, Mawad R, Stevanin G,Didierjean O, Durr A, Oyake M, Shimohata T, Sasaki R, Koide R, Igar-ashi S, Hayashi S, Takiyama Y, Nishizawa M, Tanaka H, Zoghbi H,Brice A, Tsuji S: Close associations between prevalences ofdominantly inherited spinocerebellar ataxias with CAG-repeat expansions and frequencies of large normal CAG alle-les in Japanese and Caucasian populations.  Am J Hum Genet1998, 63:1060-1066.91. Shimohata T, Nakajima T, Yamada M, Uchida C, Onodera O, NaruseS, Kimura T, Koide R, Nozaki K, Sano Y, Ishiguro H, Sakoe K,Ooshima T, Sato A, Ikeuchi T, Oyake M, Sato T, Aoyagi Y, Hozumi I,Nagatsu T, Takiyama Y, Nishizawa M, Goto J, Kanazawa I, Davidson I,Tanese N, Takahashi H, Tsuji S: Expanded polyglutaminestretches interact with TAFII130, interfering with CREB-dependent transcription.  Nat Genet 2000, 26:29-36.92. Nucifora FC Jr., Sasaki M, Peters MF, Huang H, Cooper JK, YamadaM, Takahashi H, Tsuji S, Troncoso J, Dawson VL, Dawson TM, RossCA: Interference by huntingtin and atrophin-1 with cbp-mediated transcription leading to cellular toxicity.  Science2001, 291:2423-2428.93. Benson G: Tandem repeats finder: a program to analyze DNAsequences.  Nucl Acids Res 1999, 27:573-580.94. Karlin S: Statistical significance of sequence patterns in pro-teins.  Curr Opin Struct Biol 1995, 5:360-371.95. Furey TS, Haussler D: Integration of the cytogenetic map withPage 17 of 18(page number not for citation purposes)potassium channel type 3 with ataxia.  Arch Neurol 2001,58:1649-1653.the draft human genome sequence.  Hum Mol Genet 2003,12:1037-1044.Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Genomics 2007, 8:126 http://www.biomedcentral.com/1471-2164/8/12696. HUGO Gene Nomenclature Committee   [http://www.gene.ucl.ac.uk/nomenclature]97. Coriell Cell Repository   [http://coriell.umdnj.edu]98. Rozen S, Skaletsky H: Primer3 on the WWW for general usersand for biologist programmers.  Methods Mol Biol 2000,132:365-386.99. Brownstein MJ, Carpten JD, Smith JR: Modulation of non-tem-plated nucleotide addition by Taq DNA polymerase: primermodifications that facilitate genotyping.  Biotechniques 1996,20:1004-6, 1008-10.100. Ahn SJ, Costa J, Emanuel JR: PicoGreen quantitation of DNA:effective evaluation of samples pre- or post-PCR.  Nucleic AcidsRes 1996, 24:2623-2625.101. Mood AM: Introduction to the theory of statistics.  New York,McGraw-Hill; 1974:516-517. 102. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, AminN, Schwikowski B, Ideker T: Cytoscape: a software environmentfor integrated models of biomolecular interaction networks.Genome Res 2003, 13:2498-2504.yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 18 of 18(page number not for citation purposes)


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items