UBC Faculty Research and Publications

GWATCH: a web platform for automated gene association discovery analysis Svitin, Anton; Malov, Sergey; Cherkasov, Nikolay; Geerts, Paul; Rotkevich, Mikhail; Dobrynin, Pavel; Shevchenko, Andrey; Guan, Li; Troyer, Jennifer; Hendrickson, Sher; Dilks, Holli H; Oleksyk, Taras K; Donfield, Sharyne; Gomperts, Edward; Jabs, Douglas A; Sezgin, Efe; Van Natta, Mark; Harrigan, P R; Brumme, Zabrina L; O’Brien, Stephen J Nov 5, 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-13742_2014_Article_51.pdf [ 1.61MB ]
JSON: 52383-1.0223307.json
JSON-LD: 52383-1.0223307-ld.json
RDF/XML (Pretty): 52383-1.0223307-rdf.xml
RDF/JSON: 52383-1.0223307-rdf.json
Turtle: 52383-1.0223307-turtle.txt
N-Triples: 52383-1.0223307-rdf-ntriples.txt
Original Record: 52383-1.0223307-source.json
Full Text

Full Text

TECHNICAL NOTEGWATCH: a web platformassociation discovery anall Grifempcowhich offer the promise of connecting nucleotide and maladies by interpreting one’s genomic heritage [6-9].Svitin et al. GigaScience 2014, 3:18http://www.gigasciencejournal.com/content/3/1/18cance”, a widely accepted, albeit conservative, statistical33004, USAFull list of author information is available at the end of the articlestructural variation to hereditary traits [1-3]. Genotypingarrays that resolve millions of common SNPs have enabledover 2,000 genome-wide associations studies (GWAS)to discover principal genetic determinants of complexmultifactorial human diseases [4,5]. Today, whole-genomesequence association has extended the prospects forTo date, GWAS studies have produced conflicting signalsbecause many SNP associations are not replicated in subse-quent studies. Further, GWAS frequently fail to implicatepreviously-validated gene regions described in candidategene associations for the same disease, and in most casesoffer less than 10% of the explanatory variance for thedisease etiology [9-13]. In addition, discovered genevariants are frequently nested in noncoding desert regionsof the genome that are difficult to interpret. At leastpart of these weaknesses derive from discounting SNPassociation “hits” that fail to achieve “genome-wide signifi-* Correspondence: anton.svitin@gmail.com; lgdchief@gmail.com†Equal contributors1Theodosius Dobzhansky Center for Genome Bioinformatics, St. PetersburgState University, St. Petersburg 199004, Russia15Oceanographic Center, Nova Southeastern University, Ft. Lauderdale, FLsome 60 million single nucleotide polymorphisms (SNPs),Findings: Here we present a dynamic web-based platform – GWATCH – that automates and facilitates four stepsin genetic epidemiological discovery: 1) Rapid gene association search and discovery analysis of large genome-widedatasets; 2) Expanded visual display of gene associations for genome-wide variants (SNPs, indels, CNVs), includingManhattan plots, 2D and 3D snapshots of any gene region, and a dynamic genome browser illustrating geneassociation chromosomal regions; 3) Real-time validation/replication of candidate or putative genes suggestedfrom other sources, limiting Bonferroni genome-wide association study (GWAS) penalties; 4) Open data releaseand sharing by eliminating privacy constraints (The National Human Genome Research Institute (NHGRI) InstitutionalReview Board (IRB), informed consent, The Health Insurance Portability and Accountability Act (HIPAA) of 1996 etc.)on unabridged results, which allows for open access comparative and meta-analysis.Conclusions: GWATCH is suitable for both GWAS and whole genome sequence association datasets. We illustrate theutility of GWATCH with three large genome-wide association studies for HIV-AIDS resistance genes screened in largemulticenter cohorts; however, association datasets from any study can be uploaded and analyzed by GWATCH.Keywords: AIDS, HIV, Complex diseases, Genome-wide association studies (GWAS), Whole genome sequencing (WGS)FindingsIntroductionAnnotations of human genome variation have identifiedpersonalized genomic medicine, capturing rare variants,copy number variation (CNV), indels, epistatic and epi-genetic interactions in hopes of achieving individualizedgenomic assessment, diagnostics, and therapy of complexAnton Svitin1*†, Sergey Malov1,2†, Nikolay Cherkasov1†, PauAndrey Shevchenko1, Li Guan1, Jennifer Troyer4, Sher HendSharyne Donfield8, Edward Gomperts9, Douglas A Jabs10, EZabrina L Brumme14 and Stephen J O’Brien1,15*AbstractBackground: As genome-wide sequence analyses for coincreasingly necessary to develop strategies to promote dis© 2014 Svitin et al.; licensee BioMed Central LCommons Attribution License (http://creativecreproduction in any medium, provided the orDedication waiver (http://creativecommons.orunless otherwise stated.Open Accessfor automated geneysiseerts3, Mikhail Rotkevich1, Pavel Dobrynin1,ckson5, Holli Hutcheson Dilks6, Taras K Oleksyk7,Sezgin11, Mark Van Natta11, P Richard Harrigan12,13,lex human disease determinants are expanding, it isvery and validation of potential disease-gene associations.td. This is an Open Access article distributed under the terms of the Creativeommons.org/licenses/by/4.0), which permits unrestricted use, distribution, andiginal work is properly credited. The Creative Commons Public Domaing/publicdomain/zero/1.0/) applies to the data made available in this article,Svitin et al. GigaScience 2014, 3:18 Page 2 of 10http://www.gigasciencejournal.com/content/3/1/18threshold set to discard the plethora of false positivestatistical associations (Type I errors) that derive fromthe large number of SNPs interrogated [2,13-16].A challenge to genetic epidemiology involves disentan-gling the true functional associations that straddle thegenome-wide significance threshold from the myriad ofstatistical artifacts that also occur. No one has developeda real solution to this conundrum, though some approacheshave been offered [11,15-21]. Many researchers agreethat more widely practiced open access data sharing ofunabridged GWAS data would offer the opportunity formultiple plausible approaches to bear on this question[22,23]. However, for many cohorts, especially those devel-oped before the advent of the genomics era, participantswere not consented for open access of genome-widedata. Since patient anonymization is virtually impossiblewith genetic epidemiological data, the prospects of sharingpatients’ genotype and clinical data may conflict withethical concerns over protecting the individual privacyof study subjects [24-26]. GWATCH (Genome-WideAssociation Tracks Chromosome Highway) addresses thisissue through an organized open release of unabridgedSNP-test association results from GWAS and wholegenome sequencing (WGS) association studies and illus-trates its utility using a SNP association analysis forHIV-AIDS in multiple cohorts [10,11,19,27-32].ResultsGWATCH is a dynamic genome browser that auto-mates and displays primary analysis results: p-valuesand Quantitative Association Statistic (QAS, a generalterm for statistics explaining direction and strength of asso-ciations: odds ratio, relative hazard and ez2-transformedcorrelation coefficient; see Section 2 of Additional file 1:Materials and Methods) from multiple association testsperformed for one or more cohorts in a GWAS or WGSassociation study as a visual array ordered by SNPchromosomal position [33]. GWATCH offers a numberof “features” that allow automated analysis and visualizationof multiple test results, rapid discovery, replication anddata release of unabridged association results (Table 1).A typical input of a GWAS analysis includes a largeunabridged Data Table listing p-values and QASsacross multiple SNP association tests performed for alist of ~10,000,000 ordered SNPs (Additional file 2: TableS1). GWATCH displays the Data Table, association testsand various perspectives for results: Manhattan plots foreach single test (Additional File 1: Figure S1), 2D and 3Dsnapshots of test results for chromosome regions of “hits”,and a dynamic chromosome browser that illustrates signifi-cant p-values and QASs from the Data Table (Figure 1A, Band C). The imagery provides a dynamic traverse along ahuman chromosome producing a “bird’s eye” view of thestrong SNP associations that rise above the chromosomehighway surface. The idea is to visualize associationresults across a gene region (e.g., one that may include ahighly significant SNP association) for all the testsperformed (on the same or different cohorts) and for allthe neighboring, potentially proxy SNPs (i.e., SNPs whichtrack the neighboring causal, disease-affecting SNP due tothe linkage disequilibrium [LD]) for the same tests.Top hits are ranked based upon extreme p-values, QASs,or “density” of composite p-value peaks (representing proxySNPs in linkage disequilibrium and multiple non-independ-ent association tests). A multi-page “TRAX REPORT” pro-duces curves, tables and appropriate statistics for a selectedvariant (SNP, indel or CNV tracked) on request. As geno-typing and clinical data are organized, GWATCH automatesthe computation and visualization of results allowinginstant replication of putative discoveries suggestedby outside cohort studies or functional experiments.GWATCH also provides a simple procedure for web re-lease of the association results to interested researchers.We illustrate the utility, interpretation, and navigation ofGWATCH using a GWAS carried out with study partici-pants enrolled in eight prospective HIV-AIDS cohorts,searching for AIDS Restriction Genes [10,11,19,27-32]. Weperformed a GWAS meta-analysis on 5,922 patients withdistinctive clinical outcomes genotyped using an Affymetrix6.0 genotyping array (700,022 SNPs after quality control[QC] filters) and parsed into three population groups:Group A) A select group of 1,527 European American indi-viduals; Group B) A larger group of 4,462 EuropeanAmerican individuals that includes Group A; Group C)An independent group of 1,460 African American indi-viduals (Table 2). Based upon available clinical infor-mation, we performed 123 association tests on GroupA, 144 association tests on Group B, and 60 associ-ation tests on Group C (Table 3 and Additional file 3:Table S2, Additional file 4: Table S3, Additional file 5:Table S4, Additional file 6: Table S5). The tests includeallele and genotype associations for four stages of AIDS:HIV acquisition/infection, AIDS progression (includingcategorical and survival analyses), AIDS-defining condi-tions and Highly Active AntiRetroviral Therapy (HAART)outcomes as described previously [27-32]; however, theunabridged dataset displayed in GWATCH-AIDS is farricher. For example, in references [28,31,32] each describesone association test (implicating the PARD3B, PROX1, andCCR5-Δ32 AIDS restriction genes respectively); [29,30]analyze small subsets of the SNPs tested within NEMPand HDF gene groups, respectively. GWATCH-AIDSpresents complete results for 700,022 SNPs for 327 tests(Table 3) for 5,922 study participants listed in Table 2.The first step of data analysis using GWATCH is toproduce a large Data Table listing all SNP names, chromo-some coordinates and minor allele frequency (MAF),with p-values and QASs for each test (Additional file 2:e aPsomanyAesnoASandic (Svitin et al. GigaScience 2014, 3:18 Page 3 of 10http://www.gigasciencejournal.com/content/3/1/18Table S1) plus a description of each test. Results in thisTable are displayed as familiar Manhattan plots for eachtest as well as by SNAPSHOT views of chromosomeregions. 2D-SNAPSHOT is a heat plot of orderedSNP-test results (e.g., ~80 SNPs at 4 kb average distancefor 123 tests in Group A (Table 2) equaling ~10,000 SNP-Table 1 Display feature components of GWATCHFeatures displayed1. Unabridged data table of SNP chromosome coordinates, MAF*, p-valueach SNP for each test2. Association tests list and Manhattan plots for each test across all SN3. SNAPSHOTS of SNP-test results in a chromosome region:1. 2D heat plot snapshot illustrating p-values in any selected chromos2. 3D checkerboard plot snapshot illustrating p-values and QAS** inchromosome region3. LD-polarized 3D checkerboard snapshot illustrating p-values and Qchromosome region4. Dynamic HIGHWAY view by chromosome browser illustrating p-valu5. Top association hits:1. Top hits based on ranked -log p-value2. Top hits based on ranked QAS**3. Top hits based on ranked Density of -log p-value within a SNP ge6. TRAX feature:1. TRAX PAGE – two-page graphic summary illustrating p-values and Q2. TRAX REPORT – eleven-page analysis summary with graphs, curvestests for one selected SNPAbbreviations: *MAF minor allele frequency, **QAS quantitative association statisttest combinations) indexed by the p-values from p > 0.05(light grey) to richer colors for decreasing p-values,assuring that significant region clusters are more denselycolorful (Figure 1A and Additional file 1: Figure S2).Similarly a 3D-SNAPSHOT presents a checkerboardview of a chromosome region whereby the blocks risingabove the surface reflect –log p-value and the color inten-sity reflects the QAS values with green indicating “resistant”associations (QAS < 1.0) and red showing “susceptible”ones (QAS > 1.0) (Figure 1B and Additional file 1: Figure S3and S4). The moving browser HIGHWAY, a major featureof GWATCH, scrolls across the entire chromosomes in the3D view of background statistical “noise” plus interestingregions of dense elevated blocks (Figure 1C).Since susceptible/resistant colors are initially indexedby the minor (less common) allele at any locus, colordiscordance will arise in a region when minor allele at agiven locus is tracked in LD by the common allele at anadjacent locus. The POLARIZE option corrects thiscomputational artifact by inverting the QAS in locuspairs that show discrepant (common and minor alleletracking as proxies) LD polarity. When the entire associ-ation signal for a region, driving the non-independentSNPs and non-independent tests, derives from a singlecausal allele within the region, the blocks of associatedSNPs in the viewed region should be the same color afterpolarization (Figure 1B and Additional file 1: Figure S4).Automated searches for extreme locus “hits” revealingremarkable associations across the genome can be per-formed for each stage of disease (see above) screeningfor extreme p-values, QAS values and/or density ofIllustrationnd QAS** for Additional file 2: Table S1Additional file 1: Figure S1e region Figure 1A and Additional file 1: Figure S2selected Figure 1B and Additiona file 1: Figure S3S** in any selected Figure 1B and Additional file 1: Figure S4and QAS** Figure 1CAdditional file 10: Table S7Additional file 10: Table S7mic region Additional file 10: Table S7** for one selected SNP Additional file 7: Figure S5tables for all association Additional file 8: Figure S6OR, RH, ez2-transformed correlation coefficient).extreme p-values. For loci of particular interest, a detailedTRAX REPORT is generated to display each curve, tableand statistic that had driven the association discovery(Additional file 7: Figure S5 and Additional file 8: FigureS6). TRAX REPORT is available for 641 SNPs in 241genes listed in Additional file 9: Table S6. For the restof the SNPs, the TRAX PAGE (shorter version ofTRAX REPORT) is available.To demonstrate GWATCH, three previously validatedAIDS resistance gene regions, CCR5-Δ32, PROX1 andPARD3B, can be examined by simple entering rs-number,gene name or chromosome coordinates in the searchoption (see also 2D and 3D snapshots in Additional file 1:Figures S2-S4). GWATCH moves HIGHWAY to theselected region so one can visualize the signal with the2D and 3D-SNAPSHOTS plus the TRAX REPORTS.Lastly, we also include a listing of discovered regionsthat showed AIDS association signals that, though theydid not reach genome-wide significance, representedoutlier values for several related tests and linked SNPs(Additional file 10: Table S7). These regions then wouldbe considered as candidates for future evaluation andreplication in independent cohort studies.Finally, GWATCH is a generalizable web tool suitablefor GWAS and/or WGS dataset for any complex disease.Figure 1 (See legend on next page.)Svitin et al. GigaScience 2014, 3:18 Page 4 of 10http://www.gigasciencejournal.com/content/3/1/18The “finished“or “processed” data (ones containing a finalData Table of p-values and QAS for completed associationtests) can be uploaded directly by following instructionsfor dataset upload on the GWATCH website. “Primary” or“unfinished” data (ones with genotypes and clinical datafor which tests need to be constructed and calculated) willHIV-AIDS there are at least twenty different groups con-ducting AIDS GWAS on small, well-defined cohorts thatmay differ in genetic background and clinical data avail-able for association testing) [34] GWATCH offers rapidreplication opportunities with an independent dataset.There are several websites that aim at cataloguing and(See figure on previous page.)Figure 1 GWATCH produces different kinds of snapshots and views for selected genomic region. (A) 2D-SNAPSHOT of PARD3B region ofChromosome 2 [28] tested for the 123 tests in Group A (Table 2). (B) POLARIZED 3D-SNAPSHOT of the PROX1 region of Chromosome 1 [31]tested for the same group (Table 2). (C) Dynamic 3D HIGHWAY chromosome browser view of CCR5 region of Chromosome 3 [32] tested for thesame group (Table 2). See also in Additional file 1: Figures S2-S4.AIDSvitin et al. GigaScience 2014, 3:18 Page 5 of 10http://www.gigasciencejournal.com/content/3/1/18be uploaded with our assistance in custom developmentof a disease-specific GWATCH-based analysis.DiscussionGWATCH is designed to enable investigators and usersnot connected to the original study to access the resultsof SNP association (from the whole genome sequenceor SNP array genotyping) in order to view and sharetheir study design and results openly. It can be used forvisualization of regions with low p-values to inspectthe pattern of variation across linked SNPs and also atdifferent stages of disease (e.g., HIV infection, AIDSprogression and treatment outcome).As a primary discovery approach, screening acrossunabridged test results poses large statistical penaltiesfor multiple tests eroding confidence in associations thatfail to achieve genome-wide significance [2,13-18,21]. Forthis reason, one should use caution in inspection ofputative regions of significance. Nonetheless, wholesalediscarding of marginally significant “hits” will discountsome true associations within the mix of statisticalartifacts. GWATCH offers an opportunity to screen thegenome for disease-associated regions, which may containcausal SNP variants included (or not) in the SNP arrayused for genotyping, as well as proxy SNPs tracking thecausal variant. Further, in complex diseases for whichthere are many different cohorts being studied (e.g., inTable 2 Categories and numbers of patients genotyped inAbbreviations Risk groupsHREU High Risk Exposed HIV UninfectedEU (except HREU) Exposed HIV Uninfected (all risks)SC Sero-ConvertorSP-LTS Sero-Prevalent-Long-Term-Survivor (no AIDS forSequelae AIDS sequelae diagnosisHAART Anti-retroviral treatmentTotal study participantsAbbreviations: *EA European Americans, **AA African Americans.displaying SNP associations. For example, GWAS Central[35] is a valuable resource for releasing and accessingGWAS data [36]. At the same time, we believe thatGWATCH can be advantageous in some cases for thefollowing reasons: 1) GWATCH utilizes (while not reveal-ing directly) primary unabridged clinical/phenotypicdata providing detailed analytical reports, like TRAX,not offered in GWAS Central; 2) GWATCH containssummary tools, such as top hits tables, and performscalculation of density that allows for identification,inspection and replication of putative association hits;3) GWAS Central reports traditional Manhattan plotswhile GWATCH extends these to 2D and 3D static anddynamic region visuals that expands user comprehensionand perception for better grasp of large data.The GWATCH web browser provides a dynamic visualjourney, similar to driving a video game along humanchromosomes to view patterns of GWAS- or WGS-basedvariant association with any complex disease. It is meantto be appealing, intuitive, and accessible to non-expertsand experts alike, including the various contributors totoday’s exciting gene association studies. The format andopen web access allows for importing new data from anydisease-gene association study with multiple disease stagesor genetic models of analysis. The wide breadth of testassociations displayed is particularly suited to complexdisease cohorts with detailed clinical parameters overS GWAS meta-analysisNumber of patients for each group TotalGroup A Group B Group C B + CEA*-I EA*-Total AA**254 300 148 4481 351 267 618703 767 288 1 055>10 years) 444 831 170 1 001461 1 848 0 1 848485 1 319 65 1 3841 527 4 462 1 460 5 922udtoipouelauelaeS4,Svitin et al. GigaScience 2014, 3:18 Page 6 of 10http://www.gigasciencejournal.com/content/3/1/18distinct disorder stages. Further, although GWATCH ispotentially useful for initial gene discovery, an importantcorollary lies in providing rapid replication of genediscoveries from independent cohort studies by simplykeying in the putative gene region and inspecting themany test results of the posted dataset. Since replicationscreens are hypothesis-driven, they avoid the stringentmultiple test correction penalties of a GWAS/WGS(p < 10−8). Finally, different cohort studies can be com-pared directly or combined to build meta-analyses.Should many cohort investigators release their unab-ridged results, then association discoveries will be repli-cated (or not) in a rapid, open and productive manner,allowing for large meta-analyses as have been proposedTable 3 Statistical tests performed on 3 HIV-AIDS cohort StClinical stage Test typeI. HIV Infection Ia. Infection - categoricalII. HIV Progression IIa. Progression - categorical dichoIIb. Progression - categorical multIIc. Progression - survivalIII. AIDS-defining Conditions IIIa. Sequelae - categorical first seqIIIb. Sequelae - survival first sequeIIIc. Sequelae - categorical any seqIIId. Sequelae - survival any sequeIV. Treatment with ARV IVa. HAART - categoricalIVb. HAART - survivalTotalSee Additional file 3: Table S2, Additional file 4: Table S3, Additional file 5: Tabletests performed in each group.for HIV-AIDS and other complex diseases [22,23,34].Unlike other methods of data sharing, this results-based open data sharing/release approach avoids anyviolation of patient privacy, IRB (institutional reviewboard) and HIPAA (Health Insurance Portability andAccountability Act of 1996) concerns, or informed consentconstraints, since the primary clinical and genotype dataremain confidential while the derivative results (p-values,QASs, plots) of multiple conceivable analytical approachesare openly released. In this approach, we hope to consid-erably expand discovery and replication opportunitiesin important biomedical research. To us, this ensuresthe maximum benefit of open access data sharing whileprotecting patients who prefer privacy (many do), butwish to see their volunteerism fulfilled.Materials and methodsGWATCH implementationGWATCH is a web-based application that integrates sev-eral technologies and programming languages. Server-sideis represented by Apache web server, which employs PHPengine and Java-based toolkit Batik. R-project functionsand modules are used for performing statistical tests,polarization and density calculation. MySQL databasecomponent of GWATCH allows access, retrieval andmanagement of genotypes, clinical information and testresults. On the frontend, GWATCH employs HTML5,Javascript, jQuery and WebGL for HIGHWAY browserinterface, and Ajax and JSON technologies for dataexchange between server and client.GWATCH toolsTRAX REPORTSAfter screening for associations of clinical traits and geno-types one may be interested in a closer review of certainy Groups A-C (see Table 2)Number of tests for each groupGroup A Group B Group C3 12 12mous 12 12 12int 12 12 1248 48 24la 9 9 -9 - -lae 9 33 -9 6 -6 - -6 12 -123 144 60Additional file 6: Table S5, for detailed description of statistical associationSNPs. The TRAX REPORT (Additional file 7: Figure S5and Additional file 8: Figure S6) tool allows the productionof reports on extended statistical analysis for any singleSNP if the corresponding genotype and clinical informationis available for all individuals. Important genotype informa-tion is given in the header on the TRAX front page: SNPidentifier, SNP coordinate, chromosome, alleles and theirfrequencies. The header also lists information on popula-tions involved in the analysis. In addition to the header,front page also contains a summary for all tests with p-values, as well as values of QAS represented in the bar plotform. The following pages of TRAX REPORT containdetailed information, such as contingency tables (that areproduced in the form of corresponding bar plots for anycategorical test, including progression categorical tests),and Kaplan–Meier survival curves that are reported for allthree genotypes for all survival tests.PolarizationThe polarization tool enables the inversion of test resultsfor minor and common SNP-alleles around some fixedSvitin et al. GigaScience 2014, 3:18 Page 7 of 10http://www.gigasciencejournal.com/content/3/1/18SNP (called index SNP) for better approximation oftrue associations. A polarization table is produced usinglinkage disequilibrium coefficients (D’) between neighbor-ing SNPs. Linkage disequilibrium coefficients are calculatedfor 80 SNPs upstream and 80 SNPs downstream of theindex SNP. In the case of a sufficiently large positive valueof linkage disequilibrium (D’ > 0.9), the polarization mark isassigned to 1, whereas in the case of a sufficiently largenegative linkage disequilibrium (D’ < −0.9) the polarizationmark is assigned to −1. If the linkage disequilibrium issufficiently small, the polarization mark is assigned to 0.In the process of polarization, QAS values for test resultsof neighboring SNPs are inverted if the polarizationmark is −1 implying the inversion of direction of diseaseassociation for such SNPs.DensityDensity top scoring that identifies regions of concentra-tion of small p-values is calculated for each SNP in twosteps:1) in the window of specified size (n SNPs upstreamand downstream or n Kbp upstream anddownstream) average -log p-value is computed foreach test (lane of the Highway)2) these per-test (per-lane) averages are used forcalculating density at this SNP either by averagingthem or by finding the largest one (depending onthe option chosen)The second step can be performed for all the tests orfor the group of tests by the disease stage (e.g., all testsfor HIV infection, all tests for AIDS progression etc.).Statistical tests and data used for complex AIDS studyGeneral types of statistical data and tests relevant toGWATCH are described in Additional file 1: Materialsand Methods. Below we describe particular tests anddata types used in the exemplary analysis of HIV/AIDSstudy data.To illustrate GWATCH utility in the analysis of GWASresults we used data from multicenter longitudinal studiesof several cohorts of patients exposed to the risk of HIVinfection and/or already infected with HIV: ALIVE, DCG,HGDS, HOMER, LSOCA, MACS, MHCS and SFCC[11,34,37,38]. The total pool of patients was divided intothree groups A, B and C based on ethnicity and timing ofdata development (see Table 2). A total of 5,922 patientswere analyzed in all 3 groups.All patient samples and genotypes were subjected toQC filtering depicted in Additional file 1: Table S8 asdescribed previously [28,31]. Once final genotypes wereobtained, population structure was assessed using thePrincipal Components Analysis module of Eigensoftsoftware in European and African American populations[39] and structured SNP variants were excluded [28,39].The statistical tests described below and listed inTable 3 and Additional file 3: Table S2, Additional file 4:Table S3, Additional file 5: Table S4 and Additional file 6:Table S5, were applied to the three patient study groupsA, B and C (see Table 2). For each of the tests describedbelow three genetic models were used (D, R and CD, seein Section 1 of Supplementary Materials and Methodsunder “Genotype classification” in Additional file 1)unless stated otherwise.Infection tests (INF)The aim of infection tests is to specify association of anyselected genotype with HIV infection. The original clinicaldata is of categorical type based on the population ofseronegatives (SN, individuals which stay HIV-negativethroughout the whole study) at the baseline with theresponse variable indicating serostatus at the endpointand having three levels: “high risk exposed uninfected”(HREU) seronegatives, “other seronegatives” (OSN) and“seroconverters” (SC, individuals which entered the studyas HIV-negative, but became HIV-positive during thestudy). Three combinations of HIV status classificationswere used to perform the categorical tests: “SC” vs.“HREU”, “SC” vs. “HREU” plus “OSN” and “SC” vs.“HREU” vs. “OSN”. In addition to the three genotypeclassifications described above (D, R and CD), allelicmodel (A) was also used for this test. One more groupof individuals based on infection status, “seroprevalents”(SP, individuals which entered the study already beingHIV-positive), was not informative for this type of test andtherefore was not included in it.Disease progression testsThe disease progression tests were used for screeningsignificant associations between AIDS progression andgenotype. The original data were of right-censored survivaltype under four different criteria of AIDS disease: CD4 <200 (level of CD4+ cells falling below 200 cells/mm3),AIDS-1987 (patient meeting criteria of 1987 CDC defin-ition of AIDS), AIDS-1993 (patient meeting criteria of 1993CDC definition of AIDS) and Death from AIDS. OnlySC and SP individuals were included in this analysis. SCindividuals were included into analysis with HIV infec-tion date (date of seroconvertion) as the baseline. SPindividuals were included into categorical analysis withthe date of the first visit as the baseline with somewarnings.Disease progression categorical analysis (PDCA) usedthe categorical tests for survival data (CTSD) approachdescribed in Section 2 of Additional file 1: Materials andMethods. The CTSD were performed in dichotomous(PDCA2, two groups by the survival time or currentAdditional file 4: Table S3. List of SNP association statistical tests andresults for a single SNP for a study group (e.g. p-values and QASs for HIVSvitin et al. GigaScience 2014, 3:18 Page 8 of 10http://www.gigasciencejournal.com/content/3/1/18status data) and multipoint (PDCAM, more than twogroups by the survival time) forms. All individuals censoredbefore the breakpoint were removed from the PDCAdichotomous analysis, as well as the SP individuals whofailed before the breakpoint. All remaining individualscensored or failed after the breakpoint were classifiedinto the group of long-term survivors (LTS, those whodo not show AIDS symptoms before the breakpoint).The breakpoints used for classification in multipointPDCA are stated in Additional file 3: Table S2, Additionalfile 4: Table S3, Additional file 5: Table S4.Proportional hazard (PHAZ) analysis of disease pro-gression used the proportional hazards survival tests(PHST) approach described in Section 2 of Additionalfile 1: Materials and Methods. These tests were performedfor all four criteria of AIDS. Only SC individuals wereincluded into PHAZ analysis.Sequelae testsSurvival and categorical tests were performed for survivaldata on Kaposi’s sarcoma (KS), Pneumocystis carnii pneu-monia (PCP), cytomegalovirus infection (CM), lymphoma(LY), mycobacterial infection (MYC) and other opportun-istic infections (OOI). As in progression disease tests,survival sequelae tests included seroconverters only, whilecategorical sequelae tests included both seroconvertersand seroprevalents.Sequelae tests for any infection order classify patientsbased on whether specific sequela occurred at all, irrespec-tively of its order (i.e., whether it was the first sequelato occur for patient). The survival tests (SEQSA) underproportional hazards model as well as the progressioncategorical tests (SEQCA) were performed separatelyfor each of the diseases described above.Sequelae tests for the first infection classify patients basedon whether specific sequela occurred first or not. Thesurvival tests (SEQS1) under proportional hazards model aswell as the progression categorical tests (SEQC1) were per-formed separately for each of the diseases described above.Highly active antiretroviral therapy (HAART) testsHAART tests were performed for the cohorts of patientswho were subject to this type of treatment. Patients wereclassified based on either the level of suppression of HIVviral load or on the rebound of viral load following itssuppression. Both survival (HRTS) and progressioncategorical (HRTC) tests were used for this analysis.Hardy–Weinberg equilibrium (HWE) testsThe HWE tests are performed to control for the quality ofdata used for the screening of associations. Large devia-tions from HWE are not typical for the large populationsand thus signal the genotyping error or some other typeof data quality breach.infection, AIDS progression using categorical and survival tests, AIDSsequelae, and HAART outcomes can be viewed and compared). TRAXPAGE can be generated de novo for any SNP of interest by placing mousetip over a significant tower/block in the HIGHWAY and selecting theTRAX PAGE option from the data window that appears (SNPs for whichTRAX REPORT is available do not have separate TRAX PAGE option in datawindow since TRAX REPORT includes TRAX PAGE content).Additional file 8: Figure S6. Detailed 11 page TRAX REPORT of derivedstatistics for all the tests accomplished including tables, bar graphs,survival curves and additional parameters for each test. TRAX REPORT canbe generated de novo for the SNP of interest by placing mouse tip over asignificant tower/block in HIGHWAY and selecting the TRAX REPORToption from the data window that appears. TRAX REPORTs are availablefor 641 SNPs in 241 human genes that were genotyped to replicate theGWAS associations for Study Groups A-C (Additional file 9: Table S6).Additional file 9: Table S6. List of 641 SNPs within 241 human genesthat were assessed to replicate the GWAS associations for Study GroupsA-C. For each of these SNPs a full TRAX REPORT (11 page report of figurespatient counts for Study Group B.Additional file 5: Table S4. List of SNP association statistical tests andpatient counts for Study Group C.Additional file 6: Table S5. Summary of SNP association testsperformed for each Study Group.Additional file 7: Figure S5. TRAX PAGE, 2 page summary or all testAvailability and requirementsProject name: GWATCHProject home page: gen-watch.orghttps://github.com/DobzhanskyCenter/GWATCHOperating system(s): Platform independent (runs in theweb browser)Programming language: HTML5, Javascript, PHP, Java,R, MySQLOther requirements: WebGL-supporting web browser(Firefox 4.0 and above; Chrome 12 and above; under OSX runs also in Safari 5.1 and above)License: GPL v2.0Any restrictions to use by non-academics: noAvailability of supporting dataArchive of the version of GWATCH used in thispaper is available from the GigaScience database [40],and for the most recent version please see ourGitHub repository.Additional filesAdditional file 1: Supplementary Information. Contains Materials andMethods, Figures S1–S4, legends for Figure S5 and S6, legends for TableS1–S7, Table S8 and References.Additional file 2: Table S1. Data Table of GWAS results: 100 rows ofthe Data Table containing SNPs, p-values and QASs for AIDS RestrictionGenes dataset in Study Group A in the PARD3B region of chromosome 2.Full unabridged data tables for Groups A-C are available on the GWATCHweb portal [33].Additional file 3: Table S2. List of SNP association statistical tests andpatient counts for Study Group A.and tables for each test) is available on the GWATCH web portal [33] asillustrated in Additional file 8: Figure S6.CA 90027, USA. Departments of Ophthalmology and Medicine, Icahn11genome-wide association signals. Nat Rev Genet 2009, 10:318–329.Svitin et al. GigaScience 2014, 3:18 Page 9 of 10http://www.gigasciencejournal.com/content/3/1/18School of Medicine at Mount Sinai, New York, NY 10029, USA. Departmentof Epidemiology, The Johns Hopkins University Bloomberg School of PublicHealth, Baltimore, MD 21205, USA. 12British Columbia Centre for Excellence inHIV/AIDS, Vancouver, BC V6Z 1Y6, Canada. 13Division of AIDS, Faculty ofAdditional file 10: Table S7. Genomic regions of remarkable statisticalassociation (HITS) identified in ARG-GWAS by the screen for extremep-values.AbbreviationsAIDS: Acquired immunodeficiency syndrome; CDC: Centers for DiseaseControl and Prevention; CNV: Copy-number variation; CTSD: categoricaltests for survival data; GWAS: Genome-wide association study; GWATCH:Genome-Wide Association Tracks Chromosome Highway; HAART: HighlyActive Antiretroviral Therapy; HIPAA: The Health Insurance Portability andAccountability Act of 1996; HIV: Human immunodeficiency virus; HREU: Highrisk exposed uninfected; HTML5: Hypertext markup language, revision 5;IRB: Institutional Review Board; HWE: Hardy-Weinberg equilibrium;LD: Linkage disequilibrium; LTS: Long-term survivor; MAF: Minor allelefrequency; OSN: Other seronegatives; QAS: Quantitative Association Statistic;QC: Quality control; PDCA: Disease progression categorical analysis;PHAZ: Proportional hazard; PHP: Hypertext Preprocessor; SC: Seroconverter;SN: Seronegative; SNP: Single nucleotide polymorphism; SP: Seroprevalent;WGS: Whole genome sequencing.Competing interestsASv, SM, NC, PG and SJO are authors of the provisional application for patentUS 61/897,524 “Visualization, sharing and analysis of large data sets” filed on10/30/2013.Authors’ contributionsASv, SM, NC, PG, MR, PD, ASh, TKO and SJO developed GWATCH. LG, JT, SH,HHD, ES and SJO performed the original GWAS studies. SD, EG, DAJ, MVN,RH and ZLB contributed new epidemiological data from their AIDS cohorts.ASv, SM, NC and SJO wrote the manuscript. All authors read and approvedthe final manuscript.AcknowledgementsWe gratefully acknowledge the prior collaborative contribution of the patients,health care givers and investigators of HIV-AIDS cohorts who developed andcatalogued the demographic information used in this illustration.This work was supported in part by Russian Ministry of Science Mega-grant No.11.G34.31.0068; Stephen J. O’Brien, Principal Investigator. The HemophiliaGrowth and Development Study is funded by the National Institutes of Health,National Institute of Child Health and Human Development, R01-HD-41224.This work was supported by the National Eye Institute, National Institutes ofHealth (grants U10EY008052, U10EY008057, and U10EY008067). ZLB issupported by a New Investigator Award from the Canadian Institutes forHealth Research and a Scholar Award from the Michael Smith Foundationfor Health Research.Author details1Theodosius Dobzhansky Center for Genome Bioinformatics, St. PetersburgState University, St. Petersburg 199004, Russia. 2Department of Mathematics,St. Petersburg Electrotechnical University, St. Petersburg 197376, Russia.3Scientific Data Visualization Consultant, Turner, ACT 2612, Australia.4Genetics and Genomics Group, Advanced Technology Program,SAIC-Frederick, National Cancer Institute, Frederick, MD 21702, USA.5Department of Biology, Shepherd University, Shepherdstown, WV 25443,USA. 6Vanderbilt Technologies for Advanced Genomics, Office of Research,Vanderbilt University Medical Center, Nashville, TN 37204, USA. 7BiologyDepartment, University of Puerto Rico, Mayaguez, PR 00680, USA.8Department of Biostatistics, Rho, Inc., Chapel Hill, NC 27517, USA. 9Divisionof Hematology-Oncology, Children’s Hospital of Los Angeles, Los Angeles,10Medicine, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.14Faculty of Health Sciences, Simon Fraser University, Burnaby, BC V5A 1S6,Canada. 15Oceanographic Center, Nova Southeastern University, Ft.Lauderdale, FL 33004, USA.18. Moskvina V, Schmidt KM: On multiple-testing correction in genome-wideassociation studies. Genet Epidemiol 2008, 32:567–573.19. O’Brien SJ, Hendrickson S: Host genomic influences on HIV/AIDS. Genome BiolReceived: 6 June 2014 Accepted: 30 September 2014Published: 5 November 2014References1. 1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD,DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA:An integrated map of genetic variation from 1,092 human genomes.Nature 2012, 491:56–65.2. Wellcome Trust Case Control Consortium: Genome-wide association studyof 14,000 cases of seven common diseases and 3,000 shared controls.Nature 2007, 447:661–678.3. International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L,Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F,Peltonen L, Dermitzakis E, Bonnen PE, Altshuler DM, Gibbs RA, de Bakker PI,Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, ParkinM, Whittaker P, Yu F, Chang K, Hawes A, Lewis LR, Ren Y, et al: Integratingcommon and rare genetic variation in diverse human populations.Nature 2010, 467:52–58.4. Hindorff LA, MacArthur J, Morales J, Junkins HA, Hall PN, Klemm AK,Manolio TA: A Catalog of Published Genome-Wide Association Studies.[http://www.genome.gov/gwastudies]5. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS,Manolio TA: Potential etiologic and functional implications of genome-wideassociation loci for human diseases and traits. Proc Natl Acad Sci 2009,106:9362–9367.6. Jiang YH, Yuen RK, Jin X, Wang M, Chen N, Wu X, Ju J, Mei J, Shi Y, He M,Wang G, Liang J, Wang Z, Cao D, Carter MT, Chrysler C, Drmic IE, Howe JL,Lau L, Marshall CR, Merico D, Nalpathamkalam T, Thiruvahindrapuram B,Thompson A, Uddin M, Walker S, Luo J, Anagnostou E, Zwaigenbaum L,Ring RH, et al: Detection of clinically relevant genetic variants in autismspectrum disorder by whole-genome sequencing. Am J Hum Genet 2013,93:249–263.7. Kilpivaara O, Aaltonen LA: Diagnostic cancer genome sequencing and thecontribution of germline variants. Science 2013, 339:1559–1562.8. Wade CH, Tarini BA, Wilfond BS: Growing up in the genomic era:implications of whole-genome sequencing for children, families, andpediatric practice. Annu Rev Genomics Hum Genet 2013, 14:535–555.9. Cirulli ET, Goldstein DB: Uncovering the roles of rare variants in commondisease through whole-genome sequencing. Nat Rev Genet 2010,11:415–425.10. Hutcheson HB, Lautenberger JA, Nelson GW, Pontius JU, Kessing BD,Winkler CA, Smith MW, Johnson R, Stephens R, Phair J, Goedert JJ, DonfieldS, O’Brien SJ: Detecting AIDS restriction genes: from candidate genes togenome-wide association discovery. Vaccine 2008, 26:2951–2965.11. O’Brien SJ, Nelson GW: Human genes that limit AIDS. Nat Genet 2004,36:565–574.12. Bushman FD, Malani N, Fernandes J, D'Orso I, Cagney G, Diamond TL, Zhou H,Hazuda DJ, Espeseth AS, König R, Bandyopadhyay S, Ideker T, Goff SP, KroganNJ, Frankel AD, Young JA, Chanda SK: Host cell factors in HIV replication:meta-analysis of genome-wide studies. PLoS Pathog 2009,5:e1000437.13. Goldstein DB: Common genetic variation and human traits. N Engl J Med2009, 360:1696–1698.14. Conneely KN, Boehnke M: So many correlated tests, so little time! Rapidadjustment of P values for multiple correlated tests. Am J Hum Genet2007, 81:1158–1168.15. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JP,Hirschhorn JN: Genome-wide association studies for complex traits:consensus, uncertainty and challenges. Nat Rev Genet 2008, 9:356–369.16. Johnson RC, Nelson GW, Troyer JL, Lautenberger JA, Kessing BD, Winkler CA,O'Brien SJ: Accounting for multiple comparisons in a genome wideassociation study (GWAS). BMC Genomics 2010, 11:724.17. Ioannidis JP, Thomas G, Daly MJ: Validating, augmenting and refining2013, 14:201.20. Dudbridge F, Gusnanto A: Estimation of significance thresholds forgenome wide association scans. Genet Epidemiol 2008, 32:227–234.21. Best practices in GWAS. In Genome Technology Supplemental report 2009.[http://www.genomeweb.com/node/917734]Svitin et al. GigaScience 2014, 3:18 Page 10 of 10http://www.gigasciencejournal.com/content/3/1/1822. Johnson AD, O’Donnell CJ: An open access database of genome-wideassociation results. BMC Med Genet 2009, 10:6.23. Hayden EC: Geneticists push for global data-sharing. Nature 2013, 498:16–17.24. Greely HT: The uneasy ethical and legal underpinnings of large-scalegenomic biobanks. Annu Rev Genomics Hum Genet 2007, 8:343–364.25. O’Brien SJ: Stewardship of human biospecimens, DNA, genotype, andclinical data in the GWAS era. Annu Rev Genomics Hum Genet 2009,10:193–209.26. Gymrek M, McGuire AL, Golan D, Halperin E, Erlich Y: Identifying personalgenomes by surname inference. Science 2013, 339:321–324.27. O’Brien SJ, Nelson GW, Winkler CA, Smith MW: Polygenic and multifactorialdisease gene association in man: Lessons from AIDS. Annu Rev Genet2000, 34:563–591.28. Troyer JL, Nelson GW, Lautenberger JA, Chinn L, McIntosh C, Johnson RC,Sezgin E, Kessing B, Malasky M, Hendrickson SL, Li G, Pontius J, Tang M, AnP, Winkler CA, Limou S, Le Clerc S, Delaneau O, Zagury JF, Schuitemaker H,van Manen D, Bream JH, Gomperts ED, Buchbinder S, Goedert JJ, Kirk GD,O'Brien SJ: Genome-wide association study implicates PARD3B-basedAIDS restriction. J Infect Dis 2011, 203:1491–1502.29. Hendrickson SL, Lautenberger JA, Chinn LW, Malasky M, Sezgin E, Kingsley LA,Goedert JJ, Kirk GD, Gomperts ED, Buchbinder SP, Troyer JL, O’Brien SJ: Geneticvariants in nuclear-encoded mitochondrial genes influence AIDSprogression. PLoS One 2010, 5:e12862.30. Chinn LW, Tang M, Kessing BD, Lautenberger JA, Troyer JL, Malasky MJ,McIntosh C, Kirk GD, Wolinsky SM, Buchbinder SP, Gomperts ED, Goedert JJ,O’Brien SJ: Genetic associations of variants in genes encodingHIV-dependency factors required for HIV-1 infection. J Infect Dis 2010,202:1836–1845.31. Herbeck JT, Gottlieb GS, Winkler CA, Nelson GW, An P, Maust BS, Wong KG,Troyer JL, Goedert JJ, Kessing BD, Detels R, Wolinsky SM, Martinson J,Buchbinder S, Kirk GD, Jacobson LP, Margolick JB, Kaslow RA, O'Brien SJ,Mullins JI: Multistage genomewide association study identifies a locus at1q41 associated with rate of HIV-1 disease progression to clinical AIDS.J Infect Dis 2010, 201:618–626.32. Dean M, Carrington M, Winkler C, Huttley GA, Smith MW, Allikmets R,Goedert JJ, Buchbinder SP, Vittinghoff E, Gomperts E, Donfield S, Vlahov D,Kaslow R, Saah A, Rinaldo C, Detels R, O’Brien SJ: Genetic restriction ofHIV-1 infection and progression to AIDS by a deletion allele of the CKR5structural gene. Science 1996, 273:1856–1862.33. GWATCH: Genome-Wide Association Tracks Chromosome Highway.[http://gen-watch.org]34. McLaren PJ, Coulonges C, Ripke S, van den Berg L, Buchbinder S, CarringtonM, Cossarizza A, Dalmau J, Deeks SG, Delaneau O, De Luca A, Goedert JJ,Haas D, Herbeck JT, Kathiresan S, Kirk GD, Lambotte O, Luo M, Mallal S, vanManen D, Martinez-Picado J, Meyer L, Miro JM, Mullins JI, Obel N, O'Brien SJ,Pereyra F, Plummer FA, Poli G, Qi Y, et al: Association study of commongenetic variants and HIV-1 acquisition in 6,300 infected cases and 7,200controls. PLoS Pathog 2013, 9:e1003515.35. GWAS Central. [www.gwascentral.org]36. Beck T, Hastings RK, Gollapudi S, Free RC, Brookes AJ: GWAS Central: acomprehensive resource for the comparison and interrogation ofgenome-wide association studies. Eur J Hum Genet 2014, 22:949–952.37. Sezgin E, van Natta ML, Ahuja A, Lyon A, Srivastava S, Troyer JL, O'Brien SJ,Jabs DA, Studies of the ocular complications of AIDS research group:Association of host genetic risk factors with the course ofcytomegalovirus retinitis in patients infected with humanimmunodeficiency virus. Am J Ophthalmol 2011, 151:999–1006.e4.38. Harris M, Nosyk B, Harrigan R, Lima VD, Cohen C, Montaner J:Cost-effectiveness of antiretroviral therapy for multidrug-resistant HIV:past, present, and future. AIDS Res Treat 2012, 2012:595762.39. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D:Principal components analysis corrects for stratification in genome-wideassociation studies. Nat Genet 2006, 38:904–909.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistribution40. Svitin A, Malov S, Cherkasov N, Geerts P, Rotkevich M, Dobrynin P,Shevchenko A, Guan L, Troyer J, Hendrickson S, Hutcheson Dilks H, OleksykTK, Donfield S, Gomperts E, Jabs DA, Sezgin E, Van Natta M, Harrigan PR,Brumme ZL, O'Brien SJ: Software and Supporting Material for: “GWATCH:A Web Platform For Automated Gene Association Discovery Analysis”.In GigaScience Database. 2014. http://dx.doi.org/10.5524/10.5524/100109.doi:10.1186/2047-217X-3-18Cite this article as: Svitin et al.: GWATCH: a web platform for automatedgene association discovery analysis. GigaScience 2014 3:18.Submit your manuscript at www.biomedcentral.com/submit


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items