Open Collections

UBC Faculty Research and Publications

MD-SeeGH: a platform for integrative analysis of multi-dimensional genomic data Chi, Bryan; deLeeuw, Ronald J; Coe, Bradley P; Ng, Raymond T; MacAulay, Calum; Lam, Wan L May 20, 2008

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2007_Article_2228.pdf [ 4.28MB ]
JSON: 52383-1.0221624.json
JSON-LD: 52383-1.0221624-ld.json
RDF/XML (Pretty): 52383-1.0221624-rdf.xml
RDF/JSON: 52383-1.0221624-rdf.json
Turtle: 52383-1.0221624-turtle.txt
N-Triples: 52383-1.0221624-rdf-ntriples.txt
Original Record: 52383-1.0221624-source.json
Full Text

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceSoftwareMD-SeeGH: a platform for integrative analysis of multi-dimensional genomic dataBryan Chi*1, Ronald J deLeeuw1, Bradley P Coe1, Raymond T Ng1,2, Calum MacAulay3 and Wan L Lam1Address: 1Department of Cancer Genetics and Developmental Biology, British Columbia Cancer Research Centre, Vancouver, BC, Canada, 2Department of Computer Science, University of British Columbia, Vancouver, BC, Canada and 3Department of Cancer Imaging, British Columbia Cancer Research Centre, Vancouver, BC, CanadaEmail: Bryan Chi* -; Ronald J deLeeuw -; Bradley P Coe -; Raymond T Ng -; Calum MacAulay -; Wan L Lam -* Corresponding author    AbstractBackground: Recent advances in global genomic profiling methodologies have enabled multi-dimensional characterization of biological systems. Complete analysis of these genomic profilesrequire an in depth look at parallel profiles of segmental DNA copy number status, DNAmethylation state, single nucleotide polymorphisms, as well as gene expression profiles. Due to thedifferences in data types it is difficult to conduct parallel analysis of multiple datasets from diverseplatforms.Results: To address this issue, we have developed an integrative genomic analysis platform MD-SeeGH, a software tool that allows users to rapidly and directly analyze genomic datasets spanningmultiple genomic experiments. With MD-SeeGH, users have the flexibility to easily update datasetsin accordance with new genomic builds, make a quality assessment of data using the filteringfeatures, and identify genetic alterations within single or across multiple experiments. Multiplesample analysis in MD-SeeGH allows users to compare profiles from many experiments alongsidetracks containing detailed localized gene information, microRNA, CpG islands, and copy numbervariations.Conclusion: MD-SeeGH is a new platform for the integrative analysis of diverse microarray data,facilitating multiple profile analyses and group comparisons.BackgroundRecent advances in global genomic profiling methodolo-gies have enabled multi-dimensional characterization ofbiological systems. The deciphering of downstream effectsof genetic and epigenetic alterations on expression pat-terns is paramount in understanding disease phenotypeotide polymorphism (SNP) status. The large scale genera-tion of such data has created a need for robust software tointegrate multiple large genetically linked data sets gener-ated on diverse microarray platforms. Although severalvisualization software programs are available publicly (forexample [1-5]), there is a growing demand for new bioin-Published: 20 May 2008BMC Bioinformatics 2008, 9:243 doi:10.1186/1471-2105-9-243Received: 30 October 2007Accepted: 20 May 2008This article is available from:© 2008 Chi et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 9(page number not for citation purposes)and requires the integration of segmental DNA copynumber status, DNA methylation state and single nucle-formatics tools that allow for the concerted analysis ofmultiple genome-wide experiments derived from differ-BMC Bioinformatics 2008, 9:243 experimental platforms [6]. Blue Fuse [7] and CGHAnalytics [8], two commercially available software tools,offer integrative analysis with expression data but neithercontain the full feature set that we deem necessary (Table1). SeeGH (v1.6) was initially developed to view primarilyarray CGH data [2] but as we continued to use anddevelop the software we realized that there was a need forthe combined analysis of multi-platform data whichrequired significant upgrades to the initial version ofSeeGH. Here we present MultiDimensional-SeeGH (MD-SeeGH) analysis platform, a powerful software tool thatallows users to quickly and easily analyze genomicanchored datasets comprised from multiple genomicexperiments (Figure 1).ImplementationMD-SeeGH was created using Borland's C++ Builder6development platform. MySQL is used as the backenddatabase server which is freely available for download.The MD-SeeGH software was developed and tested onWindows XP and Vista. The software and documentationare publicly available online [9].From our interaction with researchers and clinicians, wenote that some of the key features required by integrativeanalysis software for handling diverse genomic datasetsare: (1) flexibility (2) data quality assessment, (3) visuali-zation (4) single and multiple sample analyses, (5) multigroup analyses, and (6) comprehensive reporting. Tohighlight how MD-SeeGH performs these functions, wediscuss the parallel analysis of genomic and epigenomicarray comparative genomic hybridization (CGH) data aswell as the analysis of multidimensional data sets includ-ing gene expression, comparative genomic hybridization,differential methylation, and single nucleotide polymor-phisms.Results and discussionThe following sections describe the flow chart summariz-ing the functional modules of MD-SeeGH (Figure 1).Data tracking, preprocessing and import of dataMicroarray data captured after hybridization, scanning,and spot finding are imported into MD-SeeGH as tab-delimited text files. At this time each dataset can be anno-tated to facilitate data tracking, easy recall, and group def-inition. Clinical information can also be entered andassociated with each dataset. Microarray image data arecommonly normalized to remove intensity and spatialbiases. For example, the output of a stepwise normaliza-tion algorithm, CGH-Norm, is seamlessly imported intoMD-SeeGH [10].Flexible genome mapping and annotationTo relate array spot information to a specific genomic maplocation, it is important to use the appropriate genomebuild (e.g. UCSC Human Genome Freeze Mar 2006/hg18). We have embedded the genomic locations of arrayfeatures within MD-SeeGH for all available genomebuilds utilized by the common genomic microarray plat-forms. New mapping information (future builds) can beeasily imported into MD-SeeGH using tab-delimited textfiles containing base pair information for each array fea-ture. This provides the user with the ultimate flexibility ofremapping entire datasets against any genomic buildwithout having to manually transform each individualTable 1: Feature comparison of integrative analysis platformsMD-SeeGH VAMP/CAPweb ISACGH CGH Analytics CGHPRO CGH Explorer Blue Fuse ArrayCyGHt M-CGH SeeGHv1.6Segmentation         Normalization        ISCN Reporting   Integrative Analysis with Expression data     Multiple Sample Visualization       Gene Tracks      Other Tracks (ie.miRNA, CNV, CpG island, etc.)  Links to external websites (ie. NCBI, UCSC, etc.)      Mapping files (different Genomic Builds)   Integration of third party analysis tools    Frequency Plot       Heatmap     Group analysis    Free public access        Data storage (samples only loaded once)   Web based software    Page 2 of 9(page number not for citation purposes)References [6, 20] [21] [8] [1] [3] [7] [22] [5] [2]BMC Bioinformatics 2008, 9:243 and reload them into MD-SeeGH [see Additionalfile 1].Quality assessment of dataMD-SeeGH allows filtering of spot data based on standarddeviation of replicate measurements and spot signal tonoise ratios based upon user-defined parameters. The per-centage of spots discarded by filtering and the average sig-nal to noise ratio are displayed for each experimentaldataset. A recently described phenomenon in array CGHexperimentation (regardless of array platform used) hasbeen the identification of a recurrent artefact pattern thatis independent of the copy number status [11,12]. Wehave created a tool to measure and compensate for thisidentified recurrent baseline pattern (noise) within arrayCGH experiments [11] [see Additional file 1].Detection of genetic alterationsOnce the imported data has undergone the appropriatequality assessment, analysis can begin with the identifica-tion of alterations for each sample (Analysis I – Figure 1).Many segmentation algorithms have been developed toidentify regions of alteration, each with their uniquefor example a modified Hidden Markov Model for seg-mentation of array CGH data [14]. MD-SeeGH allows theuser to import the output from such algorithms as CNAHMMer, DNAcopy and aCGH Smooth [14-16]. (Can alsoimport any dataset where each spot is annotated with acall). The result of each segmentation output is displayedbeside each measured experimental data feature to assistthe user in assigning copy number representation to thedata within an experiment (Figure 2) [see Additional file1].Gene and additional tracksAccessible information embedded in MD-SeeGH includesall annotated RefSeq genes, microRNAs[17], CpGislands[18], and natural copy number variations (CNVs)[19]. However, we provide the user with the flexibility todisplay any genomic annotated fields as a track beside theexperimental data (Figure 3a). Data within gene tracks canbe selected to display information about each gene frommultiple sources (OMIM, Entrez mRNA, Entrez protein,Pubmed, and UCSC genome browser). The gene andadditional tracks allow the user to determine if a specificspot on their array overlaps with a specific gene, micro-Flow Diagram Summarizing the MD-SeeGH Platformigure 1Flow Diagram Summarizing the MD-SeeGH Platform.Page 3 of 9(page number not for citation purposes)strengths/weaknesses[13]. Given that each microarrayplatform may require a specific segmentation algorithm,RNA, CNV, etc. Up to 4 tracks are visible at all zoom levelsduring analysis and visualization. Alternately, an entireBMC Bioinformatics 2008, 9:243 can be displayed via the UCSC genome browser atthe touch of a button [see Additional file 1].Multiple sample analysesIn MD-SeeGH, data from up to 50 experiments can bealigned for direct comparison allowing for cross platformanalysis or viewing multiple patient samples from thesame disease type (Analysis II – Figure 1) (Figure 3). Mul-tiple samples are viewed one chromosome at a time andcan be easily changed via a chromosome drop-down box.There is no limit on the density of arrays when viewingmultiple samples. Of course the larger the arrays thelonger it takes to load. During testing of the MultipleAlignment feature we were able to load 50 Agilent 244 karrays in 1 minute and 20 seconds on a computer with 2GB of RAM and a 2.7 Ghz processor. On the samemachine, 50 SMRT 32 K arrays took less than 15 secondsto load. In addition, up to 100 experiments can be ana-lyzed and summarized as a heatmap (Figure 4). The heat-map is generated by calculating a moving average acrosseach experiment and allows for a quick way to findregions of interest across a large number of experiments.A given region of interest identified on the heatmap canbe further investigated in detail by directly switching tomultiple alignment of individual profiles. MD-SeeGHalso has the ability to analyze up to 1000 samples as a fre-quency plot showing percent of samples altered (FigureFrequency plots can be created within MD-SeeGH for anydatasets from the same array platform that have calleddata. The frequency of alteration is calculated for eachspot of the selected array platform. When creating fre-quency plots within MD-SeeGH, the maximum densityallowed is 25,000 spots per chromosome for a total of 600K spots in the array. Frequency plot data can also be cre-ated externally and loaded into MD-SeeGH. When analys-ing frequency plots, whole genome and individualchromosome plots are available to the user [see Addi-tional file 1].Platform independence and integrative analyses of multi-dimensional datasetsAny data that is tied to a genomic base pair position canbe loaded into MD-SeeGH. This includes single channelAffymetrix SNP arrays and double channel Agilent, Nim-blegen, Illumina, and SMRT arrays. Within multiple sam-ple analyses, it is not a requirement that all data be createdfrom the same microarray platform. This capability can beutilized to assess the differing characteristics of microarrayplatforms (Figure 3) or combine data derived from the lat-est platforms with data created using older platforms. Thisfunctionality is increasingly desirable to analyze multi-dimensional datasets, for example, the integration ofmethylation patterns, copy number alterations and singlenucleotide polymorphisms (Figure 6a,b). However, itsIdentification and Annotation of Altered RegionsFigure 2Identification and Annotation of Altered Regions. The Annotate Regions option is an analysis tool that allows you to record regions of interest (i.e. amplifications, deletions), save them to the database, and create ISCN reports. Annotating regions can be used side by side with segmentation probabilities to verify the called regions and can also be used to compare amplification and deletions across multiple samples or create Frequency Plots. Numbers indicate genomic view of (1) anno-tated regions and (2) segmentation calls, and chromosome view of (3) annotation form where user can mark the region as an amplification, gain, deletion, loss or neutral region and (4) segmentation calls which aid in making the calls.Page 4 of 9(page number not for citation purposes)5). The frequency plot feature gives the user the ability toidentify minimal regions of interest across large datasets.main advantage is in analyzing gene expression changesin the context of these genetic features (Figure 6c).BMC Bioinformatics 2008, 9:243 analysis of multiple groupsAn additional level of analysis is the ability to comparetwo groups of experiments to identify differences betweenthem. In MD-SeeGH this is achieved through the compar-ison of frequency plots (Figure 7). Permutation testing,Fisher's exact tests, and other statistical tests can be easilyconducted using data exported from MD-SeeGH. Thesestatistical analyses provide p-values for the differencesbetween the two groups [see Additional file 1].Exporting results and analysis reportingMD-SeeGH provides three main ways to export data.Firstly, any image can be saved as either a jpeg or a bitmapexport data from MD-SeeGH in a tab-delimited text for-mat that can be readily manipulated with other software/statistical packages. Finally, in a clinical or repetitivestandardized analysis setting, the attachment of an entirearray CGH data file to a report is unrealistic; therefore, weallow direct generation of a cytogenetic report, formattedwith the latest ISCN standard for array experiments (Fig-ure 8) [see Additional file 1].ConclusionIn conclusion, we have developed a new platform for theintegrative analysis of diverse microarray data, facilitatingmultiple profile analyses and group comparisons.Multiple Analyses of Different PlatformsFigure 3Multiple Analyses of Different Platforms. Comparison of the same sample (BT474 cell line) across the following different array platforms: (1) RefSeq GeneTrack, (2) SMRT array, (3) Affymetrix GeneChip human mapping 500 K set, (4) Agilent 244A, and (5) VUMC MACF human 30 K.Page 5 of 9(page number not for citation purposes)file. Secondly, noting that new analysis algorithms areconstantly being developed, we built in the ability toBMC Bioinformatics 2008, 9:243 and requirementsProject name: MD-SeeGHProject home page:http://www.arraycgh.caOperating system: Microsoft Windows XP, MicrosoftWindows VistaProgramming language: C++, SQLOther requirements: MySQL databaseLicense: Academic software license must be agreed uponduring installation.Any restrictions to use by non-academics: YesAuthors' contributionsBC was the principle programmer of the source code.RJdL, BPC and RTN contributed ideas for software featuresand requirements. CM and WLL are principle investiga-tors of this work. All authors contributed to writing themanuscript. All authors read and approved the final man-uscript.Frequency Plotigur  5Frequency Plot. The Frequency Plot can be used to analyze a group of samples and find minimal regions of amplifications or deletions. Frequency plot scoring for up to 1000 samples can be created within SeeGH or created externally and loaded into SeeGH. Once loaded each sample is stored in the SeeGH database. Amplifications are shaded red and deletions are shaded green. Left panel (A) shows genome view and right panel (B) shows the chromo-Heatmap for Defining Recurring FeaturesFigure 4Heatmap for Defining Recurring Features. Heatmap allows the user to analyze up to 100 samples and find common regions of amplification or deletion across groups of samples. (1) Each column represents a moving average heatmap of a single sample. Amplifications are shaded red and deletions are shaded green. The greater the moving average ratio the brighter the red (-) or green(+).Page 6 of 9(page number not for citation purposes)some view.BMC Bioinformatics 2008, 9:243 of Multi-Dimensional Data: Integration of Different Types of DataFigure 6Analysis of Multi-Dimensional Data: Integration of Different Types of Data. a. Integration of copy number (CN) data in the context of SNP profile for MCF7 cells. (1) SMRT array CGH profile displayed alongside (2) Affymetrix SNP array – Homozygous AA on left, Heterozygous AB in middle, Homozygous BB on right. Region between the blue lines (3) shows a copy number loss (left) on chromosome 1 associated with LOH (right). b. Integration of epigenetic and genomic profiles. (4) Methylated DNA immunopre-cipitation (MeDIP) array CGH profile displayed alongside a (5) SMRT array CGH. Region between the blue lines (6) shows both hypomethylation (left) and copy number change (right). c. Integration of Array CGH and Lymphochip cDNA Gene Expression. (7) SMRT array CGH profile displayed alongside a (8) cDNA expression (EXP) profile (Lympho-chip). Yellow highlighted region (9) shows a BCL2 gain and overexpression.Group ComparisonFigure 7Group Comparison. Frequency plot comparison of two different groups (derived from two frequency plot datasets) representing different disease types. Once frequency Page 7 of 9(page number not for citation purposes)plots have been loaded/created in SeeGH the user can compare two frequency plots using the overlay feature. Each group is a different color (Group 1 – Fuschia, Group 2 – Blue) and any overlapping regions are a third color (Intersection – Yellow). This is a useful feature to determine similarities and differences between two groups of samples.BMC Bioinformatics 2008, 9:243 materialAcknowledgementsWe thank Spencer Watson and Dr. Doug Horsman for useful discussion and Andre Soesilo, Tony Qin, and Philip Wang for help with software devel-opment. This work was supported by funds from Genome Canada/British Columbia, Canadian Institutes for Health Research, Canadian Breast Can-cer Research Alliance and NIH (NIDCR) R01 DE15965.References1. Chen W, Erdogan F, Ropers HH, Lenzner S, Ullmann R: CGHPRO -- a comprehensive data analysis tool for array CGH.  BMC Bio-informatics 2005, 6:85.2. Chi B, DeLeeuw RJ, Coe BP, MacAulay C, Lam WL: SeeGH--a soft-ware tool for visualization of whole genome array compara-tive genomic hybridization data.  BMC Bioinformatics 2004, 5:13.3. Lingjaerde OC, Baumbusch LO, Liestol K, Glad IK, Borresen-Dale AL:CGH-Explorer: a program for analysis of array-CGH data.Bioinformatics 2005, 21(6):821-822.4. Lockwood WW, Chari R, Chi B, Lam WL: Recent advances inarray comparative genomic hybridization technologies andtheir applications in human genetics.  Eur J Hum Genet 2006,14(2):139-148.5. Wang J, Meza-Zepeda LA, Kresse SH, Myklebost O: M-CGH: ana-lysing microarray-based CGH experiments.  BMC Bioinformatics2004, 5:74.6. La Rosa P, Viara E, Hupe P, Pierron G, Liva S, Neuvial P, Brito I, LairS, Servant N, Robine N, Manie E, Brennetot C, Janoueix-Lerosey I,Raynal V, Gruel N, Rouveirol C, Stransky N, Stern MH, Delattre O,Aurias A, Radvanyi F, Barillot E: VAMP: visualization and analysisof array-CGH, transcriptome and other molecular profiles.Bioinformatics 2006, 22(17):2066-2073.7. Blue Fuse Software   []8. CGH Analytics Software   []9. Flintbox   []10. Khojasteh M, Lam WL, Ward RK, MacAulay C: A stepwise frame-work for the normalization of array CGH data.  BMC Bioinfor-matics 2005, 6:274.11. Blesa D, Suela J, Melchor L, Alvarez S, Largo C, Ferreira B, CalasanzMJ, Cifuentes F, Cigudosa JC: Artefacts in aCGH: La GrandeMotte, France.   ; 2006. 12. Marioni JC, Thorne NP, Valsesia A, Fitzgerald T, Redon R, Feigler H,Andrews TD, Stranger BE, Lynch AG, Dermitzakis ET, Carter NP,Tavare S, Hurles ME: Breaking the waves: improved detectionof copy number variation from microarray-based compara-tive genomic hybridization.  Genome Biology 2007, 8(R228):.13. Lai WR, Johnson MD, Khucherlapati R, Park PJ: Comparative anal-ysis of algorithms for identifying amplifications and deletionsin array CGH data.  Bioinformatics 2005, 21(19):3763-3770.14. Shah SP, Xuan X, DeLeeuw RJ, Khojasteh M, Lam WL, Ng R, MurphyKP: Integrating copy number polymorphisms into array CGHanalysis using a robust HMM.  Bioinformatics 2006, 22(14):e431-9.15. Jong K, Marchiori E, Meijer G, Vaart AV, Ylstra B: Breakpoint iden-tification and smoothing of array comparative genomichybridization data.  Bioinformatics 2004, 20(18):3636-3637.16. Venkatraman ES, Olshen AB: A faster circular binary segmenta-tion algorithm for the analysis of array CGH data.  Bioinformat-ics 2007, 23(6):657-663.Additional file 1Supplementary figuresClick here for file[]Sample ISCN report exported from MD-SeeGHFigur  8Sample ISCN report exported from MD-SeeGH. Tab-delimited ISCN report exported from MD-SeeGH. Regions are noted with (1) chromosome banding position, (2) first and last clone/feature of the region, and (3) whether it is an amplification or deletion. Amplifications or copy number gain is marked as 'x3', while deletions or copy number loss is marked as 'x1'.Page 8 of 9(page number not for citation purposes)Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2008, 9:243 Griffiths-Jones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ:miRBase: microRNA sequences, targets and gene nomencla-ture.  Nucleic Acids Res 2006, 34:D140-D144.18. UCSC Genome Browser   []19. Wong KK, deLeeuw RJ, Dosanjh NS, Kimm LR, Cheng Z, HorsmanDE, MacAulay C, Ng RT, Brown CJ, Eichler EE, Lam WL: A compre-hensive analysis of common copy-number variations in thehuman genome.  Am J Hum Genet 2007, 80(1):91-104.20. Liva S, Hupe P, Neuvial P, Brito I, Viara E, La Rosa P, Barillot E: CAP-web: a bioinformatics CGH array Analysis Platform.  NucleicAcids Res 2006, 34(Web Server issue):W477-W481.21. Conde L, Montaner D, Burquet-Castell J, Taragga J, Medina I, Al-Shahrour F, Dopazo J: ISACGH: a web-based environment forthe analysis of Array CGH and gene expression whichincludes functional profiling.  Nucleic Acids Res 2007, 35(WebServer issue):W81-5.22. Kim SY, Nam SW, Lee SH, Park WS, Yoo NJ, Lee JY, Chung YJ:ArrayCyGHt: a web application for analysis and visualizationof array-CGH data.  Bioinformatics 2005, 21(10):2554-2555.yours — you keep the copyrightSubmit your manuscript here: 9 of 9(page number not for citation purposes)


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items