Open Collections

UBC Faculty Research and Publications

CGMIM: Automated text-mining of Online Mendelian Inheritance in Man (OMIM) to identify genetically-associated… Bajdik, Chris D; Kuo, Byron; Rusaw, Shawn; Jones, Steven; Brooks-Wilson, Angela Mar 29, 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2004_Article_403.pdf [ 1.34MB ]
JSON: 52383-1.0223201.json
JSON-LD: 52383-1.0223201-ld.json
RDF/XML (Pretty): 52383-1.0223201-rdf.xml
RDF/JSON: 52383-1.0223201-rdf.json
Turtle: 52383-1.0223201-turtle.txt
N-Triples: 52383-1.0223201-rdf-ntriples.txt
Original Record: 52383-1.0223201-source.json
Full Text

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceSoftwareCGMIM: Automated text-mining of Online Mendelian Inheritance in Man (OMIM) to identify genetically-associated cancers and candidate genesChris D Bajdik*1, Byron Kuo1, Shawn Rusaw2, Steven Jones2 and Angela Brooks-Wilson1,2Address: 1Cancer Control Research Program, BC Cancer Agency, 600 West 10th Avenue, Vancouver BC, V5Z 4E6, Canada and 2Genome Sciences Centre, BC Cancer Agency, 600 West 10th Avenue, Vancouver BC, V5Z 4E6, CanadaEmail: Chris D Bajdik* -; Byron Kuo -; Shawn Rusaw -; Steven Jones -; Angela Brooks-Wilson -* Corresponding author    AbstractBackground: Online Mendelian Inheritance in Man (OMIM) is a computerized database ofinformation about genes and heritable traits in human populations, based on information reportedin the scientific literature. Our objective was to establish an automated text-mining system forOMIM that will identify genetically-related cancers and cancer-related genes. We developed thecomputer program CGMIM to search for entries in OMIM that are related to one or more cancertypes. We performed manual searches of OMIM to verify the program results.Results: In the OMIM database on September 30, 2004, CGMIM identified 1943 genes related tocancer. BRCA2 (OMIM *164757), BRAF (OMIM *164757) and CDKN2A (OMIM *600160) wereeach related to 14 types of cancer. There were 45 genes related to cancer of the esophagus, 121genes related to cancer of the stomach, and 21 genes related to both. Analysis of CGMIM resultsindicate that fewer than three gene entries in OMIM should mention both, and the more thanseven-fold discrepancy suggests cancers of the esophagus and stomach are more genetically relatedthan current literature suggests.Conclusion: CGMIM identifies genetically-related cancers and cancer-related genes. In severalways, cancers with shared genetic etiology are anticipated to lead to further etiologic hypothesesand advances regarding environmental agents. CGMIM results are posted monthly and the sourcecode can be obtained free of charge from the BC Cancer Research Centre website are complex diseases with multiple genetic andenvironmental factors contributing to their development.tern of disease within certain rare families. Most cancers,however, are sporadic and appear in people who do nothave a clear family history of the disease. These cancers arePublished: 29 March 2005BMC Bioinformatics 2005, 6:78 doi:10.1186/1471-2105-6-78Received: 29 October 2004Accepted: 29 March 2005This article is available from:© 2005 Bajdik et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 7(page number not for citation purposes)The most prominent success stories in cancer genetics todate have involved genes that produce a recognizable pat-currently being studied in epidemiological investigationsthat examine genetics, environmental exposures or both.BMC Bioinformatics 2005, 6:78 studies often compare "cases" or affected individualsto "controls" or unaffected individuals, to determinewhich group has a higher frequency of a particular genevariant or a greater level of exposure to an environmentalagent. The studies require logical hypotheses regarding thegenes to be tested and clear criteria for case definition.Cases may be defined as people who have any of severaltypes of cancer, if those types are related. For example, epi-demiologic studies of BRCA1 mutation carriers mightbenefit from information collected about both breast andovarian cancer cases. But what genes are associated with agroup of cancers, and what cancers are associated with aparticular gene? The answers can be found in literatureregarding cancer genetics, microbiology, clinical medi-cine, epidemiology and other sciences. More than 1% ofall human genes are associated with cancer [1] and infor-mation about the association between genes and cancerchanges constantly.Online Mendelian Inheritance in Man (OMIM; is a computerized data-base of information about genes and heritable traits inby him and colleagues around the world.[2] We considerit a particularly high-quality data source because it iscurated by a knowledgeable team, based on informationreported in the scientific literature, and continuouslyupdated. OMIM is maintained on the Internet by theNational Center for Biotechnology Information at the USNational Institutes of Health.[3] Data mining aims to dis-cover unexpected trends and patterns from large sets ofdata [4], and the rapid growth of biomedical literatureunderscores the value of text-mining in particular. Text-mining has been described as a modular process involvingdocument categorization, named entity tagging, fact andinformation extraction, and collection-wide analysis.[5]In document categorization, a subset of potentially rele-vant documents is retrieved to increase the efficiency ofsubsequent steps. Named entity tagging identifies theimportant entities or objects mentioned in the article,often using a list of synonyms. Fact and informationextraction identifies the relationships between entities.Finally, in collection-wide analysis, information extractedfrom different documents is integrated.Table 1: The twenty pairs of cancer types with the highest ratio of observed (O) to expected (E) number of associated genes. The O/E ratio and a 95% confidence interval (95%CI) are provided. Results are based on cancers mentioned in Online Mendelian Inheritance in Man (OMIM; searched September 30, 2004.Pair of Cancer Types Number of Genes Related to Both O/E Ratio and 95%CIObserved (O) Expected (E)cervix – larynx 2 0.16 12.6 ± 4.9larynx – mouth 2 0.17 11.6 ± 4.8larynx – uterus 2 0.25 7.9 ± 3.9esophagus – stomach 21 2.80 7.5 ± 1.2larynx – stomach 3 0.44 6.9 ± 3.0bladder – esophagus 10 1.51 6.6 ± 1.6larynx – myeloma 1 0.15 6.6 ± 5.1larynx – esophagus 1 0.16 6.2 ± 4.9cervix – esophagus 6 1.02 5.9 ± 1.9bladder – cervix 8 1.47 5.4 ± 1.6cervix – uterus 8 1.59 5.1 ± 1.6larynx – bladder 1 0.23 4.3 ± 4.1pancreas – stomach 23 6.10 3.8 ± 0.8cervix – stomach 10 2.74 3.7 ± 1.2esophagus – mouth 4 1.11 3.6 ± 1.9brain – kidney 19 5.39 3.5 ± 0.8bladder – testis 18 5.12 3.5 ± 0.9bladder – prostate 17 5.05 3.4 ± 0.9bladder – pancreas 11 3.28 3.4 ± 1.1cervix – lung 20 6.14 3.3 ± 0.8Page 2 of 7(page number not for citation purposes)human populations. The database was created by VictorMcKusick at Johns Hopkins University and is now editedMany research studies aim to explore the associationbetween genes and cancer. The design of these studiesBMC Bioinformatics 2005, 6:78 the identification of appropriate patient groupsand candidate genes, and both steps can benefit fromeffective text-mining of public data sources. OMIM is ahigh-quality information source and considered a key ref-erence database by the genetics community. Our objectivewas to establish an automated text-mining system forOMIM that will identify genetically-related cancers andcancer-related genes.ImplementationWe developed the computer program CGMIM to text-mine OMIM. The software considers 21 major cancertypes identified by the National Cancer Institute of Can-ada [ref [6], page 18, Table 1]. CGMIM recognizes geneti-cally-related cancers by identifying cancer typesmentioned in association with a specific gene. For pairs ofcancer types, CGMIM generates a table with rows and col-umns for each cancer type, and cells containing thenumber of OMIM gene entries that mention an associa-tion with those cancers. We refer to this table as the siteX-site matrix. If several OMIM entries mention one type ofcancer, and several entries mention another type of can-cer, then some entries will mention both types of cancerby chance alone. If the mention of different cancersoccurred at random, the expected number of genes (E) inOMIM that mention two specific types of cancer can beestimated as the total number of genes related to cancer,multiplied by the probabilities that an entry mentionseach individual cancer type. The latter probabilities areestimated as the proportion of genes in OMIM that arerelated to each cancer type. Explicitly, if there are N genesrelated to cancer, GA genes related to cancer type A and GBgenes related to cancer type B, thenEAB = (GA/N) × (GB/N) × N (1)where EAB is the expected number of genes related to bothcancer types A and B. The observed number of genes (O)is the number of OMIM entries that mention both cancertypes, and O/E indicates whether the number of genesassociated with a pair of cancer sites is different thanchance alone would predict. An O/E value of 1.0 indicatesthe number of entries observed is the number expected bychance. An approximate 95% confidence interval (95%CI) is O/E ± (1.96/√E).Our text-mining algorithm begins by separating para-graphs of an OMIM entry into constituent sentences, andassumes sentences end with a period followed by a space.There are many words and phrases that refer to cancer. Abreast cancer might be described as a breast tumor, breastcarcinoma or mammary gland neoplasm. A list of syno-nyms for each cancer type was developed using the Inter-variation occurs as the result of English grammar. Breastcancer might be referred to as cancer of the breast, and sev-eral cancers might be referred to in a list (e.g., "cancer ofthe ovary, breast, and skin"). The algorithm identifiedOMIM entries for each type of cancer by finding sentencesthat included both a site synonym and a cancer synonym.For phrases in the synonym list, CGMIM searched for sen-tences containing all of the individual words."Stemming" was used to remove capitalization and com-mon suffixes from words, and thereby changes similarwords to identical word fragments. The process is bestdemonstrated with an example.UnstemmedLarge-cell lymphomas comprise approximately 25% of allnon-Hodgkin lymphomas in children and young adults,and approximately one-third of these tumors have at(2;5)(p23;q35) translocation.Stemmedlarg-cell lymphoma compris approxim 25% of all non-hodgkin lymphoma in children and young adult, andapproxim on-third of these tumor have a t(2;5)(p23;q35)translocWe used an established algorithm ("Porter's algorithm")to perform the stemming.[8]Our list of synonyms was stemmed and then compared tothe stemmed sentences in OMIM. An OMIM entry maycontain alternative entry names, mapping information, atext summary, references to key publications, examples ofknown allelic variants, and a clinical synopsis of the cor-responding phenotype. Some of these fields are subjec-tive, such as the examples of allelic variants, and werestricted our search to the text summary.Finally, not all OMIM entries refer to specific genes. Someentries refer to heritable traits for which no gene has beenidentified. In addition, more than one OMIM entry canrefer to the same gene. This typically occurs when theentry for a trait is linked to a gene that was previouslyidentified and described in a separate OMIM entry.Because OMIM is dynamically organized and updated,this type of multiple referencing is unavoidable. To restrictsearches to only the OMIM entries for genes, CGMIMcompares each entry name and alternative names with alist of gene names assigned by the Human GenomeOrganization (HUGO; performed manual searches of the OMIM database toPage 3 of 7(page number not for citation purposes)national Classification of Disease for Oncology (ICD-O)[7] and augmented by familiar lay terminology. Otheridentify the strengths and weakness of the computerizedsearch method, and to iteratively modify the software.BMC Bioinformatics 2005, 6:78 involved selecting a sample of OMIM entries andreading through the text to determine whether the entriesreferred to a cancer, or if entries were identified byCGMIM where, in reality, there was no true cancer refer-ence. We also reviewed the entries to identify sentencesthat referred to cancer, but for which evidence indicatedthere was no association. (E.g., "An early study showedthe gene was not related to breast cancer.") While anOMIM entry might include a sentence of that sort, anothersentence in the entry might cite evidence supporting theassociation. (E.g., "A subsequent study showed the genewas related to breast cancer.") Despite the negative state-ment, this example OMIM entry mentions evidence sup-porting the association and hence would be includedwhen tallying entries associated with the cancer.CGMIM was written in the Perl computer language andimplemented on a Linux workstation. OMIM is updateddaily and we created static copies of the database to pro-vide a stable reference for search evaluation. The copies ofOMIM used to develop CGMIM were downloadedbetween March and October of 2003, and each copy con-tained more than 14,000 entries.Results and discussionIn the OMIM database on September 30, 2004, CGMIMidentified 1943 genes related to cancer. BRCA2 (OMIM*164757), BRAF (OMIM *164757) and CDKN2A (OMIM*600160) were each related to 14 types of cancer. TheOMIM entries for all three genes mention leukemia,melanoma, breast cancer, colorectal cancer, pancreaticcancer, stomach cancer, ovarian cancer and prostate can-cer. The entry for BRCA2 also mentions cancer of thebrain, larynx, cervix, uterus, thyroid and kidney. The entryfor BRAF also mentions lymphoma and cancer of thelung, bladder, testes, cervix and uterus. The entry forCDKN2A also mentions lymphoma and cancer of thelung, bladder, brain, esophagus and kidney. Each genedefines a large group of related cancers.The numbers of genes associated with each pair of cancertypes are summarized in the siteXsite matrix (Figure 1).Diagonal cells in the matrix contain the total numbers ofgenes identified for each cancer type; off-diagonal cells arethe numbers of genes identified by both the row and thecolumn titles. For example, there were 45 genes related tocancer of the esophagus, 121 genes related to cancer of thestomach, and 21 genes related to both. The cancer men-tioned by the greatest number of OMIM entries was leuke-mia, and the greatest number of OMIM gene entries thatmention a combination of two cancers was 143 for lym-phoma and leukemia. For some pairs of cancer sites, nogenes were identified.The numbers in the off-diagonal cells depend on thenumber of genes related to the individual cancers. Basedon the number of OMIM entries that mention leukemiaand lymphoma individually, the number expected tomention both is 98.3 and the ratio of the observed andexpected values is 1.5 (95% CI 1.3–1.7). (In equation (1),GLEUKEMIA = 643, GLYMPHOMA = 297 and N = 1943.) Thisindicates there are 50% more genes related to both can-cers than would be expected by chance. Table 1 providesa list of 20 pairs of cancer types where the ratio of theobserved and expected number of genes in the siteXsitematrix is greatest. The table indicates that fewer than threegenes in OMIM should mention both cancer of theesophagus and cancer of the stomach by chance, but 21entries mention both cancers. This more than seven-folddiscrepancy suggests that cancers of the esophagus andstomach might be more related than current literaturesuggests. Similar conclusions might be made for the otherpairs of cancer types in Table 1.We randomly selected 25 genes related to cancer andmanually reviewed text of the corresponding OMIMentries. All of the entries correctly mention one or moretypes of cancer, but for 20% of those entries, one of thecancers was only mentioned in the context of evidencesuggesting no association.CGMIM can assist in designing effective studies of geneti-cally-related cancers. CGMIM uses a high-quality databaseof genetic information to produce a summary of gene andcancer associations. A group of cancer types might berelated by physical proximity in the body (e.g., prostateand bladder cancer), a shared physiologic function (e.g.,cancers involving the digestive tract), a common exposure(e.g., cancers caused by air pollution) or a commongenetic characteristic (e.g., cancers in tissues that expressBRCA1). The identification of such groups becomes moredifficult and time-consuming as the literature about genesand cancer expands, and efficient text-mining tools haveincreasing value.In several ways, groups of cancers that have shared geneticfactors are anticipated to lead to further etiologic hypoth-eses and advances regarding environmental agents. First,grouping cancers will be especially useful if a group com-bines several cancers that are rare and difficult to studyindividually. Second, knowledge of genetic pathwaysmight suggest an environmental factor associated with allof the cancers. For example, a grouping defined by avitamin receptor gene would suggest vitamin intake as apossible environmental agent in the etiology of all of thecancers. Third, CGMIM will allow us to design studies thatmight extend gene-cancer associations to include cancersPage 4 of 7(page number not for citation purposes)at other sites. The groups can also be used to identify can-cers that should be considered together in a definition ofBMC Bioinformatics 2005, 6:78 history, and in selection of genetic tests that mightbe adopted for high-risk families. During development ofCGMIM, we observed changes in OMIM and the cancerthe analysis, as opposed to a set of results based on theOMIM contents from a particular day.A siteXsite matrix for 21 major cancer types as reported by the National Cancer Institute of CanadaFigure 1A siteXsite matrix for 21 major cancer types as reported by the National Cancer Institute of Canada. Matrix cells indicate the number of genes related to cancers named in the row and column labels. Cell entries are based on cancers mentioned in Online Mendelian Inheritance in Man (OMIM; searched September 30, 2004.Page 5 of 7(page number not for citation purposes)groups that it produced from one week to another. Thisillustrates the need for a tool that can routinely performOMIM is based on published material from the scientificliterature. The number of genes identified by our programBMC Bioinformatics 2005, 6:78 not necessarily indicate the relatedness of two ormore cancer types, but rather what is known about thosecancers. This reflects what research has been funded, per-formed and published. There is more funding for certaintypes of cancer, there are more journals that address cer-tain types of cancer, and there are more people studyingcertain types of cancer. Published information reflects ourknowledge base and the scientific literature is hence avalid basis for identifying cancer groups and genes for fur-ther study. In some cases, evidence about an associationwas based on studies of cell lines or non-human organ-isms. In other cases, evidence was based on anecdotalobservations in a small number of people. Some associa-tions were based on several independent studies that eachinvolved hundreds of patients.There are sentences in OMIM that contain phrases such as"is not related to breast cancer". We could not create analgorithm that recognized all negative references withoutoverlooking positive valid ones. Some OMIM entriesreport both negative and positive evidence of anassociation. These "mixed" entries are tallied as positivereports by CGMIM, consistent with our interest in positiveassociations. Other sentences in OMIM describe evidenceof gene expression in both cancerous and normal tissue.E.g., "... has been shown to be expressed in breast cancercells and prostate cells". The sentences are incorrectlyinterpreted as mentions of prostate cancer. Manual reviewof OMIM indicated that a minority of apparent associa-tions (about 20%) between a gene and specific type ofcancer were the result of negative evidence and are thus"false-positive" text-mining associations. We suggest thata manual review of OMIM associations always precedesubsequent study design and analysis. We assume theexcess 20% is included in every cell of the siteXsite matrix.Thus expected values also include the 20% excess, and theO/E ratios are not affected.Other databases might be used as the basis for assessingscientific knowledge regarding genetic cancer groupings,but OMIM offers several advantages. OMIM is based onall publications in the PubMed database that are related toa specific human gene or trait. Results based on mining allof PubMed would be of interest, but would involve amuch larger volume of literature and lack the expertreview that is characteristic of OMIM. More specializedcancer groupings also might be created using computer-ized conference proceedings or journal contents. Like-wise, a list of synonyms might be determined from othersources such as the UMLS (Unified Medical Language Sys-tem) Specialist Lexicon of the National Cancer Institute.We used ICD-O terminology because it is the basis formost scientific writing on cancer.This project used resources that have been developed bythe US National Institutes of Health and Human GenomeProject.[3] Our approach is exhaustive of the informationreported in OMIM, will produce a computer algorithm fornear-automatic updating of the review, and has the poten-tial to be extended to other computerized databases. Wewill use CGMIM along with other criteria to guide thedesign of studies of genes and environment in canceretiology.ConclusionCGMIM uses an expert database of genetic information todetermine a summary of gene and cancer associations.The software identifies genes that are associated with aparticular type of cancer, groups of cancers that share acommon genetic association, and pairs of cancer typeswhere there are more related genes than expected bychance.Availability and requirements• Project name: CGMIM• Project home page:• Operating system: The source code for CGMIM can bedownloaded from the CGMIM homepage and run underLinux.• Programming language: The source code for CGMIMcan be downloaded from the CGMIM homepage and iswritten in Perl.AbbreviationsOMIM is Online Mendelian Inheritance in Man; HUGO isthe Human Genome Organisation; ICD-O is the Interna-tional Classification of Disease for OncologyAuthors' contributionsThe software was developed by BK and SR under the direc-tion of SJ and CDB. The website and manuscript were cre-ated by CDB and AB. Funding for the project was obtainedby CDB, AB and SJ.AcknowledgementsChris Bajdik and Steven Jones are scholars of the Michael Smith Foundation for Health Research. This work was supported by a research grant from the Canadian Cancer Etiology Research Network. Steve Sung assisted with design of the CGMIM website and Chris Young assisted with manual searches of the OMIM database.References1. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R,Rahman N, Stratton MR: A census of human cancer genes. NatRev Cancer 2004, 4:177-183.2. Hamosh A, Scott AF, Amberger J, Bocchini C, Valle D, McKusick VA:Page 6 of 7(page number not for citation purposes)Online Mendelian Inheritance in Man (OMIM), a knowledge-base of human genes and genetic disorders. Nucleic AcidsResearch 2002, 30:52-55.Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2005, 6:78 Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Mad-den TL, Pontius JU, Schuler GD, Schrimi LM, Sequeira E, Suzek TO,Tatusova TA, Wagner L: Database resources of the NationalCenter for Biotechnology Information: update. Nucleic AcidsResearch 2004, 32:D35-40.4. Han J, Kamber M: Data Mining: Concepts and Techniques First edition.Morgan Kaufmann Publishers; 2001. 5. de Bruin B, Martin J: Getting to the (c)ore of knowledge: miningbiomedical literature. Int J Medical Informatics 2002, 67:7-18.6. National Cancer Institute of Canada: Canadian Cancer Statistics Toronto2004.7. Fritz A, Percy C, Jack A, Shanmugaratnam K, Sobin L, Parkin DM,Whelan S: International Classification of Diseases for Oncology Third edi-tion. World Health Organization; 2000. 8. Porter MF: An algorithm for suffix stripping. Program 1980,14:130-137. It has since been reprinted in Sparck Jones, Karen, andPeter Willet (1997) Readings in Information Retrieval San Francisco.Morgan Kaufmannyours — you keep the copyrightSubmit your manuscript here: 7 of 7(page number not for citation purposes)


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items