UBC Faculty Research and Publications

Improving the specificity of high-throughput ortholog prediction Fulton, Debra L; Li, Yvonne Y; Laird, Matthew R; Horsman, Benjamin G; Roche, Fiona M; Brinkman, Fiona S May 28, 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12859_2005_Article_1009.pdf [ 1.2MB ]
JSON: 52383-1.0215896.json
JSON-LD: 52383-1.0215896-ld.json
RDF/XML (Pretty): 52383-1.0215896-rdf.xml
RDF/JSON: 52383-1.0215896-rdf.json
Turtle: 52383-1.0215896-turtle.txt
N-Triples: 52383-1.0215896-rdf-ntriples.txt
Original Record: 52383-1.0215896-source.json
Full Text

Full Text

ralssBioMed CentBMC BioinformaticsOpen AcceMethodology articleImproving the specificity of high-throughput ortholog predictionDebra L Fulton†1,2, Yvonne Y Li†1,3, Matthew R Laird1, Benjamin GS Horsman1, Fiona M Roche1 and Fiona SL Brinkman*1Address: 1Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada, 2Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada and 3Canada's Michael Smith Genome Sciences Centre, 570 W. 7th Avenue, Vancouver, BC, CanadaEmail: Debra L Fulton - debra@cmmt.ubc.ca; Yvonne Y Li - yli@bcgsc.ca; Matthew R Laird - lairdm@sfu.ca; Benjamin GS Horsman - bhorsman@sfu.ca; Fiona M Roche - fiona_roche@sfu.ca; Fiona SL Brinkman* - brinkman@sfu.ca* Corresponding author    †Equal contributorsAbstractBackground: Orthologs (genes that have diverged after a speciation event) tend to have similarfunction, and so their prediction has become an important component of comparative genomicsand genome annotation. The gold standard phylogenetic analysis approach of comparing availableorganismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis;therefore, ortholog prediction for large genome-scale datasets is typically performed using areciprocal-best-BLAST-hits (RBH) approach. One problem with RBH is that it will incorrectlypredict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. Inaddition, there is an increasing interest in identifying orthologs most likely to have retained similarfunction.Results: To address these issues, we present here a high-throughput computational methodnamed Ortholuge that further evaluates previously predicted orthologs (including those predictedusing an RBH-based approach) – identifying which orthologs most closely reflect species divergenceand may more likely have similar function. Ortholuge analyzes phylogenetic distance ratios involvingtwo comparison species and an outgroup species, noting cases where relative gene divergence isatypical. It also identifies some cases of gene duplication after species divergence. Throughsimulations of incomplete genome data/gene loss, we show that the vast majority of genes falselypredicted as orthologs by an RBH-based method can be identified. Ortholuge was then used toestimate the number of false-positives (predominantly paralogs) in selected RBH-predictedortholog datasets, identifying approximately 10% paralogs in a eukaryotic data set (mouse-ratcomparison) and 5% in a bacterial data set (Pseudomonas putida – Pseudomonas syringae speciescomparison). Higher quality (more precise) datasets of orthologs, which we term "ssd-orthologs"(supporting-species-divergence-orthologs), were also constructed. These datasets, as well asOrtholuge software that may be used to characterize other species' datasets, are available at http://www.pathogenomics.ca/ortholuge/ (software under GNU General Public License).Conclusion: The Ortholuge method reported here appears to significantly improve the specificity(precision) of high-throughput ortholog prediction for both bacterial and eukaryotic species. ThisPublished: 28 May 2006BMC Bioinformatics 2006, 7:270 doi:10.1186/1471-2105-7-270Received: 03 October 2005Accepted: 28 May 2006This article is available from: http://www.biomedcentral.com/1471-2105/7/270© 2006 Fulton et al; licensee BioMed Central Ltd.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 16(page number not for citation purposes)method, and its associated software, will aid those performing various comparative genomics-basedanalyses, such as the prediction of conserved regulatory elements upstream of orthologous genes.BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270BackgroundOrtholog prediction is an important facet of comparativegenomics and is frequently used in genome annotation,gene function characterization, evolutionary genomics,and in the identification of conserved regulatory ele-ments. As the number of genome sequences grow, com-parative genomics has become increasingly relevant.Errors in ortholog prediction can greatly affect such stud-ies and associated downstream analyses (including func-tional genomics and proteomics analyses), so there hasbeen increasing interest in high quality ortholog predic-tion.Orthologs are commonly defined as genes that havediverged after a speciation event [1], whereas genes thathave diverged after a gene duplication event, either beforea speciation event (out-paralogs) or after a speciationevent (in-paralogs), are collectively known as paralogs. Ithas been found that orthologs tend to have similar func-tion and so their utility in comparative analyses is para-mount. Classically, orthologous genes are identified byphylogenetic analysis. A phylogenetic tree for the genes iscompared against a reference species tree, with the notionthat the gene tree of orthologs should be similar to thespecies tree. However, sophisticated phylogenetic analysisis not easily automated, due in part to the complexity ofboth manual sequence alignment editing and choice ofWhole-genome analyses indicate that many gene families(essentially paralogs) were formed before the divergenceof most species commonly being compared in a compar-ative genomics analysis (out-paralogs). Therefore,orthologs – which diverged due to speciation – are typi-cally more similar to each other than to other genes in thegenome. This is why sequence similarity is often used toinfer gene orthology between two or more species, and isalso the premise behind the most common high-through-put ortholog prediction method used today: the recipro-cal-best-BLAST-hits (RBH) analysis [2]. With the RBHmethod, genes from species A and species B are predictedto be orthologs if they are both the "best BLAST hit" of theother, when all genes from species A are compared to allgenes from species B by BLAST analysis. There are numer-ous resources and methods that use a version of RBH aspart of their ortholog prediction process, including theClusters of Orthologous Groups (COG) database [3,4],The Institute for Genomic Research (TIGR)'s EGO data-base [5], and INPARANOID [6,7]. However, if a gene isnot present in one organism's gene dataset, perhaps dueto incomplete genome sequence data or gene loss in theorganism, the RBH method will incorrectly predict a par-alog as an ortholog (Fig. 1). Today, comparative genomicsis often being performed using incomplete genomes,especially for large eukaryotic genome sequencingprojects. Also, gene loss is a major driving force behindbacterial evolution [8]. It is therefore important to recog-nize that many of the current ortholog databases willlikely contain false-positives due to the limitations of theRBH approach.For comparative analyses, it is also frequently desirable toidentify orthologs that most likely have similar function.In some cases, an ortholog may diverge more rapidly insequence (and function) in one organism/species versusanother related organism/species. In addition, a geneduplication may occur in one species, but not a secondspecies, after species divergence. In this case either one –or both – of the duplicated genes (in-paralogs) are morelikely to have diverged in function [9]. We therefore pro-pose to differentiate such orthologs (reflecting what hassometimes been referred to as "many-to-many" orthologrelationships) from those that appear to have divergedonly due to a speciation event. We also wish to identifythose orthologs that have diverged to a degree that is sim-ilar to that expected for its species, since those orthologsthat have undergone unusually rapid divergence in onespecies, relative to another, may have also diverged morein function. We therefore propose the term ssd-orthologs(for "supporting-species-divergence" orthologs) to defineorthologs that appear to have diverged only due to speci-ation – and have diverged to the same relative degree asAn example of how RBH analysis may falsely identify a para-log as n orthologFi ure 1An example of how RBH analysis may falsely identify a para-log as an ortholog. Illustrated is a hypothetical species tree and gene tree for the human, cattle, and mouse species, where human and cattle orthologs (unshaded genes) are being identified. If the true cattle ortholog has not yet been sequenced because of an incomplete bovine genome project, it will not be present in the gene dataset used for analysis (cattle gene crossed out with an X), and the best reciprocal BLAST hit for the human gene will be a cattle paralog (shaded gene). However, Ortholuge will detect this case as a potential paralog, because it examines the relative phyloge-netic distance between genes and identifies how well their relative distances match expected species divergence.Page 2 of 16(page number not for citation purposes)appropriate genes and species to be included in an analy-sis.their species. These ssd-orthologs are more likely to haveretained similar function, and would better suit the pur-BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270poses of many comparative analyses. To avoid the confu-sion that may stem from the association of the term"many-to-many orthologs" with in-paralogs, we will usethe term paralogs in this text to refer to out-paralogs andspecify in-paralogs, when applicable.To address these issues, we have developed a method wecall Ortholuge. Ortholuge is a high-throughput analysispipeline that evaluates previously predicted orthologsare likely ssd-orthologs and which are likely paralogs orother non-ssd-orthologs. The pipeline requires tentativeortholog predictions (and the associated gene/proteinsequences) for large gene datasets from three species, twoof which are the species to be compared, and one of whichis an outgroup species. All phylogenetic distances betweenthe genes/proteins in an ortholog group are computed foreach group in the input list. Ratios of these distances areused to evaluate ortholog quality. We find that these ratiosshow certain consistencies over several sets of eukaryoticand bacterial orthologs, along with data sets introducedwith true-negatives for comparison. This permitted theformulation of ratio cut-offs for retaining ssd-orthologsand removing probable paralogs, which resulted in ahigher quality data set of orthologs. Overall, we demon-strate that the relative evolutionary relationships may beused to support the prediction of orthologs. In addition,noting those orthologs with prominent differences (suchas recent gene duplications after species divergence) mayhelp refine analyses to permit the identification of thoseorthologs that most likely retain the same function.ResultsAn overview of the Ortholuge approach for increasing thespecificity of ortholog predictions is outlined in Figure 2.Based on the analyses described below, the details of thisapproach were formulated and the approach validatedusing both prokaryotic and eukaryotic data sets.Ortholuge software is available [28] to assist with theanalysis of data sets other than those reported here.Data sets exhibited little bias due the automated sequence alignment trimming approachWe investigated the behaviour and utility of Ortholugethrough analysis of diverse eukaryotic and bacterial RBH-derived datasets. For the initial test eukaryotic data set, wechose predicted mouse-rat-human orthologs from theexpressed sequence tag (EST) data in TIGR's EukaryoticGene Ortholog (EGO) database [5] (for a mouse-rat com-parison, with human as the outgroup). The majority ofour subsequent analyses utilized the higher quality MGD-based dataset (see Methods describing datasets) and theRefSeq-based RBH dataset composed of these same spe-cies, as indicated. For the bacterial data set, we chose threegamma-proteobacteria: Escherichia coli, Pseudomonas put-ida, and Pseudomonas syringae (a Pseudomonas species com-parison, with E. coli as the outgroup). Orthologs betweenthese three species (and other sets of species subsequentlyexamined) were predicted using a transitive RBHapproach, applied to the deduced proteins from completegenome sequences [10-12].Accurate sequence alignment is critical for phylogeneticAn overview of the Ortholuge methodFigur  2An overview of the Ortholuge method. (A) Flow-chart out-lining the main steps of the method. (B) The three ratios computed by Ortholuge. The phylogenetic distances in the numerator (dark line) and denominator (dashed line) for each ratio is shown, overlaid on the phylogenetic tree (gray line) that relates the ingroups and outgroup. Note that the three ratios are related such that Ratio2 = Ratio1 × Ratio3. Therefore, ratio data is presented both in terms of frequency histograms for all three ratios (see Fig. 4) and also as Ratio1 × Ratio2 plots (see Fig. 5) for just two of the three ratios – the latter is simply another way to conveniently visualize the data.Page 3 of 16(page number not for citation purposes)(such as RBH-predicted orthologs on a genome-widescale) and generates predictions regarding which of theseanalysis; thus, we wished to improve the automated align-ment and trimming components of the OrtholugeBMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270method. We therefore performed a comprehensive exam-ination of biases in our automated alignment editingprocess (see Methods). A sample of RBH-predictedortholog sequence sets was analyzed to devise the gap-masking and sequence trimming approaches. Thesequence sets were examined to identify both gaps intro-duced by misalignments and gaps introduced throughsequence insertions and deletions. Our observations sug-gested that some of the noise introduced through the mis-alignment may be alleviated through the removal of thegapped-segment flanking portions. We also noted thatthere was no appreciable effect on the sequence distanceswhen the flanking sequences around the sequence-varia-tion gapped regions were removed. We manually intro-duced gap-masking simulations over the sequences usingvarious window length criteria to establish a gap-maskingapproach with a relatively conservative worst-case sce-nario. Both the trimming and gap-masking methods wereevaluated for the introduction of ratio distribution biasesby selected alignment characteristics. No obvious bias wasobserved through the introduction of our gap maskingapproach or alignment trimming (Fig. 3).Ortholuge produces ratios which form distributionsOrtholuge was designed with the purpose of overcomingcertain limitations of the RBH method, such as the prob-between genes to evaluate orthology, and using an out-group species as a reference for two ingroup species beingcompared (Fig. 2). For these three species, the distancesfor the "ortholog triple" are calculated and the three pos-sible ratios that can be generated are calculated (Fig. 2).With this approach, the problem illustrated in Figure 1would be detected because the human-cattle distance isunexpectedly larger than the human-mouse distance –impacting on ratio values. We ran Ortholuge on threemouse-rat-human datasets: two sets of RBH-predictedorthologs – one based on EGO data and the other basedon RefSeq data – and a third high-quality curated set. Forall datasets, human was the outgroup used to help predictmore precise orthologs between mouse (ingroup1) andrat (ingroup2). The resulting Ortholuge phylogenetic dis-tance ratios are shown in Figure 4 and Supplemental Fig-ure 3 as histograms. For each of the three ratios, wetabulated the frequency of putative orthologous groupswithin certain ratio value ranges. Ratio1, Ratio2, andRatio3 each form clear distributions. Ratio3 is generallylocated around a ratio value of 1, which is expected if thechosen outgroup is more distant relative to the ingroups.It is centered to the left or right of 1 depending on whichof the two ingroups is closer to the outgroup. The Ratio1and Ratio2 distributions are generally located at a ratiomuch lower than 1, reflecting the closer relationshipbetween the ingroup species versus any ingroup to theoutgroup. We ran our analyses on both protein and nucle-otide sequences and found that for closely related speciesHistogram illustrating the distribution of RBH-predicted (i.e. putative) orthologous groups across the three Ortholuge dist nc  ratiosFigure 4Histogram illustrating the distribution of RBH-predicted (i.e. putative) orthologous groups across the three Ortholuge distance ratios. The results for predicted mouse-rat-human RBH ortholog sets (EGO RBH data set; 19,200 ortholog groups) are shown. Each of the three ratios forms their own distribution: Ratio1 and Ratio2 are generally located at ratio values lower than 1 and Ratio3 is generally located about a ratio value of 1, reflecting the relative distances between ingroups and between each ingroup and the outgroup. A sim-ilar ratio analysis was performed on a RefSeq RBH dataset Ratio 1 (R1) ratio distribution curves for selected alignment characteristicsFigure 3Ratio 1 (R1) ratio distribution curves for selected alignment characteristics. Higher quality mouse-rat-human ortholog sequence sets were analyzed to devise the gap-masking and sequence trimming approaches. These methods were evalu-ated for the introduction of ratio distribution biases for selected alignment characteristics such as identity and gap length. Ratio distribution curves were plotted for several characteristics. No obvious bias was observed through the introduction of our gap masking approach or alignment trim-ming.Page 4 of 16(page number not for citation purposes)lem illustrated in Figure 1. Ortholuge overcomes thisproblem by using ratios of phylogenetic distances(see Figure 3 of [Additional file 1]).BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270such as these, nucleotide sequences provide a better ratiodistribution resolution. However, the overall ratio distri-butions are similar, even when using different methods ofinitial ortholog detection (see Figure 4 of [Additional file1]).We also performed this analysis with our bacterial P. put-ida-P. syringae-E. coli orthologs, comparing P. putida(ingroup1) and P. syringae (ingroup2) using E. coli as theoutgroup. We observed very similar results: Both theeukaryotic and prokaryotic data sets are consistent in thedistributions formed, and in the approximate position ofthe distributions. Since we expected most ssd-orthologs(see Introduction for definition) to evolve in a similarmanner, we hypothesized that orthologs falling withinthe higher frequency ranges of the distributions are morethe divergence observed for most genes (i.e. the highestfrequency ranges).Ortholuge ratios can also be conveniently visualized in an R1 × R2 plotInstead of histograms (Fig. 4), an alternative way to repre-sent Ortholuge ratios is to use a 2-dimensional plot of twoOrtholuge ratios, where each putative ortholog group isrepresented by one point in the graph. In principal, anytwo of the three ratios can be used for the plot, since thethree ratios are related. That is, Ratio3 equals Ratio2divided by Ratio1. Through subsequent analyses, wefound that the Ratio1 and Ratio2 combination (i.e. an R1× R2 plot) was the simplest to visualize and to work with.For the R1 × R2 plots, the eukaryotic mouse-rat-humanRBH-predicted putative orthologous groups appear tooccupy three types of positions (Fig. 5A and 5D). (1) Themajority of points form a cluster (highest frequencyrange) at low Ratio1 and Ratio2 values. In fact, about 85%of orthologs have Ratio1 and Ratio2 values less than 1. (2)Some points with higher Ratio1 values are located along acurve that approaches, and then falls along, the line equa-tion Ratio2 = 1. This is consistent with an unusually highdivergence of a gene from ingroup 2. (3) Conversely,some points with higher Ratio2 values are located along aline that is roughly around line equation Ratio1 = 1. Thisis consistent with an unusually high divergence for a genefrom ingroup 1. The RBH-predicted orthologous groupsfor P. putida-P. syringae-E. coli species show a similar R1 ×R2 plot (Fig. 6A and 6D). Consistent with the eukaryoticresults, the vast majority of orthologous groups for thisprokaryotic analysis also exhibit Ratio1 and Ratio2 valuesless than 1.We expected most ssd-orthologs to evolve in a similarmanner, and found that most orthologous groups form acluster (high frequency range) in an R1 × R2 plot. There-fore, we hypothesized that orthologous groups fallingwithin the high frequency range are more likely to containssd-orthologs. Conversely, those outside of this range (i.e.high Ratio1 or Ratio2 values) are more likely to contain,in an ingroup, either an ortholog that has undergone unu-sual divergence, or a paralog."Higher quality" orthologous groups are found primarily in "low" Ortholuge ratio ranges, in R1 × R2 plotsThe data sets of tentative orthologs predicted above by anRBH approach will certainly contain genes that are beingfalsely identified as orthologs. It is difficult, if not impos-sible, to obtain a dataset of this size that contains only trueorthologs, due to the inherent nature of inference associ-ated with evolutionary study. However, data sets ofOrtholuge R1 × R2 plots (Ratio1 versus Ratio2) for selected eukaryotic data, where each p int represents one putative orthol g groupFigu e 5Ortholuge R1 × R2 plots (Ratio1 versus Ratio2) for selected eukaryotic data, where each point represents one putative ortholog group. (A) Putative orthologous groups identified using RBH for mouse-rat-human (Figure 4 shows the corre-sponding histogram). (B) Putative orthologs groups for mouse-rat-human from a higher quality (more precise) data-set (see Methods). It is expected that this more precise data set comprises primarily true orthologs. (C) A lower quality data set of RBH-predicted orthologous groups for cattle-human-mouse, where cattle genes have been identified from an incomplete genome sequence. (D), (E), (F) are zoomed-in versions of (A), (B), (C), respectively, with axes shown from 0 to 2 instead of 0 to 30. Note that most orthologous groups exhibit low Ratio1 and Ratio2 values, in all three data sets. For example, in panels A and D, about 86% of orthologs have Ratio1 and Ratio2 values less than 1. However, the higher quality data set (panels B and E) contains fewer points at higher Ratio values versus the RBH-predicted data set. The lower quality data set contains more points with very high Ratio2 values (i.e. only 73% of points have Ratio1 and Ratio2 values less than 1), potentially reflecting the increased occur-rence of probable cattle paralogs (i.e. paralogs being misiden-tified as orthologs by an RBH-analysis with an incomplete cattle genome).Page 5 of 16(page number not for citation purposes)likely to be ssd-orthologs compared to those that are out-liers. In essence, what is defining the species divergence is"higher" and "lower" quality can be constructed andexamined (see Methods), to observe how their OrtholugeBMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270ratios change in comparison to each other. These data setsshould contain a notably greater or smaller proportion oftrue orthologs, respectively.We therefore examined the behaviour of Ortholuge ratiosfor a higher quality data set of probable orthologs.Curated orthologs between human, mouse, and ratgenomes were acquired from the Mouse Genome Data-base (MGD). Figure 5B and 5E illustrate that this higherquality data set occupies a smaller area of the R1 × R2 plot.This smaller area is observed, even when the number ofpoints is normalized with the number plotted for theRBH-based data (data not shown). For this higher quality(more precise) data set there are notably fewer pointsalong the Ratio1 = 1 line equation and the Ratio2 = 1 linein the plot, compared to the RBH-based data plot in Fig-ure 5A and 5D.Conversely, we examined the ratios associated with aEGO database (with mouse as the outgroup). The incom-plete state of the bovine genome data at the time of thisanalysis should lead to more falsely predicted orthologs,since some true orthologs will be missing from the bovinedataset (see Fig. 1 for a scenario). These results are shownin Figure 5C and 5F. Note the higher number of pointswith a high Ratio2 value, falling along the line equationRatio1 = 1; these points are consistent with how the ratiowould behave if the bovine data contained paralogs thatwere notably more divergent than expected for mostorthologs.To gain a sense of the differences in plots of different qual-ity datasets, note that below Ratio1 and Ratio2 values of1, there lies 97% of high quality dataset points (Fig. 5B),86% of RBH-predicted ortholog group points (Fig. 5A),and only 73% of the low quality data set points (Fig. 5C).These results suggest that true orthologs (or at least moreprecise ortholog data sets) tend to fall within the bulk ofthe highest frequency range (i.e. relatively "low" Ratio val-ues in an R1 × R2 plot), while orthologs with unusualdivergence patterns (non-ssd-orthologs) and paralogshave either high Ratio1 or high Ratio2 values.For the prokaryotic analysis, a higher quality data set wascompared to the RBH-based data set as well. Figure 6Aand 6B illustrate the same trend as the eukaryotic data,with respect to how the R1 × R2 plots look for more pre-cise and less precise ortholog data sets.Known paralogs (true-negatives) introduced into orthologous groups generate either high Ratio1 or high Ratio2 values, as shown in a gene loss/incomplete genome simulationThe above comparisons of higher quality (more precise)and lower quality (less precise) ortholog data sets supportour hypothesis that orthologs and paralogs fall within dif-ferent regions of the R1 × R2 plot. However, a strongerargument can be made by examining specifically wherefalsely predicted orthologs (true paralogs) occur in suchdistributions. A true-negative data set was therefore con-structed by removing genes from one of the ingroup genedata sets and then identifying the next best reciprocalBLAST hit with the other ingroup (ensuring transitivity ofthis introduction with the other ingroup and outgroup).Therefore a true negative is essentially an ortholog triplewhich has been transformed into a false positive by intro-ducing a less similar sequence for one of the speciessequences. These true negatives represent the types ofortholog predictions that would result from an RBH-method in scenarios such as Figure 1. Since we know thatRBH can make incorrect predictions when a genome isincomplete or when gene loss has occurred, this analysisOrtholuge R1 × R2 plots for the prokaryotic data, illustrating two rtholog data sets and a true-negative data setFigure 6Ortholuge R1 × R2 plots for the prokaryotic data, illustrating two ortholog data sets and a true-negative data set. (A) Puta-tive orthologous groups from an RBH-predicted data set. (B) Probable true orthologs from a higher quality (more precise) data set. (C) True-negative orthologs (i.e. true paralogs) from the "gene-loss simulation" data set. Darker dots repre-sent putative orthologous groups which have had an ingroup1 true-negative (paralog) introduced into the group. Lighter dots represent putative orthologous groups which have had an ingroup2 true-negative (paralog) introduced into the group. (D), (E), (F) are zoomed-in versions of (A), (B), (C), respectively, with axes shown from 0 to 2 instead of 0 to 10. Most putative ortholog groups (particularly for the high quality data set) exhibit low Ratio1 and Ratio2 values (for example, all values are less than 1 for the points in the high quality data set plot), whereas most true-negative groups exhibit higher Ratio1 and Ratio2 values (i.e. only 9% of ingroup1 true negative introductions, and 6% of ingroup2 true negative introductions, have points with Ratio1 and Ratio2 values less than 1).Page 6 of 16(page number not for citation purposes)"lower quality" data set, involving RBH-predictedorthologs for bovine, human, and mouse, from TIGR'ssimulates what would occur with the RBH method in suchcases. The benefit of this analysis is that we specificallyBMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270know the true-negatives introduced, allowing us to exam-ine how the Ortholuge ratios for these true-negatives (par-alogs) behave.For the E. coli-P. putida-P. syringae input ortholog groups,we constructed two true-negative data sets. In the first, wereplaced P. putida genes with their next best RBH hit to P.syringae, resulting in ingroup1 paralogs. In the second, wereplaced P. syringae genes with their next best RBH hit toP. putida, resulting in ingroup2 paralogs. For both, weconservatively introduced all possible paralogs into theanalysis, resulting in roughly 50% of the genes convertedto true-negatives (i.e. conservative, because most data setswould never contain this many true-negatives). Theresults from these two data sets (Fig. 6C and 6F), showthat these true-negatives overlap very little with the RBH-predicted orthologs (Fig. 6A) or with the high quality(more precise) orthologs (Fig. 6B). This demonstrates thateven with all possible true paralogs simulated, very few ofthem are falling within the higher frequency ranges of theRBH distributions.We also constructed a third true-negative data set with alloutgroup genes (E. coli) replaced by their next best RBHhit to both P. syringae and P. putida. The R1 × R2 plot (Fig-ure 7) shows that these true-negative cases plot at lowerRatio1 and Ratio2 values and do not separate well fromwhat would be expected for true-orthologs. This is actuallyorthologs and should still be falling within the main clus-ter of true-orthologs, as we observe. In other words, sincethe goal of Ortholuge is to improve ortholog identifica-tion between the two ingroups, it is beneficial that an out-group paralog does not generally interfere with/affect theanalysis.Ortholuge ratio cut-offs, to separate orthologs from paralogs, can be determined based on an iterative-true-negative analysisAfter determining that the introduced true-negativesalmost never fall within certain ratio ranges, it becameclear that ratio cut-offs could be derived to exclude mosttrue-negatives, and thus improve the specificity (preci-sion) of ortholog prediction. To do this, another strategywas employed to simulate the introduction of paralogs(true-negative ortholog predictions) and then formulateortholog identification cut-offs. This second strategy,involving an iterative-true-negative analysis, allows one toview the variance in proportion of true-negatives in a par-ticular ratio range, and is also amenable to high through-put use for the formulation of cut-offs. For both theeukaryotic (human-mouse-rat) RBH-predicted data set(RefSeq-based), and the prokaryotic RBH-predicted dataset, we conservatively modeled an incomplete genome (orgene loss) scenario by randomly replacing 25% of thegenes in the RBH-predicted data set with the "next bestRBH" hit (i.e. a true-negative). This randomized introduc-tion of true-negatives was iterated at least 50 times, andeach iteration was evaluated by Ortholuge. The propor-tion of true-negative orthologs was averaged over all iter-ations and the standard deviation determined. We foundthat that once again, the ratio values of true-negativeorthologs do not overlap well with those of the bulk ofRBH-predicted orthologous groups (Figure 8 and Supple-mental Figures 1 and 2).For both the prokaryotic and eukaryotic RBH-based datasets, this iterative true-negative analysis was used to deter-mine ratio ranges where true paralogs were very unlikelyto land and ranges where they were very likely to land. Theborders of these ranges (described in Figure 8 and Supple-mental Figures 1 and 2) became the ratio cut-off values.This permitted classification of the RBH-predicted tenta-tive orthologous groups into probable ssd-orthologs,probable paralogs, or "uncertain" categories. It should benoted that a more accurate name for the 'probable para-log' category might be 'probable non-ssd-ortholog,'because there may be true orthologs that have undergoneunusual divergence in one ingroup species within this cat-egory. However, in such cases the non-ssd-orthologs mayhave functionally diverged, and therefore are cases that wewould want to differentiate from our ssd-ortholog set.R1 × R2 plots, for the prokaryotic data, illustrating the effect of introducing outgrou  paralogs (ou group or holog true-nega ives) in the analysisFi u e 7R1 × R2 plots, for the prokaryotic data, illustrating the effect of introducing outgroup paralogs (outgroup ortholog true-negatives) in the analysis. Unlike for other figures of R1 × R2 plots in the paper, only ratio ranges from 0 to 2 are shown for each axis. (A) RBH-predicted orthologous groups. (B) Outgroup paralogs from a true-negative data set where all possible outgroups were replaced with next best RBH para-logs. They cannot be well distinguished from other orthologs, however, this is actually promising, since Ortholuge is in essence identifying orthologs between the ingroups only. This analysis shows that an outgroup paralog does not inter-fere greatly with the identification of true orthologs shared between the ingroups.Page 7 of 16(page number not for citation purposes)promising, since in the case of a paralog in an outgroup,the two ingroups should still be regarded as probable trueRegardless, for ease of comprehension, we propose to callthose cases with very atypical ratios (in the range of whatBMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270is observed for paralogs) "probable paralogs", since para-logs likely predominate in this region.We chose a 25% true-negative introduction, since this islikely above a worst-case scenario in terms of the numberof genes that may be missing in an incomplete genome, ormost cases of naturally occurring gene loss. We felt it wasimportant to "saturate" the data set with true-negatives,because any given RBH-based dataset will likely containsome proportion of false-positives in the putative orthol-ogous groups (i.e. it is difficult to ensure one has a com-pletely true-positive set of orthologs). Therefore, toeffectively identify the ranges where true-negatives werebecoming increasingly more common we needed toobserve a large proportion of true-negatives. However, wedid not want to transform a data set with all possible true-negatives, as this would not provide a sense of the varia-tion in proportion of true-negatives within a given ratiorange. Note that we also chose to report the results herefor a transformation of an RBH-predicted data set with thetrue-negatives (i.e. a RefSeq-based RBH analysis), ratherthan a transforming a high quality dataset, since the Ref-Seq based analysis could be more easily fully automated(i.e. it did not require developing a curated set of highquality orthologs). However, transformation of a eukary-otic high quality dataset with true-negatives generatedsimilar cut-off values (data not shown). Through an itera-tive sampling approach we were able to generate standarddeviations of the proportion of true-negatives in a givenratio range (Figure 8B), providing a clearer picture of thelikelihood of a true-negative occurring in that range.Ortholuge ratios in combination can help predict which gene in a given putative orthologous group is likely a paralogA closer inspection of the Ortholuge ratios shows thatthey behave in a predictable fashion when the orthologgroup contains one or more false-positives (Table 1). Forexample, if ingroup1 is actually a paralog, then the dis-tance between ingroup1-outgroup and the distancebetween ingroup1-ingroup2 would be larger than thenorm for an ssd-ortholog. This would cause Ratio2 toincrease (the degree of increase would depend on howdiverged the paralog is from the missing 'true' ortholog),and Ratio1 to increase a slighter amount (depending onhow distant the outgroup is). Conversely, if ingroup2 isactually the paralog, then Ratio1 would be expected toincrease and Ratio2 to increase slightly. These predictablechanges do indeed occur, as illustrated by an analysis oftrue-negatives (Figure 6C and 6F), an analysis of a datasetof tentative orthologs identified by RBH using an incom-plete genome (Figure 5C and 5F), and an additional man-ual review of selected cases (data not shown). We proposeExample of the generation of cut-offs for classification of ssd-orthologs and probable paralogs, based on n terative-true-negativ  analysis (i. . based on an introdu tion of rand m s s f true-n gat v s)Fi ure 8Example of the generation of cut-offs for classification of ssd-orthologs and probable paralogs, based on an iterative-true-negative analysis (i.e. based on an introduction of random sets of true-negatives). The particular analysis illustrated here is a Ratio1 analysis for the mouse, rat, human RefSeq RBH dataset, with true-negatives introduced into the mouse (ingroup1) set. In panel A, the number of putative ortholo-gous groups in each ratio range for the true-negative-trans-formed data set is shown for the whole data set (light shaded bars) and for just the introduced true-negatives only (dark shaded bars). Note how the distribution of the data set dif-fers from that of the true negatives (i.e. introduced paralogs). In panel B, the proportion of randomly introduced true-nega-tives at 0.5 ratio range intervals is used to formulate cut-offs (denoted by dashed lines) for classifying ssd-orthologs and probable paralogs for the analysis. For the ssd-orthologs cut-off (left-most dashed line), no more than 10% true negatives in a given ratio range are permitted for the ssd-orthologs range. For the probable paralogs cut-off (right-most dashed line) the proportion of true negatives is at or above 50 per-cent. The resulting middle region bounded by these two cut-off points establishes the "uncertain" orthology class ratio range. Dashed-lines denoting these particular cut-offs are also illustrated on the figure in Panel A for reference. This approach for a true-negative analysis and cut-off generation is also performed for Ratio2 [Additional file 1] and the combi-nation of cut-offs for Ratio1 and Ratio2 are used to classify putative orthologous groups from another data set (such as an RBH-predicted data set) into the three classification levels of "probable ssd-ortholog", "uncertain" and "probable para-logs". Panel C schematically shows the areas of an R1 × R2 that would be classified in this way, with the cut-off numbers in this particular example matching the RefSeq RBH-based mouse-rat-human analysis (see Table 2 for how these ranges are numerically determined).Page 8 of 16(page number not for citation purposes)that when unusual ratio ranges are identified for a givenorthologous group, the relative changes can facilitate pre-BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270dictions regarding which of the two ingroups may containa paralog (or non-ssd-ortholog).Note that an outgroup paralog cannot be well predicted,however this does not affect the utility of Ortholuge, sincethe method is focused on characterizing the orthology ofthe two ingroups. It should also be noted that multiple-paralog scenarios (last three rows in Table 1), are morecomplex. Though relatively easy to predict on paper, theyare more difficult to distinguish in reality, because theamount of divergence for the two paralogs may varygreatly. In most cases they would resemble one of the firstthree scenarios, depending on which of the two paralogswas more diverged. Nevertheless, in the end, these rarecases (two paralogs in a group of three) will still most fre-quently display atypical ratios, and will not fall withinprobable ortholog cut-offs.Ortholuge in action: an estimation of probable ssd-orthologs and probable paralogs in RBH-based data setsAn example of ratio cut-offs generated based on our true-negative analysis is listed in Table 2 (see also Figure 8 andSupplemental Figures 1 and 2). Researchers are of courseencouraged to choose their own cut-off to suit their needs(i.e. more sensitivity or specificity). However, based onour simulations, these cut-offs should effectively differen-tiate probable orthologs and paralogs for these data sets.We also propose that these cut-offs can identify thoseally similar to each other versus those that have divergedat different evolutionary rates in each species.Using the derived ratio cut-offs, we have constructed sev-eral data sets of probable ssd-orthologs consisting of:mouse-rat comparisons (with human as the outgroup),and one for a P. putida-P. syringae comparison (with E. colias the outgroup). These ssd-orthologs are particularlysuited for comparative genomics analyses. In addition,notations are added to all the data analysed, indicatingcases of probable gene duplication after species diver-gence ("possible in-paralog") – a scenario that canincrease the likelihood of functional divergence of thegenes. These higher quality sets of orthologs can be foundvia the Ortholuge website [28]. The proportion of ssd-orthologs in the RBH-predicted data sets is summarized inTable 2. Note that cases of in-paralogs are not countedwithin the counts of ssd-orthologs in Table 2. Such cases,due to their uncertain potential to have diverged in func-tion because of a gene duplication, are counted within the"uncertain" category.Using the cut-offs, we were also able to estimate the pro-portion of RBH-predicted orthologs that are likely para-logs for these eukaryotic and prokaryotic data sets (Table2; see also data available on the Ortholuge website [28],which includes a classification of the EGO dataset usingthe RefSeq analysis cut-offs). For the prokaryotic dataTable 1: Ortholuge-ratios can help predict which gene in a given putative orthologous group is likely a paralogaa.Ratio1 Ratio2 Ratio3 Probable ParalogaIngroup1 paraloga↓ Ingroup2 paralog↓ ↓ - Outgroup paralogb or c  or cvariable d Ingroup1 & Ingroup2 paralogs↓ variable d Ingroup1 & Outgroup paralogsvariable d ↓ ↓ Ingroup2 & Outgroup paralogsa Only selected scenarios are listed. Arrows indicate relative increases or decreases in a ratio value, when compared to the highest frequency values in a histogram plot (i.e. "expected" ratio value). Smaller arrows indicate that the increase is less. In the case of the ingroup1 or ingroup2 paralog scenarios, it will depend on how divergent the paralog is and how distant the outgroup is.b Note that an outgroup paralog cannot be discriminated from cases of orthologs, nor does this analysis need to discriminate such cases (see text). However, this has been included in the table solely to illustrate how ortholog paralog cases can be discriminated (using Ratio 3) from cases where there is a combination of an ingroup1 (or ingroup2) paralog and an outgroup paralog.c This scenario will resemble an ingroup1 paralog scenario or ingroup2 paralog scenario, if one of the two ingroup paralogs diverged much more than the other.d The variation may be an increase or decrease, depending on which of the two paralogs is more diverged. Ratio 3 can help resolve such cases.K K KK KK K K KKPage 9 of 16(page number not for citation purposes)orthologs most closely following species divergence (i.e.ssd-orthologs) – orthologs which may be more function-about 5% of RBH-based predictions are probable para-logs. For the eukaryotic data, about 10% of the RBH-pre-BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270dictions are probable paralogs. These are significantnumbers that validate the need for a method likeOrtholuge, particularly if one is trying to use RBH-pre-dicted orthologs for downstream analyses that requirestringent ortholog prediction (for example, for regulatoryelement detection).Application of these cut-offs to classify the curated eukary-ote and prokaryote datasets suggest that the false negativerate in is in the range of 0.7% for prokaryote data and 3%for the eukaryote data.To facilitate the analysis of other datasets, we have devel-oped Ortholuge software that can be used to characterizeany existing dataset of orthologs. If no pre-existingortholog dataset is available, Ortholuge can also constructsuch a dataset using an RBH-based approach applied towhole genome datasets (or other adequate datasets ofgenes from three organisms that a user supplies).Ortholuge was developed using Perl under Linux (SuSE9.0 and RH 9.0) and operates in any UNIX environment,provided all the needed tools (see Methods) are availablefor the user's operating system. This freely available, opensource, software is available on the Ortholuge website[28].DiscussionFor cross-genome comparison purposes, researchers oftenwish to compare orthologs – in particular orthologs thathave not undergone unusual divergence rates relative toone another, and have more likely retained similar func-tion. We propose that Ortholuge is an approach, suitablefor high-throughput genome-scale analysis, which aidsidentification of such orthologs. The Ortholuge methodsignificantly improves the specificity/precision of high-throughput RBH-based ortholog analysis. For example,our results indicate that roughly 1 in 10 RBH-predictedrat-mouse orthologs are very likely paralogs, and about 1in 20 RBH-predicted orthologs for two Pseudomonas spe-cies are similarly likely incorrect. Note that our RBH anal-ysis requires transitivity between three species, renderingit more stringent than the typical RBH analysis betweentwo species. This suggests that the typical RBH analysismay have an even greater number of false predictions. Theresulting more specific identification of orthologs byOrtholuge is an important requirement for many down-stream analyses, such as identifying gene regulatoryregions, or characterizing differences in microarray-meas-ured gene expression responses across species. An auto-mated method such as Ortholuge is of course nosubstitute for a more manual, comprehensive phyloge-netic analysis and has some limitations as mentionedbelow. However, its simplicity and utility for high-whole-genome gene datasets. In addition, Ortholuge'shigher specificity approach can complement other meth-ods that may provide a higher sensitivity/recall approachfor ortholog identification [13].Ortholuge evaluates orthologs through phylogenetic dis-tance comparisons. To perform such comparisons, an out-group is required to assist the prediction of orthologsbetween the two ingroups – this has simultaneous advan-tages and disadvantages. The added sequence providesextra resolution and extra specificity; however, a distantoutgroup may lessen the sensitivity of the approach. Pre-sumably, though, as more genomes are sequenced, thenumber of possible outgroups available to choose fromwill increase and very distant outgroups will become lessof a problem.The Ortholuge pipeline generates predictions by evaluat-ing the entire genome at once (or at least adequate generepresentation for the species). The more data points thatare representative of the genome, the more confident theratio cut-offs will be. It assumes that the majority ofincoming predictions are true orthologs, will exhibitexpected ratios, and will thus form the high frequencyranges of the distributions. Our analysis does suggest thisassumption to be reasonable and, notably, both eukaryo-tic and bacterial orthologs display similar ratio distribu-tions, despite marked differences regarding how suchorganisms evolve.Once the genome-wide predictions are made for a certainspecies combination, Ortholuge can be used to estimatehow likely it is that a specific putative orthologous groupcontains a true-negative within its ingroups. In such cases,we can match these ratios with a category (i.e. classifica-tion shown in Table 2), to suggest which gene in theortholog group is likely to be the paralog. However, itshould be emphasized that at this time we have notexhaustively examined all possible scenarios, and so suchanalysis should be taken as a guide requiring furtherinvestigation. Interestingly, this method also appears tobe useful to examine, in a genome-wide scale, the rela-tionships between species. By examining the ratio valuesat the highest frequency ranges in the histograms, one caneasily determine which two of any three organisms aremore similar to each other, on average, and on a genome-wide scale (for example, that cattle genes are more similarto human genes, than mouse genes are to human genes,on average).The simplicity of Ortholuge allows for many benefits. Forexample, it can easily be re-run when genome annotationsundergo significant changes. In addition, it can easily bePage 10 of 16(page number not for citation purposes)throughput analyses suggest that it is a useful complementto RBH-based identification of putative orthologs usingcustomized with any method of sequence alignment orphylogenetic distance calculation, depending upon theBMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270researcher's preference. It is expected that further analysiswill reveal relationships between true-negative analysesand ratio cut-off generation, negating the need to performa full iterative-introduced-true-negative analysis for eachspecies comparison. Of course, users can choose their ownOrtholuge ratio cut-offs, either using a true-negative anal-ysis, or another approach of their choice, for identificationof orthologs at their preferred level of specificity.Accepting only orthologs in a certain ratio range and dis-carding the rest will certainly eliminate a small fraction oftrue orthologs from the input set. For example, if theprobable paralog cut-offs are applied to the "high quality"curated prokaryotic and eukaryotic data sets, we eliminate0.7% and 3% of the prokaryotic and eukaryotic predic-tions, respectively. However, if the more stringent ssd-ortholog cut-offs are applied, we eliminate 1.4% and 8%of the predictions, respectively. While these outliers maybe false-positives in the curated data, they may also betrue orthologs that have undergone unusual divergence ingence, the resulting duplicated gene may undergo acceler-ated evolution [14]. Such scenarios would result inskewed ratios for true orthologs. However, we proposethat such orthologs with unusual (relative) divergencemay more likely have differing function at some level. Inmany genome-wide studies involving comparisonsbetween species, researchers wish to identify those genesthat are more likely to be functionally equivalent – i.e.orthologs that did not experience unusual rates of evolu-tion or gene transfer. Ortholuge improves the identifica-tion of such "supporting species divergence" orthologpairs (i.e. ssd-orthologs).This is apparently an important issue, as illustrated bysome confusion occurring in the literature regarding thedefinition of orthologs. The definition that we, and manyevolutionary biologists use, is the one initially proposed[1] that describes orthologs as genes that have divergeddue to speciation (rather than due to gene duplication,which describes paralogs). However, the term ortholog isTable 2: Proportion of RBH-predicteda orthologs that are likely ssd-orthologsb and likely paralogs, according to Ortholuge analysis.Data setc Probable ssd-ortholog Orthology uncertainf Probable paralogRatio RangecProportion of introduced true-negatives in a true-negative analysisdProportion of RBH-predicted orthologseRatio RangecProportion of introduced true-negatives in a true-negative analysisdProportion of RBH-predicted orthologseRatio RangecProportion of introduced true-negatives in a true-negative analysisdProportion of RBH-predicted orthologserat-mouse comparison (human outgroup)R1 ≤ 0.60 and R2 ≤ 0.550.8% 76% See footnotef16% 14% R1 > 0.80 or R2 > 0.8077%d 10%P. putida-P. syringae comparison (E. coli outgroup)R1 ≤ 0.55 and R2 ≤ 0.701.3% 91% See footnotef24% 4% R1 > 0.75 and R2 > 0.8587% 5%a RBH-predicted = Predicted to be orthologous using a Reciprocal-best BLAST hit approach.b "Supporting-species-divergence orthologs" = orthologs that appear to have diverged only due to speciation and have diverged at an expected relative rate for the species. Such orthologs are likely to have more similar function. See text for details.c Ratio Range for both Ratio1 (R1) and Ratio2 (R2). See Figure 8C for a schematic illustration of the cut-off ranges on a R1 × R2 plot.d Proportion of introduced true-negatives for the 25% true-negative analysis is shown here, however the actual number of true-negatives will be higher due to false-positives likely occurring in the original ortholog dataset. This analysis was used to estimate % false predictions in range (see text and Figure 8.e RBH-predicted data sets were examined using the cut-offs generated by the true-negative analysis, to identify what proportion of all RBH-predicted orthologs fell within each range. For the rat-mouse comparison 6294 RefSeq-based groups were classified into "probable ssd-ortholog", "uncertain", and "probable paralog" classes. For the Pseudomonas comparison, a total of 1456 groups were classified. Note that for an analysis of the EGO-based rat-mouse data set of 19,200 groups with the same cut-offs, 76% ssd-orthologs and 16% probable paralogs were predicted (when in-paralogs were not counted, because of the lack of differentiation of gene isoforms in the EGO data set).f This "uncertain" category falls between the other two ranges and is graphically illustrated, for ease of understanding, in Figure 8C. This category follows the formula (R1 > a and R1 < b and R2 < d) or (R2 > c and R2 < d and R1 < a), where a and b are the lower and upper cut-off values, respectively, for Ratio1 (i.e. lower = cut-off for ssd-orthologs and higher = cut-off for probable paralogs), and c and d are the lower and upper cut-off values, respectively, for Ratio2. Note this "uncertain" category also contains counts of in-paralogs detected (7% of eukaryotic data, and negligible for prokaryotic data) – see text for details.Page 11 of 16(page number not for citation purposes)one ingroup species. For example, if a gene duplicationoccurred in one ingroup species after the speciation diver-increasingly being inferred to mean 'functionally equiva-lent genes in different species' – a common misconcep-BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270tion [15]. While we and others agree that orthologs tend tohave similar function, this is not a requirement for orthol-ogy [16]. So, it appears that while many researchers areidentifying orthologs in a genetic or genomic study, whatthey really wish to identify is the subset of orthologs thatare specifically functionally equivalent.Some methods, such as the widely used INPARANOID,refer to all in-paralogs (i.e. genes created by gene duplica-tion after the species divergence) in the one species asorthologous to the related gene in the other species. Theydo not clearly distinguish between such cases of in-parol-ogy and more simple one-to-one orthologous relation-ships. We believe that such cases should be differentiatedbecause a duplication event after species divergence mayhave led to significant functional divergence of one orboth of the duplicated genes in the one species. InOrtholuge, cases involving possible in-paralogs areflagged using a simple analysis that focuses on detectingthe most clear-cut in-paralog cases. For our analysis, wedid apply ratio cutoffs derived using one-to-one orthologRBH-based (RefSeq) data to classify a same species (EGO)data set that includes both one-to-one orthologs andmany-to-many orthologs. However, we recognize a needto implement more robust procedures that would con-sider all cases of suspected recent gene duplications in theanalysis (the current method is subject to the limitationsof the initial RBH-based ortholog identification). It wouldalso be desirable to complement this analysis further bynoting cases of relative gene rearrangement in the inputset of orthologs. Ortholuge in its current form cannotdetect gene rearrangements, however it could potentiallycomplement other bioinformatics approaches that detectsuch rearrangements [17]. Ortholuge could also beadapted to contain a gene rearrangement analysis that iscustomized to its methodology. These additionalortholog evolutionary scenarios, involving possible in-paralogy or gene rearrangements, should be specificallynoted because they cannot be distinguished by examiningOrtholuge distance ratios alone. They require furtherstudy in any comparative analysis, since functional equiv-alence between the orthologs is less likely.Regarding the limitations of this method, it should beemphasized that Ortholuge is limited by the quality of theinitial ortholog-analysis (i.e. RBH can miss cases of true-orthology, and some data sets such as those from EGO areincomplete and don't clarify which genes are isoforms,which complicates in-paralog analysis). Ortholuge is alsoonly as good as the quality of the sequence data being ana-lyzed. We have tested our alignment trimming and mask-ing of regions of lower alignment quality extensively toimprove the critical sequence alignment component ofanalysis. In addition, the top BLAST hit is not necessarilythe nearest neighbour [18] and so true orthologs may bemissed when using Ortholuge after initially identifyingorthologs with an RBH-based approach. Ortholuge couldtherefore improve if the initial ortholog predictionmethod is improved (it should be emphasized thatOrtholuge can be used with any input dataset of proposedorthologs deduced by any current or future ortholog pre-diction methodology – not just the ones presented).Regardless of any limitations, Ortholuge appears to effec-tively improve the specificity of ortholog identificationand is suitable for high-throughput, genome-wide use.Given the amount of genomics data being obtained at thistime, such specific, high-throughput approaches willbecome increasingly necessary, as genomics researchmoves further toward more multi-genome comparativeanalyses.ConclusionOrtholuge improves the specificity of ortholog identifica-tion and is suitable for high-throughput use. This preciseortholog prediction method complements other orthologprediction methods that are not focused on precision andit potentially identifies those orthologs most likely to befunctionally similar. The Ortholuge method providesimportant data set evaluation for a variety of analysesbased on comparative approaches, including gene func-tion prediction, prediction of conserved regulatory ele-ments, and comparative analysis of gene order or gene/protein expression data.MethodsData sets1. Eukaryotic Gene Orthologs (EGO) RBH data setEGO release 8 database was obtained from The Instituteof Genomic Research (TIGR) [5]. This database is com-posed of two files: 1) one file housing ortholog identifiersand tentative consensus sequence (TC) identifiers and 2)a second file TC sequences in FASTA format. Both fileswere used to extract and create 19,200 unique mouse, rat,human tentative ortholog gene sets files (TOGs) forOrtholuge analysis.2. Eukaryotic curated orthologs ("high quality" MGD dataset)The Mouse Genome Database (MGD) is a comprehensive,high-quality database which currently includes orthologyinformation for mouse, human, rat, and 14 other mam-mals [19]. Orthology annotations are manually curatedfrom scientific literature and each orthology assertion isbased on criteria recommended by the Human GenomeOrganisation (HUGO).Page 12 of 16(page number not for citation purposes)our method; however, certainly this method will fail iflow quality sequences, with many errors, are used in theA program was developed to extract the orthologous genepairs from the MGD Sybase database for two species tri-BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270ples: 1) mouse, rat, human 2) cattle, human, mouse. Allrelevant human, mouse, and rat RefSeq [20] mRNAsequences and protein sequences were obtained from theNational Center for Biotechnology Information (NCBI)FTP site along with the Locus Link RefSeq mappings file.FASTA-formatted ortholog sets for those ortholog pairswere created that satisfied a transitive, triple ortholog rela-tionship and had corresponding RefSeq sequences anno-tated with a reviewed or validated status. 2642 mouse, rat,human mRNA, 2499 mouse, rat, human protein, and 427cattle, human, mouse mRNA ortholog sets were created.3. Eukaryotic Gene Orthologs (EGO) "lower quality" RBH setCattle, human, and mouse ortholog groups, totaling16,134 in number, were extracted from the EGO release 8database. The cattle genome was incomplete at this timeand thus we expected more incorrectly predictedorthologs by the RBH method (see Fig. 1 for the scenario).4. Eukaryotic RefSeq-based RBH ortholog setThe species-specific mouse, rat, human RefSeq files wereobtained from the NCBI FTP site and BLAST databases [2]were constructed for each file. A pairwise blastall analysiswas performed between each species enforcing a 10e-04 Evalue cut-off. 6294 ortholog FASTA-formatted sets werecreated from transitive, best-hit mRNA RefSeqs. Weallowed one unique best-hit isoform per Locuslink ID inthe RBH dataset.5. Eukaryotic RBH Tentative Consensus (TC) ortholog set involving cattleA higher-quality, non-redundant RBH TC dataset wasestablished using the cattle, human, and mouse tentativesequences found in the EGO release 8 database. The tran-sitive, triple reciprocal top best BLAST hit for each uniquecattle TC was used to form 15,660 ortholog groups. Thisapproach served to reduce the over-representation of TC'sfound in the currently established set of EGO tentativeortholog groups (TOGs) due to the allowance of multipleRBH relationships within a specified cut-off.6. Bacterial RBH-predicted data setsProtein sequences of Escherichia coli K12 [10], Pseu-domonas syringae pv. tomato str. DC3000 [11], and Pseu-domonas putida KT2440 [12] were obtained from NCBI.For the RBH analysis, first a BLASTp was performedbetween all pair-wise combinations, with an E-value cut-off of 10e-04. Genes that retained a transitive reciprocalbest hit property and passed the BLAST cut-off wereretained. There were 1456 ortholog groups constructed.7. Bacterial higher quality orthologsA set of higher quality orthologs was constructed from aexactly one gene per species. For simplicity, we chosethose that had annotated gene names in each of our threechosen bacterial species. Initially, there were 156 orthologgroups, and of these 143 ortholog groups passed our auto-mated alignment editing stage.8. OrthoMCL eukaryote ortholog datasetThe OrthoMCL database files were downloaded [22] anda set of mouse-rat-human ortholog triples were extractedfrom the OrthoMCL clusters to construct ortholog triples.These predicted ortholog groups were analyzed using theOrtholuge analysis software.Through our analyses, we observed that the use of nucle-otide sequences provided better resolution for these par-ticular sets of eukaryotic data, at the level of divergencebeing examined using Ortholuge (see Figure 4 of [Addi-tional file 1]), whereas protein sequences provided betterresolution for the particular bacterial data we were analyz-ing (data not shown). Consequently, all analyses belowwere performed using nucleotide sequences for eukaryoteanalysis and protein sequences for the given prokaryoteanalysis.Ortholuge analysis pipelineThe input parameters for Ortholuge include a list of tenta-tive ortholog species groups with sets of FASTA-formattedsequences for each respective gene/protein in the tentativeorthologs set. A flowchart overview of this pipeline can beseen in Figure 2. If ortholog groups have not been pre-determined, the Ortholuge software we developed is capa-ble of calculating an initial list of tentative orthologousgroups, using the RBH approach. In this latter case, theinput required is a FASTA-formatted list of sequencesfrom genes predicted in three genomes to be examined(two sequences to be compared, one reference sequenceas an outgroup). Note that whole-genome data does notnecessarily need to be used, however the dataset should belarge enough to ensure that the distribution of relativeevolutionary distances will centre around what is likelythe true median for the relative evolutionary distance forthe given organisms being examined.1. Sequence alignmentsInitial alignments of the genes/proteins for each tentativeortholog group are generated using CLUSTALW [23] witheither DNA or PROTEIN alignment options. All otherparameters are default.2. Automated alignment editingAll alignment overhangs and poly-A tails are removed ineach aligned set of sequences. An alignment must bealigned over 300 base pairs (bp) or 100 amino acids (aa)Page 13 of 16(page number not for citation purposes)set provided by Lerat et al [21], who found all the genefamilies in 13 gamma-proteobacteria genomes that hador it is discarded from the analysis. This choice of thresh-old was based on previous studies that have suggested thatBMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270it is more likely that a sequence codes for a protein if itslength is over 100 amino acids [24].Gap masking is performed to remove ambiguouslyaligned gap-flanking regions. A sample of RBH-predictedortholog sequence sets were examined to identify bothgaps introduced by misalignments and gaps introducedthrough sequence insertions and deletions. Gap-maskingsimulations using various window length intervals wereapplied to the aligned sequences to establish a gap-mask-ing approach. Our approach entails running a sliding 25-base pair window over the aligned sequences in bothdirections to assess gap percentages exceeding a 40% gapthreshold. The window size and gap threshold were cho-sen such that overlapping windows exceeding the gapthreshold would produce a worst-case gap masked regionof 49 base pairs.Both the trimming and gap-masking methods were evalu-ated for the introduction of ratio distribution biases byselected alignment characteristics. Selected characteristicsof both trimmed and gap masked alignments wererecorded and analyzed to determine whether the auto-mated alignment editing process had created a ratio distri-bution bias for certain alignment characteristics. Thesecharacteristics included: number of aligned base pairs,identity over aligned length, identity over left and rightends, proportion of gaps over full length, proportion ofgaps over left and right ends. Here we defined end lengthas MIN(.25 * alignment length, 150 bp/50aa). See alsoFigure 3.3. Sequence distances and calculation of Ortholuge ratiosThe EDNADIST or EPROTDIST programs of EMBOSS [25]and PHYLIP 3.6 [26] software, respectively, were used tocompute the nucleotide or protein distances. We opted toanalyze our data using the Kimura distance formula dueto its simplicity and computational efficiency. We used aconservative transition/transversion rate of 2 as anapproximation, although studies do suggest that transi-tion/transversion rates are context dependent [27]. Allother parameters were defaults. The phylogenetic dis-tances were used to compute the three ratios, Ratio1,Ratio2, and Ratio3, as described in Figure 2.Ratios are then displayed manually in two forms: Histo-grams and as R1 × R2 plots. The ratio frequencies are enu-merated for a given interval and histograms areconstructed for all three ratios, visually displaying theratio frequencies of tentative orthologous groups within aratio of 2.5. The R1 × R2 plots are comprised of an x-y plotof Ratio1 versus Ratio2 which facilitates visualization ofthe full ratio distribution range (though zoomed in ver-vided to facilitate viewing data in low ranges in thisformat).True-negative introduction analysesMean/iterative true-negative analysisFor the eukaryotic RefSeq-based RBH ortholog dataset, aselected proportion of the ortholog sets were randomlytransformed to true-negative ortholog sets and then runthrough the Ortholuge analysis. We report here the resultsfor a 25% transformation of the data, though other per-centages were examined (data not shown). To do thistransformation (introduction of true-negatives), the fullset of mouse, rat, and human RefSeq sequence files wereobtained from NCBI and a pairwise best-hits list was cre-ated using a pairwise blastall analysis with a 10e-4 E-valuecut-off. An orthologous set was transformed to a true-neg-ative by replacing one of the species sequences withanother sequence that had a greater (next highest) BLASTexpect value and which still satisfied a reciprocal and tran-sitive best BLAST hit with the two other sequences in theorthologous set. In essence, we were removing an RBH-predicted ortholog and identifying another gene thatcould satisfy an RBH relationship. This essentially simu-lated what could happen if a gene was lost in one genome,or a genome sequence was incomplete, by removing agene from a proposed ortholog "triple" and determiningwhat the next RBH relationship would be for the remain-ing genes in the triple. Care was taken to ensure that theset of original sequences in the higher quality set beingtransformed initially satisfy an RBH relationship. Further-more, the algorithm mandated that the non-orthologousreplacement is not an isoform of the replaced sequence.Each transformed dataset was then run through theOrtholuge analysis. Such true-negative transformationswere iteratively performed 50 to 100 times for each true-negative percentage proportion. A mean true-negativevalue and standard deviation for each ratio value in thedistribution could then be calculated. Note that this sameapproach was also used to perform an iterative true-nega-tive analysis for other eukaryotic data sets, and theprokaryotic data.The true-negative mean and standard deviations were ana-lyzed to establish conservative ratio cut-offs and estimatefalse-positive proportions. A three-level classification sys-tem for true mean false-positive values over defined ratiointervals was derived from this analysis (see "EstablishingCut-offs" Method's section, below). The number of ratiosin a given set falling into each level (i.e. "probable ssd-ortholog", "uncertain", and "probable paralog" classes)was counted.Page 14 of 16(page number not for citation purposes)sions of these plots up to ratio values of 2.0 are also pro-BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270True-negative introduction in the bacterial set – an introduction of all possible true-negativesFor the bacterial RBH-predicted data set, P. syringae genes(ingroup2) were replaced with their next best reciprocalBLAST hit to P. putida (within a 10e-4 E value cut-off),wherever possible. 668 out of 1456 ortholog triples weretransformed into true-negative triples for this dataset.There are no iterations necessary here, so the transformeddata set was then run through Ortholuge once.Establishing cut-offs for Ortholuge-predicted "probable paralogs", uncertain, and probable ssd-orthologsResearchers are of course encouraged to use the abovetrue-negative analysis to formulate their own cut-offs,since cut-offs of differing levels of sensitivity and specifi-city are possible. In our example analysis, we examinedthe iterative/mean true-negative analysis for a eukaryoticand prokaryotic dataset using a histogram and examinedthe data in terms of the proportion of introduced true-negatives identified in each ratio range. These percentagesare used to aid in identifying cut-offs for more specific(precise) identification of probable orthologs (or ssd-orthologs) and probable paralogs. We examined the trendmanually, and opted to identify "probable ssd-orthologs"as those occurring in ratio ranges where there were, onaverage, only between 0–10% introduced true-negatives(out of the total number of tentative orthologous groupsin the range; see Results, Figure 8). Tentative orthologousgroups falling in ratio ranges that contained between 10 to50% introduced true-negatives (on average) were classi-fied as orthology "uncertain". Finally, groups falling inratio ranges that contained greater than 50% introducedtrue-negatives were classified as "Probable paralog". Wechose this cut-off because it was at this point that the tran-sition from few introduced true-negatives in a range, tomostly introduced true-negatives in a range, increased sig-nificantly. Note that at this point there will also likely besome true-negatives occurring in the analyzed dataset (asillustrated also by our "higher quality" data set analyses),and so the actual proportion of true-negatives at this prob-able paralog cut-off point will likely be much higher. Asmentioned in the results, we opted to perform this analy-sis on completely automated RBH data (RefSeq-based),rather than high quality data, since we appear to be ableto obtain meaningful results, while being able to takeadvantage of the automated nature of RBH data set gener-ation. However, we did also perform this analysis on thehigh quality data set, and on any EGO data set, generatingcomparable results.Identification of in-paralogsFor those tentative orthologs predicted by Ortholuge to bessd-orthologs (and also for other classes as well, in casemay affect the possible functional equivalence of the ssd-orthologs (see the Introduction for a discussion of thisissue). To do this, we combined both ingroup species'sequences into one database and performed a BLAST anal-ysis using all the individual sequences from each of theingroup species as a query. We then identified individualsequence cases in which top hits (other than a querysequence self-hit) were to another sequence in its ownspecies. If the bit score for this same-species hit was greaterthan the bit score for the other species hit, then the casewas flagged as an in-paralog candidate (ie. a gene duplica-tion may have occurred after the speciation, potentiallyaffecting the function of the ssd-ortholog). Any such in-paralog cases were classified under the "uncertain" cate-gory, unless they had been classified, according to aRatio1 and Ratio2 analysis, as belonging to the "probableparalog" category (in the latter case they would remain inthe probable paralogs category). Note that this analysisonly identifies a proportion of all cases – in particular veryclear cut cases. It does not identify all possible in-paralogsand researchers are encouraged to investigate any suchcases more thoroughly.Authors' contributionsBH and FSLB developed the initial framework for thiscomputational method and FSLB led formation of thefinal draft of the manuscript. DLF and YYL developed thefinal methodology, performed the analyses of the selecteddata sets, and drafted the initial versions of the manu-script, with each focusing their research and analyses oneukaryotic and prokaryotic data, respectively, during theirresearch rotations. FMR participated in the design of thisstudy and provided critical input that improved this work.MRL used initial scripts developed by DLF and YYL todevelop a software package that will perform the maincomponent of the Ortholuge analysis. All authors readand approved the final manuscript.Additional materialAdditional file 1Supplementary Figures. Supplementary Figure 1. Ratio1, Ratio2 and Ratio3 histograms of the P. putida – P. syringae – E. coli putative orthol-ogous sets summarizing results of a true negative introduction analysis. Supplementary Figure 2. Ratio2 and Ratio3 histograms of the mouse-rat-human putative orthologous sets indicating the average proportion of true negatives observed in our simulation of an incomplete genome through the iterative introduction of a mouse (ingroup1) paralog in randomly selected ortholog sets. Supplementary Figure 3. Histograms of Ortholuge Ratios 1, 2, and 3 for the mouse-rat-human RBH RefSeq nucleotide dataset. Supplementary Figure 4. Histograms of Ortholuge Ratios 1, 2, and 3 for the mouse-rat-human OrthoMCL protein dataset.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-7-270-S1.pdf]Page 15 of 16(page number not for citation purposes)researchers wish to use other cut-offs), we performed anadditional analysis to identify cases of in-paralogy thatPublish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270AcknowledgementsThe authors wish to thank members of the Brinkman Laboratory for helpful discussions and technical assistance. FSLB is a Canadian Institutes of Health Research New Investigator (CIHR) and Michael Smith Foundation for Health Research (MSFHR) Scholar. DLF and YYL are CIHR/MSFHR Bioin-formatics Training Program for Health Research award recipients. All other authors of this work, as well as computer hardware resources utilized for this project, were supported by the Functional Pathogenomics of Mucosal Immunity Project and Pathogenomics of Innate Immunity Project (funded by Genome Canada/Genome Prairie/Genome BC and Inimex Pharmaceuti-cals) and by IBM and Sun Microsystems.References1. Fitch WM: Distinguishing homologous from analogous pro-teins.  Syst Zool 1970, 19:99-113.2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic localalignment search tool.  J Mol Biol 1990, 215:403-410.3. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, KooninEV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS,Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: TheCOG database: An updated version includes eukaryotes.BMC Bioinformatics 2003, 4:41.4. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, ShankavaramUT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: TheCOG database: New developments in phylogenetic classifi-cation of proteins from complete genomes.  Nucleic Acids Res2001, 29:22-28.5. Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B,Cheung F, Antonescu V, White J, Holt I, Liang F, Quackenbush J:Cross-referencing eukaryotic genomes: TIGR OrthologousGene Alignments (TOGA).  Genome Res 2002, 12:493-502.6. Remm M, Storm CE, Sonnhammer EL: Automatic clustering oforthologs and in-paralogs from pairwise species compari-sons.  J Mol Biol 2001, 314:1041-1052.7. O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a compre-hensive database of eukaryotic orthologs.  Nucleic Acids Res2005, 33:D476-480.8. Kunin V, Ouzounis CA: The balance of driving forces duringgenome evolution in prokaryotes.  Genome Res 2003,13:1589-1594.9. Zhang P, Gu Z, Li WH: Different evolutionary patternsbetween young duplicate genes in the human genome.Genome Biol 2003, 4:R56.10. Blattner FR, Plunkett G 3rd, Bloch CA, Perna NT, Burland V, Riley M,Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, DavisNW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y: Thecomplete genome sequence of escherichia coli K-12.  Science1997, 277:1453-1474.11. Buell CR, Joardar V, Lindeberg M, Selengut J, Paulsen IT, Gwinn ML,Dodson RJ, Deboy RT, Durkin AS, Kolonay JF, Madupu R, DaughertyS, Brinkac L, Beanan MJ, Haft DH, Nelson WC, Davidsen T, Zafar N,Zhou L, Liu J, Yuan Q, Khouri H, Fedorova N, Tran B, Russell D,Berry K, Utterback T, Van Aken SE, Feldblyum TV, D'Ascenzo M,Deng WL, Ramos AR, Alfano JR, Cartinhour S, Chatterjee AK, Dela-ney TP, Lazarowitz SG, Martin GB, Schneider DJ, Tang X, Bender CL,White O, Fraser CM, Collmer A: The complete genomesequence of the arabidopsis and tomato pathogen pseu-domonas syringae pv. tomato DC3000.  Proc Natl Acad Sci U S A2003, 100:10181-10186.12. Nelson KE, Weinel C, Paulsen IT, Dodson RJ, Hilbert H, Martins dosSantos VA, Fouts DE, Gill SR, Pop M, Holmes M, Brinkac L, Beanan M,DeBoy RT, Daugherty S, Kolonay J, Madupu R, Nelson W, White O,Peterson J, Khouri H, Hance I, Chris Lee P, Holtzapple E, Scanlan D,Tran K, Moazzez A, Utterback T, Rizzo M, Lee K, Kosack D, MoestlD, Wedler H, Lauber J, Stjepandic D, Hoheisel J, Straetz M, Heim S,Kiewitz C, Eisen JA, Timmis KN, Dusterhoft A, Tummler B, FraserCM: Complete genome sequence and comparative analysisof the metabolically versatile pseudomonas putida KT2440.Environ Microbiol 2002, 4:799-808.13. Zheng XH, Lu F, Wang ZY, Zhong F, Hoover J, Mural R: Using14. Castillo-Davis CI, Hartl DL, Achaz G: Cis-regulatory and proteinevolution in orthologous and duplicate genes.  Genome Res2004, 14:1530-1536.15. Jensen RA: Orthologs and paralogs – we need to get it right.Genome Biol 2001, 2:. INTERACTIONS100216. Fitch WM: Homology a personal view on some of the prob-lems.  Trends Genet 2000, 16:227-231.17. Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Bat-zoglou S: Glocal alignment: Finding rearrangements duringalignment.  Bioinformatics 2003, 19(Suppl 1):i54-62.18. Koski LB, Golding GB: The closest BLAST hit is often not thenearest neighbor.  J Mol Evol 2001, 52:540-542.19. Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, AnagnostopoulosA, Baldarelli RM, Baya M, Beal JS, Bello SM, Boddy WJ, Bradt DW,Burkart DL, Butler NE, Campbell J, Cassell MA, Corbani LE, CousinsSL, Dahmen DJ, Dene H, Diehl AD, Drabkin HJ, Frazer KS, Frost P,Glass LH, Goldsmith CW, Grant PL, Lennon-Pierce M, Lewis J, Lu I,Maltais LJ, McAndrews-Hill M, McClellan L, Miers DB, Miller LA, Ni L,Ormsby JE, Qi D, Reddy TB, Reed DJ, Richards-Smith B, Shaw DR,Sinclair R, Smith CL, Szauter P, Walker MB, Walton DO, WashburnLL, Witham IT, Zhu Y, Mouse Genome Database Group: The MouseGenome Database (MGD): from genes to mice – a commu-nity resource for mouse biology.  Nucleic Acids Res 2005,33:D471-475.20. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequence(RefSeq): A curated non-redundant sequence database ofgenomes, transcripts and proteins.  Nucleic Acids Res 2005,33:D501-504.21. Lerat E, Daubin V, Moran NA: From gene trees to organismalphylogeny in prokaryotes: The case of the gamma-proteo-bacteria.  PLoS Biol 2003, 1:E19.22. Chen F, Mackey AJ, Stoeckert CJ Jr, Roos DS: OrthoMCL-DB: que-rying a comprehensive multi-species collection of orthologgroups.  Nucleic Acids Res 2006, 34:D363-368.23. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG,Thompson JD: Multiple sequence alignment with the clustalseries of programs.  Nucleic Acids Res 2003, 31:3497-3500.24. Brinkman FS, Blanchard JL, Cherkasov A, Av-Gay Y, Brunham RC,Fernandez RC, Finlay BB, Otto SP, Ouellette BF, Keeling PJ, Rose AM,Hancock RE, Jones SJ, Greberg H: Evidence that plant-like genesin chlamydia species reflect an ancestral relationshipbetween chlamydiaceae, cyanobacteria, and the chloroplast.Genome Res 2002, 12:1159-1167.25. Rice P, Longden I, Bleasby A: EMBOSS: The european molecularbiology open software suite.  Trends Genet 2000, 16:276-277.26. Felsenstein J: PHYLIP-phylogeny inference package.  Cladistics1989, 5:164-166.27. Hwang DG, Green P: Bayesian Markov chain Monte Carlosequence analysis reveals varying neutral substitution pat-terns in mammalian evolution.  Proc Natl Acad Sci U S A 2004,101:13994-14001.28. Ortholuge   [http://www.pathogenomics.ca/ortholuge/]yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 16 of 16(page number not for citation purposes)shared genomic synteny and shared protein functions toenhance the identification of orthologous gene pairs.  Bioinfor-matics 2005, 21:703-710.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items