ralssBioMed CentBMC BioinformaticsOpen AcceResearch articleCore Hunter: an algorithm for sampling genetic resources based on multiple genetic measuresChris Thachuk*1, José Crossa*2, Jorge Franco4,2, Susanne Dreisigacker3, Marilyn Warburton3,5 and Guy F Davenport2Address: 1Department of Computer Science, University of British Columbia, 2366 Main Mall, Vancouver, BC V6T1Z4, Canada, 2Crop Research Informatics Laboratory, International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, México D.F., México, 3Applied Biotechnology Center, International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, México D.F., México, 4International Institute of Tropical Agriculture (IITA), Ibadan, Nigeria and 5USDA-ARS-CHPRRU, Box 9555, Mississippi State, MS 39762, USAEmail: Chris Thachuk* - cthachuk@cs.ubc.ca; José Crossa* - j.crossa@cgiar.org; Jorge Franco - j.franco@cgiar.org; Susanne Dreisigacker - s.dreisigacker@cgiar.org; Marilyn Warburton - m.warburton@cgiar.org; Guy F Davenport - g.davenport@cgiar.org* Corresponding authors AbstractBackground: Existing algorithms and methods for forming diverse core subsets currently addresseither allele representativeness (breeder's preference) or allele richness (taxonomist's preference).The main objective of this paper is to propose a powerful yet flexible algorithm capable of selectingcore subsets that have high average genetic distance between accessions, or rich genetic diversityoverall, or a combination of both.Results: We present Core Hunter, an advanced stochastic local search algorithm for selectingcore subsets. Core Hunter is able to find core subsets having more genetic diversity and betteraverage genetic distance than the current state-of-the-art algorithms for all genetic distance anddiversity measures we evaluated. Furthermore, Core Hunter can attempt to optimize any numberof genetic measures simultaneously, based on the preference of the user. Notably, Core Hunter isable to select significantly smaller core subsets, which retain all unique alleles from a referencecollection, than state-of-the-art algorithms.Conclusion: Core Hunter is a highly effective and flexible tool for sampling genetic resources andestablishing core subsets. Our implementation, documentation, and source code for Core Hunteris available at http://corehunter.orgBackgroundGenetic resources stored in gene banks are usually sam-pled with the purpose of evaluating and utilizing themefficiently, as well as studying phenotypic and genotypicdiversity, identifying duplicate accessions, and formingcore subsets. The aim of the latter activity is to preserve inon varying criteria including phenotypic traits or variousforms of molecular marker data including, but not limitedto, single nucleotide polymorphisms (SNP), amplifiedfragment length polymorphisms (AFLP), random ampli-fied polymorphic DNA (RAPD), and simple sequencerepeats (SSR). A simple example considering SNP data isPublished: 6 August 2009BMC Bioinformatics 2009, 10:243 doi:10.1186/1471-2105-10-243Received: 2 December 2008Accepted: 6 August 2009This article is available from: http://www.biomedcentral.com/1471-2105/10/243© 2009 Thachuk et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 13(page number not for citation purposes)the sample as much of the diversity present in the originalcollection as possible. Core subset selection can be basedgiven in Figure 1. The concept of core collections (or coresubsets) was introduced to increase the efficiency of char-BMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243acterizing and utilizing the collections stored in genebanks while preserving the genetic diversity of the collec-tion [1,2].In many instances, gene bank curators and geneticresource conservation managers need to stratify their sam-pling procedure prior to forming a core subset. The criteriafor stratifying samples may be based on ecogeographicalsub-regions or on genetic considerations such as racesand/or land-races. Standard stratified sampling strategiesseek to maximize the diversity among clusters while min-imizing the diversity within groups. Hierarchical cluster-ing algorithms along with statistical models thatmaximize the probability of assigning accessions intoeach cluster have been used for this purpose [3-7]. Onceclusters are formed, an appropriate allocation methodshould be used to determine the number of accessions tobe drawn from each cluster.Using phenotypic evaluation and characterization data,the D-method was developed as an allocation criterion fordetermining the number of accessions to be drawn fromeach cluster [8]. The D allocation method determines thatthe size of the sample to be drawn from each clustershould be proportional to the diversity between acces-sions within that cluster. The authors showed that the D-method produced samples with significantly more diver-sity than other allocation methods. In another study, theD-method was used, along with other sampling strategies,for forming core subsets of maize using molecular markerdata [9]. The results showed that the unweighted pair-core subsets with significantly more diversity than othermethods in terms of genetic distances and diversity indi-ces. Another study, not using the D-method, showed sim-ilar results when using deviation sampling with theunweighted pair-group average method for hierarchicalclustering [11].Although the stratified sampling strategy using the D-method proved to be efficient for forming diverse coresubsets, these can be formed by a non stratified procedurein which accessions are directly selected from the entirecollection by maximizing an objective function. In [12],Schoen and Brown addressed the issue of how to usegenetic markers to sample collections of wild related spe-cies while maximizing allelic richness. They proposed theM (maximization) strategy that maximizes the number ofobserved alleles at each marker locus. A study using com-puter simulation for comparing the retention of neutralalleles when forming core collections using non marker-based random sampling and stratified random samplingstrategies versus the M strategy using genetic markers,found the M-strategy very effective for retaining wide-spread and low frequency alleles [13].MSTRAT, a local search algorithm based on the M-strat-egy, has been proposed [14]. As described by the authors,MSTRAT uses a maximum iterative improvement searchand consists of (1) forming a subset of n accessions cho-sen at random from the N accessions of the whole collec-tion, (2) all possible subsets of size n - 1 are tested forallele diversity and the subset showing the highest level ofrichness is retained, and (3) the accession bringing thegreatest increment in the diversity criterion among theremnant accessions is added, forming an new subset ofsize n. Steps (2) and (3) are repeated until the richness ofthe subset is no longer improved. The diversity of the coresubsets formed is measured using a score of allele rich-ness.The objectives of MSTRAT differ from those of the D-method. It has been suggested that core subsets can beformed that either include rare and localized alleles,which will maximize the total allelic diversity in the core(as favored by taxonomists and geneticists), or can be con-structed by including widely adapted accessions that max-imize the representativeness of the genetic diversity in thecore (which is the breeder's preference) [15]. The objectiveof the D-method is to select the most diverse accessions interms of genetic distances among genotypes, whereas theM-strategy emphasizes selecting accessions with the mostdiverse alleles. This was confirmed in studies, usingmolecular marker data, which found that MSTRAT is veryeffective for retaining widespread and low frequency alle-Example of core subset selectionFigure 1Example of core subset selection. A collection of six samples with data collected for three markers, each having two possible alleles, is shown. Each sample shows the pres-ence for one of two possible alleles at each marker position. The two rows bordered in red are a possible core subset which contain all unique alleles found in the original collec-tion.0 0 0S11 0 1S20 1 1S31 1 1S4A1 A2 A4 A5 A610110010000A70Marker 1 Marker 2 Marker 30 0 1S51 0 0S610 11 01Page 2 of 13(page number not for citation purposes)group method using arithmetic average clustering(UPGMA) [10] with the D allocation method producedles and thus for forming core subsets with high allele rich-ness and a low proportion of non-informative allelesBMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243[9,14]. Furthermore, it was found that MSTRAT formedcore subsets with more allelic diversity than the D-methodin most populations [9]. In general, the D-method formedcore subsets having higher average genetic distancesbetween genotypes than MSTRAT.Recently, Power Core, a new algorithm also based on theM-strategy was proposed [16]. However, Power Core'salgorithm differs significantly from existing ones. Theauthors proposed a deterministic heuristic search, basedon the concepts of A* search [17] – an exhaustive graphsearching algorithm, guaranteed to find an optimal solu-tion. As suggested by the authors, it would be infeasible torun A* search on even moderately sized probleminstances (collections), due to its exhaustive nature.Therefore, the deterministic heuristic they propose is nec-essary to ensure Power Core terminates in a reasonableamount of time. Although their algorithm is not guaran-teed to find optimal solutions, it found core subsets thatwere superior to those found by MSTRAT, as defined bytheir proposed evaluation criteria of variable coverage: theratio of unique values present in the core subset versusthose found in the entire collection, averaged over all var-iables [16].Other sampling procedures have been proposed such asgenetic distance sampling [18], which finds core subsetsguaranteeing no two accessions are within a defined dis-tance of each other. In this way, the user need not specifythe core size; however, the choice of an appropriate valuefor the distance parameter can be as problematic as theselection of the core size itself. Another iterative proce-dure, least distance stepwise sampling (LDSS), was pro-posed which uses the distances and groupings fromhierarchical clustering to determine which accessions toeliminate, and which to add in each step of the procedureuntil the desired core size has been attained [19]. Whileboth methods employ the use of genetic distances duringtheir heuristic sampling, neither attempts to directly opti-mize them.A first objective of this paper is to demonstrate the effec-tiveness of formally treating core subset selection as anoptimization problem. This entails first defining whichcharacteristics of the core subset should be optimized.These may include the average genetic distance betweenaccessions in the core subset, and/or its redundancy ofparticular alleles, amongst other criteria. Even constraintson the core subset such as ensuring it does not containtwo accessions within a threshold distance of each other,as guaranteed by genetic distance sampling, can be treatedin a broader view as an optimization problem. We willshow that once the criteria for the desired core subset isble of finding as good or better core subsets whencompared with existing methods.Despite incremental improvements to the challenge ofselecting the best core subsets, most existing algorithmsand methods currently address either allele representa-tiveness (breeder's concept) or allele richness (taxono-mist's perspective). A second objective of this paper is topropose an algorithm capable of selecting core subsetshaving high average genetic distance between accessionsor a rich genetic diversity overall, or a core subset that con-siders both criteria. A method for combining discretemolecular marker data and continuous genetic data forcore subset selection has been previously proposed [20].We generalize this notion in order to consider anynumber of optimization criteria simultaneously. In thispaper we limit our focus to discrete molecular markerdata, although the same approach could be extended toconsider continuous genetic data.We present Core Hunter, an algorithm based on anadvanced stochastic local search method. Results from thediversity of the core subsets selected by Core Hunter arecompared with the diversity of the core subsets formedusing the current state-of-the-art methods with an availa-ble implementation: D-Method [8], MSTRAT [14], andPower Core [16]. We demonstrate that Core Hunter findsas good or better core subsets than other methods for allgenetic measures evaluated when attempting to optimizea single genetic measure. Furthermore, by attemptingsimultaneous optimization of multiple genetic measures,Core Hunter often finds core subsets that simultaneouslyhave higher average genetic distance and genetic diversityvalues than any reported by the other algorithms evalu-ated. We also demonstrate that Core Hunter is able to findsmaller core subsets which maintain all unique allelesfound in a reference collection, than all other methodsevaluated.MethodsTo simplify the discussion that follows, we first formalizethe core subset selection problem.The core subset selection problemLet S denote the original collection of resources and γ, 0 ≤γ ≤ 1, the sampling intensity used to form the core subset.Furthermore, let C(S) be the set of all possible core subsetsof S of size n, n = γ·|S|. Finally, let F be the objective func-tion we wish to maximize. F may be a genetic diversitymeasure such as Shannon's diversity index, an averagegenetic distance within a population, possibly measuredby Modified Rogers distance, or some multi-objectivefunction which we will detail next. Formally, we wish toPage 3 of 13(page number not for citation purposes)well defined, the selection can be effectively and effi-ciently handled by sophisticated search algorithms capa-select an optimal core subset c*, c* ∈ C(S), such that F(c*)= max{F(c')|c' ∈ C(S)}.BMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243The proposed pseudo-index for integrating genetic distances and diversity indicesAn advantage of using search algorithms for core subsetselection is the potential to select core subsets that attemptto optimize more than one criterion. As an example andgiven the above definition of core subset selection, letF(c') = α·Fdistance(c') + (1 - α)Fdiversity (c'), where α, 0 ≤ α ≤1, is a weight associated with genetic distance. In this for-mulation, if α = 1.0, then a core subset would be selectedwhich attempts to maximize the average genetic distance,using Modified Roger's distance for example, whilegenetic diversity, possibly Shannon's diversity index,would not be considered. Likewise, one could attempt tomaximize the genetic diversity by setting α = 0.0. How-ever, for the case of 0 <α < 1, a core subset will be formedby attempting to maximize both genetic distance anddiversity proportional to the weight assigned to α. Thisconcept is easily generalized to incorporate any k meas-ures as shown below, where Fi(c') ≥ 0, 1 ≤ i ≤ k, and.Indeed, this is a common approach in multi-objectiveoptimization referred to as Pareto optimization [21]. Westress that the pseudo-index does not provide any biolog-ical insight into the chosen samples; rather, it serves onlyas a mean for attempting optimization of more than onegenetic measures simultaneously, based on the weightsassigned to standard measures. Another commonapproach in multi-objective optimization is Pareto rank-ing [22], a technique not explored further here.The Core Hunter algorithm for the proposed pseudo-indexThe Core Hunter algorithm uses an advanced stochasticlocal search (SLS) algorithm, replica exchange MonteCarlo [23-25], to maximize the pseudo-index we proposeabove. This search method has been effectively used tosolve high dimensional search problems containing manylocal maxima embedded in rugged search terrains inmany areas of study including spin glasses [26,27] andprotein folding [28]. Given the high dimensionality of thecore subset selection problem, and the vast number ofpossible core subsets, this was deemed as a necessary alter-native to the use of simpler search methods, such as itera-tive improvement (hill climbing), which are more likelyto become trapped in local maxmima due to their greedynature. We now provide a brief overview of the algorithm.The reader is referred to a review of extended ensembleFor each replica, Monte Carlo search initiates by ran-domly selecting n accessions from the reference set toform the initial solution, where n is the number of acces-sions desired in the core subset. The search proceeds byperturbing the current solution and exploring the so-called search neighborhood. In this application, oursearch neighborhood consists of all possible subsets thatdiffer from our current solution, by at most one accession.Thus, to evaluate a new potential solution found in thesearch neighborhood, we perturb the current core subsetby randomly removing an accession and adding a newaccession from the original collection. This is commonlyreferred to as a 1-exchange neighborhood. While it is triv-ial to extend this concept to a δ-exchange neighborhood,δ > 1, it was deemed unnecessary given our results. Notethat when the core subset need not be of a specific size, itis possible that a perturbation remove an accession, butnot add another, or vice versa, in order to shrink or growthe core subset. This is useful, for instance, when attempt-ing to find the smallest core subset which contains allunique alleles of a collection. After perturbing a currentsolution (core subset) s to form an alternate core subset s',the so-called Metropolis criterion is used to determine ifthe new core subset s' should be accepted. The probabilityof accepting s' as the new solution can be expressed aswhere ΔF := F(s') - F(s) is the difference in the pseudo-index score between the new (s') and old (s) solutions(core subsets), and t denotes the temperature of the rep-lica.Intuitively, a replica at a higher temperature is more likelyto accept a bad transition, one where the new core subsethas a worse score than the original. Accepting bad transi-tions to core subsets with worse scores than a previoussolution allows the search to proceed without beingtrapped at a local maximum. This is a fundamental differ-ence from other search algorithms such as maximum iter-ative improvement that is implemented in the MSTRATprogram [14]. Conversely, replicas at lower temperaturesare less likely to accept worsening transitions, and there-fore converge towards a solution.For the duration of the search, k independent replicas aremaintained, each performing a Monte Carlo search asdescribed above at, a unique temperature ti, 1 ≤ i ≤ k. Rep-licas are ordered such that t1 <t2 <...<tk. Neighboring repli-cas periodically swap their core subsets in order to allowa iik==∑ 1 01 .F c F c F c F ck k( ) ( ) ( ) ( )′ = ⋅ ′ + ⋅ ′ + + ⋅ ′a a a1 1 2 2 K (1)Pr s sFeFt[ ] :,,→ ′ =≥−⎧⎨⎪⎩⎪⎫⎬⎪⎭⎪1 0 if otherwiseΔΔ (2)Page 4 of 13(page number not for citation purposes)Monte Carlo algorithms [25] for further details. solutions that have stagnated to break out of local maxi-mums and to allow promising solutions to converge. TheBMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243probability of swapping the current solutions of replicas iand i + 1, namely the core subsets si and si+1, can beexpressed aswhere ψ is the product of the pseudo-index difference andinverse temperature difference:As the probability of accepting an exchange of core subsetsof two replicas drops exponentially as the temperature dif-ference between them increases, potential replicaexchanges are only considered between neighboring tem-peratures [28].Therefore, the search consists of performing a MonteCarlo search independently for a number of replicas. Inbrief, each replica is a potential solution and represents acore subset. After each Monte Carlo search has progressedfor a fixed number of steps, replica exchanges are consid-ered, temperatures are possibly swapped, and a MonteCarlo search begins again for each replica. The best solu-tion, among all replicas, is tracked during the entiresearch. After a fixed runtime, the best solution observed isreported.Core Hunter with stratified samplingAs previously stated, it is sometimes necessary to firststratify samples based on ecogeographical information,and other criteria, prior to forming core subsets. In thesecases, Core Hunter can be used in a complementary man-ner with the D-method, or similar allocation methods.After stratification, the D-method will determine thenumber of resources to sample from each cluster, andCore Hunter can then be run independently on each clus-ter at the specified sampling intensity.Genetic distances and diversity indicesTo evaluate the quality of core subsets formed by the dif-ferent algorithms, and as compared with the original col-lection, similarly to a previous study [9] we use twogenetic distances between genotypes and three diversityindices that can be incorporated into the pseudo-index wepropose above. Before providing precise definitions ofthese measures, we pause to give a brief summary of thepurpose of each measure and some insight into how themeasures differ from one another. For an in depth treat-ment of appropriate applications of various genetic meas-ures, and how they relate mathematically, the reader istherein, including a review article by Mohammadi andPrasanna [30].Genetic distances are a measure defined between twosamples in order to quantify their degree of dissimilarity;simply put, the larger the value of a genetic distance, themore genetically different the two samples are. Con-versely, redundant or highly similar pairs of accessionscan easily be identified within a collection as those withvery low genetic distance between each other. As a geneticdistance measure is defined between a pair of samples,and not an entire collection, it is customary to report theaverage genetic distance between all unique pairs of sam-ples within a collection. This type of measure is particu-larly useful for breeders interested in forming core subsetswhere each chosen accession is sufficiently distant fromthe others. The two genetic distance measures used hereare Modified Rogers (MR) [31] and Cavalli-Sforza andEdwards (CE) [32]. Both measures compare samples atthe allelic level. Modified Rogers distance is a refinementof the standard Euclidean distance where each allele istreated as a separate dimension.Cavalli-Sforza and Edwards distance is similar to Modi-fied Rogers distance, however, it assumes a selective driftmodel where samples are subject to a low mutation rateand rapid changes in selective pressure [29]. For this rea-son, Modified Rogers distance may be a more suitablemeasure than Cavalli-Sforza and Edwards distance inbreeding programs where consistent selective pressure isapplied for particular traits. In contrast to genetic distancemeasures, genetic diversity measures do not consider pairsof accessions, rather the allelic composition of the sampleas a whole. Genetic diversity measures are particularly use-ful for ensuring rare alleles, which may confer diseaseresistance or some other desirable property, are includedduring core subset formation. For this reason, these meas-ures are well suited for genetic conservation efforts such asseed banks. Although, they can be equally valuable inbreeding programs to ensure a large distribution of allelesis maintained. We consider three genetic diversity indicesin this study. The first, Shannon's diversity index (SH), isdirectly related to Shannon's information content meas-ure [33]. The index is defined in such a way that the largestvalue attainable occurs when each allele is present onlyonce in the entire sample being measured. Generallyspeaking, it penalizes redundancy at the allelic level, withrespect to the entire sample. Therefore, Shannon's diver-sity index is an appropriate measure when forming coresubsets that attempt to retain as many rare alleles as pos-sible, regardless of their co-location within loci (markers).The expected proportion of heterozygous loci (HE) [34]on the other hand, specifically considers diversity withinPr s sF s F sei ii i[ ] :, ( ) ( ),↔ =− ≥⎧⎨⎪⎩⎪⎫⎬⎪⎭⎪++−111 0 if otherwisey(3)y : ( ( ) ( ))=+−⎛⎝⎜⎞⎠⎟ −+1111t i t iF s F si i (4)Page 5 of 13(page number not for citation purposes)referred to a review article by Reif et al. [29] and references each loci. Intuitively, since each loci contributes equally tothe overall value of this measure, core subsets selectedBMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243using this measure are less likely to be homozygous for anumber of different loci than core subsets selected withShannon's diversity index. The number of effective alleles(NE), by definition, positively correlates with HE andmeasures the number of alleles within a loci and howevenly alleles are distributed within that loci [34], aver-aged over all loci in the sample. Thus, both measures aresuitable for selecting core subsets which ensure allelicdiversity within and across loci.In each evaluation of a sample, we also report two auxil-iary values: the proportion of non-informative alleles(PN) (see [9] and references therein) and allele coverage(CV) [16]. CV is a simple measure which reports the per-centage of alleles retained in a core subset compared withthe original collection. This measure is particularly suita-ble in selecting core subsets for the purpose of allele con-servation in gene banks and seed banks. For instance, dueto time or financial constraints, it may be desirable toselect the smallest core subset possible, which retains allunique alleles found within a larger collection (CV =100%). PN is defined to be the opposite of CV. Thus, max-imizing the one will minimize the other.We now give the precise definition of these measuresbelow where we let L be the number of loci (markers), nlbe the number of alleles within the lth loci, and A be thenumber of alleles in the collection ( ). Fur-thermore, let denote the relative frequency of allele aover the observed frequency of all A alleles (note = 1), let denote the relative frequency of the ath allelein the lth loci and let denote the relative frequency ofallele a within loci l for genotype x.1. The Modified Rogers distance (MR) between a pairof genotypes x, y is.2. The Cavalli-Sforza and Edwards distance (CE)between a pair of genotypes x, y is.3. The Shannon diversity index (SH) of the entire sam-ple is .5. The number of effective alleles (NE) is.6. Proportion of non-informative alleles in the coresubset (PN) is an auxiliary variable measuring the pro-portion of alleles lost in a core subset compared withthe alleles found in the original collection. Specifi-cally, let be the set of alleles found in the originalcollection and let be the set of alleles found in thecore subset. Then .7. Coverage of alleles in the core subset (CV) is an aux-iliary variable measuring the percentage of alleles fromthe collection which are also present in the core sub-set. Note that CV = (1.0-PN) * 100.Criteria for the best core subsetThe best core subset has the highest average genetic dis-tance between accessions, the highest allele richness, andthe lowest proportion of non-informative alleles (andequivalently, the highest allele coverage). These criteriaare in agreement with previous works that suggest coresubsets can be formed with the aim of maximizing thetotal diversity through allele richness and/or maximizingthe representativeness of the genetic diversity in the coresubset [9,15].Data setsTo evaluate the utility of the proposed pseudo-index andthe effectiveness of the Core Hunter algorithm, a numberof experiments were conducted to compare them to exist-ing strategies. For comparison with MSTRAT and D-Method, the same three molecular marker data sets froma previous study were used [9]. A brief description of thethree data sets follows:• 'bulk data set':- 275 samples, having 24 markers and 186 totalalleles- obtained by fingerprinting 275 bulks (popula-tions represented by two bulks of 15 genotypeseach) of maize landrace populations from theAmericas and Europe, using 24 SSR markers withat least one SSR per chromosome and a total of186 alleles [35]• 'accession data set':A nllL==∑ 1pˆapˆaaA=∑ 1pˆlapˆ xla0 112211≤ = − ≤==∑∑MR p pxy xla ylaanlLL l ( )0 112211≤ = − ≤==∑∑CE p pxy xla ylaanlLL l ( )SH p pa aaA= −=∑ ˆ ln( ˆ )11 1 2 111≤ = −==∑∑NE pL laanlL l( ) col csPN cs col= −1 0. | | / | | Page 6 of 13(page number not for citation purposes)4. The expected proportion of heterozygous loci perindividual (HE) is .- 521 samples, having 26 markers and 209 totalalleles0 1 11 211≤ = − ≤==∑∑HE pL laanlL l( )BMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243- obtained by fingerprinting 521 maize individualsfrom 25 maize populations using 26 SSR markerswith at least one SSR per chromosome and a totalof 209 alleles [36].• 'populations data set':- 25 samples, having 26 markers and 209 total alle-les- obtained from the 'accession data set' by group-ing the individuals of each population, and calcu-lating the allele frequency per population; thisdata set had a total of 25 populations and 209 alle-les [36].Comparisons with Power Core used the same 'rice SSR'data set as used in the original study on Power Core [16].The SSR data set contained values for 1000 individualaccessions at 18 loci.Implementation and hardwareCore Hunter was implemented in Java (version 1.6.0).Experiments were run on our reference Pentium IV 2.4GHz processor machines, with 1GB main memory and256 Kb of CPU cache, running SUSE Linux version 10.1.Results and DiscussionIn the following section, we compare our proposed algo-rithm for forming core subsets, Core Hunter, with threestate-of-the-art methods for which implementations areavailable: MSTRAT [14], D-Method [8] and Power Core[16]. Core Hunter is evaluated on the same data sets usedin recent studies of these algorithms. Results for MSTRATand D-Method were reported in a previous study [9]where core subsets were selected using a sampling inten-sity of 20%, a typical choice suggested in the literature[37,38]. For all comparisons with these methods, CoreHunter also used a sampling intensity of 20%. Specifi-cally, 55 samples were chosen for each core subset of thebulk data set, 104 samples for the accession data set, and5 samples for the population data. We note that in generalthe choice of sampling intensity, and thus core subset size,for a particular purpose may be based on many independ-ent factors. These factors may include criteria such asdiversity, redundancy and possibly financial constraints.Power Core results were determined by calculating theseven genetic measures used in this study on the core sub-set it selected for a rice SSR data set. The core subsetselected by Power Core was previously published, consist-ing of 87 accessions, and made available online [16].When comparing with Power Core, all core subsetsdemonstrated the software is capable of selecting smallercore subsets that maintained all unique alleles from thereference collection than other algorithms they comparedagainst. We repeat this experiment with our Core Hunteralgorithm.We show how core subsets can be selected which attemptto optimize multiple genetic measures simultaneously,respective of an assigned weight, using the pseudo-indexproposed above. We also explore how core subsets differunder various sampling intensities. As Core Hunter is arandomized algorithm, we also report on the solutionquality variance arising from repeated independent simu-lations.Optimizing a single distance or diversity measureFor all data sets, Core Hunter was run with the objectiveof optimizing each genetic measure independently (CoreHunter (single)). Results reported for each measure areindependent of results reported for all other measures. Foreach genetic measure being optimized for a given data set,20 independent runs were performed with a maximumsearch runtime of 5 CPU minutes.Table 1 compares Core Hunter with MSTRAT and D-Method and lists the mean value of these independentsruns for each combination of data set and genetic measurealong with previously reported results for MSTRAT and D-Method [9]. When attempting to minimize the propor-tion of non-informative alleles (PN) and when attempt-ing to maximize the coverage (CV) for the accession dataset, Core Hunter matches the performance of MSTRATwith both algorithms finding an optimal solution (0.0and 100.0, respectively). In every other instance, for everymeasure, Core Hunter outperforms both MSTRAT and D-Method for the measure being optimized, often by animportant margin. Notably, Core Hunter finds core sub-sets that, on average, improve upon the number of effec-tive alleles for the accession data set by 17.4%, and theaverage Modified Roger's distance by 17.2% and 13.7%for the population and bulk data sets, respectively. Thelowest increases were observed for the Shannon diversityindex on each data set, ranging from 0.2% to 2.0%improvement over MSTRAT. Overall, Core Hunter is ableto select better cores than MSTRAT and D-Method, withrespect to the single genetic measures being optimized.Table 2 compares Core Hunter's performance with PowerCore. Both algorithms select core subsets having an opti-mal solution for proportion of non-informative alleles(0.0) and coverage (100.0). For every other measure, CoreHunter outperforms Power Core with respect to the meas-ure being optimized. The least improvement is observedPage 7 of 13(page number not for citation purposes)selected by Core Hunter also contain 87 accessions, unlessotherwise noted. In that study, the authors of Power Corefor Shannon's diversity index, with Core Hunter improv-ing upon Power Core by 2%. Core Hunter finds core sub-BMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243sets having values of Modified Roger's and Cavalli-Sforzaand Edwards' distance that are 5% better than those ofPower Core. The most important gain is for the number ofeffective alleles, with Core Hunter improving upon PowerCore by 21%.While Core Hunter is capable of selecting core subsetswhich meet or exceed the quality of those chosen by exist-ing software for a particular genetic measure, an impor-tant question is how the values of the other measures,which are not considered during optimization, wereaffected. These values are reported in Table S1 and TableS2 [see Additional file 1] and we summarize the findingshere. The following general trends were noticed. WhenCore Hunter only attempts to optimize the auxiliarymeasures CV and PN, core subsets were selected that ingeneral had worse values than at least one of the otheralgorithms for every other measure. Conversely, CoreHunter consistently reported the worse score for CV andPN measures when attempting to optimize any othermeasure. This suggests selecting core subsets whichattempt to minimize allele redundancy does not necessar-ily result in core subsets which have high average geneticdistance or diversity, at least in the case of core subsetsfound by Core Hunter.Table 1: Comparison of core subsets selected by MSTRAT, D-Method and Core HunterStrategy MR CE SH HE NE PN CVBulk data setCore Hunter (single)† 0.572 0.641 4.531 0.667 3.446 0.000 100.000Core Hunter (multi)‡ 0.506 0.598 4.513 0.662 3.403 0.015 98.500MSTRAT 0.477 0.571 4.493 0.649 3.217 0.021 97.900D-Method§ 0.503 0.578 4.411 0.626 2.980 0.066 93.400COLLECTION 0.440 0.521 4.399 0.620 2.937 0.000 100.000Accession data setCore Hunter (single)† 0.694 0.752 4.670 0.676 3.501 0.000 100.000Core Hunter (multi)‡ 0.659 0.733 4.613 0.650 3.281 0.084 91.600MSTRAT 0.647 0.718 4.579 0.624 2.982 0.000 100.000D-Method§ 0.653 0.719 4.525 0.619 2.963 0.164 83.600COLLECTION 0.630 0.696 4.467 0.591 2.742 0.000 100.000Population data setCore Hunter (single)† 0.442 0.540 4.503 0.619 2.997 0.177 82.300Core Hunter (multi)‡ 0.396 0.508 4.482 0.609 2.969 0.225 77.500MSTRAT 0.357 0.465 4.450 0.593 2.763 0.183 81.700D-Method§ 0.377 0.485 4.409 0.579 2.702 0.264 73.600COLLECTION 0.357 0.455 4.466 0.592 2.749 0.000 100.000†each selection criteria was attempted to be optimized independently by performing 20 runs with 100% weight given to the respective selection criteria during each run. Results reported for each measure are independent of results reported for all other measures.‡20 independent runs were performed with equal weight given to each of the seven measures in an attempt to maximize (minimize) all measures simultaneously.§for each measure, results are shown for the best performing strategy as reported in [9].Table 2: Comparison of core subsets selected by Power Core and Core HunterStrategy MR CE SH HE NE PN CVCore Hunter (single)† 0.926 0.926 5.259 0.873 9.431 0.000 100.000Core Hunter (multi)‡ 0.884 0.884 5.157 0.841 7.928 0.000 100.000Power Core 0.880 0.880 5.131 0.834 7.444 0.000 100.000COLLECTION 0.733 0.733 4.397 0.659 3.700 0.000 100.000†each selection criteria was attempted to be optimized, independently, by performing 20 runs with 100% weight given to the respective selection Page 8 of 13(page number not for citation purposes)criteria during each run. Results reported for each measure are independent of results reported for all other measures.‡20 independent runs were performed with equal weight given to each of the seven measures in an attempt to maximize (minimize) all measures simultaneously.BMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243Another overall trend that was observed was an apparenttrade-off between optimizing genetic distance measuresand optimizing genetic diversity measures. This observa-tion was made in a previous study comparing D-Methodand MSTRAT [9] and the results for Core Hunter are noexception to this trend. When Core Hunter optimizes agenetic diversity measure, it generally finds core subsetshaving worse values for distance measures and is outper-formed with respect to other software. Likewise, whenoptimizing a genetic distance, Core Hunter often findscore subsets with worse genetic diversity. There are a fewinteresting exceptions. For the Accession data set (the larg-est Maize data set), Core Hunter consistently finds bettercore subsets with respect to all distance and diversitymeasures, regardless of which of those measures is beingoptimized, with only one exception: MSTRAT finds a coresubset with a better SH value when Core Hunter is opti-mizing the MR measure. Also of note is that when CoreHunter is optimizing a genetic distance for the rice dataset, it finds core subsets with both good average distanceand high diversity, compared with other software.These results motivate further study in a number of inter-esting directions. In the next sections, we study thesetrade-offs in more detail to determine if Core Hunter canbe used to find core subsets having both high geneticdiversity and high average genetic distance. We also con-sider the case for trying to find core subsets which havedesirable properties for a number of measures, simultane-ously.Simultaneous distance and diversity optimizationTo test how effectively Core Hunter can attempt to opti-mize genetic distance and genetic diversity measuressimultaneously, we conducted the following experiments.For each data set, Core Hunter was run with the objectiveof optimizing both a genetic distance and a genetic diver-sity index simultaneously, with respect to a weightassigned to each measure proportional to the pseudo-index parameter α. One hundred uniform values weretested in the range [0,1]. For each value of the parameterα, 20 independent runs were performed for a duration of5 CPU minutes, and the mean values of both measureswere determined.Figure 2 plots the values of the average Modified Roger'sdistance versus the Shannon's diversity index values of thecore subsets selected by MSTRAT, D-Method, and CoreHunter for different values of α. The highest average Mod-ified Roger's distance is observed when all weight is givento genetic distance (i.e., α = 1.0). Likewise, the Shannon'sdiversity index value is highest when no weight is given togenetic distance (i.e., α = 0.0). In all other cases (0 <α <1), a trade-off between the measures can be observed, rel-ative to their respective weight in the pseudo-index. Thistrade-off is characterized as the Pareto set, or Pareto fron-tier. Note that sufficiently close values of α may not nec-essarily result in distinct core subsets being selected.Therefore, larger data sets (at the same sampling intensity)may result in a larger Pareto set, as is the case when com-paring the population data set (Figure 2 (left)) to theaccession data set (Figure 2 (right)). Interestingly, in allthree data sets, there are values of α such that Core Hunterselects core subsets that simultaneously have bettergenetic distance and genetic diversity than the core subsetsselected by either MSTRAT or D-Method.The same experimental protocol was used for simultane-ous optimization of the Cavalli-Sforza and Edwards dis-tance and the number of effective alleles diversity index.The margin by which Core Hunter outperforms MSTRATand D-Method is even larger, with results shown in FigureMaximizing Modified Roger's (MR) distance and Shannon's diversity (SH) index simultaneouslyFigure 2Maximizing Modified Roger's (MR) distance and Shannon's diversity (SH) index simultaneously. Core Hunter was run independently for 100 different values of the genetic distance weight parameter, α, in an attempt to identify the Pareto frontier for each of the three datasets having results for MSTRAT and D-Method. Each point on the frontier is the mean value 4.3604.3804.4004.4204.4404.4604.4804.5004.5200.340 0.360 0.380 0.400 0.420 0.440 0.460SH diversityMR distanceCore HunterMSTRATD-Method4.4004.4204.4404.4604.4804.5004.5204.5400.470 0.490 0.510 0.530 0.550 0.570 0.590SH diversityMR distanceMSTRATD-MethodCore Hunter4.5204.5404.5604.5804.6004.6204.6404.6604.6800.640 0.650 0.660 0.670 0.680 0.690 0.700SH diversityMR distanceCore HunterMSTRATD-MethodPage 9 of 13(page number not for citation purposes)of 20 independent runs. Results for the population, bulk and accession datasets are shown on the left, center, and right, respec-tively.BMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/2433. For the accession data set (Figure 3 (right)), CoreHunter outperforms both algorithms for both measuressimultaneously, regardless of the weights assigned to therespective measure.Figure 4 shows the results of attempting to optimize Mod-ified Roger's distance and the number of effective allelesdiversity index compared to the performance of PowerCore. Regardless of the value of α, Core Hunter alwaysfinds core subsets with a higher number of effective alle-les. When α < 0.88, the core subset selected by Power Corehas a higher average Modified Roger's distance than thoseselected by Core Hunter. However, when more weight isgiven to genetic distance (i.e., α ≥ 0.88), Core Hunter eas-ily finds core subsets having better average ModifiedRoger's distance and a higher number of effective alleles.As was observed when optimizing a single measure, thereis a clear trade-off between genetic distance and geneticdiversity. The resulting core subsets discussed above donot have as high average genetic distance than core subsetsoptimized solely for that property. The same can be saidregarding genetic distance. There is a necessary trade-off ofone type of property to benefit the other. However, asshown above, core subsets can be selected that still exhibithigh average genetic distance and diversity, especiallywhen compared with core subsets formed by MSTRAT, D-Method, or Power Core.Optimizing multiple genetic measuresTo test the performance of Core Hunter when attemptingto optimize more than two measures, the algorithm wasrun with weights assigned to all seven genetic measuresdetailed in this paper, for each data set we tested. In allcases, Core Hunter was run for 20 independent trials of 5CPU minutes.For the comparison with MSTRAT and D-Method, themean solution quality is reported in Table 1 (Core Hunter(multi)) with each measure given equal weight. For eachof these data sets, Core Hunter is able to select a core sub-set which better optimizes each genetic measure simulta-neously than MSTRAT or D-Method, with the onlyexception being the proportion of non-informative allelesand coverage. By varying the weights of each measure (i.e.,assigning more weight to PN or CV), a core subset can befound which outperforms MSTRAT and D-Method for allmeasures (data not shown).Results comparing Core Hunter to Power Core can befound in Table 2 (Core Hunter (multi)). As a goal ofPower Core is to optimize coverage, we assigned 99% ofthe weight to coverage, and distributed the remaining 1%of the weight equally amongst the other measures, for allruns of Core Hunter. Our intention is to show that it ispossible to select core subsets which satisfy a primaryobjective, such as ensuring perfect coverage (CV = 100.0),while still attempting to optimize other measures in theprocess. Indeed, Core Hunter is able to find a core subsetthat outperforms Power Core for every genetic measuresimultaneously, with the exception of proportion of non-informative alleles and coverage, as both algorithms findan optimal solution.As discussed in the previous section, the core subsetsfound when optimizing multiple criteria generally per-form worse with respect to individual measures whencompared with the core subsets selected for those specificproperties. For instance, when optimizing only ModifiedRoger's distance, Core Hunter finds core subsets whichgenerally have 5% higher average Modified Roger's dis-tance than the core subsets selected by Core Hunter(multi) [see Additional file 1, Table S1]. While these coreAttempting to maximize Cavalli-Sforza and Edwards' (CE) distance and the number of effective alleles (NE) simultaneouslyFigure 3Attempting to maximize Cavalli-Sforza and Edwards' (CE) distance and the number of effective alleles (NE) simultaneously. Core Hunter was run independently for 100 different values of the genetic distance weight parameter, α, in an attempt to identify the Pareto frontier for each of the three data sets having results for MSTRAT and D-Method. Each point 2.6502.7002.7502.8002.8502.9002.9503.0003.0500.450 0.470 0.490 0.510 0.530 0.550NE diversityCE distanceCore HunterMSTRATD-Method2.9002.9753.0503.1253.2003.2753.3503.4253.5000.560 0.580 0.600 0.620 0.640 0.660NE diversityCE distanceCore HunterMSTRATD-Method2.9003.0003.1003.2003.3003.4003.5003.6000.710 0.720 0.730 0.740 0.750 0.760NE diversityCE distanceCore HunterMSTRATD-MethodPage 10 of 13(page number not for citation purposes)on the frontier is the mean value of 20 independent runs. Results for the population, bulk and accession datasets are shown on the left, center, and right, respectively.BMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243subsets have higher average Modified Roger's distance,they have 6% less allele coverage on average. Thus, thevarious trade-offs between genetic distance, diversity andpreservation of rare alleles must considered in the contextof the intended purpose of the core subset being formed.While optimizing a single one of these criteria will beimportant in some instances, it will often be the case thatfinding a suitable balance will be the desired outcome.Selecting minimal size core subsets with perfect coverageWith respect to genetic data sets, a goal of Power Core isto select a core subset that retains all unique alleles foundin the collection (perfect coverage) and is as small as pos-sible. Using the SSR rice dataset, Kim et al. demonstratethat Power Core is able to select smaller core subsets hav-ing perfect coverage than a random selection method (R-core), a proportional selection method (P-core) andMSTRAT [16]. Details of the experimental protocol andother selection methods are given in [16].Core Hunter was run on the same data set, for 20 inde-pendent runs of 5 CPU minutes, with the objective offinding the smallest core subset, which retains all uniquealleles. Results for Core Hunter and those previouslycoverage while containing only 80 accessions, 7 acces-sions fewer than Power Core.Effect of sampling intensityThe choice of sampling intensity when forming core sub-sets is usually determined by a number of factors. In theprevious section the core subset was not intended to be afixed size, rather the smallest possible size which pro-duced a core subset that retains all unique alleles of a col-lection. This is beneficial for the application of geneticconservation of rare alleles. Sampling intensity can bechosen based on preliminary estimations of redundancyin the original collection, which is an approach taken inthe program MSTRAT [14]. Often, the sampling intensityis chosen based on a combination of factors, includingfinancial constraints, time constraints, and the particularapplication for which the core subset is intended. A sam-pling intensity of 5% to 20% has been suggested by manyauthors in literature [2,12,37,38].To determine the effect of sampling intensity on the vari-ous genetic measures we selected core subsets for the threeMaize data sets using two new sampling intensities andcompared the results to the core subsets selected using thedefault 20% sampling intensity. Results for 10% and 30%sampling intensity are shown in Tables S3 and S4 [seeAdditional file 1] and are summarized here. Two veryclear, yet expected, trends were observed. With very fewexceptions, core subsets selected with a 10% samplingintensity generally had higher average genetic distanceand higher genetic diversity when compared with the coresubsets selected using 20% sampling intensity. However,the resulting core subsets did not preserve rare alleles aseffectively as the larger core subsets unless the geneticmeasure being optimized was specifically PN or CV. Coresubsets selected with a 30% sampling intensity preservedrare alleles as well or better than smaller core subsets aswould be expected. However, the core subsets selectedgenerally had worse average genetic distance and lowergenetic diversity. There were exceptions noticed withregards to the Population data set which are probablyexplained by the small size of the original collection.Overall, it was observed that a small sampling intensity of10% results in core subsets which have high averageAttempting to maximize Modified Roger's (MR) distance and the number f effective alleles (NE) simultaneouslyFigure 4Attempting to maximize Modified Roger's (MR) dis-tance and the number of effective alleles (NE) simul-taneously. Core Hunter was run independently for 100 different values of the genetic distance weight parameter, α, in an attempt to identify the Pareto frontier on the rice SSR data set. Each point on the frontier is the mean value of 20 independent runs.Table 3: Comparison of core subset size and coverageStrategy Coverage (CV) % Number of entriesCore Hunter 100.0 80Power Core† 100.0 87MSTRAT† 88.9 87P-core† 55.0 100Page 11 of 13(page number not for citation purposes)reported in [16] can be found in Table 3. All 20 independ-ent runs of Core Hunter found core subsets having perfectR-core† 46.8 100†results were previously reported in [16].BMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243genetic distance and diversity and could be an appropriatechoice for breeding programs. The higher sampling inten-sities selected more homogeneous core subsets that pre-served rare alleles better and would be an appropriatechoice for applications involving genetic conservation.Variability of solution qualityAs Core Hunter is a randomized algorithm, solution qual-ity may vary between independent runs for the same prob-lem instance. However, it was observed that in most casesall independent runs of Core Hunter for a particular com-bination of data set and genetic measure converge to thesame solution quality, yielding negligible variance. Thelargest range of solution quality values observed involvedthe optimization of the number of effective alleles (NE)diversity index for the rice SSR data set, [9.42661,9.43369], having a difference of less than 0.008. Thecumulative distribution of solution quality for 20 inde-pendent runs is shown in Figure 5. The second largestrange of values, also for the rice data set, involved the opti-mization of the expected proportion of heterozygous lociper individual (HE) diversity index, [0.87333,0.87364],having a difference of less than 0.0004. In nearly everyother experiment, the independent trials converged tosolutions having the same score.ConclusionWe have demonstrated that our proposed algorithm forcore subset selection, Core Hunter, has improved uponstate-of-the-art selection methodologies in several ways.Results for four distinct genetic data sets show that CoreHunter, when attempting to optimize a single genetic dis-tance or diversity measure, selects core subsets as good asor better than existing algorithms, often by a significantmargin. Furthermore, when using the proposed pseudo-index, the algorithm attempts to optimize multiplegenetic criteria simultaneously, often finding core subsetsthat have better values for all genetic measures evaluated,when compared with existing methods. Therefore, it isnow possible to select core subsets which satisfy both thebreeders' and taxonimists' perspectives, respective of aweight assigned to each genetic measure as specified bythe user. Also, our algorithm is agnostic to the choice ofgenetic measures. New measures can be incorporatedwithout altering the underlying algorithm. We have fur-ther demonstrated that Core Hunter is able to select signif-icantly smaller core subsets that retain all unique alleleswithin a collection, than other algorithms designed forthis purpose such as Power Core.While we believe Core Hunter will significantly improvethe process of core subset selection, there are a number ofdirections in which this approach can be furtherimproved. First, our algorithm currently considers onlygenetic data. Selection of crop varieties always depend onphenotypic traits and a sole focus on genetic informationmay bias results due to non-functional genetic variations.Power Core [16], MSTRAT [14] and D-Method [8] all pro-vide support for using phenotypic measures when select-ing core subsets. Second, when combining a large numberof molecular marker data with phenotypic variables, it ischallenging to come up with a unified approach so thatinformation on both data sets can be utilized. While CoreHunter is freely available for use, it currently lacks a richgraphical user interface such as those found in Power Coreand MSTRAT. In order to make Core Hunter more accessi-ble to users, development has begun on a rich graphicaluser interface as well as a web based interface. Announce-ments regarding these efforts will be made on the projectwebsite.Authors' contributionsGD, JC, and JF proposed the project, and collaboratedwith CT on the algorithm development and design andanalysis of experiments. CT proposed the algorithm,implemented it, and performed all new experiments. SD,MW provided advice, data and feedback throughout theprocess. CT and JC wrote the initial manuscript with allauthors contributing to the final version.Additional materialAdditional file 1Supplemental results. Expanded listing of Tables 1 and 2 in addition to results for various sampling intensities.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-Solution quality varianceFig re 5Solution quality variance. Cumulative distribution of solu-tion quality, over 20 independent runs, is shown for the rice SSR data set when attempting to maximize the number of Page 12 of 13(page number not for citation purposes)2105-10-243-S1.pdf]effective alleles diversity index.Publish with BioMed Central and every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2009, 10:243 http://www.biomedcentral.com/1471-2105/10/243AcknowledgementsWe would like to thank Jonathan Crouch and Holger H. Hoos for helpful comments concerning this project and manuscript. We would also like to thank the anonymous reviewers for their valuable suggestions. Computa-tional biology research in CRIL was funded by the Generation Challenge Program (GCP). JC and GD are supported with funding by IFAD and Euro-pean Union. CT received additional support from an NSERC PGS-D3 scholarship.References1. Frankel OH, Brown AHD: Plant genetic resources today: a crit-ical appraisal. In Crop genetic resources: conservation and evaluationEdited by: Hoden HW, Williams JT. London, UK: George Allen andUnwin; 1984:249-257. 2. Brown AHD: Core collections: A practical approach togenetic resources management. Genome 1989, 31:818-824.3. Franco J, Crossa J, Taba S, Shands H: A multivariate method forclassifying cultivars and studying group × environment × traitinteraction. Crop Science 2003, 43:1249-1258.4. Franco J, Crossa J: The Modified Location Model for classifyinggenetic resources. I. Association between Categorial andContinuous Variables. Crop Science 2002, 42:1719-1726.5. Franco J, Crossa J, Taba S, Eberhart SA: The Modified LocationModel for classifying genetic resources. II Unrestrictive vari-ance-covariance matrices. Crop Science 2002, 42:1727-1736.6. Franco J, Crossa J, Villaseñor J, Taba S, Eberhart SA: A two-stage,three-way method for classifying genetic resources in multi-ple environments. Crop Science 1999, 39:259-267.7. Franco J, Crossa J, Villaseñnor J, Taba S, Eberhart SA: Classifyinggenetic resources by categorical and continuous variables.Crop Science 1998, 38(6):1688-1696.8. Franco J, Crossa J, Taba S, Shands H: A sampling strategy for con-serving genetic diversity when forming core subsets. Crop Sci-ence 2005, 45:1035-1044.9. Franco J, Crossa J, Warburton ML, Taba S: Sampling strategies forconserving maize diversity when forming core subsets usinggenetic markers. Crop Science 2006, 46:854-864.10. Sokal RR, Michener CD: A statistical method for evaluating sys-tematic relationships. University of Kansas Science Bulletin 1958,38:1409-1438.11. Li CT, Shi CH, Wu JG, Xu HM, Zhang HZ, Ren YL: Methods ofdeveloping core collections based on the predicted geno-typic value of rice (Oryza sativa L.). Theoretical Applied Genetics2004, 108(6):1172-1176.12. Schoen DJ, Brown AHD: Conservation of allelic richness in wildcrop relatives is aided by assessment of genetic markers. Pro-ceedings of the Natial Academy of Sciences, USA 1993, 38:10623-10627.13. Bataillon TL, David JL, Schoen DJ: Neutral genetic markers andconservation genetics: simulated germplasm collections.Genetics 1996, 144:409-417.14. Gouesnard B, Bataillon TM, Decoux G, Rozale C, Schoen DJ, DavidJL: MSTRAT: An algorithm for building germplasm core col-lections by maximizing allelic or phenotypic richness. Journalof Heredity 2001, 92:93-94.15. Marita JM, Rodríguez JM, Nienhuis J: Development of an algo-rithm indentifying maximally diverse core collections.Genetic Resources and Crop Evolution 2000, 47:515-526.16. Kim KW, Chung HK, Cho GT, Ma KH, Chandrabalan D, Gwag JG,Kim TS, Cho EG, Park YJ: PowerCore: A program applying theadvanced M strategy with a heuristic search for establishingcore sets. Bioinformatics 2007, 23(16):515-526.17. Hart PE, Nilsson NJ, Raphael B: A Formal Basis for the HeuristicDetermination of Minimum Cost Paths. IEEE Transactions onSystems Science and Cybernetics 1968, 4(2):100-107.18. Jansen J, van Hintum T: Genetic distance sampling: a novel sam-pling method for obtaining core collections using genetic dis-tances with an application to cultivated lettuce. TheoreticalApplied Genetics 2007, 114(3):421-428.19. Wang JC, Hu J, Xu HM, Zhang S: A strategy on constructing corecollections by least distance stepwise sampling. TheoreticalApplied Genetics 2007, 115:1-8.for constructing core subsets. Journal of Integrative Plant Biology2006, 48(11):1371-1378.21. Steuer RE: Multiple Criteria Optimization: Theory, Computation and Appli-cation John Wiley & Sons, New York, NY; 1985. 22. Coello Coello C, Lamont G, Van Veldhuizen D: Evolutionary Algorithmsfor Solving Multi-objective Problems, Springer 2nd edition. 2007.23. Geyer C: Markov chain Monte Carlo maximum likelihood.Computing Science and Statistics: Proceedings of the 23rd Symposium onthe Interface 1991.24. Kimura K, Taki K: Time-homogeneous parallel annealing algo-rithm. Proceedings of the 13th IMACS World Congress on Computationand Applied Mathematics (IMACS'91) 1991, 2:827-828.25. Iba Y: Extended Ensemble Monte Carlo. International Journal ofModern Physics C 2001, 12(5):623-656.26. Hukushima K, Nemoto K: Exchange Monte Carlo Method andApplication to Spin Glass Simulations. Journal of the Physical Soci-ety of Japan 1996, 65:1604-1608.27. Hukushima K, Takayama H, Yoshino H: Exchange Monte CarloDynamics in the SK Model. Journal of the Physical Society of Japan1998, 67:12-15.28. Thachuk C, Shmygelska A, Hoos HH: A replica exchange MonteCarlo algorithm for protein folding in the HP model. BMC Bio-informatics 2007, 8:342.29. Reif JC, Melchinger AE, Frisch M: Genetical and MathematicalProperties of Similarity and Dissimilarity CoefficientsApplied in Plant Breeding and Seed Bank Management. CropScience 2005, 45:1-7.30. Mohammadi SA, Prasanna BM: Analysis of Genetic Diversity inCrop Plants-Salient Statistical Tools and Considerations.Crop Sci 2003, 43(4):1235-1248.31. Wright S: Evolution and the Genetics of Populations: A treatise in four vol-umes Volume IV. University of Chicago Press; 1978. 32. Cavalli-Sforza L, Edwards A: Phylogenetic analysis. Models andestimation procedures. American Journal of Human Genetics 1967,19(3):233-257.33. Shannon C: A mathematical theory of communication. Bell Sys-tems Technical Journal 1948, 27(3):2398-403.34. Berg EE, Hamrick JL: Quantification of genetic diversity at alloz-yme loci. Canadian Journal of Forest Research 1997, 27(3):415-424.35. Dubreuil P, Warburton M, Chastanet M, Hoisington D, Charcosset A:More on the introduction of temperate maize into Europe:Large-scale bulk SSR genotyping and new historical evi-dence. Maydica 2006, 51(2):281-291.36. Warburton M, Crossa J, Diaz L, Gomez A, Taba S: Diversidadgenética en criollos de mais medida por microsatélites. VCongreso Nacional de Biotecnología Agropecuaria y Forestal, Chapingo,México 2004.37. van Hintum TJL: The general methodology for creating a corecollection. In Core collections for today and tomorrow Edited by: John-son RC, Hodgkin T. International Plant Genetic Resources Institute,Rome; 1999:10-17. 38. van Hintum TJL, Brown AHD, Spillane C, Hodgkin T: Core collec-tions of plant genetic resources. IPGRI Technical Bulletin 3,International Plant Genetic Resources Institute, Rome, Italy; 2000. yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 13 of 13(page number not for citation purposes)20. Wang J, Hu J, Liu N, Xu H, Zhang S: Investigation of combiningplant genotypic values and molecular marker information
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- Core Hunter: an algorithm for sampling genetic resources...
Open Collections
UBC Faculty Research and Publications
Core Hunter: an algorithm for sampling genetic resources based on multiple genetic measures Thachuk, Chris; Crossa, José; Franco, Jorge; Dreisigacker, Susanne; Warburton, Marilyn; Davenport, Guy F Aug 6, 2009
pdf
Page Metadata
Item Metadata
Title | Core Hunter: an algorithm for sampling genetic resources based on multiple genetic measures |
Creator |
Thachuk, Chris Crossa, José Franco, Jorge Dreisigacker, Susanne Warburton, Marilyn Davenport, Guy F |
Publisher | BioMed Central |
Date Issued | 2009-08-06 |
Description | Background: Existing algorithms and methods for forming diverse core subsets currently address either allele representativeness (breeder's preference) or allele richness (taxonomist's preference). The main objective of this paper is to propose a powerful yet flexible algorithm capable of selecting core subsets that have high average genetic distance between accessions, or rich genetic diversity overall, or a combination of both. Results: We present Core Hunter, an advanced stochastic local search algorithm for selecting core subsets. Core Hunter is able to find core subsets having more genetic diversity and better average genetic distance than the current state-of-the-art algorithms for all genetic distance and diversity measures we evaluated. Furthermore, Core Hunter can attempt to optimize any number of genetic measures simultaneously, based on the preference of the user. Notably, Core Hunter is able to select significantly smaller core subsets, which retain all unique alleles from a reference collection, than state-of-the-art algorithms. Conclusion: Core Hunter is a highly effective and flexible tool for sampling genetic resources and establishing core subsets. Our implementation, documentation, and source code for Core Hunter is available at http://corehunter.org |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2016-01-19 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution 4.0 International (CC BY 4.0) |
DOI | 10.14288/1.0223744 |
URI | http://hdl.handle.net/2429/56569 |
Affiliation |
Computer Science, Department of Science, Faculty of |
Citation | BMC Bioinformatics. 2009 Aug 06;10(1):243 |
Publisher DOI | 10.1186/1471-2105-10-243 |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | Thachuk et al. |
Rights URI | http://creativecommons.org/licenses/by/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 52383-12859_2008_Article_2973.pdf [ 681.4kB ]
- Metadata
- JSON: 52383-1.0223744.json
- JSON-LD: 52383-1.0223744-ld.json
- RDF/XML (Pretty): 52383-1.0223744-rdf.xml
- RDF/JSON: 52383-1.0223744-rdf.json
- Turtle: 52383-1.0223744-turtle.txt
- N-Triples: 52383-1.0223744-rdf-ntriples.txt
- Original Record: 52383-1.0223744-source.json
- Full Text
- 52383-1.0223744-fulltext.txt
- Citation
- 52383-1.0223744.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0223744/manifest