ralssBioMed CentBMC BioinformaticsOpen AcceResearch articleAn adaptive bin framework search method for a beta-sheet protein homopolymer modelAlena Shmygelska1 and Holger H Hoos*2Address: 1Department of Structural Biology, Stanford University, 299 W. Campus Dr., Stanford, CA 94305, USA and 2Department of Computer Science, University of British Columbia, 2366 Main Mall, Vancouver, BC V6T 1Z4, CanadaEmail: Alena Shmygelska - alenas@stanford.edu; Holger H Hoos* - hoos@cs.ubc.ca* Corresponding author AbstractBackground: The problem of protein structure prediction consists of predicting the functional ornative structure of a protein given its linear sequence of amino acids. This problem has played aprominent role in the fields of biomolecular physics and algorithm design for over 50 years.Additionally, its importance increases continually as a result of an exponential growth over time inthe number of known protein sequences in contrast to a linear increase in the number ofdetermined structures. Our work focuses on the problem of searching an exponentially large spaceof possible conformations as efficiently as possible, with the goal of finding a global optimum withrespect to a given energy function. This problem plays an important role in the analysis of systemswith complex search landscapes, and particularly in the context of ab initio protein structureprediction.Results: In this work, we introduce a novel approach for solving this conformation search problembased on the use of a bin framework for adaptively storing and retrieving promising locally optimalsolutions. Our approach provides a rich and general framework within which a broad range ofadaptive or reactive search strategies can be realized. Here, we introduce adaptive mechanisms forchoosing which conformations should be stored, based on the set of conformations already storedin memory, and for biasing choices when retrieving conformations from memory in order toovercome search stagnation.Conclusion: We show that our bin framework combined with a widely used optimization method,Monte Carlo search, achieves significantly better performance than state-of-the-art generalizedensemble methods for a well-known protein-like homopolymer model on the face-centered cubiclattice.BackgroundConsidering the close connection between the function ofproteins and their three-dimensional (tertiary) structure,there are many reasons for studying protein folding; theseunderstand a number of diseases that are directly causedby protein misfolding, aggregation and fibrillogenesis(some of which include Alzheimer's, Huntington's andprion disease as well as cystic fibrosis [1]), and to designPublished: 24 April 2007BMC Bioinformatics 2007, 8:136 doi:10.1186/1471-2105-8-136Received: 12 January 2007Accepted: 24 April 2007This article is available from: http://www.biomedcentral.com/1471-2105/8/136© 2007 Shmygelska and Hoos; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 15(page number not for citation purposes)include the desire to predict protein function based onsequence data (via tertiary structure prediction), to betterproteins with desired structure and function. Since exper-imental methods (X-ray crystallography and NMR) forBMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136protein structure determination are highly labour inten-sive and require purification and, in the case of X-ray crys-tallography, crystallization of proteins, computationalmethods for predicting protein structure from sequenceare very attractive.The ab initio protein folding problem is the problem ofpredicting the tertiary structure (the native state) of a pro-tein from its amino acid sequence by minimizing a givenenergy function. Even for simple models that discretizeconformations on a lattice (grid), this optimization prob-lem is P-hard [2,3]. Its difficulty stems from the factthat the space of possible conformations is vast, and thesearch landscapes induced by the given energy functionare very complex. One of the most prominent and success-ful approaches for solving this and many other P-hardoptimization problems is known as stochastic local search(SLS) [4]. Applied to protein folding problems, SLS meth-ods attempt to find native states by iteratively performingsmall conformational modifications guided by the givenenergy function; randomized decisions are used to avoidgetting trapped in local minima of the given energy land-scape. Monte Carlo algorithms, which are widely used inprotein structure prediction, are a prominent special caseof SLS methods.The performance of stochastic local search algorithms iscritically dependent on the properties of the search land-scape encountered, such as the number and distributionof local minima, the overall landscape ruggedness (meas-ured, for example, using auto-correlation or fitness-dis-tance analysis), and the basin structure. Therefore, searchstrategies that can extract important features of the land-scape and adapt the search accordingly are among themost effective tools for solving optimization problemswith complex search landscapes. As evident from an anal-ysis of the literature describing search methods for ab initioprotein folding presented in the related work section, suchadaptive search strategies have not been widely studied forthis problem.In this work, we introduce a novel adaptive SLS methodthat is based on a system of bins for storing a diverse setof conformations (candidate solutions) in memory. Thegeneral idea behind our approach is to store promisingconformations encountered during the search for later usein situations when search stagnation is detected. Theseconformations are pooled into a number of bins accord-ing to energy and diversity criteria. The storage andretrieval mechanisms used in this context adaptively con-trol the behaviour of a subsidiary local search procedure,The remainder of this paper is structured as follows: First,we provide background information and discuss relatedwork on protein folding problems and algorithms. Next,we introduce the bin framework and describe a MonteCarlo algorithm that utilizes this bin framework adap-tively. We then compare the performance of our algo-rithm with that of other prominent methods from theliterature, followed by a discussion of how the behaviorand performance of our algorithm is influenced by itsparameters. Finally, we explain how the bin frameworkintroduced in this work relates to other search methodsknown from the literature, summarize our findings, drawsome general conclusions, and indicate directions forfuture work. In the "Methods" section, we describe theface-centered cubic (FCC) lattice model and β-sheetenergy potential used in our computational analysis andprovides details of the experimental protocols used in theempirical analysis of our algorithm.Related workTo address the ab initio protein folding problem, the fol-lowing three issues need to be considered: (1) the modelused for the representation of protein structure (whichmay have implications on prediction accuracy); (2) theenergy potential function (which ideally should be able todiscriminate between native and non-native conforma-tions); and (3) the method used for searching through thespace of possible conformations (which should be able tofind optimal conformations as efficiently as possible). Inthis section, we discuss the choice of protein representa-tion and energy function made in this work; we also pro-vide a brief overview of the best-performing searchmethods for ab initio protein folding known from the lit-erature.To facilitate ab initio protein structure prediction by meansof more efficient methods for searching in the space ofprotein conformations, various reduced models of pro-tein structure have been introduced by biochemists andphysicists. These reduced models fall into two majorclasses: lattice and off-lattice models. The primary reasonfor choosing off-lattice models over lattice models is toobtain better geometrical accuracy. Despite the biasesintroduced by the lattice models, namely a somewhatrestricted ability to accurately represent secondary struc-ture and backbone conformation, lattice models stillretain essential properties of the system [5-7] and offer anumber of computational advantages; these advantagesinclude fast energy computation, easiness of testing self-avoidance and the ability to use pre-computed tables ofmoves, all of which help to compute search steps effi-ciently.Page 2 of 15(page number not for citation purposes)such as canonical Monte Carlo search, and are shown togreatly improve its performance.An important representative of lattice models is the Face-Centered Cubic (FCC) lattice, which underlies the crystal-BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136line structure of most metals. Even though in the contextof ab initio protein structure prediction, the simpler squareand cubic lattices are the most widely studied models inthe literature, the FCC lattice captures real protein confor-mations with higher accuracy (coordinate Cα root meansquare deviation below 2 Å) [7] while still being represen-tationally rather simple (it requires only 12 basis vectors).It has also been shown that local packing of amino acidsin real proteins closely fits a distorted FCC lattice [8], andthat the FCC model allows a reasonable description of sec-ondary structure elements; furthermore, it can representgeometrically accurate hydrogen bonding [6]. As a result,the FCC lattice is considered the best overall choiceamong the simpler regular lattice models [6]. A detaileddescription of the FCC lattice is provided in the "Meth-ods" Section.As an energy function to be used in conjunction with theFCC lattice model we chose the β-sheet potential [9,10].This choice was motivated by the fact that there are nouniversally used energy functions for ab initio proteinstructure prediction; at the same time, the β-sheet poten-tial has been used relatively widely in the literature for theempirical evaluation of the best-performing protein struc-ture prediction methods discussed later in this section.This enables us to compare our new approach against arelatively wide range of other conformation search meth-ods known from the literature. Furthermore, this β-sheetpotential exhibits characteristics of more complicatedenergy functions used for off-lattice models [9,11] (partic-ularly, cooperative all-or-none folding transition, charac-teristic interplay between short- and long-interactions andsecondary structure propensity). Finally, the problem offolding of β-sheet proteins is particularly important, sincethe accuracy of protein structure prediction methods forβ-sheet proteins is the lowest among all structural classesof proteins [12]. The β-sheet potential used in this work isdescribed in detail in the "Methods" Section.There are a number of search methods applicable to theprotein folding problem that can be used in conjunctionwith reduced complexity models and simplified poten-tials to perform a broad search through low-resolutionstructures. The most widely used methods includeMetropolis Monte Carlo (MC) search [13-17], GeneticAlgorithms [18-20] and Generalized Ensemble Methods[21-23], which include the currently best-performingalgorithms for ab initio protein structure prediction. Thislast class of algorithms is based on the observation thatcanonical Monte Carlo methods sample conformationsaccording to Boltzmann probabilities. For typical distri-butions of states (i.e., protein conformations) over energylevels this means that very high and, more importantly,come this problem by striving to perform a random walkin energy space by computing the density of states, bysampling expanded range of temperatures or by comput-ing other physical quantities affecting transitions betweenthe states during search.Currently, the best-performing Generalized EnsembleMethod for ab initio protein structure prediction is ReplicaExchange Monte Carlo search (REMC) [23], also knownas the multiple Markov Chain method or Parallel Temper-ing [22]. In REMC, a number of non-interacting copies(replicas) of the given protein sequence are folded inde-pendently and at different temperature settings of theunderlying canonical Monte Carlo search. Every few steps,pairs of replicas are exchanged (i.e., the temperature set-tings of the MC search performed on them are swapped)with a probability that depends on the energies of therespective conformations (using Boltzmann weighting).While other Generalized Ensemble Methods, such as Mul-ticanonical (MUCA) Monte Carlo (or Entropy SamplingMonte Carlo) search [9], maintain only one conformationat any given time, the number of replicas required inREMC increases as the square root of the number ofdegrees of freedom (which in its turn increases linearlywith sequence length) [23].Improvements of REMC introduced in the literatureinclude hybrid approaches between REMC (for the weightfactor determination) and MUCA, or Simulated Temper-ing production runs [22-24]. The Parallel-Hat Tempering(PHAT) Monte Carlo method utilizes an additionalweight factor based on the histogram of energies sampledby each temperature replica [10] to achieve an exponen-tial increase in the acceptance probabilities for high- andlow-energy conformations, which increases the efficiencyof the search process by allowing it to effectively overcomehigher energy barriers and to explore a wider range of con-formations for each replica.The following algorithms have been implemented andempirically evaluated for the FCC lattice model with theβ-sheet energy function considered in this work: canoni-cal Metropolis Monte Carlo search [9], MUCA [9], REMC[9] and PHAT [10]. Of these, PHAT is the best-performingconformation search method for the ab initio prediction ofprotein structures on the FCC lattice using the β-sheetpotential.Bin framework Monte Carlo searchThe key idea behind our adaptive bin framework is toimprove the effectiveness of a given search process, suchas canonical Monte Carlo search, by making it adaptiveand augmenting it with a mechanism for storing andPage 3 of 15(page number not for citation purposes)very low energy conformations are rarely sampled. Gener-alized Ensemble Monte Carlo Methods attempt to over-retrieving promising conformations. This is achieved byusing a series of bins each of which holds a set of confor-BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136mations within a certain energy range and an adaptivestrategy for restarting a given search process with a confor-mation retrieved from these bins when the search stag-nates. By varying the search strategy according to a prioridefined transition probabilities (which are dependent onthe search progress), this approach leads to an algorithmthat sacrifices an exact relationship with the canonicalensemble for search efficiency. This method effectivelyreduces the slow convergence, or quasi-ergodicity, in rug-ged energy landscapes; it is therefore very useful when themain interest is in finding global minima, rather than inobtaining other physical properties from canonicalensembles.Conformations are added to the bins in a way that isaimed at maintaining a diverse set of low-energy confor-mations. To achieve this we define energy and diversitythresholds for each bin, which are dynamically modifiedduring the search process (it may be noted that this adap-tive strategy is closely related to the concept of reactivesearch [25]).With each bin i the following properties are associated:• The capacity of the bin, capi, i.e., the maximal number ofconformations that can be stored in the bin at any giventime.• The current number of conformations stored in the bin,numi. These conformations themselves are stored in a listthat is sorted according to energy, to facilitate efficientretrieval of low-energy conformations.• The width of the bin's maximal energy range, ΔEi.• The bin's energy threshold, . This is the highestenergy that a conformation can reach and still be placedinto bin i if the respective diversity threshold (described inthe following) is satisfied.• The diversity threshold, which determines how differenta conformation has to be from other conformationsalready stored at the same energy level in order for thenew conformation to be accepted into the bin. The pair-wise diversity of conformations is measured using a dis-tance measure that depends on the protein model underconsideration. Here, we use the normalized average Ham-ming distance, HD, between the β-sheet energy sequenceof a newly considered conformation c and all β-sheetenergy sequences for the set C' of all conformations withthe same energy that are already in the bin, see MethodsSection for details.Furthermore, the overall bin framework has the followingparameters:• The total number of bins, numBins.• The energy range of interest, ΔE. Together with the cur-rent estimate of the ground state energy, , which is mod-ified throughout the search to always represent the lowestenergy encountered so far, ΔE controls the energy intervalinto which conformations must fall in order to beaccepted into any bin. This range represents the estimateof the maximal barrier height that needs to be sur-mounted to reach lower energy conformations.• The temperature Tbin, which controls the retrieval of con-formations from bins.The general bin framework search mechanism is outlinedin Figure 1. Procedure initalizeBins is used to set all binparameters to their initial values. Bins are numbered 1 ...Ei+EˆHigh-level outline of the main body of the Bin Framework Search MethodFi ur 1High-level outline of the main body of the Bin Frame-work Search Method. Outline of the main body of the Bin procedure BinFramework(c, numBins, ΔE, Tbin)input: initial conformation c,number of bins numBins,the range of energies of interest ΔE,the temperature Tbin controlling the retrieval;output: lowest energy conformation cˆ;cˆ := c;initializeBins;while (termination condition not satisfied) doc′ := subSearchStep(c);if (c′ = c) thenif (E(c′) < Eˆ) thencˆ := c′;Eˆ := E(c′);end ifconsiderStoringInBin(c′,Eˆ,ΔE);end ifif (stagnation detected) thenc := retrieveFromBin(Tbin,Eˆ);elsec := c′;end ifend whilereturn cˆ;end BinFrameworkPage 4 of 15(page number not for citation purposes)Framework Search Method.BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136numBins, and for every bin i, ΔEi and are alwaysassigned such that , i.e., the energyintervals for different bins never overlap.Furthermore, the energy bounds are always assigned suchthat , i.e., bin 1 always has thehighest energy interval, while bin numBins stores the low-est energy conformations. Procedure subSearchStep per-forms one step of a subsidiary search procedure (such ascanonical MC search) on a given conformation c andreturns the resulting conformation c'. This step involves asingle proposal of the move from the given move set andits subsequent acceptance or rejection. The two proce-dures considerStoringInBin and retrieveFromBin control thestorage of conformations in the bin system and theretrieval of previously stored conformations. Note thatconformations will only be stored in a bin if they satisfythe corresponding energy and diversity thresholds. Stor-ing a conformation may lead to adjustments of the energythresholds for the corresponding bin or addition of a newbin (this will be described later in detail for the BINMCalgorithm). Finally, a stagnation criterion is used to decidewhen to retrieve a conformation from the bin system inorder to overcome search stagnation, and a terminationcondition is used to determine when the search processshould terminate.In the following, we will describe the specific instantiationof this framework on which the remainder of our study isfocused: the BINMC algorithm.In BINMC, for simplicity all bin capacities capi are set tothe same value, and this value is kept constant during thesearch. The same holds for the energy ranges ΔEi. Finally,for simplicity we also keep the number of bins, numBins,constant during the run of the algorithm. This number isdetermined by the interval of energies of interest and theenergy window width used:numBins = LΔE/ΔEiO. (1)At the beginning of the search process, the energy thresh-old for each bin i is set to := -(i - 1)·ΔEi. Initially, theenergy intervals of all bins form a partition of the interval[0, numBins·ΔEi), note that this interval is larger than orequal to the desired interval [0, ΔE); it is larger if ΔE doesIt should be noted that 0 is the highest energy possibleunder the model chosen in this work and all the energiesare integer values; in the general case energy thresholdscan be adjusted initially to store the highest energy confor-mations under the protein model considered. The binenergy bounds are adjusted during the search, as will beexplained later.The diversity thresholds for the bins, HDi, are determinedbased on the following formula:HDi = HDmin + (numBins - i)·(HDmax - HDmin)/(numBins - 1),(2)where HDmin and HDmax are parameters of the algorithmthat determine the diversity threshold of the lowest andhighest energy bins, respectively. This choice is based onthe experimental observation that fewer protein confor-mations exist for lower energies, and therefore, the set ofof conformations to be found at low energy levels can beexpected to be less diverse. (This is also consistent with theprevalent view that the energy landscapes encountered inab initio protein structure prediction problems are fun-neled [26].) Figure 2 depicts some of the properties of binsand conformations in a bin and the overall relationship ofEi+E E Ei i i+−+−< − +1 1 1Δ[ , )E E Ei i i+ +− ΔE E Ei numBins+ + +> > >2 "Ei+Relationship between the bin framework and the energy landscapeFigure 2Relationship between the bin framework and the energy landscape. An illustration of how conformations at a given state of the bin framework relate to the energy land-scape of a given protein. is the best solution quality found so far and serves as an estimate of the ground state energy, ΔE is the energy range of interest, and conformations within this range are binned. Each bin i has energy threshold , EˆEi+Page 5 of 15(page number not for citation purposes)not divide evenly by ΔEi. diversity threshold HDi, and energy window ΔEi.BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136conformations stored in the framework to the energylandscape.BINMC uses a standard canonical Monte Carlo searchprocedure running at a constant inverse temperature βMC:= 1/(kB·TMC) (where kB denotes the Boltzmann constant),that controls the probability with which worsening searchsteps are accepted. Our canonical Monte Carlo procedurefor the FCC lattice is based on the same search neighbour-hood (move set) as used by Gront et al. [9] and by Zhangand Skolnick [10]. In each search step, either a double-bond move or a chain-end move is attempted. A double-bond move involves the modification of two successivebond angles, whose position in the chain is chosen uni-formly at random. Similarly, in a chain-end move, thelocation of the first or the last residue is changed. For effi-ciency and speed, double-bond moves are pre-computedin a table, as done by Gront et al. [9] and by Zhang & Skol-nick [10]. The search proceeds in phases, each of whichstarts with two attempts at chain-end moves (one on eachend, in random order) followed by N/2 successiveattempts at double-bond moves (each chosen uniformlyat random, without repetition but allowing overlap withpreviously chosen double-bond moves) any number ofwhich may be accepted. Procedure subSearchStep in Figure1 performs a single step of this simple subsidiary searchprocess by attempting one move, starting from conforma-tion c and resulting in conformation c' (which, if the pro-posed move has not been accepted, is equal to c); theattempted moves are chosen such that the previouslydescribed phasing of chain-end and double-bond movesis ensured.After a new conformation has been accepted by the sub-sidiary MC procedure, it is considered for placement intoa bin. If the new conformation c' has lower energy thenany other conformation encountered so far, i.e., if E(c')< , it is always accepted into the bin framework and thecurrent estimate of the ground state, , is updated. If E(c')falls outside the energy range currently represented by thebin framework, before accepting the new conformation, anew bin (or a number of bins if needed) is created and thefirst bin storing conformations with high energies (or anumber of bins starting with the first one) is deleted asfollows: (here we only describe addition of a single bin,addition of multiple bins is handled analogously): Weadd a new bin numBins + 1 and delete all conformationsfrom bin 1 along with bin 1 itself. We also shift bin num-bers by -1, such that bin 2 becomes bin 1 and bin numBins+ 1 becomes bin numBins. The energy threshold . The diversity thresholds arenot shifted with the bins, such that HDmax and HDminremain associated with the first and the last bin, respec-tively.If conformation E(c') falls within the energy interval of abin i that is not yet filled to capacity (i.e., numi < capi), andc' satisfies the diversity criterion for that bin – i.e., theHamming distance between the conformation c' andother conformations c" with the same energy E should belarger or equal to HDi (see Methods Section for details) –c' is added to that bin, and numi is incremented by one.(Note that there is at most one such bin i, since bin energyintervals never overlap.) Finally, if E(c') falls within theenergy interval of a bin i that is already filled to capacity(i.e., numi = capi) and it satisfies the diversity criterion forthat bin, c' is added to the content of bin i and the highestenergy conformation is removed. At the same time, isset to the energy of the conformation that is currently thehighest in the bin; as a consequence, conformations withenergy above the updated will not be accepted into bini in the future, and therefore, the energy ranges of bins i -1 and i are now may no longer be adjacent.The stagnation detection mechanism used in BINMC isquite simple: the search is considered to be stagnatedwhen no improvement on the lowest energy has beenachieved for noImprRetrieve search steps, where noImprRe-trieve is a parameter of the algorithm.To retrieve a conformation from the bin system, BINMCuses a two-phase procedure that first selects a bin and thenchooses one of the conformations stored in that bin. Inthe first phase, the probability of selecting a bin i dependson the difference between its upper energy threshold and the best energy reached so far, and is proportional to:where βbin = 1/(kB·Tbin), kB denotes the Boltzmann con-stant, and Tbin is BINMC's temperature parameter. In thesecond phase, the probability of choosing a conformationc from that bin is analogously proportional to:In general, conformations could be chosen with or with-EˆEˆEnumBins+E EnumBins numBins−+−− +1 1 1ΔEˆEˆEˆe bin iE E− −+β ( )e bin E c E− ⋅ −β ( ( ) )Page 6 of 15(page number not for citation purposes)for the newly added bin is set toout replacement; here we limited ourselves to choosingconformation with replacement, since the same confor-mation can yield a different fold each time it is picked.BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136As in the stochastic tunneling approach [27], to lessenexponential decay of the probability function we usedBoltzmann-based modified weights proportional to. This weighting preserves the location of allminima, but maps the entire energy space from to themaximum energy 0 onto the interval [0, 1]. The dynamicprocess following the Boltzmann distribution can there-fore pass through energy barriers of an arbitrary height.Overall, this retrieval procedure ensures that lower energyconformations are selected with higher probability, whichis in accordance with the belief that the energy landscapesof real proteins are funneled. The search is terminatedwhen the target energy level has been reached or when auser-specified number of steps has been executed.ResultsIn this section, we present empirical performance resultsfor BINMC as compared to the best-performing algo-rithms known from the literature. MC, REMC and PHAThave been tested by their original authors on thehomopolymers of length 32 and 64, but only results forthe homopolymer of length 64 have been published[9,10]. (We contacted D. Gront [9] and Y. Zhang [10],who commented that all of their algorithms were able toreach quite easily what they believed to be the global min-imum for the polymer of length 32. However, they couldnot provide any information regarding the energy valuesor conformations reached in these experiments.) For theprotein of length 64, Gront et al. believed the lowestenergy they had reached, -374, to correspond to a groundstate conformation; however, Zhang and Skolnick laterreported a conformation with energy -387 for this system[10].Analogous with the results of Gront et al. for MC andREMC [9] and of Zhang and Skolnick for PHAT [10], inTable 1 we provide results averaged over 10 independentruns of each algorithm. It should be noted that severaldetails were not evident from the published descriptionsof these algorithms. From personal communication withD. Gront, we learned that their MC procedure was runwith temperatures from 2.75 to 1.25 [ε0/kB]. Since wecould not determine the exact annealing schedule used intheir study, we chose a constant temperature of 1.25 forour MC procedure. Therefore, the results for the MCSAalgorithm of Gront et al. may not be exactly comparablewith that of our implementation of pure MC. Since thenumber of iterations performed within a given amount oftime varies significantly based on implementation details,we used the average CPU times reported by Gront et al. [9]and by Zhang and Skolnick [10] as the cut-off time for ouralgorithms.As seen from Table 1, our implementations of MC andREMC show comparable performance to that of MCSAand REMC by Gront et al. [9]. (As discussed in the tablecaption, differences in execution environments weretaken into account.) However, our implementation ofPHAT did not reach the performance reported in the liter-e E E− −β( )EˆTable 1: Performance differences among algorithms for the homopolymer of length 64. Method Temperature set Timecut-off Eavg ± sd Emin P – valueMCSA [9] annealed from 2.75 to 1.25 24 min (approx) -349.3 (± 2.1) -362REMC [9] linear 1.25 to 2.75 28 min (approx) -368.2 (± 0.8) -373PHAT [10] linear 1.25 to 2.75 1 hr 25 min (approx) -380.4 (± 1.9) -387our MC 1.25 24 min -367.2 (± 1.7) -370our MC 1.25 28 min -367.4 (± 2.7) -371 0.1367our REMC linear 1.25 to 2.75 28 min -368.5 (± 2.1) -373 0.3425our PHAT linear 1.3 to 2.75 28 min -367.5 (± 3.3) -372 0.1599our BINMC TMC = 1.25, Tbin = 6.521 28 min -370.3 (± 4.3) -379our MC 1.25 1 hr 25 min -368.2 (± 4.6) -374 0.0006*our REMC linear 1.25 to 2.75 1 hr 25 min -369.4 (± 3.0) -376 0.0008*our PHAT linear 1.3 to 2.75 1 hr 25 min -369.5 (± 3.2) -376 0.0023*our BINMC TMC = 1.25, Tbin = 6.521 1 hr 25 min -375.7 (± 3.8) -383Comparison of the energy levels reached for the homopolymer of length N = 64 by Monte Carlo Simulated Annealing (MCSA) [9], the Replica Exchange Monte Carlo (REMC) algorithm with a linear set of temperatures [9] and the Parallel-hat Tempering algorithm (PHAT) [10] with our implementation of Monte Carlo (MC), REMC and PHAT, as well as with our new Bin Framework Monte Carlo (BINMC) algorithm. The run-time Page 7 of 15(page number not for citation purposes)reported for MCSA and REMC [9] has been conservatively adjusted to our 2.4 GHz reference machine (this was done by dividing the published run-times, which have been obtained on a 500 MHz CPU, by a factor of 4.8). The same was done for the run-times for PHAT (which were originally obtained on a 750 MHz CPU and therefore conservatively divided by 3.2).BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136ature [10]. Therefore, we contacted Y. Zhang to checkunpublished details of their algorithm, but unfortunately,those details along with the precise information on thebest conformation reported in their paper (with energy -387) were no longer available due to data loss. Our novelBINMC algorithm performs better than MC and REMC,and than our implementation of PHAT. We used the fol-lowing set of parameters for the bin framework for thehomopolymer of length 64: ΔE = 30 [ε0], ΔEi = 5 [ε0], TMC= 1.25 [ε0/kB], Tbin = 6.521 [ε0/kB], binCapacity = 100,HDMAX = 0.8, HDMIN = 0.01, noImprRetrieve = 2 000 000steps. These settings were determined in a series of exper-iments in which we studied the influence of differentparameter settings; these will be further discussed in Dis-cussion Section.The lowest energy level for the homopolymer of length 64reached by our BINMC algorithm is -391; this is lowerthan the energy of any conformation previously reportedin the literature. Conformations with energy -391 werefound in 2 out of 10 runs, each with a 100 CPU hour cut-off on our reference machine, after 47 and 55 CPU hours,respectively. One of the two resulting conformations isshown in Figure 3, the other is its exact mirror image. Con-formations with energies of -389 and -388 (some of whichare shown in the supplementary material, see Additionalfile 1), were found multiple times by BINMC within aCPU time cut-off of 10 hours on our reference machines.To extend our comparison of the re-implemented meth-ods from the literature (MC, REMC and PHAT) withBINMC, we tested these methods on instances of length12, 24 and 32. Again, we performed 10 independent runsof each algorithm on every problem instance, measuringthe mean as well as the standard deviation of the energylevels reached after a run-time of 1 CPU hour. For BINMCwhen applied to the homopolymers of length 12 and 24,we used the following parameter settings: ΔE = 20 [ε0], ΔEi= 5 [ε0], TMC = 1.25 [ε0/kB], Tbin = 4.344 [ε0/kB], binCapacity= 100, HDMAX = 0.6, HDMIN = 0.01 and noImprRetrieve =100 000 steps, whereas on the homopolymer of length32, we set the parameters to: ΔE = 20 [ε0], ΔEi = 5 [ε0], TMC= 1.25 [ε0/kB], Tbin = 4.344 [ε0/kB], binCapacity = 100,HDMAX = 0.6, HDMIN = 0.01 and noImprRetrieve = 1 000000 steps. (These parameter settings are discussed in Dis-cussion Section and in the supplementary material, seeAdditional file 1.)As can be seen from our results presented in Table 2, allmethods find what appears to be the lowest energy (-39)for the homopolymer of length 12 in less than 1 CPU sec-ond on our reference machine. For the homopolymer oflength 24, we are starting to see differences among thealgorithms: BINMC slightly outperforms all other algo-rithms in terms of CPU time required for reaching anenergy of -109, which we believe to be the global mini-mum for this problem instance. MC is the next bestmethod in terms of performance, followed by PHAT andREMC. The performance results for REMC and PHAT areworse than for MC because this homopolymer is too shortfor the additional time invested in exchanges between rep-licas to be amortized. For the homopolymer of length 32,BINMC outperforms the other algorithms by obtaininglower average energy (and also finding lower energystates, for example with energy -161, more often), fol-lowed by PHAT, REMC and finally MC. We show mini-mum energy conformations for the polymers of length 12,24, and 32 in Figure 4. These solutions appear to beunique in terms of short-range and long-range energy val-ues, since all of the conformations found by any of thealgorithms we ran show the same short- vs long-rangeenergy interplay.Additionally, to see to which extent the results reported inTables 1 and 2 could be further improved, we carried out10 long independent runs for the homopolymers oflength 64 and 32, using a cut-off time of 10 CPU hours onour reference machine. The results of this experiment areshown in Table 3; clearly, BINMC outperforms our imple-mentation of the state-of-the-art REMC and PHAT algo-rithms as well as the canonical MC algorithm in terms ofthe solution quality reached in these long runs, withREMC ranking second, followed by PHAT and MC.Next, we conducted a more thorough performance com-parison of the algorithms based on run-time distributions(RTDs) measured on the homopolymers of length 32 andThe lowest energy conformation found by BINMC for the homopolym r of length 64Figure 3The lowest energy conformation found by BINMC for the homopolymer of length 64. Part (a) shows the lowest energy conformation of homopolymer of length 64 found by BINMC (total energy -391, short-range energy -212, long-range energy -179). A detailed description of this conformation is given in the supplementary material, see Additional file 1. Part (b) shows same conformation as seen (a) (b)Page 8 of 15(page number not for citation purposes)64. The goal of this analysis was to analyze the variabilitybetween independent runs on the same problem instance,from above.BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136which is known to reflect the parallelization efficiency ofan algorithm and can also reveal detrimental stagnationbehavior [4]. In order to be able to perform 100 inde-pendent runs of each algorithm on both sequences withina reasonable overall computation time using our referencemachine, we used sub-optimal energy levels of -158 and -370 for N = 32 and N = 64, respectively. The resultingempirical RTDs are shown in Figure 5 in the form of therespective cumulative distribution functions. For thehomopolymer of length 32, BINMC and MC outperformREMC and PHAT in terms of the time required to reachthe sub-optimal solution quality of -158. For thehomopolymer of length 64 BINMC performs best interms of the time required to reach the sub-optimal solu-tion quality of -370, followed by MC, REMC and PHAT.As can be seen from the graphs in Figure 5, the RTDs of allfour algorithms closely resemble exponential distribu-tions. (Note that the cumulative distribution functions ofall exponential distributions have exactly the same shapewhen shown in a semi-log plot.) This indicates that forreaching the energy levels considered here, none of thealgorithms stagnates and all of them can be parallelizedefficiently by concurrently executing independent runs[4].It may come as a surprise that MC performs better thanREMC and PHAT. However, it is important to note thatthe RTDs reported in Figure 5 are for sub-optimal qualitiesonly. Since MC at a low temperature is "greedier" anddoes not run multiple chains at different temperatures norattempts exchanges between them, it gets to sub-optimalenergies faster. After reaching them, however, it stagnates;this is reflected in the observation that solution qualitydoes not improve when running MC for a long time (10Examples of the lowest energy conformations for the homopolymers of leng h 12, 24, a d 32Figure 4Examples of the lowest energy conformations for the homopolymers of length 12, 24, and 32. The lowest energy conformations of the FCC homopolymers of length 12, 24, and 32 amino acids (the last of these is shown from the side and above) found by the algorithms we tested; the corresponding energies are: -39 for N = 12 (short-range energy = -28, long-range energy = -11), -109 for N = 24 (short-range energy = -68, long-range energy = -41) and -161 for N = 32 (short-range energy = -112, long-range energy = -49). These conformations are specified in detail in the supple-Table 2: Performance differences among algorithms for the homopolymers of length 12, 24, 32. Method Length Eavg± sd Emin CPU Timeavg Timemed Timeq75 Timeq25 p – valueMC 12 -39 (± 0) -39 < 1 sec < 1 sec < 1 sec < 1 secREMC 12 -39 (± 0) -39 < 1 sec < 1 sec < 1 sec < 1 secPHAT 12 -39 (± 0) -39 < 1 sec < 1 sec < 1 sec < 1 secBINMC 12 -39 (± 0) -39 < 1 sec < 1 sec < 1 sec < 1 secMC 24 -109 (± 0) -109 5.0 min (± 4.1 min) 5.5 min 7.2 min 1.5 min 0.1230REMC 24 -109 (± 0) -109 18.3 min (± 18.0 min) 16.3 min 19.4 min 4.3 min 0.0015*PHAT 24 -109 (± 0) -109 8.7 min (± 8.2 min) 6.6 min 11.7 min 2.9 min 0.0039*BINMC 24 -109 (± 0) -109 1.7 min (± 1.2 min) 1.8 min 2.7 min 0.5 minMethod Length Eavg± sd Lowest E CPU Timeavg Emed E q75 E q25 p – valueMC 32 -158.1 (± 0.9) -161 4.3 min (± 8.4 min) -158 -158 -158 0.0155*REMC 32 -158.2 (± 0.7) -161 4.5 min (± 8.6 min) -158 -158 -158 0.0185*PHAT 32 -158.3 (± 0.9) -161 5.8 min (± 8.0 min) -158 -158 -158 0.0214*BINMC 32 -158.9 (± 0.6) -161 23.0 min (± 20.1 min) -159 -159 -158Comparison of the average energy level obtained and the average time required for the homopolymers of lengths N = 12, 24, 32 for the re-implemented MC, REMC, PHAT, and BINMC. The time cut-off used was 1 CPU hour on our 2.4 GHz reference machine, and all statistics were calculated from 10 independent runs. The temperature sets used for our implementations of MC, REMC and are the same as in Table 1. The p-values reported in the last column were determined using the Mann-Whitney U test to test the null hypothesis that the mean run-time (for N = 24) and the mean energies reached by the respective algorithm vs BINMC (within the same CPU cut-off time for N = 32), respectively, are identical [4]; p-values marked with an asterisk (*) correspond to cases in which the null hypothesis is rejected at a standard significance level of 0.05, and therefore indicate statistically significant performance differences.Page 9 of 15(page number not for citation purposes)CPU hours or more on our 2.4 GHz reference machine)for the homopolymers of length 32 and 64, as shown inmentary material, see Additional file 1.BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136Table 3. We also investigated the scaling behaviour ofREMC and BINMC with homopolymer length. We meas-ured the median run-time to reach the global minimumover 20 runs for sequences of length 12, 24, and 32 aminoacids (the sequence of length 64 was not used, since onlyBINMC reaches the lowest known energy). The resultingsets of three data points for each algorithm were each fit-ted with a line in a semi-logarithmic plot (which corre-sponds to fitting an exponential function to the originaldata in a way that counteracts over-fitting for largeinstance sizes). Based on this analysis, the median run-time required for finding (purportedly optimal) confor-mations appears to scale as 100.34·N-5.2 for REMC and as100.28·N-4.7 for BINMC.Finally, we inspected the distribution of energies sampledby each method for the long homopolymer of length 64,based on approximately 5 × 109 conformations each. Asseen in Figure 6, REMC and PHAT show typical energy dis-tributions for each replica, as reported by Zhang and Skol-nick [10]; as expected, in the case of PHAT, theprobabilities of encountering low and high energies areelevated. Interesting differences are observed when exam-ining the distribution of energies visited by MC andDistribution of run-times for all algorithms required to reach sub-optimal conformations for homopolymers of length 32 and 64Figure 5Distribution of run-times for all algorithms required to reach sub-optimal conformations for homopolymers of length 32 and 64. Distribution of run-times required by MC, REMC, PHAT and BINMC to reach sub-optimal conformations with energy -158 for the homopolymer of length 32 (part a) -370 for the homopolymer of length 64 (part b), based on 100 independent runs on our reference machine, each of which reached the target energy value. We fitted the run-time distribu-tion (RTD) of BINMC for the homopolymer of length 64 with exponential distribution, to illustrate that the respective RTD is 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 10000P(solve)run-time, [CPU sec]BINMCMCREMCPHAT(a) 0 0.2 0.4 0.6 0.8 1 1 10 100 1000 10000 100000 1e+06P(solve)run-time, [CPU sec]BINMCMCREMCPHAT1-2-x/1500(b)Table 3: Performance differences among algorithms for the homopolymers of length 64 and 32 in long runs. Method Temperature set Length Eavg± sd Emed E q75 E q25 Emin p – valueour MC 1.25 32 -158.7 (± 1.9) -159 -159 -158 -161 0.0271*our REMC linear 1.25 to 2.75 32 -159.6 (± 1.3) -160 -161 -158 -161 0.5471our PHAT linear 1.3 to 2.75 32 -158.9 (± 1.4) -159 -159 -158 -161 0.0638our BINMC TMC = 1.25, Tbin = 6.521 32 -160.1 (± 0.9) -161 -161 -159 -161our MC 1.25 64 -372.2 (± 2.3) -372 -373 -371 -377 0.0005*our REMC linear 1.25 to 2.75 64 -376.1 (± 3.5) -376 -378 -373 -382 0.0521our PHAT linear 1.3 to 2.75 64 -374.1 (± 3.8) -374 -377 -371 -383 0.0120*our BINMC TMC = 1.25, Tbin = 6.521 64 -379.5 (± 3.3) -381 -382 -376 -389Comparison of the energy levels reached for the homopolymers of length 64 and 32 by our implementations of MC, REMC (with the linear set of temperatures), PHAT, and the new BINMC algorithm in 10 independent runs of 10 CPU hours each on our 2.4 GHz reference machine. The p-values reported in the last column were determined using the Mann-Whitney U test to test the null hypothesis that the mean energies reached by the respective algorithm and BINMC (within the same CPU cut-off time) are identical [4]; p-values marked with an asterisk (*) correspond to cases in which the null hypothesis is rejected at a standard significance level of 0.05, and therefore indicate statistically significant performance differences.Page 10 of 15(page number not for citation purposes)approximately exponential. (The same holds for all other RTDs shown in these plots.)BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136BINMC (see Figure 7): While our new algorithm samplesenergies according to the Boltzmann probability densityfunction, the distributions are shifted towards lowerenergy values, reflecting the fact that BINMC tends toreach lower-energy conformations more efficiently.DiscussionSimilar to the Model-based Search (MBS) method [28]introduced for off-lattice fragment insertion as used in theROSETTA [15] algorithm, our bin framework stores prom-ising candidate solutions for future reuse. However,unlike MBS, we developed and tested an adaptive diversi-fication mechanism that varies based on the energy levelconsidered and takes into account how different a confor-mation is from other conformations with the sameenergy. Additionally, the energy level of interest, whichdetermines the highest energy conformations stored inthe bin framework are allowed to have, and individualHamming distance criteria used for each bin, are adaptedaccording to the current estimate of the ground stateenergy. MBS does not have a mechanism comparable toour diversity criteria between stored conformations. InMBS, a number of elite conformations are stored, whosequality is measured using a score determined from theirenergies and the radius of the local minimum representedby them. (This radius is estimated from the distance totheir nearest neighbours using root mean square devia-tion.) Another important distinction between BINMC andMBS is that in BINMC, the model (a pool of stored con-formations) is updated during the search and influencesthe choice of a new candidate solution every noImprRe-trieve steps, when search stagnation is detected. On theother hand, in MBS for the discrete off-lattice model withstructural fragment insertion as described by Brunette andBrock [28], the choice of a new candidate solution is influ-enced at every step by the pool of conformations stored.Thus, MBS exploration of the search space is only depend-ent on the conformations stored [28]. Therefore, regionsExample of distributions of energies visited by BINMCFigure 7Example of distributions of energies visited by BINMC. Distributions of energies visited by MC and our new BINMC algorithm for the homopolymer of length 64. The time cut-off used was 28 CPU min on our 2.4 GHz ref- 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05-400 -350 -300 -250 -200 -150 -100 -50 0Probability(E)E, [ε0] BINMCMCExample of distributions of energies visited by different replicas in REMC and PHATFigure 6Example of distributions of energies visited by different replicas in REMC and PHAT. Distributions of energies vis-ited by different replicas in a representative run of REMC (a) and of PHAT (b) for the homopolymer of length 64. The time cut-off used was 28 CPU min on our 2.4 GHz reference machine. 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05-400 -350 -300 -250 -200 -150 -100 -50 0Probability(E)E, [εο]T=1.250T=1.625T=2.000T=2.375T=2.750(a) 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035-400 -350 -300 -250 -200 -150 -100 -50 0Probability(E)E, [ε0]T=1.300T=1.670T=2.140T=2.750(b)Page 11 of 15(page number not for citation purposes)that are pruned based on the model are eliminated andnot explored any further. In contrast, the bin frameworkerence machine.BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136provides only a subsidiary mechanism for generating can-didate solutions when search stagnation is detected, anddoes not completely eliminate unexplored regions of thesearch space. This is achieved by running a non-model-based search (canonical MC) for a sufficiently long timeto allow it to explore other regions of the search space.The bin framework sorts conformations into bins repre-senting different energy levels to make sure that the modelcontains as many energy levels of interest as possiblewhile still reducing the search space. This aspect of thesearch is conceptually related to histogram-based sam-pling and search methods such as Multicannonical algo-rithm (MUCA) [24] and Energy Landscape Paving (ELP)method [21]. It should be noted, however, that our binframework is a non-parametric model of the search spacethat consists of a diverse set of promising candidate con-formations stored in the bins. MUCA, on the other hand,and to some degree ELP, are based on a parametric modelof sampling all energy levels with the same probabilitywithout emphasis on low energies. In addition, our binframework uses Hamming distance criteria that are basedon the energy level of each bin to ensure that the respec-tive sets of conformations stored are diverse and capturethe overall funnel-like structure of the landscape.Up to this point, we have focused on comparing BINMCwith existing algorithms. We now turn our attention tothe question how the performance of the BINMC algo-rithm depends on its parameters and the algorithm com-ponents controlled by them, and to the determination ofgood settings for these parameters. We conducted thisinvestigation for the homopolymer of length 32, sincereaching its best known energy level of -161 is challeng-ing, but not too computationally expensive to precludeperforming multiple successful runs for a large number ofparameter settings. For each parameter configuration, weconducted 20 independent runs on our reference machineand recorded the time required in each of these to reachan energy of -161; from these runs, we then determinedthe average CPU time required to reach this target energy.To study the influence of the different parameters on algo-rithm performance, we varied one (or, in the case ofclosely related parameters, two) of them at a time, whilekeeping all other parameter values fixed. Unless indicatedotherwise, the parameters kept fixed in these experimentswere set to the following values: ΔE = 20 [ε0], ΔEi = 5 [ε0],TMC = 1.25 [ε0/kB], Tbin = 4.344 [ε0/kB], binCapacity = 100,HDMAX = 0.6, HDMIN = 0.1, noImprRetrieve = 100 000 steps.Here, we summarize the results of this study; details areprovided in the supplementary material, see Additionalfile 1. The impact of the parameters on the performance of1. the number of non-improving steps (over the bestenergy) that are performed before reaching into a bin andreplacing the current conformation in the Monte Carlorun, noImprRetrieve (the optimal value seems to be around1 000 000 steps which allows the underlying MC search toexplore the current region of the landscape);2. the ratio between the width of the energy of interestconsidered for binning, ΔE, and the bin temperature, Tbin(the ratio that results in at least 0.01 probability ofretrieval of the highest energy conformations works well);3. the diversity criteria used during binning high- and low-energy conformations (the Hamming distance limits),values of HDmin = 0.01 and HDmax = 0.06 guarantee that weare selective enough when storing promising conforma-tions at low and high energy levels correspondingly andresult in the best performance;4. the width of the energy window considered by each bin(ΔEi = 5 seem to provide optimal discretization of thesearch space); and5. the capacity of bins, capi (storing 100 conformations ineach bin seems to work best, this provides required diver-sification during the search).Furthermore, we made the following observations regard-ing parameter settings of our algorithm for the longerhomopolymer of length 64:1. a higher value of noImprRetrieve should be used (2 000000 was found to work well), which indicates that longersearch times are required to effectively explore the neigh-bourhood of the current conformation before reachinginto the bin framework and replacing it with anotherpromising conformation;2. ΔE should be increased, which is consistent with thecommon belief that the barrier heights for longerhomopolymers are higher, and Tbin should be adjustedsuch that the ratio ΔE/Tbin remains the same as for length32;3. ΔEi should be increased to Ei = 10, which suggests that acoarser search space discretization may be beneficial forlonger homopolymers; note that the combination ofincreases in ΔE and ΔEi results in only a slight increase inthe total number of bins, numBins;4. the same values for HDmin, HDmax and capi can be usedas in the case of length 32.Page 12 of 15(page number not for citation purposes)BINMC can be ranked as follows (in decreasing order): Finally, our empirical results presented in Additional file1 show that the BINMC algorithm performs substantiallyBMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136better than the simple restart strategy and than the pureMonte Carlo on which it is based. We also determinedthat all components of our algorithm are important for itsefficiency, see Additional file 1. For example, the averagetime required for finding purportedly optimal conforma-tions increases significantly (at least threefold for thehomopolymer of length 32) when increasing noImprRe-trieve until bin framework retrieval is performed veryinfrequently, when decreasing ΔE or capi or when increas-ing the diversity thresholds HDmin and HDmax, resulting invery few conformations being stored, or when reducingthe number of bins (by varying ΔEi for fixed ΔE).ConclusionThe bin framework introduced in this work is a generalapproach that can be used to augment existing conforma-tion search methods in order to increase their ability tofocus on promising regions of the search phase (intensifi-cation) and to effectively overcome stagnation in regionsof sub-optimal conformations (diversification). As shownby our computational experiments, even very simpleinstantiations of the general bin framework can result inhighly effective search algorithms.In particular, our novel Bin Framework Monte Carlo algo-rithm (a combination of the bin framework and a simpleMonte Carlo search procedure) surpasses ReplicaExchange Monte Carlo search and its heuristic variant,Parallel-hat Tempering, in its ability to find (purportedly)minimum energy conformations for β-sheet homopoly-mers on the FCC lattice. Furthermore, using our new algo-rithm we have improved the best known solution for thehomopolymer of length 64 from -387 to -391.In future work, we plan to consider more advanced adap-tive bin framework strategies that control search diversifi-cation and intensification reactively during the search,based on observed features of the search landscape. Addi-tionally, we are planning to generalize our bin frameworkto work on partial as well as complete conformations,producing an efficient generalized framework that com-bines two distinct search strategies. Finally, we would liketo extend our bin framework to address other models ofprotein structure, such as the FCC model with a morecomplex energy function or other discrete and continuousoff-lattice models.Given the results for β-sheet homopolymers on the FCClattice achieved in this work, we believe that further inves-tigation of our adaptive bin framework and its applicationto other protein structure prediction problems holdsmuch promise.MethodsIn this section, we provide a detailed description of theFCC lattice model, β-sheet energy potential, and experi-mental analysis performed to compare our approach tothe existing methods from the literature.Face-Centered Cubic lattice modelIn the FCC model, the polypeptide is restricted to a face-centered cubic lattice [11] that has 12 base set vectors:vbase = {e1, e2, ..., e12}, (3)where e1 = (1, 1, 0), e2 = (1, -1, 0), e3 = (1, 0, 1),e4 = (1, 0, -1), e5 = (0, 1, 1), e6 = (0, 1, -1), e7 = (0, -1, 1),e8 = (0, -1, -1), e9 = (-1, 0, 1), e10 = (-1, 0, -1),e11 = (-1, 1, 0), e12 = (-1, -1, 0).A protein chain of N residues is described by N - 1 vectors,where vector vi connects residues i and (i + 1). The 12 basevectors allow for the following valence angles betweeneach pair of vectors: 60°, 90°, 120°, and 180° [9].Beta sheet energy potential used with the FCC latticeTo model β-sheet proteins and the stiffness of the polymerchain, the following definition of an extended, β -typechain conformation was defined by Gront et al. [9]: threevectors are in an extended state if1. the angles between vectors vi-1 and vi and between vi andvi+1 are greater than 90 degrees;2. the dot product vi-1·vi+1 is larger than 0, which meansthat the angle between vectors vi-1 and vi+1 is less than 90degrees.The energy potential for this model is composed of twoterms: the short-range potential Ui-1, i, i+1 that depends onthe three consecutive vectors in the chain (vi-1, vi, vi+1) andmimics conformational propensity to form an extendedset of β-strands:and the long-range potential Vi, j for two non-bondedchain residues, defined as:Ui i iB i i i− +− +=−1 11 1, ,, ,ε if the triple is in extended stv v v ateotherwise,,0⎧⎨⎩(4)Vr i jri ji jA i j,,,, ,, (=+∞ = ≠− =for and for in lattice unit01ε s and for in lattice units) | | ,, ( ),i jri j− >>⎧⎨⎪⎩⎪10 1Page 13 of 15(page number not for citation purposes)where ri, j is the lattice distance between residues i and j.(5)BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/136For a chain of length N, the total energy is defined as:where δij is the Kronecker delta (δij = 1 when i = j, and 0otherwise) and the values of the force field parameters aredefined as εA := 1.0 [ε0] and εB := 4.0 [ε0] (with ε0 := 1.0representing one unit of interaction energy) to model asemi-flexible polymer [9].Hamming distance criteria for the FCC latticeFor the FCC β-sheet energy potential the normalized aver-age Hamming distance, HD, between the β-sheet energysequence of a newly considered conformation c and all β-sheet energy sequences for the set C' of all conformationswith the same energy that are already in the bin is calcu-lated as follows.where , Ui(x) denotes the Ui-1, i, i+1 value forconformation x, and εB = 4.0 [ε0] represents the energycontribution of each β-residue [9]. The inner sum rangesfrom i = 2 to i = N - 2, since the first residue and the twolast residues can never be in the extended β-state.Empirical analysisBINMC has been implemented in C++ and compiledusing gcc (version 3.3.3) for the Linux operating system.The same holds for our implementations of three pub-lished algorithms for the same protein models: simpleMonte Carlo (MC), Replica Exchange Monte Carlo(REMC) and Parallel-hat (PHAT) Monte Carlo search[9,10]; we had to re-implement these because the respec-tive codes could not be obtained from the authors. Allexperiments were performed on PCs with 2.4 GHz Pen-tium IV CPUs, 256 Kb cache, and 1 Mb RAM, runningRedhat Linux (our reference machine). Their run-time wasmeasured in terms of the CPU time required to reach (orexceed) a specified energy level.For our performance analysis we used the FCC β-homopolymers of length 12, 24, 36, and 64. The algo-rithms were evaluated based on a number of independentruns on each homopolymer. In most experiments, eachrun was terminated after a fixed CPU time limit (cut-offtime) had been reached. From the distribution of energylevels over 10 independent runs, we determined the aver-age energy, median energy, 25- and 75-percentiles as wellas the lowest energy reached. To further evaluate the per-of Hoos and Stützle [29] and analyzed run-time distribu-tions (RTDs) of the algorithms, i.e., the (empirical) prob-ability distributions over the run-time required to reachthe lowest known (or, in some cases, certain sub-optimal)energy level for the respective homopolymer based on100 independent runs.Authors' contributionsBoth authors contributed to the development of ideas,design of experiments, analysis and interpretation ofresults, and the writing of the paper. AS implemented theproposed method and performed the computationalexperiments.Additional materialAcknowledgementsThis work has been supported by an NSERC Postgraduate Scholarship (PGS-B) held by AS and by funding from the Mathematics of Information Technology and Complex Systems (MITACS) Network of Centers of Excellence held by HH.References1. Notling B: Protein Folding Kinetics Springer; 1993. 2. Ngo JT, Marks J, Karplus M: Computational Complexity: ProteinStructure Prediction and the Levinthal Paradox. Protein Engi-neering 1992, 5:313-321.3. Unger R, Moult J: Finding the Lowest Free Energy Conforma-tion of a Protein is a NP-hard Problem: Proof and Implica-tions. Bulletin of Mathematical Biology 1993, 55:1183-1198.4. Hoos HH, Stützle T: Stochastic Local Search: Foundations and Applica-tions Morgan Kaufmann Publishers, Elsevier; 2004. 5. Bonneau R, Baker D: Ab Initio Protein Structure Prediction:Progress and Prospects. Annual Review of Biophysics and Biomolecu-lar Structure 2001, 30:173-189.6. Kolinski A, Skolnick J: Reduced Models of Proteins and theirApplications. Polymer 2004, 45:511-524.7. Park BH, Levitt M: The Complexity and Accuracy of DiscreteState Models of Protein Structure. Journal of Molecular Biology1995, 249:493-507.8. Bagci Z, Jernigan RL, Bahar I: Residue coordination in ProteinsConforms to the Closest Packing of Spheres. Polymer 2002,43:451-459.9. Gront D, Kolinski A, Skolnick J: Comparison of Three MonteCarlo Conformational Search Strategies for a Protein-likeHomopolymer Model: Folding Thermodynamics and Identi-fication of Low-Energy Structures. Journal of Chemical PhysicsE U Vi i i i j ijjNiNiN= + −− +===−∑∑∑ 1 111211, , , ( ),δ (6)HD U c M U c Ni kiNiiNkMB= ′ −⎛⎝⎜⎜⎞⎠⎟⎟ ⋅ −=−=−=∑ ∑∑ ( )/ ( ) /( ( )),222213εC c cM= ′ ′{ ,..., }1Additional file 1Additional data for an adaptive bin framework search method for a beta-sheet protein homopolymer model. In this supplementary file, we show additional low-energy conformations for the 64 amino acid homopolymer found by BINMC, provide details on experiments conducted to determine good parameter settings for BINMC and supply the necessary information to reconstruct the lowest energy conformations of homopolymers of length 12, 24, 32, and 64 found by our new algorithm.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-136-S1.pdf]Page 14 of 15(page number not for citation purposes)formance of our BINMC algorithm and the methodsknown from the literature, we followed the methodology2000, 13:5065-5071.Publish with BioMed Central and every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2007, 8:136 http://www.biomedcentral.com/1471-2105/8/13610. Zhang Y, Skolnick J: Parallel-Hat Tempering: A Monte CarloSearch Scheme for the Identification of Low-Energy Struc-tures. Journal of Chemical Physics 2001, 115:5027-5032.11. Pokarowski P, Kolinski A, Skolnick J: A Minimal Physically Realis-tic Protein-Like Lattice Model: Designing and Energy Land-scape that Ensures All-Or-None Folding to a Unique NativeState. Biophysical Journal 2003, 84:1518-1526.12. Aloy P, Stark A, Hadley C, Russel RB: Predictions without Tem-plates: New Folds, Secondary Structure, and Contacts inCASP5. Proteins: Structure, Function, and Genetics 2003, 53:436-456.13. Jones DT: Predicting Novel Protein Folds by Using FRAG-FOLD. Proteins: Structure, Function, and Genetics Suppl 2001,5:127-132.14. Ortiz AR, Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J: Ab InitioFolding of Proteins Using Restrains Derived from Evolution-ary Information. Proteins: Structure, Function, and Genetics Suppl1999, 3:177-185.15. Simons K, Kooperberg C, Huang E, Baker D: Assembly of ProteinTertiary Structures from Fragments with Similar LocalSequences Using Simulated Annealing and Bayesian ScoringFunction. Journal of Molecular Biology 1997, 268:209-225.16. Skolnick J, Kolinski A, Kihara D, Betancourt M, Rotkiewicz P, BonieckiM: Ab Initio Protein Structure Prediction via a Combinationof Threading, Lattice Folding, Clustering, and StructureRefinement. Proteins: Structure, Function, and Genetics Suppl 2001,5:149-156.17. Zhang Y, Kihara D, Skolnick J: Local Energy Landscape Flatten-ing: Parallel Hyperbolic Monte Carlo Sampling of ProteinFolding. Proteins: Structure, Function, and Genetics 2002, 48:192-201.18. Bowie JU, Eisenberg D: An Evolutionary Approach to FoldingSmall α-helical Proteins that Use Sequence Information andan Empirical Guiding Fitness Function. Proceedings of theNational Academy of Sciences of the USA 1994, 91:4436-4440.19. Dandekar T, Argos P: Folding the Main Chain of Small Proteinswith the Genetic Algorithm. Journal of Molecular Biology 1994,236:844-861.20. Pedersen JT, Moult J: Protein Folding Simulations with GeneticAlgorithms and a Detailed Molecular Description. Journal ofMolecular Biology 1997, 269:240-259.21. Hansmann UHE: Protein Folding Simulations in a DeformedEnergy Landscape. The European Physical Journal B 1999,12:607-611.22. Mitsutake A, Sugita Y, Okamoto Y: Generalized-Ensemble Algo-rithms for Molecular Simulations of Biopolymers. Biopolymers(Peptide Science) 2001, 60:96-123.23. Sugita Y, Okamoto Y: Replica-Exchange Multicanonical Algo-rithm and Multicanonical Replica-Exchange Method for Sim-ulating Systems with Rough Energy Landscape. CondensedMaterials Archive 2004:0009119.24. Okamoto Y: Generalized-Ensemble Algorithms: EnhancedSampling Techniques for Monte Carlo and MolecularDynamic Simulations. Journal of Molecular Graphics and Modeling2004, 22:425-439.25. Battiti R, Tecchiolli G: Residue Coordination in Proteins Con-forms to the Closest Packing of Spheres. Polymer 2002,43:451-159.26. Plotkin SS, Onuchic JN: Understanding Protein Folding withEnergy Landscape Theory Part I: Basic Concepts. QuarterlyReviews of Biophysics 2002, 35:111-167.27. Wenzel W, Hamacher K: Stochastic Tunneling Approach forGlobal Minimization of Complex Potential Energy Land-scapes. Physical Review Letters 1999, 82:3003-3007.28. Brunette TJ, Brock O: Improving Protein Structure Predictionwith Model-based Search. Bioinformatics 2005, 21:i66-i74.29. Hoos HH, Stützle T: On the Empirical Evaluation of Las VegasAlgorithms. Proceedings of the 14th Conference on Uncertainty in Arti-ficial Intelligence 1998:238-245.yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 15 of 15(page number not for citation purposes)
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- An adaptive bin framework search method for a beta-sheet...
Open Collections
UBC Faculty Research and Publications
An adaptive bin framework search method for a beta-sheet protein homopolymer model Shmygelska, Alena; Hoos, Holger H Apr 24, 2007
pdf
Page Metadata
Item Metadata
Title | An adaptive bin framework search method for a beta-sheet protein homopolymer model |
Creator |
Shmygelska, Alena Hoos, Holger H |
Publisher | BioMed Central |
Date Issued | 2007-04-24 |
Description | Background: The problem of protein structure prediction consists of predicting the functional or native structure of a protein given its linear sequence of amino acids. This problem has played a prominent role in the fields of biomolecular physics and algorithm design for over 50 years. Additionally, its importance increases continually as a result of an exponential growth over time in the number of known protein sequences in contrast to a linear increase in the number of determined structures. Our work focuses on the problem of searching an exponentially large space of possible conformations as efficiently as possible, with the goal of finding a global optimum with respect to a given energy function. This problem plays an important role in the analysis of systems with complex search landscapes, and particularly in the context of ab initio protein structure prediction. Results In this work, we introduce a novel approach for solving this conformation search problem based on the use of a bin framework for adaptively storing and retrieving promising locally optimal solutions. Our approach provides a rich and general framework within which a broad range of adaptive or reactive search strategies can be realized. Here, we introduce adaptive mechanisms for choosing which conformations should be stored, based on the set of conformations already stored in memory, and for biasing choices when retrieving conformations from memory in order to overcome search stagnation. Conclusion We show that our bin framework combined with a widely used optimization method, Monte Carlo search, achieves significantly better performance than state-of-the-art generalized ensemble methods for a well-known protein-like homopolymer model on the face-centered cubic lattice. |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2015-11-15 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution 4.0 International (CC BY 4.0) |
DOI | 10.14288/1.0215910 |
URI | http://hdl.handle.net/2429/55337 |
Affiliation |
Computer Science, Department of Science, Faculty of Non UBC |
Citation | BMC Bioinformatics. 2007 Apr 24;8(1):136 |
Publisher DOI | 10.1186/1471-2105-8-136 |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | Shmygelska and Hoos. |
Rights URI | http://creativecommons.org/licenses/by/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 52383-12859_2007_Article_1508.pdf [ 749.11kB ]
- Metadata
- JSON: 52383-1.0215910.json
- JSON-LD: 52383-1.0215910-ld.json
- RDF/XML (Pretty): 52383-1.0215910-rdf.xml
- RDF/JSON: 52383-1.0215910-rdf.json
- Turtle: 52383-1.0215910-turtle.txt
- N-Triples: 52383-1.0215910-rdf-ntriples.txt
- Original Record: 52383-1.0215910-source.json
- Full Text
- 52383-1.0215910-fulltext.txt
- Citation
- 52383-1.0215910.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0215910/manifest