ralssBioMed CentBMC BioinformaticsOpen AcceResearch articleA replica exchange Monte Carlo algorithm for protein folding in the HP modelChris Thachuk1, Alena Shmygelska2 and Holger H Hoos*3Address: 1School of Computing Science, Simon Fraser University, Burnaby, B.C., V5A 1S6, Canada, 2Department of Structural Biology, Stanford University, Stanford, CA, 94305, USA and 3Department of Computer Science, University of British Columbia, B.C., V6T 1Z4, CanadaEmail: Chris Thachuk - cthachuk@sfu.ca; Alena Shmygelska - alena.shmygelska@stanford.edu; Holger H Hoos* - hoos@cs.ubc.ca* Corresponding author AbstractBackground: The ab initio protein folding problem consists of predicting protein tertiary structurefrom a given amino acid sequence by minimizing an energy function; it is one of the most importantand challenging problems in biochemistry, molecular biology and biophysics. The ab initio proteinfolding problem is computationally challenging and has been shown to be -hard even whenconformations are restricted to a lattice. In this work, we implement and evaluate the replicaexchange Monte Carlo (REMC) method, which has already been applied very successfully to morecomplex protein models and other optimization problems with complex energy landscapes, incombination with the highly effective pull move neighbourhood in two widely studied HydrophobicPolar (HP) lattice models.Results: We demonstrate that REMC is highly effective for solving instances of the square (2D)and cubic (3D) HP protein folding problem. When using the pull move neighbourhood, REMCoutperforms current state-of-the-art algorithms for most benchmark instances. Additionally, weshow that this new algorithm provides a larger ensemble of ground-state structures than theexisting state-of-the-art methods. Furthermore, it scales well with sequence length, and it findssignificantly better conformations on long biological sequences and sequences with a provablyunique ground-state structure, which is believed to be a characteristic of real proteins. We alsopresent evidence that our REMC algorithm can fold sequences which exhibit significant interactionbetween termini in the hydrophobic core relatively easily.Conclusion: We demonstrate that REMC utilizing the pull move neighbourhood significantlyoutperforms current state-of-the-art methods for protein structure prediction in the HP model on2D and 3D lattices. This is particularly noteworthy, since so far, the state-of-the-art methods for2D and 3D HP protein folding – in particular, the pruned-enriched Rosenbluth method (PERM) and,to some extent, Ant Colony Optimisation (ACO) – were based on chain growth mechanisms. Tothe best of our knowledge, this is the first application of REMC to HP protein folding on the cubiclattice, and the first extension of the pull move neighbourhood to a 3D lattice.Published: 17 September 2007BMC Bioinformatics 2007, 8:342 doi:10.1186/1471-2105-8-342Received: 28 February 2007Accepted: 17 September 2007This article is available from: http://www.biomedcentral.com/1471-2105/8/342© 2007 Thachuk et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 20(page number not for citation purposes)BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342BackgroundThe ab initio protein folding problem concerns the predic-tion of the three dimensional functional state, i.e., thenative fold, of a protein given only its sequence informa-tion. A successful method for solving this problem wouldhave far reaching implications in many fields includingstructural biology, genetics and medicine. Current labora-tory techniques for protein structure determination areboth costly and time consuming. In the current era of highthroughput sequencing, it is infeasible to rely exclusivelyon time and labor-intensive experimental structure deter-mination techniques, such as X-ray crystallography andnuclear magnetic resonance, for characterizing the proteinproducts of newly discovered genes; there is a clear needfor effective and efficient computational protein structureprediction programs. However, even for simplified pro-tein models that use lattices to discretize the conforma-tional space, the ab initio protein structure predictionproblem has been shown to be -hard [1-3], and a pol-ynomial-time algorithm is therefore unlikely to exist.One of the most prevalently studied abstractions of the abinitio protein structure prediction problem is Dill's hydro-phobic polar (HP) model. Many algorithms have been for-mulated to address the protein folding problem using twodimensional (2D) and three dimensional (3D) HP mod-els on a variety of lattices (see, e.g., [4-12]). In this study,we restrict our attention to those HP models that embedall protein folds into the 2D square lattice or the 3D cubiclattice. Many of these algorithms can be classified prima-rily as construction based (or chain growth) algorithms,which determine folds by sequentially placing residuesonto the lattice. Among these, the pruned enriched Rosen-bluth method (PERM) [13] has been particularly success-ful in finding optimal conformations for standardbenchmark sequences in both 2D and 3D. PERM is aMonte Carlo based chain growth algorithm that iterativelyconstructs partial conformations; it is heavily based onmechanisms for pruning unfavourable folds and forenriching promising partial conformations, to facilitatetheir further exploration.Despite being one of the most successful algorithms for abinitio protein structure prediction in the 2D and 3D HPmodels, PERM – like all other currently known algorithmsfor this problem – is not dominant in every instance. Inthe work of Shmygelska and Hoos [9] it was shown thatPERM has great difficulty folding proteins which have ahydrophobic core located in the middle and not at one ofthe ends of the sequence, as is the case when the core isformed from interacting termini. We note that an earliermore effective in folding these types of sequences. How-ever, to the best of our knowledge, no comparison hasbeen made with the most recent version of PERM or otherprotein folding algorithms.Shmygelska and Hoos proposed an ant colony optimiza-tion algorithm, ACO-HPPFP-3, which employs both con-struction and local search phases on completeconformations [9]. Ant Colony Optimisation (ACO) is apopulation-based stochastic search method for solving awide range of combinatorial optimisation problems. ACOis based on the concept of stigmergy – indirect communi-cation between members of a population through interac-tion with the environment. From the computationalpoint of view, ACO is an iterative construction searchmethod in which a population of simple agents ('ants')repeatedly constructs candidate solutions to a given prob-lem instance; this construction process is probabilisticallyguided by heuristic information on the problem instanceas well as by a shared memory containing experience gath-ered by the ants in previous iterations of the search proc-ess ('pheromone trails') [15]. The ACO-HPPFP-3algorithm combines a relatively straight-forward applica-tion of the general ACO method to the 2D and 3D HPprotein structure prediction problem with specific localsearch procedures that are used to optimize the conforma-tions constructed by the ants.In the 2D case, ACO-HPPFP-3 was shown to be competi-tive with PERM on many benchmark instances and dom-inant on proteins whose hydrophobic core is located inthe middle of the sequence. Other attempts at the prob-lem use local search methods on complete conforma-tions, including the GTabu algorithm [7]. This methodutilizes the generic tabu search algorithm from theHuman Guided Search (HuGS) framework [16]. GTabuwas shown to find conformations with the lowest knownenergy for several benchmark instances in the 2D case.This was primarily made possible by using a newly intro-duced neighbourhood consisting of so-called pull moves,which is also utilized in our work.In addition to PERM, many other Monte Carlo algorithmshave been devised to address the problem of ab initio pro-tein structure prediction using lattice models [17-20]. Aclass of Monte Carlo methods known as generalized ensem-ble algorithms have been shown to be particularly effectivefor more complex lattices and for the off-lattice case [5,21-24]. Classical Monte Carlo search methods for proteinstructure prediction typically sample conformationsaccording to the Boltzmann distribution in energy space.In generalized ensemble algorithms, random walks inother dimensions, such as temperature, can also be real-Page 2 of 20(page number not for citation purposes)version of PERM [14], capable of initiating search at non-terminus positions, was previously proposed and may beized. This is the case for replica exchange Monte Carlo(REMC) algorithms, which maintain many independentBMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342replicas of potential solutions, i.e., protein conforma-tions. Each replica is set at a different temperature andlocally runs a Markov process sampling from the Boltz-mann distribution in energy space. A random walk intemperature space is achieved by periodic exchanges ofconformations at neighbouring temperatures. REMCappears to have been discovered independently by variousresearchers [25-28] and is also known as parallel temper-ing, multiple Markov chain Monte Carlo and exchangeMonte Carlo search. REMC has been shown to be highlyeffective in high dimensional search problems with rug-ged landscapes containing many local minima. Initiallythis was demonstrated in an application to spin glass sys-tems [26,29]. REMC has also been applied to the off-lat-tice protein folding problem [21-24,30-34]. Furthermore,it was previously used for folding proteins on the 2Dsquare lattice in a study by Irbäck [23] and to the face-cen-tred cubic lattice in the work of Gront et al [5]. However,to the best of our knowledge, no extensive study of theREMC algorithm in the HP model on the cubic lattice hasbeen undertaken. The remainder of this paper is struc-tured as follows. First, we formally introduce the hydropho-bic polar model and describe in detail the two searchneighbourhoods (move sets) utilized later in this work.Next, we present the general REMC method followed bythe three instantiations we have developed for the 2D and3D HP protein folding problem. Then, we report resultsfrom a comparative empirical performance analysis of ournew algorithms vs PERM and ACO-HPPFP-3. The respec-tive computational experiments are run on standardbenchmark instances as well as on two new sequence sets,which we introduced to evaluate the performance ofREMC when folding long sequences and sequences whichhave a provably unique optimal structure. We also reportresults from experiments involving proteins with terminiinteracting to form a hydrophobic core. Next, we comparethe performance of our new REMC algorithms with that ofGTabu. A discussion follows, in which we report empiricalresults regarding the effects of various parameters on theperformance of our new algorithms. We close with a high-level summary of our major findings and a brief discus-sion of potential future work.The hydrophobic polar modelThe hydrophobic polar (HP) model was first introducedby Dill in 1985 [35]. In this model, amino acids are clas-sified as either H (hydrophobic) or P (polar). Informally, asequence of H's and P's is embedded into a lattice struc-ture. A valid conformation of the sequence corresponds toa self-avoiding walk on the lattice. Borrowing the termi-nology used by Lau and Dill [36], we define connectedneighbours as any two residues k and k + 1 that are adjacentalong the given sequence, and topological neighbours as res-conformation can be calculated as the number of H-Hcontacts between topological neighbours. This is illus-trated in Figure 1, which shows a conformation withenergy -2 (every H-H contact contributes -1 to the totalenergy, while all other contacts do not contribute).Formally, for a sequence s ∈ Σn with Σ = {H, P} and n = |s|,we define a conformation ci ∈ Cs to have energy E(ci),where Cs is the set of all valid self-avoiding walks on somelattice L for sequence s, and E(ci) is given by the followingequation:In this model, we search for a conformation c* that mini-mizes the objective energy function E(ci). Such a confor-mation is considered a solution and is also called aground-state conformation of the given protein sequence.However, many instances of the HP protein folding prob-lem exhibit solution degeneracy, i.e., have more than oneminimum-energy conformation. In this sense, our defini-tion of ground-state conformation does not imply aunique solution, but simply one that satisfies the follow-ing equation:E c NNj ki jkk jnjnjk( ) ,==−= +=−∑∑ withif and are both H r1111 esiduesand topological neighboursotherwise;.0A ground-state conformation in the 2D HP modelFi ure 1A ground-state conformation in the 2D HP model. The grid points and lines represent the 2D square lattice this con-formation is embedded upon. Filled, black circles represent hydrophobic residues while unfilled circles represent polar residues. The conformation above yields an optimal energy score in the HP model of -2. The two hydrophobic contacts contributing to the score are between residues 4 and 13 and 12 34 513 126 7891011141516Page 3 of 20(page number not for citation purposes)idues adjacent in topological space (forming a contact)that are not also connected neighbours. The energy of abetween residues 5 and 12.BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342E(c*) = min{E(ci) | ci ∈ Cs}Although ground-state structures in this model typicallydo not closely resemble the known native conformationsof the respective proteins, a close correspondence hasbeen observed in some cases [37]; this is particularly truefor higher resolution lattices such as the face-centred cubiclattice.Generally, simplified models, such as the ones consideredhere, are widely considered to be useful in studying cer-tain aspects of protein folding and structure prediction,including the formation of conformations exhibiting ahydrophobic core [38,39].Search neighbourhoodsLocal search methods (including REMC and simpleMonte Carlo search) are based on the idea of iterativelyimproving a given candidate solution by exploring itslocal neighbourhood. In the protein folding problem as itis presented here, the neighbourhood of a conformationcan be thought to consist of slight perturbations of therespective structure. The neighbourhoods (move sets) usedin solving this problem specify a perturbation as a feasiblechange from a current conformation c at time t to a validconformation at time t + 1. Thus, the neighbourhood of aconformation c is a set of valid conformations c' that areobtained by applying a specific set of perturbations to c. Inthis study we consider two such neighbourhoods, the so-called VSHD moves and pull move neighbourhoods, for both,the 2D and 3D HP models.VSHD movesVSHD moves, as we will refer to them in this study,appeared early on in the simulation of polymer chains byVerdier and Stockmayer [40]. In this early work, only sin-gle residue moves were used, and the single residue end andcorner moves were introduced. That work was later cri-tiqued in a study by Hilhorst and Deutch [18], which alsointroduced the two residue crankshaft move. Gurler et al.combined all three types of moves into one search neigh-bourhood [41], which we call the VSHD neighbourhood.End movesFor a chain of length n, an end move can be performed onresidue 1 or residue n. The residue is pivoted relative to itsconnected neighbour to a free position adjacent to thatneighbour. This mechanism ensures that the chainremains connected. If more than one valid position is free,one is chosen uniformly at random. For instance, in Fig-ure 2a, residue 1 could be moved to two possible posi-tions on the lattice. Generally, for the 2D and 3D HPmodel, there are up to 2 and 4 possible moves for each ofCorner movesA corner move can potentially be performed on any resi-due excluding the end residues. For a corner move to bepossible, the two connected neighbours of some residue imust be mutually adjacent to another, unoccupied posi-tion on the lattice. Note that for both, the 2D square andthe 3D cubic lattices, any two residues i - 1 and i + 1 canshare at most one adjacent lattice position. When this sit-uation occurs, a corner is formed by residues i - 1, i and i+ 1. If the mutually adjacent position is empty, residue ican be moved to it. This is illustrated in Figure 2b for the2D case. Overall, in 2D as well as in 3D, there are at mostn - 2 possible corner moves for any conformation of a n-residue chain.Crankshaft movesA crankshaft move can occur if some residue i is part of au-shaped bend in the chain, as shown in Figure 2c. Refer-ring to this figure, the crankshaft move can be performedin 2D if positions i' and i + 1' are empty. Crankshaftmoves in 2D always involve a 180° rotation of a u-shapedstructure consisting of four connected neighbours on thechain. The 3D case is handled analogously, except that themotif is rotated by either 90° or -90°, provided the appro-priate positions are empty. (If both rotations are feasible,VSHD MovesFigure 2VSHD Moves. Residue positions are shown before the move and immediately after a successful move. T(t) denotes the state of the conformation at time t. In 2a there are two possible positions that residue one could be moved to, denoted by 1' in grey circles. Each position is checked in ran-dom order for availability. If a position is found to be free, the residue is moved. In 3D the same logic is followed except there is a possibility of two additional potential positions (four in total). End moves are applied on the last residue n in the same manner. 2b shows there to be only one potential new position for a corner move. This is also the case in 3D where the position must lie on the plane formed by i - 1, i, and i + 1. 2c shows the case for a crankshaft move. In 3D, the crankshaft could potentially rotate 90° or -90°.3211'T(t) T(t+1)1'321 i-1ii+1 i'T(t) T(t+1)i-1ii+11i-2i-1i+2ii+1 i+1'i'i+31i-2i-1i+2ii+1i+3T(t) T(t+1)(a) (b)(c)Page 4 of 20(page number not for citation purposes)the two end residues, respectively. one of them is chosen uniformly at random). Note that inBMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342the figure, the same crankshaft move can be initiated fromresidue i and i + 1.Pull movesPull moves have been introduced relatively recently by Leshet al. [7], who used them in the context of a generic tabusearch algorithm for the 2D HP protein folding problem.In the following, we will briefly introduce the central ideabehind this type of move. For a formal treatment of thepull move neighbourhood and the proof of its complete-ness (i.e., the fact that any two valid sequence conforma-tions on the 2D square lattice can be transformed intoeach other by a sequence of pull moves), the reader isdirected to the original paper by Lesh et al. [7].Suppose at time t for some residue i there is an empty lat-tice position labeled L which is adjacent to residue i + 1and diagonally adjacent to i. Further consider a positionmutually adjacent to L and i, labeled C. Using this labe-ling, a square is formed by residues i, i + 1, L and C, asillustrated in Figure 3a. A pull move can only proceed if Cis either empty or occupied by residue i - 1.The simplest case occurs when C is occupied by residue i -1, in which case the entire move consists of moving resi-due i to location L. Note that this move, which is illus-trated in Figure 3a, is equivalent to the previouslyintroduced corner move. When C is not occupied by resi-due i - 1, i is moved to L and i - 1 is moved to C. If residuei - 2 is adjacent to position C, this second operation com-pletes the pull move. This case is illustrated in Figure 3b.If, however, residue i - 2 is adjacent to position C, thechain is still not in a valid conformation at this point, andin this case, the following procedure is used. Using thenotation by Lesh et al. [7], starting with residue j = i - 2, let(xj(t + 1), yj(t + 1)) = (xj+2(t), yj+2(t)) until a valid confor-mation has been found or residue 1 has been moved.Informally speaking, residues are successively pulled intopositions that have just been vacated (as a consequence ofpulling another residue) until a valid conformation hasbeen obtained or one end of the chain is reached. Figure3c illustrates this situation where residues i to i - 3 werepulled successively, until the valid conformation shownon the right was obtained. Note that pull moves have beendescribed as pulling from residue i down to residue 1, ifneeded. Pulling in the opposite direction is equivalentand also valid.When they introduced pull moves, Lesh et al. claimed thatthe resulting neighbourhood could be generalized to the3D case. However, to the best of our knowledge no algo-rithm implementing pull moves for the 3D case has beenconsider choices of L and C in any plane containing bothi and i + 1; in the case of the 3D cubic lattice, there are twosuch planes. In our study presented here, we have imple-mented this generalization of pull moves in the context ofa standard REMC algorithm, which will be described inthe following.Replica exchange Monte Carlo searchIn the following, we provide a brief introduction to replicaexchange Monte Carlo search. For an in-depth descriptionof the algorithm including its historical aspects, the readeris referred to the review of extended ensemble MonteCarlo algorithms by Iba [42], which also provides detailsrelated to simulated tempering [43] and replica MonteCarlo search [44].Replica exchange Monte Carlo (REMC) search maintainsχ independent replicas of a potential solution. Each of theχ replicas has an associated temperature value (T1, T2,...,Tχ). Each temperature value is unique and the replicas arenumbered such that T1 <T2 < ... <Tχ. In our description ofthe algorithm, we will label the χ conformations main-tained by the algorithm at any given time with the replicanumbers (1 ,... χ,) and always associate temperature Tjwith replica j (for all j such that 1 ≤ j ≤ χ). Thus, theexchange of replicas is equivalent to (and is commonlyimplemented as) the swap of replica labels.Each of the χ replicas independently performs a simpleMonte Carlo search at the respective temperature setting.The transition probability from some current conforma-tion c to an alternative conformation c' is determinedusing the so-called Metropolis criterion such thatwhere ∆E : = E(c') - E(c) is the difference in energy betweenconformations c' and c, and T denotes the temperature ofthe replica.We can represent the current state of the extended ensem-ble of all χ replicas as a vector c : = (c1,..., cχ) shown below,where cj is the conformation of replica j, which (as previ-ously stated) runs at temperature Tj. During replicaexchange, temperature values of neighbouring replicas areswapped with a probability proportional to their energyand temperature differences. An exchange of tempera-tures, and therefore a relabeling of replicas, affects thestate of the extended ensemble c. Therefore, we define anexchange between two replicas i and j more generally as atransition of the current ensemble state c to an alteredPr[ ] :,.c cEeET→ ′ =≤−1 0if otherwise∆∆ (1)Page 5 of 20(page number not for citation purposes)published. For the 2D case, valid choices of L and C arerestricted to a single plane. The generalization to 3D canstate c'. We define l(ci) = i, the current label or replicanumber, for all ci. The probability of a transition fromBMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342ensemble state c to state c' by exchanging replicas i and jis defined as:The value ∆ is the product of the energy difference andinverse temperature difference:∆ : = (βj - βi)(E(ci) - E(cj)).where is the inverse of the temperature of replicai.Potential replica exchanges are only performed betweenneighbouring temperatures, since the acceptance proba-bility of the exchange drops exponentially as the temper-ature difference between replicas increases.Our REMC algorithmswith three variants of the REMC algorithm for both the 2Dand 3D case, which differ only in the neighbourhoodsused in the subsidiary Monte Carlo local search proce-dure. REMCvshd folds protein sequences using exclusivelythe VSHD neighbourhood. Likewise, REMCpm is based onthe pull move neighbourhood. Our third variant, REMCm,makes use of a hybrid neighbourhood that allows both,pull moves and VSHD moves to be performed; more pre-cisely, in each local search step, the pull move neighbour-hood will be used with probability ρ (where ρ is aconfigurable parameter of the algorithm) and otherwise,the VSHD neighbourhood will be used.ResultsTo evaluate the performance of our REMC algorithms wedirectly compared results against those for two state-of-the-art folding algorithms, ACO-HPPFP-3 and PERM. Inthe same manner in which the parameters for REMCremain fixed for all experiments, the PERM and ACO-HPPFP-3 parameters have been fixed to the values sug-gested by their authors. The parameter values for ACO-HPPFP-3 have been taken from Shmygelska and Hoos [9],and those for PERM were optimized by P. Grassberger andhis group and pre-configued in the code kindly providedPr Pr[ ] : [ ( ) ( )]:.c c→ ′ = ↔=≤ −l c l cei j1 0∆∆ otherwise(2)βiiT=1Pull MovesFigure 3Pull Moves. This figure has been reproduced from [7] to illustrate the central idea behind this neighbourhood. In 3a, the sim-plest case where position C is occupied by residue i - 1 is shown. This move is equivalent to a corner move in the VSHD move-set. In 3b, residue i is moved to L and i - 1 to C. The chain is in a valid conformation and the move is finished. In 3c, residues i down to i - 3 must be pulled until a valid conformation is found.ii+1 LT(t) T(t+1)Ci+1 ii-1(a)i+1 LT(t) T(t+1)Ci-2i-1ii+1i-2i-1i(b)LT(t) T(t+1)Ci-7i-4i-5i-6i-2i-3i-1ii+1 i-7i-4i-5i-6i-3ii+1i-2 i-1(c)Page 6 of 20(page number not for citation purposes)Details of our implementation of REMC search are pre-sented in the 'Methods' section. We have experimentedto us. For all runs of PERM, the parameter settings β : = 26and q : = 0.2 were used [13].BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342In our experiments we conducted a number of runs witha given energy or CPU time cut-off on a standard set ofbenchmark instances for both the 2D and 3D HP proteinfolding problems. Furthermore, several new benchmarksets were created to evaluate the performance of REMC onlong, biologically inspired sequences as well as onsequences with provably unique optimal conformations.A direct comparison between ACO-HPPFP-3 and PERMhas been previously reported by Shmygelska and Hoos[9]. In this earlier work it has been shown through exper-iments on artificially designed as well as on known bio-logical sequences that PERM has inherit difficulties withfolding proteins where the termini interact in the forma-tion of the hydrophobic core. Here, we performed analo-gous experiments to determine the performancedifferences between ACO-HPPFP-3, PERM and our REMCalgorithms for these cases. We further tested our 2D algo-rithms using the pull move neighbourhood, REMCm andREMCpm, against the first algorithm based on this neigh-bourhood, GTabu, by means of a computational experi-ment analogous to that performed by Lesh et al. [7].Results for standard benchmark sequencesThere are eleven benchmark sequences for the 2D HPmodel and ten for the 3D HP model. This benchmark set,in whole or in part, has been used extensively in the liter-ature [8,9,11,12,45-48]. A complete listing of the 2D and3D sequences can be found in Table 1. To evaluate per-formance differences between ACO-HPPFP-3, PERM andour REMC algorithms, we follow the experimental proto-col established by Shmygelska and Hoos [9].Every run was performed independently with a uniquerandom seed. In the 2D case, for sequences of length n ≤50, 500 independent runs were performed; for 50 <n ≤ 64,100 runs; and for n > 64, 20 runs. In the case of 3D, 100independent runs were performed for each sequence.Results for ACO-HPPFP-3 and PERM were taken from thestudy of Shmygelska and Hoos [9], which used the sameexperimental environment and protocol. Expected run-times for PERM are computed as ,where t1 and t2 are the average run-times when foldingfrom the N-terminus and C-terminus of the given proteinsequence, respectively; as noted by Shmygelska and Hoos,the performance of PERM often varies substantiallybetween folding directions [9].Results for the 2D case are listed in Table 2. All algorithmsshow similar running times for the first three benchmarkother two variants of REMC, both utilizing pull moves,perform better than ACO-HPPFP-3 for all instances.PERM shows better performance than REMCm and REM-Cpm for sequence S1-7. On average, it also solves S1-9faster than REMCpm. In every other case, however, REMCpmand REMCm outperform PERM, often by a significant fac-tor. Of particular note is the fact that the variants usingpull moves solve sequence S1-8 in a matter of CPU sec-onds compared to 78 CPU hours required on average byPERM (ACO-HPPFP-3 also outperforms PERM on thistt texp= ⋅ +−21 11 21Table 1: Standard benchmark sequencesID Length E* Protein Sequence2D HPS1-1 20 -9 (HP)2PH2PHP2HPH2P2HPHS1-2 24 -9 H2(P2H)7HS1-3 25 -8 P2HP2(H2P4)3H2S1-4 36 -14 P3H2P2H2P5H7P2H2P4H2P2HP2S1-5 48 -23 P2H(P2H2)2P5H10P6(H2P2)2HP2H5S1-6 50 -21 H2(PH)3PH4PH(P3H)2P4H(P3H)2PH4(PH)4HS1-7 60 -36 P2H3PH8P3H10PHP3H12P4H6PH2PHPS1-8 64 -42 H12(PH)2(P2H2)2P2HP2H2PPH2P2HP2(H2P2)2(HP)2H12S1-9 85 -53 H4P4H12P6(H12P3)3HP2(H2P2)2HPHS1-10 100 -50 P3H2P2H4P2H3(PH2)2PH4P8H6P2H6P9HPH2PH11P2H3PH2PHP2HPH3P6H3S1-11 100 -48 P6HPH2P5H3PH5PH2P4H2P2H2PH5PH10PH2PH7P11H7P2HPH3P6HPH23D HPS2-1 48 -32 HPH2P2H4PH3P2H2P2HPH3PHPH2P2H2P3HP8H2S2-2 48 -34 H4PH2PH5P2HP2H2P2HP6HP2HP3HP2H2P2H3PHS2-3 48 -34 PHPH2PH6P2HPHP2HPH2(PH)2P3H(P2H2)2P2HPHP2HPS2-4 48 -33 PHPH2P2HPH3P2H2PH2P3H5P2HPH2(PH)2P4HP2(HP)2S2-5 48 -32 P2HP3HPH4P2H4PH2PH3P2(HP)2HP2HP6H2PH2PHS2-6 48 -32 H3P3H2PH(PH2)3PHP7HPHP2HP3HP2H6PHS2-7 48 -32 PHP4HPH3PHPH4PH2PH2P3HPHP3H3(P2H2)2P3HS2-8 48 -31 PH2PH3PH4P2H3P6HPH2P2H2PHP3H2(PH)2PH2P3S2-9 48 -34 (PH)2P4(HP)2HP2HPH6P2H3PHP2HPH2P2HPH3P4HS2-10 48 -33 PH2P6H2P3H3PHP2HPH2(P2H)2P2H2P2H7P2H22D and 3D standard benchmark collection which can be found in [9]. These sequences can be found originally amongst [6, 20]. As presented in [9], Hi and Pi indicate a string of i consecutive H's and P's; likewise (s)i indicated an i-fold repetition of string s.Page 7 of 20(page number not for citation purposes)sequences (S1-1 to S1-3). For sequences S1-4 to S1-11,REMCvshd exhibited the worst performance; however, thesequence with a mean running time of 1.5 CPU hours).Sequence S1-8 has a symmetric core formed by extensiveBMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342interactions between the two termini; the difficulty ofsequences with interacting termini for PERM has beenpreviously demonstrated by Shmygelska and Hoos [9].The second-hardest instance for PERM, S1-10, is solved onaverage 2.5 times faster by REMCm and nearly 6 timesfaster by REMCpm. The other benchmark sequence with100 residues, S1-11, is solved approximately 8 times fasterby both pull move variants. Overall, on the eleven bench-mark instances the performance of PERM is matched orexceeded in 9 and 10 cases by REMCpm and REMCm,respectively.In the 3D case (see Table 3), the general performancetrend is similar. All REMC variants report superior run-ning times to ACO-HPPFP-3 in every case, as does PERM.Furthermore, PERM outperforms REMCvshd in each case,often by a significant factor. However, the generalizationof pull moves to the 3D case has lead to a substantialincrease in the performance of REMC. For only onesequence, S2-10, does PERM outperform REMCpm andREMCm (by a factor of 10). REMC using pull moves showssignificantly better performance than PERM on S2-4, S2-5and S2-9, where a five- to twenty-fold increase in perform-ance is observed. For the other instances, REMCpm andREMCm match or outperform PERM by a small margin.REMCpm and REMCm also outperform other algorithmsfound in the literature. Shmygelska and Hoos comparedPERM and ACO-HPPFP-3 against other methods withincluded the genetic algorithm of Unger and Moult [11],the evolutionary Monte Carlo algorithm of Liang andWong [8], and the multi-self-overlap ensemble algorithmof Chikenji et al. [47]. Furthermore, a previous applica-tion of replica exchange Monte Carlo search (parallel tem-pering) on the 2D square lattice [23] has failed to reachground-state configurations for the two largest standardbenchmark sequences (here referred to as S1-10 and S1-11) [47]. For the 3D cubic lattice, the hydrophobic zipperalgorithm [49], the constraint-based hydrophobic coreconstruction method [37], the core-directed chain growthalgorithm [46] and the contact interactions algorithm[10] were considered. Considering these previously pub-lished results in combination with the results reportedhere, REMCpm and REMCm both perform better than anyof the earlier methods mentioned above in terms of theenergy of the conformations found or the CPU timerequired for reaching a given energy level (where differ-ences in CPU speed are taken into account).Due to their superior performance, only the REMCpm andREMCm algorithms were considered in the remainder ofour study.Characteristic performance of REMCPrompted by the results on sequence S1-8, we decided tofurther investigate to which extent REMC using pullmoves can fold proteins with interacting termini in theircores substantially more effectively than PERM. To thatTable 2: Results on 2D benchmark sequencesID E* ACO-HPPFP-3 REMCvshd REMCpm REMCmS1-1 -9 -9 (< 1 sec) -9 (< 1 sec) -9 (< 1 sec) -9 (< 1 sec) -9 (< 1 sec)S1-2 -9 -9 (< 1 sec) -9 (< 1 sec) -9 (< 1 sec) -9 (< 1 sec) -9 (< 1 sec)S1-3 -8 -8 (2 sec) -8 (2 sec) -8 (< 1 sec) -8 (< 1 sec) -8 (< 1 sec)S1-4 -14 -14 (< 1 sec) -14 (4 sec) -14 (15 sec) -14 (< 1 sec) -14 (< 1 sec)S1-5 -23 -23 (2 sec) -23 (1 min) -23 (91% of runs 18 min)-23 (< 1 sec) -23 (< 1 sec)S1-6 -21 -21 (3 sec) -21 (15 sec) -21 (98% of runs 19 min)-21 (< 1 sec) -21 (< 1 sec)S1-7 -36 -36 (4 sec) -36 (20 min) -34 (33% of runs 33 min)-36 (10 sec) -36 (13 sec)S1-8 -42 -42 (78 hrs) -42 (1.5 hrs) -35 (11% of runs 40 min)-42 (5 sec) -42 (6 sec)S1-9 -53 -53 (1 min) -53 (20% of runs 1 day)-50 (5% of runs 19 min)-53 (2 min) -53 (38 sec)S1-10 -50 -50 (20 min) -49 (12 hrs) -46 (5% of runs 41 min)-50 (3.5 min) -50 (8 min)S1-11 -48 -48 (8 min) -47 (10 hrs) -46 (5% of runs 97 min)-48 (1 min) -48 (1.2 min)Details on runs can be found in the text. Results for PERM and ACO-HPPFP-3 are reproduced from [9]. In all instances, REMCvshd reports the worst running times followed by ACO-HPPFP-3. REMCm outperforms PERM in 10 of 11 instances and REMCpm reports better times than PERM in 9 of 11 instances. Details of the experimental protocol can be found in the text.PERMtexpPage 8 of 20(page number not for citation purposes)previously published results on the standard benchmarksets [9]. For the 2D square lattice, this comparisonend, we used three additional sequences that had beenshown to be difficult for PERM by Shmygelska and HoosBMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342[9]; these sequences are listed in Supplemental Table 1[see Additional file 1]. These sequences and the corre-sponding mean run-times for each algorithm (determinedfrom 100 independent runs) are reported in Table 4. Forall four instances, both REMC variants outperform ACO-HPPFP-3 by factors ranging from 21 to 236. In the case ofB50-5, REMCpm and REMCm easily outperform PERM (bya factor of 20) when the latter is folding from both direc-tions or from the C-terminus; however, when foldingfrom the N-terminus, PERM slightly outperforms REMCpm(by a factor of 1.2).For B50-7, REMC beats all variants of PERM by more thantwo orders of magnitude. As B50-7 involves significantinteraction between both termini, the folding direction ofPERM appears to be inconsequential. This is not the casefor D-1. When folding from the C-terminus, PERM has nodifficulty folding the sequence within 1 CPU second, as asignificant part of the C-terminus forms the hydrophobiccore of this protein. This performance is matched by bothREMC algorithms. However, when folding from the N-ter-minus, PERM requires a mean run-time of over one CPUhour. The D-2 sequence is highly symmetric in its core for-mation. PERM reports the worst run-times of all algo-rithms in this instance with a mean run-time of over 2.5CPU hours in the best case. This is more than 200 timesworse than either of the REMC algorithms. Overall, theseresults clearly indicate that, compared to PERM, REMC ismuch more effective in finding low-energy structureswhose termini interact to form hydrophobic cores.It has also been previously demonstrated that ACO-HPPFP-3 provides a larger range of relative H-H contactorder values than PERM when analyzing the ensemble offolds obtained from multiple independent runs on thependent runs. The relative H-H contact order measuresthe average separation of hydrophobic-hydrophobic con-tacts and is formally defined aswhere l is the number H-H contacts, n is the number ofhydrophobic residues, and i and j are hydrophobic resi-dues in contact that are not neighbours in the chain. Thismeasure can be employed to compare the quantity anddiversity of structures returned by one or more algorithms.Since identical conformations have the same relative H-Hcontact order value, the number of unique structures in aset is bounded from below by the number of unique con-tact order values. Furthermore, a larger range of relativecontact order values is indicative of a more diverse set ofstructures.Figure 4 demonstrates the frequency distribution of rela-tive H-H contact orders for AC0-HPPFP-3, PERM and theREMC variants using pull moves. Ground-state conforma-tions were examined from 500 independent runs per algo-rithm on S1-7 for the 2D case (left side) and on S2-5 forthe 3D case (right side). Runs were terminated immedi-ately after a ground-state conformation was found. For the2D case, ACO-HPPFP-3 and REMC find conformationswith higher relative contact order than PERM does (rela-tive COH-H = 0.324). REMC also appears to have a flatter,more even distribution than either ACO-HPPFP-3 andPERM. Both REMCpm and REMCm find 34 unique relativecontact order values, while ACO-HPPFP-3 finds 22 andPERM only 15.In the 3D case, the REMC algorithms also find a moreCOl ni jH Hi j−< −=⋅−∑: ,11Table 3: Results on 3D benchmark sequencesID E* ACO-HPPFP-3 REMCvshd REMCpm REMCmS2-1 -32 -32 (0.1 min) -32 (30 min) -32 (0.75 min) -32 (0.1 min) -32 (0.1 min)S2-2 -34 -34 (0.3 min) -34 (420 min) -34 (8.1 min) -34 (0.2 min) -34 (0.2 min)S2-3 -34 -34 (0.1 min) -34 (120 min) -34 (3.3 min) -34 (0.1 min) -34 (0.1 min)S2-4 -33 -33 (2 min) -33 (300 min) -33 (2.2 min) -33 (0.2 min) -33 (0.1 min)S2-5 -32 -32 (0.5 min) -32 (15 min) -32 (1.2 min) -32 (0.1 min) -32 (0.1 min)S2-6 -32 -32 (0.1 min) -32 (720 min) -32 (1.5 min) -32 (0.1 min) -32 (0.1 min)S2-7 -32 -32 (0.5 min) -32 (720 min) -32 (3.9 min) -32 (0.4 min) -32 (0.3 min)S2-8 -31 -31 (0.3 min) -31 (120 min) -31 (2.3 min) -31 (0.2 min) -31 (0.1 min)S2-9 -34 -34 (5 min) -34 (450 min) -34 (14 min) -34 (0.7 min) -34 (0.9 min)S2-10 -33 -33 (0.01 min) -33 (60 min) -33 (2 min) -33 (0.1 min) -33 (0.1 min)Details on runs can be found in the text. Results for PERM and ACO-HPPFP-3 have been reproduced from [9]. All variants of REMC outperform ACO-HPPFP-3. REMCm and REMCpm match or outperform PERM in 9 of 10 instances. Details of the experimental protocol can be found in the text.PERMtexpPage 9 of 20(page number not for citation purposes)same sequence [9], where the ensemble contains the firstoptimal conformation encountered in each of the inde-diverse set of ground-state structures than ACO-HPPFP-3and PERM. REMCm and REMCpm return 82 and 83 uniqueBMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342values, respectively, compared to 74 found by ACO-HPPFP-3 and 69 by PERM. Furthermore, REMC finds con-formation with larger relative contact order values thanACO-HPPFP-3 and PERM; the largest values are 0.789 forREMCm, 0.776 for REMCm, 0.75 for ACO-HPPFP-3 and0.737 for PERM.Results for longer sequencesTo evaluate how REMC's performance scales withsequence length, a new, biologically motivated test setwas constructed. All sequences were taken from the Pro-tein Data Bank and have length between 200 and 250 res-idues at a sequence similarity of less than 10%. Sequenceswere translated into HP strings based on the RASMOLhydrophobicity classification scale [50], except for non-standard amino acid symbols, such as X and Z, whichwere skipped (the same protocol has been previously usedby Shmygelska and Hoos [9]). The resulting HP sequencesare listed in Supplemental Table 2 [see Additional file 1].As ACO-HPPFP-3 scaled poorly with sequence length onthe benchmark sequences compared with PERM andREMC, it has been omitted from this evaluation. PERMwas run in both directions for each instance.For each sequence, ten independent runs were conductedfor each algorithm in both 2D and 3D. Runs were termi-nated after 60 CPU minutes on our reference machine,and the best energy was recorded. Figure 5 shows the bestand mean energy values for REMCm plotted against therespective performance metrics for PERM; the best energyvalue corresponds to the lowest energy value foundamongst all independent runs, while the mean energyvalue we report is the average of the best energies found ineach independent run. In the 2D case (Figure 5, left side),Comparison of the distribution of H-H contact orders found by REMC, ACO-HPPFP-3 and PERMFigure 4Comparison of the distribution of H-H contact orders found by REMC, ACO-HPPFP-3 and PERM. The frequency distribution of relative contact order values for folding S1-7 in 2D (left side) and S2-5 in 3D (right side) over 500 independent runs is shown. This measure can be employed to compare the quantity and diversity of folds returned by one or more algo-rithms. In both the 2D and 3D case, REMC variants have a more even distribution and find a larger number of relative contact order values than PERM or ACO-HPPFP-3. REMC and ACO-HPPFP-3 both find relative contact orders larger than PERM in 2D. In 3D, REMC finds larger relative contact order values than both PERM and ACO-HPPFP-3. 0 50 100 150 200 250 300 350 400 450 500 0.22 0.24 0.26 0.28 0.3 0.32 0.34Frequency (Relative COHH)Relative COHHREMCmREMCpmPERMACO 0 50 100 150 200 250 300 350 400 450 500 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8Frequency (Relative COHH)Relative COHHREMCmREMCpmPERMACOTable 4: Results for biological and designed sequencesID E* ACO-HPPFP-3 REMCpm REMCmB50-5 -22 5 sec 118 sec 9 sec 820 sec 6 sec 5 secB50-7 -17 271 sec 299 sec 284 sec 130 sec 1 sec 2 secD-1 -19 3 795 sec 1 sec 2 sec 236 sec 1 sec 1 secD-2 -17 9 257 sec 19 356 sec 12 524 sec 951 sec 44 sec 41 secResults for PERM and ACO-HPPFP-3 are reproduced from [9]. In all instances, REMC finds optimal conformations relatively easily compared with the other algorithms. REMC does not demonstrate an inherit difficulty folding sequences when conformations involve hydrophobic cores confined to one end of the sequence or the case involving both ends. For every instance, 100 independent runs were conducted of 1 CPU hour each. In cases where not every run reached the same energy value after 1 hour, the expected run-time to reach the energy value shown in the table was calculated using the equation detailed by Parkes and Walser [54].PERMt1 PERMt2PERMtexpPage 10 of 20(page number not for citation purposes)BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342PERM finds a better energy value for one sequence andfinds the same best energy values for two others. In theremaining seven cases, REMCm finds superior conforma-tions. For every sequence, REMCm achieves better meanenergy values than PERM.In the 3D case (Figure 5, right side), the performance dif-ference is more pronounced. REMC finds better confor-mations on average and in the best case for everysequence. Considering the best conformations foundamong the ten independent runs for each algorithm andfor each initial direction in the case of PERM, REMCmreaches significantly lower energies; the same holds withrespect to average energy values. REMCpm achieves similarperformance in all cases (results not shown).Results for sequences with unique ground-state conformationsFurther experiments were conducted on three classes ofsequences with unique ground-state conformations in the2D HP model on the square lattice. In 2003, Aichholzer etal. identified and proved that a class of sequencesuniquely fold into structures dubbed Z-structures [51].Later, Gupta et al. proposed a tile set used to design con-structible structures for the inverse protein folding prob-lem. Of these constructible structures, the authors provedthat the sequences associated with linear structures withno bends (L0) and linear structures with one bend (L1)uniquely fold into the designed conformation [52]. Exam-ples of these structures are shown in Figure 6. We con-structed a new test set comprising ten sequences ofincreasing length for each of these classes of sequencesand list them in Supplemental Table 3 [see Additional file1]. To evaluate the performance of PERM and REMC onthese test sequences, we performed 100 independent runsper sequence, each with a cut-off time of 1 CPU hour.PERM was run in both directions, and in the case whereneither direction finds the (known) lowest energy, theexpected run-time is reported for the best energy found.The mean run-times for the Z-structures is reported inTable 5. REMC finds the unique conformation of eachsequence relatively easily with a worst case mean run-timeof 2 CPU seconds. PERM's performance, on the otherhand, scales very poorly with sequence length, and thealgorithm is unable to find the optimal energy for the fourlongest sequences.The L0 structures turned out to be much more difficult tosolve for REMC (see Table 6). Neither PERM nor REMCare able to find the optimal conformation for L0-9 or L0-10, although REMCm finds lower-energy conformationsthan PERM in both cases. PERM finds the same sub-opti-nate PERM by finding either lower energy conformationsor by requiring less run-time for reaching the same energyvalues.The L1 structures are the hardest for all algorithms consid-ered here (see Table 7). For the three longest sequences,both REMC algorithms find the same sub-optimal solu-tions as PERM, but PERM reaches these only for one fold-ing direction. For the other instances, REMC consistentlyfinds the optimal conformation significantly faster thanPERM.Comparison with GTabuThe variants of REMC utilizing pull moves significantlyoutperform REMCvshd for the 2D and 3D HP models. Thisclearly demonstrates the effectiveness of the pull moveneighbourhood. To address the question to which extentthe REMC search strategy contributes to the excellent per-formance of REMCpm and REMCm, we compared the per-formance of these algorithms with that of GTabu, the firstalgorithm for the HP model to use pull moves. In theirpaper describing GTabu and pull moves, Lesh et al.reported performance results for the standard benchmarkTable 5: Results on stable Z-structuresID E* REMCpm REMCmZ-4 -3 -3 (< 1 sec) -3 (< 1 sec) -3 (< 1 sec)Z-8 -7 -7 (< 1 sec) -7 (< 1 sec) -7 (< 1 sec)Z-12 -11 -11 (< 1 sec)-11 (< 1 sec) -11 (< 1 sec)Z-16 -15 -15 (3 sec) -15 (< 1 sec) -15 (< 1 sec)Z-20 -19 -19 (51 min)-19 (< 1 sec) -19 (< 1 sec)Z-24 -23 -23 (49 hrs†)-23 (< 1 sec) -23 (< 1 sec)Z-28 -27 -26 -27 (< 1 sec) -27 (< 1 sec)Z-32 -31 -29 -31 (< 1 sec) -31 (< 1 sec)Z-36 -35 -31 -35 (1 sec) -35 (< 1 sec)Z-40 -39 -34 -39 (2 sec) -39 (1 sec)The Z-structures proposed in [51] are easy for REMC to fold when compared with PERM. After Z-24, PERM is unable to find the unique optimal conformation in any of the 100 independent runs conducted. When only one folding direction of PERM finds the optimal conformation, we report the mean run-time of that direction, denoting this in the table with a †. When best energies found by and differed (and neither find the optimal solution), the best energy by either is reported and the run-time is omitted. For every instance, 100 independent runs were conducted of 1 CPU hour each. In cases where not every run reached the same energy value after 1 hour, the expected run-time to reach the energy value shown in the table was calculated using the equation detailed by Parkes and Walser [54].PERMtexpPERMt1 PERMt2Page 11 of 20(page number not for citation purposes)mal solution as REMCpm for L0-10 in significantly lesstime. For all other instances, both REMC variants domi-sequences S1-8 to S1-11 [7]. To ensure the comparabilityof results, we used the same experimental protocol as LeshBMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342et al. for evaluating our REMC algorithms on thesesequences. Two hundred independent runs were per-formed for each sequence and the rate of successful com-pletion (i.e., fraction of runs in which the best knownenergy was reached) after 30 minutes and 60 minutes wasreported.Lesh et al. pointed out that the performance of theirimplementation of GTabu could be improved by a factorof 2 to 5 through relatively straightforward optimizations.Furthermore, GTabu was evaluated on different hardware(based on the 1000 MHz Alpha processor). Therefore,optimistically assuming our hardware is three times fasterand GTabu performance could be improved by a factor offive, we also report run-times for GTabu if it were faster bya factor of fifteen.Figure 7 shows the run-time distributions of REMCpm andREMCm (i.e., empirical distributions of run-times over the200 independent runs) for each of the four sequences. Wealso indicate the completion rates achieved by GTabuafter 30 CPU minutes (scaled to 2 minutes) and 60 CPUminutes (scaled to 4 minutes). Overall, even for the opti-mistically scaled results, it is clear that REMC significantlyoutperforms GTabu on three of the four instances. Theremaining instance, S1-8, is the most difficult of thebenchmark sequences for PERM, while being solved easilyby both, GTabu and REMC.DiscussionIn all experiments reported so far, the parameters of theREMC algorithms have remained fixed at the values listedin the 'Methods' section. These parameter sets were cho-sen separately for the 2D case and for the 3D case usingthe automatic algorithm configuration tool of Hutter et al.[53], which performs a local search in parameter space toTable 7: Results on stable L1-structuresID E* REMCpm REMCmL1-1-3 -16 -16 (120 sec)-16 (6 sec) -16 (6 sec)L1-2-2 -16 -16 (57 sec)-16 (3 sec) -16 (2 sec)L1-3-1 -16 -16 (28 sec)-16 (3 sec) -16 (3 sec)L1-1-5 -24 -24 (100 hrs†)-24 (17 min)-24 (14 min)L1-2-4 -24 -24 (49 hrs†)-24 (9 min) -24 (7 min)L1-3-3 -24 -23 -24 (7 min) -24 (5 min)L1-4-2 -24 -24 (49 hrs†)-24 (5 min) -24 (4 min)L1-3-7 -40 -38 -38 (20 hrs) -38 (20 hrs)L1-5-5 -40 -38 -38 (16 hrs) -38 (14 hrs)L1-8-2 -40 -38 -38 (16 hrs) -38 (14 hrs)The L1-structures proposed in [52] are the most difficult stable structures for REMC to fold. Both PERM and REMC are unable to find the optimal conformations for L1-3-7, L1-5-5, and L1-8-2 after 100, one hour runs. PERM also did not find the optimal conformation for L1-3-3. When only one folding direction of PERM finds the optimal conformation, we report the mean run-time of that direction, denoting this in the table with a †. When best energies found by and differed (and neither find the optimal solution), the best energy by either is reported and the run-time is omitted. For every instance, 100 independent runs were conducted of 1 CPU hour each. In cases where not every run reached the same energy value after 1 hour, the expected run-time to reach the energy value shown in the table was calculated using the equation detailed by Parkes and Walser [54].PERMtexpPERMt1 PERMt2Table 6: Results on stable L0-structuresID E* REMCpm REMCmL0-1 -4 -4 (< 1 sec) -4 (< 1 sec) -4 (< 1 sec)L0-2 -8 -8 (< 1 sec) -8 (< 1 sec) -8 (< 1 sec)L0-3 -12 -12 (< 1 sec)-12 (< 1 sec)-12 (< 1 sec)L0-4 -16 -16 (32 sec)-16 (7 sec) -16 (5 sec)L0-5 -20 -20 (3 hrs†)-20 (1.1 min)-20 (55 sec)L0-6 -24 -23 -24 (16 min)-24 (13 min)L0-7 -28 -26 (33 sec) -28 (3.2 hrs)-28 (2.5 hrs)L0-8 -32 -30 (3 min) -32 (50 hrs)-32 (16 hrs)L0-9 -36 -34 (22 min)-35 (99 hrs) -35 (100 hrs)L0-10 -40 -38 (40 min)-38 (9.6 hrs)-39 (100 hrs)The L0-structures proposed in [52] are hard for both REMC and PERM to fold. After L0-5, PERM is unable to find the unique optimal conformation in any of the 100 independent runs conducted. REMC is unable to find the ground-state conformation for L0-9 and L0-10, however, REMCm finds better sub-optimal conformations than PERM in both instances. When only one folding direction of PERM finds the optimal conformation, we report the mean run-time of that direction, denoting this in the table with a †. When best energies found by and differed (and neither find the optimal solution), the best energy by either is reported and the run-time is omitted. For every instance, 100 independent runs were conducted of 1 CPU hour each. In cases where not every run reached the same energy value after 1 hour, the expected run-time to reach the energy value shown in the table was calculated using the equation detailed by Parkes and Walser [54].PERMtexpPERMt1 PERMt2Page 12 of 20(page number not for citation purposes)optimize a given performance criterion (here: mean run-time). Attempts to manually configure the algorithmBMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342parameters failed to produce settings as robust as thosefound by the automated tool. Experiments using manu-ally tuned parameter configurations yielded performanceresults that were biased in favour of either short or longsequences.To better understand the impact of parameter settings onthe performance of our REMC algorithms, we performeda series of additional experiments, in which we varied oneparameter at a time, while leaving all others fixed at theirdefault values (as specified in the 'Methods' section), i.e.,(ϕ, τmin, τmax, χ, ρ) : = (500, 160, 220, 5, 0.4) in 2D and (ϕ,τmin, τmax, χ, ρ) : = (500, 160, 220, 2, 0.5) in 3D, where ϕis the number of local steps in a Monte Carlo search, τmin,and τmax, are the minimum and maximum temperaturevalues respectively, χ is the number of replicas to simulateand ρ is the probability of performing a pull move.Two test sequences were selected from the standardbenchmark set for this purpose. The first sequence, S1-7,was selected for the 2D case as it does not involve signifi-cant interaction of the sequence termini in hydrophobiccore formation. We did not choose a sequence with a sym-metric optimal fold, such as S1-8, since in that case, REMCalways appeared to find an optimal conformation fast(compared to the time required for solving othersequences of similar length), irrespectively of the parame-ter settings used. For the 3D case, sequence S2-7 was cho-sen. Neither sequence demands extensive CPU time tosolve, therefore 100 independent runs were conducted foreach parameter value being evaluated. In the following,we always report the mean run-time required for reachingthe target energy level. Results for REMCvshd have beenomitted, because they are always significantly inferior tothose achieved by REMCm and REMCpm.Examples of sequence classes with unique structuresFigure 6Examples of sequence classes with unique structures. On the left, an example of a Z-structure proposed by Aich-holzer et al. [51] is shown. On the right, we show an example of a L1-structure proposed by Gupta et al. [52]. An L1-struc-ture has one bend whereas the other L-structure we experi-ment with (L0) has no bends. The sequences associated with these structures have a provably unique optimal conforma-tion [51, 52].1 n n1Comparison of energies found by REMC and PERM for long biological sequencesFigure 5Comparison of energies found by REMC and PERM for long biological sequences. The best and mean energy values found over 10 independent, one hour runs for each of the long biological sequences found in Supplemental Table 2 [see Addi-tional file 1] is shown. The mean energy values reported for each instance is the average energy found amongst the 10 inde-pendent runs; the best energy is the lowest value found amongst the 10 runs. Notice that ground-state conformations have minimal free energy and therefore lower energy values are more desirable. In the 2D case (left side), PERM finds a best energy value lower than REMC in one instance and matched the best energy value found by REMC in two other instances. In the other 7 instances, REMC finds conformations with lower energies. In all instances, the average energy found was lower for REMC than PERM. In the 3D case (right side), REMC reports lower energies than PERM overall and on average for every instance.-120-110-100-90-80-120-110-100-90-80PERM energyREMCm energymean energybest energy-220-210-200-190-180-170-160-150-140-220-210-200-190-180-170-160-150-140PERM energyREMCm energymean energybest energyPage 13 of 20(page number not for citation purposes)BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342Number of replicasA particularly important parameter of any REMC algo-rithm is the number of replicas, i.e., the number of confor-mations on which Monte Carlo search is concurrentlyperformed. In the literature, the use of χ : = replicashas been suggested, where N is the number of degrees offreedom within the system [42].To test the specific effect of this parameter within the con-text of protein folding in the HP model with our currentimplementation, we conducted experiments on S1-7 andS2-7 using 2, 3, 4, 5, 6, 7, 8, 9 and 10 replicas. As statedabove, all other parameters remained fixed including theminimum and maximum temperature, set to 160 and220, respectively. Formally, when evaluating the perform-ance for replicas χ the temperature of replica k, 1 ≤ k ≤ χ,was determined by the uniform linear function(right). Interestingly, the best parameters found by theautomatic algorithm configuration tool of Hutter et al.[53], 5 replicas for the 2D case and 2 replicas for the 3Dcase, seem to perform poorly for the problem instancestested here. In fact, the worst results in the 2D case forboth REMCm and REMCpm occurred when using 5 replicas.However, the results shown in Figure 8 demonstrate theeffect of the number of replicas on run-time only for twospecific problem instances, whereas the automatic algo-rithm configuration tool determined parameters such thatperformance was optimized across a wide range of prob-lem instances.Temperature distributionWe now focus our attention on the effect of temperaturevalues on run-time. The probability distributions thatcontrol the acceptance of conformations during theMonte Carlo search depend directly on the temperaturesettings for each replica (see Equation 1); similarly, theprobability for performing replica exchanges depends onthe temperature difference between neighbouring replicas(see Equation 2). Generally, a replica with a higher tem-NT k k( ) : ( )= + − ⋅−−160 1220 1601χComparing REMC and GTabuFigure 7Comparing REMC and GTabu. The run-time distributions of REMCm and REMCpm for the four largest benchmark instances in 2D are shown; P(solve) denotes the probability of finding a ground-state conformation within a given run-time. The comple-tion rates for GTabu after 30 minutes and 60 minutes as reported in [7] are plotted. Optimistically assuming GTabu could be improved by a factor of 15 under different experimental conditions and implementation improvements, we have also plotted the same completion rates after 2 and 4 minutes. In the case of S1-8, GTabu reports a 100% successful completion rate. In all other instances, both variants of REMC using pull moves in their local search neighbourhood outperform GTabu even under a handicapped analysis. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 1 10 100 1000 10000S1-8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 10 100 1000 10000S1-9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 10 100 1000 10000P(solve)run-time, log [CPU sec]S1-10REMCmREMCpmGTabu - ScaledGTabu - Actual 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 10 100 1000 10000S1-11Page 14 of 20(page number not for citation purposes)Figure 8 shows the effect of the number of replicas onmean run-time in the 2D case (left) and the 3D caseperature value will accept a worsening move with a higherprobability than a replica at a lower temperature. Hence,BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342at higher temperatures, the search process is less likely tostagnate in local minima of the energy landscape. At thesame time, however, lower temperatures are required forexploring promising regions effectively. Therefore, there isa trade-off between search diversification and intensifica-tion that is controlled by the temperature values used bythe replicas. While our algorithms support arbitraryassignments of temperature values to each replica, in allexperiments conducted in this study we have chosen sim-ple linear temperature distributions over replicas, inwhich the temperature values are obtained by uniformlinear interpolation between a minimum and a maximumtemperature value. Furthermore, we have chosen to fix theminimum temperature to 160 in all cases; at this value,significantly worsening moves are accepted with a proba-bility near zero while exchanges between the neighbour-ing temperature are still possible when reasonable valuesare chosen. For a thorough discussion on the use of expo-nential temperature distributions and the general effect oftemperature distribution on performance, the reader isreferred to the work of Iba [42] and Mitsutake et al. [24].The results reported in Figure 9 suggest a clear relationshipbetween maximum temperature and mean run-time inboth, the 2D case (left side) and the 3D case (right side).In the 2D case, run-time is poor at both extremes of thetemperature range. When temperature values are too low,the algorithm gets trapped in local minima regions forextended periods of time; likewise, higher temperaturesmake it difficult for the algorithm to effectively optimizepromising conformations. The default parameter value of220 seems a reasonable choice for both REMCm and REM-Number of MC stepsThe parameter φ specifies the number of Monte Carlosteps performed by each replica between any two (pro-posed) replica exchanges. To determine the effect of thisparameter on the run-time of our REMC algorithms, weconducted experiments using a number of values rangingfrom 5 to 5000 MC steps between replica exchanges.Figure 10 shows the results for the 2D case (left side) and3D case (right side). Although the relationship betweenthe setting of ϕ and algorithm performance is not as clearas in the case of temperature choices, we observe that ourdefault value of 500 MC steps is a good choice for REMCmand REMCpm on both, 2D and 3D instances.Probability of performing pull movesIn REMCm, a parameter ρ is used to specify the probabilityof using the pull move (rather than the VSHD) neighbour-hood in any given search step. Figure 11 illustrates howthe value of ρ affects the performance of the algorithm inthe 2D case (left side) and 3D case (right side). Note thatfor ρ = 0, REMCm becomes identical to REMCvshd, and forρ = 1, REMCm behaves exactly like REMCpm. As can beexpected based on the results previously shown for allthree algorithms, low settings of ρ result in substantiallyweaker performance than high settings; for the instancesconsidered here, there were no significant performancedifferences for ρ ≥ 0.3.ConclusionIn this work we have demonstrated the effectiveness of anEffect of number of replicas on run-timeFigure 8Effect of number of replicas on run-time. Results for mean runtimes of 100 independent runs at varying number of repli-cas is shown for S1-7 in 2D (left) and S2-7 in 3D (right). A general relationship is unclear, however, the runtimes observed while increasing the number of replicas scale less than the expected linear increase in most cases. 8 9 10 11 12 13 14 15 2 3 4 5 6 7 8 9 10CPU time [sec]number of replicasREMCmREMCpm 14 15 16 17 18 19 20 21 2 3 4 5 6 7 8 9 10CPU time [sec]number of replicasREMCmREMCpmPage 15 of 20(page number not for citation purposes)Cpm. In the case of 3D, it seems that run-time scales worseas temperature is increased.extended ensemble algorithm, replica exchange MonteCarlo search, when applied to the protein structure predic-BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342tion problem for the HP model on the two dimensionalsquare lattice and the three dimensional cubic lattice. Adirect comparison with two state-of-the-art algorithms,ACO-HPPFP-3 and PERM, on a standard set of bench-mark sequences has shown that when using the pull moveneighbourhood, REMC performs exceptionally well. Inthe 2D case, REMC ties or outperforms ACO-HPPFP-3 onevery problem instance we studied. Furthermore, the per-formance of REMCm matches or exceeds that of PERM onten out of the eleven benchmark instances.In 3D, we have shown that REMC outperforms ACO-HPPFP-3 on all commonly studied benchmark instances.Moreover, REMC variants based on pull moves findground-state conformations as fast or faster than PERMfor nine of ten instances and with a mean run-time of 0.1CPU seconds on the remaining instance (which PERMsolves in 0.01 CPU seconds on average).We have demonstrated that in the context of REMCsearch, using pull moves – as opposed to VSHD movesonly – results in substantial performance improvements.We have also shown that by combining pull moves andVSHD moves into a hybrid search neighbourhood, better(and more robust) performance can be obtained in somecases. At the same time, the use of REMC search also con-tributes to the overall effectiveness of our new algorithms,as can be seen from the fact that our REMC algorithmsusing pull moves performs better than the GTabu algo-rithms, which is also based on the pull move neighbour-lattice, (to the best of our knowledge) our study is the firstto use pull moves on a 3-dimensional lattice model.REMC proved to be very effective in folding proteinswhose hydrophobic cores are formed by interacting ter-mini, such as S1-8 from the standard benchmark set – aclass of sequences that are very difficult for PERM. Simi-larly, we have shown that REMC finds ground-state con-formations for sequences with provably unique optimalstructures more effectively than PERM. We also presentedevidence indicating that when applied to sequences withdegenerate ground-states, REMC finds a larger and morediverse set of ground-state conformations in both 2D and3D. Finally, we have demonstrated that REMC performsbetter than PERM on long biological sequences (in 2Dand 3D), which suggests that REMC's performance scalesquite well with sequence length. We expect, however, thatfor very long sequences it may be beneficial to use a chain-growth method to generate a compact conformation fromwhich REMC search is started. Overall, we have demon-strated that REMC algorithms using the pull move neigh-bourhood show excellent performance on the HP model.Considering the generality of REMC and the possibility ofadapting the concept of pull moves to more complex lat-tice structures, we see much promise in developing similaralgorithms for models that can represent protein structuremore accurately.MethodsIn this section, specific details of our algorithm, experi-Effect of maximum temperature on run-timeFigure 9Effect of maximum temperature on run-time. Results for mean runtimes of 100 independent runs is shown for an increasing value of the maximum temperature. In the 2D case of folding S1-7 (left side), extremely low and extremely high tem-peratures yield the worst running times. The mean run-time seems to consistently scale worse as the maximum temperature is increased in the 3D case, while folding S2-7. 8 10 12 14 16 18 20 22 24 160 180 200 220 240 260 280 300 320 340CPU time [sec]maximum temperatureREMCmREMCpm 10 15 20 25 30 35 40 45 160 180 200 220 240 260 280 300 320 340CPU time [sec]maximum temperatureREMCmREMCpmPage 16 of 20(page number not for citation purposes)hood. While GTabu introduced pull moves on the square mental protocol and experimental environment are listed.BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342Naming of problem instancesWe have followed the naming conventions of probleminstances established by Shmygelska and Hoos [9]. Theinstances with unique ground-state conformations werenamed analogously. For the long biologically-inspiredsequences we retained the respective Protein Data Bankidentification codes.Details of our REMC algorithmIn Figure 12, pseudo-code is presented illustrating thedetails of our Monte Carlo search procedure performedfor a single replica and a predetermined number of steps.Figure 13 presents pseudo-code of our replica exchangeimplementation.In order to demonstrate the effectiveness of the REMCalgorithms without prior knowledge of problemEffect of pull move probability on run-timeFigure 11Effect of pull move probability on run-time. Results for mean runtimes of 100 independent runs using different probabil-ities of performing pull moves are reported for folding S1-7 in 2D (left) and S2-7 in 3D (right). Worst running times are observed for very low values. Otherwise, performance is generally consistent for other values. 5 10 15 20 25 30 35 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1CPU time [sec]pull move probabilityREMCm 14 16 18 20 22 24 26 28 30 32 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1CPU time [sec]pull move probabilityREMCmEffect of number of local steps on run-timeFigure 10Effect of number of local steps on run-time. Results for mean runtimes of 100 independent runs using different numbers of local steps during Monte Carlo search, ranging from 5 to 5000, are presented. The value of local steps is plotted in log scale. Results in 2D for folding S1-7 are shown on the left and those of folding S2-7 in 3D are shown on the right. The relationship in this instance appears to be more erratic, with poorest performance often observed at the extreme values tested. The default value of 500 local steps reports good relative running times for both REMCm and REMCpm in both 2D and 3D. 6 8 10 12 14 16 18 1 10 100 1000 10000CPU time [sec]number of local stepsREMCmREMCpm 14 16 18 20 22 24 26 28 30 1 10 100 1000 10000CPU time [sec]number of local stepsREMCmREMCpmPage 17 of 20(page number not for citation purposes)BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342instances, we have fixed parameters across all experimentsincluding long sequences. For the 2D case, we use theparameter configuration (ϕ, τmin, τmax, χ, ρ) : = (500, 160,220, 5, 0.4); in 3D, (ϕ, τmin, τmax, χ, ρ) : = (500, 160, 220,2, 0.5) where ϕ is the number of local steps in a MonteCarlo search, τmin, and τmax, are the minimum and maxi-mum temperature values respectively, χ is the number ofreplicas to simulate and ρ is the probability of performinga pull move. In the case of REMCvshd, where pull moves arenot considered, we use ρ : = 0.0; likewise, ρ : = 1.0 is usedin the case of REMCpm. A detailed description of theseparameters and their effects on run-time can be found inthe 'Discussion' section. The REMC algorithms are alwaysrun on one processor and have not been parallelized.Experimental protocolIn all experiments, runs were conducted independentlyand with unique random seeds. All runs were terminatedtime was exceeded in the case of fixed time runs. Whenless than 100% of runs were successful, i.e.. reached thetarget energy level, the expected run-time was calculatedas detailed by Parkes and Walser [54] as, where ts and tf are the mean run-time of successful and unsuccessful runs, respectively, andsr is the ratio of successful to unsuccessful runs. All timingt tsrtexp s f:= + − ⋅11Outline Replica exchange Monte Carlo searchFigur 13Outline Replica exchange Monte Carlo search. (a, b) denotes a uniform random selection of a real number in the inclusive range a to b. The procedure swapLabels(ca, cb) swaps the labels (and therefore their temperature values) of replicas a and b.Procedure REMCSimulation(c, E∗,φ, ν)Input: c – the state of the extendedensemble, E∗ – the optimal energy,φ – the number of local steps, ν –the search neighbourhoodOutput: c′ – the modified state of theextended ensembleE′ ← 0;offset ← 0;while E′ > E∗ doforeach replica i in M doMCsearch (φ,ci,ν);if E(ci) < E′ thenE′ ← E(ci);endifendfchi← offset + 1;while i+ 1 ≤M doj ← i+ 1;∆← (βj − βi)(E(ci)− E(cj));if ∆ ≤ 0 thenswapLabels (ci,cj);elseq ← U(0, 1);if q ≤ e−∆ thenswapLabels (ci,cj);endifendifi← i+ 2;endwoffset ← 1− offset ;endwOutline of Monte Carlo procedureFigur 12Outline of Monte Carlo procedure. (a, b), and (a, b), denote a uniform random selection of a real number, and respectively an integer number, in the inclusive range a to b. (c', k, ν) denotes a mutation performed on conformation c' at residue k, using a move from the ν neighbourhood. When more than one move is feasible at position k, one is chosen uniformly at random.Procedure MCsearch(φ, c, ν)Input: φ – the number of search steps toperform, c – the currentconformation, and ν the searchneigbourhoodOutput: c′ – the modified conformationfor i← 1 . . .φ doc′ ← c;k ← Û(1, n);c′ ←M(c′, k, ν) ;∆E ← E(c′)− E(c);if ∆E ≤ 0 thenc← c′;elseq ← U(0, 1);if q > e−∆ET thenc← c′;endifendifendfor lPage 18 of 20(page number not for citation purposes)immediately upon discovering the best known energy ofsome sequence, or when a pre-specified maximum run-of runs was performed measuring CPU time and startedwith the first search step.BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/342Experimental environmentAll experiments were performed on PCs with 2.4Ghz Pen-tium IV processors with 256KB cache and 1GB of RAM,running SUSE Linux version 9.2, unless explicitly statedotherwise.ImplementationAll versions of our REMC protein folding algorithms werecoded in C++ and compiled using g++ (GCC version3.3.3). The source code is freely available under the GNUGeneral Public License (GPL) and can be downloadedfrom our project website [55].Authors' contributionsHH and AS initially proposed to investigate REMC incombination with the pull move neighbourhood for theHP folding problem. AS provided code which partiallyformed the basis of the initial REMC implementation. CTimplemented REMC and conducted all experiments. CTand HH collaborated on improving REMC, incorporatingpull moves and generalizing them to the 3D cubic lattice;they also designed most of the computational experi-ments. All authors were involved in interpreting experi-mental results and in writing this manuscript.Additional materialAcknowledgementsWe would like to thank Neil Lesh, Michael Mitzenmacher and Sue White-sides for providing us with their GTabu code, and Peter Grassberger for providing us with an implementation of PERM. CT was funded by the CIHR/MSFHR Bioinformatics Training Program for Health Research. This research was also supported by funding from the Mathematics of Informa-tion Technology and Complex Systems (MITACS) Network of Centres of Excellence held by HH. We would like to thank the anonymous reviewers for their helpful comments.References1. Berger B, Leighton T: Protein folding in the hydrophobic-hydrophilic (HP) is NP-complete. Proceedings of the secondannual international conference on Computational molecular biology 1998,5(1):27-40.2. Crescenzi P, Goldman D, Papadimitriou C, Piccolboni A, YannakakisM: On the complexity of protein folding. Proceedings of the sec-ond annual international conference on Computational molecular biology3. Hart W, Istrail S: Robust proofs of NP-hardness for proteinfolding: general lattices and energy potentials. Journal of Com-putational Biology 1997, 4:1-22.4. Grassberger P: Pruned-enriched Rosenbluth method: Simula-tions of θ polymers of chain length up to 1 000 000. Physicalreview. E, Statistical physics, plasmas, fluids, and related interdisciplinarytopics 1997, 56(3):3682-3693.5. Gront D, Kolinski A, Skolnick J: A new combination of replicaexchange Monte Carlo and histogram analysis for proteinfolding and thermodynamics. The Journal of Chemical Physics 2001,115(3):1569-1574.6. Konig R, Dandekar T: Improving genetic algorithms for proteinfolding simulations by systematic crossover. Biosystems 1999,50:17-25.7. Lesh N, Mitzenmacher M, Whitesides S: A complete and effectivemove set for simplified protein folding. In RECOMB '03: Proceed-ings of the seventh annual international conference on Research in compu-tational molecular biology New York, NY, USA: ACM Press;2003:188-195. 8. Liang F, Wong WH: Evolutionary Monte Carlo for protein fold-ing simulations. The Journal of Chemical Physics 2001,115:3374-3380.9. Shmygelska A, Hoos H: An ant colony optimisation algorithmfor the 2D and 3D hydrophobic polar protein folding prob-lem. BMC Bioinformatics 2005, 6:30.10. Toma L, Toma S: Contact interactions method: A new algo-rithm for protein folding simulations. Protein Science 1996,5:147-153.11. Unger R, Moult J: Genetic Algorithms for Protein Folding Sim-ulations. Journal of Molecular Biology 1993, 231:75-81.12. Unger R, Moult J: Genetic Algorithm for 3D Protein FoldingSimulations. In Proceedings of the 5th International Conference onGenetic Algorithms San Francisco, CA, USA: Morgan Kaufmann Publish-ers Inc; 1993:581-588. 13. Hsu HP, Mehra V, Nadler W, Grassberger P: Growth-based opti-mization algorithm for lattice heteropolymers. Physical review.E, Statistical, nonlinear, and soft matter physics 2003, 68(2):021113.14. Bastolla U, Frauenkron H, Grassberger P: Phase Diagram of Ran-dom Heteropolymers: Replica Approach and Application ofa New Monte Carlo Algorithm. Journal of Molecular Liquids 2000,84:111-129.15. Dorigo M, Gambardella LM: Ant Colony System: A CooperativeLearning Approach to the Traveling Salesman Problem.IEEE Transactions on Evolutionary Computation 1997, 1:53-66.16. Klau GW, Lesh N, Marks J, Mitzenmacher M: Human-guided tabusearch. In Eighteenth national conference on Artificial intelligence MenloPark, CA, USA: American Association for Artificial Intelligence;2002:41-47. 17. Gront D, Kolinski A, Skolnick J: Comparison of three MonteCarlo conformational search strategies for a proteinlikehomopolymer model: Folding thermodynamics and identifi-cation of low-energy structures. The Journal of Chemical Physics2000, 113(12):5065-5071.18. Hilhorst HJ, Deutch JM: Analysis of Monte Carlo results on thekinetics of lattice polymer chains with excluded volume. TheJournal of Chemical Physics 1975, 63(12):5153-5161.19. Kremer K, Binder K: Monte Carlo simulation of lattice modelsfor macromolecules. Computer Physics Reports 1988, 7(6):259-310.20. Ramakrishnan R, Ramachandran B, Pekny JF: A dynamic MonteCarlo algorithm for exploration of dense conformationalspaces in heteropolymers. The Journal of Chemical Physics 1997,106(6):2418-2425.21. Hansmann UHE: Parallel tempering algorithm for conforma-tional studies of biological molecules. Chemical Physics Letters1997, 281:140-150.22. Irbäck A, Sandelin E: Monte Carlo study of the phase structureof compact polymer chains. The Journal of Chemical Physics 1999,110(24):12256-12262.23. Irbäck A: Dynamic Parameter Algorithms for Protein Fold-ing. In Monte Carlo Approach to Biopolymers and Protein Folding Editedby: Grassberger P, Barkema G, Nadler W. World Scientific, Singa-pore; 1998:98-109. 24. Mitsutake A, Sugita Y, Okamoto Y: Generalized-ensemble algo-rithms for molecular simulations of biopolymers. Peptide Sci-Additional file 1Supplemental material. This file contains tables listing the biologically motivated benchmark sets and the problems instances with a provably unique ground-state conformation. Additionally, results of simulations are reported for the rate of energy evaluations (per CPU second) achieved by our implementation.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-342-S1.pdf]Page 19 of 20(page number not for citation purposes)1998:61-62. ence 2001, 60(2):96-123.Publish with BioMed Central and every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central BMC Bioinformatics 2007, 8:342 http://www.biomedcentral.com/1471-2105/8/34225. Geyer C: Markov chain Monte Carlo maximum likelihood.Computing Science and Statistics: Proceedings of the 23rd Symposium onthe Interface 1991.26. Hukushima K, Nemoto K: Exchange Monte Carlo Method andApplication to Spin Glass Simulations. Journal of the Physical Soci-ety of Japan 1996, 65:1604-1608.27. Iba Y: Review of Extended Ensemble Algorithms. Proceedingsof the Institute of Statistical Mathematics 1993, 41:65.28. Kimura K, Taki K: Time-homogeneous parallel annealing algo-rithm. Proceedings of the 13th IMACS World Congress on Computationand Applied Mathematics (IMACS'91) 1991, 2:827-828.29. Hukushima K, Takayama H, Yoshino H: Exchange Monte CarloDynamics in the SK Model. Journal of the Physical Society of Japan1998, 67:12-15.30. Lin CY, Hu CK, Hansmann UH: Parallel tempering simulationsof HP-36. Proteins 2003, 52(3):436-445.31. Sugita Y, Kitao A, Okamoto Y: Multidimensional replica-exchange method for free-energy calculations. The Journal ofChemical Physics 2000, 113(15):6042-6051.32. Sugita Y, Okamoto Y: Replica-exchange molecular dynamicsmethod for protein folding. Chemical Physics Letters 1999, 314(1–2):141-151.33. Sugita Y, Okamoto Y: Free-Energy Calculations in Protein Fold-ing by Generalized-Ensemble Algorithms. Lecture Notes in Com-putational Science and Engineering 2001.34. Sugita Y, Okamoto Y: Replica-exchange multicanonical algo-rithm and multicanonical replica-exchange method for sim-ulating systems with rough energy landscape. Chemical PhysicalLetters 2000, 329(3–4):261-270.35. Dill KA: Theory for the folding and stability of globular pro-teins. Biochemistry 1985, 24(6):1501-1509.36. Lau KF, Dill KAD: A lattice statistical mechanics model of theconformational and sequence spaces of proteins. Macromole-cules 1989, 22(10):3986-3997.37. Yue K, Dill K: Forces of Tertiary Structural Organization inGlobular Proteins. Proceedings of the National Academy of Sciencesof the United States of America 1995, 92:146-150.38. Kolinski A, Skolnick J: Reduced models of proteins and theirapplications. Polymer 2004, 45(2):511-524.39. Dill KA, Bromberg S: Molecular Driving Forces New York and London:Garland Science; 2003. 40. Verdier PH, Stockmayer WH: Monte Carlo Calculations on theDynamics of Polymers in Dilute Solution. The Journal of Chemi-cal Physics 1962, 36:227-235.41. Gurler MT, Crabb CC, Dahlin DM, Kovac J: Effect of bead move-ment rules on the relaxation of cubic lattice models of poly-mer chains. Macromolecules 1983, 16(3):398-403.42. Iba Y: Extended Ensemble Monte Carlo. International Journal ofModern Physics C 2001, 12:623.43. Marinari E, Parisi G: Simulated tempering: a new Monte Carloscheme. Europhysics Letters 1992, 19:451-458.44. Swendsen R, Wang J: Replica Monte Carlo simulation of spinglasses. Physical Review Letters 1986, 57:2607-2609.45. Bastolla U, Frauenkron H, Gerstner E, Grassberger P, Nadler W:Testing a new Monte Carlo algorithm for protein folding.Proteins 1998, 32(1):52-66.46. Beutler TC, Dill KA: A fast conformational search strategy forfinding low energy structures of model proteins. Protein Sci-ence 1996, 5(10):2037-2043.47. Chikenji G, Kikuchi M, Iba Y: Multi-Self-Overlap Ensemble forProtein Folding: Ground State Search and Thermodynam-ics. Physical Review Letters 1999, 83(9):1886-1889.48. Krasnogor N, Hart WE, Smith J, Pelta DA: Protein Structure Pre-diction With Evolutionary Algorithms. In Proceedings of theGenetic and Evolutionary Computation Conference Volume 2. Edited by:Banzhaf W, Daida J, Eiben AE, Garzon MH, Honavar V, Jakiela M,Smith RE. Morgan Kaufmann; 1999:1596-1601. 49. Dill K, Fiebig K, Chan H: Cooperativity in Protein-FoldingKinetics. Proceedings of the National Academy of Sciences of the UnitedStates of America 1993, 90(5):1942-1946.50. Sayle R, Milner-White EJ: RASMOL – Biomolecular Graphics forAll. Trends in biochemical sciences 1995, 20(9):374-376.51. Aichholzer O, Bremner D, Demaine ED, Meijer H, Sacristan V, SossM: Long proteins with unique optimal foldings in the H-P52. Gupta A, Manuch J, Stacho L: Structure-Approximating InverseProtein Folding Problem in the 2D HP Model. Journal of Com-putational Biology 2005, 12(10):1328-1345.53. Hutter F, Hoos HH, Stützle T: Automatic Algorithm Configura-tion based on Local Search. Proceedings of the Twenty-Second Con-ference on Artifical Intelligence (AAAI '07) 2007:1152-1157.54. Parkes A, Walser J: Tuning Local Search for Satisfiability Test-ing. In Proceedings of the Application of Artifical Intelligence ConferenceMIT Press; 1996:356-362. 55. A replica exchange Monte Carlo algorithm for protein fold-ing in the HP model: Project website [http://www.cs.ubc.ca/labs/beta/Projects/REMC-HPPFP]yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 20 of 20(page number not for citation purposes)model. Computational Geometry 2003, 25(1–2):139-159.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Faculty Research and Publications /
- A replica exchange Monte Carlo algorithm for protein...
Open Collections
UBC Faculty Research and Publications
A replica exchange Monte Carlo algorithm for protein folding in the HP model Thachuk, Chris; Shmygelska, Alena; Hoos, Holger H Sep 17, 2007
pdf
Page Metadata
Item Metadata
Title | A replica exchange Monte Carlo algorithm for protein folding in the HP model |
Creator |
Thachuk, Chris Shmygelska, Alena Hoos, Holger H |
Publisher | BioMed Central |
Date Issued | 2007-09-17 |
Description | Background: The ab initio protein folding problem consists of predicting protein tertiary structure from a given amino acid sequence by minimizing an energy function; it is one of the most important and challenging problems in biochemistry, molecular biology and biophysics. The ab initio protein folding problem is computationally challenging and has been shown to be -hard even when conformations are restricted to a lattice. In this work, we implement and evaluate the replica exchange Monte Carlo (REMC) method, which has already been applied very successfully to more complex protein models and other optimization problems with complex energy landscapes, in combination with the highly effective pull move neighbourhood in two widely studied Hydrophobic Polar (HP) lattice models. Results We demonstrate that REMC is highly effective for solving instances of the square (2D) and cubic (3D) HP protein folding problem. When using the pull move neighbourhood, REMC outperforms current state-of-the-art algorithms for most benchmark instances. Additionally, we show that this new algorithm provides a larger ensemble of ground-state structures than the existing state-of-the-art methods. Furthermore, it scales well with sequence length, and it finds significantly better conformations on long biological sequences and sequences with a provably unique ground-state structure, which is believed to be a characteristic of real proteins. We also present evidence that our REMC algorithm can fold sequences which exhibit significant interaction between termini in the hydrophobic core relatively easily. Conclusion We demonstrate that REMC utilizing the pull move neighbourhood significantly outperforms current state-of-the-art methods for protein structure prediction in the HP model on 2D and 3D lattices. This is particularly noteworthy, since so far, the state-of-the-art methods for 2D and 3D HP protein folding – in particular, the pruned-enriched Rosenbluth method (PERM) and, to some extent, Ant Colony Optimisation (ACO) – were based on chain growth mechanisms. To the best of our knowledge, this is the first application of REMC to HP protein folding on the cubic lattice, and the first extension of the pull move neighbourhood to a 3D lattice. |
Genre |
Article |
Type |
Text |
Language | eng |
Date Available | 2015-12-22 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution 4.0 International (CC BY 4.0) |
DOI | 10.14288/1.0221561 |
URI | http://hdl.handle.net/2429/56017 |
Affiliation |
Computer Science, Department of Science, Faculty of Non UBC |
Citation | BMC Bioinformatics. 2007 Sep 17;8(1):342 |
Publisher DOI | 10.1186/1471-2105-8-342 |
Peer Review Status | Reviewed |
Scholarly Level | Faculty |
Copyright Holder | Thachuk et al. |
Rights URI | http://creativecommons.org/licenses/by/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 52383-12859_2007_Article_1714.pdf [ 1.6MB ]
- Metadata
- JSON: 52383-1.0221561.json
- JSON-LD: 52383-1.0221561-ld.json
- RDF/XML (Pretty): 52383-1.0221561-rdf.xml
- RDF/JSON: 52383-1.0221561-rdf.json
- Turtle: 52383-1.0221561-turtle.txt
- N-Triples: 52383-1.0221561-rdf-ntriples.txt
- Original Record: 52383-1.0221561-source.json
- Full Text
- 52383-1.0221561-fulltext.txt
- Citation
- 52383-1.0221561.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52383.1-0221561/manifest