UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Performance modelling and automated algorithm design for NP-hard problems Xu, Lin 2014

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_february_xu_lin.pdf [ 6.25MB ]
JSON: 24-1.0135602.json
JSON-LD: 24-1.0135602-ld.json
RDF/XML (Pretty): 24-1.0135602-rdf.xml
RDF/JSON: 24-1.0135602-rdf.json
Turtle: 24-1.0135602-turtle.txt
N-Triples: 24-1.0135602-rdf-ntriples.txt
Original Record: 24-1.0135602-source.json
Full Text

Full Text

Performance Modelling and Automated Algorithm Designfor NP-hard ProblemsbyLin XuBSc. Chemistry, Nanjing University, 1996MSc. Chemistry, Nanjing University, 1999MSc. Computer Science, University of Nebraska-Lincoln, 2003A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University Of British Columbia(Vancouver)November 2014c© Lin Xu, 2014AbstractIn practical applications, some important classes of problems are NP-complete.Although no worst-case polynomial time algorithm exists for solving them, state-of-the-art algorithms can solve very large problem instances quickly, and algorithmperformance varies significantly across instances. In addition, such algorithms arerather complex and have largely resisted theoretical average-case analysis. Em-pirical studies are often the only practical means for understanding algorithms’behavior and for comparing their performance.My thesis focuses on two types of research questions. On the science side, thethesis seeks a in better understanding of relations among problem instances, algo-rithm performance, and algorithm design. I propose many instance features/char-acteristics based on instance formulation, instance graph representations, as wellas progress statistics from running some solvers. With such informative features,I show that solvers’ runtime can be predicted by predictive performance modelswith high accuracy. Perhaps more surprisingly, I demonstrate that the solution ofNP-complete decision problems (e.g., whether a given propositional satisfiabilityproblem instance is satisfiable) can also be predicted with high accuracy.On the engineering side, I propose three new automated techniques for achiev-ing state-of-the-art performance in solving NP-complete problems. In particular,I construct portfolio-based algorithm selectors that outperform any single solveron heterogeneous benchmarks. By adopting automated algorithm configuration,our highly parameterized local search solver, SATenstein-LS, achieves state-of-the-art performance across many different types of SAT benchmarks. Finally,I show that portfolio-based algorithm selection and automated algorithm configu-ration could be combined into an automated portfolio construction procedure. Itiirequires significant less domain knowledge, and achieved similar or better perfor-mance than portfolio-based selectors based on known high-performance candidatesolvers.The experimental results on many solvers and benchmarks demonstrate thatthe proposed prediction methods achieve high predictive accuracy for predictingalgorithm performance as well as predicting solutions, while our automaticallyconstructed solvers are state of the art for solving the propositional satisfiabilityproblem (SAT) and the mixed integer programming problem (MIP). Overall, myresearch results in more than 8 publications including the 2010 IJCAI/JAIR bestpaper award. The portfolio-based algorithm selector, SATzilla, won 17 medalsin the international SAT solver competitions from 2007 to 2012.iiiPrefaceI am indebted to many outstanding co-authors for their contributions to the workdescribed in this thesis. My two supervisors, Kevin Leyton-Brown and HolgerHoos, involved in all my works, and contributed in designing, discussion, and writ-ing.The development of SATzilla had a major impact on my PhD thesis. In col-laboration with Frank Hutter, Holger Hoos, and Kevin Leyton-Brown, I designedand built all versions of SATzilla, which won 17 medals in the international SATsolver competitions from 2007 to 2012. More details of this work was described inChapter 7 (published in [210, 211, 216]) and Chapter 8 (published in [215]).Collaborating with Holger Hoos and Kevin Leyton-Brown, I then designed thefirst version of Hydra, automatic configuration of algorithms for portfolio-basedselection, based on the work of SATzilla (Chapter 10.2, published in [212]). Later,with Frank Hutter joining the force, we developed an improved the version of Hy-dra and applied it on the domain of MIP (Chapter 10.3, published in [213]).Both SATzilla and Hydra were based on our own performance predictive mod-els. The runtime prediction was described in Chapter 5 (published in [208]). Thesolution prediction was described in Chapter 4 (published in [214]). I also collab-orated with Frank Hutter to compare different prediction techniques in Chapter 6(published in [101]).The work of automatically building high-performance algorithms from compo-nents (Chapter 9, published in [115]) is based on the project of SATenstein, whichinitiated by Ashiqur KhudaBukhsh for his M.S. thesis. I contributed in supervising,designing, implementation, and writing.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Science Side Research . . . . . . . . . . . . . . . . . . . . . . . 31.2 Engineering Side Research . . . . . . . . . . . . . . . . . . . . . 41.3 Overview of Contributions . . . . . . . . . . . . . . . . . . . . . 92 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1 Empirical Hardness Models . . . . . . . . . . . . . . . . . . . . . 112.2 Algorithm Portfolios and the Algorithm Selection Problem . . . . 132.3 Automated Construction of Algorithms . . . . . . . . . . . . . . 162.4 Automated Algorithm Configuration Tools . . . . . . . . . . . . . 172.5 Automatically Configuring Algorithms for Portfolio-Based Selection 183 Domains of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1 Propositional Satisfiability (SAT) . . . . . . . . . . . . . . . . . . 20v3.1.1 Tree Search for SAT . . . . . . . . . . . . . . . . . . . . 213.1.2 Local Search for SAT . . . . . . . . . . . . . . . . . . . . 223.1.3 SAT Features . . . . . . . . . . . . . . . . . . . . . . . . 263.1.4 SAT Benchmarks . . . . . . . . . . . . . . . . . . . . . . 293.2 Mixed Integer Programming (MIP) . . . . . . . . . . . . . . . . . 303.2.1 IBM ILOG CPLEX . . . . . . . . . . . . . . . . . . . . . 313.2.2 MIP Features . . . . . . . . . . . . . . . . . . . . . . . . 313.2.3 MIP Benchmarks . . . . . . . . . . . . . . . . . . . . . . 333.3 Traveling Salesperson (TSP) . . . . . . . . . . . . . . . . . . . . 353.3.1 TSP Solvers . . . . . . . . . . . . . . . . . . . . . . . . . 353.3.2 TSP Features . . . . . . . . . . . . . . . . . . . . . . . . 373.3.3 TSP Benchmarks . . . . . . . . . . . . . . . . . . . . . . 384 Solution Prediction for SAT . . . . . . . . . . . . . . . . . . . . . . . 404.1 Uniform Random 3-SAT and Phase Transition . . . . . . . . . . . 414.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 424.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 454.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 Runtime Prediction with Hierarchical Hardness Models . . . . . . . 555.1 Empirical Hardness Models . . . . . . . . . . . . . . . . . . . . . 555.1.1 Overview of Linear Basis-function Ridge Regression . . . 565.1.2 Hierarchical Hardness Models . . . . . . . . . . . . . . . 575.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 595.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 605.3.1 Performance of Conditional and Oracular Models . . . . . 615.3.2 Performance of Classification . . . . . . . . . . . . . . . 625.3.3 Performance of Hierarchical Models . . . . . . . . . . . . 665.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696 Performance Prediction with Empirical Performance Models . . . . 716.1 Methods Used in the Literature . . . . . . . . . . . . . . . . . . . 726.1.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . 726.1.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . 74vi6.1.3 Gaussian Process Regression . . . . . . . . . . . . . . . . 756.1.4 Regression Trees . . . . . . . . . . . . . . . . . . . . . . 766.2 New Modeling Techniques for EPMs . . . . . . . . . . . . . . . . 786.2.1 Scaling to Large Amounts of Data with Approximate Gaus-sian Processes . . . . . . . . . . . . . . . . . . . . . . . . 786.2.2 Random Forest Models . . . . . . . . . . . . . . . . . . . 796.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 826.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 846.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897 SATzilla: Portfolio-based Algorithm Selection for SAT . . . . . . . . 917.1 Procedure of Building Portfolio based Algorithm Selection . . . . 927.2 Algorithm Selection Core: Predictive Models . . . . . . . . . . . 947.2.1 Accounting for Censored Data . . . . . . . . . . . . . . . 947.2.2 Predicting Performance Score Instead of Runtime . . . . . 967.2.3 More General Hierarchical Performance Models . . . . . 987.3 Portfolio Construction . . . . . . . . . . . . . . . . . . . . . . . 997.3.1 Selecting Instances . . . . . . . . . . . . . . . . . . . . . 997.3.2 Selecting Solvers . . . . . . . . . . . . . . . . . . . . . . 1007.3.3 Choosing Features . . . . . . . . . . . . . . . . . . . . . 1017.3.4 Computing Features and Runtimes . . . . . . . . . . . . . 1037.3.5 Identifying Pre-solvers . . . . . . . . . . . . . . . . . . . 1037.3.6 Identifying the Backup Solver . . . . . . . . . . . . . . . 1047.3.7 Learning Empirical Performance Models . . . . . . . . . 1057.3.8 Solver Subset Selection . . . . . . . . . . . . . . . . . . . 1057.3.9 Different SATzilla Versions . . . . . . . . . . . . . . . 1067.4 Performance Analysis of SATzilla . . . . . . . . . . . . . . . 1077.4.1 Random Category . . . . . . . . . . . . . . . . . . . . . 1087.4.2 Crafted Category . . . . . . . . . . . . . . . . . . . . 1117.4.3 Industrial Category . . . . . . . . . . . . . . . . . . 1147.4.4 ALL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.5 Further Improvements over the Years . . . . . . . . . . . . . . . . 1207.5.1 SATzilla09 for Industrial . . . . . . . . . . . . . . . 120vii7.5.2 SATzilla2012 with New Algorithm Selector . . . . . . . . 1217.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228 Evaluating Component Solver Contributionsto Portfolio-Based Algorithm Selectors . . . . . . . . . . . . . . . . . 1248.1 Measuring the Value of a Solver . . . . . . . . . . . . . . . . . . 1258.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 1278.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 1288.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1349 Automatically Building High-performance Algorithms from Com-ponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1379.1 SATenstein-LS . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389.1.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389.1.2 Implementation and Validation . . . . . . . . . . . . . . . 1439.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 1469.2.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 1469.2.2 Tuning Scenario and PAR . . . . . . . . . . . . . . . . . 1479.2.3 Solvers Used for Performance Comparison . . . . . . . . 1499.2.4 Execution Environment . . . . . . . . . . . . . . . . . . . 1519.3 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . 1519.3.1 Comparison with Challengers . . . . . . . . . . . . . . . 1519.3.2 Comparison with Automatically Configured Versions ofChallengers . . . . . . . . . . . . . . . . . . . . . . . . . 1559.3.3 Comparison with Complete Solvers . . . . . . . . . . . . 1589.3.4 Configurations Found . . . . . . . . . . . . . . . . . . . . 1599.4 Quantitative Comparison of Algorithm Configurations . . . . . . 1629.4.1 Concept DAGs . . . . . . . . . . . . . . . . . . . . . . . 1629.4.2 Comparison of SATenstein-LS Configurations . . . . 1649.4.3 Comparison to Configured Challengers . . . . . . . . . . 1689.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17010 Hydra: Automatic Configuration of Algorithms for Portfolio-BasedSelection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172viii10.1 Hydra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17310.2 Hydra for SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17610.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 17710.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . 17910.3 Hydra for MIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18210.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 18510.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . 18810.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19111 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 19311.1 Statistical Models of Instance Hardness and Algorithm Performance 19411.2 Portfolio-based Algorithm Selection . . . . . . . . . . . . . . . . 19511.3 Automatically Building High-performance Algorithms from Com-ponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19611.4 Automatically Configuring Algorithms for Portfolio-Based Selection19711.5 Future Research Directions . . . . . . . . . . . . . . . . . . . . . 198Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200ixList of TablesTable 4.1 The performance of decision forests with 61 features on our 21 pri-mary instance sets. We report median classification accuracy over 25replicates with different random splits of training and test data as wellas the fraction of false positive and false negative predictions. . . . . 46Table 4.2 The mean of median classification accuracy with up to 10 featuresselected by forward selection. The stepwise improvement for a fea-ture fi at forward selection step k is the improvement when we add fito the existing k− 1 features. Each median classification accuracy isbased on the results of 25 runs of classification with different randomsplit of training and test data. . . . . . . . . . . . . . . . . . . . 50Table 5.1 Accuracy of hardness models for different solvers and instance distri-butions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Table 5.2 The five most important features (listed from most to least important)for classification as chosen by backward selection. . . . . . . . . . 65Table 5.3 Comparison of oracular, unconditional and hierarchical hardness mod-els. The second number of each entry is the ratio of the model’sRMSE to the oracular model’s RMSE. ( ∗For SW-GCP, even the orac-ular model exhibits a large runtime prediction error.) . . . . . . . . 66Table 6.1 Overview of our models. . . . . . . . . . . . . . . . . . . . . 82xTable 6.2 Quantitative comparison of models for runtime predictions onpreviously unseen instances. We report 10-fold cross-validationperformance. Lower RMSE values are better (0 is optimal).Note the very large RMSE values for ridge regression on somedata sets (we use scientific notation, denoting “×10x” as “Ex”);these large errors are due to extremely small/large predictionsfor a few data points. Boldface indicates performance not sta-tistically significantly different from the best method in eachrow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85Table 6.3 Quantitative comparison of models for runtime predictions onunseen instances. We report 10-fold cross-validation perfor-mance. Higher rank correlations are better (1 is optimal); log-likelihoods are only defined for models that yield a predictivedistribution (here: PP and RF); higher values are better. Bold-face indicates results not statistically significantly from the best. 86Table 7.1 Instances from before 2007 and from 2007 randomly split into train-ing (T), validation (V) and test (E) data sets. These sets include in-stances for all categories: Random, Crafted and Industrial. . 100Table 7.2 Data sets used in our experiments. Note that all data sets use identicaltest data, but different test data. . . . . . . . . . . . . . . . . . . 100Table 7.3 The seven solvers in SATzilla07; we refer to this set ofsolvers as S. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101Table 7.4 Eight complete solvers from the 2007 SAT Competition. . . . . 102Table 7.5 Four local search solvers from the 2007 SAT Competition. . . . 102Table 7.6 Solver sets used in our second series of experiments. . . . . . . 102Table 7.7 The different SATzilla versions evaluated in our second setof experiments. . . . . . . . . . . . . . . . . . . . . . . . . . 106Table 7.8 Pre-solver candidates for our four data sets. These candidateswere automatically chosen based on the scores on validationdata achieved by running the respective algorithms for a maxi-mum of 10 CPU seconds. . . . . . . . . . . . . . . . . . . . . 107xiTable 7.9 SATzilla’s configurations for the Random category; cutoff timesfor pre-solvers are specified in CPU seconds. . . . . . . . . . . . 108Table 7.10 The performance of SATzilla compared to the best solvers on Random.The cutoff time was 1 200 CPU seconds; SATzilla07∗(S++,D+)was trained on ALL. Scores were computed based on 20 referencesolvers: the 19 solvers from Tables 7.3, 7.4, and 7.5, as well as oneversion of SATzilla. To compute the score for each non-SATzillasolver, the SATzilla version used as a member of the set of refer-ence solvers was SATzilla07+(S++,D+r ). Since we did not in-clude SATzilla versions other than SATzilla07+(S++,D+r ) inthe set of reference solvers, scores for these solvers are incomparableto the other scores given here, and therefore, we do not report them.Instead, for each SATzilla solver, we indicate in parentheses itsperformance score as a percentage of the highest score achieved bya non-portfolio solver, given a reference set in which the appropriateSATzilla solver took the place of SATzilla07+(S++,D+r ). . . 110Table 7.11 The solvers selected by SATzilla07+(S++,D+r ) for the Randomcategory. Note that column “Selected [%]” shows the percentage ofinstances remaining after pre-solving for which the algorithm was se-lected, and this sums to 100%. Cutoff times for pre-solvers are speci-fied in CPU seconds. . . . . . . . . . . . . . . . . . . . . . . . 111Table 7.12 SATzilla’s configurations for the Crafted category. . . . . . . 112Table 7.13 The performance of SATzilla compared to the best solvers on Crafted.Scores for non-portfolio solvers were computed using a reference setin which the only SATzilla solver was SATzilla07+(S++,D+h ).Cutoff time: 1200 CPU seconds; SATzilla07∗(S++,D+) wastrained on ALL. . . . . . . . . . . . . . . . . . . . . . . . . . . 112Table 7.14 The solvers selected by SATzilla07+(S++,D+h ) for the Craftedcategory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113Table 7.15 SATzilla’s configuration for the Industrial category. . . . . 114xiiTable 7.16 The performance of SATzilla compared to the best solvers on Industrial.Scores for non-portfolio solvers were computed using a reference setin which the only SATzilla solver was SATzilla07+(S++,D+i ).Cutoff time: 1 200 CPU seconds; SATzilla07∗(S++,D+) wastrained on ALL. . . . . . . . . . . . . . . . . . . . . . . . . . . 115Table 7.17 The solvers selected by SATzilla07+(S++,D+i ) for the Industrialcategory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116Table 7.18 SATzilla’s configurations for the ALL category. . . . . . . . . . 117Table 7.19 The performance of SATzilla compared to the best solvers on ALL.Scores for non-portfolio solvers were computed using a reference setin which the only SATzilla solver was SATzilla07∗(S++,D+).Cutoff time: 1200 CPU seconds. . . . . . . . . . . . . . . . . . 118Table 7.20 The solvers selected by SATzilla07∗(S++,D+) for the ALL cat-egory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119Table 7.21 Confusion matrix for the 6-way classifier on data set ALL. . . . . . 120Table 8.1 Comparison of SATzilla11 to the VBS, an Oracle over its com-ponent solvers, SATzilla09, the 2011 SAT competition winners,and the best single SATzilla11 component solver for each cate-gory. We counted timed-out runs as 5000 CPU seconds (the cutoff). . 129Table 8.2 Performance of SATzilla11 component solvers, disregarding in-stances that could not be solved by any component solver. We countedtimed-out runs as 5 000 CPU seconds (the cutoff). Average correla-tion for s is the mean of Spearman correlation coefficients between sand all other solvers. Marginal contribution for s is negative if drop-ping s improved test set performance. (Usually, SATzilla’s solversubset selection avoids such solvers, but they can slip through whenthe training set is too small.) SATzilla11(Application) ran itsbackup solver Glucose2 for 10.3% of the instances (and therebysolved 8.7%). SATzilla11 only chose one presolver for all foldsof Random and Application; for Crafted, it chose Sattimeas the first presolver in 2 folds, and Sol as the second presolver in 1of these; for the remaining 8 folds, it did not select presolvers. . . . 136xiiiTable 9.1 SATenstein-LS components. . . . . . . . . . . . . . . . . 141Table 9.2 Design choices for selectFromPromisingList(). . . . . . . . . . 141Table 9.3 List of heuristics chosen by the parameter heuristic and depen-dent parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 142Table 9.4 Categorical parameters of SATenstein-LS. Unless otherwisementioned, multiple “active when” parameters are combined to-gether using AND. . . . . . . . . . . . . . . . . . . . . . . . . 143Table 9.5 Integer parameters of SATenstein-LS and the values con-sidered during ParamILS tuning. Multiple “active when” pa-rameters are combined together using AND. Existing defaultsare highlighted in bold. For parameters first introduced in SATenstein-LS,default values are underlined. . . . . . . . . . . . . . . . . . . 144Table 9.6 Continuous parameters of SATenstein-LS and values con-sidered during ParamILS tuning. Unless otherwise mentioned,multiple “active when” parameters are combined together usingAND. Existing defaults are highlighted in bold. For parametersfirst introduced in SATenstein-LS, default values are under-lined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145Table 9.7 Our eleven challenger algorithms. . . . . . . . . . . . . . . . . 150Table 9.8 Complete solvers we compared against. . . . . . . . . . . . . . 150Table 9.9 Performance of SATenstein-LS and the 11 challengers. Ev-ery algorithm was run 25 times with a cutoff of 600 CPU sec-onds per run. Each cell 〈i, j〉 summarizes the test-set perfor-mance of algorithm i on distribution j as a/b/c, where a (top)is the the PAR10 score; b (middle) is the median of the medianruntime(where the outer median is taken over the instances andthe inner median over the runs); c (bottom) is the percentage ofinstances solved (median runtime < cutoff). The best-scoringalgorithm(s) in each column are indicated in bold, and the best-scoring challenger(s) are underlined. . . . . . . . . . . . . . . 152xivTable 9.10 Percentage of instances on which SATenstein-LS achievedbetter (equal) median runtime than each of the 11 challengers.Medians were taken over 25 runs on each instance with a cutofftime of 600 CPU seconds per run. . . . . . . . . . . . . . . . . 154Table 9.11 Performance summary of the automatically configured versionsof 8 challengers (three challengers have no parameters). Ev-ery algorithm was run 25 times on each problem instance witha cutoff of 600 CPU seconds per run. Each cell 〈i, j〉 summa-rizes the test-set performance of algorithm i on distribution jas a/b/c, where a (top) is the the penalized average runtime; b(middle) is the median of the median runtimes over all instances(not defined if fewer than half of the median runs failed to finda solution within the cutoff time); c (bottom) is the percentageof instances solved (i.e., having median runtime < cutoff). Thebest-scoring algorithm(s) in each column are indicated in bold. 156Table 9.12 Performance of SATenstein-LS solvers, the best challengerswith default configurations and the best automatically config-ured challengers. Every algorithm was run 25 times on eachinstance with a cutoff of 600 CPU seconds per run. Each ta-ble entry 〈i, j〉 indicates the test-set performance of algorithmi on distribution j as a/b/c, where a (top) is the the penal-ized average runtime; b (middle) is the median of the medianruntimes over all instances; c (bottom) is the percentage of in-stances solved (i.e., those with median runtime < cutoff). . . . 159xvTable 9.13 Performance summary of SATenstein-LS and the completesolvers. Every complete solver was run once (SATenstein-LSwas run 25 times) on each instance with a per-run cutoff of 600CPU seconds. Each cell 〈i, j〉 summarizes the test-set perfor-mance of algorithm i on distribution j as a/b/c, where a (top)is the the penalized average runtime; b (middle) is the median ofthe median runtimes over all instances (for SATenstein-LS,it is the median of the median runtimes over all instances. themedian runtimes are not defined if fewer than half of the medianruns failed to find a solution within the cutoff time); c (bottom)is the percentage of instances solved (i.e., having median run-time < cutoff). The best-scoring algorithm(s) in each columnare indicated in bold. . . . . . . . . . . . . . . . . . . . . . . 160Table 9.14 SATenstein-LS parameter configuration found for each dis-tribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161Table 10.1 Performance comparison between Hydra, SATenstein-LS,challengers, and portfolios based on 11 (without 6 SATenstein-LSsolvers) and 17 (with 6 SATenstein-LS solvers) challengers.All results are based on 3 runs per algorithm and instance; an al-gorithm solves an instance if its median runtime on that instanceis below the given cutoff time. . . . . . . . . . . . . . . . . . . 180Table 10.2 The percentage of instances for each solver chosen by algo-rithm selection at each iteration for RAND (left) and INDULIKE(right). Pk and sk are respectively the portfolio and algorithmobtained in iteration k. . . . . . . . . . . . . . . . . . . . . . 183Table 10.3 Performance (average runtime and PAR in seconds, and per-centage solved) of HydraDF,4, HydraDF,1 and HydraLR,1 af-ter 5 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . 189xviList of FiguresFigure 3.1 12 groups of SAT features . . . . . . . . . . . . . . . . . . . . 27Figure 3.2 MIP instance features; for the variable-constraint graph, linear con-straint matrix, and objective function features, each feature is com-puted with respect to three subsets of variables: continuous, C, non-continuous, NC, and all, V . . . . . . . . . . . . . . . . . . . . . 32Figure 3.3 9 groups of TSP features . . . . . . . . . . . . . . . . . . . . . 37Figure 4.1 Left: Median runtime of kcnfs07 for each instance set. The solu-tions of some instances in v = 575 and v = 600 were estimated byrunning adaptg2wsat09++ for 36 000 CPU seconds. Right: CDF ofkcnfs07’s runtime for v = 500. . . . . . . . . . . . . . . . . . . 43Figure 4.2 Classification accuracies achieved on our 21 primary instance sets.The blue box plots are based on 25 replicates of decision forestmodels, trained and evaluated on different random splits of train-ing and test data. The median prediction accuracies of using thedecision forest trained on v100(large) are shown as red stars. Themedian prediction accuracies of using a single decision tree trainedon v100(large) based on two features are shown as green squares. . 45Figure 4.3 Classifier confidence vs fraction of instances. Left: v = 200; Right:v = 500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47xviiFigure 4.4 Classifier confidence vs instance hardness. Each marker ([x,y]) showsthe average runtime of kcnf07 over a bin of instances with classifierconfidence (predicted probability of SAT) between x− 0.05 and x.Each marker’s intensity corresponds to the amount of data inside thebin. Left: v = 200; Right: v = 500. . . . . . . . . . . . . . . . . 47Figure 4.5 Statistical significance of pairwise differences in classification ac-curacy for our 21 primary instance sets. Yellow: accuracy on thesmaller instance size is significantly higher than on the larger size.Blue: accuracy on the smaller instance size is significantly lowerthan on the larger size. No dot: the difference is insignificant. Sig-nificance level: p = 0.05. . . . . . . . . . . . . . . . . . . . . . 49Figure 4.6 Distribution of LPSLACK coeff variation over instances in each ofour 21 sets. Left: SAT; Right: UNSAT. Top: original value; Bot-tom: value after normalization. The line at y = 0.0047 indicates thedecision threshold used in the tree from Figure 4.8. . . . . . . . . 51Figure 4.7 Distribution of POSNEG ratio var mean over instances in each of our21 instance sets. Left: SAT; Right: UNSAT. Top: original value;Bottom: value after normalization. The line at y = 0.1650 indicatesthe decision threshold used in the tree from Figure 4.8. . . . . . . 52Figure 4.8 The decision tree trained on v100(large) with only the features LPSLACK coeff variationand POSNEG ratio var mean, and with (minparent) set to 10 000. . . 53Figure 5.1 Graphical model for our mixture-of-experts approach. . . . . . . . 57Figure 5.2 Prediction accuracy comparison of oracular model (left, RMSE=0.247)and unconditional model (right, RMSE=0.426). Distribution: QCP,solver: satelite. . . . . . . . . . . . . . . . . . . . . . . . 61Figure 5.3 Actual vs predicted logarithm runtime using only Msat (left, RMSE=1.493)and only Munsat (right, RMSE=0.683), respectively. Distribution:QCP, solver: satelite. . . . . . . . . . . . . . . . . . . . . . 62Figure 5.4 Classification accuracy for different data sets . . . . . . . . . . . 63Figure 5.5 Classification accuracy vs classifier output (top) and fraction of in-stances within the given set vs classifier output (bottom). Left: rand3-var,right: QCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64xviiiFigure 5.6 Classification accuracy vs classifier output (top) and fraction of theinstances within the given set vs classifier output (bottom). Left:rand3-fix, right: SW-GCP. . . . . . . . . . . . . . . . . . . 65Figure 5.7 Actual vs. predicted logarithm runtime for satz on rand3-var.Left: unconditional model (RMSE=0.387); right: hierarchical model(RMSE=0.344). . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 5.8 Actual vs. predicted logarithm runtime for satz on rand3-fix.Left: unconditional model (RMSE=0.420); right: hierarchical model(RMSE=0.413). . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 5.9 Actual vs. predicted logarithm runtime for satelite on QCP.Left: unconditional model (RMSE=0.426); right: hierarchical model(RMSE=0.372). . . . . . . . . . . . . . . . . . . . . . . . . . 68Figure 5.10 Actual vs. predicted logarithm runtime for zchaff on SW-GCP.Left: unconditional model (RMSE=0.993); right: hierarchical model(RMSE=0.983). . . . . . . . . . . . . . . . . . . . . . . . . . 69Figure 5.11 Classifier output vs runtime prediction error (left); relationship be-tween classifier output and RMSE (right). Data set: QCP, solver:satelite. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Figure 6.1 Visual comparison of models for runtime predictions on previ-ously unseen test instances. The data sets used in each columnare shown at the top. The x-axis of each scatter plot denotestrue runtime and the y-axis 2-fold cross-validated runtime aspredicted by the respective model; each dot represents one in-stance. Predictions above 3000 or below 0.001 are denoted bya blue cross rather than a black dot. Plots for other benchmarksare qualitatively similar. . . . . . . . . . . . . . . . . . . . . 87Figure 6.2 Prediction quality for varying numbers of training instances.For each model and number of training instances, we plot themean (taken across 10 cross-validation folds) correlation co-efficient (CC) between true and predicted runtimes for newtest instances; larger CC is better, 1 is perfect. Plots for otherbenchmarks are qualitatively similar. . . . . . . . . . . . . . 89xixFigure 7.1 Left: CDFs for SATzilla07+(S++,D+r ) and the best non-portfoliosolvers on Random; right: CDFs for the different versions of SATzillaon Random shown in Table 7.9, where SATzilla07∗(S++,D+)was trained on ALL. All other solvers’ CDFs are below the onesshown here. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110Figure 7.2 Left: CDFs for SATzilla07+(S++,D+h ) and the best non-portfoliosolvers on Crafted; right: CDFs for the different versions of SATzillaon Crafted shown in Table 7.12, where SATzilla07∗(S++,D+)was trained on ALL. All other solvers’ CDFs are below the onesshown here. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113Figure 7.3 Left: CDFs for SATzilla07+(S++,D+i ) and the best non-portfoliosolvers on Industrial; right: CDFs for the different versions ofSATzilla on Industrial shown in Table 7.15, where SATzilla07∗(S++,D+)was trained on ALL. All other solvers’ CDFs (including Eureka’s)are below the ones shown here. . . . . . . . . . . . . . . . . . . 116Figure 7.4 Left: CDF for SATzilla07∗(S++,D+) and the best non-portfoliosolvers on ALL; right: CDFs for different versions of SATzilla onALL shown in Table 7.18. All other solvers’ CDFs are below theones shown here. . . . . . . . . . . . . . . . . . . . . . . . . . 119Figure 8.1 Visualization of results for category Random . . . . . . . . . 130Figure 8.2 Visualization of results for category Crafted . . . . . . . . 132Figure 8.3 Visualization of results for category Application . . . . . 135Figure 9.1 Performance comparison of SATenstein-LS and the bestchallenger. Left: R3SAT; Right: FAC. Medians were takenover 25 runs on each instance with a cutoff time of 600 CPUseconds per run. . . . . . . . . . . . . . . . . . . . . . . . . . 155xxFigure 9.2 Performance of SATenstein-LS solvers vs challengers withdefault and optimized configurations. For every benchmarkdistribution D, the base-10 logarithm of the ratio between SATenstein[D]and one challenger (default and optimized) is shown on the y-axis, based on data from Tables 9.9 and 9.11. Top-left: QCP;Top-right: SWGCP; Middle-left: R3SAT; Middle-right: HGEN;Bottom-left: FAC; Bottom-right: CBMC(SE) . . . . . . . . . 157Figure 9.3 Visualization of the transformation costs in the design of 16high-performance solvers (359 configurations) obtained via Isomap.165Figure 9.4 True vs mapped distances in Figure 9.3. The data points cor-respond to the complete set of SATenstein-LS[D] for alldomains and all challengers with their default and domain-specific, optimized configurations. . . . . . . . . . . . . . . 166Figure 9.5 The transformation costs of configuration of individual chal-lengers and selected SATenstein-LS solvers. (a): SAPS(best on HGEN and FACT); (b): SAPS and SATenstein[HGEN,FACT]; (c): G2 (best on SWGCP); (d): G2 and SATenstein[SWGCP];(e): VW (best on CBMC(SE), QCP, and R3FIX); (f): VW andSATenstein[CBMC, QCP, R3FIX]. . . . . . . . . . . 169Figure 10.1 Hydra’s performance progress after each iteration, for BM (left)and INDULIKE (right). Performance is shown in terms ofPAR-10 score; the vertical lines represent the best challenger’sperformance for each data set. . . . . . . . . . . . . . . . . . 182Figure 10.2 Performance comparison between Hydra[D,7] and Hydra[D,1]on the test sets, for BM (left) and INDULIKE (right). Perfor-mance is shown in terms of PAR-10 score. . . . . . . . . . . 183Figure 10.3 Performance per iteration for HydraDF,4, HydraDF,1 and HydraLR,1,evaluated on test data. . . . . . . . . . . . . . . . . . . . . . 190xxiAcknowledgmentsI gratefully acknowledge my two symmetrical co-supervisors, Holger Hoos andKevin Leyton-Brown. They not only accepted me into the Ph.D. program eightyears ago, but also made sure that I stayed on the right track. They encouragedme to develop principled ideas and fine-tuned them with me to ensure their goodperformance in practice. Their passion for science, creativity and high standardsdeeply influenced me in many ways. In finalizing this thesis, I benefited from theirdetailed and insightful feedback. Many thanks also to Alan Hu, the third memberof my supervisory committee, for making sure I work on interesting but feasiblequestions and toward finishing my Ph. D.While co-authors are acknowledged in a separate section of the front-matter,I would like to thank Frank Hutter in particular, for the fruitful collaboration andmany insightful discussions. I would like to thank many other members of theLaboratory for Computational Intelligence (LCI) and the Bioinformatics, Empir-ical and Theoretical Algorithmics (BETA) Laboratory at UBC computer science,for many interesting chats about our respective work and friendship that made thejoy of my graduate studies.I would like to thank my family here, my wife, Ping Xiang; my son, DanielBinbin Xu and my daughter, Grace Linglin Xu for their continuous support and forbringing sunshine into my life when all else failed. I also like to thank my momand my two sisters back in China for supporting my choice.xxiiChapter 1IntroductionIn many practical applications, algorithm designers confront computationally hardproblems. Examples are graph coloring (see, e.g., Garey and Johnson, 1979),planning and scheduling (see, e.g., Kautz and Selman, 1999), Boolean satisfiability(SAT) (see, e.g., Cook, 1971), traveling salesperson (TSP) (see, e.g., Applegateet al., 2006), software/hardware verification (see, e.g., Biere et al., 1999), proteinfolding (see, e.g., Fraenkel, 1993), and gene sequencing (see, e.g., Pop et al.,2002). In complexity theory, such problems belong to the complexity class ofNP-complete problems (NP−C or NPC). These are the most difficult problemsin NP. If one could find a deterministic, polynomial-time solution to any NP-complete problem, then one would be able to provide a polynomial-time solutionto every other problem in NP. It is widely believed that no worst-case polynomialtime algorithm exists for solving NP-complete problems. The Clay MathematicsInstitute has offered a one million US dollar prize for the first correct proof ordisproof of whether NP is equivalent to the complexity class P, where all problemswould be solved on a deterministic sequential machine in polynomial time [102].For NP-complete problems, even the best currently known algorithms haveworst-case runtimes that increase exponentially with instance size. Luckily, whilethese problems may be hard to solve on worst-case inputs, it is often feasible tosolve large problem instances that arise in practice. However, state-of-the-art algo-rithms often exhibit exponential runtime variation across instances from realisticdistributions, even when problem size is held constant, and conversely the same1instance can take exponentially different amounts of time to solve depending onthe algorithm used [13]. There is little theoretical understanding of what causesthis variation, and thus it is nontrivial to determine how long a given algorithm willtake to solve a given problem instance without incurring the potentially large costof running the algorithm. This phenomenon suggests that worst case analysis isnot sufficient for studying an algorithm’s behavior on practical applications. In-stead, empirical studies are often the only practical means for assessing and com-paring an algorithm’s performance. Researchers and practitioners seek to locatefeatures/characteristics of instances that explain when instances will be hard for aparticular algorithm, choosing the most promising heuristics for designing high-performance algorithms, and finding the most efficient algorithm for an unseeninstance drawn from a given instance distribution. Answers to these questions canhelp one to better understand and solve NP-complete problems.My PhD work focuses on studying NP-complete problems based on empiricaldata and machine learning techniques, in addition to proposing automated meth-ods for improving the effectiveness of solving problem instances from real appli-cations. My work has four major components. For a better understanding of thenature of NP-complete problems, the relations between instance features and algo-rithm performance were studied for both NP-complete decision and optimizationproblems using supervised machine learning techniques. Furthermore, the rela-tions between instance features and an instance’s satisfiability status for decisionproblems were studied. Based on the successes of these studies, we set out to im-prove the state of the art in solving NP-complete problems. The thesis proposesthree different approaches. The first is a portfolio-based algorithm selector thatcombines the strengths of multiple candidate solvers. Here, predictive models areused as the basis for an algorithm portfolio that selects the most promising can-didate solver for an unseen instance automatically. Inspired by the successes ofautomated algorithm configuration that is able to find very good parameter set-tings for highly parameterized algorithms, the second approach suggests a newalgorithm design philosophy. Unlike the traditional approach for building heuristicalgorithms, the algorithm designer should include as many alternate approaches tosolving the same subproblem as seem promising instead of fixing most of the de-sign choices at development time. The optimal instantiation of heuristic algorithms2for a given instance benchmark should be automatically produced by automated al-gorithm configuration tools. The third approach targets domains where only onehighly parameterized algorithm is competitive and combines portfolio-based algo-rithm selection and automated algorithm configuration together in a novel manner.By changing the performance measure in algorithm configuration, this approachautomatically discovers algorithm configurations that possess the greatest poten-tial for improving the current algorithm portfolio.1.1 Science Side ResearchPrevious empirical studies on NP-complete problems have revealed many intrigu-ing results. For example, problem instances with certain properties can be muchharder than others for many algorithms. Mitchell et al. (1992) showed how ran-dom 3-SAT instances with clauses-to-variables ratios around 4.3 are usually harderthan other random 3-SAT instances of the same size. More recent work studiedthe use of machine learning methods to make instance-specific predictions aboutsolver runtimes. Leyton-Brown et al. (2002, 2009) introduced the use of such mod-els for predicting the runtimes of solvers for solving NP-complete problems, andNudelman et al. (2004) showed that using this approach, surprisingly accurate run-time predictions can be obtained for uniform random 3-SAT. Nudelman et al. alsonoticed that training models on only SAT or UNSAT instances allowed much sim-pler, albeit very dissimilar, models to achieve high accuracy. Since unconditionalmodels, without considering SAT/UNSAT status, are able to predict runtimes ac-curately, despite the qualitative differences between the SAT and UNSAT regimes,the models must implicitly predict satisfiability status.Motivated by this result, we investigated the feasibility of predicting the sat-isfiability of a previously unseen SAT instance by considering a variety of bothstructured and unstructured SAT instances. The empirical results proved ratherpromising, the classification accuracies were always better than 68% rather than50% given by random guess. A detailed case study of uniform random 3-SAT at thephase transition revealed that the classification accuracies remained roughly con-stant and far above random guessing even using a single decision tree with only twosimple features. Furthermore, we investigated the benefit of having a reasonably3accurate (but imperfect) classifier on runtime prediction. We improved runtimeprediction by constructing hierarchical hardness models using a mixture-of-expertsapproach with fixed (“clamped”) experts, that is, with conditional models trainedon satisfiable instances and unsatisfiable instances separately. The classifier’s con-fidence correlated with prediction accuracy, giving useful per-instance evidence onthe quality of the runtime prediction. Of course, there are many other regressiontechniques that could be used for runtime/performance prediction. We performeda thorough study on different machine learning techniques on many NP-completeproblems such as the Boolean satisfiability (SAT), the mixed integer programming(MIP), and the traveling salesperson (TSP).1.2 Engineering Side ResearchThe wide applications of NP-complete problems facilitate the development of high-performance algorithms, while significant research and engineering efforts have ledto sophisticated algorithms. In one prominent and ongoing example, the SAT com-munity holds an annual SAT Competition/Race/Challenge (http://www.satcompetition.org/). This competition intends to provide an objective assessment of SAT algo-rithms, and thus to track the state of the art in SAT solving, to assess and promotenew solvers, and to identify new challenging benchmarks. Solvers are judged basedon their empirical performance with both speed and robustness taken into account.One observation from the competitions is that algorithm performance highly de-pends on the type of instances. One algorithm could be much better than otherson solving one class of instances, but dramatically worse on instances from otherclasses (see, e.g., Le Berre et al., 2012). One possible explanation is that manypractical problem instances possess some special structure. Solvers can achievemuch better performance if they can exploit such structural.One manner in which evaluations such as the SAT competition are useful is thatthey allow practitioners to determine which algorithm performs best for instancesrelevant to their problem domain. However, choosing a single algorithm on the ba-sis of competition ranks is not always a good approach. Such a “winner-take-all”approach typically results in the neglect of many algorithms that are not compet-itive on average but that nevertheless offer very good performance on particular4instances. Thus, practitioners with hard problems to solve confront a potentiallydifficult “algorithm selection problem” [170]: which algorithm(s) should be runin order to minimize an performance objective, such as expected runtime? Theideal solution to the algorithm selection problem, conversely, would be to consultan oracle that knows the amount of time that each algorithm will take to solvea given problem instance, and then to select the algorithm with the best perfor-mance. Unfortunately, computationally cheap, perfect oracles of this nature arenot available for SAT or any other NP-complete problem. Inspired by the successof runtime prediction, Nudelman et al. (2004) proposed an automated algorithm se-lection approach based on approximate performance predictors, which can be seenas a heuristic approximation to a perfect oracle. Initial trial of such an approachdemonstrated promising results. In the 2003 SAT Competition, the first version ofSATzilla [155] placed 2nd in two categories and 3rd in another.Note that due to the nature of NP-completeness, one could not expect suchapproximation to be perfect without solving the instances. Therefore, we intro-duced several new techniques to improve the robustness of SATzilla, such aspre-solving, backup solver and feature cost prediction. The competition resultsdemonstrate that my portfolio-based algorithm selectors are capable of achievingstate-of-the-art performance. They won many medals in the 2007 and 2009 SATCompetitions in conjunction with the 2012 SAT Challenge (with a new selectiontechnique based on cost-sensitive classification).We also showed that the general framework of SATzilla was compatiblewith other performance predictors, and performed very well on other problem do-mains, such as MIP. Given that portfolio-based algorithm selectors often achievestate-of-the-art performance, the community could benefit from rethinking how tovalue individual solvers. Developing a solver that helps to improve state-of-the-artperformance should be more valuable than designing a slightly on-average bettersolver. We developed techniques for analyzing the extent to which the performanceof the state-of-the-art (SOTA) portfolio depends on each of their component solvers.High-performance heuristic algorithms are able to solve very large problem in-stances from practical applications. However, designing them is a time-consumingtask even for domain experts. Traditionally, heuristic algorithms are designed inan iterative, manual process in which most design choices are fixed at development5time, usually based on preliminary experimentation, leaving only a small numberof parameters exposed to the user. Although such an approach has proven to workeffectively in the past, this approach requires a significant amount of effort on thepart of the domain experts. Recently, a new line of research has attempted to au-tomate parts of the algorithm design process with cheap computing power, andachieved many successes (see, e.g., Hoos, 2008). Inspired by such work, our teamproposed a new approach to heuristic algorithm design in which the designer fixesas few design choices as possible, instead exposing all promising design choicesas parameters. This approach removes the burden from the algorithm designer ofmaking early design decisions without knowing how different algorithm compo-nents will interact on problem distributions of interest. Instead, now the designer isencouraged to consider many alternative designs from known solvers in addition tonovel mechanisms. Of course, such flexible, highly parameterized algorithms mustbe instantiated appropriately to achieve good performance on a given instance set.With the availability of advanced automated parameter configurators and cheapcomputational resources, finding a good parameter configuration from a huge pa-rameter space becomes practical (see, e.g., [22, 33, 94]).Although this general idea is not specifically tailored to a particular domain, inthis work we applied it to the challenge of constructing stochastic local search(SLS) algorithms for the propositional satisfiability problem (SAT). SLS-basedsolvers have exhibited consistently dominant performance for several families ofSAT instances; they also play an important role in state-of-the-art portfolio-basedautomated algorithm selection methods for SAT [210]. Our team implementeda highly parameterized SLS algorithm by drawing mechanisms from two dozenexisting high-performance SLS SAT solvers and also incorporating many novelstrategies and termed this SATenstein-LS. Similar to the ”perfect human be-ing” created by Victor Frankenstein using scavenged human body parts in theclassic novel Frankenstein, here one scavenges components from existing high-performance algorithms for a given problem and combines them to build new high-performance algorithms. Unlike Frankenstein’s creation, our algorithm is built us-ing an automated construction process that enables one to optimize performancewith minimal human effort. The design space contains a total of 2.01×1014 pos-sible instantiations, and includes most existing, state-of-the-art SLS SAT solvers6that have been proposed in the literature. With the aid of automated algorithmconfiguration tools, we demonstrate experimentally that our new, automatically-constructed solvers dramatically outperform the best SLS-based SAT solvers cur-rently available on six well-known SAT instance distributions, ranging from hardrandom 3-SAT instances to SAT-encoded factoring and software verification prob-lems. This makes it interesting to understand the similarities and differences be-tween our new configurations and existing SLS algorithms. We propose an au-tomatic, quantitative approach for visualizing the degree of similarity between aset of algorithms. Using this approach, we investigated the similarities among ourSATenstein-LS solvers and SLS-based incumbents. This visualization demon-strates that most of our new solvers are very different from existing solvers.Although portfolio-based algorithm selection and automated algorithm con-figuration have demonstrated many positive results in practice [22, 33, 94, 210],they each have some shortcomings. The former approach requires relatively sig-nificant domain knowledge, including in particular, a set of relatively uncorrelatedcandidate solvers. The latter approach requires no domain knowledge beyond aparameterized algorithm framework, and no human effort to target a new domain;however, it produces only a single algorithm, which is designed to achieve a highperformance overall, but which may perform badly on many individual instances.This drawback is particularly serious when the instance distribution is heteroge-neous. Once a state-of-the-art portfolio exists for a domain, such as SATzillafor various SAT distributions, the critical question to the algorithm developer is:how should new research aim to improve upon it? One approach is to build newstand-alone algorithms either by hand or by automatic configuration, with the goalof replacing the portfolio. This approach has the weakness that it reinvents thewheel: the new algorithm must perform well on all the instances for which theportfolio is already effective, and must also make additional progress.Alternatively, one may attempt to build a new algorithm to complement theportfolio, which has been dubbed “boosting as a metaphor for algorithm design”[128]. The boosting algorithm in machine learning builds an ensemble of classi-fiers by focusing on problems that are handled poorly by the existing ensemble.The proposal is to approach algorithm design analogously, focusing on instanceson which the existing portfolio performs poorly. In particular, the suggestion is7to use sampling (with replacement) to generate a new benchmark distribution thatwill be harder for an existing portfolio, and for new algorithms to attempt to min-imize average runtime on this benchmark. Indeed, such a method was shown tobe particularly effective for inducing new, hard distributions. While we agree withthe core idea of aiming explicitly to build algorithms that will complement a port-folio, we have come to disagree with its concrete realization as described mostthoroughly by Leyton-Brown et al. (2009), realizing that average performance ona new benchmark distribution is not always an adequate proxy for the extent towhich a new algorithm would complement a portfolio. A region of the originaldistribution that is exceedingly hard for all candidate algorithms can dominate thenew distribution, leading to stagnation.Based on this observation, we introduced Hydra, a new method for automat-ically designing algorithms to complement a portfolio. This name was inspiredby the Lernaean Hydra, a mythological, multi-headed beast that grew new headsfor those cut off during its struggle with the Greek hero Hercules. Hydra, givenonly a highly parameterized algorithm and a set of instance features, automaticallygenerates a set of configurations that form an effective portfolio. It thus does notrequire any domain knowledge in the form of existing algorithms. Hydra is ananytime procedure: it begins by identifying a single configuration with the bestoverall performance, and then iteratively adds algorithms to the portfolio. Hydrais also able to drop previously added algorithms when they are no longer helpful.Hydra offers the greatest potential benefit in domains where only one highly pa-rameterized algorithm is competitive (e.g., certain distributions of mixed-integerprogramming problems), and the least potential benefit in domains where a widevariety of strong, uncorrelated solvers already exist. We performed case studies onboth SAT and MIP, where Hydra consistently achieved significant improvementsover the best existing individual algorithms designed both by experts and auto-matic configuration methods. More importantly, Hydra always at least roughlymatched—and indeed often exceeded—the performance of the best portfolio ofsuch algorithms.81.3 Overview of ContributionsOverall, my research resulted in major advances in understanding a variety of NP-complete decision and optimization problems, as well as in pushing forward thestate-of-the-art for solving them. My major contributions are summarized as fol-lows.• Features: We extended the feature set proposed by Nudelman et al. (2004)for characterizing the propositional satisfiability problem (SAT). Chapter3 also introduces new features for other NP-complete problems (TSP andMIP). Those features were proven to be informative and have been widelyused by other research groups [111].• Predictive models: We demonstrated that simple rules can predict the solu-bility of uniform random 3-SAT at the phase with surprisingly high accuracy[Chapter 4]. Extensive empirical results suggest that classification accuracydoes not decrease with instance size. Chapter 5 relates how to improveruntime prediction by combing classifiers with conditional hardness mod-els into a hierarchical hardness model using a mixture-of-experts approach.Chapter 6 describes a thorough comparison of different existing and newmodel building techniques for SAT, MIP, and TSP. We demonstrated thatrandom forests yield substantially better runtime predictions than previousapproaches.• Portfolio-based algorithm selection: We made significant advances in build-ing state-of-the-art portfolio-based algorithm selectors. With many new tech-niques introduced in Chapter 7, SATzilla won the 2007 and 2009 SATCompetition, and the 2012 SAT Challenge. Due to the huge success ofSATzilla, the paper by Xu et al. (2008) won the 2010 IJCAI-JAIR bestpaper prize. In addition to state-of-the-art performance, SATzilla is usefulfor evaluating solver contributions. By omitting a solver from the portfolio,We measured the contribution of this solver by computing SATzilla’s per-formance difference with and without it. Chapter 8 shows that solvers thatexploited novel strategies were more valuable than those with the best over-all performance. We also demonstrate that cost-sensitive classification-based9algorithm selector achieved the best performance. In fact, SATzilla2012won the 2012 SAT Challenge by using cost-sensitive decision forests as thealgorithm selector.• Automatically building high-performance algorithms from components: Weproposed a new approach to heuristic algorithm design in which the design-ers fix as few design choices as possible at development time, instead ex-posing a huge number of design choices in the form of parameters. Chapter9 demonstrates a case study on constructing stochastic local search (SLS)algorithms for SAT. By taking components from 25 local search algorithms,we built a highly parameterized local search algorithm, SATenstein-LS,which can be instantiated as 2.01× 1014 different solvers. The empiricalresults show that the automatically constructed SATenstein-LS outper-forms existing state-of-the-art solvers with both manually and automaticallytuned configurations. In addition, we proposed a new representation for al-gorithm parameter settings, concept DAGs, and defined a novel similaritymetric based on the transformation cost. We have shown that the visualiza-tion based on such similarity measure provides useful insights into algorithmdesign.• Automatically configuring algorithms for portfolio-based selection: By com-bining the strengths of automated algorithm selection and automated algo-rithm configuration, we proposed a novel technique, Hydra, for automati-cally discovering a set of solvers iteratively with complementary strengths.The case study on SAT benchmarks (Chapter 10.2) showed that Hydra witha single solver, SATenstein-LS, significantly outperforms state-of-the-art SLS algorithms. Hydra reaches and often exceeds the performance ofportfolios that use many strong local search solvers as candidate solvers. Byadapting the cost-sensitive classification models and modifying method forselecting candidate configurations, we demonstrated that MIP-Hydra con-verges faster, and achieves strong performance for MIP (Chapter 10.3).10Chapter 2Related WorkOver the last decades, considerable research efforts have led to significant progressin understanding and solving NP-complete problems. In this chapter, we reviewsome of the work that is most relevant to this thesis. The remainder of this chap-ter is structured as follows. Section 2.1 introduces recent advances in empiricalhardness models. Section 2.2 summarizes related work on algorithm portfoliosand the algorithm selection problem. Section 2.3 discusses some automated tech-niques for constructing high-performance algorithms, which closely relate to mySATenstein work. Section 2.4 overviews the techniques for automated algo-rithm configuration. They play an important role in building SATenstein andHydra. In the end, we discuss other approaches for automatically configuringalgorithms for portfolio-based selection, and compare them with my Hydra ap-proach.2.1 Empirical Hardness ModelsMost of the heuristic algorithms for solving NP-complete problems are highlycomplex, and thus have largely resisted theoretical average-case analysis. Instead,empirical studies are often the only practical means for assessing and comparingtheir performance. One recent approach for understanding the empirical hardnessof computational hard problems was proposed by Leyton-Brown et al. (2002). Itused linear basis-function regression to build models that predict the time required11for an algorithm to solve a given problem instance [127]. These so-called empiricalhardness models can be used to evaluate the factors responsible for an algorithm’sperformance, or to introduce challenging instances for a given algorithm [128].They can also be leveraged to select among several different algorithms for solvinga given problem instance [128, 129, 209] and can be applied in automated algo-rithm configuration and tuning [92].On a high level, empirical hardness models represent functional relations be-tween instance characteristics and algorithm performance (e.g., CPU time). Givena set of training data (pairs of instance characteristics and algorithm performance),an empirical hardness model is trained to fit the training data using regression tech-niques. Later, for a new, unseen problem instance (test data point), a performanceprediction can be made by evaluating the empirical hardness model on the char-acteristics of the test instance. The instance characteristics are very important forbuilding good models for predicting algorithm’s performance. Good instance char-acteristics should correlate well with algorithm performance and be cheap to com-pute. Algorithm performance is measured by a function that maps from algorithmoutput to a real value (e.g., algorithm’s runtime [156], performance score [209], orsolution quality found within a certain budget).Beyond the previous work conducted in our group [128, 129, 209], there ex-ist a few other approaches for predicting algorithm runtime. Similar models wereapplied by Brewer (1995) and Huang et al. (2010), although they considered onlyalgorithms with low-order polynomial runtimes. The most closely related work isby Smith-Miles and van Hemert (2011), who employed neural network models topredict the runtime of local search algorithms for solving the traveling salespersonproblem. A different approach for predicting the performance of tree search algo-rithms rests on predictions of the search tree size [116, 118, 138]. The literatureon search space analysis has investigated measures that correlate with algorithmruntime. Prominent examples include fitness distance correlation [110], landscaperuggedness [205], and autocorrelation [83]. The typical approach is either to vi-sually inspect the relationship between a measure and runtime (e.g., in a scatterplot), or to compute descriptive statistics, such as the Spearman correlation coeffi-cient between the two.Empirical hardness models have proven effective in predicting runtime for12many algorithms on a number of interesting instance distributions. In a studyon combinatorial auction winner determination, a prominent NP-hard optimiza-tion problem, empirical hardness models were used to predict CPLEX’s runtimeson randomly-generated problem instances using 30 characteristics [127]. Nudel-man et al. (2004) used empirical hardness models to predict several tree-searchalgorithms’ runtimes on uniform-random 3-SAT instances. One interesting obser-vation from this work is that if instances were restricted to be either only satisfiableor only unsatisfiable, very different models were needed to make accurate run-time predictions. Furthermore, models for each type of instance were simpler andmore accurate than models that must handle both types. Empirical hardness mod-els have been applied to the study of local search algorithms as well. Based on thework of Nudelman et al. (2004), Hutter et al. (2006) used empirical hardness mod-els to predict runtime distributions of randomized, incomplete algorithms. Theyalso have been used in model-based algorithm configuration procedures (such asSMAC [99]) to identify promising combinations of algorithm components to eval-uate.2.2 Algorithm Portfolios and the Algorithm SelectionProblemWith recent advances in algorithm development, many previously challenging prob-lem instances can be quickly solved by at least some algorithms. However, onealgorithm often only performs well on some small classes of instances. Hence, onepossible approach for solving NP-complete problems effectively is to use multipleexisting algorithms and find out the best way to allocate computational resourcesto each individual algorithm.One way of using multiple existing algorithms is to build algorithm portfo-lios. The term “algorithm portfolio” was introduced by Huberman et al. (1997)to describe the strategy of running k algorithms in parallel, potentially with eachalgorithm i getting a share of computational resources si (i = 1, ...,k). Gomesand Selman (2001) built a portfolio of stochastic algorithms for quasi-group com-pletion and logistics scheduling problems. Low-knowledge algorithm control byCarchrae and Beck (2005) employed a portfolio of anytime algorithms, prioritizing13each algorithm according to its performance so far. Dynamic algorithm portfoliosby Gagliolo and Schmidhuber (2006) also ran several algorithms at once, where analgorithm’s priority depends on its predicted runtime conditioned on the fact thatit has not yet found a solution. In a recent approach, black-box techniques wereused for learning how to interleave the execution of multiple heuristics to improveaverage-case performance based on the development of solution quality [190].Besides algorithm portfolios, there is another line of research that takes advan-tage of multiple algorithms. Given a computationally hard problem instance andmultiple algorithms with relatively uncorrelated performance, it is natural to de-fine an “algorithm selection problem” [170]: which algorithm(s) should be usedto minimize some performance objective, such as classification error (for solvinga classification problem) or expected runtime (e.g., for solving SAT)? Much earlywork on solving the algorithm selection problem focused on selecting learning al-gorithms for solving classification problems [1, 137, 159]. Instead of using theterm “algorithm selection”, they used the term “meta-learning”. For example, Aha(1992) used rule-based learning algorithms to decide which classification algorithmshould be used based on a number of characteristics of the test data sets. Later, thislearning approach had been applied to many other problem domains. Arinze et al.(1997) demonstrated a knowledged-based system that selected among three fore-casting methods with six features for solving a time-series forecasting problem.Lobjois and Lemaıˆtre (1998) studied the problem of selecting between branch-and-bound algorithms based on an estimation of search tree size due to Knuth (1975).Gebruers et al. (2005) employed case-based reasoning to select a solution strategyfor instances of a constraint programming problem. Various authors have proposedclassification-based methods for algorithm selection [55, 65, 66, 84]. Note that oneproblem with such approaches is that they typically use an error metric that penal-izes all misclassifications equally, regardless of their real cost. However, usinga sub-optimal algorithm is acceptable in solving an algorithm selection problemif the difference between its performance and that of the best algorithm is small.The studies of Leyton-Brown et al. (2003) and Nudelman et al. (2004) were mostclosely related to my own work of SATzilla. (Nudelman et al. (2004) indeedcoined the name SATzilla.) Specifically, they built empirical hardness mod-els to predict the runtime of given algorithms using regression techniques. By14modeling a portfolio of algorithms and choosing the algorithm predicted to havethe lowest runtime, empirical hardness models can serve as the basis for buildingan automatic system that solves the algorithm selection problem. In fact, such asystem can also be viewed as a type of classification that takes the real cost ofmisclassification into account.Algorithm selection is closely related to algorithm portfolios. They work forthe same reason—they exploit lack of correlation in the best-case performance ofseveral algorithms in order to obtain improved performance in the average case.In fact, algorithm selection can be viewed as a special type of algorithm portfoliossuch that the algorithm with the best performance has 100% share of computa-tional resources. To more clearly describe algorithm portfolios in a broad sense,we introduced some new terminology [210]. An (a,b)-of-n portfolio is defined as aprocedure for selecting among a set of n algorithms with the property that if no al-gorithm terminates early, at least a and no more than b algorithms will be executed.We consider a portfolio to have terminated early if it solves the problem before oneof the solvers has a chance to run, or if one of the solvers crashes. For brevity,we also use the terms a-of-n portfolio to refer to an (a,a)-of-n portfolio, and n-portfolio for an n-of-n portfolio. It is also useful to distinguish how solvers arerun after being selected. Portfolios can be parallel, sequential, or partly sequen-tial (some combination of the two). Thus traditional algorithm portfolios can bedescribed as parallel n-portfolios. In contrast, pure algorithm selection proceduresare 1-of-n portfolios.Some approaches fall between these two extremes, making decisions aboutwhich algorithms to use on the fly instead of committing in advance to a fixednumber of candidates. Lagoudakis and Littman (2001) employed reinforcementlearning to solve an algorithm selection problem at each decision point of a DPLLsolver for SAT in order to select a branching rule. Similarly, Samulowitz andMemisevic (2007) employed classification to switch between different heuristicsfor QBF solving during the search. These approaches can be viewed as (1,n)-of-nportfolios.In the recent SAT Competitions/Challenge, portfolio-based solvers achievedmany successes. Our own portfolio-based algorithm selector, SATzilla, wona total of 17 medals in the 2007 and 2009 SAT Competitions and the 2012 SAT15Challenge. A simple parallel portfolio, ppfolio [171] with 5 candidate solversperformed very well in the 2011 SAT Competition and the 2012 SAT Challenge.3S [112] achieved remarkable performance by using nearest neighbor classifica-tion. It also used a powerful fixed solver schedule as the pre-solving step.2.3 Automated Construction of AlgorithmsDesigning high-performance heuristic algorithms for solving NP-complete prob-lems is often a time-consuming task. The traditional approach requires significantefforts from domain experts to select design choices, and pick default parametersbased on preliminary experimentation. However, the demand for high-performancesolvers for difficult combinatorial problems in practical applications has increasedsharply. With ever-increasing availability of cheap computing power, a new line ofresearch has automated parts of the algorithm design process (see also Hoos, 2008)and achieved many successes [31, 50, 51, 54, 64, 145, 157, 158, 207, 210].Here we discuss three closely related lines of previous work in more detail.First, Minton (1993) used meta-level theories to produce distribution-specific ver-sions of generic heuristics, and then found the most useful combination of theseheuristics by evaluating their performance on a small set of test instances. He fo-cused on producing distribution-specific versions of candidate heuristics, and onlyconsidered at most 100 possible heuristics. The performance of the resulting algo-rithms was comparable with that of algorithms designed by a skilled programmer,but not an algorithm expert. Second, Gratch and Dejong (1992) presented a systemthat starts with a STRIPS-like planner, and augments it by incrementally addingsearch control rules. Finally and most relatedly, Fukunaga (2002) proposed a ge-netic programming approach that has a goal similar to the one we pursued in ourwork on SATenstein 9: the automated construction of local search heuristics forSAT. Fukunaga considered a potentially unbounded design space, based only onGSAT-based and WalkSAT-based SLS algorithms up to the year 2000. His can-didate variable selection mechanisms were evaluated on Random 3-SAT instancesand graph coloring instances with at most 250 variables. While Fukunaga’s ap-proach could in principle be used to obtain high-performance solvers for specifictypes of SAT instances, to our knowledge this potential has never been realized;16the best automatically constructed solvers obtained by Fukunaga only achieved aperformance level similar to that of the best WalkSAT variants available in 2000 onmoderately-sized SAT instances. In contrast, as mentioned in Chapter 1, our newSATenstein-LS solvers performs substantially better than current state-of-the-art SLS-based SAT solvers on a broad range of challenging, modern SAT instances.We consider a huge but bounded combinatorial space of algorithms based on com-ponents taken from 25 of the best SLS algorithms for SAT currently available,and we use an off-the-shelf, general-purpose algorithm configuration procedure tosearch this space. Our target distribution contains instances with up to 4 978 vari-ables.2.4 Automated Algorithm Configuration ToolsRecently, considerable attention has been paid to the problem of automated al-gorithm configuration [3, 12, 50, 64, 93, 96]. A variety of black-box, automatedconfiguration procedures have been proposed. They take as input a highly parame-terized algorithm, a set of benchmark instances, and a performance metric, and thenoptimize the algorithm’s empirical performance automatically. These approachescan be categorized into two major families: model-based approaches that learn aresponse surface over the parameter space, and model-free approaches that do not.Most of the early approaches were only able to handle relatively small numbers ofnumerical (often continuous) parameters. Some relatively recent approaches per-mit both larger numbers of parameters and/or categorical domains, in particularComposer [63], F-Race [21–23], and ParamILS [94, 96, 97].Automated algorithm configuration procedures have been applied to optimizea variety of parametric algorithms. Gratch and Chien (1996) successfully appliedtheir Composer system to optimize an algorithm for scheduling communicationbetween a collection of antennas and spacecraft in deep space. F-Race and itsextensions have been used to optimize numerous algorithms, including iteratedlocal search for the quadratic assignment problem, ant colony optimization for thetraveling salesperson problem, and the best-performing algorithm submitted to the2003 timetabling competition [23]. Our group successfully used various versionsof ParamILS to configure algorithms for a wide variety of problem domains [94,1796, 97].2.5 Automatically Configuring Algorithms forPortfolio-Based SelectionIn domains where only one highly parameterized algorithm is competitive (e.g.,certain distributions of mixed-integer programming problems), how should webuild a strong portfolio-based algorithm selector for a given (potentially hetero-geneous) distribution? Applying automated algorithm configuration tools can im-prove the overall performance. However, the resulting single configuration mayperform poorly on some subset of instance. Meanwhile, due to the absent of multi-ple strong and uncorrelated candidate solvers, algorithm selection approaches can-not give the edge over a single configuration. One possible solution is combiningthe above two techniques and performing instance-specific selection from an auto-matically generated set of algorithm configurations.Beyond our work on Hydra, there exist a few other approaches for solvingthis problem. Stochastic offline programming (SOP) [140] assumes that each ofthese algorithms has a particular structure, iteratively sampling from a distributionover heuristics and using the sampled heuristic for one search step. It clusters theinstances based on features and then configures one algorithm for each cluster. Acustom optimization method is used for building its set of algorithms.Later, the same research group improved this approach with Instance-SpecificAlgorithm Configuration (ISAC) [111]. It first divides instance sets into clustersbased on instance features using the G-means clustering algorithm, then applies analgorithm configurator to find a good configuration for each cluster. At runtime,ISAC computes the distance in feature space to each cluster centroid and selectsthe configuration for the closest cluster.We note two theoretical problems with this approach. First, ISAC’s cluster-ing is solely based on distance in feature space, ignoring the importance of eachfeature to runtime. Thus, ISAC’s performance can change dramatically if addi-tional features are added (even if they are uninformative). Second, no amount oftraining time allows ISAC to recover from a misleading initial clustering or analgorithm configuration run that yields poor results. Nevertheless, ISAC substan-18tially outperformed solvers with default configurations and configurations obtainedby automated algorithm configuration tools on a set covering problem, MIP, andSAT [111].19Chapter 3Domains of InterestComputationally hard combinatorial problems are ubiquitous in AI. This thesisfocuses on some fundamental problems in computer science that have wide real-world application. In particular, we applied machine learning techniques to studythe empirical hardness of the propositional satisfiability problem (SAT), the mixedinteger programming problem, and the traveling salesperson problem (TSP). Wealso developed many meta-algorithmic techniques that improved the state of theart for solving SAT and MIP. This chapter first gives overviews on problem do-main, and prominent solver for each domain, then introduces the sets of featuresfor characterizing problem instances in conjunction with benchmarks used for casestudies. 13.1 Propositional Satisfiability (SAT)The propositional satisfiability problem (SAT) asks, for a given propositional for-mula F , whether there exists a complete assignment of truth values to the variablesof F under which F evaluates to true [83]. F is considered satisfiable if there existsat least one such assignment, otherwise the formula is labeled unsatisfiable. A SATinstance is usually represented in conjunctive normal form (conjunction of disjunc-tions), where each disjunction has one or more literals, each of which is either avariables or the negation of variables. These disjunctions are called clauses. Thus1This chapter is based on the joint work with Ashiqur KhudaBukhsh, Frank Hutter, Holger Hoos,and Kevin Leyton-Brown [101, 115, 209, 210].20the goal for a SAT solver is to find a variable assignment that satisfies all clausesor to prove that no such assignment exists. For example, a solution of formula(A∨B∨C)∧ (¬B∨¬C) is A = true,B = f alse,C = f alse. SAT is one of the mostfundamental problems in computer science. Indeed, there are entire conferencesand journals devoted to the study of this problem. Another important reason for in-terest in SAT is that instances of other NP-complete problems will be encoded intoSAT and solved by SAT solvers. This approach has been shown effective for solv-ing several real-world applications, including planning [113, 114], scheduling [39],graph coloring [202], bounded model checking [20], and formal verification [189].Over the past decades, considerable research and engineering efforts have beeninvested into designing and optimizing algorithms for SAT solving. Today’s high-performance SAT solvers include tree-search algorithms [41, 44, 46, 73, 121, 139],local search algorithms [80, 91, 105, 131, 178, 179], and resolution-based prepro-cessors [9–11, 40, 42, 192]. Here, we will give a brief introduction of the mostpopular SAT solving methods: tree search and local search.3.1.1 Tree Search for SATA tree-search algorithm attempts to locate solutions to a problem instance in a sys-tematic manner. It guarantees that eventually a solution is found if there exists one,or the algorithm will report that no solution exists. In other words, tree-search iscomplete. Most modern tree-search algorithms for SAT are based on the Davis-Putnam-Logemann-Loveland (DPLL) procedure [41]. This procedure explores abinary search tree in which each node corresponds to assigning a truth value toone variable (that value is then fixed for all subtrees beneath that node). Sincethe search space size increases exponentially with the number of variables, simplebacktrack search becomes rapidly infeasible even for relatively small problem in-stances. Fortunately, in many cases, it is possible to prune large parts of the searchtree that do not contain any solution. One of the key techniques used in SAT solv-ing for reducing the size of the search tree is unit propagation. When SAT instancesare represented in conjunctive normal form, any clause containing a literal with a“true” assignment can be deleted, and all literals with “false” assignments can beeliminated from the clauses. Clauses with only one literal are termed unit clauses.21The literal in a unit clause must be assigned a “true” value. Furthermore, assigninga value to a variable in a unit clause may lead to more unit clauses. Therefore,unit propagation is a procedure that propagates the consequences of particular unitclause’s literal assignment down the search tree and prunes parts of the search treethat do not contain any solution.Recent advances in tree search algorithms include clause learning [139], pre-processing [45], backbone detection [44], and belief propagation [85]. These tech-niques are used for intelligent backtracking, simplifying the original formula, de-termining the best variable for branching, and finding the most promising assign-ments, respectively.3.1.2 Local Search for SATAnother common approach for solving hard combinatorial problems is local search.A local search algorithm starts at some location in the space of candidate solutionsand subsequently moves from the present location to a neighboring location. Thereare many ways to define a neighborhood relation. For SAT, the neighbors of a can-didate solution (a complete assignment) are usually the candidates only differingfrom the current one by a single variable assignment. Typically, every candidatesolution has more than one neighbor; the choice of which one to move to is basedon information mainly related to the candidates in the neighborhood of the currentone, such as the number of unsatisfiable clauses for each neighbor. In contrast totree search algorithms, typical local search algorithms are incomplete: there is noguarantee that an existing solution will eventually be found within limit amount oftime, nor can unsatisfiability ever be proven.The basic local search framework for SAT solving [83] is as follows. Given apropositional formula F with n variables, first randomly pick a complete variableassignment that corresponds to a point in the solution space. Then check whetherthe current assignment satisfies F . If so, terminate and report the current assign-ment as a solution. Otherwise, modify the current assignment (i.e., visit a neigh-boring location) by selecting a variable based on some predefined scoring functionand changing its value from “true” to “false” or vice versa. This procedure repeatsuntil a solution is found or a maximal number of steps have been performed.22Many local search algorithms can get stuck in a small part of the solution space(a situation called search stagnation), and they are unable or unlikely to escapefrom this condition without some special mechanisms. In order to avoid searchstagnation, modern local search algorithms are typically randomized, leading tostochastic local search (SLS) [83]. For example, WalkSAT avoids search stagna-tion by using a random walk strategy that randomly changes the value of a variablein an unsatisfied clause [179]. Currently, much research in local search focuseson finding good tradeoffs of intensification (more intensely searching a promisingsmall part of the solution space) and diversification (exploring other regions of thesolution space).Existing SLS-based SAT solvers can be grouped into four broad categories:GSAT-based algorithms [178], WalkSAT-based algorithms [179], dynamic localsearch algorithms [91, 195], and G2WSAT variants [131]. SATenstein-LS, thehighly parameterized algorithm framework described in Chapter 9, takes compo-nents from solvers from each of these categories; therefore, we describe the majorfeatures in detail for each of these categories in the following subsections.Category 1: GSAT-based AlgorithmsGSAT [178] was one of the earliest SLS SAT solvers. At each step, GSAT com-putes the score of each variable using a scoring function, then flips the variable(changes the value from true to f alse or from f alse to true) with the best score.The score of a variable depends on two quantities, MakeCount and BreakCount.The MakeCount of a variable with respect to an assignment is the number ofpreviously-unsatisfied clauses that will be satisfied if the variable is flipped. Simi-larly, the BreakCount of a variable with respect to an assignment is the number ofpreviously-satisfied clauses that will be unsatisfied if the variable is flipped. Thescoring function of GSAT is MakeCount - BreakCount.Variants of GSAT introduced many techniques that were later used by otherSLS solvers. For example, GWSAT [177] performs a conflict-directed random walkstep with probability wp, otherwise it performs a regular GSAT step. Conflict-directed random walk is an example of a search diversification strategy that waslater used by many SLS solvers. GSAT randomly picks a variable if multiple vari-23ables have the same score. HSAT [57] introduces a new tie-breaking scheme inwhich ties are broken in favor of the least-recently-flipped variable. In subsequentSLS solvers, breaking ties randomly and breaking in the favor of the least-recently-flipped variable were prominent tie-breaking schemes. GSAT now has only histori-cal importance, as there is a substantial performance gap between GSAT and recentstate-of-the-art SLS solvers.Category 2: WalkSAT-based AlgorithmsThe major difference between WalkSAT algorithms and GSAT algorithms is theneighborhood each considers. For a WalkSAT algorithm, the neighborhood con-sists of the variables appearing in all currently unsatisfied clauses rather than thefull set of variables. At each search step, a WalkSAT algorithm first picks anunsatisfied clause (e.g., uniformly at random), and then flips a variable from thatclause depending on some heuristic. WalkSAT/SKC [179] was one of the earliestWalkSAT algorithms, and has a scoring function that only depends on BreakCount.Novelty [142] and its several variants are among the most prominent Walk-SAT algorithms. Novelty picks a random unsatisfied clause and computes thevariables with highest and second-highest scores with the same scoring functionas GSAT. Ties are broken in favor of the least-recently-flipped variable. If thevariable with the highest score is not the most recently flipped variable withinthe clause, then it is deterministically selected for flipping. Otherwise, it is se-lected with probability (1 - p), where p is a parameter termed the noise setting(with probability p, the second-best variable is selected). The idea of consider-ing flip history is exploited in various ways in different SLS solvers, such as theage of a variable (e.g., in Novelty), flip counts (e.g., in VW [166]). To preventstagnation (getting stuck in local minima), Novelty is often augmented with aprobabilistic conflict-directed random walk [79]. Recent Novelty variants (e.g.,adaptNovelty+ [80]) also use a reactive mechanism that adaptively changesthe noise parameter. This reactive mechanism is extended to many SLS solvers[135] and often yields improved performance.24Category 3: Dynamic Local Search AlgorithmsThe most prominent feature of dynamic local search (DLS) algorithms is the useof “clause penalties” or “clause weights”. At each step, the penalty of an unsatis-fied clause is increased (this increase can be additive [195] or multiplicative [91]).In this manner, information that pertains to the difficulty of solving a given clauseis recorded in its associated clause penalty. In order to prevent an unbounded in-crease in weights and to emphasize the most recent information about the difficultyof a given clause, occasional smoothing steps are performed to reduce them. Thescoring function is the sum of the clause penalties of all unsatisfied clauses. Forprominent DLS solvers, such as SAPS, RSAPS [91], and PAWS [195], the neigh-borhood consists of variables that appear in at least one unsatisfied clause.Category 4: G2WSAT VariantsG2WSAT [131] can be viewed as a combination of the GSAT and WalkSAT archi-tectures. Similar to GSAT, G2WSAT has a deterministic greedy component thatexamines a large number of variables belonging to a promising list data structurethat contains promising decreasing variables (defined below). If the list has at leastone promising decreasing variable, G2WSAT deterministically selects the variablewith the best score for flipping. Ties are broken in favor of the least-recently-flipped variable. If the list is empty, G2WSAT executes its stochastic component, aNovelty variant that belongs to the WalkSAT architecture.The definition of a promising decreasing variable is somewhat technical. Avariable x is said to be decreasing with respect to an assignment A if scoreA(x) > 0.A promising decreasing variable is defined as follows:1. For the initial random assignment A, all decreasing variables with respect toA are promising.2. Let x and y be two different variables where x is not decreasing with respectto A. If, after y is flipped, x becomes decreasing with respect to the newassignment A’, then x is a promising decreasing variable with respect to A’.3. As long as a promising decreasing variable is decreasing, it remains promis-ing with respect to subsequent assignments in local search.25Apart from G2WSAT [131], all G2WSAT variants use the reactive mechanismfound in adaptNovelty+ [79]. gNovelty+ [161], the winner of 2007 SATCompetition in the random satisfiable category, also uses clause penalties andsmoothing found in dynamic local search algorithms [195].UBCSATUBCSAT [198] is an SLS solver implementation and experimentation environmentfor SAT. It has already been used to implement many existing high-performanceSLS algorithms from the literature (e.g., SAPS [91], adaptG2WSAT+ [135]).These implementations generally match or exceed the efficiency of implemen-tations by the original authors. UBCSAT implementations have therefore beenwidely used as reference implementations (see, e.g., [115, 166]) for many well-known local search algorithms. In addition, it also provides a rich interface thatincludes numerous statistical and reporting features facilitating empirical analysisof SLS algorithms.Many existing SLS algorithms for SAT share common components and datastructures. The general design of UBCSAT allows for the reuse and extension ofsuch common components and mechanisms. This rendered UBCSAT a suitable en-vironment for the implementation of highly-parameterized local search algorithms,such as our SATenstein-LS solver described in Chapter SAT FeaturesFor the propositional satisfiability (SAT) problem, we used 138 features listed inFigure 3.1. Since a preprocessing step can significantly reduce the size of the CNFformula (particularly for industrial type instances), we chose to apply the prepro-cessing procedure SatElite [45] on all instances first, and then to compute in-stance features on the preprocessed instances. The first 90 features, except Features22–26 and 32–36, were introduced by Nudelman et al. [156]. They can be catego-rized as problem size features (1–7), graph-based features (8–36), balance features(37–49), proximity to horn formula features (50–55), DPLL probing features (56–62), LP-based features (63–68), and local search probing features (69–90).We incrementally introduced additional features in our work on SATzilla [210,26Problem Size Features:1–2. Number of variables and clauses inoriginal formula: denoted v and c, re-spectively3–4. Number of variables and clauses aftersimplification with SatElite: denotedv’ and c’, respectively5–6. Reduction of variables and clauses bysimplification: (v-v’)/v’ and (c-c’)/c’7. Ratio of variables to clauses: v’/c’Variable-Clause Graph Features:8–12. Variable node degree statistics: mean,variation coefficient, min, max, and en-tropy13–17. Clause node degree statistics: mean,variation coefficient, min, max, and en-tropyVariable Graph Features:18–21. Node degree statistics: mean, variationcoefficient, min, and max22–26. Diameter: mean, variation coefficient,min, max, and entropyClause Graph Features:27–31. Node degree statistics: mean, variationcoefficient, min, max, and entropy32–36. Clustering Coefficient: mean, variationcoefficient, min, max, and entropyBalance Features:37–41. Ratio of positive to negative literals ineach clause: mean, variation coefficient,min, max, and entropy42–46. Ratio of positive to negative occur-rences of each variable: mean, variationcoefficient, min, max, and entropy47–49. Fraction of unary, binary, and ternaryclausesProximity to Horn Formula:50. Fraction of Horn clauses51–55. Number of occurrences in a Hornclause for each variable: mean, varia-tion coefficient, min, max, and entropyDPLL Probing Features:56–60. Number of unit propagations: com-puted at depths 1, 4, 16, 64 and 25661–62. Search space size estimate: meandepth to contradiction, estimate of thelog of number of nodesLP-Based Features:63–66. Integer slack vector: mean, variationcoefficient, min, and max67. Ratio of integer vars in LP solution68. Objective value of LP solutionLocal Search Probing Features, based on 2 sec-onds of running each of SAPS and GSAT:69–78. Number of steps to the best local min-imum in a run: mean, median, variationcoefficient, 10th and 90th percentiles79–82. Average improvement to best in a run:mean and coefficient of variation of im-provement per step to best solution83–86. Fraction of improvement due to firstlocal minimum: mean and variation co-efficient87–90. Best solution: mean and variation coef-ficientClause Learning Features (based on 2 secondsof running Zchaff rand):91–99. Number of learned clauses: mean, vari-ation coefficient, min, max, 10%, 25%,50%, 75%, and 90% quantiles100–108. Length of learned clauses: mean, vari-ation coefficient, min, max, 10%, 25%,50%, 75%, and 90% quantilesSurvey Propagation Features109–117. Confidence of survey propagation: Foreach variable, compute the higherof P(true)/P( f alse) or P( f alse)/P(true).Then compute statistics across vari-ables: mean, variation coefficient, min,max, 10%, 25%, 50%, 75%, and 90%quantiles118–126. Unconstrained variables: For each vari-able, compute P(unconstrained). Thencompute statistics across variables:mean, variation coefficient, min, max,10%, 25%, 50%, 75%, and 90% quan-tilesTiming Features127–138. CPU time required for feature compu-tation: one feature for each of 12 com-putational subtasksFigure 3.1: 12 groups of SAT features27211] since 2007. Our new diameter features 22–26 are based on the variablegraph [71]. For each node i in that graph, we compute the longest of the short-est path between i and any other node. As with most of the features that follow, wethen compute various statistics over this vector (e.g., , mean, max). Our new clus-tering coefficient features 32–36 measure the local cliqueness of the clause graph.For each node in the clause graph, let p denote the number of edges present be-tween the node and its neighbours, and let m denote the maximum possible numberof such edges; we compute p/m for each node.Our new clause learning features (91–108) are based on statistics gathered in2-second runs of Zchaff rand [139]. We measure the number of learned clauses(features 91–99) and the length of the learned clauses (features 100–108) afterevery 1000 search steps.Our new survey propagation features (109–126) are based on estimates of vari-able bias in a SAT formula obtained using probabilistic inference [86]. We usedVARSAT’s implementation to estimate the probabilities that each variable is truein every satisfying assignment, false in every satisfying assignment, or uncon-strained. Features 109–117 measure the confidence of survey propagation (thatis, max(Ptrue(i)/Pfalse(i),Pfalse(i)/Ptrue(i)) for each variable i) and features 118–126are based on the Punconstrained vector.Finally, our new timing features (127–138) measure the time taken by 12 dif-ferent blocks of feature computation code: instance preprocessing by SatElite,problem size (1–6), variable-clause graph (clause node) and balance features (7,13–17, 37–41, 47–49); variable-clause graph (variable node), variable graph andproximity to Horn formula features (8–12, 18–21, 42–46, 50–55); diameter-basedfeatures (22–26); clause graph features (27–36); unit propagation features (56–60); search space size estimation (61–62); LP-based features (63–68); local searchprobing features (69–90) with SAPS and GSAT; clause learning features (91–108);and survey propagation features (109–126).The cost of computing these features depends on the size of the SAT instances.Normally, feature computation takes less than 200 CPU seconds on a Intel Xeon3.2GHz CPU, but it may take over 1000 seconds for very large instances (withmillions of variables) from industrial applications.283.1.4 SAT BenchmarksMany interesting SAT benchmarks are used for studying empirical hardness andconstructing high-performance portfolio solvers. They come from three majorsources: SAT Competitions and SAT Races, instance generators, and other sources.SAT Competitions and SAT Races These benchmarks comprise instances from2002-2011 SAT competitions in addition to 2006-2010 SAT Races. For each com-petition, the instances were divided into three categories: Industrial/application(INDU/APP), Handmade/Crafted (HAND/CRAFTED), and Random (RAND). SATraces only considered instances from industrial/application. These are very hetero-geneous benchmarks with the number of instances limited by the actual data usedin the competitions/races.Instance generators Many instance generators have been used for generating ran-dom and structured instances. Detailed information about instance generators andtheir parameters are listed as follows.• rand3-fix/R3SAT: uniform-random 3-SAT at the solubility phase transi-tion (c = 4.258 ·v+58.26 ·v−2/3) [32, 180]. The satisfiable/unsatsfiable ratiois approximately 50/50.• rand3-var: uniform-random 3-SAT with clauses-to-variables ratio ran-domly selected from 3.26 to 5.26.• QCP: SAT-encoded quasi-group completion problem: the task of determin-ing whether the missing entries of a partial Latin square can be filled in toobtain a complete Latin square. We generated instances around the solubil-ity phase transition using the parameters given by Gomes and Selman (1997)(order O ∈ [10, ...,30]; holes H = h×O1.55, h ∈ [1.2, ...,2.2]).• SW-GCP: SAT-encoded graph-coloring on small-world graphs [58] with ringlattice size S ∈ [100, ...,400], nearest neighbors connected 10, rewiring prob-ability 2−7, chromatic number 6.• HGEN: satisfiable only, random instances generated using HGEN2 [76].29• FAC: SAT-encoded factoring problems based on prime numbers∈ [3000, ...,4000][200].• CBMC(SE): SAT-encoded software verification instances based on a binarysearch algorithm [34] with array size s ∈ [1, ...,2000] and loop-unwindingvalues n ∈ [4,5,6]. To reduce the size of the original instances, we prepro-cessed these instances with SatElite [45].Other sources Some benchmarks were downloaded from industrial users withreal applications.• IBM This distribution of SAT-encoded bounded model checking instancescomprises 765 instances generated by Zarpas (2005); these instances weredownloaded from the IBM Formal Verification Benchmarks Library.• SWV: This distribution of SAT-encoded software verification instances com-prises 604 instances generated with the CALYSTO static checker [7], usedfor the verification of five programs: the spam filter Dspam, the SAT solverHyperSAT, the Wine Windows OS emulator, the gzip archiver, and a com-ponent of xinetd (a secure version of inetd).3.2 Mixed Integer Programming (MIP)Mixed integer programming (MIP) is a general approach for representing con-strained optimization problems with integer-valued and continuous variables. Be-cause MIP serves as a unifying framework for NP-complete optimization problemsand combines the expressive power of integrality constraints with the efficiency ofcontinuous optimization, it is widely used both in academia and industry. MIP usedto be studied mainly in operations research, but has recently become an importanttool in AI, with applications ranging from auction theory [125] to computationalsustainability [62]. Furthermore, several recent advances in MIP solving have beenachieved with AI techniques [59, 97].303.2.1 IBM ILOG CPLEXOne important advantage of the MIP representation is that broadly applicable solverscan be developed in a problem-independent manner. IBM ILOG’s CPLEX solveris particularly well known for achieving strong practical performance; it is used byover 1 300 corporations (including one-third of the Global 500) and researchers atmore than 1 000 universities [103]. CPLEX principally uses a branch and cut al-gorithm that essentially solves a series of relaxed LP subproblems. In order to findthe optimal solution more effectively, additional cuts and sophisticated branchingstrategies are employed at these subproblems. CPLEX also uses heuristics thathelp finding initial good solutions, while it includes a sophisticated mixed integerpreprocessing system. There are some other commercial/non-commercial solversfor MIP, such as XPRESS, Gurobi, and SCIP. Each of them offers its own advan-tages. For example, the XPRESS MIP Optimizer uses a sophisticated branch andbound algorithm to solve MIP problems and is well known for its ability to quicklyfind high quality solutions [147].State-of-the-art MIP solvers typically expose many parameters to end users;for example, CPLEX 12.1 comes with a 221-page parameter reference manual de-scribing 135 parameters.3.2.2 MIP FeaturesFigure 3.2 summarizes 121 features for mixed integer programs (i.e., MIP in-stances). These include 101 features based on existing work [90, 111, 130], 15 newprobing features, and 5 new timing features. Features 1–101 are primarily basedon features for the combinatorial winner determination problem from our group’spast work [130], generalized to MIP and previously only described in a Ph.D. the-sis [90]. These features can be categorized as problem type & size features (1–25),variable-constraint graph features (26–49), linear constraint matrix features (50–73), objective function features (74–91), and LP-based features (92–95). We alsointegrated ideas from the feature set used by Kadioglu et al., 2010 (right-hand sidefeatures (96–101) and the computation of separate statistics for continuous vari-ables, non-continuous variables, and their union). We extended existing featuresby adding richer statistics where applicable: medians, variation coefficients (vc),31Problem Type (trivial):1. Problem type: LP, MILP, FIXEDMILP, QP,MIQP, FIXEDMIQP, MIQP, QCP, or MIQCP,as attributed by CPLEXProblem Size Features (trivial):2–3. Number of variables and constraints: de-noted n and m, respectively4. Number of non-zero entries in the linear con-straint matrix, A5–6. Quadratic variables and constraints: num-ber of variables with quadratic constraints andnumber of quadratic constraints7. Number of non-zero entries in the quadraticconstraint matrix, Q8–12. Number of variables of type: Boolean,integer, continuous, semi-continuous, semi-integer13–17. Fraction of variables of type (summing to1): Boolean, integer, continuous, semi-continuous, semi-integer18-19. Number and fraction of non-continuousvariables (counting Boolean, integer, semi-continuous, and semi-integer variables)20-21. Number and fraction of unbounded non-continuous variables: fraction of non-continuous variables that has infinite lower orupper bound22-25. Support size: mean, median, vc, q90/10 forvector composed of the following values forbounded variables: domain size for binary/in-teger, 2 for semi-continuous, 1+domain sizefor semi-integer variables.Variable-Constraint Graph Features (cheap): eachfeature is replicated three times, for X ∈ {C,NC,V}26–37. Variable node degree statistics: characteristicsof vector (∑c j∈C I(Ai, j 6= 0))xi∈X : mean, me-dian, vc, q90/1038–49. Constraint node degree statistics: characteris-tics of vector (∑xi∈X I(Ai, j 6= 0))c j∈C: mean,median, vc, q90/10Linear Constraint Matrix Features (cheap): each fea-ture is replicated three times, for X ∈ {C,NC,V}50–55. Variable coefficient statistics: characteristics ofvector (∑c j∈C Ai, j)xi∈X : mean, vc56–61. Constraint coefficient statistics: characteristicsof vector (∑xi∈X Ai, j)c j∈C: mean, vc62–67. Distribution of normalized constraint matrixentries, Ai, j/bi: mean and vc (only of elementswhere bi 6= 0)68–73. Variation coefficient of normalized absolutenon-zero entries per row (the normalization isby dividing by sum of the row’s absolute val-ues): mean, vcObjective Function Features (cheap): each feature isreplicated three times, for X ∈ {C,NC,V}74-79. Absolute objective function coefficients{|ci|}ni=1: mean and stddev80-85. Normalized absolute objective function coeffi-cients {|ci|/ni}ni=1, where ni denotes the num-ber of non-zero entries in column i of A: meanand stddev86-91. squareroot-normalized absolute objective func-tion coefficients {|ci|/√ni}ni=1: mean and std-devLP-Based Features (expensive):92–94. Integer slack vector: mean, max, L2 norm95. Objective function value of LP solutionRight-hand Side Features (trivial):96-97. Right-hand side for ≤ constraints: mean andstddev98-99. Right-hand side for = constraints: mean andstddev100-101. Right-hand side for ≥ constraints: mean andstddevPresolving Features (moderate):102-103. CPU times: presolving and relaxation CPUtime104-107. Presolving result features: # of constraints,variables, non-zero entries in the constraintmatrix, and clique table inequalities after pre-solving.Probing Cut Usage Features (moderate):108-112. Number of specific cuts: clique cuts, Gomoryfractional cuts, mixed integer rounding cuts,implied bound cuts, flow cutsProbing Result features (moderate):113-116. Performance progress: MIP gap achieved, #new incumbent found by primal heuristics, #of feasible solutions found, # of solutions orincumbents foundTiming Features117–121. CPU time required for feature computation:one feature for each of 5 groups of features(see text for details)Figure 3.2: MIP instance features; for the variable-constraint graph, linear con-straint matrix, and objective function features, each feature is computed withrespect to three subsets of variables: continuous, C, non-continuous, NC, andall, V .32and percentile ratios (q90/q10) of features, which are based on vectors of values.We introduce two new sets of features. Firstly, our new MIP probing features102–116 are based on 5-second runs of CPLEX with default settings. They areobtained via the CPLEX API and include 6 presolving features based on the outputof CPLEX’s presolving phase (102–107); 5 probing cut usage features describingthe different cuts CPLEX used during probing (108–112); and 4 probing resultfeatures summarizing probing runs (113–116).Secondly, our new timing features 117–121 capture the CPU time requiredfor computing five different groups of features: variable-constraint graph, lin-ear constraint matrix, and objective features for three subsets of variables (“con-tinuous”, “non-continuous”, and “all”, 26–91); LP-based features (92–95); andCPLEX probing features (102–116). The cost of computing the remaining features(1–25, 96–101) is small (linear in the number of variables or constraints).3.2.3 MIP BenchmarksMost of the MIP benchmarks were collected from other research groups (exceptREG and RCW). In order to test the robustness of our predictive models and MIPsolvers, we also considered some heterogenous benchmarks by combining severalhomogeneous benchmarks together.BIGMIX This benchmark is a highly heterogenous mix of 1 510 publicly avail-able Mixed Integer Linear Programming (MILP) instances. The instances in thisset have an average of 8 610 variables and 4 250 constraints. Some of the instancesare very large with up to 550 539 variables and 550 339 constraints.CORLAT This benchmark comprises 2 000 MILP instances based on real dataused for the construction of a wildlife corridor for grizzly bears in the NorthernRockies region (the instances were described by Gomes et al. (2008) and madeavailable to us by Bistra Dilkina). All instances have 466 variables with 486 con-straints on average.33RCW This benchmark comprises 1 980 MILP-encoded instances from compu-tational sustainability for modeling the spread of the endangered red-cockadedwoodpecker, conditional on decisions about certain parcels of land to be protected.We generated 1 980 instances (20 random instances for each combination of 9 mapsand 11 budgets), using the generator from Ahmadizadeh et al. (2010) with the sameparameter setting, but with a smaller sample size of 5.REG This benchmark comprises 2 000 MILP-encoded instances based on thewinner determination problem in combinatorial auctions. We used the regionsgenerator from the Combinatorial Auction Test Suite [126], with the number ofbids selected uniformly at random between 750 and 1250, and a fixed bids/goodsratio of 3.91 (following [130]).CL∪REG This set is a mixture of two homogeneous subsets, CORLAT and REG.We randomly selected 1 000 CORLAT and 1 000 REG instances.CL∪REG∪RCW This benchmark set is the union of CL∪REG and 990 randomlyselected RCW instances.ISAC(new) This set is a subset of the MIP benchmark set used by Kadiogluet al. (2010); we could not use the entire set, since the authors informed us thatthey irretrievably lost their test set. There are 276 instances in total.MIX This is a very heterogenous benchmark that combines the sets studied inHutter et al. (2010). It includes all instances from MASS (100 instances), MIK (120instances), CLS (100 instances), and a subset of CL (120 instances) and REG200(120 instances). (see, e.g., [97] for the description of each underlying set.) Thereare 560 instances in total.MIPLIBless MIPLIB is one of the most widely used benchmark for study-ing and evaluating MIP solvers. MIPLIBless consists of all 44 instances thatcan be solved by CPLEX 12.1 default within 1 800 CPU seconds on our reference34machines with Intel Xeon 3.2GHz CPUs.NELAND This benchmark set comprises 640 MIP instances from Northeast LandManagement [141] and is divided into 32 subsets while each subset contains 20instances.3.3 Traveling Salesperson (TSP)The traveling salesperson problem (TSP) is one of the most widely studied com-binatorial optimization problems. Given an edge-weighted directed graph G withvertices V = {v1, ...,vn}, the goal is to find a Hamiltonian cycle (tour) in G witha minimal path weight. For simplicity, a TSP instance is often defined in sucha manner that the underlying graph is complete with very large edge weights foredges between disconnected nodes. Hence, a Hamiltonian cycle in G correspondsexactly to a cyclic permutation of the vertices in V . There are many different typesof TSP instances depending on restrictions on their weight functions. The beststudied type is the Euclidean TSP, where the edge weight function w is a Euclideandistance metric.Over the past five decades, much work on TSP has been a driving force formany important research areas, such as stochastic local search [4, 107], branch-and-cut methods [5], and Ant Colony Optimization algorithms [191].3.3.1 TSP SolversMost state-of-the-art complete algorithms for TSP are based on branch-and-cutmethods. In brief, the basic process of branch-and-cut methods is to formulate aTSP as an integer programming problem (IP), and repeatedly solve linear program-ming (LP) relaxations of it. First, a cutting-plane method is used to solve an LPrelaxation of a TSP that allows variables to take arbitrary values between 0 and1. If the optimal LP solution is also an IP solution, the algorithm terminates andreports the IP solution; otherwise, a new restriction (cut) is added to the LP relax-ation that cuts off non-integer solutions but does not cut off any integer solution.The new LP relaxation (with additional restrictions) is solved to optimality again.This procedure is repeated until no good “cut” can be found. At this stage, the cur-35rent problem needs to be branched into two sub-problems: an edge is selected andforced to be part of all solutions for one sub-problem and not to be part of any so-lution for another sub-problem. For each sub-problem, the cutting-plane method isused again. Thus, branch-and-cut methods iterate between a cutting-plane step anda branching step until an integer solution is found. The state-of-the-art completealgorithm for TSP, Concorde, can solve very large TSP instances [35].Much work on incomplete algorithms for solving TSP has been focused ontour-construction heuristics and iterative tour-improvement algorithms. Tour-constructionheuristics include the Nearest Neighbor Heuristic, the Insert Heuristic, and theGreedy Heuristic [169]. As an example, the Nearest Neighbor Heuristic constructsa tour starting from a randomly-chosen vertex u1 and then iteratively adds one un-visited vertex uk+1 to the current partial tour (u1, · · · ,uk) such that (uk,uk+1) hasminimal weight. After all vertices have been visited, a complete tour is obtained byconnecting the end vertex of the partial tour un to the starting vertex u1. In practice,the tours obtained by tour-construction heuristics are usually very good for mediansize TSP with a few thousand nodes (11-16% to the optimal solutions) [83].Most successful tour-improvement algorithms are based on k-exchange itera-tive improvement methods. Two candidate solutions, s and s′, are called directk-exchange neighbors if and only if s′ can be obtained from s by deleting k edgesand reconnecting the resulting k tour fragments into a complete tour with k edges(the new edges may be the same as the deleted ones). The common choices of kare 2 or 3, and much larger k are rarely used because of the complexity in cod-ing and the high computational cost in each step (O(nk)). Experimental resultssuggest that with larger k (4 or 5), an algorithm’s solution quality can only beimproved slightly [183]. Some additional techniques are available for making k-exchange more efficient, such as using fixed radius search, candidate lists, anddon’t look bits. One of the most well-known tour-improvement algorithms for TSPis the Lin-Kernighan (LK) algorithm [136]. This rather complicated algorithm isthe foundation of many state-of-the-art incomplete TSP algorithms such as Iter-ated Lin-Kernighan [107], Chained Lin-Kernighan [4], and Iterated Helsgaun [70].Optimal or very close to optimal solutions can be obtained within hours by usinghigh-performance tour-improvement algorithms for TSPs with tens of thousandsof nodes [108].36Problem Size Features:1. Number of nodes: denoted nCost Matrix Features:2–4. Cost statistics: mean, variation coeffi-cient, skewMinimum Spanning Tree Features:5–8. Cost statistics: sum, mean, variationcoefficient, skew9–11. Node degree statistics: mean, variationcoefficient, skewCluster Distance Features:12–14. Cluster distance: mean, variation coef-ficient, skewLocal Search Probing Features:15–17. Tour cost from construction heuristic:mean, variation coefficient, skew18–20. Local minimum tour length : mean,variation coefficient, skew21–23. Improvement per step: mean, variationcoefficient, skew24–26. Steps to local minimum: mean, varia-tion coefficient, skew27–29. Distance between local minima: mean,variation coefficient, skew30–32. Probability of edges in local minima:mean, variation coefficient, skewBranch and Cut Probing Features:33–35. Improvement per cut: mean, variationcoefficient, skew36. Ratio of upper bound and lower bound37–43. Solution after probing: Percentage ofinteger values and non-integer values inthe final solution after probing. Fornon-integer values, we compute staticsacross nodes: min,max, 25%,50%, 75%quantilesRuggedness of Search Landscape:44. Autocorrelation coefficientTiming Features45–50. CPU time required for feature compu-tation: one feature for each of 6 com-putational subtasksNode Distribution Features (after instancenormalization)51. Cost matrix standard deviation: stan-dard deviation of cost matrix after in-stance being normalized into the rectan-gle [(0,0), (400, 400)].52–55. Fraction of distinct distances: precisionto 1, 2, 3, 4 decimal places.56–57. Centroid: the (x, y) coordinates of theinstance centroid.58. Radius: the mean distances from eachnode to the centroid.59. Area: the rectangular area within whichthe nodes lie.60–61. nNNd: the standard deviation and coef-ficient variation of the normalized near-est neighbour distance.62–64. Cluster: #clusters / n , #outliers / n,variation of #nodels in clusters.Figure 3.3: 9 groups of TSP features3.3.2 TSP FeaturesFigure 3.3 summarizes 64 features for the travelling salesperson problem (TSP).Features 1–50 are new, while Features 51–64 were introduced by Smith-Mileset al. [185]. Features 51–64 capture the spatial distribution of nodes (features51–61) and clustering of nodes (features 62–64); we used the authors’ code (avail-able at http://www.vanhemert.co.uk/files/TSP-feature-extract-20120212.tar.gz) tocompute these features.37Our 50 new TSP features are as follows.2 The problem size feature (1) is thenumber of nodes in the given TSP. The cost matrix features (2–4) are statistics ofthe cost between two nodes. Our minimum spanning tree features (5–11) are basedon constructing a minimum spanning tree over all nodes in the TSP: features 5–8are the statistics of the edge costs in the tree and features 9–11 are based on itsnode degrees. Our cluster distance features (12–14) are based on the cluster dis-tance between every pair of nodes, which is the minimum bottleneck cost of anypath between them; here, the bottleneck cost of a path is defined as the largest costalong the path. Our local search probing features (15–32) are based on 20 shortruns (1000 steps each) of LK [136], using the implementation available from [37].Specifically, features 15–17 are based on the tour length obtained by LK; features18–20, 21–23, and 24–26 are based on the tour length of local minima, the tourquality improvement per search step, and the number of search steps to reach alocal minimum, respectively; features 27–29 measure the Hamming distance be-tween two local minima; and features 30–32 describe the probability of edges ap-pearing in any local minimum encountered during probing. Our branch and cutprobing features (33–43) are based on 2-second runs of Concorde. Specifically,features 33–35 measure improvement in the lower bound per cut; feature 36 is theratio of upper and lower bound at the end of the probing run; and features 37–43analyze the final LP solution. Feature 44 is the autocorrelation coefficient: a mea-sure of the ruggedness of the search landscape, based on an uninformed randomwalk (see, e.g., , [83]). Finally, our timing features 45–50 measure the CPU timerequired for computing feature groups 2–7 (the cost of computing the number ofnodes can be ignored).3.3.3 TSP BenchmarksWe used three TSP benchmarks that come from random TSP generators and TSPlib.The detailed information about these benchmarks is as follows.2In independent work, Mersmann et al. [144] have introduced feature sets similar to some ofthose described here.38PORTGEN This benchmark comprises 4 993 uniform random EUC-2D TSP in-stances generated by the random TSP generator portgen [106]. The number ofnodes was randomly selected from 100 to 1 600 and the generated TSP instanceshave 849±429 nodes.PORTCGEN This benchmark comprises 5 001 random clustered EUC-2D TSPinstances generated by the random TSP generator, portcgen [106]. The number ofnodes was randomly selected from 100 to 1 600 and the number of clusters was setto 1% of the number of nodes. The generated TSP instances have 852±432 nodes.TSPLIB This benchmark contains a subset of the prominent TSPLIB (http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/) repository. We only includedthe 63 instances for which both our own feature computation code and the codeby Smith-Miles and van Hemert (2011) completed successfully within 3,600 CPUseconds on our reference machines.39Chapter 4Solution Prediction for SATRecent work has studied the use of regression methods to make instance-specificpredictions on solver runtimes. Nudelman et al. (2004) showed that using thisapproach, surprisingly accurate runtime predictions can be obtained for uniformrandom 3-SAT. They also noticed that training models on only SAT or UNSATinstances allowed much simpler, but very different, models to achieve high accu-racies. (In Chapter 5, we demonstrate that such simpler models can be used forconstructing hierarchical hardness models for better runtime prediction.) Since un-conditional models are able to predict runtimes accurately, despite the qualitativedifferences between the SAT and UNSAT regimes, we believe that the models mustimplicitly predict satisfiability status. This chapter tests this hypothesis on one ofmost difficult SAT benchmarks, the uniform random 3-SAT at the phase transition,and shows how to build classification models that achieve accuracies of approxi-mately 70% with a small set of features. Two arguments demonstrate that this isnot a small-size effect. First, the models’ predictive accuracy remains roughly con-stant, and is far better than that of random guessing (50%) across the entire rangeof problem sizes. Second, a classifier trained on our easiest (v = 100) instancesagain achieves very accurate predictions across the whole range of instance sizes.A detailed investigation shows that two features sufficed to achieve good perfor-mance for all instance sizes: one based on variation in the slack vectors of an LPrelaxation of the problem, and one based on the ratio of positive to negative lit-erals in the formula. Finally, we trained a three-leaf decision tree based on these40two features only on the smallest instances, which achieved prediction accuraciesacross the entire range of instance sizes close to those of our most complex model.14.1 Uniform Random 3-SAT and Phase TransitionA prominent family of SAT instances is uniform random 3-SAT. Instances fromthis class are easy to generate and often hard to solve, they have often been used asa test bed in the design and evaluation of heuristic algorithms (see, e.g., Le Berreet al., 2012). One interesting phenomena related to uniform random 3-SAT is theso-called solubility phase transition: the probability that a random 3-SAT instanceis satisfiable exhibits sharp threshold behavior when the control parameter α = c/vpasses a critical value [32, 146]. The width of the window in which this solubilityphase transition takes place becomes narrower as instance size grows.Most interestingly, a wide range of state-of-the-art SAT solvers exhibit dra-matically longer runtimes for instances in this critical region. For intuition, notethat instances are under-constrained when α is small (where few constraints exist,and therefore many solutions), and over-constrained when α is large (where manyconstraints exist, making it relatively easy to derive a contradiction). The so-calledphase transition point occurs between these extremes, when the probability of gen-erating a satisfiable instance is 0.5. Crawford and Auton (1996) confirmed thesefindings in an extensive empirical study and proposed a more accurate formula foridentifying the phase transition point. Kirkpatrick and Selman (1994) used finite-size scaling, a method from statistical physics, to characterize size-dependent ef-fects near the transition point, with the width of this transition narrowing as thenumber of variables increases. Yokoo (1997) studied the behavior of simple localsearch algorithms on uniform random 3-SAT instances, observing a peak in thehardness of solving satisfiable instances at the phase transition point. He attributedthis hardness peak to a relatively larger number of local minima present in criticallyconstrained instances, as compared to over-constrained satisfiable instances.There is a useful analogy between uniform random 3-SAT problems and whatphysicists call “disordered materials”: conflicting interactions in the latter are sim-1This chapter is based on the joint work with Holger Hoos, and Kevin Leyton-Brown [214].41ilar to the randomly negated variables in the former. Exploiting this connection,uniform random 3-SAT has been studied using methods from statistical physics.Monasson and Zecchina [148, 149] applied replica methods to determine the char-acteristics of uniform random 3-SAT and showed that at the phase transition, theground state entropy is finite. They concluded that the transition itself is due to theabrupt appearance of logical contradictions in all solutions and not to a progressivedecrease in the number of models.4.2 Experimental SetupWe considered uniform random 3-SAT instances generated at the solubility phasetransition with v ranging from 100 to 600 variables in steps of 25. For each value ofv, we generated 1000 instances with different random seeds using the same instancegenerator as used in SAT competitions since 2002 [182]. In total, we generated21 instance sets jointly comprising 21000 3-SAT instances. For v = 100, 25000additional instances are generated (referred as v100(large)).We solved all of these instances using kcnfs07 [44] within 36000 CPU secondsper instance, with the exception of 2 instances for v = 575 and 117 for v = 600. Forthese instances, additional 5 runs of adaptg2wsat09++ [134] were performed witha cutoff time of 36000 CPU seconds, which failed to solve them. As this cutoffis more than 100 times larger than the runtime of the hardest instances solved byadaptg2wsat09++, and the largest increase in running time needed for any sizev > 475 to solve an additional instance we ever observed was lower than a factorof 6.5, we believe that these instances are unsatisfiable, and so treated them assuch for the remainder of our study. (Readers who feel uncomfortable with thisapproach should feel free to disregard our results for v = 575 and v = 600; none ofour qualitative conclusions is affected.)Figure 4.1 (left) shows the median runtime of kcnfs07 on both satisfiable andunsatisfiable instances across our 21 instance sets. Median kcnfs07 runtime in-creased exponentially with the number of variables, growing by a factor of about2.3 with every increase of 25 variables beyond v = 200 (smaller instances weresolved more quickly than the smallest CPU time one could measure). To verifythat the generating instances are indeed at the phase transition point, we examined42100 200 300 400 500 60010−2100102104106Instance SizeKcnfs07 Median Runtime (CPU seconds)  SATUNSAT 10−1 100 101 102 103 1040102030405060708090100Runtime (CPU seconds)Solved Percentage  SATUNSATFigure 4.1: Left: Median runtime of kcnfs07 for each instance set. The solu-tions of some instances in v = 575 and v = 600 were estimated by runningadaptg2wsat09++ for 36 000 CPU seconds. Right: CDF of kcnfs07’s runtimefor v = 500.the fraction of satisfiable and unsatisfiable instances for each set; the majority con-tained between 49 and 51% satisfiable instances (with mean 50.2% and standarddeviation 1.6%), and there was no indication that deviations from the 50% markcorrelated with instance size. The v100(large) set contained 49.5% satisfiable in-stances. As illustrated in Figure 4.1 (right) for v = 500, unsatisfiable instancestended to be harder to solve, and to give rise to less runtime variation; satisfiableinstances were easier on average but with larger runtime variation. Intuitively, toprove unsatisfiability, a complete solver such as kncfs07 needs to reason aboutthe entire space of candidate assignments, while satisfiability may be proven by ex-hibiting a single model. Depending on the number of solutions of a given instance,which is known to vary at the phase transition point, the search cost of finding thefirst one can vary significantly.This work used a state-of-the-art classifier, decision forests, as they can pro-duce good predictions with robust uncertainty estimates and direct visualizationfrom tree structures. Decision forests are constructed as collections of T decisiontrees [196]; in this work, a rather large number of trees (T = 99) is used for highclassification accuracy. Following Breiman (2001), given n training data pointswith k features each, for each tree we drew a bootstrap sample of n training datapoints sampled uniformly at random with repetitions; during tree construction, wesampled a random subset of log2(k)+1 features at each internal node to be consid-43ered for splitting the data at that node. Predictions were based on majority votingacross all T trees. In our case, the class labels were SAT and UNSAT. Since weused 99 trees, an instance i was classified as SAT if more than 49 trees predictedthat it was satisfiable. We measured the decision forest’s confidence as the frac-tion of trees that predicted i to be satisfiable; by choosing T as an odd number, weavoided the possibility of ties.We used the feature set as listed in Figure 3.1. The feature computation timedepends on the size of the instance under consideration (e.g., ≈ 50 CPU secondson average for all features of a single instance with v = 550, of which ≈ 41 CPUseconds were spent on computing the 6 LP-based features). Some easy instancesare solved during the computation of certain features (in particular, local searchprobing features). Thus, even though these features have been found quite usefulfor other problems/taskes [156], we excluded them from this study and resolvedto use 61 features, of which 7 are related to problem size, 29 to graph-based repre-sentations of the CNF formula, 13 to balance properties, 6 to proximity to a Hornformula, and 6 to LP relaxations.For each instance size v, we first partitioned the respective instance set into twosubsets based on satisfiability status, and then randomly split each subset 60:40into training and test sets. Finally, we combined the training sets for SAT andUNSAT into the final training set, and the SAT and UNSAT test sets into the finaltest set. We trained our decision forests on the training sets only, and used only thetest sets to measure model accuracy. In order to reduce variance in these accuracymeasurements we repeated this whole process 25 times (with different randomtraining/test splits); the results reported in this paper are medians across these 25runs.All runtime and feature data were collected on a computer cluster with 840nodes, each equipped with two 3.06 GHz Intel Xeon 32-bit processors and 2GBof RAM per processor. The decision forest classifier was implemented in Matlab,version R2010a, which we also used for data analysis.440.50.550.60.650.70.750.80.850.90.951100125150175200225250275300325350375400425450475500525550575600Instance SizeClassification Accuracy  DF trained on same sizeDF trained on v100 (large)DT trained on v100 (large) with  2 featuresFigure 4.2: Classification accuracies achieved on our 21 primary instance sets. Theblue box plots are based on 25 replicates of decision forest models, trainedand evaluated on different random splits of training and test data. The medianprediction accuracies of using the decision forest trained on v100(large) areshown as red stars. The median prediction accuracies of using a single decisiontree trained on v100(large) based on two features are shown as green squares.4.3 Experimental ResultsAt the solubility phase transition, uniform random 3-SAT instances are equallylikely to be satisfiable or unsatisfiable. Thus, random guessing can achieve predic-tive accuracy of only 50%. The first goal was to investigate the extent to which ourmodels were able to make more accurate predictions. As shown in Figure 4.2 (firstdata series) and Table 4.1, the models did achieve accuracies of between about 70%and 75%. There is no significant difference in the frequency of the two possiblepredictive errors (predicting SAT as UNSAT and vice versa).Figure 4.3 shows two sample distributions (v = 200 and v = 500) of the classi-fier’s confidence. The plots for other instance sets were qualitatively similar. Recallthat the confidence of the classifier is measured by the fraction of ‘SAT’ predictionsamong the 99 trees. Therefore, the classifier had full confidence if all 99 predic-45Median False FalseVariables Accuracy Positives Negatives100 0.694 0.138 0.168125 0.709 0.125 0.166150 0.702 0.148 0.150175 0.702 0.155 0.144200 0.682 0.153 0.164225 0.703 0.148 0.153250 0.697 0.158 0.148275 0.740 0.140 0.120300 0.714 0.143 0.143325 0.749 0.122 0.130350 0.704 0.151 0.143375 0.697 0.148 0.155400 0.724 0.143 0.135425 0.727 0.138 0.135450 0.740 0.128 0.132475 0.744 0.118 0.138500 0.737 0.130 0.133525 0.733 0.143 0.125550 0.747 0.120 0.133575 0.762 0.113 0.125600 0.732 0.129 0.139Table 4.1: The performance of decision forests with 61 features on our 21 primaryinstance sets. We report median classification accuracy over 25 replicates withdifferent random splits of training and test data as well as the fraction of falsepositive and false negative predictions.tions were consistent, and had small confidence if the numbers of ‘SAT’ predic-tions and ‘UNSAT’ predictions were about the same. As shown in Figure 4.3, theclassifier had low levels of confidence more often than high levels of confidence.The representative examples in the left and right panes of Figure 4.3 illustrate thatour decision forests had more difficulty with small instances, in the sense that theywere uncertain about a larger fraction of such instances.As one would hope, the confidence was positively correlated with classifica-tion accuracy. This can be seen by comparing the height of the bars for correctand incorrect predictions at each predicted probability of SAT. When the predictedprobability of SAT was close to 0 or 1, the classifier was almost always correct, and460.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.900. Probability of SAT  Correct (v200)Wrong (v200)0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.900. Probability of SAT  Correct (v500)Wrong (v500)Figure 4.3: Classifier confidence vs fraction of instances. Left: v = 200; Right:v = 500.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110−1Prob. of SATAverage Runtime (CPU seconds)  SAT+UNSATSATUNSAT0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1102103104Prob. of SATAverage Runtime (CPU seconds)  SAT+UNSATSATUNSATFigure 4.4: Classifier confidence vs instance hardness. Each marker ([x,y]) showsthe average runtime of kcnf07 over a bin of instances with classifier confidence(predicted probability of SAT) between x−0.05 and x. Each marker’s intensitycorresponds to the amount of data inside the bin. Left: v= 200; Right: v= 500.when the predicted probability of SAT was close to 0.5, accuracy dropped towards0.5 (i.e., random guessing).The decision forest’s confidence was also correlated with kcnfs07’s runtime.As shown in Figure 4.4, instances tended to be easier to solve when the predictedprobabilities of SAT were close to either 0 or 1. Recall that variation in runtime wasmore pronounced on satisfiable instances, as previously illustrated in Figure 4.1(right).We now examine the hypothesis that our models’ prediction accuracy decreasesas problem size grows. Two pieces of evidence suggest that this hypothesis shouldbe rejected. The first evidence is the result of a pairwise comparison of the clas-47sification accuracies obtained from the full decision forest models trained for eachinstance size. For each pair of data sets with instance sizes i and j (i > j), Fig-ure 4.5 shows a blue dot when classification accuracy on size i was significantlyhigher than on size j and a yellow dot when classification accuracy on size i wassignificantly lower than on size j, according to a Mann-Whitney U test. Amongthe 210 paired comparisons with significance level 0.05, there are 133 blue dots(63.3%), and 21 yellow dots (10.0%). Thus, there is little evidence that predictionaccuracy decreases as instance size grows; indeed, the data shown in Figure 4.5 ap-pears to be more consistent with the hypothesis that prediction accuracy increaseswith instance size.The second piece of evidence against the hypothesis of lower accuracies forbigger problems is that models trained only on the smallest problems achievedhigh levels of predictive accuracy across the whole range of problem sizes. The redstars in Figure 4.2 (the second data series) indicate the performance of the decisionforest trained on v100(large) evaluated on problems of other sizes. This singlemodel performed about as well—indeed, in many cases better—than the modelsspecialized to different problem sizes. Although we do not report the results here,we also trained decision forests on each of the other 20 instance sets; in each case,the models generalized across the entire range of problem sizes in qualitatively thesame manner.The next task is to identify the smallest set of features that could be used tobuild accurate models. Hopefully, such feature set will be useful to other re-searchers seeking a theoretical explanation of the phenomenon identified in thischapter.One may imagine that most predictive features are different for large instancesand for small instances. Therefore, the 21 instance sets are divided into two groups,small (10 instance sets, v = 100 to 325) and large (10 instance sets, v = 375 to600). We do not use the 350-variable set in this analysis in order to keep the twogroups balanced. For every subset of the 61 features with cardinality 1, 2 and3, we measure the accuracy of the model. For both small and large, the best 1-and 2-feature sets were subsets of the best 3-feature subset. Next, we use a forwardselection procedure to build subsets of up to 10 features, as exhaustive enumerationwas infeasible above 3 features. Specifically, starting with the best 3-feature set,48Figure 4.5: Statistical significance of pairwise differences in classification accuracyfor our 21 primary instance sets. Yellow: accuracy on the smaller instance sizeis significantly higher than on the larger size. Blue: accuracy on the smaller in-stance size is significantly lower than on the larger size. No dot: the differenceis insignificant. Significance level: p = 0.05.and for every feature not in the set, we compute the mean of median classificationaccuracy across all of the instance sets in the group. Then, we add the feature withthe best such accuracy to the feature subset. These steps are repeated until thesubset contains 10 features.Table 4.2 describes the sets of obtained features, as well as the improvementin classification accuracy achieved at each step. In both cases of small and large,the classifier was able to achieve good predictive accuracy with a small number offeatures; for each number of features, the classification accuracy on large instanceswas better than on small instances. The two most informative features were identi-49Group Features ordered by FW Classification Accuracy Stepwise ImprovementLPSLACK coeff variation 0.614 –POSNEG ratio var mean 0.670 0.056LP OBJ 0.681 0.011VG mean 0.688 0.007small LPSLACK max 0.690 0.002VG max 0.692 0.002VCG var max 0.694 0.002HORNY var coeff variation 0.694 0.000LPSLACK mean 0.695 0.001LP int ratio 0.697 0.002LPSLACK coeff variation 0.646 –POSNEG ratio var mean 0.696 0.050LPSLACK mean 0.706 0.010LP int ratio 0.714 0.008large VCG clause max 0.720 0.006CG mean 0.721 0.001TRINARYp 0.725 0.004HORNY var coeff variation 0.727 0.002DIAMETER entropy 0.728 0.001POSNEG ratio clause entropy 0.728 0.000Table 4.2: The mean of median classification accuracy with up to 10 features se-lected by forward selection. The stepwise improvement for a feature fi at for-ward selection step k is the improvement when we add fi to the existing k− 1features. Each median classification accuracy is based on the results of 25 runsof classification with different random split of training and test data.cal for both groups, and adding additional features beyond this point offered littlemarginal benefit.It is worth understanding the meaning of these features. LPSLACK coeff variationis based on solving a linear programming relaxation of an integer program repre-sentation of SAT instances. For each variable i with LP solution Si, LPSLACKi is de-fined as min{1− Si,Si}: Si’s proximity to integrality. LPSLACK coeff variationis then the coefficient of variation of the vector LPSLACK. POSNEG ratio var meanis the average ratio of positive and negative occurrences of each variable. For eachvariable i with Pi positive occurrences and Ni negative occurrences, POSNEG ratio variis defined as 2 · |0.5−Pi/(Pi +Ni)|. POSNEG ratio var mean is then the averageover elements of the vector POSNEG ratio var.500.80.911. 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600Instance Size (SAT)LPSLACK_coeff_variation0.80.911. 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600Instance Size (UNSAT)LPSLACK_coeff_variation−2−1.5−1−0.500.511.52100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600Instance Size (SAT)Normalized LPSLACK_coeff_variation−2−1.5−1−0.500.511.52100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600Instance Size (UNSAT)Normalized LPSLACK_coeff_variationFigure 4.6: Distribution of LPSLACK coeff variation over instances in each of our21 sets. Left: SAT; Right: UNSAT. Top: original value; Bottom: value afternormalization. The line at y = 0.0047 indicates the decision threshold used inthe tree from Figure 4.8.There are three main findings: (1) models achieved high accuracies; (2) mod-els trained on small instances were effective for large instances; (3) a model con-sisting of only two features was nearly as accurate as the full model. We nowshow that all of these findings also held simultaneously where we were able toachieve high accuracies using a two-feature model trained only on small instances.Specifically, we construct a single decision tree (rather than a random forest) us-ing only the LPSLACK coeff variation and POSNEG ratio var mean features, andtrained using only the very easiest instances, v100(large). We further simplify thismodel by setting the parameter minparent of the tree building procedure to 10 000.The minparent parameter defines the smallest number of observations that impure510. 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600Instance Size (SAT)POSNEG_ratio_var_mean0. 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600Instance Size (UNSAT)POSNEG_ratio_var_mean−2−1.5−1−0.500.511.52100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600Instance Size (SAT)Normalized POSNEG_ratio_var_mean−2−1.5−1−0.500.511.52100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550 575 600Instance Size (UNSAT)Normalized POSNEG_ratio_var_meanFigure 4.7: Distribution of POSNEG ratio var mean over instances in each of our 21instance sets. Left: SAT; Right: UNSAT. Top: original value; Bottom: valueafter normalization. The line at y = 0.1650 indicates the decision thresholdused in the tree from Figure 4.8.nodes may contain before they are allowed to further split; setting it to such a largevalue forced the decision tree to be extremely simple. This tree’s performance isplotted using green squares (third data series) in Figure 4.2. Overall, it achievedremarkably good prediction accuracies of always more than 65% on all instancesets.Figure 4.8 shows the decision tree. First, it classifies instances as satisfi-able if LPSLACK coeff variation takes a large value: that is, if LPSLACK ex-hibits large variance across the variables in the given formulae (region A). WhenLPSLACK coeff variation takes a small value, the model considers the balancebetween positive and negative literals in the formula (POSNEG ratio var mean). Ifthe literals’ signs are relatively balanced, the model predicts unsatisfiability (re-52LPSLACK_coeff_variation>=0.00466585YesNoSAT [A]POSNEG_ratio_var_mean>= 0.164963UNSAT [C]SAT [B]YesNoFigure 4.8: The decision tree trained on v100(large) with only the featuresLPSLACK coeff variation and POSNEG ratio var mean, and with (minparent)set to 10 000.gion B). Otherwise, it predicts satisfiability (region C). To gain further understand-ing about the effectiveness of this model, we partitioned each of our 21 data setsinto the three regions and observed the fraction of each partition that was correctlylabeled by the tree. These fractions were between 60 and 70% (region A), between70 and 80% (region C), and about 50% (region B).Finally, Figures 4.6 and 4.7 show the distribution of the LPSLACK coeff variationand POSNEG ratio var mean features over each of our 21 instance sets, before andafter normalization, and considering satisfiable and unsatisfiable instances sepa-rately. Note that both features’ pre-normalization variations decreased with in-stance size, while their median values remained relatively constant. After nor-malization, both features’ distributions remained very similar as instance size in-creased. The decision thresholds used by our simple decision tree are plotted assolid horizontal lines in these figures. Both thresholds were located near the 25thor 75th percentiles of the respective distributions, regardless of instance size.4.4 ConclusionUniform random 3-SAT instances from the solubility phase transition are challeng-ing to solve considering their size. Nevertheless, this work has shown that the sat-53isfiability of such instances can be predicted efficiently and with surprisingly highaccuracy. The experimental results demonstrated that high prediction accuracies(19.4% – 26.2% better than random guessing) can be achieved across a wide rangeof instance sizes, and that there is little support for the hypothesis that this accu-racy decreases as instance sizes grow. The predictive confidence of our classifierscorrelates with the prediction accuracy obtained and with the runtime of a state-of-the-art complete SAT solver. A classifier trained on small, very easy instancesalso performed well on large, extremely challenging instances. Furthermore, thefeatures most important to models trained on different problem sizes were sub-stantially the same. Finally, using only two features, LPSLACK coeff variationand POSNEG ratio var mean, we could build a trivial, three-leaf decision tree thatachieved classification accuracies only slightly below those for the most complexdecision forest classifier. Examining the operation of this model, we observe thesurprisingly simple rules that instances with large variation in LPSLACK (distanceof LP solutions to integer values) across variables are likely to be satisfiable, whileinstances with small variation of LPSLACK and roughly the same number of positiveand negative literals are likely to be unsatisfiable. We hope that these rules will leadto novel heuristics for SAT solvers targeting random instances, and will serve as astarting point for new theoretical analysis of uniform random 3-SAT at the phasetransition.54Chapter 5Runtime Prediction withHierarchical Hardness ModelsThe previous chapter shows that the satisfiability of SAT instances can be pre-dicted with high accuracy for uniform random 3-SAT at the phase transition. Thischapter demonstrates that such satisfiability prediction can be used for construct-ing hierarchical hardness models for better runtime prediction. First, we confirmthe observation of Nudelman et al. (2004) that better models can be learned withonly satisfiable instances (SAT) or only unsatisfiable instances (UNSAT). Then,we show that predicting satisfiability is possible in general by considering a va-riety of both structured and unstructured SAT instances. More importantly, satis-fiability prediction is useful for combining SAT models and UNSAT models intohierarchical hardness models using a mixture-of-experts approach. Such hierarchi-cal models improve overall runtime prediction accuracy. Classification confidencecorrelates with runtime prediction accuracy, giving useful per-instance evidenceabout the quality of the runtime prediction. 15.1 Empirical Hardness ModelsFor a given problem instance, empirical hardness models predict the runtime of analgorithm based on polytime-computable instance features using regression tech-1This chapter is based on the joint work with Holger Hoos, and Kevin Leyton-Brown [208].55niques. One of the most popular methods is linear basis-function ridge regression,which has previously been demonstrated to be very successful in studies of uniformrandom SAT and combinational auctions [127, 156].5.1.1 Overview of Linear Basis-function Ridge RegressionIn order to predict the runtime of an algorithm A on a distribution D of probleminstances, one first collects the runtimes of algorithm A on n instances drawn fromD, y, and computes p-dimensional feature vectors, X . Let I p be the p× p identitymatrix, and let ε be a small constant. Then, compute the weight vectorw = (X TX + εI p)−1X>y,where X T denotes the transpose of matrix X . The effect of ε > 0 is to regularizethe model by penalizing large coefficients w and to improve numerical stability.For the latter reason, we also use forward selection to eliminate highly correlatedfeatures. Since algorithm runtime can often be better approximated by a polyno-mial function than by a linear one, a quadratic basis function expansion was usedby Leyton-Brown et al. (2002), augmenting each model input X i = [xi,1, . . . ,xi,p]Twith pairwise product inputs xi, j · xi,q for j = 1, . . . , p and q = j, . . . , p. Then, an-other pass of forward feature selection is performed for selecting a subset of ex-tended features Φ. The final empirical hardness model is learned with Φ and y.Given a new unseen instance with extended feature vector Φn+1, ridge regressionpredicts fw(Φn+1) = wTΦn+1.Empirical hardness models have a probabilistic interpretation. The features Φand the empirical algorithm runtime y, when seen as random variables, are relatedas in the following graphical model:x yIn this model, the feature vector Φ (or X ), and the probability distribution over run-time y is conditionally dependent on Φ. Since one trains a linear model using leastsquares fitting, we have implicitly chosen to represent P(y|Φ) as a Gaussian withmean w>Φ and some fixed variance β . The prediction of an empirical hardness56x, s z yfeatures &probabilityof satisfiablemodelselectionoracleruntimeFigure 5.1: Graphical model for our mixture-of-experts approach.model is actually E(y|Φ), the mean of this distribution conditioned on the observedfeature vector.5.1.2 Hierarchical Hardness ModelsPrevious work [156] has shown that if instances are restricted to be either onlysatisfiable or only unsatisfiable, very different models are needed to make accurateruntime predictions. Furthermore, models for each type of instance are simpler andmore accurate than models that must handle both types, which means that betterempirical hardness models can be built if one knows the satisfiability of instances.With accurate satisfiability prediction (shown in the previous chapter), it would betempting to construct a hierarchical model that uses a classifier to pick the mostlikely conditional model and then simply return that model’s prediction. However,while this approach could sometimes be a good heuristic, it is not theoreticallysound. Intuitively, the problem is that the classifier does not take into account theaccuracies of the different conditional models. The model trained on one class ofinstances could have very large (indeed, unbounded) prediction error on instancesfrom another class.57A more principled method of combining conditional models can be derivedbased on the probabilistic interpretation of empirical hardness models given inSection 5.1.1. As shown previously (see Figure 5.1), there is a set of featuresthat are always observed and a random variable representing runtime that is con-ditionally dependent on the features. Now the features and our classifier’s pre-diction s are combined into a new feature vector (Φ,s). A new random variable,z∈ {sat,unsat}, is introduced to represent the oracle’s choice of which conditionalmodel will perform best for a given instance. Instead of selecting one of the predic-tions from the two conditional models for runtime y, we use their weighted sum.P(y|(Φ,s)) = ∑z∈{sat,unsat}P(z|(Φ,s)) ·PMz(y|(Φ,s)), (5.1)where PMz(y|(Φ,s)) is the probability of y evaluated according to model Mz. Sincethe models are fit using ridge regression, Eq. (5.1) can be written asP(y|(Φ,s)) = ∑z∈{sat,unsat}P(z|(Φ,s)) ·N (y|wzΦ,βz), (5.2)where wz and βz are the weights and the standard deviation of model Mz respec-tively. Thus, the weighting functions P(z|(Φ,s)) are learned to maximize the like-lihood of the training data according to P(y|(Φ,s)). As a hypothesis space for theseweighting functions we chose the commonly used softmax functionP(z = sat|(Φ,s)) =ev>(Φ,s)1+ ev>(Φ,s), (5.3)where v is a vector of free parameters that must be learned [24]. Therefore, theloss function is as follows, where E(yi,z|Φi) is the prediction of Mz and yˆi is thereal runtime:L =N∑i=1(yˆi−(∑k∈{sat,unsat}P(z = k|(Φi,si)) ·E(yi,z|Φi)))2. (5.4)This can be seen as a mixture-of-experts problem with the experts clamped to Msatand Munsat (see, e.g., [24]). For implementation convenience, we used an existingmixture of experts implementation, which is built around an EM algorithm and58performs iterative reweighed least squares in the M step [150]. This code wasmodified slightly to clamp the experts and to set the initial values of P(z|(Φ,s)) tos (i.e., we initialized the choice of experts to the classifier’s output). To evaluatethe model and obtain a runtime prediction for test data, we simply compute thefeatures x and the classifier’s output s, and then evaluateE(y|(Φ,s)) = ∑k∈{sat,unsat}P(z|(Φ,s)) ·w>k ·Φ, (5.5)where wk are the weights from Mk and Φ is the basis function expansion of X .Thus, the classifier’s output is not directly used to select a model, but rather as afeature upon which the weighting functions P(z|(Φ,s)) depend, and for initializingthe EM algorithm.5.2 Experimental SetupFor the experiments conducted throughout this chapter, two distributions of un-structured SAT instances and two of structured instances were selected, rand3-fixand rand3-var with 400 variables, QCP and SWGCP. Each instance set was ran-domly split into training, validation and test sets, at a ratio of 70:15:15. All pa-rameter tuning was performed with a validation set; test sets were used only togenerate the final results reported in this chapter. For each instance, 84 featuresfrom nine categories were computed: problem size, variable-clause graph, vari-able graph, clause graph, balance features, proximity to Horn formulae, LP-based,DPLL search space, and local search space. We used quadratic basis functionsin addition to raw features to achieve better runtime prediction accuracy. The ac-curacy of logarithm runtime prediction was measured by root mean squared error(RMSE).For uniform random 3-SAT instances, we ran four solvers that are known toperform well on these distributions: kcnfs [44], oksolver [121], march dl[72], and satz [133]. For structured SAT instances, we ran six solvers that areknown to perform well on these distributions: oksolver, zchaff [221], sato[220], satelite [45], minisat [46], and satzoo [46]. Note that in the 2005SAT competition, satelite won gold medals for the Industrial and Handmade59SAT+UNSAT categories; minisat and zchaff won silver and bronze, respec-tively, for Industrial SAT+UNSAT; and kcnfs and march dl won gold and sil-ver, respectively, in the Random SAT+UNSAT category.For classification, we used Sparse Multinomial Logistic Regression (SMLR)[120], a state-of-the-art sparse classification algorithm. Similar to relevance vec-tor machines and sparse probit regression, SMLR learns classifiers use sparsity-promoting priors to control the expressivity of the learned classifier, thereby tend-ing to result in better generalization. SMLR encourages parameter weights eitherto be significantly large or exactly zero. It also learns a sparse multi-class classifierthat scales favorably in both the number of training samples and the input dimen-sionality, which is important for our problems since we have tens of thousands ofsamples per data set. We used the same set of raw features as were available tothe regression model, although in this case we did not find it necessary to use abasis-function expansion of these features. Note that, since the goal of this chapterwas to obtain the highest classification accuracy, we also used probing features inthe study.We performed all of our experiments using a cluster consisting of 50 computersequipped with dual Intel Xeon 3.2GHz CPUs with 2MB cache and 2GB RAM,running Suse Linux 9.1. All runs of any solver that exceeded 1 CPU hour wereterminated and recorded in our database of experimental results with a runtime of1 CPU hour; this occurred in fewer than 3% of all runs.5.3 Experimental ResultsThis section presents three sets of results. Section 5.3.1 confirms Nudelman et al.’sobservation that much simpler and more accurate empirical hardness models canbe learned when all instances are either satisfiable or unsatisfiable [156]. We referthese two models to conditional models, while models trained on all instances arereferred to unconditional models. Let Msat (Munsat) denote a model trained onlyon satisfiable (unsatisfiable) instances. The models equipped with an oracle thatknows which conditional model performs better for a given instance are refereedto oracular models. (Note that the oracle chooses the best model for a particularinstance, not the model trained on data with the same satisfiability status as the60−4 −2 0 2 4−4−3−2−101234Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatiisfiable−4 −2 0 2 4−4−3−2−101234Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiableFigure 5.2: Prediction accuracy comparison of oracular model (left, RMSE=0.247)and unconditional model (right, RMSE=0.426). Distribution: QCP, solver:satelite.instance.) Section 5.3.2 shows that satisfiability can be accurately predicted for avariety of data sets. Section 5.3.3 compares the prediction accuracy of hierarchicalhardness models with unconditional models and oracular models.5.3.1 Performance of Conditional and Oracular ModelsFigure 5.2 shows the difference between using oracular models and unconditionalmodels on structured SAT instances (distribution: QCP, solver: satelite). Fororacular models, we observed almost perfect predictions of runtime for unsatis-fiable instances and more noisy, but unbiased predictions for satisfiable instances(Figure 5.2, left). Figure 5.2 (right) shows that the runtime prediction for unsatis-fiable instances made by unconditional models can exhibit both less accuracy andmore bias.Even though using the best conditional model can result in higher predictionaccuracy, there is a large penalty for using the wrong conditional model to predictthe runtime of an instance. Figure 5.3 (left) shows that if Msat is used for runtimeprediction on an unsatisfiable instance, the prediction error is often very large.The large bias in the inaccurate predictions is due to the fact that models trainedon different types of instances are very different. As shown in Figure 5.3 (right),similar phenomena occur when we use Munsat to predict the runtime on a satisfiable61−4 −2 0 2 4 6−4−20246Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiable−4 −2 0 2 4 6−4−20246Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiableFigure 5.3: Actual vs predicted logarithm runtime using only Msat (left,RMSE=1.493) and only Munsat (right, RMSE=0.683), respectively. Distribu-tion: QCP, solver: satelite.instance.The results in Table 5.1 are consistent across data sets and solvers. Oracu-lar models always achieve higher accuracy than unconditional models. The verylarge prediction errors in Table 5.1 for Msat and Munsat indicate that these modelsare very different. In particular, the RMSE for using models trained on unsatisfi-able instances to predict runtimes on a mixture of instances is as high as 14.914(distribution: QCP, solver: sato).Unfortunately, oracular models rely on information that is unavailable in prac-tice: the respective accuracies of the two conditional models on a given (test) in-stance. Nevertheless, the prediction accuracies achieved by oracular models sug-gest that it may be promising to find some practical means of combining con-ditional models. However, it could also be harmful, as if one makes the wrongchoices, prediction error can be much higher than when using an unconditionalmodel.5.3.2 Performance of ClassificationConsidering the difficulty of the classification task, our experimental results provedvery good. Overall accuracy on test data is as high as 98%, and never lower than62RMSE for rand3-var models RMSE for rand3-fix modelsSolvers sat. unsat. unconditional oracular sat. unsat. unconditional oracularsatz 5.481 3.703 0.385 0.329 0.459 0.835 0.420 0.343march dl 1.947 3.705 0.396 0.283 0.604 1.097 0.542 0.444kcnfs 4.766 4.765 0.373 0.294 0.550 0.983 0.491 0.397oksolver 8.169 4.141 0.443 0.356 0.689 1.161 0.596 0.497RMSE for QCP models RMSE for SW-GCP modelsSolvers sat. unsat. unconditional oracular sat. unsat. unconditional oracularzchaff 1.866 1.163 0.675 0.303 1.230 1.209 0.993 0.657minisat 1.761 1.150 0.574 0.305 1.280 1.275 1.022 0.682satzoo 1.293 0.876 0.397 0.240 0.709 0.796 0.581 0.384satelite 1.493 0.683 0.426 0.247 1.232 1.226 0.970 0.618sato 2.373 14.914 0.711 0.375 1.682 1.887 1.353 0.723oksolver 1.213 1.062 0.548 0.427 1.807 2.064 1.227 0.601Table 5.1: Accuracy of hardness models for different solvers and instance distribu-tions.  rand3−var    rand3−fix   QCP               SW-QCP Accurary sat.unsat.all Classification AccuracyDataset on sat. on unsat. overallrand3sat-var 0.9791 0.9891 0.9840rand3sat-fix 0.8480 0.8814 0.8647QCP 0.9801 0.9324 0.9597SW-GCP 0.7516 0.7110 0.7340Figure 5.4: Classification accuracy for different data sets73%, substantially better than random guessing. Compared to Chapter 4, we showclassification can be applied to a large variety of instance distributions. With prob-ing features (e.g., local-search probing), we are able to achieve even better classi-fication accuracy for rand3sat-fix. Furthermore, the classifier is usually veryconfident about the satisfiability of an instance (i.e., returned probabilities veryclose to 0 or 1), and the more confident the classifier is, the more accurate it tendsto be. These results are summarized in Figures 5.4–5.6.For the rand3-var data set (Figure 5.5, left), the overall classification er-ror was only 1.6%. Using only the clauses-to-variables ratio (greater or less than4.26) as the basis for predicting the satisfiability of an instance yielded error of63  0                 0.25               0.5              0.7     5 100.51Classification accuracy  0                 0.25               0.5               0.75 100.20.4Probability of satisfiableFraction  0 0.25 0.5 0.75 100.51Classification accuracy   0                 0.25                       0.5 0.75 100.20.4Probability of satisfiableFractionFigure 5.5: Classification accuracy vs classifier output (top) and fraction of in-stances within the given set vs classifier output (bottom). Left: rand3-var,right: QCP.3.7%; therefore, by using SMLR rather than this simple classifier, the classifica-tion error was halved. On the QCP data set (Figure 5.5, right), classification ac-curacy was 96%, and the classifier was extremely confident in its predictions. Forrand3sat-fix (Figure 5.6, left), the classification accuracy was even higher,86% vs. 75% (in the previous chapter), with probing features now included. ForSW-GCP (Figure 5.6, right), classification accuracy was much lower (73%). Thiswas because the features were less predictive on this instance distribution, whichwas consistent with the results of unconditional hardness models for SW-GCP.Note that the fraction of instances, for which the classifier was confident wassmaller for the last two distributions than for rand3-var and QCP. However,even for SW-GCP, there was a strong correlation between the classifier’s outputand classification accuracy on test data.One further interesting finding is that classifiers can achieve very high accu-racies even given very small sets of features. For example, on the QCP data, theSMLR classifier achieved 93% accuracy with only 5 features. The five most impor-tant features for classification on all four data sets are shown in Table 5.2. Local-search-based features turned out to be very important for classification in all fourdata sets, which may be because easy instances can be solved by computing theseprobing features.64  0 0.25 0.5 0.75 100.51Classification accuracy   0               0.25         0.5      0.75 100.20.4Probability of satisfiableFraction  0 0.25 0.5 0.75 100.51Classification accuracy   0              0.25                 0.5 0.75 100.20.4Probability of satisfiableFractionFigure 5.6: Classification accuracy vs classifier output (top) and fraction of the in-stances within the given set vs classifier output (bottom). Left: rand3-fix,right: SW-GCP.Data sets rand3-var rand3-fixgsat BestCV Mean saps BestSolution CoeffVarianceFive saps BestStep CoeffVariance gsat BestSolution Meanfeatures lobjois mean depth over vars saps BestCV MeanVCG VAR max lobjois mean depth over varssaps BestSolution Mean gsat BestCV MeanAccuracy (5 features) 98.4% 86.5%Accuracy (all features ) 98.4% 86.5%Data sets QCP SW-GCPlobjois log num nodes over vars vars reduced depthFive saps BestSolution Mean gsat BestCV Meanfeatures saps BestCV Mean nvarsvars clauses ratio VCG VAR minsaps BestStep CoeffVariance saps BestStep MeanAccuracy (5 features) 93.0% 73.2%Accuracy (all features) 96.0% 73.4%Table 5.2: The five most important features (listed from most to least important) forclassification as chosen by backward selection.Overall, the experimental results confirmed that a classifier may be used tomake accurate polynomial-time predictions pertaining to the satisfiability of SATinstances. This finding may be useful in its own right. For example, researchersinterested in evaluating incomplete SAT algorithms on large numbers of satisfiableinstances drawn from a distribution that produces both satisfiable and unsatisfiable65RMSE (rand3-var models) RMSE (rand3-fix models)Solvers oracular uncond. hier. oracular uncond. hier.satz 0.329 0.385(85%) 0.344(96%) 0.343 0.420(82%) 0.413(83%)march dl 0.283 0.396(71%) 0.306(92%) 0.444 0.542(82%) 0.533(83%)kcnfs 0.294 0.373(79%) 0.312(94%) 0.397 0.491(81%) 0.486(82%)oksolver 0.356 0.443(80%) 0.378(94%) 0.497 0.596(83%) 0.587(85%)RMSE (QCP models) RMSE (SW-GCP models)∗Solvers oracular uncond. hier. oracular uncond. hier.zchaff 0.303 0.675(45%) 0.577(53%) 0.657 0.993(66%) 0.983(67%)minisat 0.305 0.574(53%) 0.500(61%) 0.682 1.022(67%) 1.024(67%)satzoo 0.240 0.397(60%) 0.334(72%) 0.384 0.581(66%) 0.581(66%)satelite 0.247 0.426(58%) 0.372(66%) 0.618 0.970(64%) 0.978(63%)sato 0.375 0.711(53%) 0.635(59%) 0.723 1.352(53%) 1.345(54%)oksolver 0.427 0.548(78%) 0.506(84%) 0.601 1.337(45%) 1.331(45%)Table 5.3: Comparison of oracular, unconditional and hierarchical hardness mod-els. The second number of each entry is the ratio of the model’s RMSE to theoracular model’s RMSE. ( ∗For SW-GCP, even the oracular model exhibits alarge runtime prediction error.)instances could use a complete search algorithm to label a relatively small trainingset, and then use the classifier to filter instances.5.3.3 Performance of Hierarchical ModelsThe broader performance of different unconditional, oracular and hierarchical mod-els is shown in Table 5.3. For rand3-var, classification accuracy was very high(classification error was only 1.6%). Our experiments confirmed that hierarchicalhardness models can achieve almost the same runtime prediction accuracy as orac-ular models for all four solvers considered in our study. Figure 5.7 shows that usingthe hierarchical hardness model to predict satz’s runtime was much better thanusing the unconditional model.On the rand3-fix dataset, results for all four solvers were qualitatively sim-ilar: hierarchical hardness models gave slightly but consistently better runtime pre-dictions than unconditional models. On this distribution, the gap in prediction ac-curacy between unconditional and oracular models was already quite small, whichmade further significant improvements more difficult to achieve. Detailed analysisof actual vs. predicted runtimes for satz (see Figure 5.8) shows that particularlyfor unsatisfiable instances, the hierarchical model tended to produce slightly more66−3 −2 −1 0 1 2 3−3−2−10123Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiable−3 −2 −1 0 1 2 3−3−2−10123Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiableFigure 5.7: Actual vs. predicted logarithm runtime for satz on rand3-var.Left: unconditional model (RMSE=0.387); right: hierarchical model(RMSE=0.344).−2 −1 0 1 2 3−2−10123Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiable−2 −1 0 1 2 3−2−10123Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiableFigure 5.8: Actual vs. predicted logarithm runtime for satz on rand3-fix.Left: unconditional model (RMSE=0.420); right: hierarchical model(RMSE=0.413).accurate predictions. Further investigation confirmed that those instances in Fig-ure 5.8 (right) that were far away from the ideal prediction line (y = x) possessedlow classification confidence.For the structured QCP instances, similar runtime prediction accuracy improve-ments were obtained by using hierarchical models. Since the classification accu-67−4 −2 0 2 4−4−3−2−101234Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiable−4 −2 0 2 4−4−3−2−101234Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiableFigure 5.9: Actual vs. predicted logarithm runtime for satelite on QCP.Left: unconditional model (RMSE=0.426); right: hierarchical model(RMSE=0.372).racy for QCP was higher than the classification accuracy for rand3-fix, weexpected bigger improvements. The experimental results confirm this hypothe-sis (Figure 5.9). For example, a hierarchical model for the satelite solverachieved a RMSE of 0.372, compared to 0.462 obtained from an unconditionalmodel (whereas the oracular model yields RMSE 0.247).However, the runtime prediction accuracy obtained by hierarchical hardnessmodels depends on the quality of the underlying conditional models (experts). Inthe case of data set SW-GCP (see Figure 5.10), both unconditional and oracularmodels had fairly large prediction error. This is also consistent with classificationerror on SW-GCP being much higher (26.6%, compared to 4.0% on QCP and 13.5%on rand3sat-fix).When investigating the relationship between the classifier’s confidence and re-gression runtime prediction accuracy, we find that higher classification confidencetends to be indicative of more accurate runtime predictions. This relationship isillustrated in Figure 5.11 for the satelite solver on the QCP data set: when theclassifier was more confident about the satisfiability of an instance, both predictionerror (Figure 5.11, left) and RMSE (Figure 5.11, right) were smaller.68−2 −1 0 1 2 3 4−2−101234Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiable−2 −1 0 1 2 3 4−2−101234Log predicted runtime [sec]Log actual runtime [sec]satisfiableunsatisfiableFigure 5.10: Actual vs. predicted logarithm runtime for zchaff on SW-GCP.Left: unconditional model (RMSE=0.993); right: hierarchical model(RMSE=0.983).  0                0.25                0.5               0.75 1−4−3−2−101234Probability of satisfiable Error for predictionsatisfiableunsatisfiable  0                                   0.25     0.5 0.75 100.511.5RMSEProbability of satisfiableFigure 5.11: Classifier output vs runtime prediction error (left); relationshipbetween classifier output and RMSE (right). Data set: QCP, solver:satelite.5.4 ConclusionsThis chapter shows that there are big differences between models trained only onsatisfiable and unsatisfiable instances, not only for uniform random 3-SAT (as waspreviously reported in Nudelman et al. (2004)), but also for distributions of struc-69tured SAT instances, such as QCP and SW-GCP. Furthermore, these models havehigher prediction accuracy than the respective unconditional models. A classifiercan be used to distinguish between satisfiable and unsatisfiable instances for allabove distributions with high accuracy. Compared to Chapter 4, we confirm thatadding probing features significantly improves classification accuracy. Further-more, such a classifier can be combined with conditional hardness models into ahierarchical hardness model using a mixture-of-experts approach. In cases of highclassification accuracy, the hierarchical models thus obtained always offered sub-stantial improvements over an unconditional model. In the case of less accurateclassification, the hierarchical models could not offer a substantial improvementover the unconditional model; however, hierarchical models were never signifi-cantly worse. It should be noted that the hierarchical models come at virtually noadditional computational cost, as they depend on the same features as used for theindividual regression models.70Chapter 6Performance Prediction withEmpirical Performance ModelsThe previous chapter has demonstrated that an algorithm’s runtime can be pre-dicted with high accuracy by using linear regression models. However, we nei-ther have to use linear regression models nor have to predict runtime. This chap-ter extends empirical hardness models (EHMs) to empirical performance models(EPMs) to reflect these broadened scopes. Such models have important applica-tions to algorithm analysis, portfolio-based algorithm selection, and the automaticconfiguration of parameterized algorithms.Over the past decade, a wide variety of techniques have been studied for build-ing such models. In this chapter, we describe a thorough comparison of differentexisting and new model building techniques for SAT, MIP, and TSP. Our exper-iments consider 11 algorithms and 35 instance distributions with the least struc-tured having been generated uniformly at random and the most structured havingemerged from real industrial applications. Overall, we demonstrate that our newmodels yield substantially better runtime predictions than previous approaches. 11This chapter is based on the joint work with Frank Hutter, Holger Hoos, and Kevin Leyton-Brown [101].716.1 Methods Used in the LiteratureThe goal of performance prediction is to find a mapping from instance features toalgorithm performance. Therefore, any regression technique can be applied. Inthis chapter, we only consider machine learning methods that have previously beenused to predict algorithm runtimes: ridge regression [28, 29, 88, 92, 127, 130, 156,209, 210], neural networks [184], Gaussian process regression [92], and regressiontrees [15].Let X i denote a p-dimensional feature vector for instance i. Let yi denote theperformance of an algorithm A on i. An EPM is a stochastic process f : X 7→ y thatdefines a probability distribution over performance measures y for A and probleminstances with features X . The prediction at a particular input is a distribution overperformance values. Let Π be the set of n training instances drawn from an instancedistribution D. The training data for the regression models is simply {(X i,yi)}ni=1.Throughout this chapter, we focus on runtime as a performance measure and usea log transformation. Since many of the methods yield only point-valued runtimepredictions, our experimental analysis focuses on the accuracy of mean predictions.6.1.1 Ridge RegressionRidge regression [see, e.g., 24] is a simple regression method that fits a linear func-tion fw(X ) of its inputs X . Due its simplicity (both conceptual and computational)and its interpretability, ridge regression has been widely used for building EPMs[92, 127, 130, 156, 208].As mentioned in the previous chapter, two key techniques are widely used tobetter approximate algorithm performance: feature expansion for extending fea-ture space and feature selection for removing redundant features. Many differentmethods exist for feature expansion and selection, and we review three differentridge regression variants from the recent literature that only differ in these designdecisions.Ridge Regression Variant RR: Two-phase forward selection [209, 210]Chapter 5 uses a simple and scalable feature selection method based on forwardselection [see e.g., 68] to build the regression models. This iterative method starts72with an empty input set, greedily adds one linear input at a time to minimize cross-validation error at each step, and stops when l linear inputs have been selected. Itthen performs a full quadratic expansion of these l linear features (using the orig-inal features, and then normalizing the resulting quadratic features again to havemean zero and standard deviation one). Finally, it carries out another pass of for-ward selection with the expanded feature set to select q features. One advantageof this two-phase approach is scalability: it can handle a very large number offeatures. The computational complexity of forward selection can be reduced byexploiting the fact that the inverse matrix A′−1 resulting from including one ad-ditional feature can be computed incrementally by two rank-one updates of theprevious inverse matrix A−1, requiring quadratic time rather than cubic time [181].In the experiments, we fix l = 30 to keep the result of a full quadratic basisfunction expansion manageable in size. Free parameters are the maximum numberof quadratic terms q and the ridge penalizer ε; by default, we use q = 20 andε = 10−3.Ridge Regression Variant SPORE-FoBa: Forward-backward selection [88]Recently, Huang et al. (2010) described a method for predicting algorithm run-time, termed Sparse POlynomial REgression (SPORE), which is based on ridgeregression with forward-backward (FoBa) feature selection. In contrast to the twoRR variants, SPORE-FoBa employs a cubic feature expansion (based on its ownnormalization of the original predictors). Essentially, it performs a single pass offorward selection, adding a small set of terms at each step determined by a forward-backward phase on a feature’s candidate set. Specifically, having already selecteda set of terms T based on raw features S, SPORE-FoBa loops over all raw featuresr /∈ S, constructing a candidate set Tr that consists of all polynomial expansionsof S∪{r} that include r with non-zero degree and whose total degree is boundedby 3. For each such candidate set Tr, the forward-backward phase iteratively addsthe best term t ∈ T \Tr if its reduction χ of root mean squared error (RMSE) ex-ceeds a threshold γ (forward step), and then removes the worst term t ∈ T , if itsreduction of RMSE is below 0.5 · γ (backward step). This phase terminates whenno single term t ∈ T \Tr can be added to reduce RMSE by more than γ . Finally,73SPORE-FoBa’s outer forward selection loop chooses the T resulting from the bestof its forward-backward phases, and iterates until the number of terms in T reacha pre-specified maximum of tmax terms. SPORE-FoBa’s free parameters are theridge penalizer ε , tmax, and γ , with defaults ε = 10−3, tmax = 10, and γ = Neural NetworksNeural networks are a well-known regression method inspired by information pro-cessing in human brains. The multilayer perceptron (MLP) is a particularly populartype of neural network that organizes single computational units (“neurons”) in lay-ers (input, hidden, and output layers), using the outputs of all units in a layer as theinputs of all units in the next layer. Each neuron ni in the hidden and output layerswith k inputs ai = [ai,1, . . . ,ai,k] has an associated weight vector wi = [wi,1, . . . ,wi,k]and a bias term bi, and computes a function wiTai +bi. For neurons in the hiddenlayer, the result of this function is further propagated through a nonlinear acti-vation function g : R→ R (which is often instantiated as tanh). Given an inputx = [x1, . . . ,xp], a network with a single hidden layer of h neurons n1, . . . ,nh and asingle output neuron nh+1 then computes outputfˆ (x) =(h∑j=1g(wTj ·x+b j) ·wh+1, j)+bh+1.The p ·h+h weight terms and h+1 bias terms can be combined into a single weightvector w, which can be set to minimize the network’s prediction error using anycontinuous optimization algorithm (e.g., the classic “backpropagation” algorithmperforms gradient descent to minimize squared prediction error).Smith-Miles and van Hemert (2011) used an MLP with one hidden layer of 28neurons to predict the runtime of local search algorithms for solving timetablinginstances. They used the proprietary neural network software Neuroshell, but ad-vised us to compare to an off-the-shelf Matlab implementation instead. We thusemployed the popular Matlab neural network package NETLAB [151]. NETLABuses activation function g = tanh and supports a regularizing prior on weights,minimizing the error metric ∑Ni ( fˆ (xi)− yi)2 +α ·wT ·w, where α is a parameterdetermining the prior’s strength. In the experiments, we used NETLAB’s default74optimizer (scaled conjugate gradients, SCG), to minimize this error metric, stop-ping the optimization after the default of 100 SCG steps. Free parameters are theregularization prior α and the number of hidden neurons h; for the default, we usedNETLAB’s default α = 0.01 and, like Smith-Miles and van Hemert [184], h = Gaussian Process RegressionStochastic Gaussian processes (GPs) [168] are a popular class of regression mod-els with roots in geostatistics, where they are also termed Kriging models [119].GPs are the dominant modern approach for building response surface models formodeling a process when the underlying fundamental mechanism is largely un-known [14, 109, 172, 174].To construct a GP regression model, we first select a parameterized kernelfunction kλ to characterize the degree of similarity between two elements of theinput space. We also need to determine the variance σ2 of Gaussian-distributedobservation noise, which in our setting corresponds to the variance of the tar-get algorithm’s runtime distribution. The predictive distribution of a zero-meanGaussian stochastic process for response yn+1 at input X n+1 given training dataD = {(X 1,y1), . . . ,(X n,yn)}, measurement noise variance σ2, and kernel functionk, is then the Gaussianp(yn+1|X n+1,X 1:n,y1:n) =N (yn+1|µn+1,Varn+1) (6.1)with mean and varianceµn+1 = k∗T · [K +σ2 · In]−1 · y1:nVarn+1 = k∗∗− k∗T · [K +σ2 · I]−1 · k∗,75whereK =k(X 1,X 1) . . . k(X 1,X n). . .k(X n,X 1) . . . k(X n,X n)k∗ = (k(X 1,X n+1), . . . ,k(X n,X n+1))Tk∗∗ = k(X n+1,X n+1)+σ2.Refer to Rasmussen and Williams [168] for a derivation.A variety of kernel functions are possible, but the most common choice forcontinuous predictors is the squared exponential kernelKcont(X i,X j) = exp[p∑l=1(−λl · (Xi,l−X j,l)2)], (6.2)where λ1, . . . ,λp are kernel parameters. This kernel is most appropriate if the re-sponse is expected to vary smoothly in the predictors X .The GP equations above assume fixed kernel parameters λ1, . . . ,λp and fixedobservation noise variance σ2. These constitute the GP’s hyperparameters, whichare typically set by maximizing the marginal likelihood p(y1:N) of the data witha gradient-based optimizer. We refer to Rasmussen and Williams [168] for theequations. The choice of optimizer can make a substantial difference in practice.We used minFunc [176] with its default setting of a limited-memory version ofBFGS [154].Learning a GP model from data can be computationally expensive. Invertingthe n by n matrix [K +σ2 · In] takes O(n3) time and has to be performed in everystep of the hyperparameter optimization (h steps in total), yielding a total complex-ity of O(h · n3). Subsequent predictions at a new input are relatively cheap: O(n)and O(n2) for predictions of the mean and the variance, respectively.6.1.4 Regression TreesRegression trees [27] are simple tree-based regression models. They are known tohandle discrete predictors well; we believe that their first application to the predic-76tion of algorithm runtime was by Bartz-Beielstein and Markon (2004).The leaf nodes of regression trees partition the input space into disjoint regionsR1, . . . ,RM, and use a simple model for prediction in each region Rm; the most com-mon choice is to predict a constant cm. This leads to the following prediction foran input point X : µˆ(X ) = ∑Mm=1 cm · IX∈Rm , where the indicator function Iz takesvalue 1 if the proposition z is true and 0 otherwise. Note that since the regionsRm partition the input space, this sum will always involve only one nonzero term.We denote the subset of training data points in region Rm as Dm. Under the stan-dard squared error loss function ∑ni=1 (yi− µˆ(X i))2, the error-minimizing choice ofconstant cm in region Rm is then the sample mean of the data points in Dm:cm =1|Dm|∑X i∈Rmyi. (6.3)To construct a regression tree, we use the following standard recursive proce-dure (see, e.g., Breiman et al. (1984)), which starts at the root of the tree with allavailable training data points D = {(X i,yi)}ni=1. We consider binary partitioningsof a given node’s data along split variables j and split points s. For a real-valuedsplit variable j, s is a scalar and data point X i is assigned to region R1( j,s) ifXi, j ≤ s and to region R2( j,s) otherwise. For a categorical split variable j, s is aset, and data point xi is assigned to region R1( j,s) if Xi, j ∈ s and to region R2( j,s)otherwise. At each node, we select split variable j and split point s to minimize thesum of squared differences to the regions’ means,l( j,s) =[∑X i∈R1( j,s)(yi− c1)2 + ∑X i∈R2( j,s)(yi− c2)2], (6.4)where c1 and c2 are chosen according to Equation (6.3) as the sample means inregions R1( j,s) and R2( j,s) respectively. We continue this procedure recursively,finding the best split variable and split point, partitioning the data into two childnodes, and recursing into the child nodes. The process terminates when all train-ing data points in a node share the same x values, meaning that no more splitsare possible. This procedure tends to overfit the data, which can be mitigated byrecursively pruning away branches that contribute little to the model’s predictive77accuracy. We use cost-complexity pruning with 10-fold cross-validation to identifythe best tradeoff between complexity and predictive quality (see, e.g., Hastie et al.(2009) for details).In order to predict the response value at a new input, X i, we propagate X downthe tree, that is, at each node with split variable j and split point s, we continue tothe left child node if X i, j ≤ s, and to the right child node otherwise. The predictivemean for X i is the constant cm in the leaf that this process selects; there is novariance predictor.6.2 New Modeling Techniques for EPMsWe also introduce some advanced machine learning techniques, and apply themfor the first time to algorithm performance prediction. In particular, we introducetwo new, more sophisticated, runtime prediction models. The first is based onan approximate version of Gaussian processes that scale gracefully to many datapoints, and also includes a new kernel function for handling categorical data. Thesecond one is based on random forests, collections of regression trees that yieldmuch better predictions than single trees and are known to perform well for discreteinputs.6.2.1 Scaling to Large Amounts of Data with Approximate GaussianProcessesFitting Gaussian processes has complexity cubic in the number of data points,which limits the amount of data that can practically be used to fit these models.To deal with this obstacle, the machine learning literature has proposed variousapproximations to Gaussian processes [see, e.g., 167]. We experimented withthe Bayesian committee machine [199], the informative vector machine [123], andthe projected process (PP) approximation [168]. All of these methods performedrather similarly, with a slight edge for the PP approximation. Following, we givethe final equations for the PP’s predictive mean and variance (see, e.g., Rasmussenand Williams (2006) for a derivation).The PP approximation to GPs uses a subset of a of the n training data points,the so-called active set. Let v be a vector consisting of the indices of these a data78points. We extend the notation for exact GPs (see Section 6.1.3) as follows: letKaa denote the a by a matrix with Kaa(i, j) = k(xv(i),xv( j)) and let K an denote thea by n matrix with K an(i, j) = k(xv(i),x j). The predictive distribution of the PPapproximation is then a Gaussian with mean and varianceµn+1 = k∗T · (σ2 ·K aa +K an ·K Tan)−1K an · y1:nVarn+1 = k∗∗− k∗T ·K−1aa · k∗+σ2 · k∗T · (σ2 ·K aa +K an ·K Tan)−1 · k∗.We perform h steps of hyperparameter optimization based on a standard GPtrained using a set of a data points sampled uniformly at random without replace-ment from the n input data points. We then use the resulting hyperparameters andanother independently sampled set of a data points (sampled in the same way) forthe subsequent PP approximation. In both cases, if a > n, we only use n data points.The complexity of the PP approximation is superlinear only in a, meaningthat the approach is much faster when we choose a << n. The hyperparameteroptimization based on a data points takes O(h ·a3) time. In addition, there is a one-time cost of O(a2 · n) for evaluating the PP equations. Thus, the complexity forfitting the approximate GP model is O([ha+ n] · a2), as compared to O(h · n3) forthe exact GP model. The complexity for predictions with this PP approximation isO(a) for the mean and O(a2) for the variance of the predictive distribution [168],as compared to O(n) and O(n2) for the exact version. In our experiments we seta = 300 and h = 50 to achieve a good compromise between speed and predictiveaccuracy.6.2.2 Random Forest ModelsRandom forests [26] are a flexible tool for regression and classification, and areparticularly effective for high-dimensional and discrete input data. To the best ofour knowledge, they have not yet been used for algorithm runtime predictions ex-cept in the most recent work on algorithm configuration [98, 99] performed by ourown group. Here, we describe the standard RF framework and some nonstandardimplementation choices.79The Standard Random Forest FrameworkA random forest (RF) consists of a set of regression trees. If grown to sufficientdepths, regression trees can capture very complex interactions and thus have lowbias. However, they can also have high variance: small changes in the data canlead to a dramatically different tree. Random forests [26] reduce this varianceby aggregating predictions across multiple different trees. These trees are madeto be different in one of two ways: by training them on different subsamples ofthe training data, or by permitting only a random subset of the variables as splitvariables at each node. Our experiments show slightly worse performance with acombination of the two approaches. Therefore, we chose the latter option with thefull training set for each tree.Mean predictions for a new input x are trivial: predict the response for x witheach tree and average the predictions. Predictive quality improves as the numberof trees B grows, but computational cost also grows linearly in B. We used B = 10throughout our experiments to keep computational costs low. Random forests havetwo additional hyperparameters: the percentage of variables to consider at eachsplit point, perc, and the minimal number of data points required in a node to makeit eligible to be split further, nmin. We set perc = 0.5 and nmin = 5 by default.Modifications to Standard Random ForestsWe introduce a simple, yet effective, method for quantifying predictive uncertaintyin random forests. (Our method is similar to that of Meinshausen [143], who re-cently introduced quantile regression trees that allow for predictions of quantilesof the predictive distribution; in contrast, we predict mean and variance.) In eachleaf of each regression tree, in addition to the empirical mean of the training dataassociated with that leaf, we store that data’s empirical variance. To avoid mak-ing deterministic predictions for leaves with few data points, we round the storedvariance up to at least the constant σ2min; we set σ2min = 0.01 throughout. Forany input, each regression tree Tb thus yields a predictive mean µb and a predic-tive variance σ2b . To combine these estimates into a single estimate, we treat theforest as a mixture model of B different models. We denote the random variablefor the prediction of tree Tb as Lb and the overall prediction as L, and then have80L = Lb if Y = b, where Y is a multinomial variable with p(Y = i) = 1/B fori = 1, . . . ,B. The mean and variance for L can then be expressed as follows:µ = E[L] = 1BB∑b=1µbσ2 = Var(L) = E[Var(L|Y )]+Var(E[L|Y ])=(1BB∑b=1σ2b)+(E[E(L|Y )2]−E[E(L|Y )]2)=(1BB∑b=1σ2b)+(1BB∑b=1µ2b)−E[L]2=(1BB∑b=1σ2b +µ2b)−µ2.Thus, the mean prediction is simply the mean across the individual trees’ meanpredictions. To compute the variance prediction, we used the law of total variance[206], which computes the total variance as the variance across the individual trees’mean predictions (predictions are uncertain if the trees disagree), plus the averagevariance of each tree (predictions are uncertain if the individual tree predictions areuncertain).A second nonstandard ingredient in our models concerns the choice of splitpoints. Consider splits on a real-valued variable j. Note that when the loss inEquation (6.4) is minimized by choosing split point s between the values of xk, jand xl, j, one is still free to choose the exact location of s anywhere in the interval(xk, j,xl, j). In typical implementations, s is chosen as the midpoint between xk, j andxl, j. Instead, here we draw it uniformly at random from (xk, j,xl, j). In the limit ofan infinite number of trees, this leads to a linear interpolation of the training datainstead of a partition into regions of constant prediction. Furthermore, it causesvariance estimates to vary smoothly and to grow with the distance from observeddata points.Complexity of Fitting Random ForestsThe computational cost for fitting a random forest is relatively low. We need to fitB regression trees, each of which is somewhat easier to fit than a normal regression81Abbreviation Reference Section DescriptionRR 6.1.1 Ridge regression with 2-phase forward selectionSP 6.1.1 SPORE-FoBa (ridge regression with forward-backward selection)NN 6.1.2 Feedforward neural network with one hidden layerPP 6.2.1 Projected process (approximate Gaussian process); optimized via minFuncRT 6.1.4 Regression tree with cost-complexity pruningRF 6.2.2 Random forestTable 6.1: Overview of our models.tree since at each node we only consider v=max(1,bperc · pc) out of the p possiblesplit variables. Building B trees simply takes B times as long as building a singletree. Thus, the complexity of learning a random forest is O(B · v · n2 · logn) in theworst case (splitting off one data point at a time) and O(B · v ·n · log2 n) in the bestcase (perfectly balanced trees).Prediction with a random forest model entails predicting with B regressiontrees (plus an O(B) computation to compute mean and variance across those pre-dictions). The complexity of predictions is thus O(B · n) in the worst case andO(B · logn) for perfectly balanced trees.6.3 Experimental SetupTable 6.1 provides an overview of the models we evaluated; we evaluate eachmethod’s accuracy at predicting the runtime of a variety of solvers for SAT, MIP,and TSP on multiple benchmarks, which are described in Chapter 3.For SAT, we used a wide range of instance distributions. Briefly, INDU, HAND,and RAND are collections of industrial, handmade, and random instances from theinternational SAT competitions and races. COMPETITON is their union. SWV andIBM are sets of software and hardware verification instances, and SWV-IBM istheir union. Finally, RANDSAT is a subset of RAND containing only satisfiable in-stances. For all distributions but this last one, we ran the popular tree search solver,Minisat 2.0 [46]. For INDU, SWV and IBM, we also ran two more solvers: thewinner of the SAT Race 2010, CryptoMiniSat [186], and SPEAR [8] (whichhas shown state-of-the-art performance on IBM and SWV with optimized parametersettings [93]). Finally, to evaluate predictions for local search algorithms, we usedthe RANDSAT instances, and considered two solvers: TNM [203] (the winner of the82SAT 2009 competition in the random satisfiable category) and the dynamic localsearch algorithm SAPS [91] (which we see as a baseline).For MIP, we used two instance distributions from computational sustainabil-ity (RCW and CORLAT), one from winner determination in combinatorial auctions(REG), two unions of these (CR and CRR), and a very heterogeneous mix of publiclyavailable MIP instances (BIGMIX). We used the two state-of-the-art commercialsolvers, CPLEX 12.1 [104] and Gurobi 2.0 [67]. We also used the two strongestnon-commercial solvers, SCIP [17] and lp solve 5.5 [16].For TSP, we used three instance distributions: uniform random instances (PORTGEN),random clustered instances (PORTCGEN), and TSPLIB, a heterogeneous set ofprominent TSP instances. We ran the most prominent systematic and local searchalgorithms, Concorde [5] and LK-H [70]. For the latter, we computed runtimesas the time required to find a solution with optimal quality.For each algorithm–distribution pair, we executed the algorithm with its defaultparameter setting on all instances of the distribution, measured runtimes, and col-lected the results. All algorithm runs were executed on a cluster of 55 dual 3.2GHzIntel Xeon PCs with 2MB cache and 2GB RAM, running OpenSuSE Linux 11.1;runtimes were measured as CPU time on these reference machines. We cut off eachalgorithm run after one CPU hour; this gives rise to capped runtime observations,because we only observe a lower bound on the runtime. Like most past work onruntime modeling, we simply counted such capped runs as having taken one hour.Due to the resolution of the CPU timer, runtimes below 0.01 seconds are measuredas 0 seconds. To make yi = log(ri) well defined in these cases, such low runtimesare counted as 0.005.Since the goal is to compare different model construction techniques, all thefeatures listed in Chapter 3 were used. The features that have constant value acrossall training data points are removed and the remaining ones are normalized to havemean 0 and standard deviation 1. For some instances, certain feature values weremissing because of timeouts, crashes, or because they were undefined (e.g., be-cause preprocessing already solved an instance). These missing values occur rel-atively rarely, a simple mechanism is used for handling them: disregard missingvalues for the purposes of normalization, and then set the missing values to zero.This means that the feature’s value is at the data mean and is thus minimally in-83formative; in some models (ridge regression and neural networks), this mechanismactually lead us to ignore the feature, since its weight is multiplied by zero.We evaluate methods for building empirical performance models by assess-ing the quality of the predictions they make about inputs that were not used totrain the model. This can be done visually (for example, in the scatter plots inFigure 6.1), or quantitatively. We considered three complementary quantitativemetrics to evaluate mean predictions {µi}ni=1 and predictive variances {σ2}ni=1given true performances {yi}ni=1. Root mean squared error (RMSE) is defined as√1/n ·∑ni=1(yi−µi)2; Pearson’s correlation coefficient (CC) is defined as (∑ni=1(µiyi)−n · µ¯ · y¯)/((n−1) · sµ · sy), where x¯ and sx denote sample mean and standard devia-tion of x; and log likelihood (LL) is defined as ∑ni=1 logϕ(yi−µiσi ), where ϕ denotesthe probability density function (PDF) of a standard normal distribution. Intui-tively, LL is the log probability of observing the true values yi under the predicteddistributionsN (µi,σ2i ). For CC and LL, higher values are better, while for RMSElower values are better. We used k-fold cross-validation and report means of thesemeasures across the k folds. Scatter plots show cross-validated predictions for arandom subset of up to 1 000 data points.6.4 Experimental ResultsTable 6.2 provides quantitative results for all benchmarks, and Figure 6.1 visu-alizes results. At the broadest level, we can conclude that most of the methodswere able to capture sufficient information pertaining to algorithm performance ontraining data to make meaningful predictions on test data, most of the time: easyinstances tended to be predicted as easy, and hard ones as hard. For example, in thecase of predicting the runtime of Minisat 2.0 on a heterogeneous mix of SATcompetition instances (refer to the leftmost column in Figure 6.1 and the top rowof Table 6.2), Minisat 2.0 runtimes varied by almost six orders of magnitude,while predictions with the better models rarely were off by more than one orderof magnitude (outliers may draw the eye in the scatter plot, but quantitatively, theRMSE for predicting log10 runtime is low – e.g., 0.47 for random forests, whichmeans an average misprediction of a factor of 3).In our experiments, random forests were indeed the overall winner among the84RMSE Time to learn model (s)Domain RR SP NN PP RT RF RR SP NN PP RT RFMinisat-COMPETITON 1.01 1.25 0.62 0.92 0.68 0.47 6.8 28.08 21.84 46.56 20.96 22.42Minisat-HAND 1.05 1.34 0.63 0.85 0.75 0.51 3.7 7.92 6.2 44.14 6.15 5.98Minisat-RAND 0.64 0.76 0.38 0.55 0.5 0.37 4.46 7.98 10.81 46.09 7.15 8.36Minisat-INDU 0.94 1.01 0.78 0.86 0.71 0.52 3.68 7.82 5.57 48.12 6.36 4.42Minisat-SWV-IBM 0.53 0.76 0.32 0.52 0.25 0.17 3.51 6.35 4.68 51.67 4.8 2.78Minisat-IBM 0.51 0.71 0.29 0.34 0.3 0.19 3.2 5.17 2.6 46.16 2.47 1.5Minisat-SWV 0.35 0.31 0.16 0.1 0.1 0.08 3.06 4.9 2.05 53.11 2.37 1.07CryptoMiniSat-INDU 0.94 0.99 0.94 0.9 0.91 0.72 3.65 7.9 5.37 45.82 5.03 4.14CryptoMiniSat-SWV-IBM 0.77 0.85 0.66 0.83 0.62 0.48 3.5 10.83 4.49 48.99 4.75 2.78CryptoMiniSat-IBM 0.65 0.96 0.55 0.56 0.53 0.41 3.19 4.86 2.59 44.9 2.41 1.49CryptoMiniSat-SWV 0.76 0.78 0.71 0.66 0.63 0.51 3.09 4.62 2.09 53.85 2.32 1.03SPEAR-INDU 0.95 0.97 0.85 0.87 0.8 0.58 3.55 9.53 5.4 45.47 5.52 4.25SPEAR-SWV-IBM 0.67 0.85 0.53 0.78 0.49 0.38 3.49 6.98 4.32 48.48 4.9 2.82SPEAR-IBM 0.6 0.86 0.48 0.66 0.5 0.38 3.18 5.77 2.58 45.72 2.5 1.56SPEAR-SWV 0.49 0.58 0.48 0.44 0.47 0.34 3.09 6.24 2.09 56.09 2.38 1.13TNM-RANDSAT 1.01 1.05 0.94 0.93 1.22 0.88 3.79 8.63 6.57 46.21 7.64 5.42SAPS-RANDSAT 0.94 1.09 0.73 0.78 0.86 0.66 3.81 8.54 6.62 49.33 6.59 5.04CPLEX-BIGMIX 2.7E8 0.93 1.02 1 0.85 0.64 3.39 8.27 4.75 41.25 5.33 3.54Gurobi-BIGMIX 1.51 1.23 1.41 1.26 1.43 1.17 3.35 5.12 4.55 40.72 5.45 3.69SCIP-BIGMIX 4.5E6 0.88 0.86 0.91 0.72 0.57 3.43 5.35 4.48 39.51 5.08 3.75lp solve-BIGMIX 1.1 0.9 0.68 1.07 0.63 0.5 3.35 4.68 4.62 43.27 2.76 4.92CPLEX-CORLAT 0.49 0.52 0.53 0.46 0.62 0.47 3.19 7.64 5.5 27.54 4.77 3.4Gurobi-CORLAT 0.38 0.44 0.41 0.37 0.51 0.38 3.21 5.23 5.52 28.58 4.71 3.31SCIP-CORLAT 0.39 0.41 0.42 0.37 0.5 0.38 3.2 7.96 5.52 26.89 5.12 3.52lp solve-CORLAT 0.44 0.48 0.44 0.45 0.54 0.41 3.25 5.06 5.49 31.5 2.63 4.42CPLEX-RCW 0.25 0.29 0.1 0.03 0.05 0.02 3.11 7.53 5.25 25.84 4.81 2.66CPLEX-REG 0.38 0.39 0.44 0.38 0.54 0.42 3.1 6.48 5.28 24.95 4.56 3.65CPLEX-CR 0.46 0.58 0.46 0.43 0.58 0.45 4.25 11.86 11.19 29.92 11.44 8.35CPLEX-CRR 0.44 0.54 0.42 0.37 0.47 0.36 5.4 18.43 17.34 35.3 20.36 13.19LK-H-PORTGEN 0.61 0.63 0.64 0.61 0.89 0.67 4.14 1.14 12.78 22.95 11.49 11.14LK-H-PORTCGEN 0.71 0.72 0.75 0.71 1.02 0.76 4.19 2.7 12.93 24.78 11.54 10.79LK-H-TSPLIB 9.55 1.11 1.77 1.3 1.21 1.06 1.61 3.02 0.51 4.3 0.17 0.11Concorde-PORTGEN 0.41 0.43 0.43 0.42 0.59 0.45 4.18 3.6 12.7 22.28 10.79 9.9Concorde-PORTCGEN 0.33 0.34 0.34 0.34 0.46 0.35 4.17 2.32 12.68 24.8 11.16 10.18Concorde-TSPLIB 120.6 0.69 0.99 0.87 0.64 0.52 1.54 2.66 0.47 4.26 0.22 0.12Table 6.2: Quantitative comparison of models for runtime predictions on pre-viously unseen instances. We report 10-fold cross-validation perfor-mance. Lower RMSE values are better (0 is optimal). Note the verylarge RMSE values for ridge regression on some data sets (we use sci-entific notation, denoting “×10x” as “Ex”); these large errors are due toextremely small/large predictions for a few data points. Boldface indi-cates performance not statistically significantly different from the bestmethod in each row.85Spearman rank correlation coefficient Log likelihoodDomain RR SP NN PP RT RF PP RFMinisat-COMPETITON 0.69 0.57 0.86 0.79 0.83 0.9 -4.78 −0.33Minisat-HAND 0.69 0.59 0.87 0.81 0.84 0.91 -2.65 −0.43Minisat-RAND 0.79 0.74 0.82 0.8 0.78 0.83 -1.12 −0.18Minisat-INDU 0.7 0.66 0.85 0.79 0.87 0.92 -5.72 −0.43Minisat-SWV-IBM 0.95 0.89 0.97 0.96 0.98 0.99 -6.64 0.12Minisat-IBM 0.94 0.91 0.97 0.97 0.98 0.99 -6.13 0.06Minisat-SWV 0.94 0.95 0.97 0.98 0.99 0.99 -4.83 0.2CryptoMiniSat-INDU 0.66 0.59 0.72 0.71 0.76 0.81 -5.99 −0.9CryptoMiniSat-SWV-IBM 0.93 0.9 0.94 0.91 0.96 0.97 -6.91 −0.37CryptoMiniSat-IBM 0.93 0.85 0.94 0.94 0.96 0.97 -5.8 −0.23CryptoMiniSat-SWV 0.92 0.94 0.95 0.93 0.97 0.97 -6.88 −0.59SPEAR-INDU 0.63 0.62 0.78 0.75 0.82 0.88 -6.66 −0.59SPEAR-SWV-IBM 0.94 0.91 0.95 0.92 0.97 0.98 -13.6 −0.22SPEAR-IBM 0.95 0.87 0.96 0.93 0.96 0.98 −2.58 −0.18SPEAR-SWV 0.95 0.93 0.94 0.95 0.96 0.97 -7.33 −0.19TNM-RANDSAT 0.87 0.86 0.9 0.89 0.83 0.91 -4.65 −1.32SAPS-RANDSAT 0.9 0.86 0.93 0.92 0.91 0.95 -3.16 −0.79CPLEX-BIGMIX 0.82 0.81 0.81 0.76 0.84 0.9 -8.06 −0.7Gurobi-BIGMIX 0.62 0.62 0.57 0.57 0.54 0.64 -18.09 −2.36SCIP-BIGMIX 0.81 0.76 0.81 0.73 0.84 0.89 -7.33 −0.72lp solve-BIGMIX 0.34 0.31 0.35 0.22 0.47 0.6 -13.22 −0.24CPLEX-CORLAT 0.95 0.95 0.94 0.96 0.93 0.95 -4.46 −0.53Gurobi-CORLAT 0.95 0.93 0.94 0.95 0.92 0.95 -3.12 −0.38SCIP-CORLAT 0.94 0.94 0.93 0.95 0.91 0.94 -5.04 −0.38lp solve-CORLAT 0.76 0.75 0.75 0.75 0.82 0.76 -1.53 −0.25CPLEX-RCW 0.94 0.92 0.99 1 1 1 2 0.23CPLEX-REG 0.87 0.87 0.82 0.87 0.75 0.84 -8.52 −0.59CPLEX-CR 0.9 0.86 0.9 0.91 0.85 0.9 -15.46 −0.54CPLEX-CRR 0.89 0.85 0.9 0.92 0.88 0.92 -20.04 −0.29LK-H-PORTGEN 0.82 0.81 0.8 0.82 0.7 0.77 -46.04 −1.16LK-H-PORTCGEN 0.73 0.72 0.69 0.73 0.55 0.68 -26.86 −1.25LK-H-TSPLIB 0.64 0.8 0.55 0.71 0.76 0.75 -3.78 −2Concorde-PORTGEN 0.88 0.88 0.88 0.88 0.79 0.86 -34.47 −0.66Concorde-PORTCGEN 0.86 0.85 0.85 0.85 0.76 0.84 -26.36 −0.36Concorde-TSPLIB 0.73 0.86 0.72 0.67 0.86 0.91 −1.44 −1.1Table 6.3: Quantitative comparison of models for runtime predictions on un-seen instances. We report 10-fold cross-validation performance. Higherrank correlations are better (1 is optimal); log-likelihoods are only de-fined for models that yield a predictive distribution (here: PP and RF);higher values are better. Boldface indicates results not statistically sig-nificantly from the best.86Minisat-COMPETITON CPLEX-BIGMIX CPLEX-RCW Concorde-PORTGENRidgeregression(RR)SPOREFoBaProjectedprocessNeuralnetworkRandomForestFigure 6.1: Visual comparison of models for runtime predictions on previ-ously unseen test instances. The data sets used in each column areshown at the top. The x-axis of each scatter plot denotes true runtimeand the y-axis 2-fold cross-validated runtime as predicted by the respec-tive model; each dot represents one instance. Predictions above 3000 orbelow 0.001 are denoted by a blue cross rather than a black dot. Plotsfor other benchmarks are qualitatively similar.87different methods, yielding the best predictions in terms of all our quantitativemeasures (refer to root mean squared error results in Table 6.2; correlation coeffi-cients and log likelihoods results in Table 6.3). For SAT, they were always the bestmethod, and for MIP they clearly yielded the best performance for the most het-erogeneous instance set, BIGMIX (refer to Column 2 of Figure 6.1). We attributethe strong performance of random forests on highly heterogeneous data sets to atree-based approach being able to model very different parts of the data separately,whereas methods that fit continuous functions typically allow the fit in one regionto affect the fit in another. Indeed, all ridge regression variants posed problems withextremely-badly-predicted outliers for BIGMIX. For the other MIP datasets, eitherrandom forests or projected processes performed best, often followed closely byridge regression variant RR. CPLEX’s performance on set RCW was a special casethat could be predicted extremely well across models (see Column 3 of Figure 6.1).Finally, for TSP, projected processes and ridge regression had a slight edge for thehomogeneous PORTGEN and PORTCGEN benchmarks, whereas tree-based meth-ods (once again) performed best on the most heterogeneous benchmark, TSPLIB.The last column of Figure 6.1 shows that in the case where random forests per-formed worst the qualitative differences in predictions were small. In terms ofcomputational requirements, random forests were among the cheapest methods,taking between 0.1 and 11 seconds for model learning.Since the number of instances for which performance data is available can bevery limited in practical, we are interested in how the predictive quality of our mod-els depends on the number of training instances. Figure 6.2 visualizes this scalingbehaviour for six representative benchmarks (plots for other benchmarks are qual-itatively similar). We show CC rather than RMSE, for two reasons. First, plots ofRMSE are often cluttered due to poorly performing outliers (mostly of the ridgeregression variants). Second, plotting CC allows immediate visual performancecomparisons across benchmarks since CC ∈ [−1,1].Overall, random forests performed best across training set sizes. Interestingly,both versions of ridge regression (SP and RR) performed poorly for small trainingsets. This observation is significant since most past work employed ridge regres-sion to construct empirical performance models in situations when data was sparseas in old versions of SATzilla, for example [210].8813 43 149 519 1809 631100. of training data points, nCorrelation coefficient  RFRTPPNNSPRR 8 23 62 173 485 135900. of training data points, nCorrelation coefficient9 25 73 212 617 180000. of training data points, nCorrelation coefficientMinisat-COMPETITON CPLEX-BIGMIX CPLEX-CORLAT8 22 59 162 446 123300. of training data points, nCorrelation coefficient12 37 123 407 1352 449400. of training data points, nCorrelation coefficient4 6 11 18 32 5700. of training data points, nCorrelation coefficientMinisat-SWV-IBM Concorde-PORTGEN Concorde-PORTGENFigure 6.2: Prediction quality for varying numbers of training instances. Foreach model and number of training instances, we plot the mean (takenacross 10 cross-validation folds) correlation coefficient (CC) betweentrue and predicted runtimes for new test instances; larger CC is better, 1is perfect. Plots for other benchmarks are qualitatively similar.6.5 ConclusionsThis chapter assessed and advanced the state of the art in predicting the perfor-mance of combinatorial algorithms. We proposed new techniques for building suchpredictive models and conducted the largest experimental study of which we areaware—predicting the performance of 11 algorithms on 35 instance distributionsfrom SAT, MIP and TSP—comparing our new modeling approaches with all thosepreviously used in the literature. We showed that our new approaches—chieflythose based on random forests, but also approximate Gaussian processes—offer thebest performance, whether we consider predictions for unseen problem instances.We also demonstrated that very accurate predictions (correlation coefficients be-tween predicted and true runtime exceeding 0.9) are possible based on very smallamounts of training data (only hundreds of runtime observations). Overall, weshowed that our methods are fast, general, and achieve good, robust performance;we hope they will be useful to a wide variety of researchers who seek to model al-89gorithm performance for algorithm analysis, scheduling, algorithm portfolio con-struction, automated algorithm configuration, and other applications.90Chapter 7SATzilla: Portfolio-basedAlgorithm Selection for SATIt has been widely observed that there is no single “dominant” SAT solver; in-stead, different solvers perform best on different instances. Rather than followingthe traditional approach of choosing the best solver for a given class of instances,we advocate making this decision online on a per-instance basis. In particular,this chapter describes SATzilla, an automated approach for constructing per-instance algorithm portfolios for solving SAT, that use machine learning techniquesto choose among their constituent solvers. SATzilla takes as input a distributionof problem instances and a set of component solvers, and constructs a portfoliooptimizing a given objective function (such as mean runtime, percent of instancessolved, or score in a competition).In addition to demonstrating the design and analysis of the first state-of-the-artportfolio-based algorithm selector on SAT, SATzilla07, this chapter goes wellbeyond it by making the portfolio construction scalable and completely automated,by improving it with local search solvers as candidate solvers, by predicting per-formance score instead of runtime, and by using hierarchical hardness models thatconsider different types of SAT instances. The effectiveness of these new tech-niques is demonstrated through extensive experimental results.SATzilla remains an ongoing project. In 2009, a new procedure for predic-tion feature computation cost was added into SATzilla09 to improve SATzilla’s91performance on the industrial category. In 2012, SATzilla2012 introduceda new selection procedure based on an explicit cost-sensitive loss function thatpunishes misclassifications in direct proportion to their impact on portfolio per-formance. The excellent performance of SATzilla on SAT was independentlyverified in the 2007/2009 SAT Competition and the 2012 SAT Challenge, whereSATzilla solvers won more than 10 medals. 17.1 Procedure of Building Portfolio based AlgorithmSelectionThe general methodology for building a portfolio based algorithm selector that weuse in this work follows Leyton-Brown et al. (2003) in its broad strokes, but wehave made significant extensions here. Portfolio construction transpires offline, aspart of algorithm development, and comprises the following steps.1. Identify a target distribution of problem instances. Practically, this meansselecting a set of instances believed to be representative of some underly-ing distribution, or using an instance generator that constructs instances thatrepresent samples from such a distribution.2. Select a set of candidate solvers that have relatively uncorrelated runtimeson this distribution and are known or expected to perform well on at leastsome of the instances.3. Identify features that characterize problem instances. In general this cannotbe done automatically, but rather must reflect the knowledge of a domainexpert. To be usable effectively for automated algorithm selection, thesefeatures must be related to instance hardness and be relatively cheap to com-pute.4. On a training set of problem instances, compute these features and run eachalgorithm to determine its running times.1This chapter is based on the joint work with Frank Hutter, Holger Hoos, and Kevin Leyton-Brown [210, 211, 216].925. Identify one or more solvers to use for pre-solving instances. These pre-solvers will later be run for a short amount of time before features are com-puted (refer to step 9), in order to ensure good performance on very easy in-stances and to allow the empirical performance models to focus exclusivelyon harder instances.6. Using a validation data set, determine which solver achieves the best perfor-mance for all instances that are not solved by the pre-solvers and on whichthe feature computation times out. We refer to this solver as the backupsolver. In the absence of a sufficient number of instances for which pre-solving and feature computation timed out, we employed the single bestcomponent solver (i.e., the winner-take-all choice) as a backup solver.7. Construct an empirical performance model for each algorithm in the portfo-lio, which predicts the runtime of the algorithm for each instance, based onthe instance’s features.8. Choose the best subset of solvers to use in the final portfolio. We formalizeand automatically solve this as a simple subset selection problem: from allgiven solvers, select a subset for which the respective portfolio (which usesthe empirical performance models learned in the previous step) achieves thebest performance on the validation set. (Observe that because our runtimepredictions are not perfect, dropping a solver from the portfolio entirely canincrease the portfolio’s overall performance.)Then, online, to solve a given problem instance, the following steps are per-formed.9. Run each pre-solver until a predetermined fixed cutoff time is reached.10. Compute feature values. If feature computation cannot be completed forsome reason (error or timeout), select the backup solver identified in step 6.11. Otherwise, predict each algorithm’s runtime using the empirical performancemodels from step 7.9312. Run the algorithm predicted to be the best. If a solver fails to complete itsrun (e.g., it crashes), run the algorithm predicted to be next best.Since 2009, we introduced an additional step prior to Step 7 that constructsa model for predicting the cost of feature computation based on some cheap fea-tures (e.g., number of variables, number of clauses). For a new test instance, beforefeature computation (Step 10), SATzilla first extracts the cheap features and pre-dicts the cost of computing all features. If the predicted cost is higher than a giventhreshold, then SATzilla runs the backup solver instead of performing featurecomputation. In 2012, the empirical performance models were replaced by pair-wise cost-sensitive classification models. Nevertheless, the described procedure isthe basis of many advanced algorithm selectors.7.2 Algorithm Selection Core: Predictive ModelsThe effectiveness of an algorithm selector depends on the ability to learn empir-ical performance models that can accurately predict a solver’s performance on agiven instance using efficiently computable features. In the experiments presentedin this chapter, we use the same ridge regression method as in Chapter 5 that haspreviously proven to be very successful in predicting runtime on uniform randomk-SAT, on structured SAT instances, and on combinatorial auction winner determi-nation problems [92, 127, 156]. It should be noted that our portfolio methodologycan make use of any regression/classification approach that provides sufficientlyaccurate estimates of an algorithm’s performance and that is adequately compu-tationally efficient where the time spent making a prediction can be compensatedfor by the performance gain obtained through improved algorithm selection. Thedetailed information on building predictive models has been described in previouschapters. In what follows, we introduce some special techniques that help to obtainmore robust algorithm selectors.7.2.1 Accounting for Censored DataAs is common with heuristic algorithms for solving NP-complete problems, SATalgorithms tend to solve some instances very quickly, while taking an extremelylong amount of time to solve other instances. Hence, runtime data can be very94costly to gather, as individual runs can take literally weeks to complete, even whenother runs on instances of the same size require only milliseconds. The commonsolution to this problem is to “censor” some runs by terminating them after a fixedcutoff time.The question of how to fit good models in the presence of censored data hasbeen extensively studied in the survival analysis literature in statistics, which orig-inated in actuarial questions, such as estimating a person’s lifespan given mortalitydata as well as the ages and characteristics of others who are still alive. Observethat this problem is the same as ours, except that in our case, data points are alwayscensored at the same value, a subtlety that turns out not to matter.The best approach that we know for dealing with censored data is to buildmodels that use all available information about censored runs by using the censoredruntimes as lower bounds on the actual runtimes. To our knowledge, this techniquewas first used in the context of SAT by Gagliolo and Schmidhuber (2006). Thischapter chooses the simple, yet effective method by Schmee and Hahn (1979) todeal with censored samples. In brief, this method first trains a hardness modeltreating the cutoff time as the true (uncensored) runtime for censored samples, andthen repeats the following steps until convergence.1. Estimate the expected runtime of censored runs using the hardness model.Since in ridge regression, predictions are in fact normal distributions (withfixed variance), the expected runtime conditional on the runtime exceedingthe cutoff time is the mean of the corresponding normal distribution trun-cated at the cutoff time.2. Train a new hardness model using true runtimes for the uncensored instancesand the predictions generated in the previous step for the censored instances.We compared this approach with two other approaches to managing censoreddata: dropping such data entirely, and treating censored runs as though they fin-ished at the cutoff threshold. The experimental results [209] demonstrated thatboth of these methods are significantly worse than the method presented above. In-tuitively, both old methods introduce bias into empirical hardness models, whereasthe method by Schmee and Hahn (1979) is unbiased.957.2.2 Predicting Performance Score Instead of RuntimeThe general portfolio methodology is based on empirical hardness models, whichpredict an algorithm’s runtime. However, one may not simply be interested in us-ing a portfolio to pick the solver with the lowest expected runtime. For example,in the 2007 SAT competition, solvers were evaluated based on a complex scoringfunction that depends only partly on a solver’s runtime. Although the idiosyncra-cies of this scoring function are somewhat particular to the SAT competition, theidea that a portfolio should be built to optimize a performance score more com-plex than runtime has wide applicability. In this section we describe techniques forbuilding models that predict such a performance score directly.One critical issue is that—as long as one depends on standard supervised learn-ing methods that require independent and identically distributed training data—onecan only deal easily with scoring functions that actually associate a score with eachsingle instance and combine the partial scores of all instances to compute the over-all score. Given training data labeled with such a scoring function, SATzillacan simply learn a model of the score (rather than runtime) and then choose thesolver with highest predicted score. Unfortunately, the scoring function used inthe 2007 SAT Competition does not satisfy this independence property: the scorea solver attains for solving a given instance depends in part on its (and, indeed,other solvers’) performance on other, similar instances. More specifically, in theSAT competition each instance P has a solution purse SolutionP and a speed purseSpeedP; all instances in a given series (typically 5–40 similar instances) share oneseries purse SeriesP. Algorithms are ranked by summing three partial scores de-rived from these purses.1. For each problem instance P, its solution purse is equally distributed be-tween the solvers Si that solve the instance within the cutoff time (therebyrewarding robustness of a solver).2. The speed purse for P is divided among a set of solvers S that solved theinstance as Score(P,Si) =SpeedP×SF(P,Si)Σ jSF(P,S j), where the speed factor SF(P,S) =timeLimit(P)1+timeUsed(P,S) is a measure of speed that discounts small absolute differencesin runtime.963. The series purse for each series is divided equally and distributed betweenthe solvers Si that solved at least one instance in that series.Si’s partial score from problem P’s solution and speed purses solely depends onthe solver’s own runtime for P and the runtime of all competing solvers for P.Thus, given the runtimes of all competing solvers as part of the training data, wecan compute the score contributions from the solution and the speed purses ofeach instance P, and these two components are independent across instances. Incontrast, since a solver’s share of the series purse will depend on its performanceon other instances in the series, its partial score received from the series purse forsolving one instance is not independent of its performance on other instances.Our solution to this problem is to approximate an instance’s share of the seriespurse score by an independent score. If N instances in a series are solved by anyof SATzilla’s component solvers, and if n solvers solve at least one of the in-stances in that series, we assign a partial score of SeriesP/(N×n) to each solver Si(where i = 1, . . . ,n) for each instance in the series it solved. This approximation ofa non-independent score as independent is not always perfect, but it is conservativebecause it defines a lower-bound on the partial score from the series purse. Pre-dicted scores will only be used in SATzilla to choose between different solverson a per-instance basis. Thus, the partial score of a solver for an instance shouldreflect how much it would contribute to SATzilla’s score. If SATzilla wereperfect (i.e., for each instance, it always selected the best algorithm) our scoreapproximation would be correct: SATzilla would solve all N instances from theseries that any component solver can solve, and thus would actually achieve the se-ries score SeriesP/(N×n)×N = SeriesP/n. If SATzilla performed very poorlyand did not solve any instance in the series, our approximation would also be exact,since it would estimate the partial series score as zero. Finally, if SATzilla wereto pick successful solvers for some (say, M) but not all instances of the series thatcould be solved by its component solvers (i.e., M < N), we would underestimatethe partial series purse, since SeriesP/(N×n)×M < SeriesP/n.While the learning techniques require an approximation of the performancescore as an independent score, our empirical evaluation of solver scores employ theactual SAT competition scoring function. As explained previously, in the SAT com-97petition, the performance score of a solver depends on the score of all other solversin the competition. In order to simulate a competition, we select a large numberof solvers and pretend that these “reference solvers” and SATzilla are the onlysolvers in the competition. Throughout our analysis, we used the 19 solvers listedin Tables 7.3, 7.4 and 7.5. This is not a perfect simulation, since the scores changesomewhat when different solvers are added to or removed from the competition.However, we obtained much better approximations of the performance score byfollowing the methodology outlined here than by using cruder measures, such aslearning models to predict mean runtime or the numbers of benchmark instancessolved.Finally, predicting performance score instead of runtime has a number of im-plications for the components of SATzilla. First, notice that one can computean exact score for each algorithm and instance, even if the algorithm times outunsuccessfully or crashes—in these cases, the score from all three components issimply zero. When predicting scores instead of runtimes, we thus no longer needto rely on censored sampling techniques (see Section 7.2.1). Secondly, notice thatthe oracles for maximizing SAT competition score and for minimizing runtime areidentical, since always using the solver with the smallest runtime guarantees thatthe highest values in all three components are obtained.7.2.3 More General Hierarchical Performance ModelsIn this chapter, we consider a very heterogeneous instance distribution that con-sists of all instances from the categories Random, Crafted and Industrial.In order to further improve performance on this benchmark, we extend our previ-ous hierarchical hardness model approach (predicting satisfiability status and thenusing a mixture of two conditional models as in Chapter 5) to the more generalscenario of six underlying empirical hardness models (one for each combination ofcategory and satisfiability status). The output of the general hierarchical model isa linear weighted combination of the output of each component. As described inChapter 5, we can approximate the model selection oracle by a softmax functionwhose parameters are estimated using EM.987.3 Portfolio ConstructionIn this section, we describe the procedure of constructing SATzilla solvers forthe SAT Competition that features three main categories of instances, Random,Crafted (also known as Handmade) and Industrial. In order to studySATzilla’s performance on an even more heterogeneous instance distribution,a third version of SATzilla is trained on data from all three categories of thecompetition; we label this new category ALL.All of the SATzilla solvers were built using the design methodology detailedin Section 7.1. Each of the following subsections corresponds to one step from thismethodology.7.3.1 Selecting InstancesIn order to train empirical performance models for any of the above scenarios,we needed instances that would be similar to those used in the real competition.For this purpose we used instances from the respective categories of the previousSAT competitions (2002, 2003, 2004, and 2005), as well as from the 2006 SATRace (which only featured Industrial instances). Instances that were repeatedin previous competitions were also repeated in our data sets. Overall, there were4811 instances: 2300 instances in category Random, 1490 in category Craftedand 1021 in category Industrial; of course, category ALL included all of theseinstances. In addition to all instances prior to 2007, we also added the 869 instancesfrom the 2007 SAT Competition into our four data sets. Overall, this resulted in5680 instances: 2811 instances in category Random, 1676 in category Craftedand 1193 in category Industrial. Approximately 72% of the instances couldbe solved by at least one of the 19 solvers considered within the cutoff time of 1200CPU seconds on the reference machine; the remaining instances were excludedfrom our analysis.We randomly split the above benchmark sets into training, validation and testsets, as described in Table 7.1. All parameter tuning and intermediate testing wasperformed on validation sets, and test sets were used only to generate the finalresults reported here.We will be interested in analyzing SATzilla’s performance as we vary the99“Old” instances before 2007 “New” instances in 2007Training (40%) To (1925 instances) Tn (347 instances)Validation (30%) Vo (1443 instances) Vn (261 instances)Test (30%) Eo (1443 instances) En (261 instances)Table 7.1: Instances from before 2007 and from 2007 randomly split into training(T), validation (V) and test (E) data sets. These sets include instances for allcategories: Random, Crafted and Industrial.Data set Training Validation TestD′ To Vo Eo∪EnD+ To∪Tn Vo∪Vn Eo∪EnTable 7.2: Data sets used in our experiments. Note that all data sets use identicaltest data, but different test data.data that was used to train it. To make it easy to refer to our different data sets, wedescribe them here and assign them names (D′, D+). Table 7.1 shows the divisionof our data into “old” (pre-2007) and “new” (2007) instances. Table 7.2 showshow we combined this data to construct the two data sets we use for evaluation.Data set D′ uses only pre-2007 instances for training, validation, and both old andnew instances for testing. Data set D+ combines both old and new instances in itstraining, validation and test sets.Since both data sets use the same test sets, the performance of portfolios trainedusing these different sets can be compared directly. However, we expect a portfo-lio trained using D+ to be at least slightly better, because it has access to moredata. For different categories, we use D′r, D′h, D′i to refer instances from Random,Crafted, Industrial (same for D+).7.3.2 Selecting SolversTo decide what algorithms to include in our portfolio, we considered a wide varietyof solvers that had been entered into previous SAT competitions and into the 2006SAT Race. We manually analyzed the results of these competitions, identifying allalgorithms that yielded the best performance on some subset of instances. Since100the instance sets contain both satisfiable and unsatisfiable instances, we considereda case study that did not chose any incomplete algorithms (the cost of misclassi-fication would be very high if we choose a local search solver on an unsatisfiableinstance). Ultimately, we selected the seven high-performance solvers shown inTable 7.3 as candidates for the SATzilla07 portfolio.Solver ReferenceEureka [152]Kcnfs06 [44]March dl04 [75]Minisat [47]Rsat 1.03 [162]Vallst [201]Zchaff Rand [139]Table 7.3: The seven solvers in SATzilla07; we refer to this set of solversas S.When we shift to predicting and optimizing performance score instead of run-time, local search solvers always obtain a score of exactly zero on unsatisfiableinstances, since they are guaranteed not to solve them within the cutoff time. (Ofcourse, they do not need to be run on an instance during training if the instance isknown to be unsatisfiable.) Hence, we can build models for predicting the score oflocal search solvers using exactly the same methods as for complete solvers. There-fore, we considered eight new complete solvers (Tables 7.4) and four local searchsolvers (Tabl 7.5) from the 2007 SAT Competition for inclusion in our portfolio.As with training instances, the sets of candidate solvers are treated as an inputparameter of SATzilla, S. The sets of candidate solvers used in our experimentsare detailed in Table Choosing FeaturesThe choice of instance features has a significant impact on the performance ofempirical performance models. Good features need to correlate well with (solver-specific) instance hardness and need to be cheap to compute, since feature compu-tation time counts as part of SATzilla07’s runtime.101Solver ReferenceKcnfs04 [43]TTS [188]Picosat [19]MXC [25]March ks [74]TinisatElite [87]Minisat07 [187]Rsat 2.0 [163]Table 7.4: Eight complete solvers from the 2007 SAT Competition.Solver ReferenceRanov [160]Ag2wsat0 [30]Ag2wsat+ [204]Gnovelty+ [161]Table 7.5: Four local search solvers from the 2007 SAT Competition.Name of Set Solvers in the SetS all 7 solvers from Table 7.3S+ all 15 solvers from Tables 7.3 and 7.4S++ all 19 solvers from Tables 7.3, 7.4 and 7.5Table 7.6: Solver sets used in our second series of experiments.We used a subset of features from Figure 3.1. These features can be classifiedinto nine categories: problem size, variable-clause graph, variable graph, clausegraph, balance, proximity to Horn formulae, LP-based, DPLL probing and localsearch probing features. In order to limit the time spent computing features, weexcluded a number of computationally expensive features, such as LP-based andclause graph features. The computation time for each of the local search and DPLLprobing features was limited to 1 CPU second, and the total feature computationtime per instance was limited to 60 CPU seconds. After eliminating some featuresthat had the same value across all instances and some that were too unstable given102only 1 CPU second of local search probing, we ended up using 48 raw features.7.3.4 Computing Features and RuntimesAll our experiments were performed using a computer cluster consisting of 55machines with dual Intel Xeon 3.2GHz CPUs, 2MB cache and 2GB RAM, runningSuse Linux 10.1. As in the SAT competition, all runs of any solver that exceeded acertain runtime were aborted (censored) and recorded as such. In order to keep thecomputational cost manageable, we chose a cutoff time of 1200 CPU seconds.7.3.5 Identifying Pre-solversAs described in Section 7.1, in order to solve easy instances quickly without spend-ing any time for the computation of features, we use one or more pre-solvers: algo-rithms that are run unconditionally but briefly before features are computed. Goodalgorithms for pre-solving solve a large proportion of instances quickly.The naive approach for identifying pre-solvers is manual selection of pre-solvers based on an examination of the training runtime data. There are severallimitations to this approach. First and foremost, manual pre-solver selection doesnot scale well. If there are many candidate solvers, manually finding the best com-bination of pre-solvers and cutoff times can be difficult and requires significantamounts of valuable human time. In addition, the manual pre-solver selection con-centrates solely on solving a large number of instances quickly and does not takeinto account the pre-solvers’ effect on model learning. In fact, there are three con-sequences of pre-solving.1. Pre-solving solves some instances quickly before features are computed. Inthe context of the SAT competition, this improves SATzilla’s scores foreasy problem instances due to the “speed purse” component of the SAT com-petition scoring function. (See Section 7.2.2 above.)2. Pre-solving increases SATzilla’s runtime on instances not solved duringpre-solving by adding the pre-solvers’ time to every such instance. Likefeature computation itself, this additional cost reduces SATzilla’s scores.3. Pre-solving filters out easy instances, allowing our empirical performance103models to be trained exclusively on harder instances.Manual selection considers (1) and (2), but not (3). In particular, it ignores the factthat the use of different pre-solvers and/or cutoff times results in different train-ing data and hence in different learned models, which can also affect a portfolio’seffectiveness.The new automatic pre-solver selection technique functions as follows. Wecommitted in advance to using a maximum of two pre-solvers: one of three com-plete search algorithms and one of three local search algorithms. The three candi-dates for each of the search approaches are automatically determined for each dataset as those with highest score on the validation set when run for a maximum of 10CPU seconds. We also use a number of possible cutoff times, namely 2, 5 and 10CPU seconds, as well as 0 seconds (i.e., the pre-solver is not run at all) and con-sider both orders in which the two pre-solvers can be run. For each of the resulting288 possible combinations of two pre-solvers and cutoff times, SATzilla’s per-formance on the validation data is evaluated by performing steps 6, 7 and 8 of thegeneral methodology presented in Section 7.1:6. determine the backup solver for selection when features time out;7. construct an empirical performance model for each algorithm; and8. automatically select the best subset of algorithms to use as part of SATzilla.The best-performing subset found in this last step—evaluated on validation data—is selected as the algorithm portfolio for the given combination of pre-solver / cut-off time pairs. Overall, this method aims to choose the pre-solver configurationthat yields the best-performing portfolio.7.3.6 Identifying the Backup SolverWe computed average runtime of every solver on every category counting timeoutsas runs that completed at the cutoff time of 1 200 CPU seconds. For categoriesRandom and Crafted, we did not encounter instances for which feature compu-tation timed out. Thus, we employed the winner-take-all solver as a backup solverin both of these domains. For categories Industrial and ALL, we chose the104solver that performed best on those instances that remained unsolved after pre-solving and for which feature computation timed out.7.3.7 Learning Empirical Performance ModelsWe learned empirical performance models for predicting each solver’s runtime/per-formance as described in Section 7.1, using the procedure of Schmee and Hahn(1979) for managing censored data and also employing hierarchical hardness mod-els.7.3.8 Solver Subset SelectionFor a small number of candidate solvers, we performed automatic exhaustive sub-set search as outlined in Section 7.1 to determine which solvers to include inSATzilla. For a large number of component solvers, such a procedure is in-feasible (N component solvers would require the consideration of 2N solver sets,for each of which a model would have to be trained). The automatic pre-solverselection methods described previously in Section 7.3.5 further worsen this sit-uation: solver selection must be performed for every candidate configuration ofpre-solvers, because new pre-solver configurations induce new models.As an alternative to exhaustively considering all subsets, we implemented arandomized iterative improvement procedure to search for a good subset of solvers.The local search neighborhood used by this procedure consists of all subsets ofsolvers that can be reached by adding or dropping a single component solver. Start-ing with a randomly selected subset of solvers, in each search step, we consider aneighboring solver subset selected uniformly at random and accept it if validationset performance increases; otherwise, we accept the solver subset anyway with aprobability of 5%. Once 100 steps have been performed with no improving step, anew run is started by re-initializing the search at random. After 10 such runs, thesearch is terminated and the best subset of solvers encountered during the searchprocess is returned. Preliminary evidence suggests that this local search procedureis efficient in finding very good subsets of solvers.105SATzilla version DescriptionSATzilla07(S,D′) Basic version for the 2007 SAT Competition,but evaluated on an extended test set.SATzilla07(S+,D+) The same design as SATzilla07(S,D′),but includes new complete solvers (Table 7.4)and new data (Section 7.3.1).SATzilla07+(S++,D+) In addition to new complete solvers and data,this version uses local search solvers (Ta-ble 7.5) and all of the new design elementsexcept “more general hierarchical hardnessmodels” (Section 7.2.3).SATzilla07∗(S++,D+) This version uses all solvers, all data and allnew design elements. Unlike for the otherversions, we trained only one variant of thissolver for use in all data set categories.Table 7.7: The different SATzilla versions evaluated in our second set ofexperiments.7.3.9 Different SATzilla VersionsWith new design ideas for SATzilla (Section 7.3.5, 7.3.8, 7.2.3, 7.2), newtraining data (Section 7.3.1) and new solvers (Section 7.3.2), we were interested inevaluating how much our portfolio improved as a result. In order to gain insightsinto how much performance improvement was achieved by these different changes,we studied several intermediate SATzilla solvers, which are summarized in Ta-ble 7.7.SATzilla07(S,D’) used manual pre-solver selection, exhaustive searchfor solver subset selection, and EHMs for predicting runtime. It only consid-ered “old” solvers and training data. The construction of SATzilla07(S+,D+)was the same as that for SATzilla07(S,D’), except that it relied on differentsolvers and corresponding training data.SATzilla07+(S++,D+) and SATzilla07∗(S++,D+) incorporated thenew techniques. Pre-solvers were identified automatically as described in Section7.3.5, using the (automatically determined) candidate solvers listed in Table 7.8.We built models to predict the performance score of each algorithm. This score iswell defined even in case of timeouts and crashes; thus, there was no need to deal106Random Crafted Industrial ALLComplete Kcnfs06 March dl04 Rsat 1.03 Minisat07Pre-solver March dl04 Vallst Picosat March ksCandidates March ks March ks Rsat 2.0 March dl04Local Search Ag2wsat0 Ag2wsat0 Ag2wsat0 SAPSPre-solver Gnovelty+ Ag2wsat+ Ag2wsat+ Ag2wsat0Candidates SAPS Gnovelty+ Gnovelty+ Gnovelty+Table 7.8: Pre-solver candidates for our four data sets. These candidates wereautomatically chosen based on the scores on validation data achieved byrunning the respective algorithms for a maximum of 10 CPU seconds.with censored data. In the manner of SATzilla07, SATzilla07+ used hier-archical empirical hardness models [208] with two underlying models (Msat andMunsat) for predicting a solver’s score. For SATzilla07∗, we built more generalhierarchical hardness models for predicting scores; these models were based on sixunderlying empirical hardness models (Msat and Munsat trained on data from eachSAT competition category). We chose solver subsets based on the results of thelocal search procedure for subset search as outlined in Section 7.3.8.Observe that all of these solvers were built using identical test data and werethus directly comparable. We generally expected each solver to outperform itspredecessors in the list. The exception was SATzilla07∗(S++,D+): this lastsolver was designed to achieve good performance across a broader range of in-stances. Thus, we expected SATzilla07∗(S++,D+) to outperform the otherson category ALL, but not to outperform SATzilla07+(S++,D+) on the morespecific categories. The resulting final components of SATzilla07, SATzilla07+and SATzilla07∗ for each category are described in detail in the following sec-tion.7.4 Performance Analysis of SATzillaThe effectiveness of our new techniques was investigated by evaluating the fourSATzilla versions (Table 7.7): SATzilla07(S,D′), SATzilla07(S+,D+),SATzilla07+(S++,D+) and SATzilla07∗(S++,D+). To evaluate their107SATzilla version Pre-Solvers (time) Component solversSATzilla07(S,D′r)March dl04(5);SAPS(2)Kcnfs06, March dl04, Rsat1.03SATzilla07(S+,D+r )March dl04(5);SAPS(2)Kcnfs06, March dl04,March ks,Minisat07SATzilla07+(S++,D+r )SAPS(2);Kcnfs06(2)Kcnfs06, March ks,Minisat07, Ranov, Ag2wsat+,Gnovelty+Table 7.9: SATzilla’s configurations for the Random category; cutoff times forpre-solvers are specified in CPU seconds.performance, we constructed a simulated SAT competition using the same scoringfunction as in the 2007 SAT Competition, but differing in a number of important as-pects. The participants in our competition were the 19 solvers listed in Tables 7.3,7.4, and 7.5 (all solvers were considered for all categories), and the test instanceswere Eo ∪En as described in Tables 7.1 and 7.2. Furthermore, our computationalinfrastructure differed from the 2007 competition, and we also used shorter cutofftimes of 1200 seconds. For these reasons some solvers ranked slightly differentlyin our simulated competition than in the 2007 competition.7.4.1 Random CategoryTable 7.9 shows the configuration of the three SATzilla versions for the Randomcategory. Note that the automatic solver selection in SATzilla07+(S++,D+r )included different solvers than the ones used in SATzilla07(S+,D+r ); in par-ticular, it chose three local search solvers, Ranov, Ag2wsat+, and Gnovelty+,that were not available to SATzilla07. Also, the automatic pre-solver selectionchose a different order and cutoff time of pre-solvers than our manual selection: itchose to first run SAPS for two CPU seconds, followed by two CPU seconds ofKcnfs06. Even though running the local search algorithm SAPS did not help forsolving unsatisfiable instances, we see in Figure 7.1 (left) that SAPS solved manymore instances than March dl04 in the first few seconds.Table 7.10 shows the performance of different versions of SATzilla com-108pared to the best solvers in the Random category.All versions of SATzilla outperformed every non-portfolio solver in termsof average runtime and number of instances solved. SATzilla07+ and SATzilla07∗,the variants optimizing score rather than another objective function, also clearlyachieved higher scores than the non-portfolio solvers. This was not always thecase for the other versions; for example, SATzilla07(S+,D+r ) achieved only86.6% of the score of the best solver, Gnovelty+ (where scores were com-puted based on a reference set of 20 reference solvers: the 19 solvers from Ta-bles 7.3, 7.4, and 7.5, as well as SATzilla07(S+,D+r )). Table 7.10 and Figure7.1 show that adding complete solvers and training data did not greatly improveSATzilla07. At the same time, substantial improvements were achieved by thenew mechanisms in SATzilla07+, leading to 11% more instances solved, a re-duction of average runtime by more than half, and an increase in score of over 50%.Interestingly, the performance of the more general SATzilla07∗(S++,D+)trained on instance mix ALL and tested on the Random category was quite closeto the best version of SATzilla specifically designed for Random instances,SATzilla07+(S++,D+r ). Note that due to their excellent performance on satis-fiable instances, the local search solvers in Table 7.10 (Gnovelty+ and Ag2wsatvariants) tended to have higher overall scores than the complete solvers (Kcnfs04and March ks), even though they solved fewer instances and in particular couldnot solve any unsatisfiable instance. In the 2007 SAT Competition, however, allwinners of the random SAT+UNSAT category were complete solvers, which led tospeculation that local search solvers were not considered in this category (while inthe Random SAT category, all winners were indeed local search solvers).Figure 7.1 presents CDFs summarizing the performance of the best non-portfoliosolvers, SATzilla solvers and two oracles. All non-portfolio solvers omitted hadCDFs below those shown. The oracles represent ideal versions of SATzilla thatchoose among component solvers perfectly and without any computational cost.More specifically, given an instance, an oracle picks the fastest algorithm; it is al-lowed to consider SAPS (with a maximum runtime of 10 CPU seconds) and anysolver from the given set (S for one oracle and S++ for the other).Table 7.11 indicates how often each component solver of SATzilla07+(S++,D+r )was selected, how often it was successful, and the amount of its average runtime.109Solver Avg. runtime [s] Solved [%] Performance scoreKcnfs04 852 32.1 38309March ks 351 78.4 113666Ag2wsat0 479 62.0 119919Ag2wsat+ 510 59.1 110218Gnovelty+ 410 67.4 131703SATzilla07(S,D′r) 231 85.4 — (86.6%)SATzilla07(S+,D+r ) 218 86.5 — (88.7%)SATzilla07+(S++,D+r ) 84 97.8 189436 (143.8%)SATzilla07∗(S++,D+) 113 95.8 — (137.8%)Table 7.10: The performance of SATzilla compared to the bestsolvers on Random. The cutoff time was 1 200 CPU seconds;SATzilla07∗(S++,D+) was trained on ALL. Scores were computedbased on 20 reference solvers: the 19 solvers from Tables 7.3, 7.4, and7.5, as well as one version of SATzilla. To compute the score for eachnon-SATzilla solver, the SATzilla version used as a member of theset of reference solvers was SATzilla07+(S++,D+r ). Since we did notinclude SATzilla versions other than SATzilla07+(S++,D+r ) in theset of reference solvers, scores for these solvers are incomparable to the otherscores given here, and therefore, we do not report them. Instead, for eachSATzilla solver, we indicate in parentheses its performance score as apercentage of the highest score achieved by a non-portfolio solver, given areference set in which the appropriate SATzilla solver took the place ofSATzilla07+(S++,D+r ).10−1 100 101 102 1030102030405060708090100Runtime [CPU sec]% Instances SolvedPre−solving(07+(S++,D+r )) AvgFeature(07+(S++,D+r ))Oracle(S++)SATzilla07+(S++,D+r )March_dl04March_ksGnovelty+10−1 100 101 102 1030102030405060708090100Runtime [CPU sec]% Instances SolvedPre−solving(07(S+,D+r ),07(S,D’r)) AvgFeature(07(S+,D+r ),07(S,D’r))Pre−solving(others) AvgFeature(others)Oracle(S++)Oracle(S)SATzilla07+(S++,D+r )SATzilla07(S+,D+r )SATzilla07(S,D’r)SATzilla07*(S++,D+)Figure 7.1: Left: CDFs for SATzilla07+(S++,D+r ) and the best non-portfoliosolvers on Random; right: CDFs for the different versions of SATzilla onRandom shown in Table 7.9, where SATzilla07∗(S++,D+) was trainedon ALL. All other solvers’ CDFs are below the ones shown here.110Pre-Solver (Pre-Time) Solved [%] Avg. Runtime [CPU sec]SAPS(2) 52.2 1.1March dl04(2) 9.6 1.68Selected Solver Selected [%] Success [%] Avg. Runtime [CPU sec]March dl04 34.8 96.2 294.8Gnovelty+ 28.8 93.9 143.6March ks 23.9 92.6 213.3Minisat07 4.4 100 61.0Ranov 4.0 100 6.9Ag2wsat+ 4.0 77.8 357.9Table 7.11: The solvers selected by SATzilla07+(S++,D+r ) for the Randomcategory. Note that column “Selected [%]” shows the percentage of instancesremaining after pre-solving for which the algorithm was selected, and thissums to 100%. Cutoff times for pre-solvers are specified in CPU seconds.We found that the solvers picked by SATzilla07+(S++,D+r ) solved the giveninstance in most cases. Another interesting observation is that when a solver’ssuccess ratio was high, its average runtime tended to be lower.7.4.2 Crafted CategoryThe configurations of the three SATzilla versions designed for the Craftedcategory are shown in Table 7.12. Again, SATzilla07+(S++,D+h ) includedthree local search solvers, Ranov, Ag2wsat+ and Gnovelty+, which were notavailable to SATzilla07. Similar to the manual choice in SATzilla07, theautomatic pre-solver selection chose to run March dl04 for five CPU seconds.Unlike the manual selection, it abstained from using SAPS (or indeed any othersolver) as a second pre-solver. Table 7.13 shows the performance of the differ-ent versions of SATzilla compared to the best solvers for category Crafted.Here, about half of the observed performance improvement was achieved by usingmore solvers and more training data; the other half was due to the improvements inSATzilla07+. Note that for the Crafted category, SATzilla07∗(S++,D+)performed quite poorly. We attribute this to a weakness of the feature-based clas-sifier on Crafted instances, an issue discussed further in Section 7.4.4.Table 7.14 indicates how often each component solver of SATzilla07+(S++,D+h )111SATzilla Pre-Solver (time) ComponentsSATzilla07(S,D′h)March dl04(5);SAPS(2)Kcnfs06, March dl04,Minisat, Rsat 1.03SATzilla07(S+,D+h )March dl04(5);SAPS(2)Vallst, Zchaff rand, TTS,MXC,March ks, Minisat07, Rsat2.0SATzilla07+(S++,D+h ) March ks(5)Eureka, March dl04;Minisat, Rsat 1.03, Vallst,TTS, Picosat, MXC, March ks,TinisatElite, Minisat07,Rsat 2.0, Ranov, Ag2wsat0,Gnovelty+Table 7.12: SATzilla’s configurations for the Crafted category.Solver Avg. runtime [s] Solved [%] Performance scoreTTS 729 41.1 40669MXC 527 61.9 43024March ks 494 63.9 68859Minisat07 438 68.9 59863March dl04 408 72.4 73226SATzilla07(S,D′h) 284 80.4 — (93.5%)SATzilla07(S+,D+h ) 203 87.4 — (118.8%)SATzilla07+(S++,D+h ) 131 95.6 112287 (153.3%)SATzilla07∗(S++,D+) 215 88.0 — (110.5%)Table 7.13: The performance of SATzilla compared to the best solvers onCrafted. Scores for non-portfolio solvers were computed using a referenceset in which the only SATzilla solver was SATzilla07+(S++,D+h ).Cutoff time: 1200 CPU seconds; SATzilla07∗(S++,D+) was trained onALL.11210−1 100 101 102 103102030405060708090100Runtime [CPU sec]% Instances SolvedAvgFeature(07+(S++,D+h))Pre−solving(07+(S++,D+h))Oracle(S++)SATzila07+(S++,D+h)March_dl04March_ksMinisat07 10−1 100 101 102 1030102030405060708090100Runtime [CPU sec]% Instances SolvedPre−solving(07(S+,D+h),07(S,D’h)) AvgFeature(07(S+,D+h),07(S,D’h))Pre−solving(07+(S++,D+h)) AvgFeature(07+(S++,D+h))Pre−solving(others) AvgFeature(others)Oracle(S++)Oracle(S)SATzilla07+(S++,D+h)SATzilla07(S+,D+h)SATzilla07(S,D’h)SATzilla07*(S++,D+)Figure 7.2: Left: CDFs for SATzilla07+(S++,D+h ) and the best non-portfoliosolvers on Crafted; right: CDFs for the different versions of SATzillaon Crafted shown in Table 7.12, where SATzilla07∗(S++,D+) wastrained on ALL. All other solvers’ CDFs are below the ones shown here.Pre-Solver (Pre-Time) Solved [%] Avg. Runtime [CPU sec]March ks(5) 39.0 3.2Selected Solver Selected [%] Success [%] Avg. Runtime [CPU sec]Minisat07 40.4 89.3 205.1TTS 11.5 91.7 133.2MXC 7.2 93.3 310.5March ks 7.2 100 544.7Eureka 5.8 100 0.34March dl04 5.8 91.7 317.6Rsat 1.03 4.8 100 185.1Picosat 3.9 100 1.7Ag2wsat0 3.4 100 0.5TinisatElite 2.9 100 86.5Ranov 2.9 83.3 206.1Minisat 2.0 1.4 66.7 796.5Rsat 2.0 1.4 100 0.9Gnovelty+ 1.0 100 3.2Vallst 0.5 100 <0.01Table 7.14: The solvers selected by SATzilla07+(S++,D+h ) for the Craftedcategory.113SATzilla Pre-Solver (time) ComponentsSATzilla07(S,D’i) Rsat 1.03 (2)Eureka, March dl04,Minisat, Rsat 1.03SATzilla07(S+,D+i ) Rsat 2.0 (2)Eureka, March dl04,Minisat, Zchaff Rand, TTS,Picosat, March ksSATzilla07+(S++,D+i )Rsat 2.0 (10);Gnovelty+(2)Eureka, March dl04,Minisat, Rsat 1.03, TTS,Picosat, Minisat07, Rsat2.0Table 7.15: SATzilla’s configuration for the Industrial category.was selected, how many problem instances it solved, and its average runtime forthese runs. There are many solvers that SATzilla07+(S++,D+h ) picked quiterarely; however, in most cases, their success ratios are close to 100%, and theiraverage runtimes are very low.7.4.3 Industrial CategoryTable 7.15 shows the configuration of the three SATzilla versions designed forthe Industrial category. Local search solvers performed poorly for the in-stances in this category, with the best local search solver, Ag2wsat0, only solving23% of the instances within the cutoff time. Consequently, no local search solverwas selected by the automatic solver subset selection in SATzilla07+(S++,D+i ).However, automatic pre-solver selection did include the local search solver Gnovelty+as the second pre-solver, to be run for 2 CPU seconds after 10 CPU seconds of run-ning Rsat 2.0.Table 7.16 compares the performance of different versions of SATzilla andthe best solvers on Industrial instances. It is not surprising that more trainingdata and more solvers helped SATzilla07 to improve in terms of all our metrics(avg. runtime, percentage solved and score). A bigger improvement was due tothe new mechanisms in SATzilla07+ that led to SATzilla07+(S++,D+i )outperforming every non-portfolio solver with respect to every metric, particu-larly in terms of performance score. Note that the general SATzilla version114Solver Avg. runtime [s] Solved [%] Performance scoreRsat 1.03 353 80.8 52740Rsat 2.0 365 80.8 51299Picosat 282 85.9 66561TinisatElite 452 70.8 40867Minisat07 372 76.6 60002Eureka 349 83.2 71505SATzilla07(S,D′i) 298 87.6 — (91.3%)SATzilla07(S+,D+i ) 262 89.0 — (98.2%)SATzilla07+(S++,D+i ) 233 93.1 79724 (111.5%)SATzilla07∗(S++,D+) 239 92.7 — (104.8%)Table 7.16: The performance of SATzilla compared to the best solverson Industrial. Scores for non-portfolio solvers were com-puted using a reference set in which the only SATzilla solverwas SATzilla07+(S++,D+i ). Cutoff time: 1 200 CPU seconds;SATzilla07∗(S++,D+) was trained on ALL.SATzilla07∗(S++,D+) trained on ALL achieved performance very close tothat of SATzilla07+(S++,D+i ) on the Industrial data set in terms of av-erage runtime and percentage of solved instances.As can be seen from Figure 7.3, the performance improvements achieved bySATzilla over non-portfolio solvers were smaller for the Industrial cate-gory than for other categories. Note that the best Industrial solver, Picosat,performed very well, solving 85.9% of the instances within the cutoff time of 1200CPU seconds. Recall that this number means the solver solved 85.9% of the in-stances that could be solved by at least one solver. Compared to our other data sets,it would appear that either solvers exhibited more tightly correlated behavior onIndustrial instances or that instances in this category exhibited greater vari-ability in hardness. Nevertheless, SATzilla07+(S++,D+i ) had significantlysmaller average runtime (17%) and solved 7.2% more instances than the best com-ponent solver, Picosat. Likewise, the score for SATzilla07+(S++,D+i ) was11.5% higher than that of the top-ranking component solver (in terms of score),Eureka.Table 7.17 indicates how often each component solver of SATzilla07+(S++,D+i )was selected, how many problem instances it solved, and its average runtime for11510−1 100 101 102 103102030405060708090100Runtime [CPU sec]% Instances SolvedPre−solving(07+(S++,D+i )) AvgFeature(07+(S++,D+i ))Oracle(S++)SATzilla07+(S++,D+i )Rsat1.03Picosat 10−1 100 101 102 103102030405060708090100Runtime [CPU sec]% Instances SolvedPre−solving(07+(S++,D+i )) AvgFeature(07+(S++,D+i ))AvgFeature(07*(S++,D+))Pre−solving(07*(S++,D+))AvgFeature(others)Pre−solving(others)Oracle(S++)Oracle(S)SATzilla07+(S++,D+i ))SATzilla07(S+,D+i )SATzilla07(S,D’i)SATzilla07*(S++,D+)Figure 7.3: Left: CDFs for SATzilla07+(S++,D+i ) and the best non-portfolio solvers on Industrial; right: CDFs for the different ver-sions of SATzilla on Industrial shown in Table 7.15, whereSATzilla07∗(S++,D+) was trained on ALL. All other solvers’ CDFs (in-cluding Eureka’s) are below the ones shown here.Pre-Solver (Pre-Time) Solved [%] Avg. Runtime [CPU sec]Rsat 2.0(10) 38.1 6.8Gnovelty+ (2) 0.3 2.0Selected Solver Selected [%] Success [%] Avg. Runtime [CPU sec]Eureka (BACKUP) 29.1 88.5 385.4Eureka 15.1 100 394.2Picosat 14.5 96.2 179.6Minisat07 14.0 84.0 306.3Minisat 2.0 12.3 68.2 709.2March dl04 8.4 86.7 180.8TTS 3.9 100 0.7Rsat 2.0 1.7 100 281.6Rsat 1.03 1.1 100 10.6Table 7.17: The solvers selected by SATzilla07+(S++,D+i ) for theIndustrial category.these runs. In this case, the backup solver Eureka was used for problem instancesfor which feature computation timed out and pre-solvers did not produce a solution.116SATzilla Pre-Solver (time) ComponentsSATzilla07(S,D’)March dl04(5);SAPS(2)Eureka, Kcnfs06,March dl04,Minisat,Zchaff randSATzilla07(S+,D+)March dl04(5);SAPS(2)Eureka, March dl04,Zchaff rand, Kcnfs04,TTS, Picosat, March ks,Minisat07SATzilla07+(S++,D+)SAPS(2);March ks(2)Eureka, Kcnfs06, Rsat 1.03,Zchaff rand, TTS, MXC,TinisatElite, Rsat 2.0,Ag2wsat+, RanovSATzilla07∗(S++,D+)SAPS(2);March ks(2)Eureka, Kcnfs06,March dl04, Minisat,Rsat 1.03, Picosat, MXC,March ks, Minisat07,Ag2wsat+, Gnovelty+Table 7.18: SATzilla’s configurations for the ALL category.7.4.4 ALLThere are four versions of SATzilla specialized for category ALL. Their detailedconfigurations are listed in Table 7.18. The results of automatic pre-solver selec-tion were identical for SATzilla07+ and SATzilla07∗: both chose to first runthe local search solver SAPS for two CPU seconds, followed by two CPU secondsof March ks. These solvers were similar to our manual selection, but their orderwas reversed. For solver subset selection, SATzilla07+ and SATzilla07∗yielded somewhat different results, but both of them kept two local search algo-rithms, Ag2wsat+ & Ranov, and Ag2wsat+ & Gnovelty+, respectively.Table 7.19 compares the performance of the four versions of SATzilla onour ALL test set. Roughly equal improvements in terms of all our performance met-rics were due to more training data and solvers on the one hand, and to the improve-ments in SATzilla07+ on the other hand. The best performance in terms of allour performance metrics was obtained by SATzilla07∗(S++,D+). Recall thatthe only difference between SATzilla07+(S++,D+) and SATzilla07∗(S++,D+)was the use of more general hierarchical hardness models, as described in Sec-117Solver Avg. runtime [s] Solved [%] Performance scoreRsat 1.03 542 61.1 131399Kcnfs04 969 21.3 46695TTS 939 22.6 74616Picosat 571 57.7 135049March ks 509 62.9 202133TinisatElite 690 47.3 93169Minisat07 528 61.8 162987Gnovelty+ 684 43.9 156365March dl04 509 62.7 205592SATzilla07(S,D’) 282 83.1 — (125.0%)SATzilla07(S+,D+) 224 87.0 — (139.2%)SATzilla07+(S++,D+) 194 91.1 — (158%)SATzilla07∗(S++,D+) 172 92.9 344594 (167.6%)Table 7.19: The performance of SATzilla compared to the best solvers on ALL.Scores for non-portfolio solvers were computed using a reference set in whichthe only SATzilla solver was SATzilla07∗(S++,D+). Cutoff time:1200 CPU seconds.tion 7.2.3.Note that using a classifier is of course not as good as using an oracle for de-termining the distribution an instance comes from; thus, the success ratios of thesolvers selected by SATzilla07∗ over the instances in the test set for distribu-tion ALL (see Table 7.20) were slightly lower than those for the solvers picked bySATzilla07+ for each of the distributions individually (see Tables 7.11, 7.14,and 7.17). However, when compared to SATzilla07+ on distribution ALL,SATzilla07∗ performed significantly better: achieving overall performance im-provements of 11.3% lower average runtime, 1.8% more solved instances and 9.6%higher score. This supports our initial hypothesis that SATzilla07∗ would per-form slightly worse than specialized versions of SATzilla07+ in each singlecategory, yet would yield the best result when applied to a broader and more het-erogeneous set of instances.The runtime cumulative distribution function (Figure 7.4, right) shows thatSATzilla07∗(S++,D+) dominated the other versions of SATzilla on ALLand solved approximately 30% more instances than the best non-portfolio solver,11810−1 100 101 102 1030102030405060708090100Runtime [CPU sec]% Instances SolvedPre−solving(07*(S++,D+)) AvgFeature(07*(S++,D+))Oracle(S++)SATzilla07*(S++,D+)March_dl04Gnovelty+10−1 100 101 102 1030102030405060708090100Runtime [CPU sec]% Instances SolvedPre−solving(07(S+,D+),07(S,D’)) AvgFeature(07(S+,D+),07(S,D’))Pre−solving(others) AvgFeature(others)Oracle(S++)Oracle(S)SATzilla07+(S++,D+)SATzilla07(S+,D+)SATzilla07(S,D’)SATzilla07*(S++,D+)Figure 7.4: Left: CDF for SATzilla07∗(S++,D+) and the best non-portfoliosolvers on ALL; right: CDFs for different versions of SATzilla on ALLshown in Table 7.18. All other solvers’ CDFs are below the ones shown here.March dl04 (Figure 7.4, left).Pre-Solver (Pre-Time) Solved [%] Avg. Runtime [CPU sec]SAPS(2) 33.0 1.4March ks (2) 13.9 1.6Selected Solver Selected [%] Success [%] Avg. Runtime [CPU sec]Minisat07 21.2 85.5 247.5March dl04 14.5 84.0 389.5Gnovelty+ 12.5 85.2 273.2March ks 9.1 89.8 305.6Eureka (BACKUP) 8.9 89.7 346.1Eureka 7.2 97.9 234.6Picosat 6.6 90.7 188.6Kcnfs06 6.5 95.2 236.3MXC 5.5 88.9 334.0Rsat 1.03 4.0 80.8 364.9Minisat 2.0 3.5 56.5 775.7Ag2wsat+ 0.5 33.3 815.7Table 7.20: The solvers selected by SATzilla07∗(S++,D+) for the ALL cate-gory.Table 7.21 shows the performance of the general classifier in SATzilla07∗(S++,D+). We note several patterns: Firstly, classification performance for Randomand Industrial instances was much better than for Crafted instances. Sec-119R, sat R, unsat H, sat H, unsat I, sat I, unsatclassified R, sat92% 5% 1% – 1% 1%classified R, unsat 4% 94% – 1% – 1%classified H, sat – – 57% 38% – 5%classified H, unsat – 1% 23% 71% 1% 4%classified I, sat – – 8% – 81% 11%classified I, unsat – – – 5% 6% 89%Table 7.21: Confusion matrix for the 6-way classifier on data set ALL.ondly, for Crafted instances, most misclassifications were not due to a misclas-sification of the instance type, but rather due to the satisfiability status. Finally, onecan observe that Random instances were almost perfectly classified as Random,and only very few other instances were classified as Random, while Craftedand Industrial instances were confused somewhat more often. The compa-rably poor classification performance for Crafted instances partly explains whySATzilla07∗(S++,D+) did not perform as well for the Crafted category asfor the others.7.5 Further Improvements over the YearsOur group is actively working on introducing new techniques for improving SATzilla’sperformance. Currently, SATzilla is still considered the state-of-the-art, evenwhen compared to many new portfolio-building techniques.7.5.1 SATzilla09 for IndustrialSATzilla07 achieved less improvement in the domain of Industrial. Onereason is that Industrial instances are often very large; the feature computa-tion could be very costly and take a large proportion of the total CPU budget. Theother reason is that the state-of-the-art solvers for solving Industrial are com-plete algorithms based on the DPLL procedure. To better handle Industrialinstances, we introduced two new techniques in SATzilla09.1. New instance features. After 2007, we introduced several new classes ofinstance features: 18 features based on clause learning, 18 based on survey120propagation, and 5 based on graph diameter. For the Industrial cat-egory, we also discarded 12 computationally expensive features based onDPLL probing and graph diameter.2. Prediction of feature computation time. Before feature computation, weintroduced an additional step that predicts the time required for feature com-putation. If that prediction exceeds two minutes, we run the backup solver;otherwise we continue with feature computation. In order to predict the fea-ture computation time for an instance based on its number of variables andclauses, we built a simple linear regression model with quadratic basis func-tions. This was motivated by the fact that SATzilla’s feature computationtimed out on over 50% of the Industrial instances in the 2007 SATcompetition and the 2008 SAT Race. By applying feature cost prediction,we force SATzilla to use a default solver on very large Industrialinstances without paying the large cost of feature computation.With the above two improvements and updated candidate solvers, training data,SATzilla09 performed very well on the 2009 SAT Competition. For the firsttime, SATzilla won a gold medal in the INDUSTRIAL category.7.5.2 SATzilla2012 with New Algorithm SelectorPrevious versions of SATzilla perform algorithm selection based on empiricalperformance models for predicting algorithm’s performance. However, the goalof algorithm selection is to select solvers in order to optimize some performanceobjective. If multiple solvers have similar performance on instance i, selecting anyof them does not reduce the performance of SATzilla. By contrast, if solvershave very different performance on instance j, then picking an incorrect solverfor j is more harmful than picking an incorrect solver for i. Therefore, the cost ofmisclassification depends on the performance difference among multiple candidatesolvers. The new SATzilla2012 [216] is based on cost-sensitive classificationmodels that punish misclassifications in direct proportion to their impact on portfo-lio performance. In addition, we also introduced a new procedure that generates astand-alone SATzilla executable based on models learned within Matlab. Thesetwo improvements are described in detail in what follows.1211. New algorithm selector. Our new selection procedure is based on classifica-tion models with cost-sensitive loss function. To the best of our knowledge,this is the first time this approach has been applied to algorithm selection.We construct cost-sensitive decision forests (DFs) as collections of 99 cost-sensitive decision trees for every pair of algorithms. Each DF casts a vote forthe better solver. The best solver is selected based on the number of votesreceived.2. New SATzilla executable. To reduce the hassle of installing the free Matlabruntime environment (MRE) for making predictions based on models builtwith Matlab, we then converted our Matlab-built models to Java and provideJava code to make predictions using them. Thus, running SATzilla2012now only requires the scripting language Ruby (which is used for runningthe SATzilla pipeline).SATzilla2012 won first place in Application, Hard Combinatorial,and Sequential Portfolio; second place in Application, Hard Combinatorial,and Random SAT; third place in Random SAT (see [124] for detailed informa-tion).7.6 ConclusionsAlgorithms can be combined into portfolios to build a whole greater than the sumof its parts. We have significantly extended earlier work on algorithm portfoliosfor SAT that select solvers on a per-instance basis using empirical hardness modelsfor runtime prediction. We have demonstrated the effectiveness of the portfolioconstruction method, SATzilla07, on four large sets of SAT competition in-stances. The experiments reveal that the SATzilla07 portfolio solvers alwaysoutperformed their components. Furthermore, SATzilla07’s excellent perfor-mance in the 2007 SAT Competition demonstrates the practical effectiveness ofour approach.Following this work, we pushed the SATzilla approach further beyond SATzilla07.For the first time, we showed that portfolios can optimize complex scoring func-tions and integrate local search algorithms as component solvers. Furthermore, we122showed how to automate the process of pre-solver selection, one of the last aspectsof our approach that was previously based on manual engineering. In 2012, weintroduced a completely new algorithm selector based on cost-sensitive classifica-tion models that punished misclassifications in direct proportion to their impact onportfolio performance. As demonstrated in extensive computational experimentsand the competition results, these enhancements improved SATzilla07’s per-formance substantially.SATzilla is now at a stage where it can be applied “out of the box” givena set of possible component solvers along with representative training and vali-dation instances. In an automated built-in meta-optimization process, the com-ponent solvers to be used and the solvers to be used as pre-solvers are automat-ically determined from the given set of solvers, without any human effort. Thecomputational bottleneck is to execute the possible component solvers on a rep-resentative set of instances in order to obtain adequate runtime data to build rea-sonably accurate empirical hardness models. However, these computations can beparallelized very easily and require no human intervention, only computer time,which becomes ever cheaper. The code for building empirical hardness mod-els and SATzilla portfolios that use these models are available online at http://www.cs.ubc.ca/labs/beta/Projects/SATzilla.SATzilla’s performance ultimately depends on the power of all its com-ponent solvers and automatically gets better as they are improved. Furthermore,SATzilla takes advantage of solvers that are only competitive for certain kindsof instances and perform poorly otherwise, and thus SATzilla’s success demon-strates the value of such solvers. Indeed, the identification of more such solvers,which are otherwise easily overlooked, still has the potential to further improveSATzilla’s performance substantially.123Chapter 8Evaluating Component SolverContributionsto Portfolio-Based AlgorithmSelectorsHaving established SATzilla’s effectiveness in 2007 and 2009, our team decidednot to compete in the solver track of the 2011 competition, to avoid discouragingnew work on (non-portfolio) solvers. Instead, we entered SATzilla in a new“analysis track”, hoping other portfolio authors would do the same. However, otherportfolio-based methods did feature prominently among the winners in every solvertrack: the algorithm selection and scheduling system 3S [112] and the simple, yetefficient parallel portfolio ppfolio [171] won a combined seven gold and 16other medals (out of 18 categories overall).Considering that portfolio-based solvers often achieve state-of-the-art perfor-mance, we believe that the community could benefit from rethinking how the valueof individual solvers is measured. In this chapter, we demonstrate techniques foranalyzing the extent to which state-of-the-art (SOTA) portfolio’s performance de-pends on each of its component solvers. Such measures of solver contributionsmay also be applied to other portfolio approaches, including parallel portfolios.124We hope that this analysis serves as an encouragement to the community to fo-cus on creative approaches that complement the strengths of existing solvers, eventhough they may (at least initially) be effective only on certain classes of instances.18.1 Measuring the Value of a SolverOne of the main reasons for holding a solver competition is to answer the ques-tion: what is the current state of the art (SOTA)? The traditional answer to thisquestion has been the winner of the respective category of the competition; we callsuch a winner a single best solver (SBS). However, as clearly demonstrated by theefficacy of algorithm portfolios, different solver strategies are (at least sometimes)complementary. This fact suggests a second answer to the SOTA question: thevirtual best solver (VBS), defined as the best competition entry on a per-instancebasis. The VBS typically achieves much better performance than the SBS, and doesprovide a useful theoretical upper bound on the performance currently achievable.However, this bound is typically not tight: the VBS is not an actual solver, becauseit only informs which underlying solver to run after the performance of each solveron a given instance has been measured, and thus the VBS cannot be (efficiently)run on new instances. Here, we propose a third answer to the SOTA question: thebest portfolio that can be constructed in a fully automated fashion from availablesolvers; we call such a portfolio a SOTA portfolio. Since algorithm portfolios oftensubstantially outperform their component solvers, SOTA portfolios can be expectedto achieve better performance than SBS’s; unlike the VBS, a SOTA portfolio is anexecutable algorithm that can be run on novel instances.The most natural way of assessing the performance of a solver is by means ofsome statistic of its performance over a set (or distribution) of instances, such asthe number of instances solved in a time budget, or its average runtime on an in-stance set. While there is value in these natural performance measures, we believethat they are not sufficient for capturing the value a solver brings to the community.Take, for example, two solvers MiniSAT’++ and NewSAT, where MiniSAT’++1This chapter is based on the joint work with Frank Hutter, Holger Hoos, and Kevin Leyton-Brown [215].125is based on MiniSAT [46] and improves some of its components, while NewSATis a (hypothetical) radically different solver that performs extremely well on a lim-ited class of instances and poorly elsewhere. While solver MiniSAT’++ hasa good chance to win medals in the SAT competition’s Application track,solver NewSAT may not even be submitted, since (due to its poor average perfor-mance) it would be unlikely to even survive Phase 1 of the competition. However,MiniSAT’++ may only very slightly improve on the previous (MiniSAT-based)incumbent’s performance, while NewSAT might represent deep new insights intothe solution of instances that are intractable for all other known techniques.The notion of state-of-the-art (SOTA) contributors [193] captures a solver’svalue to the community much more effectively than does average algorithm per-formance. The drawback is that it describes idealized solver contributions ratherthan contributions to an actual executable method. We propose instead measuringthe SOTA contribution of a solver as its contribution to a SOTA portfolio that canbe automatically constructed from the available solvers. This new notion resemblesthe prior notion of SOTA contributors, but directly quantifies their contributions toan executable portfolio solver, rather than to an abstract virtual best solver (VBS).We must still describe exactly how we should assess a solver A’s contribu-tion to a portfolio. One may measure the frequency with which the portfolio se-lects A, or the number of instances the portfolio solves using A. However, neitherof these measures accounts for the fact that if A were not available other solverswould be chosen instead, and might perform nearly as well. (Consider again SolverMiniSAT’++, and presume that it is chosen frequently by a portfolio. However,if it had not been created, the set of solved instances may be the same, and the port-folio’s performance may be only slightly less.) We argue that a solver A should bejudged by its marginal contribution to the SOTA: the difference between the SOTAportfolio’s performance including A and the portfolio’s performance excluding A.(Here, we measure portfolio performance as the percentage of instances solvedsince this is the main performance metric in the SAT competition.)1268.2 Experimental setupSolvers. In order to evaluate the SOTA portfolio contributions of the SAT com-petition solvers, we constructed SATzilla portfolios using all sequential, non-portfolio solvers from Phase 2 of the 2011 SAT Competition as component solvers:9, 15, and 18 candidate solvers for the Random, Crafted, and Applicationcategories, respectively. (These solvers are listed in Table 8.2; see, e.g., [124] fortheir detailed information.) We hope that in the future, fully automated construc-tion procedures will also be made publicly available for other portfolio builders,such as 3S [112]; if so, our analysis could be easily and automatically repeatedfor them. For each category, we also computed the performance of an oracle oversequential non-portfolio solvers (an idealized algorithm selector that picks the bestsolver for each instance) and the virtual best solver (VBS, an oracle over all 17,25 and 31 entrants for the Random, Crafted and Application categories,respectively). These oracles do not represent the current state of the art in SATsolving, since they cannot be run on new instances; however, they serve as up-per bounds on the performance that any portfolio-based selector over these solverscould achieve. We also compared to the performance of the winners of all threecategories (including other portfolio-based solvers).Features. We used a subset of 115 features from Figure 3.1. They fall into 9categories: problem size, variable graph, clause graph, variable-clause graph, bal-ance, proximity to Horn formula, local search probing, clause learning, and surveypropagation. Feature computation averaged 31.4, 51.8 and 158.5 CPU secondson Random, Crafted, and Application instances, respectively; this timecounted as part of SATzilla’s runtime budget.Methods. We constructed SATzilla11 portfolios using the improved proceduredescribed in Section 7.5.2. We set the feature computation cutoff t f = 500 CPUseconds (a tenth of the time allocated to solve an instance). To demonstrate theeffectiveness of our improvement, we also constructed a version of SATzilla09(which uses ridge regression models), using the same training data.We used 10-fold cross-validation to obtain an unbiased estimate of SATzilla’sperformance. First, we eliminated all instances that could not be solved by any can-127didate solver (we denote those instances as U). Then, we randomly partitioned theremaining instances (denoted S) into 10 disjoint sets. Treating each of these sets inturn as the test set, we constructed SATzilla using the union of the other 9 setsas training data, and measured SATzilla’s runtime on the test set. Finally, wecomputed SATzilla’s average performance across the 10 test sets.To evaluate how important each solver was for SATzilla, for each categorywe quantified the marginal contribution of each candidate solver, as well as the per-centage of instances solved by each solver during SATzilla’s presolving (Pre1or Pre2), backup, and main stages. Note that our use of cross-validation meansthat we constructed 10 different SATzilla portfolios using 10 different subsets(“folds”) of instances. These 10 portfolios can be qualitatively different (e.g., se-lecting different presolvers); we report aggregates over the 10 folds.Data. Runtime data was provided by the organizers of the 2011 SAT competi-tion. All feature computations were performed by Daniel Le Berre on a quad-corecomputer with 4GB of RAM and running Linux, using our code. Four out of 1200instances (from the Crafted category) had no feature values, due to a databaseproblem caused by duplicated file names. We treated these instances as timeoutsfor SATzilla, thus obtaining a lower bound on SATzilla’s true performance.8.3 Experimental ResultsWe begin by assessing the performance of our SATzilla portfolios, to confirmthat they did indeed yield SOTA performance. Table 8.1 compares SATzilla11to the other solvers discussed above. In all categories SATzilla11 outperformedall of its component solvers. It also always outperformed SATzilla09, which inturn was slightly worse than the best component solver on Application.SATzilla11 also outperformed each category’s gold medalist (includingportfolio solvers such as 3S and ppfolio). Note that this does not constitute afair comparison of the underlying portfolio construction procedures, as SATzillahad access to data and solvers unavailable to portfolios that competed in the solvertrack. This finding does, however, give us reason to believe that SATzilla portfo-lios either represent or at least closely approximate the best performance reachableby current methods. Indeed, in terms of instances solved, SATzilla11 reduced128Solver Application Crafted RandomRuntime (Solved) Runtime (Solved) Runtime (Solved)VBS 1104 (84.7%) 1542 (76.3%) 1074 (82.2%)Oracle 1138 (84.3% ) 1667 (73.7%) 1087 (82.0%)SATzilla11 1685 (75.3%) 2096 (66.0%) 1172 (80.8%)SATzilla09 1905 (70.3%) 2219(63.0%) 1205 (80.3%)Gold medalist Glucose2: 1856 (71.7%) 3S: 2602 (54.3%) 3S: 1836 (68.0%)Best comp. Glucose2: 1856 (71.7%) Clasp2: 2996 (49.7%) Sparrow: 2066 (60.3%)Table 8.1: Comparison of SATzilla11 to the VBS, an Oracle over its compo-nent solvers, SATzilla09, the 2011 SAT competition winners, and the bestsingle SATzilla11 component solver for each category. We counted timed-out runs as 5000 CPU seconds (the cutoff).the gap between the gold medalists and the (upper performance bound defined bythe) VBS by 27.7% on Application, by 53.2% on Crafted and 90.1% onRandom. The remainder of this chapter studies the contributions of each com-ponent solver to these portfolios. To substantiate our previous claim that marginalcontribution is the most informative measure, here we contrast it with various othermeasures.Random. Figure 8.1 presents a comprehensive visualization of our findings forthe Random category; Table 8.2 (top) shows the underlying data. First, Figure 8.1aconsiders the set of instances that could be solved by at least one solver, and showsthe percentage that each component solver is able to solve. By this measure, thetwo best solvers were Sparrow and MPhaseSAT M. The former is a local searchalgorithm; it solved 362 + 0 satisfiable and unsatisfiable instances, respectively.The latter is a complete search algorithm; it solved 255 + 104 = 359 instances.Neither of these solvers won medals in the combined SAT + UNSAT Randomcategory, as they were outperformed by portfolio solvers that combined both lo-cal and complete solvers. Figure 8.1b shows a correlation matrix of componentsolver performance: the entry for solver pair (A,B) is computed as the Spearmanrank correlation coefficient between A’s and B’s runtime, with black and whiterepresenting perfect correlation and perfect independence respectively. Two clus-ters are apparent: six local search solvers (EagleUP, Sparrow, Gnovelty+2,Sattime11, Adaptg2wsat11, and TNM), and two versions of the completesolver March, which achieved almost identical performance. MPhaseSAT M per-formed well on both satisfiable and unsatisfiable solvers; it was strongly correlated1290 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8March_hiMarch_rwMPhaseSAT_MAdaptg2wsat11Sattime11TNMGnovelty+2EagleUPSparrowSolved Percentage(a) Percentage of solvable instancessolved by each component solver aloneSparrow EagleUPGnovelty+2 TNM Sattime11Adaptg2wsat11MPhaseSAT_MMarch_rw March_hiSparrowEagleUPGnovelty+2TNMSattime11Adaptg2wsat11MPhaseSAT_MMarch_rwMarch_hi(b) Correlation matrix of solver run-times (darker = more correlated)EagleUPSparrowMarch_rwMarch_hiOther SolversSolved by PreSolvers(c) Solver selection frequencies forSATzilla11; names only shown forsolvers picked for > 5% of the instancesEagleUP (Pre1)EagleUPSparrowMarch_rwMarch_hiOther SolversUnsolved(d) Fraction of instances solved bySATzilla11 pre-solver(s), backupsolver, and main stage solvers10−1 100 101 102 1030102030405060708090Runtime (CPU seconds)Solved Percentage  Virtual best solver (Random)Oracle (Random)SATzilla 2011 (Random)3S(e) Runtime CDF forSATzilla11, VBS, Oracle,and the gold medalist 3S0 2 4 6 8March_hiMarch_rwMPhaseSAT_MAdaptg2wsat11Sattime11TNMGnovelty+2EagleUPSparrowMarginal Contribution (%)  SATzilla 2011 (Random)Oracle (Random)(f) Marginal contribution to Oracleand SATzilla11Figure 8.1: Visualization of results for category Random130with local search solvers on the satisfiable instance subset, and very strongly cor-related with the March variants on the unsatisfiable subset. Figure 8.1c showsthe frequency with which different solvers were selected in SATzilla11. Themain solvers selected in SATzilla11’s main phase were the best-performing lo-cal search solver Sparrow and the best-performing complete solver March. Asshown in Figure 8.1d, the local search solver EagleUP was consistently chosenas a presolver and was responsible for more than half (51.3%) of the instancessolved by SATzilla11 overall. We observe that MPhaseSAT M did not playa large role in SATzilla11: it was only run for 2 out of 492 instances (0.4%).Although MPhaseSAT M achieved very strong overall performance, its versatilityappears to have come at the price of not excelling on either satisfiable or unsatisfi-able instances, being largely dominated by local search solvers on the former andby March variants on the latter. Figure 8.1e shows that SATzilla11 closely ap-proximated both the Oracle over its component solvers and the VBS, and stochas-tically dominated 3S, the gold medalist. Finally, Figure 8.1f shows the metric thatwe previously argued is the most important: each solver’s marginal contributionto SATzilla11’s performance. The most important portfolio contributor wasSparrow, with a marginal contribution of 4.9%, followed by EagleUP with amarginal contribution of 2.2%. EagleUP’s low marginal contribution may be sur-prising at first glance (recall that it solved 51.3% of the instances SATzilla11solved overall); however, 49.1% of these instances were also solvable by otherlocal search solvers. Similarly, both March variants had very low marginal con-tributions (0% and 0.2%, respectively) since they were essentially interchangeable(correlation coefficient 0.9974). Further insight can be gained by examining themarginal contribution of sets of highly correlated solvers. The marginal contri-bution of the set of both March variants was 4.0% (MPhaseSAT M could stillsolve most instances), while the marginal contribution of the set of six local searchsolvers was 22.5% (nearly one-third of the satisfiable instances were not solvableby any complete solver).Crafted. Overall, sufficiently many solvers were relatively uncorrelated in theCrafted category (Figure 8.2) to yield a portfolio with many important contrib-utors. The most important of these was Sol, which solved all of the 13.7% of the1310 0.1 0.2 0.3 0.4 0.5 0.6 0.7SattimeSattime+SathysClasp1Clasp2SApperloTRestartSATMinisat07JMiniSatPicoSATGlucose2CryptoMinisatQuteRSatSolMPhaseSATSolved Percentage(a) Percent of solvable instancessolved by each component solver aloneMPhaseSAT SolQuteRSatCryptoMinisatGlucose2PicoSATJMiniSatMinisat07RestartSATSApperloTClasp2Clasp1SathysSattime+SattimeMPhaseSATSolQuteRSatCryptoMinisatGlucose2PicoSATJMiniSatMinisat07RestartSATSApperloTClasp2Clasp1SathysSattime+Sattime(b) Correlation matrix of solver run-times (darker = more correlated)SattimeSolClasp2PicoSATClasp1QuteRSatSattime+Other SolversSolved by PreSolvers(c) Solver selection frequencies forSATzilla11; names only shown forsolvers picked for > 5% of the instancesSattime (Pre1)SattimeSol (Pre2)SolClasp2 PicoSATClasp1QuteRSatSattime+Other SolversUnsolved(d) Fraction of instances solved bySATzilla11 pre-solver(s), backup solver,and main stage solvers10−1 100 101 102 10301020304050607080Runtime (CPU seconds)Solved Percentage  Virtual best solver (Crafted)Oracle (Crafted)SATzilla 2011 (Crafted)3S(e) Runtime CDF forSATzilla11, VBS, Oracle,and the gold medalist 3S0 2 4 6 8 10SattimeSattime+SathysClasp1Clasp2SApperloTRestartSATMinisat07JMiniSatPicoSATGlucose2CryptoMinisatQuteRSatSolMPhaseSATMDUJLQDO&RQWULEXWLRQ (%)  SATzilla 2011 (Crafted)Oracle (Crafted)(f) Marginal contribution for Oracleand SATzilla11Figure 8.2: Visualization of results for category Crafted132instances for which SATzilla11 selected it; without it, SATzilla11 wouldhave solved 8.1% fewer instances. We observe that Sol was not identified as animportant solver in the SAT competition results, ranking 11th of 24 solvers in theSAT+UNSAT category. Similarly, MPhaseSAT M, Glucose2, and Sattimeeach solved a 3.6% fraction of instances that would have gone unsolved withoutthem. (This is particularly noteworthy for MPhaseSAT M, which was only se-lected for 5% of the instances in the first place.) Considering the marginal contri-butions of sets of highly correlated solvers, we observed that {Clasp1, Clasp2}was the most important at 6.3%, followed by {Sattime, Sattime11} at 5.4%.{QuteRSat, CryptoMiniSat} and {PicoSAT, JMiniSat, Minisat07,RestartSAT, SApperloT}were relatively unimportant even as sets, with marginalcontributions of 0.5% and 1.8% respectively.Application. All solvers in the Application category (Figure 8.3) exhibitedrather highly correlated performance. It is thus not surprising that in 2011, nomedals were awarded to portfolio solvers in the sequential Application track,and that in 2007 and 2009, SATzilla versions performed worst in this track,only winning a single gold medal in the 2009 satisfiable category. As mentionedearlier, SATzilla11 did outperform all competition solvers, but here the marginwas only 3.6% (as compared to 12.8% and 11.7% for Random and Crafted,respectively). All solvers were rather strongly correlated and each solver could bereplaced in SATzilla11 without a large decrease in performance; for example,dropping the competition winner only decreased SATzilla11’s percentage ofsolved instances by 0.4%. The highest marginal contribution across all 18 solverswas four times larger: 1.6% for MPhaseSAT64. Similar to MPhaseSAT in theCrafted category, it was selected infrequently (only for 3.6% of the instances)but was the only solver able to solve about half of these instances. We conjecturethat this was due to its unique phase selection mechanism. Both MPhaseSAT64and Sol (in the Crafted category) thus come close to the hypothetical solverNewSAT mentioned earlier: they showed outstanding performance on certain in-stances and thus contributed substantially to a portfolio, while having achievedan unremarkable ranking in the competition (9th of 26 for MPhaseSAT64, 11thof 24 for Sol). We did observe one larger marginal contribution than that of133MPhaseSAT64 when dropping sets of solvers: 2.3% for {Glueminisat, LRGL SHR}. The other three highly correlated clusters also gave rise to relativelyhigh marginal contribution: 1.5% for {CryptoMiniSat, QuteRSat}, 1.5% for{Glucose1,Glucose2,EBGlucose}, and 1.2% for {Minisat,EBMiniSAT,MiniSATagile}.8.4 ConclusionsThis chapter investigates the question of assessing the contributions of individ-ual SAT solvers by examining their value to SATzilla, a portfolio-based al-gorithm selector. SATzilla11 is an improved version of this procedure basedon cost-based decision forests, which entered into the new analysis track of the2011 SAT competition. Its automatically generated portfolios achieved state of theart performance across all competition categories, and consistently outperformedboth its constituent solvers, other competition entrants, and our previous version ofSATzilla. The experimental results show that the selection frequency of a com-ponent solver is a poor measure of that solver’s contribution to SATzilla11’sperformance. Instead, we advocate assessing solvers in terms of their marginalcontributions to the state of the art in SAT solving.One main observation was that the solvers with the largest marginal contribu-tions to SATzilla were often not competition winners (e.g., , MPhaseSAT64in Application SAT+UNSAT; Sol in Crafted SAT+UNSAT). To encourageimprovements to the state of the art in SAT solving and taking into account thepractical effectiveness of portfolio-based approaches, we suggest rethinking theway future SAT competitions are conducted. In particular, we suggest that allsolvers that solve some instances that no other solver can handle pass Phase 1 of thecompetition, and that solvers contributing most to the best-performing portfolio-based approaches be given formal recognition. We also recommend that portfolio-based solvers be evaluated separately—and with access to all submitted solvers ascomponents—rather than competing with traditional solvers.1340 0.2 0.4 0.6 0.8 1QuteRSatCryptoMinisatEBGlucoseGlucose2Glucose1RclMinisat_psmContrasatMPhaseSAT64LingelingPrecosatLR GL SHRGlueminisatMinisatagileEBMinisatMinisatCirminisatRestartSATSolved Percentage(a) Percent of solvable instancessolved by each component solver aloneRestartSATCirminisatMinisatEBMinisatMinisatagileGlueminisatLR GL SHRPrecosatLingelingMPhaseSAT64ContrasatMinisat_psm RclGlucose1Glucose2EBGlucoseCryptoMinisatQuteRSatRestartSATCirminisatMinisatEBMinisatMinisatagileGlueminisatLR GL SHRPrecosatLingelingMPhaseSAT64ContrasatMinisat_psmRclGlucose1Glucose2EBGlucoseCryptoMinisatQuteRSat(b) Correlation matrix of solver run-times (darker = more correlated)Glucose2 (Backup)Glucose2 GlueminisatQuteRSatPrecosatOther SolversSolved by Presolvers(c) Solver selection frequencies forSATzilla11; names only shown forsolvers picked for > 5% of the instancesGlucose2 (Backup)Glucose2 (Pre1)Glucose2 Glueminisat (Pre1)Glueminisat QuteRSatPrecosatEBGlucose (Pre1)EBGlucoseMinisat psm (Pre1)Minisat psmOther SolversUnsolved(d) Fraction of instances solved bySATzilla11 pre-solver(s), backup solver, andmain stage solvers10−1 100 101 102 1030102030405060708090Runtime (CPU seconds)Solved Percentage  Virtual best solver (Application)Oracle (Application)SATzilla 2011 (Application)Glucose 2(e) Runtime CDF forSATzilla11, VBS, Oracle, andthe gold medalist Glucose20 2 4 6 8 10QuteRSatCryptoMinisatEBGlucoseGlucose2Glucose1RclMinisat_psmContrasatMPhaseSAT64LingelingPrecosatLR GL SHRGlueminisatMinisatagileEBMinisatMinisatCirminisatRestartSATMarginal Contribution (%)  SATzilla 2011 (Application)Oracle (Application)(f) Marginal contribution for Oracleand SATzilla11Figure 8.3: Visualization of results for category Application135Solver Indiv. Perform. Average Used as Backup Used as Pre Picked by Model Marg. contrib.Runtime (solved) Correlation (Solved) (Solved) (Solved) SATzilla(Oracle)EagleUP 1761 (66.7%) 0.70 - (-) 10/10 (47.6%) 4.5% (3.7%) 2.2% (0.2%)R Sparrow 1422 (73.6%) 0.67 - (-) - (-) 20.3% (20.3%) 4.9% (2.4%)A March rw 2714 (51.0%) 0.23 - (-) - (-) 20.1% (19.9%) 0.0% (0.0%)N March hi 2725 (51.0%) 0.23 - (-) - (-) 5.3% (5.1%) 0.2% (0.0%)D Gnovelty+2 2146 (60.6%) 0.71 - (-) - (-) 0.8% (0.8%) 0.4% (0.0%)O MPhaseSAT M 1510 (73.0%) 0.67 - (-) - (-) 0.4% (0.4%) 0.2% (0.0%)M Sattime11 1850 (67.9%) 0.70 - (-) - (-) 0.4% (0.4%) 0.6% (0.0%)Adaptg2wsat11 1847 (66.7%) 0.70 - (-) - (-) 0.4% (0.2%) -0.4% (0.2%)TNM 1938 (65.9%) 0.70 - (-) - (-) 0.2% (0.2%) 0.4% (0.0%)Sattime 2638 (49.3%) 0.39 - (-) 2/10 (6.4%) 15.5% (12.8%) 3.6% (1.4%)Sol 2563 (52.5%) 0.48 - (-) 1/10 (1.4%) 13.7% (13.7%) 8.1% (6.4%)Clasp2 2280 (67.4%) 0.69 - (-) - (-) 17.4% (15.5%) 2.7% (0.4%)PicoSAT 2729 (54.8%) 0.73 - (-) - (-) 10.1% (10.1%) 0.5% (0.4%)C Clasp1 2419 (67.4%) 0.67 - (-) - (-) 7.8% (6.9%) 1.4% (1.4%)R QuteRSat 2793 (49.8%) 0.69 - (-) - (-) 6.9% (6.4%) 1.4% (0.0%)A Sattime+ 2681 (48.0%) 0.40 - (-) - (-) 6.4% (5.9%) 1.4% (1.4%)F MPhaseSAT 2398 (59.7%) 0.62 - (-) - (-) 5.0% (4.6%) 3.6% (1.8%)T CryptoMiniSat 2766 (49.8%) 0.68 - (-) - (-) 3.2% (2.7%) 0.5% (0.0%)E RestartSAT 2773 (50.7%) 0.73 - (-) - (-) 2.7% (1.4%) 1.4% (0.0%)D SApperloT 2798 (49.3%) 0.73 - (-) - (-) 1.4% (1.4%) 1.8% (0.0%)Glucose2 2644 (56.6%) 0.66 - (-) - (-) 0.5% (0.5%) 3.6% (0.0%)JMiniSat 3026 (44.3%) 0.74 - (-) - (-) 0.5% (0.5%) 0.9% (0.0%)Minisat07 2738 (55.2%) 0.70 - (-) - (-) 0.0% (0.0%) 0.0% (0.0%)Sathys 2955 (43.4%) 0.69 - (-) - (-) 0.0% (0.0%) 0.0% (0.0%)Glucose2 1272 (85.0%) 0.86 10/10 (8.7%) 3/10 (6.3%) 9.9% (9.5%) 0.4% (0.0%)Glueminisat 1391 (83.4%) 0.86 - (-) 5/10 (13.4%) 12.7% (9.9%) 0.8% (0.0%)QuteRSat 1380 (81.4%) 0.80 - (-) - (-) 12.7% (11.1%) 0.8% (0.0%)A Precosat 1411 (81.4%) 0.85 - (-) - (-) 5.5% (4.7%) 0.4% (0.0%)P EBGlucose 1630 (78.7%) 0.87 - (-) 1/10 (2.8%) 1.9% (1.6%) 0.8% (0.0%)P CryptoMiniSat 1328 (81.8%) 0.82 - (-) - (-) 3.6% (3.6%) 0.4% (0.4%)L Minisat psm 1564 (77.9%) 0.88 - (-) 1/10 (2.8%) 0.4% (0.4%) 1.2% (0.0%)I MPhaseSAT64 1529 (79.4%) 0.82 - (-) - (-) 3.6% (2.8%) 1.6% (1.2%)C Lingeling 1355 (82.2%) 0.86 - (-) - (-) 2.4% (2.4%) 0.8% (0.4%)A Contrasat 1592 (78.7%) 0.80 - (-) - (-) 2.4% (2.0%) 1.2% (0.0%)T Minisat 1567 (76.7%) 0.88 - (-) - (-) 2.0% (2.0%) -0.4% (0.0%)I LR GL SHR 1667 (75.1%) 0.85 - (-) - (-) 2.0% (1.6%) 0.8% (0.0%)O RestartSAT 1437 (79.4%) 0.88 - (-) - (-) 1.9% (1.2%) 0.4% (0.4%)N Rcl 1752 (72.7%) 0.86 - (-) - (-) 1.2% (1.2%) 0.4% (0.0%)MiniSATagile 1626 (74.7%)) 0.87 - (-) - (-) 1.6% (0.8%) 0.4% (0.0%)Cirminisat 1514 (79.8%) 0.88 - (-) - (-) 0.8% (0.8%) 0.0% (0.0%)Glucose1 1614 (77.8%) 0.86 - (-) - (-) 0.0% (0.0%) 0.0% (0.0%)EBMiniSAT 1552 (77.5%) 0.89 - (-) - (-) 0.0% (0.0%) 0.0% (0.0%)Table 8.2: Performance of SATzilla11 component solvers, disregarding in-stances that could not be solved by any component solver. We counted timed-outruns as 5 000 CPU seconds (the cutoff). Average correlation for s is the meanof Spearman correlation coefficients between s and all other solvers. Marginalcontribution for s is negative if dropping s improved test set performance. (Usu-ally, SATzilla’s solver subset selection avoids such solvers, but they can slipthrough when the training set is too small.) SATzilla11(Application)ran its backup solver Glucose2 for 10.3% of the instances (and thereby solved8.7%). SATzilla11 only chose one presolver for all folds of Random andApplication; for Crafted, it chose Sattime as the first presolver in 2folds, and Sol as the second presolver in 1 of these; for the remaining 8 folds,it did not select presolvers.136Chapter 9Automatically BuildingHigh-performance Algorithmsfrom ComponentsDesigning high-performance solvers for computationally hard problems is a diffi-cult and often time-consuming task. Although such design problems are tradition-ally solved by the application of human expertise, we argue instead for the use ofautomatic methods. In this work, we consider the design of stochastic local search(SLS) solvers for the propositional satisfiability problem (SAT). We first introducea generalized, highly parameterized solver framework, dubbed SATenstein, that in-cludes components drawn from or inspired by existing high-performance SLS al-gorithms for SAT. The parameters of SATenstein determine which components areselected and how these components behave; they allow SATenstein to instantiatemany high-performance solvers previously proposed in the literature, along withtrillions of novel solver strategies. We used an automated algorithm configurationprocedure to find instantiations of SATenstein that perform well on several well-known, challenging distributions of SAT instances. Our experiments show thatSATenstein solvers achieved dramatic performance improvements as compared tothe previous state of the art in SLS algorithms; for many benchmark distributions,our new solvers also significantly outperformed all automatically tuned variantsof previous state-of-the-art algorithms. To better understand the novel algorithm137designs generated in our work, we propose a new metric for quantitatively measur-ing the similarity between algorithm configurations, and show how to leverage thismetric for visualizing the relative similarities between different solver designs. 19.1 SATenstein-LSSATenstein advocates designing new solvers by inducing a single parameterizedsolver from distinct examples in the literature, and then searching this parameterspace automatically [115]. This approach is an example of—and indeed was partof the inspiration for—a design philosophy we call Programming by Optimization(PbO) [82]. In general, PbO means seeking and exposing design choices during adevelopment process, and then automatically finding instantiations of these choicesthat optimize performance in a given use context. SATenstein-LS can be seen as anexample of PbO in which the algorithm design space has been obtained by unify-ing a large number of local search schemes for SAT into a tightly integrated, highlyparametric algorithm framework. However, the PbO philosophy goes further and isultimately more general: it emphasizes encouraging developers to identify and ex-pose design choices as parameters, rather than merely recovering parameters fromexisting, fully implemented examples. Because of its emphasis on changing thesoftware development process, the PbO paradigm is also supported by program-ming language extensions that allow parameters and design choices to be exposedquickly and transparently. For more information, please see the PbO website atwww.prog-by-opt.net.9.1.1 DesignAs discussed in Section 3.1.2, most SLS algorithms for SAT can be categorized intofour broad categories: GSAT, WalkSAT, dynamic local search and G2WSAT. Sinceno recent, state-of-the-art SLS solver is GSAT-based, we constructed SATenstein-LSby drawing components from algorithms belonging to the three remaining cate-gories.As shown in the high-level algorithm outline (Procedure SATenstein-LS), SATenstein-LS1This chapter is based on the joint work with Ashiqur KhudaBukhsh, Holger Hoos, and KevinLeyton-Brown [115].138is comprised of five major building blocks, B1–B5. Any instantiation of SATenstein-LSfollows the same high-level structure:1. Optionally execute B1, which performs search diversification.2. Execute either B2, B3 or B4, thus performing a G2WSAT-based, WalkSAT-based, or dynamic local search procedure, respectively.3. Optionally execute B5, to update data structures such as promising list, clausepenalties, dynamically adaptable parameters or tabu attributes.Each of our building blocks is composed of one or more components (listed inTable 9.1); some of these components are shared across different building blocks.Each component is configurable by one or more parameters. Out of 42 param-eters overall, 6 of SATenstein-LS’s parameters are integer-valued (listed inTable 9.5), 19 are categorical (listed in Table 9.4), and 17 are real-valued (listedin Table 9.6). All of these parameters are exposed on the command line so thatthey can be optimized using an automatic configurator. After fixing the domainsof integer- and real-valued parameters to between 3 and 16 values each (as we didin our experiments, reported later) the total number of valid SATenstein-LSinstantiations was 2.01×1014.We now give a high-level description of each of the building blocks. B1 isconstructed using the SelectClause(), DiversificationStrategy() and Diversification-Probability() components. SelectClause() is configured by one categorical param-eter and, depending on its value, either selects an unsatisfied clause uniformly atrandom or selects a clause with probability proportional to its clause penalty [198].Component diversificationStrategy() can be configured by a categorical parameterto do any of the following with probability diversificationProbability(): flip theleast recently flipped variable [131], flip the least frequently flipped variable [166],flip the variable with minimum variable weight [166], or flip a randomly selectedvariable [80].Block B2 instantiates G2WSAT-based algorithms that use a data structure promis-ing list that keeps track of a set of variables considered for being flipped. In theliterature on G2WSAT, there are two strategies for selecting a variable from thepromising list: choosing the variable with the highest score [131] or choosing139Procedure SATenstein-LS(. . .)Input: CNF formula φ ; real number cutoff ;Booleans performDiversification, singleClauseAsNeighbor,usePromisingList;Output: Satisfying variable assignmentStart with random assignment A;Initialize parameters;while runtime < cutoff doif A satisfies φ thenreturn A;varFlipped← FALSE;if performDiversification thenB1 with probability diversificationProbability() doB1 c← selectClause();B1 y← diversificationStrategy(c) ;B1 varFlipped← TRUE;if not varFlipped thenif usePromisingList thenB2 if promisingList is not empty thenB2 y← selectFromPromisingList() ;elseB2 c← selectClause();B2 y← selectHeuristic(c) ;elseif singleClauseAsNeighbor thenB3 c← selectClause();B3 y← selectHeuristic(c) ;elseB4 sety← selectSet();B4 y← tieBreaking(sety);flip y ;B5 update();the least recently flipped variable [135]. We added nine novel strategies basedon variable selection heuristics from other solvers. These, to the best of our knowl-edge, have never been used before in the context of promising variable selectionfor G2WSAT-based algorithms. For example, in previous work, variable selectionmechanisms used in Novelty variants are only applied to variables of unsatisfi-able clauses, not to promising lists. Table 9.2 lists the eleven possible strategies forSelectFromPromisingList.140Component Block Parameters Instantiations Detailed InfodiversificationStrategy() 1 searchDiversificationStrategy 4 Table 9.4SelectClause() 1, 2, 3 selectClause 2 Table 9.4diversificationProbability() 1 rdp, rfp, rwp 216 Table 9.6selectFromPromisingList() 2 selectPromVariable 4312 Table 9.2, 9.4promDp, promWp, promNovNoise Table 9.6selectHeuristic() 2, 3 heuristic Table 9.3, 9.4performAlternateNovelty 1.83×106 Table 9.4wp, dp, wpWalk, novNoise, s, c Tabel 9.6selectSet() 4 scoringMeasure, smoothingScheme Table 9.4maxinc 24576 Table 9.5alpha,rho, sapsthresh, pflat Table 9.6tiebreaking() 4 tieBreaking 4 Table 9.4update() 5 useAdaptiveMechanism, adaptivenoisescheme, Table 9.4adaptWalkProb, performTabuSearch, Table 9.4useClausePenalty, adaptiveProm, Table 9.4adaptpromwalkprob, updateSchemePromList, 1.76×108 Table 9.4tabuLength, phi, theta, promPhi,promTheta, Table 9.5ps Table 9.6Table 9.1: SATenstein-LS components..Param Value Design choice Based on1 If freebie exists, use tieBreaking(); [179]else, select uniformly at random2 Variable with best score [131]3 Least-recently-flipped variable [135]4 Variable with best VW1 score [166]5 Variable with best VW2 score [166]6 Variable selected uniformly at random [79]7 Variable selection from Novelty [142]8 Variable selection from Novelty++ [131]9 Variable selection from Novelty+ [79]10 Variable selection from Novelty++′ [132]11 Variable selection from Novelty+p [132]Table 9.2: Design choices for selectFromPromisingList().If promising list is empty, B2 behaves exactly as B3, which instantiates WalkSAT-based algorithms. As previously described in the context of B1, component Select-Clause() is used to select an unsatisfiable clause c. The SelectHeuristic() compo-nent selects a variable from c for flipping. Depending on a categorical parameter,SelectHeuristic() can behave as any of the thirteen well-known WalkSAT-basedheuristics that include Novelty variants, VW1 and VW2. Table 9.3 lists theseheuristics and related continuous parameters. We also extended the Noveltyvariants with an optional “flat move” mechanism as found in the selection strategyof gNovelty+ [161, 195].Block B4 instantiates dynamic local search algorithms. The selectSet() compo-141Param. Value Selected Heuristic Dependent Parameters1 Novelty [142] novnoise2 Novelty+ [80] novnoise, wp3 Novelty++ [131] novnoise, dp4 Novelty++′ [132] novnoise, dp5 R-Novelty [142] novnoise6 R-Novelty+ [80] novnoise, wp7 VW1 [166] wpwalk8 VW2 [166] s, c, wp9 WalkSAT-SKC [179] wpwalk10 Noveltyp [132] novnoise11 Novelty+p [132] novnoise, wp12 Novelty++p [132] novnoise, dp13 Novelty++′p [132] novnoise, dpTable 9.3: List of heuristics chosen by the parameter heuristic and dependentparameters.nent considers the set of variables that occur in any unsatisfied clause. It associateswith each such variable v a score, which depends on the clause weights of eachclause that changes satisfiability status when v is flipped. These clause weightsreflect the perceived importance of satisfying each clause. For example, weightsmight increase the longer a clause has been unsatisfied, and decrease afterwards[91, 195]. After scoring the variables, selectSet() returns all variables with maximalscore. Our implementation of this component incorporates three different scoringfunctions, including those due to [142], [179], and a novel, greedier variant thatonly considers the number of previously unsatisfied clauses that are satisfied by avariable flip. The tieBreaking() component selects a variable from the maximum-scoring set according to the same strategies used by the diversificationStrategy()component.Block B5 updates data structures required by the previously mentioned mech-anisms, (e.g., dynamic local search) after a variable has been flipped. Performingthese updates in an efficient manner is of crucial importance for the performanceof many SLS algorithms. As the SATenstein-LS framework supports the com-bination of mechanisms from many different SLS algorithms, each depending ondifferent data structures, the implementation of the update() function was techni-cally quite challenging.142Parameter Active When Domain DescriptionperformSearchDiversification Base level parameter {0,1} If true, block B1 is performedusePromisingList Base level parameter {0,1} If true, block B2 is performedsingleClauseAsNeighbor Base level parameter {0,1} If true, block B3 is performedelse, block B4 is performedselectPromVariable usePromisingList = 1 {1, 11} See Table 9.2heuristic singleClauseAsNeighbor = 1 {1, 13} See Table 9.3performAlternateNovelty singleClauseAsNeighbor = 1 {0,1} If true, performs Noveltyvariant with “flat move”.useAdaptiveMechanism Base level parameter {0,1} If true, uses adaptive mechanisms.adaptivenoisescheme useAdaptiveMechanism = 1 {1,2} Specifies adaptive noise mechanisms.usePromisingList = 1adaptWalkProb useAdaptiveMechanism = 1 {0,1} If true, walk probability or diversificationprobability of a heuristic is adaptivelytuned.performTabuSearch Base level parameter {0,1} If true, tabu variables arenot considered for flipping.useClausePenalty Base level parameter {0,1} If true, clause penalties are computed.selectClause singleClauseAsNeighbor = 1 {1,2} 1 selects an UNSAT clause uniformlyat random.2 selects an UNSAT clause with aprobability proportional to itsclause penalty.searchDiversificationStrategy performSearchDiversification = 1 {1,2,3,4} 1 randomly selects a variable from anUNSAT clause.2 selects the least-recently-flipped-variable from an UNSAT clause.3 selects the least-frequently-flippedvariable from an UNSAT clause.4 selects the variable with leastVW2 weight from an UNSAT clause.adaptiveProm usePromisingList = 1 {0,1} If true, performs adaptive versions ofNovelty variants to select variablefrom promising list.adaptpromwalkprob usePromisingList = 1 {0,1} If true, walk probability or diversificationadaptiveProm = 1 probability of Novelty variants usedon promising list is adaptively tuned.scoringMeasure usePromisingList = 0 {1,2,3} Specifies the scoring measure.singleClauseAsNeighbor = 0 1 uses MakeCount - BreakCount2 uses MakeCount3 uses -BreakCounttieBreaking usePromisingList = 1 {1,2,3,4} 1 breaks ties randomly.selectPromVariable ∈ { 1,4,5 } 2 breaks ties in favor of theor singleClauseAsNeighbor = 0 least-recently-flipped variable.3 breaks tie in favor of theleast-frequently-flipped variable.4 breaks tie in favor of thevariable with least VW2 score.updateSchemePromList usePromisingList = 1 {1,2,3} 1 and 2 follow G2WSAT .3 follows gNovelty+.smoothingScheme useClausePenalty = 1 {1,2} When singleClauseAsNeighbor = 1 :1 performs smoothing for only random3-SAT instances with 0.4 fixedsmoothing probability.2 performs smoothing for all instances.When singleClauseAsNeighbor = 0 :1 performs SAPS-like smoothing.2 performs PAWS-like smoothing.Table 9.4: Categorical parameters of SATenstein-LS. Unless otherwisementioned, multiple “active when” parameters are combined together us-ing AND.9.1.2 Implementation and ValidationSATenstein-LS is built on top of UBCSAT [198], a well-known framework for143Parameter Active When Description Values consideredtabuLength performTabuSearch = 1 Specifies tabu step-length 1, 3, 5, 7, 10, 15, 20phi useAdaptiveMechanism = 1 Parameter for adaptively setting noise 3, 4, 5, 6, 7, 8, 9, 10singleClauseAsNeighbor = 1theta useAdaptiveMechanism = 1 Parameter for adaptively setting noise 3, 4, 5, 6, 7, 8, 9, 10singleClauseAsNeighbor = 1promPhi usePromisingList = 1 Parameter for adaptively setting noise 3, 4, 5, 6, 7, 8, 9, 10adaptiveProm = 1selectPromVariable ∈ {7,8,9,10,11}promTheta usePromisingList = 1 Parameter for adaptively setting noise 3, 4, 5, 6, 7, 8, 9, 10adaptiveProm = {1}selectPromVariable ∈ {7,8,9,10,11}maxinc singleClauseAsNeighbor = 0 PAWS [195] parameter for 5, 10, 15, 20useClausePenalty = 1 additive clause weightingsmoothingScheme = 2Table 9.5: Integer parameters of SATenstein-LS and the values consid-ered during ParamILS tuning. Multiple “active when” parameters arecombined together using AND. Existing defaults are highlighted in bold.For parameters first introduced in SATenstein-LS, default values areunderlined.developing and empirically evaluating SLS algorithms for SAT. UBCSAT makesuse of a trigger-based architecture that facilitates the reuse of existing mechanisms.While designing and implementing SATenstein-LS, we not only studied exist-ing SLS algorithms, as presented in the literature, but we also analyzed the SATcompetition submissions of such algorithms. We found that the pseudocode ofVW2 according to [166] differed from its SAT competition 2005 version, whichincludes a reactive mechanism; we included both versions in SATenstein-LS’simplementation. We also found that in the SAT competition implementation ofgNovelty+, Novelty uses a PAWS-like [195] “flat move” mechanism. We im-plemented this alternate version of Novelty in SATenstein-LS and exposeda categorical parameter to choose between the two implementations. While exam-ining the implementations of various SLS solvers, we noticed that certain key datastructures were implemented in different ways. In particular, different G2WSATvariants use different realizations of the update scheme of promising list. We in-cluded all these update schemes in SATenstein-LS and declared parameter up-dateSchemePromList to select between them.Since SATenstein-LS is quite complex, we took great care in validating itsimplementation of existing SLS-based SAT solvers. We compared our SATenstein-LSimplementation with ten well-known algorithms’ reference implementations (specif-ically, every algorithm listed in Table 9.7 except for Ranov), measuring running144Parameter Active When Description Discrete Values Consideredwp singleClauseAsNeighbor = 1 Randomwalk probability for Novelty+ 0, 0.01, 0.03, 0.04, 0.05, 0.06, 0.07,heuristic ∈ {2,6,11} 0.1, 0.15, 0.20useAdaptiveMechanism = 0or smoothingScheme = 1singleClauseAsNeighbor = 0useClausePenalty = 0dp singleClauseAsNeighbor = 1 Diversification probability for Novelty++ 0.01, 0.03, 0.05, 0.07, 0.1, 0.15, 0.20heuristic ∈ {3,4,12,13} and Novelty++′useAdaptiveMechanism = 0promDp usePromisingList = 1 Diversification probability for Novelty 0.01, 0.03, 0.05, 0.07, 0.1, 0.15, 0.20selectPromVariable ∈ {8,10} variants used to select variable fromadaptiveProm = 0 promising listnovNoise singleClauseAsNeighbor = 1 Noise parameter for all Novelty variants 0.01, 0.03, 0.05, 0.07, 0.1, 0.15, 0.20heuristic ∈ {1,2,3,4,5,6,10,11,12,13}useAdaptiveMechanism = 0wpWalk singleClauseAsNeighbor = 1 Noise parameter for WalkSAT and VW1 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 , 0.8heuristic ∈ {7,9}useAdaptiveMechanism = 0promWp usePromisingList = 1 Randomwalk probability for Novelty 0.01, 0.03, 0.05, 0.07, 0.1, 0.15, 0.20selectPromVariable ∈ {9,11} variants used to select variablefrom promising listpromNovNoise usePromisingList = 1 Noise parameter for all Novelty 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 , 0.8selectPromVariable ∈ 7,8,9,10,11 variants used to select variablefrom promising listalpha singleClauseAsNeighbor = 0 Parameter for SAPS 1.01, 1.066, 1.126, 1.189, 1.3, 1.256,useClausePenalty = 1 1.326, 1.4smoothingScheme = 1rho singleClauseAsNeighbor = 0 Parameter for SAPS 0, 0.17, 0.333, 0.5, 0.666, 0.8, 0.83, 1useClausePenalty = 1smoothingScheme = 1sapsthresh singleClauseAsNeighbor = 0 Parameter for SAPS -0.1, -0.2, -0.3, -0.4useClausePenalty = 1smoothingScheme = 1ps useClausePenalty = 1 Smoothing parameter for SAPS, RSAPS, 0, 0.033, 0.05, 0.066, 0.1, 0.133, 0.166,singleClauseAsNeighbor = 1 and gNovelty+ 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0or singleClauseAsNeighbor = 0useClausePenalty = 1useAdaptiveMechanism = 0smoothingScheme = 1s singleClauseAsNeighbor = 1 VW parameter for smoothing 0.1, 0.01, 0.001useAdaptiveMechanism = 0or singleClauseAsNeighbor = 0tieBreaking = 4useAdaptiveMechanism = 0c singleClauseAsNeighbor = 1 VW parameter for smoothing 0.1, 0.01, 0.001, 0.0001, 0.00001,useAdaptiveMechanism = 0 0.000001or singleClauseAsNeighbor = 0tieBreaking = 4useAdaptiveMechanism = 0rdp performSearchDiversification = 1 Parameter for search diversification 0.01, 0.03, 0.05, 0.07, 0.1, 0.15searchDiversificationStrategy ∈ {2,3}rfp performSearchDiversification = 1 Parameter for search diversification 0.01, 0.03, 0.05, 0.07, 0.1, 0.15searchDiversificationStrategy = 4rwp performSearchDiversification = 1 Parameter for search diversification 0.01, 0.03, 0.05, 0.07, 0.1, 0.15searchDiversificationStrategy = 1pflat singleClauseAsNeighbor = 0 Parameter for PAWS that controls 0.05, 0.10, 0.15, 0.20useClausePenalty = 1 “flat-moves”smoothingScheme = 2Table 9.6: Continuous parameters of SATenstein-LS and values consid-ered during ParamILS tuning. Unless otherwise mentioned, multiple“active when” parameters are combined together using AND. Exist-ing defaults are highlighted in bold. For parameters first introduced inSATenstein-LS, default values are underlined.times as the number of variable flips.2 These ten algorithms span G2WSAT-based,2Since SATenstein-LS does not use any preprocessor, we manually disabled the preprocess-ing steps of G2, AG2p, AG2+, and AG20 when performing this validation.145WalkSAT-based, and dynamic local search procedures, and also make use of all theprominent SLS solver mechanisms discussed earlier. Our validation results showedthat in every case reference solvers and their SATenstein-LS implementationshad the same run-length distributions on a small set of validation instances choosenfrom block world and software verification, based on a Kolmogorov-Smirnov test(5000 runs per solver–instance pair with significance threshold 0.05).9.2 Experimental SetupIn order to study the effectiveness of our proposed approach for algorithm design,we configured SATenstein-LS on training sets from various distributions ofSAT instances and compared the performance of the SATenstein-LS solversthus obtained against that of several existing high-performance SAT solvers ondisjoint test sets.9.2.1 BenchmarksWe considered six sets of well-known benchmark instances for SAT. They can beroughly categorized into three broad categories of SAT instances, namely indus-trial (CBMC(SE), FAC), handmade (QCP, SW-GCP), and random (R3SAT, HGEN).Because SLS algorithms are unable to prove unsatisfiability, we constructed ourbenchmark sets to include only satisfiable instances.The instance generators for HGEN [76] and FAC [200] only produce satisfi-able instances. For each of these two distributions, we generated 2000 instances.For QCP [60] and SW-GCP [58], we first filtered out unsatisfiable instances us-ing complete solvers and then randomly chose 2000 satisfiable instances. ForR3SAT [180], we generated a set of 1000 instances with 600 variables and aclauses-to-variables ratio of 4.26. We identified 521 satisfiable instances and ran-domly chose 500 instances. Finally, we used the CBMC generator [34] to generate611 SAT-encoded software verification instances. We preprocessed these instancesusing SatELite [45], identifying 604 of them as satisfiable and the remaining 7 asunsatisfiable.We randomly split each of the six instances sets thus obtained into training andtest sets of equal size.1469.2.2 Tuning Scenario and PARIn order to perform automatic algorithm configuration, we first had to quantifyperformance using an objective function. Consistent with most previous work onSLS algorithms for SAT, we chose to focus on mean runtime. In order to deal withruns that had to be terminated at a given cutoff time, following Hutter et al., [2009],we used a variant of mean runtime known as penalized average runtime (PAR-10),defined as the average runtime over a given set of runs, where timed-out runs arecounted as 10 times the given cutoff time. Unless explicitly stated otherwise, allruntimes reported in this chapter were measured using PAR-10 over the respectiveset of instances.To perform automated configuration, we used the FocusedILS procedure fromthe ParamILS framework, version 2.3 [95]. We chose this method because it hasbeen demonstrated to operate effectively on many extremely large, discrete pa-rameter spaces (see, e.g., 94, 97, 165, 197), and because it supports conditionalparameters (discussed below). FocusedILS takes as input a parameterized algo-rithm (the so-called target algorithm), a specification of domains and (optionally)conditions for all parameters, a set of training instances, and an evaluation met-ric. It outputs a parameter configuration of the target algorithm that approximatelyminimizes the given evaluation metric.As just mentioned, FocusedILS supports conditional parameters, which areimportant to SATenstein-LS. For example, condition A|B = b means that Ais activated if B takes the value b. When more than one such condition is givenfor the same parameter A, these are interpreted as being connected by logical“AND”. For example, the two conditions, A|B = b and A|C = c, are interpretedas A|(B = b)∧ (C = c). Some parameters in SATenstein-LS can be activatedin more than one way. While this cannot be directly specified in the input to Fo-cusedILS, we can express such disjunctive conditions using dummy parameters, asillustrated in the following example. Consider an algorithm S with four parame-ters, {A,B,C,D}, and where A is activated if B = b or C = c, while D is activatedif A = a. As it is impossible to express the condition A|(B = b)∨ (C = c) directlyin the input to FocusedILS, we introduce two dummy parameters, A∗ and D∗. Us-ing these additional parameters, the given conditions can be expressed as A|B = b;147A∗|C = c; A∗|B 6= b; D|A = a; D∗|A∗ = a. Since only one of (A,A∗)/(D,D∗) isactivated, we can simply map A∗ to A and D∗ to D when instantiating S with aparameter configuration found by FocusedILS.We used a cutoff time of 5 CPU seconds for each target algorithm run, andallotted 7 days to each run of FocusedILS; we note that, while 5 CPU secondsis unrealistically short for assessing the performance of SAT solvers, using shortcutoff times during configuration is important for the efficiency of the configura-tion process and typically works well, as demonstrated by our SATenstein-LSresults. Since ParamILS cannot operate directly on continuous parameters, eachcontinuous parameter was discretized into sets containing between 3 and 16 valuesthat we considered reasonable (see Table 9.5). Except for a small number of cases(e.g., the parameters s,c) for which we used the same discrete domains as men-tioned in the publication first describing it [166]), we selected these values usinga regular grid over a range of values that appeared reasonable. For each integerparameter, we specified 4 to 10 values, always including the known defaults (seeTable 9.6). In all cases, these choices included the parameter values required tocover the default configurations of the solvers whose components were integratedinto SATenstein-LS’s design space. Categorical parameters and their respec-tive domains are listed in Table 9.4. As mentioned before, based on this discretiza-tion, SATenstein-LS’s parameter configuration space consists of 2.01× 1014distinct configurations.Since the performance of FocusedILS can vary significantly depending on theorder in which instances appear in the training set, we ran FocusedILS 20 times onthe training set, using different, randomly determined instance orderings for eachrun. From the 20 parameter configurations obtained from FocusedILS for eachinstance distribution D, we selected the parameter configuration with the best pe-nalized average runtime on the training set. We then evaluated this configuration onthe test set. For a given distribution D, we refer to the corresponding instantiationof a solver S as S[D].1489.2.3 Solvers Used for Performance ComparisonFor each instance distribution D, we compared the performance of SATenstein-LS[D]against that of 11 high-performance SLS-based SAT solvers on the respective testsets. We included every SLS algorithm that won a medal in any category of aSAT competition between 2002 and 2007, because those algorithms are all part ofthe SATenstein-LS design space. Although dynamic local search (DLS) algo-rithms have not won medals in recent SAT competitions, we also included threeprominent, high-performing DLS algorithms for two reasons. First, some of themrepresented the state of the art when introduced (e.g., SAPS [91]) and still of-fer competitive performance on many instances. Second, techniques used in thesealgorithms have been incorporated into other recent high-performance SLS algo-rithms. For example, the additive clause weighting scheme used in PAWS is alsoused in the 2007 SAT Competition winner gNovelty+ [161]. We call thesealgorithms challengers and list them in Table 9.7. In order to demonstrate the fullperformance potential of these solvers, we also tuned the parameters for all param-eterized challengers using the same configuration procedure and protocol as forSATenstein-LS, including the same choices of discrete values for continuousand integer parameters.SATenstein-LS can be instantiated such that it emulates all 11 challengeralgorithms (except for preprocessing components used in Ranov, AG2p, AG2plus,and AG20). However, in some cases, the original implementations of these algo-rithms are more efficient—on our data, by at most a factor of two on average perinstance set—mostly, because SATenstein-LS’s generality rules out some datastructure optimizations. Thus, we based all of our experimental comparisons onthe original algorithm implementations, as submitted to the respective SAT Com-petitions. The exceptions are PAWS, whose implementation within UBCSAT isalmost identical to the original in terms of runtime, as well as SAPS, RSAPS, andANOV, whose UBCSAT implementations are those used in the competitions. Allof our comparisons on the test set are based on running each solver 25 times perinstance, with a per-run cutoff of 600 CPU seconds.Our goal was to improve the state of the art in SAT solving. Thus, although thedesign space of SATenstein-LS consists solely of SLS solvers, we have also149Algorithm Abbrev Reason for Inclusion ParametersRanov[160] Ranov gold 2005 SAT Competition (random) wpG2WSAT[131]G2 silver 2005 SAT Competition (random) novNoise, dpVW[166] VW bronze 2005 SAT Competition (ran-dom)c, s, wpWalkgNovelty+[161]GNOV gold 2007 SAT Competition (random) novNoise, wpWalk, psadaptG2WSAT0[132]AG20 silver 2007 SAT Competition (random) NAadaptG2WSAT+[135]AG2+ bronze 2007 SAT Competition (ran-dom)NAadaptNovelty+[80]ANOV gold 2004 SAT Competition (random) wptextttadaptG2WSATp[135]AG2p performance comparable toG2WSAT [131], Ranov, andadaptG2WSAT+; see [132]NASAPS[91] SAPS prominent DLS algorithm alpha, ps, rho, sapsthresh, wpRSAPS[91] RSAPS prominent DLS algorithm alpha, ps, rho, sapsthresh, wpPAWS[195] PAWS prominent DLS algorithm maxinc, pflatTable 9.7: Our eleven challenger algorithms.Category Solver Reason for InclusionIndustrial Picosat gold, silver(CBMC(SE) and FAC) [18, 19] 2007 SAT Competition (industrial)Minisat2.0 bronze, silver[187] 2007 SAT Competition (industrial)Handmade Minisat2.0 bronze, silver(QCP and SW-GCP) [187] 2007 SAT Competition (handmade)March pl Improved, bug-free version of[73] March ks [74],gold in 2007 SAT Competition (handmade)Random Kcnfs 04 silver(HGEN and R3SAT) [44] 2007 SAT Competition (random)March pl Improved, bug-free version of[73] March ks [74], silverin 2007 SAT Competition (random)Table 9.8: Complete solvers we compared against.compared its performance to that of high-performance complete solvers (listed inTable 9.8). Unlike SLS solvers, these complete solvers are deterministic. Thus,for every instance in each distribution, we ran each complete solver once with aper-run cutoff of 600 CPU seconds.1509.2.4 Execution EnvironmentWe performed our experiments on a cluster of 55 dual 3.2GHz Intel Xeon PCswith 2MB cache and 2GB RAM, running OpenSuSE Linux 11.1. Our computercluster was managed by a distributed resource manager, Sun Grid Engine software(version 6.0). Runtimes for all algorithms (including FocusedILS) were measuredas CPU time on these reference machines. Each run of any solver only used oneCPU.9.3 Performance ResultsIn this section, we present the results of performance comparisons between SATenstein-LSand the 11 challenger SLS solvers (listed in Table 9.7), configured versions ofthese challengers, and two complete solvers for each of our benchmark distribu-tions (listed in Table 9.8). Although in our configuration experiment, we optimizedSATenstein-LS for penalized average runtime (PAR-10), we also examine itsperformance in terms of other performance metrics, such as median runtime andpercentage of instances solved within the given cutoff time.9.3.1 Comparison with ChallengersFor every one of our six benchmark distributions, we were able to find a SATenstein-LSconfiguration that outperformed all 11 challengers. Our results are summarized inTable 9.9 and Figure 9.1.In terms of penalized average runtime, the performance metric we explic-itly optimized using ParamILS (with a cutoff time of 5 CPU seconds rather thanthe 600 CPU seconds used here for testing, as explained in Section 5.2), ourSATenstein-LS solvers achieved better performance than every challenger onevery distribution. For QCP, HGEN, and CBMC(SE), SATenstein-LS achieveda PAR-10 score that was orders of magnitude better than the respective best chal-lengers. For SW-GCP, R3SAT, and FAC, there was substantial, but less dramaticimprovement. The modest improvement in R3SAT was not very surprising (Fig-ure 9.1: Left); R3SAT is a well-known SAT distribution on which SLS solvershave been evaluated and optimized for decades. Conversely, on a new bench-mark distribution, CBMC(SE), where DPLL solvers represent the state of the art,1510.08 0.03 1.11 0.02 10.89 4.75SATenstein-LS[D] 0.01 0.02 0.14 0.01 7.90 0.02[115] 100% 100% 100% 100% 100% 100%Solvers QCP SW-GCP R3SAT HGEN FAC CBMC(SE)1054.99 0.64 2.14 137.02 3594.40 2169.77AG20 0.03 0.11 0.13 0.57 N/A 0.56[132] 81.2% 100% 100% 98.1% 35.9% 61.1%1119.96 0.43 2.35 105.30 1954.83 2294.24AG2p 0.02 0.06 0.14 0.48 330.26 2.57[135] 80.1% 100% 100% 98.4% 80.6% 61.1%1091.37 0.67 3.04 148.28 1450.89 2181.92AG2+ 0.03 0.08 0.16 0.59 238.31 0.64[135] 80.3% 100% 100% 98.0% 91.0% 61.1%25.42 4.86 11.17 109.94 2897.52 2021.22ANOV 0.02 0.04 0.15 0.50 588.23 3.10[80] 99.6% 100% 100% 98.6% 51.4% 61.1%2942.13 4092.29 3.69 104.55 5947.80 2139.12G2 341.60 N/A 0.13 0.60 N/A 0.57[131] 50.9% 31.0% 100% 98.7% 0% 65.4%414.69 1.20 11.14 52.58 5935.39 2236.85GNOV 0.03 0.09 0.15 0.71 N/A 0.67[161] 93.3% 100% 100% 99.4% 0% 61.5%1127.84 4495.50 1.77 62.18 22.05 1693.82PAWS 0.03 N/A 0.08 0.82 10.41 0.18[195] 81.0% 24.3% 100% 99.4% 100% 70.8%73.38 0.15 18.29 151.11 887.33 1227.07RANOV 0.1 0.12 0.36 0.90 152.16 0.58[160] 99.1% 100% 100% 98.2% 96.8% 79.7%1255.94 5635.54 18.42 33.28 17.86 827.81RSAPS 0.05 N/A 1.86 2.33 11.53 0.02[91] 79.2% 5.4% 100% 99.7% 100% 85.0%1248.34 3864.74 22.93 40.17 16.41 646.89SAPS 0.04 N/A 1.77 2.65 10.56 0.02[91] 79.4% 34.2% 100% 99.5% 100% 89.7%1022.69 161.74 12.45 176.18 3382.02 385.12VW 0.25 40.26 0.82 3.13 N/A 0.23[166] 81.9% 99.4% 100% 97.8% 35.3% 93.4%Table 9.9: Performance of SATenstein-LS and the 11 challengers. Everyalgorithm was run 25 times with a cutoff of 600 CPU seconds per run.Each cell 〈i, j〉 summarizes the test-set performance of algorithm i ondistribution j as a/b/c, where a (top) is the the PAR10 score; b (middle)is the median of the median runtime(where the outer median is takenover the instances and the inner median over the runs); c (bottom) isthe percentage of instances solved (median runtime < cutoff). The best-scoring algorithm(s) in each column are indicated in bold, and the best-scoring challenger(s) are underlined.152SATenstein-LS solvers performed markedly better than every SLS-based chal-lenger. We were surprised to see the amount of improvement we obtained forHGEN, a hard random SAT distribution very similar to R3SAT, and QCP, a widely-known SAT distribution. We noticed that on HGEN, some older solvers such asSAPS and PAWS performed much better than more recent medal winners suchas GNOV and AG20. Also, for QCP, a somewhat older algorithm, ANOV, turnedout to be the best challenger. These observations led us to believe that the strongperformance of SATenstein-LS was partly due to the fact that the past sevenyears of SLS SAT solver development have not taken these types of distributionsinto account and have not yielded across-the-board improvements in SLS solverperformance.We also evaluated the performance of SATenstein-LS solvers using twoother performance metrics: median-of-median runtime and percentage of solvedinstances. If a solver finishes most of the runs on most instances, the capped runswill not affect its median-of-median performance, and hence the metric does notneed a way of accounting for the cost of capped runs. (When the median of mediansis a capped run, we say that the metric is undefined.) Table 9.9 shows that, althoughthe SATenstein-LS solvers were obtained by optimizing for PAR-10, they stilloutperformed every challenger in every distribution except for R3SAT, in which thechallengers achieved slightly better performance than SATenstein-LS. Finally,we measured the percentage of instances on which the median runtime was belowthe cutoff used for capping runs. According to this measure, SATenstein-LSeither equalled or beat every challenger, since it solved 100% of the instances inevery benchmark set. In contrast, only 4 challengers managed to solve more than50% of instances in every test set. Overall, SATenstein-LS solvers scored wellon these measures, even though its performance was not explicitly optimized forthem.The relative performance of the challengers varied significantly across differentdistributions. For example, the three dynamic local search solvers (SAPS, PAWS,and RSAPS) performed substantially better than the other challengers on factor-ing instances (FAC). However, on SW-GCP, their performance was weak. Sim-ilarly, GNOV (SAT Competition 2007 winner in the random satisfiable category)performed very poorly on our two industrial benchmark distributions, CBMC(SE)153Challengers QCP SW-GCP R3SAT HGEN FAC CBMC(SE)AG20 76.1 (23.3) 95.8 (4.2) 45.6 (17.6) 98.0 (1.5) 100.0 (0.0) 100.0 (0.0)AG2p 70.6 (28.6) 88.9 (10.7) 47.6 (15.2) 98.2 (1.1) 100.0 (0.0) 100.0 (0.0)AG2+ 75.4 (24.1) 94.3 (5.7) 61.6(12.4) 98.5 (1.1) 100.0 (0.0) 100.0 (0.0)ANOV 57.7 (40.4) 68.5 (27.2) 57.2 (8.0) 97.6 (1.3) 99.9 (0.0) 100.0 (0.0)G2 81.4 (18.6) 100.0 (0.0) 34.0 (15.2) 98.0 (1.4) 100.0 (0.0) 100.0 (0.0)GNOV 97.5 (2.4) 99.6 (0.4) 48.8 (16.4) 99.4 (0.4) 100.0 (0.0) 100.0 (0.0)PAWS 69.0 (30.1) 100.0 (0.0) 19.6 (3.2) 100.0 (0.0) 68.8 (0.0) 100.0 (0.0)RANOV 100.0 (0.0) 100.0 (0.0) 99.2 (0.0) 100.0 (0.0) 100.0 (0.0) 100.0 (0.0)RSAPS 71.5 (28.0) 99.8 (0.2) 96.8 (3.2) 100.0 (0.0) 81.1 (0.0) 42.2 (54.5)SAPS 70.9 (28.5) 100.0 (0.0) 96.8 (2.4) 100.0 (0.0) 73.7 (0.2) 48.8 (48.5)VW 85.3 (14.7) 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 100.0 (0.0) 100.0 (0.0)Table 9.10: Percentage of instances on which SATenstein-LS achievedbetter (equal) median runtime than each of the 11 challengers. Medianswere taken over 25 runs on each instance with a cutoff time of 600 CPUseconds per run.and FAC, but solved SW-GCP and HGEN instances quite efficiently.3 This suggeststhat different distributions are most efficiently solved by rather different solvers.We are thus encouraged that our automatic algorithm construction process wasable to find good configurations for each distribution.So far, we have discussed performance metrics that describe aggregate perfor-mance over the entire test set. One might wonder if SATenstein-LS’s strongperformance is due its ability to solve relatively few instances very efficiently,while performing poorly on others. Table 9.10 shows that this is typically not thecase, summarizing the performance of each SATenstein-LS solver comparedto each challenger on a per-instance basis. Except for R3SAT, SATenstein-LSsolvers outperformed the respective best challengers for each distribution on a per-instance basis. R3SAT was an exception: PAWS outperformed SATenstein-LS[R3SAT]most frequently (77.2%), but still achieved a lower PAR-10 score, indicating thatSATenstein-LS[R3SAT] achieved dramatically better performance than PAWSon a relatively small number of hard instances.3Interestingly, on both types of random instances we considered, GNOV failed to outperform someof the older solvers, in particular, PAWS and RSAPS.15410−2 100 10210−210−1100101102SATenstein[R3SAT] median runtime (CPU sec)PAWS median runtime (CPU sec)  R3SAT 10−2 100 10210−210−1100101102SATenstein[FAC] median runtime (CPU sec)SAPS median runtime (CPU sec)  FACFigure 9.1: Performance comparison of SATenstein-LS and the bestchallenger. Left: R3SAT; Right: FAC. Medians were taken over 25runs on each instance with a cutoff time of 600 CPU seconds per run.9.3.2 Comparison with Automatically Configured Versions ofChallengersThe fact that SATenstein-LS solvers achieved significantly better performancethan all 11 challengers with default parameter configurations (i.e., those selectedby their designers) admits two possible explanations. First, it could be due to thefact that SATenstein-LS’s (vast) design space includes useful new configura-tions that combine solver components in novel ways. Second, the performancegains may have been achieved simply by better configuring existing SLS algo-rithms within their existing, and quite small, design spaces. To determine whichof these two hypotheses holds, we compared SATenstein-LS solvers againstchallengers configured for optimized performance on our benchmark sets, usingthe same automated configuration procedure and protocol.Table 9.11 summarizes the performance thus obtained, and Figure 9.2 showsthe PAR-10 ratios of SATenstein-LS solvers over the default and configuredchallengers. Compared to challengers with default configurations (see Table 9.9),the specifically optimized versions of the challenger solvers often achieved signifi-cantly better performance, reducing their performance gaps to SATenstein-LSsolvers. For example, automatic configuration of G2 led to a speedup of 5 ordersof magnitude in terms of PAR-10 on SWGCP and solved 100% of the instances inthat benchmark set within a 600 second cutoff (vs. 31% for G2 default). However,it is worth noting that the configured challengers sometimes also exhibited worseperformance than the default configurations (in the worst case, VW[SWGCP] was155Solvers QCP SW-GCP R3SAT HGEN FAC CBMC(SE)26.13 0.06 2.68 119.75 1731.16 994.94ANOV[D] 0.02 0.04 0.12 0.54 296.84 0.50[80] 99.6% 100% 100% 98.2% 90.1% 83.4%514.29 0.05 3.64 98.70 617.83 1084.60G2[D] 0.03 0.05 0.15 0.75 110.42 0.58[131] 91.4% 100% 100% 99.1% 97.8% 81.4%417.33 0.22 8.87 68.24 5478.75 2195.76GNOV[D] 0.03 0.09 0.17 0.62 N/A 0.19[161] 92.9% 100% 100% 99.4% 0.3% 61.8%68.06 0.70 1.91 64.48 22.01 1925.56PAWS[D] 0.02 0.35 0.09 0.83 10.39 0.50[195] 99.2% 100% 100% 99.4% 100% 67.7%75.06 0.15 13.85 141.61 336.27 1223.83RANOV[D] 0.1 0.12 0.24 0.77 95.53 0.47[160] 98.9% 100% 100% 98.1% 100% 80.4%868.37 0.19 1.32 42.99 12.17 67.59RSAPS[D] 0.04 0.15 0.11 0.64 7.86 0.02[91] 85.2% 100% 100% 99.5% 100% 99.0%27.69 0.31 1.54 31.77 10.68 62.63SAPS[D] 0.06 0.21 0.16 0.75 7.00 0.02[91] 99.8% 100% 100% 99.6% 100% 99.0%0.33 417.71 1.26 57.44 32.38 16.45VW[D] 0.02 8.43 0.15 1.00 17.60 0.02[166] 100% 94.8% 100% 99.6% 100% 100%Table 9.11: Performance summary of the automatically configured versionsof 8 challengers (three challengers have no parameters). Every algo-rithm was run 25 times on each problem instance with a cutoff of 600CPU seconds per run. Each cell 〈i, j〉 summarizes the test-set perfor-mance of algorithm i on distribution j as a/b/c, where a (top) is thethe penalized average runtime; b (middle) is the median of the medianruntimes over all instances (not defined if fewer than half of the medianruns failed to find a solution within the cutoff time); c (bottom) is thepercentage of instances solved (i.e., having median runtime < cutoff).The best-scoring algorithm(s) in each column are indicated in bold.156ANOV G2 GNOV PAWS RANOV RSAPS SAPS VW−5−4.5−4−3.5−3−2.5−2−1.5−1−0.50Log(SATenstein PAR/Challenger PAR)  Default Tuned for QCP ANOV G2 GNOV PAWS RANOV RSAPS SAPS VW−6−5−4−3−2−10Log(SATenstein PAR/Challenger PAR)  Default Tuned for SWGCPANOV G2 GNOV PAWS RANOV RSAPS SAPS VW−1.4−1.2−1−0.8−0.6−0.4−0.20Log(SATenstein PAR/Challenger PAR)  Default Tuned for R3SAT ANOV G2 GNOV PAWS RANOV RSAPS SAPS VW−5−4.5−4−3.5−3−2.5−2−1.5−1−0.50Log(SATenstein PAR/Challenger PAR)  Default Tuned for HGENANOV G2 GNOV PAWS RANOV RSAPS SAPS VW−3−2.5−2−1.5−1−0.500.5Log(SATenstein PAR/Challenger PAR)  Default Tuned for FAC ANOV G2 GNOV PAWS RANOV RSAPS SAPS VW−3−2.5−2−1.5−1−0.50Log(SATenstein PAR/Challenger PAR)  Default Tuned for CBMC(SE)Figure 9.2: Performance of SATenstein-LS solvers vs challengers withdefault and optimized configurations. For every benchmark distributionD, the base-10 logarithm of the ratio between SATenstein[D] andone challenger (default and optimized) is shown on the y-axis, basedon data from Tables 9.9 and 9.11. Top-left: QCP; Top-right: SWGCP;Middle-left: R3SAT; Middle-right: HGEN; Bottom-left: FAC; Bottom-right: CBMC(SE)2.58 times slower than VW default in terms of PAR-10 with a cutoff of 600 CPUseconds). This was caused by the short cutoff time used during the configurationprocess, as motivated in Section 5.2; had we used the same 5 CPU second cut-off time for computing PAR-10 score, we expected that the configured challengerswould have always outperformed the default versions.Examining benchmark distributions individually and ranging over our 8 chal-lengers, we observed average and median speedups over default configurations of157396 and 3.58 (for QCP), 15,900 and 3,240 (for SWGCP), 5.84 and 2.74 (for R3SAT),1.23 and 1.01 (HGEN), 15.4 and 1.61 (FAC), 6.61 and 2.00(CBMC(SE)). We weresurprised to observe only small speedups for all challengers on HGEN. Consideringchallengers individually and ranging over our 6 benchmark distributions, averageand median PAR-10 improvement was 15.0 and 1.85 (for ANOV), 13,200 and 3.84(for G2), 1.74 and 1.05 (for GNOV), 1,070 and 0.98 (for PAWS), 1.33 and 1.03 (forRANOV), 4,870 and 6.85 (for RSAPS), 2,080 and 12.3 (for SAPS), 539 and 16.6(for VW). RANOV showed the smallest performance improvement as a result of au-tomated configuration across all benchmarks; this is likely due to RANOV’s smallparameter space (it has only one parameter).Table 9.12 shows the performance of our SATenstein-LS solvers, the bestdefault challengers, and the best automatically configured challengers. For QCP,HGEN and CBMC(SE), the SATenstein-LS solvers still significantly outper-formed the best configured challengers. For R3SAT and SWGCP, the performancedifference was small, but still above 10%. The only benchmark where the bestconfigured challenger outperformed SATenstein-LS was FAC. As we will seelater (in Figure 9.3), SATenstein-LS[FAC] turns out to be very similar to thebest configured challenger, SAPS[FAC].Overall, these experimental results provide evidence in favour of our first hy-pothesis: the good performance of SATenstein-LS solvers is due to combiningcomponents gleaned from existing high-performance algorithms in novel ways.Later in Section 9.4.3, we provide detailed analysis for demonstrating the designdifference between configured SATenstein-LS solvers and component solvers.9.3.3 Comparison with Complete SolversTable 9.13 compares the performance of SATenstein-LS solvers and four promi-nent complete SAT solvers (two for each distribution). For four out of our sixbenchmark distributions, SATenstein-LS solvers comprehensively outperformedthe complete solvers. For the other two industrial distributions (FAC and CBMC(SE)),the performance of the selected complete solvers was much better than that of ei-ther the SATenstein-LS solvers and any of our other local search solvers. Thesuccess of DPLL-based complete solvers on industrial instances is not surprising;158Distribution QCP SW-GCP R3SAT HGEN FAC CBMC(SE)Best Challenger(default) ANOV RANOV PAWS RSAPS SAPS VW25.42 0.15 1.77 33.28 16.41 385.12Performance 0.02 0.12 0.08 2.33 10.56 0.2399.6% 100% 100% 99.7% 100% 93.4%Best Challenger(tuned) VW[D] G2[D] VW[D] SAPS[D] SAPS[D] VW[D]0.33 0.05 1.26 31.77 10.68 16.45Performance 0.02 0.05 0.15 0.75 7.00 0.02100% 100% 100% 99.6% 100% 100%0.08 0.03 1.11 0.02 10.89 4.75SATenstein-LS[D] 0.01 0.02 0.14 0.01 7.90 0.02Performance 100% 100% 100% 100% 100% 100%Table 9.12: Performance of SATenstein-LS solvers, the best challengerswith default configurations and the best automatically configured chal-lengers. Every algorithm was run 25 times on each instance with acutoff of 600 CPU seconds per run. Each table entry 〈i, j〉 indicates thetest-set performance of algorithm i on distribution j as a/b/c, where a(top) is the the penalized average runtime; b (middle) is the median ofthe median runtimes over all instances; c (bottom) is the percentage ofinstances solved (i.e., those with median runtime < cutoff).it is widely believed to be due to their ability to take advantage of instance structure(by means of unit propagation and clause learning). Our results confirm that state-of-the-art local search solvers cannot compete with state-of-the-art DPLL solverson industrial instances. However, SATenstein-LS solvers have made signif-icant progress in closing the gap. For example, for CBMC(SE), state-of-the-artcomplete solvers were five orders of magnitude better than the next-best SLS chal-lenger, VW. SATenstein-LS reduced the performance gap to three orders ofmagnitude. We also obtained some modest improvements (a factor of 1.51) forFAC.9.3.4 Configurations FoundTo better understand the automatically-constructed SATenstein-LS solvers, wecompared their automatically selected design choices to the design of the existingSLS solvers for SAT. The full parameter configurations of the six SATenstein-LSsolvers are shown in Table 9.14.159Distribution QCP SW-GCP R3SAT HGEN FAC CBMC(SE)Complete Solver Minisat2.0 Minisat2.0 Kcnf 04 Kcnf 04 Minisat2.0 Minisat2.035.05 2.17 4905.6 3108.77 0.03 0.23Performance 0.02 0.9 N/A N/A 0.02 0.0399.5% 100% 18.8% 49.5% 100% 100%Complete Solver March pl March pl March pl March pl Picosat Picosat120.29 253.99 3543.01 2763.41 0.02 0.03Performance 0.2 1.12 N/A 400.78 0.02 0.0198.1% 95.8% 42.0% 55.2% 100% 100%0.08 0.03 1.11 0.02 10.89 4.75SATenstein-LS[D] 0.01 0.02 0.14 0.01 7.90 0.02Performance 100% 100% 100% 100% 100% 100%Table 9.13: Performance summary of SATenstein-LS and the completesolvers. Every complete solver was run once (SATenstein-LS wasrun 25 times) on each instance with a per-run cutoff of 600 CPU sec-onds. Each cell 〈i, j〉 summarizes the test-set performance of algorithmi on distribution j as a/b/c, where a (top) is the the penalized aver-age runtime; b (middle) is the median of the median runtimes over allinstances (for SATenstein-LS, it is the median of the median run-times over all instances. the median runtimes are not defined if fewerthan half of the median runs failed to find a solution within the cutofftime); c (bottom) is the percentage of instances solved (i.e., having me-dian runtime < cutoff). The best-scoring algorithm(s) in each columnare indicated in bold.SATenstein-LS[QCP] uses building blocks 1, 3, and 5. Recall that block1 is used for performing search diversification, and block 5 is used to update datastructures, tabu attributes and clause penalties. In block 3, which is used to in-stantiate a solver belonging to the WalkSAT architecture, the heuristic is based onNovelty++′, and in block 1, diversification flips the least-frequently-flipped vari-able from an UNSAT clause. SATenstein-LS[SW-GCP] is similar to SATenstein-LS[QCP]but does not use block 1. In block 3, the heuristic is based on Novelty++ as usedwithin G2. SATenstein-LS[R3SAT] uses blocks 1, 4 and 5; it is closest toSAPS, but performs search diversification. A tabu list with length 3 is used to ex-clude some variables from the search neighborhood. Recall that block 4 is usedto instantiate dynamic local search algorithms. SATenstein-LS[HGEN] usesblocks 1, 3, and 5. It is similar to SATenstein-LS[QCP] but uses a heuris-tic based on VW1 as well as a tabu list of length 3. SATenstein-LS[FAC]uses blocks 4 and 5; its instantiation closely resembles that of SAPS, but differs in160Distribution Parameter Configuration-useAdaptiveMechanism 0 -performSearchDiversification 1 -usePromisingList 0QCP -singleClauseAsNeighbor 1 -adaptWalkProb 0 -selectClause 1 -useClausePenalty 0-performTabuSearch 0 -heuristic 4 -performAlternateNovelty 0 -searchDiversificationStrategy 3-dp 0.07 -c 0.0001 -novNoise 0.5 -rfp 0.1 -s 0.1-useAdaptiveMechanism 0 -performSearchDiversification 0 -usePromisingList 0-singleClauseAsNeighbor 1 -adaptWalkProb 0 -selectClause 1 -useClausePenalty 0SW-GCP -performTabuSearch 0 -heuristic 3 -performAlternateNovelty 0-dp 0.01 -c 0.01 -novNoise 0.1 -s 0.1-useAdaptiveMechanism 0 -performSearchDiversification 1 -singleClauseAsNeighbor 0R3SAT -scoringMeasure 3 -tieBreaking 2 -useClausePenalty 1 -searchDiversificationStrategy 1-smoothingScheme 1 -tabuLength 3 -performTabuSearch 1-alpha 1.189 -ps 0.1 -rho 0.8 -sapsthresh -0.1 -rwp 0.05 -wp 0.01-useAdaptiveMechanism 0 -performSearchDiversification 1 -usePromisingList 0-singleClauseAsNeighbor 1 -tabuLength 3 -performTabuSearch 1HGEN -useClausePenalty 0 -searchDiversificationStrategy 4-adaptWalkProb 0 -selectClause 1 -heuristic 7-c 0.001 -rfp 0.15 -s 0.1 -wpWalk 0.1-useAdaptiveMechanism 0 -performSearchDiversification 0 -singleClauseAsNeighbor 0FAC -scoringMeasure 3 -tieBreaking 1 -useClausePenalty 1 -smoothingScheme 1 -tabuSearch 0-alpha 1.189 -ps 0.066 -rho 0.83 -sapsthresh -0.3 -wp 0.03-useAdaptiveMechanism 0 -performSearchDiversification 1 -singleClauseAsNeighbor 0-useClausePenalty 1 -smoothingScheme 1 -performTabuSearch 0 -searchDiversificationStrategy 4CBMC(SE) -scoringMeasure 3 -tieBreaking 2 -alpha 1.066 -ps 0 -rho 0.83 -sapsthresh -0.3 -wp 0.01 -rfp 0.1Table 9.14: SATenstein-LS parameter configuration found for each dis-tribution.the way in which variable scores are computed. SATenstein-LS[CBMC(SE)]uses blocks 1, 4, and 5; it computes variable scores using -BreakCount andemploys a search diversification strategy similar to that of VW.Interestingly, none of the six SATenstein-LS configurations we found usesa promising list (block 2), a technique integrated into many recent SAT Competi-tion winners. This indicates that many interesting designs that could compete withexisting high-performance solvers still remain unexplored in SLS design space. Inaddition, we found that all SATenstein-LS configurations differ from existingSLS algorithms (except for SATenstein[FAC], whose configuration and per-formance is similar to SAPS). This underscores the importance of an automatedapproach, since manually finding such good configurations from a huge designspace is very difficult.1619.4 Quantitative Comparison of AlgorithmConfigurationsSo far, we have examined the parameter configurations identified for each of our sixbenchmark distributions, quantitatively assessed the performance they achieved,and qualitatively observed that most were substantially different from existingsolver designs. We would like to be able to dig deeper, saying something about howsimilar each of these configurations is to existing designs. More broadly, we wouldlike automatic and quantitative techniques for comparing different solver designsin terms of their similarities and differences. In the case of highly configurablealgorithms like SATenstein-LS, this requires some sophistication, because pa-rameters share conditional dependencies. The approach presented in the followingcan deal with arbitrary levels of conditional parameter dependence and be appliedto arbitrary parametric algorithms. Unlike previous work on this problem, whichonly considered the edit distance between configurations [153], the metric we in-troduce takes into account not only differences between the parameters that areactive in two given configurations, but also the importance of each parameter.9.4.1 Concept DAGsTo preserve the hierarchical structure of parameter dependencies, we use a noveldata structure called a concept DAG to represent algorithm configurations. Ournotion of a concept DAG is based on that of a concept tree [217]. We then definefour operators whose repeated application that can be used to map between arbi-trary concept DAGs, and assign each operator a cost. To compare two parameterconfigurations, we first represent them using concept DAGs and then define theirsimilarity as the minimal total cost of transforming one DAG into the other.A concept DAG is a six-tuple G = (V,E,LV ,R,D,M), where V is a set of nodes,E is a set of directed edges between the nodes in V such that they form an acyclicgraph, LV is a set of lexicons (terms) for concepts used as node labels, R is a dis-tinguished node called the root, D is the domain of the nodes, and M is an injectivemapping from V to LV . A parameter configuration can be expressed as a conceptDAG4 in which each node in V represents a parameter and each directed edge in4It was necessary for us to base this data structure around DAGs rather than trees because pa-162E represents the conditional dependence relationship between two parameters. Dis the domain of all parameters and M specifies which values any given parameterv ∈ V can take. We add an artificial root node R, which connects to all param-eter nodes that do not have any parent, and refer to these parameters as level 1parameters.We can transform one concept DAG into another by a series of delete, insert,relabel and move operations, each of which has an associated cost. For measur-ing the degree of similarity between two algorithm configurations, we first expressthem as concept DAGs, DAG1 and DAG2. We define the distance between theseDAGs as the minimal total cost required for transforming DAG1 into DAG2. Obvi-ously, the distance between two identical configurations is 0.The parameters with the biggest impact on an algorithm’s execution path arelikely to have low level (i.e., to be conditional upon few or no other parameters)and/or to turn on a complex mechanism (i.e., to have many parameters conditionalupon them). Therefore, we say that the importance of a parameter v is a function ofits depth (the length of the longest path from the root R of the given concept DAGto v) and the total number of other parameters conditional on it. To capture thisdefinition of importance, we define the cost of each of the four DAG-transformingoperations as follows.Deletion cost C(delete(v)) = 1|V | ·(height(DAG)−depth(v)+1+ |DE(v)|), whereheight(DAG) is the height of the DAG, depth(v) is the depth of node v andDE(v) is the set of descendants of node v. This captures the idea that itis more costly to delete low-level parameters and parameters that turn oncomplex mechanisms.Insertion cost C(insert(u,v))= 1|V | ·(height(DAG)−depth(u)+1+|DE(v)|), whereDE(v) is the set of descendants of v after the insertion.Moving cost C(move(u,v)) = |V |−22·|V | · [C(delete(v))+C(insert(u,v))], where |V |>2.rameters may have more than one parent parameter, where the child is only active if the parents takecertain values. For example, the noise parameter phi is only activated when both useAdaptiveMech-anism and singleClauseAsNeighbor are turned on.163Relabelling cost C(relabel(v, lv, lv∗)= [C(delete(v))+C(insert(u,v))]·s(lv, lv∗), wheres(lv, lv∗) is a measure of the distance between two labels lv and lv∗. Forparameters with continuous domains, s(lv, lv∗) = |lv− lv∗|. For parameterswhose domains are some finite, ordinal and discrete set {lv1 , lv2 , . . . , lvk},s(lv, lv∗)= abs(v−v∗)/(k−1), where abs(v−v∗) measures the number of in-termediate values between v and v∗. For categorical parameters, s(lv, lv∗) = 0if lv = lv∗and 1 otherwise.9.4.2 Comparison of SATenstein-LS ConfigurationsWe are now able to compare our automatically identified SATenstein-LS solverdesigns to all of the challengers, without having to resort to expert knowledge. Aspreviously mentioned, 3 of our 11 challengers (AG2p, AG2plus, and AG20) areparameter free. Furthermore, RANOV only differs from ANOV by the addition of apreprocessing step, and so can be understood as a variant of the same algorithm.This leaves us with 7 parameterized challengers to consider. For each, we sam-pled 50 configurations (consisting of the default configuration, one configurationoptimized for each of our 6 benchmark distributions, and 43 random configura-tions). We then computed the pairwise transformation cost between the resulting359 configurations (7 × 50 challengers’ configurations + 6 SATenstein-LSsolvers + AG2p + AG2plus + AG20). The result can be understood as a graphwith 359 nodes and 128 522 edges, with nodes corresponding to concept DAGs,and edges labeled by the minimum transformation cost between them. To visual-ize this graph, we used a dimensionality reduction method to map it onto a plane,with the aim of positioning points so that the Euclidean distance between every pairof points approximated their transformation cost as accurately as possible. In par-ticular, we used the Isomap algorithm [194], which builds on a multidimensionalscaling technique but has the ability to preserve the intrinsic geometry of the data,as captured in the geodesic manifold distances between all pairs of data points. Itis capable of discovering the nonlinear degrees of freedom that underlie complexnatural observations. We implemented the transformation cost computation codeusing Matlab, and performed all computations using the same computer cluster asdescribed in Section 9.2.164−5 0 5−3−2−10123GNOVG2VWANOVVWautoSATenstein[QCP]SATenstein[SWGCP]SATenstein[FACT]SATenstein[CBMC]SATenstein[HGEN]RSAPSSAPSSATenstein[R3FIX]AG20AG2pAG2+PAWSFigure 9.3: Visualization of the transformation costs in the design of 16 high-performance solvers (359 configurations) obtained via Isomap.The final layout of similarities among 359 configurations (16 algorithms) isshown in Figure 9.3. Observe that in most cases the 50 different configurations fora given challenger solver were so similar that they mapped to virtually the samepoint in the graph.As noted earlier, the distance between any two configurations shown in Fig-ure 9.3 only approximates their true distance. In addition, the result of the visu-alization also depends on the number of configurations considered: adding an ad-ditional configuration may affect the position of many or all other configurations.Thus, before drawing further conclusions about the results illustrated in Figure 9.3,we validated the fidelity of the visualization to the original distance data. As can beseen from Figure 9.4, although Isomap tended to underestimate the true distancesbetween configurations, there was a strong correlation between the computed andmapped distances (Pearson correlation coefficient: 0.93). Also, the mapping dida good job of preserving the relative ordering of the true distances between con-figurations (Spearman correlation coefficient 0.91)—in other words, distances that1650 2 4 6 802468Real DistanceMapped DistanceFigure 9.4: True vs mapped distances in Figure 9.3. The data points corre-spond to the complete set of SATenstein-LS[D] for all domains andall challengers with their default and domain-specific, optimized con-figurations.appear similar in the 2D plot tend to correspond to similar true distances (and viceversa). Digging deeper, we confirmed that the challenger closest in Figure 9.3 toeach given SATenstein-LS solver was indeed the one having the lowest truetransformation cost. This was not true for the most distant challengers; however,we find this acceptable, since in the following, we are mainly interested in exam-ining which configurations are similar to each other.Having confirmed that our dimensionality reduction method is performing reli-ably, let us examine Figure 9.3 in more detail. Overall, and unsurprisingly, we firstnote that the transformation cost between two configurations in the design spaceis very weakly related to their performance difference (quantitatively, the Spear-man correlation coefficient between performance difference (PAR-10 ratio) andconfiguration difference (transformation cost) was 0.25). As we suspected basedon our manual examination of parameter configurations, each SATenstein-LSsolver except SATenstein-LS[FAC] was quite different from every challenger.This provides further evidence that the superior performance of SATenstein-LSsolvers is due to combining components from existing SLS algorithms in novelways. Examining algorithms by type, we note that all dynamic local search al-gorithms are grouped together, on the right side of Figure 9.3; likewise, the al-166gorithms using adaptive mechanisms are grouped together at the bottom of Fig-ure 9.3. SATenstein-LS solvers were typically more similar to each other thanto challengers, and fell into two broad clusters. The first cluster also includes theSAPS variants (SAPS, RSAPS), while the second also includes G2 and VW. Noneof the SATenstein-LS solvers uses an adaptive mechanism to automatically ad-just other parameters; in fact, as shown in Table 9.12, the same is true of the bestperformance-optimized challengers. This suggests that in many cases, contrary tocommon belief (see, e.g., [80, 135]) it may be preferable to expose parameters sothey can be instantiated by sophisticated configurators rather than automaticallyadjusting them at running time using a simple adaptive mechanism.We now consider benchmarks individually. For the FAC benchmark, SATenstein-LS[FAC]had similar performance to SAPS[FAC]; as seen in Figure 9.3, both solvers arestructurally very similar as well. Overall, for the ‘industrial’ distributions, CBMC(SE)and FAC, dynamic local search algorithms often yielded the best performanceamongst all challengers. Our automatically-constructed SATenstein-LS solversfor these two distributions are also dynamic local search algorithms. Due to thelarger search neighbourhood and the use of clause penalties, dynamic local searchalgorithms are more suitable for solving industrial SAT instances, which often havesome special global structure.For R3SAT, a well-studied distribution, many challengers showed good perfor-mance (the top three challengers were VW, RSAPS, and SAPS). The performanceof SATenstein-LS[R3SAT] is only slightly better than that of VW[R3SAT].Figure 9.3 shows that SATenstein-LS[R3SAT] is a dynamic local search al-gorithm similar to RSAPS and SAPS.For HGEN, even the best performance-optimized challengers, RSAPS[HGEN]and SAPS[HGEN], performed poorly. SATenstein-LS[HGEN] achieves morethan 1,000-fold speedups against all challengers. Its configuration is far away fromany dynamic local search algorithm (the best challengers), and closest to VW andG2.For QCP, VW[QCP] does not reach the performance of SATenstein-LS[QCP],but significantly outperforms all other challengers. Our transformation cost anal-ysis shows that VW is the closest neighbour to SATenstein-LS[QCP]. ForSWGCP, many challengers achieve similar performance to SATenstein-LS[SWGCP].167Figure 9.3 shows that SATenstein-LS[SWGCP] is close to G2[SWGCP], whichis the best performing challenger on SWGCP.9.4.3 Comparison to Configured ChallengersSince there are large performance gaps between default and configured challengers(as seen in Figure 9.2), we were also interested in the transformation cost betweenthe configurations of individual challenger solvers. Recall that after configuringeach challenger for each distribution, we found that SAPS was best on HGEN andFAC; G2 was best on SWGCP, and VW was best on CBMC(SE), QCP, and R3FIX.Figure 9.5:left visualizes the parameter spaces for each of these three solvers (43random configurations + default configuration + 6 optimized configurations). Fig-ure 9.5:right shows the same thing, but also adds the best SATenstein-LS con-figurations for each benchmark on which the challenger had top performance.Examining these figures in the left column of Figure 9.5, we first note thatthe SAPS configurations optimized for FAC and HGEN are very similar but differsubstantially from SAPS’s default configuration. On SWGCP, the optimized con-figuration of G2 not only performed much better than the default but, as seen inFigure 9.5:middle-left, is also quite different. All three top-performing VW config-urations are rather different from VW’s default, and none of them uses the adap-tive mechanism for choosing parameter wpWalk, s, and c. Since the parameteruseAdaptiveMechanism is a level-1 parameter and many other parametersare conditionally dependent on it, the transformation costs between VW default andoptimized configurations of VW are very large, due to the high relabelling cost forthese nodes in our concept DAGs.The right column of Figure 9.5 illustrates the similarity between optimizedSATenstein-LS solvers and the best performing challenger for each benchmark.As previously noted, SATenstein[FAC] and SAPS[FACT] are not only verysimilar in performance, but also structurally similar. Likewise, SATenstein[SWGCP]is similar to G2SWGCP. On R3SAT, many challengers have similar performance.SATenstein[R3SAT] (PAR-10=1.11) is quite different from the best challengerVW[R3SAT] (PAR-10=1.26), but resembles SAPS[R3SAT] (PAR-10=1.53). Forthe three remaining benchmarks, SATenstein-LS solvers exhibited much better168−0.5 0 0.5−0.4−[HGEN]SAPS[FACT]SAPS(default)−6 −4 −2 0−0.500.5SAPS[HGEN]SAPS(default)SAPS[FACT]SATenstein[HGEN]SATenstein[FACT](a) SAPS (b) SAPS and SATenstein-LS−0.5 0 0.5−0.500.5G2[SWGCP]G2(default)−0.8 −0.6 −0.4 −0.2 0 0.2 0.4−0.500.5SATenstein[SWGCP]G2[SWGCP] G2(default)(c) G2 (d) G2 and SATenstein-LS−0.6 −0.4 −0.2 0 0.2 0.4 0.6−0.500.511.5 VW_auto(default)VW[QCP]VW[R3FIX]VW[CBMC]−2 0 2 4 6−0.500.511.5VW_auto(default)SATenstein[R3FIX]SATenstein[CBMC]SATenstein[QCP]VW[QCP]VW[R3FIX]VW[CBMC](e) VW (f) VW and SATenstein-LSFigure 9.5: The transformation costs of configuration of individual chal-lengers and selected SATenstein-LS solvers. (a): SAPS (best onHGEN and FACT); (b): SAPS and SATenstein[HGEN, FACT];(c): G2 (best on SWGCP); (d): G2 and SATenstein[SWGCP];(e): VW (best on CBMC(SE), QCP, and R3FIX); (f): VW andSATenstein[CBMC, QCP, R3FIX].169performance than the best optimized challengers, and their configurations likewisediffer substantially from the challengers’ configurations.As an aside, it might initially be surprising that qualitative features of the visu-alizations in Figures 9.5 appear to be absent from Figure 9.3. In particular, the setsof randomly sampled challenger configurations that are quite well-separated in Fig-ure 9.5 are nearly collapsed into single points in Figure 9.3, although the scales arenot vastly different. The reason for this lies in the fact that the 2D-mapping of thehighly non-planar pairwise distance data performed by Isomap focuses on mini-mal overall distortion. For example, when visualizing the differences within a setof randomly sampled SAPS configurations (Figure 9.5 (a)), Isomap spreads theseout into a cloud of points to represent their differences. However, the presence ofa single SATenstein-LS configuration that has large transformation costs fromall of these SAPS configurations forces Isomap to use one dimension to capturethose differences, leaving essentially only one dimension to represent the muchsmaller differences between the SAPS configurations (Figure 9.5 (b)). Adding fur-ther very different configurations (as present in Figure 9.3) leads to mappings inwhich the smaller differences between configurations of the same challenger be-come insignificant.9.5 ConclusionsIn this chapter, we have proposed a new approach for designing heuristic algo-rithms based on (1) a framework that can flexibly combine components drawnfrom existing high-performance solvers, and (2) a powerful algorithm configura-tion tool for finding instantiations that perform well on given sets of instances. Wehave demonstrated the effectiveness of our approach by automatically constructinghigh-performance stochastic local search solvers for SAT. We have shown empir-ically that these automatically constructed SAT solvers outperform existing state-of-the-art solvers with manually and automatically optimized configurations on arange of widely studied distributions of SAT instances.We have also proposed a new metric for quantitatively assessing the similar-ity between configurations for highly parametric solvers. We first introduce a datastructure, concept DAGs, that preserves the internal hierarchical structure of pa-170rameters. We then estimate the similarity of two configurations as the transfor-mation cost from one configuration to another. We have demonstrated that visual-izations based on transformation costs can provide useful insights into similaritiesand differences between solver configurations. In addition, we believe that thismetric could be useful for suggesting potential links between algorithm structureand algorithm performance. While this chapter only applied our metric to com-pare SATenstein-LS and several local search algorithms on SAT, we expectthe same technique will be useful more broadly for comparing different algorithmdesigns that can be expressed within the same configuration space.Our original inspiration comes from Mary Shelley’s classic novel, Franken-stein. One important methodological difference is that we use automated methodsfor selecting components for our monster instead of picking them by hand. Theoutcomes are quite different. Unlike the tragic figure of Dr. Frankenstein, whosemonstrous creature haunted him enough to quench forever his ambitions to createa ‘perfect’ human, we feel encouraged to unleash not only our new solvers, butalso the full power of our automated solver-building process onto other classes ofSAT benchmarks. Like Dr. Frankenstein, we find our creations somewhat mon-strous, recognizing that the SATenstein solvers do not always represent the mostelegant designs. Thus, desirable lines of future work include techniques for under-standing importance of different parameters to achieving strong performance on agiven benchmark; the extension of our solver framework with preprocessors; andthe investigation of algorithm configuration procedures other than ParamILS in thecontext of our approach. Encouraged by the results achieved on SLS algorithmsfor SAT, we believe that the general approach behind SATenstein-LS is equallyapplicable to non-SLS-based solvers and to other combinatorial problems. Finally,we encourage participators from the SAT community to apply SATenstein-LSon their own problem distributions, and to extend SATenstein-LS with theirown heuristics. Source code and documentation for our SATenstein-LS frame-work are freely available at http://www.cs.ubc.ca/labs/beta/Projects/SATenstein.171Chapter 10Hydra: Automatic Configurationof Algorithms for Portfolio-BasedSelectionThis chapter introduces Hydra, a novel technique for combining automated algo-rithm configuration and portfolio-based algorithm selection, thereby realizing thebenefits of both. Hydra automatically builds a set of solvers with complementarystrengths by iteratively configuring new algorithms. It is primarily intended foruse in problem domains for which an adequate set of candidate solvers does not al-ready exist. Nevertheless, we tested Hydra on a widely studied domain, stochasticlocal search algorithms for SAT, in order to characterize its performance against awell-established and highly competitive baseline. We found that Hydra consis-tently achieves major improvements over the best existing individual algorithms,and always at least roughly matches—and indeed often exceeds—the performanceof the best portfolios of these algorithms. 1For mixed integer programming problems, there are very few strong solvers,and state-of-the-art solvers are highly parameterized. Such observations renderMIP a perfect case for applying techniques such as Hydra. We demonstrate howto improve Hydra to achieve strong performance for MIP based on single MIP1This work is based on the joint work with Holger Hoos, and Kevin Leyton-Brown [212].172solver, IBM ILOG’s CPLEX. By applying advanced algorithm selection tech-nique and modifying Hydra’s method for selecting candidate configurations, weshow that Hydra dramatically improves CPLEX’s performance for a variety ofMIP benchmarks, as compared to ISAC [111], algorithm configuration alone, andCPLEX’s default configuration. 210.1 HydraOnce state-of-the-art algorithm portfolios exist for a problem, such as the SATzillaportfolios for various categories of SAT instances, the question arises: how shouldnew research aim to improve upon it? Inspired by the idea of “boosting as ametaphor for algorithm design” [128], we believe that algorithm design shouldfocus on problems for which the existing portfolio performs poorly. In particular,[128] suggested to use sampling (with replacement) to generate a new benchmarkdistribution that will be harder for an existing portfolio, and for new algorithmsto attempt to minimize average runtime on this benchmark. While we agree withthe core idea of aiming explicitly to build algorithms that will complement a port-folio, we have come to disagree with its concrete realization as described mostthoroughly by Leyton-Brown et al. (2009), realizing that average performance ona new benchmark distribution is not always an adequate proxy for the extent towhich a new algorithm would complement a portfolio.We note that a region of the original distribution that is exceedingly hard forall candidate algorithms can dominate the new distribution, leading to stagnation.A further problem is illustrated in the following, more complex example (due toFrank Hutter). Consider a uniform distribution over instance types A, B, and C.The current portfolio solves C instances in 0.01 seconds, and A and B instances in20 seconds each. The new distribution S thus emphasizes instance types A and B.There are three kinds of algorithms. X algorithms solve A instances in 0.1±ε sec-onds and B instances in 100±ε seconds each, where ε is a number between 0 and0.01; the actual runtime varies randomly within this range across given algorithm–instance pairs. Y algorithms solve B instances in 0.1±ε seconds and A instances2This work is based on the joint work with Frank Hutter, Holger Hoos, and Kevin Leyton-Brown [213].173in 100±ε seconds each. Z algorithms solve both A and B instances in 25±ε sec-onds each. All three algorithm types solve C instances in 10±ε seconds each. Thebest average performance on S will be achieved by some Z algorithm, which wethus add to the portfolio. However, observe that this new Z algorithm is dominatedby the current portfolio. Thus our new distribution S′ will be the same as S. Theprocess thus stagnates (endless algorithms of type Z exist), and we never add anyX or Y algorithm to the portfolio, although adding any pair of these would lead toimproved overall performance.To overcome these limitations, we introduce Hydra, a new method for auto-matically designing algorithms to complement a portfolio. This name is inspiredby the Lernaean Hydra, a mythological, multi-headed beast that grew new headsfor those cut off during its struggle with the Greek hero Heracles. Hydra, givenonly a highly parameterized algorithm and a set of instance features, automati-cally generates a set of configurations that form an effective portfolio. It thus doesnot require any domain knowledge in the form of existing algorithms or algorithmcomponents that are expected to work well, and can be applied to any problem.Hydra is an anytime algorithm: it begins by identifying a single configurationwith the best overall performance, and then iteratively adds algorithms to the port-folio. It is also able to drop previously added algorithms when they are no longerhelpful.The critical idea behind Hydra is that a new candidate algorithm should bepreferred exactly to the extent that it improves upon the performance of a (slightlyidealized) portfolio. Hydra is thus implemented by changing the performancemeasure given to the algorithm configuration. A candidate algorithm is scored withits actual performance in cases where it is better than the existing portfolio, but withthe portfolio’s performance in cases where it is worse. Thus an algorithm is notpenalized for bad performance on instances for which it should not be selected, butis only rewarded to the extent that it outperforms the portfolio. The examples givenearlier would be handled properly by this approach: the presence of intractableinstances does not lead one to ignore performance gains elsewhere, while X and Yalgorithms would be chosen in the first two iterations.As shown in pseudocode, Hydra takes five inputs: a parameterized solver s, aset of training problem instances I, an algorithm configuration procedure AC with a174performance metric m to be optimized, and a procedure PB for building portfolio-based algorithm selectors.In its first iteration, Hydra uses configurator AC to produce a configuration ofs, dubbed s1, that is optimized on training set I according to performance metricm. Solver s1 is then run on all instances of I in order to collect data that caneventually be input to PB; runs performed during the earlier configuration processcan be cached and reused as appropriate. We define portfolio P1 as the portfoliothat always selects s1, and solver set S1 as {s1}.Then, in each subsequent iteration k ≥ 2, Hydra defines a modified perfor-mance metric mk as the better of the performance of the solver being assessed andthe performance of the current portfolio, both measured according to performancemetric m. The configurator AC is run to find a configuration sk of s that optimizesmk on the entire training set I. As previously, the resulting solver is evaluated onthe entire set I and then added to the solver set S. We then use PB to construct anew portfolio Pk from the given set of solvers. In each iteration of Hydra, the sizeof the candidate solver set Sk grows by one; however, PB may drop solvers that donot contribute to the performance of portfolio Pk (as in SATzilla [210]). Slightlymodifying the second example provided earlier, if Z algorithms have slightly bet-ter performance on A and B instances than the current portfolio, some Z algorithmwill be chosen in the first iteration. However, X and Y algorithms are chosen in thenext two iterations, at which point the Z algorithm will be dropped, because it isdominated by the pair of X and Y algorithms.Hydra can be terminated using various criteria, such as a user-specified boundon the number of iterations and/or a total computation-time budget.The algorithm configuration procedure AC used within Hydra must be ableto deal efficiently with configurations having equal performance on some or allinstances, because such configurations can be expected to be encountered fre-quently. (For example, all configurations dominated by portfolio Pk−1 will haveequal performance under performance metric mk.) It is also possible to exploit mkfor computational gain when optimizing runtime (as we do in our experimentalstudy below). Specifically, a run of s on some instance i ∈ I can be terminatedduring configuration once its runtime reaches portfolio Pk−1’s runtime on i. (Referto the analogous discussion of capping in algorithm configuration by Hutter et al.175Procedure Hydra(s, I, AC, m, PB)Input: Parametric solver s; Instance set I;Algorithm configurator AC;Performance metric m;Portfolio builder PBOutput: Portfolio Pk := 1; m1 = m ;obtain a solver s1 by running configurator AC on parametric solver s and instance set I withperformance metric m1;measure performance of s1 on all instances in I, using performance metric m;let P1 by a portfolio that always selects s1;let S1 := {s1};while termination condition not satisfied dok := k+1;define performance metric mk as the better of the performance of the solver beingassessed and the performance of portfolio Pk−1, both measured using performancemetric m;obtain a new solver sk by running configurator AC on parametric solver s and instanceset I with performance metric mk;measure performance of sk on all instances in I, using performance metric m;Sk = Sk−1∪{sk};obtain new portfolio Pk by running portfolio builder PB on S;return P(2009).)It should be noted that Hydra need not be started from an empty set of algo-rithms, or only consider one parameterized algorithm. For example, it is straight-forward to initialize S with existing state-of-the-art algorithms before running Hydra,or to optimize across multiple parameterized algorithms.10.2 Hydra for SATHydra offers the greatest potential benefit in domains where only one highly pa-rameterized algorithm is competitive (e.g., certain distributions of mixed-integerprogramming problems), and the least potential benefit in domains where a widevariety of strong, uncorrelated solvers already exist. Nevertheless, we chose toevaluate Hydra on SAT—possibly the most extreme example of the latter category—effectively building a SATzilla of SATensteins. We did so for several rea-176sons. Most of all, to demonstrate the usefulness of the approach, we considered itimportant to work on a problem for which the state of the art is known to be verystrong. SLS-based SAT algorithms have been the subject of a large and sustainedresearch effort over the past two decades, and the success of SATzilla demon-strates that existing SAT algorithms can be combined together to form very strongportfolios. The criteria for success is thus set extremely high in this domain. Fur-ther, studying SLS for SAT also offered several pragmatic benefits: a wide varietyof datasets exist and are agreed to be interesting; effective instance-based featuresare available; and SATenstein is a suitable configuration target. Finally, becauseSAT is an important problem, even small improvements are significant.10.2.1 Experimental SetupWe chose inputs for Hydra to facilitate comparisons with past work, setting s, I,AC, and m as in Chapter 9, and taking PB from Chapter 7. Inputs s, I and m definethe application context in which Hydra is run. In contrast, AC and PB are genericcomponents; we chose these “off the shelf” and made no attempt to modify themto achieve domain-specific performance improvements. We do not expect that anend user would have to vary them either.Parametric Solver: SATenstein-LS . As our parametric solver s, we choseSATenstein-LS, a generalized, highly parameterized stochastic local search(SLS) framework (Chapter 9).Instances We investigated the effectiveness of Hydra on four distributions, draw-ing on well-known families of SAT instances. As no state-of-the-art SLS algo-rithms are able to prove unsatisfiability, we considered only satisfiable instances.We identified these by running all complete algorithms that had won a SAT com-petition category between 2002 and 2007 for one hour. First, the BM data set isconstructed from 500 instances taken from each of the six distributions used byChapter 9 (QCP, SWGCP, FACT, CBMC, R3FIX, and HGEN), split evenly into train-ing and test sets. Second, the INDULIKE data set is a mixture of 500 instancesfrom each of the CBMC and FACT distributions, again split evenly into training and177test sets. Third and fourth, the HAND and RAND data sets include all satisfiable in-stances from the Random and Handmade categories of the SAT Competitions heldbetween 2002 and 2007; we split the data 1141:571 and 342:171 into training andtest sets, respectively.Algorithm Configurator: FocusedILS As our algorithm configurator AC, wechose the FocusedILS procedure from the ParamILS framework, version 2.3[96]. ParamILS is able to deal with extremely large configuration spaces suchas SATenstein-LS’s, and indeed was the method used to identify the high-performing SATenstein-LS configurations mentioned previously. FocusedILScompares a new configuration with an incumbent by running on instances one at atime, and rejects the new configuration as soon as it yields weakly worse overallperformance on the set of instances than the incumbent. As we expect many tiesin Hydra’s modified performance measures mk, particularly in later iterations, wechanged this mechanism in order that new configurations are rejected only oncethey yield strictly worse overall performance. We also modified FocusedILSto cap all runs at the corresponding runtime for the portfolio Pk−1, as discussedpreviously.Performance Metric: PAR We followed the approach of Chapter 9, capping runsat 5 seconds and setting our performance metric m to be Penalized Average Runtime–10 (PAR-10). We performed 10 independent FocusedILS runs on training datawith different instance orderings and with a one-day time bound. We retained theparameter configuration that yielded the best PAR score on training data.Portfolio Builder: SATzilla As our portfolio builder PB we used the SATzillaframework [210] based empirical hardness models.We computed the same set of features as in Figure 3.1. For BM and INDULIKE,we only used 40 very efficiently computable features (with an average feature com-putation time of 0.04 seconds in both cases) since initial, exploratory experimentsrevealed that Hydra could achieve performance on the order of seconds on thesedata sets. For the same reason, we also reduced the time allowed for subset se-178lection on these distributions by a factor of 10, allowing time budgets taken from{0s,0.2s,0.5s,1s}. For RAND and HAND, we used all features except the mostexpensive ones (LP-based and clause-graph-based features); the average featurecomputation times were 4.2 seconds and 4.9 seconds, respectively.Challengers As previously explained, one reason that we studied SLS for SAT isthat a wide variety of strong solvers exist for this domain. In particular, we iden-tified 17 such algorithms, which we dubbed “challengers.” Following Chapter 9,we included all 7 SLS algorithms that won a medal in any of the SAT Competi-tions between 2002 and 2007, and also 5 additional prominent high-performancealgorithms. We also included the 6 SATenstein-LS configurations introducedin Chapter 9. While in some sense this set a high standard for Hydra (it had tocompete against strong configurations of its own parametric solver) we includedthese configurations because they were shown to outperform the previous state ofthe art.Experimental Environment We collected training data and performed ParamILSruns on two different compute clusters. The first had 55 dual 3.2GHz Intel Xeonmachines with 2MB cache and 2GB RAM, running OpenSuSE Linux 11.1; thesecond cluster from Westgrid had 384 dual 3.0GHz Intel Xeon E5450 quad-coremachines with 16GB of RAM running Red Hat Linux 4.1.2. Although the use ofdifferent machines added noise to the runtime observations in our training data, ithad to be undertaken to leverage additional computational resources. To ensure thatour results were meaningful, we gathered all test data using only the first cluster;all results reported in this chapter were collected using this data, and the data wasused for no other purpose. Runtime was always measured as CPU time.10.2.2 Experimental ResultsTo establish a baseline for our empirical evaluation, we first ran all 17 challengeralgorithms on each of our test sets. The best-performing challengers are identifiedin the third column of Table 10.1, and their PAR-10 scores are shown in the fourthcolumn. We also report the percentages of instances that each algorithm solved.179Dataset Metric Best Challenger Chall. Perf Portf. 11-Chall. Portf. 17-Chall. Hydra[D,1] Hydra[D,7]BM PAR Score SATenstein-LS 224.53 54.04 3.06 249.44 3.06Solved (%) [FACT] 96.4 99.3 100 96.0 100INDULIKE PAR Score SATenstein-LS 11.89 135.84 7.74 33.49 7.77Solved (%) [CBMC] 100 98.1 100 100 100RAND PAR Score gNovelty+ 1128.63 897.37 813.72 1166.66 631.35Solved (%) 81.6 85.5 86.9 80.8 89.8HAND PAR Score adaptG2WSAT+ 2960.39 2670.22 2597.71 2915.22 2495.06Solved (%) 50.9 55.8 56.9 51.7 58.7Table 10.1: Performance comparison between Hydra, SATenstein-LS,challengers, and portfolios based on 11 (without 6 SATenstein-LSsolvers) and 17 (with 6 SATenstein-LS solvers) challengers. Allresults are based on 3 runs per algorithm and instance; an algorithmsolves an instance if its median runtime on that instance is below thegiven cutoff time.We then used SATzilla to automatically construct portfolios, first from the11 manually crafted challenger algorithms, and then from the full set of 17 chal-lengers that also included the 6 SATenstein-LS solvers. As can be seen fromcolumn 6 of Table 10.1, the latter portfolios perform much better than the best indi-vidual challenger, and the same holds for the former, more limited portfolios (col-umn 5) as compared to the best of their 11 handcrafted component solvers. As onewould expect, the performance gain was particularly marked for instance set BM,which is highly heterogeneous. In all cases, the inclusion of the 6 SATenstein-LSsolvers, which were derived by automatic configuration on the six instance distribu-tions considered by [115], led to improved performance. While this was expectedfor BM and INDULIKE, which are combinations of the instance distributions forwhich the 6 SATenstein-LS solvers were built, we were more surprised to ob-serve the same qualitative effect for RAND and HAND.Column 7 (Table 10.1) shows the performance of the single SATenstein-LSconfiguration that was obtained in the initial phase of Hydra. Comparing theseresults to those the portfolio obtained after 7 iterations (column 8), we confirmthat Hydra is able to automatically configure solvers to work well as componentsof a portfolio. Furthermore, in all cases the Hydra portfolio outperformed theportfolio of 11 challengers. The Hydra portfolio outperformed the portfolio of 17challengers in RAND and HAND, and effectively tied with it in BM and INDULIKE.Note that these latter distributions are those for which SATenstein-LS solvers180were specifically built; indeed, we found that the 17-challenger portfolios reliedvery heavily on these solvers. Furthermore, we note that Chapter 9 devoted about240 CPU days to the construction of the 6 SATenstein-LS solvers, while theconstruction of the entire Hydra[D,7] portfolio required only about 70 CPU days.Overall, recall that the success of the challenger-based portfolios depends crit-ically upon the availability of domain knowledge in the form of very strong solvers(some handcrafted, such as 11 of the challengers, and some constructed automati-cally based on clearly-delineated instance distributions, such as the 6 SATenstein-LSsolvers). In contrast, Hydra always achieved equivalent or significantly better per-formance without relying on such domain knowledge.Figure 10.1 shows the PAR-10 performance improvements achieved in eachHydra iteration, considering both training and test data for BM and INDULIKE.In all cases, test performance closely resembled training performance. Hydra’stest performance improved monotonically from one iteration to the next. Further-more, on BM, HAND and RAND, Hydra achieved better performance than the bestchallenger after at most two iterations. On INDULIKE, Hydra took five iterationsto outperform the best challenger, SATenstein-LS[CBMC]. While this may ap-pear surprising considering that the latter is a configuration of SATenstein-LS,it is attributed to much less CPU time for each Hydra iteration than the construc-tion of SATenstein-LS[CBMC].Figure 10.2 compares the test-set performance of Hydra[D,1] and Hydra[D,7]for BM and INDULIKE. (The plots for HAND and RAND are not shown here, but re-semble the BM plot.) Note that Hydra[D,7] is substantially stronger than Hydra[D,1],particularly on hard instances. The fact that Hydra[D,1] on occasion outperformsHydra[D,7] is due to the feature-based selection not always choosing the bestsolver from the given portfolio, and that the algorithms are randomized.Table 10.2 shows, over each of the 7 iterations, the fraction of training instancessolved by each Hydra portfolio component. Obviously, a total of k solvers areavailable in each stage k. Note that solver subset selection does lead Hydra toexclude solvers from the portfolio; this transpires on RAND for example, where thethird solver was dropped in iteration 7. Another interesting effect can be observedin iteration 3 on INDULIKE, where the second solver was effectively replaced bythe third, whose instance share is marginally higher. Had we allowed the algorithm1810 1 2 3 4 5 6 7 8050100150200250300Number of Hydra Steps  Hydra[BM] trainingHydra[BM] testSATenFACT trainingSATenFACT test 0 1 2 3 4 5 6 7 80510152025303540Number of Hydra Steps  Hydra[INDULIKE] trainingHydra[INDULIKE] testSATenCBMC trainingSATenCBMC on testFigure 10.1: Hydra’s performance progress after each iteration, for BM (left)and INDULIKE (right). Performance is shown in terms of PAR-10score; the vertical lines represent the best challenger’s performancefor each data set.configurator to run longer in iteration 2, it would eventually have found this lattersolver. The fact that it was found in the subsequent iteration illustrates Hydra’sability to recover from insufficient allocation of runtime to the algorithm configu-rator. A similar phenomenon occurred in iterations 6 and 7 on INDULIKE. Thesolver found in iteration 6 turned out not to be useful at all, and was thereforedropped immediately; in the next round of algorithm configuration a useful solverwas found. (However, we see in Figure 10.1 that the overall benefit derived fromusing this latter solver turned out to be quite small.) Finally, we note that for allfour distributions, the Hydra[D,7] portfolios consisted of at least 5 solvers, eachof which were executed on between 6.8 and 41.8% of the instances. This indicatesthat the individual solvers constructed by Hydra indeed worked well on sizeablesubsets of our instance sets, without the explicit use of problem-dependent knowl-edge (such as instance features) for partitioning these sets.10.3 Hydra for MIPIt is difficult to directly apply the original Hydra to the MIP domain, for two rea-sons. First, the data sets we are dealt in MIP tend to be highly heterogeneous;preliminary prediction experiments showed that Hydra’s linear regression mod-18210-2 100 10210-210-1100101102103Hydra[BM, 1] PAR ScoreHydra[BM, 7] PAR Score10-2 100 10210-210-1100101102103Hydra[INDULIKE, 1] PAR ScoreHydra[INDU, 7] PAR ScoreFigure 10.2: Performance comparison between Hydra[D,7] andHydra[D,1] on the test sets, for BM (left) and INDULIKE (right).Performance is shown in terms of PAR-10 score.s1 s2 s3 s4 s5 s6 s7P1 100 0 0 0 0 0 0P2 45.1 54.9 0 0 0 0 0P3 27.4 44.4 28.2 0 0 0 0P4 18.1 31.0 21.6 29.4 0 0 0P5 13.9 25.9 19.8 26.1 14.3 0 0P6 12.5 22.9 16.8 23.2 13.2 11.5 0P7 12.5 23.9 0 22.8 13.2 13.2 14.4s1 s2 s3 s4 s5 s6 s7P1 100 0 0 0 0 0 0P2 50.0 50.0 0 0 0 0 0P3 49.0 0 51.0 0 0 0 0P4 47.8 0 42.8 9.4 0 0 0P5 35.8 0 42.6 9.4 12.2 0 0P6 35.8 0 42.6 9.4 12.2 0 0P7 31.2 0 41.8 9.2 11.0 0 6.8Table 10.2: The percentage of instances for each solver chosen by algorithmselection at each iteration for RAND (left) and INDULIKE (right). Pkand sk are respectively the portfolio and algorithm obtained in iterationk.els were not robust for such heterogeneous inputs, sometimes yielding extrememispredictions of more than ten orders of magnitude. Second, individual Hydraiterations can take days to run—even on a large computer cluster—making it diffi-cult for the method to converge within a reasonable amount of time. (We say thatHydra has converged when substantial increases in running time cease to lead tosignificant performance gains.)For MIP, we proposed two major improvements to Hydra that address both ofthese issues. First, we modify the model-building method used by the algorithmselector, using a classification procedure based on decision forests with a non-uniform loss function. Second, we modify Hydra to add multiple solvers in eachiteration and to reduce the cost of evaluating these candidate solvers, speeding upconvergence. We denote the original method as HydraLR,1 (“LR” stands for linear183regression and “1” indicates the number of configurations added to the portfolioper iteration), the new method including only our first improvement as HydraDF,1(“DF” stands for decision forests), and the full new method as HydraDF,k.• Cost-sensitive Decision Forests for Algorithm Selection: Since 2011, thenew versions of SATzilla are based on cost-sensitive classification mod-els, in particular cost-sensitive decision forest (DF). DFs offer the promiseof effectively partitioning the feature space into qualitatively different parts,particularly for heterogeneous benchmark sets. In contrast to clusteringmethods, DFs take runtime into account when determining that partitioning.Therefore, we adopt this new approach into Hydra and construct a cost-sensitive DF for every pair of configurations (i, j) based on training data.The cost for a given instance n for configuration pair (i, j) is defined as theperformance difference between i and j. For a test instances, we apply eachDF to vote for the stronger configuration and select the configuration withthe most votes as the best algorithm for that instance.• Speeding Up Convergence: Hydra uses an automated algorithm configura-tor as a subroutine, which is called in every iteration to find a configurationthat augments the current portfolio as well as possible. Since algorithm con-figuration is a hard problem, configuration procedures are incomplete andtypically randomized. As a single run of a randomized configuration proce-dure may not yield a high-performing parameter configuration, it is commonpractice to perform multiple runs in parallel and to use the configuration thatperforms best on the training set [93, 96, 97, 212].Here, we make two modifications to Hydra to speed up its convergence.First, in each iteration, we add k promising configurations to the portfolio,rather than just the single best. If algorithm configuration runs were inexpen-sive, this modification to Hydra would not help: additional configurationscould always be found in later iterations, if they indeed complemented theportfolio at that point. However, when each iteration must repeatedly solvemany difficult MIP instances, it may be impossible to perform more than asmall number of Hydra iterations within any reasonable amount of time,even when using a computer cluster. In such a case, when many good (and184rather different) configurations are found in an iteration, it can be wastefulto retain only one of these.Our second change to Hydra concerns the way that the ‘best’ configurationsreturned by different algorithm configuration runs are identified. HydraDF,1determines the ‘best’ of the configurations found in a number of independentconfigurator runs by evaluating each configuration on the full training set andselecting the one with best performance. This evaluation phase can be verycostly: for example, if we use a cutoff time of 300 seconds per run duringtraining and have 1 000 instances, then computing the training performanceof each candidate configuration can take nearly four CPU days. Therefore,in HydraDF,k, we select the configuration for which the configuration pro-cedure’s internal estimate of the average performance improvement over theexisting portfolio is largest. This alternative is computationally cheap: itdoes not require any evaluations of configurations beyond those already per-formed by the configurator. However, it is also potentially risky as differentconfigurator runs typically use the training instances in a different order andevaluate configurations using different numbers of instances. It is thus possi-ble that the configurator’s internal estimate of improvement for a parameterconfiguration is high, but that it turns out to not help for instances the config-urator has not yet used. Fortunately, adding k parameter configurations to theportfolio in each iteration mitigates this problem: if each of the k selectedconfigurations has independent probability p of yielding a poor configura-tion, the probability of all k configurations being poor is only pk.10.3.1 Experimental SetupWhile the improvements to Hydra presented were motivated by MIP, they cannevertheless be applied to any domain. The parameterized solver, instance bench-mark and performance metric define the application context in which Hydra isrun. In contrast, the automated algorithm configuration tool and portfolio builderare generic components.185CPLEX Parameters. Out of CPLEX 12.1’s 135 parameters, we selected a sub-set of 74 parameters to be optimized. These are the same parameters consideredin [97], minus two parameters governing the time spent for probing and solutionpolishing. (These led to problems when the captime used during parameter op-timization was different from that used at test time.) We were careful to keep allparameters fixed that change the problem formulation (e.g., parameters such as theoptimality gap below which a solution is considered optimal). The 74 parameterswe selected affect all aspects of CPLEX. They include 12 preprocessing parame-ters; 17 MIP strategy parameters; 11 parameters controlling how aggressively touse which types of cuts; 8 MIP “limits” parameters; 10 simplex parameters; 6 bar-rier optimization parameters; and 10 further parameters. Most parameters have an“automatic” option as one of their values. We allowed this value, but also includedother values (all other values for categorical parameters, and a range of values fornumerical parameters). Exploiting the fact that 4 parameters were conditional onothers taking certain values, they gave rise to 4.75× 1045 distinct parameter con-figurations.MIP Benchmark Sets. Our goal was to obtain a MIP solver that works well onheterogenous data. Thus, we selected four heterogeneous sets of MIP benchmarkinstances, composed of many well studied MIP instances. They range from a rela-tively simple combination of two homogenous subsets (CL∪REG) to heterogenoussets using instances from many sources (e.g., , MIX). While previous work in au-tomated portfolio construction for MIP [111] has only considered very easy in-stances (ISAC(new) with a mean CPLEX default runtime below 4 seconds), ourthree new benchmarks sets are much more realistic, with CPLEX default runtimesranging from seconds to hours. We split these instances 50:50 into training andtest sets except for ISAC(new), where we divided the 276 instances into a newtraining set of 184 and a test set of 92 instances. Due to the small size of the dataset, we performed this in a stratified fashion, first ordering the instances based onCPLEX default runtime and then picking every third instance for the test set.We used all 143 MIP features from Figure 3.2 including 101 features from[90, 111, 130] and 42 new probing features. In our feature computation, we used a5-second cutoff for computing probing features. We omitted these probing features186(only) for the very easy ISAC(new) benchmark set.Algorithm Configurator: FocusedILS. For algorithm configuration we usedParamILS version 2.3.4 with its default instantiation of FocusedILS with adap-tive capping [96]. We always executed 25 parallel configuration runs with differentrandom seeds with a 2-day cutoff. (Running times were always measured usingCPU time.) During configuration, the captime for each CPLEX run was set to 300seconds, and the performance metric was penalized average runtime (PAR-10). Fortesting, we used a cutoff time of 3 600 seconds.Portfolio Builder: SATzilla based on cost-sensitive decision forests. We usedthe Matlab version R2010a implementation of cost-sensitive decision trees as de-scribed in Chapter 7. For any pair of algorithms (i, j), a cost-sensitive decisionforest was built with 99 trees. Therefore, i receives a vote from j if more than 44trees predict it being “better”. The algorithm with the most votes is selected as thebest algorithm for a given instance. Ties are broken by only counting the votes fromthose decision forests that involve algorithms which received equal votes; furtherties are broken randomly.Experimental Environment. All of our experiments were performed on a clus-ter of 55 dual 3.2GHz Intel Xeon PCs with 2MB cache and 2GB RAM, runningOpenSuSE Linux 11.1.Computational Cost. The total running time for the various Hydra procedureswas often dominated by the time required for running the configurator and there-fore turned out to be approximately proportional to the number of Hydra itera-tions performed. Each iteration required 50 CPU days for algorithm configura-tion, as well as validation time to (1) select the best configuration in each iteration(only for HydraLR,1 and HydraDF,1); and (2) gather performance data for theselected configurations. Since HydraDF,4 selects 4 solvers in each iteration, it hasto gather performance data for 3 additional solvers per iteration (using the samecaptime of 3 600 seconds), which roughly offsets its savings due to ignoring the187validation step. Using the format (HydraDF,1, HydraDF,4), the overall runtimerequirements in CPU days were as follows: (366,356) for CL∪REG; (485, 422)for CL∪REG∪RCW; (256,263) for ISAC(new); and (274,269) for MIX. Thus, thecomputational cost for each iteration of HydraLR,1 and HydraDF,1 was similar.10.3.2 Experimental ResultsWe evaluated our full HydraDF,4 approach for MIP; on all four MIP benchmarks,we compared it to HydraDF,1, to the best configuration found by ParamILS, andto the CPLEX default. For ISAC(new) and MIX we also assessed HydraLR,1.We did not do so for CL∪REG and CL∪REG∪RCW because they are relatively sim-pler and we expected the DF and LR models to perform almost identically.Table 10.3 presents these results. First, comparing HydraDF,4 to Param-ILS alone and to the CPLEX default, we observed that HydraDF,4 achieved dra-matically better performance, yielding between 2.52-fold and 8.83-fold speedupsover the CPLEX default and between 1.35-fold and 2.79-fold speedups over theconfiguration optimized with ParamILS in terms of average runtime. Note that(likely due to the heterogeneity of the data sets) the built-in CPLEX self-tuningtool was unable to find any configurations better than the default for any of ourfour data sets. Compared to HydraLR,1, HydraDF,4 yielded a 1.3-fold speedupfor ISAC(new) and a 1.5-fold speedup for MIX. HydraDF,4 also typically per-formed better than our intermediate procedure HydraDF,1, with speedup factorsup to 1.21 (ISAC(new)). However, somewhat surprisingly, it actually performedworse for one distribution, CL∪REG∪RCW. We analyzed this case further and foundthat in HydraDF,4, after iteration three ParamILS did not find any configurationsthat would further improve the portfolio, even with a perfect algorithm selector.This poor ParamILS performance could be explained by the fact that Hydra’sdynamic performance metric only rewarded configurations that made progress onsolving some instances better; almost certainly starting in a poor region of config-uration space, ParamILS did not find configurations that made progress on anyinstances over the already strong portfolio, and thus lacked guidance towards betterregions of configuration space. We believed that this problem could be addressedby means of better configuration procedures in the future.188DataSet Solver Train (cross valid.) TestTime PAR (Solved) Time PAR (Solved)Default 424 1687 (96.7%) 424 1493 (96.7%)CL ParamILS 145 339 (99.4%) 134 296 (99.5%)∪REG HydraDF,1 64 97 (99.9%) 63 63 (100%)HydraDF,4 42 42 (100%) 48 48 (100%)MIPzilla 40 40 (100%) 39 39 (100%)Oracle 33 33 (100%) 33 33 (100%)(MIPzilla)CL Default 405 1532 (96.5%) 406 1424 (96.9%)∪REG ParamILS 148 148 (100%) 151 151 (100%)∪RCW HydraDF,1 89 89 (100%) 95 95 (100%)HydraDF,4 106 106 (100%) 112 112 (100%)MIPzilla 99 99 (100%) 99 99 (100%)Oracle 89 89 (100%) 89 89 (100%)(MIPzilla)Default 3.98 3.98 (100%) 3.77 3.77 (100%)ISAC ParamILS 2.06 2.06 (100%) 2.13 2.13 (100%)(new) HydraLR,1 1.67 1.67 (100%) 1.52 1.52 (100%)HydraDF,1 1.2 1.2 (100%) 1.42 1.42 (100%)HydraDF,4 1.05 1.05 (100%) 1.17 1.17 (100%)MIPzilla 2.19 2.19 (100%) 2.00 2.00 (100%)Oracle 1.83 1.83 (100%) 1.81 1.81 (100%)(MIPzilla)Default 182 992 (97.5%) 156 387 (99.3%)ParamILS 139 717 (98.2%) 126 357 (99.3%)MIX HydraLR,1 74 74 (100%) 90 205 (99.6%)HydraDF,1 60 60 (100%) 65 181 (99.6%)HydraDF,4 53 53 (100%) 62 177 (99.6%)MIPzilla 48 48 (100%) 48 164 (99.6%)Oracle 34 34 (100%) 39 155 (99.6%)(MIPzilla)Table 10.3: Performance (average runtime and PAR in seconds, and percent-age solved) of HydraDF,4, HydraDF,1 and HydraLR,1 after 5 itera-tions.Figure 10.3 shows the test performance the different Hydra versions achievedas a function of their number of iterations, as well as the performance of theMIPzilla portfolios we built manually. When building these MIPzilla portfo-lios for CL∪REG, CL∪REG∪RCW, and MIX, we exploited ground truth knowledgeabout the constituent subsets of instances, using a configuration optimized specifi-cally for each of these subsets. As a result, these portfolios yielded very strong per-formance. Although our various Hydra versions did not have access to this groundtruth knowledge, they still roughly matched MIPzilla’s performance (indeed,HydraDF,1 outperformed MIPzilla on CL∪REG). For ISAC(new), our base-1891 2 3 4 5050100150200250300Number of Hydra IterationsPAR Score  HydraDF,4HydraDF,1MIPzillaDFOracle(MIPzilla)(a) CL∪REG1 2 3 4 580100120140160Number of Hydra IterationsPAR Score  HydraDF,4HydraDF,1MIPzillaDFOracle(MIPzilla)(b) CL∪REG∪RCW1 2 3 4 511.522.53Number of Hydra IterationsPAR Score  HydraDF,4HydraDF,1HydraLR,1MIPzillaDFOracle(MIPzilla)(c) ISAC(new)1 2 3 4 5150200250300350400Number of Hydra IterationsPAR Score  HydraDF,4HydraDF,1HydraLR,1MIPzillaDFOracle(MIPzilla)(d) MIXFigure 10.3: Performance per iteration for HydraDF,4, HydraDF,1 andHydraLR,1, evaluated on test data.line MIPzilla portfolio used CPLEX configurations obtained by ISAC [111];all Hydra versions clearly outperformed MIPzilla in this case, which suggeststhat its constituent configurations are suboptimal. For ISAC(new), we observedthat for (only) the first three iterations, HydraLR,1 outperformed HydraDF,1. Webelieved that this occurred because in later iterations the portfolio had strongersolvers, making the predictive models more important. We also observed thatHydraDF,4 consistently converged more quickly than HydraDF,1 and HydraLR,1.While HydraDF,4 stagnated after three iterations for data set CL∪REG∪RCW (re-fer to our prior discussion), it achieved the best performance at every given point intime for the three other data sets. For ISAC(new), HydraDF,1 did not convergeafter 5 iterations, while HydraDF,4 converged after 4 iterations and achieved bet-190ter performance. For the other three data sets, HydraDF,4 converged after twoiterations. The performance of HydraDF,4 after the first iteration (i.e., with 4 can-didate solvers available to the portfolio) was already very close to the performanceof the best portfolios for MIX and CL∪REG.We spent a tremendous amount of effort attempting to compare HydraDF,4with ISAC [111], since ISAC is also a method for automatic portfolio construc-tion and was previously applied to a distribution of MIP instances. ISAC’s au-thors supplied us with their training instances and the CPLEX configurations theirmethod identified, but are generally unable to make their code available to other re-searchers and, as mentioned previously, were unable to recover their test data. Wetherefore compared HydraDF,4’s and ISAC’s relative speedups over the CPLEXdefault (thereby controlling for different machine architectures) on their trainingdata. We note that HydraDF,4 was given only 2/3 as much training data as ISAC(due to the need to recover a test set from [111]’s original training set); the methodswere evaluated using only the original ISAC training set; the data set is very small,and hence high-variance; and all instances were quite easy even for the CPLEX de-fault. In the end, HydraDF,4 achieved a 3.6-fold speedup over the CPLEX default,as compared to the 2.1-fold speedup reported in [111].As shown in Figure 10.3, all versions of Hydra performed much better than aMIPzilla portfolio built from the configurations obtained from ISAC’s authorsfor the ISAC(new) dataset. In fact, even a perfect oracle of these configurationsonly achieved an average runtime of 1.82 seconds, which is a factor of 1.67 slowerthan HydraDF,4.10.4 ConclusionsIn this chapter, we introduced Hydra, a new automatic algorithm design approachthat combines portfolio-based algorithm selection with automatic algorithm con-figuration. We applied Hydra to SAT, a particularly well-studied and challengingproblem domain, producing high-performance portfolios based only on a singlehighly parameterized SLS algorithm, SATenstein-LS. Our experimental resultson widely-studied SAT instances showed that Hydra significantly outperformed17 state-of-the-art SLS algorithms. Hydra reached, and in two of four cases ex-191ceeded, the performance of portfolios that used all 17 challengers as candidatesolvers, 6 of which had been configured automatically using domain knowledgeabout specific types of SAT instances. At the same time, the total CPU time usedby Hydra to reach this performance level for each distribution was less than athird of that used for configuring the 6 automatically-configured challengers.We also showed how to extend Hydra to achieve strong performance forheterogeneous MIP distributions, outperforming CPLEX’s default, ParamILSalone, ISAC and the original Hydra approach. This was accomplished by us-ing a cost-sensitive classification model for algorithm selection (which also led toperformance improvements in SATzilla), along with improvements to Hydra’sconvergence speed. We expect that HydraDF,k can be further strengthened byusing improved algorithm configurators, such as model-based procedures (e.g.,SMAC [99]). Overall, the availability of effective procedures for constructingportfolio-based algorithm selectors, such as our new Hydra, should encouragethe development of highly parametrized algorithms for other prominent NP-hardproblems in AI, such as planning and CSP.192Chapter 11Conclusion and Future WorkComputationally hard problems play a key role in many practical applications, in-cluding formal verification, planning and scheduling, resource allocation and man-agement. Even though theoretically, no worst-case polynomial time algorithmexists, heuristic methods are able to solve large problems effectively in practice.However, designing a high-performance heuristic solver is not an easy task. Tra-ditionally, algorithm developers manually explore the combinations of algorithmiccomponents and empirically evaluate them on small benchmarks. Due to the natureof NP-completeness, it is often the case that the developed solver is only good oncertain types of benchmarks. Given a new benchmark, the whole difficult, tediousand time-consuming process needs to be repeated again.Motivated by an increasing demand for high-performance solvers for difficultcombinatorial problems in practical applications, by the desire to reduce the humaneffort required for building such algorithms, and by an ever-increasing availabilityof cheap computing power that can be harnessed for automating parts of the al-gorithm design process, this thesis leveraged rigorous statistical methods to studysuch computationally hard problems. My work has already displayed great impactin many areas of research and development. The sets of instance characteristicswe proposed were widely used for obtaining insights into understanding the hard-ness of problem instances and designing algorithms for solving them (see, e.g.,[111, 112, 140]). The constructed statistical methods can characterize algorithmruntime (even satisfiability status for NP-complete decision problems) with high193levels of confidence. My automated algorithm design approaches have led to sub-stantial improvements for solving a range of hard computational problems.11.1 Statistical Models of Instance Hardness andAlgorithm PerformanceTraditional notions of complexity cannot adequately explain the strong perfor-mance of heuristic algorithms; empirical methods are often the only practical meansfor assessing and comparing algorithms’ performance. Through adapting super-vised machine learning techniques, we constructed statistical models that predictsolvers’ performance without actually running them.Inspired by the success of Leyton-Brown et al. [127] that predicted an algo-rithm’s runtime using so-called empirical hardness models (EHMs), my work madesignificant advances on improving prediction accuracy. Firstly, we extended thefeature set that characterizes propositional satisfiability (SAT) instances. In addi-tion, we introduced features for other type of NP-complete problems such as thetravelling salesman problem and the mixed integer programming problem (Chap-ter 3). Such features proved to be informative and have been widely appropriatedby other research groups [111]. Secondly, we used new statistical models (e.g.,Gaussian processes, random forests [101], hierarchical hardness models [208]),substantially improving prediction accuracies over previous linear regression mod-els (Chapter 5, Chapter 6). Thirdly, we showed that EHMs were not limited to pre-dicting the runtime of an algorithm, and instead can predict arbitrary user-definedmeasures of algorithm performance (e.g., penalized runtime or some arbitrary per-formance score). Finally, we showed that other statistical models (classification)could be constructed to predict the solution of NP-complete decision problems(Chapter 4). My classifiers achieved high accuracy in predicting the satisfiabilitystatus of SAT instances ranging from randomly generated to industrial; the classifi-cations were sufficiently accurate to help make a better prediction of an algorithm’sperformance [208].Overall, the extensive experimental study showed that the regression modelsachieved good, robust performance in predicting algorithms’ performance. Thesemodels are useful for algorithm analysis, scheduling, algorithm portfolio construc-194tion, automated algorithm configuration, and other applications. The classificationmodels were able to predict the satisfiability status of SAT instances with highaccuracy even for uniform random 3-SAT instances from the solubility phase tran-sition (Chapter 4). Further study on this benchmark showed that one could build athree-leaf decision tree trained with very small instances and only two features thatachieved good classification accuracies across a wide range of instance sizes.11.2 Portfolio-based Algorithm SelectionHeuristic algorithms are capable of handling very large instances, but they oftenonly perform well on certain types of instances. Therefore, practitioners confronta potentially difficult algorithm selection problem: given a new instance, whichalgorithm should be used in order to optimize some performance objective. Wedeveloped a portfolio-based algorithm selector, SATzilla, which solved the al-gorithm selection problem automatically based on statistical models introducedin the previous section. By exploiting the complementary strengths of differentSAT solvers, SATzilla won more than 10 medals in 2007, 2009 SAT competi-tions [124], and the 2012 SAT challenge [13]. It encouraged the development ofmany other portfolio-based solvers, and represented state-of-the-art in SAT solvingfor many years.The goal of algorithm selection is to find a mapping from instances to algo-rithms that optimizes some performance metric. SATzilla approached this prob-lem by using EHMs to predict the solvers’ performance. For a new instance, it firstpredicts the performance of each solver based on instance features, then picks theone that is predicted to be the best. Several new techniques were introduced inChapter 7 to improve the robustness of SATzilla, such as pre-solvers, backupsolvers, and solver subset selection [209, 210]. These components played impor-tant roles on the success of SATzilla and became standard techniques widelyused in constructing portfolio-based algorithm selectors. Recently, we improvedSATzilla’s performance even further by adapting a new cost-sensitive classifi-cation technique that punishes misclassification in direct proportion to impact onportfolio performance (Chapter 7). The new SATzilla won the 2012 SAT chal-lenge.195The general framework of SATzilla can be applied to any problem domain.In Chapter 10.3, we showed that MIPzilla (SATzilla for MIP) also achievedstate-of-the-art performance on solving mixed integer programming problem.In addition to its state-of-the-art performance, portfolio-based algorithm se-lection is useful for evaluating solver contributions. By omitting a solver from theportfolio, one can measure the contribution of this solver by computing SATzilla’sperformance difference with and without it. In Chapter 8, we showed that solversthat exploit novel strategies proved more valuable than those that exhibited the bestoverall performance.11.3 Automatically Building High-performanceAlgorithms from ComponentsDesigning high-performance heuristic algorithms for solving NP-complete prob-lems is a time-consuming task even for domain experts. The resulting algorithmmay not meet users’ requirements due to 1) absence of the optimal combinationof heuristics and 2) lack knowledge of user’s benchmarks or performance metric.We solved the first problem by designing a generalized, highly parameterized al-gorithm with many different promising heuristics. Furthermore, the choices andbehaviors of heuristics are often controlled by a large set of parameters. Therefore,this approach removes the burden of making early design choices without knowingthe interaction of multiple heuristics and encourages the designer to consider manyalternative designs from existing algorithms in addition to novel mechanisms. Tobetter meet the requirements of end users, we used automated algorithm configura-tion tools [96] that optimized solver parameters given any benchmark and perfor-mance metric.This general approach could be applied to many domains with its effectivenessdemonstrated on the domain of SAT (Chapter 9). By taking components from 25local search algorithms, we built a highly parameterized local search algorithm,SATenstein, that could be instantiated as 2.01× 1014 different solvers. Mostof these instantiations had never been previously studied. Given a benchmark anda performance metric, we applied a black-box automated algorithm configuratorto optimize SATenstein’s parameters. Empirical evidence demonstrates that196the automatically constructed SATenstein outperformed existing state-of-the-art solvers with manually and automatically tuned configurations in several widelystudied SAT benchmarks. In addition, we have proposed a new data structure,concept DAG, to represent the parameter configurations, and defined a novel sim-ilarity metric for comparing different configurations based the transformation costbetween concept DAGs. The visualization of these similarity measure provideduseful insights into algorithm design. For example, contrary to common belief(see, e.g., [80, 135]), it is preferable to expose parameters so they can be instanti-ated by automated configurators rather than adjusting them at running time using asimple adaptive mechanism.11.4 Automatically Configuring Algorithms forPortfolio-Based SelectionIn a problem domain with only one or a few high-performance parameterized algo-rithms, it would be difficult to construct a portfolio-based algorithm selector due tothe small number of candidates. An automated algorithm configuration produces asingle algorithm, which could achieve high performance overall, but may performbadly on many individual instances. To overcome these problems, we developeda new automated algorithm design approach, Hydra, that combines the strengthof algorithm selection and algorithm configuration, and achieved state-of-the-artperformance on SAT (Chapter 10.2) and MIP (Chapter 10.3) with a single param-eterized algorithm.Once a portfolio exists for a domain, how should new research aim to improveup it? Hydra approaches this question by adapting the concept of boosting whichfocuses on problems that are handled poorly with the current portfolio. Hydrais implemented by iteratively changing the performance metric given to the algo-rithm configurator. Such a metric emphasizes the potential marginal contribution ofa new configuration to the existing portfolio. In each iteration, one algorithm (con-figuration) is added into the candidate algorithm set to improve the performanceof the current portfolio. With a single parameterized solver, Hydra reached, andoften exceeded, the performance of portfolios that used the state-of-the-art solversas candidate solvers (including solvers configured using domain knowledge of spe-197cific types of instances). Later in Chapter 10.3, we further improved Hydra us-ing a more advanced algorithm selection technique, along with improvements forspeedup convergence. The resulting HydraDF,k achieved strong performance forheterogeneous MIP distributions, outperforming CPLEX’s default, configurationsfrom ParamILS alone , and ISAC.Since Hydra requires only very little domain knowledge (one parameterizedalgorithm and an instance feature extractor), it is extremely attractive in cases ofnew problem domains. It also encourages the development of highly parameterizedalgorithms for other prominent NP-complete problems in AI, such as planning andCSP.11.5 Future Research DirectionsI am rather interested in studying and solving problems that have a substantialimpact on society and industry, such as computational sustainability, bioinformat-ics, hardware and software verification, and other problems pertinent to industrialapplications. I firmly believe that meta-algorithmic techniques are the future insolving hard computational problems in a broad range of areas. Therefore, I planto continue developing new techniques and am looking forward to collaborationwith domain experts in a variety of areas. Evidenced by the successes on SATand MIP, I believe that my research on automated algorithm design will generatestate-of-the-art algorithms for problems in real world applications.Informative Features. One challenge in applying meta-algorithmic techniqueson practical applications is to discover a set of features that is both easy to computeand yet correlate well with an algorithm’s performance. Although our feature sets(for SAT, MIP, and TSP) have proven to be informative and efficient, more pow-erful features can be obtained by using heuristic information from new algorithmsor intuition from domain experts. For example, the distribution of flip counts overvariables (based on local search probing) could be used to indicate the number oflocal minima.More Applications. Although my proposed automated algorithm design ap-proaches have been applied to SAT and MIP, they can be applied to any computa-tional hard problems. I continually seek more applications, for example, building198portfolio-based algorithm selectors to solve protein folding problems, constructinghighly parameterized complete solvers for SAT and optimizing them by automatedalgorithm configuration tools.Integrating Strong Presolving Techniques. In SATzilla’s presolving phase,two algorithms are run for a short duration of time with the goal of solving easyinstances. Recently there were many advances in optimizing solver schedule.3S [112] introduced a new approach for presolving by constructing a fixed solverschedule using mixed integer programming. aspeed [77] used ASP to solvetimeout-optimal scheduling. I believe that SATzilla’s performance could befurther improved by adapting such advanced techniques for optimizing presolving.Explaining instance hardness. In real applications, it would be important toknow “what property makes some type of instances hard?” or “Which algorithmcomponents are most important to achieve good performance on certain types ofinstances?” I plan to extend my work on EHMs and develop an automated proce-dure for analysis of instances and algorithm components. This research will per-mit domain experts to gain insights into the functionality and interaction betweenmultiple heuristics. Our group has been actively working on this topic: Hutteret al. (2014) proposed efficient methods for gaining insight into the relative im-portance of different hyperparameters and their interaction for machine learningalgorithms; Fawcett et al. (2014) studied the relative importance of features fordomain-independent planners.Parallel Algorithm Portfolios: Depending on which heuristics are used, dif-ferent algorithms’ performance can vary significantly on the same instance. Run-ning multiple solvers in parallel can, on average, be beneficial. Therefore, I planto extend sequential portfolio techniques into a parallel environment. In the caseof great uncertainty in algorithm selection, running more uncorrelated solvers inparallel can reduce the cost of picking a solver with a runtime much larger than thebest solver. One recent approach [78] used an automatic algorithm configurator toproduce a set of configurations to be executed in parallel.199Bibliography[1] D. W. Aha. Generalizing from case studies: A case study. In Proceedingsof the 9th International Conference on Machine Learning (ICML’92),pages 1–10. Morgan Kaufmann, 1992. → pages 14[2] K. Ahmadizadeh, C. Dilkina, B.and Gomes, and A. Sabharwal. Anempirical study of optimization for maximizing diffusion in networks. InProceedings of the 16th International Conference on Principles andPractice of Constraint Programming (CP’10), volume 6308 of LNCS,pages 514–521. Springer, 2010. → pages 34[3] C. Anso´tegui, M. Sellmann, and K. Tierney. A gender-based geneticalgorithm for the automatic configuration of algorithms. In Proceedings ofthe 15th International Conference on Principles and Practice of ConstraintProgramming (CP’09), volume 5732 of LNCS, pages 142–157. Springer,2009. → pages 17[4] D. Applegate, R. Bixby, V. Chva´tal, and W. Cook. Finding tours in the TSP.Technical Report TR-99885, University of Bonn, 1999. → pages 35, 36[5] D. L. Applegate, R. E. Bixby, V. Chva´tal, and W. J. Cook. The TravelingSalesman Problem: A Computational Study. Princeton University Press,2006. → pages 1, 35, 83[6] B. Arinze, S.-L. Kim, and M. Anandarajan. Combining and selectingforecasting models using rule based induction. Computers and operationsresearch, 24:423–433, 1997. → pages 14[7] D. Babic´ and A. J. Hu. Structural abstraction of software verificationconditions. In Proceedings of the 19th International Conference onComputer Aided Verification (CAV’07), volume 4590 of LNCS, pages366–378. Springer, 2007. → pages 30200[8] D. Babic´ and F. Hutter. Spear theorem prover. Solver description, SATcompetition 2007, http://www.domagoj-babic.com/uploads/Pubs/SAT08/sat08.pdf,Version last visited on April 29, 2014, 2007. → pages 82[9] F. Bacchus. Enhancing Davis Putnam with extended binary clausereasoning. In Proceedings of the 18th National Conference on ArtificialIntelligence (AAAI’02), pages 613–619. AAAI Press, 2002. → pages 21[10] F. Bacchus. Exploring the computational tradeoff of more reasoning andless searching. In Proceedings of the 5th International Conference onTheory and Applications of Satisfiability Testing (SAT’02), pages 7–16,2002. → pages[11] F. Bacchus and J. Winter. Effective preprocessing with hyper-resolutionand equality reduction. In Proceedings of the 6th International Conferenceon Theory and Applications of Satisfiability Testing (SAT’03), volume 2919of LNCS, pages 341–355. Springer, 2003. → pages 21[12] P. Balaprakash, M. Birattari, and T. Stu¨tzle. Improvement strategies for theF-race algorithm: Sampling design and iterative refinement. InProceedings of the 4th International Conference on HybridMetaheuristics (HM’07), pages 108–122, 2007. → pages 17[13] A. Balint, A. Belov, M. Ja¨rvisalo, and C. Sinz. The international SATChallenge. http://baldur.iti.kit.edu/SAT-Challenge-2012/, 2012. Version last visitedon August 6, 2012. → pages 2, 195[14] T. Bartz-Beielstein. Experimental Research in Evolutionary Computation:The New Experimentalism. Natural Computing Series. Springer, 2006. →pages 75[15] T. Bartz-Beielstein and S. Markon. Tuning search algorithms for real-worldapplications: A regression tree based approach. In Proceedings of the 2004Congress on Evolutionary Computation (CEC’04), pages 1111–1118.IEEE Press, 2004. → pages 72, 77[16] M. Berkelaar, J. Dirks, K. Eikland, P. Notebaert, and J. Ebert. lp solve 5.5.http://lpsolve.sourceforge.net/5.5/index.htm, 2012. Version last visited on August6, 2012. → pages 83[17] T. Berthold, G. Gamrath, S. Heinz, M. Pfetsch, S. Vigerske, and K. Wolter.SCIP http://scip.zib.de/doc/html/index.shtml, 2012. Version last visited onAugust 6, 2012. → pages 83201[18] A. Biere. Picosat version 535. Solver description, SAT competition 2007,http://www.satcompetition.org/2007/picosat.pdf, Version last visited on April 29,2014, 2007. → pages 150[19] A. Biere. Picosat essentials. Journal on Satisfiability, Boolean Modelingand Computation (JSAT), 4:75–97, 2008. → pages 102, 150[20] A. Biere, A. Cimatti, E. M. Clarke, M. Fujita, and Y. Zhu. Symbolic modelchecking using SAT procedures instead of BDDs. In Proceedings of the36th annual ACM/IEEE Design Automation Conference (DAC’99), pages317–320. ACM Press, 1999. → pages 1, 21[21] M. Birattari. The Problem of Tuning Metaheuristics as Seen from aMachine Learning Perspective. PhD thesis, Universite´ Libre de Bruxelles,Brussels, Belgium, 2004. → pages 17[22] M. Birattari, T. Stu¨tzle, L. Paquete, and K. Varrentrapp. A racing algorithmfor configuring metaheuristics. In Proceedings of the Genetic andEvolutionary Computation Conference (GECCO’02), pages 11–18.Springer-Verlag, 2002. → pages 6, 7[23] M. Birattari, Z. Yuan, P. Balaprakash, and T. Stu¨zle. Empirical Methods forthe Analysis of Optimization Algorithms, chapter F-race and iteratedF-race: an overview, pages 311–336. Springer-Verlag, 2010. → pages 17[24] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.→ pages 58, 72[25] D. R. Bregman and D. G. Mitchell. The SAT solver MXC, version 0.5.Solver description, SAT competition 2007,http://www.cs.sfu.ca/∼mitchell/papers/MXC-Sat07-competition.pdf, Version last visitedon April 29, 2014, 2007. → pages 102[26] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. →pages 43, 79, 80[27] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone. Classification andRegression Trees. Wadsworth, 1984. → pages 76, 77[28] E. A. Brewer. Portable High-Performance Supercomputing: High-LevelPlatform-Dependent Optimization. PhD thesis, Massachusetts Institute ofTechnology, September 1994. → pages 72202[29] E. A. Brewer. High-level optimization via automated statistical modeling.In Proceedings of the 5th ACM SIGPLAN symposium on Principles andPractice of Parallel Programming (PPOPP-95), pages 80–91, 1995. →pages 12, 72[30] W. W. C. M. Li and H. Zhang. Combining adaptive noise and promisingdecreasing variables in local search for SAT. Solver description, SATcompetition 2007, ttp://www.satcompetition.org/2007/adaptG2WSAT.pdf, Versionlast visited on April 29, 2014, 2007. → pages 102[31] T. Carchrae and J. C. Beck. Applying machine learning to low knowledgecontrol of optimization algorithms. Computational Intelligence, 21(4):373–387, 2005. → pages 13, 16[32] P. Cheeseman, B. Kanefsky, and W. M. Taylor. Where the really hardproblems are. In Proceedings of the 9th National Conference on ArtificialIntelligence (AAAI’91), pages 331–337. AAAI Press, 1991. → pages 29, 41[33] M. Chiarandini, C. Fawcett, and H. Hoos. A modular multiphase heuristicsolver for post enrolment course timetabling. In Proceedings of the 7thInternational Conference on the Practice and Theory of AutomatedTimetabling (PATAT’08), 2008. → pages 6, 7[34] E. Clarke, D. Kroening, and F. Lerda. A tool for checking ANSI-Cprograms. In Proceedings of the 10th International Conference on Toolsand Algorithms for the Construction and Analysis of Systems (TACAS’04),pages 168–176. Springer, 2004. → pages 30, 146[35] Concorde. Concorde TSP Solver, 2012.http://www.math.uwaterloo.ca/tsp/concorde/index.html. Version last visited on October24, 2012. → pages 36[36] S. A. Cook. The complexity of theorem proving procedures. InProceedings of the 3rd annual ACM symposium on Theory of computing,pages 151–158, 1971. → pages 1[37] W. Cook. Concorde downloads page, 2012.http://www.tsp.gatech.edu/concorde/downloads/downloads.htm. Version last visited onOctober 24, 2012. → pages 38[38] C. J. M. Crawford and L. D. Auton. Experimental results on the crossoverpoint in random 3SAT. Artificial Intelligence, 81:31–35, 1996. → pages 41203[39] J. M. Crawford and A. B. Baker. Experimental results on the application ofsatisfiability algorithms to scheduling problems. In Proceedings of the 12thNational Conference on Artificial Intelligence (AAAI’94), pages1092–1097. AAAI Press, 1994. → pages 21[40] M. Davis and H. Putnam. A computing procedure for quantification theory.Journal of the ACM, 7(1):201–215, 1960. → pages 21[41] M. Davis, G. Logemann, and D. Loveland. A machine program for theoremproving. Communications of the ACM, 5(7):394–397, 1962. → pages 21[42] R. Dechter and I. Rish. Directional resolution: The Davis-Putnamprocedure, revisited. In Proceedings of the 4th International Conference onPrinciples of Knowledge Representation and Reasoning (KR’94), pages134–145. Morgan Kaufman, 1994. → pages 21[43] G. Dequen and O. Dubois. kcnfs. Solver description, SAT competition2007, http://home.mis.u-picardie.fr/∼dequen/sat/kcnfs-description--07.pdf, Version lastvisited on April 29, 2014, 2007. → pages 102[44] O. Dubois and G. Dequen. A backbone-search heuristic for efficientsolving of hard 3-SAT formulae. In Proceedings of the 17th InternationalJoint Conference on Artificial Intelligence (IJCAI’01), pages 248–253.Morgan Kaufmann, 2001. → pages 21, 22, 42, 59, 101, 150[45] N. Ee´n and A. Biere. Effective preprocessing in SAT through variable andclause elimination. In Proceedings of the 8th International Conference onTheory and Applications of Satisfiability Testing (SAT’05), volume 3569 ofLNCS, pages 61–75. Springer, 2005. → pages 22, 26, 30, 59, 146[46] N. Ee´n and N. So¨rensson. An extensible SAT-solver. In Proceedings of the7th International Conference on Theory and Applications of SatisfiabilityTesting (SAT’04), volume 2919 of LNCS, pages 502–518. Springer, 2004.→ pages 21, 59, 82, 126[47] N. Ee´n and N. So¨rensson. Minisat v2.0 (beta). Solver description, SATRace 2006, http://fmv.jku.at/sat-race-2006/descriptions/27-minisat2.pdf, Version lastvisited on April 29, 2014, 2006. → pages 101[48] C. Fawcett, M. Vallati, F. Hutter, J. Hoffmann, H. Hoos, andK. Leyton-Brown. Improved features for runtime prediction ofdomain-independent planners. In International Conference on AutomatedPlanning and Scheduling (ICAPS’14), page To appear, 2014. → pages 199204[49] A. S. Fraenkel. Complexity of protein folding. Bulletin of MathematicalBiology, 55:1199–1210, 1993. → pages 1[50] A. S. Fukunaga. Automated discovery of composite sat variable-selectionheuristics. In Proceedings of the 18th National Conference on ArtificialIntelligence (AAAI’02), pages 641–648. AAAI Press, 2002. → pages 16, 17[51] M. Gagliolo and J. Schmidhuber. Learning dynamic algorithm portfolios.Annals of Mathematics and Artificial Intelligence, 47(3-4):295–328, 2006.→ pages 14, 16[52] M. Gagliolo and J. Schmidhuber. Impact of censored sampling on theperformance of restart strategies. In Proceedings of the 12th InternationalConference on Principles and Practice of ConstraintProgramming (CP’06), volume 4204 of LNCS, pages 167–181. Springer,2006. → pages 95[53] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide tothe Theory of NP-Completeness. Freeman, 1979. → pages 1[54] L. D. Gaspero and A. Schaerf. EasySyn++: A tool for automatic synthesisof stochastic local search algorithms. In Proceedings of the internationalconference on Engineering stochastic local search algorithms: designing,implementing and analyzing effective heuristics (SLS’07), volume 4638 ofLNCS, pages 177–181. Springer, 2007. → pages 16[55] C. Gebruers, A. Guerri, B. Hnich, and M. Milano. Making choices usingstructure at the instance level within a case based reasoning framework. InProceedings of the International Conference on Integration of AI and ORTechniques in Constraint Programming for Combinatorial OptimizationProblems (CPAIOR’04), volume 3011 of LNCS, pages 380–386. Springer,2004. → pages 14[56] C. Gebruers, B. Hnich, D. Bridge, and E. Freuder. Using CBR to selectsolution strategies in constraint programming. In Proceedings of the 6thInternational Conference on Case Based Reasoning (ICCBR’05), volume3620 of LNCS, pages 222–236. Springer, 2005. → pages 14[57] I. P. Gent and T. Walsh. Towards an understanding of hill-climbingprocedures for SAT. In Proceedings of the 11th National Conference onArtificial Intelligence (AAAI’93), pages 28–33. AAAI Press, 1993. →pages 24205[58] I. P. Gent, H. H. Hoos, P. Prosser, and T. Walsh. Morphing: Combiningstructure and randomness. In Proceedings of the 16th National Conferenceon Artificial Intelligence (AAAI’99), pages 654–660. AAAI Press, 1999. →pages 29, 146[59] A. Gilpin and T. Sandholm. Information-theoretic approaches to branchingin search. Discrete Optimization, 8(2):147–159, 2010. → pages 30[60] C. P. Gomes and B. Selman. Problem structure in the presence ofperturbations. In Proceedings of the 14th National Conference on ArtificialIntelligence (AAAI’97), pages 221–226. AAAI Press, 1997. → pages 29,146[61] C. P. Gomes and B. Selman. Algorithm portfolios. Artificial Intelligence,126(1-2):43–62, 2001. → pages 13[62] C. P. Gomes, W. van Hoeve, and A. Sabharwal. Connections in networks:A hybrid approach. In Proceedings of the International Conference onIntegration of AI and OR Techniques in Constraint Programming forCombinatorial Optimization Problems (CPAIOR’08), volume 5015 ofLNCS, pages 303–307. Springer, 2008. → pages 30, 33[63] J. Gratch and S. A. Chien. Adaptive problem-solving for large-scalescheduling problems: A case study. Journal of Artificial IntelligenceResearch, 4:365–396, 1996. → pages 17[64] J. Gratch and G. Dejong. COMPOSER: A probabilistic solution to theutility problem in speed-up learning. In Proceedings of the 10th NationalConference on Artificial Intelligence (AAAI’92), pages 235–240. AAAIPress, 1992. → pages 16, 17[65] A. Guerri and M. Milano. Learning techniques for automatic algorithmportfolio selection. In Proceedings of the 16th European Conference onArtificial Intelligence (ECAI’04), pages 475–479. IOS Press, 2004. →pages 14[66] H. Guo and W. H. Hsu. A learning-based algorithm selectionmeta-reasoner for the real-time MPE problem. In Proceedings of the 17thAustralian Conference on Artificial Intelligence (AI’04), volume 3339 ofLNCS, pages 307–318. Springer, 2004. → pages 14[67] Gurobi Optimization Inc. Gurobi 2.0. http://www.gurobi.com/, 2012. Versionlast visited on August 6, 2012. → pages 83206[68] I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh. Feature Extraction,Foundations and Applications. Springer, 2006. → pages 72[69] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of StatisticalLearning. Springer Series in Statistics. Springer, 2nd edition, 2009. →pages 78[70] K. Helsgaun. An effective implementation of the Lin-Kernighan travelingsalesman heuristic. European Journal of Operational Research, 126(1):106–130, 2000. → pages 36, 83[71] P. Herwig. Using graphs to get a better insight into satisfiability problems.Master’s thesis, Delft University of Technology, Department of ElectricalEngineering, Mathematics and Computer Science, 2006. → pages 28[72] M. Heule and H. V. Maaren. march dl: Adding adaptive heuristics and anew branching strategy. Journal on Satisfiability, Boolean Modeling andComputation, 2:47–59, 2006. → pages 59[73] M. Heule and H. V. Maaren. Improved version of march ks.http://www.st.ewi.tudelft.nl/sat/Sources/stable/march pl, 2007. Version last visited onJune 16, 2009. → pages 21, 150[74] M. Heule and H. v. Maaren. march ks. Solver description, SATcompetition 2007, http://www.satcompetition.org/2007/march ks.pdf, Version lastvisited on April 29, 2014, 2007. → pages 102, 150[75] M. Heule, J. Zwieten, M. Dufour, and H. Maaren. March eq:implementing additional reasoning into an efficient lookahead SAT solver.In Proceedings of the 8th International Conference on Theory andApplications of Satisfiability Testing (SAT’05), volume 4642 of LNCS,pages 345–359. Springer, 2005. → pages 101[76] E. A. Hirsch. Random generator hgen2 of satisfiable formulas in 3-CNF.http://logic.pdmi.ras.ru/∼hirsch/benchmarks/hgen2-1.01.tar.gz, 2002. Version lastvisited on June 16, 2009. → pages 29, 146[77] H. Hoos, R. Kaminski, T. Schaub, and M. Schneider. aspeed: ASP-basedsolver scheduling. In Technical Communications of the 28th InternationalConference on Logic Programming (ICLP’12), pages 176–187. SchlossDagstuhl–Leibniz-Zentrum fuer Informatik, 2012. → pages 199207[78] H. Hoos, K. Leyton-Brown, T. Schaub, and M. Schneider. Algorithmconfiguration for portfolio-based parallel SAT-solving. In Workshop onCombining Constraint Solving with Mining and Learning (CoCoMile) atthe European Conference on Artificial Intelligence (ECAI), 2012. → pages199[79] H. H. Hoos. On the run-time behaviour of stochastic local searchalgorithms for SAT. In Proceedings of the 16th National Conference onArtificial Intelligence (AAAI’99), pages 661–666. AAAI Press, 1999. →pages 24, 26, 141[80] H. H. Hoos. An adaptive noise mechanism for WalkSAT. In Proceedingsof the 18th National Conference on Artificial Intelligence (AAAI’02), pages655–660. AAAI Press, 2002. → pages 21, 24, 139, 142, 150, 152, 156,167, 197[81] H. H. Hoos. Computer-aided design of high-performance algorithms.Technical Report TR-2008-16, University of British Columbia, Departmentof Computer Science, 2008. → pages 6, 16[82] H. H. Hoos. Programming by optimization. Communications of the ACM,55(2):70–80, 2012. → pages 138[83] H. H. Hoos and T. Stu¨tzle. Stochastic Local Search – Foundations &Applications. Morgan Kaufmann, 2005. → pages 12, 20, 22, 23, 36, 38[84] E. Horvitz, Y. Ruan, C. P. Gomes, H. Kautz, B. Selman, and D. M.Chickering. A Bayesian approach to tackling hard computationalproblems. In Proceedings of the 17th Conference on Uncertainty inArtificial Intelligence (UAI’01), pages 235–244. Morgan Kaufmann, 2001.→ pages 14[85] E. I. Hsu and S. A. McIlraith. Characterizing propagation methods forBoolean satisfiability. In Proceedings of the 9th International Conferenceon Theory and Applications of Satisfiability Testing (SAT’06), volume 4121of LNCS, pages 325–338. Springer, 2006. → pages 22[86] E. I. Hsu, C. Muise, J. C. Beck, and S. A. McIlraith. Probabilisticallyestimating backbones and variable bias: Experimental overview. InProceedings of the 14th International Conference on Principles andPractice of Constraint Programming (CP’08), volume 5202 of LNCS,pages 613–617. Springer, 2008. → pages 28208[87] J. Huang. TINISAT in SAT competition 2007. Solver description, SATcompetition 2007, http://www.satcompetition.org/2007/tinisat.pdf, Version lastvisited on April 29, 2014, 2007. → pages 102[88] L. Huang, J. Jia, B. Yu, B. Chun, P.Maniatis, and M. Naik. Predictingexecution time of computer programs using sparse polynomial regression.In Proceedings of the 23rd Conference on Advances in Neural InformationProcessing Systems (NIPS’10), pages 883–891. MIT Press, 2010. → pages12, 72, 73[89] B. Huberman, R. Lukose, and T. Hogg. An economics approach to hardcomputational problems. Science, 265:51–54, 1997. → pages 13[90] F. Hutter. Automated Configuration of Algorithms for Solving HardComputational Problems. PhD thesis, University Of British Columbia,Computer Science, 2009. → pages 31, 186[91] F. Hutter, D. A. D. Tompkins, and H. H. Hoos. Scaling and probabilisticsmoothing: Efficient dynamic local search for SAT. In Proceedings of the8th International Conference on Principles and Practice of ConstraintProgramming (CP’02), volume 2470 of LNCS, pages 233–248. Springer,2002. → pages 21, 23, 25, 26, 83, 142, 149, 150, 152, 156[92] F. Hutter, Y. Hamadi, H. H. Hoos, and K. Leyton-Brown. Performanceprediction and automated tuning of randomized and parametric algorithms.In Proceedings of the 12th International Conference on Principles andPractice of Constraint Programming (CP’06), volume 4204 of LNCS,pages 213–228. Springer, 2006. → pages 12, 13, 72, 94[93] F. Hutter, D. Babic´, H. H. Hoos, and A. J. Hu. Boosting verification byautomatic tuning of decision procedures. In Proceedings of the 7thInternational Conference on Formal Methods in Computer-Aided Design(FMCAD’07), pages 27–34. IEEE Press, 2007. → pages 17, 82, 184[94] F. Hutter, H. H. Hoos, and T. Stu¨tzle. Automatic algorithm configurationbased on local search. In Proceedings of the 22nd National Conference onArtificial Intelligence (AAAI’07), pages 1152–1157. AAAI Press, 2007. →pages 6, 7, 17, 147[95] F. Hutter, H. H. Hoos, T. Stu¨tzle, and K. Leyton-Brown. ParamILS version2.3. http://www.cs.ubc.ca/labs/beta/Projects/ParamILS. Version last visitedon Sept. 16, 2013., 2008. → pages 147209[96] F. Hutter, H. H. Hoos, K. Leyton-Brown, and T. Stu¨tzle. ParamILS: anautomatic algorithm configuration framework. Journal of ArtificialIntelligence Research, 36:267–306, 2009. → pages 17, 18, 147, 175, 176,178, 184, 187, 196[97] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Automated configuration ofmixed integer programming solvers. In Proceedings of the InternationalConference on Integration of AI and OR Techniques in ConstraintProgramming for Combinatorial Optimization Problems (CPAIOR’10),volume 6140 of LNCS, pages 186–202. Springer, 2010. → pages 17, 18,30, 34, 147, 184, 186[98] F. Hutter, H. H. Hoos, K. Leyton-Brown, and K. P. Murphy. Time-boundedsequential parameter optimization. In Proceedings of the 4th Learning andIntelligent Optimization Conference (LION’10), volume 6073 of LNCS,pages 281–298. Springer, 2010. → pages 79[99] F. Hutter, H. H. Hoos, and K. Leyton-Brown. Sequential model-basedoptimization for general algorithm configuration. In Proceedings of the 5thLearning and Intelligent Optimization Conference (LION’11), volume6683 of LNCS, pages 507–523. Springer, 2011. → pages 13, 79, 192[100] F. Hutter, J. Hoffmann, H. Hoos, and K. Leyton-Brown. An efficientapproach for assessing hyperparameter importance. In Proceedings of the31th International Conference on Machine Learning (ICML’14), page Toappear. Omnipress, 2014. → pages 199[101] F. Hutter, L. Xu, H. H. Hoos, and K. Leyton-Brown. Algorithm runtimeprediction: methods & evaluation. Artificial Intelligence, 206(0):79–111,2014. → pages iv, 20, 71, 194[102] C. M. Institute. P vs NP Problem.http://www.claymath.org/millenium-problems/p-vs-np-problem, 2014. Version lastvisited on March 6, 2014. → pages 1[103] International Business Machines Corp. IBM ILOG CPLEX Optimizer –Data Sheet, 2011.ftp://public.dhe.ibm.com/common/ssi/ecm/en/wsd14044usen/WSD14044USEN.PDF.Version last visited on August 6, 2012. → pages 31[104] International Business Machines Corp. CPLEX 12.1.http://www-01.ibm.com/software/integration/optimization/cplex-optimizer, 2012. Versionlast visited on August 6, 2012. → pages 83210[105] A. Ishtaiwi, J. Thornton, Anbulagan, A. Sattar, and D. N. Pham. Adaptiveclause weight redistribution. In Proceedings of the 12th InternationalConference on Principles and Practice of ConstraintProgramming (CP’06), volume 4204 of LNCS, pages 229–243. Springer,2006. → pages 21[106] D. S. Johnson. Random TSP generators for the DIMACS TSP Challenge.http://www2.research.att.com/∼dsj/chtsp/codes.tar, 2011. Version last visited onMay 16, 2011. → pages 39[107] D. S. Johnson and L. A. McGeoch. Local Search in CombinatorialOptimization, chapter The Traveling Salesman Problem: a Case Study inLocal Optimizaiton. Wiley and Sons, 1997. → pages 35, 36[108] D. S. Johnson and L. A. McGeoch. Experimental analysis of heuristics forthe STSP. The Traveling Salesman Problem and its Variations, pages369–443, 2002. → pages 36[109] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimizationof expensive black box functions. Journal of Global Optimization, 13:455–492, 1998. → pages 75[110] T. Jones and S. Forrest. Fitness distance correlation as a measure ofproblem difficulty for genetic algorithms. In Proceedings of the 6thInternational Conference on Genetic Algorithms, pages 184–192. MorganKaufmann, 1995. → pages 12[111] S. Kadioglu, Y. Malitsky, M. Sellmann, and K. Tierney. ISAC - instancespecific algorithm configuration. In Proceedings of the 19th EuropeanConference on Artificial Intelligence (ECAI’10), pages 751–756. IOSPress, 2010. → pages 9, 18, 19, 31, 34, 173, 186, 190, 191, 193, 194[112] S. Kadioglu, Y. Malitsky, A. Sabharwal, H. Samulowitz, and M. Sellmann.Algorithm selection and scheduling. In Proceedings of the 17thInternational Conference on Principles and Practice of ConstraintProgramming (CP’11), number 6876 in LNCS, pages 454–469. Springer,2011. → pages 16, 124, 127, 193, 199[113] H. Kautz and B. Selman. Pushing the envelope: Planning, propositionallogic, and stochastic search. In Proceedings of the 13th NationalConference on Artificial Intelligence (AAAI’96), pages 1194–1201. AAAIPress, 1996. → pages 21211[114] H. A. Kautz and B. Selman. Unifying SAT-based and graph-basedplanning. In Proceedings of the 16th International Joint Conference onArtificial Intelligence (IJCAI’99), pages 318–325. Morgan Kaufmann,1999. → pages 1, 21[115] A. KhudaBukhsh, L. Xu, H. H. Hoos, and K. Leyton-Brown. SATenstein:Automatically building local search SAT solvers from components. InProceedings of the 21st International Joint Conference on ArtificialIntelligence (IJCAI’09), pages 517–524. Morgan Kaufmann, 2009. →pages iv, 20, 26, 138, 152, 180[116] P. Kilby, J. Slaney, S. Thiebaux, and T. Walsh. Estimating search tree size.In Proceedings of the 21st National Conference on ArtificialIntelligence (AAAI’06), pages 1014–1019. AAAI Press, 2006. → pages 12[117] S. Kirkpatrick and B. Selman. Cirtical behavior in the satisfiability ofrandom Boolean formulae. Science, 264:1297–1301, 1994. → pages 41[118] D. Knuth. Estimating the efficiency of backtrack programs. Mathematicsof Computation, 29(129):121–136, 1975. → pages 12, 14[119] D. G. Krige. A statistical approach to some basic mine valuation problemson the Witwatersrand. Journal of the Chemical, Metallurgical and MiningSociety of South Africa, 52(6):119–139, 1951. → pages 75[120] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink. Sparsemultinomial logistic regression: Fast algorithms and generalization bounds.IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6):957–968, 2005. → pages 60[121] O. Kullmann. Investigating the behaviour of a SAT solver on randomformulas. http://www.cs.swan.ac.uk/∼csoliver/Artikel/OKsolverAnalyse.html, 2002.Version last visited on June 22, 2008. → pages 21, 59[122] M. G. Lagoudakis and M. L. Littman. Learning to select branching rules inthe DPLL procedure for satisfiability. In Electronic Notes in DiscreteMathematics (ENDM), 2001. → pages 15[123] N. D. Lawrence, M. Seeger, and R. Herbrich. Fast Sparse Gaussian ProcessMethods: The Informative Vector Machine. In Proceedings of the 15thConference on Advances in Neural Information ProcessingSystems (NIPS’02), pages 609–616. MIT Press, 2003. → pages 78212[124] D. Le Berre, O. Roussel, and L. Simon. The international SATCompetitions web page. www.satcompetition.org, 2012. Version last visited onJan 29, 2012. → pages 4, 41, 122, 127, 195[125] D. Lehmann, R. Mu¨ller, and T. Sandholm. The Winner DeterminationProblem. MIT Press, 2006. → pages 30[126] K. Leyton-Brown, M. Pearson, and Y. Shoham. Towards a universal testsuite for combinatorial auction algorithms. In Proceedings of the 2nd ACMconference on Electronic commerce (EC’00), pages 66–76. ACM, 2000. →pages 34[127] K. Leyton-Brown, E. Nudelman, and Y. Shoham. Learning the empiricalhardness of optimization problems: The case of combinatorial auctions. InProceedings of the 8th International Conference on Principles andPractice of Constraint Programming (CP’02), volume 2470 of LNCS,pages 556–572. Springer, 2002. → pages 3, 11, 12, 13, 56, 72, 94, 194[128] K. Leyton-Brown, E. Nudelman, G. Andrew, J. McFadden, and Y. Shoham.Boosting as a metaphor for algorithm design. In Proceedings of the 9thInternational Conference on Principles and Practice of ConstraintProgramming (CP’03), volume 2833 of LNCS, pages 899–903. Springer,2003. → pages 7, 12, 173[129] K. Leyton-Brown, E. Nudelman, G. Andrew, J. McFadden, and Y. Shoham.A portfolio approach to algorithm selection. In Proceedings of the 18thInternational Joint Conference on Artificial Intelligence (IJCAI’03), pages1542–1543. Morgan Kaufmann, 2003. → pages 12, 14, 92[130] K. Leyton-Brown, E. Nudelman, and Y. Shoham. Empirical hardnessmodels: Methodology and a case study on combinatorial auctions. Journalof the ACM, 56(4):1–52, 2009. → pages 3, 8, 31, 34, 72, 173, 186[131] C. Li and W. Huang. Diversification and determinism in local search forsatisfiability. In Proceedings of the 8th International Conference on Theoryand Applications of Satisfiability Testing (SAT’05), volume 3569 of LNCS,pages 158–172. Springer, 2005. → pages 21, 23, 25, 26, 139, 141, 142,150, 152, 156[132] C. Li, W. Wei, and H. Zhang. Combining adaptive noise and promisingdecreasing variables in local search for SAT. Solver description, SATcompetition 2007, ttp://www.satcompetition.org/2007/adaptG2WSAT.pdf, Versionlast visited on April 29, 2014, 2007. → pages 141, 142, 150, 152213[133] C. M. Li and Anbulagan. Look-ahead versus look-back for satisfiabilityproblems. In Proceedings of the 3rd International Conference onPrinciples and Practice of Constraint Programming (CP’97), volume 1330of LNCS, pages 341–355. Springer, 1997. → pages 59[134] C. M. Li and W. Wei. Combining adaptive noise and promising decreasingvariables in local search for SAT. Solver description, SAT competition2009, https://www.laria.u-picardie.fr/∼cli/adaptg2wsat2009++.tar, Version last visitedon April 29, 2014, 2009. → pages 42[135] C. M. Li, W. X. Wei, and H. Zhang. Combining adaptive noise andlook-ahead in local search for SAT. In Proceedings of the 10thInternational Conference on Theory and Applications of SatisfiabilityTesting (SAT’07), volume 4501 of LNCS, pages 121–133. Springer, 2007.→ pages 24, 26, 140, 141, 150, 152, 167, 197[136] S. Lin and B. W. Kernighan. An effective heuristic algorithm for thetraveling-salesman problem. Operations Research, 21:498–51, 1973. →pages 36, 38[137] G. Lindner and R. Studer. AST: Support for algorithm selection with aCBR approach. In Principles of Data Mining and Knowledge Discovery,volume 1704 of LNCS, pages 418–423. Springer, 1999. → pages 14[138] L. Lobjois and M. Lemaıˆtre. Branch and bound algorithm selection byperformance prediction. In Proceedings of the 15th National Conferenceon Artificial Intelligence (AAAI’98), pages 353–358. AAAI Press, 1998. →pages 12, 14[139] Y. S. Mahajan, Z. Fu, and S. Malik. Zchaff2004: an efficient SAT solver. InProceedings of the 8th International Conference on Theory andApplications of Satisfiability Testing (SAT’05), volume 3542 of LNCS,pages 360–375. Springer, 2005. → pages 21, 22, 28, 101[140] Y. Malitsky and M. Sellmann. Stochastic offline programming. InProceedings of the 21st IEEE International Conference on Tools withArtificial Intelligence, pages 784–791. IEEE, 2009. → pages 18, 193[141] P. J. Manning and M. E. McDill. Optimal parameter settings for solvingharvest scheduling models with adjacency constraints. Mathematical andComputational Forestry & Natural-Resource Sciences, 4(1):16–26, 2012.→ pages 35214[142] D. McAllester, B. Selman, and H. Kautz. Evidence for invariants in localsearch. In Proceedings of the 14th National Conference on ArtificialIntelligence (AAAI’97), pages 321–326. AAAI Press, 1997. → pages 24,141, 142[143] N. Meinshausen. Quantile regression forests. Journal of Machine LearningResearch, 7:983–999, 2006. → pages 80[144] O. Mersmann, B. Bischl, H. Trautmann, M. Wagner, J. Bossek, andF. Neumann. A novel feature-based approach to characterize algorithmperformance for the traveling salesperson problem. Annals of Mathematicsand Artificial Intelligence (AMAI), pages 151–182, 2013. → pages 38[145] S. Minton. An analytic learning system for specializing heuristics. InProceedings of the 13th International Joint Conference on ArtificialIntelligence (IJCAI’93), pages 922–929. Morgan Kaufmann, 1993. →pages 16[146] D. Mitchell, B. Selman, and H. Levesque. Hard and easy distributions ofsat problems. In Proceedings of the 10th National Conference on ArtificialIntelligence (AAAI’92), pages 459–465. AAAI Press, 1992. → pages 3, 41[147] H. D. Mittelmann. Performance of optimization software - an update.http://plato.asu.edu/talks/mittelmann bench.pdf, 2011. Version last visited onMarch 16, 2014. → pages 31[148] R. Monasson and R. Zecchina. Entropy of the k-satisfibility problem.Physical Review Letters, 76:3881–3885, 1996. → pages 42[149] R. Monasson and R. Zecchina. The statistical mechanics of the randomk-satisfibility model. Physical Review E, 56:1357–1370, 1997. → pages 42[150] K. Murphy. The bayes net toolbox for matlab. Computing Science andStatistics: Proceedings of Interface, 33(2):1024–1034, 2001. → pages 59[151] I. T. Nabney. NETLAB: algorithms for pattern recognition.Springer-Verlag, 2002. → pages 74[152] A. Nadel, M. Gordon, A. Palti, and Z. Hanna. Eureka-2006 SAT solver.Solver description, SAT Race 2006,http://www.cs.tau.ac.il/research/alexander.nadel/Eureka.pdf, Version last visited onApril 29, 2014, 2006. → pages 101215[153] M. Nikolic´, F. Maric´, and P. Janicˇic´. Instance-based selection of policiesfor sat solvers. In Proceedings of the 12th International Conference onTheory and Applications of Satisfiability Testing (SAT’09), volume 5584 ofLNCS, pages 326–340. Springer, 2009. → pages 162[154] J. Nocedal and S. J. Wright. Numerical Optimization (Second Edition).Springer, 2006. → pages 76[155] E. Nudelman, K. Leyton-Brown, A. Devkar, Y. Shoham, and H. Hoos.Satzilla: An algorithm portfolio for SAT. Solver description, SATcompetition 2004,http://www.researchgate.net/publication/2925723 SATzilla An Algorithm Portfolio for SAT,Version last visited on April 29, 2014, 2004. → pages 5, 14[156] E. Nudelman, K. Leyton-Brown, H. H. Hoos, A. Devkar, and Y. Shoham.Understanding random SAT: beyond the clauses-to-variables ratio. InProceedings of the 10th International Conference on Principles andPractice of Constraint Programming (CP’04), volume 3258 of LNCS,pages 438–452. Springer, 2004. → pages 3, 9, 12, 13, 26, 40, 44, 55, 56,57, 60, 69, 72, 94[157] M. Oltean. Evolving evolutionary algorithms using linear geneticprogramming. Evolutionary Computation, 13(3):387–410, 2005. → pages16[158] E. O’Mahony, E. Hebrard, A. Holland, C. Nugent, and B. O’Sullivan.Using case-based reasoning in an algorithm portfolio for constraint solving.In Proceedings of the 19th Irish Conference on Artificial Intelligence andCognitive Science, 2008. → pages 16[159] B. Pfahringer, H. Bensusan, and C. Giraud-Carrier. Meta-learning bylandmarking various learning algorithms. In Proceedings of the 17thInternational Conference on Machine Learning (ICML’00), pages743–750. Morgan Kaufmann, 2000. → pages 14[160] D. Pham and Anbulagan. Resolution enhanced SLS solver:R+AdaptNovelty+. Solver description, SAT competition 2007,http://www.satcompetition.org/2007/ranov.pdf, Version last visited on April 29,2014, 2007. → pages 102, 150, 152, 156[161] D. N. Pham, J. Thornton, C. Gretton, and A. Sattar. Combining adaptiveand dynamic local search for satisfiability. Journal on Satisfiability,216Boolean Modeling and Computation, 4:149–172, 2008. → pages 26, 102,141, 149, 150, 152, 156[162] K. Pipatsrisawat and A. Darwiche. Rsat 1.03: SAT solver description.Technical Report D-152, Automated Reasoning Group, UCLA, 2006. →pages 101[163] K. Pipatsrisawat and A. Darwiche. Rsat 2.0: SAT solver description.Solver description, SAT competition 2007,http://reasoning.cs.ucla.edu/rsat/papers/rsat 2.0.pdf, Version last visited on April29, 2014, 2007. → pages 102[164] M. Pop, S. L. Salzberg, and M. Shumway. Genome sequenceassembly:algorithms and issues. Computer, 35(7):47–54, 2002. → pages 1[165] P. C. Pop and S. Iordache. A hybrid heuristic approach for solving thegeneralized traveling salesman problem. In Proceedings of the 13th annualconference on Genetic and evolutionary computation, pages 481–488.ACM, 2011. → pages 147[166] S. Prestwich. Random walk with continuously smoothed variable weights.In Proceedings of the 8th International Conference on Theory andApplications of Satisfiability Testing (SAT’05), volume 3569 of LNCS,pages 203–215. Springer, 2005. → pages 24, 26, 139, 141, 142, 144, 148,150, 152, 156[167] J. Quinonero-Candela, C. E. Rasmussen, and C. K. Williams.Approximation methods for gaussian process regression. In Large-ScaleKernel Machines, Neural Information Processing, pages 203–223. MITPress, 2007. → pages 78[168] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for MachineLearning. The MIT Press, 2006. → pages 75, 76, 78, 79[169] G. Reinelt. The Travelling Salesman: Computational Solutions for TSPApplications. Springer-Verlag, 1994. → pages 36[170] J. R. Rice. The algorithm selection problem. Advances in Computers, 15:65–118, 1976. → pages 5, 14[171] O. Roussel. Description of ppfolio. Solver description, SAT competition2011, http://www.cril.univ-artois.fr/∼roussel/ppfolio/solver1.pdf, Version last visitedon April 29, 2014, 2011. → pages 16, 124217[172] J. Sacks, W. J. Welch, T. J. Welch, and H. P. Wynn. Design and analysis ofcomputer experiments. Statistical Science, 4(4):409–423, 1989. → pages75[173] H. Samulowitz and R. Memisevic. Learning to solve QBF. In Proceedingsof the 22nd National Conference on Artificial Intelligence (AAAI’07), pages255–260. AAAI Press, 2007. → pages 15[174] T. J. Santner, B. J. Williams, and W. I. Notz. The Design and Analysis ofComputer Experiments. Springer-Verlag, 2003. → pages 75[175] J. Schmee and G. J. Hahn. A simple method for regression analysis withcensored data. Technometrics, 21(4):417–432, 1979. → pages 95, 105[176] M. Schmidt. minfunc. http://www.di.ens.fr/∼mschmidt/Software/minFunc.html, 2012.Version last visited on August 5, 2012. → pages 76[177] B. Selman and H. A. Kautz. Domain-independant extensions to GSAT :Solving large structured variables. In Proceedings of the 13th InternationalJoint Conference on Artificial Intelligence (IJCAI’93), pages 290–295.Morgan Kaufmann, 1993. → pages 23[178] B. Selman, H. J. Levesque, and D. Mitchell. A new method for solvinghard satisfiability problems. In Proceedings of the 10th NationalConference on Artificial Intelligence (AAAI’92), pages 440–446. AAAIPress, 1992. → pages 21, 23[179] B. Selman, H. A. Kautz, and B. Cohen. Noise strategies for improvinglocal search. In Proceedings of the Twelfth National Conference onArtificial Intelligence (AAAI’94), pages 337–343. AAAI Press, 1994. →pages 21, 23, 24, 141, 142[180] B. Selman, D. G. Mitchell, and H. J. Levesque. Generating hardsatisfiability problems. Artificial Intelligence, 81:17–29, 1996. → pages29, 146[181] J. Sherman and W. Morrison. Adjustment of an inverse matrixcorresponding to changes in the elements of a given column or a given rowof the original matrix (abstract). Annals of Mathematical Statistics, 20:621,1949. → pages 73[182] L. Simon. SAT competition random 3CNF generator.www.satcompetition.org/2003/TOOLBOX/genAlea.c, 2002. Version last visited onMay 16, 2012. → pages 42218[183] S. Skiena. The Algorithm Design Manual. Springer-Verlag, 1998. → pages36[184] K. Smith-Miles and J. van Hemert. Discovering the suitability ofoptimisation algorithms by learning from evolved instances. Annals ofMathematics and Artificial Intelligence, 61:87–104, 2011. → pages 12, 39,72, 74, 75[185] K. Smith-Miles, J. van Hemert, and X. Y. Lim. Understanding TSPdifficulty by learning from evolved instances. In Proceedings of the 4thLearning and Intelligent Optimization Conference (LION’10), volume6073 of LNCS, pages 266–280. Springer, 2010. → pages 37[186] M. Soos. CryptoMiniSat 2.5.0. Solver description, SAT Race 2010,http://baldur.iti.uka.de/sat-race-2010/descriptions/solver 13.pdf, Version last visitedon April 29, 2014, 2010. → pages 82[187] N. So¨rensson and N. Ee´n. Minisat2007.http://www.cs.chalmers.se/Cs/Research/FormalMethods/MiniSat/, 2007. Version lastvisited on May 20, 2010. → pages 102, 150[188] I. Spence. Ternary tree solver (tts-4-0). Solver description, SATcompetition 2007, http://baldur.iti.uka.de/sat-race-2010/descriptions/solver 13.pdf,Version last visited on April 29, 2014, 2007. → pages 102[189] P. Stephan, R. Brayton, and A. Sangiovanni-Vencentelli. Combinationaltest generation using satisfiability. IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems, 15:1167–1176, 1996. → pages21[190] M. Streeter, D. Golovin, and S. F. Smith. Combining multiple heuristicsonline. In Proceedings of the 22nd National Conference on ArtificialIntelligence (AAAI’07), pages 1197–1203. AAAI Press, 2007. → pages 14[191] T. Stu¨tzle and H. Hoos. MAX-MIN ant system. Future GenerationComputer Systems, 16(8):889–914, 2000. → pages 35[192] S. Subbarayan and D. Pradhan. Niver: Non-increasing variable eliminationresolution for preprocessing SAT instances. In Proceedings of the 7thInternational Conference on Theory and Applications of SatisfiabilityTesting (SAT’04), volume 3542 of LNCS, pages 276–291. Springer, 2005.→ pages 21219[193] G. Sutcliffe and C. B. Suttner. Evaluating general purpose automatedtheorem proving systems. Artificial Intelligence, 131(1-2):39–54, 2001. →pages 126[194] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometricframework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. → pages 164[195] J. Thornton, D. N. Pham, S. Bain, and V. Ferreira. Additive versusmultiplicative clause weighting for SAT. In Proceedings of the 19thNational Conference on Artificial Intelligence (AAAI’04), pages 191–196.AAAI Press, 2004. → pages 23, 25, 26, 141, 142, 144, 150, 152, 156[196] K. M. Ting. An instance-weighting method to induce cost-sensitive trees.IEEE Transactions on Knowledge and Data Engineering, 14(3):659–665,2002. → pages 43[197] D. A. Tompkins, A. Balint, and H. H. Hoos. Captain jack - new variableselection heuristics in local search for SAT. In Proceedings of the 14thInternational Conference on Theory and Applications of SatisfiabilityTesting (SAT’11), volume 6695 of LNCS, pages 302–316. Springer, 2011.→ pages 147[198] D. A. D. Tompkins and H. H. Hoos. UBCSAT: An implementation andexperimentation environment for SLS algorithms for SAT & MAX-SAT. InProceedings of the 7th International Conference on Theory andApplications of Satisfiability Testing (SAT’04), volume 3542 of LNCS,pages 306–320. Springer, 2004. → pages 26, 139, 143[199] V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719–2741, 2000. → pages 78[200] T. Uchida and O. Watanabe. Hard SAT instance generation based on thefactorization problem. http://www.is.titech.ac.jp/∼watanabe/gensat/a2/GenAll.tar.gz,1999. Version last visited on June 6, 2009. → pages 30, 146[201] D. Vallstrom. Vallst documentation. http://vallst.satcompetition.org/inde