Automated Configuration of Algorithms for Solving Hard Computational Problems by Frank Hutter Dipl. Inf., Darmstadt University of Technology, 2004 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) October 2009 c© Frank Hutter, 2009 Abstract The best-performing algorithms for many hard problems are highly parameterized. Selecting the best heuristics and tuning their parameters for optimal overall performance is often a difficult, tedious, and unsatisfying task. This thesis studies the automation of this important part of algorithm design: the configuration of discrete algorithm components and their continuous parameters to construct an algorithm with desirable empirical performance characteristics. Automated configuration procedures can facilitate algorithm development and be applied on the end user side to optimize performance for new instance types and optimization objec- tives. The use of such procedures separates high-level cognitive tasks carried out by humans from tedious low-level tasks that can be left to machines. We introduce two alternative algorithm configuration frameworks: iterated local search in parameter configuration space and sequential optimization based on response surface models. To the best of our knowledge, our local search approach is the first that goes beyond local optima. Our model-based search techniques significantly outperform existing techniques and extend them in ways crucial for general algorithm configuration: they can handle categorical parameters, optimization objectives defined across multiple instances, and tens of thousands of data points. We study how many runs to perform for evaluating a parameter configuration and how to set the cutoff time, after which algorithm runs are terminated unsuccessfully. We introduce data-driven approaches for making these choices adaptively, most notably the first general method for adaptively setting the cutoff time. Using our procedures—to the best of our knowledge still the only ones applicable to these complex configuration tasks—we configured state-of-the-art tree search and local search algorithms for SAT, as well as CPLEX, the most widely-used commercial optimization tool for solving mixed integer programs (MIP). In many cases, we achieved improvements of orders of magnitude over the algorithm default, thereby substantially improving the state of the art in solving a broad range of problems, including industrially relevant instances of SAT and MIP. Based on these results, we believe that automated algorithm configuration procedures, such as ours, will play an increasingly crucial role in the design of high-performance algorithms and will be widely used in academia and industry. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Algorithms and Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi Co-Authorship Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii I Algorithm Configuration: The Problem 1 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1 Motivation for Automated Algorithm Configuration . . . . . . . . . . . . . . 4 1.1.1 Traditional, Manual Algorithm Design . . . . . . . . . . . . . . . . . 4 1.1.2 Automated Algorithm Design from Components . . . . . . . . . . . 6 1.1.3 Practical End Use of Algorithms . . . . . . . . . . . . . . . . . . . . 6 1.1.4 Scientific Studies of Algorithm Performance . . . . . . . . . . . . . 7 1.2 Problem Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Blackbox Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2 Algorithm Configuration . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 How to Read this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 iii 2.1 Algorithm Configuration and Parameter Optimization . . . . . . . . . . . . . 19 2.1.1 Direct Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.2 Model-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.3 Racing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Related Work on the Construction of Algorithms . . . . . . . . . . . . . . . 24 2.2.1 Algorithm Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Genetic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Approaches for Related Problems . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 Per-Instance Approaches . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.2 Dynamic Algorithm Portfolios . . . . . . . . . . . . . . . . . . . . . 29 2.3.3 Online Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.4 Finding a Single Good Solution for Optimization Problems . . . . . . 32 2.4 Further Promising Application Areas for Algorithm Configuration . . . . . . 33 2.4.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4.2 Optimizing Algorithms for Frequently-Used Polytime Operations . . 34 2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 II Algorithm Configuration: Scenarios and Empirical Analysis 36 3 Algorithm Configuration Scenarios . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1.1 Propostitional Satisfiability (SAT) . . . . . . . . . . . . . . . . . . . 38 3.1.2 Mixed Integer Programming (MIP) . . . . . . . . . . . . . . . . . . 38 3.2 Target Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Target Algorithms for SAT . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.2 Target Algorithm for MIP . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.3 CMA-ES for Global Continuous Function Optimization . . . . . . . 43 3.2.4 GLS+ for the Most Probable Explanation (MPE) Problem . . . . . . 43 3.3 Benchmark Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.1 SAT Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.2 MIP Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.3 Test Functions for Continuous Global Optimization . . . . . . . . . . 48 3.4 Optimization Objective: Penalized Average Runtime . . . . . . . . . . . . . 48 3.5 Configuration Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.1 Set of Configuration Scenarios BLACKBOXOPT . . . . . . . . . . . . 50 3.5.2 Set of Configuration Scenarios SINGLEINSTCONT . . . . . . . . . . 51 3.5.3 Set of Configuration Scenarios SINGLEINSTCAT . . . . . . . . . . . 51 3.5.4 Set of Configuration Scenarios BROAD . . . . . . . . . . . . . . . . . 52 3.5.5 Set of Configuration Scenarios VERIFICATION . . . . . . . . . . . . 52 3.5.6 Set of Configuration Scenarios CPLEX . . . . . . . . . . . . . . . . . 52 3.5.7 Set of Configuration Scenarios COMPLEX . . . . . . . . . . . . . . . 53 3.6 Experimental Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.6.1 Selecting Instances and Seeds . . . . . . . . . . . . . . . . . . . . . 53 iv 3.6.2 Comparison of Configuration Procedures . . . . . . . . . . . . . . . 54 3.6.3 Reference Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4 Empirical Analysis of Algorithm Configuration Scenarios Based on Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Gathering a Matrix of Runtimes . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Analysis of Runtime Variability across Configurations and Instances . . . . . 61 4.4 Tradeoffs in Identifying the Best Configuration . . . . . . . . . . . . . . . . 64 4.4.1 Overconfidence and overtuning . . . . . . . . . . . . . . . . . . . . 64 4.4.2 Trading off number of configurations, number of instances, and captime 65 4.5 Tradeoffs in Ranking Configurations . . . . . . . . . . . . . . . . . . . . . . 69 4.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 III Model-free Search for Algorithm Configuration 73 5 Methods I: PARAMILS—Iterated Local Search in Parameter Configuration Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.1 The PARAMILS framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 The BASICILS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2.1 Algorithm Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 78 5.3 FOCUSEDILS: Adaptively Selecting the Number of Training Instances . . . . 80 5.3.1 Algorithm Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3.2 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4 Configuration of SAPS, GLS+ and SAT4J . . . . . . . . . . . . . . . . . . . 86 5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6 Applications I: A Case Study on the Configuration of SPEAR for Industrial Verification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.1 Formal Verification and Decision Procedures . . . . . . . . . . . . . . . . . 90 6.2 Algorithm Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2.1 Manual Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2.2 Initial Performance Assessment . . . . . . . . . . . . . . . . . . . . 93 6.3 Automated Configuration of SPEAR . . . . . . . . . . . . . . . . . . . . . . 94 6.3.1 Parameterization of SPEAR . . . . . . . . . . . . . . . . . . . . . . . 94 6.3.2 Configuration for the 2007 SAT Competition . . . . . . . . . . . . . 95 6.3.3 Configuration for Specific Applications . . . . . . . . . . . . . . . . 96 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 v 7 Methods II: Adaptive Capping of Algorithm Runs . . . . . . . . . . . . . . . . 103 7.1 Adaptive Capping for RANDOMSEARCH . . . . . . . . . . . . . . . . . . . 103 7.2 Adaptive Capping in BASICILS . . . . . . . . . . . . . . . . . . . . . . . . 106 7.2.1 Trajectory-Preserving Capping . . . . . . . . . . . . . . . . . . . . . 107 7.2.2 Aggressive Capping . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7.3 Adaptive Capping in FOCUSEDILS . . . . . . . . . . . . . . . . . . . . . . 110 7.4 Final Evaluation of Default vs Optimized Parameter Configurations . . . . . 111 7.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 8 Applications II: Configuration of Various Target Algorithms based on PARAM- ILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.1 Configuration of CPLEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.1.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.2 Self-Configuration of PARAMILS . . . . . . . . . . . . . . . . . . . . . . . 120 8.3 Applications of PARAMILS by Other Researchers . . . . . . . . . . . . . . . 124 8.3.1 Configuration of SATenstein . . . . . . . . . . . . . . . . . . . . . . 124 8.3.2 Configuration of a Monte Carlo Algorithm for Protein Folding . . . . 124 8.3.3 Configuration of a Local Search Algorithm for Time-Tabling . . . . . 125 8.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 IV Model-based Search for Algorithm Configuration 127 9 Sequential Model-Based Optimization: The State of The Art . . . . . . . . . . 128 9.1 A Gentle Introduction to Sequential Model-Based Optimization (SMBO) . . 129 9.2 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 132 9.3 Two Methods for Handling Noisy Data . . . . . . . . . . . . . . . . . . . . . 133 9.3.1 Standard Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . 133 9.3.2 Noise-free Gaussian Processes Trained on Empirical Statistics . . . . 134 9.4 Log Transformations of Raw Data and Cost Statistics . . . . . . . . . . . . . 134 9.5 Two Existing SMBO Approaches for Noisy Functions: SKO and SPO . . . . 136 9.5.1 A Framework for Sequential Model-Based Optimization . . . . . . . 136 9.5.2 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 9.5.3 Fit of Response Surface Model . . . . . . . . . . . . . . . . . . . . . 137 9.5.4 Selection of new Parameter Settings to Evaluate . . . . . . . . . . . . 140 9.5.5 Intensification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 9.6 An Experimental Comparison of SKO and SPO . . . . . . . . . . . . . . . . 141 9.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 9.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 143 9.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 10 Improvements I: An Experimental Investigation of SPO Components . . . . . 146 10.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 vi 10.2 Model Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 10.2.1 Measures of Model Quality . . . . . . . . . . . . . . . . . . . . . . 147 10.2.2 Transforming Performance Data . . . . . . . . . . . . . . . . . . . . 149 10.2.3 Choosing the Initial Design . . . . . . . . . . . . . . . . . . . . . . 153 10.3 Sequential Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . 155 10.3.1 Intensification Mechanism . . . . . . . . . . . . . . . . . . . . . . . 155 10.3.2 Expected Improvement Criterion . . . . . . . . . . . . . . . . . . . . 160 10.3.3 Overall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 10.4 Reducing Overheads: SPO∗ and RANDOM∗ . . . . . . . . . . . . . . . . . . 163 10.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 11 Improvements II: Different Models . . . . . . . . . . . . . . . . . . . . . . . . 168 11.1 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 11.1.1 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 11.1.2 Standard Random Forest Framework . . . . . . . . . . . . . . . . . 170 11.1.3 Uncertainty Estimates Across Trees . . . . . . . . . . . . . . . . . . 171 11.1.4 Transformations of the Response Variable . . . . . . . . . . . . . . . 171 11.2 Qualitative Comparison of Models . . . . . . . . . . . . . . . . . . . . . . . 172 11.2.1 Qualitative Model Quality . . . . . . . . . . . . . . . . . . . . . . . 172 11.2.2 Potential Failure Modes of SMBO Using Various Models . . . . . . . 175 11.3 Scaling To Many Data Points . . . . . . . . . . . . . . . . . . . . . . . . . . 177 11.3.1 Computational Complexity of Random Forests . . . . . . . . . . . . 178 11.3.2 Approximate Gaussian Processes . . . . . . . . . . . . . . . . . . . 179 11.3.3 Experiments for Scaling Behaviour . . . . . . . . . . . . . . . . . . 180 11.4 Model Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 11.4.1 Diagnostic Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 11.4.2 Overall Model Performance . . . . . . . . . . . . . . . . . . . . . . 183 11.4.3 Log Transformation for approximate GPs and Random Forests . . . . 186 11.5 Sequential Optimization Process . . . . . . . . . . . . . . . . . . . . . . . . 187 11.5.1 Configuration Procedures based on Different Models . . . . . . . . . 187 11.5.2 Experimental Evaluation of Configuration Procedures . . . . . . . . 188 11.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 12 Extensions I: Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . 192 12.1 Models for Partly Categorical Inputs . . . . . . . . . . . . . . . . . . . . . . 192 12.1.1 Categorical Inputs in Random Forests . . . . . . . . . . . . . . . . . 192 12.1.2 A Weighted Hamming Distance Kernel for Categorical Inputs in Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 12.2 Model Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 12.2.1 Effect of Discretization on Model Quality . . . . . . . . . . . . . . . 196 12.2.2 Predictive Performance for Many Categorical Parameters . . . . . . . 196 12.3 Sequential Optimization Process . . . . . . . . . . . . . . . . . . . . . . . . 196 12.3.1 Optimization of Expected Improvement for Categorical Parameters . 199 12.3.2 Convergence of ACTIVECONFIGURATOR for Categorical Parameters 199 vii 12.3.3 Quantifying the Loss Due to Discretization . . . . . . . . . . . . . . 201 12.3.4 Experimental Evaluation of Configuration Procedures . . . . . . . . 203 12.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 13 Extensions II: Multiple Instances . . . . . . . . . . . . . . . . . . . . . . . . . 208 13.1 Instance Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 13.1.1 SAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 13.1.2 MIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 13.1.3 Generic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 13.1.4 Principal Component Analysis (PCA) to Speed Up Learning . . . . . 213 13.2 Modelling Cost Measures Defined Across Multiple Instances . . . . . . . . . 214 13.2.1 Gaussian Processes: Existing Work for Predicting Marginal Perfor- mance Across Instances . . . . . . . . . . . . . . . . . . . . . . . . 214 13.2.2 Random Forests: Prediction of Cost Measures Across Multiple Instances215 13.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 13.4 Model Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 13.4.1 Evaluation of Different Sets of Instance Features . . . . . . . . . . . 217 13.4.2 Evaluation of PCA to Speed Up Learning . . . . . . . . . . . . . . . 221 13.5 Prediction of Matrix of Runtimes . . . . . . . . . . . . . . . . . . . . . . . . 221 13.6 Sequential Optimization Process . . . . . . . . . . . . . . . . . . . . . . . . 227 13.6.1 Blocking on Instances and Seeds . . . . . . . . . . . . . . . . . . . . 227 13.6.2 Experimental Evaluation of SMBO Configuration Procedures . . . . 231 13.6.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 V Conclusion 237 14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 14.1 General Advice for Practitioners . . . . . . . . . . . . . . . . . . . . . . . . 239 14.1.1 Necessary Ingredients To Apply Algorithm Configuration . . . . . . 239 14.1.2 Which Configuration Procedure to Apply When? . . . . . . . . . . . 241 14.2 Dimensions of Algorithm Configuration . . . . . . . . . . . . . . . . . . . . 243 14.2.1 Sequential Search Strategy: Model-free or Model-based? . . . . . . . 243 14.2.2 Which Instances to Use and How Many Runs to Perform? . . . . . . 245 14.2.3 Which Captime To Use? . . . . . . . . . . . . . . . . . . . . . . . . 247 14.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 14.3.1 Configuration Scenarios and Their Analysis . . . . . . . . . . . . . . 247 14.3.2 Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 14.3.3 Adaptive Approaches for Selecting Captime and Instances . . . . . . 250 14.3.4 Scientific Studies of Target Algorithm Performance . . . . . . . . . . 251 14.3.5 Per-Instance Approaches . . . . . . . . . . . . . . . . . . . . . . . . 252 14.4 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 viii List of Tables 3.1 Target algorithms and their parameters . . . . . . . . . . . . . . . . . . . . . 39 3.2 Target algorithms with only numerical parameters . . . . . . . . . . . . . . . 39 3.3 Summary of our BLACKBOXOPT configuration scenarios . . . . . . . . . . . . 50 3.4 Experimental setup for the CMA-ES configuration scenarios . . . . . . . . . 50 3.5 Summary of our SINGLEINSTCONT scenarios . . . . . . . . . . . . . . . . . 51 3.6 Summary of our SINGLEINSTCAT scenarios . . . . . . . . . . . . . . . . . . 51 3.7 Summary of our BROAD configuration scenarios . . . . . . . . . . . . . . . . 52 3.8 Summary of our VERIFICATION configuration scenarios . . . . . . . . . . . 52 3.9 Summary of our CPLEX configuration scenarios . . . . . . . . . . . . . . . . 53 3.10 Summary of our COMPLEX configuration scenarios . . . . . . . . . . . . . . . 53 3.11 Summary of performance measures for configurators . . . . . . . . . . . . . 54 4.1 Quality of default, best-known, and best sampled configuration . . . . . . . . 60 5.1 Comparison of RANDOMSEARCH and BASICILS . . . . . . . . . . . . . . . 80 5.2 Comparison of SIMPLELS and BASICILS . . . . . . . . . . . . . . . . . . . 80 5.3 Comparison of BASICILS and FOCUSEDILS . . . . . . . . . . . . . . . . . 86 5.4 Configuration of GLS+, SAPS, and SAT4J . . . . . . . . . . . . . . . . . . 87 6.1 Summary of results for configuring SPEAR . . . . . . . . . . . . . . . . . . 101 7.1 Speedup of RANDOMSEARCH due to adaptive capping . . . . . . . . . . . . 106 7.2 Speedup of BASICILS due to adaptive capping . . . . . . . . . . . . . . . . 108 7.3 Comparison of RANDOMSEARCH and BASICILS, with adaptive capping . . 109 7.4 Speedup of FOCUSEDILS due to adaptive capping . . . . . . . . . . . . . . 111 7.5 Final evaluation of default vs configurations found with BASICILS and FOCUSEDILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.1 Overview of our five CPLEX configuration scenarios . . . . . . . . . . . . . 116 8.2 Experimental results for our CPLEX configuration scenarios . . . . . . . . . 119 8.3 Parameters in the self-configuration of PARAMILS . . . . . . . . . . . . . . 122 8.4 Effect of self-configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 8.5 Performance summary of SATenstein-LS; reproduced from (KhudaBukhsh et al., 2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 ix 9.1 Summary of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 9.2 SPO and SPO+ algorithm parameters . . . . . . . . . . . . . . . . . . . . . 136 10.1 p-values for comparison of intensification mechanisms . . . . . . . . . . . . 158 10.2 p-values for comparison of expected improvement criteria . . . . . . . . . . 159 10.3 Performance comparison of various configuration procedures for optimizing SAPS on instance QWH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 11.1 Quantitative comparison of configurators for SINGLEINSTCONT scenarios . . 190 12.1 Quantification of loss due to discretization for SINGLEINSTCONT scenarios . 202 12.2 Quantitative comparison of configurators for SINGLEINSTCAT and discretized SINGLEINSTCONT scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 203 13.1 Model quality for predictions of 〈configuration, instance〉 combinations . . . 224 13.2 Quantitative comparison of configurators for BROAD scenarios . . . . . . . . 235 x List of Figures 1.1 Visualization of algorithm configuration . . . . . . . . . . . . . . . . . . . . 3 1.2 Visualization of Algorithm Configuration: Runtime Matrix . . . . . . . . . . 13 4.1 Raw data: runtime matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Hardness variation across instances . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Overconfidence and overtuning . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Performance with various values of N . . . . . . . . . . . . . . . . . . . . . 66 4.5 Performance with various values of κ . . . . . . . . . . . . . . . . . . . . . 67 4.6 Performance with various combinations of N and κ . . . . . . . . . . . . . . 69 4.7 Reliability of comparisons between solvers . . . . . . . . . . . . . . . . . . 70 4.8 Ratio of correct comparisons for various combinations of N and κ . . . . . . 71 5.1 Comparison of BASICILS and RANDOMSEARCH . . . . . . . . . . . . . . . 79 5.2 Comparison of BASICILS(N ) with N = 1, 10, and 100 vs FOCUSEDILS for scenario SAPS-QWH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Comparison of BASICILS(N ) with N = 1, 10, and 100 vs FOCUSEDILS . . 85 5.4 SAPS default vs SAPS configured for SWGCP . . . . . . . . . . . . . . . . . 88 6.1 MiniSAT 2.0 vs SPEAR default . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2 SPEAR default vs SPEAR configured for SAT competition, on SAT competition instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3 SPEAR default vs SPEAR configured for SAT competition, on BMC and SWV . 97 6.4 SPEAR configured for general benchmark set vs SPEAR configured for specific benchmark sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.5 SPEAR default vs SPEAR configured for specific benchmark sets . . . . . . . 99 6.6 MiniSAT 2.0 vs SPEAR configured for specific benchmark sets . . . . . . . . 100 7.1 Speedup of RANDOMSEARCH due to adaptive capping . . . . . . . . . . . . 106 7.2 Speedup of BASICILS due to adaptive capping . . . . . . . . . . . . . . . . 108 7.3 Comparison of BASICILS and RANDOMSEARCH, with adaptive capping . . 109 7.4 Speedup of FOCUSEDILS due to adaptive capping . . . . . . . . . . . . . . 112 7.5 Comparison of default vs automatically-determined parameter configurations 113 8.1 CPLEX default vs configurations configured for specific benchmark sets . . . 117 8.2 CPLEX default vs configurations configured for quadratic programs . . . . . . 118 xi 8.3 Visualization of self-configuration . . . . . . . . . . . . . . . . . . . . . . . 121 9.1 Two steps of SMBO for the optimization of a 1-d function . . . . . . . . . . 130 9.2 Alternative ways of fitting a response surface to noisy observations . . . . . . 133 9.3 Alternative GP fits to log-transformed data . . . . . . . . . . . . . . . . . . . 135 9.4 Comparison of SKO and three variants of SPO for optimizing CMA-ES . . . 144 9.5 Comparison of SKO and three variants of SPO for optimizing CMA-ES, with log-transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 10.1 Diagnostic plots of Jones et al. (1998) for scenario CMAES-SPHERE . . . . . . 150 10.2 Offline diagnostic plots for scenario CMAES-SPHERE . . . . . . . . . . . . . . 152 10.3 Comparison of models based on log-transformed and original data . . . . . . 153 10.4 Comparison of models based on different initial designs . . . . . . . . . . . . 154 10.5 Comparison of intensification mechanisms, end result . . . . . . . . . . . . . 158 10.6 Comparison of intensification mechanisms, over time . . . . . . . . . . . . . 159 10.7 Comparison of expected improvement criteria, end result . . . . . . . . . . . 160 10.8 Comparison of SPO variants for tuning SAPS on instance QWH . . . . . . . . 162 10.9 Comparison of SPO+ and SPO∗ . . . . . . . . . . . . . . . . . . . . . . . . 166 11.1 Step 1 of SMBO based on RF models, for noisy and noise-free responses . . 173 11.2 Standard GP and RF model fits for proportional noise . . . . . . . . . . . . . 174 11.3 Standard GP and RF model fits with a log-transformations for proportional noise174 11.4 Standard GP and RF fits to log-transformed data with nonstationary multi- plicative noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 11.5 Failure mode of SMBO based on standard GP model . . . . . . . . . . . . . 176 11.6 Failure model of SMBO based on DACE model . . . . . . . . . . . . . . . . 177 11.7 Failure mode of SMBO based on RF model . . . . . . . . . . . . . . . . . . 178 11.8 Scaling of RF and approximate GP model with number of training data points, n, for scenario SAPS-QCP-Q075 . . . . . . . . . . . . . . . . . . . . . . . . . 180 11.9 Predictive quality of ranks for RF and approximate GP models with growing number of training data points . . . . . . . . . . . . . . . . . . . . . . . . . 181 11.10Scaling of RF models (with # trees) and of approximate GP models (with size of active set), for scenario SAPS-QCP-Q075 . . . . . . . . . . . . . . . . . . . 182 11.11Scaling of RF models (with # trees) and of approximate GP models (with size of active set), for other scenarios . . . . . . . . . . . . . . . . . . . . . . . . 183 11.12Predicted vs actual cost for approximate GP, RF, and DACE model . . . . . . 184 11.13Comparison of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 11.14Comparison of models built on original and log-transformed data . . . . . . . 186 11.15Comparison of configurators for SINGLEINSTCONT scenarios, over time . . . 189 11.16Boxplot comparison of configurators for SINGLEINSTCONT scenarios . . . . 190 12.1 Comparison of approximate GP and RF models for SINGLEINSTCONT sce- narios with discretized parameters . . . . . . . . . . . . . . . . . . . . . . . 197 12.2 Comparison of approximate GP and RF for SINGLEINSTCAT scenarios . . . 198 xii 12.3 Quantification of loss due to discretizing parameters, for RANDOM∗ . . . . . 202 12.4 Quantification of Loss Due to Discretizing Parameters in SMBO . . . . . . . 202 12.5 Comparison of configurators for SINGLEINSTCONT scenarios with discretized parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 12.6 Comparison of configurators for SINGLEINSTCAT scenarios . . . . . . . . . 205 12.7 Boxplot comparison of configurators for categorical parameters on single instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 13.1 11 groups of SAT features; these were introduced by Nudelman et al. (2004) and Xu et al. (2008, 2009). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 13.2 Eight groups of features for the mixed integer programming problem. . . . . . . . . 212 13.3 RF model quality based on different features, for scenario SAPS-QCP . . . . . 218 13.4 RF model quality based on different features, for scenario SAPS-SWGCP . . . . 219 13.5 RF model quality based on different sets of instance features . . . . . . . . . 220 13.6 RF model quality based on different number of principal components . . . . 222 13.7 Predictions of runtime matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 225 13.8 Predictions of runtime matrix, for COMPLEX scenarios with large κmax . . . . 228 13.9 RANDOM∗ with and without blocking, for scenarios with multiple and single instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 13.10Performance of AC(RF) variants based on different sets of features, over time 232 13.11Final performance of AC(RF) variants based on different sets of features . . . 233 13.12Performance of AC(RF), RANDOM∗ and FOCUSEDILS for BROAD scenarios . 234 13.13Performance of AC(RF), RANDOM∗ and FOCUSEDILS for BROAD scenarios . 235 xiii List of Algorithms and Procedures 5.1 PARAMILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 PARAMILS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 betterN (θ1,θ2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 RANDOMSEARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.5 betterFoc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6 betterFoc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.1 objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 9.1 Sequential Model-Based Optimization (SMBO) . . . . . . . . . . . . . . . . . 137 9.2 Initialize in SKO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 9.3 ExecuteRuns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 9.4 Initialize in SPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 9.5 FitModel in SKO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 9.6 FitModel in SPO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 9.7 SelectNewConfigurations in SKO . . . . . . . . . . . . . . . . . . . . . . . . 140 9.8 SelectNewConfigurations in SPO . . . . . . . . . . . . . . . . . . . . . . . . 140 9.9 Intensify in SKO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 9.10 Intensify in SPO 0.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 9.11 Intensify in SPO 0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 10.1 Intensify in SPO+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 10.2 SelectNewParameterConfigurations in SPO+ . . . . . . . . . . . . . . . . . . 157 10.3 Sequential Model-Based Optimization (SMBO) With a Time Budget . . . . . 164 10.4 SelectNewParameterConfigurations in SPO∗ . . . . . . . . . . . . . . . . . . 164 10.5 Intensify in SPO∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 11.1 Fit Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 11.2 Prediction with Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . 172 12.1 SelectNewParameterConfigurations in ACTIVECONFIGURATOR for discrete configuration spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 13.1 Prediction with Random Forest for Multiple Instances . . . . . . . . . . . . . 217 13.2 Intensify with Blocking in ACTIVECONFIGURATOR for Multiple Instances . . 229 13.3 ExecuteRuns with Multiple Instances . . . . . . . . . . . . . . . . . . . . . . 231 xiv List of Symbols Symbol Meaning Θ Parameter configuration space, space of allowable parameter configurations θ Parameter configuration, element of Θ θi:j Vector of parameter configurations, [θi,θi+1, . . . ,θj ] θi,j value of j-th parameter of parameter configuration θi d dimensionality of Θ (number of parameters to be tuned) Π Set of problem instances pi Problem instance, element of Π s Random number seed o Outcome of a single target algorithm run (e.g., runtime, solution cost) R Sequence of runs, with elements (θi, pii, si, κi, oi) as defined in Section 1. R[i] ith element of R Rθ Subsequence of runs using configuration θ (i.e., those runs with θi = θ) N(θ) Length of Rθ cvalid(θ) Empirical cost estimate of θ using all 〈instance, seed〉 combinations in the training set ĉ(θ) Empirical cost statistic over the N(θ) runs with θ: τ̂({R[i].o|θi = θ}) ĉN (θ) Identical to ĉ(θ), emphasizing that N(θ) = N M Response surface model p(x|y) The conditional probability distribution of x given y N (µ,Σ) Normal distribution with mean µ and covariance matrix Σ Conditional probability distribution of x given y isp(x|y) = N (x|µ,Σ) a Normal distribution with mean µ and covariance matrix Σ xv Glossary Algorithm Configuration The formally-defined algorithm configuration problem; see Defi- nition 1 on page 10 Algorithm Configuration Procedure See Configuration Procedure AC ACTIVECONFIGURATOR, a model-based framework for the configuration of algorithms with discretized configuration spaces; see Section 12.3.2 Blackbox Optimization Problem The problem of minimizing a “blackbox” function, that is, a function the only information for which is its value at query points we can sequentially select; see Section 1.2.1 Captime A time bound, usually denoted by κ, after which unsuccessful target algorithm runs are terminated Cutoff time See Captime Conditional Parameter Target algorithm parameter that is only defined if certain “higher- level” parameters take on specific values Configuration See Parameter Configuration Configuration Procedure An executable procedure, i.e., an algorithm, for optimizing the parameters of target algorithms Cost Distribution P{θ}, a distribution of costs of single runs of target algorithm A with configuration θ, on instances pi drawn from a distribution S and pseudo-random number seeds s drawn uniformly at random from a set of allowable seeds Cost Measure c(θ), the criterion being optimized, a statistical population parameter, τ , of the cost distribution, P{θ}. For example, expected runtime or median solution cost Cost Statistic See Empirical Cost Statistic Configurator See Configuration Procedure DACE model A noise-free GP model introduced by Sacks et al. (1989); DACE is an acronym for “Design and Analysis of Computer Experiments”. In SPO and its variants, the DACE model is fitted on empirical cost statistics, and we use the term in this meaning xvi Deterministic Blackbox Optimization Problem See Blackbox Optimization Problem EIC Expected Improvement Criterion, a function that quantifies how promising a parameter configuration is by using predictions of a response surface model EGO Efficient Global Optimization, a model-based approach for deterministic continuous blackbox optimization described by Jones et al. (1998) Empirical Cost Statistic ĉN (θ), empirical statistic of the cost distribution c(θ) based on N samples GP Model Regression model based on a Gaussian stochastic process; see Section 9.2 Incumbent Parameter Configuration θinc, the parameter configuration a configurator judges to be best and would return if it had to terminate. The incumbent changes over time, and we denote the incumbent at time step t as θinc(t) MIP The mixed integer programming problem; see Section 3.1.2 PAR Penalized average runtime of a set of runs with a captime κ, where runs timed out after the captime are counted as a · κ; we set penalization constant a to 10 throughout this thesis, see Section 3.4 Parameter Configuration θ, a vector θ of parameter values (θ1, . . . , θd)T that uniquely identify a single executable variant of a target algorithm Parameter Configuration Space Θ, set of all parameter configurations θ a target algorithm allows; this is a subset of the Cartesian product of the domains of each parameter, Θ1 × · · · ×Θd RF model Regression model based on a random forest; see Section 11.1 Response Surface Model Regression model that aims to predict the cost measure, c(θ), of parameter configurations, θ. In the context of multiple instances, these models can also be used to predict the cost distribution for a combination of a parameter configuration and an instance RMSE Root mean squared error, a measure of model quality ∈ [0,∞) (the lower the better); RMSE measures the root of the squared differences between model predictions and validation costs, see Definition 14 on page 148 EIC quality A measure of model quality ∈ [−1, 1] (the higher the better); it measures the Spearman correlation coefficient between expected improvement based on model predictions, and validation costs, see Definition 13 on page 148 Quality of predictive ranks A measure of model quality ∈ [−1, 1] (the higher the better); it measures the Spearman correlation coefficient between model predictions and validation costs, see Definition 12 on page 148 xvii SAT The propositional satisfiability problem; see Section 3.1.1 SPO Sequential Parameter Optimization, a model-based framework for stochastic continuous blackbox optimization first described by Bartz-Beielstein et al. (2005) SKO Sequential Kriging Optimization, a model-based approach for stochastic continuous blackbox optimization described by Huang et al. (2006) Stochastic Blackbox Optimization Problem Similar to a deterministic Blackbox Optimiza- tion Problem, but the observations we make at a query point, θ, are samples of a cost distribution, P{θ}, rather than deterministic values of a function, f(θ); see Section 1.2.1 Target Algorithm A parameterized algorithm whose parameters are to be optimized Training Performance of a Configurator at Time t Empirical cost estimate of the config- urator’s incumbent at time t, θinc(t), across the runs the configurator performed with θinc(t) Test Performance of a Configurator at Time t Empirical cost estimate of the configurator’s incumbent at time t, θinc(t), for runs with θinc(t) on test set Training Set Set of 〈problem instance, pseudo-random number seed〉 combinations used during configuration Test Set Set of 〈problem instance, pseudo-random number seed〉 combinations used for offline test purposes; disjoint from training set Validation Cost cvalid(θ), empirical cost estimate of parameter configuration θ using all 〈instance, seed〉 combinations in the training set xviii Acknowledgements I am deeply grateful to my three co-supervisors, Holger Hoos (my principal supervisor), Kevin Leyton-Brown, and Kevin Murphy. With their energy, creativity, and rigor, they shaped my work in many important ways, while at the same time giving me the freedom to explore my own ideas. Together, throughout my Ph.D., they made sure that I develop principled methods that walk the fine line between elegance and good performance in practice. Without their encouragement to always go a step further than what I thought is possible, I would have probably shied away from problems as hard as configuring CPLEX (which eventually became one of the two most exciting application domains in this thesis). Many thanks also to Alan Mackworth, the fourth member of my supervisory committee, for making sure I work on interesting but feasible problems and stay on track. I also thank my external examiner Henry Kautz, as well as my university examiners Michael Friedlander and Lutz Lampe, for their valuable feedback. Holger and Kevin L.-B. have also helped me greatly during the writing process of this thesis. With their detailed and insightful feedback, they have improved the end result in many ways, including content, structure, narrative, grammar, and style. I gratefully acknowledge funding from various sources. Firstly, I thank the German National Academic Foundation (“Studienstiftung des deutschen Volkes”) for continuing to fund me after my M.Sc. thesis, through a one-year foreign exchange scholarship and a three- year doctoral fellowship. In this context, my sincere gratitude to Wolfgang Bibel, who nominated me for a fellowship in the first place and who supported me as a mentor for many years. I also received a university graduate fellowship and a R. Howard Webster Foundation fellowship from the University of British Columbia (UBC). Co-authors are acknowledged in a separate section of the frontmatter, but I would like to thank some of them in particular. I would like to thank Thomas Stützle, my M.Sc. thesis supervisor, for the continued fruitful collaboration and many insightful discussions. I would also like to express my gratitude to Youssef Hamadi for hiring me as a research intern at Microsoft Research in Cambridge, UK, in the summer of 2005. The 12 weeks spent under his guidance have been very enjoyable and influenced me in many ways. Most importantly in the context of my Ph.D., they boosted my interest in automated parameter optimization and helped my decision to choose it as a dissertation topic. I am indebted to Lin Xu for the many fruitful discussions we shared, for fixing countless problems with our computer cluster and database, and for being a pleasant colleague that brought joy to many of my days. The productive and enjoyable collaboration with Domagoj Babić has opened my eyes to the full xix potential of automated algorithm configuration, removing my final doubts about its eventual usefulness for practitioners. Many thanks to everyone who has helped me out in one way or the other. In particular, thanks to Dave Tompkins for help with UBCSAT, to Daniel Le Berre for help with SAT4J, to Belarmino Adenso-Dı́az for providing the CALIBRA system and benchmark data, and to Theodore Allen for access to the SKO source code. Many thanks to Chris Fawcett for taking over and maintaining the PARAMILS project and fixing some bugs to help external users. Also thanks to Christian Bang, my office mate during my M.Sc. thesis, who helped me automate my experiments based on Ruby scripts; over time, these scripts grew into the first version of PARAMILS. I would like to thank many other members of the laboratory for computational intelligence (LCI) and the laboratory for bioinformatics, empirical and theoretical algorithmics (β) at UBC computer science, both for their friendship and many interesting chats about our respective work. In LCI, these include Eric Brochu, Mike Chiang, Nando de Freitas, Matt Hoffman, Roman Holenstein, Emtiyaz Khan, Hendrik Kueck, and Mark Schmidt, and in β Mirela Andronescu, Chris Fawcett, Ashiqur KhudaBukhsh, Kevin Smyth, Chris Thachuk, Dave Tompkins, Dan Tulpan, and Lin Xu. I was lucky enough to spend almost three years in Green College, a residential graduate college at UBC. I thank my many Green College friends from all kinds of academic disciplines for broadening my perspective on both life and research, and for the endless fun we shared. Finally, I would like to thank my loving partner Tashia for putting up with my nights and weekends spent working and for bringing sunshine into my life when all else failed. I am extremely grateful to her and to my family, especially my parents, my brother, and his wife, for their unconditional love, care, advice, and emotional support. I also would like to thank my family and many friends in Germany for the peaceful and enjoyable environment I can always be certain to find when I return home. I especially thank my parents for supporting my choice to follow the path that led me so far away from home, for so long. xx Dedication To my parents, Ilona and Gerald Hutter xxi Co-Authorship Statement Over the past five years I have had the good fortune to work with outstanding co-authors. I am indebted to them for their many contributions to the work described in this thesis. The development of PARAMILS had a major impact on my PhD thesis. An early version of PARAMILS goes back to an unpublished side project during my MSc. thesis (see Appendix A of Hutter, 2004), which was co-supervised by Thomas Stützle and Holger Hoos. In joint work with Holger and Thomas during my PhD, we improved PARAMILS and first published it at AAAI-07 (Hutter et al., 2007b). This formed the basis for Chapter 5. Subsequently, Kevin Leyton-Brown got involved in the project and, notably, contributed the idea for the adaptive capping mechanism discussed in Chapter 7; thanks also to Kevin Murphy for valuable discussions of that topic. Adaptive capping is discussed in a comprehensive journal article on PARAMILS with Holger Hoos, Kevin Leyton-Brown, and Thomas Stützle, which also introduced the application to CPLEX and has been accepted for publication at JAIR (Hutter et al., 2009c). That article contains much of the material in Chapters 5, 7, and 8. Chapter 6 is based on joint work with Domagoj Babić, Holger Hoos, and Alan Hu that appeared at FMCAD-07 (Hutter et al., 2007a). Domagoj deserves credit for much of the research presented in that chapter; it is based on a joint case study, in which he played the role of algorithm designer while I performed the algorithm configuration. Chapter 4 is based on joint work with Holger Hoos and Kevin Leyton-Brown, which is about to be submitted for publication (Hutter et al., 2009d). Thomas Stützle provided valuable feedback on an early draft of that work. Chapters 9 and 10 are primarily based on a conference publication at GECCO-09 (Hutter et al., 2009e), co-authored by Holger Hoos, Kevin Leyton-Brown, and Kevin Murphy, as well as a book chapter with the same co-authors and Thomas Bartz-Beielstein (Hutter et al., 2009a). Chapters 11 through 13 are all based on yet-unpublished joint work with Holger Hoos, Kevin Leyton-Brown, and Kevin Murphy. A number of my publications are only indirectly reflected in this thesis. My first publica- tions on (per-instance) parameter optimization are jointly with Youssef Hamadi (Hutter and Hamadi, 2005), and with Youssef Hamadi, Holger Hoos, and Kevin Leyton-Brown (Hutter et al., 2006). I have not included that work prominently in this thesis since it dealt with a different (albeit very interesting) problem; it is briefly discussed in Section 14.3.5. Joint publications with Lin Xu, Holger Hoos, and Kevin Leyton-Brown on per-instance algorithm selection (Xu et al., 2007b, 2008) fall into the same category. The SAT instance features we use in Section 13.1 originated in this work. xxii Part I Algorithm Configuration: The Problem —in which we introduce and motivate the algorithm config- uration problem and discuss related work 1 Chapter 1 Introduction Civilization advances by extending the number of important operations which we can perform without thinking of them. —Alfred North Whitehead, English mathematician and philosopher Parameterized algorithms are abundant in computer science and its applications. For many computational problems, there exist a wide array of solution approaches, and it is the task of computer scientists to identify the one that best matches the domain-specific user require- ments. Typically, once a general solution approach has been chosen, there are a number of subsequent lower-level choices to be made before arriving at a complete algorithm speci- fication. Often, some of those choices are left open; these free parameters allow users to adapt the algorithm to their particular scenario. Particularly, this is the case for the heuristic algorithms used for solving computationally hard problems. As an example, consider CPLEX, the most widely used commercial optimization tool for solving mixed integer programming problems.1 Its latest version, 11.2, has about 80 parameters that affect the solver’s search mechanism and can be configured by the user. Other examples of parameterized algorithms can be found in areas as diverse as sorting (Li et al., 2005), linear algebra (Whaley et al., 2001), numerical optimization (Audet and Orban, 2006), compiler optimization (Cavazos and O’Boyle, 2005), parallel computing (Brewer, 1995), computer vision (Muja and Lowe, 2009), machine learning (Maron and Moore, 1994; Kohavi and John, 1995), database query optimization (Stillger and Spiliopoulou, 1996), database server optimization (Diao et al., 2003), protein folding (Thachuk et al., 2007), formal verification (Hutter et al., 2007a), and even in areas far outside of computer science, such as water resource management (Tolson and Shoemaker, 2007). In this thesis, we adopt a very general notion of what constitutes an algorithm parameter. This notion includes numerical parameters (e.g., level of a real-valued threshold); ordinal parameters (e.g., low, medium, high); and categorical parameters (e.g., choice of heuristic), with the frequent special case of binary parameters (e.g., algorithm component active/inactive). Note that categorical parameters can be used to select and combine discrete building blocks of an algorithm (e.g., preprocessing and variable ordering heuristics in a SAT solver). Conse- quently, our general view of algorithm configuration includes the automated construction of 1http://www.ilog.com/products/cplex/ 2 Figure 1.1: Visualization of algorithm configuration. A configuration procedure executes the target algorithm with specified parameter settings on one or more problem instances, receives information about algorithm performance, and uses this information to decide about target algorithm runs to perform subsequently. A configuration scenario includes the target algorithm to be configured and a collection of instances. a heuristic algorithm from such building blocks. Thus, while other authors have referred to the optimization of an algorithm’s performance by setting its (typically few and numerical) parameters as parameter tuning, we deliberately use the term algorithm configuration to emphasize our focus on configuring algorithms with potentially many, partially categorical algorithm parameters. (We discuss related work on parameter tuning in Chapter 2.) To avoid potential confusion between algorithms whose performance is optimized and algorithms used for carrying out this optimization task, we refer to the former as target algorithms and the latter as configuration procedures. We refer to instances of the algorithm configuration problem as configuration scenarios. This setup is illustrated in Figure 1.1. Note that we treat algorithm configuration as a black-box optimization problem: our configuration procedures execute the target algorithm on a problem instance and receive feedback about the algorithm’s performance; yet, they do not have access to any internal state of the algorithm (unless it is part of the optimization objective and encoded in the performance measure). This leads to a clean interface and considerably eases the application of our methods to the configuration of new algorithms. Different variants of parameter optimization problems have been considered in the literature, including setting parameters on a per-instance basis and modifying the parameters while the algorithm is running; we defer a discussion of these approaches to Chapter 2. Whether manual or automated, effective configuration procedures are central in the de- velopment of heuristic algorithms for many classes of problems, particularly for NP-hard problems. Algorithms with state-of-the-art empirical performance for such problems tend to be highly heuristic, and their behaviour and efficiency depends strongly and in complex ways on their parameters. They often combine a multitude of approaches and feature a correspond- ingly large and structured parameter space (see, e.g., the aforementioned solver CPLEX or many state-of-the-art solvers for SAT). In this area, algorithm configuration is crucial, as the runtime of weak and strong algorithms regularly differs by several orders of magnitude on the 3 same problem instances. Existing theoretical techniques are typically not powerful enough to determine whether one parameter configuration is superior to another, and so the algorithm designer typically relies on manual empirical evaluation to determine the best configuration. Hence, in this thesis, we propose automated methods for algorithm configuration. We hope that such methods will spare researchers and practitioners most of the countless (and often futile) hours spent manually experimenting with different heuristics and “tweaking parameters”. As we will demonstrate in Chapters 6 and 8, these automated configuration procedures have already improved the performance of state-of-the-art algorithms for a variety of problems. The concepts and approaches we introduce in this thesis are not restricted to the configura- tion of algorithms for solving hard combinatorial problems. Conversely, the issue of setting an algorithm’s parameters to achieve good performance is almost omnipresent in algorithmic problem solutions. There already exists a host of domains for which automated configuration procedures have been applied successfully—we review these in Chapter 2. However, most existing work has applied methods geared specifically towards particular applications. We believe that more general freely-available algorithm configuration procedures could benefit research in most areas of computer science. In the remainder of this chapter, we motivate the algorithm configuration problem further, define the problem formally, summarize the main contributions of our work, and outline the remainder of the thesis. 1.1 Motivation for Automated Algorithm Configuration Our main motivation for developing generally-applicable automated procedures for algorithm configuration is (1) the potential to replace the most tedious and unrewarding part of traditional algorithm design with an automated process that leads to better results in a fraction of the time. In particular, automated configuration procedures pave the way for a paradigm of automated algorithm design from components. Automated algorithm configuration can also be used to (2) replace its manual counterpart on the end-user side of a complex algorithm, and (3) in order to enable fairer comparisons between algorithms. In order to motivate the automated configuration of such algorithms, we first describe the traditional, manual approach to algorithm design (see also Hoos, 2008). We then show how this process can be improved by means of automated algorithm configuration. 1.1.1 Traditional, Manual Algorithm Design State-of-the-art algorithms for solving computationally hard problems are traditionally con- structed in an iterative, manual process in which the designer gradually introduces or modifies components or mechanisms. The performance resulting from these modifications is then evaluated empirically, typically based on rather small-scale experiments (small and fairly easy sets of benchmark problems, and comparably small cutoff times are typically used to limit the time required). Based on these experiments, some of the degrees of freedom—typically those the algorithm designer is most uncertain about—are exposed as parameters, while most others are hard-coded and usually never re-considered for modification. 4 When the iterative design process continues for an extended period of time the considered space of potential algorithm designs can rapidly grow unwieldy. This is particularly the case when the number of exposed parameters is not pruned to remain small throughout the process. An extreme example for this is the commercial optimization tool CPLEX, which—during its 20-year history—has exposed more and more parameters, leading to over 80 user-controllable parameters in version 10.1. Identifying the best design choices in such an immense design space requires substantial knowledge about the heuristic algorithm components and their interactions, which is far beyond what can be expected from a typical user. In fact, even the algorithm designers themselves often cannot solve this complex problem very well (Hutter et al., 2007a). During the development of a typical high-performance algorithm there is only time to consider a few different algorithm designs (i.e., combinations of parameter settings). This is because even small-scale empirical evaluations require a considerable amount of time, and experiments generally have to be launched manually. Instead of rigidly following an experimental design strategy (e.g., outlined in Crary and Spera, 1996), the designs to be evaluated are typically chosen in an ad hoc fashion based on the intuition of the algorithm designer whenever a new feature is implemented. In particular, single design choices are often evaluated in isolation, with all other choices fixed, implicitly—and falsely—assuming that design choices do not interact. Such experiments often lead algorithm designers to make unjustified generalizations, such as “algorithms using component A1 perform better than algorithms using component A2”. If these components have only been empirically compared based on fixed instantiations of other design choices, (e.g., B2, C1, D3), then all that can be claimed is that, for the particular problem set used, A1 performs better in combination with the remaining design choices (B2, C1, D3) than does A2. Since design choices often interact in complex and nonintuitive ways that are hard to grasp by human designers, the manual process is likely to miss the best combination of design choices (e.g., A2, B2, C3, D2). In contrast to humans, computer-aided mechanisms tend to perform well at such complex high-dimensional optimization tasks with many interdependent decisions. This very reason has brought success to the fields of operations research and constraint programming; the ability to efficiently handle complex high-dimensional data has also led to some of the most impressive success stories in artificial intelligence. As an analogy, consider Sudoku puzzles. While humans can in principle solve these, computer algorithms can do so much more efficiently. While Sudoku is a fun exercise for the brain, countless days of experimenting with different heuristics and parameter settings is not; quite the contrary, it holds researchers (often graduate students) back from spending their time with more meaningful problems (or at the beach, for that matter). More importantly, this manual experimentation is unlikely to fully realize the potential of a given algorithm. For this reason, we advocate the use of automated algorithm configuration tools whenever possible; in the next section, we outline a number of potential uses. 5 1.1.2 Automated Algorithm Design from Components Efficient approaches for automated algorithm configuration pave the way for a paradigm of automated algorithm design from components. In this paradigm, the algorithm designer implements a high-level parameterized framework with interchangeable algorithm components, as well as a set of high-performance (parameterized) algorithm components that can instantiate the various parts of the framework. Then, he chooses a type of problem instances and a performance metric to be minimized (e.g., cost, operational risk, or undesirable impacts on the environment) and applies an automated algorithm configuration procedure to search for the framework instantiation that empirically optimizes the metric on the chosen type of instances. This semi-automated process has the advantage of a clean separation between high-level cognitive tasks performed by human designers and end users on the one hand and tedious low-level tasks that can be left to machines2 on the other hand. High-level cognitive tasks include the conceptual design of the algorithm framework and the specification of which algorithm components should be considered (and which lower-level parameters should be exposed for tuning). Together, these high-level design choices comprise the design space that is later searched by automated procedures. Further high-level choices concern the selection of a problem distribution and a performance metric of interest. These can in fact be left to the end user: she simply calls an automated configuration procedure to search the design space (defined by the algorithm designer) in order to generate an algorithm optimizing the objective she specified. If her objective or instance distribution ever changes, she can simply repeat the automated procedure with a modified objective to generate an algorithm customized for the new problem. Note that this process of automated algorithm design from components has in fact already been demonstrated to yield new state-of-the-art algorithms. In particular, we automatically constructed different instantiations of the SPEAR algorithm and thereby substantially improved the state of the art for two sets of SAT-encoded industrial verification problems (Hutter et al., 2007a). The SATENSTEIN framework of local search algorithms for SAT by KhudaBukhsh et al. (2009) took this process to the extreme, combining a multitude of components from various existing local search algorithms. Both SPEAR and SATENSTEIN were configured using ParamILS, one of the automated configuration procedures we introduce in this thesis (see Chapter 5). We provide full details on the configuration of SPEAR in Chapter 6 and review the SATENSTEIN application in Section 8.3.1. We hope that automated algorithm configuration procedures will become a mainstream technique in the design of algorithms; they have the potential to significantly speed up and improve this process. 1.1.3 Practical End Use of Algorithms The end use of existing state-of-the-art algorithms is often complicated by the fact that these algorithms are highly parameterized. This is aggravated by the fact that the ability of complex heuristic algorithms to solve large and hard problem instances often depends critically on the use of parameter settings suitable for the particular type of problem instances. End users have 2For the time being, these tasks are addressed by human-designed automated approaches carried out on machines. 6 two options: to use the default parameter configuration and hope for the best or to invest time in exploring the space of possible parameter configurations. End users often opt to use the default algorithm configuration, since they typically have little or no knowledge about the algorithm’s parameter configuration space and little time to explore alternative parameter configurations. However, even if it has been carefully optimized on a standard benchmark set, a default configuration typically does not perform as well on the particular problem instances encountered by a user as a parameter configuration developed specifically with that type of instances in mind. In situations where the default configuration does not yield satisfactory performance end users are forced to experiment with different parameters. For example, the ILOG CPLEX manual states this explicitly: “Integer programming problems are more sensitive to specific parameter settings, so you may need to experiment with them.” (ILOG CPLEX 10.0 user manual, page 130) Such experimentation requires expertise in the heuristic methods the algorithm is based on and experience in their empirical evaluation, which is not necessarily available on the end user’s side. Most importantly, it is simply very time-consuming. As a consequence, shortcuts are often taken and suboptimal parameter settings are used, which can result in solutions far worse than possible. This leads to an opportunity cost in the metric being minimized. 1.1.4 Scientific Studies of Algorithm Performance A central question in empirical comparisons of algorithms is whether one algorithm out- performs another one because it is fundamentally superior, or because its developers more successfully optimized its parameters (Johnson, 2002). In principle, similar efforts should be exerted for configuring (or tuning) all algorithms participating in a “horse-race” comparison. In practice, however, algorithm designers are most familiar with their own algorithm, say A, and often invest considerable manual effort to tune it. Even if a competitor algorithm, B, exposes a set of parameters and even if the designers of A invest the same amount of effort for tuning B as they did for A (both of which are desirable but not always true in practice), they can still be expected to do better at tuning A. This is because they are more familiar with A and have a better intuition about the importance of each of its parameters and about good ranges for continuous parameters. In contrast, using a generally-applicable configuration procedure can mitigate this problem of unfair tuning and thus facilitate more meaningful comparative studies. One would simply run an automated configurator for a prespecified amount of time for each algorithm of interest and then compare the found instantiations of these algorithms against each other. In contrast to the manual configuration of each competitor algorithm, the use of an automated configuration method also is a well-defined and repeatable process. However, we note that absolute fairness might be hard to achieve. One challenge is that the time allowed for configuration can have a strong influence on the outcome of comparisons. For example, think of a comparison between a robust (basically) parameter-less algorithm A and a heavily parameterized algorithm B that strongly depends on a configurator to instantiate it for 7 a given domain. It can be expected that for short configuration timesA outperformsB whereas a specialized configuration of B found given a long enough configuration time outperforms A. Of course, different configuration procedures might also do better at configuring different algorithms. A further potential use of algorithm configuration procedures is for identifying which types of parameter configurations do well for which types of instances. The model-based approaches we investigate in Part IV of this thesis show particular promise for linking the performance of algorithm components to characteristics of the instance at hand. Such insights may lead to new scientific discoveries and further improve algorithm performance for important subclasses of problems. They may also guide the (human part of the) algorithm design process. On the theoretical side, the study of algorithm behaviour with different parameter con- figurations can yield important insights that may help characterize which core components of an algorithm are effective for solving various kinds of problem instances. Similar to lab experiments in the natural sciences, empirical studies can help build intuition and inspire new theoretical results, such as probabilistic algorithm guarantees (Hoos, 1999b). Furthermore, note that it is entirely possible to automatically configure algorithms with certain theoretical performance guarantees. This can be achieved by defining the parameter configuration space such that all allowed configurations have these guarantees (Hoos, 2008). Finally, if an auto- mated method could be devised to check whether a certain theoretical property holds for a given configuration (or which out of a set of properties hold), the same methods we discuss in this thesis could be used to search for a configuration with the desirable properties. 1.2 Problem Definitions and Notation The central topic of this thesis is the algorithm configuration problem. This problem can be informally stated as follows: given an algorithm, a set of parameters for the algorithm, and a set of input data, find parameter values under which the algorithm achieves the best possible performance on the input data. Before we define this problem more formally, we first introduce some notation and relate the problem to standard blackbox function optimization problems. Let A denote an algorithm, and let θ1, . . . , θk be parameters of A. We denote the domain of possible values for each parameter θi as Θi; these domains can be infinite (as for continuous and unbounded integer parameters), finite and ordered (ordinal), or finite and unordered (categorical). We use Θ ⊆ Θ1 × . . .×Θk to denote the space of all feasible parameter configurations, and A(θ) to denote the instantiation of algorithm A with parameter configuration θ ∈ Θ. Let D denote a probability distribution over a space Π of problem instances, and denote an element of Π, i.e., an individual problem instance, as pi. D may be given implicitly, as through a random instance generator or a distribution over such generators. It is also possible (and indeed common) for Π to consist of a finite sample of instances. In this case, we define D as the uniform distribution over Π. 8 1.2.1 Blackbox Optimization Algorithm configuration can be seen as a special type of BlackBox Optimization (BBO) problem. Deterministic BBO problems are problems of the form minθ∈Θ f(θ), where f : Θ→ R is a “blackbox function”: we can query f at arbitrary inputs θ ∈ Θ, and the only information available about f are its function values at these queried points. In typical BBO problems, the optimization domain is continuous, that is, Θ = Rd. In stochastic BBO problems, the (deterministic) function f is replaced with a stochastic process {Fθ|θ ∈ Θ}, a collection of random variables indexed by θ ∈ Θ. The goal in stochastic BBO is to find the configuration θ ∈ Θ that optimizes a given statistical parameter, τ (e.g., the expected value or median), of Fθ’s distribution. Denoting the distribution of Fθ as P{θ}, stochastic BBO problems are thus of the form minθ∈Θ τ(P{θ}). For example, in algorithm configuration, P{θ} might be the runtime distribution of a randomized algorithm with parameter configuration θ, and τ the expected value. We refer to P{θ} as the configuration’s cost distribution, to observed samples from that distribution as single observed costs, oi, and to the statistical parameter, τ , of that cost distribution as overall cost, c(θ): c(θ) := τ(P{θ}). Since objectives differ between applications, we leave the definition of the overall cost, c(θ), up to the user. For example, we might aim to minimize expected runtime or median solution cost. In this thesis, we predominantly minimize mean runtime (penalizing timed-out runs as discussed in Section 3.4). 1.2.2 Algorithm Configuration Algorithm configuration for deterministic algorithms and single instances can be seen as a deterministic BBO problem, whereas the configuration of randomized algorithms or the configuration on distributions with infinite support Π would be a stochastic BBO problem. Since algorithm parameters are often discrete, in algorithm configuration we usually have Θ 6= Rd. Another difference between algorithm configuration and (standard) BBO problems lies in the fact that the cost distribution, P{θ}, of a parameter configuration, θ, contains structure. For a fixed combination of a parameter configuration, θ, an instance, pi, and a pseudorandom number seed, s, the observed cost, o(θ, pi, s), is deterministic.3 Distribution P{θ} is induced 3This discussion relies on the fact that randomized algorithms typically take a seed for a pseudorandom number generator as an input. Note that this is an artifact of the way random decisions are implemented in 9 by sampling instances pi from distribution D and seeds from a uniform distribution, S, over allowable seeds. The structure in this distribution can, indeed, be exploited. For example, knowing the runtime of configuration θ1 on an instance can yield important information on the runtime of configuration θ2 on the same instance. This is even the case for random seeds. Think of two configurations θ1 and θ2 that only differ in a parameter the algorithm never uses; given o(θ1, pi, s1), we perfectly know o(θ2, pi, s1), but not o(θ2, pi, s2). We can, for example, exploit such dependence by performing blocked comparisons (see Section 3.6.1). Finally, we have an additional degree of freedom in algorithm configuration: at which cutoff time, κ, to terminate a run that is unsuccessful within time κ. In standard BBO problems, each function evaluation is assumed to take the same amount of time. In contrast, in algorithm configuration, runtimes with different configurations can vary substantially. For example, when minimizing algorithm runtime to solve a given problem, runtimes can differ by many orders of magnitude across parameter configurations. In order to save time spent in long runs for poor parameter configurations, we can limit the time for each run by a cutoff time, κ. With this intuition in mind, we now define the algorithm configuration problem formally. Definition 1 (Algorithm Configuration Problem). An instance of the algorithm configuration problem is a 6-tuple 〈A,Θ,D, κmax, o, τ〉, where: • A is a parameterized algorithm; • Θ is the parameter configuration space of A; • D is a distribution over problem instances with some domain Π; • κmax is a cutoff time (or captime), after which each run of A will be terminated if still running; • o is a function that measures the observed cost of runningA(θ) on instance pi ∈ Π with captime κ ∈ R; and • τ is a statistical population parameter to be optimized. Examples for o are runtime for solving the instance, or quality of the solution found within the captime; in the former case, o must also define a cost for runs that do not complete within the captime. Throughout most of this thesis, we aim to minimize the penalized average runtime of an algorithm on a distribution of instances, where timeouts after κmax are counted as a · κmax with a ≥ 1, and τ is the expected value (see Section 3.4 for a discussion of that choice of optimization objective). Our problem formulation also allows us to express conditional parameter dependencies. For example, one algorithm parameter, θ1, might be used to select among several search components, with component i’s behaviour controlled by further parameters.The values of those further parameters are irrelevant unless θ1 = i; We thus call such parameters conditional on the higher-level parameter θ1. The configuration framework we introduce in Part III of this thesis exploits this and effectively searches the space of equivalence classes in parameter current algorithms. We can exploit this artifact by making informed decisions about which seeds to employ (as for example in blocking), but our approaches also work for target algorithms that do not rely on seeds to implement randomness. 10 configuration space. In addition, our formulation supports the specification of infeasible combinations of parameter values, which are excluded from Θ. Any parameter configuration θ ∈ Θ is a candidate solution of the algorithm configuration problem. An algorithm configuration procedure (short: configuration procedure or configura- tor), is a procedure for solving the algorithm configuration problem. For each configuration, θ, P{θ} denotes the cost distribution induced by function o, applied to instances pi drawn from distribution D and multiple independent runs for randomized algorithms, using captime κmax. As in the case of stochastic BBO problems, the cost of a candidate solution θ is defined as c(θ) := τ(P{θ}), (1.1) the statistical population parameter, τ , of the cost distribution. An optimal solution, θ∗, minimizes c(θ), i.e., θ∗ ∈ argmin θ∈Θ c(θ). (1.2) We cannot optimize c directly, since we cannot write this function analytically. (In some special cases, writing the function may actually be possible; here, we use a blackbox formalization for generality.) Instead, we must execute a sequence of runs R of the target algorithm A with different parameter configurations, derive empirical estimates of c’s values at particular points in configuration space, and use them to identify a configuration with low cost. We denote the sequence of runs executed by a configurator as R = ((θ1, pi1, s1, κ1, o1), . . . , (θn, pin, sn, κn, on)). The ith run is described by five values: • θi ∈ Θ denotes the parameter configuration being evaluated; • pii ∈ Π denotes the instance on which the algorithm is run; • si denotes the random number seed used in the run (a constant for algorithms that do not accept seeds); • κi ∈ R denotes the run’s captime; and • oi denotes the observed cost of the run. Note that each of θ, pi, s, κ, and o can vary from one element of R to the next, regardless of whether or not other elements are held constant. This is in particular the case for κ: we are free to terminate algorithm runs after any captime κ ≤ κmax, but the eventual cost of a configuration is always computed using the “full” captime, κmax. Also note that R is typically constructed sequentially. We denote the ith run of R as R[i], and the subsequence of runs using parameter configuration θ (i.e., those runs with θi = θ) as Rθ. We use such sequences of runs, R, in order to estimate the cost, c(θ), of a parameter configuration θ, both online, during runtime of a configurator, as well as offline, for evaluation purposes. We now introduce the notion of an empirical cost estimate. Definition 2 (Empirical Cost Estimate). Given an algorithm configuration problem instance 〈A,Θ,D, κmax, o, τ〉, an empirical cost estimate of cost c(θ) based on a sequence of runs R = ((θ1, pi1, s1, κ1, o1), . . . , (θn, pin, sn, κn, on)) is defined as ĉ(θ,R) := τ̂({oi | θi = θ}), where τ̂ is the sample statistic analogue to the statistical population parameter τ . 11 For example, when c(θ) is mean runtime over a distribution of instances and random number seeds, ĉ(θ,R) is the sample mean runtime of runs Rθ. We often omit R for brevity of notation and write ĉN (θ) to emphasize that estimates are based on N runs, that is, |Rθ| = N . All configuration procedures studied in this thesis are anytime algorithms in the sense that at all times they keep track of the configuration currently believed to have the lowest cost. We refer to this configuration as the incumbent configuration, or, short, the incumbent, θinc. We evaluate a configurator’s performance at time t by means of its incumbent’s training and test performance, defined as follows. Definition 3 (Training performance). When at some time t a configurator has performed a sequence of runs R = ((θ1, pi1, s1, κ1, o1), . . . , (θt, pit, st, κt, ot)) to solve an algorithm configuration problem instance 〈A,Θ,D, κmax, o, τ〉, and has thereby found an incumbent configuration θinc(t), then its training performance at time t is defined as the empirical cost estimate ĉ(θinc(t),R). We denote training performance at time t as pt,train. The set of instances {pi1, . . . , pit} discussed above is called the training set. While the true cost of a parameter configuration cannot be computed exactly, it can be estimated using training performance. However, the training performance of a configurator is a biased estimator of its incumbent’s true cost, because the same instances are used for selecting the incumbent as for evaluating it (see, e.g., Birattari, 2005). In order to achieve unbiased estimates during offline evaluation, we set aside a fixed set of instances {pi′1, . . . , pi′T } (called the test set) and random seeds {s′1, . . . , s′T }, both unknown to the configurator, and use these for evaluation. Note that test set and training set are disjoint. Definition 4 (Test performance). At some time t, let a configurator’s incumbent for an algorithm configuration problem instance 〈A,Θ,D, κmax, o, τ〉 be θinc(t) (this is found by means of executing a sequence of runs on the training set). Furthermore, let Rtest = ((θinc(t), pi′1, s′1, κmax, o1), . . . , (θinc(t), pi′T , s ′ T , κmax, oT )) be a sequence of runs on the T instances and random number seeds in the test set (which is performed offline for evaluation purposes), then the configurator’s test performance at time t is defined as the empirical cost estimate ĉ(θinc(t),Rtest). We denote test performance at time t as pt,test, or simply as pt. Finally, we evaluate predictions, ˆµ(θ), of a parameter configuration’s cost measure, c(θ), by comparing them to θ’s empirical cost estimate on the entire training set. Since this estimate is computed offline for validation purposes, we refer to it as validation cost, cvalid(θ). Definition 5 (Validation cost). For an algorithm configuration problem instance 〈A,Θ,D, κmax, o, τ〉 and a parameter configuration, θ ∈ Θ, let Rvalid = ((θ, pi′1, s′1, κmax, o1), . . . , (θ, pi′T , s ′ T , κmax, oT )) be a sequence of runs on the T instances and random number seeds in the training set (which is performed offline for evaluation purposes). Then, the valida- tion cost, cvalid(θ), of configuration θ is defined as the empirical cost estimate ĉ(θ,Rvalid). To provide more intuition for the algorithm configuration problem, we visualize the joint space of configurations and instances faced by configuration procedures. For finite configuration spaces, Θ, and training sets, Π, we can imagine the algorithm configuration problem as a matrix of size |Θ| × |Π|, where entry (i, j) contains the (deterministic) cost 12 co n fig (s or ted by P AR ) instance (sorted by hardness) 500 1000 1500 2000 200 400 600 800 1000 −1 −0.5 0 0.5 1 1.5 (a) CPLEX-REGIONS100 co n fig (s ort ed by P AR ) instance (sorted by hardness) 20 40 60 80 100 200 400 600 800 1000 −2 −1 0 1 2 3 (b) SPEAR-IBM Figure 1.2: Visualization of algorithm configuration: matrix of runtimes for M = 1 000 configurations and all training instances, Π. Each dot in the matrix represents the runtime of a configuration on a single instance: darker dots represent shorter runtimes; note that the scale is logarithmic with base 10. Configurations are sorted by penalized average runtime (PAR, see Section 3.4) on the training set; instances are sorted by hardness (mean runtime of the |Θ| configurations, analogously to PAR counting runs that timed out at captime κmax as 10 · κ). (in this thesis typically runtime) of A(θi) using seed sj on instance pij . For visualization purposes, we randomly sample a subset of parameter configurations θ ∈ Θ and sort them by their empirical cost estimate ĉ|Π| across the training set. Likewise, we sort instances by their hardness across configurations. Figure 1.2 shows two configuration scenarios; the target algorithms and instance sets used in these scenarios are described in detail in Chapter 3. Briefly, in CPLEX-REGIONS100 (see Figure 1.2(a)) we optimize CPLEX for a homogeneous set of MIP-encoded winner determination problems in combinatorial auctions, and in SPEAR-IBM (see Figure 1.2(b)), we optimize the tree search SAT solver SPEAR for a set of industrial bounded model-checking instances. This visualization provides some intuition about how different actual instances of the algorithm configuration problem can be. For example, in scenario CPLEX-REGIONS100, the difference between the worst and the best configurations was much higher than for SPEAR-IBM: the worst sampled configuration did not solve any instance within κmax = 5 seconds whereas the best configuration solved all. In contrast, in scenario SPEAR-IBM every sampled con- figuration solved the easiest 50% of the instances and no configuration solved the hardest 30%. Intuitively, in order to find a configuration that minimizes c(θ), algorithm configuration procedures sequentially execute runs of the target algorithm, thereby filling in entries of the runtime matrix (only filling in lower bounds in cases where κ is chosen < κmax and the run times out). It is not obvious how an automatic algorithm configurator should choose which runs to perform. Filling in the whole matrix is computationally prohibitive: gathering the data for the visualization in Figures 1.2(a) and 1.2(b) took 1.5 and 2.5 CPU months, respectively. This is despite the fact that we only used a small fraction of 1 000 randomly-sampled configuration 13 from a very large configuration space, Θ (|Θ| = 1.38 · 1037 for CPLEX and |Θ| = 8.34 · 1017 for SPEAR). In order to best minimize c(θ) within a given time budget, we thus have to make the following choices: 1. Which sequential search strategy should be used in order to select a sequence of parameter configurations ~Θ to be evaluated? 2. Which problem instances Πθ′ ⊆ Π should be used for evaluating each element of ~Θ, and how many runs should be performed for it? 3. Which cutoff time κ should be used for each run? In combination, these choices define the design space for algorithm configuration pro- cedures. We therefore refer to them as the dimensions of algorithm configuration. It is not straight-forward to make the right choices for these dimensions; in essence, the topic of this thesis is to determine choices that result in effective algorithm configuration procedures. As we will show throughout, using our configuration procedures we found much better configurations than the simple random sampling approach, in a fraction of the runtime. 1.3 Summary of Contributions The main contribution of this thesis is a comprehensive study of the algorithm configuration problem. This includes an empirical analysis approach to study the characteristics of con- figuration scenarios, two fundamentally different search frameworks to instantiate the first dimension of algorithm configuration, various adaptive mechanisms for the second and third dimensions, and a demonstration of algorithm configuration’s practical relevance. Here, we describe these in somewhat more detail. • We introduce and experimentally compare two fundamentally different frameworks (model-free and model-based search) for the first dimension of algorithm configuration, the search strategy (Parts III and IV). To the best of our knowledge, our model-free search mechanism is the first that goes beyond local optima. Our model-based search techniques significantly outperform existing techniques and are substantially more widely applicable. • We introduce the first algorithm configuration procedures capable of configuring algo- rithms with many categorical parameters, such as CPLEX (63 parameters), SPEAR (26 parameters) and SATENSTEIN (41 parameters) (Chapters 5 and 12). Our first method, PARAMILS (introduced in Chapter 5), is already in use by other research groups to configure their algorithms for solving hard combinatorial problems. • We demonstrate the practical relevance of automated algorithm configuration by using it to substantially advance the state of the art for solving important types of problem instances. In particular, we configure the tree search SAT algorithm SPEAR, yielding 4.5-fold and over 500-fold speedups for SAT-encoded industrial hardware and software verification, respectively (Chapter 6). This amounts to a substantial improvement of the 14 state of the art for solving these types of instances. Based on the automated configuration of the commercial optimization tool CPLEX, we also achieve up to 23-fold speedups of the state of the art for MIP-encoded problem distributions (Section 8.1). Finally, by configuring GLS+, we improve the state of the art for solving the MPE problem in certain types of Bayesian networks by more than a factor of 360 (Section 5.4). • We empirically analyze the tradeoff between the second and third dimensions of algo- rithm configuration—how many runs to perform for evaluating each configuration and how to set the per-run captime (Chapter 4). We demonstrate that there exists no optimal fixed choice. Thus, we introduce data-driven approaches for making these choices adaptively. We present three new methods for adaptively selecting the number of runs to perform for each configuration (Sections 5.3, 10.3.1, and 13.6.1). We also introduce a general method for adaptively setting the captime—to the best of our knowledge, the first of its kind (Chapter 7). • Overall, we offer a thorough and systematic study of algorithm configuration and provide reliable and scalable configuration procedures that can strongly improve algorithm performance with minimal manual user effort. A further contribution stems from the improvements and extensions of model-based sequential optimization procedures. • We substantially extend the applicability of model-based search techniques to handle categorical parameters (Chapter 12) and optimization objectives defined across multiple problem instances (Chapter 13). Furthermore—in contrast to previous methods—our techniques scale to configuration scenarios that require tens of thousands of target algorithm runs (Sections 11.3 and 11.4) and smoothly handle non-Gaussian, non- stationary observation noise (Section 11.2.1). 1.4 How to Read this Thesis This thesis provides a comprehensive account of algorithm configuration, organized in a hierarchical fashion (see next section for the outline). The interested but time-constrained reader can find chapter summaries at the end of all chapters (except the last) that provide a high- level overview of the material and are designed to be self-contained. Chapters 2 (Related Work) and 3 (Configuration Scenarios) can largely serve as reference chapters, while Chapter 4 builds intuition that is likely of more immediate benefit. Parts III (Chapters 5-8) and IV (Chapters 9-13) are largely decoupled. Part III is likely of more immediate interest to practitioners looking to use algorithm configuration: it covers a simple algorithm configuration framework and its various existing practical applications. In contrast, Part IV is likely of more interest to researchers with a background in blackbox optimization or in statistics/machine learning. It builds on sequential experimental design methods from the statistics literature, improving and extending them to create a new framework for general algorithm configuration based on response surface models. Chapter 14 (Conclusion) puts these two different frameworks into 15 perspective; it is somewhat self-contained but points to specific pieces of the thesis to make its points. 1.5 Thesis Outline This thesis is organized in five parts. Part I introduces and motivates the algorithm configuration problem (this current chapter) and discusses related work (Chapter 2). In Part II, we introduce our sets of algorithm configuration scenarios, which provide the testing grounds for the configuration procedures we discuss in this thesis. In Chapter 3, we describe the target algorithms, benchmark instances, and optimization objectives used. In Chapter 4, we present the empirical analysis techniques that we used to gain insight into the characteristics of a given algorithm configuration scenario. In this analysis, we focus on the intricate tradeoffs between the number of parameter configurations we are able to evaluate, the (fixed) number of instances used for each evaluation and the (fixed) captime after which to terminate unsuccessful algorithm runs. We show that there exists no optimal fixed tradeoff: the best number of instances and captime to use differs across algorithm configuration scenarios. In Part III, we discuss our first set of algorithm configuration procedures, as well as practical applications of it. This part focuses on model-free search for algorithm configuration, that is, approaches that directly search the space of parameter configurations without relying on a predictive model of performance to guide the search. In Chapter 5, we present a framework for iterated local search (ILS) for algorithm con- figuration called PARAMILS. ILS is one possible choice for the first dimension of algorithm configuration (which parameter configurations to evaluate). Our FOCUSEDILS algorithm, one instantiation of the PARAMILS framework, also implements an adaptive mechanism for making the second fundamental choice of algorithm configuration (how many algorithm runs to execute for each parameter configuration). In Chapter 6, we report a “real-life” application study of PARAMILS. In this case study, we automatically configured the 26 parameters of the tree search SAT solver SPEAR for large industrial bounded model checking and software verification instances, achieving 4.5-fold and 500-fold speedups in mean runtime over the manually-engineered default, respectively. In Chapter 7, we introduce a novel adaptive capping scheme that determines automatically when poor algorithm runs can be terminated early, thereby providing the first adaptive mech- anism to instantiate the third and last dimension of algorithm configuration (which captime should be used for each run). Finally, in Chapter 8, we report further successful “real-life” applications of PARAMILS, most notably our application to automatically setting the 63 parameters of CPLEX, the most complex and heavily parameterized algorithm of which we are aware. Even without using any explicit domain knowledge, we were able to achieve substantial speedups over the well- engineered CPLEX default parameter configuration, in some domains exceeding a factor of ten. We also review other applications of PARAMILS, including an application to automated algorithm design from parameterized components. 16 In Part IV, we explore a fundamentally different choice for the first dimension of algorithm configuration, the search strategy. In particular, we study model-based approaches, which utilize predictive models of algorithm performance (so-called response surface models) in order to guide the search for optimal configurations. This approach was much less developed as a tool for general algorithm configuration, and thus our contribution in this part is mostly methodologically; nevertheless, after five chapters of improvements and extensions of existing procedures, we end up with a state-of-the-art procedure for general algorithm configuration. In Chapter 9, we start this part with an introduction to sequential model-based optimization (SMBO) for general blackbox function optimization. We introduce a general framework for SMBO and describe two existing state-of-the-art approaches, SPO and SKO, as instantia- tions of this framework. We then experimentally compare the two, showing that for some configuration scenarios SPO is more robust “out-of-the-box”. We substantially improve upon these approaches in Chapters 10 and 11. In Chapter 10, we experimentally investigate the components of SPO, and identify and resolve a number of weaknesses. Most notably, we introduce a novel intensification mechanism, and show that log transformations of cost statistics can substantially improve the quality of the response surface model. This mechanism leads to substantially improved robustness of our resulting configuration procedure, dubbed SPO+, as compared to SPO. To reduce computational overheads due to the model construction, we also introduce a modification of SPO+, dubbed SPO∗. In Chapter 11, we consider the use of different response surface models in SMBO, which are based on random forests and approximations of Gaussian processes. We demonstrate qualitatively that these models can improve predictive quality under various noise models as well as the performance of SMBO. Quantitatively, we demonstrate that our new models lead to substantially improved predictions. Furthermore, the computational complexity for their construction is substantially lower than that of traditionally-used models. Our computational experiments demonstrate that this leads to speedups of orders of magnitude. Based on these models, we introduce a new family of configuration procedures—dubbed ACTIVECONFIG- URATOR—which we demonstrate to show significantly better performance than SPO∗. Next, we substantially expand the scope of SMBO in ways crucial for its use in algorithm configuration. While, prior to our work, existing SMBO methods have been limited to the optimization of numerical parameters, in Chapter 12 we extend its scope to include the optimization of categorical parameters. In particular, we extend our response surface models to handle categorical inputs and introduce a simple local search mechanism to search for the most promising parameter configuration to try next. We also compare instantiations of our ACTIVECONFIGURATOR and PARAMILS frameworks for configuration scenarios with single instances and show that ACTIVECONFIGURATOR yields state-of-the-art performance. In Chapter 13, we extend SMBO to handle general algorithm configuration problems defined across distributions (or sets) of problem instances. We integrate problem instance features into our response surface model, enabling them to predict the runtime for combinations of instances and parameter configurations. We discuss different sets of instance features and experimentally compare models based on them. Overall, we find that in some cases predictive 17 performance is excellent, yielding rank correlations between predicted and actual runtimes of above 0.95. We then compare ACTIVECONFIGURATOR variants based on the different sets of features, and demonstrate that they achieve state-of-the-art performance in a variety of configuration scenarios. Finally, in Part V, we offer conclusions about algorithm configuration in general and about our methods in particular, and discuss various promising avenues for future work. 1.6 Chapter Summary In this chapter, we motivated and formally defined the algorithm configuration problem, the problem central to this thesis. The algorithm configuration problem can be informally stated as follows: given an algorithm, a set of parameters for the algorithm, and a set of input data, find parameter values under which the algorithm achieves the best possible performance on the input data. An automated method for solving this problem (a so-called algorithm configuration proce- dure) can be used to replace the most tedious and unrewarding part of traditional algorithm design. The resulting automated process may only require a fraction of the time human experts need to establish good parameter settings while yielding better results. Automated configuration procedures (short: configuration procedures or configurators) not only facilitate algorithm development, they can also be applied on the end user side to optimize performance for new instance types and optimization objectives. The use of such procedures separates high-level cognitive tasks carried out by humans from tedious low-level tasks that can be left to machines. Toward the end of the chapter, we also summarized the contributions of this thesis, provided a guideline on how to read the thesis, and gave an outline of the remainder of the thesis. 18 Chapter 2 Related Work If I have seen further than others, it is by standing upon the shoulders of giants. —Sir Isaac Newton, English physicist and astronomer Many researchers before us have been dissatisfied with the manual optimization of algorithm parameters, and various fields have developed their own approaches automating this task. In particular, research communities that have contributed techniques for algorithm configuration or parameter tuning include planning (Gratch and Dejong, 1992), evolutionary computa- tion (Bartz-Beielstein, 2006), meta-heuristics (Birattari, 2005), genetic algorithms (Fukunaga, 2008), parallel computing (Brewer, 1995), and numerical optimization (Audet and Orban, 2006). In this review of previous work, we focus on the proposed techniques, only mentioning application areas in passing. However, in the end of the chapter (in Section 2.4), we discuss some additional promising application areas. We broadly categorize approaches related to our work into three categories, discussed in the first three sections of this Chapter: (1) approaches that optimize an algorithm’s fixed set of parameters; (2) approaches for the automated construction of algorithms; and (3) approaches that deal with related problems (such as per-instance approaches, algorithm portfolios, and online approaches). 2.1 Algorithm Configuration and Parameter Optimization In this section, we present previous algorithm configuration and parameter optimization methods based on direct search, model-based search and racing. 2.1.1 Direct Search Methods Approaches for automated algorithm configuration go back at least to the early 1990s, when a number of systems were developed for what Gratch and Chien (1996) called adaptive problem solving. Gratch and Dejong (1992)’s Composer system performs a hill-climbing search in configuration space, taking moves if enough evidence has been gathered to render a neighbouring configuration statistically significantly better than the current configuration. Gratch and Chien (1996) successfully applied Composer to improving the five parameters of 19 an algorithm for scheduling communication between a collection of ground-based antennas and spacecraft in deep space. Around the same time, Minton (1993, 1996) developed the MULTI-TAC system. MULTI- TAC takes as input a number of generic heuristics as well as a specific problem domain and a distribution over problem instances. It adapts the generic heuristics to the problem domain and automatically generates domain-specific LISP programs implementing them. In order to choose the best program, it uses a beam search, evaluating each program by running it on a number of problem instances sampled from the given distribution. Terashima-Marı́n et al. (1999) introduced a genetic algorithm for configuring a constraint satisfaction algorithm for large-scale university exam scheduling. They constructed and configured an algorithm that works in two stages and has seven configurable categorical parameters. They optimized these choices with a genetic algorithm for each of 12 problem instances, and for each of them found a configuration that improved performance over a modified Brelaz algorithm. However, note that they performed this optimization separately for each instance. Their paper did not quantify how long these optimizations took, but stated that “Issues about the time for delivering solutions with this method are still a matter of research”. Muja and Lowe (2009) applied a custom algorithm configuration procedure to the problem of identifying the best-performing approximate nearest neighbour algorithm for a specific type of data set. They select between multiple randomized kd-trees and a new version of hierarchical k-means trees, and also select the best parameter setting for the algorithm. Their automated configuration procedure first evaluates multiple parameter configurations; it then employs the Nelder-Mead downhill simplex procedure initialized at the best of these to locally optimize numerical parameters. Evaluations are based on cross-validation using a fixed- size data set, typically only one tenth of the available dataset to reduce the computational burden. The reported experiments show a roughly ten-fold speedup compared to the best previously-available software for approximate nearest neighbour matching. A multitude of direct search algorithms exist for the optimization of numerical parameters. Here, we discuss only the ones that have been used to optimize algorithm performance. Coy et al. (2001) presented another search-based approach that uses a fixed training set. Their approach works in two stages. First, it finds a good parameter configuration θi for each instance Ii in the training set by a combination of experimental design (full factorial or fractional factorial) and gradient descent. Next, it combines the parameter configurations θ1, . . . ,θN thus determined by setting each parameter to the average of the values taken in all of them. This averaging step restricts the applicability of the method to algorithms with exclusively numerical parameters. A similar approach, also based on a combination of experimental design and gradient descent, using a fixed training set for evaluation, is implemented in the CALIBRA system of Adenso-Diaz and Laguna (2006). CALIBRA starts by evaluating each parameter config- uration in a full factorial design with two values per parameter. It then iteratively homes in to good regions of parameter configuration space by employing fractional experimental designs that evaluate nine configurations around the best performing configuration found so far. The grid for the experimental design is refined in each iteration. Once a local optimum is found, the search is restarted (with a coarser grid). Experiments showed CALIBRA’s ability to 20 find parameter settings for six target algorithms that matched or outperformed the respective originally-proposed parameter configurations. Its main drawback is the limitation to tuning numerical and ordinal parameters, and to a maximum of five parameters. Tolson and Shoemaker (2007) presented a simple global optimization algorithm. Their algorithm, called dynamically dimensioned search (DDS), can be interpreted as a variable-neighbourhood local search mechanism in continuous space. Limited to tuning continuous parameters only, DDS starts from an initial solution and then iterates the following steps: (1) randomly select a subset of parameters; (2) perturb the values of the selected parameters by sampling a new value for each selected parameter θi from a Gaussian distribution N (θi, 0.2 · ri), where θi is the parameter’s current value and ri is the parameter’s range; and (3) accept the perturbed solution if it yields a performance improvement. Related to simulated annealing, the probability for selecting parameters to be perturbed shrinks over time. Experiments for the automatic calibration of watershed simulation models with 6 to 30 dimensions showed that some parameterizations of DDS could outperform a competitor, the shuffled complex evolution algorithm. Work on automated parameter tuning can also be found in the numerical optimization literature. In particular, Audet and Orban (2006) proposed the mesh adaptive direct search (MADS) algorithm. Designed for purely continuous parameter configuration spaces, MADS is guaranteed to converge to a local optimum of the cost function. The optimization objective in their work was the runtime and number of function evaluations required by interior point methods for solving a set of large unconstrained regular problems from the CUTEr collec- tion (Gould et al., 2004). To reduce training cost, MADS used a set of smaller problems from the same collection to evaluate training performance. Although this bears the risk of improving performance for small problems but worsening performance for large problems, the experiments reported by Audet and Orban (2006) demonstrated performance improvements of around 25% over the classical configuration of four continuous parameters of interior point methods. Audet and Orban (2006) mentioned very small amounts of noise (around 1% variation between evaluations of the same parameter configuration), but did not specifically address them by stochastic optimization techniques. As discussed in Section 1.2.2, algorithm configuration can be seen as a stochastic optimiza- tion problem, and there exists a large body of algorithms designed for such problems (see, e.g., Spall, 1999, 2003). However, many of the algorithms in the stochastic optimization literature require explicit gradient information and are thus inapplicable to algorithm configuration. Often, one can simply approximate gradients by finite differences, but this typically increases the number of required function evaluations substantially. Since algorithms with purely continuous parameter configuration spaces can be expected to have very few local minima, this approach is promising for tuning such algorithms. For objective functions with many local minima, simulated annealing is a frequently-used technique in stochastic optimization. However, in the context of optimizing functions with many categorical parameters, simulated annealing often yields suboptimal results (Hoos and Stützle, 2005). For details on stochastic optimization, we refer the reader to the book by Spall (2003). 21 2.1.2 Model-Based Optimization Model-based optimization techniques are typically developed for blackbox optimization, where the only information available about a function to be optimized are values of the function at some queried data points. Given these values, these techniques construct a response surface model, that is, a model that predicts the true value of the function at unseen data points. They use this response surface in order to identify optimal parameter settings. Experimental design is concerned with the study of which design points to choose in order to gather data for learning the model. Several measures of a design’s optimality exist—for a review on the extensive literature on experimental design see, e.g., the article by Chaloner and Verdinelli (1995). Sequential model-based optimization methods combine a response surface model with a criterion for selecting the next design point. A very prominent response surface model in statistics, used for example, by Sacks et al. (1989), Jones et al. (1998), and Santner et al. (2003), is a combination of a linear regression model and a noise-free stochastic Gaussian process model (also referred to as a “kriging” model in the statistics literature) that is fitted on the residual errors of the linear model. This model has come to be known as the “DACE” model—an acronym for the “Design and Analysis of Computer Experiments”, the title of the paper by Sacks et al. (1989) which popularized the method. A popular criterion for selecting the next design point θ is the expectation of positive improvement over the incumbent solution at θ (where the expectation is taken over the response at θ with respect to the current model). This expected improvement criterion goes back to the work of Mockus et al. (1978) and continues to be the most widely-used criterion today. The combination of the DACE model, the expected improvement criterion and a branch and bound method for optimizing the expected improvement criterion makes up the efficient global optimization (EGO) algorithm by Jones et al. (1998), a popular framework in statistics for deterministic blackbox function optimization. The EGO algorithm has been extended by three different lines of work. Two of these extensions deal with an extension to noisy functions: the sequential kriging optimization (SKO) algorithm by Huang et al. (2006), and the sequential parameter optimization (SPO) procedure by Bartz-Beielstein et al. (2005); Bartz-Beielstein (2006). SKO extends EGO by adding noise to the DACE model and augmenting the expected improvement criterion. SPO computes an empirical summary statistic for each design point and fits a noise-free Gaussian process to the values of this statistic. Over time, it increases the number of runs this statistic is based on. For a more detailed description and a comparison of SKO and SPO, see Chapter 10. The third extension, presented by Williams et al. (2000), constructs a Gaussian process model that predicts the response value integrated over a set of “environmental conditions”. We discuss this approach in detail in Section 13.2.1. Briefly, it uses its model to optimize marginal performance across the environmental conditions. In the context of algorithm configuration, this could be used to optimize mean performance across a set of problem instances. This extension is computationally expensive, and only applicable for optimizing mean performance. As we will demonstrate in Chapter 9.4, it is not compatible with transformations of the response variable, which are often crucial for strong model performance (Jones et al., 1998; Huang et al., 2006). 22 From the point of view of algorithm configuration, the main drawbacks of the EGO line of work are • its limitation to continuous parameters; • its limitation to noise-free functions or Gaussian-distributed noise; • the cubic scaling of the time complexity of Gaussian process regression models with respect to the number of data points; and • its limitation to single problem instances (or mean performance across instances with the limitation that no response transformation be applied.) In Chapters 10 through 13, we will address these issues in a model-based optimization framework that follows the EGO line of work in its broad strokes. Not all work on model-based optimization of algorithm parameters builds on Gaussian processes. Ridge and Kudenko (2007) employed classical experimental design methods to tune the parameters of the Max-Min Ant System (MMAS) for solving the travelling salesperson problem. First, they selected values for ten search parameters that represent “low” and “high” levels of that parameter (in statistical terms, factor). Then, they employed a fractional factorial design with a number of replicates for each design point, leading to 1452 design points. Out of these design points, they discarded 3% outliers and then fitted a quadratic model. They used this model for two purposes: to predict the relative importance of each parameter and to predict which combination of parameter settings can be expected to be optimal. It also relies on Gaussian noise assumptions that are often not met by cost distributions in an algorithm configuration setting (Hoos and Stützle, 2005). Srivastava and Mediratta (2005) used a decision tree classifier to partition parameter configuration space into good and bad configurations for a given domain. All their parameters were discrete or discretized, and the experiments covered scenarios with less than 200 possible parameter configurations. Unfortunately, their evaluation is inconclusive, since the only comparison made is to the worst parameter configuration, as opposed to the default or the best parameter configuration. Bartz-Beielstein and Markon (2004) compared tree-based regression techniques to the DACE model and to a classical regression analysis. Their application domain was to optimize the parameters of an evolution strategy (with four binary and five continuous parameters) and a simulated annealing algorithm (with two continuous parameters) for a single elevator control problem instance. Their experiments are based on 16 data points from a fractional factorial design. Based on the results on this very small data set, they suggested tree-based methods be used in a first stage of algorithm configuration for screening out important factors since they can handle categorical parameters. 2.1.3 Racing Algorithms A rather different approach for algorithm configuration is based on adaptations of racing algorithms in machine learning (such as Hoeffding races, introduced by Maron and Moore (1994)). Birattari et al. [2002; 2004] developed a procedure dubbed F-Race and used it to configure various stochastic local search algorithms. F-Race takes as input an algorithm A, 23 a finite set of parameter configurations Θ, and an instance distribution D. It iteratively runs the target algorithm with all “surviving” parameter configurations on a number of instances sampled from D (in the simplest case, each iteration runs all surviving configurations on one instance). After each iteration, F-Race performs a non-parametric Friedman test to check whether there are significant differences among the configurations. If this is the case, it eliminates the inferior configurations using a series of pairwise tests. This process is iterated until only one configuration survives or a given cutoff time is reached. Various applications of F-Race have demonstrated very good performance (for an overview, see the PhD thesis by Birattari (2004)). However, since at the start of the procedure all candidate configurations are evaluated, this approach is limited to situations in which the space of candidate configurations can practically be enumerated. In fact, published experiments with F-Race have been limited to applications with at most 1 200 configurations. A recent extension presented by Balaprakash et al. (2007) iteratively performs F-Race on subsets of parameter configurations. This approach scales better to large configuration spaces, but the version described in that paper handles only algorithms with numerical parameters. In related work, Chen et al. (2000) introduced an algorithm dubbed optimal computing budget allocation (OCBA). Their application domain is identifying the best out of a finite number of competing designs for a complex system modelled by simulation experiments. This problem is in principle equivalent to the problem of identifying the best out of a finite number of parameter configurations for solving a single problem instance. OCBA is a sequential algorithm for this problem that asymptotically maximizes the probability of identifying the best design given a fixed computational budget (i.e., a fixed number of allowed simulation runs). Their experiments are promising but limited to small configuration spaces (they considered their space of 210 designs “huge”). 2.2 Related Work on the Construction of Algorithms Some approaches aim to automatically construct algorithms to solve instances of a given type. Here, we review methods from the fields of algorithm synthesis and genetic programming. 2.2.1 Algorithm Synthesis Cognizant of the time-intensive and error-prone nature of manual implementations of complex algorithms, researchers have introduced systems that transform formal problem specifications into synthesized software that is correct by construction. We note that the approaches for algorithm synthesis discussed here complement work on automated algorithm configuration and parameter optimization. They focus on reducing implementation effort, while the particular choice of heuristics still rests with the algorithm designer. Westfold and Smith (2001) described techniques for the automatic reformulation of problems into an effective representation that allows efficient constraint propagation and pruning. They also showed how to automatically derive efficient code for these operations. The authors reported the synthesized programs to be very efficient (sometimes orders of magnitudes faster than manually written programs for the same problem), and attributed this to the specialized representation of constraints and their optimized propagation. 24 Van Hentenryck and Michel (2007) described the automated synthesis of constraint-based local search methods. In their approach, the algorithm designer only needs to provide a model of the problem in a high-level constraint-based language specifying objectives as well as hard and soft constraints. Their code synthesizer analyzes this high-level model, deduces facts about type and tightness of constraints and uses this information to automatically select local search operators guaranteed to preserve feasibility. The actual code synthesis is based on Comet (Hentenryck and Michel, 2005), an object-oriented programming language that includes modelling and control abstractions to support constraint-based local search. Experiments for a range of constraint programming problems demonstrated that the synthesized code performed almost as good as hand-coded solutions for a large variety of constraint programming domains. We note that in this approach the user still has to specify the meta-heuristic to be used, as well as its parameters. Thus, this approach is perfectly orthogonal to the topic of this thesis. Indeed, we see a lot of potential in combining the two approaches in future work: simply search in the space of possible parameterized meta-heuristics and evaluate a configuration by synthesizing the code for it and running it. In follow-up work, Monette et al. (2009) introduced the AEON system for the domain of scheduling. AEON analyzes a given high-level input model to identify problem characteristics, uses these to classify the problem (using a lookup table) and selects the appropriate solving strategy based on the classification output. (AEON is closely related to the per-instance algorithm configuration approach discussed later in this Chapter, except that here it is applied at the level of problems not problem instances.) It then uses the corresponding implementation for this solving strategy. Note that AEON supports both complete search and local search, as well as hybrid algorithms generated by composition (e.g., first use a Tabu Search to find a bound B on solution quality, and then apply branch and bound using bound B). As with the work discussed above, we believe that algorithm configuration could be used to explore the space of possible algorithms defined by all basic search procedures and their parameters, as well as their compositions. Finally, Di Gaspero and Schaerf (2007) presented EasySyn++, a different tool for the synthesis of source code for stochastic local search (SLS) methods. Based on the EasyLocal++ framework for developing SLS algorithms, EasySyn++ is a software environment for the fast prototyping of SLS algorithms. While the above-mentioned Comet system follows a top-down approach that—given a high-level problem specification—generates a complete program not to be touched by the user, EasySyn++ follows a bottom-up approach. In particular, it synthesizes most of the code needed on top of the EasyLocal++ framework, leaving only a small—but crucial—portion of the source code to be written by the algorithm designer. While the necessity of having to write low-level code is a drawback, the advantage of EasySyn++ is that algorithm designers have more control over the details of their algorithm. 2.2.2 Genetic Programming Genetic programming (see, e.g., Poli et al., 2008) evolves a population of programs for solving a given problem. The fitness of an individual can, for example, be assessed by executing the program and measuring its performance on a training set. Thus, it is in principle possible to directly generate effective programs for solving instances of a given class of problems. 25 Fukunaga (2008) used genetic programming to construct complex variable selection mechanisms for a generic local search algorithm for SAT. These mechanisms are constructed as LISP programs and generated from terminals and functions by means of composition. There is no clear separation between the space of possible mechanisms and the method used to search it: the space of mechanisms is unbounded, but the genetic programming procedure limits itself to a bounded subspace. Fukunaga’s generic algorithm only supports variable selection methods based on the prominent SAT local search algorithms GSAT and WalkSAT. His experiments demonstrated that for Random-3-SAT instances the automatically-generated variable selection mechanisms achieved performance comparable to the best local search SAT algorithms in the year 2002. Li et al. (2005) applied genetic programming to generate optimized sorting algorithms. Their paper introduced two applications of genetic programming to selecting the best combi- nation of various sorting primitives for a given sorting instance. Both approaches take into account instance features, such as number of elements and entropy of values to be sorted. Similar to Fukunaga’s work discussed above, the first approach is based on the composition of various primitives in a tree; this approach also shares the problem of an unbounded search space. The second approach is used to generate a classifier system that partitions the space of possible instance features and selects the most appropriate sorting algorithm for each partition. In the experiments Li et al. reported, both new genetic programming approaches yielded substantial speedups over state-of-the-art algorithm libraries on multiple architectures. Oltean (2005) employed genetic programming to evolve genetic algorithms themselves, searching for the best combination of genetic programming operators, such as mutation, crossover, and selection of individuals. In his experiments the automatically-evolved genetic algorithms outperformed standard implementations of genetic algorithms on a variety of tasks, such as function optimization, the travelling salesperson and the quadratic assignment problem (Oltean, 2005). However, this work did not use a separate test set to evaluate the performance of the final configurations found. Thus, it is unclear to what extent the reported results would generalize to other problem instances. In a similar vein, Bölte and Thonemann (1996) applied genetic programming to optimize simulated annealing. In particular, they optimized the annealing schedule for solving the quadratic assignment problem. In their experiments, an automatically-found oscillating schedule outperformed the best previously-known simulated annealing algorithms. Finally, Stillger and Spiliopoulou (1996) applied genetic programming to the database query optimization problem. They evolved a population of query execution plans (QEPs), using QEPs directly instead of using a string representation as had been done in previous work. One interesting feature of the application to database query optimization is that the fitness of a particular individual does not require an execution of the QEPs. In contrast, the performance of any QEP can be evaluated before its execution. It is thus possible to search for the best QEP to a given query, and then execute that QEP. Stillger and Spiliopoulou (1996) reported experiments in which the QEPs their method found yielded performance similar to the performance of QEPs constructed with a widely-used iterative improvement method. 26 2.3 Approaches for Related Problems So far we have focused on the problem of finding the best parameter configuration for an entire set (or distribution) of problem instances. Related problems are (1) to find the best configuration or algorithm on a per-instance basis; (2) to run multiple algorithms or copies of a single algorithm at once; (3) to adapt algorithm components or parameters during the execution of an algorithm, and (4) to combine algorithms to find the best solution for a single problem. We now survey approaches for these problems. 2.3.1 Per-Instance Approaches While some algorithms and parameter configurations dominate others in the sense that they perform better on all instances of a problem class, often there is not a single best approach for all problem instances.1 This observation led Rice (1976) to define the algorithm selec- tion problem: selecting the best algorithm for each problem instance based on computable properties of the instance. In what we call per-instance algorithm configuration (PIAC), analogously, the task is to select the best parameter configuration for each problem instance. Note that per-instance algorithm configuration is a generalization of algorithm selection: the choice between different algorithms can be encoded as a categorical top-level parameter of a portfolio-based algorithm that chooses between subalgorithms. PIAC is more complex, however, since it also deals with ordinal and numerical parameters, and since the parameter configuration space is structured. Most importantly, the number of possible configurations to choose from is typically quite small in algorithm selection and typically very large in PIAC. This fact often renders it impossible to perform experiments with every single configuration, as is often done in algorithm selection (see, e.g., Xu et al., 2008). Methods for solving PIAC are relevant for many of the application scenarios given in Section 1.1, namely the design and development of complex algorithms; empirical studies, evaluations, and comparisons of algorithms; and the actual end use of algorithms in order to solve problems. A robust solution to PIAC would enable algorithm developers to focus on the development of algorithm components and use the PIAC solver to pick the right combination of components for every problem instance. PIAC is in principle more powerful than the “per-distribution” algorithm configuration we consider throughout this thesis. However, it requires the existence of cheaply-computable features that characterize problem instances and are informative about which kind of approaches would work well for a given instance. The models this approach is based on can also require a large amount of training data. Knuth (1975) introduced a Monte-Carlo approach to estimate the size of a search tree, which can be used to judge the hardness of an instance. Lobjois and Lemaı̂tre (1998) used a similar approach to choose the algorithm which can traverse the entire search tree quickest. A related approach was presented by Kilby et al. (2006). Leyton-Brown et al. (2003b,a, 2009) introduced portfolio-based algorithm selection using predictive models of algorithm runtime, dubbed empirical hardness models. These models 1One citation typically used in support of this argument is the so-called No Free Lunch (NFL) Theorem (Wolpert and Macready, 1997). However, note that this theorem only applies to optimization problems for which the objective function is specified as a blackbox lookup table—which is far from typical for hard combinatorial problems. 27 predict the hardness of an unseen test instance based on a set of polytime-computable instance features. In most of their work, Leyton-Brown et al. use linear basis function regression to predict log-transformed algorithm performance. Nudelman et al. (2004) demonstrated that empirical hardness models work on (uniform-random) SAT. Hutter et al. (2006) showed how to apply them to randomized, incomplete algorithms. Xu et al. (2007a) introduced hierarchical hardness models, which first apply a classifier to predict the type or solution status of an instance, and then combine the prediction of lower-level models for the separate classes. Empirical hardness models can be used in a straightforward manner for portfolio- based algorithm selection: simply predict the performance of each candidate algorithm and select the one with best predicted performance. This idea was first used by Leyton-Brown et al. (2003b,a) and was later extended by Xu et al. (2007b, 2008). These portfolio approaches have been repeatedly demonstrated to yield state of the art performance for various classes of SAT problems. Most notably, in each of the 2007 and the 2009 SAT competitions, SATzilla won three gold and two other medals. Brewer (1994, 1995) proposed a very similar approach in the context of parallel computing. That work used linear regression models to predict the runtime of different implementations of portable, high-level libraries for multiprocessors, with the goal of automatically selecting the implementation and parameter setting with the best predicted runtime on a new architecture. While most other work we discuss here is concerned with solvingNP-hard problems, Brewer focused on problems of low polynomial complexity (sorting and stencil computations, with respective asymptotic complexities of O(n log n) and O(n1.5)). While some approaches exist for algorithm selection, there is not much work for algorithm configuration on a per instance basis, that is, approaches that pick a parameter configuration depending on instance features (but then keep it fixed). Patterson and Kautz (2001) introduced the Auto-WalkSAT algorithm. This approach is based on an easily computable characteristic (the “invariant ratio”, see the work by McAllester et al., 1997), which was empirically found to typically be 10% less than the optimal value for WalkSAT’s noise parameter. Auto-WalkSAT simply computes the invariant ratio and then sets the noise parameter to its value plus 10%. The experiments Patterson and Kautz (2001) reported show that this simple approach found almost optimal settings of the noise parameter for such heterogeneous problem classes as unstructured, graph colouring, and blocksworld instances. However, it failed for logistics instances for which the above mentioned relationship between invariant ratio and optimal noise parameter does not hold. Auto-WalkSAT is inherently limited to SAT and in particular to the WalkSAT framework and its single noise parameter. Gebruers et al. (2005) used case-based reasoning—a classification technique—in order to determine the best configuration of a constraint programming algorithm on a per-instance basis. In particular, the parameters being optimized included problem modelling, propagation, the variable selection heuristic, and the value selection heuristic. However, they used a flat representation of all possible parameter configurations that did not exploit the structured parameter configuration space and led to a simple algorithm selection problem instead of a structured per-instance algorithm configuration problem. They did not specify how many parameter configurations are possible nor which kind of instance features were used (except that they are extracted offline, before the algorithm starts). Experiments demonstrated that the 28 case-based reasoning approach performed better than C4.5 with default parameters for solving instances of the social golfer problem. Hutter and Hamadi (2005) and Hutter et al. (2006) used an approach based on empirical hardness models in order to select the best parameter settings of two local search algorithms for SAT. They used linear basis function regression in order to fit a joint model of instance characteristics and (continuous) algorithm parameter settings. They then used this model to determine the appropriate parameter settings to be used on a per-instance basis. Their experiments showed that for a mixed benchmark set parameter settings that were automatically selected on a per-instance basis outperformed the best fixed parameter setting by a factor of almost 10. However, the methods introduced in these papers were limited to purely continuous algorithm parameters. Finally, there are applications of per-instance approaches in compiler optimization. For example, for each program a different combination of compiler optimizations yields the best performance; this is even true on a per-method basis. Both of these problems can be formalized in the PIAC framework, with program features in the first case and method features in the second one. There are a number of possible compiler optimizations, and optimally, one would like to determine the optimal sequence. Cavazos and O’Boyle (2006) simplified this problem to only picking the optimizations to be performed, with their order defined externally. For each compiler optimization flag, they performed a binary logistic regression to determine whether or not the optimization would lead to a speedup in a given method. They generated training data offline and for all logistic regressions at once by compiling a large number of training methods under all possible compiler optimizations and recording the best optimization sequence for each one. Whether or not optimization i is used in the best optimization sequence for a training method determines the method’s training label for the ith logistic regression. This approach does not take interactions between the optimizations into account. However, the experimental results reported by Cavazos and O’Boyle (2006) showed strong results, suggesting that the independent treatment of optimizations was not too detrimental. In another application in compiler optimization, Agakov et al. (2006) predicted which program optimizations were promising on a per-program basis. In an offline training phase, they evaluated 1,000 optimization sequences for a set of training programs and kept all good sequences for each training program (within 5% of optimal). For each new program P to be compiled, their method determines the closest training program, P ′, in Euclidean PCA-reduced feature space, and creates a probability distribution D based on the optimization sequences with good performance on P ′. Then, their method uses distribution D in a rather simple way to bias a search in parameter space towards good compiler optimizations for P : randomly sample from D or initialize the population of a genetic algorithm based on a random sample from D. Their experiments showed that this approach sped up the search for good compiler optimizations by up to an order of magnitude compared to a random search or initialization of the GA with a random sample. 2.3.2 Dynamic Algorithm Portfolios Work on dynamic restarts of algorithms involves multiple independent runs of an algorithm. The approach by Horvitz et al. (2001) and Kautz et al. (2002) executes one algorithm run 29 at a time, and after an initial observation horizon predicts whether the algorithm run will be good or bad. For this prediction, it uses a binary classification into long and short runs by means of a decision tree. Due to the sometimes extraordinarily long runtimes of bad runs, such an approach can lead to significant speedups of algorithms with heavy-tailed runtime distributions. However, this approach does not scale to drawing multiple decisions during an algorithm run: for each decision, it requires a separate classifier. In order to scale to a more general online case, one would like to learn a single classifier which can handle features that are computed at arbitrary times during the search. Gagliolo and Schmidhuber (2006) introduced dynamic algorithm portfolios, which run multiple algorithms with different shares in parallel. This approach fits a runtime distribution across instances. (When instance features are available, such a model could also take them into account.) Gagliolo and Schmidhuber stress that the learning in this case does not happen offline but rather in a life-long learning scenario, by which the learned models are updated after each solved problem instance. Arguably, the more important novelty in this work is that predictions are no longer just used to make a single irreversible decision about which algorithm to use. Rather, the decision about how to prioritize algorithms is constantly revisited in the light of new evidence, namely the fact that algorithms have not terminated after having been alloted a certain time. Later work on learning restart strategies by Gagliolo and Schmidhuber (2007) is another example of dynamic algorithm portfolios, applying more principled methodology based on a bandit problem solver. Finally, low-knowledge algorithm control is an approach by Carchrae and Beck (2004, 2005) to build reactive algorithm portfolios for combinatorial optimization problems. They assume that all algorithms in the portfolio are anytime algorithms that continuously improve a lower bound on solution quality and assign priorities to each algorithm based on its respective improvements of solution quality over time. Algorithms only communicate by sharing the best solutions found so far. 2.3.3 Online Methods Finally, there exist a variety of approaches that combine different solution strategies during a single algorithm run. The most appropriate strategy to use typically varies over the course of a search trajectory, and online approaches aim to select the most appropriate one in a given situation. It is hard to attribute the improvements an algorithm incrementally achieves during a search trajectory to single decisions or series of decisions made by the algorithm. This blame- attribution problem invites the use of reinforcement learning. Examples for reinforcement approaches include the STAGE algorithm by Boyan and Moore (2000), as well as work by Lagoudakis and Littman (2000, 2001) on algorithm selection and selecting branching rules in the DPLL procedure for SAT solving. The original results for STAGE were very encouraging, but unfortunately we are not aware of any follow-up work that showed STAGE to outperform state-of-the-art meta-heuristics, in particular iterated local search with simple random perturbations. Lagoudakis and Littman (2000, 2001) showed that in simple domains it is possible to perform better than the best single approach for a particular instance. However, their reinforcement learning approach was limited to very simple characteristics of the current 30 search state, such as problem size for algorithm selection and number of unassigned variables for SAT tree search algorithms. How to make reinforcement learning work with a more expressive set of search state features is an open and interesting research problem. Some approaches adaptively change the heuristics to be used in tree search algorithms. Samulowitz and Memisevic (2007) applied an online approach to the problem of solving quantified Boolean formulae (QBF). Their tree search method solves an algorithm selection problem at various levels of the search tree in order to determine the best solution approach for the particular subtree rooted at the current decision point. Even though this approach does not taken future decisions into account (as a reinforcement learner would), their experiments showed that it sped up the search substantially. A more heuristic approach was taken by Borrett et al. (1995). Their quickest first principle (QFP) employs a number of algorithms of increasing power (but also increasing complexity), and switches to the next approach when search stagnation is detected. The Adaptive Constraint Engine (ACE) by Eppstein and Freuder (2001) and Epstein et al. (2002) uses a set of so-called advisors, heuristics that vote for possible actions during the search. These votes are taken at every decision point during the search and effectively compose the variable- and value-selection heuristics of ACE. Work in hyper-heuristics originates in the meta-heuristic2 community and is concerned with the development of heuristics that search a space of heuristic algorithms to identify one of them to be used in a given situation. Burke et al. (2003) surveyed the area and stated as the main motivation behind hyper-heuristics that many competitive meta-heuristics are too problem-specific or too knowledge-intensive to be implemented cheaply in practice. Hyper- heuristics are hoped to raise the level of generality and ease of use of meta-heuristics. Burke et al. gave a framework for hyper-heuristic algorithms, which iterates the following central step: given a problem state Si, find a heuristic ingredient that is most suitable for transforming that state, and apply it to reach a new state Si+1. Cowling et al. (2002) discussed the application of such a hyper-heuristic to a university personnel scheduling problem. They introduced several possible move types and defined low-level heuristics that iteratively use one move type until some termination criterion is met. One of their proposed hyper-heuristics adaptively ranks the low-level heuristics and selects effective ones more often. The results presented show that this hyper-heuristic produced very good results, dramatically better than those found manually or by a constructive heuristic. The local search community has developed a great variety of approaches for adapting search parameters to the algorithm trajectory. The reactive search framework by Battiti et al. (2008) uses a history-based approach to decide whether the search is trapped in a small region of the search space, and makes a diversification move more likely when trapped. For example, reactive tabu search makes the tradeoff between intensification (more intensely searching a promising small part of the search space) and diversification (exploring other regions of the search space) via its tabu tenure—the number of steps for which a modified variable cannot be changed after a modification. When search stagnation is detected, reactive tabu search increases the tabu tenure exponentially, and otherwise slowly decreases it. Hoos (2002a) used a very similar mechanism in an adaptive noise mechanism for WalkSAT. 2Note that the term “meta-heuristic” is commonly used to refer to a set of heuristics that are generally applicable to more than one problem. Unlike hyper-heuristics, it does not refer to heuristics that act on heuristics. 31 Instead of the tabu tenure, the resulting Adaptive Novelty+ algorithm controls its noise parameter. Adaptive Novelty+ increases its noise if its does not observe any improvement in the objective function for too long and decreases it otherwise. Introduced in 2002, Adaptive Novelty+ is still a competitive local search algorithm for SAT. Hutter et al. (2002) introduced a similar reactive variant for the SAPS algorithm. Reactive SAPS, or RSAPS, adapts the probability of performing a smoothing step, where smoothing corresponds to an intensification of the search. Since the optimal parameter setting may change during the course of the search, in principle, this strategy has the potential to achieve better performance than any fixed parameter setting. In practice this is true for some instances, but overall, SAPS still shows more robust performance with its default parameter settings than RSAPS. One could argue that online methods are more powerful than approaches that commit to a given parameter configuration before running the algorithm. The flexibility of modifying algorithm components during runtime leads to a much larger configuration space that may indeed contain much better algorithms. However, it is not clear that existing online methods make the best use of this flexibility. Since many decisions have to be made during the course of a search, the computational efficiency of the learning component also becomes an important issue. Thus, research focuses on simple heuristics or hard-coded learning rules (where this use of the term “learning” does not have much in common anymore with traditional machine learning approaches). One exception is the more principled work on reinforcement learning for search. However, so far, this line of work has not resulted in state-of-the algorithms for SAT or related problems. We hope that the type of response surface models we discuss in Part IV of this thesis can be generalized to help tackling the problem of reactively tuning algorithm parameters and choosing algorithms while solving a problem instance. Finally, work on online methods and on algorithm configuration is in part orthogonal. Online methods often retain many parameters whose settings are kept fixed throughout the search. Thus, it is perfectly possible to configure online methods using automated algo- rithm configuration procedures. In particular, dynamic methods are often more complex than their static counterparts and automated configuration procedures can facilitate the consider- ation of such complex mechanisms. In fact, such methods have already have been used to optimize the parameters of a dynamic multi-armed bandit approach for adaptive operator selection (Maturana et al., 2009). 2.3.4 Finding a Single Good Solution for Optimization Problems Cases where the objective is simply to find a good solution for a given optimization problem cannot be formalized as algorithm configuration problems, except in the case of deterministic algorithms. In algorithm configuration, we try to identify a configuration with good perfor- mance across repeated runs, such as, for example, mean solution quality across test runs. In contrast, in some optimization problems, the objective is simply to achieve good performance once; the objective function is thus the maximal performance across a number of training runs. For deterministic algorithms, this can be formalized as an algorithm configuration problem where training and test set coincide. For randomized algorithms, one could still apply the algorithm configuration approaches discussed here, but they can be expected to yield suboptimal results since they optimize a different objective function. 32 Cicirello and Smith (2004) introduced the max k-armed bandit problem in order to model scenarios, in which the goal is to maximize the maximal performance achieved during training. Their framework can be used to prioritize runs of various randomized algorithms for optimization problems (or runs of a single algorithm with several parameter configurations). While earlier work (Cicirello and Smith, 2004, 2005; Streeter and Smith, 2006a) assumed generalized extreme value distributions of performance, Streeter and Smith (2006b) applied a distribution-free approach. This approach was used to assign priorities to five different priority dispatching rules for the resource-constrained project scheduling problem with maximal time lags (RCPSP/max). For this problem, round-robin sampling achieved better overall performance than any single rule, and the max k-armed bandit approach further reduced the regret of round-robin sampling by half. 2.4 Further Promising Application Areas for Algorithm Configuration While the discussion above concentrated on methods, here we discuss two promising applica- tion areas for automated algorithm configuration; these applications are rather different from the ones we focus on in this thesis. 2.4.1 Machine Learning For potential applications of algorithm configuration in supervised machine learning, we distinguish between model parameters and algorithm parameters. In order to fit a given data set, supervised learners typically select the best-fitting model from a given hypothesis space, using either closed-form solutions or some sort of optimization to set a number of model parameters. This is not what we would refer to as algorithm configuration or parameter tuning. In contrast, there exist algorithm parameters that determine the hypothesis space or the method for searching this space; it is those parameters which would lend themselves to the use of automated configuration techniques. Further parameters that could be determined by algorithm configuration procedures are model hyper-parameters or model complexity parameters; for simplicity, we also refer to these as algorithm parameters. Model parameters are typically set automatically within the learning algorithm to minimize some loss function, whereas algorithm parameters are typically determined manually or optimized based on cross-validation. Typical examples include the number of hidden layers in a neural network, the number of neighbours in a nearest neighbour classifier, the choice of which numerical optimizer to use (and how to initialize it), and the manual setting of hyper-parameters. Similar to what is the case for algorithms outside machine learning, in some situations certain algorithm parameters can be transformed into model parameters or avoided altogether: for example, the depth to which to build a decision tree does not need to be specified if one instead uses pruning techniques. The typical optimization objective in machine learning is predictive performance on a previously-unseen test set of instances. As one of many examples, consider the work of Kohavi and John (1995). They demon- strated that it is possible to automatically find good parameter configurations θ for the popular 33 decision tree learning algorithm C4.5. In that study, a parameter configuration consisted of the instantiation of one integer, one continuous, and two binary parameters; the parameter configuration space Θ was discretized to a total of 1156 choices. For each of 33 data sets in their study, they performed a best-first search in Θ, where each θ ∈ Θ was evaluated by cross validation on the training set. Their approach found different well-performing configurations for each of their data sets which significantly outperformed the default parameter configuration of C4.5 in nine of 33 cases. Note that in our terminology, “machine learning data set” translates to “problem instance pi ∈ Π”. A k-fold cross-validation on the training set can thus be seen as assessing average performance of a parameter configuration across k instances. Alternative problem formulations could have been to find the configuration θ with best performance across the 33 data sets, or—given some cheaply computable features for data sets—the best configuration on a per-data-set basis. Some machine learning algorithms have a rather large number of parameters. One example can be found in the training of restricted Boltzmann machines and deep belief networks, which relies on various implementation “tricks” (Sversky and Murphy, 2009). The right combination of the parameters associated with these “tricks” can be found with automated configuration procedures, an approach that is far closer to the philosophy of machine learning than the manual optimization of such parameters. 2.4.2 Optimizing Algorithms for Frequently-Used Polytime Operations In this thesis, we focus on the optimization of algorithms for solving hard computational problems, such as, e.g.,NP-hard problems. Very similar approaches would also be applicable to the empirical optimization of polytime algorithms for many prominent problems, such as, e.g., sorting, finding shortest paths in graphs, and linear algebra operations. Here, algorithm configuration interfaces with algorithm engineering, an area that combines theoretical algo- rithm design with the empirical analysis of practical algorithm performance. While algorithm engineering manually combines algorithm components to yield algorithms that are often provably efficient (but can also include heuristics), some or all of this task could be automated. Even though the theoretical performance guarantees may not change (and even if they degrade, as, e.g., in Quicksort compared to MergeSort), empirical performance can improve by orders of magnitude when using careful implementations. As an example, consider the construction of fast sorting algorithms. One example for the engineering effort necessary in this task is given by Bentley and McIlroy (1993). That paper specifically concludes “We have added a few new tricks to Quicksort’s bag [...] We mixed these with a few old tricks, ignored many others, and arrived at the new champion Program.” We note that these “tricks” often involve numerical parameters and combinations of categorical algorithm components. For example, in their work there are three integer parameters which determine the size of subarray for which to use InsertionSort and select between different adaptive partitioning schemes. Further, the swapping macro is highly optimized and contains categorical decisions about whether to inline algorithm code or not. These explicit optimizations sped up swaps by a factor of 12 for the common special case of sorting integer arrays. Another example can be found in the problem of finding the shortest path between 34 two vertices of a graph (V,E). Even though this problem can be solved by Dijkstra’s classical algorithm (Dijkstra, 1959) in time O(|E|+ |V | log(|V |)), until recently practitioners used heuristic solutions without correctness guarantees. Only the last decade has seen the development of (parameterized) algorithms that are both exact and empirically outperform Dijkstra’s algorithm by up to a factor of one million (Sanders and Schultes, 2007). A final but no less important problem is the optimization of high-performance libraries, such as those in a linear algebra package. Research in automatically optimizing such libraries has demonstrated that frequently-used routines can be optimized to run orders of magnitude faster than naı̈ve implementations (Whaley, 2004). However, the optimizations are platform- specific and an optimization on one platform may cause a performance decrease on another. Thus, the approach taken in the ATLAS project (Whaley et al., 2001) is to optimize algorithm performance on the end user side. Custom search heuristics are used to automatically determine the most appropriate implementation for the user’s architecture, which—next to setting various numerical parameters—also requires making some categorical choices. General algorithm configuration procedures could be used for cases where custom heuristics either do not perform well or simply have not been developed yet. In all of these potential applications of automated algorithm configuration procedures, sin- gle algorithm runs would only require a small fraction of the time they take in the applications typical for this thesis. Nevertheless, the configuration procedures we introduce in this thesis would remain valid. However, for the optimization of extremely fast operations it appears advisable to pay special attention to keeping the overhead of configuration procedures small. Another characteristic shared by algorithms for all these applications is their widespread use. Improvements would thus have an immediate impact on a large user base. 2.5 Chapter Summary In this chapter, we discussed related work from various fields of research. We started with the most closely-related work, methods that apply relatively directly to (special cases) of the algorithm configuration problem we defined in Section 1.2.2. Most such existing methods are based on direct search in parameter configuration space, some are based on response surface models, and some on statistical racing techniques. We then discussed methods for automatically constructing algorithms. This included methods for the construction of correct implementations based on high-level specifications and no or little manual implementation effort in low-level programming languages. We also discussed related methods from genetic programming that evolve programs for solving problem instances of a given type. In such methods, the specification of possible programs is often tightly coupled to the methods used to search the space of programs. Next, we discussed a variety of approaches for problems related to algorithm configuration, such as per-instance algorithm selection and parameter tuning, algorithm portfolios, and methods that adapt algorithm parameters while solving a problem instance. Finally, we discussed two areas of research that promise to hold much potential for applying automated algorithm configuration procedures: machine learning and the optimization of algorithms for frequently-used polytime operations. 35 Part II Algorithm Configuration: Scenarios and Empirical Analysis —in which we introduce our algorithm configuration scenar- ios and empirically analyze their characteristics 36 Chapter 3 Algorithm Configuration Scenarios We have to learn again that science without contact with experiments is an enterprise which is likely to go completely astray into imaginary conjecture. —Hannes Alfvén, Swedish plasma physicist In this chapter, we describe the algorithm configuration scenarios we use throughout this thesis to experimentally evaluate configuration procedures. We constructed these configuration scenarios—instances of the algorithm configuration problem—to study various aspects of algorithm configuration. Algorithm configuration aims to improve the performance of existing target algorithms for solving sets of instances of a problem. First, in Section 3.1, we discuss the two problems we focus on: propositional satisfiability (SAT) and mixed integer programming (MIP). Then, we introduce target algorithms (Section 3.2) for these problems, sets of benchmark instances (Section 3.3), and optimization objectives (Section 3.4). Next, we introduce seven sets of configuration scenarios that we will use throughout the thesis (Section 3.5). Finally, at the end of this chapter, we also discuss experimental preliminaries that apply throughout the thesis. This chapter serves as a reference and can be skimmed or skipped by readers eager to move on. 3.1 Problems We studied algorithm configuration for two problems in detail. In particular, we focused on SAT, the most-widely studied NP-complete problem, and mixed integer programming (MIP), which is important due to the wide spread of problem formulations as MIP in industry and academia. Beyond these two, we considered additional problems to demonstrate the generality of our methods. In particular, we studied global continuous blackbox optimization and the most probable explanation problem (MPE). 37 3.1.1 Propostitional Satisfiability (SAT) The propositional satisfiability (SAT) problem is the prototypical NP-hard problem (Garey and Johnson, 1979). All SAT instances we used were propositional formulae in conjunctive normal form (CNF), that is, each instance is a conjunction of disjunctions of literals (negated or non-negated variables). The goal is to determine whether or not there exists a variable assignment under which the formula evaluates to TRUE. We restrict ourself to CNF-encoded formulae since this is the widely-accepted standard input format for SAT solvers. Any propositional formula can be encoded into CNF with linear overhead by adding additional variables. 3.1.2 Mixed Integer Programming (MIP) Mathematical programs are general optimization problems of the form min f(x) s.t. c(x), where f(·) is the objective function and c(·) expresses feasibility constraints. In integer programs, all variables xi are constrained to be integer. In mixed integer programs only a subset of variables is constrained in that way. In the frequent case of mixed integer linear programs (MILP), f(·) is a linear function. Even with binary domains for each xi this conceptually simple problem is NP-hard. All MIP instances we used can be expressed in the rather general representation used by CPLEX 10.1.1, which we used to solve these instances: minimize 1/2 · xT ·Q · x+ cT · x subject to A · x ./ b aTi · x+ xT ·Q′i · x ≤ ri for i = 1, . . . , q li ≤ xi ≤ ui for i = 1, . . . , n xi is integer for i in a subset of {1, . . . , n}, where ./ may be ≥, ≤, or =, x denotes the n optimization variables,Q is an n× n matrix of quadratic coefficients of the objective function, c is a column vector of linear coefficients of the objective function, the m× n matrixA and the column vector b specify linear constraints, the n× n matricesQ′i and scalars ri specify quadratic constraints, and the scalars li and ui specify lower and upper bounds on xi. These lower and upper bounds can be positive or negative infinity or any real number. 3.2 Target Algorithms In this section, we first describe the SAT solvers we used, then the commercial optimization tool CPLEX for MIP, and finally the algorithms for the other two problems. Table 3.1 gives an overview of those target algorithms for which we considered a discretized parameter 38 Algorithm Parameter type # params of type # values considered Total # configs, |Θ| Categorical 50 2–7 CPLEX Integer 8 5–7 1.38 · 1037 Continuous 5 3–5 Categorical 10 2–20 SPEAR Integer 4 5–8 8.34 · 1017 Continuous 12 3–6 Categorical 16 2–13 SATENSTEIN Integer 5 4–9 4.82 · 1012 Continuous 20 3–10 Categorical 4 3–6 SAT4J Integer 4 5–7 567 000 Continuous 3 5 SAPS Continuous 4 7 2 401 Categorical 1 2 GLS+ Integer 2 5–7 1 680 Continuous 2 4–6 Table 3.1: Target algorithms and their parameters. We list those algorithms for which we discretized all parameters, in order of decreasing complexity (measured in terms of total number of parameter configurations). Algorithm Parameter type # parameters of type SAPS Continuous 4 CMA-ES Integer 1 Continuous 3 Table 3.2: Target algorithms with only numerical parameters. configuration space. Table 3.2 lists our target algorithms with only numerical parameters. Note that SAPS is contained in both tables; we performed experiments with both discretized and continuous versions of its parameter configuration space. Throughout, we paid careful attention to only selecting high-performance solvers as target algorithms to be configured. Improving poor algorithms might be much easier, but good results for such algorithms might not imply improvements of the state of the art. 3.2.1 Target Algorithms for SAT For SAT, we considered two local search and two tree search algorithms. Stochastic Local Search Algorithms for SAT SAPS This stochastic local search (SLS) SAT solver is a type of dynamic local search algorithm, that is, it dynamically changes its internal cost function during the search. SAPS does this by associating weights with the clauses of the input SAT formula. In each encountered 39 local minimum, m, it increases the weights of the currently unsatisfied clauses, thereby changing the cost function to ensure that m is not a local minimum anymore. At each iteration, with a certain probability clause weights are smoothed towards the uniform distribution. SAPS(short for “Scaling And Probabilistic Smoothing”) was introduced by Hutter et al. (2002); we use the UBCSAT implementation by Tompkins and Hoos (2004). When introduced in 2002, SAPS was a state-of-the-art solver, and, with appropriately-chosen parameters, it still offers high performance on many types of “crafted” and unstructured instances. We chose to study this algorithm because it is well known, it has relatively few parameters, we are intimately familiar with it, and we knew from earlier work that SAPS’s parameters can have a strong impact on its performance (Hutter et al., 2006). Its four continuous parameters control the scaling and smoothing of clause weights, as well as the percentage of random steps. Their original defaults were set through manual configuration based on experiments with prominent sets of benchmark instances; this manual experimentation kept the percentage of random steps fixed and took up about one week of development time. We gained more experience with SAPS’ parameters for more general problem classes in our early work on parameter optimization (Hutter et al., 2006). Informed by that work, we chose promising intervals for each parameter, including but not centered at the original default. For our experiments with procedures restricted to discretized configuration spaces, we picked seven possible values for each parameter. These were spread uniformly across its respective interval, that is, equally dividing the interval. For the one multiplicative parameter (‘α’), we picked these seven values uniformly on a log scale, that is, by an equal division of its log-transformed interval. This resulted in 2 401 possible parameter configurations (exactly the values used in Hutter et al., 2007b). As the starting configuration for configuration procedures requiring discretized values, we used the center of each parameter’s interval. We also performed configuration experiments without discretizing SAPS’ parameters. In these experiments, we used the same interval as above for each parameter, extending from the minimum to the maximum of the seven values chosen above. Since the original SAPS default is contained in this continuous space, we used it as the starting configuration. SATENSTEIN This highly parameterized framework for stochastic local search (SLS) SAT solvers was very recently introduced by KhudaBukhsh et al. (2009). SATENSTEIN draws on components from WalkSAT-based algorithms (Selman et al., 1996), dynamic local search algorithms (Hutter et al., 2002) and G2WSAT variants (Li and Huang, 2005), all combined in a highly parameterized framework solver with a total of 41 parameters. It covers almost all state-of-the-art SLS solvers for SAT and is designed to be used in conjunction with a configuration procedure to automatically construct new SLS solvers for new types of problem instances of interest. The parameter configuration space of SATENSTEIN is highly structured. There is one top-level parameter that decides whether to construct a dynamic local search (DLS) type algorithm or a non-DLS algorithm. The subspace for DLS algorithms has 21 parameters, 17 out of which are conditional parameters that are only active depending on certain instantiations of other (higher-level) parameters. The non-DLS subspace has 36 parameters, 29 of which are conditional parameters. Thus, an efficient mechanism for handling conditional parameters 40 might be important for the effective configuration of SATENSTEIN. In total, the configuration space of SATENSTEIN is of size 4.82 · 1012. When studying 1 000 random configurations of SATENSTEIN in Chapter 4, we used 500 configurations from each of those two subspaces. Tree Search Algorithms for SAT SPEAR This recent tree search algorithm for SAT solving was developed by Babić (2008). It is a state-of-the-art SAT solver for industrial instances, and with appropriate parameter settings it is the best available solver for certain types of SAT-encoded hardware and software verification instances (Hutter et al., 2007a). Furthermore, configured with one of the algorithm configuration tools we introduce in this thesis (PARAMILS), SPEAR won the quantifier-free bit-vector arithmetic category of the 2007 Satisfiability Modulo Theories Competition. SPEAR has 26 parameters, including ten categorical, four integer, and twelve continuous parameters. Their default values were manually engineered by its developer, using a benchmark set of relatively small software verification and bounded model checking instances. (Manual tuning required about one week.) The categorical parameters mainly control heuristics for variable and value selection, clause sorting, resolution ordering, and also enable or disable optimizations, such as the pure literal rule. The continuous and integer parameters mainly deal with activity, decay, and elimination of variables and clauses, as well as with the interval of randomized restarts and percentage of random choices. We discretized the integer and continuous parameters by choosing lower and upper bounds at reasonable values and allowing between three and eight discrete values spread relatively uniformly across the resulting interval, including the default. The number of discrete values was chosen according to our intuition about the importance of the parameter. After this discretization, there were 3.7 · 1018 possible parameter configurations. Exploiting the fact that nine of the parameters are conditional (i.e., only relevant when other parameters take certain values) reduced this to 8.34 · 1017 configurations. As the starting configuration for our configuration procedures, we used the default. In Section 6.3.1, we discuss SPEAR’s parameter configuration space in more detail in the context of a case study for algorithm configuration. SAT4J This library1, developed by Le Berre, provides an implementation of SAT solvers in Java. It is targeted at users who would like to embed a black box SAT solver in their application without worrying about the details. SAT4J is based on an implementation of MiniSAT (Eén and Sörensson, 2004), extended with new heuristics and learning strategies. In our experiments, we used a modified version of SAT4J version 1.5, for which all parameters had been made accessible from the command line (the version number outputted by the code is “OBJECTWEB.1.0.113”).2 These parameters include categorical choices of variable ordering, learning strategy, data structure and clause minimization. One variable ordering heuristic and three learning heuristics are parameterized with one additional continuous parameter 1http://www.sat4j.org/index.php 2Many thanks to Daniel Le Berre for preparing this version for us. 41 each. Finally, three other numerical parameters govern variable decay, first restart and the multiplication factor of the restart interval. We only experimented with SAT4J early on and replaced it with SPEAR later, since SPEAR was typically faster and in particular was the best available algorithm for some important industrial benchmark distributions. However, we do see much promise in integrating automated algorithm configuration techniques with current versions of SAT4J. SAT4J’s typical usage scenarios as a blackbox SAT solver for users unfamiliar with SAT technology could be combined nicely with an automated option to improve the solver for the (a priori unknown) types of problem instances it faces. 3.2.2 Target Algorithm for MIP CPLEX The most-widely used commercial optimization tool for solving mixed integer programs is ILOG CPLEX. As stated on the CPLEX website (http://www.ilog.com/products/ cplex/), currently over 1 300 corporations and government agencies use CPLEX, along with researchers at over 1 000 universities. CPLEX is massively parameterized and considerable effort has been expended to engineer a good default parameter configuration: “A great deal of algorithmic development effort has been devoted to establishing default ILOG CPLEX parameter settings that achieve good performance on a wide variety of MIP models.” (ILOG CPLEX 10.0 user manual, page 247) Despite these efforts, the end user still sometimes has to experiment with parameters: “Integer programming problems are more sensitive to specific parameter settings, so you may need to experiment with them.” (ILOG CPLEX 10.0 user manual, page 130) As such, the automated configuration of CPLEX is very promising and has the potential to directly impact a large user base. For the experiments in this thesis, we used CPLEX version 10.1.1. Out of its 159 user- specifiable parameters, we identified 81 parameters that affect its search trajectory. We did this without any experience with CPLEX, solely based on two days spent with its manual. Thus, we may have omitted some important parameters or included inconsequential ones. We were careful to omit all parameters that change the problem formulation (e.g., by changing the numerical accuracy of a solution). Many CPLEX parameters deal with MIP strategy heuristics (such as variable and branching heuristics, probing, dive type, and subalgorithms) and amount and type of preprocessing to be performed. There are also nine parameters each governing how frequently a different type of cut should be used (there are four magnitude values and the value “choose automatically”; note that this last value prevents the parameters from being ordinal). A considerable number of other parameters deal with simplex and barrier optimization, and with various other algorithm components. For categorical parameters with an automatic option, we considered all categorical values as well as the automatic one. In contrast, for continuous and integer parameters with an automatic option, we chose that option instead of hypothesizing 42 values that might work well. We also identified some parameters that primarily deal with numerical accuracy, and fixed those to their default values. For other numerical parameters, we chose up to five possible values that seemed sensible, including the default. For the many categorical parameters with an automatic option, we included the automatic option as a choice for the parameter, but also included all the manual options. Finally, we ended up with 63 configurable parameters, leading to 1.78 · 1038 possible configurations. Exploiting seven conditional parameters reduced this to 1.38 · 1037 distinct configurations. As the starting configuration for our configuration procedures, we used the CPLEX default settings. 3.2.3 CMA-ES for Global Continuous Function Optimization This prominent gradient-free global optimization algorithm for continuous functions (Hansen and Ostermeier, 1996; Hansen and Kern, 2004) is based on an evolutionary strategy that uses a covariance matrix adaptation scheme. CMA-ES has been used as an application domain of parameter optimization algorithms (Bartz-Beielstein et al., 2008). We used the interface they wrote for the SPO toolbox, which used the Matlab implementation of CMA-ES 2.54.3 As an evolutionary strategy, CMA-ES has two obvious parameters: the number of parents, N , and a factor ν ≥ 1 relating the number of parents to the population size. (The population size is defined as bN · ν + 0.5c.) Furthermore, Bartz-Beielstein et al. (2008) modified CMA-ES’s interface to expose two additional parameters: the “learning rate for the cumulation for the step size control”, cσ or cs, and the damping parameter, dσ or damps (for details, see Hansen, 2006). All these parameters are numerical. We used CMA-ES as a test case for optimizing such parameters without the need for discretization. 3.2.4 GLS+ for the Most Probable Explanation (MPE) Problem GLS+ is a guided local search algorithm for solving the Most Probable Explanation (MPE) problem in discrete-valued graphical models, that is, the problem of finding the variable instantiation with maximal overall likelihood (Hutter et al., 2005; Hutter, 2004). It has five parameters, one binary and four continuous. The binary parameter decides whether to initialize at random or by using a weight-bounded version of the Mini-Buckets approximation algorithm (Dechter and Rish, 2003). The numerical parameters govern the scaling of clause penalties (amount and interval), the weighting factor for clause weights and the weight bound on preprocessing with Mini-Buckets. We only performed experiments with discretized parameters (between 4 and 7 values per numerical parameter, depending on our intuition as to its importance). This led to a parameter configuration space of size 1 680. 3.3 Benchmark Sets We selected our benchmark distributions by the following principles. Firstly, we only used standard benchmark distributions for a problem that had previously been used to evaluate algorithms for that problem. This avoided the potential issue of creating a problem distribution 3The newest CMA-ES version, 3.0, differs mostly in the interface and in supporting “separable” CMA (see the change log at http://www.lri.fr/∼hansen/cmaes inmatlab.html). 43 that lends itself to our approach. Secondly, whenever possible we used distributions of practical relevance, to ensure improvements we make by automated configuration would have immediate impact. Prime examples are our two sets of industrial verification instances; some of our CPLEX distributions are also derived from industrial applications. We selected instance distributions of varying hardness. While easy instances were useful for facilitating algorithm development, we needed to verify the performance of our configurators on harder instance distributions. Along the same lines, for the evaluation of configuration procedures that are limited to single instances, we selected a variety of single instances of varying hardness. Generally, we randomly split each benchmark set 50-50 into training and test sets. All configuration runs only had access to the training set, and we report results on the test set unless explicitly stated. For benchmark “sets” that only contain one instance, we measured test performance on independent test runs with different random seeds but on the same single instance. 3.3.1 SAT Benchmarks For our experiments on configuring the various SAT solvers discussed above, we selected structured instances from two broad categories: “crafted” instances, based on encodings of randomly-generated instances of otherNP-hard problems that are prominent in SAT research, and “industrial” instances based on encodings of industrial verification instances. For the study of configuration procedures that are limited to single problem instances, we selected instances of varying hardness within each of these categories. All instances we study contain structure that can potentially be exploited by SAT solvers. This structure is due to the SAT encoding and to the structure in the original problem. Encodings of Randomly Generated Instances of Other Hard Problems QCP This benchmark set contains 23 000 instances of the quasi-group completion problem (QCP), which has been widely studied by researchers in artificial intelligence and constraint programming. The objective in this problem is to determine whether the unspecified entries of a partial Latin square can be filled to obtain a complete Latin square. Latin squares play a role in applications such as scheduling, timetabling, experimental design, and error correcting codes. They have been widely used for benchmarking purposes (Gomes and Selman, 1997). Xu generated these QCP instances around the solubility phase transition, using the parameters given by Gomes and Selman (1997). Specifically, the order n was drawn uniformly from the interval [26, 43], and the number of holes H (open entries in the Latin square) was drawn uniformly from [1.75, 2.3] · n1.55. For use with complete SAT solvers, we sampled 2 000 of these SAT instances uniformly at random. These had on 1 497±1 094 variables (mean± standard deviation across all instances) and 13 331± 12 473 clauses; 1 182 of the instances were satisfiable. For use with incomplete SAT solvers, we randomly sampled 2 000 instances from the subset of satisfiable instances (determined using a complete algorithm). Their number of variables and clauses were very similar to those of the set used for complete solvers (1 515± 1 173 variables, 14 304± 12 528 44 clauses). Single Instances QCPmed, QCPq075, and QCPq095 These three single instances are the in- stances at the 50%, 75%, and 95% quantiles of the subset of satisfiable instances of QCP in terms of hardness for SAPS with its default configuration. They contain 1 009, 2 367, and 2 946 variables and 7 647, 23 332, and 29 096 clauses, respectively. Single Instance QWH This benchmark consists of a single quasigroup completion problem based on a quasigroup with randomly punched holes. (QWH differs from QCP in that QWH instances are always satisfiable by construction.) It is one of a set of 10 000 instances that Xu generated using lsencode by Gomes and Selman (1997), namely the instance at the 50% quantile of hardness for SAPS with its default configuration. This instance contains 1 077 variables and 7 827 clauses. In our experiments, it could be solved by SAPS in a median number of 85 500 search steps, taking well below a second on our reference machines (see Section 3.6). It thus allowed us to perform many target algorithm runs quickly and so facilitated algorithm development. For this reason, we used this instance extensively in early development phases of both algorithm configuration frameworks discussed in this thesis: model-free search in Part III and model-based search in Part IV. SWGCP This benchmark set contains 20 000 instances of the graph colouring problem (GCP) based on the small world (SW) graphs of Gent et al. (1999). Using their generator, Xu created these instances, with a ring lattice size sampled uniformly at random from [100, 400], each node connected to the 10 nearest neighbours, a rewiring probability of 2−7 and chromatic number 6. We sampled 2 000 of these instances uniformly at random for use with complete SAT solvers. These had 1 813 ± 703 variables and 13 902 ± 5 393 clauses; 1 109 of the instances were satisfiable. For use with incomplete local search solvers, we randomly sampled 2 000 satisfiable instances (again, determined using a complete SAT algorithm). Their number of variables and clauses was very similar to that in the set for complete solvers (1 958± 646 variables, 15 012± 4 953 clauses). Single Instances SWGCPmed, SWGCPq075, and SWGCPq095 These three single instances are the instances at the 50%, 75%, and 95% quantiles of the subset of satisfiable instances of SWGCP in terms of hardness for SAPS with its default configuration. They contain 2 616, 2 586, and 1 956 variables respectively; and 20 056, 19 826, and 14 996 clauses, respectively. (Note that although often instance hardness correlates with instance size, this is not a strict rule.) Structured Instances from Industrial Verification Problems SWV This set of SAT-encoded software verification instances comprises 604 instances gener- ated with the CALYSTO static checker (Babić and Hu, 2007b), used for the verification of five 45 programs: the spam filter Dspam, the SAT solver HyperSAT, the Wine Windows OS emulator, the gzip archiver, and a component of xinetd (a secure version of inetd). These instances con- tain 64 416± 53 912 variables and 195 058± 174 534 clauses. (We only employed complete solvers for these instances, and thus did not create a separate subset of satisfiable instances.) Single Instances SWVmed, SWVq075, and SWVq095 These three single instances are the in- stances at the 50%, 75%, and 95% quantiles of SWV in terms of hardness for SPEAR with its default configuration. They contain 92 737, 72 596, and 114 395 variables and 273 793, 214 459, and 370 825 clauses, respectively. BMC This set of SAT-encoded bounded model checking instances comprises 765 instances generated by Zarpas (2005); these instances were selected as the instances in 40 randomly- selected folders from the IBM Formal Verification Benchmarks Library. These instances contained an average of 91 041 variables and 386 171 clauses, with respective standard de- viations of 149 815 and 646 813. Some of the instances in this set are extremely hard, the largest instance containing 1 400 478 variables and 5 502 329 clauses. (As above, we only employed complete solvers for these instances and did not create a separate subset of satisfiable instances.) Single Instances IBMq025 and IBMmed These two single instances are the instances at the 25% and 50% quantiles of BMC in terms of hardness for SPEAR with its default configuration. They contain 45 853 and 29 725 variables and 151 611 and 125 883 clauses, respectively. (Unlike in the case of the other single instances, we did not use instances from the 75% and 95% quantiles from this distribution since these could not be solved by any parameter configuration we studied.) 3.3.2 MIP Benchmarks For our configuration experiments with CPLEX, we considered a variety of benchmark sets that have previously been used in computational experiments for CPLEX. As in the case of SAT, we used encodings of another NP-hard problem prominent in the evaluation of MIP algorithms, as well as prominent instance types based on industrial applications. Benchmarks from the Combinatorial Auction Test Suite In some types of auctions, bidders can bid on sets of goods rather than on single goods. The winner determination problem in such combinatorial auctions is to determine the revenue- maximizing allocation of goods to bidders. This problem is NP-hard (Rothkopf et al., 1998) and instances of it can be easily encoded as MILPs. We created two instance sets using the generator provided with the Combinatorial Auction Test Suite (Leyton-Brown et al., 2000). These two sets are very similar, only differing in instance size, which allows us to study how our methods scale to larger instances. Instances based on this generator have been used before to model CPLEX performance and to perform per-instance algorithm selection (Leyton-Brown 46 et al., 2009). Regions100 For this benchmark set we generated 2 000 MILPs based on the above generator and using the regions option with the ‘goods’ parameter set to 100 and the ‘bids’ parameter set to 500. The resulting MILP instances contained 501 variables and 193 inequalities on average, with a standard deviation of 1.7 variables and 2.5 inequalities. Regions200 This set is very similar to the Regions100 set, but its instances are much larger. We generated 2 000 MILP instances as above, but now with the ‘goods’ parameter set to 200 and the ‘bids’ parameter set to 1 000. These instances contain an average of 1 002 variables and 385 inequalities, with respective standard deviations of 1.7 and 3.4. Benchmarks from Berkeley Computational Optimization Lab We obtained a variety of instance sets from the Berkeley Computational Optimization Lab.4. We only used sets of instances that were large and homogeneous enough to be split into disjoint training and test sets and still have the training set be quite representative. MJA This benchmark set, introduced by Aktürk et al. (2007), comprises 343 machine-job assignment instances encoded as mixed integer quadratically constrained programs (MIQCP). These instances contain 2 769± 2 133 variables and 2 255± 1 592 constraints. CLS This set comprises 100 capacitated lot-sizing instances encoded as mixed integer linear programs (MILP). It was introduced by Atamtürk and Muñoz (2004). All 100 instances contain 181 variables and 180 constraints. MIK This set of 120 MILP-encoded mixed-integer knapsack ins
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Automated configuration of algorithms for solving hard...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Automated configuration of algorithms for solving hard computational problems Hutter, Frank 2009
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Automated configuration of algorithms for solving hard computational problems |
Creator |
Hutter, Frank |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | The best-performing algorithms for many hard problems are highly parameterized. Selecting the best heuristics and tuning their parameters for optimal overall performance is often a difficult, tedious, and unsatisfying task. This thesis studies the automation of this important part of algorithm design: the configuration of discrete algorithm components and their continuous parameters to construct an algorithm with desirable empirical performance characteristics. Automated configuration procedures can facilitate algorithm development and be applied on the end user side to optimize performance for new instance types and optimization objectives. The use of such procedures separates high-level cognitive tasks carried out by humans from tedious low-level tasks that can be left to machines. We introduce two alternative algorithm configuration frameworks: iterated local search in parameter configuration space and sequential optimization based on response surface models. To the best of our knowledge, our local search approach is the first that goes beyond local optima. Our model-based search techniques significantly outperform existing techniques and extend them in ways crucial for general algorithm configuration: they can handle categorical parameters, optimization objectives defined across multiple instances, and tens of thousands of data points. We study how many runs to perform for evaluating a parameter configuration and how to set the cutoff time, after which algorithm runs are terminated unsuccessfully. We introduce data-driven approaches for making these choices adaptively, most notably the first general method for adaptively setting the cutoff time. Using our procedures—to the best of our knowledge still the only ones applicable to these complex configuration tasks—we configured state-of-the-art tree search and local search algorithms for SAT, as well as CPLEX, the most widely-used commercial optimization tool for solving mixed integer programs (MIP). In many cases, we achieved improvements of orders of magnitude over the algorithm default, thereby substantially improving the state of the art in solving a broad range of problems, including industrially relevant instances of SAT and MIP. Based on these results, we believe that automated algorithm configuration procedures, such as ours, will play an increasingly crucial role in the design of high-performance algorithms and will be widely used in academia and industry. |
Extent | 9522135 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-10-13 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0051652 |
URI | http://hdl.handle.net/2429/13907 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2009_fall_hutter_frank.pdf [ 9.08MB ]
- Metadata
- JSON: 24-1.0051652.json
- JSON-LD: 24-1.0051652-ld.json
- RDF/XML (Pretty): 24-1.0051652-rdf.xml
- RDF/JSON: 24-1.0051652-rdf.json
- Turtle: 24-1.0051652-turtle.txt
- N-Triples: 24-1.0051652-rdf-ntriples.txt
- Original Record: 24-1.0051652-source.json
- Full Text
- 24-1.0051652-fulltext.txt
- Citation
- 24-1.0051652.ris