Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Analysing the empirical time complexity of high-performance algorithms for SAT and TSP Mu, Zongxu 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2016_february_mu_zongxu.pdf [ 1.29MB ]
Metadata
JSON: 24-1.0215912.json
JSON-LD: 24-1.0215912-ld.json
RDF/XML (Pretty): 24-1.0215912-rdf.xml
RDF/JSON: 24-1.0215912-rdf.json
Turtle: 24-1.0215912-turtle.txt
N-Triples: 24-1.0215912-rdf-ntriples.txt
Original Record: 24-1.0215912-source.json
Full Text
24-1.0215912-fulltext.txt
Citation
24-1.0215912.ris

Full Text

Analysing the Empirical Time Complexityof High-performance Algorithms for SAT and TSPbyZongxu MuB.Sc. (Hons), City University of Hong Kong, 2013A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)November 2015© Zongxu Mu 2015AbstractThe time complexity of problems and algorithms, i.e., the scaling of the time re-quired for solving a problem instance as a function of instance size, is of key in-terest in theoretical computer science and practical applications. In this context,the propositional satisfiability (SAT) and the travelling salesperson (TSP) are twoof the most intensely studied problems, and it is generally believed that solvingSAT or TSP requires exponential time in the worst case. In this work, we refineand extend a recent empirical scaling analysis approach and study the empiricalscaling of the running times of several prominent, high-performance SAT and TSPalgorithms. For SAT, we focus on random 3-SAT instances from the phase trans-ition region, arguably the most prominent model for difficult SAT instances, andobtain interesting and surprising scaling results for both SLS- and DPLL-basedsolvers. Particularly, we find solid support for polynomial scaling for SLS-basedsolvers on phase-transition random 3-SAT instances. We also show that DPLL-based solvers scale exponentially and are faster by only a constant factor in solv-ing satisfiable instances compared to unsatisfiable instances. We further reportempirical scaling results for two classes of random 4-SAT instances to gain addi-tional insights into the performance of state-of-the-art SAT solvers. For TSP, weconcentrate on two-dimensional random uniform Euclidean (RUE) instances, andcharacterise the scaling of running time for complete and incomplete algorithmsfor finding optimal solutions. Our results indicate that the scaling of all these al-gorithms is consistent with or bounded from above by root-exponential models ofthe form a ·b√n. We also explored the impact of automated algorithm configurationon the scaling of these algorithms. Since our approach is applicable beyond SATand TSP, to enable its broad use, we designed Empirical Scaling Analyser (ESA),an automated tool that can be conveniently used to study empirical scaling of manytypes of algorithms. In particular, ESA presents scaling analysis results in the formof automatically generated, detailed technical reports. Many results reported in thisthesis, including most tables and figures, were automatically generated by ESA andare only slightly modified to fit here.iiPrefaceThis thesis covers several publications as well as manuscripts not yet published,which I co-authored with others.A major part of the SAT-related work is published in [52], where I performedall experiments, including those for collection of running time data, modelling ofphase transition, and analysis of scaling behaviours of the solvers. I also wrote thefirst draft of the paper, including preparation of the tables and the figures.Work on the scaling of TSP solvers has been prepared for publication (witha complete draft finished in parallel with this thesis) [54]. I performed scalinganalysis on the running time data collected by Dubois-Lacoste and Stützle, andwrote the first draft of the paper. For work on impact of automated configurationof TSP solvers, I ran the configuration experiments, collected running time dataand performed the scaling analysis. Thomas Stützle helped prepare the parameterfiles of the solvers, and answered my questions regarding these parameters.I was heavily involved in the conceptual design of ESA [53], and was respons-ible for the development of its code and website.Hereafter, I adopt the plural subject “we” in recognition of the collaborationwith Prof. Hoos and others.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Empirical Analysis of Algorithms . . . . . . . . . . . . . . . . . 11.2 The Propositional Satisfiability Problem . . . . . . . . . . . . . . 21.3 The Travelling Salesperson Problem . . . . . . . . . . . . . . . . 41.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 52 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Related Work on Random k−SAT . . . . . . . . . . . . . . . . . 82.2 Related Work on TSP . . . . . . . . . . . . . . . . . . . . . . . 92.3 Related Work on Empirical Scaling Analysis . . . . . . . . . . . 103 Empirical Scaling Methodology and Extensions . . . . . . . . . . . 123.1 Empirical Scaling Methodology . . . . . . . . . . . . . . . . . . 123.2 Extensions to the Methodology . . . . . . . . . . . . . . . . . . 144 Empirical Scaling Analyser:An Automated System for Scaling Analysis . . . . . . . . . . . . . . 164.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20ivTable of Contents4.3 Automated Interpretation of Scaling Results . . . . . . . . . . . . 254.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 Downloading and Running ESA . . . . . . . . . . . . . . . . . . 285 Empirical Time Complexity of Random k-SAT . . . . . . . . . . . . 315.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Location of Phase Transition . . . . . . . . . . . . . . . . . . . . 335.3 Empirical Scaling of Solver Performance on Random 3-SAT atPhase Transition . . . . . . . . . . . . . . . . . . . . . . . . . . 365.4 Empirical Scaling of Solver Performance on Random 4-SAT . . 485.5 Experiments for the Survey Propagation Algorithm . . . . . . . . 565.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 576 Empirical Scaling of Running Time of TSP Solvers . . . . . . . . . 606.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 606.2 Empirical Scaling of Running Time of Concorde for Finding Op-timal Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.3 Empirical Scaling of Running Time of EAX and LKH . . . . . . 666.4 Impact of Automated Configuration on Scaling of EAX and LKH 706.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . 857 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 877.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 877.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91AppendicesA LATEX Template for ESA . . . . . . . . . . . . . . . . . . . . . . . . 97B Gnuplot Templates for ESA . . . . . . . . . . . . . . . . . . . . . . 106B.1 Gnuplot Template for Plotting Models . . . . . . . . . . . . . . . 106B.2 Gnuplot Template for Plotting Residue Curves . . . . . . . . . . 106vList of Tables4.1 Example output of ESA – statistics of running times for support data. 204.2 Example output of ESA – statistics of running times for challengedata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 Example output of ESA – fitted models and corresponding RMSEvalues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 Example output of ESA – bootstrap confidence intervals for allmodel parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5 Example output of ESA – bootstrap confidence intervals for ob-served and predicted running times for support data. . . . . . . . . 234.6 Example output of ESA – bootstrap confidence intervals for ob-served and predicted running times for support data. . . . . . . . . 245.1 Lower and upper bounds on the location of the phase transition(mc/n) for 3-SAT determined with 95% confidence, and predic-tions from the two models discussed in the text. In parentheses:number of instances used for determining the bound. Model pre-dictions inconsistent with our lower bounds are marked with aster-isks (*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Exclusive lower and upper bounds of phase transition points for 4-SAT determined with 95% confidence, and predictions from themodel discussed in the text. The number after the lower/upperbounds (in parentheses) are the least number of instances we haveused to conclude with 95% confidence that the ratio is away fromthe phase transition point. . . . . . . . . . . . . . . . . . . . . . . 375.3 We used UniformKSAT to generate 12 sets of random 3-SAT in-stances of different sizes, at clause-to-variable ratios computed asequation 5.2. This table summarises the numbers of variables andof clauses, and the clause-to-variable ratios. . . . . . . . . . . . . 38viList of Tables5.4 Fitted models for median running times of SLS-based solvers solv-ing phase-transition random 3-SAT instances and the correspond-ing RMSE values (in CPU sec). Model parameters are shown with6 significant digits, and RMSEs with 4 significant digits; the mod-els yielding more accurate predictions (as per RMSEs on challengedata) are shown in boldface. . . . . . . . . . . . . . . . . . . . . 395.5 95% bootstrap confidence intervals for observed and predicted run-ning times of SLS-based solvers solving random 3-SAT instances.The instance sizes shown here are larger than those used for fit-ting the models. Bootstrap intervals on predictions that agree withthe observed point estimates are shown in boldface, and those thatfully contain the confidence intervals on observations are markedby asterisks (*). . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.6 Bootstrap intervals of model parameters for median running timesof SLS-based solvers solving phase-transition random 3-SAT in-stances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.7 Fitted models for 0.75-quantile of the running times of SLS-basedsolvers solving phase-transition random 3-SAT instances and thecorresponding RMSE values (in CPU sec). Model parameters areshown with 5 significant digits and RMSEs with 4 significant di-gits; the models yielding more accurate predictions (as per RMSEson challenge data) are shown in boldface. . . . . . . . . . . . . . 425.8 Fitted models for median running times of the DPLL-based solverssolving phase-transition random 3-SAT instances and the corres-ponding RMSE values (in CPU sec). Model parameters are shownwith 6 significant digits, and RMSEs with 4 significant digits; themodels yielding more accurate predictions (as per RMSEs on chal-lenge data) are shown in boldface. . . . . . . . . . . . . . . . . . 435.9 95% bootstrap confidence intervals for observed running times ofDPLL-based solvers solving random 3-SAT instances. The in-stance sizes shown here are larger than those used for fitting themodels. Bootstrap intervals on predictions that agree with the ob-served point estimates are shown in boldface, and those that fullycontain the confidence intervals on observations are marked by as-terisks (*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.10 Bootstrap intervals of model parameters for median running timesof DPLL-based solvers solving phase-transition random 3-SAT in-stances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46viiList of Tables5.11 Challenging the models with running times for large 3-SAT in-stances. For the bootstrap intervals of the predicted median runningtimes, those overlapping with the observed confidence intervals arein boldface, those that agree with the observed point estimates aremarked by #, and those agreeing with the observed confidence in-tervals are followed by *. . . . . . . . . . . . . . . . . . . . . . . 485.12 Fitted models for median running times of the SLS-based solverssolving phase-transition random 4-SAT instances and the corres-ponding RMSE values (in CPU sec). The models yielding the mostaccurate predictions (as per RMSEs on challenge data) are shownin boldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.13 Bootstrap intervals of model parameters for median running timesof SLS-based solvers solving phase-transition random 4-SAT in-stances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.14 95% bootstrap confidence intervals for observed running times ofSLS-based solvers solving phase-transition random 4-SAT instances.The instance sizes shown here are larger than those used for fittingthe models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.15 95% bootstrap confidence intervals for predicted median runningtime of SLS-based solvers solving phase-transition random 4-SATinstances. The instance sizes shown here are larger than those usedfor fitting the models. Bootstrap intervals on predictions that over-lap with the corresponding bootstrap interval on observed data areshown in boldface, those that agree with the observed point es-timates are marked by sharps (#), and those that fully contain theconfidence intervals on observations are marked by asterisks (*). . 515.16 Fitted models for median running times of the DPLL-based solv-ers solving phase-transition random 4-SAT instances and the cor-responding RMSE values (in CPU sec). The models yielding moreaccurate predictions (as per RMSEs on challenge data) are shownin boldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.17 Bootstrap intervals of model parameters for median running timesof DPLL-based solvers solving phase-transition random 4-SAT in-stances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53viiiList of Tables5.18 95% bootstrap confidence intervals for observed running times ofDPLL-based solvers solving phase-transition random 4-SAT in-stances. The instance sizes shown here are larger than those usedfor fitting the models. Bootstrap intervals on predictions that over-lap with the corresponding bootstrap interval on observed data areshown in boldface, those that agree with the observed point es-timates are marked by sharps (#), and those that fully contain theconfidence intervals on observations are marked by asterisks (*). . 545.19 Fitted models for median running times of the solvers solving less-constrained random 4-SAT instances and corresponding RMSE val-ues (in CPU sec). The models yielding more accurate predictions(as per RMSEs on challenge data) are shown in boldface. . . . . . 545.20 Bootstrap intervals of model parameters for median running timeof the solvers solving less-constrained random 4-SAT instances. . 545.21 95% bootstrap confidence intervals for observed running times onless-constrained random 4-SAT instances. The instance sizes shownhere are larger than those used for fitting the models.. . . . . . . . 555.22 95% bootstrap confidence intervals for predicted running timeson less-constrained random 4-SAT instances. The instance sizesshown here are larger than those used for fitting the models. Boot-strap intervals on predictions that overlap with the correspondingbootstrap interval on observed data are shown in boldface, thosethat agree with the observed point estimates are marked by sharps(#), and those that fully contain the confidence intervals on obser-vations are marked by asterisks (*). . . . . . . . . . . . . . . . . . 555.23 Number of the satisfiable instances of different sizes, and the num-ber and percentage of instances that cannot be solved in at least 3out of 5 independent runs of the survey propagation algorithm. . . 596.1 Fitted models for the medians of the running times of Concordeand the corresponding RMSE values (in CPU sec). The modelsyielding more accurate predictions (as per RMSEs on challengedata) are shown in boldface. . . . . . . . . . . . . . . . . . . . . 636.2 95% bootstrap confidence intervals for the parameters of the scal-ing models for Concorde’s running times to find optimal solutionsof RUE instances. . . . . . . . . . . . . . . . . . . . . . . . . . . 636.3 95% bootstrap confidence intervals for the medians of observedrunning times for Concorde to find optimal solutions of RUE in-stances. The instance sizes shown here are larger than those usedfor fitting the models. . . . . . . . . . . . . . . . . . . . . . . . . 64ixList of Tables6.4 95% bootstrap confidence intervals for the medians of the runningtime predictions for Concorde to find optimal solutions of RUEinstances. The instance sizes shown here are larger than those usedfor fitting the models. Bootstrap intervals on predictions that areweakly consistent with the observed data are shown in boldface,those that are strongly consistent are marked by sharps (#), andthose that fully contain the confidence intervals on observationsare marked by asterisks (*). . . . . . . . . . . . . . . . . . . . . . 646.5 95% bootstrap confidence intervals for the medians of the observedand predicted overall running times for Concorde to solve RUEinstances. Bootstrap intervals on predictions that are weakly con-sistent with the observed finding times are shown in boldface, thosethat are strongly consistent are marked by sharps (#), and those thatfully contain the confidence intervals on observations are markedby asterisks (*). . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.6 Fitted models for the medians of the running times of EAX andLKH and the corresponding RMSE values (in CPU sec). The mod-els yielding more accurate predictions (as per RMSEs on challengedata) are shown in boldface. . . . . . . . . . . . . . . . . . . . . 676.7 95% bootstrap confidence intervals for the model parameters of thescaling models for EAX’s and LKH’s running times to find optimalsolutions of RUE instances. . . . . . . . . . . . . . . . . . . . . . 676.8 95% bootstrap confidence intervals for the medians of the observedrunning times for EAX and LKH to find optimal solutions of RUEinstances. The instance sizes shown here are larger than those usedfor fitting the models. . . . . . . . . . . . . . . . . . . . . . . . . 686.9 95% bootstrap confidence intervals for the medians of the runningtime predictions for EAX and LKH to find optimal solutions ofRUE instances. The instance sizes shown here are larger than thoseused for fitting the models. Bootstrap intervals on predictions thatare weakly consistent with the observed data are shown in bold-face, those that are consistent are marked by plus signs (+), thosethat are strongly consistent are marked by sharps (#), and those thatfully contain the confidence intervals on observations are markedby asterisks (*). . . . . . . . . . . . . . . . . . . . . . . . . . . . 686.10 Fitted models for the medians of the running times of EAX withconfigured parameters and the corresponding RMSE values (in CPUsec). The models yielding more accurate predictions (as per RMSEson challenge data) are shown in boldface. . . . . . . . . . . . . . 72xList of Tables6.11 95% bootstrap confidence intervals for the parameters of the scal-ing models for EAX’s running times (with configured parameters)to find optimal solutions of RUE instances. . . . . . . . . . . . . 726.12 Fitted models for the medians of the running times of EAX with de-fault parameters and varying population size and the correspondingRMSE values (in CPU sec). The models yielding more accuratepredictions (as per RMSEs on challenge data) are shown in boldface. 756.13 95% bootstrap confidence intervals for the parameters of the scal-ing models for EAX’s running times (with default parameters andvarying population size) to find optimal solutions of RUE instances. 756.14 Fitted models for the medians of the running times of EAX withconfigured parameters and varying population size and the corres-ponding RMSE values (in CPU sec). The models yielding moreaccurate predictions (as per RMSEs on challenge data) are shownin boldface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.15 95% bootstrap confidence intervals for the parameters of the scal-ing models for EAX’s running times (with configured parametersand varying population size) to find optimal solutions of RUE in-stances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.16 List of numerical (N) and categorical (C) parameters for LKH thatare configurable. . . . . . . . . . . . . . . . . . . . . . . . . . . 796.17 Fitted models for the medians of the running times of LKH withconfigured parameters from experiments following the standardprotocol and the corresponding RMSE values (in CPU sec). Themodels yielding more accurate predictions (as per RMSEs on chal-lenge data) are shown in boldface. . . . . . . . . . . . . . . . . . 816.18 95% bootstrap confidence intervals for the parameters of the scal-ing models for EAX’s running times (with configured parameters)to find optimal solutions of RUE instances. . . . . . . . . . . . . 816.19 Fitted models for the medians of the running times of LKH withconfigured parameters from experiments following the scaling pro-tocol and the corresponding RMSE values (in CPU sec). The mod-els yielding more accurate predictions (as per RMSEs on challengedata) are shown in boldface. . . . . . . . . . . . . . . . . . . . . 826.20 95% bootstrap confidence intervals for the parameters of the scal-ing models for EAX’s running times (with configured parameters)to find optimal solutions of RUE instances . . . . . . . . . . . . . 82xiList of Tables6.21 Fitted models for the medians of the running times of LKH withconfigured parameters from experiments following the scaling pro-tocol but with less parameters and the corresponding RMSE values(in CPU sec). The models yielding more accurate predictions (asper RMSEs on challenge data) are shown in boldface. . . . . . . . 836.22 95% bootstrap confidence intervals for the parameters of the scal-ing models for EAX’s running times (with configured parameters)to find optimal solutions of RUE instances. . . . . . . . . . . . . 84xiiList of Figures1.1 Typical scenario for the process of empirical scaling analysis (fig-ure taken from [59]). Here, Parkes and Walser studied the empir-ical scaling of number of flips taken by WalkSAT/SKC to solverandom 3-SAT instances at phase transition. They fitted two func-tions, f (n) = f1 ·n f2+ f3 log(n) and g(n) = g1 ·exp(ng2 · (1+g3/n)), tothe mean number of flips. . . . . . . . . . . . . . . . . . . . . . . 33.1 The methodology for empirical scaling analysis. . . . . . . . . . . 134.1 Excerpt of an input file for ESA, where deleted lines are represen-ted by “...”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Example of a configuration file for ESA. . . . . . . . . . . . . . . 194.3 An example of model specification for ESA. . . . . . . . . . . . . 194.4 Example output of ESA – a figure of running times, fitted modelsand corresponding bootstrap confidence intervals of each model. . 224.5 Example output of ESA – a figure of residues of the fitted models. 254.6 Snapshot of the technical report generated by ESA. . . . . . . . . 264.7 Flow diagram on how ESA automatically interprets the fitting res-ults. Detailed definitions of the conditions are given in the maintext. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.8 The user interface of ESA as a web service. . . . . . . . . . . . . 305.1 Empirical bounds on the location of the phase transition points for3-SAT, mc/n for different n, data used for fitting our model (eq.5.2) and model predictions. Data for n > 500 is based on heuristicestimates of the fraction of satisfiable instances and may underes-timate the true mc/n. . . . . . . . . . . . . . . . . . . . . . . . . 365.2 Empirical bounds on the location of the phase transition points for4-SAT, mc/n for different n, data used for fitting our model (eq.5.3) and model predictions. Data for n≥ 200 is based on heuristicestimates of the fraction of satisfiable instances and may underes-timate the true mc/n. . . . . . . . . . . . . . . . . . . . . . . . . 38xiiiList of Figures5.3 Fitted models for the median running times (in CPU-sec). Bothmodels are fitted with the median running times of WalkSAT/SKCsolving the satisfiable instances from the set of 1200 random 3-SAT instances of 200, 250, ..., 500 variables, and are challengedby median running times of 600, 700, ..., 1000 variables. . . . . . 395.4 Comparing the scaling model for BalancedZ with that for Walk-SAT/SKC on solving 3-SAT instances. Both models are fitted withthe median running times of WalkSAT/SKC or BalancedZ solving3-SAT instances of 200, 250, ..., 500 variables, and are challengedby median running times of 600, 700, ..., 1000 variables. Similarresults can also been obtained by doing the comparison in the otherway around. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.5 Fitted models for the 0.75-quantiles of the running times (in CPU-sec). Both models are fitted with the 0.75 quantiles of the runningtimes of WalkSAT/SKC solving the satisfiable instances from theset of 1200 random 3-SAT instances of 200, 250, ..., 500 variables,and are challenged by median running times of 600, 700, ..., 1000variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.6 Fitted models for the median running times of kcnfs [CPU-sec].Both models are fitted with the median running times of kcnfs solv-ing 1200 random 3-SAT instances of 200, 250, ..., 400 variables,and are challenged by median running times of 450, 500 and 550variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.7 Comparing scaling models of kcnfs and march_hi on solving 3-SAT instances. Both models are fitted with the median runningtimes of kcnfs or march_hi solving 1200 3-SAT instances of 200,250, ..., 400 variables, and are challenged by median running timesof 450, 500 and 550 variables. . . . . . . . . . . . . . . . . . . . 465.8 Scaling of the median running time of march_hi on satisfiable vsunsatisfiable 3-SAT instances. We show the scaling model andbootstrap confidence intervals for performance on satisfiable 3-SAT instances, along with observations on support and challengedata for satisfiable and unsatisfiable instances. . . . . . . . . . . 475.9 Fitted models for median running times of BalancedZ. Both mod-els are fitted to the median running times of BalancedZ solving the4-SAT instances from the set of phase-transition random instanceswith 70, 80, ... 150 variables, and are challenged by median run-ning times for instances with 160, 170, ... 200, 225, ... 300 vari-ables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52xivList of Figures5.10 Fitted models for median running times of march_hi. Both modelsare fitted with the median running times of march_hi solving the4-SAT instances from the set of phase-transition random instanceswith 40, 50, ... 120 variables, and are challenged by median run-ning times for instances with 130, 140, ... 190 variables. . . . . . 565.11 Fitted models for the median running times of WalkSAT/SKC. Bothmodels are fitted to the median running times of WalkSAT/SKCsolving the less-constrained 4-SAT instances from the set of ran-dom instances with 6000,7000, · · · ,15000 variables, and are chal-lenged by median running times with ≥ 16000 variables. . . . . . 575.12 Fitted models for the median running times of kcnfs. Both mod-els are fitted to the median running times of kcnfs solving theless-constrained 4-SAT instances from the set of random instanceswith 300, · · · ,450 variables, and are challenged by median runningtimes for instances with 500,550,600 variables. . . . . . . . . . . 586.1 Fitted models for the medians of the running times for Concordeto find optimal solutions of RUE instances without proving optim-ality. All models are fitted with the medians of the running timesof Concorde solving the RUE instances from the set of instances500 ≤ n ≤ 1500 variables, and are challenged by the medians ofthe running times of 2000≤ n≤ 4500 variables. . . . . . . . . . . 656.2 Fitted models for the medians of the running times for EAX tofind optimal solutions of RUE instances. All models are fitted withthe medians of the running times of EAX solving the set of RUEinstances of 500 ≤ n ≤ 1500 variables, and are challenged by themedians of the running times of 2000≤ n≤ 4500 variables. . . . 696.3 Fitted models for the medians of the running times for LKH tofind optimal solutions of RUE instances. All models are fitted withthe medians of the running times of LKH solving the set of RUEinstances of 500 ≤ n ≤ 1500 variables, and are challenged by themedians of the running times of 2000≤ n≤ 4500 variables. . . . 706.4 Fitted one-parameter models (of the form a ·b√nConcorde) for the me-dians of the running times for EAX (top) and LKH (bottom) tofind optimal solutions of RUE instances. All models are fittedwith the medians of the running times of the solvers solving theRUE instances from the set of instances of 500 ≤ n ≤ 1500 vari-ables, and are challenged by the medians of the running times of2000≤ n≤ 4500 variables. . . . . . . . . . . . . . . . . . . . . . 71xvList of Figures6.5 Fitted models for the medians of the running times for EAX withconfigured parameters to find optimal solutions of RUE instances.The parameters of EAX are set to the optimised values determinedby configuration on instances with 1500 variables using SMAC.All models are fitted with the medians of the running times ofEAX solving the SAT instances from the set of RUE instances of500 ≤ n ≤ 1500 variables, and are challenged by the medians ofthe running times of 2000≤ n≤ 4500 variables. . . . . . . . . . . 736.6 Best fitted models for the medians of the running times for EAXwith configured and default parameters to find optimal solutionsof RUE instances. All models are fitted with the medians of therunning times of EAX solving the SAT instances from the set ofRUE instances of 500≤ n≤ 1500 variables, and are challenged bythe medians of the running times of 2000≤ n≤ 4500 variables. . 746.7 Fitted models for the medians of the running times of EAX to findoptimal solutions of RUE instances. The parameters of EAX areset to the default values, and the population size is a simple func-tion of instance size. All models are fitted with the medians of therunning times of EAX solving the SAT instances from the set ofRUE instances of 500≤ n≤ 1500 variables, and are challenged bythe medians of the running times of 2000≤ n≤ 4500 variables. . 766.8 Best fitted models for the medians of the running times for EAXwith configured and default parameters to find optimal solutionsof RUE instances. All models are fitted with the medians of therunning times of EAX solving the SAT instances from the set ofRUE instances of 500≤ n≤ 1500 variables, and are challenged bythe medians of the running times of 2000≤ n≤ 4500 variables. . 776.9 Fitted models for the medians of the running times of EAX to findoptimal solutions of RUE instances. The parameters of EAX areset to the optimised values, and the population size is a simplefunction of instance size. All models are fitted with the medians ofthe running times of EAX solving the SAT instances from the setof RUE instances of 500≤ n≤ 1500 variables, and are challengedby the medians of the running times of 2000≤ n≤ 4500 variables. 786.10 Best fitted models for the medians of the running times for EAXwith configured and default parameters to find optimal solutionsof RUE instances. All models are fitted with the medians of therunning times of EAX solving the SAT instances from the set ofRUE instances of 500≤ n≤ 1500 variables, and are challenged bythe medians of the running times of 2000≤ n≤ 4500 variables. . 80xviList of Figures6.11 Fitted models for the medians of the running times for LKH withconfigured parameters from experiments following the standardprotocol. All models are fitted with the medians of the runningtimes of LKH solving the set of RUE instances of 500≤ n≤ 1500variables, and are challenged by the medians of the running timesof 2000≤ n≤ 4500 variables. . . . . . . . . . . . . . . . . . . . 816.12 Fitted models for the medians of the running times of LKH withconfigured parameters from experiments following the scaling pro-tocol. All models are fitted with the medians of the running timesof LKH solving the set of RUE instances of 500 ≤ n ≤ 1500 vari-ables, and are challenged by the medians of the running times of2000≤ n≤ 4500 variables. . . . . . . . . . . . . . . . . . . . . . 836.13 Fitted models for the medians of the running times of LKH withconfigured parameters from experiments following the scaling pro-tocol but with less parameters. All models are fitted with the medi-ans of the running times of LKH solving the set of RUE instancesof 500≤ n≤ 1500 variables, and are challenged by the medians ofthe running times of 2000≤ n≤ 4500 variables. . . . . . . . . . . 84xviiAcknowledgementsI would like to start by thanking Prof. Holger Hoos, my supervisor, for his capablesupervision, enduring guidance and unceasing encouragement during the pursuit ofmy Master’s degree in Computer Science. With both a vision of the big picture anda focus in details, he kept me on track by reminding me of the scientific questionsthat we are investigating, encouraged me to work harder by pointing out the impactthat my work can have, and was always ready to dive into the data and analyse theproblems. Prof. Hoos is extremely knowledgeable, and always encouraged me toformulate my own ideas and test them. He also shared with me his thoughts andcomments on scientific writing, presentation and communication.My collaborators at Université Libre de Bruxelles, Thomas Stützle and JérémieDubois-Lacoste, were extremely supportive. They have answered numerous ques-tions from me on the TSP solvers and on the running time data and helped organiseour data in a better way.For work on empirical time complexity of phase-transition random 3-SAT, Ithank the anonymous reviewers of our IJCAI paper for their thoughtful commentsand suggestions. I also thank Dimitris Achlioptas, Paul Beame, Henry Kautz, Don-ald Knuth, Bart Selman and Moshe Vardi for valuable advice regarding previouswork on the 3-SAT phase transition and empirical performance of SAT algorithms,as well as Alfredo Braunstein and Riccardo Zecchina for discussions related to thesurvey propagation solver.I also owe my gratitude to members of the β -lab in the CS department: Prof.Kevin Leyton-Brown, Chris Cameron, Chris Fawcett, Alexandre Fréchette, JulietaMartinez, Steve Ramage, Shu Yang and others, for the insightful discussions andvaluable support.Lastly but not the least, I want to thank my fellow classmates in the department,especially Rui Ge, Yidan Liu, Yifan Peng, Michael M.A. Wu, and Ben Zhu, formaking my graduate school an unforgettable experience, and my family and friendsfor their unconditioned support and encouragement.xviiiDedicationTo my parents, Yutong Zhao and Shanwen Mu.xixChapter 1IntroductionIn theoretical computer science, time complexity is arguably the most important as-pect for analysing and understanding problems and algorithms. Time complexityof an algorithm is the scaling of the time required for solving a problem instance asa function of instance size. Traditionally studied via rigorous mathematical meth-ods, complexity results are usually studied and understood via a family of functionsof instance size, which describe the asymptotic scaling of operation counts usingnotations such as O(n) or Θ(n). Essentially, such results are compact descriptorsof the performance of algorithms and form the basis of the hierarchy of complex-ity classes in theoretical computer science. They also convey certain informationregarding the feasibility of using a given algorithm to solve a problem instance ofcertain size.In spite of the significant role that theoretical methods play in understandingcomplexity of problems and algorithms, many high-performance algorithms arebeyond the reach of such methods due to the complexity of the algorithms. Yet,these algorithms often represent the state of the art for solving problems of wideinterest in practice. Thus, empirical analysis arises as the alternative for study-ing the time complexity of these algorithms. In such studies, the propositionalsatisfiability problem (SAT) and the travelling salesperson problem (TSP) are twoimportant problems that have attracted sustained academic interest and have seenmany practical applications. In this thesis, we extended a recent empirical scalinganalysis approach [33], which emphasises the idea of challenging by extrapola-tion and uses bootstrap re-sampling to statistically assess obtained models. Wealso applied the methodology to studying the empirical time complexity of severalhigh-performance algorithms for SAT and TSP.1.1 Empirical Analysis of AlgorithmsTheoretical methods usually ignore or idealise low-level details of algorithm im-plementations and execution environments. Moreover, they are typically obtainedfor extreme or average cases and for simpler variants of algorithms. Empirical ana-lysis of algorithms has seen increasing interest, because it permits predicting the11.2. The Propositional Satisfiability Problemrunning times of high-performance algorithms in practice and may also provideuseful insights into their behaviours. This is especially true for N P-hard prob-lems, where worst-case theoretical analysis typically does not offer much in under-standing, evaluating and designing high-performance algorithms. Many of theseproblems have important applications in practice, for which high-performance al-gorithms have been designed. Examples of such problems include SAT and TSP,which we concentrate on in this work, and many others as well. In this context, em-pirical analysis is often the only way to understand the relevant performance differ-ences that we observe in practice. Hoos and Stützle [34], for instance, conducteda systematic empirical analysis to evaluate a number of local search algorithms forSAT. Their results drew a comprehensive picture on the performance of these al-gorithms, and, for some of them, revealed some fundamental weaknesses in theirdesign. Empirical analysis is also applicable to tractable or polynomial-time solv-able problems, where empirical results often complement results obtained fromworst- and average-case theoretical analysis. Such analysis can help choose orconfigure an algorithm for a given problem instance or situation, as demonstratedin several studies of sorting algorithms [9, 45, 10].Empirical analysis recently witnessed progress in studying the scaling or timecomplexity of algorithms. For such analysis, the typical approach is to collect run-ning time data for problem instances of various sizes, use a representative statisticsuch as the mean to characterise the running time for each size, and then derivea parametric function that fits or bounds the statistics. In other words, such ana-lysis focus on obtaining a good model that describes the functional dependency ofrunning time on instance size. The process usually involves comparing, based ongoodness of the fit or a quantitative error measure, one parametric function withanother to determine which describes the data better. Figure 1.1 (taken from [59])illustrates a typical scenario of such a process, as done by Parkes and Walser [59]when studying the scaling behaviour of WalkSAT/SKC at phase transition.1.2 The Propositional Satisfiability ProblemThe propositional satisfiability problem (SAT) is a conceptually simple decisionproblem. Given a Boolean formula, typically in conjunctive normal form, the prob-lem asks whether there exists an interpretation, i.e., a mapping from propositionalvariables to truth values, that satisfies the formula. The following is an example ofa Boolean formula:(x1∨ x2∨ x3)∧ (¬x2∨ x3∨¬x4)∧ (¬x1∨ x2∨ x4) ,21.2. The Propositional Satisfiability ProblemFigure 1.1: Typical scenario for the process of empirical scaling analysis (fig-ure taken from [59]). Here, Parkes and Walser studied the empirical scalingof number of flips taken by WalkSAT/SKC to solve random 3-SAT instances atphase transition. They fitted two functions, f (n) = f1 · n f2+ f3 log(n) and g(n) =g1 · exp(ng2 · (1+g3/n)), to the mean number of flips.which can be satisfied, for instance, by setting all Boolean variables to true. SATwas the first problem proven to be N P-complete [16], and no known algorithmsolves the problem efficiently in the sense of better-than-exponential scaling ofrunning time with instance size in the worst case. SAT also has a broad rangeof practical applications, and despite its discouraging worst-case time complex-ity, many SAT solvers have been constructed that perform well on theoreticallyand practically interesting sets of benchmark instances. The performance of SATsolvers is regularly assessed in SAT competitions [62] , which feature benchmarkinstances from various applications, instances that are hand-crafted to challengesolvers, and randomly generated instances.Following seminal work by Cheeseman et al. [13] and Mitchell et al. [51], ran-domly generated 3-SAT instances at the so-called solubility phase transition, where50% of the instances generated at a given size are satisfiable, have been widelystudied; to this day, they are regarded as a model for the difficulty of the N P-complete 3-SAT problem and represent one of the most prominent benchmarks forSAT solvers. It is conjectured, yet unproven, that the location of the solubilityphase transition for uniform random 3-SAT, in terms of the ratio of the numberof clauses and variables, m/n, converges towards a limiting value as n approachesinfinity. Crawford and Auton [17] studied the location of the phase transition forrandom 3-SAT as a function of n and provided a widely used formula, based onempirical data for n = 20 . . .300.In this thesis, we consider the question how the performance of high-performanceSAT solvers on uniform random 3-SAT at the solubility phase transition scale with31.3. The Travelling Salesperson Problemn. Specifically, prompted in part by earlier, inconclusive results suggesting poly-nomial scaling of incomplete SAT solvers based on stochastic local search (SLS)[25, 59, 26], we address the problem of characterising the empirical scaling ofthe performance of prominent incomplete, SLS-based and complete, DPLL-basedSAT solvers. SLS-based solvers start with a random interpretation and work byrepeatedly flipping the value of a selected variable until a satisfiable interpreta-tion is found. They usually implement heuristic strategies to select the variableto flip and often have mechanisms to restart with a new random assignment if nosolution is found for too long. DPLL-based solvers explore the search space ofvariable assignment to find satisfiable interpretations following systematic back-tracking procedures based on the Davis–Putnam–Logemann–Loveland algorithm[18, 19]. We are interested in determining whether there are qualitative differencesin the scaling behaviour of SLS-based and DPLL-based solvers, in the scaling ofDPLL-based solvers on satisfiable and unsatisfiable instances, and in the scaling ofdifferent solvers within the two major groups (SLS-based and DPLL-based). Wealso performed brief experiments on 4-SAT to further understand the scaling per-formance of these solvers. Our experiments concerned phase-transition instances,as well as a distribution of instances that is less constrained but believed to be hardby theorists.To avoid ‘falling off’ the phase transition, which could bias our scaling resultstowards easier instances, we model the location of the solubility phase transitionpoint with extensive experiments. For 3-SAT, we re-examine the model in [17] forthe location of the phase transition point, and derive a new model that agrees betterwith observed data for n > 300 and with recent results from statistical physics (seeSec. 5.2.1); for 4-SAT, we fit a model using the same approach.1.3 The Travelling Salesperson ProblemThe travelling salesperson problem (TSP) is a well known and widely studiedN P-hard combinatorial optimisation problem. Given a set of cities and their pair-wise distances, the objective of TSP is to find the shortest round trip to visit eachcity exactly once. TSP has motivated sustained development of new algorithmicideas in the domain of combinatorial optimisation. TSP algorithms are usuallycategorised into two kinds: exact algorithms, which guarantee to find an optimalsolution of any TSP instance and can prove the optimality of the solution, and in-exact algorithms, which may find optimal solutions but cannot prove optimality.Until today, Concorde [4] represents the long-standing state-of-the-art completealgorithm for solving TSP. For incomplete algorithms, LKH [29, 30] had been thebest available solver until the introduction of EAX [57], which is an evolutionary41.4. Outline of the Thesisalgorithm that improved the edge assembly crossover operator Nagata [56] that re-combines short tours effectively. Empirical results in [57] show that EAX tendsto perform better than LKH on a broad range of TSP instances, though LKH, asshown by Kotthoff et al. [41], is not dominated in that it is more efficient in solvinga substantial proportion of the instances.In the following, we focus on the finding times, namely the times required bystate-of-the-art complete and incomplete solvers to find optimal solutions of a TSPproblem without proving optimality, and we want to investigate how the runningtimes of different solvers scale with instance size. Natural as the focus on find-ing times is for incomplete solvers, it may not seem straightforward why findingtime should of interest for a complete solver. On one hand, we want to comparethe scaling models of finding and overall running times of complete solvers to bet-ter understand them. On the other hand, we are also interested in comparing thescaling models of complete and incomplete solvers.2D Euclidean instances, i.e., instances where the locations to be visited corres-pond to points in the Euclidean plane, often occur in practical applications. Onespecial type of such instances are so-called RUE instances, which can be gener-ated by placing cities uniformly at random. Even though RUE instances typicallydo not occur in practice, they have similar properties as general 2D Euclidean in-stances and are widely studied in the literature. Following previous works on em-pirical scaling of TSP solvers [35, 21], we choose 2D RUE instances because theyrepresent a distribution of TSP instances that is widely studied and can be easilygenerated.1.4 Outline of the ThesisWhat follows immediately in Chapter 2 is discussion of previous works on theempirical time complexity of SAT and TSP and on empirical methods for scalinganalysis, which motivated much of this work. Then we review the methodologyproposed by Hoos [33] and describe our improvements in Chapter 3. These im-provements make extended use of the bootstrap intervals for statistical assessmentof fitted models and introduce a novel way for comparison of scaling models andthus scaling performances of solvers. This is followed by Chapter 4 on empiricalscaling analyser (ESA), the automated tool that we designed to perform scalinganalysis automatically. ESA takes a file of running time data as input and canoutput a technical report of empirical scaling analysis results.In Chapter 5, we present our results on the empirical scaling of running timesof six SAT solvers on random 3-SAT, followed by a similar analysis for 4-SATinstances. Our main findings on 3-SAT are as follows:51.4. Outline of the Thesis• The median running times of the three prominent SLS-based solvers we stud-ied, WalkSAT/SKC [64], BalancedZ [44] and probSAT [7, 6], scale polyno-mially (with an exponent of about 3), and the best exponential models are re-jected with 95% confidence. Furthermore, we found no evidence that higherpercentiles of the distributions of running times over sets of instances forfixed n may scale exponentially.• The median running times of the three DPLL-based solvers we considered,kcnfs [20], march_hi [32] and march_br [31], exhibit exponential scaling(with a base of about 1.03), and the best polynomial models are rejectedwith 95% confidence.• For all three DPLL-based solvers, the median running times when solvingonly satisfiable and only unsatisfiable instances, respectively, clearly exhibitexponential running time scaling, and the respective scaling models differmainly by a constant factor.• While the scaling models for the SLS-based solvers are very similar to eachother, the two march-variants scale significantly better than kcnfs.Afterwards, we investigate the time complexity of three TSP solvers in Chapter6. For both complete and incomplete solvers, we focus on times required to findoptimal solutions (without proving optimality), and compare the scaling models ofthese solvers. Our major conclusions are:• The median finding times of Concorde, the state-of-the-art complete TSPsolver, are consistent with a root-exponential model, i.e., a model of theform a ·b√n (with a base of about 1.25), and the best exponential model andpolynomial model can be rejected with 95% confidence.• The time Concorde takes for finding an optimal solution scales slightly worsethan the overall running time, but more detailed models with lower-orderterms may be required to better capture this difference.• For the two prominent incomplete solvers we studied, the median runningtimes of EAX are consistent with a root-exponential model (with a base ofabout 1.12), and the best exponential and polynomial models can be rejectedwith 95% confidence; the median running times of LKH are bound by apolynomial (with an exponent of about 2.9) and a root-exponential model(with a base of about 1.19), and the best exponential model can be rejectedwith 95% confidence.61.4. Outline of the Thesis• Both incomplete solvers scale significantly better than Concorde, with b inthe root-exponential model for Concorde (≈ 1.25) significantly larger than bin those for EAX (≈ 1.12) and LKH (≈ 1.19).We also assess the impact of automated configuration on the scaling of incompleteTSP solvers in Chapter 6. For EAX, we showed that automated algorithm config-uration and adapting population size with instance size can significantly improvethe scaling behaviour; while for LKH, we observed overfitting during automatedconfiguration in that it improved the performance of LKH for smaller instances butcaused it to suffer for larger ones.Finally, we conclude the thesis in Chapter 7 and discuss potential areas forfuture work.7Chapter 2Related WorkIn this chapter, we briefly survey related work on SAT and TSP as well as onempirical scaling analysis. For work on empirical time complexity of SAT andTSP, we discuss their limitations and point out gaps that we will later fill withadditional analysis. For work on empirical scaling analysis, we discuss variousmethods and applications found in the literature, including the methodology thatour work directly builds on.2.1 Related Work on Random k−SATSAT is one of the most intensely studied problems in the computer science literat-ure and beyond. The k-SAT problem is arguably the most widely studied class ofSAT. Given a Boolean formula in conjunctive normal form for which every clauseconsists of exactly k literals, the problem asks whether there exists an interpreta-tion that satisfies the formula. 3-SAT, the special case for k = 3, is arguably themost prominentN P-complete decision problem [16]. Interest in phase transitionphenomena in combinatorial problems and in uniform random 3-SAT specificallyrose sharply when Cheeseman et al. [13] demonstrated that the hardest instancesare found around a critical value of an order parameter, where a transition from pre-dominantly soluble to mostly insoluble instances occurs. Uniform random k-SATinstances are generated by constructing uniformly and independently at random mclauses, each of which is obtained by sampling, again uniformly and independentlyat random, 3 of n variables, and negating each of them with probability 1/2 [51](duplicate clauses are eliminated); the order parameter is the clauses/variable ratio,m/n. It is believed, but not yet proven, that for k ≥ 3, the location of the phasetransition point of uniform random k-SAT converges to a fixed threshold value as napproaches infinity. Assuming threshold values exist for small k, an accurate the-oretical upper bound was proven by Franco and Paull [23] and was later improvedto 2k log2− (1+ log2)/2+ ok (1) by Kirousis et al. [40]. Achlioptas and Peres[1] achieved a major breakthrough in proving a lower bound, which was recentlyimproved to 2k log2− (1+ log2)/2−ok (1) by Coja-Oghlan [15].In a prominent study on the empirical difficulty of SAT instances, Mitchell82.2. Related Work on TSPet al. [51] demonstrated that instances drawn from the phase transition region ofuniform random 3-SAT instances tend to be the most difficult for a simple DPLLsolver. Similar results were shown by Yokoo [68] for an SLS-based solver on thesatisfiable phase of uniform random 3-SAT. Crawford and Auton [17] studied thephase transition region of uniform random 3-SAT empirically, developed a modelfor the location of the phase transition point and presented additional evidence forthe difficulty of the SAT instances found there.Gent and Walsh [25] studied the empirical behaviour of GSAT, one of the earli-est and most prominent SLS-based SAT solvers [63], and its two variants, DSATand HSAT. They noted that the scaling of the average number of variable flips re-quired by these solvers for solving phase transition random 3-SAT instances wasconsistent with less-than-linear exponential scaling with n and did not rule out apolynomial scaling model with a degree of about 3. Later, Parkes and Walser [59]presented empirical evidence that the scaling of the average number of flips re-quired by a more recent, prominent SLS-based SAT solver, WalkSAT/SKC [64],on the same class of instances might scale either as a power function with a slowlygrowing exponent, or as a sub-exponential function. Furthermore, Gent et al. [26]found that the 90th percentile of the number of flips of GSAT for random 3-SATappears to grow no faster than n4. However, in all cases, performance was meas-ured in variable flips (which become more expensive as n increases) and based onlimited curve fitting only, with a vaguely defined notion of a ‘good fit’.Coarfa et al. [14] used simple curve fitting to study the empirical scaling ofmedian running time for three complete solvers on random 3-SAT and observedexponential scaling above certain solver-dependent density thresholds.In contrast, in the following, we consider the actual scaling of running time anduse a considerably more advanced and statistically sound approach for assessingscaling models. Unlike these earlier studies, we also challenge our scaling modelsby assessing their predictions on larger instances sizes than those used for fittingthe models.2.2 Related Work on TSPThe computation complexity of TSP has been intensely studied. It is N P-hardto solve both the general TSP [24] and the special case of 2D Euclidean instances[58]. For Euclidean distances, in spite of known polynomial approximation schemes,it takes exponential time to find good solutions as the gap to optimality decreases[5].Much less work has been done on investigating the empirical scaling of modernTSP solvers on interesting distributions of TSP instances. For complete solvers,92.3. Related Work on Empirical Scaling Analysisan important observation is made the book by Applegate et al. [3], who presenta graphical analysis of observed mean running times suggesting that Concordemay scale exponentially. A more thorough investigation is presented by Hoos andStützle [35], which is a direct precursor of our work on TSP. After observing a log-normal distribution of the running times for given instance size, they found thatthe median (and other quantiles of) running times of Concorde scale as a functionof the form a · b√n, where b is about 1.24194. In addition to analysing runningtimes required by Concorde to find an optimal solution and to prove its optimality,there is work investigating how much of the time is spent on finding the optimalsolution. Hoos and Stützle [36] found that time spent on finding optimal solutionsaccounts for a larger percentage of the overall running time, and the percentagetends to increase with instance size.For incomplete solvers, Dubois-Lacoste et al. [21] studied the scaling of run-ning times of EAX and LKH (with restart mechanisms), also on solving RUE in-stances. They found that scaling models of EAX and LKH are of a similar formwith Concorde, namely of the form a ·b√n, but they seem to scale better than Con-corde in that their models have smaller values of b. We note that some of thiswork was carried out in parallel with our work presented in the following, and ourscomplemented and extended their results for a better understanding of the scalingbehaviour of these two incomplete algorithms.2.3 Related Work on Empirical Scaling AnalysisEmpirical analysis of the time complexity of algorithms has been applied to prob-lem other than SAT and TSP. Some of these studies use graphical analysis only,while others involve the use of model fitting. Subramani and Desovski [67], for in-stance, investigated the empirical scaling of the vertex contraction (VC) algorithm,a very well known greedy procedure for the negative cost cycle detection (NCCD)problem. Their comprehensive results demonstrated that the VC algorithm out-perform the standard Bellman-Ford algorithm by an order of magnitude. Anotherexample is found in work by Kunkle [42], which studied four different algorithmsfor the longest common subsequence problem. Referencing theoretical results,Kunkle fitted one-parameter models to find values of the constants in the models.He also examined the impact of sequence structure and alphabet size on the empir-ical running times of the algorithms. Aguirre-Hernández et al. [2] also employedgraphical analysis and model fitting to analyse two algorithms for the design ofRNA secondary structure. In particular, they investigated the impact of RNA struc-tures and primary structure constraints on the scaling of these algorithms and foundsupport of polynomial scaling for both algorithms.102.3. Related Work on Empirical Scaling AnalysisMoreover, there has been significant interest in the methodology for empiricalscaling analysis and in its role in complexity research. Sanders and Fleischer [61],for instance, discussed the role of empirical investigations in guiding theoreticalresearch, falsifying results from incomplete theoretical analysis or even producinghigh-quality surrogates for unproven conjectures on scaling of algorithms. Theyillustrated this via several examples, some involving graphical scaling analysis andeven the standard t-test to evaluate hypotheses on scaling of algorithms. McGeochet al. [48, 49] also examined the role of analysing asymptotic trends from exper-imental data as part of a scientific method and evaluated several curve-boundingtechniques using designed experiments. In particular, they focused on polynomialmodels and evaluated their techniques on data, both artificial and real, that areknown to be bounded by low-degree polynomials. Success was seen for a strategythat finds bounding polynomials essentially by performing linear regression on log-log transformed data. On the other hand, they observed unsatisfactory results fromnon-linear regression-based strategies. This observations, as Hoos [33] suspected,might come from the difficulty of bounding slowly-growing scaling behaviour andfrom bounding (rather than fitting) data with a RMSE-minimisation procedure.Another development is to use empirical scaling analysis as a testing tool, al-lowing developers to test whether their software scales as theory predicts and/oras they expect. For this purpose, Goldsmith et al. [28] developed a tool namedTrend Profiler that analyse the computational complexity of software as a func-tion of a feature (such as size) of a given workload. Trend Profiler models theexecution frequencies of blocks or clusters of blocks of software and fits them tolinear or power law models using linear regression (on actual or transformed data).These models are qualitatively assessed by scatter plots that relate input featurevalues to observed and predicted execution frequencies, and to the residues, thatis, differences between observed and predicted frequencies. R2, the square of Pear-son’s correlation coefficient, is also used to quantify the quality of such models.Trend Profiler also calculates bootstrap confidence intervals for predictions madeon workloads with extrapolated input feature values, though these confidence inter-vals are not utilised in statistical testing. Goldsmith et al. [28] have demonstratedthe effectiveness of Trend Profiler on several pieces of real-world software, includ-ing bzip2 [12] and gcov [27]. All software they considered showed slow scaling oflow-degree polynomials.More recently, Hoos [33] proposed an empirical scaling analysis methodologythat emphasised the idea of challenging by extrapolation and used bootstrap re-sampling to statistically assess obtained models. Different from earlier approaches,this method uses non-linear numerical techniques to fit models. Our work directlybuilds on this methodology and improves in two useful ways, described detailedlyin Chapter 3.11Chapter 3Empirical Scaling Methodologyand Extensions3.1 Empirical Scaling MethodologyA significant advance in the methodology for studying the empirical scaling of al-gorithm performance with input size was achieved by Hoos [33]. His method usesstandard numerical optimisation approaches to automatically fit scaling models,which are then challenged by extrapolation to larger input sizes. Most importantly,it uses a re-sampling approach to assess the models and their predictions in a stat-istically meaningful way. Here, we briefly reviewed the methodology in steps, aspresented in Figure 3.1:1. Collect running time data.Running time data should be collected for one type of benchmark instancesof interest. These instances need to be collected or generated in sets of dif-ferent sizes, and the number of different sizes and the number of instances ineach set should be carefully chosen, considering the goal of the analysis andthe budget of computation resources. Then, algorithm runs should be per-formed on these instances in a homogeneous environment, and the runningtimes should be reliably measured. For randomised algorithms, multiple runsare required for each instance.2. Fit parametric models.Running times of an algorithm typically vary greatly for instances of thesame size. To study scaling of running times with instance size, a statisticis typically used to summarise running time data of the same size. Typicalexamples include the median, the mean, and higher quantiles. Medians andother quantiles have the advantage that they can be more reliably estimated,even when there are (not too many) time-out or crashed runs. Then, a stand-ard numerical optimisation procedures, such as the Levenberg-MarquardtAlgorithm [43, 47], is used to fit scaling models over summarised runningtime data. To perform the next step, only running times of smaller instances123.1. Empirical Scaling Methodologysolverrunningtimesfit parametricmodelschallenge byextrapolationresultuse bootstrap re-samplingfor further assessmentFigure 3.1: The methodology for empirical scaling analysis.(support data) should be used for the fitting process. The numerical optimisa-tion procedure tries to minimise an error measure such as root-mean-squarederror (RMSE). Commonly considered models are polynomial and exponen-tial models of the form a ·nb and a ·bn, but other parametric models may beconsidered as necessary.3. Challenge models by extrapolation.Extrapolation is viewed as an important way to evaluate fitted models. Thereason is that scaling models are of interest mainly because their ability topredict of running times for larger, unseen instances, and we usually acceptor reject a scaling model based on whether good predictions can be made forsolving larger problem instances. Thus, models obtained from the last stepare challenged by running times required to solve larger instances, which arereferred to as challenge data. In other words, RMSEs on the challenge dataare more informative than those on the support data, and should be relied onwhen comparing scaling models.4. Assess models using bootstrap confidence intervals.Bootstrap analysis [22] is used to assess the statistical confidence we shouldhave in the fitted models. In detail, the method re-samples, with replacement,the running times for each support size. For each sample, it computes the de-scriptive statistic of interest (e.g., median running time) and fits a parametricmodel which gives predictions for the challenge sizes. Through this pro-cess, a collection of fitted models is obtained for each model family, whichthen generates a set of predictions for the challenge sizes. Bootstrap percent-ile confidence intervals (for certain confidence level denoted as α) are thencomputed for each challenge size and for each model family. By comparingthe observed running time statistics with the confidence intervals of a model133.2. Extensions to the Methodologyfamily, we can determine whether a model is consistent with observations orshould be rejected with confidence level of α .3.2 Extensions to the MethodologyOur work presented in the following applies this empirical methodology to SATand TSP solvers, and extends it in two useful ways.Firstly, noting that observed running time statistics for challenge data are alsobased on measurements on sets of instances, we calculate bootstrap percentile con-fidence intervals for those, in addition to the point estimates used in the originalapproach. This way, we capture dependency of these statistics on the underlyingsample of instances.More formally, we have running time data for instance sizes n1, · · · ,ns+t , di-vided into a support set of s different sizes and a challenge set of t sizes. For eachsize ni, we have a set of instances Ii. We have also collected a set of running timedata D(Ii) for each size. When computing bootstrap confidence intervals, we res-ample, with replacement, r samples of the instance set Ii,1, · · · , Ii,r. For each sampleI·, we compute statistics S (I·) of the support sizes for model fitting, which thengives us confidence intervals CI1, · · · ,CIt of model predictions (see step 4 above).In [33], these confidence intervals are directly compared to S (Is+1) , · · · ,S (Is+t)to assess whether a model is a good fit. However, these statistics are also basedon sets of running time data. Thus, we calculate bootstrap confidence intervalsfor observed challenge data as well. That is, we compute CIOi for i = 1, · · · , tas confidence intervals of S (Is+i,1) , · · · ,S (Is+i,r) and compare it with CIi to assesswhether a model is a good fit or not. The comparison results can be: the two con-fidence intervals are disjoint (which we treat as inconsistent), they overlap witheach other but are not fully contained (which we say to be weakly consistent), andone is contained in another (which we say to be fully consistent). (In addition, ifthe point estimate of the observed running time falls within the confidence intervalof the prediction, we say it to be strongly consistent.)Secondly, we propose a way to compare scaling models between algorithms.This is done by cross-checking model predictions and observed data. We determ-ine to which extent a solver A1 shows scaling behaviour different from anothersolver A2, by comparing the observed running time statistics of A1 to the boot-strap confidence intervals obtained from the scaling model of A2. In other words,we compare CIOi’s for i = 1, · · · ,n of A1 to the CIi’s of A2. If the latter do notoverlap with the former, we can reject the hypothesis that the performance ofA1 isconsistent with the scaling model of A2. We note that this method also applies tocomparing scaling models of one algorithm solving two different distributions of143.2. Extensions to the Methodologyinstances.A similar methodology can be used to determine if two scaling models differby a constant factor. For example, if we have shown that the running times of A1and A2 are consistent with an exponential model of the form a ·bn, and the modelfor A1 is a1 ·bn1, then we can determine if the scaling of the two algorithms differby a constant factor by fitting a one-parameter model of the form a ·bn1, where a isa free parameter, on the running times of A2. If the resulting model is not a goodfit (determined based on bootstrap confidence intervals, as explained in step 4 ofSec. 3.1), we reject the hypothesis that the two models differ by only a constantfactor with confidence level of α .15Chapter 4Empirical Scaling Analyser:An Automated System for ScalingAnalysis1In this chapter, we present an automated tool – the Empirical Scaling Analyser(ESA) – that can perform core elements, particularly steps 2 through 4, of theanalysis described in Chapter 3. (The workflow implemented by ESA is illustratedin Figure 3.1) To use ESA, a user needs to prepare an input file of running time dataof an algorithm (referred to as target algorithm hereafter), as well as other optionalfiles, including a configuration file, a file specifying the parametric models to befitted, a LATEX template and a gnuplot template. Details of these input files willbe given in Section 4.1. Note that ESA is not limited to fitting and assessing asingle scaling model, but can deal with multiple models simultaneously. In otherwords, once data collection is finished, a user can put all running time data into afile, feed it into ESA and obtain the results from the scaling analysis using severalparametric models. Results are presented in a technical report, which containseasy-to-read tables and figures for the scaling of the target algorithm. The detailsof the output report are described in Section 4.2.We believe the tool is useful for other researchers who want to study the empir-ical time complexity of other algorithms, and can thus promote the use of such ana-lysis for other problems and algorithms. The tool is available as an easy-to-use on-line service at www.cs.ubc.ca/labs/beta/Projects/ESA/esa-online.htmland as a command-line tool with additional functionality (see Section 4.5 on howto run ESA). Here, we describe all features available in the command-line version.4.1 InputTo perform scaling analysis, ESA requires input data containing the sizes of theinstances studied and the running times the target algorithm requires for solving1A shorter version was published as a late breaking abstract in GECCO’15 [53].164.1. Inputthese instances. The input running time data need to follow the following format-ting rules:• the input file contains lines of details of the instances, one instance per line;• in each line, the following three pieces of information are provided in orderand are separated by “,”:– instance name (e.g., file name) and other optional information (thisfield is for the user’s reference only; ESA does not use this field inthe scaling analysis);– instance size;– running time required to solve the instance, which itself may be a stat-istic for multiple runs of the target algorithm solving the instance andmay be “inf" for time-out or crashed runs.Besides, the user may specify the number of instances for some sizes. If there arenot enough entries for one size, ESA will treat the missing entries as instances withunknown running time. An example for such data is described in Section 6.3, in thecontext of analysing the scaling behaviour of EAX and LKH, where the runningtimes of some instances are unknown because no optimal solution has been foundin previous runs of Concorde. An excerpt of an input file for ESA is in Figure 4.1.ESA also takes as input a configuration file, containing details on the targetalgorithm (algName), the instance distribution (instName), the number of bootstrapsamples (numTrainingData), etc. The file contains lines of configurations, oneconfiguration per line. Each configuration follows the “name : value” format.An example of a configuration file is shown in Figure 4.2.There are a number of other files that a user may supply (if not supplied, ESAwill use the default file(s) distributed with the code), including:• a file specifying the models to be fit• a LATEX template specifying the content and format of the output report• gnuplot templates specifying the format of the plotsThe first of these is needed, because ESA supports customised models, as long asthe models are supported by python (including the math and the numpy packages)and gnuplot. This file contains lines of models, one model per line. Each containsthe following items, separated by “,”:• Model name (e.g., Exponential)174.1. Input# instance name , size , datum (running time)portgen -500 -1000.tsp ,500 ,2.3portgen -500 -100.tsp ,500 ,2.58portgen -500 -101.tsp ,500 ,2.36portgen -500 -102.tsp ,500 ,2.51portgen -500 -103.tsp ,500 ,2.63portgen -500 -104.tsp ,500 ,2.84portgen -500 -105.tsp ,500 ,2.62portgen -500 -106.tsp ,500,3...portgen -600 -1000.tsp ,600 ,3.42...portgen -4500 -10.tsp ,4500 ,727.68portgen -4500 -11.tsp ,4500,inf...#instances ,4000 ,100#instances ,4500 ,100Figure 4.1: Excerpt of an input file for ESA, where deleted lines are representedby “...”.• Number of parameters (e.g., 2)• LATEXexpression of the model• Python expression of the model• Gnuplot expression of the model• Default values of the parameters, separated by “,”For all expressions of the models, x represents the size, and the parameters shouldbe a,b, . . . , and should be surrounded by “@@”. For example, the specificationin Figure 4.3, which is also the default model specification, tells ESA to fit anexponential and a polynomial model of the form a ·bx and a · xb respectively:For the LATEX template, ESA will use the default template, if no customisedtemplate is found. In the template, dynamic values should be surrounded by “@@”.For instance, the name of the algorithm (which is defined in the configuration file)is a dynamic value. Wherever mentioned in the template file, the user should use“@@algName@@”, and ESA will instantiate it to be the real name of the al-gorithm when generating the report. Users can also specify the formats of the plots184.1. InputfileName : runtimes.csvalgName : WalkSAT/SKCinstName : random 3-SAT instances at phase transitionmodelFileName : models.txtnumTrainingData : 7alpha : 95numBootstrapSamples : 1000statistic : medianlatexTemplate : template -AutoScaling.texmodelPlotTemplate : template_plotModels.pltresiduePlotTemplate : template_plotResidues.pltFigure 4.2: Example of a configuration file for ESA.Exp ,2,@@a@@\times @@b@@^{x},@@a@@*@@b@@**x,@@a@@*@@b@@**x,1e-4 ,1.01Poly ,2,@@a@@\times x^{@@b@@},@@a@@*x**@@b@@ ,@@a@@*x**@@b@@ ,1e-8,1Figure 4.3: An example of model specification for ESA.194.2. Outputn 500 600 700 800# instances 1000 1000 1000 1000# running times 1000 1000 1000 1000mean 19.33 31.55 56.8 89.35coefficient of variation 1.164 1.418 1.303 1.455Q(0.1) 0.43 2.21 6.73 10.62Q(0.25) 3.75 8.75 15.3 22.31median 12.22 18.64 33.15 48.95Q(0.75) 25.01 37.64 68.29 102.5Q(0.9) 47.33 70.35 130.4 195.5n 900 1000 1100 1200# instances 1000 1000 1000 1000# running times 1000 1000 1000 1000mean 139.7 201.2 314.6 385.4coefficient of variation 1.734 1.759 1.851 1.713Q(0.1) 17.23 23.28 28.8 38.76Q(0.25) 32.72 43.23 60.84 78.66median 70.64 98.56 145.7 177.2Q(0.75) 142.7 216.1 341.3 409.2Q(0.9) 302.7 429.4 693.2 846.3n 1300 1400 1500# instances 1000 1000 1000# running times 1000 1000 1000mean 548.7 749 1072coefficient of variation 1.859 2.227 2.109Q(0.1) 53.27 73.78 93.04Q(0.25) 112.2 153.3 210.1median 271.7 344.3 483.5Q(0.75) 583.6 783.6 1136Q(0.9) 1190 1517 2277Table 4.1: Example output of ESA – statistics of running times for support data.via the template gnuplot script. For instance, users may choose whether to use alog-log plot or a semi-log plot via the template gnuplot script. Default templatesare presented in Appendices A and B and are available for download together withthe source code (see Section 4.5 for details).4.2 OutputESA automatically generates a technical report containing detailed empirical scal-ing analysis results and interpretation. This report contains tables and figures thatusers can easily read (see Chapters 5 and 6 for examples), including:• two tables of statistics of running times, one for support data and the otherfor challenge, as illustrated in Tables 4.1 & 4.2, respectively;204.2. Outputn 2000 2500 3000# instances 1000 100 100# running times 1000 100 100mean 5402 ∞ ∞coefficient of variation 2.624 N/A N/AQ(0.1) 307 671.5 3538Q(0.25) 765.5 1694 8188median 1969 6149 1.84×104Q(0.75) 5207 1.766×104 4.118×104Q(0.9) 1.137×104 4.611×104 9.048×104n 3500 4000 4500# instances 100 100 100# running times 100 100 100mean ∞ ∞ ∞coefficient of variation N/A N/A N/AQ(0.1) 6060 2.096×104 3.041×104Q(0.25) 1.226×104 4.125×104 1.22×105median 3.246×104 1.312×105 2.633×105Q(0.75) 9.717×104 3.938×105 ∞Q(0.9) 2.76×105 ∞ ∞Table 4.2: Example output of ESA – statistics of running times for challenge data.ModelRMSE RMSE(support) (challenge)ConcordeExp. Model 4.0388×1.0032x 7.7847 2.7852×106RootExp. Model 0.083457×1.2503√x 7.0439 9169.4Poly. Model 1.6989×10−10× x3.9176 9.9327 1.038×105Table 4.3: Example output of ESA – fitted models and corresponding RMSE val-ues.• a table of fitted models and corresponding RMSE values, as illustrated inTable 4.3;• a figure of running times, fitted models and corresponding bootstrap confid-ence intervals of each model, as illustrated in Figure 4.4;• a figure of residues of the fitted models, as illustrated in Figure 4.5;• a table of bootstrap confidence intervals for all model parameters, as illus-trated in Table 4.4;• two tables of bootstrap confidence intervals for observed and predicted run-ning times, one for support data and the other for challenge, as illustrated inTables 4.5 & 4.6.214.2. Output100101102103104105106107108 1000CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 4.4: Example output of ESA – a figure of running times, fitted models andcorresponding bootstrap confidence intervals of each model.Solver Model Confidence interval of a Confidence interval of bConcordeExp. [2.6108,5.2975] [1.003,1.0036]RootExp. [0.037056,0.15111] [1.2287,1.2793]Poly.[6.1872×10−12,1.7351×10−9] [3.5859,4.3713]Table 4.4: Example output of ESA – bootstrap confidence intervals for all modelparameters.224.2. OutputSolver nPredicted confidence intervals Observed median run-timeExp. model Point estimates Confidence intervalsConcorde500 [15.43,23.47] 12.22 [11.01,13.33]600 [22,31.52] 18.64 [16.99,20.16]700 [31.38,42.57]# 33.15 [30.38,35.9]800 [44.68,57.59]# 48.95 [44.05,53.15]900 [63.49,77.63]* 70.64 [64.34,77.28]1000 [90.18,104.8]# 98.56 [90.16,107.8]1100 [127.6,141.9] 145.7 [130.9,158.6]1200 [178.6,193] 177.2 [165.8,196.1]1300 [244.7,266.8] 271.7 [244.9,298.3]1400 [332.7,375.7]# 344.3 [318.9,383.3]1500 [448.8,534.7]# 483.5 [432.8,532.9]Solver nPredicted confidence intervals Observed median run-timeRootExp. model Point estimates Confidence intervalsConcorde500 [9.176,15.31]* 12.22 [11.01,13.33]600 [15.56,23.81]* 18.64 [16.99,20.16]700 [25.21,35.83]# 33.15 [30.38,35.9]800 [39.43,52.46]# 48.95 [44.05,53.15]900 [60.23,74.99]# 70.64 [64.34,77.28]1000 [89.56,105.2]# 98.56 [90.16,107.8]1100 [130.4,145.2] 145.7 [130.9,158.6]1200 [184.4,199.1] 177.2 [165.8,196.1]1300 [252,274]# 271.7 [244.9,298.3]1400 [336.6,380.4]# 344.3 [318.9,383.3]1500 [442,522.8]# 483.5 [432.8,532.9]Solver nPredicted confidence intervals Observed median run-timePoly. model Point estimates Confidence intervalsConcorde500 [4.206,8.58] 12.22 [11.01,13.33]600 [9.398,16.47] 18.64 [16.99,20.16]700 [18.41,28.62] 33.15 [30.38,35.9]800 [32.91,46.48] 48.95 [44.05,53.15]900 [55.04,71.24]# 70.64 [64.34,77.28]1000 [86.98,104.3]# 98.56 [90.16,107.8]1100 [131.4,147.6]# 145.7 [130.9,158.6]1200 [188.5,204.4] 177.2 [165.8,196.1]1300 [257.9,280.1]# 271.7 [244.9,298.3]1400 [339.3,384.2]# 344.3 [318.9,383.3]1500 [435.4,516.3]# 483.5 [432.8,532.9]Table 4.5: Example output of ESA – bootstrap confidence intervals for observedand predicted running times for support data.234.2. OutputSolver nPredicted confidence intervals Observed median run-timeExp. model Point estimates Confidence intervalsConcorde2000 [1988,3179] 1969 [1739,2222]2500[8718,1.884×104] 6149 [4084,8812]3000[3.853×104,1.103×105] 1.84×104 [1.332×104,2.669×104]3500[1.698×105,6.479×105] 3.246×104 [2.581×104,5.038×104]4000[7.5×105,3.809×106] 1.312×105 [7.073×104,2.024×105]4500[3.301×106,2.245×107] 2.633×105 [1.73×105,4.419×105]Solver nPredicted confidence intervals Observed median run-timeRootExp. model Point estimates Confidence intervalsConcorde2000 [1528,2269]* 1969 [1739,2222]2500 [4536,8335]# 6149 [4084,8812]3000[1.212×104,2.694×104]* 1.84×104 [1.332×104,2.669×104]3500[3.001×104,7.925×104]# 3.246×104 [2.581×104,5.038×104]4000[6.95×104,2.163×105]* 1.312×105 [7.073×104,2.024×105]4500[1.528×105,5.563×105]* 2.633×105 [1.73×105,4.419×105]Solver nPredicted confidence intervals Observed median run-timePoly. model Point estimates Confidence intervalsConcorde2000 [1228,1795] 1969 [1739,2222]2500 [2737,4771] 6149 [4084,8812]3000[5252,1.057×104] 1.84×104 [1.332×104,2.669×104]3500[9149,2.069×104] 3.246×104 [2.581×104,5.038×104]4000[1.477×104,3.708×104] 1.312×105 [7.073×104,2.024×105]4500[2.248×104,6.205×104] 2.633×105 [1.73×105,4.419×105]Table 4.6: Example output of ESA – bootstrap confidence intervals for observedand predicted running times for support data.244.3. Automated Interpretation of Scaling Results-7e+06-6e+06-5e+06-4e+06-3e+06-2e+06-1e+06 0 1e+06 500  1000  1500  2000  2500  3000  3500  4000  4500Residue [sec]nExp. Model ResiduesRootExp. Model ResiduesPoly. Model ResiduesSupport DataChallenge DataFigure 4.5: Example output of ESA – a figure of residues of the fitted models.A snapshot of the reported generated by ESA using the default LATEX template isshown in Figure 4.6.4.3 Automated Interpretation of Scaling ResultsIn addition, ESA generates automated interpretations for scaling analysis results. Itevaluates how well a model fits the given data based on the percentage of challengesizes for which the model predicts the corresponding running times reasonablyaccurately. If a model predicts well for most challenge sizes, then the model shouldbe accepted as a good fit. Technically, the evaluation is based on the percentageof challenge points that lie within the predicted bootstrap confidence intervals ofthe model. It especially emphasises the challenge points for larger input sizes, asthose provide more information regarding whether the model predicts well. Thedetailed criteria, which we design based on extensive experiments with SAT andTSP algorithms, are as follows:• fair fit: the model predicts well for a fair percentage of the challenge sizes,more precisely, > 70% of the challenge points or > 70% of the larger half ofthe challenge points are within the predicted bootstrap intervals;• very good fit: the model predicts well for almost all challenge sizes, moreprecisely,> 95% of the challenge points are within the predicted bootstrapintervals;254.4. ImplementationFigure 4.6: Snapshot of the technical report generated by ESA.• over-/under-estimate: the model over-/under-estimates the running times ofa significant percentage of the challenge sizes, more precisely, > 75% ofthe challenge points or > 75% of the larger half of the challenge points arebelow/above the predicted bootstrap intervals.These criteria are combined into the fully automated interpretation procedure illus-trated in Figure 4.7. Note that when medians (or other quantiles) are not definitelyknown (due to instances with unknown running times), we compare the intervalsof the medians against the predicted bootstrap intervals. To be more precise, forinstance, we say a challenge point is below the predicted bootstrap interval, if theupper bound of the median is smaller than the lower bound of the bootstrap interval.4.4 ImplementationESA is implemented in python 2.7 and calls gnuplot to generate plots. All providedgnuplot scripts are prepared for gnuplot version 4.6, but only minimal modifica-tions, if any, will be needed to use with another gnuplot version. The online servicehas a clear user interface implemented with HTML/CSS, and the back-end servicewas realised using python CGI.One technical difficulty for implementing ESA is to interact with the LATEX264.4. Implementationmodels& data fair fit?very goodfit?"fits verywell""tends tofit well"over-estimate?under-estimate?"tendsto over-esimate""tends tounder-esimate""does notfit well"yesyesnono yesnoyesnoFigure 4.7: Flow diagram on how ESA automatically interprets the fitting results.Detailed definitions of the conditions are given in the main text.274.5. Downloading and Running ESAtemplate for producing the output technical report. Generating LATEX commands,especially those concerning tables, requires careful implementation. Moreover, itwas important to modularise the code for maximum re-use. For instance, there isone function that takes a string as input and output a string that instructs LATEX tomake the content boldface. This is very frequently used in ESA, since boldfaceis widely used for highlighting models and bootstrap confidence intervals. Thefunction will also need to distinguish between plain text and mathematics environ-ments, as different commands are needed to make them boldface. Other examplesinclude the conversion of numbers into LATEX expressions, and the generation oftable rows. Modularisation is also of great value beyond working with LATEX, asseveral operations need to be performed in more than one stage of an ESA run.An example of such operations is to call gnuplot for model fitting/plotting, whichneeds to be done for each model in steps 3 and 4.In addition, ESA was designed to work with customisable LATEX and gnuplottemplates (see Sec. 4.1 for details). There is a special LATEX syntax that can beused as "variables" in the LATEX template, which will be replaced by actual contentwhen ESA runs. The LATEX template can also be compiled "as-is"to make it easierfor users to adapt it to their specific needs. We chose gnuplot for plot generation sothat users can easily supply a template for customised figure formatting. ESA alsosupports user-defined models for scaling analysis. To achieve this, ESA definesfunctions on-the-fly from user-supplied strings.4.5 Downloading and Running ESA4.5.1 Downloading and Running ESA from the Command LineESA can be downloaded from the project page online at www.cs.ubc.ca/labs/beta/Projects/ESA/. After unzipping the compressed file, there will be a dir-ectory named ESA, which contains runESA.sh, the source code and other supportfiles. A user needs to have python and gnuplot installed in order to run ESA. Weused python 2.7 and gnuplot 4.6 in our environment, but expect ESA to work forother versions with minimal modifications, if any.To run ESA from command line, a user should follow the following steps:1. Create a directory for input and output files.2. Put the primary input file for running time data into the directory.3. Create files for models and LATEX and gnuplot templates (optional; if notprovided, ESA will use the default files distributed with the source code).284.5. Downloading and Running ESA4. Create a configuration file named configurations.txt within the directory,telling ESA the details it requires, including names of the algorithm andthe instance set/distribution, file names of running time data and, if used, thefile names of model specifications and LATEX and gnuplot templates.5. Run the script in the ESA directory by ./runESA.sh <directory name>,and ESA will run according to the specifications in <directory name>/configurations.txt.4.5.2 Running ESA as a Web ServiceESA is also available online as a web service, which supports the essential but notall features of the command-line version. In particular:• it only supports three pre-defined models, include:– exponential: a ·bn,– root-exponential: a ·b√n, and– polynomial: a ·nb;• it uses the default LATEX and gnuplot templates for report generation;• it only supports a limited number of statistics, include median, mean, as wellas 75th, 90th and 95th percentiles.To run ESA as a web service, visit www.cs.ubc.ca/labs/beta/Projects/ESA/esa-online.html, upload the file of running time data and fill in the other detailsof the form. After submission, ESA will run in the back and redirect the user to theoutput technical report. The user interface is shown in Figure 4.8.294.5. Downloading and Running ESAFigure 4.8: The user interface of ESA as a web service.30Chapter 5Empirical Time Complexity ofRandom k-SAT2In this chapter, we study the empirical time complexity of random k-SAT, witha focus on random 3-SAT at the phase transition. Our analyses were performedon prominent complete and incomplete SAT solvers. We also study the empiricalscaling of these solvers on two interesting distributions of random 4-SAT instancesto further understand the time complexity of random k-SAT.5.1 Experimental SetupFor our study, we selected three SLS-based solvers, WalkSAT/SKC [64], Bal-ancedZ [44] and probSAT [7], and three DPLL-based solvers, kcnfs [20], march_hi[32] and march_br [31]. We chose these, because WalkSAT/SKC and kcnfs are twoclassical solvers that are widely known in the community, and the others showedexcellent performance in recent SAT competitions.All sets of uniform random 3-SAT instances used in our study were obtainedusing the same generator used for producing SAT competition instances [8]. Toseparate satisfiable from unsatisfiable instances with n ≤ 500, we used the CSH-CrandMC hybrid solver[46], the winner of Random SAT+UNSAT Track in the2013 SAT Competition. For larger instances, for which the use of complete solversis impractical, we performed long runs (43 200 CPU sec) of BalancedZ and treatedthose instances not solved in those runs as unsatisfiable. Since BalancedZ neverfailed to solve any of our instances known to be satisfiable, and because the cut-offtime we used for these long runs was at least 20 times of the maximum runningtime observed on any satisfiable instance of the same size, we have high confid-ence that we did not miss any satisfiable instances. We note that, even if we hadmisclassified some satisfiable instances, the quantile performance measures wouldnot be much affected (if at all).2This chapter covers major results from [52] on the empirical time complexity of random 3-SATat the phase transition.315.1. Experimental SetupAfter observing floor effects for WalkSAT/SKC due to small CPU times re-quired for solving small instances, we calculated running times based on the num-ber of local search steps performed, and on estimates of CPU time per local searchstep for each problem size obtained from long runs on unsatisfiable instances. TheCPU time estimates thus obtained closely agreed with measured CPU times forshort and long runs, but are considerably more accurate. For all other solvers,which were less affected by such floor effects, we used runsolver [60] to recordCPU times (and to enforce CPU time limits).For our scaling analysis, we restricted ourselves to 2-parameter models. Inprinciple, our methodology works for models with fewer or more parameters; how-ever, fitting models with more parameters requires more data points (and hencecomputational resource) and has more potential for overfitting. In particular, wemainly considered two parametric models:• Exp [a,b] (n) = a ·bn (2-parameter exponential);• Poly [a,b] (n) = a ·nb (2-parameter polynomial).For investigating 4-SAT, we also added the following model, which was motivatedby Hoos and Stützle [35], to better characterise the scaling behaviour of the solvers:• RootExp [a,b] (n) = a ·b√n (2-parameter root-exponential).We note that these models are used to capture the first-order terms of the actualscaling function, which we expect to also contain lower-order terms in most cases.Particularly, we did not include constant offsets in these models, which could cap-tured running time for setting up, reporting results, etc., because we believe that,for the problems and instance sizes studied here, those offsets are small enough tobe negligible.Models were fitted to performance observations in the form of quantiles, chieflythe median of the distributions of running times over sets of instances for given n.Compared to the mean, the median has two advantages: it is statistically morestable and is immune to the presence of a certain amount of timed-out runs. Con-sidering the stochastic nature of the SLS-based solvers, we performed 5 independ-ent runs per instance and used the median over those 5 running times as the runningtime for the respective instance. We note that more stable estimates of the medianscan be obtained by performing more runs per instance, but we restricted ourselvesto 5 runs due to limitations of our computational resources. Our approach couldbe easily extended to other scaling models, but, as we will show in the following,these models jointly characterise the scaling observed in all our experiments, andthus we saw no need to consider different or more complex models. For fitting325.2. Location of Phase Transitionparametric scaling models to observed data, we used the non-linear least-squaresLevenberg-Marquardt algorithm.Closely following works by Hoos [33] and Hoos and Stützle [35], we computed95% bootstrap confidence intervals for the performance predictions obtained fromour scaling models, based on 1000 bootstrap samples per instance set and 1000automatically fitted variants of each scaling model. We extended their approach intwo ways, which are described detailedly in Chapter 3.For collecting running time data for our SAT solvers, we used Compute Canada/ Westgrid cluster orcinus (DDR), each equipped with two Intel Xeon E5450 quad-core CPUs at 3.0 GHz with 16GB of RAM running 64-bit Red Hat EnterpriseLinux Server 5.3.5.2 Location of Phase TransitionWe first revisited studies on the location of phase transition for 3- and 4-SAT, andrefined existing models for them. This refinement is important because we want toavoid drifting off the phase transition region, which may bias our results.5.2.1 Location of Phase Transition for 3-SATCrawford and Auton modelled the location of the phase transition point based onextensive experiments on uniform random 3-SAT instances with up to 300 vari-ables as:mc = 4.258 ·n+58.26 ·n−2/3, (5.1)where n is the number of variables and mc the critical number of clauses, at whichabout 50% satisfiable instances are obtained [17]. The parametric form of thismodel was derived using finite-size scaling by Kirkpatrick and Selman [39]. Whilethis model does provide a good fit for n ≤ 300, in preliminary experiments, wefound that for n > 300, it increasingly underestimates mc. Furthermore, the modelin Eq. 5.1 predicts that, as n grows, mc/n approaches 4.258, which is in contradic-tion with a more recent result by Mertens et al., who used the heuristic one-stepreplica symmetry breaking cavity method from statistical physics to estimate thevalue of mc/n as 4.26675±0.00015 [50].Because in the empirical scaling study that follows, we wanted to be sure to notdrift off the phase transition point (which could bias the scaling models, especially,as the phase transition, in terms of mc/n, is known to become increasingly steep asn grows), we decided to revisit and improve the Crawford & Auton’s model.We first chose several m/n ratios for each n ∈ {300,400, . . . ,1400}, aroundthe predictions from Eq. 5.1, loosely adjusted based on results from preliminary335.2. Location of Phase Transitionexperiments. Next, we generated sets of 600 uniform random 3-SAT instancesfor each m/n ratio. We separated out the satisfiable instances using a hybrid com-plete solver, CSHCrandMC [46], with a cutoff of 3600 CPU seconds per run. Up ton= 500, we solved all instances within that cutoff. Beyond, it would have been im-practical to run any complete solver sufficiently long to prove unsatisfiability, andwe therefore heuristically assumed that the instances not solved by CSHCrandMCare unsatisfiable. As mentioned earlier, we later used much longer runs of Bal-ancedZ to challenge these putative unsatisfiable instances further, and among thethousands of instances for which this was done, only one was found to be satis-fiable. Nevertheless, throughout what follows we are well aware of the fact that wemay underestimate the fraction of satisfiable instances in our sets.We note that, even at large numbers of instances sampled at the correct (un-known) value of mc/n, we should expect to get sets with a fraction of satisfiableinstances varying according to a binomial distribution. Based on this observation,we determined 95% confidence intervals for the fraction of satisfiable instancesobserved for each n. We then rejected the m/n values for which we empirically ob-served fractions of satisfiable instances outside of the respective 95% confidenceinterval as valid estimates of mc/n. In this way, we obtained bounds on the locationof the phase transition point, mc/n. In many cases, the confidence intervals for ourinitial sets of 600 instances were too wide to yield bounds, and we subsequentlyincreased the number of instances in those sets until for every n, we obtained anupper and lower bound on mc/n. Those bounds and the respective set sizes areshown in Table 5.1.These results provide further evidence for our earlier observation that for largern, Crawford & Auton’s model becomes inaccurate, as the respective predictions arebelow the lower bounds from our analysis. We note that our lower bounds are valid,as they could only increase if some of our putatively unsatisfiable instances werein fact satisfiable. For the same reason, the upper bounds may be inaccurate.Interestingly, and notably different from Crawford & Auton’s model, our res-ults suggest that mc/n as a function of n is not monotonic, but first drops, beforeslowly increasing again, possibly towards a threshold value. This observation, incombination with the limiting value found by Mertens et al. [50] led us to choosea model of the formmc = 4.26675 ·n+a ·nb+ c ·nd ,where a · c < 0, b < 0 and d < 0.3Finally, taking the data points from work by Crawford and Auton [17] up to3This type of model is consistent with current theoretical thinking that, for small n, second-orderterms may cause non-monotonic behaviour of the function mc(n). (Dimitris Achlioptas, personalcommunication.)345.2. Location of Phase Transitionn Lower bound Upper boundPredictionEq. 5.1 Eq. 5.2300 4.2600 (3600) 4.2667 (3600) 4.2623 4.2638400 4.2575 (4800) 4.2650 (600) 4.2607 4.2626500 4.2580 (2400) 4.2660 (600) 4.2598 4.2622600 4.2567 (600) 4.2667 (4800) 4.2594 4.2622700 4.2600 (1200) 4.2657 (4800) 4.2591* 4.2623800 4.2575 (2400) 4.2638 (2400) 4.2588 4.2624900 4.2600 (2400) 4.2667 (3600) 4.2587* 4.26261000 4.2600 (600) 4.2670 (2400) 4.2586* 4.26271100 4.2609 (1200) 4.2655 (600) 4.2585* 4.26291200 4.2600 (2400) 4.2650 (3600) 4.2584* 4.26301300 4.2608 (600) 4.2646 (600) 4.2584* 4.26311400 4.2600 (1200) 4.2664 (2400) 4.2583* 4.2633Table 5.1: Lower and upper bounds on the location of the phase transition (mc/n)for 3-SAT determined with 95% confidence, and predictions from the two modelsdiscussed in the text. In parentheses: number of instances used for determiningthe bound. Model predictions inconsistent with our lower bounds are marked withasterisks (*).n = 300 (which we believe to be of high quality) and the mid-points between ourbounds from Table 5.1, we fitted this 4-parameter model, resulting in:mc = 4.26675n+447.884n−0.0350967−430.232n−0.0276188 (5.2)Figure 5.1 shows the predictions obtained from this model, along with the boundsand mid-points between them from Table 5.1. This model was subsequently usedto generate the instance sets used in the scaling analysis described in Section 5.3.While it is possible that our model is biased by the fact that we may have missedsatisfiable instances for larger n, it provides a better basis than previously availablefor generating instances at or very near the solubility phase transition of uniformrandom 3-SAT.5.2.2 Location of Phase Transition for 4-SATFollowing the same approach used for modelling the locations of the phase trans-ition points for 3-SAT (see Section 5.2.1), we run experiments to determine theintervals of the phase transition points for 4-SAT as well. The lower and upperbounds can be found in Table 5.2.The bounds, in combination with the limiting value found by [50], led us tochoose a model of the formr = 9.931+a ·nb,355.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase Transition 4.25 4.255 4.26 4.265 4.27 4.275 4.28 4.285 4.29 4.295 4.3 0  200  400  600  800  1000  1200  1400m c/nnSupport data from Crawford&Auton (1996)Support data (all solved)Support data (estimated)Challenge data (estimated)Bounds (95% confidence)Fitted modelFigure 5.1: Empirical bounds on the location of the phase transition points for3-SAT, mc/n for different n, data used for fitting our model (eq. 5.2) and modelpredictions. Data for n > 500 is based on heuristic estimates of the fraction ofsatisfiable instances and may underestimate the true mc/n.resulting, after fitting, inr = 9.931+155.729×n−2.01596. (5.3)Figure 5.2 shows the predictions obtained from this model, along with the boundsand mid-points between them from Table 5.2. This model was subsequently usedto generate the instance sets used in the scaling analysis described in Section 5.4.5.3 Empirical Scaling of Solver Performance on Random3-SAT at Phase TransitionWe first studied 3-SAT – arguably the conceptually simplestN P-complete prob-lem, and focused on random 3-SAT instances at phase transition, which is a widelyknown distribution of hard 3-SAT instances.5.3.1 Empirical Scaling of the Performance of SLS-based SolversWe first fitted our two parametric scaling models to the median running times ofthe three SLS-based solvers we considered, as described in Section 5.1. For eachSLS-based solver, both models were fitted using median running times for n =365.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase Transition# Variables Lower bound Upper bound Prediction by eq. 5.320 10.20 (600) 10.40 (600) 10.30240 9.975 (600) 10.10 (600) 10.02360 9.95 (2400) 10.0 (1200) 9.97280 9.90 (600) 9.975 (2400) 9.954100 9.91 (600) 9.95 (1200) 9.945120 9.925 (600) 9.958 (2400) 9.941140 9.921 (600) 9.95 (1200) 9.938160 9.919 (600) 9.931 (2400) 9.937180 9.922 (600) 9.944 (600) 9.935200 9.92 (1200) 9.94 (600) 9.935250 9.916 (1200) 9.94 (600) 9.933300 9.917 (600) 9.94 (600) 9.933350 9.92 (600) 9.9428 (600) 9.932400 9.92 (600) 9.945 (600) 9.932Table 5.2: Exclusive lower and upper bounds of phase transition points for 4-SATdetermined with 95% confidence, and predictions from the model discussed inthe text. The number after the lower/upper bounds (in parentheses) are the leastnumber of instances we have used to conclude with 95% confidence that the ratiois away from the phase transition point.200,250, . . .500 (support) and later challenged with median running times for n =600,700, . . .1000. Details of the instances can be found in Table 5.3. This resultedin the models, shown along with RMSEs on support and challenge data, in Table5.4. In particular, we illustrate the fitted models of WalkSAT/SKC in Figure 5.3;similar results are obtained for BalancedZ and probSAT (figures not shown).We note that the RMSEs on support data, i.e., the data used for fitting the mod-els, often, but not always provide a good indication of their predictive power. Basedon the latter, in the form of RMSEs on challenge data, we see clear indications thatthe median running times of all three SLS-based solvers are overall more consistentwith our polynomial scaling model than the best exponential model.But how much confidence should we have in these models? Are the RMSEssmall enough that we should accept them? To answer this question, we assessedthe fitted models using the bootstrap approach described in Chapter 3 and imple-mented in ESA. The results of this analysis, shown in Table 5.5 , clearly showthat observed median running times for the SLS-based solvers are consistent withour polynomial scaling model and inconsistent with the exponential model (as il-lustrated in Figure 5.3 for WalkSAT/SKC). Also, this analysis gives us bootstrapconfidence intervals for the model parameters, as shown in Table 5.6. Note thatthe scaling of all SLS-based solvers are consistent with a low-degree polynomialmodel (b < 4). This is especially striking for the larger challenge instances sizes.Limited experiments for even larger instances sizes (up to n = 2000) produced375.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase Transition 9.6 9.7 9.8 9.9 10 10.1 10.2 10.3 10.4 0  200  400m c/nnSupport data (all solved)Challenge data (estimated)Bounds (95% confidence)Fitted modelFigure 5.2: Empirical bounds on the location of the phase transition points for4-SAT, mc/n for different n, data used for fitting our model (eq. 5.3) and modelpredictions. Data for n ≥ 200 is based on heuristic estimates of the fraction ofsatisfiable instances and may underestimate the true mc/n.# Variables 200 250 300 350 400 450 500# Clauses 854 1066 1279 1492 1704 1918 2130C-to-V ratio 4.27 4.264 4.263 4.2629 4.260 4.260 4.260equation 5.2 4.2687 4.2652 4.2638 4.2630 4.2626 4.2623 4.2622# Variables 600 700 800 900 1000# Clauses 2557 2984 3410 3836 4263C-to-V ratio 4.2617 4.2629 4.2625 4.2622 4.2630equation 5.2 4.2622 4.2623 4.2624 4.2625 4.2627Table 5.3: We used UniformKSAT to generate 12 sets of random 3-SAT instancesof different sizes, at clause-to-variable ratios computed as equation 5.2. This tablesummarises the numbers of variables and of clauses, and the clause-to-variableratios.385.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase TransitionSolver ModelRMSE RMSE(support) (challenge)WalkSAT/SKCExp. Model 6.89157×10−4×1.00798n 0.0008564 0.7600Poly. Model 8.83962×10−11×n3.18915 0.0007433 0.03142BalancedZExp. Model 1.32730×10−3×1.00759n 0.001759 1.081Poly. Model 5.14258×10−10×n2.97890 0.002870 0.05039probSATExp. Model 8.35877×10−4×1.00763n 0.0013867 0.6487Poly. Model 2.92275×10−10×n2.99877 0.002285 0.03301Table 5.4: Fitted models for median running times of SLS-based solvers solvingphase-transition random 3-SAT instances and the corresponding RMSE values (inCPU sec). Model parameters are shown with 6 significant digits, and RMSEs with4 significant digits; the models yielding more accurate predictions (as per RMSEson challenge data) are shown in boldface.10-410-310-210-1100101 100  200  500  1000CPU time [sec]nSupport dataExp. model: 6.89157e-04 × 1.00798nPoly. model: 8.83962e-11 × n3.18915Exp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 5.3: Fitted models for the median running times (in CPU-sec). Both modelsare fitted with the median running times of WalkSAT/SKC solving the satisfiableinstances from the set of 1200 random 3-SAT instances of 200, 250, ..., 500 vari-ables, and are challenged by median running times of 600, 700, ..., 1000 variables.395.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase TransitionSolver nObserved median running time (sec) Predicted confidence intervals (sec)Point estimates Confidence intervals Poly. model Exp. modelWalkSAT/SKC600 0.056 [0.050,0.070] [0.054,0.081] [0.067,0.104]700 0.108 [0.093,0.120] [0.083,0.146]* [0.137,0.264]800 0.180 [0.132,0.209] [0.122,0.238]* [0.277,0.664]900 0.267 [0.222,0.323] [0.170,0.373]* [0.565,1.676]1000 0.385 [0.327,0.461] [0.229,0.557]* [1.151,4.200]BalancedZ600 0.095 [0.085,0.102] [0.082,0.116]* [0.103,0.149]700 0.142 [0.131,0.154] [0.124,0.195]* [0.204,0.348]800 0.194 [0.177,0.212] [0.177,0.308]* [0.400,0.816]900 0.270 [0.231,0.324] [0.240,0.462] [0.782,1.915]1000 0.353 [0.307,0.398] [0.316,0.663] [1.531,4.493]probSAT600 0.050 [0.043,0.059] [0.050,0.085] [0.063,0.110]700 0.089 [0.077,0.105] [0.073,0.149]* [0.119,0.271]800 0.151 [0.133,0.197] [0.101,0.245]* [0.222,0.664]900 0.237 [0.209,0.295] [0.135,0.379]* [0.413,1.640]1000 0.357 [0.304,0.438] [0.174,0.559]* [0.771,4.050]Table 5.5: 95% bootstrap confidence intervals for observed and predicted runningtimes of SLS-based solvers solving random 3-SAT instances. The instance sizesshown here are larger than those used for fitting the models. Bootstrap intervalson predictions that agree with the observed point estimates are shown in boldface,and those that fully contain the confidence intervals on observations are marked byasterisks (*).Solver Model Confidence interval of a Confidence interval of bWalkSAT/SKCPoly.[2.58600×10−12,8.63869×10−10] [2.80816,3.76751]Exp.[4.05064×10−4,1.00662×10−3] [1.00709,1.00924]BalancedZPoly.[3.65984×10−11,4.53094×10−9] [2.60985,3.41689]Exp.[8.91856×10−4,1.83333×10−3] [1.00675,1.00855]probSATPoly.[5.00843×10−12,1.02411×10−8] [2.40567,3.66266]Exp.[4.90478×10−4,1.42905×10−3] [1.00629,1.00907]Table 5.6: Bootstrap intervals of model parameters for median running times ofSLS-based solvers solving phase-transition random 3-SAT instances.405.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase Transition10-410-310-210-1100101 100  200  500  1000CPU time [sec]nSupport DataChallenge DataExponential ModelPolynomial ModelExponential Model Bootstrap IntervalsPolynomial Model Bootstrap IntervalsSupport Data for BalancedZChallenge Data for BalancedZFigure 5.4: Comparing the scaling model for BalancedZ with that for WalkSA-T/SKC on solving 3-SAT instances. Both models are fitted with the median run-ning times of WalkSAT/SKC or BalancedZ solving 3-SAT instances of 200, 250,..., 500 variables, and are challenged by median running times of 600, 700, ..., 1000variables. Similar results can also been obtained by doing the comparison in theother way around.further evidence consistent with the polynomial scaling models for all three SLS-based solvers (see Section 5.3.3). Although the running time of the SLS-basedsolvers are still quite moderate even at these instances sizes, the much longer runsrequired to filter out satisfiable instances with high confidence make it difficult tofurther increase n.Next, we used the same bootstrap confidence intervals to investigate the signi-ficance of differences observed between the scaling models for different solvers.Using the approach described in Section 3.2, we cannot reject the hypothesis thatthe differences reflected in the constants for the polynomial models of our threeSLS-based solvers seen in Table 5.4 are insignificant. The difference betweenWalkSAT/SKC and BalancedZ, for instance, is illustrated in Figure 5.4. Examiningthese results in detail, we believe that by substantially increasing the number ofinstances for each value of n, the bootstrap confidence intervals can likely be nar-rowed to also demonstrate significant scaling differences between all pairs of SLS-based solvers.Finally, we investigated the question whether our observations regarding thequalitative differences in the scaling of median running time of SLS-based SATsolvers also hold when considering higher quantiles of the distribution of running415.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase TransitionSolver ModelRMSE RMSE(support) (challenge)WalkSAT/SKCExp. Model 1.3165×10−3×1.0092n 0.002370 5.354Poly. Model 6.5151×10−12×n3.8148 0.002143 0.06216BalancedZExp. Model 3.0082×10−3×1.0075n 0.005848 2.024Poly. Model 1.0385×10−9×n2.9892 0.004009 0.05739probSATExp. Model 1.5357×10−3×1.0088n 0.004594 3.912Poly. Model 2.1829×10−11×n3.6122 0.004743 0.2951Table 5.7: Fitted models for 0.75-quantile of the running times of SLS-based solv-ers solving phase-transition random 3-SAT instances and the corresponding RMSEvalues (in CPU sec). Model parameters are shown with 5 significant digits andRMSEs with 4 significant digits; the models yielding more accurate predictions (asper RMSEs on challenge data) are shown in boldface.times across instance sets. For these solvers, we noted weak evidence that theratio between the higher quantiles and the median increases with n, but the ob-served scaling of the 0.75-quantile is still found to be consistent with a polynomialmodel and inconsistent with an exponential model. Details of the models, alongwith RMSEs on support and challenge data, are shown in Table 5.7. Note that forBalancedZ, the parameter b in the polynomial model for 0.75-quantiles does notincrease much from that of the model for medians, which is in contrast to our find-ings for the other two solvers. The fitted models of WalkSAT/SKC are illustratedin Figure 5.5.5.3.2 Empirical Scaling of the Performance of DPLL-based SolversApplying the same methodology, we studied the empirical scaling of the perform-ance of DPLL-based solvers. For each DPLL-based solver, we used support datafor n= 200,250, . . .400 and challenge data for n= 450,500,550, as for even largern, we could no longer complete sufficiently many runs to estimate even median run-ning times. The fitted models and the corresponding RMSEs are shown in Table5.8. We note that running times of the DPLL-based solvers are more consistentwith our exponential scaling model – even when only considering performance onsatisfiable instances. As an example, we illustrate the fitted models of kcnfs in Fig-ure 5.6; similar results are obtained for the two march variants (figures not shown).We also performed similar bootstrap analysis for DPLL-based solvers. Theresult of this analysis, presented in Table 5.9, shows opposite results for SLS-basedsolvers: the median running times of DPLL-based solvers, even solving satisfiableinstances only, are consistent with exponential models but not polynomial ones.Also, this analysis gives us bootstrap confidence intervals for the model parameters,425.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase TransitionSolver Instances ModelRMSE RMSE(support) (challenge)kcnfsAllExp. Model 4.30400×10−5×1.03411n 0.05408 143.3Poly. Model 9.40745×10−31×n12.1005 0.06822 1516Sat.Exp. Model 2.41708×10−5×1.03205n 0.02496 83.86Poly. Model 2.41048×10−30×n11.7142 0.05600 225.8Unsat.Exp. Model 6.38367×10−5×1.03386n 0.001466 52.19Poly. Model 9.70804×10−31×n12.1448 0.1813 2291march_hiAllExp. Model 4.93309×10−5×1.03295n 0.05449 460.0Poly. Model 1.05593×10−30×n12.0296 0.05971 1266Sat.Exp. Model 8.33113×10−6×1.03119n 0.03035 3.087Poly. Model 2.44435×10−30×n11.4789 0.03879 61.77Unsat.Exp. Model 7.86081×10−5×1.03281n 0.03336 399.7Poly. Model 2.10794×10−30×n11.9828 0.16703 1912march_brAllExp. Model 6.17600×10−5×1.03220n 0.05401 402.4Poly. Model 5.56146×10−30×n11.7408 0.05199 1253Sat.Exp. Model 1.02788×10−5×1.03048n 0.02497 13.72Poly. Model 1.25502×10−29×n11.1944 0.03341 67.85Unsat.Exp. Model 6.10959×10−5×1.03344n 0.03230 262.8Poly. Model 5.18600×10−31×n12.2154 0.1586 1920Table 5.8: Fitted models for median running times of the DPLL-based solverssolving phase-transition random 3-SAT instances and the corresponding RMSEvalues (in CPU sec). Model parameters are shown with 6 significant digits, andRMSEs with 4 significant digits; the models yielding more accurate predictions (asper RMSEs on challenge data) are shown in boldface.Solver nObserved median running time (sec) Predicted confidence intervals (sec)Point estimates Confidence intervals Poly. model Exp. modelkcnfs450 156.480 [143.340,166.770] [98.326,122.115] [120.078,161.444]500 750.510 [708.290,806.130] [327.997,439.089] [561.976,889.428]*550 3896.450 [3633.630,4130.915] [971.862,1402.255] [2622.488,4901.661]*march_hi450 112.553 [101.957,121.167] [62.021,91.787] [74.982,116.729]500 564.821 [508.433,614.105] [190.395,333.963] [317.398,628.498]*550 2971.450 [2660.430,3152.570] [523.034,1074.244] [1342.438,3375.460]*march_br450 112.812 [101.108,121,469] [69.649,91.290] [84.743,117.937]500 594.095 [542.564,620.963] [226.874,332.773] [385.943,640.179]*550 2975.580 [2544.450,3179.950] [659.553,1070.478] [1754.277,3492.830]*Table 5.9: 95% bootstrap confidence intervals for observed running times ofDPLL-based solvers solving random 3-SAT instances. The instance sizes shownhere are larger than those used for fitting the models. Bootstrap intervals on predic-tions that agree with the observed point estimates are shown in boldface, and thosethat fully contain the confidence intervals on observations are marked by asterisks(*).435.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase Transition10-310-210-1100101102 100  200  500  1000CPU time [sec]nSupport dataExp. modelPoly. modelExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 5.5: Fitted models for the 0.75-quantiles of the running times (in CPU-sec).Both models are fitted with the 0.75 quantiles of the running times of WalkSA-T/SKC solving the satisfiable instances from the set of 1200 random 3-SAT in-stances of 200, 250, ..., 500 variables, and are challenged by median running timesof 600, 700, ..., 1000 variables.as shown in Table 5.10. Note that we found no evidence for a substantial changein the ratios between the median and higher quantiles as n increases; thus, there isno reason to assume any difference in the scaling models of a higher quantile andthe median other than a constant factor.Additionally, we investigated whether the differences between the scaling mod-els of DPLL-based solvers are significant using the approach described in Section3.2. We found that the bootstrap confidence intervals for march_hi and march_brlargely overlap with each other, and we cannot detect sufficient evidence for statist-ically significant differences in scaling behaviour. On the other hand, the observedrunning times for kcnfs are inconsistent with the scaling models for march_hi andmarch_br, confirming that the performance of the march solvers indeed scales sig-nificantly better than that of kcnfs. (See Figure 5.7 for an illustration of the com-parison of the models for kcnfs and march_hi.)Using the same approach, we investigated the significance of the differencesin scaling behaviour for each DPLL-based solver when applied to satisfiable andunsatisfiable instances only, respectively. Intuitively, we would expect unsatis-fiable instances to be harder, and the differences between the respective scalingmodels are significant, in that the observed running times for solving unsatisfiableinstances are generally inconsistent with the scaling model for the same solver on445.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase Transition10-410-310-210-1100101102103104105 100  200  300  400  500  600CPU time [sec]nSupport dataExp. model: 6.05099e-05 × 1.03311nPoly. model: 1.00831e-30 × n12.08155Exp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 5.6: Fitted models for the median running times of kcnfs [CPU-sec]. Bothmodels are fitted with the median running times of kcnfs solving 1200 random 3-SAT instances of 200, 250, ..., 400 variables, and are challenged by median runningtimes of 450, 500 and 550 variables.only satisfiable instances (as illustrated in Figure 5.8 for march_hi). To furtherinvestigate whether the performance difference can be attributed to the constantfactor a in the exponential models, we fitted a one-parameter model of the forma · bnsat of the median running times for solving unsatisfiable instances, where a isa free parameter and bsat is the value of parameter b from the scaling model forsatisfiable instances. For all three DPLL-based solvers, the observed running timesfor solving unsatisfiable instances is consistent with these one-parameter models,suggesting that, indeed, the scaling on satisfiable and unsatisfiable instances differsonly by a constant factor of around 20.We also experimented with sat11, the running times of which are highly con-sistent with exponential models but not with polynomial ones. The exponent issimilar to those of other DPLL-based solvers, but the constant factor is much lar-ger (potentially due to implementation details of the solver itself).5.3.3 Challenging the Scaling Models of SAT Solvers with LargerInstancesWe further challenged our models with runs of solvers of larger instances, the sizesof which match those used in the most recent SAT Competitions. Due to the cost455.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase TransitionSolver Instances Model Confidence interval of a Confidence interval of bkcnfsAllPoly.[3.33969×10−31,4.30846×10−29] [11.4234,12.2674]Exp.[3.33378×10−5,1.07425×10−4] [1.03136,1.03476]Sat.Poly.[1.81751×10−36,2.47924×10−21] [8.19161,14.0613]Exp.[2.02817×10−6,5.85540×10−4] [1.02283,1.03835]Unsat.Poly.[5.65987×10−32,5.28138×10−30] [11.8558,12.6251]Exp.[4.22382×10−5,1.03613×10−4] [1.03252,1.03508]march_hiAllPoly.[1.72213×10−31,3.51322×10−27] [10.6379,12.3284]Exp.[2.90480×10−5,1.72479×10−4] [1.02928,1.03433]Sat.Poly.[5.68054×10−32,1.93360×10−24] [9.16765,12.1075]Exp.[7.97341×10−7,7.07414×10−5] [1.02521,1.03765]Unsat.Poly.[5.01545×10−31,1.43762×10−29] [11.6557,12.2263]Exp.[5.51043×10−5,1.06014×10−4] [1.03195,1.03386]march_brAllPoly.[1.26357×10−31,1.79577×10−28] [11.1507,12.3819]Exp.[2.61030×10−5,1.08165×10−4] [1.03064,1.03466]Sat.Poly.[4.33022×10−32,1.53387×10−24] [9.10489,12.1426]Exp.[1.81911×10−6,6.40234×10−5] [1.02515,1.03519]Unsat.Poly.[4.36438×10−31,6.51380×10−30] [11.7848,12.2416]Exp.[4.27173×10−5,9.18950×10−5] [1.03230,1.03443]Table 5.10: Bootstrap intervals of model parameters for median running times ofDPLL-based solvers solving phase-transition random 3-SAT instances.10-610-510-410-310-210-1100101102103104105 100  200  300  400  500  600CPU time [sec]nSupport DataChallenge DataExponential Model Bootstrap IntervalsPolynomial Model Bootstrap IntervalsExponential Model: 7.22871e-05 × 1.03181nPolynomial Model: 1.31142e-29 × n11.5990Support Data of kcnfsChallenge Data of kcnfsFigure 5.7: Comparing scaling models of kcnfs and march_hi on solving 3-SAT in-stances. Both models are fitted with the median running times of kcnfs or march_hisolving 1200 3-SAT instances of 200, 250, ..., 400 variables, and are challenged bymedian running times of 450, 500 and 550 variables.465.3. Empirical Scaling of Solver Performance on Random 3-SAT at Phase Transition 1e-05 0.0001 0.001 0.01 0.1 1 10 100 1000 10000 100  200  300  400  500  600CPU time [sec]nSupport data for sat. instancesChallenge data for sat. instancesExp. model for sat.: 8.33113e-06 × 1.03119nExp. model bootstrap intervals for sat.Support data for unsat. instancesChallenge data for unsat. instancesFigure 5.8: Scaling of the median running time of march_hi on satisfiable vs un-satisfiable 3-SAT instances. We show the scaling model and bootstrap confidenceintervals for performance on satisfiable 3-SAT instances, along with observationson support and challenge data for satisfiable and unsatisfiable instances.of solving these instances, we only used a small number of these. For SLS-basedsolvers, we generated 120 instances of sizes 1500 and 2000 as well as 11 instancesof sizes 5000 and 10000. For these instances, 66, 58, 5 and 7 instances were solvedrespectively, with a cut-off of 2 CPU days. Note that, WalkSAT/SKC failed tosolve one n = 5000 instance and five n = 10000 instances that BalancedZ man-aged to solve, and probSAT missed one instance of each size. For DPLL-basedsolvers, we generated 120 instances with n = 650, 67 our of which are satisfiableand 53 are unsatisfiable. Note that about half of the march_hi and march_br runsproduced SIGSEGV errors when solving these larger instances, and we did notfurther analyse them with larger instances.The detailed results are shown in Table 5.11. From these results, we see evid-ence that the exponential model remains a good descriptor of the scaling of kcnfsfor the large n = 650 instances. For the SLS-based solvers, the predictions madeby the polynomial models are also well consistent with the observed running times,except for WalkSAT/SKC solving n = 10000 instances. We have not found con-crete explanations for this exception, but believe that it may result from imple-mentation details of the solver. Nevertheless, our results are striking, consideringthat we fitted our polynomial models on instances with n ≤ 500 and these modelsremain valid, even when challenged with 10 times larger instances that are challen-ging for modern SAT solvers.475.4. Empirical Scaling of Solver Performance on Random 4-SATSolver nPredicted Observed median running time (sec)Confidence intervals Point estimate Confidence intervalsWalkSAT/SKC1500 [0.721,2.569]# 1.775 [1.135,5.286]2000 [1.641,7.571] 12.002 [4.969,32.103]5000 [20.48,217.6]# 149.6 [94.91,331.9]BalancedZ1500 [0.906,2.653] # 1.081 [0.541,2.586]2000 [1.915,7.068] # 3.465 [1.485,6.383]5000 [23.72,185.0] 18.07 [4.515,4077]10000 [149.9,2072]# 1064 [224.1,20475]probSAT1500 [0.488,2.755] 2.820 [1.335,5.745]2000 [0.989,8.129] 19.38 [3.740,37.944]5000 [9.387,257.527]# 12.75 [10.329,∞]10000 [51.502,3546.870] 11377 [1326,∞]kcnfs650, All [57720,149011]* 108071 [62308,126999]650, Sat. [1285,96993]* 26370.6 [11877,44980]650, Unsat. [111920,228028]* 159747 [135445,191781]Table 5.11: Challenging the models with running times for large 3-SAT instances.For the bootstrap intervals of the predicted median running times, those overlap-ping with the observed confidence intervals are in boldface, those that agree withthe observed point estimates are marked by #, and those agreeing with the observedconfidence intervals are followed by *.5.4 Empirical Scaling of Solver Performance on Random4-SATWe next studied two distributions of random 4-SAT instances: phase-transitioninstances (as discussed in Section 5.2.2) and a class of less constrained instancesproposed by Dimitris Achlioptas.5.4.1 Empirical Scaling of Solver Performance on Random 4-SAT atPhase TransitionSimilar to the analysis performed for random 3-SAT, we analysed the scalingperformance of the same set of SAT solvers (WalkSAT/SKC, BalancedZ, prob-SAT, kcnfs, march_hi and march_br) on phase-transition random 4-SAT as well.For SLS-based solvers, we used n = 70,80, · · ·150 as the support data and n =160,170, · · ·200,225, · · · ,300 as the challenge data; while for DPLL-based solv-ers, we used n = 80,90, · · · ,130 as the support data and n = 140,150, · · · ,190 asthe challenge data. The sizes were chosen considering the running times of thesolvers and our budget of computational resources, and we roughly used half ofthe sizes for model fitting and the other half for challenging the obtained models.To reduce floor effects, we made sure that median running times for SLS-based485.4. Empirical Scaling of Solver Performance on Random 4-SATModelRMSE RMSE(support) (challenge)WalkSAT/SKCExp. Model 6.9784×10−5×1.0456n 0.00084331 11.175RootExp. Model 2.0584×10−7×2.777√n 0.0011903 1.1389Poly. Model 9.6674×10−15×n5.8626 0.0015716 3.5491BalancedZExp. Model 1.1841×10−3×1.0208n 0.00077 0.03889RootExp. Model 1.0763×10−4×1.5631√n 0.00089 0.08843Poly. Model 1.473×10−7×n2.4042 0.00107 0.13516probSATExp. Model 8.7936×10−4×1.0208n 0.00071 0.06340RootExp. Model 8.082×10−5×1.5616√n 0.00087 0.03013Poly. Model 1.1498×10−7×n2.3939 0.00104 0.06473Table 5.12: Fitted models for median running times of the SLS-based solvers solv-ing phase-transition random 4-SAT instances and the corresponding RMSE values(in CPU sec). The models yielding the most accurate predictions (as per RMSEson challenge data) are shown in boldface.Solver Model Confidence interval of a Confidence interval of bWalkSAT/SKCPoly.[8.3451×10−17,7.7952×10−13] [4.7239,6.7928]RootExp.[3.375×10−8,1.1002×10−6] [2.308,3.2189]Exp.[2.9942×10−5,0.00014855] [1.0377,1.0518]BalancedZPoly.[1.9088×10−8,4.2697×10−7] [2.1762,2.8316]RootExp.[4.7598×10−5,1.6375×10−4] [1.5002,1.6841]Exp.[7.9832×10−4,1.4443×10−3] [1.0189,1.0242]probSATPoly.[2.4434×10−8,3.9394×10−7] [2.1333,2.7203]RootExp.[4.3564×10−5,1.3324×10−4] [1.4899,1.6541]Exp.[6.5415×10−4,1.1207×10−3] [1.0186,1.0234]Table 5.13: Bootstrap intervals of model parameters for median running times ofSLS-based solvers solving phase-transition random 4-SAT instances.solvers are above 0.005 CPU sec, and those for DPLL-based solvers above 0.1CPU sec.We first fitted parametric models to the median running times. Preliminary ex-periments showed that polynomial and exponential models do not accurately cap-ture the scaling behaviour of the SLS-based solvers. Thus, as mentioned in Section5.1, we included root-exponential models to better characterise their scaling beha-viour. The fitted models and corresponding RMSE values for SLS-based solversare shown in Table 5.12. We also performed bootstrap analysis as described inSection 3.1 and computed bootstrap confidence intervals for the model paramet-ers, observed and predicted running times, which are presented in Tables 5.13,5.14 and 5.15, respectively. Our results show that exponential (for BalancedZ)or root-exponential (for the other solvers) models usually fit the running times ofSLS-based solvers better, which stands in contrast to random 3-SAT. Figure 5.9, forinstance, shows the fitted models and corresponding bootstrap confidence intervals495.4. Empirical Scaling of Solver Performance on Random 4-SATSolver nObserved median running time (sec)Point estimates Confidence intervalsWalkSAT/SKC160 0.07675 [0.06683,0.09427]170 0.1112 [0.09516,0.1296]180 0.1661 [0.1439,0.2145]190 0.2111 [0.1773,0.252]200 0.3488 [0.2739,0.3967]225 0.7695 [0.5895,1.055]250 2.306 [1.936,3.239]275 5.159 [4.166,7.331]300 13.29 [8.837,17.68]BalancedZ160 0.03399 [0.02849,0.03878]170 0.03899 [0.03447,0.04252]180 0.04849 [0.04247,0.05478]190 0.06099 [0.04947,0.07156]200 0.08699 [0.06646,0.1016]225 0.106 [0.08699,0.1492]250 0.209 [0.186,0.2967]275 0.283 [0.232,0.3613]300 0.4724 [0.3586,0.6418]probSAT160 0.02499 [0.022,0.02852]170 0.02899 [0.027,0.03452]180 0.03099 [0.02899,0.03499]190 0.03899 [0.03347,0.04652]200 0.04699 [0.03999,0.05751]225 0.06599 [0.05641,0.07851]250 0.118 [0.09241,0.136]275 0.166 [0.128,0.209]300 0.2615 [0.17,0.3467]Table 5.14: 95% bootstrap confidence intervals for observed running times of SLS-based solvers solving phase-transition random 4-SAT instances. The instance sizesshown here are larger than those used for fitting the models.505.4. Empirical Scaling of Solver Performance on Random 4-SATSolver nPredicted confidence intervals of median running time (sec)Poly. model RootExp. model Exp. modelWalkSAT/SKC160 [0.06396,0.09442]* [0.06672,0.09797]* [0.06985,0.1021]#170 [0.08559,0.1412]* [0.09274,0.153]* [0.1013,0.1676]#180 [0.1121,0.2077]# [0.1272,0.2377]* [0.1466,0.2776]#190 [0.1447,0.2971]* [0.173,0.3618]* [0.2123,0.4554]200 [0.1844,0.42]* [0.2334,0.5483]* [0.3074,0.7519]#225 [0.3217,0.9324]# [0.4784,1.49]* [0.7752,2.645]250 [0.5291,1.907] [0.943,3.848]* [1.955,9.34]#275 [0.83,3.644] [1.798,9.486]* [4.932,32.98]#300 [1.252,6.581] [3.332,22.46]* [12.44,116.5]#BalancedZ160 [0.02644,0.03333] [0.0275,0.0349]# [0.02862,0.03662]#170 [0.03021,0.03962]# [0.03225,0.04282]* [0.03455,0.0466]#180 [0.03425,0.04663] [0.03764,0.05223]# [0.04173,0.05931]*190 [0.03857,0.05441] [0.04374,0.06337]# [0.05039,0.07523]#200 [0.04317,0.06298] [0.05064,0.07626] [0.06086,0.09537]#225 [0.05592,0.08773] [0.07192,0.119]# [0.09753,0.1728]#250 [0.07048,0.118] [0.1001,0.1812] [0.1561,0.313]*275 [0.08678,0.1543] [0.1371,0.2704] [0.2495,0.5671]#300 [0.1049,0.1971] [0.1849,0.3968] [0.3985,1.03]#probSAT160 [0.01968,0.02449] [0.0205,0.02566]# [0.02136,0.02694]#170 [0.02242,0.02885] [0.02397,0.03119]# [0.02574,0.03394]#180 [0.02535,0.03368]# [0.02791,0.03769]* [0.03105,0.04275]190 [0.02848,0.03899] [0.03239,0.04532]# [0.03731,0.05385]#200 [0.03183,0.0448] [0.03733,0.05423]# [0.04482,0.06784]#225 [0.0409,0.06161] [0.05244,0.08337]* [0.07101,0.1208]250 [0.05113,0.08195] [0.07232,0.1252]# [0.1129,0.2152]#275 [0.06258,0.1061] [0.09844,0.1843]# [0.1794,0.3833]300 [0.07526,0.1342] [0.1322,0.2668]# [0.285,0.684]Table 5.15: 95% bootstrap confidence intervals for predicted median running timeof SLS-based solvers solving phase-transition random 4-SAT instances. The in-stance sizes shown here are larger than those used for fitting the models. Bootstrapintervals on predictions that overlap with the corresponding bootstrap interval onobserved data are shown in boldface, those that agree with the observed point es-timates are marked by sharps (#), and those that fully contain the confidence inter-vals on observations are marked by asterisks (*).515.4. Empirical Scaling of Solver Performance on Random 4-SAT10-310-210-1100101 60  80  100  120  140  160  180  200  250  300CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 5.9: Fitted models for median running times of BalancedZ. Both models arefitted to the median running times of BalancedZ solving the 4-SAT instances fromthe set of phase-transition random instances with 70, 80, ... 150 variables, and arechallenged by median running times for instances with 160, 170, ... 200, 225, ...300 variables.for BalancedZ.On the other hand, results for DPLL-based solvers on random 4-SAT are verysimilar to those on 3-SAT, with exponential models fitting the running times verywell. Details of the fitted models and corresponding RMSE values can be foundin Table 5.16. Note that we did not present result for polynomial models, becausethey clearly under-estimate the scaling of the running times, and the Levenberg-Marquardt Algorithm takes extremely long time to fit such models. We againperformed bootstrap analysis and obtained the bootstrap confidence intervals formodel parameters and for observed and predicted running times shown in Tables5.17 and 5.18, respectively. Figure 5.10 illustrates the fitted models and corres-ponding bootstrap confidence intervals for march_hi.5.4.2 Empirical Scaling of Solver Performance on Less-constrainedRandom 4-SATWe also explored less-constrained 4-SAT instances generated according to a for-mula proposed by Dimitris Achlioptas, namelym = 2k−1 ·n (5.4)525.4. Empirical Scaling of Solver Performance on Random 4-SATSolver ModelRMSE RMSE(support) (challenge)kcnfsExp. Model 1.9621×10−5×1.1098n 0.02472 689.0RootExp. Model 5.8865×10−11×10.008√n 0.06817 1085march_hiExp. Model 5.4972×10−5×1.1006n 0.01430 333.02RootExp. Model 4.8241×10−10×8.285√n 0.06351 1318march_brExp. Model 3.272×10−5×1.1045n 0.07124 123.7RootExp. Model 1.813×10−10×8.9758√n 0.11317 1034Table 5.16: Fitted models for median running times of the DPLL-based solverssolving phase-transition random 4-SAT instances and the corresponding RMSEvalues (in CPU sec). The models yielding more accurate predictions (as perRMSEs on challenge data) are shown in boldface.Solver Model Confidence interval of a Confidence interval of bkcnfsExp.[1.1077×10−5,3.4453×10−5] [1.1047,1.1149]RootExp.[1.8193×10−11,1.9247×10−10] [8.9979,11.116]march_hiExp.[2.2428×10−5,9.0675×10−5] [1.0961,1.1082]RootExp.[7.9199×10−11,1.3699×10−9] [7.5342,9.701]march_brExp.[9.033×10−6,5.9334×10−5] [1.0994,1.1156]RootExp.[1.3416×10−11,6.0806×10−10] [8.0647,11.323]Table 5.17: Bootstrap intervals of model parameters for median running times ofDPLL-based solvers solving phase-transition random 4-SAT instances.For WalkSAT/SKC, we used n = 6000,7000, · · · ,15000 as support data, and n =16000,17000, · · · ,20000,30000,40000 as challenge data. For kcnfs, we usedn = 300,350, · · · ,450 as support data, and n = 500,550,600 as challenge data.The sizes were chosen considering the running times of the solvers and our budgetof computational resources, and we roughly used half of the sizes for model fittingand the other half for challenging the obtained models. To reduce floor effects, wemade sure that median running times for SLS-based solvers are above 0.005 CPUsec, and those for DPLL-based solvers above 0.1 CPU sec. The fitted models andcorresponding RMSE values are shown in Table 5.19. Results of bootstrap ana-lysis, including confidence intervals for model parameters, for observed runningtimes and for predicted values, are shown in Tables 5.20, 5.21 and 5.22, respect-ively. Figure 5.11 illustrates the fitted models of WalkSAT/SKC, which suggestssome scaling behaviour lower-bounded by, and close to, polynomial. For kcnfs,the fitted models are shown in Figure 5.12, and the observed running times aresurprisingly well consistent with the polynomial model.To summarise, both WalkSAT/SKC and kcnfs demonstrate significantly betterscaling behaviour for solving the less-constrained 4-SAT instances than for phase-transition instances. This confirms earlier observations that phase-transition SAT535.4. Empirical Scaling of Solver Performance on Random 4-SATSolver nObserved median running time (sec) Predicted confidence intervals (sec)Point estimates Confidence intervals RootExp. model Exp. modelkcnfs140 39.39 [37.06,41.39] [36.81,42.96]* [38.74,45.25]#150 115.2 [108.5,119.9] [91.8,116.9]# [105,134.3]*160 319.6 [306.1,327.3] [221.9,306.9] [284.1,398]*170 839.7 [783.9,874.3] [521.8,785.3] [768.8,1183]*180 2331 [2242,2448] [1197,1955] [2081,3519]*190 6151 [5745,6447] [2686,4749][5632,1.046×104]*march_hi140 36.54 [32.77,39.18] [31.81,39.33]* [33.43,41.37]#150 101.6 [95.02,105.3] [73.85,101.8]# [83.98,116.5]*160 278.8 [263.4,291.4] [166.8,254.3] [211,326.4]*170 759.6 [710.9,803.5] [367.3,615.5] [530.1,910.3]*180 2070 [1965,2157] [790.7,1449] [1332,2535]*190 5227 [4887,5537] [1667,3331] [3346,7059]*march_br140 32.6 [28.06,35.07] [31.9,38.24]# [33.53,40.23]150 94.26 [85.39,99.02] [76.18,104.4]* [86.79,119.8]#160 261.1 [251.2,272.4] [176.8,277.8]* [224.7,359.2]*170 714.1 [686.2,748.8] [399.3,717]# [580.3,1077]*180 1868 [1774,1962] [880,1800] [1498,3227]*190 4892 [4529,5237] [1898,4396] [3865,9646]*Table 5.18: 95% bootstrap confidence intervals for observed running times ofDPLL-based solvers solving phase-transition random 4-SAT instances. The in-stance sizes shown here are larger than those used for fitting the models. Bootstrapintervals on predictions that overlap with the corresponding bootstrap interval onobserved data are shown in boldface, those that agree with the observed point es-timates are marked by sharps (#), and those that fully contain the confidence inter-vals on observations are marked by asterisks (*).Solver ModelRMSE RMSE(support) (challenge)WalkSAT/SKCExp. Model 0.02346×1.00012n 0.00301 0.76778Poly. Model 2.97526×10−7×n1.35594 0.00139 0.08560kcnfsExp. Model 2.47586×10−5×1.02815n 0.02313 137.183Poly. Model 9.44803×10−31×n11.625 0.03740 2.91674Table 5.19: Fitted models for median running times of the solvers solving less-constrained random 4-SAT instances and corresponding RMSE values (in CPUsec). The models yielding more accurate predictions (as per RMSEs on challengedata) are shown in boldface.Solver Model Confidence interval of a Confidence interval of bWalkSAT/SKCExp. [0.02190,0.02570] [1.00011,1.00012]Poly.[2.12689×10−7,4.36148×10−7] [1.31430,1.39195]kcnfsExp.[3.47927×10−6,0.00012] [1.02419,1.03295]Poly.[3.92302×10−36,3.17454×10−26] [9.91562,13.6855]Table 5.20: Bootstrap intervals of model parameters for median running time ofthe solvers solving less-constrained random 4-SAT instances.545.4. Empirical Scaling of Solver Performance on Random 4-SATSolver nObserved median running time (sec)Point estimates Conf. intervalsWalkSAT/SKC16000 0.1404 [0.1387,0.1422]17000 0.1540 [0.1521,0.1558]18000 0.1697 [0.1671,0.1717]19000 0.1862 [0.1841,0.1882]20000 0.2040 [0.2014,0.2075]30000 0.4363 [0.4234,0.4546]40000 0.7262 [0.7113,0.7355]kcnfs500 21.3758 [17.8882,24.9521]550 72.2845 [55.7475,89.4627]600 189.071 [153.728,239.913]Table 5.21: 95% bootstrap confidence intervals for observed running times on less-constrained random 4-SAT instances. The instance sizes shown here are larger thanthose used for fitting the models..Solver nPredicted confidence intervals (sec)Exp. model Poly. modelWalkSAT/SKC16000 [0.1510,0.1617] [0.1462,0.1519]17000 [0.1687,0.1832] [0.1583,0.1652]18000 [0.1884,0.2075] [0.1707,0.1789]19000 [0.2105,0.2351] [0.1833,0.1929]*20000 [0.2351,0.2663] [0.1961,0.2072]#30000 [0.7111,0.9267] [0.3341,0.3637]40000 [2.1504,3.2241] [0.4876,0.5423]kcnfs500 [18.7603,39.0276]# [16.1396,32.6376]*550 [61.7576,189.54]# [41.1617,115.85]*600 [208.907,957.263] [99.4254,379.376]*Table 5.22: 95% bootstrap confidence intervals for predicted running times on less-constrained random 4-SAT instances. The instance sizes shown here are larger thanthose used for fitting the models. Bootstrap intervals on predictions that overlapwith the corresponding bootstrap interval on observed data are shown in boldface,those that agree with the observed point estimates are marked by sharps (#), andthose that fully contain the confidence intervals on observations are marked byasterisks (*).555.5. Experiments for the Survey Propagation Algorithm10-210-1100101102103104105 80  100  120  140  160  180  200CPU time [sec]nSupport dataExp. modelRootExp. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsChallenge data (with confidence intervals)Figure 5.10: Fitted models for median running times of march_hi. Both modelsare fitted with the median running times of march_hi solving the 4-SAT instancesfrom the set of phase-transition random instances with 40, 50, ... 120 variables,and are challenged by median running times for instances with 130, 140, ... 190variables.instances represent a very hard distribution, and the less-constrained instances westudied here are significantly easier than phase-transition 4-SAT instances.5.5 Experiments for the Survey Propagation AlgorithmWe also tried to experiment with the survey propagation algorithm [11], but en-countered some practical problems with the implementation made available bythe authors4. In particular, the solver turns out to be unable to solve a signi-ficant proportion of satisfiable phase-transition random 3-SAT instances for n =200, · · · ,1000. In our experiments, 5 runs were performed for each instance andwe report the number of instances that cannot be solved in at least 3 out of the 5runs in Table 5.23. It is clear that the proportion of failed runs is quite large andtends to increase with n. For n = 1000, more than 70% of the instances could notbe solved in 3 out of 5 runs. To further evaluate whether we could restart the al-gorithm to make it a practical solver for satisfiable instances, we performed 1000independent runs on 3 instances, and none of these instances was ever solved inany single run of the algorithm.4http://areeweb.polito.it/ricerca/cmp/code/sp565.6. Chapter Summary10-210-1100101 5000  10000  15000  20000  30000  40000CPU time [sec]nSupport dataExp. modelPoly. modelExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 5.11: Fitted models for the median running times of WalkSAT/SKC.Both models are fitted to the median running times of WalkSAT/SKC solv-ing the less-constrained 4-SAT instances from the set of random instances with6000,7000, · · · ,15000 variables, and are challenged by median running times with≥ 16000 variables.According to Alfredo Braunstein and Riccardo Zecchina5, the reason for this isthat the solver, after some decimation, starts not converging and eventually goes toa singularity of the underlying fixed point equations. Then, the solver presumablymakes random choices and eventually gets into a bad partial configuration of vari-ables that leads to a contradiction. To solve this problem, one needs to implementa solver combining survey propagation with backtracking. Due to time constraints,we did not pursue this path to evaluate the survey propagation algorithm empir-ically. However, our methodology is readily applicable to the survey propagationalgorithm, once a practical implementation of the algorithm is available, and suchanalysis should be of interest to the community.5.6 Chapter SummaryIn this chapter, we presented an empirical analysis of the scaling behaviour of sev-eral prominent SAT solvers on phase transition uniform random 3-SAT instances.We were surprised to find solid support for low-degree polynomial scaling of theperformance of all three SLS-based SAT solvers we studied, which stands in stark5Personal communications.575.6. Chapter Summary10-210-1100101102103 300  400  500  600CPU time [sec]nSupport dataExp. modelPoly. modelExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 5.12: Fitted models for the median running times of kcnfs. Both modelsare fitted to the median running times of kcnfs solving the less-constrained 4-SATinstances from the set of random instances with 300, · · · ,450 variables, and arechallenged by median running times for instances with 500,550,600 variables.contrast to the exponential performance scaling of the three DPLL-based solverswe considered, even when these were run on satisfiable instances only. As ex-pected, we did find evidence for significant differences in the performance scalingmodels of DPLL-based solvers on satisfiable and unsatisfiable instances, but thesedifferences could be attributed to a constant factor, suggesting that they cannot beleveraged to obtain reasonably accurate guesses on the unsatisfiability of instancesbased on long runs of those solvers. However, the qualitative differences betweenthe scaling of SLS- and DPLL-based solvers can be exploited, in that for growingn, it appears to be possible to separate satisfiable from unsatisfiable instances withhigh and further increasing accuracy based on long runs of SLS-based or hybridsolvers. Furthermore, the polynomial scaling of even high quantiles of the distri-bution of running times for SLS-based solvers across instance sets suggests thatrandom 3-SAT instances from the solubility phase transition are likely not captur-ing the difficulty we expect to encounter in the worst case when solving anN P-hard problem. We note that, considering their sizes, these instances are still veryhard compared to almost all types of structured SAT instances and may thereforestill be useful as a benchmark.We also explored the empirical scaling of these solvers on two distributionsof random 4-SAT instances. For phase-transition instances, exponential (for Bal-ancedZ) or root-exponential (for the other solvers) models usually fit the running585.6. Chapter Summaryn 200 250 300 350 400 450# Instances 581 589 633 558 579 572# Not solved 211 268 317 285 319 329% Not solved 36.3% 45.5% 50.1% 51.1% 55.1% 57.5%n 500 600 700 800 900 1000# Instances 578 572 607 584 592 593# Not solved 355 356 398 408 393 437% Not solved 61.4% 62.2% 65.6% 69.9% 66.4% 73.7%Table 5.23: Number of the satisfiable instances of different sizes, and the numberand percentage of instances that cannot be solved in at least 3 out of 5 independentruns of the survey propagation algorithm.times of SLS-based solvers better, while DPLL-based solvers demonstrate scalingbehaviour well characterised by exponential models. For solving a class of less-constrained instances, we showed that both WalkSAT/SKC and kcnfs scale signi-ficantly better than solving phase-transition instances in that a polynomial modelis the better fit for both solvers.To the best of our knowledge, ours is the first study that uses a statisticallysound way to assess the scaling of SAT solver performance with instance size, andto discriminate between different scaling models. The empirical scaling analysiswe have performed here can easily be applied to other SAT solvers and other dis-tributions of SAT instances (as long as reasonably large sets of instances for eachn can be obtained). We believe that doing so can and will produce interesting anduseful results that can inspire the design of algorithms and benchmark instancegenerators as well as, hopefully, theoretical work.59Chapter 6Empirical Scaling of RunningTime of TSP Solvers 6In this chapter, we studied the empirical scaling of finding time, that is, the timerequired by a solver to find an optimal solution of a TSP instance without provingoptimality, of prominent TSP solvers. Our analyses were done on the state-of-the-art complete solver, Concorde, and two prominent incomplete solvers, EAX andLKH.6.1 Experimental SetupFor our analyses, we used the benchmark RUE instances that were previously usedand made available to us by Dubois-Lacoste et al. [21] as well as by Hoos andStützle [35, 36]. We focused on RUE instances, because they represent a widely-studied distribution of hard TSP instances and can be obtained easily for a givenset of instance sizes. These instances were generated using the portgen generatorfrom the 8th DIMACS implementation challenge for TSP [38]. It places n pointsin a 100000× 100000 square uniformly at random and computes the Euclideandistances between pairs of points. There are 1000 instances for each instance sizen= 500,600, · · · ,1500,2000 and 100 instances for each n= 2500,3000, · · · ,4500.Regarding solvers, we chose three high-performance TSP solvers that are wellknown in the community, including one complete algorithm and two incompletealgorithms. All algorithms perform well on large TSP instances and play an im-portant role in the development of TSP solving techniques. We thus have reasonsto expect that analyses of these solvers would greatly enrich our understanding ofempirical time complexity for finding optimal TSP solutions.For complete algorithm, we concentrated on the long-time state-of-the-art exactTSP solver, Concorde [4]. Mainly based on a branch & cut scheme, it makes use ofa multitude of heuristics and is the best-performing exact solver that we are awareof. The solver has been used to solve some largest non-trivial instances.6This chapter covers results that are also reported in two working papers [54, 55]606.1. Experimental SetupFor incomplete algorithms, we chose the latest versions of LKH and EAX.LKH [29, 30], Helsgaun’s variant of the Lin-Kernighan TSP heuristic, is a variable-depth search method that performs sophisticated heuristic-guided local search movesbased on sequences of five or more edge exchanges. It can perform restarts basedon perturbations of previously found solutions using a variety of strategies. Thealgorithm arguably represents a milestone in the development of inexact TSP solv-ing and the state-of-the-art for finding high-quality TSP solutions. In this work, weused LKH version 2.0.7, keeping parameters at default except for PATCHING_Aand PATCHING_C, which we set to 2 and 3 respectively to include patching ofcycles in searching for improving moves. These values were also adopted in theexample parameter file for solving TSPLIB instance pr2392 and in earlier work byDubois-Lacoste et al. [21].Recently, there has been a fast-developing line of work on evolutionary al-gorithms for inexact solving of TSP. EAX [57], a very successful genetic algorithm,makes use of improved variants of the edge assembly cross-over recombinationoperator. It also exploits diversity preservation techniques and carefully initial-ises the initial population by local optimisation. In the following, we used EAXwith default parameters, namely with population size as 100 and number of no-improvement iterations before restart as 30.For our analyses, we used implementations of LKH and EAX made availableto us by Dubois-Lacoste et al. [21]. Their versions of the solvers use a restart mech-anism to achieve improved performance. Restart mechanisms are needed becauseeven high-performance local search algorithms can suffer from stagnation beha-viour, and restart mechanisms help both algorithms considerably to find the knownoptimal solution more efficiently.For our scaling analysis, we considered three parametric models:• Exp [a,b] (n) = a ·bn (2-parameter exponential);• RootExp [a,b] (n) = a ·b√n (2-parameter root-exponential);• Poly [a,b] (n) = a ·nb (2-parameter polynomial).Similar to our analyses on SAT, we fitted these models on the medians of runningtimes, and computed 95% confidence intervals of the predictions following meth-odology outlined in Chapter 3.For running these TSP solvers, we used a cluster of computers, each equippedwith a 2.0GHz eight-core AMD 6128 processor, 2×12MB L2/L3 cache and 16GBRAM. The operating system was Cluster Rocks Linux 6.0/CentOS 6.3, and gcc4.4.6 was used to compile the programs with optimisation flag -O3. To avoid meas-urement noise caused by memory bottlenecks, we used only one core per CPU to616.2. Empirical Scaling of Running Time of Concorde for Finding Optimal Solutionsrun our solvers and imposed a memory limit of 1GB per run. We performed 10independent runs of incomplete TSP solvers, and used the median as well.6.2 Empirical Scaling of Running Time of Concorde forFinding Optimal SolutionsIn this section, we study empirical scaling of finding time, namely the time requiredto find an optimal solution of a given TSP instance, of Concorde, the state-of-the-art complete solver for TSP. We further compared the scaling results with thoseof overall running time of Concorde and obtained interesting results. Our workcomplement previous analyses on empirical scaling of TSP, as discussed in Section2.2.6.2.1 Treatments of running time dataWhen studying the running times for finding optimal solutions (finding times) com-pared to those including proving optimality (overall running times), care needs tobe taken in the treatment of timed-out runs. In the Concorde case, this concernsruns on instances for which only pseudo-optimal solutions or not even pseudo-optimal solutions are available. These solutions were found through multiple runsof EAX and LKH. For a subset of these instances, EAX and LKH reached the samebest solution in every single run; we conjecture that these best solutions are in factoptimal and refer to them as pseudo-optimal. When comparing the best solutionfound during a run of Concorde to the pseudo-optimal solution or the best solutionwe know for an instance in case these are not considered pseudo-optimal, we couldverify that for all except one such solution Concorde did not find within the 7 CPUdays such solutions, that is, we can verify that the finding times of Concorde arelarger than 7 CPU days and, hence, larger than the median times on which we basethe analysis given below. It thus enables us to compute the median finding time in adefinite manner, despite that finding times for these instances are unknown. For theremaining instance on which Concorde found the same solution as the best knownone, the computation time to match the best solution known was larger than therespective median time required on instances for which proven optimal solutionsare already known.6.2.2 Scaling models, RMSEs and bootstrap intervalsWe first fitted parametric models on the finding times of Concorde, and obtainedthree models together with corresponding RMSE values as in Table 6.1. According626.2. Empirical Scaling of Running Time of Concorde for Finding Optimal SolutionsSolver ModelRMSE RMSE(support) (challenge)ConcordeExp. 4.0388×1.0032n 7.7847 2.7852×106RootExp. 0.083457×1.2503√n 7.0439 9169.4Poly. 1.6989×10−10×n3.9176 9.9327 1.038×105Table 6.1: Fitted models for the medians of the running times of Concorde andthe corresponding RMSE values (in CPU sec). The models yielding more accuratepredictions (as per RMSEs on challenge data) are shown in boldface.Solver Model Confidence interval of a Confidence interval of bConcordeExp. [2.6108,5.2975] [1.0030,1.0036]RootExp. [0.037056,0.15111] [1.2287,1.2793]Poly.[6.1872×10−12,1.7351×10−9] [3.5859,4.3713]Table 6.2: 95% bootstrap confidence intervals for the parameters of the scalingmodels for Concorde’s running times to find optimal solutions of RUE instances.to the RMSE values on support and challenge data, the root exponential modelgives the best fit.To assess the confidence we should have in these models, we used the bootstrapre-sampling method described in Section 3.1 to evaluate the models, which givesus the confidence intervals for model parameters (Table 6.2) and for observed/pre-dicted running times (Tables 6.3 and 6.4 respectively). The results, also clearlyseen from Figure 6.1, indicate that the root-exponential model with a base ≈ 1.25is a very good fit for the finding times of Concorde.6.2.3 Comparison with scaling for overall running time of ConcordePreviously, Hoos and Stützle [35] have shown that the overall running times (forfinding optimal solutions and proving optimality) of Concorde also scale as a root-exponential model. Here, we repeated the analysis with newer overall running timedata that we collected together with finding times. The results we thus obtainedconfirmed a root-exponential scaling model for the overall running time of Con-corde, with a in the interval [0.11658,0.35323] and b in the interval [1.2126,1.2522].Note that the confidence interval for b is very close to that obtained for finding time,which is [1.2287,1.2793]. Moreover, the predictions made by the root-exponentialmodel of overall running time, as shown in Table 6.5, are quite consistent with ob-served finding times. A closer look at the data, however, reveals that the observedfinding times are usually closer to the lower end of the corresponding bootstrapconfidence intervals, while the observed overall running times are usually closer636.2. Empirical Scaling of Running Time of Concorde for Finding Optimal SolutionsSolver nObserved median finding timePoint estimates Confidence intervalsConcorde2000 1969 [1739,2222]2500 6149 [4084,8812]3000 1.84×104 [1.332×104,2.669×104]3500 3.246×104 [2.581×104,5.038×104]4000 1.312×105 [7.073×104,2.024×105]4500 2.633×105 [1.73×105,4.419×105]Table 6.3: 95% bootstrap confidence intervals for the medians of observed runningtimes for Concorde to find optimal solutions of RUE instances. The instance sizesshown here are larger than those used for fitting the models.Solver nPredicted confidence intervalsExp. model RootExp. model Poly. modelConcorde2000 [1988,3179] [1528,2269]* [1228,1795]2500[8718,1.884×104] [4536,8335]# [2737,4771]3000[3.853×104,1.103×105] [1.212×104,2.694×104]* [5252,1.057×104]3500[1.698×105,6.479×105] [3.001×104,7.925×104]# [9149,2.069×104]4000[7.5×105,3.809×106] [6.95×104,2.163×105]* [1.477×104,3.708×104]4500[3.301×106,2.245×107] [1.528×105,5.563×105]* [2.248×104,6.205×104]Table 6.4: 95% bootstrap confidence intervals for the medians of the running timepredictions for Concorde to find optimal solutions of RUE instances. The instancesizes shown here are larger than those used for fitting the models. Bootstrap inter-vals on predictions that are weakly consistent with the observed data are shown inboldface, those that are strongly consistent are marked by sharps (#), and those thatfully contain the confidence intervals on observations are marked by asterisks (*).Solver nPredicted confidence intervals Observed median overall running timeRootExp. model Point estimates Confidence intervalsConcorde2000 [1962,2736]# 2508 [2197,2760]2500 [5431,8914]# 7899 [4886,9789]3000[1.366×104,2.612×104]# 2.064×104 [1.492×104,2.795×104]3500[3.181×104,6.994×104]# 4.057×104 [2.586×104,5.719×104]4000[6.987×104,1.753×105]# 1.377×105 [8.236×104,2.108×105]4500[1.462×105,4.154×105]# 3.264×105 [1.953×105,5.087×105]Table 6.5: 95% bootstrap confidence intervals for the medians of the observed andpredicted overall running times for Concorde to solve RUE instances. Bootstrapintervals on predictions that are weakly consistent with the observed finding timesare shown in boldface, those that are strongly consistent are marked by sharps (#),and those that fully contain the confidence intervals on observations are marked byasterisks (*).646.2. Empirical Scaling of Running Time of Concorde for Finding Optimal Solutions100101102103104105106107108 500  1000  2000  4000CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 6.1: Fitted models for the medians of the running times for Concorde tofind optimal solutions of RUE instances without proving optimality. All modelsare fitted with the medians of the running times of Concorde solving the RUEinstances from the set of instances 500 ≤ n ≤ 1500 variables, and are challengedby the medians of the running times of 2000≤ n≤ 4500 variables.to the higher end. In other words, we do not have strong evidence that the findingtimes scale differently compared to the overall running times, but more experi-ments may narrow down the confidence intervals and help us better understand thescaling behaviour of Concorde.To further test whether the difference between finding and overall runningtimes is explained by a constant factor, we fitted a one-parameter root-exponentialmodel of the form a · b√nproving on the finding times of Concorde, where bproving =1.2281 from the root-exponential model for Concorde overall running time. Ourresults show that the model obtained in this way tends to under-estimate the find-ing times, and imply that the finding times scale slightly worse than overall runningtimes. Note that this is also suggested by the difference in the confidence intervalsfor b in the two models. Moreover, it was also observed by Hoos and Stützle that,as n increases, the percentage of running times spent on finding optimal solutionsgoes up and dominates the running time for large n [36].So, do Concorde finding times scale worse than overall running times? Whilethis is not clear from cross-checking model predictions with observed finding times,our analysis using the one-parameter root-exponential model clearly suggests anaffirmative answer. We argue that two reasons cause our analysis with the two-parameter model of overall running time to fail to illustrate this: 1) the two-656.3. Empirical Scaling of Running Time of EAX and LKHparameter model represents a large family of models, and the bootstrap confidenceintervals tend to be wide; 2) the fitted root-exponential model for overall runningtime has a larger a, which causes an illusion that overall running times scale worsethan finding times. Given the relationship between finding and overall runningtimes, though, we believe that more sophisticated models than the two-parameterroot-exponential model we use here would be required in order to explain the in-crease of fraction of running time spent on finding optimal solutions.6.3 Empirical Scaling of Running Time of EAX andLKHIn this section, we study empirical scaling of running time of EAX and LKH, twoprominent incomplete solvers for TSP. We further compare the scaling results withthose of findings time of Concorde and obtain interesting results. Our work com-plements previous analyses on empirical scaling of TSP, as discussed in Section2.2.6.3.1 Details of instancesAs described in detail before, Concorde does not manage to solve all instanceswithin the allowed time. Because EAX and LKH cannot prove optimality, theyneed to be given the optimal solution in advance in order to terminate once theoptimal solution is reached. To make more instances available for EAX and LKHto solve, we ran Concorde on the previously unsolved instances with different seedsand/or on (limited number of) faster machines. In addition, we performed multipleruns of EAX and LKH on those instances that remain unsolved by Concorde. Fora subset of these instances, EAX and LKH reached the same best solution in everysingle run; we conjecture that these best solutions are in fact optimal and refer tothem as pseudo-optimal. In the results to follow, we include data for both optimaland pseudo-optimal instances. We note that only using data for optimal instancesgives qualitatively similar conclusions.For the six instances (two of size 4000 and four of size 4500) for which we didnot succeed in establishing pseudo-optimal solutions, we took special care in howto treat them. As found by Dubois-Lacoste et al. [21], performance correlationsbetween each pair of solvers are all very low. Thus, the instances for which wedo not have optimal or pseudo-optimal solutions may actually be easy for the in-complete solvers. Thus, they are treated using an optimistic/pessimistic estimationas done by Dubois-Lacoste et al. [21]. More precisely, we treat these instances aseasy with smaller-than-the-median running times in the optimistic estimation, and666.3. Empirical Scaling of Running Time of EAX and LKHSolver ModelRMSE RMSE(support) (challenge)EAXExp. 1.6512×1.0017n 0.80329 [1513.3,1566.2]RootExp. 0.24795×1.1230√n 0.45614 [44.77,88.739]Poly. 1.9196×10−5×n1.9055 0.10699 [235.79,287.6]LKHExp. 0.56147×1.0025x 0.51265 [18213,18330]RootExp. 0.030075×1.1879√x 0.38383 [797.7,909.51]Poly. 1.15×10−8×x2.9297 0.50193 [282.12,378.53]Table 6.6: Fitted models for the medians of the running times of EAX and LKH andthe corresponding RMSE values (in CPU sec). The models yielding more accuratepredictions (as per RMSEs on challenge data) are shown in boldface.Solver Model Confidence interval of a Confidence interval of bEAXExp. [1.6234,1.6764] [1.0017,1.0018]RootExp. [0.23938,0.25592] [1.1219,1.1242]Poly.[1.6803×10−5,2.1556×10−5] [1.8887,1.9245]LKHExp. [0.46665,1] [1.0021,1.0027]RootExp. [0.020678,0.043847] [1.1749,1.2006]Poly.[2.769×10−9,4.8245×10−8] [2.7229,3.1287]Table 6.7: 95% bootstrap confidence intervals for the model parameters of thescaling models for EAX’s and LKH’s running times to find optimal solutions ofRUE instances.as timed-out instances in the pessimistic treatment. This gives us intervals for themedians for those sizes where there are instances with unknown (pseudo-)optimalsolutions. We note that these intervals are not confidence intervals, but boundson the median running times, as the true median must be contained within theseintervals.6.3.2 Scaling models, RMSEs and bootstrap intervalsSimilarly, we first fitted parametric models to the running times of EAX and LKH,resulting in the models and corresponding RMSE values shown in Table 6.6. Judgedfrom the RMSE values, the running times of EAX and LKH are best approximatedby a root-exponential and a polynomial model, respectively.We also assessed the fitted models with bootstrap re-sampling, using the methoddescribed in Section 3.1. In particular, we calculated the confidence intervals formodel parameters (Table 6.7) and for observed/predicted running times (Tables 6.8and 6.9). The results for EAX, also presented graphically in Figure 6.2, indic-ate that the root-exponential model with a base ≈ 1.123 is a fairly good fit. Onthe other hand, the results for LKH, also shown in Figure 6.3, suggest a scalingbehaviour bounded from below and above by a polynomial and a root-exponential676.3. Empirical Scaling of Running Time of EAX and LKHSolver nObserved median run-timePoint estimates Confidence intervalsEAX2000 41.24 [40.03,42.26]2500 73.19 [61.11,118]3000 172.2 [155.7,223.2]3500 239.9 [220,357.7]4000 [483.2,547.4] [370.2,649.8]4500 [611.7,727.7] [520,877.6]LKH2000 62.64 [58.03,69.61]2500 137 [108.5,199.7]3000 249.4 [201.3,372.2]3500 382.8 [260.8,648.8]4000 [891,907.3] [551.5,1207]4500 [1059,1352] [808.2,2203]Table 6.8: 95% bootstrap confidence intervals for the medians of the observedrunning times for EAX and LKH to find optimal solutions of RUE instances. Theinstance sizes shown here are larger than those used for fitting the models.Solver nPredicted confidence intervalsExp. model RootExp. model Poly. modelEAX2000 [53.08,54.75] [43.77,45.01] [37.02,37.98]2500 [125.9,131.9] [80.33,83.51] [56.43,58.35]3000 [298.7,317.8] [139,146] [79.63,82.87]3500 [709.2,765.6] [230.3,244]# [106.5,111.5]4000 [1682,1845] [368.4,393.6] [137.1,144.1]4500 [3991,4445] [572.8,616.7]+ [171.3,180.8]LKH2000 [62.15,95.52]# [58.8,73.94]# [48.05,59.71]2500 [174.5,360.2] [138.1,193.6] [88.54,119.8]3000 [490,1361] [299.3,462.5] [145.9,211.4]3500 [1376,5154] [609.3,1030] [222.4,342.4]4000[3863,1.954×104] [1176,2178] [320.7,519.9]4500[1.085×104,7.412×104] [2173,4391] [442.8,751.6]Table 6.9: 95% bootstrap confidence intervals for the medians of the running timepredictions for EAX and LKH to find optimal solutions of RUE instances. Theinstance sizes shown here are larger than those used for fitting the models. Boot-strap intervals on predictions that are weakly consistent with the observed data areshown in boldface, those that are consistent are marked by plus signs (+), thosethat are strongly consistent are marked by sharps (#), and those that fully containthe confidence intervals on observations are marked by asterisks (*).686.3. Empirical Scaling of Running Time of EAX and LKH100101102103104 500  1000  2000  4000CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 6.2: Fitted models for the medians of the running times for EAX to findoptimal solutions of RUE instances. All models are fitted with the medians ofthe running times of EAX solving the set of RUE instances of 500 ≤ n ≤ 1500variables, and are challenged by the medians of the running times of 2000 ≤ n ≤4500 variables.model, respectively.6.3.3 Comparison with scaling of Concorde for finding optimalsolutionsTo compare the scaling of Concorde with that of EAX and LKH, we first ana-lysed the bootstrap intervals of the values of the parameter b in the obtained root-exponential models. For Concorde, b∈ [1.2255,1.2747], which is significant largerthan that for EAX ([1.1219,1.1242]) and for LKH ([1.1749,1.2006]). This obser-vation suggests that Concorde scales significantly worse than EAX and LKH.To further test this, we fitted two one-parameter models of the form a ·b√nConcordeto the running times of EAX and LKH, where bConcorde = 1.2503. The obtainedmodels, shown in Figure 6.4, clearly over-estimate the running times of EAX andLKH, which confirms our observation that the finding times of Concorde scalesignificantly worse than those of EAX and LKH.696.4. Impact of Automated Configuration on Scaling of EAX and LKH10-1100101102103104105 500  1000  2000  4000CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 6.3: Fitted models for the medians of the running times for LKH to findoptimal solutions of RUE instances. All models are fitted with the medians ofthe running times of LKH solving the set of RUE instances of 500 ≤ n ≤ 1500variables, and are challenged by the medians of the running times of 2000 ≤ n ≤4500 variables.6.4 Impact of Automated Configuration on Scaling ofEAX and LKHWe also performed several experiments investigating the impact of automated con-figuration on scaling behaviour of incomplete TSP solvers. Automated config-uration involves the use of an automated procedure to determine good parametervalues for a given algorithm. In this work, we used SMAC [37], the prominent se-quential model-based algorithm configurator, to configure EAX and LKH, and as-sessed the impact of algorithm configuration on their scaling behaviour. Followingthe standard configuration protocol, we performed 25 independent runs of SMACfor each scenario, selected the best parameter setting according to performance onthe given set of training instances and report scaling results for that setting.6.4.1 Impact of Automated Configuration on Scaling of EAXWe first present results for EAX, which has only two exposed parameters. One keyparameter is fNumOfPop, the population size whose default value is 100, and theother is fNumOfKids, the number of no-improvement iterations before the searchrestarts, for which the default value is 30. In our experiments, we enforced a cur-706.4. Impact of Automated Configuration on Scaling of EAX and LKH10-1100101102103104105 500  1000  2000  4000CPU time [sec]nSupport dataFixed-base RootExp. modelRootExp. model bootstrap intervalsChallenge data (with confidence intervals)10-1100101102103104105 500  1000  2000  4000CPU time [sec]nSupport dataFixed-base RootExp. modelRootExp. model bootstrap intervalsChallenge data (with confidence intervals)Figure 6.4: Fitted one-parameter models (of the form a ·b√nConcorde) for the mediansof the running times for EAX (top) and LKH (bottom) to find optimal solutionsof RUE instances. All models are fitted with the medians of the running times ofthe solvers solving the RUE instances from the set of instances of 500≤ n≤ 1500variables, and are challenged by the medians of the running times of 2000 ≤ n ≤4500 variables.716.4. Impact of Automated Configuration on Scaling of EAX and LKHSolver ModelRMSE RMSE(support) (challenge)EAXExp. 1.2511×1.0017n 0.4711 [1085,1097.8]RootExp. 0.1991×1.1193√n 0.22625 [22.278,32.899]Poly. 2.0916×10−5×n1.8464 0.050459 [124.04,135.61]Table 6.10: Fitted models for the medians of the running times of EAX with con-figured parameters and the corresponding RMSE values (in CPU sec). The modelsyielding more accurate predictions (as per RMSEs on challenge data) are shown inboldface.Solver Model Confidence interval of a Confidence interval of bEAXExp. [1.2249,1.2709] [1.0017,1.0017]RootExp. [0.19024,0.20586] [1.1182,1.1208]Poly.[1.7804×10−5,2.3636×10−5] [1.8288,1.8693]Table 6.11: 95% bootstrap confidence intervals for the parameters of the scalingmodels for EAX’s running times (with configured parameters) to find optimal solu-tions of RUE instances.off of 1 CPU day for each SMAC run. To ensure that SMAC could perform atleast 1000 runs of EAX, we further enforced a cut-off time of 86 sec for each EAXrun. After training on a set of RUE instances with 1500 cities, SMAC picked aparameter setting with larger population sizes (167 vs. 100) and smaller number ofno-improvement iterations (20 vs. 30).To investigate the scaling of EAX with optimised parameters, we used themethodology described in Section 3.1. The fitted models are presented in Table6.10 and illustrated in Figure 6.5 together with the confidence intervals obtainedfrom bootstrap analysis. Similar to results for the default version of EAX, root-exponential characterises the scaling the best, while the best exponential and poly-nomial models can be rejected with 95% confidence. From the confidence intervalsof the model parameters, as shown in Table 6.11, there is evidence that algorithmconfiguration improves the scaling of EAX, since it reduces the value of b in root-exponential models from 1.123 to 1.119 with completely disjoint confidence in-tervals ([1.1219,1.1242] vs. [1.1182,1.1208] ). Following the method describedin Section 3.2, we cross-checked the predictions of the root-exponential model forthe default version of EAX against the observed running times for the optimisedversion of EAX. Our results, illustrated in Figure 6.6, clearly show that the modelover-estimate the running times for the optimised version of EAX.726.4. Impact of Automated Configuration on Scaling of EAX and LKH100101102103104 500  1000  2000  4000CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 6.5: Fitted models for the medians of the running times for EAX with con-figured parameters to find optimal solutions of RUE instances. The parametersof EAX are set to the optimised values determined by configuration on instanceswith 1500 variables using SMAC. All models are fitted with the medians of therunning times of EAX solving the SAT instances from the set of RUE instances of500≤ n≤ 1500 variables, and are challenged by the medians of the running timesof 2000≤ n≤ 4500 variables.736.4. Impact of Automated Configuration on Scaling of EAX and LKH100101102103 500  1000  2000  4000CPU time [sec]nSupport data for optimised EAXRootExp. model for optimised EAXBootstrap intervals for optimised EAXChallenge data for optimised EAXSupport data for default EAXRootExp. model for default EAXBootstrap intervals for default EAXChallenge data for default EAXFigure 6.6: Best fitted models for the medians of the running times for EAX withconfigured and default parameters to find optimal solutions of RUE instances. Allmodels are fitted with the medians of the running times of EAX solving the SATinstances from the set of RUE instances of 500 ≤ n ≤ 1500 variables, and arechallenged by the medians of the running times of 2000≤ n≤ 4500 variables.6.4.2 Effect of Population Size for EAXConsidering the impact of the parameters, it is natural to increase the populationsize when solving a larger instance size. Thus, we also performed experiments toquantify the impact of varying population size with instance size. In particular, weset the population size as a multiple of the instant size, namely,ps(n) = α ·n,where ps(n) is the population size for instance size n.We experimented with two different ways of determining α . The first way isto choose α based on the default parameter settings, namely, fNumOfPop = 100(for training instances with instance size n = 1500) and fNumOfKids = 30. Thisgives α = 0.067. We then followed the methodology described in Section 3.1 toexamine the scaling of EAX with these optimised parameters. The fitted modelsare presented in Table 6.12 and illustrated in Figure 6.7. Surprisingly, the best-fitting model is now the polynomial model, while the best and root-exponentialmodels are rejected with 95% confidence. This stands, as illustrated in Figure 6.8,in contrast with the scaling of the default version of EAX, which is well character-ised by a root-exponential model. We also performed bootstrap analysis described746.4. Impact of Automated Configuration on Scaling of EAX and LKHSolver ModelRMSE RMSE(support) (challenge)EAXExp. 0.34117×1.0026n 0.37901 [13976,13987]RootExp. 0.017513×1.1914√n 0.17295 [840.11,850.63]Poly. 4.7691×10−9×n2.9918 0.099509 [11.214,21.427]Table 6.12: Fitted models for the medians of the running times of EAX with de-fault parameters and varying population size and the corresponding RMSE values(in CPU sec). The models yielding more accurate predictions (as per RMSEs onchallenge data) are shown in boldface.Solver Model Confidence interval of a Confidence interval of bEAXExp. [0.32156,0.36342] [1.0025,1.0026]RootExp. [0.01539,0.01993] [1.1870,1.1957]Poly.[2.9101×10−9,7.6663×10−9] [2.9241,3.0607]Table 6.13: 95% bootstrap confidence intervals for the parameters of the scalingmodels for EAX’s running times (with default parameters and varying populationsize) to find optimal solutions of RUE instances.in Section 3.1, which results in the confidence intervals for the model parametersshown in Table 6.13.The second way is to configure α (hence fNumOfPop) and fNumOfKids us-ing SMAC. Following the same configuring protocol in the previous configura-tion experiment with EAX, we arrived at the optimised setting with α = 0.111and fNumOfKids = 20. We then followed the same methodology to examine thescaling of EAX with these optimised parameters. The fitted models are presen-ted in Table 6.14 and illustrated in Figure 6.9. Similar to the previous experi-ment, the best-fitting model is now the polynomial model, while the best and root-exponential models are rejected with 95% confidence. This also stands, as illus-trated in Figure 6.10, in contrast with the scaling of the default version of EAX.We also performed bootstrap analysis described in Section 3.1, which results in theconfidence intervals for the model parameters shown in Table 6.15.To summarise, both experiments indicate that varying population size can sig-nificantly improve the scaling behaviour of EAX. Whether the formula for thepopulation size is determined based on default parameter settings or based on auto-mated configuration of α (together with fNumOfKids), the scaling of EAX is bestcaptured by a polynomial model. Cross-checking the two polynomial models ob-tained above using the method described in Section 3.2, we found that the differ-ence between them is not significant.756.4. Impact of Automated Configuration on Scaling of EAX and LKH10-1100101102103104105 500  1000  2000  4000CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 6.7: Fitted models for the medians of the running times of EAX to findoptimal solutions of RUE instances. The parameters of EAX are set to the defaultvalues, and the population size is a simple function of instance size. All modelsare fitted with the medians of the running times of EAX solving the SAT instancesfrom the set of RUE instances of 500≤ n≤ 1500 variables, and are challenged bythe medians of the running times of 2000≤ n≤ 4500 variables.Solver ModelRMSE RMSE(support) (challenge)EAXExp. 0.43879×1.0024n 0.48413 [10931,10940]RootExp. 0.02628×1.1816√n 0.24178 [667.42,676.81]Poly. 1.6194×10−8×n2.8364 0.027288 [37.049,44.847]Table 6.14: Fitted models for the medians of the running times of EAX with con-figured parameters and varying population size and the corresponding RMSE val-ues (in CPU sec). The models yielding more accurate predictions (as per RMSEson challenge data) are shown in boldface.Solver Model Confidence interval of a Confidence interval of bEAXExp. [0.42779,0.44748] [1.0024,1.0025]RootExp. [0.024996,0.027408] [1.1801,1.1833]Poly.[1.3335×10−8,1.9003×10−8] [2.8132,2.8636]Table 6.15: 95% bootstrap confidence intervals for the parameters of the scalingmodels for EAX’s running times (with configured parameters and varying popula-tion size) to find optimal solutions of RUE instances.766.4. Impact of Automated Configuration on Scaling of EAX and LKH10-1100101102103104 500  1000  2000  4000CPU time [sec]nSupport data for optimised EAXPoly. model for optimised EAXBootstrap intervals for optimised EAXChallenge data for optimised EAXSupport data for default EAXRootExp. model for default EAXBootstrap intervals for default EAXChallenge data for default EAXFigure 6.8: Best fitted models for the medians of the running times for EAX withconfigured and default parameters to find optimal solutions of RUE instances. Allmodels are fitted with the medians of the running times of EAX solving the SATinstances from the set of RUE instances of 500 ≤ n ≤ 1500 variables, and arechallenged by the medians of the running times of 2000≤ n≤ 4500 variables.6.4.3 Impact of Automated Configuration on Scaling of LKHWe also attempted to configure LKH for scaling performance. With over 40 para-meters, LKH is arguably much more configurable than EAX. Out of these para-meters, some require additional information like initial tours or sub-division of agiven TSP instance, which are not available in our case. Thus, we analysed the de-tails of the parameters, and selected 21 parameters that we could configure withoutadditional information or code modification. These parameters are listed in Table6.16 in detail. Out of them, 12 are numerical paramters and 9 are categorical. Wekept all parameter at default values specified in the user guide, except for PATCH-ING_A and PATCHING_C. As mentioned earlier, we used 2 and 3 respectively forthese two parameters in our earlier experiments, and continued using these valuesas default in our configuration experiments. The ranges of all parameters were de-termined based on the user guide. When in doubt, we tried to use large ranges andleave larger configuration space to SMAC.We first followed the standard protocol, performing 25 SMAC runs to configureLKH using a set of 100 instances of size n = 1500 randomly selected from thelarger set. We enforced a cur-off of 2 CPU days for each SMAC run. To ensurethat SMAC could perform at least 1000 runs of LKH, we further enforced a cut-off776.4. Impact of Automated Configuration on Scaling of EAX and LKH10-1100101102103104105 500  1000  2000  4000CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 6.9: Fitted models for the medians of the running times of EAX to find op-timal solutions of RUE instances. The parameters of EAX are set to the optimisedvalues, and the population size is a simple function of instance size. All modelsare fitted with the medians of the running times of EAX solving the SAT instancesfrom the set of RUE instances of 500≤ n≤ 1500 variables, and are challenged bythe medians of the running times of 2000≤ n≤ 4500 variables.786.4. Impact of Automated Configuration on Scaling of EAX and LKHParameter name N/C DomainsASCENT_CANDIDATES N [10,500]BACKBONE_TRIALS N {0,1,2,3,4,5}BACKTRACKING C {YES, NO}CANDIDATE_SET_TYPE C {ALPHA, DELAUNAY, NEAREST-NEIGHBOR,QUADRANT}EXTRA_CANDIDATES N [0,20]EXTRA_CANDIDATE_SET_TYPE C {NEAREST-NEIGHBOR, QUADRANT}GAIN23 C {YES, NO}GAIN_CRITERION C {YES, NO}INITIAL_STEP_SIZE N {1,2,3,4,5}INITIAL_TOUR_ALGORITHM C {BORUVKA, GREEDY, MOORE,NEAREST-NEIGHBOR, QUICK-BORUVKA,SIERPINSKI, WALK}KICK_TYPE N {0}∪ [4,20]KICKS N {0,1,2,3,4,5}MAX_CANDIDATES N [3,20]MOVE_TYPE N [2,20]PATCHING_A N {1,2,3,4,5}PATCHING_C N {1,2,3,4,5}POPULATION_SIZE N [0,1000]RESTRICTED_SEARCH C {YES, NO}SUBGRADIENT C {YES, NO}SUBSEQUENT_MOVE_TYPE N {0}∪ [2,20]SUBSEQUENT_PATCHING C {YES, NO}Table 6.16: List of numerical (N) and categorical (C) parameters for LKH that areconfigurable.796.4. Impact of Automated Configuration on Scaling of EAX and LKH10-1100101102103104 500  1000  2000  4000CPU time [sec]nSupport data for optimised EAXPoly. model for optimised EAXBootstrap intervals for optimised EAXChallenge data for optimised EAXSupport data for default EAXRootExp. model for default EAXBootstrap intervals for default EAXChallenge data for default EAXFigure 6.10: Best fitted models for the medians of the running times for EAXwith configured and default parameters to find optimal solutions of RUE instances.All models are fitted with the medians of the running times of EAX solving theSAT instances from the set of RUE instances of 500≤ n≤ 1500 variables, and arechallenged by the medians of the running times of 2000≤ n≤ 4500 variables.time of 172 sec for each LKH run. Out of the 25 parameter settings, we selected thebest parameter setting based on training performance on the same set. Analysingthe scaling of LKH with this parameter setting, we obtained the three models shownin Table 6.17. Based on RMSE (challenge), the root-exponential model providesthe best fit, but has a larger b than that for LKH with default parameters (1.2452 vs1.1879). We also performed the bootstrap analysis described in Section 3.1. Fromthe bootstrap confidence intervals of the model parameters shown in Table 6.18, theconfidence interval of b in the root-exponential model is also larger than that forLKH with default parameters ([1.2213,1.2720] vs. [1.1749,1.2006] ). Moreover,the root-exponential model actually under-estimates the observed running times onchallenge data, as clearly seen in Figure 6.11. Examining the running times indetail, we saw that configuration does bring down the running times of LKH forn ≤ 2000, but causes it to perform worse for larger instances. In other words, theconfiguration procedure seems to overfit on smaller instance sizes.We then followed a protocol proposed by Styles et al. [66], performing 25SMAC runs to configure LKH using a set of 100 instances with instance size n =1000, and selected the best parameter setting based on validation performance ona set of 50 instances with instance size n= 1500 randomly selected from the largerset (referred to as scaling protocol hereafter). The underlying idea is to select806.4. Impact of Automated Configuration on Scaling of EAX and LKHSolver ModelRMSE RMSE(support) (challenge)LKHExp. 0.10374×1.0031n 0.24213 [32342,44308]RootExp. 0.0022353×1.2452√n 0.17628 [5682.2,17808]Poly. 6.8001×10−12×n3.8412 0.19239 [7758.7,19825]Table 6.17: Fitted models for the medians of the running times of LKH with con-figured parameters from experiments following the standard protocol and the cor-responding RMSE values (in CPU sec). The models yielding more accurate pre-dictions (as per RMSEs on challenge data) are shown in boldface.Solver Model Confidence interval of a Confidence interval of bLKHExp. [0.071619,0.14743] [1.0028,1.0034]RootExp. [0.0010571,0.004436] [1.2213,1.272−]Poly.[3.4774×10−13,9.5553×10−11] [3.4693,4.2573]Table 6.18: 95% bootstrap confidence intervals for the parameters of the scalingmodels for EAX’s running times (with configured parameters) to find optimal solu-tions of RUE instances.10-1100101102103104105106 500  1000  2000  4000CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 6.11: Fitted models for the medians of the running times for LKH with con-figured parameters from experiments following the standard protocol. All modelsare fitted with the medians of the running times of LKH solving the set of RUEinstances of 500 ≤ n ≤ 1500 variables, and are challenged by the medians of therunning times of 2000≤ n≤ 4500 variables.816.4. Impact of Automated Configuration on Scaling of EAX and LKHSolver ModelRMSE RMSE(support) (challenge)LKHExp. 0.17326×1.0027n 0.25216 [12219,12917]RootExp. 0.0066414×1.2077√n 0.13341 [533.13,1165.9]Poly. 4.7378×10−10×n3.2473 0.090054 [1281.8,1941.8]Table 6.19: Fitted models for the medians of the running times of LKH with con-figured parameters from experiments following the scaling protocol and the cor-responding RMSE values (in CPU sec). The models yielding more accurate pre-dictions (as per RMSEs on challenge data) are shown in boldface.Solver Model Confidence interval of a Confidence interval of bLKHExp. [0.13477,0.22338] [1.0025,1.0029]RootExp. [0.0040961,0.010936] [1.1908,1.2245]Poly.[7.2956×10−11,2.9719×10−9] [2.9843,3.5068]Table 6.20: 95% bootstrap confidence intervals for the parameters of the scalingmodels for EAX’s running times (with configured parameters) to find optimal solu-tions of RUE instancesparameter settings that tend to perform well for larger instances. We again analysedthe scaling of LKH with the parameter setting such obtained, and obtained thethree models shown in Table 6.19. Based on the RMSE (challenge), the root-exponential model is still the best fit, with a smaller b (1.2077) than the previousconfiguration (1.2452) but still larger than the default (1.1879). We also performedthe bootstrap analysis described in Section 3.1. From the bootstrap confidenceintervals of the model parameters shown in Table 6.20, the confidence interval of bin the root-exponential model is also slightly larger than that for LKH with defaultparameters ([1.1908,1.2245] vs. [1.1749,1.2006]). As seen in Figure 6.12, theroot-exponential model is a very good fit up to n= 3500. Investigating the runningtimes more closely, we see that this configuration brings down the running timesof LKH compared to the default up to instance size n = 3500. In other words, thenew configuration protocol seems to less overfit on the smaller instances, but stillruns into problems for large instances.Comparing the two resulted parameter settings, we believed that the configuredparameter settings help LKH do a more careful check of possible moves on smal-ler instances, but that does not pay off for larger instances. Seeing this, we fixedKICKS, MOVE_TYPE, POPULATION_SIZE and RESTRICTED_SEARCH totheir default values and performed another round of configuration following thescaling protocol. The result parameter setting was also analysed using the meth-odology described in Section 3.1, resulting in the three models as shown Table6.21. We also performed the bootstrap analysis described in Section 3.1. From826.4. Impact of Automated Configuration on Scaling of EAX and LKH10-1100101102103104105 500  1000  2000  4000CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 6.12: Fitted models for the medians of the running times of LKH with con-figured parameters from experiments following the scaling protocol. All modelsare fitted with the medians of the running times of LKH solving the set of RUEinstances of 500 ≤ n ≤ 1500 variables, and are challenged by the medians of therunning times of 2000≤ n≤ 4500 variables.Solver ModelRMSE RMSE(support) (challenge)LKHExp. 0.1084×1.0027n 0.14318 [4910.6,5596.5]RootExp. 0.0048097×1.2010√n 0.15081 [1394.8,2090.6]Poly. 6.0534×10−10×n3.1401 0.19236 [1764.1,2460.9]Table 6.21: Fitted models for the medians of the running times of LKH with con-figured parameters from experiments following the scaling protocol but with lessparameters and the corresponding RMSE values (in CPU sec). The models yield-ing more accurate predictions (as per RMSEs on challenge data) are shown inboldface.836.4. Impact of Automated Configuration on Scaling of EAX and LKHSolver Model Confidence interval of a Confidence interval of bLKHExp. [0.079825,0.14304] [1.0024,1.0029]RootExp. [0.0027107,0.008814] [1.1802,1.2210]Poly.[6.2838×10−11,6.1249×10−9] [2.8125,3.4561]Table 6.22: 95% bootstrap confidence intervals for the parameters of the scalingmodels for EAX’s running times (with configured parameters) to find optimal solu-tions of RUE instances.10-1100101102103104105 500  1000  2000  4000CPU time [sec]nSupport dataExp. modelRootExp. modelPoly. modelExp. model bootstrap intervalsRootExp. model bootstrap intervalsPoly. model bootstrap intervalsChallenge data (with confidence intervals)Figure 6.13: Fitted models for the medians of the running times of LKH with con-figured parameters from experiments following the scaling protocol but with lessparameters. All models are fitted with the medians of the running times of LKHsolving the set of RUE instances of 500 ≤ n ≤ 1500 variables, and are challengedby the medians of the running times of 2000≤ n≤ 4500 variables.the bootstrap confidence intervals of the model parameters shown in Table 6.22,the confidence interval of b in the root-exponential model is also slightly largerthan that for LKH with default parameters ([1.1802,1.2210] vs. [1.1749,1.2006]). Again, the root-exponential model gives the best fit according to RMSE (chal-lenge), and fits the running times, as seen in Figure 6.13, well up to n = 4000. Theb in the model (1.2010) is very similar to before (1.2077), with similar confidenceintervals ([1.1802,1.2210] vs. [1.1908,1.2245]), but remains larger than the de-fault. Examining the running times, we noticed that the new configuration furtherdecreases running times for n ≤ 4000, but performs even worse for n = 4500. Inother words, the new configuration protocol seems to even less overfit on the smal-ler instances, but still runs into problems for the largest instance size in our analysis846.5. Chapter Summaryand presumably even larger sizes.To sum up, we have attempted to configure LKH in several different settingsand observed clear impact of configuration on the scaling of LKH. However, allof our attempts overfit the running times for smaller instances and lead to worseperformance for instances larger than some threshold that varied with the configur-ation protocol used. This calls for more experiments on LKH to study the impactof its key parameters on the scaling of the solver and to examine if we can improveits scaling by adapting certain parameters with instance size. Another direction isthe development of more sophisticated protocols for algorithm configuration thatincorporate automated scaling analysis. Such protocols can be useful in config-uring algorithms with a large number of parameters for scaling performance. Weelaborate more on this idea in Section 7.2, where we discuss potential future work.6.5 Chapter SummaryIn this chapter, we presented empirical scaling results for several state-of-the-artTSP solvers for finding optimal solutions of RUE instances.For Concorde, a root-exponential model of the form a · b√n, with b beingaround 1.25, was found to best describe the scaling behaviour. Compared to thescaling models for overall running times (for finding optimal solutions and prov-ing optimality), we found that the finding times of Concorde scale slightly worsethan the overall running times, which can be seen in that b is slightly larger forthe model of finding time (1.25032 vs 1.2281); the effect is also seen in the boot-strap intervals for b ([1.2287,1.2793] vs [1.2126,1.2522]). While this is consistentwith the conclusion that the fraction of finding time goes up and approaches 1 asn increases Hoos and Stützle [36], we note that better capturing this property mayrequire more sophisticated models with second-order terms.EAX also scales root-exponentially, best described by a model with b beingaround 1.123. For LKH, the scaling seems also to be root-exponential, but we can-not rule out a polynomial model. For both solvers, there is evidence that they scalebetter than Concorde for finding the optimal solutions of RUE instances. In otherwords, if we treat Concorde as an incomplete solver, then it scales significantlyworse than EAX and LKH. We are hopeful that by exploiting solutions found byhigh-performance incomplete solvers, better exact algorithms can be constructedfor the TSP problem.We also investigated the impact of parameter settings and automated configura-tion on the scaling of algorithms. For EAX, algorithm configuration helps improvethe scaling, which can be further improved by adapting the population size withinstance size. For LKH, we have observed significant impact of parameter settings856.5. Chapter Summaryon its scaling, but the state-of-the-art algorithm configurator SMAC tends to overfitthe running times for smaller instances and thus produces configurations for whichLKH scales worse. This calls for thorough studies on the parameters of LKH, andfor the development of better configuration protocols for scaling performance.Overall, our work complements earlier work on the scaling of state-of-the-artTSP solvers [35, 36, 21] and deepens our understandings of them. It also indicatespotential for improvements in scaling behaviour through automated algorithm con-figuration and through setting certain parameters in dependence of features of theproblem instance to be solved.86Chapter 7Conclusions and Future Work7.1 ConclusionsIn this work, we further developed a framework to analyse the empirical scaling ofthe performance of algorithms in a statistically sound way. The original method-ology was introduced by Hoos [33], who proposed to challenge obtained modelsby extrapolation and to use bootstrap re-sampling to assess these models. Here,we further expanded the use of bootstrap intervals to observed running times ofchallenge data, and proposed a systematic way to compare scaling models for dif-ferent algorithms. The former extension allows us to better capture variability ofobserved running time data, while the latter helps compare scaling behaviour ofalgorithms to draw useful conclusions. To facilitate the broad use of the method-ology, we designed ESA, an automated empirical scaling analysis tool, which canbe easily used as a web service or a command-line tool. ESA takes a file of runningtimes as input, performs scaling analysis as described in Chapter 3, and generatesa technical report on the scaling of the target algorithm.We applied and extended the methodology for empirical scaling analysis to twoprominent problems, SAT and TSP. For phase-transition random 3-SAT, we invest-igated the time complexity of three SLS-based (WalkSAT/SKC, BalancedZ andprobSAT) and three DPLL-based solvers (kcnfs, march_hi and march_br). Ourresults show that the running times of SLS-based solvers are surprisingly wellcaptured by polynomial models, while DPLL-based solvers clearly scale expo-nentially. We also compared the scaling models within each category of solvers,and found no significant difference between the SLS-based solvers we studied,but showed that the two march variants scale significantly better than kcnfs. Wealso demonstrated that DPLL-based solvers are faster by only a constant factorfor solving satisfiable instances compared to their performance on unsatisfiableinstances. In addition, we explored two distributions of 4-SAT instances. Forphase-transition instances, the performance of SLS-based solvers are consistentwith exponential (for BalancedZ) or root-exponential (for the other two solvers)models, while DPLL-based solvers show clear exponential scaling. For a class ofless-constrained instances, the performance of both WalkSAT/SKC and kcnfs iswell characterised by polynomial models.877.2. Future WorkFor TSP, we investigated the scaling of running times of state-of-the-art com-plete solver, Concorde, as well as incomplete solvers, EAX and LKH, for findingoptimal solutions without proving optimality. For Concorde, we found that thefinding times scale root-exponentially, and the finding times scale slightly worsethan overall running times, but more sophisticated models may be needed to bettercapture this. For EAX and LKH, we showed that their running times are consist-ent with or upper bounded by a root-exponential model, respectively, and that theyscale significantly better than Concorde. We also studied the impact of automatedconfiguration on the scaling of the two incomplete solvers and obtained positiveresults for EAX. For LKH, we encountered overfitting, which calls for more stud-ies on its parameters and for the development of new configuration protocols forscaling performance.In summary, we showed that the empirical scaling analysis methodology can besuccessfully applied to high-performance solvers for prominent decision or optim-isation problems like SAT and TSP, and that such analysis can produce interestingand even surprising results. We note that the methodology is also applicable toother problems, and we have designed ESA to facilitate such applications. Wehope that empirical scaling analysis can play a more important role in the designand analysis of algorithms and instance generators, and can even inspire new linesof theoretical analysis.7.2 Future WorkWe see several areas for future work, which potentially will lead to enhanced meth-odology and improved tool for empirical scaling analysis, as well as to deeper un-derstanding of target algorithms and problems. For SAT, we see some interest infurther investigating the survey propagation algorithm, which is of interest to manyresearchers in the community. Also of interest, as pointed out by the anonymousreviewers for our paper [52], is the empirical scaling analysis for structured SATinstances. More generally, it would be very interesting to develop empirical scal-ing analysis methods for classes of instances for which no instance generator isavailable. Such instances can be of substantial industrial or practical interest, likestructured SAT and real-world non-Euclidean TSP instances. In such cases, wemay have only one or a few instances for each size, and much fewer instancesoverall. As a result, estimations of medians or other statistics will be of muchlower quality. Presumably, outlier detection will be an interesting and difficult taskfor scaling analysis in this context. On the methodology side, how to handle estim-ations of varying quality for different instance sizes is also a challenging problem.Currently, the methodology treats the estimation of all support sizes as of equal887.2. Future Workquality, and needs to be extended if the assumption does not hold. Intuitively, thefitted model should be less influenced by an estimation obtained from fewer in-stances and thus being of lower quality. Bayesian methods can potentially addressthis situation, but further thought and thorough evaluation is clearly needed whenextending the methodology to handle such situations.Earlier work on empirical scaling analysis has involved evaluating curve bound-ing techniques with artificial data [48]. McGeoch et al. [49] also reported poorresults when using non-linear regression for curve bounding of slow-growing func-tions. We suspect that this results from their focus on slow-growing functions andon curve bounding rather than fitting. Thus, it will be interesting to evaluate ourmethodology with this type of data. It will also be interesting to explore how toextend our methodology to empirically bounding asymptotic growth of algorithms.One future direction for the development of our methodology and of ESA isto use nested bootstrap re-sampling when dealing with randomised algorithms andmultiple runs per problem instance. Currently, the methodology only re-samplesthe running times for different instances. For randomised algorithms, multiple runsare performed for each instance and the running time for one instance is actuallya statistic of multiple running times for the same instance. Thus, re-sampling canalso be done on multiple running times for the same instance in order to capturethe variability of running time on the same instance. We call this nested bootstrapre-sampling, since it involves two nested steps of bootstrap re-sampling.Another future direction is to automatically select models from a large familyof functions based on input data. This can also facilitate fitting of models withlower-order terms. One possible approach is to repeatedly fit models, first on theoriginal data, then on the residues, in order to obtain a model with several terms. Inthis way, the tool will be easily applicable to an even broader range of algorithmswith little human input, and may potentially produce even more interesting results.Potentially, such models can make more accurate predictions on running times forlarge instances, and can help bring us to another level of understanding of the al-gorithms. One particular example is the investigation of the scaling of finding andoverall running times of Concorde, where models with lower-order terms are be-lieved to be needed in order to model the observed running times more accurately.In addition, we see great potential in developing automated configuration pro-cedures for better scaling behaviour. Such procedures can make algorithm config-uration even more applicable to real-world situations, as problem instances fromthe target distribution can take a long time to solve. This makes it infeasible to runalgorithm configuration directly on these instances, because an algorithm configur-ation usually requires many runs of the target algorithms with different parametersettings. Some previous work has shown that new protocols are needed to con-figure algorithms for better scaling performance [66, 65] . As we observed when897.2. Future Workconfiguring LKH, an algorithm configurator may overfit the support data. A prom-ising area for the development of configurators is to incorporate empirical scalinganalysis into algorithm configuration. To achieve that goal, our automated scalinganalysis tool, ESA, should provide a solid basis. The major challenge comes fromthe fact the scaling analysis involves solving a large number of problem instancesof different sizes and takes a long time to complete. Thus, it will be importantto design a way to reduce the time, possibly by leveraging previously fitted mod-els. One possible way is to incorporate Bayesian methods into empirical scalinganalysis, with a previous model acting as the prior for model fitting.Application of the methodology to other problems and instance distributionsis also a promising area for future research. As seen in this thesis and in previouswork, empirical scaling analysis has revealed interesting and surprising propertiesof high-performance algorithm for SAT and TSP, and we believe other problemscan also benefit from such analysis. Some example problems include AI plan-ning and scheduling, for which international competitions are regularly organisedto evaluate solvers, but the methodology is applicable other problems as well. Wenote that empirical analysis is also of value for polynomial-time solvable prob-lems, particularly in understanding the impact of features other than the size ofinstances on the empirical complexity of algorithms. Empirical investigation ofthe complexity of prominent algorithms for these domains can potentially enrichour understanding of these algorithms and inspire the design of new algorithms.Finally, we envision that our methodology can be adapted to analyse propertiesbeyond time complexity of algorithms, as long as the change in the property withsome aspect (e.g., size) of problem instances is expected to follow some parametricfunctions. For instance, one may use similar ways to model the scaling of qualityof algorithm outcomes. A more concrete example is modelling of learning curvesof machine learning algorithms.90Bibliography[1] Dimitris Achlioptas and Yuval Peres. The threshold for random k-SAT is2k log2−O(k). Journal of the American Mathematical Society, 17(4):947–973, 2004.[2] Rosalía Aguirre-Hernández, Holger H Hoos, and Anne Condon. Computa-tional RNA secondary structure design: Empirical complexity and improvedmethods. BMC Bioinformatics, 8(1):34, 2007.[3] David L Applegate, Robert E Bixby, Vasek Chvátal, and William J Cook. TheTraveling Salesman Problem: A Computational Study. Princeton UniversityPress, 2006.[4] David L Applegate, Robert E Bixby, Vasek Chvátal, and William J Cook.The traveling salesman problem, concorde TSP solver. 2012. URL http://www.tsp.gatech.edu/concorde. Last visited on 24 October 2015.[5] Sanjeev Arora. Polynomial time approximation schemes for euclidean trav-eling salesman and other geometric problems. Journal of the ACM, 45(5):753–782, 1998.[6] Adrian Balint and Uwe Schöning. Choosing probability distributions forstochastic local search and the role of make versus break. In Proceedingsof SAT 2012, pages 16–29. Springer, 2012.[7] Adrian Balint and Uwe Schöning. probSAT and pprobSAT. In Proceedingsof SAT Competition 2014, page 63, 2014.[8] Adrian Balint, Anton Belov, Matti Järvisalo, and Carsten Sinz. SAT challenge2012 random SAT track: Description of benchmark generation. In Proceed-ings of SAT Challenge 2012, page 72, 2012.[9] Jon L Bentley and M Douglas McIlroy. Engineering a sort function. Software– Practice and Experience, 23(11):1249–1265, 1993.91Bibliography[10] Paul Biggar, Nicholas Nash, Kevin Williams, and David Gregg. An experi-mental study of sorting and branch prediction. Journal of Experimental Al-gorithmics, 12:1–8, 2008.[11] Alfredo Braunstein, Marc Mézard, and Riccardo Zecchina. Survey propaga-tion: An algorithm for satisfiability. Random Structures & Algorithms, 27(2):201–226, 2005.[12] bzip.org. bzip2 project homepage. URL http://www.bzip.org/. Lastvisited on 24 October 2015.[13] Peter Cheeseman, Bob Kanefsky, and William M Taylor. Where the reallyhard problems are. In Proceedings of IJCAI 1991, pages 331–337, 1991.[14] Cristian Coarfa, Demetrios D Demopoulos, Alfonso San Miguel Aguirre,Devika Subramanian, and Moshe Y Vardi. Random 3-SAT: The plot thickens.Constraints, 8(3):243–261, 2003.[15] Amin Coja-Oghlan. The asymptotic k-SAT threshold. In Proceedings ofSTOC 2014, pages 804–813. ACM, 2014.[16] Stephen A Cook. The complexity of theorem-proving procedures. In Pro-ceedings of STOC 1971, pages 151–158. ACM, 1971.[17] James M Crawford and Larry D Auton. Experimental results on the crossoverpoint in random 3-SAT. Artificial Intelligence, 81(1):31–57, 1996.[18] Martin Davis and Hilary Putnam. A computing procedure for quantificationtheory. Journal of the ACM, 7(3):201–215, 1960.[19] Martin Davis, George Logemann, and Donald Loveland. A machine programfor theorem-proving. Communications of the ACM, 5(7):394–397, 1962.[20] Gilles Dequen and Olivier Dubois. Kcnfs: An efficient solver for random k-SAT formulae. In Proceedings of SAT 2004, pages 486–501. Springer, 2004.[21] Jérémie Dubois-Lacoste, Holger H Hoos, and Thomas Stützle. On the em-pirical scaling behaviour of state-of-the-art local search algorithms for theEuclidean TSP. In Proceedings of GECCO 2015, pages 377–384. ACM,2015.[22] Bradley Efron and Robert J Tibshirani. An Introduction to the Bootstrap,vol. 57 of Monographs on Statistics and Applied Probability. New York:Chapman and Hall, 1993.92Bibliography[23] John Franco and Marvin Paull. Probabilistic analysis of the Davis Putnamprocedure for solving the satisfiability problem. Discrete Applied Mathemat-ics, 5(1):77–87, 1983.[24] Michael R Garey and David S Johnson. Computers and intractability: Aguide to the theory of NP-completeness. San Francisco, CA: Freeman, 1979.[25] Ian P Gent and Toby Walsh. Towards an understanding of hill-climbing pro-cedures for SAT. In Proceedings of AAAI 1993, pages 28–33, 1993.[26] Ian P Gent, Ewan MacIntyre, Patrick Prosser, and Toby Walsh. The scalingof search cost. In Proceedings of AAAI/IAAI 1997, pages 315–320, 1997.[27] gnu.org. gcov documentation. URL http://gcc.gnu.org/onlinedocs/gcc/Gcov.html. Last visited on 24 October 2015.[28] Simon F Goldsmith, Alex S Aiken, and Daniel S Wilkerson. Measuring em-pirical computational complexity. In Proceedings of ESEC/FSE 2007, pages395–404. ACM, 2007.[29] Keld Helsgaun. An effective implementation of the Lin–Kernighan travelingsalesman heuristic. European Journal of Operational Research, 126(1):106–130, 2000.[30] Keld Helsgaun. General k-opt submoves for the Lin–Kernighan TSP heur-istic. Mathematical Programming Computation, 1(2-3):119–163, 2009.[31] Marijn JH Heule. march_br. In Proceedings of SAT Competition 2013,page 53, 2013.[32] Marjin JH Heule and Hans van Marren. march_hi: Solver description. InProceedings of SAT Competition 2009, pages 23–24, 2009.[33] Holger H Hoos. A bootstrap approach to analysing the scaling of empiricalrun-time data with problem size. Technical report, TR-2009-16, Departmentof Computer Science, University of British Columbia, 2009.[34] Holger H Hoos and Thomas Stützle. Local search algorithms for SAT: Anempirical evaluation. Journal of Automated Reasoning, 24(4):421–481, 2000.[35] Holger H Hoos and Thomas Stützle. On the empirical scaling of run-timefor finding optimal solutions to the travelling salesman problem. EuropeanJournal of Operational Research, 238(1):87–94, 2014.93Bibliography[36] Holger H Hoos and Thomas Stützle. On the empirical time complexity offinding optimal solutions vs proving optimality for Euclidean TSP instances.Optimization Letters, 9:1247–1254, 2015.[37] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In Proceedings ofLION 2011, pages 507–523. Springer, 2011.[38] David Johnson, Lyle McGeoch, Cesar Rego, and Fred Glover. 8th DIMACSimplementation challenge. 2001. URL http://dimacs.rutgers.edu/Challenges/TSP/. Last visited on 24 October 2015.[39] Scott Kirkpatrick and Bart Selman. Critical behavior in the satisfiability ofrandom boolean expressions. Science, 264(5163):1297–1301, 1994.[40] Lefteris M Kirousis, Evangelos Kranakis, Danny Krizanc, Yannis C Stama-tiou, et al. Approximating the unsatisfiability threshold of random formulas.Random Structures and Algorithms, 12(3):253–269, 1998.[41] Lars Kotthoff, Pascal Kerschke, Holger Hoos, and Heike Trautmann. Im-proving the state of the art in inexact tsp solving using per-instance algorithmselection. In Proceedings of LION 2015. Springer, 2015.[42] Daniel Kunkle. Empirical complexities of longest common subsequence al-gorithms. Technical report, Computer Science Department, Rochester Insti-tute of Technology, NY, USA, 2002.[43] Kenneth Levenberg. A method for the solution of certain problems in leastsquares. Quarterly of Applied Mathematics, 2:164–168, 1944.[44] Chumin Li, Chong Huang, and Ruchu Xu. Balance between intensificationand diversification: a unity of opposites. In Proceedings of SAT Competition2014, pages 10–11, 2014.[45] Xiaoming Li, María Jesús Garzarán, and David Padua. A dynamically tunedsorting library. In Proceedings of CGO 2004, pages 111–122. IEEE, 2004.[46] Yuri Malitsky, Ashish Sabharwal, Horst Samulowitz, and Meinolf Sellmann.Csch-based portfolio using ccsat and march. In Proceedings of SAT Compet-ition 2013, page 28, 2013.[47] Donald W Marquardt. An algorithm for least-squares estimation of nonlinearparameters. Journal of the Society for Industrial & Applied Mathematics, 11(2):431–441, 1963.94Bibliography[48] Catherine McGeoch, Doina Precup, and Paul R Cohen. How to find big-ohin your data set (and how not to). In Proceedings of IDA 1997, pages 41–52.Springer, 1997.[49] Catherine McGeoch, Peter Sanders, Rudolf Fleischer, Paul R Cohen, andDoina Precup. Using finite experiments to study asymptotic performance.In Experimental Algorithmics, pages 93–126. Springer, 2002.[50] Stephan Mertens, Marc Mézard, and Riccardo Zecchina. Threshold values ofrandom k-SAT from the cavity method. Random Structures and Algorithms,28(3):340–373, 2006.[51] David Mitchell, Bart Selman, and Hector Levesque. Hard and easy distribu-tions of SAT problems. In Proceedings of AAAI 1992, pages 459–465, 1992.[52] Zongxu Mu and Holger H Hoos. On the empirical time complexity of random3-SAT at the phase transition. In Proceedings of IJCAI 2015, pages 367–373,2015.[53] Zongxu Mu and Holger H Hoos. Empirical scaling analyser: An automatedsystem for empirical analysis of performance scaling. In Companion Public-ation of GECCO 2015, pages 771–772, 2015.[54] Zongxu Mu, Jérémie Dubois-Lacoste, Holger H Hoos, and Thomas Stützle.On the empirical scaling for finding optimal solutions of the TSP problem. Inpreparation, 2015.[55] Zongxu Mu, Holger H Hoos, and Thomas Stützle. The impact of automatedalgorithm configuration on the scaling behaviour of state-of-the-art incom-plete tsp solvers. In preparation, 2015.[56] Yuichi Nagata. Edge assembly crossover: A high-power genetic algorithmfor the traveling salesman problem. In Proceedings of ICGA 1997, pages450–457, 1997.[57] Yuichi Nagata and Shigenobu Kobayashi. A powerful genetic algorithm us-ing edge assembly crossover for the traveling salesman problem. INFORMSJournal on Computing, 25(2):346–363, 2013.[58] Christos H Papadimitriou. The euclidean travelling salesman problem is np-complete. Theoretical Computer Science, 4(3):237–244, 1977.[59] Andrew J Parkes and Joachim P Walser. Tuning local search for satisfiabilitytesting. In Proceedings of AAAI/IAAI 1996, pages 356–362, 1996.95Bibliography[60] Olivier Roussel. Controlling a solver execution with the runsolver tool systemdescription. Journal on Satisfiability, Boolean Modeling and Computation, 7:139–144, 2011.[61] Peter Sanders and Rudolf Fleischer. Asymptotic complexity from experi-ments? A case study for randomized algorithms. In Algorithm Engineering,pages 135–146. Springer, 2001.[62] SatCompetition.org. The international SAT competitions web page, 2014.URL http://www.satcompetition.org/. Last visited on 24 October2015.[63] Bart Selman, Hector J Levesque, David G Mitchell, et al. A new methodfor solving hard satisfiability problems. In Proceedings of AAAI 1992, pages440–446, 1992.[64] Bart Selman, Henry A Kautz, and Bram Cohen. Noise strategies for improv-ing local search. In Proceedings of AAAI 1994, pages 337–343, 1994.[65] James Styles and Holger Hoos. Ordered racing protocols for automaticallyconfiguring algorithms for scaling performance. In Proceedings of GECCO2013, pages 551–558. ACM, 2013.[66] James Styles, Holger H Hoos, and Martin Müller. Automatically configuringalgorithms for scaling performance. In Proceedings of LION 2012, pages205–219. Springer, 2012.[67] K Subramani and Dejan Desovski. On the empirical efficiency of the ver-tex contraction algorithm for detecting negative cost cyles in networks. InProceedings of ICCS 2005, pages 180–187. Springer, 2005.[68] Makoto Yokoo. Why adding more constraints makes a problem easier forhill-climbing algorithms: Analyzing landscapes of CSPs. In Proceedings ofCP 1997, pages 356–370. Springer, 1997.96Appendix ALATEX Template for ESA%% template -AutoScaling.tex%% Author: Zongxu Mu%% This is the LaTeX template file for Empirical ScalingAnalyser (ESA).%% ESA takes the template , replace variables with theircorresponding values ,%% and generates an output file named AutoScaling.tex.%% You may compile this file alone without running ESA tosee how the output looks like.%% Variables are surrounded by ``@@ ''s%% Supported variable names include:%% - algName , e.g., ``WalkSAT/SKC ''%% - instName , e.g., ``random 3SAT at phasetransition ''%% - models , e.g., ``\\begin{itemize }\n\\item $Exp\\left[a,b\\ right ]\\ left(n\\right)=a\\cdot b^{n}$ \\quad {}(2- parameter exponential);\n\\end{itemize}''%% - numBootstrapSamples , the number of bootstrapsamples used in the analysis , e.g., 1000%% - numSizes , the number of sizes used in theanalysis , e.g., 12%% - largestSupportSize , e.g., 500%% - table -Details -dataset , e.g., ``\\input{table_Details -dataset}''%% - table -Details -dataset , e.g., ``\\input{table_Details -dataset}''%% - table -Fitted -models , e.g., ``\\input{table_Fitted -models}''%% - figure -fittedModels , e.g., ``\\ includegraphics[width =0.8\ textwidth ]{ fittedModels_loglog}''%% - supportSizes , the sizes used for fitting themodels , e.g., ``200, 250, 300, 350, 400, 450, 500''%% - challengeSizes , e.g., ``600, 700, 800, 900,1000''%% - table -Bootstrap -intervals -of-parameters , e.g.,97Appendix A. LATEX Template for ESA``\\input{table_Bootstrap -intervals -of -parameters}''%% - table -Bootstrap -intervals , e.g., ``\\input{table_Bootstrap -intervals}''%% - analysisSummary , e.g., ``observed medianrunning times are consistent with the polynomialscaling model ''\documentclass[british ]{ article}\usepackage[T1]{ fontenc}\usepackage[latin9 ]{ inputenc}\usepackage{geometry}\geometry{verbose ,tmargin =3.5cm ,bmargin =3.5cm,lmargin =3cm,rmargin =3cm}\usepackage{array}\usepackage{multirow}\usepackage{amstext}\usepackage{graphicx}\usepackage{color}\newcommand {\ medianInterval }[1]{}@@customCommands@@\makeatletter%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeXcommands.%% Because html converters don 't know tabularnewline\providecommand {\ tabularnewline }{\\}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeXcommands.\title{On the empirical scaling of running time of@@algName@@ for solving @@instName@@}\author{Empirical Scaling Analyser}\makeatother\usepackage{babel}\begin{document}\maketitle %\section{Introduction}This is the automatically generated report on the98Appendix A. LATEX Template for ESAempirical scalingof the running time of @@algName@@ for solving@@instName@@.\section{Methodology}\label{sec:Methodology}% models , model fittingFor our scaling analysis , we considered the followingparametric models:@@models@@% \begin{itemize}% \item $Exp\left[a,b\right ]\left(n\right)=a\cdot b^{n}$\quad {}(2- parameter exponential);% \item $RootExp\left[a,b\right]\left(n\right)=a\cdot b^{\ sqrt{n}}$ \quad {}(2- parameter root -exponential);% \item $Poly\left[a,b\right]\left(n\right)=a\cdot n^{b}$\quad {}(2- parameter polynomial).% \end{itemize}Note that the approach could be easily extended to otherscaling models.For fitting parametric scaling models to observed data ,we used thenon -linear least -squares Levenberg -Marquardt algorithm.% what we fitted the models to , how we assessed model fitModels were fitted to performance observations in theform of @@statistic@@sof the distributions of running times over sets ofinstances for given$n$ , the instance size.%Compared to the mean , the median has two%advantages: it is statistically more stable and immuneto the presence%of a certain amount of timed -out runs.To assess the fit of a givenscaling model to observed data , we used root -mean -squareerror (RMSE).\medianInterval{Due to the instances for which the running times areunknown ,there is uncertainty about the99Appendix A. LATEX Template for ESAprecise location of the @@statistic@@s of the runningtime distributionsat each such $n$ , and we can only provide bounds onthose @@statistic@@s instead. Closely following \cite{dubois2015on}, we calculatethese bounds based on the bestand worst case scenarios , in which all instances withunknown running timesare easiest or hardest , respectively.We note that these are not confidenceintervals , since we can guarantee the actual@@statistic@@ runningtimes to lie within them.We also calculate RMSEs and confidence intervals based onthese bounds.}% bootstrap confidence intervalsClosely following \cite{hoos2009bootstrap ,hoos2014empirical}, wecomputed 95\% bootstrap confidence intervals for theperformance predictionsobtained from our scaling models , based on@@numBootstrapSamples@@ bootstrap samplesper instance set and @@numBootstrapSamples@@automatically fitted variants of each scalingmodel.In the following , we say that a scaling model is in-consistent with observed data if the bootstrap confidenceinterval for the observed datais disjoint from the bootstrap confidence interval forpredicted running times;\medianInterval{we say that a scaling model is consistentwith observed data , if the interval for observed mediansoverlaps with , but is not fully contained within thebootstrap confidence interval for predicted running times;}we say that a scaling model is strongly consistent withobserveddata , if the observed median is fully containedwithin the bootstrap confidence interval for predictedrunning times.Also , we define residue of a model at a given size as theobservedpoint estimate less the predicated value.100Appendix A. LATEX Template for ESA\section{Dataset Description}The dataset contains running times of the @@algName@@algorithm solving@@numSizes@@ sets of instances of different sizes. Wesplit the running times intotwo categories , support ($n\leq@@largestSupportSize@@$)and challenge ($n >@@largestSupportSize@@$). Thedetails of the dataset can be found in Tables \ref{tab:Details -dataset -support}and \ref{tab:Details -dataset -challenge }.\begin{table *}\noindent \begin{centering}@@table -Details -dataset -support@@\par\end{centering}\caption {\label{tab:Details -dataset -support} Details ofthe running time dataset used as support data formodel fitting .}\end{table *}\begin{table *}\noindent \begin{centering}@@table -Details -dataset -challenge@@\par\end{centering}\caption {\label{tab:Details -dataset -challenge} Details ofthe running time dataset used as challenge data formodel fitting .}\end{table *}%% Figure \ref{fig:CDFs} shows the distributions of therunning times of% @@algName@@ solving @@instName@@.% \begin{figure *}[tb]% \begin{centering}% @@figure -cdfs@@% % \includegraphics[width =0.8\ textwidth ]{cdfs}% \par\end{centering}%% \noindent \centering {}\ caption {\ label{fig:CDFs}Distribution of running times across instance setsfor101Appendix A. LATEX Template for ESA% @@algName@@ .}% \end{figure *}%%\section{Empirical Scaling of Solver Performance}\label{sec:Results}We first fitted our parametric scaling models to the@@statistic@@s of the running timesof @@algName@@ , as described in Section \ref{sec:Methodology }. Themodels were fitted using the @@statistic@@s of therunning times for $@@supportSizes@@$(support) and later challenged with the @@statistic@@s ofthe running times for $@@challengeSizes@@$.This resulted in the models , shown along with RMSEs onsupport andchallenge data , shown in Table ~\ref{tab:Fitted -models }.\begin{table }[tb]\begin{centering}@@table -Fitted -models@@% \input{table_Fitted -models}\par\end{centering}\caption {\label{tab:Fitted -models}Fitted models for the@@statistic@@s of the running times and RMSEvalues (in CPU sec). The models yielding moreaccurate predictions (as per RMSEs on challenge data) areshown inboldface .}\end{table}In addition , we illustrate the fitted models of@@algName@@ in Figure ~\ref{fig:Fitted -models},and the residues for the models in Figure ~\ref{fig:Fitted-residues }.\begin{figure }[tb]\noindent \begin{centering}@@figure -fittedModels@@% \includegraphics[width =0.8\ textwidth ]{ fittedModels}\par\end{centering}\caption {\label{fig:Fitted -models} Fitted models for the@@statistic@@s of the running times.102Appendix A. LATEX Template for ESAThe models are fitted with the @@statistic@@s of therunning times of@@algName@@ solving the set of @@instName@@of $@@supportSizes@@$ variables , and are challenged bythe @@statistic@@s of therunning times of $@@challengeSizes@@$ variables .}\end{figure}\begin{figure }[tb]\noindent \begin{centering}@@figure -fittedResidues@@% \includegraphics[width =0.8\ textwidth ]{ fittedResidues}\par\end{centering}\caption {\label{fig:Fitted -residues} Residues of thefitted models for the @@statistic@@sof the running times. }\end{figure}But how much confidence should we have in these models?Are the RMSEssmall enough that we should accept them? To answer thisquestion ,we assessed the fitted models using the bootstrapapproach outlinedin Section ~\ref{sec:Methodology }. Table ~\ref{tab:Bootstrap -intervals -of-parameters}shows the bootstrap intervals of the model parameters ,and Table~\ref{tab:Bootstrap -intervals -support}contains the bootstrap intervals for the support data.Challenging the models with extrapolation , as shown inTable ~\ref{tab:Bootstrap -intervals -challenge}, it isconcluded that@@analysisSummary@@(as also illustrated in Figure ~\ref{fig:Fitted -models }).\begin{table *}[tb]\noindent \begin{centering}@@table -Bootstrap -intervals -of -parameters@@% \input{table_Bootstrap -intervals -of -parameters}\par\end{centering}\caption {\label{tab:Bootstrap -intervals -of-parameters}95\% bootstrap intervals103Appendix A. LATEX Template for ESAof model parameters for the @@statistic@@s of the runningtimes}\end{table *}\begin{table *}[tb]\noindent \begin{centering}@@table -Bootstrap -intervals -support@@% \input{table_Bootstrap -intervals}\par\end{centering}\caption {\label{tab:Bootstrap -intervals -support} 95\%bootstrap confidence intervalsfor the @@statistic@@s of the running time predictionsand observed running times on @@instName@@.The instance sizes shown here are those used for fittingthe models.Bootstrap intervals on predictions that are weaklyconsistentwith the observed point estimates are shown in boldface ,those that are consistent are marked by plus signs ({+}) ,and those that fully contain the confidence intervals onobservations are marked by asterisks ({*}).}\end{table *}\begin{table *}[tb]\noindent \begin{centering}@@table -Bootstrap -intervals -challenge@@% \input{table_Bootstrap -intervals}\par\end{centering}\caption {\label{tab:Bootstrap -intervals -challenge} 95\%bootstrap confidence intervalsfor the @@statistic@@s of the running time predictionsand observed running times on @@instName@@.The instance sizes shown here are larger than those usedfor fitting the models.Bootstrap intervals on predictions that are weaklyconsistentwith the observed data are shown in boldface ,\medianInterval{those that are consistent are marked byplus signs ({+}) ,}those that are strongly consistent are markedby sharps ({\#}) ,and those that fully contain the confidence intervals onobservations are marked by asterisks ({*}).}\end{table *}104Appendix A. LATEX Template for ESA\section{Conclusion}In this report , we presented an empirical analysis of thescalingbehaviour of @@algName@@ on @@instName@@. We found@@analysisSummary@@.\bibliographystyle{plain}\begin{thebibliography }{1}\bibitem{dubois2015on}J{\'e}r{\'e}mie Dubois -Lacoste , Holger~H. Hoos , andThomas St{\"u}tzle.\newblock On the empirical scaling behaviour of state -of -the -art local searchalgorithms for the euclidean tsp.\newblock In {\em GECCO }. ACM , 2015.\bibitem{hoos2009bootstrap}Holger~H Hoos.\newblock A bootstrap approach to analysing the scalingof empirical run -timedata with problem size.\newblock Technical report , Technical Report TR -2009-16 ,University of BritishColumbia , 2009.\bibitem{hoos2014empirical}Holger~H Hoos and Thomas St{\"u}tzle.\newblock On the empirical scaling of run -time forfinding optimal solutions tothe travelling salesman problem.\newblock {\em European Journal of Operational Research},238(1) :87--94, 2014.\end{thebibliography}\end{document}105Appendix BGnuplot Templates for ESAB.1 Gnuplot Template for Plotting Models#!/ gnuplotset terminal pdfcairo dashed fontscale 0.5set termopt enhancedset logscaleset xlabel 'n'set ylabel 'CPU time [sec]'set format y "10^{%T}"set key top leftset xtics autoset grid xtics ytics mxtics mytics lc rgb '#999999 ' lw 1lt 0set style fill transparent solid 0.2 noborderB.2 Gnuplot Template for Plotting Residue Curves#!/ gnuplotset terminal pdfcairo dashed fontscale 0.5set termopt enhancedset print "plot -residues.log"set xlabel 'n'set ylabel 'Residue [sec]'set xtics autoset key left bottomset grid xtics ytics mxtics mytics lc rgb '#999999 ' lw 1lt 0set style fill transparent solid 0.2 noborder106

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0215912/manifest

Comment

Related Items