UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Methods to compare expensive stochastic optimization algorithms with application to road design Xie, Shangwei 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2017_ September_Xie_Shangwei.pdf [ 3.65MB ]
Metadata
JSON: 24-1.0348580.json
JSON-LD: 24-1.0348580-ld.json
RDF/XML (Pretty): 24-1.0348580-rdf.xml
RDF/JSON: 24-1.0348580-rdf.json
Turtle: 24-1.0348580-turtle.txt
N-Triples: 24-1.0348580-rdf-ntriples.txt
Original Record: 24-1.0348580-source.json
Full Text
24-1.0348580-fulltext.txt
Citation
24-1.0348580.ris

Full Text

Methods to Compare ExpensiveStochastic Optimization Algorithmswith Application to Road DesignbyShangwei XieB.A. Hons., The University of Nottingham, 2012M.Q.F., Rutgers University, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE COLLEGE OF GRADUATE STUDIES(Computer Science)THE UNIVERSITY OF BRITISH COLUMBIA(Okanagan)June 2017c© Shangwei Xie, 2017The undersigned certify that they have read, and recommend to theCollege of Graduate Studies for acceptance, a thesis entitled: Methodsto Compare Expensive Stochastic Optimization Algorithms withApplication to Road Design submitted by Shangwei Xie in partialfulfilment of the requirements of the degree of Master of ScienceDr. Warren Hare, Irving K. Barber School of Arts and SciencesSupervisorDr. Jason Loeppky, Irving K. Barber School of Arts and SciencesCo-supervisorDr. Yves Lucet, Irving K. Barber School of Arts and SciencesSupervisory Committee MemberDr. Ross Hickey, Irving K. Barber School of Arts and SciencesUniversity ExaminerJune 2, 2017(Date Submitted to Grad Studies)iiAbstractAnalyzing test data of stochastic optimization algorithms under randomrestarts is challenging. The data needs to be resampled to estimate thebehavior of the incumbent solution during the optimization process. Theestimation error needs to be understood in order to make reasonable in-ference on the actual behavior of the incumbent solution. Comparing theperformance of different algorithms based on proper interpretation of theestimator is also very important. We model the incumbent solution of theoptimization problem over time as a stochastic process and design an estima-tor of it based on bootstrapping from test data. Some asymptotic propertiesof the estimator and its bias are shown. The estimator is then validated byan out-of-sample test. Three methods for comparing the performance ofdifferent algorithms based on the estimator are proposed and demonstratedwith data from a road design optimization problem.iiiPrefaceThe identification and design of the program was a joint work of theauthor and his supervisors. The author was responsible for the design, theo-retical analysis, numerical validation and illustration of the model developedin this thesis.ivContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivContents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiGlossary of Notation . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . xiChapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . 11.1 Comparing Stochastic Optimization Algorithms . . . . . . . . 11.2 Road Design Optimization . . . . . . . . . . . . . . . . . . . . 31.2.1 Horizontal Alignment Optimization . . . . . . . . . . 31.2.2 Vertical Alignment Optimization . . . . . . . . . . . . 51.2.3 Implementation and Surrogate Cost Functions . . . . 61.3 Challenges and Organization . . . . . . . . . . . . . . . . . . 7Chapter 2: Bootstrapping the Incumbent Solution Process . 92.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Bootstrap Estimation of the Incumbent Solution Process . . . 132.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . 15Chapter 3: Numerical Validation of the Bootstrap Method . 193.1 Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Basic Test Results . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Reliable Time Interval and Quantiles . . . . . . . . . . . . . . 25vCONTENTSChapter 4: Methods to Compare Algorithms . . . . . . . . . . 304.1 Road Design Optimization Algorithms Compared . . . . . . . 314.2 Prediction Band with Median . . . . . . . . . . . . . . . . . . 354.3 Integrated P-P Plot . . . . . . . . . . . . . . . . . . . . . . . 384.4 Speed Quantification . . . . . . . . . . . . . . . . . . . . . . . 40Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 455.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49viList of TablesTable 4.1 Speed quantification by 50th percentile . . . . . . . . . 43Table 4.2 Advanced speed quantification by 50th and 90th per-centiles . . . . . . . . . . . . . . . . . . . . . . . . . . . 44viiList of FiguresFigure 1.1 3D road alignment . . . . . . . . . . . . . . . . . . . . 4Figure 1.2 Horizontal road alignment . . . . . . . . . . . . . . . 5Figure 1.3 Vertical road alignment . . . . . . . . . . . . . . . . . 6Figure 1.4 3D alignment optimization framework . . . . . . . . . 7Figure 3.1 Six instances of the estimate versus the truth . . . . . 22Figure 3.2 Mean of estimates . . . . . . . . . . . . . . . . . . . . 23Figure 3.3 Mean and standard deviation of estimates . . . . . . 24Figure 3.4 Relative error for f1 on R5 . . . . . . . . . . . . . . . 26Figure 3.5 Average-case relative error . . . . . . . . . . . . . . . 27Figure 3.6 Worst-case relative error . . . . . . . . . . . . . . . . 27Figure 3.7 Reliable time intervals . . . . . . . . . . . . . . . . . . 29Figure 4.1 The 80% prediction band with median for two algo-rithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Figure 4.2 The 80% prediction band with median for ten algo-rithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Figure 4.3 Integrated P-P plot . . . . . . . . . . . . . . . . . . . 39Figure 5.1 Uniqueness of λ . . . . . . . . . . . . . . . . . . . . . 47viiiGlossary of NotationΩ Set of all possible restarts. Pg. 10X Feasible set. Pg. 10f(x) Objective function. Pg. 10τ Time since the start of the optimizationprocess. Pg. 12B Number of bootstrap paths. Pg. 14M Number of sample solutions in each bootstrappath. Pg. 14N Number of restarts in the test. Pg. 13Ns Number of samples drawn in the numericalvalidation. Pg. 25τmax Upper bound of focused time horizon. Pg.14Y Algorithm’s output. Pg. 11T Algorithm’s running time. Pg. 11{(Yi, Ti)}Ni=1 A sample of (Y, T ). Pg. 13{Ui}∞i=1 Independent and identically distributed dis-crete uniform random variables between 1 andN . Pg. 13Γ(τ) Incumbent solution process. Pg. 12Γˆ(τ ;N) Theoretical approximator for Γ(τ). Pg. 13Γˆi(τ ;N) The ith unique path of Γˆ(τ ;N). Pg. 15Γˆi(τ ;N,B) The ith bootstrap path. Pg. 14C(τ) Count of outputs generated by the algorithmsat time τ . Pg. 16Cˆ(τ ;N) Count of outputs in the path of Γˆ(τ ;N) bytime τ . Pg. 16ixGlossary of NotationCˆ∗(τ ;N) Count of unique outputs in the path ofΓˆ(τ ;N) by time τ . Pg. 17C∗(i) Count of unique values in {Yj}ij=1. Pg. 17GY,T (y, t) CDF of (Y, T ). Pg. 11GΓ(y; τ) CDF of Γ(τ). Pg. 13GΓˆ(y; τ,N) CDF of Γˆ(τ ;N). Pg. 14GˆΓˆ(y; τ,N,B) ECDF of Γˆi(τ ;N,B); bootstrap estimator ofthe incumbent solution process. Pg. 14Gˆ−1Γˆ(p; τ,N,B, i) Estimate based on the ith random sample inthe numerical validation. Pg. 25Gα(y; τ) Bootstrap estimator for algorithm α. Pg. 30Gα(y) Integrated CDF for algorithm α. Pg. 38xAcknowledgementsI would like to thank my supervisors Dr. Warren Hare and Dr. JasonLoeppky for guiding me through my research towards this thesis. I wouldalso like to thank my committee member Dr. Yves Lucet for helping meset up the environment for implementation of the algorithms tested in thiswork. In addition, I would like to thank my colleague Soroor Sarafrazi forhelping me understand road design optimization models at the early stageof my research. Finally, I would like to thank our industrial partner SoftreeTechnical Systems Inc. for providing test data.This research is partially funded by a Collaborative Research and De-velopment (CRD) Grant from the Natural Sciences and Engineering Re-search Council (NSERC) sponsored by Softree Technical System Inc., andby the Graduate Entrance Scholarship and the University Graduate Fellow-ship from the University of British Columbia. Part of the computation inthis research was performed in the Computer-Aided Convex Analysis Labo-ratory funded by the Canadian Foundation for Innovation (CFI) and by theBritish Columbia Knowledge Development Fund (BCKDF).xiChapter 1Introduction1.1 Comparing Stochastic OptimizationAlgorithmsComparing the performance of different optimization algorithms is animportant task in modern research [VWD16]. Stochastic optimization algo-rithms are algorithms designed to solve optimization problems with the useof some random procedures. Since they rely on certain random proceduresto seek the solutions to optimization problems, their outputs are usuallyrandom. Computational complexity is usually not sufficient for compari-son between expensive algorithms, because they tend to belong to the samecomplexity class. Hence, numerical experiments are commonly used to com-pare the performance of expensive stochastic optimization algorithms. Toproperly characterize a stochastic algorithm’s performance, test should beconducted over a range of random restarts. Since test data is also random,comparing the performance of different algorithms based on it is nontrivial,especially when limited data is available due to the expensiveness of thealgorithm. In this thesis, a methodology is developed for comparing theperformance of expensive stochastic optimization algorithms when randomrestarts are employed.Researchers have proposed many comparison methods based on runningtime or number of iterations, e.g., operating characteristics [Gri78], time-to-target plots [FRS94], run-time distributions [HS98], performance profiles[DM02] and data profiles [MW09]. However, none of the above methods ad-dresses how the behavior of stochastic algorithms are influenced by randomparameters.Some researchers have examined comparison methods for stochastic op-timization algorithms. Since the performance of stochastic algorithms couldbe sensitive to random inputs such as random seed [FRK+08], performancetests without considering this issue could give misleading results. The op-erational zone proposed by Sergeyev et al. [SKM16] handled this issue byapplying random restarts in the test and showing the best and worst case11.1. Comparing Stochastic Optimization Algorithmsperformances along with the average performance. This approach is basedon the assumption that the end user will solve problems with a single al-gorithm run. If a problem is solved with a single run, then it would bereasonable to project the performance based on the probability distribu-tions of the solution time and quality. However, in practice it is commonto employ multiple random restarts when an optimization problem is solvedby a stochastic algorithm [BHHR14, HJK96, KGB13, Spa03]. In particu-lar, when the objective function is non-convex, multiple restarts will helpthe algorithm find more local minima and thus achieve a better solution[KG09].For example, given different initial points, the algorithm may converge todifferent local minima. For algorithms relying on random seeds to determinesearch paths, different seeds will also result in different local minima. Evenif the objective function is convex, different parameter settings may leadto different solutions or running times. For instance, a line search methodmay return different solutions when different values are used as the step sizeparameter. Alternatively, a pattern search algorithm using different randomseeds may have the same output but different running time, because thepattern may move through different trajectories depending on the randomnumbers generated by the seed.In these situations, the probability distributions of the solution time andquality alone cannot disclose sufficient information on the performance ofthe algorithms. With different restarts, both the solution and the runningtime can be different, which makes it nontrivial to compare the performanceof different algorithms. In addition, if the algorithms being compared areexpensive, only limited amount of test data is available for comparison and itbecomes more challenging to compare them. Therefore, statistical methodsneed to be applied to process the test data and make inference about theperformance.Truchet et al. [TRC12] proposed a statistical method to analyze the par-allel speedup of Las Vegas Algorithms [Bab79], i.e., algorithms that alwaysgive correct results but take random running times. This is the first workto analyze the performance of stochastic algorithms under multiple restarts.However, the method by Truchet et al. [TRC12] is not sufficient for han-dling the problem of interest in this thesis. Firstly, it focuses on the analysisof an algorithm’s speed under parallel rather than sequential restarts. Inaddition, it is limited to a subclass of stochastic optimization algorithms.A stochastic optimization algorithm in general may return different outputsunder different random restarts, thus a new approach is needed to analyzeboth the speed and solution quality of the algorithms.21.2. Road Design OptimizationMoreover, Truchet et al.’s work mainly relies on the assumption that therunning time distribution is exponential or lognormal. This is further basedon the assumption that both the solution in the search space and the searchspeed are uniformly distributed, which is not always satisfied in reality. Onthe contrary, the statistical method developed in this thesis does not imposeany assumption about the distribution of the the algorithm’s running timeor the output.1.2 Road Design OptimizationThe comparison methods developed in this thesis will be applied to anoptimization problem in road design. A good transportation system is es-sential to the economic development of a country [RCS06, Don10]. Roadtransportation forms a significant part of all forms of transportation. Tobuild and maintain the road network, governments all over the world needto spend large amount of money. For example, in 2008 the Canadian gov-ernment spent 20.9 billion dollars, which is 1.6% of the country’s GDP tobuild and maintain 1 million kilometers of road [Can08]. When a road isdesigned, it is important to take effort in reducing construction costs.In particular, road design optimization is the problem of minimizingthe cost of building a road between two points given the ground profileand design constraints [ABS05]. The geometric specification of the road iscalled the alignment and is a curve in three-dimensional Euclidean spaceconnecting the start and the end of the road. Usually, road alignment ismodeled separately as the horizontal alignment and the vertical alignment[HEAEH98]. Consequently, a complete road design optimization problemis often defined as a bi-level optimization problem, of which the outer levelis the horizontal alignment optimization problem and the inner level is thevertical alignment optimization problem [MLH15]. The algorithms testedin this thesis are based on the horizontal alignment optimization model in[MLH15] and the vertical alignment optimization model in [HHLR14]. Abrief introduction of the two models are in the next two subsections.1.2.1 Horizontal Alignment OptimizationThe horizontal alignment is the projection of the three-dimensional roadalignment onto the xy-plane, as shown in Figure 1.1. In particular, the xand y coordinates represent the location of an alignment and the z coordi-nate represents the elevation of points in the Figure 1.1. The blue curve isthe three-dimensional road alignment in the xyz-space connecting the two31.2. Road Design Optimization1.2. Road design optimizationFigure 1.1: A three dimensional alignment (blue curve) showing its projec-tion onto theXY -plane. The projected red curve is the horizontal alignment.– user cost, and– social and environment cost.Planning and administrative costs are not considered in the alignmentoptimization because these costs are insensitive to alignment alternatives[JSJ06]. In the study of Chew et al. [CGF89], construction costs are clas-sified into six categories. Table 1.1 lists the cost components of the con-struction costs and the associate approximate contributions toward the totalconstruction cost.The percentage of each cost component in the total construction cost isnot fixed [JSJ06]. Depending on the road location, it could be significantlydi↵erent. For example, the land acquisition cost might be higher in urbanareas, whereas in mountainous areas, the earthwork cost is substantiallyhigher than other costs.Maintenance costs have many classifications (at least eight) such as road-way surface, bridges, tunnels, roadside features, drainage, shoulders and ap-proaches maintenance, snow and ice control, and trac control devices. Thenet maintenance cost over 30 years is about 5% of the total construction cost[OEC73].User costs (vehicle operating costs) consist of the cost of vehicle main-tenance, the value of travel time and the cost of trac accidents. The netuser cost over 30 years varies approximately from 300% to 1000% of thetotal construction cost [OEC73].3Figure 1.1: 3D road alignment. Picture from [Mon14], used with permission.end points; the red line is defined on the xy-plane and is the projection ofthe three-dimensional road alignment. The horizontal alignment consists ofstraight lines and curves. The framework we present here models the curvesas arcs. For the smoothness of the road, the straight lines are tangent to thearc(s) they connect to, so they are called tangents. To model a horizontalalignment with n turns from its start S and end E, n arcs and n+1 tangentsare needed. Each arc is connected by two tangents. The first and the lasttangent respectively connect S to the first arc and E to the last arc. Forthe ith turn, the radius of the arc is denoted as ri ∈ R+ and the intersectionpoint of the two tangents connecting the arc is denoted by Pi ∈ R2. Anexample of horizontal alignment modeled by arcs and tangents is shown inFigure 1.2. In particular, the red lines are tangents and the blue curvesare arcs. Given any horizo tal alignment, a vertical alignment optimizatioprobl m can be derived as shown in the n xt subsection. The cost of a hor-izontal alignment i the minimum of the vertical alignment problem derivedfrom it.In particular, the horizontal alignment optimization problem is a black-box optimization problem, which is also known as derivative-free optimiza-tion problem [CSV09]. In this problem, the vertical alignment solver istreated as a black-box, which takes the data input and returns the min-imum of the vertical alignment problem. A horizontal alignment with nturns can be specified by {(ri, Pi)}ni=1. Given a vertical alignment opti-41.2. Road Design Optimizationx0 200 400 600 800 1000y050100150200250300350400450SEP1P2r1r2arctangentFigure 1.2: Horizontal road alignmentmization model, the cost of constructing a road can also be determined by{(ri, Pi)}ni=1, therefore it is the collection of decision variables in the horizon-tal alignment optimization problem. Three sets of constraints are imposedto ensure that i) the road is continuous, ii) the location of the turns are de-sired, and iii) the turning radius is large enough for safety. For more details,please refer to [Mon14].1.2.2 Vertical Alignment OptimizationGiven a horizontal alignment, the vertical alignment represents the ele-vation of the road along the horizontal alignment [Kan08]. Mathematically,it is a function of the distance from the start of the road along its corre-sponding horizontal alignment. In particular, the model used in this thesisspecifies the vertical alignment by quadratic splines, as shown in Figure 1.3.In practice, terrain elevation data are sampled discretely, hence the costsare calculated only at certain sample points. A vertical alignment is dividedinto sections, of which each has a sample point for cost calculation. Accord-ing to [HHLR14], the costs associated with road construction are usuallyearth cutting, filling and movement costs. For example, to build the roadin Figure 1.3, earth needs to be filled to sections 0 to 3, cut from sections4 to 6 and transported between the sections. Borrow pits and waste pitscan be used (at a cost) to deal with absent or excess earth. The objec-tive function of this optimization problem is the sum of all the above coststhroughout the entire road. The decision variables are the coefficients of the51.2. Road Design Optimization0 1 32 54 6SectionElevation (z)Road    TerrainFigure 1.3: Vertical road alignmentquadratic splines and the volume of earth moved between different sections,borrow pits and waste pits. Constraints are imposed for smoothness of thesplines, curvature of the road for safety, balance of earth movements, maxi-mum cut and fill volume, etc. This optimization problem is formulated as amixed-integer linear program. For more details, please refer to [HHLR14].1.2.3 Implementation and Surrogate Cost FunctionsThe vertical and horizontal alignment problems are respectively imple-mented in IBM ILOG CPLEX 12.5.1 (http://www.cplex.com) and NOMAD3.7 [AAC+].Figure 1.4(a) shows the relation between the horizontal and the verticalalignment optimization problems.In black-box optimization, surrogates, or surrogate functions, are cheaperapproximation functions for the actual objective function for the purpose ofspeeding up the minimization process [VDHL17]. A framework for usage ofsurrogates in black-box optimization can be found in [BDF+99]. Generally,in each iteration the black-box solver would evaluate the surrogate on a setof points and use the result to guide the evaluation of the actual objectivefunction afterwards. Ideally, the solver can quickly identify potential descentdirections with the surrogate and save time on the evaluation of the expen-sive objective function. In the case of road design optimization, a surrogatewould be a function that can provide quick approximation for the minimalcost of the vertical alignment optimization problem. The model framework61.3. Challenges and Organization(a) Original Model. (b) Surrogate Model.Figure 1.4: 3D alignment optimization framework. HA and VA correspond-ingly stand for horizontal alignment and vertical alignment.of the road alignment optimization problem with surrogate is illustrated inFigure 1.4(b).Surrogates are usually categorized into two types [BDF+99]. One type isfunctions built through interpolating or smoothing sample points of the ac-tual objective function. The other is objective functions based on simplifiedphysical models. In addition, some surrogates are dynamic, i.e., able to ad-just themselves according to the optimization results, and other surrogatesare static, i.e., remain the same during the optimization process. In thisthesis, all the surrogates designed are static surrogates based on simplifiedphysical models and described in Section 4.1.When the surrogate model is implemented the user needs to provide twoobjective functions, marking one as the true cost function and the other asthe surrogate. In the process of optimizing a road alignment with surrogates,the use of the surrogate is managed internally by NOMAD. Specifically, NO-MAD evaluates the surrogate function over points in the poll set and use theresults to decide the order that the true objective function is evaluated. Sec-tion 6.2 of the NOMAD User Guide [ALT09] provides detailed instructionson the implementation of surrogates.1.3 Challenges and OrganizationThere are three challenges in analyzing the test data. The first challengeis extracting meaningful information from the test data. During the test, we71.3. Challenges and Organizationobserve a particular path of solutions (path is formally defined in Section2.3), but it is only one realization of many possible paths and does notnecessarily reflect the general case. Specifically, when different solutionsappear in different orders, the path evolves in different ways. To avoid thebias of any single observation, in Chapter 2 a bootstrap [ET93] technique isdeveloped to simulate numerous solution paths based on the test data, andthus to offer a more comprehensive view on the behavior of the solutions.Another challenge is making proper use of the bootstrap result. Cautionshould be taken when using bootstrap to estimate the minimum or maximumof a statistical sample [Che07]. In our case, a point in the solution pathis the minimum of all previously generated solutions, therefore we needto understand the limitations of our estimation to the incumbent solution.In Section 2.4, the asymptotic correctness and the bias of the bootstrapestimator are analyzed. In Chapter 3, the reliability of the estimator isdemonstrated with an out-of-sample test. To the author’s knowledge, thiswork is the first to analyze the estimation error of the test result for theperformance of optimization algorithms.The last challenge is interpreting the processed data for intuitive com-parison of different algorithms. Many popular comparison methods [Gri78,FRS94, HS98, DM02, MW09] successfully compared the performance of op-timization algorithms through their speeds for solving certain problems orsolution qualities under a time budget. However, none of them are designedfor the case where multiple restarts are employed. When the algorithm’s out-put may be different under different random restarts, the user may employmultiple restarts to achieve better results. In this thesis, we analyze stochas-tic optimization algorithm’s performance when multiple random restarts areemployed. The major challenge arises when both speed and solution qualityare simultaneously considered by modeling the solution path as a stochasticprocess. In Chapter 4, three comparison methods are demonstrated alongwith test data of 11 algorithms solving the road design optimization prob-lem mentioned in the previous section. Before introducing the comparisonmethods, the algorithms being compared are described in Section 4.1.8Chapter 2Bootstrapping theIncumbent Solution ProcessIn this chapter, we develop a bootstrap method to model the solutionpaths of the stochastic optimization algorithms being compared. Since themethod will be applied to every algorithm in the same way, by “the algo-rithm”, we mean a given stochastic optimization algorithm.2.1 PreliminariesThe term incumbent solution usually means the best solution found sofar during the process of solving an optimization problem, but the exactdefinition could be different under different scenarios. Below is a formaldefinition of the incumbent solution as used in this thesis.Definition 2.1. When an optimization problem is solved by a stochasticalgorithm with random restarts, the incumbent solution at a given time isthe optimal output of all existing outputs generated by the algorithm sincethe first restart.The cumulative distribution function is a very important concept inprobability and statistics, since a probability distribution can be fully char-acterized by its cumulative distribution function.Definition 2.2. The cumulative distribution function (CDF) of a real-valued random variable Z is defined asGZ(z) = P(Z ≤ z),where z ∈ R.The cumulative distribution function for a vector of real-valued randomvariables (Z1, Z2, . . . , Zn) is defined asGZ1,Z2,...,Zn(z1, z2, . . . , zn) = P(Z1 ≤ z1, Z2 ≤ z2, . . . , Zn ≤ zn),where z1, z2, . . . , zn ∈ R.92.2. Problem DefinitionDefinition 2.3. The indicator function of a set A is defined asIA(x) ={1 if x ∈ A,0 if x 6∈ A.Definition 2.4. Let {Zi}ni=1 be independent, identically distributed (i.i.d.)real-valued random variables with a common distribution function. Theirempirical cumulative distribution function (ECDF) is defined asGˆn(z) =1nn∑i=1I(−∞,z](Zi) .Definition 2.5. A set of real-valued random variables Z1, Z1, . . . , Zn areconditionally independent given another random variable Z ifGZ1,Z2,...,Zn(z1, z2, . . . , zn|Z) =n∏i=1GZi(zi|Z),where G is the cumulative distribution function and z1, z2, . . . , zn ∈ R.Definition 2.6. A sequence of random variables {Zn} converges in proba-bility to the random variable Z, or Zn → Z in probability, iflimn→∞P(|Zn − Z|≥ ) = 0for all  > 0.Definition 2.7. Let θˆ be an estimator of a real number θ. The bias of θˆ isdefined asE[θˆ]− θ.2.2 Problem DefinitionThe algorithm is used to solve the following problem with different ran-dom restarts:min f(x) : Rn 7→ Rsubject to: x ∈ X ⊆ Rn. (2.1)Let (Ω,P) be a probability space, where Ω is the set of all possible restartsand P is the probability of using certain restart(s). Given certain restart,102.2. Problem Definitionthe algorithm becomes deterministic, thus we can assume it returns a deter-ministic output and spends a deterministic amount of CPU time under thesame test environment. We then define a random variable(Y , T ) : Ω 7→ R× R++,where Y (ω) is the output produced and T (ω) is the time spent if ω ∈ Ωis employed. The unknown cumulative distribution function of (Y, T ) isdenoted byGY,T (y, t).To understand this framework better, we present the following example.Example 2.1. Consider the following instance of Problem 2.1:min f(x) : Rn 7→ Rsubject to: x ∈ [0, 1]n. (2.2)Suppose the problem above is approached by the Random Pursuit AlgorithmAlgorithm 1: Random PursuitInput:A problem of the form (2.2)K ∈ N: number of iterationsx0 : an initial pointOutput: approximate solution xK to (2.2)for k ← 0 to K − 1 dochoose uk uniformly at random from {y ∈ Rn : ‖y‖2= 1}set xk+1 ← xk + LS(xk, uk) · ukend[SMG13] shown in Algorithm 1, where LS(xk, uk) is a line search subroutine(e.g., the golden-section search [Kie53]). Depending on the user’s strategyfor employing restarts, there are a few possible cases of random restarts asdiscussed below.Each case is characterized by (Ω,P). The probability measure P is some-times much easier to describe in terms of probability distributions, especiallywhen Ω ⊂ R. However, the probability space (Ω,P) itself is not a randomvariable. If we define the random variableI(ω) : Ω 7→ Ω = w112.2. Problem Definitionas the identity map, then I is a random variable on Ω with probabilitymeasure P. In this way, the law of P can be shown via the probability distri-bution of I with sufficient technical validity. For technical details of prob-ability measure and random variable, please refer to any measure-theoreticprobability theory book, for example [JP03].Next, we describe three cases of random restarts.Case 1: The random seed. The random search directions {uk}K−1k=0depend on a sequence of pseudo-random numbers, which depend on therandom seed given to the algorithm. Therefore, the random seed affects theresult through search directions. If the algorithm accepts 32-bit unsignedbinary integer values as random seeds, then Ω = {0, 1, . . . , 232− 1}. Usuallyrandom seeds are sampled uniformly from Ω and I ∼ U{0, 232 − 1}.Case 2: Random initial point. Searching from different initial pointscan also result in different solutions. Without any prior understanding ofthe objective function, the user can uniformly sample initial points fromthe feasible region [0, 1]n, then I follows a multi-variate continuous uniformdistribution between 0 and 1.Case 3: Combination of multiple cases. For instance, when Case 1and Case 2 are combined, the setup can be Ω = {0, 1, . . . , 232 − 1} × [0, 1]nand I = (I1, I2), where I1 ∼ U{0, 232 − 1} and I2 follows a multi-variatecontinuous uniform distribution between 0 and 1.To compare the performance of different algorithms, we need to under-stand how the incumbent solution evolves during the optimization process.We model the incumbent solution by the following stochastic process.Definition 2.8. Given a stochastic optimization algorithm, the incumbentsolution process is defined asΓ(τ) = min{infk∈N{Yk :k∑i=1Ti ≤ τ}, f(x0)},where τ ≥ 0 is the time since the start of the optimization process, thesequence {(Yi, Ti)}∞i=1 are i.i.d. random variables following the distributionof (Y, T ) and x0 is a given point in the feasible set X.When τ < T1, the infimum in the above definition is +∞, because theset it corresponds to is ∅. Thus, the value f(x0) is used to avoid ∞ when τis small. The value f(x0) can be considered as the minimal known value of122.3. Bootstrap Estimation of the Incumbent Solution Processf before the optimization process. Clearly, the choice of x0 in a particularoptimization problem should be the same for different algorithms for a faircomparison of the solution quality.We want to compare the performance of different algorithms by com-paring their incumbent solution processes. Since the incumbent solutionprocess Γ(τ) is characterized by the unknown GY,T (y, t), we cannot fullyunderstand it. However, we can estimate it based on an observed sampleof (Y, T ) in the test. Next, we will develop an estimator for the incumbentsolution process’s cumulative distribution function,GΓ(y; τ) = P (Γ(τ) ≤ y) , (2.3)through bootstrapping from the sample.2.3 Bootstrap Estimation of the IncumbentSolution ProcessWe now introduce a theoretical framework for the bootstrap method. Asmentioned earlier, GY,T (y, t) characterizes Γ(τ) and thus GΓ(y; τ), but it isunknown. In the test, we observe {(Yi, Ti)}Ni=1, which is a sample of (Y, T )with size N . It follows that (Yi, Ti) has the same distribution as (Y, T ) forall 1 ≤ i ≤ N .Definition 2.9. Given a sequence {(Yi, Ti)}Ni=1 of i.i.d. random variablesfollowing the distribution of (Y, T ), the theoretical approximator for theincumbent solution process is a stochastic process defined asΓˆ(τ ;N) = min{infk∈N{YUk :k∑i=1TUi ≤ τ}, f(x0)},where x0 is a given point in the feasible set X and {Ui}∞i=1 are i.i.d. discreteuniform random variables between 1 and N .Given the above definition, {(YUi , TUi)}∞i=1 are i.i.d. random variablesandP {(YUi , TUi) = (Yj , Tj)} =1Nfor all i ≥ 1 and 1 ≤ j ≤ N . Resampled with replacement from {(Yi, Ti)}Ni=1,the set {(YUi , TUi)}∞i=1 approximates {(Yi, Ti)}∞i=1. As a result, Γˆ(τ ;N) ap-proximates Γ(τ) based on the observations {(Yi, Ti)}Ni=1.132.3. Bootstrap Estimation of the Incumbent Solution ProcessNext, we introduce the implementation of the bootstrap method. Given{(Yi, Ti)}Ni=1, the cumulative distribution function of the theoretical approx-imator is defined byGΓˆ(y; τ,N) = P{Γˆ(τ ;N) ≤ y}, (2.4)and is deducible analytically. However, it is difficult to compute, thereforeit is estimated by its empirical cumulative distribution function based ona finite collection of {(YUi , TUi)}∞i=1. We simulate B random paths of thetheoretical approximator based on {(Yi, Ti)}Ni=1 for 0 ≤ τ ≤ τmax, whereτmax is the upper bound of the time horizon we focus on. To ensure datasufficiency for each path, we generateM = τmaxmin1≤i≤NTi (2.5)sample outputs in each path. In particular, we generate only the first B ·Melements of {(YUi , TUi)}∞i=1 and reindex them as{(Y ∗i,j , T ∗i,j)}i=B,j=Mi,j=1 .Definition 2.10. Given resampled test data {(Y ∗i,j , T ∗i,j)}i=B,j=Mi,j=1 , the ithbootstrap path is defined asΓˆi(τ ;N,B) = min inf1≤k≤MY ∗i,k :k∑j=1T ∗i,j ≤ τ , f(x0) ,where x0 is a given point in the feasible set X, 1 ≤ i ≤ B and 0 ≤ τ ≤ τmax.Finally, our estimator is given by the following definition.Definition 2.11. The bootstrap estimator for the cumulative distributionfunction of the incumbent solution process is defined as the bootstrap paths’empirical cumulative distribution functionGˆΓˆ(y; τ,N,B) =1BB∑i=1I(−∞,y](Γˆi(τ ;N,B)).142.4. Theoretical Analysis2.4 Theoretical AnalysisIn this Section, we will first show the bootstrap estimator proposed inSection 2.3 is asymptotically consistent. The following fact provides a foun-dation for the proof.Fact 2.12. [BF81, Corollary 4.1] Let {Zi}Ni=1 be a sample from an unknowncumulative distribution function G, which is estimated by {Zi}Ni=1’s empiricalcumulative distribution function GN . Given {Zi}Ni=1, let {Z∗i }Bi=1 be condi-tionally independent, with common distribution GN . Let GNB be the empir-ical cumulative distribution function of {Z∗i }Bi=1. Then, as N and B → ∞,‖GNB −G‖→ 0 in probability.Theorem 2.13. Given any τmax ∈ R+, ‖GˆΓ(y; τ,N,B)−GΓ(y; τ)‖→ 0 inprobability for all τ ∈ [0, τmax] as N →∞ and B →∞.Proof. We first construct the elements needed for the use of Fact 3.1. SinceΓˆ(τ ;N) is based on a finite sample {(Yi, Ti)}Ni=1, given any τmax ∈ R+, ithas finitely many unique paths for τ ∈ [0, τmax]. Let this finite number bedenoted as N∗(τmax), then the collection of unique paths of Γˆ(τ ;N) canbe denoted as {Γˆi(τ ;N)}N∗(τmax)i=1 . Each unique path has the same proba-bility, thus GΓˆ(y; τ,N) is the empirical cumulative distribution function of{Γˆi(τ ;N)}N∗(τmax)i=1 .Next, we apply Fact 3.1 to show the result. For any τ ∈ [0, τmax], theset {Γˆi(τ ;N)}N∗(τmax)i=1 is a sample from the unknown distribution GΓ(y; τ),which is estimated by {Γˆi(τ ;N)}N∗(τmax)i=1 ’s empirical cumulative distribu-tion function GΓˆ(y; τ,N). Given the sample {Γˆi(τ ;N)}N∗(τmax)i=1 , it followsthat {Γˆi(τ ;N,B)}Bi=1 are conditionally independent with common distri-bution GΓˆ(y; τ,N). The function GˆΓ(y; τ,N,B) is the empirical cumula-tive distribution function of {Γˆi(τ ;N,B)}Bi=1. As N → ∞ and B → ∞,we also have N∗(τmax) → ∞ and B → ∞, then Fact 2.12 implies that‖GˆΓ(y; τ,N,B)−GΓ(y; τ)‖→ 0 in probability for all τ ∈ [0, τmax].Although the bootstrap estimator is asymptotically consistent by The-orem 2.13, it has a bias when N and B are finite. Next we will show aproperty of the bias.The following assumption facilitates a step in the proof of Theorem 2.15.It assumes the bootstrap method introduced in Section 2.3 can give anunbiased estimation for the distribution of a counting process associatedwith the incumbent solution process.152.4. Theoretical AnalysisAssumption 2.1. Let C(τ) be a random variable representing the countof outputs generated by the algorithm at time τ , i.e., the maximal index isatisfying the constraint in Definition 2.8. Let Cˆ(τ ;N) be a random variablerepresenting the count of outputs in the path of Γˆ(τ ;N) by time τ , i.e., themaximal index j satisfying the constraint in Definition 2.9. Assume that theexpected distribution of Cˆ(τ ;N) is the same as the distribution of C(τ), i.e.,for all i we have:E[P(Cˆ(τ ;N) = i)]= P(C(τ) = i).The following fact is a basic property of Order Statistics and is importantfor the proof of the next theorem.Fact 2.14. [DN05, (2.12)] Let Z1, Z2, . . . , Zn be independent and iden-tically distributed random variables with cumulative distribution functionG(z). Then the cumulative distribution function of the smallest elementin Z1, Z2, . . . , Zn is given by1− [1−G(z)]n.Theorem 2.15. Given Assumption 2.1, the bootstrap estimator’s biasE[GˆΓˆ(y; τ,N,B)]−GΓ(y; τ)is non-positive.Proof. The estimator GˆΓˆ(y; τ,N,B) has two sources of uncertainty: therandom sample{(Yi, Ti)}Ni=1and the random bootstrap paths{(Y ∗i,j , T ∗i,j)}i=B,j=Mi,j=1 .We first eliminate the latter in the expectation:E[GˆΓˆ(y; τ,N,B)]−GΓ(y; τ) (2.6)=E{E[GˆΓˆ(y; τ,N,B)|{(Yi, Ti)}Ni=1]}−GΓ(y; τ) (2.7)=E{GΓˆ(y; τ,N)}−GΓ(y; τ), (2.8)where the last step is a result of the fact that GˆΓˆ(y; τ,N,B) is defined as anempirical CDF of Γˆ(τ ;N) given any {(Yi, Ti)}Ni=1.162.4. Theoretical AnalysisNext, we will analyzeE[GΓˆ(y; τ,N)]−GΓ(y; τ)as a proxy of E[GˆΓˆ(y; τ,N,B)]−GΓ(y; τ).Case 1: 0 ≤ τ < min{Ti}Ni=1. In this case,Γ(τ) ≤ f(x0) = Γˆ(τ ;N) (2.9)⇒ GΓ(y; τ) ≥ GˆΓˆ(y; τ,N) ∀{(Yi, Ti)}Ni=1 (2.10)⇒ GΓ(y; τ) ≥ E[GˆΓˆ(y; τ,N)](2.11)⇒ E[GˆΓˆ(y; τ,N)]−GΓ(y; τ) ≤ 0. (2.12)Hence, we proved the non-positivity of the bias for the first case.Case 2: τ ≥ min{Ti}Ni=1. Let GY (y) denote the marginal cumulativedistribution function of Y . Let Cˆ∗(τ ;N) be the count of unique outputs inthe path of Γˆ(τ ;N) by time τ , i.e., the number of unique values in {Yi}Cˆ(τ ;N)i=1 .We introduce Cˆ∗(τ ;N) to enable the use of Fact 2.14. Let C∗(i) be thecount of unique values in {Yj}ij=1. All of C(τ), Cˆ(τ ;N), Cˆ∗(τ ;N) andC∗(i) are random. The process Cˆ(τ ;N) solely depends on the distributionof {Ti}Ni=1, but C∗(i) is independent of {Ti}Ni=1. Therefore, Cˆ(τ ;N) andC∗(i) are independent as needed in the proof below. Now, by Fact 2.14,E[GΓˆ(y; τ,N)]−GΓ(y; τ) (2.13)=E[1− (1−GY (y))Cˆ∗(τ ;N)]− E[1− (1−GY (y))C(τ)](2.14)=1− E[(1−GY (y))Cˆ∗(τ ;N)]− 1 + E[(1−GY (y))C(τ)](2.15)=E[(1−GY (y))C(τ)]− E[(1−GY (y))Cˆ∗(τ ;N)](2.16)=E[(1−GY (y))C(τ)]− E[E[(1−GY (y))Cˆ∗(τ ;N)|{(Yi, Ti)}Ni=1]](2.17)=∞∑i=1P(C(τ) = i)(1−GY (y))i − E[ ∞∑i=1P(Cˆ(τ ;N) = i)(1−GY (y))C∗(i)](2.18)=∞∑i=1P(C(τ) = i)(1−GY (y))i−∞∑i=1E[P(Cˆ(τ ;N) = i)]E[(1−GY (y))C∗(i)] (2.19)172.4. Theoretical Analysis=∞∑i=1P(C(τ) = i)(1−GY (y))i −∞∑i=1P(C(τ) = i)E[(1−GY (y))C∗(i)](2.20)=∞∑i=1P(C(τ) = i)E[(1−GY (y))i − (1−GY (y))C∗(i)], (2.21)where (2.20) is due to Assumption 2.1.Since C∗(i) ≤ i is always true and 0 ≤ 1−GY (y) ≤ 1, then(1−GY (y))i − (1−GY (y))C∗(i) ≤ 0 (2.22)is always true. Thus,E[(1−GY (y))i − (1−GY (y))C∗(i)]≤ 0. (2.23)As P(·) ≥ 0, for all i ≥ 1 we haveP(C(τ) = i)E[(1−GY (y))i − (1−GY (y))C∗(i)]≤ 0. (2.24)It therefore follows thatE[GΓˆ(y; τ,N)]−GΓ(y; τ) (2.25)=∞∑i=1P(C(τ) = i)E[(1−GY (y))i − (1−GY (y))C∗(i)](2.26)≤0, (2.27)which shows the non-positivity of the bias for the second case.In conclusion, E[GˆΓˆ(y; τ,N,B)]−GΓ(y; τ) = E[GΓˆ(y; τ,N)]−GΓ(y; τ) ≤0.As a consequence of the above theorem, we have E[Gˆ−1Γˆ(y; τ,N,B)] −G−1Γ (y; τ) ≥ 0, i.e., the expected estimations of the incumbent solution’squantiles are not less than their true values at any time τ .18Chapter 3Numerical Validation of theBootstrap MethodIn the previous chapter, we introduced an estimator for the incumbentsolution process and analyzed its theoretical properties. In this section, wenumerically validate the estimator through an out-of-sample test.3.1 Test SetupIn the test, we minimize five objective functions. In particular, we se-lected five non-convex functions from [Ort12] and modified them in orderthat they have more local minima. The objective functions are as follows:f1(x) = exp{max {|cos (|sin(|xi|)|)| : i = 1, 2, . . . , n}η[|tan(|cot(|x1|)|)|, |tan(|cot(|x2|)|)|, . . . , |tan(|cot(|xn|)|)|]},where η(x1, x2, . . . , xn) is the median of x1, xn, . . . , xn,(3.1)f2(x) = ln(m∑i=1m∑j=i+1sin2(3pixi) + (xi − 1)2[sin2(3pixj) + 1]+ (xj − 1)2[sin2(2pixj) + 1]),(3.2)f3(x) =∣∣∣∣100− ‖x‖pi∣∣∣∣+ n∑i=1log (|sin(xi)|) , (3.3)f4(x) = ln(∣∣∣∣∣m∑i=1m∑j=i+1(xj + 47) sin(√|xi2+ xj + 47|)+ xi sin(√|xi − xj − 47|) ∣∣∣∣∣),(3.4)193.2. Basic Test Resultsf5(x) = −m∑i=1m∑j=i+1∣∣∣∣sin(xi) cos(xj) exp(∣∣∣∣1− 1pi√x2i + x2j∣∣∣∣)∣∣∣∣. (3.5)Each of the 5 objective functions is applied with 5, 10 and 50 variables,resulting in 15 test functions. Setting the feasible setX = {(x1, x2, . . . , xn) : −99 ≤ xi ≤ 99, i = 1, 2, . . . , n}for all test functions, we have 15 test problems of the form (2.1).The derivative-free optimization algorithm LTMADS [AD06] is used tosolve all of the test problems. The LTMADS algorithm is a direct searchmethod that uses random numbers to help determine poll directions. Fixinga random seed can make the algorithm deterministic. We let Ω representthe set of random seeds. For the purpose of numerical testing, we assumethat all positive integers not larger than 105 are used by the user as randomseeds, we can set Ω = {1, 2, . . . , 105}. In the test, we solve each test problemwith all ω ∈ Ω.The derivative-free optimization software NOMAD 3.7.2 [AAC+] is basedon LTMADS and used as the solver in the numerical experiments. All thecomputer codes are programed in MATLAB 2014b. The MATLAB versionof NOMAD is called through OPTI Toolbox [CW12]. All of the numericalexperiments in this chapter are performed on an Apple iMac with a 3.5 GHzIntel Core i5 processor, 8 GB of RAM and a 64-bit Windows 8 operatingsystem under Boot Camp.3.2 Basic Test ResultsWe use the data set from the numerical experiments to demonstrate theaccuracy of the bootstrap estimator by comparing it with the incumbent so-lution process. Since all the test functions are cheap, we are able to generatethe test result for all ω ∈ Ω. Therefore, we know the distribution of (Y, T )and can deduce the distribution of Γ(τ) in theory. However, calculating theCDF of Γ(τ) is not computationally tractable. Instead, we approximate theCDF of Γ(τ) by its empirical CDF. In particular, we randomly generate 105independent paths of Γ(τ) and derive their empirical cumulative distributionfunction, which we call the “truth”.To reflect its behavior when the objective functions are expensive andonly limited data is available, we generate the bootstrap estimator based onsmall samples from the test data and call it the “estimate”. We test usingN ∈ {50, 100, 200, 400} and B ∈ {103, 104, 105}. To explore the variability of203.2. Basic Test Resultsthe estimator, for each combination of N and B, we generate 1000 randomsamples of a subset of the test data, each of which has a size of N andcorresponds to a particular realization of {(Y ∗i,j , T ∗i,j)}i=B,j=Mi,j=1 . In addition,for each test problem we setτmax = 400T¯ , (3.6)whereT¯ =1|Ω|∑ω∈ΩT (ω) (3.7)normalizes the time unit and provides a convenient reference to the numberof restarts. For the rest of this chapter, τ is measured in the unit of T¯ .We compare the estimator with the truth through various figures. Forall the figures in this section, we only show the test results with f1, R5 andB = 105. The results are similar for different test functions and values ofB. The full results are summarized in Figure 3.7 and also available on theInternet1.Figure 3.1 gives an intuitive comparison between the estimate and thetruth for N = 100 and B = 105. The lines are the medians of the distri-butions; the lower and upper bounds of the bands are the 10th and 90thpercentiles of the distributions. These represent three different samples (oneper row) and two different bootstraps for each sample (one per column). Itcan be observed that the estimates look identical within each row, whichindicates the random error from different collections of bootstrap paths isnegligable if B = 105. Hence, for the rest of this chapter we only generateone collection of bootstrap paths for each bootstrap estimate. We also setB = 105 unless specified otherwise. On the contrary, the accuracy of theestimate varies significantly with the test data sample it is constructed from.Specifically, the instances (1a) and (1b) estimated the truth relatively well.However, the bootstrap estimator may under-estimate or over-estimate theperformance of the algorithm when an “unlucky” or “lucky” random sampleis drawn as shown by (2a)/(2b) and (3a)/(3b).Since the behavior of the estimate varies with different random samples,there is a need to show the average case. Instead of showing special instancesof the estimate, Figure 3.2 shows the mean of all instances generated in thetest. It can be observed that the mean accuracy decays as time increases,but the larger the sample size the slower the decay becomes. Among the1https://www.dropbox.com/sh/gpldpjg1b7ahr8l/AADzOt5c1w4xC-UU1vGoY_Qua?dl=0213.2. Basic Test Results=0 50 100G-1 !(p;=)-2-1012342a=0 50 100G-1 !(p;=)-2-1012342b=0 50 100G-1 !(p;=)-2-1012341a=0 50 100G-1 !(p;=)-2-1012341bTruthEstimate=0 50 100G-1 !(p;=)-2-1012343a=0 50 100G-1 !(p;=)-2-1012343bFigure 3.1: Six instances of the estimate versus the truth for f1 on R5,N = 100 and B = 105. Three rows of the subplot matrix are based onthree different test data samples. In each row, the two subplots are basedon the same test data sample, but generated by two different collections ofbootstrap paths.223.2. Basic Test Results=0 100 200 300 400G-1 !(p;=)-3-2-101234N=50=0 100 200 300 400G-1 !(p;=)-3-2-101234N=100TruthEstimate=0 100 200 300 400G-1 !(p;=)-3-2-101234N=200=0 100 200 300 400G-1 !(p;=)-3-2-101234N=400Figure 3.2: Mean of estimates for f1 on R5 and B = 105233.2. Basic Test Results=0 100 200 300 400G-1 !(p;=)-3-2-101234N=50=0 100 200 300 400G-1 !(p;=)-3-2-101234N=100TruthEstimate=0 100 200 300 400G-1 !(p;=)-3-2-101234N=200=0 100 200 300 400G-1 !(p;=)-3-2-101234N=400Figure 3.3: Mean and standard deviation of estimates for f1 on R5 andB = 105three quantiles, the 90th percentile has the best and the 10th percentilehas the worst match for the truth. Thus, the mean accuracy of the quan-tile estimation tends to increase as its corresponding probability increases.Generally, the estimations are not reasonably accurate when τ > N . Forexample, the estimate stays relatively close to the truth only for τ ≤ 75when N = 100 and for τ ≤ 300 when N = 400.In addition to Figure 3.2, Figure 3.3 also shows the standard deviationof all the estimates for different τ . In particular, the dotted and dashed linesare one standard deviation away from the mean estimates of the 10th and90th percentiles. It can be observed that the standard deviation increaseas time increases, but the larger the sample size the slower the increasebecomes. Moreover, the estimates of the 90th percentile have smaller stan-dard deviations then those of the 10th percentile. Thus, the volatility ofthe quantile estimation tends to decrease as its corresponding probability243.3. Reliable Time Interval and Quantilesincreases.Even withB = 105, computing bootstrap estimators is still much cheaperthan solving expensive test problem under various random restarts. For in-stance, in the project demonstrated in Chapter 4, the computation of thebootstrap estimates only took a few minutes for B = 105, but the samplestook a month to generate for N = 100. Considering the negligible marginalcost, we would recommend large values of B, e.g., 105 or 106, to avoid pos-sible unknown flaws. It can be concluded that the bootstrap estimator canaccurately predict the incumbent solution process under certain conditions.When used properly, it can save significant computer time.3.3 Reliable Time Interval and QuantilesFigures 3.1 to 3.3 intuitively show the accuracy of the bootstrap estima-tor and it is obvious that the estimator suffers from significant error undersome conditions. Next, we will quantify the error and propose a method toavoid excessive error by properly choosing time interval and quantiles.Definition 3.1. Given a test problem, let Gˆ−1Γˆ(p; τ,N,B, i) be the estimatebased on the ith random sample and Ns be the number of total random sam-ples drawn, then the relative error of the bootstrap estimator for probabilityp (or the p-quantile) and time τ is defined as1Ns∑Nsi=1∣∣∣Gˆ−1Γˆ(p; τ,N,B, i)−G−1Γ (p; τ)∣∣∣f(x0)−G−1Γ (p; τ).The above definition characterizes the deviation of the estimator from thetruth. To accurately compare across different test problems, the deviationis normalized relative to the decrease of the incumbent solution.Figure 3.4 shows the relative error for f1 on R5. The observations areconsistent with those from previous figures since the relative error increasesas τ increases and p decreases. In addition, the increase of N also leads toa decrease of the relative error. For other test problems, the relative errorsfollow the same pattern and are results are available on the Internet2.Figures 3.5 and 3.6 correspondingly show the average-case and worst-case(maximal) relative errors of all test problems when N = 100 and B = 105are set. It can be observed that the average-case and worst-case relative2The results for other test problems or parameter configurations are available fromhttps://www.dropbox.com/sh/gpldpjg1b7ahr8l/AADzOt5c1w4xC-UU1vGoY_Qua?dl=0253.3. Reliable Time Interval and Quantiles=0 100 200 300 400Relative Error00.050.10.150.20.250.30.35N=50=0 100 200 300 400Relative Error00.050.10.150.20.250.30.35N=100=0 100 200 300 400Relative Error00.050.10.150.20.250.30.35N=200=0 100 200 300 400Relative Error00.050.10.150.20.250.30.35N=40010%30%50%70%90%Figure 3.4: Relative error for f1 on R5 and B = 105errors correspondingly stay below 10% and 20% for most of the probabilitiesshown and τ ∈ [0, N ]. When N and B are set to other values, similar patternappears2.Next, based on relative error we propose a method to determine timeintervals within which the bootstrap estimator remains reliable.Definition 3.2. Given a test problem, a probability p (or a p-quantile), athreshold δ and τmax, the reliable time interval is the largest subinterval of[0, τmax] within which the corresponding relative error does not exceed δ.Taking the subplot titled “N = 400” in Figure 3.4 for example, if weset p = 10%, δ = 0.05 and τmax = 400, then there are only two groupsof subintervals of [0, τmax] in which the relative error appears to be lowerthan δ. The first and second group are respectively subsets of [0, 33.6] and[42.4, 98.8]. Since [42.4, 98.8] is larger than [0, 33.6], the largest subintervalof [0, τmax] satisfying the conditions is approximately [42.4, 98.8]. Therefore,the reliable time interval in this case is approximately [42.4, 98.8].263.3. Reliable Time Interval and Quantiles=0 50 100 150 200 250 300 350 400Relative Error00.020.040.060.080.10.120.140.1610%30%50%70%90%Figure 3.5: Average-case relative error for N = 100 and B = 105=0 50 100 150 200 250 300 350 400Relative Error00.050.10.150.20.250.30.350.40.450.510%30%50%70%90%Figure 3.6: Worst-case relative error for N = 100 and B = 105273.3. Reliable Time Interval and QuantilesFrom Figure 3.4, the reliable time intervals can be determined for onlyone test problem and five quantiles. On the contrary, Figure 3.7 providesan overview of the reliable time intervals for all test problems, quantilesand values of N and B. Each color corresponds to one of the test problems.Given a test problem and a value p, the interval between the dotted and solidlines represents the reliable time interval for the threshold of 10%. It shouldbe noted that there are multiple vertical lines at τ = 0 and τ = 400 = τmax,but only one of them is visible. Taking the subplot titled “N = 50 B = 1000”for example, if we set p = 0.5, then the reliable time interval of the yellowrepresented problem is approximately [5, 250], because the horizontal line atp = 0.5 intersects with the dotted and solid yellow curves at about τ = 5and τ = 250.The observations reflect those from previous figures because of the fol-lowing. It can be seen that all the reliable time intervals have a lower boundclose to 0, but not all of them have an upper bound close to τmax, indicat-ing the reliability of the bootstrap estimator decays over time. Moreover,except for extremely high quantiles and few cases, the higher the quantilethe larger the reliable time interval tends to be, indicating the higher thequantile the higher the estimator’s predictability. In addition, the length ofthe reliable time interval increases as N increases, indicating the larger thesample the more reliable estimate can be generated. However, there is nosignificant difference when B varies. Figure 3.7 provides a guidance for thechoice of the quantiles and the time interval to focus on when comparingdifferent algorithms with the bootstrap estimator. For example, if we setN = 100 and want to limit the potential relative error to 10%, we can focuson τ ∈ [0, 75] and p ∈ [10%, 95%] because most of the reliable intervals forp ∈ [10%, 95%] cover [0, 75].283.3. Reliable Time Interval and Quantiles0 200 400p00.51N=50 B=10000 200 40000.51N=50 B=100000 200 40000.51N=50 B=1000000 200 400p00.51N=100 B=10000 200 40000.51N=100 B=100000 200 40000.51N=100 B=1000000 200 400p00.51N=200 B=10000 200 40000.51N=200 B=100000 200 40000.51N=200 B=100000=0 200 400p00.51N=400 B=1000=0 200 40000.51N=400 B=10000=0 200 40000.51N=400 B=100000Figure 3.7: Reliable time intervals for δ = 10% and τmax = 400 of all testproblems29Chapter 4Methods to CompareAlgorithmsIn Chapter 2, we introduced a resampling method to estimate the dis-tribution of incumbent solutions over time. Once the estimate is generated,the user can interpret the results in various ways. In this chapter we proposethree methods of interpreting the estimators along with numerical resultsfrom a real-world case. In this chapter, we use α to index algorithms andα = 0 particularly to index the baseline algorithm. Given an algorithm αand fixing parameters N and B, we letGα(y; τ) = GˆΓˆ(y; τ,N,B). (4.1)We solved the road alignment optimization problem introduced in Section1.2 on 10 test problems with 11 algorithms including a baseline algorithm.For each algorithm on each test problem we solved the problem withN = 100subject to the constraint of about one month’s computer time for all the al-gorithms and test problems. Although more test data would increase theaccuracy of the results, with a sample of size 100 the bootstrap estimatorcan already provide considerable prediction for the performance of the algo-rithms according to the numerical validation in Chapter 3. In addition, weset B = 105 as justified by the results of Chapter 3. Based on the results ofSection 3.3, we also setτmax = 75T¯′,whereT¯′=1NN∑i=1Ti,and focus onp ∈ [10%, 95%]when comparing algorithms. Throughout this chapter, τ is in the unit ofseconds. In Sections 4.2 and 4.3, the figures only show the results of the304.1. Road Design Optimization Algorithms Comparedfirst test problem, but the results all follow a similar pattern across all testproblems and are available on the Internet3.All of the experiments in this chapter were performed on a Dell work-station with an Intel(R) Xenon(R) 2.40 GHz (2 cores) processor, 16 GB ofRAM and a 64-bit Windows 7 Enterprise operating system. NOMAD 3.7.3was used as the solver. The program was coded in C++.In Section 4.1, we introduce the road design optimization algorithmsbeing compared. Three comparison methods are introduced in the rest ofthe chapter.4.1 Road Design Optimization AlgorithmsComparedNext, we will describe the road design optimization algorithms beingcompared. The parentheses after the name of an algorithm contains itsabbreviation.Algorithm 0: the Original Model (org)This is the algorithm based on the model proposed by Mondal, Lucetand Hare [MLH15]. It is considered to be the best existing algorithm, so itis also referred as the “baseline” algorithm. This algorithm only uses truecost function, whereas all the other algorithms introduced below mix the useof the true cost function and certain surrogate cost function, as describedin Section 1.2.Surrogate 1: Constant (const)The constant functionf(x) = 0 (4.2)is used as the surrogate. This is designed to test the behavior of the solverwith the use of a surrogate that has extremely high speed, but extremelylow accuracy.3https://www.dropbox.com/sh/gpldpjg1b7ahr8l/AADzOt5c1w4xC-UU1vGoY_Qua?dl=0314.1. Road Design Optimization Algorithms ComparedSurrogate 2: True Cost Function as Surrogate (=true)This algorithm uses the true cost function to as the surrogate. This isdesigned to test the behavior of the solver with the use of a surrogate thathas extremely low speed, but extremely high accuracy.Surrogate 3: Preprocessor Off (preOff)Preprocessing of a mixed-integer linear program is the process of manip-ulating variables and constraints in the program before the solver starts tosolve the problem. It can make the actual solving process more effective,but itself also consumes extra computational resources. Therefore, there isa trade-off between effectiveness and efficiency in the use of preprocessors[Sav94]. The solver for the vertical alignment problem in the original modelapplies some preprocessor by default. In this surrogate, the preprocessor isturned off in order to achieve a different balance between effectiveness andefficiency for the problem. Compared to =true, this surrogate should havethe same level of accuracy but different speed. It can serve as a supple-ment to Surrogate 2 for testing the behavior of the solver with the use of asurrogate that has extremely high accuracy.Surrogate 4 and 5: Skip Sections (skip2 and skip4)Although terrain is a continuous surface, for computational practicalityits shape is sampled at discrete points as the input of the vertical alignmentproblem. In this surrogate, some of the sample points are skipped to reducethe problem size and therefore the time to solve the problem. In Surrogate4, one section is skipped for every two consecutive sections. In Surrogate 5,three sections are skipped for every four consecutive sections.Surrogate 6 and 7: Early Termination (term2 and term4)When the the vertical alignment problem is being solved, the cost func-tion is evaluated at different points in the feasible set and the solver tracksthe incumbent solution. The increase of the solution quality usually slowsdown as time elapses, therefore the final iterations make relatively less con-tribution to the overall solution quality. In this surrogate, the vertical align-ment solver is terminated early when the number of iterations reaches certainproportion of the expected total number of iterations. We run the originalmodel for 10 times before early termination is applied and use the average324.1. Road Design Optimization Algorithms Comparednumber of iterations as the expected total number of iterations. For Sur-rogate 6 and 7 we terminate the solver at 12 and14 of the expected totalnumber of iterations.Surrogate 8: Delete Earth Movement (noEM)The earth movement cost accounts only for a small portion of the to-tal cost but makes the vertical alignment problem significantly larger andmore time-consuming. In this surrogate, variables and constraints relatedto earth movement are deleted in order to speedup the algorithm. Thissurrogate is implemented by deleting all the earth movement related vari-ables, constraints and objective function terms in the original mixed-integerlinear program. After deleting the earth movement cost, the costs for non-adjacent sections become independent, which allows for the use of the nexttwo surrogates.Surrogate 9: Dynamic Programming (DP)This surrogate is built upon Surrogate 8. In Surrogate 8, the verticalalignment problem is still formulated as a mixed-integer linear program andthe cost function is evaluated at different feasible points. However, by ex-ploiting the cost independence between different sections, we could avoidsome unnecessary evaluations by using the following dynamic programmingformulation.Assume there are S sections in total and in each section there are Llevels. Let f˜i(v0, v1), where 2 ≤ i ≤ S, be the cost of section i if the roadelevations are at level v0 and v1 in sections i− 1 and i. Since the start andthe end of the road are fixed, only one level is needed for sections 1 and S.Thus, we make all of their levels, except for level 1, infeasible by settingf˜∗1 (v1) = f˜S(v0, v1) = +∞ for all v1 6= 0. (4.3)Let f˜∗i (v1), where 1 ≤ i ≤ S, be the optimal total cost for all the sections 1through i if the road elevation is at level v1 in section i. Since cost does notoccur in section 1, we setf˜∗1 (0) = 0. (4.4)Then, for all 2 ≤ i ≤ S, the principle of optimality is:f˜∗i (v1) = min1≤v0≤Lf˜∗i−1(v0) + f˜i(v0, v1). (4.5)334.1. Road Design Optimization Algorithms ComparedBecause of (4.3), the optimal total cost isf˜∗S(0) = min1≤v0≤Lf˜∗S−1(v0) + f˜S(v0, 0). (4.6)The horizontal alignment problem only needs the minimum but not theminimizer of the vertical alignment problem. Therefore, we simply need tofind the minimum by iterating through equation (4.5) without maintainingthe solution table and the value table in the standard implementation ofdynamic programming.The level is formulated as a continuous variable in the original model, butneeds to be discretized in this surrogate for computational tractability. Afterdiscretization, the algorithm will search for the optimum in a discrete subsetof X, thus the optimality of the vertical alignment optimization problem isnot guaranteed. However, it should approximate the true cost functionwell as a surrogate. The number of discrete levels is a parameter of thissurrogate. More levels can make the surrogate more accurate but slowerand vice versa. Therefore there is a trade-off in the choice of the number oflevels. For testing, we set it equal to the number of vertical sample pointsin the input data.Surrogate 10: Greedy Algorithm (grd)Similar to Surrogate 9, this surrogate is also built upon Surrogate 8.Unlike Surrogate 9, which guarantees optimality in a discrete subset of X,this surrogate only tries to achive optimality between two adjacent sections.Another difference of this surrogate from Surrogate 9 is that it keeps thelevel as a continuous variable.To explain this surrogate, we still use f˜ and f˜∗ to represent the cost ofthe current section and the total cost up to the current section, but definethem differently. Specifically, let f˜i(v0, v1), where 2 ≤ i ≤ S, be the cost ofsection i if the elevation of section i− 1 is v0 and the elevation of section iis v1. If v1 is infeasible given v0, then f˜i(v0, v1) = +∞. Let f˜∗i be the totalcost of sections 1 to i, thenf˜∗i = 0 if i = 1,f˜∗i−1 + minv1 f˜i(v0∗(i− 1), v1) otherwise, (4.7)wherev0∗(i) = argminv1f˜i(v0∗(i− 1), v1) for i ≥ 2 (4.8)344.2. Prediction Band with Medianand v0∗(0) is the fixed starting point of the road as an input. We calculatethe greedy optimal total cost f˜∗S based on both the original terrain dataand the reversed terrain data, which starts from the last section and movebackwards. Finally, the output returned by this surrogate is the average ofthe two greedy optimal total costs for better accuracy.The unconstraint version of (4.8) is a single variable convex problem[HHLR14] and its minimizer is the elevation of the terrain. The minimizerfor the constraint version is either at the current terrain elevation or at onebound of the feasible interval. Thus, there are a constant number (3) ofobjective function evaluations in each section and the time complexity ofthis algorithm is O(S), where S is the number of sections. Whereas thetime complexity of Surrogate 9 is O(SL2), where L is the number of levels,because the algorithm needs to compare the costs of each pair of levels foreach section. Therefore, this surrogate should be much faster than Surrogate9.Although modeling the level as a continuous parameter may bring someaccuracy, this surrogate should still be less accurate than Surrogate 9 as itonly tries to achieve optimality between two adjacent sections.4.2 Prediction Band with MedianDefinition 4.1. Given p ∈ [0, 1], the p prediction band for Algorithm α isthe set {(y, τ)∣∣∣∣G−1α (1− p2 ; τ)≤ y ≤ G−1α(1− 1− p2; τ)}.Intuitively, the p prediction band is the central region where the incum-bent solution is estimated to appear with probability p. The edges of the pprediction band are the quantiles with probability 1−p2 and 1− 1−p2 . It char-acterizes the range of the incumbent solution distribution to some extent. Astraightforward way of comparison is to visualize the prediction band alongwith the median.In Figure 4.1, the 80% prediction band and the median are shown for twoalgorithms on test problem A. This method gives an intuitive comparisonbetween two algorithms. It can be observed that =true underperforms orgin median, the 10% percentile and the 90% percentile for most of the time,except for extremely small τ values.A limitation of this approach is that it can only effectively compare twoalgorithms as more overlapped prediction bands would be difficult to tellapart.354.2. Prediction Band with Median=0 500 1000 1500 2000 2500 3000 3500 4000 4500!#1051.541.561.581.61.621.641.661.68 Problem Aorg=trueFigure 4.1: The 80% prediction band with median for two algorithmsOne way to overcome this limitation is to keep the setup for the baselinealgorithm and only show one quantile for all the other algorithms. In Figure4.2, the 80% prediction band with median of the baseline are compared withthe medians of other algorithms on test problem A. In terms of the median,it can be observed that only const outperforms the baseline; noEM and grdperformed the worst as they stay outside the prediction band; the rest of thealgorithms had similar performance and they all stay within the predictionband.A limitation of this approach is that it can only show one quantile of allthe algorithms except for the baseline. Although it allows for simultaneouscomparison of multiple algorithms, limited information can be visualized.364.2. Prediction Band with Median=0 1000 2000 3000 4000 5000 6000 7000!#1051.541.561.581.61.621.641.661.68 Problem Aorgconst=truepreOffskip2skip4term2term4noEMDPgrdFigure 4.2: The 80% prediction band with median for ten algorithms374.3. Integrated P-P Plot4.3 Integrated P-P PlotAlthough the prediction band with median method can provide a straight-forward comparison, it has limited capability of comparing multiple algo-rithms simultaneously. The method introduced in this section overcomesthis limitation by generalizing the P-P plot method in statistics, which isused to compare two probability distributions by plotting their CDFs againsteach other.In our case, we want to compare Gα(y; τ) of different α over τ ∈ [0, τmax].To generate graphs similar to the P-P plot, we need to eliminate the timedimension by integrating G−1α (p; τ) with respect to τ . Since G−1α (p; τ) isdecreasing in τ , an unweighted integration would over amplify the estimatewhen τ is small. In addition, different users may prefer different emphasisover [0, τmax], thus an unweighted integration does not suffice. Let w(τ) :[0, τmax] 7→ R+ be the weight imposed by the user to G−1α (p; τ) at τ for allα, we integrate the quantiles as follows:G−1α (p) =∫ τmax0w(τ)G−1α (p; τ)dτ. (4.9)Although there could be various possible candidates for w(τ), here we sug-gestw(τ) =1G−10 (0.5; τ). (4.10)This formulation of w(τ) is designed to impose similar levels of emphasisover [0, τmax] by balancing the effect of decreasing G−1α (p; τ) in τ . Since allthe algorithms are solving the same problem, the shape of G−1α (p; τ) shouldbe similar for all α and formula (4.10) should present reasonable weightingfunctions not only for algorithm 0 but also for all other algorithms.We define the integrated CDF asGα(y) = P(G−1α (p) ≤ y). (4.11)By choosing one of the algorithms, which in our case is algorithm 0, as thebaseline, we can simultaneously show the performance of different algorithmsall against the baseline in an integrated P-P plot. For each alternativealgorithm α and for each p of interest we plotG0(G−1α (p)), (4.12)which indicates the probability for algorithm 0 to return an output as goodas algorithm α’s best 100p% output.384.3. Integrated P-P PlotAlternative CDF0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Baseline CDF00.10.20.30.40.50.60.70.80.91Problem Aorgconst=truepreOffskip2skip4term2term4noEMDPgrdFigure 4.3: Integrated P-P plotConsidering an algorithm α 6= 0 for a minimization problem, small valuesin the P-P plot are favorable to α, because it means the baseline has a lowchance of returning outputs as good as the output by α. The baseline’sintegrated CDF is also plotted against itself as a reference line, which is thediagonal. If the P-P plot of an algorithm stays entirely below the referenceline, then its empirical solution distribution stochastically dominates thebaseline algorithm, indicating its performance is better than the baseline,and vice versa.Figure 4.3 is a demonstration of the integrated P-P plot with our testdata for test problem A. It can be observed that const outperforms thebaseline algorithm in most of the probability range, expect for the bestcases. All other algorithms are dominated by the baseline algorithm.394.4. Speed Quantification4.4 Speed QuantificationThe previous two comparison methods are both visualizations, and donot disclose how much speedup or slowdown an algorithm has against an-other. In this section, we introduce a method to quantify the relative speedof multiple algorithms. When we compare algorithm α versus algorithm 0,the method in general is to find λ ∈ R+ such that G−10 (p;λτ) best matchesG−1α (p; τ). The value of λ can be interpreted as the relative speed of algo-rithm α versus algorithm 0. Mathematically, λ is defined as a solution tothe following least squares integration problem:λ = argminλ′∈R+{max{λ′, 1}τmax∫ τmaxmax{λ′,1}0[G−10 (p;λ′τ)−G−1α (p; τ)]2dτ},(4.13)where p is the probability corresponding to the quantile of interest. Theupper bound of the integral is designed in order to ensure that all the datastays within the reliable time interval as λ′ changes. Specifically, let m bethe upper bound of the integral, then solving (4.13) requires the value ofG−10 (p; τ) and G−1α (p; τ) for 0 ≤ τ ≤ λ′m and 0 ≤ τ ≤ m respectively. Toensure data reliability, we needmax{λ′m,m} ≤ τmax. (4.14)Solving this inequality, we havem =τmaxmax{λ′, 1} . (4.15)The term max{λ′,1}τmax= 1m before the integral normalizes the size of the inte-gral, hence it is not affected by λ′ directly but through the effect of scalingG−10 .Since all the algorithms are used to solve the same problem, the incum-bent solution paths should have similar shapes. As a result, this methodshould closely match the scaled incumbent solution process of algorithm 0with the incumbent solution process of algorithm α.When a group of algorithms is tested over multiple problems, the har-monic mean for λ1, λ2, . . . , λn,λ¯ =n∑ni=11λi, (4.16)is an appropriate method to calculate the average speedup.404.4. Speed QuantificationTo see this, consider the case where there are n problems that requirethe same amount of computation c and the speed of algorithm αi is λi for alli = 1, 2, . . . , n. The harmonic mean calculates the speed λ¯ of an algorithmα∗ that solves all the problem from the same problem set in the same amountof total time as α1, α2, . . . , αn do. In particular, to equate the total times,we havencλ¯=n∑i=1cλi, (4.17)which leads to formula (4.16) by solving for λ¯.For demonstration, we set p = 0.5 for the calculations of λ, i.e., we matchthe median of different algorithms to that of the baseline algorithm. Theintegral in (4.13) is evaluated numerically by equally discretizing [0, τmax]into 1000 segments. Table 4.1 shows the speed quantifications for this setup.All the values are solved using the fmincon function in MATLAB with initialsolution 1. According to the average relative speed, only =true and preOffoutperforms the baseline algorithm.For advanced use of this method, problem (4.13) can be generalized to:λ = argminλ′∈R+Q∑i=1qimax{λ′, 1}τmax∫ τmaxmax{λ′,1}0[G−10 (pi;λ′τ)−G−1α (pi; τ)]2dτ,(4.18)where Q is the total number of the quantiles to focus on, p1, p2, . . . , pQ aretheir corresponding probabilities and q1, q2, . . . , qQ are their correspondingweights.There are many ways to specify Q depending on the user’s need andthe reliability of the quantiles, which is discussed in Section 3.3. For exam-ple, assuming all the quantiles with 10% ≤ p ≤ 95% are reliable, then bychoosingQ = 81,andpi = (i+ 9)%, qi =1Qfor all 1 ≤ i ≤ Qthe user imposes equal emphasis on each reliable percentile. This weightdistribution is more reasonable if the algorithm is used for a large numberof times in practice, because the overall outcome is closely related to theaverage performance by the “Law of Large Numbers”. However, if the al-gorithm is used only a few times, robustness becomes more important. The414.4. Speed QuantificationsetupQ = 1, p1 = 0.95 and q1 = 1can be used when the user only concerns about the 5% value-at-risk. Alter-natively,Q = 2, p1 = 0.5, p2 = 0.95 and q1 = q2 = 0.5 (4.19)can be used when the user takes both the median performance and therobustness into consideration with equal emphasis.For demonstration, we apply the advanced framework based on the con-figuration in (4.19). Table 4.2 shows the speed quantifications for this setup.The discretization and optimization are conducted the same way as thosefor Table 4.1. The results in Table 4.2 are similar to those of Table 4.1,indicating there is no significant difference in speedup between the 50th andthe 90th percentiles. Although only two of the algorithms outperforms thebaseline algorithm, all of them performed robustly when the 5% worst-casescenario is taken into consideration.424.4. Speed QuantificationTable4.1:Speedquantificationby50thpercentileSpeedTestProblemAverageABCDEFGHIJAlgorithmconst0.651.043.520.520.450.292.020.480.560.533.10=true1.430.635.332.351.961.2813.610.462.801.385.93preOff1.300.504.731.501.341.0012.431.080.881.376.44skip20.850.622.282.260.264.9310.920.272.691.473.67skip40.690.511.760.360.311.3210.670.301.701.074.96term20.520.492.160.230.600.197.980.313.950.891.75term40.450.540.980.200.280.198.100.301.970.914.06noEM0.910.336.962.600.311.5911.670.423.891.625.11DP0.600.583.050.340.560.184.350.516.840.603.07grd0.730.344.460.520.470.398.470.571.680.902.51434.4. Speed QuantificationTable4.2:Advancedspeedquantificationby50thand90thpercentilesSpeedTestProblemAverageABCDEFGHIJAlgorithmconst0.621.153.680.520.410.282.010.440.540.513.10=true1.420.615.412.331.981.3114.840.452.761.375.93preOff1.330.505.221.701.331.0012.531.090.951.296.72skip20.800.612.442.370.234.7910.740.252.791.473.63skip40.650.501.840.310.301.3211.130.281.701.055.06term20.500.562.220.210.590.188.570.293.830.871.73term40.430.550.960.180.250.198.180.281.950.904.14noEM0.880.336.862.570.291.6912.730.403.811.605.21DP0.570.592.870.330.510.174.370.496.900.563.14grd0.690.354.620.510.410.388.980.511.740.822.5244Chapter 5Conclusion5.1 ContributionIn this thesis, a methodology is proposed to compare the performanceof stochastic optimization algorithms through bootstrapping of test data.The incumbent solution of a problem over multiple restarts is modeled as astochastic process. A bootstrap estimator of the incumbent solution processgiven limited amount of test data is designed.Then, various properties of the estimator are analyzed. From the theo-retical aspect, the estimator is shown to be asymptotically consistent. Undersome assumptions, the estimator is shown to have a non-positive bias. Fromthe numerical aspect, an out-of-sample test is conducted to demonstrate thereliability of the estimator when different parameters are varied. We findthat although the estimator tends to underestimate the probability distri-bution function of the incumbent solution process, it can provide reliableprediction when the time interval and quantiles to focus on is properly cho-sen.Finally, three methods to compare the performance of different stochas-tic optimization algorithms are proposed and demonstrated with test datafrom a road design optimization problem. Different methods have differentbalances between showing more details on a single algorithm and simulta-neously comparing more algorithms. The results are consistent across allmethods. Ten surrogate cost functions for the road design optimizationproblem are designed and tested. Although two of the algorithms slightlyoutperforms the baseline, none of them is considered as a significantly bet-ter algorithm. A software package for the use of the comparison methods isavailable online4.5.2 Future WorkSeveral directions for future work present themselves.4https://www.dropbox.com/sh/sq4xhz51qrkibwe/AACCtNF3HT9A-hZ_8LQlWXE0a?dl=0455.2. Future WorkFirst, the estimator might be improved through bias correction. Sincethe incumbent solution is the minimum of a collection of solutions, its dis-tribution highly depends on the left tail of the individual solutions’ distri-bution. However, the empirical CDF has a limited ability to estimate thetails of the true distribution. The bias could be suitably corrected by betterestimating the left tail of the individual solutions’ distribution. Smoothingtechniques [Sim96] or the parametric bootstrap method [Che07] could beexplored for bias correction.Second, it is important to investigate whether Assumption 2.1 can berelaxed. According to Andersen et al. [ABGK93], the bootstrap estimatorfor the counting process is unbiased for the first and second order moments,but it still remains to verify whether it is also unbiased for the whole distri-bution.In addition, the uniqueness of the speed quantification λ as defined inequation (4.13) needs to be investigated. Figure 5.1 shows the normalizedobjective function values of the least-square integration problem in equation(4.13) for λ ∈ [0.01, 100]. For easy comparison, the function values areshifted and scaled to span [0, 1]. According to Figure 5.1, it seems that theleast squares integration problem regarding λ is convex near 1 and the globalminimizer can be found by searching from λ = 1. A rigorous proof is stillrequired to formally address the issue.Another direction to which this thesis could be extended is to analyzethe solution time distribution over different objective function values. Sinceboth speed and solution quality of algorithms are analyzed, there are twovariables: time and function value. By defining the incumbent solutionprocess, we treat time as the independent variable and function value asthe dependent variable. It would be interesting to explore the alternativesetting, in which the distribution of the time for an algorithm to reachdifferent values of the objective function is studied. However, a challengeis data inconsistency across different values of the objective function. Forthe methodology in this thesis, any finite time horizon can be covered by abootstrap path since time accumulates as multiple restarts are drawn fromthe sample. However, certain range of function values may not be able to becovered by all the algorithms being compared as some of them never reachcertain function values.Researchers could also explore how the resampling method might begeneralized to the case of non-independent restarts. One possible case israndom sampling without replacement. By using bootstrap, i.e., resamplingwith replacement, we consequently assumed the user also applies multiplerestarts with replacement. However, once a particular restart has been em-465.2. Future Work-5 0 500.20.40.60.81 Problem A-5 0 500.20.40.60.81 Problem B-5 0 500.20.40.60.81 Problem C-5 0 500.20.40.60.81 Problem D-5 0 500.20.40.60.81 Problem E-5 0 500.20.40.60.81 Problem Fln(6)-5 0 500.20.40.60.81 Problem Gln(6)-5 0 500.20.40.60.81 Problem Hln(6)-5 0 500.20.40.60.81 Problem IFigure 5.1: Uniqueness of λ475.2. Future Workployed, rerunning it will not increase the solution quality but will decreasethe overall speed of the optimization process. Therefore, it would be morereasonable to assume the user resamples without replacement and to modelthe incumbent solution process accordingly. Another group of cases is whenadaptive sampling methods, such as those in the scatter search [Glo98] orthe genetic algorithm, are applied in multiple restarts. The incumbent so-lution process needs to be modified to model the incumbent solution whencertain adaptive sampling methods are used.Moreover, more work can be done to study the use of surrogate costfunctions in the road design optimization problem. Based on the test re-sults, none of the surrogates significantly outperforms the original model.The absence of improvement may be due to the way NOMAD uses the sur-rogate cost functions and the way the surrogates are designed. NOMADassumes the surrogate cost function is close to free but not necessarily givesan accurate output. As a result, the surrogate cost function is evaluated innumerous points before the true cost function is evaluated. The surrogatesdesigned in this thesis may be too expensive for NOMAD to effectively uti-lize. One way to overcome this issue is to design surrogates that cooperatebetter with NOMAD’s surrogate usage approach. A more promising way isto develop a new framework for black-box optimization algorithms to usethe type of surrogates designed in this thesis. A good starting point can befound in [TY11].Finally, the framework proposed in this thesis might be extended to ana-lyze the speedup from parallelization of stochastic optimization algorithms.Emphasis should be put on the relationship between solution quality, speedand number of processing units.48Bibliography[AAC+] M. A. Abramson, C. Audet, G. Couture, J. E. Dennis, Jr.,S. Le Digabel, and C. Tribes. The NOMAD project. Softwareavailable at https://www.gerad.ca/nomad/. → pages 6, 20[ABGK93] P. K. Andersen, O. Borgan, R. D. Gill, and N. Keiding. Statis-tical models based on counting processes (ISBN 0387978720).Springer-Verlag Inc, Berlin; New York, 1993. → pages 46[ABS05] A. Akay, K. Boston, and J. Sessions. The evolution ofcomputer-aided road design systems. International Journal ofForest Engineering, 16(2), 2005. → pages 3[AD06] C. Audet and J.E. Dennis, Jr. Mesh adaptive direct search al-gorithms for constrained optimization. SIAM Journal on Op-timization, 17(1):188–217, 2006. → pages 20[ALT09] C. Audet, S. Le Digabel, and C. Tribes. NOMAD user guide.Technical Report G-2009-37, Les cahiers du GERAD, 2009. →pages 7[Bab79] L. Babai. Monte-Carlo algorithms in graph isomorphism test-ing. Technical Report, Universite´ de Montre´al, D.M.S. No.79-10, 1979. → pages 2[BDF+99] A. J. Booker, J. E. Dennis, P. D. Frank, D. B. Serafini, V. Torc-zon, and M. W. Trosset. A rigorous framework for optimizationof expensive functions by surrogates. Structural Optimization,17(1):1–13, 1999. → pages 6, 7[BF81] P. J. Bickel and D. A. Freedman. Some asymptotic theory forthe bootstrap. The Annals of Statistics, 9(6):1196–1217, 111981. → pages 15[BHHR14] A. Butler, R. D. Haynes, T. D. Humphries, and P. Ranjan. Effi-cient optimization of the likelihood function in gaussian process49Bibliographymodelling. Computational Statistics & Data Analysis, 73:40–52, 5 2014. → pages 2[Can08] Transport Canada. Transportation in Canada: an overview2008. Technical Report 2009-07612., Minister of Public Worksand Government Services, Canada, 2008. → pages 3[Che07] M. R. Chernick. Bootstrap Methods: A Guide for Practitionersand Researchers, Second Edition. John Wiley & Sons, Inc.,2007. → pages 8, 46[CSV09] A. Conn, K. Scheinberg, and L. Vicente. Introduction toDerivative-Free Optimization. Society for Industrial and Ap-plied Mathematics, 2009. → pages 4[CW12] J. Currie and D.I. Wilson. OPTI: Lowering the Barrier Be-tween Open Source Optimizers and the Industrial MATLABUser. In N. Sahinidis and J. Pinto, editors, Foundations ofComputer-Aided Process Operations, Savannah, Georgia, USA,8–11 January 2012. → pages 20[DM02] E. D. Dolan and J. J. More´. Benchmarking optimization soft-ware with performance profiles. Mathematical Programming,91(2):201–213, 2002. → pages 1, 8[DN05] H. A. David and H. N. Nagaraja. Order Statistics, Third Edi-tion. John Wiley & Sons, Inc., 2005. → pages 16[Don10] D. Donaldson. Railroads of the raj: Estimating the impact oftransportation infrastructure. Working Paper 16487, NationalBureau of Economic Research, October 2010. → pages 3[ET93] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap.Number 57 in Monographs on Statistics and Applied Probabil-ity. Chapman & Hall/CRC, Boca Raton, Florida, USA, 1993.→ pages 8[FRK+08] K. R. Fowler, J. P. Reese, C. E. Kees, J. E. Dennis Jr., C. T.Kelley, C. T. Miller, C. Audet, A. J. Booker, G. Couture,R. W. Darwin, M. W. Farthing, D. E. Finkel, J. M. Gablon-sky, G. Gray, and T. G. Kolda. Comparison of derivative-freeoptimization methods for groundwater supply and hydrauliccapture community problems. Advances in Water Resources,31(5):743–757, 5 2008. → pages 150Bibliography[FRS94] T. A. Feo, M. G. C. Resende, and S. H. Smith. A greedy ran-domized adaptive search procedure for maximum independentset. Operations Research, 42(5):860–878, 2017/02/20 1994. →pages 1, 8[Glo98] F. Glover. A template for scatter search and path relinking,pages 1–51. Springer Berlin Heidelberg, Berlin, Heidelberg,1998. → pages 48[Gri78] V. A. Grishagin. Operating characteristics of some globalsearch algorithms. Problems of Stochastic Search, 7, 1978. →pages 1, 8[HEAEH98] Y. Hassan, S. Easa, and A. Abd El Halim. Highway align-ment: Three-dimensional problem and three-dimensional solu-tion. Transportation Research Record: Journal of the Trans-portation Research Board, 1612:17–25, 1998. → pages 3[HHLR14] W. Hare, S. Hossain, Y. Lucet, and F. Rahman. Models andstrategies for efficiently determining an optimal vertical align-ment of roads. Computers & Operations Research, 44:161 –173, 2014. → pages 3, 5, 6, 35[HJK96] C. R. Houck, J. A. Joines, and M. G. Kay. Comparison ofgenetic algorithms, random restart and two-opt switching forsolving large location-allocation problems. Computers & Op-erations Research, 23(6):587–596, 1996. → pages 2[HS98] H. H. Hoos and T. Stu¨tzle. Evaluating las vegas algorithms:Pitfalls and remedies. In Proceedings of the Fourteenth Con-ference on Uncertainty in Artificial Intelligence, UAI’98, pages238–245, San Francisco, CA, USA, 1998. Morgan KaufmannPublishers Inc. → pages 1, 8[JP03] J. Jacod and P. E. Protter. Probability Essentials. Springer,2nd edition, 2003. → pages 12[Kan08] M. W. Kang. An alignment optimization model for a simplehighway network, 2008. → pages 5[KG09] L. Kocsis and A. Gyo¨rgy. Efficient Multi-start Strategies forLocal Search Algorithms, pages 705–720. Springer Berlin Hei-delberg, Berlin, Heidelberg, 2009. → pages 251Bibliography[KGB13] S. R. Kuindersma, R. A. Grupen, and A. G. Barto. Vari-able risk control via stochastic optimization. The InternationalJournal of Robotics Research, 32(7):806–825, 2017/03/29 2013.→ pages 2[Kie53] J. Kiefer. Sequential minimax search for a maximum. Pro-ceedings of the American Mathematical Society, 4(3):502–506,1953. → pages 11[MLH15] S. Mondal, Y. Lucet, and W. Hare. Optimizing horizontalalignment of roads in a specified corridor. Computers & Oper-ations Research, 64:130 – 138, 2015. → pages 3, 31[Mon14] S. Mondal. Horizontal alignment optimization in road design.Master’s thesis, University of British Columbia, Jul 2014. →pages 4, 5[MW09] J. More´ and S. Wild. Benchmarking derivative-free optimiza-tion algorithms. SIAM Journal on Optimization, 20(1):172–191, 2017/02/20 2009. → pages 1, 8[Ort12] G. A. Ortiz. Evolution strategies.http://www.mathworks.com/matlabcentral/fileexchange/35801-evolution-strategies–es-, May 2012 [cited Nov. 2012]. → pages19[RCS06] J. Rodrigue, C. Comtois, and B. Slack. The geography of trans-port systems. Routledge, 2006. → pages 3[Sav94] M. W. P. Savelsbergh. Preprocessing and probing techniquesfor mixed integer programming problems. ORSA Journal onComputing, 6(4):445–454, 2017/03/09 1994. → pages 32[Sim96] J. S. Simonoff. Smoothing Methods in Statistics. Springer NewYork, New York, 1 edition, 1996. → pages 46[SKM16] Y. D. Sergeyev, D. E. Kvasov, and M. S. Mukhametzhanov.Operational zones for comparing metaheuristic and determin-istic one-dimensional global optimization algorithms. Mathe-matics and Computers in Simulation, 2016. → pages 1[SMG13] S. U. Stich, C. L. Mu¨ller, and B. Ga¨rtner. Optimization ofconvex functions with random pursuit. SIAM Journal on Op-timization, 23(2):1284–1309, 2013. → pages 1152Bibliography[Spa03] J. C. Spall. Introduction to Stochastic Search and Optimiza-tion: Estimation, Simulation, and Control. Wiley-Interscience[Imprint], 2003. → pages 2[TRC12] C. Truchet, F. Richoux, and P. Codognet. Prediction of paral-lel speed-ups for las vegas algorithms. CoRR, abs/1212.4287,2012. → pages 2[TY11] J. Takaki and N. Yamashita. A derivative-free trust-regionalgorithm for unconstrained optimization with controlled error.Numerical Algebra, Control and Optimization, 1(1):117–145,2011. → pages 48[VDHL17] K. K. Vu, C. D’Ambrosio, Y. Hamadi, and L. Liberti. Surro-gateaˆbased methods for blackaˆbox optimization. InternationalTransactions in Operational Research, 24(3):393–424, 2017. →pages 6[VWD16] P. Vasant, G. Weber, and V Dieu. Handbook of research onmodern optimization algorithms and applications in engineer-ing and economics, 2016. → pages 153

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0348580/manifest

Comment

Related Items