A rubric for algorithm selection inoptimization of black-box functionsbyStefan SremacB.Ed., Canadian University College, 2008A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE COLLEGE OF GRADUATE STUDIES(Mathematics)THE UNIVERSITY OF BRITISH COLUMBIA(Okanagan)June 2015c© Stefan Sremac, 2015AbstractWhen optimizing black-box functions, little information is available toassist the user in selecting an optimization approach. It is assumed thatprior to optimization, the input dimension d of the objective function, theaverage running time tf of the objective function and the total time T al-lotted to solve the problem, are known. The intent of this research is toexplore the relationship between the variables d, tf , and T and the perfor-mance of five optimization algorithms: Genetic Algorithm, Nelder-Mead,NOMAD, Efficient Global Optimization, and Knowledge Gradient for Con-tinuous Parameters. The performance of the algorithms is measured over aset of functions with varying dimensions, function call budgets, and startingpoints. Then a rubric is developed to assist the optimizer in selecting themost appropriate algorithm for a given optimization scenario.Based on the information available prior to optimization, the rubric es-timates the number of function calls available to each algorithm and theamount of improvement each algorithm can make on the objective functionunder the function call constraint. The rubric reveals that Bayesian GlobalOptimization algorithms require substantially more time than the competingalgorithms and are therefore limited to fewer function call budgets. How-ever, if the objective function requires a large running time, this differencebecomes negligible. With respect to improvement, the rubric suggests thatDerivative Free Optimization algorithms are preferred at lower dimensionsand higher budgets, while Bayesian Global Optimization algorithms are ex-pected to perform better at higher dimensions and lower budgets. A test ofthe claims of the rubric reveals that the estimate of function call budget isaccurate and reliable, but the improvement is not estimated accurately. Thetest data demonstrates large variability for the measure of improvement. Itappears that the variables d, tf , and T are insufficient for describing theexpected performance of the assessed algorithms, since variables such asfunction type and starting point are unaccounted for.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . viiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixChapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation for Algorithm Comparison . . . . . . . . . . . . . 21.2 Definition of the Problem . . . . . . . . . . . . . . . . . . . . 31.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 4Chapter 2: Gaussian Process Model . . . . . . . . . . . . . . . 5Chapter 3: Algorithms . . . . . . . . . . . . . . . . . . . . . . . 93.1 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Nelder-Mead . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Pattern Search . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4 Efficient Global Optimization . . . . . . . . . . . . . . . . . . 143.5 Knowledge Gradient for Continuous Parameters . . . . . . . . 15Chapter 4: Developing a Rubric . . . . . . . . . . . . . . . . . . 184.1 Theoretical framework . . . . . . . . . . . . . . . . . . . . . . 184.2 Estimating tA and InA . . . . . . . . . . . . . . . . . . . . . . 224.2.1 Time functions . . . . . . . . . . . . . . . . . . . . . . 234.2.2 Improvement Functions . . . . . . . . . . . . . . . . . 32iiiTABLE OF CONTENTSChapter 5: Using the Rubric . . . . . . . . . . . . . . . . . . . . 405.1 The Adjusted Improvement Criterion . . . . . . . . . . . . . . 405.2 Case Study: Small Time Budgets . . . . . . . . . . . . . . . . 435.3 Case Study: Large Time Budgets . . . . . . . . . . . . . . . . 455.4 Analysis of Rubric . . . . . . . . . . . . . . . . . . . . . . . . 48Chapter 6: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 51Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54ivList of TablesTable 4.1 A list of the test functions used. The second columnindicates the dimensions for which each function wasused and the third column provides a brief descriptionof relative function characteristics. . . . . . . . . . . . 24Table 4.2 Summary of linear regression results for the time func-tion of the EGO algorithm. . . . . . . . . . . . . . . . 27Table 4.3 Summary of linear regression results for the time func-tion of the EGO algorithm after removal of the inter-action term. . . . . . . . . . . . . . . . . . . . . . . . . 27Table 4.4 Estimated tA functions for each of the five algorithms. 29Table 4.5 The NE value for each tˆA as well as the sample vari-ance, average variance over all (d, n) pairs, and theaverage variance over all (d, n, f) triples. . . . . . . . . 31Table 4.6 Linear regression results for the improvement of theNM algorithm. . . . . . . . . . . . . . . . . . . . . . . 34Table 4.7 Estimated Ia functions for each of the five algorithms. 35Table 4.8 The NE value for each IˆA as well as the sample vari-ance, average variance over all (d, n) pairs, and theaverage variance over all (d, n, f) triples. . . . . . . . . 36Table 5.1 Results of the rubric for a small time budget of 1 hourand varying tf and d values. The best performingalgorithm for each tf , d pair can be seen in bold. . . . 43Table 5.2 Results of the rubric for a large time budget of 1 weekand varying tf and d values. The best performingalgorithm for each tf , d pair can be seen in bold. . . . 46vList of FiguresFigure 4.1 Plot of time used by the EGO algorithm as a functionof dimension and function call budget (left). Plot oftime used by the EGO algorithm after a log transfor-mation (right). . . . . . . . . . . . . . . . . . . . . . . 25Figure 4.2 Residual plot (left) and Q-Q plot (right) from thelinear regression of EGO time. . . . . . . . . . . . . . 26Figure 4.3 Plot of the log transformed time data and the regres-sion surface for the EGO algorithm. . . . . . . . . . . 28Figure 4.4 Plots of the regression functions for the algorithmtime of GA (top left), NM (top right), sNOMADr(middle left), EGO (middle right), and KGCP (bot-tom). . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 4.5 Plot of improvement as a function of d and n for theNM algorithm. . . . . . . . . . . . . . . . . . . . . . . 33Figure 4.6 Residual plot (left) and Q-Q plot (right) from thelinear regression of NM improvement. . . . . . . . . . 34Figure 4.7 Plot of improvement data and regression surface forthe NM algorithm. . . . . . . . . . . . . . . . . . . . . 35Figure 4.8 Plot of the regression surface for the improvement ofGA (top left), NM (top right), sNOMADr (middleleft), EGO (middle right), and KGCP (bottom). . . . 37Figure 5.1 Box plots of the percent difference between the timeused by each algorithm and the time budget of 3600s. 44Figure 5.2 Box plots of the measured improvement gained byeach algorithm and the predicted improvement shownin blue. . . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 5.3 Box plots of the percent difference between the timeused by each algorithm and the time budget of 604800s. 47viLIST OF FIGURESFigure 5.4 Box plots of the measured improvement gained byeach algorithm and the predicted improvement shownin blue. . . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 5.5 Box plot of the improvement gained by each of thealgorithms for the dataset described in Chapter 4.Only data where n ≤ 10d are considered. . . . . . . . 48Figure 5.6 Box plot of the improvement gained by each of thealgorithms on real functions where d ≤ 10 and n ≤ 10d. 49Figure 5.7 Box plot of the improvement gained by each of thealgorithms on real functions where d ≥ 10 and n ≤ 10d. 50Figure 6.1 Box plot of the improvement gained by the KGCPalgorithm for d = 10 and n = 200. . . . . . . . . . . . 52viiAcknowledgementsI would like to acknowledge the support of my loving, faithful, and pa-tient wife, Jodi, who has willingly chosen to endure the life of a student’swife. You deserve no small credit for my completion of this program.My journey toward this milestone would not have been possible withoutthe support and faith of my family: Mom and Dad, Lidja and Nick, Anaand Josh. Your unbridled support of my educational pursuits has made thisprogram significantly more enjoyable.The time I have spent in Kelowna has been overwhelmingly positive dueto some excellent friends and a loving church family.I want to express my gratitude to the amazing and caring faculty andstaff of the Mathematics and Statistics department at UBC’s Okanagancampus.I am thankful for the fellow graduate students and post-doctoral fellowswith whom I shared my time at this university. I have enjoyed learning withand from you.I appreciate Dr. Yves Lucet and Dr. Abbas Milani for taking the timeto be a part of my thesis committee.For the completion of this thesis, I am greatly indebted to the supervisionof Dr. Warren Hare and Dr. Jason Loeppky, whose guidance and input hasbeen extremely valuable. I appreciate the time you have invested in me andthe respect with which I have been treated.The research presented in this thesis would not have been possible with-out the generous funding received from the University of British Columbia,in the form of the Graduate Entrance Scholarship and the University Grad-uate Fellowship, and the funding received from the Natural Sciences andEngineering Research Council of Canada (NSERC) in the form of the Dis-covery Grants of Dr. Hare and Dr. Loeppky.Finally, I am thankful to God for the fulfilling life that I have been given,for the joys I experience each day, and for His unfailing love to me.viiiDedicationTo my parents, Sima and Brankica, and to my wife, Jodi. You havesacrificed the pursuit of your own interests for the sake of mine.ixChapter 1IntroductionThe last century has seen an incredible increase in computing power.Consider that in the early years of the 20th century, the challenge for inven-tors in computing was to create a compact machine to perform addition, sub-traction, multiplication, and division with ease and timeliness, [CKAEY14].Certainly, it has been a long time since such a machine was at the forefront ofcomputational innovation. In addition to the increase in computing power,innovations of the last decades have made this power commonly availablethrough personal computers and smart phones. One of the results of suchan environment is that computers have been used to create simulations ofprocesses that are quite complex in real life. In the first chapter of [SWN03]several examples of computer simulators are presented. Among them arecomputer simulators that model the evolution of fires in closed spaces, thedesign of prosthesis devices, and the design of helicopter blades. In manyinstances the computer simulation of an event is simpler, faster, and cheaperto perform than it is to observe the actual event. The intensive computa-tions required by a simulator, which were unfathomable only decades ago,are at present the best choice in certain applications.One application of computer simulations is for performing experimentswhich would be challenging or impossible to perform in reality. Considertesting the spread-rate of an infectious disease in a human population,[BBC+02]. For obvious reasons this test would be unethical to performwithout the use of a computer simulation. Such simulations will be referredto as computer experiments. While we may express computer experimentsas functions, the code that comprises them can not usually be expressedanalytically and thus very little is known about the nature of the function.For this reason, computer experiments are referred to as black-box functions.In some computer experiment applications finding the optimal outputvalue and the corresponding input values is the main objective. Consider acomputer experiment that simulates a car crash, such as the one described in[JSW98]. The car design that performs the best in the computer experimentis the one that should be tested in a real experiment and produced if theresults remain positive. The following subsections present the motivation11.1. Motivation for Algorithm Comparisonbehind this work and a formal definition of the problem.1.1 Motivation for Algorithm ComparisonBlack-box functions present two challenges to optimizers. The first isthat common optimization methods using gradients, convexity, linearity andother properties of the objective function do not apply to a black-box func-tion since it lacks an analytic expression. The second challenge is regardingthe time required to evaluate the black-box function. Some computer sim-ulation functions may require significant computational time, such as FordMotor Company’s crash simulation which can take up to 106 hours to run,[WS07]. There are, however, black-box functions which are far less expen-sive in terms of time. In instances where the black-box is very expensivean algorithm that is near the optimal value in as few function evaluationsas possible is desirable, while computationally cheap functions do not im-pose such a requirement. This demonstrates that the computational timeof a function should be taken into account when choosing an optimizationalgorithm in the black-box context.The motivation behind this thesis is in part due to claims made by[JSW98] and [SFP11] regarding Bayesian Global Optimization (BGO). In[JSW98] it is said that the method “often requires the fewest function eval-uations of all competing methods” and in [SFP11] the authors look for analgorithm that can “give satisfactory results with as few function evaluationsas possible.” The downside however is that the algorithm will “spend extratime deciding where [it] would like to evaluate the function next,” [SFP11].In addition, both of the works intend to minimize an expensive black-boxfunction, so that the time taken by the algorithm itself, even if very long,will be small relative to obtaining the function values. Naturally, severalquestions arise.− Are BGO algorithms really the best choice when the black-box func-tion is very expensive?− How expensive does a black-box function have to be in order for BGOalgorithms to be the preferred choice?− If the black-box function is not very expensive, is another family ofalgorithms a better choice?− Does the input dimension of the function have an effect on choice ofalgorithm?21.2. Definition of the Problem− Can a rubric be developed to select the best performing algorithmbased on function evaluation time and input dimension?To answer the above questions, the performance of several competing al-gorithms, including BGO algorithms, will need to be compared. This thesisfocuses on algorithms from three different families. The first family of al-gorithms used for optimizing black-box functions are heuristic methods. Bynature this family of optimizers does not have a strong theoretical frame-work and it may not even be guaranteed to converge. However, heuristicmethods are, in general, simple to understand and implement, and are prac-tically motivated. The second family of algorithms is the Derivative FreeOptimization (DFO) family. These algorithms tend to use function valuesto explore the geometry of the objective function and move towards the op-timal value. The third family of algorithms used for black-box optimizationis the BGO family. These algorithms use statistical tools to create a modelof the true function which is cheaper to compute. New points to explore,are then chosen based on this model function. Chapter 3 provides a deeperanalysis of each of the algorithms as well as the implementations used.1.2 Definition of the ProblemTo formally define the optimization problem let f : Rd → R be a con-tinuous black-box function with input variable x ∈ X ⊂ Rd, where X is theinput domain. In this thesis, the domain is the unit hypercube, X ≡ [0, 1]d.The black-box function is assumed to be deterministic, that is, for any x¯ ∈ Xthere exists a y¯ ∈ R such that f(x¯) = y¯ each time x¯ is evaluated. Determin-istic functions essentially have no measurement error. It is assumed that theuser aims to minimize the black-box function. Mathematically, the problemis written asminimize f(x),subject to x ∈ X.(1.1)Let X∗ denote the argmin of f , that is,X∗ = arg minx∈Xf(x) := {x ∈ X : f(x) ≤ f(y) for all y ∈ X}.The optimization domain X is a compact set in Rd and f is continuousfunction mapping X to R thus the generalized Extreme Value Theorem(refer to Chapter 5 of [DD02]) guarantees that X∗ is non-empty. Let f∗31.3. Thesis Outlinedenote the minimum value of f over the domain X, that is,f∗ = minx∈Xf(x) := {f(x∗) : x∗ ∈ X∗}.1.3 Thesis OutlineThis thesis is laid out as follows. In Chapter 2 some background infor-mation is provided regarding Gaussian Process models which are used bytwo of the algorithms considered in the experiment. Chapter 3 is a presen-tation of the algorithms used in this thesis. In Chapter 4 the developmentof the rubric is documented. Section 4.1 describes the framework of therubric and Section 4.2 details the estimation of some functions crucial tothe rubric. Chapter 5 contains a demonstration of the rubric, an analysisof implications arising from its results, and a test of the reliability of therubric. Finally in Chapter 6 some conclusions are drawn and future researchdirections discussed.4Chapter 2Gaussian Process ModelTwo of the algorithms used in this thesis adopt a model based approach,where the objective function f is replaced by an estimate of the function(often referred to as a surrogate or model.) One method for creating a modelof the true function is to use a Gaussian process. What follows is a briefdiscourse on model building with the Gaussian process. To be consistentwith notation commonly used with Gaussian processes, y will denote theobjective function instead of f .Assume that the objective function y has been evaluated N times at thepoints x1, x2, . . . , xN . This initial sample of N points should be sufficientlylarge and provide a good coverage of the domain so that a good model can bedeveloped. In [LSW09] it is argued that 10d is a sufficient size for an initialsample. Now the task is to predict the function y based on this sample ofN points. The next step in building a model of y requires some knowledgeof a stochastic process.Informally, a stochastic process Y = {Yt : t ∈ T} is a family of indepen-dent random variables specified in some domain T . When T is a singletona stochastic process is just a random variable and when T is a set of finitelymany elements the stochastic process is a vector of random variables. How-ever, when T is a set of infinitely many elements, the stochastic process is arandom function. A stochastic process can therefore be viewed as an exten-sion of the concept of random variables to functions. A Gaussian process,then, is a stochastic process where each random variable has a Gaussian dis-tribution. For a detailed presentation of GPs refer to Chapter 2 in [SWN03].The Gaussian assumption on the stochastic process is made for the ease ofworking with Gaussian distributions.Adopting a Bayesian approach, prior knowledge of y is specified by theGP, Y = {Yx : x ∈ X}. Then the random variables Yx1 , Yx2 , . . . , YxNassociated with the N sampled points have a multi-variate normal distri-bution with mean vector µYN and variance-covariance matrix ΣYNYN . LetyN = (y(x1), y(x2), . . . , y(xN ))T be the vector of N observed function val-ues. Then the posterior distribution of any Yx ∈ Y conditioned on the5Chapter 2. Gaussian Process Modelobserved values yN is normal and specified byµYx|yN := E[Yx|yN ] = µYx + ΣYxYNΣ−1YNYN(yN − µYN ) (2.1)andσ2YxYx|yN := cov[Yx, Yx|yN ] = ΣYxYx −ΣYxYNΣ−1YNYNΣTYxYN , (2.2)where ΣYxYN is an N -length vector of covariances between Yx and eachelement of YN ; ΣYxYx is the variance, σ2Yx , of the random variable Yx; µYxis the mean of the random variable Yx. The posterior distribution is reallyan estimate of the function y. The equations (2.1) and (2.2), as shown in[CMMY91], then becomeyˆ(x) = µYx + ΣYxYNΣ−1YNYN(yN − µYN ), (2.3)which is the approximation to y at x andσ2x = ΣYxYx −ΣYxYNΣ−1YNYNΣTYxYN , (2.4)which is the uncertainty of the estimation in (2.3).Models created using the GP provide an estimate of the true functionbut also provide a measure of the uncertainty that comes along with theestimate at each point. The uncertainty is manifested in the variance, (2.4),of yˆ and is heavily relied on by two of the algorithms presented in Chapter 3.Also note that the costly operation of inverting the matrix in (2.3) doesnot depend on the unobserved point x and therefore when Σ−1YNYN has beencomputed once, yˆ can be computed quickly for multiple points in the domain.Equations (2.1) and (2.2) are written for multiple points in [CMMY91]. Therequirements necessary for Σ−1YNYN to be invertible will be discussed shortly.Equations (2.3) and (2.4) require knowledge of the mean vectors, µYxand µYN , the covariance vector ΣYxYN , the covariance matrix ΣYNYN andthe variance σ2Yx . All of these values will have to be estimated and to makethe task simpler, in the prior specification of y (that is the GP) it will beassumed that each Yx ∈ Y will have the same mean, µ, and variance, σ2.Equations (2.3) and (2.4) then simplify toyˆ(x) = µ+ rYxYNR−1YNYN(yN − µ1) , (2.5)andσ2x = σ2(1− rYxYNR−1YNYNrTYxYN), (2.6)6Chapter 2. Gaussian Process Modelwhere 1 is an N -length vector of ones; rYxYN is an N -length vector of cor-relations; RYNYN is an N × N matrix of correlations. The variance wasfactored out of the covariance matrices to obtain the correlation matrices.Now only one µ and one σ2 need to be estimated which is a simpler taskthan estimating 2N + 2 parameters as is the case in equations (2.3) and(2.4).Two more terms from equations (2.5) and (2.6) remain to be discussed,the correlation vector and the correlation matrix. The initial assumption onthe objective function y is that it is continuous over X. This implies that ifx1, x2 ∈ X are close to one another their function values should be similar.In terms of correlation, two input points close to each other should be highlycorrelated. Points far away from each other will exhibit a low correlation.This leads to a correlation function of the formcorr(Yx1 , Yx2) =d∏i=1R(x1i − x2i ), (2.7)where R is a correlation function in one dimension and depends on thedistance between x1i and x2i . Recall that each x ∈ Rd, thus the product isover the dimensions. Equation (2.7) is referred to as a product correlationand allows for control and analysis of the effects of each dimension. Apossible choice for R is the squared-exponential correlationR(x1i − x2i ) = e−θi(x1i−x2i )2, (2.8)where θi ≥ 0 is an unknown parameter. Variants of this exponential correla-tion are used in [SFP11] and [JSW98]. In the above equation R(x1i −x2i ) willbe close to 1 when x1i − x2i is close to 0 which is in agreement with the con-tinuity assumption on y. Taking equation (2.8) into the product correlationof (2.7) givescorr(Yx1 , Yx2) =d∏i=1e−θi(x1i−x2i )2. (2.9)Each θi is a parameter specific to dimension i. Small values of θi can indicatethat input variable i has a negligible contribution to the function y. Con-structing the correlation matrix from (2.5) and (2.6) using the correlationfunction from (2.9) ensures that the matrix is positive definite and henceinvertible, [SWN03]. The squared-exponential correlation from (2.8) is onlyone type of correlation function and Chapter 2 in [SWN03] presents someother correlation functions.7Chapter 2. Gaussian Process ModelEquations (2.5) and (2.6) have d + 2 parameters which need to be esti-mated. The most commonly used method of estimating the parameters is bythe well known maximum likelihood estimation which will not be describedin this thesis. For information on this step of the process refer to Chapter3 of [SWN03]. Once the parameters are estimated, (2.5) and (2.6) are fullyknown and the model function yˆ is fully known.8Chapter 3AlgorithmsPrior to presenting each of the algorithms, the notation used, as well as,notions of convergence will be discussed. Let k ∈ N denote the iterationof the optimization algorithm. Let Sk = {x0, x1, . . . , x|Sk|} be the orderedset of all points, x ∈ X, that have been sampled from iteration 0 up toand including iteration k. The points in Sk are ordered chronologically sothat x0 is the first point sampled and x|Sk| is the most recently sampledpoint. Then following the completion of iteration k, the best function valueobtained isfkbest := minx∈Skf(x).Similarly, the best point obtained by the end of the kth iteration isxkbest := {x ∈ Sk : f(x) = fkbest}.If multiple points satisfy the requirements of xkbest, the one with the lowerindex in Sk is taken, thus keeping xkbest a singleton. An algorithm is said toconverge to the true minimum value iflimk→∞fkbest = f∗. (3.1)Let dist(xkbest, X∗) be the distance between xkbest and X∗, defined asdist(xkbest, X∗) := infx∈X∗∥∥∥xkbest − x∥∥∥2.Then, an algorithm is said to converge to a true minimizer iflimk→∞dist(xkbest, X∗) = 0. (3.2)In the following sections each of the algorithms is presented in greaterdepth along with some convergence results. All implementations are in theprogramming language R.93.1. Genetic Algorithm3.1 Genetic AlgorithmThe genetic algorithm is a heuristic approach that was formally intro-duced in [Hol75] and has since become a well-known option to optimizers.The basic idea behind this algorithm is to mimic biological evolution andusing “survival-of-the-fittest” obtain the optimal value of the function.At iteration k, the algorithm considers a set of points, sayP k = {x0, x1, . . . , x|Pk|} ⊂ Rd.The initial sample of points, P 0, can be chosen randomly, but each sub-sequent set is derived from the previous. Since the ideology behind thealgorithm is that of biological evolution, each point xi, i = 1, 2, . . . , |P k|, isviewed as a member of a population P k and the components of the vector xiare its ‘genes’. The constraint set X is the minimal requirement for survivalin this population and any individual xi /∈ X will fail to survive or pass itsgenes to the next generation. In the next step, we use the fitness function,g, to determine the fitness of each individual in P 0. For the sake of theapplication in this thesis we let g = −f so that smaller objective valuescorrespond to higher fitnesses. And of course we have f(x∗) = −g(x∗) forany x∗ ∈ X∗.Based on the fitness values of the members of P k, three steps will betaken to determine the next population, P k+1. In the first step, Selection,the fittest individuals from P k will be selected to survive into the nextgeneration P k+1. The second step, Cross-over, selects sufficiently manyparent points from P k and breeds them to create offspring. To stay true tothe biological motivation for this algorithm, in this step the parent pointsare viewed as chromosomes and the components of each parent chromosomeare its genes. As a result of the chromosomal motivation the individualsin P k are often encoded, with binary encoding being a common choice.Cross-over will create an offspring by using some genes from each parentand this process generally relies on probabilities to determine which genesare taken from which parent. If the chromosomes are not encoded, a convexcombination of the parents, where the parameter of convexity is chosenrandomly, is a common choice. The final step, Mutation, corresponds to themutation of genes which takes place in biological reproduction. Here theoffspring are perturbed by changing the value of some of their genes. Theprobability of a gene mutating is governed by a mutation parameter whichis generally small so as to not deviate from the biological model. Once allthe steps have been completed and P k+1 has been created, the process is103.2. Nelder-Meadrepeated to obtain future generations. A thorough description of the geneticalgorithm can be found in [Hol75].In order to determine a stopping point for the genetic algorithm, it willhelp to first present the theoretical convergence results. The introductionof random choices in the Cross-over and Mutation steps at each iterationplaces the genetic algorithm in the category of stochastic search methods.As such, the convergence shown in (3.1) and (3.2) is replaced by equationsinvolving probabilities. In [EAVH91], the authors show that under someassumptions the genetic algorithm will converge to the true minimum withprobability 1 as the number of iterations approaches infinity. That is,Prob[limk→∞fkbest = f∗]= 1 and limk→∞(Prob[P k ∩X∗ 6= ∅])= 1.With this in mind, there is no theoretically preferred event at which tostop the algorithm. One common way the algorithm may be stopped iswhen a function value budget, say Nfeval, has been reached. Alternativelythe algorithm may be stopped if the number of generations where xkbest hasfailed to improve, exceeds the parameter Nfix. The first criteria stops thealgorithm if it has taken too long while the second terminates it if too muchtime has passed without improvement to the optimal value.The implementation used in this thesis can be found in the R package GAwhich is documented in [Scr13]. The population size of max{10,√Nfeval}is used at each iteration and the top 5% from each population are assuredsurvival into the next generation. The mutation parameter is 0.1 and thealgorithm is stopped only when the function call budget has been exhausted.3.2 Nelder-MeadThe Nelder-Mead (NM) method was introduced in [NM65] and is a min-imization method utilizing only objective function values. The approachmoves a special type of polytope called a simplex around the domain spacein search of the best function value.A simplex in Rd is a d-dimensional polytope with d+ 1 vertices. In R2,for example, a simplex is formed by the edges and interior of a triangle. Atiteration k the NM method considers d + 1 points {x0, x1, . . . , xd} whichare the vertices of a simplex. The points are then ordered so that f(x0) ≤f(x1) ≤ · · · ≤ f(xd) and a search is performed to replace the worst point,xd. Up to four points may be evaluated along the line passing through xdand∑d−1i=0 xi/d. If a better function value is found the corresponding point113.3. Pattern Searchreplaces xd, otherwise the simplex shrinks towards the best point x0. Theapproach guarantees that the next point(s) chosen along with the ones thatremain are the vertices of a simplex in Rd and the process is repeated insubsequent iterations.For objective functions that are bounded below, the NM approach guar-antees convergence to some function value f¯ , [LRWE98]. That is, if X∗ 6= ∅,thenlimk→∞fkbest = f¯ .While the assurance of convergence is desirable, the accumulation pointis cause for some concern. It is important to note that the statement ofconvergence develops no relationship between f¯ and f∗. Most notably, f¯is not necessarily the minimum value. An example showing the failure ofthe method to converge to a stationary point on a reasonable function waspresented in [McK98]. A second concern is that the NM method is a localsearch method in that it makes decisions based on the local geometry of theobjective function. This may cause convergence to a local minimum whileignoring the global minimum.The simplicity of this method makes it readily implementable and thefact that gradients are not needed is appealing to many users. In practice,the algorithm performance can range from very good to quite poor, [Wri96].The implementation used in this thesis is that of [VB11] which is an Rimplementation of the algorithm presented in [Kel99].3.3 Pattern SearchThe pattern search method was first introduced as a direct search al-gorithm by [HJ61]. The name ‘direct search’ was used to indicate thatonly objective function values were to be used in the optimization process.Subsequent algorithms such as the one presented in [DT91] expanded on theoriginal method while staying true to the core ideas. As such, the term ‘pat-tern search’ does not refer to a specific algorithm, instead it encompasses arange of algorithms which share several intrinsic properties.At the kth iteration a pattern search algorithm has a best point xkbestand the corresponding best function value fkbest. The algorithm then explorespoints surrounding xkbest in a pattern of directions and a fixed step size. Ifa better function value is found, the corresponding point replaces xkbest andthe process is repeated in the next iteration. For the case where no betterfunction value is found, the step size is decreased and the same exploratorystep performed.123.3. Pattern SearchLike the NM algorithm, pattern search is a local optimization methodand this fact should be considered when global optimization is desired. Asimple way to combat local optimization is to start the algorithm at multiplelocations in the domain. The algorithm then explores the area local toeach starting point and the combined results provide information about theglobal optimum. The pattern search approach is simple, intuitive and moreimportantly it is flexible and has strong convergence results. In [Tor97], itis shown that if f ∈ C1, the level sets of f are compact, and some conditionsare placed on the directions being searched at each iteration, thenlimk→∞‖∇f(xkbest)‖ = 0.This of course is not a guaranteed convergence to the minimum of the func-tion, but there is at least a guarantee of convergence to a critical point.The power of this result is that it holds for any pattern search algorithmsatisfying the aforementioned conditions.With regards to the flexibility of pattern search, as mentioned earlier,[AD06] introduced a search step prior to the directional exploration whichallows for any type of finite global search. This step has one main benefitin that it allows the algorithm to ‘jump’ out of local modes if a betterfunction value is found elsewhere without losing the convergence resultsfrom [Tor97]. The pattern search algorithm used in this thesis is NonsmoothOptimization by Mesh Adaptive Direct Search (NOMAD), [ABLD08], whichis a modification of the Mesh Adaptive Direct Search (MADS) algorithm,[AD06].In this algorithm a Variable Neighbourhood Search (VNS) subroutine,[HM03], is used to aid NOMAD in global optimization. Once the step sizeused in pattern search has become sufficiently small, the VNS algorithm isused to perturb xkbest and obtain x′. Then a neighbourhood of x′ is exploredin a pattern of directions and increasing step sizes in search of x′′ such thatf(x′′) ≤ f(xkbest). If no such point is found then pattern search continuesaround xkbest otherwise xkbest = x′′. Coupling the local nature of patternsearch with the global nature of VNS creates an algorithm more effectivefor global optimization.The R implementation of NOMAD is sNOMADr and can be found inthe package crs, [RN14]. The algorithm will be referred to as sNOMADrthroughout this paper.133.4. Efficient Global Optimization3.4 Efficient Global OptimizationThe Efficient Global Optimization (EGO) algorithm was introduced in[JSW98] as an approach to minimizing expensive-to-compute black-box func-tions. The approach improves on several previous attempts at optimizationusing GP models, and claims to use a relatively small amount of functioncalls.Suppose only N function calls can be afforded throughout the entireoptimization process. Then the EGO algorithm begins by evaluating aninitial sample of points x1, x2, . . . , xm where m < N . Only a portion of theentire sampling budget should be used in this step. The suggestion of 10dpoints, mentioned in Chapter 2, is often used. Based on these m points,a GP model, fˆ , of the function is created. The authors then introduce animprovement criterion to select the next point to evaluate. The improve-ment criterion looks for an unsampled point x ∈ X where its objectivefunction value could be smaller than the current lowest function value. Theimprovement criterion isI(x) = maxx∈X(fbest − Y (x), 0), (3.3)where fbest is the smallest function value obtained so far, and Y (x) is therandom variable with mean and variance specified by (2.5) and (2.6) (thatis the GP). If Y (x) is smaller than the best function value obtained so far,then the improvement is the amount by which Y (x) is better than the bestfunction value. If Y (x) does not improve on fbest, then the improvementat x will be 0. The improvement function is always non-negative and itsmaximum is at the point where the greatest improvement in function valuewill occur. However, the criterion in (3.3) contains a random variable andcannot be computed. Naturally this leads to the expected improvement (EI)criterion,E[I(x)] = E[maxx∈X(fbest − Y (x), 0)], (3.4)which is just the expectation of the improvement criterion in (3.3). Thenext point to be sampled is the maximizer of (3.4). The authors showthat the expected improvement criterion can be expressed in closed formand optimized using a branch and bound algorithm. Once the maximizerhas been found, its function value is obtained, a new GP function modelis created using the m + 1 points and the EI criterion is computed again.These steps are repeated until the budget, N , is exhausted.143.5. Knowledge Gradient for Continuous ParametersThe EI criterion is the key element of the EGO algorithm. While earlierwork in optimization with GP function models relied on heuristics, [JSW98],this work introduced a criterion based on theory. The criterion tends to ex-plore regions of X where the GP function model exhibits low values, but alsoregions where the model exhibits high uncertainty. This balance betweenlocal and global search allows the algorithm to avoid stagnating at locallyoptimal points.The EGO approach is not guaranteed to converge in general as shownin [Bul11]. The problem in convergence comes mainly from the constantlychanging GP model by which the EGO approach makes most of its decisions.If the GP is not a good model of the true function, then convergence to thetrue minimum may not be likely.EGO was designed for black-box optimization and that is why it is ap-propriate for this thesis. The implementation used is found in the R packageDiceOptim by [GPR+13].3.5 Knowledge Gradient for ContinuousParametersThe Knowledge Gradient for Continuous Parameters (KGCP) was intro-duced in [SFP11] and is an extension of the previous work, [FPD09], whichwas considered in the discrete context. It is also an extension of the EGOalgorithm in that it was designed for non-deterministic data. The KGCPis designed for maximization problems, so the objective function f whichneeds to be minimized will be replaced by g = −f .The KGCP assumes that the data are noisy and that at any point x ∈ X,yˆ(x) ∼ N(g(x), λ(x)), (3.5)where yˆ(x) is the noisy observation at x, g(x) is the true objective value atx, and λ(x) is the noise variance at x. Like the EGO algorithm the KGCPtakes an initial sample of m points, x1, x2, . . . , xm. LetYˆ = {yˆ(x1), yˆ(x2), . . . , yˆ(xm)}be the set of observed values from the initial sample. Then the GP modelat some x ∈ X is the expected value of the true function at x ∈ X based onthe m observations,gm(x) = E[g(x)|Yˆ ]. (3.6)153.5. Knowledge Gradient for Continuous ParametersThe superscript m is used to index the progression of models. The algorithmthen considers sampling another point x ∈ X and predicts gm+1 (the nextmodel) without having the observation yˆ(x). That isgˆm+1(x) = E[g(x)|Yˆ , x]. (3.7)Use of the ‘hat’ notation in gˆm+1 is to indicate that (3.7) does not expressthe actual model of the function based on m + 1 sampled points, since the(m + 1)th point, x, has not yet been observed. The equation is instead away to ‘look ahead’ at what the model might be like if it was built using theadditional point x. Then the KGCP criterion at some x ∈ X isKGCP (x) =(E[maxi=1,... ,m+1gˆm+1(xi) ∣∣∣Yˆ]− maxi=1,... ,m+1gm(xi))xm+1=x.(3.8)This criterion measures the improvement in function value if x was used tocreate the next model. The maximizer of this criterion is then the point xwhich is expected to create a model with the most improvement in functionvalue over the current model. The KGCP function is non-negative andmulti-modal and similar to the EI criterion from EGO. A closed form ex-pression of (3.8) as well as detailed connections of the above equations andstatements can be found in [SFP11] and [FPD09].The KGCP criterion is differentiable and the derivative is available inclosed form. This allows for use of gradient-based optimization techniquessuch as steepest ascent or BFGS. The authors suggest steepest ascent anduse it in their implementation. It should be noted that the non-concavityof the KGCP function means that any maximizer of (3.8) obtained bygradient-based methods may only be a local maximizer. As such a multi-start approach is used to increase the probability of reaching the globalmaximizer, although in practice this is not crucial, [SFP11].The KGCP method continues sampling one point at a time until N − 1points have been sampled. By this time the model gN−1 should be veryaccurate and the last point is taken as the maximizer of this model. In[SFP11] if the KGCP is used for maximization, under some assumptions, itis shown thatlimm→∞V ar [gm] = 0. (3.9)This result implies that the KGCP sampling technique leads to GP modelswhich approach the true function g. Eventually, the model gm is indistin-guishable from g. This in turn leads to the maximizer of gm being indis-tinguishable from the maximizer of g. However, one of the assumptions for163.5. Knowledge Gradient for Continuous Parametersthis convergence result is that the parameters of the GP are known andfixed throughout the optimization process. In practice this is unlikely to betrue since the parameters of the GP are unknown and estimated at eachiteration of the algorithm. Thus the GP model is constantly changing andthe concerns regarding EGO, expressed in [Bul11], are valid for the KGCPalgorithm as well.The KGCP approach was implemented in R by the author, using GPmodels created with the package DiceOptim by [GPR+13]. The steepestascent algorithm used was also implemented in R and the BFGS optimizer isfrom the standard optim package in R. The implementation of KGCP usedin this thesis can be acquired by contacting the thesis author.17Chapter 4Developing a RubricIn Section 1.1 the motivating questions behind this thesis were presented.This chapter is concerned with the development of a rubric to help answerthose questions. Given the input dimension, the average time required toevaluate the objective function, and a time budget, the rubric should deter-mine which of the algorithms in question will provide the best result. Thefollowing sections present the theoretical framework and development of therubric.4.1 Theoretical frameworkIn black-box optimization it can be assumed that prior to optimizing,the user will have some basic information regarding the problem. In thisthesis it is assumed that the user will have an allotted time to perform theoptimization. This will be referred to as the time budget and denoted byT . It is also assumed that the user knows the average time required for theobjective function to evaluate a single point, denoted by tf . Lastly, it isassumed that the user will know the input dimension, d, of the objectivefunction. Now, for an optimization problem, the time, T , required to obtaina solution can be expressed asT = Tf + TA, (4.1)where Tf is the time spent evaluating the objective function and TA is thetime used by algorithm A to decide which point(s) to evaluate next. Thefirst term, Tf , is the time required for a single function evaluation multipliedby the number of function calls:Tf = ntf ,where n is the number of times the function is evaluated. The value nwill be referred to as the function call budget. The second term, TA, ofequation (4.1) is generally unknown, however, some relationships can bedeveloped. First, the time used by the algorithm will be affected by the184.1. Theoretical frameworkfunction call budget, n. If more function evaluations are available, thenmore iterations will also follow. This, in turn, will require the algorithm tomake more decisions, hence increasing TA. This relationship will be differentfor each algorithm due to a number of factors. One factor is the numberof function calls an algorithm may perform at each iteration. For example,the implementation of the GA used in this thesis will evaluate max(10,√n)points in each iteration. For n ≥ 100 this leads to√n iterations. The EGOalgorithm, on the other hand, will initially obtain 2d + 2 points and thenevaluate the function once per iteration, leaving one function call for theend. This leads to n− (2d+ 2)− 1 iterations. The second relationship thatcan be developed is that between TA and the input dimension d. As alreadynoted, the number of iterations performed by the EGO algorithm is affectedby the input dimension. The same result is true for the KGCP algorithm. Inthe NM algorithm the number of vertices in the simplex is a function of theinput dimension. A higher dimension will lead to a larger simplex which willincrease the decision-making computations at each iteration, leading to anincreased time TA. Similar arguments can be made for each of the algorithmsused in this thesis, concluding that the time used by each algorithm is inpart a function of n and d. The variables n and d may not be the only factorscontributing to TA, however, in this thesis it is assumed that these are theonly variables known prior to optimizing the objective function. While thefunctions TA are not known, they will be estimated in Section 4.2.The discussion from the previous paragraph leads toT ≈ ntf + tA(d, n), (4.2)which is a reformulation of (4.1), where tA : R2 → R is the time used bythe algorithm and depends on the input dimension and the function callbudget. In this equation T is a function of d, tf , and n. Assuming that thefunction tA is known for each algorithm A, the user can obtain n. Since tA isdifferent for each algorithm, n will also be different for each algorithm. Forexample, an optimization problem with a given T , d and tf , might lead tothe GA evaluating the function 200 times while allowing the EGO algorithmto sample just 20 points.This provides the user with at least a starting clue as to which algorithmmay perform better. However, the number of function calls allotted to analgorithm may not be an accurate measure of performance, since it doesnot take into account, the nature of the algorithm. This brings up the needfor a performance criterion. In this thesis the function value improvementapproach is adopted, where the criterion measures how much improvement194.1. Theoretical frameworkan algorithm makes on the function value over the course of the optimizationprocess.Let x0 ∈ X be the first point evaluated by algorithm A and xnbest ∈ X bethe best point evaluated by the end of the optimization process. Then theimprovement in function value by algorithm A after n function evaluationsisInA = − log10(f(xnbest)− f∗f(x0)− f∗), (4.3)where f∗ is the minimum of the function. In a good performance, f(xnbest)−f∗ should be much smaller than f(x0) − f∗, which will lead to a largeimprovement value InA. Using the base 10 logarithm means that InA can beloosely interpreted as the digits of accuracy gained by algorithm A. Theimprovement function is not known and most likely does not exist in thegeneral case, since it can be affected by a large number of different factors.First of all, “steep” objective functions may provide greater improvementvalues than “flat” objectives. Another factor regards the starting point,x0. Suppose that f(x) = ex and x0 = 10. Some algorithms may make largeimprovements on the function value starting at such a “steep” location on thegraph of the objective. If instead the same algorithm was used with startingpoint x0 = −10, the algorithm may not demonstrate a drastic improvementin function value, since it started on a flat spot of the function. To accountfor these issues, the improvement is normalized asInA = − log10(f(xnbest)− f∗f(xnrandom)− f∗), (4.4)where f(xnrandom) is the best function value obtained after n randomly sam-pled points from X. The assumption in this formulation, is that if algorithmA obtains abnormally large improvement on a given function, the randomsampling technique will also obtain abnormally large improvement. The ra-tio, therefore, is to stabilize the measure of improvement which can now beloosely interpreted as how many more or less digits of accuracy algorithmA can obtain compared to a random sample of n points.The question, now, is whether InA is related to the information availableat the beginning of the optimization process, namely d and n? In this sectionthe existence and strength of relationships is developed in the theoreticalcontext. In Section 4.2, the relationships are studied in an empirical setting.First, the variable n is considered.Let fnbest be the lowest function value obtained after n function calls.With knowledge of fnbest, a good optimization algorithm will not choose204.1. Theoretical frameworkfn+1best if it is not an improvement on the current best value. Such a policyensures that for any m ≥ 0fn+mbest ≤ fnbest.Manipulation of this inequality leads tofn+mbest − f∗ ≤ fnbest − f∗=⇒fn+mbest − f∗fnbest − f∗≤ 1.The above manipulations are only permitted if fnbest 6= f∗. Correspondingassumptions for the random sampling technique lead tofn+mrandom − f∗fnrandom − f∗≤ 1.Now if it is assumed thatfn+mbest − f∗fnbest − f∗≤fn+mrandom − f∗fnrandom − f∗(4.5)thenfn+mbest − f∗fn+mrandom − f∗≤fnbest − f∗fnrandom − f∗=⇒ log10(fn+mbest − f∗fn+mrandom − f∗)≤ log10(fnbest − f∗fnrandom − f∗)=⇒ − log10(fn+mbest − f∗fn+mrandom − f∗)≥ − log10(fnbest − f∗fnrandom − f∗)=⇒ In+mA ≥ InA.The implication is that improvement is non-decreasing with budget. Whichmight make sense intuitively, however, the normalization in the improvementcriterion presents a challenge for this intuitive idea. The main challenge iswith assumption (4.5). This assumption is that the amount by which algo-rithm A approaches f∗ over the additional m function calls is greater thanthe amount by which the random sample approaches f∗ over the additionalm function calls. For algorithms such as NM and sNOMADr, where there is214.2. Estimating tA and InAa possibility of converging to a local minimum, such an assumption cannotbe made.The previous assumption thatfn+mbest ≤ fnbestis certainly true if the the first n sampled points are identical, but it is nottrue in general. The improvement gained by an algorithm may be affectedby the starting point or randomness and the statement,fn+mbest ≤ fnbest,may not be true if the values are a result of two distinct algorithm runs.What has been demonstrated here, is that the assumptions required fora general theoretical statement about the relationship between improvementand function call budget are unwarranted. In the following section, the pres-ence and strength of this relationship will be analyzed using experimentaldata and a statistical approach.The other variable available at the beginning of the optimization processis the input dimension, d. The attempt to develop a relationship betweenimprovement and n revealed some challenges due to the improvement be-ing normalized. It is likely difficult to determine the type and amount ofeffect d will have on a given algorithm and the random sampling technique.Therefore theoretical analysis of this relationship will not be discussed inthis thesis. The following section provides an empirical analysis of the rela-tionship between improvement and the available variables d and n.4.2 Estimating tA and InATo estimate the time and improvement functions, a number of test func-tions with varying input dimensions and function call budgets were opti-mized using the algorithms from Chapter 3. Regression was then performedon the resulting data to approximate tA and InA. The test functions used arefrom [MS05] and [MGH81]. As was mentioned in the previous section, thefunction type could have an effect on the improvement function. Thereforethe functions chosen for this experiment exhibit a variety of types. There areconvex functions, multi-modal functions, functions with multiple minimiz-ers, and highly erratic functions which have the appearance of noise. Thefunctions in the test set have input dimensions varying from 2 to 30. A listof test function names and dimensions appears in Table 4.1. Note that some224.2. Estimating tA and InAof the functions, such as the Ackley and Rastrigin, are not dimension specificand can therefore be tested at multiple dimensions. Other functions, suchas the Branin and drop-wave, are dimension specific and are therefore usedat only one dimension. The descriptions of the key characteristics (relativeto optimization) of each of the functions can be found in the third columnof the table. For clarification, ‘flat bowl’ functions have a bowl shape andexhibit a very small slope in a relatively large neighbourhood of the min-imizer. The variably dimensioned function has a very large slope in somedirections and very small slope in other directions. The Extended Powelland Rosenbrock functions each have a single minimizer but have character-istics which are not easily categorized. The variety of test problems used isintended to minimize the effect of the test set on the results of this section.Functions were optimized based on function calls ranging from 10 to500. Every function was optimized 20 times by each algorithm. Each ofthe 20 repeats used a different starting point. The purpose of this approachwas to take into account the effect of the starting point. For each optimiza-tion problem values necessary for computing the improvement criterion werestored and the algorithm was timed. The experiment was performed on adedicated cluster so that the time values would be more reliable.Once the data is available regression can be performed. The choice ofregression technique for this data is linear regression. The main reason be-hind this choice is that linear regression provides a clean equation for theregression surface and can be used for statistical inference, [Far02]. Otheroptimization techniques such as spline regression or kernel regression, pro-vide complicated regression equations. In the following subsections, theregression surfaces for tA and InA are developed. Since both tA and InA willbe approximated by functions of d and n, tˆA(d, n) will be used to denotethe estimate of the time function for algorithm A and IˆA(d, n) will be usedto denote the estimate of the improvement criterion.4.2.1 Time functionsIt will be necessary to obtain a time function for each of the five algo-rithms from Chapter 3. A deeper analysis of the regression process will beshown for the EGO algorithm, while only the results will be presented forthe remaining four algorithms, since the process is analogous.Analysis of the data for the EGO algorithm begins with a three dimen-sional plot of the data shown in the left image of Figure 4.1.Note that the time (measured in seconds) increases with both dimensionand function call budget. The data appears to be “cut-off” somewhere above234.2. Estimating tA and InATable 4.1: A list of the test functions used. The second column indicates thedimensions for which each function was used and the third column providesa brief description of relative function characteristics.Function Name Dimension(s) DescriptionAckley3,7,10,14,22,24,28,30 Multiple local minimaBranin 2 Multiple global minimaBroyden banded2,3,4,6,7,8,10,14,22,24,28,30 Flat bowlBroyden tri-diagnoal2,3,4,6,7,8,10,14,22,24,28,30 BowlDiscrete integral2,4,5,6,8,9,10,14,18,22,26,30 Flat bowlDrop-wave 2 Multiple local minimaExtended Powell 4,8 OtherGoldstein-Price 2 Multiple local minimaLinear full rank2,4,5,6,8,9,10,14,18,22,26,30 BowlLinear rank one2,4,5,6,8,9,10,14,18,22,26,30 ValleyRastrigin5,9,10,14,18,22,26,30 Multiple local minimaRosenbrock 4 OtherRotated hyper-ellipsoid3,7,10,14,22,24,28,30 BowlShekel 10 4 Multiple local minimaSix-hump camel 2 Multiple local/global minimaSum of powers3,5,7,9,10,14,18,22,24,26,28,30 Flat bowlSum of squares3,7,14,22,24,28,30 BowlTrigonometric2,4,5,6,8,9,10,14,18,22,26,30 Multiple local minimaVariably dimensioned2,4,5,6,8,9,10,14,18,22,26,30 Steep/shallow valley244.2. Estimating tA and InAbudgetdimensiontime 50001e+041.5e+04Plot of time data for EGO 5 10 15 20 25 30100 200 3004000500 budgetdimensionlog-time100 200 30005004002468Plot of log-time data for EGO 5 10 15 20 25 30Figure 4.1: Plot of time used by the EGO algorithm as a function of dimen-sion and function call budget (left). Plot of time used by the EGO algorithmafter a log transformation (right).15,000s. This is due to a 5 hour (18,000s) limit imposed on the algorithmsthroughout the experiment. As dimension and budget increase the spread inthe data also increases. This is not a desirable quality for linear regressionsince it violates one of the assumptions of the process, namely that thevariance is constant. A log transformation of the data may lead to a datasetwith constant variance. Indeed, this appears to be the case as can be seenin the right image of Figure 4.1. This plot seems to indicate a logarithmicgrowth in both d and n. Therefore linear regression is performed with themodellog(tEGO) = β0 + β1 log(d) + β2 log(n) + β3 log(d) log(n), (4.6)where βi ∈ R are coefficients to be determined by regression for each i =0, 1, 2, 3. The third term is an “interaction” term and its necessity will bedetermined by the statistical inference provided by the regression.Figure 4.2 shows two diagnostic plots from the linear regression of EGOtime.On the left is a plot of the residuals against the fitted values. Ideally, theregression surface should pass through the “middle” of the data at all pointsand the plot should have close to half of the points above 0 and the rest of thepoints below 0. This is the case with this dataset. For larger fitted values,there is some deviation from this centrality and this is probably due to thesharp “cut-off” at 18,000s. The red curve is the centre of the residuals andin a good linear model should be constant and close to 0. In this particularplot, the red curve is satisfactory.254.2. Estimating tA and InAFigure 4.2: Residual plot (left) and Q-Q plot (right) from the linear regres-sion of EGO time.The plot on the right is a Normal Quantile-Quantile (Q-Q) plot. It isa plot of the standardized residuals against the quantiles of the standardnormal distribution. In linear regression, it is assumed that the variance isnormally distributed. The purpose of this plot is to verify this assumption.If the residuals are normally distributed about the regression surface thenthe Q-Q plot should be close to the identity line. In this instance the Q-Qplot does indeed adhere to the identity line and the normality assumptionis validated. The normality assumption makes valid the use of the t and Fstatistics for inference. These diagnostic plots indicate that the model from(4.6) is worth considering.Prior to analyzing the equation produced by linear regression, it shouldbe noted that the F statistic for the model was smaller than 2E-16. Thisvalue indicates the probability of observing an F statistic greater than theone observed under the assumption that all of the coefficients from (4.6) are0. In this instance the probability of such an event is significantly smallbringing into question the assumption of all coefficients taking on a value of0. In other words there is at least one valid explanatory term in the model(4.6). The least squares estimates of the coefficients from (4.6) are shown inTable 4.2.The table also shows the p-values for the t statistic for each coefficient.This p-value can be interpreted as the probability of observing a t statisticgreater than or equal to the one observed under the assumption that thecorresponding coefficient has a value of 0. Note that the p-values for the264.2. Estimating tA and InATable 4.2: Summary of linear regression results for the time function of theEGO algorithm.Estimate Std. Error t statistic p valueβ0 -2.9001E+00 7.6448E-02 -3.7935E+01 0β1 1.9749E+00 3.9860E-02 4.9547E+01 0β2 9.6312E-01 1.6824E-02 5.7246E+01 0β3 -4.9195E-03 8.1360E-03 -6.0466E-01 0.5454first three coefficients are extremely small, indicating, very strongly thatthe coefficients for these terms are unlikely to be 0. The coefficient β3 onthe other hand is simply not small enough to support the inclusion of theinteraction term log(d) log(n). This term can therefore be removed fromthe model and linear regression performed again, with the results found inTable 4.3.Table 4.3: Summary of linear regression results for the time function of theEGO algorithm after removal of the interaction term.Estimate Std. Error t statistic p valueβ0 -2.8585E+00 3.3279E-02 -8.5894E+01 0β1 1.9515E+00 9.1106E-03 2.1420E+02 0β2 9.5401E-01 7.4847E-03 1.2746E+02 0This time each of the coefficients is reported as 0. Therefore there isno reason to believe that any of the remaining terms have no effect on theresponse variable.Using the results from linear regression, an estimate of tEGO islog(tˆEGO(d, n))= −2.859 + 1.952 log(d) + 0.954 log(n),which can be simplified totˆEGO(d, n) = exp (−2.859 + 1.952 log(d) + 0.954 log(n)) . (4.7)Figure 4.3 shows a plot of the regression surface and the data.The regression was performed using only 15 of the 20 random restartsfor each algorithm and function. These 15 points are referred to as thetraining data, since they are used to create the model function. Data fromthe remaining 5 runs is used to assess the quality of the prediction from thelinear regression models. This subset of the data is referred to as the test274.2. Estimating tA and InAbudgetdimensionlog-timePlot of log-time data and regression surface for EGO 5 10 15100 200 3000500400246820 25 30Figure 4.3: Plot of the log transformed time data and the regression surfacefor the EGO algorithm.data, since it is used to test the model developed by the training data. Thisprocess is called model-validation. The idea is that a regression function isdesigned to pass through the “middle” of the data. The data obtained in thissection contains multiple time values for each (d, n) pair that was tested. Agood regression function might be expected to lie near the mean of the dataat each tested point (d, n). Now consider the test data which was not usedto develop the regression model. Two qualities are desirable when assessingthe model using test data. First, it is desirable that its values are notvery different from the training data. This implies a well-behaved dataset.Secondly, the test data should not be very different from the regressionfunction. These two ideas can be combined into a model-validation criterion.Before this criterion is presented, however, some notation is required. LetP be the set of elements (d, n) such that a test problem for the pair (d, n) isused in this thesis. Let NP denote the number of elements in P. Then thenormalized error (NE) isNE(tˆA) :=∑NPi=1∑Nij=1(ttestA (di, ni)− tˆA(di, ni))2∑NPi=1∑Nij=1(ttestA (di, ni)− t¯A(di, ni))2 , (4.8)where ttestA (di, ni) is a time value for a test problem with dimension di andbudget ni, tˆA(di, ni) is the estimated time value at (di, ni) based on thetraining data, t¯A(di, ni) is the average time value over all training data forthe specific (di, ni pair, and Ni is the number of observations at (di, ni) inthe test data. For NE, a value near 1 indicates that the sum of the squared284.2. Estimating tA and InAresiduals from the test data is similar to the sum of the squared distancesfrom the test data to the average values of the training data.For the EGO algorithm time function the NE is 1.18. This value isnot too different from 1, however, it indicates that the sum of the distancessquared from the test data to the estimated function is greater than thesum of the distances squared from the test data to the average values of thetraining data.The process used to produce (4.7) was performed on the remaining fivealgorithms with the resulting equations, including that of EGO, shown inTable 4.4.tˆGA(d, n) = exp (−3.474 + (8.254E-2) log(d) log(n))tˆNM (d, n) = exp(−6.411− 0.521 log(d) + 0.503 log(n)++0.178 log(d) log(n))tˆsNOMADr(d, n) = exp(−6.345− (8.906E-2) log(d) + 1.009 log(n)++0.161 log(d) log(n))tˆEGO(d, n) = exp (−2.859 + 1.952 log(d) + 0.954 log(n))tˆKGCP (d, n) = exp(−4.900− 0.103 log(d) + 2.453 log(n)++0.130 log(d) log(n))Table 4.4: Estimated tA functions for each of the five algorithms.The regression plots in Figure 4.4 show an increase in log(tA) with re-spect to both d and n for each algorithm. Notable differences do exist how-ever in the amount of increase exhibited. The genetic and NM algorithmsrequire the least amount of time, while EGO and KGCP require substan-tially more time as dimension and function call budget increase. The timerequired by sNOMADr is somewhere in between these two groups.Table 4.5 shows the NE and three measures of variance for each algo-rithm.To analyze the variance present in the time data for each algorithm threemeasures of variance are used. The first is the sample variance of the entire294.2. Estimating tA and InAbudgetdimension-3-2100200300400500-4-1Plot of log-time data and regression surface for GA log-time5 10 15 20 25 300budgetdimension 100200300400500-6-4-20Plot of log-time data and regression surface for NM log-time5 10 15 20 25 30budgetdimension-4-2021002003004005005 10 15 20 25 30 4log-timePlot of log-time data and regression surface for sNOMADr budgetdimensionlog-timePlot of log-time data and regression surface for EGO 5 10 15100 200 3000500400246820 25 30budgetdimension 150200250 1004682 4 6 8 10Plot of log-time data and regression surface for KGCP log-timeFigure 4.4: Plots of the regression functions for the algorithm time of GA(top left), NM (top right), sNOMADr (middle left), EGO (middle right),and KGCP (bottom).304.2. Estimating tA and InATable 4.5: The NE value for each tˆA as well as the sample variance, aver-age variance over all (d, n) pairs, and the average variance over all (d, n, f)triples.Algorithm NE S2 S2d,n S2d,n,fGA 1.71 0.449 0.155 0.0351NM 1.39 1.46 0.292 0.0558sNOMADr 1.36 3.14 0.156 0.0694EGO 1.18 5.16 0.260 0.172KGCP 1.11 5.70 0.0835 0.0688dataset,S2 :=∑NPi=1∑Nij=1(tA,j(di, ni)− t¯A)2∑NPi=1Ni − 1, (4.9)where t¯A is the average observed time for algorithm A over all test problemsand reruns. This measure of variance is a reflection of the range of timedata for a specific algorithm. The S2 column in Table 4.5 contains thesample variance for each algorithm. Larger variance is exhibited by themore expensive algorithms such as EGO and KGCP. The fastest of thealgorithms, GA, has a small variance.The second measure of variance isS2d,n :=1NPNP∑i=1∑Nij=1(tA,j(di, ni)− t¯A(di, ni))2Ni − 1, (4.10)where t¯A(di, ni) is the average over all time observations for algorithm Aand a specific (d, n) pair. Equation (4.10) measures the average variabilityover all (d, n) pairs. If the difference between S2 and S2d,n is small then thetime data is likely close to constant. This is the case for the GA algorithmwhere the ratio S2/S2d,n is close to 3. The plot of the time used by the GAalgorithm in Figure 4.4 demonstrates the near constant nature of the timeused by the algorithm as a function of d and n. For the KGCP on the otherhand the ratio S2/S2d,n is close to 68, thus indicating that the time used bythe KGCP is far from constant with respect to d and n. Smaller values ofS2d,n indicate less variability at a given (d, n) pair which in turn implies thatother variables such as function type and starting point have a smaller effecton the time used by the algorithm. By contrast, larger values of S2d,n indicatelarger variability at a given (d, n) pair and this in turn implies that othervariables may have a significant effect on the time used by the algorithm.314.2. Estimating tA and InAThe KGCP algorithm exhibits the smallest S2d,n value of all the algorithmsand is therefore least affected by variables other than d and n. On the otherhand, the EGO and NM algorithms exhibit the largest S2d,n and their timetherefore is most affected by other variables.The final measure of variance is the average sample variance over all 135test problems,S2d,n,f :=1135135∑i=1∑20j=1(tA,j(di, ni)− t¯A(di, ni))220− 1, (4.11)where there are 20 observations for each test problem and t¯A(di, ni) is theaverage observed time over those observations. This variance takes intoconsideration not only the dimension and function call budget but also thespecific function being optimized. The greatest variability according to thismeasure can be found for the EGO algorithm. This means that even if theexact same function is optimized multiple times, the observed time will varymore than for the other algorithms. The ratio S2d,n/S2d,n,f can be looselybe interpreted as the amount by which the time used by an algorithm isaffected by the type of function being optimized. For the GA and NMalgorithms this ratio is greater than 4 and certainly greater than that of theother algorithms. For EGO and KGCP this ratio is less than 1.5 indicatingthat the function being optimized has a minimal effect on the time used bythe algorithms. The larger values for GA and NM indicate a greater effectof function type on the time used by the algorithms.Overall, the time data indicates that the BGO algorithms require moretime for optimization but are less affected by function type. The GA and NMalgorithms can perform optimization much more quickly but are also moreaffected by function type. It should be noted that the KGCP algorithm wasnot able to complete optimization for some higher budgets and dimensionsand as such the time data used to develop the regression functions uses asubset of the problems from Table 4.1.4.2.2 Improvement FunctionsAs was the case in the time function regression, the complete regressionprocess will only be shown for one algorithm, NM, while only the results willbe shown for the remaining algorithms. Figure 4.5 shows a three dimensionalplot of the improvement data for the NM algorithm.The majority of the data is positive indicating that the NM algorithmoutperformed the random sample almost every time. The amount by which324.2. Estimating tA and InAbudgetdimensionimprovement0510Plot of improvement for NM 5 10 15 20 25 30100200300400500Figure 4.5: Plot of improvement as a function of d and n for the NM algo-rithm.NM outperforms the random sample, however, varies quite a bit. NM canin some instances provide as many as 10 more digits of accuracy than arandom sample but it can also provide no more digits of accuracy than arandom sample for the same dimension and function call budget. Thus it isclear that the performance of NM varies due to factors other than d and n.There does appear to be a trend, in that there is a greater differencein the performance of NM and the random sample at smaller dimensionsand larger budgets. The non-constant variance present in this dataset posesa challenge for linear regression. It is also difficult to determine the typeof relationship between the response and explanatory variables. In such aninstance it is appropriate to view the diagnostic plots for several models andselect the one that provides the best diagnostics. The model,INM (d, n) = β0 + β1 log(d), (4.12)provides good diagnostic plots which can be seen in Figure 4.6.The plot of residuals against fitted values indicates that the regressionsurface lies near the middle of the data. At two locations, d = 24 and d = 28,the variance is notably smaller than anywhere else in the dataset. This isbecause certain types of functions were not found in these dimensions. As itturns out the omitted functions provide greater variance to the data. Thissituation reinforces the idea that function type has an effect on improve-ment. Nonetheless, the residual plot does not indicate any violation of theassumptions necessary for linear regression. The Normal Q-Q plot, on theother hand, does indicate non-trivial deviation from normality. This can334.2. Estimating tA and InAFigure 4.6: Residual plot (left) and Q-Q plot (right) from the linear regres-sion of NM improvement.pose a problem in linear regression, although it may not be very cripplingin this instance. One reason is that the large datasets such as this one arerobust to small deviations from normality. The second reason is that a viola-tion of the normality assumption in linear regression impedes the statisticalinference based on the t and F statistics, but it does not affect the predictionof the regression model. With these remarks in mind, the linear regressionprocess continues with a summary of the results found in Table 4.6.Table 4.6: Linear regression results for the improvement of the NM algo-rithm.Estimate Std. Error t statistic p valueβ0 4.6000E+00 7.4084E-02 6.2091E+01 0β1 -1.1221E+00 3.1687E-02 -3.5411E+01 0The p-values from the t statistic and F statistic (less than 2E-16) stronglyindicate that there is a significant relationship between the response andexplanatory variables. However, the earlier discussion indicates that thesevalues should not be trusted entirely. Taking the coefficients from Table 4.6provides the estimate of the improvement function for the NM algorithmIˆNM (d, n) = 4.600− 1.122 log(d). (4.13)A plot of the regression surface and the data can be seen in Figure 4.7.344.2. Estimating tA and InAbudgetdimension0100200300400500improvementPlot of improvement and regression surface for NM 5 10 15 20 25 30510Figure 4.7: Plot of improvement data and regression surface for the NMalgorithm.The regression functions for each of the algorithms can be found in Ta-ble 4.7.Table 4.7: Estimated Ia functions for each of the five algorithms.IˆGA(d, n) = 1.389− (1.543E-2)d+ (1.397E-3)nIˆNM (d, n) = 4.600− 1.122 log(d)IˆsNOMADr(d, n) = −9.980 + 0.280d+ (7.061E-3)n+ 2.664 log(d)++4.352 log(n)− 1.772 log(d) log(n)IˆEGO(d, n) = 0.976 + (3.396E-2)dIˆKGCP (d, n) = 2.687− 0.450 log(d) + (4.100E-3)nNote that for the NM and EGO algorithms, the regression functions donot indicate that the function call budget has an effect on improvement. Thismay appear to be somewhat puzzling, however, recall that improvement isnormalized by the random sampling technique. Thus it is the amount bywhich NM and EGO outperform random sampling that is not expectedto improve with increased function call budgets. The regression functions354.2. Estimating tA and InAfor the GA, NM and KGCP algorithms indicate that an increase in inputdimension leads to a decrease in improvement. This is not the case forthe EGO algorithm where the estimated relationship between dimensionand improvement is positive. This variation in type of relationships helpsexplain why it was challenging to develop theoretical relationships betweenimprovement and d and n. The nature of the relationships appears to bespecific to each algorithm. For the sNOMADr algorithm the relationshipsare not as clear due to the log(d) log(n) term.Plots of the improvement data with the regression functions for eachalgorithm can be seen in Figure 4.8.Model validation for the improvement functions is performed using thethe NE criterion from (4.8) adapted to the improvement function. The NEvalue for the model (4.13) is 1.42. This value is not significantly differentfrom 1 indicating that the predictive power of the model IˆNM is acceptable.Table 4.8 shows the NE value for each algorithm as well as the three measuresof variance discussed in Section 4.2.1.Table 4.8: The NE value for each IˆA as well as the sample variance, aver-age variance over all (d, n) pairs, and the average variance over all (d, n, f)triples.Algorithm NE S2 S2d,n S2d,n,fGA 1.03 1.74 1.59 0.718NM 1.42 6.50 3.80 1.54sNOMADr 1.09 14.5 6.19 2.22EGO 1.04 2.12 1.93 1.12KGCP 1.11 2.8 2.17 0.702With the exception of the NM algorithm, the NE for the model IˆA isvery close to 1 for every algorithm. This is an indication of good predic-tive power for the GA, sNOMADr, EGO, and KGCP improvement models.For all of the algorithms, the lower bound on improvement is nearly thesame. However, the best improvement values are quite different among thealgorithms. The sNOMADr and NM algorithms in particular can obtain avery high value of improvement. The sample variance, S2, is therefore quitelarge for sNOMADr and NM and much smaller for the GA, EGO and KGCPalgorithms. The ratio S2/S2d,n is close to 1 for the GA, EGO, and KGCPalgorithms implying that there is a more or less constant trend in improve-ment with respect to d and n for these algorithms. For the sNOMADr andNM algorithms the ratio is slightly larger and thus the relationship between364.2. Estimating tA and InAbudgetdimension0100200300400500improvement 510Plot of improvement and regression surface for GA 5 10 15 20 25 30budgetdimension0100200300400500improvementPlot of improvement and regression surface for NM 5 10 15 20 25 30510budgetdimension 3004005000Plot of improvement and regression surface for sNOMADr 510155 10 15 20 25 30improvement100200 budgetdimension02100200300400500-2improvement 468Plot of improvement and regression surface for EGO 5 10 15 20 25 30budgetdimension0100 200300400Plot of improvement and regression surface for KGCP improvement 5105 10 15 20 25 30Figure 4.8: Plot of the regression surface for the improvement of GA (topleft), NM (top right), sNOMADr (middle left), EGO (middle right), andKGCP (bottom).374.2. Estimating tA and InAimprovement and d and n is less constant as can be seen in the images inFigure 4.8. The relatively large values of S2d,n indicate the high variability inthe improvement gained by each algorithm for a particular (d, n) pair. Sucha high variability will certainly effect the validity of the regression functionfor improvement. The large variability also indicates that the improvementis likely affected by other variables. This variability is highest for the NMand sNOMADr algorithms. While the variability decreases when functiontype is considered in S2d,n,f , the values are still relatively large. The ratioS2d,n/S2d,n,f is greatest for the KGCP algorithm implying that the specificfunction being optimized has a notable effect on the amount of improve-ment gained. This ratio is smallest for the EGO algorithm. The variance inimprovement gained when optimizing a specific function is greatest for thesNOMADr algorithm and smallest for the GA and KGCP algorithms. ThesNOMADr algorithm is inherently local and as such may converge to locallyoptimal points. This of course implies that starting point will have an ef-fect on improvement. However, the use of VNS combats the local natureand may lead to larger values of improvement when the algorithm is able toavoid a locally optimal point. The combination of local and global search inthis algorithm likely leads to the large variability in the improvement it canobtain. The large variability of the improvement in the NM algorithm evenfor a specific function can be attributed to the local nature of this algorithmand the effect of starting point. The EGO algorithm also exhibits a largervariance in improvement gained and this may be attributed to the qualityof the GP model. If the GP model is accurate and the parameter estimatesare accurate, the EGO algorithm may perform quite well, however, if theGP is not an adequate representation of the objective function, the EGOalgorithm may perform poorly.The large variance noted in the improvement plot of the NM algorithmis also seen in the other four algorithms. Several remarks can be maderegarding the linear regression functions. The plot of the regression surfacefor GA is almost constant. This implies that the behaviour of GA is similarto that of the random sample which is intuitive since the genetic algorithmis essentially a “clever” random search. The plot for sNOMADr indicatesa very notable difference between the improvement gained by sNOMADrand the improvement gained by random search. Out of all the algorithmssNOMADr is expected to provide the strongest performance for dimensionssmaller than 5 and function call budgets greater than 400. The regressionsurface for KGCP indicates that at smaller dimensions and greater budgets,KGCP will perform much better than random sampling. There are also some384.2. Estimating tA and InAnotable gaps in the data for the KGCP algorithm. At higher dimensions andlarger budgets, KGCP was not able to complete optimizing in the allotted10 hour time limit. As a result, the function IˆKGCP should be used carefullyin the region where d > 14 and n > 200. The plot for EGO shows a uniquetrend among this group of algorithms. The performance of EGO comparedto that of random sampling is greater at higher dimensions.In the following chapter, the regression functions developed in these sec-tions will be used for the rubric developed in Section 4.1.39Chapter 5Using the RubricChapter 4 saw the development of a rubric to aid algorithm selectionin black-box optimization. In Section 4.1 a framework for the rubric wasdeveloped and in Section 4.2 the time and improvement functions were esti-mated for each algorithm. The rubric is designed to be used in the followingway. Starting with a time budget, T , the average running time of the ob-jective function, tf , and the input dimension d, the time equations fromTable 4.4 can be used to obtain the function call budget, nA, for each ofthe 5 algorithms. Then the improvement functions from Table 4.7 can beused to compute the estimated improvement that can be obtained by eachalgorithm. Finally the algorithm with the largest estimated improvement issuggested for the optimization problem.The purpose of this chapter is two-fold. First, the results of the rubricwill be analyzed with the intent of answering the questions from Section 1.1.Secondly, the validity and accuracy of the rubric will be tested. In Section 4.2several concerns were brought up regarding the estimated time and improve-ment functions. In both cases, there appears to be an amount of varianceunaccounted for by the variables d and n. This is particularly noticeable inthe improvement functions. For the improvement data, there is generally alarge amount of variance. Both of these issues may cause the rubric to beless reliable.Before these analyses are presented, an adjusted improvement criterionis required.5.1 The Adjusted Improvement CriterionNote that the function call budget may not be the same for each algo-rithm. For example, consider a scenario with the following information:T = 3600s tf = 20s d = 5.Recall equation (4.2.1)T = ntf + tA(d, n).405.1. The Adjusted Improvement CriterionSubstituting the known information this becomes3600 = 20n+ tA(5, n).For the GA algorithm this equation is3600 = 20nGA + exp (−3.474 + (8.254E-2) log(5) log(nGA))and for the KGCP algorithm it is3600 = 20nKGCP + exp(−4.900− 0.103 log(5) + 2.453 log(nKGCP )++ 0.130 log(5) log(nKGCP )).These equations can be solved and rounded down to obtain nGA ≈ 179and nKGCP ≈ 104. The estimated improvement for these two algorithms,according to the equations in Table 4.7, isIˆGA(5, 179) = 1.389− (1.543E-2)5 + (1.397E-3)179 = 1.56andIˆKGCP (5, 104) = 2.687− 0.450 log(5) + (4.100E-3)104 = 2.39.It appears that the KGCP algorithm is preferred over the GA algorithm forthis optimization scenario. However, recall that the improvement criterionfrom (4.4) is normalized by the random sample having the same budget. Inthis instance, GA is compared to a random sample with 179 points while theKGCP is compared to a random sample with 104 points. It is likely thatthe random sample with 179 points will outperform the random samplewith 104 points. This will lead to the improvement criterion for GA to besmaller than it really is. What is needed is a comparison of each algorithm’simprovement to a random sample with a fixed function call budget for thegiven optimization scenario. The random sample technique requires a verysmall amount of time to make decisions about which points to sample. Itcan, therefore, be assumed that tr(d, n) = 0. Then the function call budgetpermitted for the random sample under a given time budget isnr =⌊Ttf⌋.The improvement for algorithm A after n sampled points, adjusted for thesampling budget of the random sampling technique, isAInA = − log10(f(xnAbest)− f∗f(xnrr )− f∗). (5.1)415.1. The Adjusted Improvement CriterionThe improvement functions developed in the previous chapter now need tobe adapted for this criterion. Equation (5.1) can be expressed in terms ofthe improvement criterion, InA, in the following wayAInA = − log10(f(xnAbest)− f∗f(xnrr )− f∗)= − log10(f(xnAbest)− f∗f(xnrr )− f∗f(xnAr )− f∗f(xnAr )− f∗)= − log10(f(xnAbest)− f∗f(xnAr )− f∗f(xnAr )− f∗f(xnrr )− f∗)= − log10(f(xnAbest)− f∗f(xnAr )− f∗)− log10(f(xnAr )− f∗f(xnrr )− f∗)= InA − log10(f(xnAr )− f∗f(xnrr )− f∗).The adjusted improvement can now be estimated as a function of d and nasAˆIA(d, n) = IˆA(d, n)− log10(f(xnAr )− f∗f(xnrr )− f∗). (5.2)the first term of this expression is modelled in Section 4.2.2 and the equationsIˆA(d, n) for each algorithm can be found in Table 4.7. The second term inthis expression is unknown. The presence of f and f∗ in this expressionpresent a computational challenge. The approach used here is to assumethatf(xnAr )− f∗f(xnrr )− f∗≈‖xnAr − x∗‖‖xnrr − x∗‖.Substituting into the second term of (5.2) yields− log10(‖xnAr − x∗‖‖xnrr − x∗‖).The assumption made earlier allows the removal of function values from theabove expression, thus simplifying the computation. Note that this newterm is a measure of how much closer a random sample can get to x∗ if nAfunction calls are allowed as opposed to nr function calls. Knowledge of x∗is not crucial here and can be replaced by a randomly selected point x¯ ∈ X.Then− log10(‖xnAr − x¯‖‖xnrr − x¯‖)can be quickly approximated using a Monte Carlo method. This methodevaluates the above expression a large number of times randomly and thencomputes the average value. Thus (5.2) can be computed.425.2. Case Study: Small Time Budgets5.2 Case Study: Small Time BudgetsIn the first case study a relatively small budget of T = 3600s (1 hour)with tf ∈ {5, 10, 20, 60} and d ∈ {2, 8, 15, 25}. The results of the rubric canbe seen in Table 5.1.Table 5.1: Results of the rubric for a small time budget of 1 hour and varyingtf and d values. The best performing algorithm for each tf , d pair can beseen in bold.tf (s) dGA NM sNOMADr EGO KGCPnA AIA nA AIA nA AIA nA AIA nA AIA5 2 719 2.4 719 3.8 719 18.1 697 1 160 2.710 2 359 1.9 359 3.8 359 13.3 354 1 144 2.820 2 179 1.6 179 3.8 179 9.9 178 1.1 116 2.860 2 59 1.4 59 3.8 59 5.6 59 1.1 56 2.65 8 719 2.3 719 2.3 717 7.2 480 1.2 123 2.110 8 359 1.8 359 2.3 359 4.3 286 1.2 114 2.120 8 179 1.5 179 2.3 179 2.5 159 1.2 98 2.160 8 59 1.4 59 2.3 59 0.9 57 1.3 54 25 15 719 2.2 719 1.6 716 3.6 261 1.5 110 1.910 15 359 1.7 359 1.6 359 1.3 190 1.5 103 1.820 15 179 1.4 179 1.6 179 0.4 123 1.5 91 1.860 15 59 1.2 59 1.6 59 0 51 1.5 53 1.75 25 719 2 719 1 713 1.7 121 1.8 101 1.610 25 359 1.5 359 1 358 0.2 103 1.8 96 1.620 25 179 1.3 179 1 179 -0.2 79 1.8 85 1.660 25 59 1.1 59 1 59 0.5 41 1.8 52 1.4For low dimensions, d ≤ 8, the improvement gained by sNOMADr isexpected to be significantly greater than that of the other algorithms. Thedifference in performance is more noticeabe at smaller dimensions and higherbudgets. As dimension increases, the performance of sNOMADr drops quitedrastically, so that at dimensions greater than 15, this algorithm is expectedto provide less improvement than any of the other algorithms. For higherdimensions, d ≥ 15, the BGO algorithms are expected to provide the mostimprovement. The difference between the two BGO algorithms is that theimprovement of the EGO algorithm is expected to increase as d increaseswhile for the KGCP algorithm the improvement is expected to decrease.Thus at dimension 25, the EGO algorithm outperforms KGCP, while at di-mension 15, KGCP is the preferred algorithm according to the rubric. Atlower dimensions, NM competes with sNOMADr with its expected improve-ment being greater than that of sNOMADr in one instance. The improve-ment expected to be gained by GA is never greater than that of the other435.2. Case Study: Small Time Budgetsalgorithms although it is a close competitor in some instances.The claims of the rubric can now be tested against real data. The algo-rithms were used to optimize functions from two scenarios found in Table 5.1:(tf , d) = (20, 2) and (tf , d) = (60, 25). Each algorithm optimized three func-tions: Rastrigin, sum of powers, and Rosenbrock. Three repeats were used,thus creating nine data points for each algorithm under a specific optimiza-tion scenario. The results are used to test both the time and improvementpredictions of the rubric. Figure 5.1 shows two box plots of the percentdifference between the time required by each algorithm and the 3600s timebudget.Figure 5.1: Box plots of the percent difference between the time used byeach algorithm and the time budget of 3600s.The time used by the GA, NM, and sNOMADr algorithms is within 1%of the time budget. For the KGCP algorithm, this difference is closer to10% in both plots. For the EGO algorithm, the difference is less than 1%at dimension 2 but close to 25% at dimension 25. Being more expensive,the KGCP and EGO algorithms require more time for each iteration. Usingup another iteration would likely take more time than is allotted and sothe algorithms quit sooner. The time required by the EGO algorithm tocomplete an iteration increases with dimension. This explains the largedifference between the time used by the EGO algorithm at dimension 2 andat dimension 25. Note that the test functions did not require 20 secondsto run. The values in the box plot were obtained by measuring the timerequired by the algorithm and then adding 20nA to each value. It shouldalso be noted that the time budget was not exceeded by any one of thealgorithms.Figure 5.2 shows two box plots of improvement measured for each algo-rithm tested for functions where (d, tf ) = (2, 20) and (d, tf ) = (25, 60).For the box plot on the left, the measured improvement reflects the445.3. Case Study: Large Time BudgetsFigure 5.2: Box plots of the measured improvement gained by each algorithmand the predicted improvement shown in blue.predictions of the rubric for the most part. One notable deviation is in thesNOMADr algorithm, where the measured improvement is smaller than thatpredicted by the rubric. The overall trend, however, is quite accurate. Themedian improvement values for NM and sNOMADr are very similar, but thehigher upper bound for the sNOMADr algorithm supports the choice of therubric. In the box plot on the left, the measured improvement for each of thealgorithms was lower than that predicted by the rubric. The most notabledifference is in the EGO algorithm, which was preferred by the rubric, butits measured improvement is certainly not greater than that of the otheralgorithms. The KGCP algorithm is the algorithm with the best measuredimprovement and the most accurately predicted by the rubric. Note thatthe dataset from which the time and improvement functions were estimatedin Chapter 4 does not contain any functions with dimension 25 and functioncall budget as low as 59. Using the rubric in this instance is done with theunderstanding that extrapolation is being used. This may in part explainthe disparity between the rubric and measured improvement.5.3 Case Study: Large Time BudgetsIn this section, a case study of optimization problems with large timebudgets and objective functions with long running times is considered. Thetime budget allotted here is one week (604800s). Objective functions areassumed to have average running times between 20 minutes and 2 hours.The same dimensions considered in the previous sections are consideredhere as well. Table 5.2 contains the improvement and budget as predictedby the rubric.The large time budget and average running time of the function make455.3. Case Study: Large Time BudgetsTable 5.2: Results of the rubric for a large time budget of 1 week and varyingtf and d values. The best performing algorithm for each tf , d pair can beseen in bold.tf (s) dGA NM sNOMADr EGO KGCPnA AIA nA AIA nA AIA nA AIA nA AIA3000 2 201 1.6 201 3.8 201 10.4 201 1.1 199 3.25000 2 120 1.5 120 3.8 120 8.2 120 1 120 2.98000 2 75 1.5 75 3.8 75 6.4 75 1 75 2.710000 2 60 1.4 60 3.8 60 5.6 60 1 60 2.63000 8 201 1.5 201 2.3 201 2.8 201 1.2 197 2.65000 8 120 1.4 120 2.3 120 1.8 120 1.3 120 2.28000 8 75 1.4 75 2.3 75 1.2 75 1.2 75 2.110000 8 60 1.3 60 2.3 60 1 60 1.3 60 23000 15 201 1.4 201 1.6 201 0.5 201 1.5 196 2.35000 15 120 1.3 120 1.6 120 0.1 120 1.5 120 28000 15 75 1.3 75 1.6 75 0 75 1.5 75 1.810000 15 60 1.2 60 1.6 60 0 60 1.5 60 1.73000 25 201 1.3 201 1 201 -0.2 199 1.8 194 25000 25 120 1.2 120 1 120 0 120 1.8 119 1.78000 25 75 1.1 75 1 75 0.3 75 1.8 75 1.510000 25 60 1.1 60 1 60 0.5 60 1.8 60 1.5trivial the time used by the algorithm. Thus all algorithms are more or lesson the same playing field with regard to function call budgets. A generalpattern can be found in the algorithms recommended by the rubric. Forsmaller dimensions, algorithms in the DFO category are expected to performbetter, while for higher dimensions, algorithms from the BGO family arepreferred. The trend in preferred algorithms is similar for large time budgetsand small time budgets.To test the rubric, a procedure identical to the one described for smalltime budgets in Section 5.2, was used. Figure 5.3 shows box plots of thepercent difference between the time used by each algorithm and the timebudget.In the box plots, the time used by each of the algorithms differs fromthe time budget by approximately 1% or less. The effect of the expensivecomputations required by the BGO algorithms is less prevalent in these boxplots than in Figure 5.1 due to the larger time budget. In the box plot onthe right the time used by the BGO algorithms exceeds the time budget,albeit not by much.Figure 5.4 shows a box plot of measured improvement for optimizationproblems with (d, tf ) = (8, 3000) and (d, tf ) = (15, 3000).In both of these box plots, the measured improvement is lower than that465.3. Case Study: Large Time BudgetsFigure 5.3: Box plots of the percent difference between the time used byeach algorithm and the time budget of 604800s.Figure 5.4: Box plots of the measured improvement gained by each algorithmand the predicted improvement shown in blue.predicted by the rubric. A very noticeable difference is for the EGO algo-rithm in the left box plot, where the measured improvement is significantlysmaller than the predicted improvement with very little variance. A clear-cut best algorithm cannot be determined from either of the box plots. Thereis a large amount of overlap in the results between the five algorithms.With the exception of the NM algorithm in left box plot of Figure 5.1,the rubric predicted a larger improvement than was measured in the data.As was noted in Section 4.2.2, function type seems to affect improvement.This could explain the optimistic predictions of the rubric with respect toimprovement. The functions used in the test could have properties that leadto below average measures of improvement.475.4. Analysis of Rubric5.4 Analysis of RubricUsing the rubric for the two case studies in Sections 5.2 and 5.3 revealedsome trends in preferred algorithms with respect to input dimension andfunction call budget. However, testing the claims of the rubric against realdata revealed some concerns regarding its reliability. In this section, thetrends and validity of the rubric will be discussed.The basic trend seen in the rubric is that DFO algorithms are preferredat “lower” dimensions and BGO algorithms are preferred at “higher” di-mensions. The definition of lower and higher dimensions is not exact basedon the data from Table 5.1 and Table 5.2, however, it appears that the cutoff is somewhere between d = 8 and d = 15. A possible conclusion is thatdimension is not the only determining factor. Since in the tables, func-tions optimized at higher dimensions were not given an appropriately largebudget. It may be that the ratio n/d is more important than the dimension.In the tables of the previous sections, BGO algorithms were preferredwhennd∈ (2, 13).Figure 5.5 shows box plots of measured improvement values from the datasetused in Section 4.2 for d ∈ [2, 30] and n ≤ 10d for each of the algorithms.Figure 5.5: Box plot of the improvement gained by each of the algorithms forthe dataset described in Chapter 4. Only data where n ≤ 10d are considered.485.4. Analysis of RubricThe median values for the BGO algorithms are greater than those of thecompeting algorithms, but the difference is not that significant, consideringthe large variances and the notable outliers. In Figure 5.6, box plots areshown for dimensions 10 and smaller.Figure 5.6: Box plot of the improvement gained by each of the algorithmson real functions where d ≤ 10 and n ≤ 10d.The difference in median values for this subset of the data is even lesssignificant. The median performance of KGCP is still slightly greater thanfor the other algorithms, but for EGO this is no longer true. Figure 5.7shows box plots for dimensions 10 and greater.For this subset of the data, the median performance of BGO algorithmsis greater but, once again, the large variances may render the differencesinsignificant.Regarding time, the amount required by BGO algorithms is far greaterthan that of the other algorithms. Therefore BGO algorithms are allowed asmaller function call budget when the running time of the objective functionis small. However, if the average running time of the objective function islarge, the function call budgets are similar for all of the algorithms. Thus,the disadvantage of fewer function calls faced by BGO algorithms is elim-inated. In Table 5.1 and Table 5.2, when the time required to run thefunction is 60 seconds or greater, the difference in function call budgets is495.4. Analysis of RubricFigure 5.7: Box plot of the improvement gained by each of the algorithmson real functions where d ≥ 10 and n ≤ 10d.minor. The difference in function call budgets is not always a limitationfor the BGO algorithms since there are examples, in Table 5.1, where theimprovement in KGCP is greater according to the rubric than the otheralgorithms in spite of a much lower function call budget.Sections 5.2 and 5.3 present specific case studies and the discussion pre-sented in this section is based on these case studies. While other case studiescould be considered it is likely that the results will be similar to those ob-served in this chapter. Namely the time prediction will be quite accurateespecially when T is large enough, the trends regarding the DFO and BGOimprovements will be similar, and the large variance in improvement will bepresent.50Chapter 6ConclusionThe rubric constructed in this thesis estimates the amount of functioncalls that can be obtained under a given time budget and it also predicts theimprovement that could be obtained within the time budget. How accurateare these estimates and is this rubric reliable?The assessment of time in Section 4.2.1 showed clear trends, strong re-lationships, and good fits for most of the algorithms. The case studies inChapter 5 showed that, with respect to time, the rubric did not deviatemuch from the real data. These observations suggest that the predictionsof function call budgets from the rubric are quite reliable. For the BGOalgorithms at small time budgets it is inconclusive whether the predictionof function call budgets is reliable or not. The large difference between thetime budget and the time used by the BGO algorithms could be due to theexpensive nature of these algorithms or due to the poor estimates by therubric.With regard to improvement, the rubric is much less reliable. In Sec-tion 4.2.2 it was noted that the improvement data for each algorithm exhib-ited large variance and in Chapter 5, there was a notable difference betweenthe improvement according to the rubric and the measured improvement.In some instances the algorithm suggested by the rubric was clearly out-performed by other algorithms in the measured improvement. The largevariance present in the improvement data makes it difficult to determinethe best choice for an algorithm. The rubric may be more reliable in in-stances where the difference in improvement values, as suggested by therubric, is large.The large variance present in the improvement values is likely due to fac-tors other than those considered: d and n. In Section 4.2.1 and Section 4.2.2it was shown that for both time and improvement, variance decreased whenmeasured for data from specific functions. Function type is a factor notconsidered in this analysis, but appears to have a considerable effect on im-provement and a lesser effect on time. Figure 6.1 shows the improvementvalues for the KGCP algorithm for a specific dimension and budget.A distinct break in the data can be seen at a value of 1. This break51Chapter 6. ConclusionFigure 6.1: Box plot of the improvement gained by the KGCP algorithm ford = 10 and n = 200.is determined solely by function type. Two functions that were more chal-lenging for this solver reported improvement values below the break, whileresults for the remaining function types are found above the break. Thisphenomenon is not unique to the KGCP algorithm, dimension 10, or thefunction value budget of 200. Similar results can be found throughout theexperimental data. In Table 4.8 the variance in improvement for the NM andsNOMADr algorithms was larger than for the other algorithms even whenfunction type was taken into account. The improvement of these local opti-mization algorithms is likely affected by the starting point. The conclusionis that the variables d and n are insufficient for describing the improvementin function value gained by a particular algorithm.This research was conducted using specific implementations of the algo-rithms from Chapter 3. For improvement, and more so for time, the resultswill almost surely differ with implementations. For example, the time usedby KGCP and EGO should be comparable, theoretically, due to the useof the expensive GP in both algorithms. However, in practice, at higherdimensions and budgets the time required by the KGCP is almost twice as52Chapter 6. Conclusionlarge as that of EGO. This is likely due to approaches in implementation.Each of the five algorithms were used with more or less default parameters.An experienced user of each algorithm may have knowledge of preferredparameter settings and thus improve the performance of the algorithm. Itis, however, unlikely that this improvement will be significant or affect thegeneral trends observed in this thesis.The data behind the results in this thesis considers a particular rangeof dimensions and function call budgets. The results of the rubric outsideof this domain cannot be fully trusted. Although, large deviations from theobserved trends are unlikely near the boundaries of the test domain. Fartheraway from the test domain, the results may not follow the relationshipsexhibited in this thesis.The somewhat inconclusive results of this experiment suggest some di-rections for further work. An analysis where function type is consideredmay increase the accuracy of the rubric. Optimization problem classifica-tion tools have been available for some time in the Dr. Ampl meta solver,[FGK87], and could be used in a rubric of this sort. For optimizers regularlyworking on black-box functions in specific applications, such an analysis maybe useful, since the black-box functions may be of a similar type.In Chapter 5, it was noted that the large variance made it challengingto determine a clear-cut winner. An analysis taking into account worstcase performance and best case performance may assist in “tie-breaking”scenarios.53Bibliography[ABLD08] C. Audet, V. Bechard, and S. Le Digabel. Nonsmooth op-timization through mesh adaptive direct search and variableneighborhood search. Journal of Global Optimization, 41:299–318, 2008. → pages 13[AD06] C. Audet and Jr. J. E. Dennis. Mesh adaptive direct searchalgorithms for constrained optimization. SIAM Journal on Op-timization, 17(1):188–217, 2006. → pages 13[BBC+02] R. A. Berk, P. Bickel, K. Campbell, R. Fovell, S. Keller-McNulty, E. Kelly, R. Linn, B. Park, A. Perelson, N. Rouphail,et al. Workshop on statistical approaches for the evaluation ofcomplex computer models. Statistical Science, pages 173–192,2002. → pages 1[Bul11] A. D. Bull. Convergence rates of efficient global optimizationalgorithms. J. Mach. Learn. Res., 12:2879–2904, November2011. → pages 15, 17[CKAEY14] M. Campbell-Kelly, W. Aspray, N. Ensmenger, and J. R. Yost.Computer: a history of the information machine. WestviewPress, Boulder, Colorado, 2014. → pages 1[CMMY91] C. Currin, T. Mitchell, M. Morris, and D. Ylvisaker. Bayesianprediction of deterministic functions, with applications to thedesign and analysis of computer experiments. Journal of theAmerican Statistical Association, 86(416):953–963, 1991. →pages 6[DD02] K. R. Davidson and A. P. Donsig. Real analysis with real ap-plications. Prentice Hall, 2002. → pages 3[DT91] Jr. J. E. Dennis and V. Torczon. Direct search methods onparallel machines. SIAM Journal on Optimization, 1(4):448–474, 1991. → pages 1254Bibliography[EAVH91] A.E. Eiben, E.H.L. Aarts, and K.M. Van Hee. Global con-vergence of genetic algorithms: A markov chain analysis. InH. P. Schwefel and R. Mnner, editors, Parallel Problem Solvingfrom Nature, volume 496 of Lecture Notes in Computer Science,pages 3–12. Springer Berlin Heidelberg, 1991. → pages 11[Far02] J. J. Faraway. Practical regression and anova using r., 2002.→ pages 23[FGK87] R. Fourer, D. M. Gay, and B. W. Kernighan. AMPL: A mathe-matical programming language. AT&T Bell Laboratories Mur-ray Hill, NJ 07974, 1987. → pages 53[FPD09] P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradientpolicy for correlated normal beliefs. INFORMS journal onComputing, 21(4):599–613, 2009. → pages 15, 16[GPR+13] D. Ginsbourger, V. Picheny, O. Roustant, C. Chevalier, andT. Wagner. Package ‘diceoptim’. 2013. → pages 15, 17[HJ61] R. Hooke and T. A. Jeeves. “ direct search” solution of numeri-cal and statistical problems. Journal of the ACM, 8(2):212–229,April 1961. → pages 12[HM03] P. Hansen and N. Mladenovic. Variable neighbourhood search.Handbook of Metaheuristics, Dordrecht, Kluwer Academic Pub-lishers, 2003. → pages 13[Hol75] J. H. Holland. Adaptation in natural and artificial systems: Anintroductory analysis with applications to biology, control, andartificial intelligence. University of Michigan Press, Oxford,England, 1975. → pages 10, 11[JSW98] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global op-timization of expensive black-box functions. Journal of GlobalOptimization, 13:455–492, 1998. → pages 1, 2, 7, 14, 15[Kel99] C. T. Kelley. Iterative methods for optimization, volume 18.Siam, 1999. → pages 12[LRWE98] J. C. Lagarias, J. A. Reeds, M. H. Wright, and Wright P. E.Convergence properties of the nelder-mead simplex method inlow dimensions. SIAM Journal on Optimization, 9(1):112–147,1998. → pages 1255Bibliography[LSW09] J. L. Loeppky, J. Sacks, and W. J. Welch. Choosing the samplesize of a computer experiment: A practical guide. Technomet-rics, 51(4):366–376, 2009. → pages 5[McK98] K. I. M. McKinnon. Convergence of the nelder-mead simplexmethod to a nonstationary point. SIAM Journal on Optimiza-tion, 9(1):148–158, 1998. → pages 12[MGH81] J. J. More´, B. S. Garbow, and K. E. Hillstrom. Testing uncon-strained optimization software. ACM Transactions on Mathe-matical Software (TOMS), 7(1):17–41, 1981. → pages 22[MS05] M. Molga and C. Smutnicki. Test functions for optimizationneeds. 2005. → pages 22[NM65] J.A. Nelder and R. Mead. A simplex method for function min-imization. Computer Journal, 7(4):308–313, 1965. → pages11[RN14] J. S. Racine and Z. Nie. crs: Categorical Regression Splines,2014. R package version 0.15-23. → pages 13[Scr13] L. Scrucca. GA: A package for genetic algorithms in R. Journalof Statistical Software, 53(4):1–37, 2013. → pages 11[SFP11] W. Scott, P. Frazier, and W. Powell. The correlated knowledgegradient for simulation optimization of continuous parametersusing gaussian process regression. SIAM Journal on Optimiza-tion, 21(3):996–1026, 2011. → pages 2, 7, 15, 16[SWN03] T. J. Santner, B. J. Williams, and W. Notz. The design andanalysis of computer experiments. Springer Science & BusinessMedia, 2003. → pages 1, 5, 7, 8[Tor97] V. Torczon. On the convergence of pattern search algorithms.SIAM Journal on optimization, 7(1):1–25, 1997. → pages 13[VB11] R. Varadhan and H. W. Borchers. Package ‘dfoptim’. 2011. →pages 12[Wri96] M. H. Wright. Direct search methods: Once scorned, nowrespectable. Pitman Research Notes in Mathematics Series,pages 191–208, 1996. → pages 1256Bibliography[WS07] G. G. Wang and S. Shan. Review of metamodeling techniquesin support of engineering design optimization. Journal of Me-chanical Design, 129(4):370–380, 2007. → pages 257
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A rubric for algorithm selection in optimization of...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A rubric for algorithm selection in optimization of black-box functions Sremac, Stefan 2015
pdf
Page Metadata
Item Metadata
Title | A rubric for algorithm selection in optimization of black-box functions |
Creator |
Sremac, Stefan |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | When optimizing black-box functions, little information is available to assist the user in selecting an optimization approach. It is assumed that prior to optimization, the input dimension d of the objective function, the average running time tf of the objective function and the total time T allotted to solve the problem, are known. The intent of this research is to explore the relationship between the variables d, tf, and T and the performance of five optimization algorithms: Genetic Algorithm, Nelder-Mead, NOMAD, Efficient Global Optimization, and Knowledge Gradient for Continuous Parameters. The performance of the algorithms is measured over a set of functions with varying dimensions, function call budgets, and starting points. Then a rubric is developed to assist the optimizer in selecting the most appropriate algorithm for a given optimization scenario. Based on the information available prior to optimization, the rubric estimates the number of function calls available to each algorithm and the amount of improvement each algorithm can make on the objective function under the function call constraint. The rubric reveals that Bayesian Global Optimization algorithms require substantially more time than the competing algorithms and are therefore limited to fewer function call budgets. However, if the objective function requires a large running time, this difference becomes negligible. With respect to improvement, the rubric suggests that Derivative Free Optimization algorithms are preferred at lower dimensions and higher budgets, while Bayesian Global Optimization algorithms are expected to perform better at higher dimensions and lower budgets. A test of the claims of the rubric reveals that the estimate of function call budget is accurate and reliable, but the improvement is not estimated accurately. The test data demonstrates large variability for the measure of improvement. It appears that the variables d, tf, and T are insufficient for describing the expected performance of the assessed algorithms, since variables such as function type and starting point are unaccounted for. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2015-06-24 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0166321 |
URI | http://hdl.handle.net/2429/53956 |
Degree |
Master of Science - MSc |
Program |
Mathematics |
Affiliation |
Irving K. Barber School of Arts and Sciences (Okanagan) Computer Science, Mathematics, Physics and Statistics, Department of (Okanagan) |
Degree Grantor | University of British Columbia |
GraduationDate | 2015-09 |
Campus |
UBCO |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2015_september_sremac_stefan.pdf [ 6.33MB ]
- Metadata
- JSON: 24-1.0166321.json
- JSON-LD: 24-1.0166321-ld.json
- RDF/XML (Pretty): 24-1.0166321-rdf.xml
- RDF/JSON: 24-1.0166321-rdf.json
- Turtle: 24-1.0166321-turtle.txt
- N-Triples: 24-1.0166321-rdf-ntriples.txt
- Original Record: 24-1.0166321-source.json
- Full Text
- 24-1.0166321-fulltext.txt
- Citation
- 24-1.0166321.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0166321/manifest