Applying fyworx juluy hhyory inWomvinutoriul cptimizution withAppliwution to EnvironmyntulgtutistiwsbyYu WangB.Sc., Simon Fraser University, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)August 2018c© Yu Wang 2018Wommittyy dugyThe following individuals certify that they have read, and recommend to theFaculty of Graduate and Postdoctoral Studies for acceptance of this thesis.Examining CommitteeTJames V. Zidek, Professor, Department of StatisticsgupyrvisorNhu D. Le, Adjunct Professor, Department of StatisticsCoAsupyrvisoriiAvstruwtWe consider the problem of optimal subset selection from a set of correlatedrandom variables. In particular, we consider the associated combinatorialoptimization problem of maximizing the determinant of a symmetric posi-tive semidefinite matrix that characterizes the chosen subset. This problemarises in many domains, such as experimental designs, regression modelling,and environmental statistics. In this thesis, we attempt to establish an effi-cient polynomial-time algorithm for approximating the optimal solution tothe problem. Firstly, we employ determinantal point processes, a specialclass of spatial point processes, to develop an easy-to-implement sampling-based stochastic search algorithm for the task of finding approximations tothe combinatorial optimization problem. Secondly, we establish theoreti-cal tools for assessing the quality of those approximations using statisticalresults from record value theory, the study of record values and relatedstatistics from a sequence of observations.iiiLuy gummuryThe increasing recognition of the association between adverse human healthconditions and many environmental substances as well as processes has ledto the need to monitor them. An important problem that arises in environ-mental statistics is the design of the locations of the monitoring stations forthose environmental processes of interest. One particular design criterionfor monitoring networks that tries to reduce the uncertainty about predic-tions of unseen processes is called the maximum-entropy design. However,this design criterion involves a hard optimization problem that is computa-tionally intractable for large data sets. The contributions of this thesis aretwofold. Firstly, we examine a probabilistic model that can be implementedefficiently to approximate the underlying optimization problem. Specifically,we compare its computational performance with exiting methods. Secondly,we attempt to establish statistically sound tools for assessing the quality ofthe approximations.ivdryzuwyThis thesis is an original work of the author, Yu Wang, under the supervisionof Dr. James V. Zidek and Dr. Nhu D. Le.Dr. Camila M. Casquilho-Resende provided her code, which was used atthe early stages of development of chapter 3. Dr. Jon Lee and Dr. MauriceQueyranne provided valuable suggestions on parts of chapter 3 and recom-mended a software for computational experiments conducted in the chapter.A version of chapter 3 was submitted for peer review. The manuscript isnamed “Stochastic Approximation Algorithms in Combinatorial Optimiza-tion” by Wang, Y., Le, N. D., Zidek, J. V. The idea was jointly developedby myself, Dr. Nhu D. Le, and Dr. James V. Zidek. I conducted all thecomputational experiments and the majority of the writing. An electronicversion can be found online at arxiv:1709.00151.vhuvly oz WontyntsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiifay mummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivjreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vnable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vifist of nables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixfist of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xfist of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 cntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Maximum-Entropy Design Problem . . . . . . . . . . . . . . 21.2 Spatial Point Processes . . . . . . . . . . . . . . . . . . . . . 41.3 Record Values Theory . . . . . . . . . . . . . . . . . . . . . . 5L iverview of the Combinatorial iptimization jroblem . . 62.1 Exact Methods for Finding the Optimum . . . . . . . . . . . 72.1.1 Branch-and-bound . . . . . . . . . . . . . . . . . . . . 72.2 Heuristics for Approximating the Optimum . . . . . . . . . . 102.2.1 Greedy algorithm . . . . . . . . . . . . . . . . . . . . 102.2.2 Genetic algorithm . . . . . . . . . . . . . . . . . . . . 12vihubly of Wontynts2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Determinantal joint jrocesses . . . . . . . . . . . . . . . . . 143.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.1 General Poisson point processes . . . . . . . . . . . . 153.1.2 General determinantal point processes . . . . . . . . . 153.1.3 k-DPPs . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Sampling from k-DPPs . . . . . . . . . . . . . . . . . . . . . 173.2.1 Independent sampling . . . . . . . . . . . . . . . . . . 183.2.2 MCMC sampling . . . . . . . . . . . . . . . . . . . . 213.3 Sampling-based solution strategy . . . . . . . . . . . . . . . . 233.4 Computational Performances . . . . . . . . . . . . . . . . . . 243.4.1 Optimizing maximum-entropy designs for monitoringnetworks . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.2 Synthetic data with a large number of points . . . . . 263.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 lecord palues nheory . . . . . . . . . . . . . . . . . . . . . . . 304.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.1.1 Definitions and notation . . . . . . . . . . . . . . . . 314.2 Basic Distributional Results . . . . . . . . . . . . . . . . . . 334.2.1 Record values from the classical model . . . . . . . . 334.2.2 Record times and related statistics . . . . . . . . . . . 354.2.3 Markov chains . . . . . . . . . . . . . . . . . . . . . . 374.3 k-DPP Approximations as Record Values . . . . . . . . . . . 384.3.1 Jittering the log-determinants . . . . . . . . . . . . . 384.3.2 Distribution of the jittered log-determinants . . . . . 394.3.3 Log-determinants records . . . . . . . . . . . . . . . . 434.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47O Concluding lemarks . . . . . . . . . . . . . . . . . . . . . . . . 495.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.1.1 MCMC k-DPP sampling . . . . . . . . . . . . . . . . 50viihubly of Wontynts5.1.2 Combination of k-DPP and other stochastic searchmethods . . . . . . . . . . . . . . . . . . . . . . . . . 50Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51AppynxiwysA l Code for mampling from a kGDjj . . . . . . . . . . . . . . 58B jython Code for aenetic Algorithm . . . . . . . . . . . . . . 62viiiList oz huvlys3.1 Results obtained from 100 realizations of k-DPP and GA forboth the real and synthetic data. Sample size refers to thenumber of DPP simulations and the number of generationsof GA for each realization; SD refers to standard deviation. . 284.1 The first two columns provides a summary of the occurrencetime of records and their values; the second and the thirdcolumns provides the conditional probabilities given the cur-rent record values and the conditional probabilities given thegreedy approximation respectively; the last column gives theexpected waiting times for the next record. . . . . . . . . . . 46ixList oz Figurys3.1 From left to right, realizations of a homogeneous Poissonpoint process, DPP, and DPP configuration with the high-est probability on a 2-D fine grid. From Kulesza et al. (2012). 163.2 Geometric interpretations of the determinant of a kernel ma-trix. The leftmost two graphs from left represent the deter-minants of a matrix consisting of two column vectors, andthe graph on the right represents the case with three columnvectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Trace of 10O 000 log-determinants generated by a 10-DPP. . . 233.4 Occurrence of maximum log–determinants of the restrictedconditional hypercovariance matrix when increasing the num-ber of simulated 10-DPP samples. The optimum solution ismarked by the red horizontal dashed red line. The inset showsa zoomed-in view of the first 100 samples. . . . . . . . . . . . 253.5 Occurrence of maximum log-subdeterminants of the synthetickernel matrix when increasing the number of simulated 60-DPP samples. The greedy and the GA solutions are markedby the green and purple solid horizontal lines, respectively.The inset shows a zoomed-in view of the first 1O 000 samples. 274.1 Top row: Histograms of k-DPP log-determinants with sam-ple sizes 100O 000 and 500O 000 respectively; bottom row: Ker-nel density estimates of k-DPP log-determinants with samplesizes 100O 000 and 500O 000 respectively. . . . . . . . . . . . . . 40xList of Figurys4.2 From top to bottom: histograms and estimated theoreticaldensity functions (left) and quantile-to-quantile plot (right)displayed for the data above 90lh percentile for GPD, Weibull,Normal distributions, respectively. . . . . . . . . . . . . . . . 43xiList oz Algorithms1 Branch-and-bound algorithm . . . . . . . . . . . . . . . . . . 102 Greedy algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 113 Genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . . 134 Sampling from a DPP . . . . . . . . . . . . . . . . . . . . . . 195 Sampling from a k-DPP . . . . . . . . . . . . . . . . . . . . . 216 MCMC sampling from a k-DPP . . . . . . . . . . . . . . . . . 227 Sampling-based solution strategy using k-DPP . . . . . . . . 24xiiAwknowlyxgymyntsI am deeply grateful to my supervisors Prof. Jim Zidek and Dr. Nhu Le forthe opportunities and immense support they have provided me throughoutthe course of my graduate work at UBC. I am also indebted to my dearfriends at UBC and SFU who believed in me and gave me continuous mentalsupport to carry on this journey.xiiiDyxiwutionThis thesis is dedicated to my parents and YJX.xivWhuptyr EIntroxuwtionWe consider the situation where we wish to make inferences regarding a setof random variables from observations made on a subset of these randomvariables. This is closely related to the concept of experimental design inclassical statistics, in which we use a carefully selected experiment to be per-formed by the experimenter. The aim of this so-called design is to optimizea process or system by performing each experiment and to draw conclusionsabout the significant behaviour of the studied object from the results of theexperiment. In environmental statistics, for example, the experiment yieldsobservations of a certain environmental process (such as temperature, airpollution, rainfall, etc) taken from a set of monitoring stations. Maintainingall stations is costly and hence infeasible, and one may only need to selectonly a subset of them. Another example is given by variable selection inregression models, where the problem refers to the task of finding a smallsubset of the available independent variables that can effectively predict thedependent variable. In this chapter, we describe three fields in statisticsthat seem unrelated at first glance - a design strategy called the maximum-entropy design, a special class of spatial point processes called determinantalpoint processes, and the theory of record values. Because chapters 3 and 4contain detailed developments of the latter two fields, we forego these de-tails here. Later chapters then extends these methodologies, especially thoseconcerning determinantal point processes and record values, to attempt tosolve the maximum-entropy design problem.1EBEB aufiimumAEntropy Dysign droblymEBE aufiimumAEntropy Dysign drovlymWell-defined optimality criteria are required to evaluate designs. Formally,consider a set b of n points, called the design space, and a design size k,such that k ≤ n. Our goal is to select a subset K of k points, such that theobservations made at these points are maximally informative. Informationcan be measured by entropy, for example, and our goal is then to choosea subset that minimizes the resulting entropy, i.e., maximizes the amountby which uncertainty will be reduced by the information provided by theexperiment. This is referred to as the maximum-entropy design problem(or sometimes, the maximum-entropy sampling problem). The problem wasfirst introduced by Caselton and Zidek (1984) who applied it to the designof environmental monitoring networks and later considered by Shewry andWynn (1987); Guttorp et al. (1993); Wu and Zidek (1992).We provide a brief introduction into the problem and its application inspatial statistics. First note that the uncertainty about a random vector scan be represented by the entropy of its distributionH(s) = EY[− log(z(s)h(s))]Owhere h(s) denotes a reference density, which does not need to be integrable,allowing the entropy to be invariant under one-to-one transformations of thescale of s (Jaynes, 1963). In hierarchical models for environmental pro-cesses, s is usually defined conditionally on some hyperparameters, whichwe denote by . The total entropy can then be defined asH(sO ) = E[− log(z(sO )hY; (sO ))]= E[− log(z(s| )z( )hY(s)h ( ))]= H(s| ) +H( )NWe can define the total entropy in the context of the monitoring networkdesign by first partitioning the data ass = (s(u)Os(g)), wheres(u) denotes2EBEB aufiimumAEntropy Dysign droblymthe measurements at potential sites, currently ungauged, and s(g) relatesto the existing sites, referred to as gauged locations. We then haveH(s(u)Os(g)O ) = H(s(u)|s(g)O ) +H( |s(g)) +H(s(g))= H(s(u)O |s(g)) +H(s(g))NThe design criterion is based on minimizingH(s(u)O |s(g)), which mea-sures the uncertainty about s(u) after s(g) is observed. Since the totalentropy H(s(u)Os(g)O ) is fixed, an equivalent criterion is to maximizes(g). Moreover, the same criterion for maximizing s(g) would be similarlyobtained had we decomposed H(s(u)Os(g)) instead of H(s(u)Os(g)O ).Rarely are networks designed from scratch, so the redesign of an existingnetwork is often of interest. When the goal is to augment a network, theobjective is to find a subset of u+ sites among the u ungauged ones to addto the existing network. We denote the remaining sites that are not theselected as u−. The resulting network will then consists of (s(u+)Os(g)).We haveH(s(u+)Os(u−)Os(g)) = H(s(u+)Os(u−)|s(g)) +H(s(g))= H(s(u−)|s(u+)Os(g)) +H(s(u+)Os(g))NSince the total entropy H(s(u+)Os(u−)Os(g)) is fixed, it will be optimalto augment the network with the u+ sites so as to maximize H(s(u+)Os(g)).For Gaussian random fields (i.e., where s follows multivariate normal withcovariance ), it can be shown (details can be found in Le and Zidek (2006))thatH(s(u+)Os(g)) = H(s(u+)|s(g)) +H(s(g))∝12log det(u+|g) +12log det(gg)Oand hence it will be optimal to maximize )2 log det(u+|g), where u+|g isthe conditional covariance matrix of random fields measured by added sitesgiven those gauged sites. The entropy criterion for augmenting the network3EBFB gputiul doint drocyssysis thusargmaxu+⊂u12log det(u+|g)Owhere C is the set of candidate sites.Concretely, consider a monitoring network redesign problem where theobjective is to choose from a set of n ungauged sites a subset of s sites toaugment the existing network. With the maximum-entropy design strategy,one would seek an s × s sub covariance matrix that has the largest log-determinant from an n× n conditional covariance.The criterion described above coincides with what is termed the X-optimal design. The problem now is to find a design from the set of allfeasible designs that maximizes information. In classical regression models,the optimization criteria are generally related to the notion of the (Fisher)information matrix. In this context, the X-optimal design objective is tomaximize the determinant of the corresponding information matrix.EBF gputiul doint drowyssysSpatial point processes are useful as statistical models in the analysis ofobserved patterns of points, where the points represent the locations of someobject of study (e..g. trees in a forest, bird nests, disease cases, or pettycrimes). Point processes also play a special role in stochastic geometry, as thebuilding blocks of more complicated random set models (such as the Booleanmodel), and as simple instructive examples of random sets. Since chapter3 provides a thorough explanation of determinantal point processes as aspecial class of spatial point processes, and we forego this development hereand instead simply provide some useful references. The work of Moller andWaagepetersen (2003) offers an up-to-date, unified collection of theoreticaladvances and applications in simulation-based inference for spatial pointprocesses, while Lavancier et al. (2015) provides details about statisticalinference procedures using determinantal point processes.4EBGB fycorx Vuluys hhyoryEBG fyworx juluys hhyoryA record value or record statistic is a value that is larger or smaller thanthe preceding values obtained in a sequence of observations of random vari-ables. The theory of records is connected very closely to the theory of orderstatistics. The term was first introduced by Chandler (1952). The study ofrecord values was subsequently carried out by a relatively small but highlytalented group of individuals and has garnered the attention of researchersfrom various fields. For example, Re´nyi (1962) has extensiveky studied onthe limit theorems for record sequences, while Glick (1978) reviewed thework of Shorrock (1972a,b, 1974) who introduced the application of recordsin modelling the slowest cars in a traffic. Besides its use in real-world ap-plications, knowledge of certain distributional functions of the record valuesequence is adequate to determine the common distribution of the underlyingobservations. As is to be expected, the catalog parallels the correspondinglist of characterizations based on order statistics. Although the literatureis extensive, we provide details in chapter 4 only on the aspects of recordsthat are most closely related to the subject of this thesis.5Whuptyr Fcvyrviyw oz thyWomvinutoriul cptimizutiondrovlymIn this chapter, we formally define the underlying optimization problemthat arises in the maximum-entropy sampling problem. Furthermore, wepresent an overview of existing methods for finding (both exactly and ap-proximately) the solution. We start by providing the mathematical defini-tions and setting the notations for future discussions.Let b = {1O 2O 3O NNNO n} where n is a positive integer. We use C to denotea real symmetric positive semidefinite matrix indexed by elements in b .Further, let S be an s-element subset of b with 1 ≤ s ≤ n. Let CS;S = CSdenote the principal submatrix of C having rows and columns indexed byelements in S (CS1;S2 with S) ̸= S2 denotes a submatrix of C having rowsindexed by S) and columns indexed by S2). Our optimization problem is todeterminemaxS2|S|5s;S⊂Ndet(CS)O (2.1)and the associated maximizer S. For ease of future discussions, we letHN (S) = det(CS) denote the determinant of the matrix CS . We also letv(CO FO YO s) = maxS2|S|5s;S⊂Ndet(CS)O (2.2)where F is an z -element subset of b (0 ≤ z ≤ n) and Y is an y-elementsubset of b (0 ≤ y ≤ n− z) satisfying F ∩ Y = ∅. Any s-element subset Sof b satisfying F ⊂ S ⊂ F ∪ Y = b is a feasible solution to the problem6FBEB Efiuct aythoxs for Finxing thy cptimumin (2.1). The set F contains indices that are forced into the solution setor that are fixed in the solution set. The set Y contains indices that areeligible for consideration into a solution set. Usually, we start with F = ∅and Y = b .Note that this can be viewed as a subset selection problem, where theselection criterion is to maximize the determinant of the matrix indexedby the subset. This is a combinatorial optimization problem, whereby theproblem consists of finding an optimal object from a finite set of objects.Usually, an exhaustive search is not feasible. In fact, as proved by Ko et al.(1995), this particular combinatorial optimization problem is NP-hard. Togain an idea of the reason why the problem is hard, imagine having a setb with 50 elements and the goal is to find a subset of 25 elements thatmaximize HN (S). The solution space contains(5(25) ≈ 1N26 × 10)4 subsets,which is computationally intractable in both time and storage.FBE Efiuwt aythoxs zor Finxing thy cptimumAlthough its NP-hardness places the problem beyond those for which theo-retically efficient algorithms exist for general instances, the theoretical resultis based on worst-case analysis and are of an asymptotic nature. So hoperests in developing practical methods for solving the problem instances withmoderate sizes. For example, efficient software implementation, such as thatimplemented in the EnjircGhah j0.4-0 R package, can solve problem(5(25)within hours. Ko et al. (1995) first introduced a branch-and-bound algo-rithm for this problem that guarantees optimality. In the remainder of thissection, we provide the details of their algorithm.FBEBE VrunwhAunxAvounxBranch-and-bound is a technique originally developed for solving integerlinear programs (see Nemhauser and Wolsey (1989) for an example). Thisalgorithm works by forming the solution space as a rooted tree with thefull candidate set at the root, and exploring branches or regions of the tree,7FBEB Efiuct aythoxs for Finxing thy cptimumwhich represent subsets of the full solution set. Before enumerating the can-didate solutions of a branch, the branch is checked against the upper andlower estimated bounds of the optimal solution. it is then discarded if itcannot produce a better solution than the best one found so far by the al-gorithm. Accordingly, the algorithm depends on efficient estimation of thelower and upper bounds of regions or branches of the search space. Thebranch-and-bound algorithm can be thought of as an improved exhaustivesearch method, and if no good bounds are available, the algorithm degener-ates to a simple exhaustive search.Ko et al. (1995) used a spectral bounding method to compute the upperbound on the optimal value. Specifically, they established the spectral upperbound asv(CO ∅O YO s) ≤ v(CO ∅O YO s) =s∏i5)i(CE)O (2.3)and if F is non-empty and CF is invertible, thenv(CO FO YO s) ≤ v(CO FO YO s) = det(CF )s−f∏i5)i(CE|F )O (2.4)whereCE|F = CE −CE;FC−)F CF;E N (2.5)Note that if C is a covariance for a set of random variables indexed byb , then CE|F is the covariance matrix of random variables indexed by Y,conditioned on the random variables indexed by F .The branch-and-bound algorithm in this case starts with an initial so-lution S∗, which is typically obtained by using a heuristic method (see sec-tion 2.2 for a discussion). A branching strategy essentially tests differentsets F and Y with the aim to eventually discovering the optimal solution.The algorithm maintains a list of active subproblems, each of which is de-termined by its sets F and Y. Initially, the only subproblem is initializedto be the original problem (i.e., F = ∅, Y = b). At each stage, an activesubproblem is selected and an upper bound is calculated using the bound-ing methods described in Equation (2.4). If the upper bound is less than8FBEB Efiuct aythoxs for Finxing thy cptimumdet(CS∗), then the subproblem is discarded. Otherwise, an index j ∈ Y isselected, and the subproblem is replaced by two new subproblems: one withj adjoined to F and deleted from Y, and the other with j deleted from Y.Then, for each of the these two potential subproblems, we check if there isone and only one S′ having s elements with F ⊂ S′ ⊂ F ∪ Y. If there is,and this S′ satisfies HN (S′) R HN (S∗), then we replace S∗ with S′ and dis-card the subproblem. A more detailed pseudo-code of the implementation isprovided in Algorithm 1. Note that one may attempt to increase the lowerbound at any step by returning the heuristic based on information obtainedfrom the current state of L. This is necessary to speed up the algorithmwhenever the size of L becomes intolerably large. The algorithm is finiteand terminates with S∗ optimal when LV = iV. We can terminate earlyif we are willing to be satisfied when LV is sufficiently close to iV.The efficiency of the algorithm essentially depends on the sharpness ofthe theoretical upper bound. Ko et al. (1995) performed computational ex-periments using real data from environmental monitoring and solved a prob-lem of size(+626)within two hours. Meanwhile, Anstreicher et al. (1996, 1999)introduced a continuous relaxation of the problem and derived a continuousnonlinear programming upper bound on HN (S). They reported success insoving the optimization problem with an n as large as 63. Furthermore,Lee (2000) discussed how the problem can be relaxed as a semidefinite pro-gram and thus the corresponding semidefinite programming upper boundcan be derived. However, it does not seem that any computational resultshave been produced with that bound. Anstreicher et al. (1996, 1999) alsodemonstrated how any bound for the complementary problem of choosinga maximum entropy set of n− s points with respect to C−) translates to abound for the original problem. Hoffman et al. (2001) showed how the orig-inal spectral bound can be strengthened by combining the use of Fischer’sinequality (Fischer, 1908) and partitioning. In any case, the branch-and-bound algorithm has been found successful in solving real problems of a sizeas large as 75. Most recently, Anstreicher (2018) introduced a new bound“linx” based on a simple but previously unexploited determinant identity.The bound can be computed very efficiently and is superior to all previously9FBFB Hyuristics for Approfiimuting thy cptimumAlgorithm 1 Branch-and-bound algorithmcnputT An initial solution S∗ and an n× n kernel matrix C.Set LV = det(CS∗), iV = v(CO ∅O bO s), L = {(CO ∅O bO s)}.while iV R LV doSelect and remove (CO F ′O Y′O s) ∈ L and compute v(CO F ′O Y′O s).if v(CO F ′O Y′O s) R LV thenSelect j ∈ Y′.Set L = L ∪ {(CO F ′O Y′ \ {j}O s)} and compute v(CO F ′O Y′ \ {j}O s).if |F ′|+ |Y′| − 1 = s thenSet S = F ′ ∪ Y′.if det(CS) R LV thenSet S∗ = S and LV = det(CS).end ifend ifSet L = L∪{(CO F ′∪{j}O Y′\{j}O s)} and compute v(CO F ′∪{j}O Y′\{j}O s).if |F ′|+ |Y′| − 1 = s thenSet S = F ′ ∪ Y′if det(CS) R LV thenSet S∗ = S and LV = det(CS).end ifend ifend ifSet iV = maxL∈L v(L).end whileiutputT Set S∗ optimal with s elements.known bounds for the maximum-entropy sampling problem on most bench-mark test problems. With linx, the branch-and-bound algorithm can solvesome instances of the problem of size n = 124.FBF Hyuristiws zor Approfiimuting thy cptimumFBFBE Gryyxy ulgorithmFor large intractable problems, heuristics, though all lacking some degreeof generality and theoretical guarantees on the proximity to the optimum,10FBFB Hyuristics for Approfiimuting thy cptimumcan be used to find reasonably good solutions. One of the best known is theDETMAX algorithm (Mitchell, 1974), based on the idea of exchanges, whichis widely used by statisticians for finding approximate X-optimal designs.Although the algorithm does not guarantee X-optimality, it has performedadequately in cases where the optimal design is known (Mitchell, 1974).Applied to the case of this work, the DETMAX begins from a set S ofthe desired size, and while possible, choose i ∈ b \ S and j ∈ S so thatHN (S ∪ {i} \ {j}) R HN (S), and replace S with S ∪ {i} \ {j}.Due to a lack of readily available alternatives, Guttorp et al. (1993) usea greedy approach, which is summarized in Algorithm 2. Ko et al. (1995)experiment with a backwards version of Algorithm 2: start with S = b ,then, for j = 1O 2O NNNO n− s, choose l ∈ S so as to maximize HN (S \ {l}), andthen remove l from S.Algorithm L Greedy algorithmcnputT Size k and an empty set S = ∅.for i = 1O N N N O k doChoose s ∈ b \ S so as to maximize HN (S ∪ {s}).Set S = S ∪ {s}.end foriutputT Set S with k elements.The greedy-type algorithm has been popular in solving such optimizationproblems. Besides its simplicity, it attains good theoretical properties forspecial functions. Consider a function z : 2N → R defined by z(S) =log(detCS). Kelmans and Kimelfeld (1983) show that this is a submodularset function, that isz(S) + z(h ) ≥ z(S ∪ h ) + z(S ∩ h ) for all SO h ⊆ bNNemhauser et al. (1978) proved that the greedy algorithm offers a goodapproximation to the optimal solution of the NP-hard optimization probleminvolving a submodular objective function. In fact, Nemhauser et al. (1978)proved that any algorithm that is allowed to only evaluate z at a polynomialnumber of sets will be unable to obtain an approximation guarantee better11FBFB Hyuristics for Approfiimuting thy cptimumthan 1 − 1=y. Although this celebrated result does provide a guarantee onthe proximity of a class of approximations to their optimum, the error boundis loose and hence it is of little practical interest.FBFBF Gynytiw ulgorithmGenetic algorithms (GA) are a family of computational models inspired byevolution (Holland, 1992). In general, these algorithms encode candidate so-lutions to a specific problem into a sample chromosome-like data structure.The algorithms then seek to improve a population of potential solutions byapplying recombination operators to these structures while preserving crit-ical information. In particular, the recombination operators are inspired byusing principles of genetic evolution such as natural selection, crossover, andmutation. Although the range of problems to which GAs have been appliedis quite broad, GAs are often viewed as methods for function optimizationproblems. Moreover, they have been known to work well for optimizing hard,black-box functions with a potentially large number of local optima (Gold-berg and Holland, 1988; Whitley, 1994).Implementation of a GA begins with a population of (typically random)potential solutions to a problem (encoded on chromosomes). These struc-tures are then evaluated and reproductive opportunities are allocated in sucha way that those chromosomes which represent a better solution to the tar-get problem are given more chances to reproduce than those chromosomeswhich are poorer solutions. The goodness of a solution is typically definedwith respect to the current population.Recently, Ruiz-Ca´rdenas et al. (2010) proposed a stochastic search pro-cedure based on a GA for approximating the optimal design for environ-mental monitoring networks. The authors test the algorithm on a set ofsimulated datasets of different sizes, as well as on a real application involv-ing the redesign of a large-scale environmental monitoring network. The GAconsidered here consists of the general steps described in Algorithm 3.12FBGB DiscussionAlgorithm 3 Genetic algorithm1: Choose at random an initial population of size b(, that is, a set of b(possible solutions S)O NNNO SN0 .2: Compute the fitness, that is, the value of the objective function HN (Si),i = 1O NNNO b(, for each of the solutions in the population.3: Crossovyr : choose a proportion, pcrgss, of solutions from the population.These solutions are selected according to a fitness-dependent selectionscheme. Among these selected solutions, pairs of solutions are formedat random.4: autution: choose a proportion, pemlhrgh, of solutions from the popu-lation with equal probability. For each selected solution, each gaugedsite may be swapped, according to a mutation probability peml, with arandomly chosen ungauged neighbour site.5: Compute the fitness of the solutions obtained by crossover and mutation.Include these solutions in the current population, creating an augmentedpopulation.6: gylywtion: the population of solutions of the new generation will beselected from this augmented population. A proportion of solutionswith best fitness, called the elite, enter directly into the new generationwhile the remaining members of the new generation are randomly chosenaccording to a certain fitness-dependent selection scheme.7: Stop the algorithm if the stop criterion is met. Otherwise, return to step3.FBG DiswussionIn this chapter we have reviewed commonly used exact algorithms andheuristics for solving the combinatorial optimization problem of interest.Although the list of algorithms we discuss here is by no means exhaustive,it covers the mainstream of methods used in machine learning, statistics,and optimization communities for tackling similar problems and many othermethods are developed based upon it. In fact, the optimization of this typeof problems has always been a hot topic (see Krause and Golovin (2014);Nemhauser et al. (1978); Fisher et al. (1978) for a representative overview).This chapter mainly describes the theoretical development of important al-gorithms, and computational experiments involving the methods discussedhere will be presented in the following chapter.13Whuptyr GDytyrminuntul dointdrowyssysDeterminantal point processes (DPPs) constitute a special class of repulsivespatial point processes and offer efficient and exact algorithms for sampling,marginalization, conditioning, and other inferential tasks. These processwere first studied by Macchi (1975), as fermion processes, to model the dis-tribution of fermions at thermal equilibrium. Borodin and Olshanski (2000)and Hough et al. (2006) subsequently popularized the name “determinantal”and offered probabilistic descriptions of DPPs. More recently, DPPs haveattracted attention in the machine learning and statistics communities. Theseminal work of Kulesza et al. (2012) provides a thorough and comprehensiveintroduction to the applications of DPPs that are most relevant to machinelearning problems. Since then, DPPs have been gaining attention in themachine learning community where their repulsive character has been usedto enforce the idea of diversity in subset selection problems. These problemshave been encountered in a variety of applications such as image searching,neuroscience and wireless network configuration. More recently, Lavancieret al. (2015) studies rigorously statistical models and methods for DPPs andhow this class of point processes can be useful models for the description ofspatial point pattern datasets where nearby points repel each other.GBE DynitionsWe view a point process as a random lowully nity subset r of a Borel setS ⊆ Rd. Denoting by rB = r∩V, the restriction of r to a set V ⊆ S, andb(V) the number of events (random elements) in rB (b(V) also called14GBEB Dynitionsthe wounting vuriuvly), the local finiteness of r means that b(V) P ∞almost surely whenever V is bounded - any bounded region contains only afinite number of points. We also assume that the point process is simply:b({x}) ≤ 1 for all x ∈ Rd. That is, with probability 1, no two points of theprocess are coincident.GBEBE Gynyrul doisson point prowyssysAs an example of spatial point processes, recall that the Poisson processcan be generalized to a general x-dimensional space. To define a uniformPoisson point process in Rd, or an inhomogeneous Poisson process in Rd,or a Poisson point process on some other space S, the following generaldefinition can be used.Let S be a space, and Λ a measure on S (we require S to be a locallycompact metric space, and Λ a measure which is finite on every compact setand which has no atoms). The Poisson process on S with intensity measureΛ is a point process on S such that• For every compact set V ⊂ S, the count b(V) has a Poisson distribu-tion with mean Λ(V);• If V)O NNNO Vn are disjoint compact sets, then b(V))O NNNO b(Vn) are in-dependent.For example, for the inhomogeneous Poisson process on R2 with intensityfunction (u), u ∈ R2 is defined by taking S = R2 and Λ(V) = ∫B (u)xu.GBEBF Gynyrul xytyrminuntul point prowyssysThe DPPs define a point process on S ⊆ Rd, that is, a random point con-figuration r = {x)O NNNO xn} with xn ∈ S. Here we are only concerned withDPPs on a finite state space, S = {s)O NNNO sN}. Let C denote an b ×b pos-itive semidefinite matrix constructed as Cij = C(siO sj) with a covariancefunction C(siO sj). For a generic random point configuration r, defined (r = U) =det(CV)det(C + I)(3.1)15GBEB Dynitionsas a probability distribution on the 2N possible point configurations r.The above argument actually defines a subclass of DPPs known as L-ensembles. In practice, it is more convenient to characterize DPPs via theseL-ensembles (Borodin and Rains, 2005; Kulesza et al., 2012), which directlydefine the probability of observing each subset of S. It can be demonstratedthat the normalizing constant det(C+I) =∑V⊆S det(CV) and I is an n×nidentity matrix.An important characteristic of DPPs is that they assign higher proba-bility to sets of items that are diverse (unlike the Poisson point processesintroduced earlier, which present complete spatial randomness). An illustra-tion of the repulsive character of DPPs is shown in Figure 3.1. This is dueto the fact that the intensity function of a DPP depends on the determinantof a kernel matrix that defines a global measure of similarity between pairsof items. To illustrate the diversity idea, recall that one can interpret thedeterminant as the volume of a parallelotope spanned by the column vec-tors of CV - equal or similar column vectors span less volume than highlydiverse ones. Figure 3.2 provides a geometric view of this idea. Note thatthe column vectors can be thought of as features for items, and those withfeatures that are more different have larger probabilities of being sampledfrom the ground set of items.Figure 3.1: From left to right, realizations of a homogeneous Poisson pointprocess, DPP, and DPP configuration with the highest probability on a 2-Dfine grid. From Kulesza et al. (2012).16GBFB gumpling from kADddsFigure 3.2: Geometric interpretations of the determinant of a kernel matrix.The leftmost two graphs from left represent the determinants of a matrixconsisting of two column vectors, and the graph on the right represents thecase with three column vectors.GBEBG kADddsThe standard DPP models described above may yield subsets of any ran-dom size. A k-DPP on a discrete ground set is simply a DPP with fixedcardinality k, can be obtained by conditioning a standard DPP on the eventthat the set U has cardinality k, as followsPkC(U) = P(r = U||U| = k) =det(CV)∑|V′|5k det(CV′)O (3.2)where |U| denotes the cardinality of U. This notion is essential in the contextof our cardinality-constrained discrete optimization problem. In the follow-ing subsection we demonstrate how we can sample from this probabilisticmodel and approach the optimal solution based on the sampling results.GBF gumpling zrom kADddsEfficient sampling algorithms for DPPs have been the focus for many branchesin machine learning and applied mathematics. The first of these algorithmswas developed in Hough et al. (2006), and relies on eigendecomposition ofthe kernel matrix. Due to the high computational expense of the eigende-composition for large scale datasets, much research has been concentratedon developing faster sampling methods. Affandi et al. (2013) employ a17GBFB gumpling from kADddsNystrom approximation on the kernel matrix and develop an approximatesampling algorithm based on the approximated kernel matrix, which gener-ally has a lower dimension. Meanwhile, Li et al. (2015) explore the idea ofcoresets for the ground set of items and propose an approximate samplingalgorithm that uses the ground set with reduced size. More recently, Launayet al. (2018) propose an exact sampling methods for DPPs that does notrely on eigendecomposition but rather exploits the properties of the pointprocesses and construct a sequential sampling procedure based on thinning.Another alternative, recently studied by Anari et al. (2016), takes a dif-ferent approach and uses Markov chain Monte Carlo (MCMC) to generateapproximate samples for k-DPPs.In this section, we discuss sampling algorithms for DPPs, and morespecifically, for k-DPPs. We particularly focus on two of them: the exactsampling algorithm developed in Hough et al. (2006) and extensively dis-cussed in Kulesza et al. (2012); and the MCMC procedure discussed in Anariet al. (2016).GBFBE Inxypynxynt sumplingAlgorithm 4, after Hough et al. (2006), offers an efficient algorithm for gen-erating a point configuration m from a DPP. The algorithm takes an eigen-decomposition of the DPP kernel C as the input. Kulesza et al. (2012)proves that Algorithm 4 samples random point patterns that follow a DPPdistribution. Note that in the first loop of the algorithm, a subset of eigen-vectors is selected at random with selection probability depending on theassociated eigenvalue. In the second loop, a sample m is generated based onthe selected eigenvectors.Before providing further details, we introduce an important way of ex-pressing a DPP as a mixture of elementary DPPs (Kulesza et al., 2012),also referred to as determinantal projection processes (Hough et al., 2006).Elementary DPPs, denoted as Pk , are a particular type of DPP wherebyevery eigenvalue of its marginal kernel is either zero or one. Its marginal18GBFB gumpling from kADddsAlgorithm 4 Sampling from a DPPcnputT eigendecomposition {vnO n} of L.J ← ∅.for n = 1O N N N O b doJ ← J ∪ {n} with probability nn+)end forj ← {vn}n∈Jm ← ∅while |j | R 0 doSelect yi from m with probability given by)|k |∑v∈k (v⊤ei)2m ← m ∪ {yi}j ← j⊥, an orthonormal basis for the subspace of j orthogonal to eiend whileiutputT m .kernel can thus be decomposed asKk =∑v∈kvv⊤O (3.3)where j is a set of orthonormal vectors. From this decomposition, notethat elementary DPPs have their cardinality fixed as the cardinality of j .Furthermore, denoting jJ as {vn}n∈J , the mixture is (Kulesza et al., 2012)PL(m ) = 1det(L+ I)∑J⊆{);:::;N}PkJ (m )∏n∈JnN (3.4)From this notion of elementary DPPs, note that the normalization con-19GBFB gumpling from kADddsstant of the k-DPP is given by (Kulesza et al., 2012):∑|n ′|5kdet(Ln ′) = det(L+ I)∑|n ′|5kPL(m ′) (3.5)=∑|n ′|5k∑J⊆{);:::;N}PkJ (m ′)∏n∈Jn (3.6)=∑J⊆{);:::;N}|J |5k∏n∈Jn (3.7)≡ YNk O (3.8)which can be computed recursively, noting thatYNk = YN−)k + NYN−)k−) O (3.9)where )O N N N O N are the eigenvalues of the L-ensemble. Kulesza et al. (2012)have illustrated that the first phase of Algorithm 4 chooses an elementaryDPP according to its mixing coefficient, and the second phase samples areselected from the elementary DPP chosen in the previous phase.Since k-DPPs are simply DPPs conditioned on size, a comparably fastalgorithm which is similar to Algorithm 4 can be developed. Kulesza et al.(2012) shown that a k-DPP is also a mixture of elementary DPPs, but itonly gives nonzero weight to those of the desired dimension. As previouslymentioned, the second loop of Algorithm 4 can sample from any given el-ementary DPP; therefore, we can sample from a k-DPP if we can sampleindex sets J according to appropriate mixture components. From Kuleszaet al. (2012), Pr(d = J) = )eNk∏n∈J n when |J | = k, and zero otherwise.Therefore, the first loop of Algorithm 5 yields a realization of d and thesecond loop is identical to that of the DPPs.Kulesza et al. (2012) discussed thoroughly the computational complexityof sampling DPPs and k-DPPs. In summary, sampling an elementary DPP(the second loop of Algorithm 4) can be done in c(bk+) time, where k =|j | is the number of eigenvectors selected in phase one. For k-DPPs, theelementary symmetric polynomials can be precomputed and runs in c(bk)20GBFB gumpling from kADddsAlgorithm O Sampling from a k-DPPcnputT size k and {vnO n} eigenvectors and eigenvalues of L.J ← ∅.Compute elementary DPPs Yn) O N N N O Ynk , for n = 0O N N N O b .for n = bO N N N O 1 doSample u ∼ i [0O 1]if u PnEn−1k−1EnkthenJ ← J ∪ {n}k ← k − 1if k = 0 thenbreakend ifend ifend forj ← {vn}n∈Jm ← ∅while |j | R 0 doSelect yi from m with probability given by)|k |∑v∈k (v⊤ei)2m ← m ∪ {yi}j ← j⊥, an orthonormal basis for the subspace of j orthogonal to eiend whileiutputT m .time; the first loop of Algorithm 5 then iterates at most b times and requiresa constant number of operations. The remaining task is then to select Jeigenvectors with a fixed size and sample from the elementary DPPs. Sosampling k-DPP is c(bk+) overall in computational time, assuming that wehave an eigendecomposition of the kernel in advance. Therefore, samplinga k-DPP is no more expensive than sampling a standard DPP.GBFBF aWaW sumplingRecent work of Anari et al. (2016) has shown that a natural Metropolis-Hastings sampler for MCMC defined on the support of a homogeneousstrongly Rayleigh distribution mixes rapidly, and can be used to effi-ciently generate approximate samples from . The authors also indicated21GBFB gumpling from kADddsthat DPPs are special cases of strongly Rayleigh measures, so that the sameMarkov chain efficiently generates random samples of a k-DPP. The MCMCprocedure is summarized in Algorithm 6. The state space of the chain is theground set of items S = {1O N N N O b}.Algorithm P MCMC sampling from a k-DPPcnputT Size k and ground set S.Start with any U ⊆ S with |U| = kfor Maximum number of iterations doLet h = S\{i} ∪ {j}, where i ∈ S and j =∈ SSample u ∼ Uniform(0O 1)if u ≤ min(1O (i )(S) ) thenMove to a new state helseStay in Uend ifend foriutputT A Markov chain of random samples of a k-DPP.It is easy to see that the chain with the described transition is reversible,since the probability of moving forwards from the current state to the newstate is the same as moving in the reverse direction. The chain is alsoirreducible, with being the stationary distribution due to the Metropolis-Hastings step. The convergence of the Markov chain is measured via itsmixing time: the mixing time of the chain indicates the number of iterationst that we must perform (starting from S() before we can consider St as anapproximately valid sample from . Anari et al. (2016) proved that thechain described in Algorithm 6 mixes after polynomial steps. Here, insteadof stating the theorem, we show in Figure 3.3 the trace plot of an MCMCchain of 10O 000 log-determinants generated by a 10-DPP. The plot illustrateshow quickly the MCMC chain mixes practically.22GBGB gumplingAbusyx solution strutygy0 2000 4000 6000 8000 100007677787980Trace of Log−determinantsIterationsFigure 3.3: Trace of 10O 000 log-determinants generated by a 10-DPP.GBG gumplingAvusyx solution strutygyTo handle the NP-hard optimization problem described in Equation (2.1),the k-DPP sampling approach involves generating such k-DPP subsets re-peatedly and calculating the objective function HN (S), such that succes-sively better approximations, as measured by HN (S), can be found. Theapproximate solution to the problem is then given by the best attainedHN (S) up to a certain number of simulations and its associated indices ofpoints, as described in Algorithm 7.Note that eigendecomposition of the kernel matrix can be conducted asa pre-processing step and thus does not need to be performed before eachsampling step. Therefore, assuming that we have an eigendecomposition ofthe kernel in advance, sampling one k-DPP runs in O(bk+) time (Kuleszaet al., 2012) and the computation of the determinant of a submatrix typicallytakes O(k+) time. Overall, Algorithm 7 runs in O(bk+) time per iteration.23GBHB Wompututionul dyrformuncysAlgorithm 7 Sampling-based solution strategy using k-DPPcnputT Size k and the kernel matrix L.1: Sample k indices according to the k-DPP distribution specified by Lusing Algorithm 5.2: Compute determinant of the submatrix indexed by the sampled k indicessampled.3: Repeat steps 1 and 2 until reaching the maximum number of iterationsor the maximum computing resources.iutputT The maximum determinant and the associated set of indices.GBH Wompututionul dyrzormunwysIn this section, we compare the performances of the greedy algorithm, theGA, and the k-DPP approaches discussed above in two examples; one tractableand another intractable.GBHBE cptimizing mufiimumAyntropy xysigns zor monitoringnytworksFor the first study we consider the data supplied by the U.S. Global Histori-cal Climatology Network–Daily (GHCND), which is an integrated databaseof climate summaries from land surface stations across the globe. For il-lustrative purposes, we selected 97 temperature monitoring stations wherethe maximum daily temperature was recorded. A subset of 67 stations wasselected among the 97 stations to constitute a hypothetical monitoring net-work. An additional 30 stations were selected and designated as potentialsites for new monitors. Casquilho-Resende et al. (2018) has successfullyapproximated a maximum-entropy design using k-DPP for an instance ofthis problem. In this case study, the goal is to select a subset of 10 sta-tions from among the additional 30 to augment the network based on themaximum-entropy design criterion.Using the notation presented in Equation (2.1), C here is the estimatedcovariance matrix of 30 candidate sites and S is the subset of 10 sites thatmaximize HN (S). For tractable optimization problems, the maximal valueof the objective function (or equivalently the optimal design) can be obtained24GBHB Wompututionul dyrformuncysefficiently by the branch-and-bound algorithm detailed in Algorithm 1. Inthis study, the maximal value is 80N09011.For the comparison, we first performed the greedy algorithm discussed inAlgorithm 2, which yielded a solution of 80N07284. Using Algorithm 3 withthe tuning parameters suggested in Ruiz-Ca´rdenas et al. (2010) (b( = 100,pcrgss = 0N75, peml = 0N05, and a tournament selection scheme with four com-petitors), the GA yielded a solution of 80N09011 after 1000 generations. Sim-ilarly, the proposed 10-DPP achieved the optimal value after about 80O 000simulations. As illustrated in Figure 3.4, the maximal value of the log–determinant among the simulations increases as the number of simulationsescalates.76777879800 25 50 75 1007879800 20000 40000 60000 80000Number of simulationsMaximum log−subdeterminantFigure 3.4: Occurrence of maximum log–determinants of the restricted con-ditional hypercovariance matrix when increasing the number of simulated10-DPP samples. The optimum solution is marked by the red horizontaldashed red line. The inset shows a zoomed-in view of the first 100 samples.In terms of computation time, for this particular problem about 20 min-utes of wall clock time was taken to simulate 100O 000 subsets from the25GBHB Wompututionul dyrformuncys10-DPP using R programming language (R Core Team, 2018) on a laptopwith a 2.5 GHn Inhel 7cre i7 processor and a 16 G6 1,600 MHn 88R3RAM. In the same computational environment, 5 minutes of wall clock timewas needed to simulate 1O 000 generations of GA.GBHBF gynthytiw xutu with u lurgy numvyr oz pointsExact methods quickly become impractical for large data sets, and one mustthen resort to heuristics. Greedy heuristics are known to be fast and efficient,but they can be quite inaccurate. As an illustrative example, let an n × nreal symmetric positive definite matrix beU =x)) x)2 x)+ N N N x)nx2) x22 x2+ N N N x2nx+) x+2 x++ N N N x+n.......... . ....xn) xn2 xn+ N N N xnnOwhere the diagonal elements arexii =xO if i P n− k + 1Ox+ O otherwise.And the off-diagonal elements arexij =uO if i R jO i ̸= n− k + 1O n− k + 2O NNNO n− k + 10vO if i R jO i = n− k + 1OcO if i R jO i = n− k + 2O or n− k + 3O orNNNO n− k + 10uO if i P jO j ̸= n− k + 1O n− k + 2O NNNO n− k + 10vO if i P jO j = n− k + 1OcO if i P jO j = n− k + 2O or n− k + 3O orNNNO n− k + 10Owhere n is the size of the matrix and k is the desired size of the subset onewould like to select.26GBHB Wompututionul dyrformuncysSuppose we seek a 60-by-60 submatrix with maximal determinant froma 100-by-100 matrix, with u = 0N2, v = 0N9, c = 0N65, x = 7, and =1. For this particular matrix, running a greedy algorithm results in theselection of subsets {31O NNNO 40O 51O NNNO 100} at termination and an associatedlog–subdeterminant of 122N8217. We also ran GA with the same tuningparameters as in the previous section and the best solution obtained was123N6158 in 1O 000 iterations. For comparison, we simulated 100O 000 60-DPP samples and found a better solution of 123N7503, as shown in Figure3.5.121.5122.0122.5123.0123.50 250 500 750 10001211221230 20000 40000 60000 80000Number of SimulationsMaximum of Log−DeterminantsFigure 3.5: Occurrence of maximum log-subdeterminants of the synthetickernel matrix when increasing the number of simulated 60-DPP samples.The greedy and the GA solutions are marked by the green and purple solidhorizontal lines, respectively. The inset shows a zoomed-in view of the first1O 000 samples.The computational burden increases significantly when dealing with largematrices, but parallel simulations can be exploited to reduce the computa-tional time. For this example, 1 hour of wall clock time was required tosimulate 100O 000 samples of 60-DPP on a Compute Canada cluster with32 ccreg 2.1GHn Inhel 6rcadwell CPUs and 12, G6 RAM. In the samecomputational environment, running 1O 000 generations of GA required 127GBIB Discussionhour of wall clock time.In summary, the GA and the DPP approaches produced fairly compa-rable solutions that are more accurate than those produced by the greedyalgorithm. The DPP methods seemed to require more computing resources.To observe the variations from run to run, we repeated the above two stud-ies 100 times with the GA and DPP approaches. The results for DPP with100O 000 and 1O 000O 000 simulations and for the GA with 1O 000 and 10O 000generations are shown in Table 3.1. Overall, the performances of the twoapproaches are fairly comparable.Sample Sizek = 10 k = 60Maximum Mean SD Maximum Mean SDDPP-100,000 80.09 80.00 0.05 123.75 123.72 0.05DPP-1,000,000 80.09 80.07 0.01 123.89 123.80 0.02GA-1,000 80.09 80.07 0.06 123.62 123.40 0.15GA-10,000 80.09 80.08 0.01 124.02 123.89 0.08Table 3.1: Results obtained from 100 realizations of k-DPP and GA forboth the real and synthetic data. Sample size refers to the number of DPPsimulations and the number of generations of GA for each realization; SDrefers to standard deviation.GBI DiswussionThis chapter introduces a sampling-based approach for approximating thecombinatorial optimization problem of subdeterminant maximization. Bysampling from a k-DPP, which can be done in polynomial time, we ap-proach the optimal solution by using the maximum simulated value as anapproximation.We demonstrated the potential application of the k-DPP based algo-rithm for finding optimal designs of spatial monitoring networks, and foundit successful in obtaining the exact solution for small, tractable problems.When the size of the problem makes exact methods impractical, we showed(for a certain type of matrix) that our algorithm outperforms the greedy28GBIB Discussionalgorithm and is comparable to the GA for a relatively small cost in com-putational time. When solving a large problem where the exact methodsdo not work, the proposed DPP method is guaranteed to ultimately ap-proach the optimum with a sufficient number of iterations while the greedyand GA potentially converge to a local optimum after a certain numberof iterations. Moreover, although GA usually runs faster in terms of com-putational time than our DPP approximation and returns fairly accuratesolutions, it requires careful calibrations of its tuning parameters. In fact,finding the right balance between crossover/selection, which pulls the popu-lation towards a local maximum, and mutation, which explores potentiallybetter solution spaces, is a known issue for GA. Inappropriate choices oftuning parameters could adversely affect the convergence of the algorithm;see Goldberg and Holland (1988) and Whitley (1994) for more detailed dis-cussions. The DPP approximation, on the other hand, can be run naivelyto obtain comparably accurate solutions. Another major advantage of theDPP is that the algorithm can be easily implemented with parallelizationwithout material modifications, which enables potential usage of free–accesssupercomputers to further reduce computational time.In future work, approximate sampling algorithms for k-DPP will be ex-plored. Work has recently been published which attempts to reduce the di-mension of the matrix, such as the one introduced in Li et al. (2015). Otherstudies mainly focus on the approximations of the kernel matrix using somelower dimensional structures or alternate representations of the matrix inlower dimensional forms, such as those in Affandi et al. (2013) and Kuleszaet al. (2012). These methods could help to reduce the sampling complexityof the k-DPPs and eventually could further reduce the computational timerequired to obtain the approximate solutions. Currently, analytical theoryis being developed to describe the number of iterations needed for successiveimprovements in the approximate DPP solutions, as well as to estimate theexpected time duration, until an optimal solution is obtained.29Whuptyr Hfyworx juluys hhyory“Records, Record, Record!”, “The oldest Olympic Record was beaten!”,“New World Record result in 100-metre dash!”. Every day, we see such kindof headlines in newspapers, on the Internet, on the TV. Suppose we registerthe annual total inches of precipitation in Vancouver, what is the probabilitythat next year will be a new maximum? Or, if we collect data on annualamounts of precipitation in Vancouver in the previous one hundred years,how many of them set a record - being higher in value than all previouslyregistered values? It turns out that breakthroughs are less likely later thanearlier in a sequence of observations.In this chapter, we present statistical results concerning record valuestheory (also referred to as record statistics, or just simply records). Theliterature there is enormous. Here we only introduce results that are mostrelevant to the work reported in this thesis. In particular, we connect recordvalues to the k-DPP approximations discussed in previous chapters and tryto provide insights into the behaviour of those approximations.HBE IntroxuwtionThe study of record values dates back to the early nineteen-fifties. Chandler(1952) first introduced basic record models involving i.i.d. observations anddocumented many of the basic properties of the records. As one can imag-ine, record values are inextricably related to order statistics, and the studyof record values in many ways parallels the study of order statistics. In fact,several early books on order statistics and related processes have chapterson record values. Many of them emphasize the parallel between record valuetheory and the theory of sample maxima. In particular, Chapter 6 of Galam-30HBEB Introxuctionbos (1977) and Chapter 4 of Resnick (1987) provide extensive discussion ofrecord values. An early survey paper by Glick (1978) also merits attentionand introduces an interesting applications of record value theory in traffic -curiously, the simple model of random record values says something abouthow cars tend to bunch together behind a slow vehicle. Papers of similarspirit by Nevzorov (1988) and Nagaraja (1988) provide excellent resourcesfor anyone interested in the field. The book by Ahsanullah (1995) providesa limited coverage of certain topics in the domain of record statistics. Thecoverage in Arnold et al. (1998) is much more extensive and inevitably moreup to date. We follow the presentations in Arnold et al. (1998) to provide abrief overview of record values theory.HBEBE Dynitions unx notutionThe standard record value process corresponding to an infinite sequence ofi.i.d. observations is our focus. Let l)O l2O N N N be an infinite sequence ofrandom variables having the same distribution as the random variable l.Denote the cumulative distribution function (cdf) of l by F . Usually anassumption that F is continuous is invoked to avoid the possibility of tiesand allows a cleaner theoretical development.An observation lj is called an upper record value (or simply a record)if its value exceeds that of all previous observations, that is, lj R li forall i P j. The times at which records appear are often of interest. Forsimplicity, let us assume that lj is observed at time j. Then the sequenceof record time {hnO n ≥ 0} is defined as follows:hn = min{j : lj R lin−1} (4.1)for n ≥ 1 and h( = 1 with probability 1. The sequence of record values{fn} is then defined byfn = lin O n = 0O 1O 2O N N N N (4.2)Since the first observation in any sequence will always be a record, here f(31HBEB Introxuctionis sometimes referred to as the reference value or the trivial record.The above definition of record sequence implicitly assumes that the cdfF will not produce any unbreakable record. This will not be the case ifthere exists some value x( such that F (x() − F (x(−) R 0 and F (x() = 1,that is, if there is a largest possible real value that can be obtained withpositive probability. Though such cases are sometime of interest and in fact,this is the case in our study, we will eliminate them from our discussion fornow and assume that the cdf F cannot yield unbreakable records so that{fnO n ≥ 0} will be a strictly increasing non-terminating sequence (we willexplain in later sections why we still can apply such record values theory toour study). Note that this definition does not imply that {fn} is unbounded.For example, if the cdf F is uniform on the interval (0O 1), then {fn} satisfiesthe definition and is bounded above by 1.We may also define a record increment (jump) sequence {JnO n ≥ 0} byJn = fn −fn−) (4.3)for n ≥ 1 and J( = f(. An inter-record time sequence ∆n, is also of interestand can be defined as∆n = hn − hn−)O n = 1O 2O N N N N (4.4)Finally, the number of records may be tracked by a counting process {bnO n ≥1}, wherebn = {number of records among l)O N N N O ln}N (4.5)As we shall see, in the setup of the classical record model defined above,where l ′is are i.i.d. observations from a continuous distribution F , therecord counting statistics hn, ∆n, and bn will not be affected by F . Thedistributions of the record values fn are predictably affected by F .32HBFB Vusic Distributionul fysultsHBF Vusiw Distrivutionul fysultsIn this section, we discuss distributional results for record values and relatedstatistics (i.e., fn, hn, ∆n, bn) from the classical model. It turns out that,for the classical model, strong arguments can be made in favour of studyingi.i.d. exponentially distributed l ′is. In fact, we build relationship betweena standard exponential distribution and a general continuous distribution toderive distributional results for record values from a general continuous cdfF .HBFBE fyworx vuluys zrom thy wlussiwul moxylLet {l∗i O j ≥ 1} denote a sequence of i.i.d. exponentially distributed randomvariables with the rate parameter equal to 1. Due to the lack of memoryproperty of the exponential distribution, the differences between successiverecords will be i.i.d. standard exponential as well, i.e., {J∗n} = {f∗n−f∗n−)}are i.i.d. Exp(1) random variables. It follows that the nth record, f∗n, corre-sponding to an i.i.d. Exp(1) sequence, have a Gamma distribution (Arnoldet al., 1998), i.e.,f∗n ∼ Gamma(n+ 1O 1)O n = 0O 1O 2O N N N N (4.6)We may use this result to obtain the distribution of the nth record cor-responding to an i.i.d. sequence of random variables {li} with commoncontinuous cdf F . First note that if l has a continuous cdf F , thenH(l) = − log(1− F (l)) (4.7)has a standard exponential distribution. Consequently, ld= F−)(1 −exp(−l∗)) where l∗ is Exp(1). Since l is a monotone function of l∗,the nth record of the {li} sequence can be expressed as a function of thenth record of the {l∗i } sequence. Specifically, we havefnd= F−)(1− exp(−f∗n))O n = 0O 1O 2O N N N N (4.8)33HBFB Vusic Distributionul fysultsSince n is an integer, it is easy to show by repeated integration by partsthat the survival function for f∗n (Gamma(n+ 1O 1)) isd (f∗n R r∗) = y−r∗n∑k5((r∗)kk!O r∗ R 0N (4.9)We may then use the relation established in Equation (4.8) to derive thesurvival function of the nth record corresponding to an i.i.d. sequence ofrandom variables with cdf F :d (fn R r) = (1− F (r))n∑k5((− log(1− F (r)))kk!O (4.10)or equivalently, the incomplete Gamma function:d (fn ≤ r) = 1n!∫ − dgg()−F (r)(vny−vxv (4.11)If F is absolutely continuous with probability density function (pdf) z , wemay differentiate Equation (4.10) to obtain the pdf for fn:zgn(r) = z(r)(− log(1− F (r)))nn!N (4.12)The result presented in Equation (4.10) was first proved by Chandler(1952) using a different approach. The relationship between the distributionof record values and exponential distribution is not new. Ahsanullah (1978)proved that if a sequence of independent random variables follows commonexponential distribution with rate parameter , then the resulting nth recordis distributed as Gamma(n+ 1O ) and the nth jump statistic (fn − fn−))follows the same exponential distribution with rate parameter .Applying transformation (4.8) coordinstewise we obtain the joint pdf of34HBFB Vusic Distributionul fysultsthe set of records f(O f)O N N N O fn, i.e.,zg0;g1;:::;gn(r(O r)O N N N O rn) =∏ni5( z(ri)∏n−)i5( (1− F (ri))(4.13)= z(rn)n−)∏i5(h(ri)O (4.14)where h(r) = dH(r)dr =f(r))−F (r) represents the hazard function.HBFBF fyworx timys unx rylutyx stutistiwsAs mentioned earlier, under the assumption that F , the common cdf of l ′is,is continuous, the distribution of record times and counts (h ′ns,b ′ns,∆′ns)does not depend on F . In order to discuss their distributional properties,we introduce a sequence of record indicator random variables as follows:In = I(ln R max{l)O N N N O ln−)})O (4.15)for n R 1 and I) = 1 with probability 1. It is not difficult to verify that theI ′ns are independent variables withd (In = 1) =1nO n ≥ 1N (4.16)In other words, In is a Bernoulli random variable with success probabilityp = )n with n ≥ 1. It is more convenient to give an intuitive interpretationto this result: after the first record (which is the first observation in asequence with probability 1) the second observation has a probability of )2of beating the first record; the third observation could either smaller thanthe first observation, between the first and the second observation, or largerthan the second observation, which give it a probability of )+ for beatingthe previous record. In general, each random variable has the same chanceof being the largest and hence, the nth observation has a probability of )nfor being a record. With this reasoning, the results given by (4.16) easilyfollows.35HBFB Vusic Distributionul fysultsWith the help of In, we have the following facts about the record countingprocess {bnO n ≥ 1}:bn =n∑i5)IiN (4.17)Since I ′is are independent Bernoulli random variables, we immediately seethatEbn =n∑i5)1i≈ lnn (4.18)andVarbn =n∑i5)1i(1− 1i) ≈ lnnN (4.19)As we just confirmed, records are clearly not common. In a sequence of1000 observations we expect to see only about 7 records. Another fact thatimmediately follows is that the expected number of records goes to infinityas the number of samples gose to infinity, which is due to the divergenceof harmonic series. In fact, we also have that bn → ∞ as n → ∞ (seeGlick (1978) for a proof). The precise distribution of bn is complicated, butthe expressions have been given by several authors (Re´nyi, 1962; David andBarton, 1962; Karlin, 1966):d (bn = k) =Sknn!≈ ln(n)knk!(4.20)for large sample size n, where Skn is the Stirling number of the first kind.We may use the information regarding the distribution of bn and In todiscuss the distribution of the kth non-trivial record time hk. Note that theevents {hk} and {bn = K = 1O bn−) = k} are equivalent. Consequently,d (hk = n) = d (bn = K = 1O bn−) = k) = d (InO bn−) = k)N (4.21)Since the events {In = 1} and {bn−) = k} are independent, we haved (hk = n) =1nSkn−)(n− 1)! =Skn−)n!N (4.22)36HBFB Vusic Distributionul fysultsNote that hk grows rapidly as k increases. In fact, Glick (1978) verified thatEhk =∞O ∀k ≥ 1 (4.23)andE∆k = E(hk − hk−)) =∞O ∀k ≥ 1N (4.24)Neuts (1967) first developed an approach for computing the exact dis-tribution of ∆k. By conditioning on the value of f∗k−) (assuming for con-venience without loss of generality that the l ′s are standard exponentialrandom variable), we have, for j = 1O 2O N N N N , k = 1O 2O N N Nd (∆k R j) =∫ ∞(xk−2Γ(k)y−x(1− y−x)jxxN (4.25)The integration is readily performed if we expand the term (1−y−x)j , whichyieldsd (∆k R j) =j∑m5((jm)(−1)m 1(1 +m)kO (4.26)a result in Ahsanullah (1988). Using the above two equations, one mayverify thatd (∆k = j) =j−)∑m5((j − 1m)(−1)m 1(2 +m)kN (4.27)Note that summing the expression over j provides an alternative proof that,as reported earlier, E∆k =∞.HBFBG aurkov whuinsThere are several Markov chains lurking in the background of any discussionof record values and related statistics. We list some of them that are mosthelpful for characterizing the behaviour of record values and record times.First we observe that the record counting process {bnO n ≥ 1} is a non-stationary Markov process with transition probabilities d (bn = j|bn−) = i)37HBGB kADdd Approfiimutions us fycorx Vuluysgiven bypij =n−)n O j = i)n O j = i+ 1N(4.28)Next note that {hnO n ≥ 0} forms a stationary Markov chain with d (h( =1) = 1 and the transition probabilities d (hn = j|hn−) = i) given bypij =ij(j − 1) O j R iN (4.29)It is also obvious that {fn} is a Markov chain with f( ∼ F and the transi-tions governed byzgn|gn−1 =z(rn)1− F (rn−)) O rn R rn−)N (4.30)Finally, an interesting observation due to Strawderman and Holmes (1970)is that {(fnO∆n+))O n ≥ 0} is also a Markov chain. Given the sequence{fn}, the ∆n+)’s are conditionally independent geometric random variableswithd (∆n+) R k|{fnO n ≥ 0}) = d (∆n+) R k|fn = rn) = (F (rn))kN (4.31)HBG kADdd Approfiimutions us fyworx juluysIn this section, we use results from record value theory presented previouslyas tools to study the statistical behaviour of the log-determinants generatedfrom the k-DPP approximations developed in Section 3.HBGBE Jittyring thy logAxytyrminuntsAs most of the analytical results for records are developed under the assump-tion that the random sequence is independently generated from a commoncontinuous distribution, we would like to transform the random sequence oflog-determinants, which has finite support, to its continuous counterpart.Another justification, in favour of record value theory, is that we do not38HBGB kADdd Approfiimutions us fycorx Vuluyswant to concern ourselves about ties in the random sequence. The reasonis obvious: we would like to study how often strictly better approximationsappear.We refer the method we use for the transformation to Gaussian jitter-ing. Let {liO i ≥ 1} be a sequence of log-determinants generated usingAlgorithm 7. For each li, i = 1O N N N O n, where n is the sample size, wecreatemi|li ∼ Guussiun(liO /2)O i = 1O N N N O nO (4.32)so that we have a sequence of independent {mi} generated from a continuouscdf. By choosing /2 to be very small, the values of m ′i s and the behaviourof the random sequence should be identical to that of l ′is except in thatd (ml = mm) = 0O ∀l ̸= mO (4.33)and the corresponding random sequence of records will be strictly increasingand non-terminating.We argue based on heuristic reasoning that we can consider the {mi}a sequence of log-determinants with the corresponding subsets generatedfrom a k-DPP distribution. More importantly, we are able to instead studyany statistical behaviour of {mi} and generalize them to that of {li} withignorable errors.HBGBF Distrivution oz thy jittyryx logAxytyrminuntsThe next step before applying record value theory is to find an appropriatecommon cdf F for the jittered log-determinants. Since it would be unnec-essarily complicated to study the precise analytical cdf of m due to theGaussian jittering transformations, we fit a distribution from some para-metric family that best describes the data. Figure 4.1 shows histogramsand corresponding density estimates of log-determinants generated from ak-DPP using the kernel described in Section 3.4.1. Note that the two plotswith different sample sizes consistently suggest that m is distributed likeslightly skewed Gaussian. In fact, a member of the Gaussian distribution39HBGB kADdd Approfiimutions us fycorx Vuluysfamily would fit the data well for most parts except for the tails. However,the reason that we do not consider such a distribution is that the “right”extremes are the ones that most concern us. For example, for the particularkernel we used to generate our plots, the approximations are starting toget very close to the truth after two standard deviations from the centre.For reasons such as these, we consider distributions from the extreme valuetheory literature to fit our data. In particular, we use a method generallyreferred to as “Peaks-Over-Threshold” (Leadbetter, 1990), which relies onextracting, from a continuous phenomenon, the peak values reached for anyperiod during which values exceed a certain threshold.Histogram of Log−determinantsLog−determinantsDensity75 76 77 78 79 800.00.20.40.6Histogram of Log−determinantsLog−determinantsDensity75 76 77 78 79 800.00.20.40.675 76 77 78 79 800.00.20.40.6Density Estimate of Log−determinantsN = 100000 Bandwidth = 0.05481Density75 76 77 78 79 800.00.20.40.6Density Estimate of Log−determinantsN = 500000 Bandwidth = 0.03968DensityFigure 4.1: Top row: Histograms of k-DPP log-determinants with samplesizes 100O 000 and 500O 000 respectively; bottom row: Kernel density esti-mates of k-DPP log-determinants with sample sizes 100O 000 and 500O 000respectively.40HBGB kADdd Approfiimutions us fycorx VuluysExtreme value theory and peaksGoverGthresholdWe first provide a very brief introduction to extreme value theory andthe two most common distribution families to model extreme values. Letl)O N N N O ln be a sequence of i.i.d. random variables with a common cdf F .Let an = max{l)O N N N O ln}. Suppose there exists normalizing constantsun R 0 and vn such thatd (an − vnun≤ y) = Fn(uny + vn)→ G(y) (4.34)as n→∞ for all y ∈ R, where G is a non-degenerate distribution function.According to the Extremal Types Theorem (Fisher and Tippett, 1928), Gmust be either Fre´chet, Gumbel or negative Weibull. Jenkinson (1955) notedthat these three distributions can be merged into a single parametric fam-ily: the Generalized Extreme Value (GEV) distribution. The GEV has adistribution function defined byG(y) = exp[−(1 + , y − /)−)R+ ]O (4.35)where (O /O ,) are the location, scale and shape parameters respectively with/ R 0. Note that z+ = max{zO 0}. The Fre´chet distribution is obtained when, R 0, the negative Weibull case is obtained when , P 0, and the Gumbelcase is obtained when , → 0.From this result, Pickands III (1975) showed that the limiting distribu-tion of normalized excesses over a threshold as the threshold approachesthe endpoint end of the variable of interest, is the Generalized Pareto Dis-tribution (GPD). That is, if l is a random variable which satisfies (4.34),thend (l ≤ y|l R )→ H(y)O → end (4.36)withH(y) = 1− (1 + , y − /)−)R+ O (4.37)where again (O /O ,) are the location, scale and shape parameters respec-tively with / R 0. Note that the Exponential distribution is obtained by41HBGB kADdd Approfiimutions us fycorx Vuluyscontinuity as , → 0.In practice, these two asymptotic results motivated modelling block max-ima with a GEV, while peaks-over-threshold with a GPD.Fitting ajD to jittered logGdeterminantsWe now use the peaks-over-threshold scheme to fit a GPD to the tail ofthe jittered log-determinants data. Although many estimators have beenproposed in the literature (see Coles et al. (2001), for example, for a detaileddiscussion), for simplicity, we follow the maximum likelihood estimationprocedure for point estimation for the parameters (O /O ,).Note that the location parameter for the GPD or equivalently thethreshold is a particular parameter as most often it is not estimated asthe other ones, and it is possible to estimate the other two parameters usingMLE with varying threshold. The main goal of threshold selection is to selectenough events to reduce the variance; but not too much as we could selectevents coming from the central part of the distribution (i.e., not extremeevents) and induce bias. In practice, threshold selection is usually donewith the aid of exploratory tools, however, decisions are not so clear-cut forreal data example. In our study, since we have quite a lot of data we set to equal the 90lh percentile of the log-determinants. For experiments, wehave also tried the 70lh and 80lh percentiles for and found no materialdifference in the fitted distributions.The results for 100O 000 log-determinants generated from the 10-DPPwith the same kernel as the one described in Section 3.4.1 are shown inFigure 4.2. For comparison, we also fit a Weibull and a Normal distributionto the entire dataset. Note that the fitted GPD shows advantage in termsof the goodness of fit in the tail part (i.e., data above the 90lh percentile).Again the right tail of the data is what we are really interested in as thegood approximations appear in this region. As a result, we can ignore thegoodness of fit for parts of the data below our threshold.42HBGB kADdd Approfiimutions us fycorx VuluysEmpirical and Theoretical DensitiesDataDensity78.5 79.0 79.5 80.00.01.02.03.078.8 79.0 79.2 79.4 79.6 79.8 80.078.879.279.680.0Q−Q PlotTheoretical quantilesEmpirical quantilesEmpirical and Theoretical DensitiesDataDensity78.5 79.0 79.5 80.00.01.02.03.078.8 79.0 79.2 79.4 79.6 79.8 80.078.879.279.680.0Q−Q PlotTheoretical quantilesEmpirical quantilesEmpirical and Theoretical DensitiesDataDensity78.5 79.0 79.5 80.00.01.02.03.078.8 79.0 79.2 79.4 79.6 79.8 80.078.879.279.680.0Q−Q PlotTheoretical quantilesEmpirical quantilesFigure 4.2: From top to bottom: histograms and estimated theoretical den-sity functions (left) and quantile-to-quantile plot (right) displayed for thedata above 90lh percentile for GPD, Weibull, Normal distributions, respec-tively.HBGBG LogAxytyrminunts ryworxsThe end goal of analyzing the candidate distributions for modelling jitteredlog-determinants is to study the sequence of records generated from them.Recall from Equations (4.12) and (4.10) that the distribution of records can43HBGB kADdd Approfiimutions us fycorx Vuluysbe derived from the distribution of the original sequence. Formally, let fdwhere x ≥ 0 represents xth upper record from the jittered log-determinantssequence {mnO n ≥ 1}. Then, we havezgd(r) = z(r)(− log(1− F (r)))dx!O (4.38)wherez(y) = zn (y) =1/ˆ(1 + ,ˆy − ˆ/ˆ)−)R()+^) (4.39)andF (y) = Fn (y) = 1− (1 + ,ˆ y − ˆ/ˆ)−)R^N (4.40)Note that the support of the last two equations is such that y ≥ ˆ for ,ˆ ≥ 0and ˆ ≤ y ≤ ˆ− /ˆ=,ˆ for ,ˆ P 0. Here (ˆO /ˆO ,ˆ) are MLE for (O /O ,).Before delving further into the distributional results, many distribution-free results from record value theory can be applied to study the behaviourof our log-determinants approximations. Recall that the expected numberof records in a sample of size n can be approximated by log(n). In Fig-ure 3.4 where we show the progression of 10-DPP approximations using100O 000 samples, the black dots represent the occurrence of records whichcount to 12. Note that this is really close to the expected value, which islog(100O 000) ≈ 11N51. Observe in Figures 3.4 and 3.5 is that the occurrenceof records is more frequent at the beginning when sample sizes are smallthan when the sample sizes are big. This can be seen from the “diminishingreturn” property of the logarithm function and it is also an immediate ap-plication of the Markov proper of the record counting process. Recall that{bnO n ≥ 1} is a Markov process withd (bn = j|bn−) = i) =n−)n O j = i)n O j = i+ 1N(4.41)Therefore, the probabilities of record-breaking in the first few hundreds ofsamples are much higher than those in the later samples.44HBGB kADdd Approfiimutions us fycorx VuluysConditional probabilities of record incrementsA key distributional result we use to model the behaviour of the k-DPPapproximations is the conditional probability relating any two consecutiverecord values. From Equation 4.30, we haved (fd+) R r|fd = rd) = 1− F (r)1− F (rd) N (4.42)By letting r = (1 + ϵ)rd for some ϵ R 0, we haved = d (fd+) R (1 + ϵ)rd|fd = rd) = 1− F ((1 + ϵ)rd)1− F (rd) N (4.43)For ϵ P ϵsaedd and d P seadd, we can say that the probability of a smallincrement in the next record value is very small. In the context of k-DPPapproximations, this is equivalent to saying that the chance of getting abetter approximation is small. Although d (fd+) R rd) = 1 by definition,the ϵ term enables us to quantify how good the next approximation can be.One other use of Equation 4.43 is: if we have an approximation m fromanother algorithm, then by setting rd = m the conditional probability tellsus the chance that the next approximation from the k-DPP is better thanm. Intuitively, this probability is an increasing function of x. In fact, it ispossible to inversely compute x such that the probability of beating m ishigher than some user defined threshold.To illustrate how the conditional probabilities behave, we present nu-merical examples using the same sample of (jittered) log-determinants gen-erated from the 10-DPP in Section 3.4.1. Table 4.1 shows the occurrenceof records and the associated conditional probabilities of observing betterapproximations (with ϵ = 0N0005) given the current record values. Notethat the sequence of probabilities is decreasing with the number of records,with the first few of them being equal to or close to one - record-breaking ismore frequent at the beginning and the jumps are higher. The last recordobserved within this sample, though jittered, is actually the optimum valuefor the specific kernel. Note that the associated conditional probability is45HBGB kADdd Approfiimutions us fycorx Vuluysidentically 0. We also compute the conditional probabilities for the samesample but conditioning on the greedy approximation. The probabilities arenow increasing with the progression of records - the chance of achieving bet-ter approximations (better than the greedy approximation) is higher. Notethat the last record, which is better than the greedy approximation, has anassociated conditional probability of 1.No. Sims. Records Cond. Probs. Greedy Probs. E∆1 78.03975 1.0000 0.0000 15 78.04162 1.0000 0.0000 17 78.52475 0.8957 0.0000 116 78.85338 0.8691 0.0000 337 79.36103 0.7856 0.0000 34234 79.39688 0.7756 0.0000 43861 79.66752 0.6549 0.0001 3571758 79.89616 0.3823 0.0031 97073573 79.93738 0.2873 0.0408 2627644293 80.09011 0.0000 R 1N0000 1N3386× 10)(Table 4.1: The first two columns provides a summary of the occurrence timeof records and their values; the second and the third columns provides theconditional probabilities given the current record values and the conditionalprobabilities given the greedy approximation respectively; the last columngives the expected waiting times for the next record.Conditional probabilities of interGrecord timesRecall from Equation 4.31 that given the current sequence of record values,the next inter-record times are conditionally independent geometric variableswith 1 − p = F (rd) where p is the parameter for geometric distributionand rd is the xth record value. This distributional result gives us anotherway (besides the conditional probabilities of record increments) to designa stopping criteria for our k-DPP approximation algorithm. Specifically,we can stop sampling if the conditional probability of the waiting time forthe next better approximation higher than a pre-specified value is high, or46HBHB Discussionthe expected waiting time for the next better approximation is long. Forillustration, we include in Table 4.1 the expected waiting times given thecurrent sequence of record values. As anticipated, the last expected waitingtime is huge, which indicates a stopping point for the algorithm. In fact, asdiscussed before, the last record value is already the theoretical maximumand hence should be the stopping point.Essentially, a combination of the conditional probabilities and the ex-pected waiting times provides us with an incisive tool to analyze our k-DPPapproximations. In the numerical example above, these values tell us thatwe have reached an approximation such that the probability of obtaining abetter solution is very low and the waiting time for this hypothetical bet-ter solution is expected to be very long. In other words, we should stopsampling.HBH DiswussionIn this chapter we have explored mainly non-asymptotic properties of recordvalues and the related statistics and applied them to model jittered log-determinants generated by k-DPP. There are many intriguing asymptoticresults for record values (see Resnick (1973) for detailed discussions) thatconcern their behaviours when the number of records is large. Althoughthese results are quite mathematically elegant and do provide insightfulideas, we prefer to rely on exact non-asymptotic theory and, it is fairlyexpensive to sample a large number of records. However, we have had in-teresting results based only on simple distributional facts (i.e. distributionand density functions). In particular, many of the observed behaviours seenin the k-DPP approximations, such as the quick escalation at the begin-ning and the diminishing return near the end, can be explained by recordvalue theory. We also developed informative stopping rules for the searchalgorithm using a combination of distributional results for record values andinter-record times.In order for the log-determinants sampled from the k-DPP to fit intothe classical record model, we introduced the idea of Gaussian jittering.47HBHB DiscussionAlthough it worked quite well in our examples, in future work, we plan torigorously quantify the approximation errors resulted from the jittering. Inaddition to the classical records model, there are more general record models,such as the F record model (Yang, 1975) and the Pfeifer model (Pfeifer,1982) where records are generated from a dependent sequence determinedby the record history. Future research can focus on modelling the k-DPPapproximations using non-classical record models.48Whuptyr IWonwluxing fymurksThis thesis was motivated by the combinatorial optimization problem thatarises in the maximum-entropy approach for designing monitoring networks,which plays a critical role in the surveillance of environmental processes. Weintroduced a sampling-based stochastic search method using k-DPPs, anddescribed how the methodology is able to handle datasets with differentstructures and sizes. We illustrated how the k-DPP based method is able toprovide more accurate and more robust approximations than the determin-istic greedy-type algorithm, and how the former is comparable to anotherstochastic search method - the GA. The proposed method is easily paralleliz-able so that it takes advantage of the extensive high-performance computingresources that are available and saves computation time.More importantly, we explored the theory of record values in orderto assess the quality of the stochastic approximations made by k-DPPs.In particular, using basic distributional results for record values and thepeaks-over-threshold method from extreme value theory, we approximatedthe conditional distributions of the next-best k-DPP approximations giventhe current ones. With this result, we were also able to compare k-DPPapproximations with those produced by other methods and gain insightsinto their quality. In addition, we developed a guideline for developing stop-ping criteria for the proposed method using expected waiting times betweenrecords.49IBEB Futury korkIBE Futury korkIBEBE aWaW kADdd sumplingIn chapter 3, we developed the approximation method using independentsampling of k-DPPs. We relied on independent sampling mainly for the easeof applying the classical records models and the parallelizability of the sam-pling algorithm. However, the implementation quickly becomes expensivedue to the c(bk+) computational complexity of Algorithm 7. It has beenshown that the MCMC sampling algorithm described in the same chaptercan be run in c(b2k) time, as demonstrated by Li et al. (2017) using effi-cient calculation of the transition probabilities. This offers a huge reductionin computational time for large datasets. Future research then aims to com-pare the accuracies of the stochastic approximations based on independentk-DPP sampling and MCMC k-DPP sampling, respectively. Furthermore,we also aim to explore more general records models, again in order to assessthe quality of these stochastic approximations.IBEBF Womvinution oz kADdd unx othyr stowhustiw syurwhmythoxsAn observation from the numerical experiments presented in chapter 3 isthat approximations from the proposed method often accelerates quickly atthe beginning of the iterations. As confirmed by record value theory, ourmethod experienced more jumps at the beginning. in comparison, the GAin comparison needed more iterations to “warm up” but required less com-putational time in total. We saw that the k-DPP approximation methodachieved good solutions after a short time, while the GA started to outper-form the k-DPP solutions after warming up. The study of the combinationof the two methods and how well the combination compares to the originalmethods is a subject for future research.50VivliogruphyRaja Hafiz Affandi, Alex Kulesza, Emily Fox, and Ben Taskar. Nystromapproximation for large-scale determinantal processes. In Urtiwiul ]ntylAligynwy unx gtutistiws, pages 85–98, 2013.Mm Ahsanullah. Record values and the exponential distribution. Unnuls ozthy ]nstituty oz gtutistiwul authymutiws, 30(1):429–433, 1978.Mohammad Ahsanullah. ]ntroxuwtion to ryworx stutistiws. Ginn Press, 1988.Mohammad Ahsanullah. fyworx stutistiws. Nova Science Publishers, 1995.Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte carlo markovchain algorithms for sampling strongly rayleigh distributions and determi-nantal point processes. In Conzyrynwy on Lyurning hhyory, pages 103–115,2016.KM Anstreicher. Efficient solution of maximum-entropy sampling problems.guvmittyx to cpyrutions fysyurwh, 2018.Kurt M Anstreicher, Marcia Fampa, Jon Lee, and Joy Williams. Continu-ous relaxations for constrained maximum-entropy sampling. In ]ntyrnuAtionul Conzyrynwy on ]ntygyr drogrumming unx Comvinutoriul cptimizuAtion, pages 234–248. Springer, 1996.Kurt M Anstreicher, Marcia Fampa, Jon Lee, and Joy Williams. Usingcontinuous nonlinear relaxations to solve. constrained maximum-entropysampling problems. authymutiwul progrumming, 85(2):221–240, 1999.Barry C Arnold, Narayanaswamy Balakrishnan, and Haikady N Nagaraja.fyworxs, volume 768. John Wiley & Sons, 1998.51VibliogruphyAlexei Borodin and Grigori Olshanski. Distributions on partitions, pointprocesses, and the hypergeometric kernel. Communiwutions in authymutAiwul dhysiws, 211(2):335–358, 2000.Alexei Borodin and Eric M Rains. Eynard–mehta theorem, schur process,and their pfaffian analogs. Journul oz stutistiwul physiws, 121(3-4):291–317,2005.William F Caselton and James V Zidek. Optimal monitoring network de-signs. gtutistiws : drovuvility Lyttyrs, 2(4):223–227, 1984.CM Casquilho-Resende, ND Le, JV Zidek, and Y Wang. Design of mon-itoring networks using k-determinantal point processes. Environmytriws,29(1):e2483, 2018.KN Chandler. The distribution and frequency of record values. Journul ozthy foyul gtutistiwul gowiytyB gyriys V (aythoxologiwul), pages 220–228,1952.Stuart Coles, Joanna Bawa, Lesley Trenner, and Pat Dorazio. Un introAxuwtion to stutistiwul moxyling oz yfitrymy vuluys, volume 208. Springer,2001.Florence Nightingale David and David Elliott Barton. Combinatorialchance. Technical report, 1962.E Fischer. U¨ber den hadamardschen determinantensatz. UrwhBauthB(Vusyl), 13:32–40, 1908.Marshall L Fisher, George L Nemhauser, and Laurence A Wolsey. An anal-ysis of approximations for maximizing submodular set functionsii. Indolyhyxrul womvinutoriws, pages 73–87. Springer, 1978.Ronald Aylmer Fisher and Leonard Henry Caleb Tippett. Limiting formsof the frequency distribution of the largest or smallest member of a sam-ple. In authymutiwul drowyyxings oz thy Cumvrixgy dhilosophiwul gowiyty,volume 24, pages 180–190. Cambridge University Press, 1928.52VibliogruphyFe´lix-Antoine Fortin, Franc¸ois-Michel De Rainville, Marc-Andre´ Gardner,Marc Parizeau, and Christian Gagne´. Deap: Evolutionary algorithmsmade easy. Journul oz auwhiny Lyurning fysyurwh, 13(Jul):2171–2175,2012.Janos Galambos. The asymptotic theory of extreme order statistics. In hhyhhyory unx Uppliwutions oz fyliuvility with Emphusis on Vuyysiun unxbonpurumytriw aythoxs, pages 151–164. Elsevier, 1977.Ned Glick. Breaking records and breaking boards. Umyriwun authymutiwulaonthly, pages 2–26, 1978.David E Goldberg and John H Holland. Genetic algorithms and machinelearning. auwhiny lyurning, 3(2):95–99, 1988.Peter Guttorp, Nhu D Le, Paul D Sampson, and James V Zidek. Usingentropy in the redesign of an environmental monitoring network. aultiAvuriuty ynvironmyntul stutistiws, pages 175–202, 1993.Alan Hoffman, Jon Lee, and Joy Williams. New upper bounds for maximum-entropy sampling. In mcXu JUxvunwys in aoxylAcriyntyx Xysign unxUnulysis, pages 143–153. Springer, 2001.John Henry Holland. Uxuptution in nuturul unx urtiwiul systymsN un inAtroxuwtory unulysis with uppliwutions to viology, wontrol, unx urtiwiul inAtylligynwy. MIT press, 1992.J Ben Hough, Manjunath Krishnapur, Yuval Peres, Ba´lint Vira´g, et al.Determinantal processes and independence. drovuvility survyys, 3:206–229, 2006.Edwin T Jaynes. Information theory and statistical mechanics (notes by thelecturer). In gtutistiwul dhysiws G, page 181, 1963.Arthur F Jenkinson. The frequency distribution of the annual maximum(or minimum) values of meteorological elements. euurtyrly Journul oz thyfoyul aytyorologiwul gowiyty, 81(348):158–171, 1955.53VibliogruphySamuel Karlin. U rst woursy in stowhustiw prowyssys. Academic press, 1966.Alexander K Kelmans and BN Kimelfeld. Multiplicative submodularity ofa matrix’s principal minor as a function of the set of its rows and somecombinatorial applications. Xiswryty authymutiws, 44(1):113–116, 1983.Chun-Wa Ko, Jon Lee, and Maurice Queyranne. An exact algorithm formaximum entropy sampling. cpyrutions fysyurwh, 43(4):684–691, 1995.Andreas Krause and Daniel Golovin. Submodular function maximization.,2014.Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machinelearning. Founxutions unx hrynxs R© in auwhiny Lyurning, 5(2–3):123–286, 2012.Claire Launay, Bruno Galerne, and Agne`s Desolneux. Exact sampling ofdeterminantal point processes without eigendecomposition. urliv pryprinturlivNELDFBDLHFM, 2018.Fre´de´ric Lavancier, Jesper Møller, and Ege Rubak. Determinantal pointprocess models and statistical inference. Journul oz thy foyul gtutistiwulgowiytyN gyriys V (gtutistiwul aythoxology), 77(4):853–877, 2015.Nhu D Le and James V Zidek. gtutistiwul unulysis oz ynvironmyntul spuwyAtimy prowyssys. Springer Science & Business Media, 2006.Malcolm R Leadbetter. On a basis for’peaks over threshold’modeling. Tech-nical report, NORTH CAROLINA UNIV AT CHAPEL HILL CENTERFOR STOCHASTIC PROCESSES, 1990.Jon Lee. Semidefinite programming in experimental design. ]bhEfAbUh]cbUL gEf]Eg ]b cdEfUh]cbg fEgEUfCH UbX aUbU[EAaEbh gC]EbCE, pages 528–532, 2000.Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Efficient sampling for k-determinantal point processes. urliv pryprint urlivNEIDMBDEJEL, 2015.54VibliogruphyChengtao Li, Stefanie Jegelka, and Suvrit Sra. Polynomial time algorithmsfor dual volume sampling. In Uxvunwys in byurul ]nzormution drowyssinggystyms, pages 5038–5047, 2017.Odile Macchi. The coincidence approach to stochastic point processes. UxAvunwys in Uppliyx drovuvility, 7(1):83–122, 1975.Toby J Mitchell. An algorithm for the construction of d-optimal experimen-tal designs. hywhnomytriws, 16(2):203–210, 1974.Jesper Moller and Rasmus Plenge Waagepetersen. gtutistiwul inzyrynwy unxsimulution zor sputiul point prowyssys. Chapman and Hall/CRC, 2003.HN Nagaraja. Record values and related statistics-a review. CommuniwuAtions in gtutistiwsAhhyory unx aythoxs, 17(7):2223–2238, 1988.George L Nemhauser and Laurence A Wolsey. Chapter vi integer program-ming. Hunxvooks in cpyrutions fysyurwh unx aunugymynt gwiynwy, 1:447–527, 1989.George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An anal-ysis of approximations for maximizing submodular set functionsi. authAymutiwul progrumming, 14(1):265–294, 1978.Marcel F Neuts. Waitingtimes between record observations. Journul ozUppliyx drovuvility, 4(1):206–208, 1967.Valerii Borisovich Nevzorov. Records. hhyory oz drovuvility : ]ts UppliwuAtions, 32(2):201–228, 1988.Dietmar Pfeifer. Characterizations of exponential distributions by indepen-dent non-stationary record increments. Journul oz Uppliyx drovuvility, 19(1):127–135, 1982.James Pickands III. Statistical inference using extreme order statistics. thyUnnuls oz gtutistiws, pages 119–131, 1975.55VibliogruphyR Core Team. fN U Lunguugy unx Environmynt zor gtutistiwul Computing.R Foundation for Statistical Computing, Vienna, Austria, 2018. URLhhhdg.//www.R-drcjech.crg/.Alfred Re´nyi. The´orie des e´le´ments saillants dune suite dobservations. UnnBFuwB gwiB inivB ClyrmontAFyrrunx, 8:7–13, 1962.Sidney I Resnick. Limit laws for record values. gtowhustiw drowyssys unxhhyir Uppliwutions, 1(1):67–82, 1973.Sidney I Resnick. Efitrymy vuluys, rygulur vuriution unx point prowyssys.Springer, 1987.Ramiro Ruiz-Ca´rdenas, Marco AR Ferreira, and Alexandra M Schmidt.Stochastic search algorithms for optimal design of monitoring networks.EnvironmytriwsN hhy offiwiul journul oz thy ]ntyrnutionul Environmytriwsgowiyty, 21(1):102–112, 2010.Michael C Shewry and Henry PWynn. Maximum entropy sampling. Journuloz uppliyx stutistiws, 14(2):165–170, 1987.RW Shorrock. A limit theorem for inter-record times. Journul oz Uppliyxdrovuvility, 9(1):219–223, 1972a.RW Shorrock. On record values and record times. Journul oz Uppliyx drovAuvility, 9(2):316–326, 1972b.RW Shorrock. On discrete time extremal processes. Uxvunwys in Uppliyxdrovuvility, 6(3):580–592, 1974.William E Strawderman and Paul T Holmes. On the law of the iteratedlogarithm for inter-record times. Journul oz Uppliyx drovuvility, 7(2):432–439, 1970.Darrell Whitley. A genetic algorithm tutorial. gtutistiws unx womputing, 4(2):65–85, 1994.56Shiying Wu and James V Zidek. An entropy-based analysis of data from se-lected nadp/ntn network sites for 1983–1986. Utmosphyriw EnvironmyntBdurt UB [ynyrul hopiws, 26(11):2089–2103, 1992.Mark CK Yang. On the distribution of the inter-record times in an increasingpopulation. Journul oz Uppliyx drovuvility, 12(1):148–154, 1975.57Appynxifi Af Woxy zor gumpling zrom ukADdd7 k−Xdd gumplyr7 boty N Vusyx on Ulyfi Kulyzsu ' s sumpling u l go r i t hmsz o r Xdds7 aUhLUV woxy u v u i l u v l y on l iny ut7 h t t p NIIwyv B yyws B umiwh B yxuI~ ku l y s z uIwoxyIxpp B t g z7 Chywk Kulyzsu ' s t h y s i s (FDEF) z o r x y s w r i p t i o n o z thyu l go r i t hms7 hhy woxy ussumys y i g yn v y w t o r s Lj unx thy u s s o w i u t y x7 y i g ynvu l u y s lumvxu o z thy k y rny l in uxvunwy7777777777777777777777777777777777777777777777 Computy Elymyntury gymmmytriw dolynomiuls 7777777777777777777777777777777777777777777777Epoly = function ( lambda , k ) {N = length ( lambda )E = matrix (0 , nrow = k+1, ncol = N+1)E [ 1 , ] = 1for ( l in 2 : ( k+1) ) {for (n in 2 : (N+1) ) {E[ l , n ] = E[ l , n−1] + lambda [ n−1]E[ l −1,n−1]}}58Appynxifi AB f Woxy for gumpling from u kADddreturn (E)}77777777777777777777777777 gy l y w t k y i g ynvu l u y s 77777777777777777777777777ChooseEigenVals = function ( lambda , k ) {7 cvtuin y lymyntury symmytriw po lynomiu l sE = Epoly ( lambda , k )7 ] n i t i u l i z y to s t o r y k s y l y w t y x y i g ynvu l u y sS = array (0 , dim = k)i = length ( lambda )remaining = kwhile ( remaining R 0) {7 Computy murginul o z i g i vyn t hu t wy huvy whoosyn7 thy rymuining vu l u y s zrom EN ii f ( i == remaining ) {marg = 1}else {marg = lambda [ i ] E[ remaining , i ] I E[ remaining+1, i +1]}7 gumply murginuli f ( runif (1 ) P marg ) {S [ remaining ] = iremaining = remaining − 1}i = i−159Appynxifi AB f Woxy for gumpling from u kADdd}return (S)}77777777777777777777777777 gumpling zrom u k−Xdd 77777777777777777777777777DPPSample = function (L , k , lambda ,LV) {7 ] n i t i u l i z i n g vyw t o r to suvy k−Xdd sumplyy = array (0 , dim = k)7 gy l y w t i n g y i g ynvu l u y s z o r k−Xddv = ChooseEigenVals ( lambda , k )7 Corrysponxing wolumns in Lj mutrifiV = LV[ , v ]for ( i in ( k : 1 ) ) {7 Computy p r o v u v i l i t i y s z o r yuwh itymP = rowSums( t ( t (V) ) ˆ2)P = P I sum(P)7 Choosy u nyw itym to inw l uxyy [ i ] = which .max( runif (1 ) P= cumsum(P) )i f ( k == 1) break ;7 Choosy u vyw to r to y l im inu t y60Appynxifi AB f Woxy for gumpling from u kADddj = which .max(V[ y [ i ] , ] !=0)Vj = t ( t (V[ , j ] ) )V = t ( t (V[ ,− j ] ) )7 upxuty jV = V − Vj %% (V[ y [ i ] , ] IVj [ y [ i ] ] )7or t ho gonu l i z yi f (ncol (V) != 0) {normV = max(svd (V)$d)i f (normV != 1) {i f (ncol (V) != 1) {for ( a in which ( ( 1 : ( i −1) )R1) ) {for (b in which ( ( 1 : ( a−1) )R1) ) {V[ , a ] = t ( t (V[ , a ] ) ) −as .numeric ( t (V[ , a ] )%%t ( t (V[ , b ] ) ) )t ( t(V[ , b ] ) )}V[ , a ] = V[ , a ] Imax(svd (V[ , a ] ) $d)}}else { V[ , 1 ] = V[ , 1 ] Imax(svd (V[ , 1 ] ) $d) }}}}y = sort ( y )return ( y )}61Appynxifi Vdython Woxy zor GynytiwAlgorithmThe implementation is based on DEAP (Fortin et al., 2012), an evolutionarycomputation framework developed in Dyhhcn.7 hhis s w r i p t usys somy z u n w t i o n u l i t i y s o z XEUdBimport randomimport numpy as npfrom deap import basefrom deap import c r e a t o rfrom deap import t o o l sc r e a t o r . c r e a t e ( ”FitnessMax” , base . F i tness , we ights=(1 .0 , ) )c r e a t o r . c r e a t e ( ” Ind i v i dua l ” , l i s t , f i t n e s s=c r ea t o r .FitnessMax )too lbox = base . Toolbox ( )7 Ut t r i v u t y gynyru tor7 xy z i n y ' u t t r v o o l ' to vy un u t t r i v u t y ( ' gyny ' )7 whiwh worrysponxs to i n t y g y r s sumplyx uni zormly7 zrom thy rungy o D , E q ( i B y B D or E wi th yquu l7 p r o v u v i l i t y62Appynxifi VB dython Woxy for Gynytic Algorithmtoo lbox . r e g i s t e r ( ” a t t r b o o l ” , random . randint , 0 , 1)7 gtruw tu ry i n i t i u l i z y r s7 xy z i n y ' i n x i v i x u u l ' to vy un i n x i v i x u u l7 won s i s t i n g o z GD ' u t t r v o o l ' y lymynts ( ' gynys ' )too lbox . r e g i s t e r ( ” i nd i v i dua l ” , t o o l s . in i tRepeat ,c r e a t o r . Ind iv idua l , too lbox . a t t r boo l , 30)7 xy z i n y thy popu lu t i on to vy u l i s t o z i n x i v i x u u l stoo lbox . r e g i s t e r ( ” populat ion ” , t o o l s . in i tRepeat ,l i s t , too lbox . i nd i v i dua l )7 thy gou l ( ' z i t n y s s ' ) z unw t i on to vy mufiimizyxdef evalMaxDet ( i nd i v i dua l ) :Kernel = np . l oadtx t ( ”covEntropy30−86. txt ” )nonzeroInd = np . nonzero ( i nd i v i dua l ) [ 0 ]subKernel = Kernel [ nonzeroInd ] [ : , nonzeroInd ]i f sum( i nd i v i dua l )==10:return (np . l i n a l g . det ( subKernel ) , )else :return ( 0 . , )7−−−−−−−−−−7 cpyrutor r y g i s t r u t i o n7−−−−−−−−−−7 r y g i s t y r thy gou l C z i t n y s s z unw t i ontoo lbox . r e g i s t e r ( ” eva luate ” , evalMaxDet )7 r y g i s t y r thy w ro s sovy r opyru tortoo lbox . r e g i s t e r ( ”mate” , t o o l s . cxTwoPoint )63Appynxifi VB dython Woxy for Gynytic Algorithm7 r y g i s t y r u mutution opyru tor wi th u p r o v u v i l i t y to7 z l i p yuwh u t t r i v u t y Cgyny o z DBDItoo lbox . r e g i s t e r ( ”mutate” , t o o l s . mutFlipBit , indpb=0.05)7 opyru tor z o r s y l y w t i n g i n x i v i x u u l s z o r vryyx ing thynyfit7 gynyru t ion N yuwh i n x i v i x u u l o z thy wurryn tgynyru t ion7 i s ryp l uwyx vy thy ' z i t t y s t ' ( v y s t ) o z t h r y yi n x i v i x u u l s7 xruwn runxomly zrom thy wurryn t gynyru t ion Btoo lbox . r e g i s t e r ( ” s e l e c t ” , t o o l s . selTournament ,t ou rn s i z e=3)7−−−−−−−−−−def main ( ) :7 runxom B syyx (KID)7 wryu t y un i n i t i u l popu lu t i on o z GDD i n x i v i x u u l s( whyry7 yuwh i n x i v i x u u l i s u l i s t o z i n t y g y r s )pop = too lbox . populat ion (n=300)7 CldV i s thy p r o v u v i l i t y wi th whiwh twoi n x i v i x u u l s7 ury wrossyx77 aihdV i s thy p r o v u v i l i t y z o r mututing uni n x i v i x u u lCXPB, MUTPB = 0 .5 , 0 . 264Appynxifi VB dython Woxy for Gynytic Algorithmprint ( ” Star t o f evo lu t i on ” )7 Evuluuty thy yn t i r y popu lu t i onf i t n e s s e s = l i s t (map( too lbox . eva luate , pop ) )for ind , f i t in zip ( pop , f i t n e s s e s ) :ind . f i t n e s s . va lue s = f i tprint ( ” Evaluated %i i n d i v i d u a l s ” % len ( pop ) )7 Efit ruw t ing u l l thy z i t n y s s y s o zf i t s = [ ind . f i t n e s s . va lue s [ 0 ] for ind in pop ]7 juriuv l y kyyp ing t ruwk o z thy numvyr o zgynyru t i onsg = 07 Vygin thy y v o l u t i onwhile g P 1000 :7 U nyw gynyru t iong = g + 1print ( ”−− Generation %i −−” % g )7 gy l y w t thy nyfit gynyru t ion i n x i v i x u u l so f f s p r i n g = too lbox . s e l e c t ( pop , len ( pop ) )7 Clony thy s y l y w t y x i n x i v i x u u l so f f s p r i n g = l i s t (map( too lbox . c lone , o f f s p r i n g ))7 Upply w ro s sovy r unx mutution on thyo z z s p r i n gfor ch i ld1 , ch i l d 2 in zip ( o f f s p r i n g [ : : 2 ] ,o f f s p r i n g [ 1 : : 2 ] ) :65Appynxifi VB dython Woxy for Gynytic Algorithm7 wros s two i n x i v i x u u l s wi th p r o v u v i l i t yCldVi f random . random ( ) P CXPB:too lbox . mate ( ch i ld1 , ch i l d2 )7 z i t n y s s vu l u y s o z thy w h i l x r yn7 must vy r y w u l w u l u t y x l u t y rdel ch i l d1 . f i t n e s s . va lue sdel ch i l d2 . f i t n e s s . va lue sfor mutant in o f f s p r i n g :7 mututy un i n x i v i x u u l wi th p r o v u v i l i t yaihdVi f random . random ( ) P MUTPB:too lbox . mutate (mutant )del mutant . f i t n e s s . va lue s7 Evuluuty thy i n x i v i x u u l s wi th un i n v u l i xz i t n y s si n v a l i d i n d =[ ind for ind in o f f s p r i n g i f not ind . f i t n e s s .v a l i d ]f i t n e s s e s = map( too lbox . eva luate , i n v a l i d i n d )for ind , f i t in zip ( i nva l i d i nd , f i t n e s s e s ) :ind . f i t n e s s . va lue s = f i tprint ( ” Evaluated %i i n d i v i d u a l s ” % len (i n v a l i d i n d ) )7 hhy popu lu t i on i s y n t i r y l y r yp l uwyx vy thyo z z s p r i n gpop [ : ] = o f f s p r i n g66Appynxifi VB dython Woxy for Gynytic Algorithm7 [uthyr u l l thy z i t n y s s y s in ony l i s t unxp r i n t thy s t u t sf i t s = [ ind . f i t n e s s . va lue s [ 0 ] for ind in pop ]l ength = len ( pop )mean = sum( f i t s ) / l engthsum2 = sum( x∗x for x in f i t s )std = abs ( sum2 / length − mean∗∗2) ∗∗0 .5print ( ” Min %s” % min( f i t s ) )print ( ” Max %s” % max( f i t s ) )print ( ” Avg %s” % mean)print ( ” Std %s” % std )print ( ”−− End o f ( s u c c e s s f u l ) evo lu t i on −−” )be s t i nd = t o o l s . s e lBe s t ( pop , 1) [ 0 ]print ( ”Best i nd i v i dua l i s %s , %s ” %( bes t ind , b e s t i nd . f i t n e s s . va lue s ) )7 pr in t ( v y s t i n x B z i t n y s s B vu l u y s o D q )i f name == ” main ” :main ( )67
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Applying record value theory in combinatorial optimization...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Applying record value theory in combinatorial optimization with application to environmental statistics Wang, Yu 2018
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Applying record value theory in combinatorial optimization with application to environmental statistics |
Creator |
Wang, Yu |
Publisher | University of British Columbia |
Date Issued | 2018 |
Description | We consider the problem of optimal subset selection from a set of correlated random variables. In particular, we consider the associated combinatorial optimization problem of maximizing the determinant of a symmetric positive semidefinite matrix that characterizes the chosen subset. This problem arises in many domains, such as experimental designs, regression modelling, and environmental statistics. In this thesis, we attempt to establish an efficient polynomial-time algorithm for approximating the optimal solution to the problem. Firstly, we employ determinantal point processes, a special class of spatial point processes, to develop an easy-to-implement sampling-based stochastic search algorithm for the task of finding approximations to the combinatorial optimization problem. Secondly, we establish theoretical tools for assessing the quality of those approximations using statistical results from record value theory, the study of record values and related statistics from a sequence of observations. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-08-14 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0371004 |
URI | http://hdl.handle.net/2429/66768 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2018-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2018_september_wang_yu.pdf [ 2.14MB ]
- Metadata
- JSON: 24-1.0371004.json
- JSON-LD: 24-1.0371004-ld.json
- RDF/XML (Pretty): 24-1.0371004-rdf.xml
- RDF/JSON: 24-1.0371004-rdf.json
- Turtle: 24-1.0371004-turtle.txt
- N-Triples: 24-1.0371004-rdf-ntriples.txt
- Original Record: 24-1.0371004-source.json
- Full Text
- 24-1.0371004-fulltext.txt
- Citation
- 24-1.0371004.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0371004/manifest