On the Computational Asymptotics of GaussianVariational InferencebyZuheng XuB.S., Sichuan University, 2018A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)September 2020© Zuheng Xu, 2020The following individuals certify that they have read, and recommend to the Facultyof Graduate and Postdoctoral Studies for acceptance, the thesis entitled:On the Computational Asymptotics of Gaussian Variational Inferencesubmitted by Zuheng Xu in partial fulfillment of the requirements for the degree ofMASTER OF SCIENCE in Statistics.Examining Committee:Trevor Campbell, StatisticsSupervisorAlexandre Bouchard-Coˆte´, StatisticsAdditional ExamineriiAbstractVariational inference is a popular alternative to Markov chain Monte Carlo methodsthat constructs a Bayesian posterior approximation by minimizing a discrepancy tothe true posterior within a pre-specified family. This converts Bayesian inferenceinto an optimization problem, enabling the use of simple and scalable stochas-tic optimization algorithms. However, a key limitation of variational inferenceis that the optimal approximation is typically not tractable to compute; even insimple settings the problem is nonconvex. Thus, recently developed statisticalguarantees—which all involve the (data) asymptotic properties of the optimal varia-tional distribution—are not reliably obtained in practice. In this work, we providetwo major contributions: a theoretical analysis of the asymptotic convexity prop-erties of variational inference in the popular setting with a Gaussian family; andconsistent stochastic variational inference (CSVI), an algorithm that exploits theseproperties to find the optimal approximation in the asymptotic regime. CSVI con-sists of a tractable initialization procedure that finds the local basin of the optimalsolution, and a scaled gradient descent algorithm that stays locally confined to thatbasin. Experiments on nonconvex synthetic examples show that compared withstandard stochastic gradient descent, CSVI improves the likelihood of obtaining theglobally optimal posterior approximation.iiiLay SummaryBayesian inference is a statistical methodology for obtaining insights from data.Variational inference is a formulation of Bayesian inference as an optimizationproblem. The optimization problem is generally quite difficult to solve reliably;thus, although there is a wealth of previous work on understanding the statisticalproperties of the optimal solution, these guarantees are not achievable in practice.This thesis provides two major contributions: first, it provides a theoretical analysisof the computational properties of the optimization problem given a large amount ofdata; and then it exploits those properties to provide a simple and efficient schemeto solve the optimization problem more reliably.ivPrefaceThis thesis is original, unpublished work by the author, Zuheng Xu, under thesupervision of Professor Trevor Campbell. T. Campbell proposed the methodologyin Chapter 4 and proved Theorems 3.3.2, 3.3.5 and A.2.1; aside from these, all otherresults, software, and experiments were contributed by the author.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Gaussian Variational Inference . . . . . . . . . . . . . . . . . . . . . 53 Properties of the Optimization Problem . . . . . . . . . . . . . . . . 83.1 Statistical model and assumptions . . . . . . . . . . . . . . . . . 83.2 Global optimum consistency . . . . . . . . . . . . . . . . . . . . 113.3 Convexity and smoothness . . . . . . . . . . . . . . . . . . . . . 134 Consistent Stochastic Variational Inference (CSVI) . . . . . . . . . . 274.1 Initialization via smoothed MAP . . . . . . . . . . . . . . . . . . 284.1.1 Smoothed MAP problem . . . . . . . . . . . . . . . . . . 294.1.2 Smoothed MAP optimization . . . . . . . . . . . . . . . 314.2 Optimization via scaled projected SGD . . . . . . . . . . . . . . 31vi5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.1 Synthetic Gaussian mixture . . . . . . . . . . . . . . . . . . . . . 345.2 Synthetic model with a nonconvex prior . . . . . . . . . . . . . . 366 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41A Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46A.1 Proof for Theorem 4.1.1 . . . . . . . . . . . . . . . . . . . . . . 46A.1.1 Gradient and Hessian derivation . . . . . . . . . . . . . . 46A.1.2 Proof of 1st statement of Theorem 4.1.1 . . . . . . . . . . 47A.1.3 Proof of 2nd statement of Theorem 4.1.1 . . . . . . . . . . 57A.2 Proof for Theorem 4.2.1 . . . . . . . . . . . . . . . . . . . . . . 62viiList of FiguresFigure 3.1 Plots of the function fn(y) from Example 3.3.7. Each row offigures represents a single realization of the sequence (fn)n∈Nfor increasing sample sizes 5, 20, 100, and 1000. Each columnincludes three repetitions of fn under a single n. As n increases,the function fn(y) is more likely to be strongly convex andLipschitz smooth with constants approaching 2. . . . . . . . 24Figure 4.1 Plots of the smoothed posterior density pˆin with increasingsmoothing variance. . . . . . . . . . . . . . . . . . . . . . . 30Figure 5.1 The result of running 10 trials of CSVI (blue) and SVI (pink)with the Gaussian mixture target (grey) given in Eq. (5.1). Theoutput of CSVI reliably finds the global optimum solution cor-responding to the central mixture peak; SVI often providessolutions corresponding to spurious local optima. . . . . . . . 35Figure 5.2 The Bayesian posterior density for increasing dataset sizes.Note the large number of spurious local optima, resulting inthe unreliability of local optimization methods in variationalinference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36viiiFigure 5.3 The smoothed Bayesian posterior density for the same datasetsizes as in Fig. 5.2. Black curves correspond to the smoothedposterior, red dots show local optima of the density, and theblue histogram shows the counts (over 100 trials) of the outputof the smoothed MAP initialization. Note that there are fewerlocal optima relative to the original posterior density, and thatthe smoothed MAP initialization is likely to provide a meanclose to that of the optimal variational distribution. . . . . . . 36Figure 5.4 Box-plots of the final ELBO for 100 trials of CSVI and SVI. . 37ixAcknowledgmentsFirst and foremost, I would like to express my deepest gratitude to my supervisor,Prof. Trevor Campbell, who provided an uncountable amount of guidance andencouragement over the last two years so that I was able to keep making progress andrecover from frustrations. His enthusiasm and passion for research has constantlyinfluenced me and motivated me to pursue a Ph.D. I feel extremely lucky to be hisstudent and to keep working with him in the future.I would also like to thank Prof. Alexandre Bouchard-Coˆte´ for being my secondreader and for his insightful suggestions concerning my thesis. Not only that, Ihave really appreciated my interactions with Alex. I have learned so much fromhis course and reading groups, which have brought me into the world of Bayesiancomputation. I am also grateful to the whole department. All the free sandwichesand interesting conversations that happened in lounge have become good memories.I have to say, this is the best department in the world!Finally, I would love to thank my family and my friends. Their unconditionalsupport and love are, and will always be, the reason for me to move forward.xChapter 1IntroductionBayesian statistical models are powerful tools for learning from data, with theability to encode complex hierarchical dependence and domain expertise, as well ascoherently quantify uncertainty in latent parameters. Unfortunately, for many mod-ern Bayesian models, exact computation of the posterior is intractable (Blei et al.,2017, Section 2.1) and statisticians must resort to approximate inference algorithms.Currently, the most popular type of Bayesian inference algorithm in statistics isMarkov Chain Monte Carlo (MCMC) (Gelfand and Smith, 1990; Hastings, 1970;Robert and Casella, 2013), which provides approximate samples from the poste-rior distribution supported by a comprehensive literature of theoretical guarantees(Meyn and Tweedie, 2012; Roberts and Rosenthal, 2004). However, in the setting oflarge-scale data, traditional MCMC methods tend to be computationally intractabledue to the need to compute the full data likelihood in each step, which is required tomaintain the Bayesian posterior as the stationary distribution.Variational inference (Blei et al., 2017; Jordan et al., 1998; Wainwright andJordan, 2008) is a popular alternative to classical MCMC methods that approximatesthe intractable posterior with a distribution chosen from a pre-specified family,e.g., the family of Gaussian distributions parametrized by mean and covariance.The approximating distribution is chosen by minimizing a discrepancy—such asthe Kullback-Leibler (KL) divergence (Murphy, 2012, Section 2.8) or Re´nyi α-divergence (Van Erven and Harremos, 2014)—to the posterior distribution overthe family, thus converting Bayesian inference into an optimization problem. This1formulation enables the use of simple, efficient stochastic optimization algorithms(Bottou, 2004; Robbins and Monro, 1951) that require only a subsample of the dataat each iteration, avoiding computation on the entire dataset.But despite its computational tractability, variational inference has two keylimitations that prevent its widespread adoption in the statistical community. First,one must select an appropriate parametric family of distributions from which toselect the variational approximation. This choice of family presents a tradeoff: asimple family typically enables the design of fast local optimization algorithms,but limits the achievable fidelity of the posterior approximation. It is in generaldifficult to know how limited the family is before actually optimizing; and notonly that, it is also often difficult to estimate the approximation error once theoptimization is complete (Huggins et al., 2020). For example, if one chooses amean-field variational family in which variables are assumed to be independent, theresulting posterior approximation will typically underestimate their true posteriorvariances and cannot capture their covariances (Murphy, 2012, Section 21.2.2), twoquantities of particular interest to statisticians. A more flexible variational familymay result in a lower achievable approximation error, but the error is still not knownin advance and typically results in more expensive computation. The second keylimitation is that even if the family could be chosen carefully to have favourablecomputational properties and a global optimum with low approximation error, theoptimization problem itself is typically nonconvex and the global optimum cannotbe found reliably.The key to addressing the first limitation of variational inference is to understandthe minimum approximation error within a particular variational family. This is quitedifficult given finite data; Han and Yang (2019) provides a non-asymptotic analysisof the optimal mean-field variational approximation, but extending these results tomore general distribution families is not straightforward. However, multiple threadsof research have explored the statistical properties of variational inference in anasymptotic regime by taking advantage of the limiting behavior of the Bayesianposterior. Wang and Blei (2019) exploits the asymptotic normality of the posteriordistribution in a parametric Bayesian setting to show that the KL minimizer amongthe variational family to the posterior converges to the KL minimizer to the limitingdistribution of posterior under infinite samples—a normal distribution. Alquier and2Ridgway (2020) analyze the rate of convergence of the variational approximation to afractional posterior—a posterior with a tempered likelihood—in a high dimensionalsetting where the posterior itself may not have the ideal asymptotic behavior. Zhangand Gao (2020) studies the contraction rate of the variational distribution for non-parametric Bayesian inference and provides general conditions on the Bayesianmodel that characterizes the rate. Yang et al. (2020) and Jaiswal et al. (2019) build aframework for analyzing the statistical properties of α-Re´nyi variational inference,and provide sufficient conditions that guarantee an optimal convergence rate ofthe point estimate obtained from variational inference. But while this literaturehas built a comprehensive understanding of the asymptotic statistical guarantees ofoptimal variational posterior approximations, the nonconvexity of the optimizationproblem makes these guarantees difficult to obtain reliably in practice. In fact,Proposition 3.3.3 of the present work demonstrates that the problem is nonconvexeven in the simple case of Gaussian variational inference with ideal asymptoticposterior behaviour. Therefore, addressing the nonconvexity of variational inferenceis meaningful for both computational and theoretical reasons.In this work, we address the nonconvexity of Gaussian variational inferencein the data-asymptotic regime when the posterior distribution admits asymptoticnormality. However, rather than focusing on the statistical properties of the optimalGaussian variational proxy, we investigate and exploit the asymptotic properties ofthe optimization problem itself (Chapter 3), and use these to design a procedure(Chapter 4) which enables tractable Gaussian variational optimization and hencemakes theoretical results regarding the optimal solution applicable. We developconsistent stochastic variational inference (CSVI), an efficient and simple algorithmfor Gaussian variational inference that guarantees the probability of achieving theglobal optimum converges to 1 in the limit of observed data. The two key ingredientsof CSVI are the choice of initialization for the Gaussian mean (Section 4.1) andthe design of a scaled projected stochastic optimization algorithm (Section 4.2).We use the mode of a smoothed posterior—the posterior distribution convolvedwith Gaussian noise—as the mean initialization, which can be solved by a tractableoptimization formulated in Section 4.1. We show that this procedure initializesGaussian variational inference in a local region where the optimization becomesconvex with increasing sample size, and where the global optimum eventually3lies. We then show that the novel scaled projected stochastic gradient algorithmis guaranteed to stay inside this local region and eventually converge to the globaloptimum. Experiments on synthetic examples in Chapter 5 show that CSVI providesnumerically stable and asymptotically consistent posterior approximations.4Chapter 2Gaussian Variational InferenceIn the setting of Bayesian inference considered in this paper, we are given a sequenceof posterior distributions Πn, n ∈ N each with full support on Rd. The index nrepresents the amount of observed data; denote Π0 to be the prior. We considerthe problem of approximating the posterior distribution via Gaussian variationalinference, i.e.,arg minµ∈Rd,Σ∈Rd×dDKL (N (µ,Σ)||Πn) s.t. Σ 0,where the Kullback-Leibler divergence (Murphy, 2012, Section 2.8) is defined asDKL (Q||P ) :=∫logdQdPdQfor any pair of probability distributions P,Q such thatQ P , and dQdP is the Radon-Nikodym derivative of Q with respect to P (Folland, 1999, Section 3.2). We furtherassume that each posterior Πn has density pin with respect to the Lebesgue measure,and use the standard reparametrization of Σ using the Cholesky factorization Σ =n−1LLT to arrive at the common formulation of Gaussian variational inference5(Kucukelbir et al., 2017) that is the focus of the present work:µ?n, L?n = arg minµ∈Rd,L∈Rd×d− n−1 log detL− E[n−1 log pin(µ+ n−1/2LZ)]s.t. L lower triangular with positive diagonalZ ∼ N (0, I).(2.1)Denote the optimal Gaussian distribution N ?n := N (µ?n, n−1L?nL?Tn ). Intuitively,this optimization problem encodes a tradeoff between maximizing the expectedposterior density under the variational approximation—which tries to make L smalland move µ close to the maximum point of pin—and maximizing the entropy of thevariational approximation—which prevents L from becoming too small. It cruciallydoes not depend on the (typically unknown) normalization of pin, which appearsas an additive constant in Eq. (2.1); it is common to drop this constant and insteadequivalently maximize the expectation lower bound (ELBO) (Blei et al., 2017). Notethat there are a number of unconstrained parametrizations of the covariance matrixvariable Σ (Pinheiro and Bates, 1996). We select the (unique) positive-diagonalCholesky factor L as it makes the optimization problem Eq. (2.1) more amenable toboth theoretical analysis and computational optimization.One typically attempts to solve Eq. (2.1) using an iterative local descent op-timization algorithm. In cases where the expectation in the objective is analyti-cally tractable—e.g. in exponential family models with a mean-field variationalapproximation—coordinate descent is the standard approach (Bishop, 2006; Bleiet al., 2017). However, the expectation is intractable in most scenarios, and one mustinstead rely on stochastic gradient estimates (Hoffman et al., 2013; Kingma andWelling, 2014; Kucukelbir et al., 2017; Ranganath, 2014). In particular, assumingone can interchange expectation and differentiation (see Section 3.1 for details), thequantities∇ˆµ,n(µ,L, Z) :=−n−1∇ log pin(µ+ LZ)∇ˆL,n(µ,L, Z) :=−n−1(diagL)−1−n−3/2tril∇ log pin(µ+ n−1/2LZ)ZT ,(2.2)are unbiased estimates of the µ- and L-gradients of the objective in Eq. (2.1) givenZ ∼ N (0, I), where the functions diag : Rd×d → Rd×d and tril : Rd×d →6Rd×d set the off-diagonal and upper triangular elements of their arguments to 0,respectively. These unbiased gradient estimates may be used in a wide varietyof stochastic optimization algorithms (Bottou, 2004; Robbins and Monro, 1951)applied to Eq. (2.1). In this paper, we will focus on projected stochastic gradientdescent (SGD) (Bubeck, 2015, Section 3.) due to its simplicity; we expect that themathematical theory in this work extends to other related methods.In general, Gaussian variational inference is a nonconvex optimization prob-lem and standard iterative methods such as SGD are not guaranteed to produce asequence of iterates that converge to µ?n, L?n. The goal of this work is to addressthis limitation by developing an iterative algorithm that uses only the black-boxstochastic gradient estimates from Eq. (2.2) to reliably find the globally optimalsolution of Eq. (2.1).7Chapter 3Properties of the OptimizationProblemIn this section, we investigate the properties of the Gaussian variational inferenceoptimization problem Eq. (2.1). We take advantage of the theory of statisticalasymptotics to show that as we obtain more data, the optimum solution of Eq. (2.1)converges to a fixed value and the objective function becomes locally stronglyconvex around that fixed value.3.1 Statistical model and assumptionsAs is common in past work (Ghosal et al., 2000; Kleijn and van der Vaart, 2012;Shen and Wasserman, 2001), we take a frequentist approach to analyzing Bayesianinference. We assume that the sequence of observations are independent andidentically distributed (Xi)ni=1i.i.d.∼ Pθ0 from a distribution Pθ0 with parameterθ0 ∈ Rd selected from a parametric family {Pθ : θ ∈ Rd}. We further assumethat for each θ ∈ Rd, Pθ has common support, has density pθ with respect to somecommon base measure, and that pθ(x) is a Lebesgue measurable function of θ forall x. Finally, we assume the prior distribution Π0 has full support on Rd withdensity pi0 with respect to the Lebesgue measure. Thus by Bayes’ rule, the posteriordistribution Πn has density proportional to the prior density times the likelihood,8i.e.,pin(θ) ∝ pi0(θ)n∏i=1pθ(Xi).In order to develop the theory in this work, we require a set of additional technicalassumptions on pi0 and pθ given by Assumption 1. These are a collection ofregularity conditions that are standard for parametric models, which guarantee thatthe maximum likelihood estimate (MLE) θMLE,n := arg maxθ∈Rd∑ni=1 log pθ (Xi)is well-defined and asymptotically consistent for θ0 (Lehmann and Casella, 2006,Chapter 6, Thm 3.7), and that the Bayesian posterior distribution of√n (θ − θ0)converges in total variation to a Gaussian distribution; this is known as the Bernstein-von Mises theorem (van der Vaart, 2000, p. 141).Theorem 3.1.1 (Bernstein-von Mises & MLE consistency). Under Assumption 1,θMLE,nPθ0→ θ0, and DTV(Πn , N(n−1/2∆n,θ0 + θ0, n−1H−1θ0)) Pθ0−→ 0, (3.1)where ∆n,θ0 = n−1/2∑ni=1H−1θ0∇ log pθ(Xi).Assumption 1. (Regularity Conditions)1.{Pθ : θ ∈ Rd}is an identifiable family of distributions;2. For all x, θ, the densities pi0, pθ are positive and twice continuously differen-tiable in θ;3. For all θ, Eθ [∇ log pθ(X)] = 0;4. For all θ,Hθ := −Eθ[∇2 log pθ(X)] = Eθ [∇ log pθ(X)∇ log pθ(X)T ] ,and Hθ0 I for some > 0. Further, for θ, θ′ in a neighbourhood of θ0,(θ, θ′)→ Eθ′[−∇2 log pθ(X)]is continuous in spectral norm;95. There exists a measurable function g(x) such that for θ in a neighbourhoodof θ0 and for all x,maxi,j∈[d]∣∣∣[∇2 log pθ(x)]i,j∣∣∣ < g(x), Eθ0 [g(X)] <∞.Note that the above conditions in Assumption 1 are stronger (van der Vaart, 2000,Lemmas 7.6 and 10.6) than the usual local asymptotic normality (van der Vaart,2000, Section 7) and testability (van der Vaart, 2000, p. 141) conditions requiredfor asymptotic posterior concentration and Gaussianity in the Bayesian asymptoticsliterature. Many of the results in this work would still hold with these weakerconditions, but we prefer Assumption 1 for the present work as these conditions aremuch simpler to state and check in practice.We require two conditions beyond the basic regularity conditions in Assump-tion 1. First, we require that the maximum a posteriori (MAP) estimate θMAP,n :=arg maxθ∈Rd log pin(θ) converges at a√n rate to θ0. This is not a particularlystrong assumption—the MAP estimate typically has the same asymptotic propertiesas the MLE—but we require this to ensure that we can initialize the optimizationalgorithm for variational inference in a manner that ensures convergence to theglobal optimum. Readers who are interested in sufficient conditions for ensuringMAP consistency may consult Bassett and Deride (2019); Dashti et al. (2013);Grenda´r and Judge (2009); Stefanski and Boos (2002). Second, we require controlon the smoothness of the log posterior density log pin asymptotically. In this work,we impose a bound on the second derivative, but we conjecture that bounds onhigher-order derivatives would also suffice; see Section 3.3 for details.Assumption 2. (MAP√n -consistency) The maximum a posteriori point satisfies‖θMAP,n − θ0‖ = OPθ0 (1/√n ).Assumption 3. (Asymptotic Smoothness) There exists an ` > 0 such thatP(supθ∥∥n−1∇2 log pin(θ)∥∥2 > `)→ 0,where the ‖ · ‖2 denotes the spectral norm of matrices.103.2 Global optimum consistencyThe first important property of Gaussian variational inference is that the optimumsolution µ?n, L?n converges in probability to θ0, L0 under the same conditions re-quired for the Bernstein-von Mises theorem (Theorem 3.1.1), where θ0 is the truedata-generating parameter and L0 is the unique positive-diagonal Cholesky factorof the inverse Fisher information matrix H−1θ0 = L0LT0 . In other words, the globalsolution of Gaussian variational inference is a statistically consistent estimator; ifwe can develop an algorithm that solves Eq. (2.1) reliably, we therefore have anasymptotically consistent Bayesian inference procedure. Theorem 3.2.1 makesthis statement precise; the proof follows directly from a result regarding the totalvariation consistency of the optimal variational distribution (Wang and Blei, 2019)and the continuity of the positive-diagonal Cholesky decomposition (Schatzman,2002, p. 295).Theorem 3.2.1. Under Assumption 1,∀ > 0, limn→∞P (‖µ?n − θ0‖ < , ‖L?n − L0‖ < ) = 1.Proof. We consider the KL cost for the scaled and shifted posterior distribution.Let Π˜n be the Bayesian posterior distribution of√n (θ − θ0). The KL divergencemeasures the difference between the distributions of two random variables and isinvariant when an invertible transformation is applied to both random variables(Qiao and Minematsu, 2010, Theorem 1). Note that Π˜n is shifted and scaled fromΠn, and that this linear transformation is invertible, soDKL (N (µ,Σ)||Πn) = DKL(N (√n (µ− θ0), nΣ) ||Π˜n) .Let µ˜?n, Σ˜?n be the parameters of the optimal Gaussian variational approximation toΠ˜n, i.e.,µ˜?n, Σ˜?n = arg minµ∈Rd,Σ∈Rd×dDKL(N (µ,Σ)||Π˜n)s.t. Σ 0,11and letN˜n? := N(µ˜?n, Σ˜?n)= N (√n (µ?n − θ0), L?nL?Tn ) .Wang and Blei (2019, Corollary 7) shows that under Assumption 1,DTV(N˜n?,N(∆n,θ0 , H−1θ0)) Pθ0→ 0.Convergence in total variation implies weak convergence, which then impliespointwise convergence of the characteristic function. Denote φ˜?n(t) and φn(t) to bethe characteristic functions of N˜ ?n and N(∆n,θ0 , H−1θ0). Therefore∀t ∈ Rd, φ?n(t)φn(t)= exp(i(√n (µ?n − θ0)−∆n,θ0)T t−12tT(L?nL?Tn −H−1θ0)t)Pθ0−→ 1,which impliesµ?nPθ0→ 1√n∆n,θ0 + θ0, and L?nL?TnPθ0→ H−10 = L0LT0 .Under Assumption 1, van der Vaart (2000, Theorem 8.14) states that‖∆n,θ0 −√n (θMLE,n − θ0)‖Pθ0→ 0,and θMLE,nPθ0→ θ0 according to Eq. (3.1), yielding µ?nPθ0→ θ0.Finally since the Cholesky decomposition defines a continuous mapping fromthe set of positive definite Hermitian matrices to the set of lower triangular matriceswith positive diagonals (both sets are equipped with the spectral norm) (Schatzman,2002, p. 295), we haveL?nPθ0→ L0.123.3 Convexity and smoothnessThe statistical consistency of the optimal parameters µ?n, L?n alone does not providea complete analysis of the asymptotics of Gaussian variational inference; indeed, itis in general not tractable to actually compute or approximate the solution µ?n, L?n,which diminishes the utility of Theorem 3.2.1 in practice. In order to make use ofthe consistency result, we require that solving the Gaussian variational inferenceproblem is tractable in some sense. In this section, we investigate the tractabilityof Gaussian variational inference as formulated in Eq. (2.1). Since we have accessonly to (stochastic estimates of) the gradient of the objective function in Eq. (2.1),and projected stochastic gradient descent is known to solve optimization problemswith strongly convex and Lipschitz smooth objectives (Bottou, 2004; Rakhlin et al.,2012), this amounts to investigating the convexity and smoothness of the objectivefunction.1We will begin by focusing on the expectation term in the objective of Eq. (2.1),fn : Rd → R, fn(x) := −n−1 log pin(x) (3.2)Fn : Rd × Rd×d → R, Fn(µ,L) := E[fn(µ+ n−1/2LZ)], Z ∼ N (0, I).The first main result of this section is that convexity and smoothness of the logposterior density fn implies the same for Fn(µ,L). We begin with a generalizationof the typical definitions of strong convexity and Lipschitz smoothness found in theliterature (Boyd and Vandenberghe, 2004) in Definition 3.3.1, and then provide theprecise theoretical statement in Theorem 3.3.2.Definition 3.3.1 (Convexity and Smoothness). Let g : X → R be a twice differ-entiable function on a convex set X ⊆ Rd, and let D : X → Rd×d be a positivedefinite matrix depending on x. Then g is D-strongly convex if∀x ∈ X , ∇2g(x) D(x),1There are many other properties one might require of a tractable optimization problem, e.g.,pseudoconvexity (Crouzeix and Ferland, 1982), quasiconvexity (Arrow and Enthoven, 1961), orinvexity (Ben-Israel and Mond, 1986). We focus on convexity as it does not impose overly stringentassumptions on our theory and has stronger implications than each of the aforementioned conditions.13and g is D-Lipschitz smooth if∀x ∈ X , −D(x) ∇2g(x) D(x).Theorem 3.3.2. Suppose fn is D-strongly convex (-Lipschitz smooth) for positivedefinite matrixD ∈ Rd×d. Then Fn reinterpreted as a function fromR(d+1)d → R—by stacking µ and each column of L into a single vector—is D′-strongly convex(-Lipschitz smooth), whereD′ = blockd (D,n−1D, . . . , n−1D) ∈ R(d+1)d×(d+1)d,and blockd creates a block-diagonal matrix out of its arguments.Proof. We provide a proof of the result for strong convexity; the result for Lipschitzsmoothness follows the exact same proof technique. Note that ifD′ does not dependon x, Fn(x) is D′-strongly convex if and only if Fn(x)− 12xTD′x is convex. Weuse this equivalent characterization of strong convexity in this proof.Note that for Z ∼ N (0, I),E[12(µ+ n−1/2LZ)TD(µ+ n−1/2LZ)]=12µTDµ+12trLT (n−1D)L.Define λ ∈ [0, 1], vectors µ, µ′ ∈ Rd, positive-diagonal lower triangular matricesL,L′ ∈ Rd×d, and vectors x, x′ ∈ R(d+1)d by stacking µ and the columns of Land likewise µ′ and the columns of L′. Define x(λ) = λx + (1 − λ)x′, µ(λ) =λµ+ (1− λ)µ′, and L(λ) = λL+ (1− λ)L′. ThenFn(x(λ))− 12x(λ)T diag(D,n−1D, . . . , n−1D)x(λ)=Fn(µ(λ), L(λ))−(12µ(λ)TDµ(λ) +12trL(λ)T (n−1D)L(λ))=E[n−1 log pin(µ(λ) + n−1/2L(λ)Z)− 12(µ(λ) + n−1/2L(λ)Z)TD(µ(λ)+n−1/2L(λ)Z)].14By the D-strong convexity of n−1 log pin,≤λ(Fn(µ,L)− 12µTDµ− 12trLT (n−1D)L)+ (1− λ)(Fn(µ′, L′)− 12µ′TDµ′ − 12trL′T (n−1D)L′)=λ(Fn(x)− 12xT diag(D,n−1D, . . . , n−1D)x)+ (1− λ)(Fn(x′)− 12x′T diag(D,n−1D, . . . , n−1D)x′).For example, if the posterior distribution Πn is a multivariate Gaussian distri-bution N (µn, n−1Σn) with mean µn and covariance n−1Σn, then the expectationcomponent of the Gaussian variational inference objective function becomesFn(µ,L) = n−1 tr Σ−1n LLT + (µ− µn)T Σ−1n (µ− µn) ,which is a jointly convex quadratic function in µ,L with Hessian matrix (for µ andcolumns of L stacked together in a single vector) equal toblockd (Σ−1n , n−1Σ−1n , . . . , n−1Σ−1n ) ∈ R(d+1)d×(d+1)d.Combined with the convexity of the log determinant term −n−1 log detL (Boydand Vandenberghe, 2004, p.73), Gaussian variational inference for strongly convexand Lipschitz smooth log posterior density −n−1 log pin is itself strongly convexand Lipschitz smooth in any compact set contained in the optimization domain.However, in a typical statistical model, the posterior is typically neither Gaussiannor strongly convex. But when the Bernstein-von Mises theorem holds (van derVaart, 2000), the posterior distribution (scaled and shifted appropriately) convergesasymptotically to a Gaussian distribution. Thus, it may be tempting to think thatthe Bernstein von-Mises theorem implies that Gaussian variational inference shouldeventually become a convex optimization problem. This is unfortunately not true,essentially because Bernstein-von Mises only implies convergence to a Gaussianin total variation distance, but not necessarily in the log density function or its15gradients. The second main result in this section—Proposition 3.3.3—is a simpledemonstration of the fact that the Bernstein-von Mises theorem is not sufficient toguarantee the convexity of Gaussian variational inference.Proposition 3.3.3. Suppose d = 1, fn is differentiable to the third order for all n,that there exists an open interval U ⊆ R and > 0 such thatsupθ∈Ud2fndθ2≤ −,and that there exists η > 0 such thatsupθ∈R∣∣∣∣d3fndθ3∣∣∣∣ ≤ η.Then there exists a δ > 0 such thatsupσ<δ, µ∈Ud2dµ2DKL(N (µ, σ2)||Πn) < 0.Proof. Note that by reparameterization,arg minµDKL(N (µ, σ2)||Πn) = arg minµE[−n−1 log pin(µ+ σZ)] ,where Z ∼ N (0, 1). Using a Taylor expansion,− E[d2dµ2(−n−1 log pin(µ+ σZ))]= E[−n−1 log pi(2)n (µ)− n−1 log pi(3)n (µ′) · σZ],for some µ′ between µ and µ+ σZ. By the uniform bound on the third derivativeand local bound on the second derivative, for any µ ∈ U ,E[−n−1 log pi(2)n (µ)− n−1 log pi(3)n (µ′) · σZ]≤ −+ ησE |Z|≤ −+ ησ.The result follows for any 0 < δ < /η.16Although Proposition 3.3.3 is a negative result about the global convexity ofFn, it does hint at a very useful fact: the local convexity (Definition 3.3.4) ofFn matches that of fn, assuming that we control the global behaviour of fn, e.g.,through a uniform bound on the kth derivative. This is essentially due to the factthat the two functions differ only by a smoothing under a standard multivariateGaussian variable, which has finite moments of all orders. The third main resultof this section, Theorem 3.3.5—which we exploit later in Chapter 4 to develop areliable variational inference algorithm—makes the general link between the localconvexity behaviour of fn and Fn precise under global Lipschitz smoothness, i.e., auniform bound on the Hessian.Definition 3.3.4 (Local Strong Convexity). In the same setting of Definition 3.3.1, ifthere exists a convex subset Y ⊂ X such that the restriction of g to Y is D-stronglyconvex, then g is locally D-strongly convex.Theorem 3.3.5. Suppose there exist , `, r > 0 and x ∈ Rd such that fn is globally`I-Lipschitz smooth and locally I-strongly convex in the set {y : ‖y − x‖ ≤ r}.DefineDn := blockd(I, n−1I, . . . , n−1I) ∈ R(d+1)d×(d+1)dτn(µ,L) := 1− χ2d+2(n(r2 − 2‖µ− x‖2)2‖L‖2F), (3.3)where χ2k is the CDF of a chi-square random variable with k degrees of freedom.Then Fn reinterpreted as a function of R(d+1)d → R—by stacking µ and eachcolumn of L into a single vector—is `Dn-Lipschitz smooth; and is (− τn(µ,L) ·(+ `))Dn-strongly convex when ‖µ− x‖2 ≤ r22 .Proof. Note that we can split L into columns and express LZ asLZ =d∑i=1LiZi,where Li ∈ Rp is the ith column of L, and (Zi)di=1 i.i.d.∼ N (0, 1). Denoting ∇2fn :=17∇2fn(µ+ LZ) for brevity, the 2nd derivatives in both µ and L are∇2µµFn = E[∇2fn]∇2LiLjFn = n−1E[ZiZj∇2fn]∇2µLiFn = n−1/2E[Zi∇2fn]where we can pass the gradient and Hessian through the expectation by dominatedconvergence because Z has a normal distribution and fn has `-Lipschitz gradients.Stacking these together in block matrices yields the overall Hessian,A =[I n−1/2Z1I . . . n−1/2ZdI]∈ Rd×d(d+1)∇2Fn = E[AT∇2fnA] ∈ Rd(d+1)×d(d+1).Since fn has `-Lipschitz gradients, for all x ∈ Rd,−`I ∇2fn(x) `I . Applyingthe upper bound and evaluating the expectation yields the Hessian upper bound (andthe same technique yields the corresponding lower bound):∇2Fn = E[AT∇2fnA] `E [ATA]= `I 0 0 00 n−1I 0 00 0. . . 00 0 0 n−1I = `Dn.To demonstrate local strong convexity, we split the expectation into two parts:one where n−1/2LZ is small enough to guarantee that ‖µ+ n−1/2LZ − x‖2 ≤ r2,and the complement. Definer2n(µ,L) := n(r2 − 2‖µ− x‖22)2‖L‖2F.18Note that when ‖Z‖2 ≤ r2n(µ,L),∥∥∥∥µ+ 1√n LZ − x∥∥∥∥22≤ 2‖µ− x‖2 + 2n−1‖LZ‖2≤ 2‖µ− x‖2 + 2n−1‖L‖2F ‖Z‖2≤ r2.Then we may write∇2Fn = E[1[‖Z‖2 ≤ r2n(µ,L)]AT∇2fnA]+ E[1[‖Z‖2 > r2n(µ,L)]AT∇2fnA] .Since fn has `-Lipschitz gradients and is locally -strongly convex,∇2Fn · E[1[‖Z‖2 ≤ r2n(µ,L)]ATA]− ` · E [1 [‖Z‖2 > r2n(µ,L)]ATA] .Note thatATA has entries 1 and n−1Z2i along the diagonal, as well as n−1ZiZj , i 6=j and n−1/2Zi on the off-diagonals. By symmetry, since Z is an isotropic Gaussian,censoring by 1[‖Z‖2 ≤ . . . ] or 1 [‖Z‖2 > . . . ] maintains that the off-diagonalexpectations are 0. Therefore E[1[‖Z‖2 ≤ r2n(µ,L)]ATA] is diagonal with co-efficients 1 − αn(µ,L) and n−1βn(µ,L), and E[1[‖Z‖2 > r2n(µ,L)]ATA] isdiagonal with coefficients αn(µ,L) and n−1τn(µ,L) whereαn(µ,L) = P(‖Z‖2 > r2n(µ,L))βn(µ,L) = E[Z211[‖Z‖2 ≤ r2n(µ,L)]] = d−1E [‖Z‖221 [‖Z‖2 ≤ r2n(µ,L)]]τn(µ,L) = E[Z211[‖Z‖2 > r2n(µ,L)]] = d−1E [‖Z‖221 [‖Z‖2 > r2n(µ,L)]] .Note that ‖Z‖2 ∼ χ2d; so αn(µ,L) = 1− χ2d(r2n(µ,L)) andτn(µ,L) =∫ ∞r2n(µ,L)1 [x ≥ 0] 12(d+2)/2Γ((d+ 2)/2)xd+22−1e−x/2dx= 1− χ2d+2(r2n(µ,L))βn(µ,L) = 1− τn(µ,L).19Therefore,∇2Fn diag(((1− αn(µ,L))−`αn(µ,L))I, (n−1(1−τn(µ,L))−`n−1τn(µ,L))I,. . . , (n−1(1−τn(µ,L))− `n−1τn(µ,L))I)= Dn − (+ `) diag(αn(µ,L)I, n−1τn(µ,L)I, . . . , n−1τn(µ,L)I) Dn − (+ `) diag(τn(µ,L)I, n−1τn(µ,L)I, . . . , n−1τn(µ,L)I)= Dn (− τn(µ,L) · (+ `)) .The most complicated part of the statement of Theorem 3.3.5 is the functionτn(µ,L), which characterizes how much the tails of fn can influence the localstrong convexity of Fn around the point x. In particular, as long as µ is close tox and ‖L‖F (which modulates the effect of noise) is sufficiently small, then theargument of the χ2d+2 CDF is large, so τn is small, so (− τn(µ,L) · (+ `)) ≈ ;thus we recover local strong convexity of the same magnitude as fn. A furthernote is that although Theorem 3.3.5 requires a uniform bound on the Hessian of fn,we conjecture that a similar result would hold under the assumption of a uniformbound on the kth derivative. For simplicity of the result and ease of use later on inChapter 4, we opted for the second derivative bound.The final step of this section is to examine when the assumptions of Theo-rem 3.3.5 hold. This is where statistical asymptotics provide their benefit in varia-tional optimization: although the function fn need not be locally strongly convexfor any particular n, the probability (under the data-generating distribution) that itbecomes locally strongly convex around θ0 converges to 1 under weak conditionson the statistical model as n → ∞. Lemma 3.3.6 states the result precisely, andExample 3.3.7 provides an example of a sequence of functions that individuallyhave no guarantees regarding their convexity, but which are asymptotically locallystrongly convex.20Lemma 3.3.6. Under Assumption 1, there exist r, δ > 0 such thatlimn→∞P (fn is δI-strongly convex in the set Br(θ0)) = 1,where Br(θ0) := {θ ∈ Rd : ‖θ − θ0‖ ≤ r}.Proof. Given Assumption 1, we know fn is twice continuously differentiable. Thus,using the second order characterization of strong convexity, it is equivalent to showthe existence of r, δ > 0 such thatP(∀θ ∈ Br(θ0), ∇2fn(θ) δI)→ 1,as n→∞. Note that by Weyl’s inequality∇2fn(θ) =∇2fn(θ)−Hθ +Hθ (3.4)λmin(∇2fn(θ)−Hθ) I + λmin(Hθ)I.Condition 4 of Assumption 1 guarantees that Hθ0 I and that there exists a κ > 0such that Hθ is continuous in Bκ(θ0). Hence there exists 0 < κ′ ≤ κ, such that∀θ ∈ Bκ′(θ0), Hθ 2I .We then consider λmin(∇2fn(θ)−Hθ). We aim to find a 0 < r ≤ κ′ suchthat |λmin(∇2fn(θ)−Hθ) | is sufficiently small. Note that for any fixed r > 0,supθ∈Br(θ0)∣∣λmin (∇2fn(θ)−Hθ)∣∣≤ supθ∈Br(θ0)∥∥∇2fn(θ)−Hθ∥∥2= supθ∈Br(θ0)∥∥∇2fn(θ)− Eθ0 [−∇2 log pθ(X)]+ Eθ0 [−∇2 log pθ(X)]−Hθ∥∥2≤ supθ∈Br(θ0)(∥∥∇2fn(θ)−Eθ0[−∇2 log pθ(X)]∥∥2+∥∥Eθ0[−∇2 log pθ(X)]−Hθ∥∥2) .21Now we split fn into prior and likelihood, yielding that≤ supθ∈Br(θ0)∥∥∥∥∥−n−1n∑i=1∇2 log pθ(Xi)− Eθ0[−∇2 log pθ(X)]∥∥∥∥∥2(3.5)+ supθ∈Br(θ0)‖ − n−1∇2 log pi0(θ)‖2 + supθ∈Br(θ0)∥∥Eθ0 [−∇2 log pθ(X)]−Hθ∥∥2 .Given Condition 2 of Assumption 1, for all θ, pi0(θ) is positive and ∇2pi0(θ) iscontinuous; and further due to the compactness of Br(θ0), we have that∀r > 0, supθ∈Br(θ0)‖ − n−1∇2 log pi0(θ)‖2 → 0, as n→∞. (3.6)Then, it remains to bound the first term and the last term of Eq. (3.5). For the firstterm, we aim to use the uniform weak law of large numbers to show its convergenceto 0. By Condition 5 of Assumption 1, there exists a 0 < r1 ≤ κ′ and a measurablefunction g such that for all θ ∈ Br1(θ0) and for all x,maxi,j∈[d]∣∣∣(∇2 log pθ(x))i,j∣∣∣ < g(x), Eθ0 [g(X)] <∞.Then, by the compactness of Br1(θ0), we can apply the uniform weak law of largenumbers (Jennrich, 1969, Theorem 2), yielding that for all i, j ∈ [d],supθ∈Br1 (θ0)∣∣∣∣∣∣(−n−1n∑i=1∇2 log pθ(Xi))i,j− (Eθ0 [−∇2 log pθ(X)])i,j∣∣∣∣∣∣ Pθ0→ 0.Since the entrywise convergence of matrices implies the convergence in spectralnorm,supθ∈Br1 (θ0)∥∥∥∥∥−n−1n∑i=1∇2 log pθ(Xi)− Eθ0[−∇2 log pθ(X)]∥∥∥∥∥2Pθ0→ 0. (3.7)22For the last term of Eq. (3.5), by Condition 4 of Assumption 1,limr→0supθ∈Br(θ0)∥∥Eθ0 [−∇2 log pθ(X)]−Hθ∥∥2= limr→0supθ∈Br(θ0)∥∥Eθ0 [−∇2 log pθ(X)]− Eθ [−∇2 log pθ(X)]∥∥2→ 0.Thus, there exists a sufficiently small r2 > 0 such thatsupθ∈Br2 (θ0)∥∥Eθ0 [−∇2 log pθ(X)]−Hθ∥∥2 ≤ 8 . (3.8)Then, we combine Eqs. (3.6) to (3.8) and pick r = min(r1, r2) ≤ κ′, yielding thatP(supθ∈Br(θ0)∣∣λmin (∇2fn(θ)−Hθ)∣∣ ≤ 4)→ 1, (3.9)as n→∞.Then the proof is complete. Note that we have already shown for all θ ∈Bκ′(θ0), Hθ 2I . By Eqs. (3.9) and (3.4), we conclude that for all δ ≤ 4 ,limn→∞P(∀θ ∈ Br(θ0), ∇2fn(θ) δI) = 1.Example 3.3.7. Let fn(y) = y2 +(1n∑ni=1Xi)cos 5y, where Xi ∼ N (0, 1).Then ∣∣∣∣d2fndy2 − 2∣∣∣∣ = 25 |cos(5y)| ·∣∣∣∣∣n−1n∑i=1Xi∣∣∣∣∣ .Therefore by the law of large numbers and the fact that | cos(5y)| ≤ 1, for any > 0,the sequence (fn)n∈N is asymptotically (2−)-strongly convex and (2+)-Lipschitzsmooth. Fig. 3.1 visualizes the asymptotic behaviour of fn as n increases.The last result of this section—Corollary 3.3.8—combines Theorems 3.3.5and 3.2.1 and Lemma 3.3.6 to provide the key asymptotic convexity/smoothness23Figure 3.1: Plots of the function fn(y) from Example 3.3.7. Each row offigures represents a single realization of the sequence (fn)n∈N for in-creasing sample sizes 5, 20, 100, and 1000. Each column includes threerepetitions of fn under a single n. As n increases, the function fn(y) ismore likely to be strongly convex and Lipschitz smooth with constantsapproaching 2.result that we use in the development of the optimization algorithm in Chapter 4.Corollary 3.3.8. Suppose Assumptions 1 and 3 hold, and define Dn as in Theo-rem 3.3.5. Then there exist , `, r > 0 such that Fn reinterpreted as a function ofR(d+1)d → R—by stacking µ and each column of L into a single vector—satisfiesP(Fn is2Dn-strongly convex in Br,n and globally `Dn-Lipschitz smooth)→ 1,24as n→∞, whereBr,n={µ ∈ Rd, L ∈ Rd×d : ‖µ− µ?n‖2≤r24and ‖L− L?n‖2F ≤ 4‖I − L?n‖2F}.Proof. We begin by verifying the conditions of Theorem 3.3.5 for fn. By Assump-tion 1 we know that fn is twice differentiable. We also know that by Lemma 3.3.6,under Assumptions 1 and 3, there exist `, r′, > 0 such thatP(supθ∥∥−n−1∇2 log pin(θ)∥∥2 > `)→ 0P(inf‖θ−θ0‖<r′λmin(−n−1∇2 log pin(θ)) < )→ 0.By Theorem 3.2.1 we know that µ?nPθ0→ θ0, so there exists an r′ > r > 0 such thatP(inf‖θ−µ?n‖<rλmin(−n−1∇2 log pin(θ)) < )→ 0.Therefore by Theorem 3.3.5, the probability that∀µ,L, −`Dn n−1∇2E[− log pin(µ+ 1/√n LZ)] `Dn, (3.10)andfor all L and for ‖µ− µ?n‖2 < r2/2,n−1∇2E[− log pin(µ+ n−1/2LZ)] Dn(− τn(µ,L) · (+ `)),(3.11)hold converges to 1 as n→∞, where Dn and τn(µ,L) are as defined in Eq. (3.3)and x = µ?n. Note that the gradient and Hessian in the above expression are takenwith respect to a vector in Rd(d+1) that stacks µ and each column of L into a singlevector.Then for all (µ,L) ∈ Br,n, we have‖µ− µ?n‖2 ≤ r2/4‖L− L?n‖2F ≤ 4‖I − L?n‖2F =⇒ ‖L‖F ≤ 2‖I − L?n‖F + ‖L?n‖F ,25yieldingr2 − 2‖µ− µ?n‖2n−12‖L‖2F≥ nr24 (2‖I − L?n‖F + ‖L?n‖F )2.Hence ∀(µ,L) ∈ Br,n, τn(µ,L) → 0 as n → ∞, yielding that under sufficientlylarge n,− τn(µ,L) · (+ `) > /2.Therefore, the probability that for all (µ,L) ∈ Br,n,1n∇2E [− log pin(µ+ 1/√n LZ)] 2Dn (3.12)converges in Pθ0 to 1 as n→∞.Combining Eqs. (3.10) to (3.12), the proof is completed.26Chapter 4Consistent Stochastic VariationalInference (CSVI)In this section, we use the strong convexity and smoothness results from Chapter 3to develop an optimization algorithm—Consistent Stochastic Variational Inference(CSVI)—that asymptotically solves the Gaussian variational inference problem inEq. (2.1), in the sense of Definition 4.0.1: the probability that the iterates convergeto the global optimum converges to 1 in the asymptotic limit of observed data.Definition 4.0.1. An iterative algorithm asymptotically solves a (random) sequenceof optimization problems indexed by n ∈ N, each with a single global optimumpoint x?n ∈ Rd, if the sequence of iterates (xk,n)k∈N produced by the algorithmsatisfieslimn→∞P(limk→∞‖xk,n − x?n‖ = 0)= 1.As mentioned earlier, the CSVI algorithm is based on projected stochastic gradi-ent descent (SGD) (Bubeck, 2015, Section 3.). Since we know that the expectationcomponent Fn of the Gaussian variational inference objective function is asymp-totically locally strongly convex and globally Lipschitz smooth, there are threeremaining issues to address to ensure that the algorithm satisfies Definition 4.0.1.First, we need to ensure that we can initialize the optimization algorithm withinthe locally strongly convex region, i.e., the set Br,n from Corollary 3.3.8. We27address this challenge by solving a smoothed version of the maximum a posteriori(MAP) problem, which is formulated and shown to be asymptotically tractable inSection 4.1. Second, since the gradient estimates are noisy, we need to ensure thatthe iterates of CSVI stay in Br,n for all iterations so that the usual convergenceguarantees of SGD apply. Finally, note that the regularization term −n−1 log detLin the objective of Eq. (2.1) is not itself Lipschitz smooth, making the overall op-timization problem not Lipschitz smooth. We address both the second and thirdissue in Section 4.2 by applying a novel scaling matrix to the gradient steps, anddeveloping new theoretical results on the local confinement of projected stochasticgradient descent.4.1 Initialization via smoothed MAPThe goal of this section is to design an algorithm that initializes the variationaloptimization within the asymptotically strongly convex local region Br,n. Note thatthe challenging part of this problem is to set µ, as we can simply initialize L = I .Since we aim to initialize µ sufficiently close to µ?n—and µ?n converges to θ0 perTheorem 3.2.1—we might like to find something akin to the maximum a posteriori(MAP) value of log pin, due to its similar convergence to θ0. However, since log pinis typically not concave, obtaining the MAP point is generally intractable.In this section, we formulate a tractable MAP-like problem by convolving theposterior distribution with Gaussian noise prior to finding the maximum point ofthe log density. This Gaussian noise essentially results in smoothed log densitywith fewer spurious optima; hence, we denote this the smoothed MAP problem.In Section 4.1.1, we show that the smoothed MAP problem is asymptoticallyconvex—and hence tractable—under conditions similar to those that guarantee theBernstein-von Mises theorem, and that the solution is asymptotically consistentfor θ0, albeit at a slower-than-√n rate. In Section 4.1.2, we provide a stochasticoptimization algorithm that depends only on black-box access to the gradients ofthe original log density function. Taken together, these results demonstrate that wecan tractably initialize CSVI in the asymptotically strongly convex local region Br,nas required.284.1.1 Smoothed MAP problemGiven the nth posterior distribution Πn, we define the smoothed posterior Πˆn withsmoothing variance αn to be the θ-marginal of the generative processW ∼ Πn, θ ∼ N (W,αnI).Alternatively, the smoothed posterior distribution Πˆn can be viewed as the distribu-tion of the sum of independent realizations from Πn andN (0, αnI). The probabilitydensity function pˆin of Πˆn is given by the convolution of pin with a multivariatenormal density,pˆin(θ) =∫1(2pi)d/2αd/2nexp(− 12αn‖θ − w‖2)pin(w)dw=E[1(2pi)d/2αd/2nexp(− 12αn‖θ −W‖2)]. (4.1)Given these definitions, the smoothed MAP problem is the MAP inference problemfor the smoothed posterior distribution, i.e.,θˆn = arg minθ∈Rd− logE[exp(− 12αn‖θ −W‖2)]. (4.2)Gaussian smoothing is commonly used in image and signal processing (Forsythand Ponce, 2002; Haddad and Akansu, 1991; Lindeberg, 1990; Nixon and Aguado,2012), and has previously been applied to reduce the presence of spurious localoptima in nonconvex optimization problems, making them easier to solve withlocal gradient-based methods (Addis et al., 2005; Mobahi, 2013). The variance αncontrols the degree of smoothing; larger values create a smoother density pˆin, atthe cost of making pˆin a poorer approximation of the original function pin. FigureFig. 4.1 demonstrates how increasing αn increases the smoothing effect, resultingin fewer and flatter local optima in the objective.In general, although intuitively reasonable, Gaussian smoothing does not typ-ically come with strong practical theoretical guarantees, essentially because thebest choice of the smoothing variance αn is not known. Mobahi (2013) shows for29Figure 4.1: Plots of the smoothed posterior density pˆin with increasing smooth-ing variance.a continuous integrable function with quickly decaying tails (at rate ‖x‖−d−3 as‖x‖ → ∞), the smoothed function is strictly convex given a large enough selectionof αn. Addis et al. (2005) studies the smoothing effect of a log-concave kernel on aspecial type of piecewise constant function, and proves that the smoothed functionis either monotonic or unimodal. To the best of our knowledge, previous analysesof smoothed optimization are not sufficient to guide the choice of αn and do notprovide bounds on the error of the smoothed optimum point versus the original.In contrast to these previous studies, we use the asymptotic concentration ofthe statistical model as n → ∞ to address both of these issues. In particular,Theorem 4.1.1 shows that if the sequence αn is chosen to decrease more slowlythan n−1/3, the smoothed MAP problem is eventually strictly convex within anyarbitrary compact domain; and that the solution of the smoothed MAP problem isasymptotically consistent for θ0 at a√αn−1 rate.Theorem 4.1.1. Suppose Assumption 1 holds and nα3n →∞. Then for all M > 0,the probability that the smoothed MAP optimization problemmin‖θ−θ0‖≤M− log pˆin(θ)is strictly convex converges to 1 as n→∞ under the data generating distribution.If additionally Assumption 2 holds, then‖θˆn − θ0‖ = OPθ0 (√αn ).304.1.2 Smoothed MAP optimizationIn practice, we use SGD to solve the smooth MAP problem. The gradient of thesmoothed MAP objective function in Eq. (4.2) is∇(− log pˆin(θ)) =−∇ logE[exp(− 12αn‖θ −W‖2)]=−E[exp(− 12αn ‖θ −W‖2)(− 1αn)(θ −W )]E[exp(− 12αn ‖θ −W‖2)] ,where W ∼ Πn. By change of variables and reparametrization, the gradient can bereformulated as∇(− log pˆin(θ)) =α−1/2nE[Wpin(θ − α1/2n W)]E[pin(θ − α1/2n W)] ,where W ∼ N (0, I). Note that the unknown normalization constant in pin cancelsin the numerator and denominator. We obtain stochastic estimates of the gradientusing a Monte Carlo approximation of the numerator and denominator using thesame samples, i.e., self-normalized importance sampling (Robert and Casella, 2013,p. 95). It is known that the variance of this gradient estimate may be quite large oreven infinite; although techniques such as truncation (Ionides, 2008) and smoothing(Vehtari et al., 2015) exist to address it, we leave this issue as an open problemfor future work. The resulting SGD procedure with explicit gradient estimates areshown in Algorithm 1.4.2 Optimization via scaled projected SGDAs shown in Chapter 3 and Section 4.1, Gaussian variational inference Eq. (2.1) islocally strongly convex, globally Lipschitz smooth, and the initialization µ = θˆn,L = I is asymptotically tractable and lies in the locally convex region. We nowdesign a stochastic optimization algorithm and prove that it asymptotically solvesEq. (2.1) per Definition 4.0.1.Given a sequence of step sizes (γk)k∈N, γk ≥ 0 and (Zk)k∈N i.i.d.∼ N (0, I), and31Algorithm 1 Smoothed MAP estimationprocedure SMOOTHEDMAP(pin, αn, x0, K, S, (γk)k∈N)for k = 0, 1, . . . ,K − 1 doSample (Zs)Ss=1i.i.d.∼ N (0, I)∇ˆk ← α1/2n(∑Ss=1 Zs · pin(xk − α1/2n Zs))/(∑Ss=1 pin(xk − α1/2n Zs))xk+1 ← xk − γk∇ˆkend forreturn xKend procedureinitialization µ0 = θˆn, L0 = I , the standard stochastic gradient update applied tothe Gaussian variational inference problem isµk+1 ← µk − γk∇ˆµ,n(µk, Lk, Zk)Lk+1 ← Lk − γk∇ˆL,n(µk, Lk, Zk).There are two major issues with this update: first, the regularization term− 1n logdetLis not globally Lipschitz smooth and has a large gradient when L is small, which islikely to produce an iterate outside the locally convex area; and second, the updatemight produce an infeasible iterate L with nonpositive diagonal.We resolve the first issue by applying a scaling to the L gradient prior to theupdate. In particular, define the scaled L gradient matrix ∇˜L,n(µ,L, Z) ∈ Rd×dvia[∇˜L,n(µ,L, Z)]ij=[∇ˆL,n(µ,L, Z)]ijj 6= i11+(nLii)−1[∇ˆL,n(µ,L, Z)]iij = i, Lii > 0−1 j = i, Lii = 0.(4.3)This scaling prevents the gradient of L from diverging when diagonal elements ofL→ 0, and also creates a well-defined gradient for L at the boundary of the feasibleregion. Given this scaled L gradient, we resolve the second issue by employinga simple projection step after each update: we set any negative diagonal entry inthe current iterate Lk to 0. SGD with these two simple modifications is presentedin Algorithm 2. The final theoretical result of this work, Theorem 4.2.1, is that32Algorithm 2 Consistent stochastic Gaussian variational inferenceprocedure CSVI(fn, g, (γk)k∈N, K)µ0 ←SmoothedMAP (Algorithm 1)L0 ← Ifor k = 0, 1, . . . ,K − 1 doSample Zk ∼ N (0, I)µk+1 ← µk − γk∇ˆµ,n(µk, Lk, Zk)∇˜L,n ← ∇ˆL,n(µk, Lk, Zk)for i = 1, . . . , d doif Lk,ii > 0 then[∇˜L,n]ii← 11+(nLk,ii)−1[∇˜L,n]iielse[∇˜L,n]ii← −1end ifend forLk+1 ← Lk − γk∇˜L,nfor i = 1, . . . , d doLk+1,ii ← max {0, Lk+1,ii}end forend forreturn µK , LKend procedureAlgorithm 2 asymptotically solves Gaussian variational inference given a weakcondition on the step size sequence (γk)k∈N.Theorem 4.2.1. Suppose that we initialize L0 = I and µ0 such that ‖µ0 − θˆn‖22 ≤r232 . Then there exists a constant C > 0 such that ifγk = Θ(k−ρ) for some ρ ∈ (0.5, 1) and ∀k ∈ N, 0 < γk < C, (4.4)then Algorithm 2 asymptotically solves Gaussian variational inference.33Chapter 5ExperimentsIn this section, we compare CSVI to standard Gaussian stochastic variationalinference (SVI)—i.e., reparametrized Gaussian variational inference via projectedstochastic gradient descent—on one-dimensional synthetic inference problems. Werun all optimization algorithms for 50,000 iterations, and base the gradients for thesmoothed MAP mean initialization (Algorithm 1) on 100 samples.5.1 Synthetic Gaussian mixtureIn the first experiment, we highlight the reliability of CSVI as opposed to SVI onthe simple problem of approximating a Gaussian mixture target distribution Π,Π = 0.7N (0, 4) + 0.15N (−30, 9) + 0.15N (30, 9). (5.1)We set n = 1 and αn = 10 in the implementation of CSVI, and initialize thesmoothed MAP optimization and the mean of SVI uniformly in the range (−50, 50).For both methods, we initialize the log standard deviation log σ uniformly in therange (log 0.5, log 10). We set γk ≈ 0.1/(1+k0.85) for CSVI and γk = 0.5/(1+k)for SVI.Fig. 5.1 shows the result of 10 trials of each of CSVI and SVI. The majorityof the mass of the Gaussian mixture target distribution (grey) concentrates onthe central mode with mean 0 and standard deviation 2; the optimal variationalapproximation has these same parameters. However, as shown in the plot, SVI34Figure 5.1: The result of running 10 trials of CSVI (blue) and SVI (pink) withthe Gaussian mixture target (grey) given in Eq. (5.1). The output of CSVIreliably finds the global optimum solution corresponding to the centralmixture peak; SVI often provides solutions corresponding to spuriouslocal optima.with randomized µ0 often finds a spurious local optimum solution, correspondingto the two peaks centered at ±30. On the other hand, the smoothed MAP meaninitialization helps guarantee that the CSVI optimization starts close enough to thecentral mode that it recovers the global optimum solution in each trial. The gradientscaling also aids the stability of the algorithm; whereas SVI has unstable behaviourwhen σ is initialized to a small value due to the log-determinant regularization term,the scaling of CSVI in Eq. (4.3) ensures that the algorithm is stable in this region.35Figure 5.2: The Bayesian posterior density for increasing dataset sizes. Notethe large number of spurious local optima, resulting in the unreliabilityof local optimization methods in variational inference.Figure 5.3: The smoothed Bayesian posterior density for the same datasetsizes as in Fig. 5.2. Black curves correspond to the smoothed posterior,red dots show local optima of the density, and the blue histogram showsthe counts (over 100 trials) of the output of the smoothed MAP initial-ization. Note that there are fewer local optima relative to the originalposterior density, and that the smoothed MAP initialization is likely toprovide a mean close to that of the optimal variational distribution.5.2 Synthetic model with a nonconvex priorIn this section, we compare the performance of CSVI and SVI on a syntheticBayesian model across a range of observed dataset sizes (n = 10, 100, 1000, 10000).The model is as follows,θ ∼ 15N (0, 0.152) + 15N (1, 0.12) + 15N (−4, 0.32) + 15N (4, 0.32)+15N (−8, 0.12)Xi | θ i.i.d.∼ N (θ, 5000),where the data are truly generated from (Xi)ni=1i.i.d.∼ N (3, 10). We use a smoothingconstant of αn = 10n−0.3, and set γk = n10/(1 + k0.85) for CSVI and γk =36Figure 5.4: Box-plots of the final ELBO for 100 trials of CSVI and SVI.0.5/(1 + k) for SVI. We construct this model to have a posterior distribution thathas multiple modes and approaches a normal distribution in total variation as samplesize increases. As shown in Fig. 5.2, the posterior has a number of peaks when n issmall, and gradually converges to a unimodal distribution as n increases. However,even seemingly small peaks in the density may present significant local optima inthe log density that prevent SVI from obtaining the global optimum, as illustrated inprevious Gaussian mixture example. In contrast, Fig. 5.3 shows that the smoothedposterior tends to have fewer modes, smoothing smaller peaks and merging closebypeaks; thus the smoothed posterior can be used to find a reliable initialization forSVI and reduce the likelihood that SVI becomes trapped in bad local optimum. Andbecause the smoothed MAP usually finds dominating peaks of the posterior, e.g.,the larger peak in the last figure of Fig. 5.2, this initialization is usually closer to theoptimal mean and thus often results in a better final variational approximation.Fig. 5.4 provides a quantitative comparison of CSVI and SVI on this problem,confirming this intuition and demonstrating that CSVI more reliably finds low-costvariational approximations in comparison to SVI. In particular, Fig. 5.4 comparesthe final expectation lower bound (ELBO) Blei et al. (2017) for each method,which is equivalent to the negative KL divergence between posterior and variationaldistribution up to a normalizing constant. For comparison, we estimate ELBO using1000 Monte Carlo samples.The box-plots of Fig. 5.4 shows the results of running CSVI and SVI for 100trials. Note that larger ELBO means a better approximation and a less spread box-plot represents better numerical stability. It is clear that when n = 10000, SVI tendsto become trapped in local optima while CSVI tends to find the global optimumreliably; under all sample sizes, CSVI tends to provide a higher ELBO than SVI,37meaning a more accurate approximation. Moreover, CSVI is significantly morestable than SVI when the true posterior is peaky (e.g., n = 10, 100). We also findthat training SVI on a peaky distribution can have diverging σ and hence yieldsnumerical instability. CSVI, in contrast, shows great numerical stability because ofthe scaled gradient for σ.It is worth noting that the performance of CSVI seems to degrade at n =1000. We find that the posterior at n = 1000 in this example (3rd figures fromthe left in Fig. 5.2) happens to have two big peaks in the smoothed posteriordensity. Thus the smoothed MAP initialization finds these two peaks with similarprobability (see the blue bars of Fig. 5.3 at n = 1000), leading to similar finalELBO values when CSVI/SVI converges to these local optima. However, even inthis pathological setting, CSVI provides a benefit over SVI, as the smoothing kernelremoves other spurious local optima. Finally, it is worth pointing out that as withall local optimization methods, a careful choice of the learning rate for CSVI isimportant to ensure its performance. This matches the statement of Theorem 4.2.1,which guarantees that the output of CSVI is asymptotically consistent for θ0 as longas the learning rate satisfies Eq. (4.4). However, if the learning rate is too large,CSVI may jump out of the local basin found by its initialization due to the noise inthe gradient, and eventually become trapped in a spurious local optimum. But evenwhen the learning rate is not carefully tuned, CSVI performs at least as well as SVI.38Chapter 6ConclusionThis work provides an extensive theoretical analysis of the computational aspects ofGaussian variational inference, and uses the theory to design a general procedurethat addresses the nonconvexity of the problem in the data-asymptotic regime. Weshow that under mild conditions, Gaussian variational optimization is locally asymp-totically convex. Based on this fact, we developed consistent stochastic variationalinference (CSVI), a scheme that asymptotically solves Gaussian variational infer-ence. CSVI solves a smoothed MAP problem to initialize the Gaussian mean withinthe locally convex area, and then runs a scaled projected stochastic gradient descentto create iterates that converge to the optimum. The asymptotic consistency of CSVIis mathematically justified, and experimental results demonstrate the advantagesover traditional SVI.There are many avenues of further exploration for the present work. For example,we limit consideration to the case of Gaussian variational families due to theirpopularity; but aside from the mathematical details, nothing about the overallstrategy necessarily relied on this choice. It would be worth examining otherpopular variational families, such as mean-field exponential families (Xing et al.,2002).Furthermore, the current work is limited to posterior distributions with fullsupport on Rd—otherwise, the KL divergence variational objective is degenerate.It would be of interest to study whether variational inference using a Gaussianvariational family truncated to the support of the posterior possesses the same bene-39ficial asymptotic properties and asymptotically consistent optimization algorithm asdeveloped in the present work.Another interesting potential line of future work is to investigate other proba-bility measure divergences as variational objectives. For example, the chi-squaredivergence (Csisza´r, 1967; Liese and Vajda, 1987, p. 51), Re´nyi α-divergence(Van Erven and Harremos, 2014), Stein discrepancy (Stein, 1972), and more (Gibbsand Su, 2002) have all been used as variational objectives. Along a similar vein, westudied the convergence properties of only a relatively simple stochastic gradientdescent algorithm; other base algorithms with better convergence properties exist(Duchi et al., 2011; Kingma and Ba, 2015; Nesterov, 1983), and it may be fruitfulto see if they have similar asymptotic consistency properties.A final future direction is to investigate the asymptotic behaviour of variationalinference with respect to other measures of optimization tractability. In particular,(local) pseudoconvexity (Crouzeix and Ferland, 1982), quasiconvexity (Arrow andEnthoven, 1961), and invexity (Ben-Israel and Mond, 1986; Craven and Glover,1985) are all weaker than (local) convexity, but provide similar guarantees forstochastic optimization. These may be necessary to consider when examining otherdivergences as variational objectives.40BibliographyAddis, B., Locatelli, M., and Schoen, F. (2005). Local optima smoothing for globaloptimization. Optimization Methods and Software, 20(4-5):417–437.Alquier, P. and Ridgway, J. (2020). Concentration of tempered posteriors and oftheir variational approximations. The Annals of Statistics, 48(3):1475–1497.Arrow, K. and Enthoven, A. (1961). Quasi-concave programming. Econometrica:Journal of the Econometric Society, pages 779–800.Bassett, R. and Deride, J. (2019). Maximum a posteriori estimators as a limit ofBayes estimators. Mathematical Programming, 174(1-2):129–144.Bauschke, H. and Combettes, P. (2011). Convex Analysis and Monotone OperatorTheory in Hilbert Spaces. Springer.Ben-Israel, A. and Mond, B. (1986). What is invexity? The ANZIAM Journal,28(1):1–9.Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer.Blei, D., Kucukelbir, A., and McAuliffe, J. (2017). Variational inference: A reviewfor statisticians. Journal of the American Statistical Association,112(518):859–877.Bottou, L. (2004). Stochastic Learning. In Bousquet, O., von Luxburg, U., andRa¨tsch, G., editors, Advanced Lectures on Machine Learning: ML SummerSchools 2003, pages 146–168. Springer Berlin Heidelberg.Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration Inequalities: ANonasymptotic Theory of Independence. Oxford university press.Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. CambridgeUniversity Press.41Bubeck, S. (2015). Convex Optimization: Algorithms and Complexity. NowPublishers Inc.Craven, B. and Glover, B. (1985). Invex functions and duality. Journal of theAustralian Mathematical Society, 39(1):1–20.Crouzeix, J.-P. and Ferland, J. (1982). Criteria for quasi-convexity andpseudo-convexity: relationships and comparisons. Mathematical Programming,23(1):193–205.Csisza´r, I. (1967). Information-type measures of difference of probabilitydistributions and indirect observation. Studia Scientiarum MathematicarumHungarica, 2:229–318.Dashti, M., Law, K., Stuart, A., and Voss, J. (2013). MAP estimators and theirconsistency in Bayesian nonparametric inverse problems. Inverse Problems,29(9):095017.Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods foronline learning and stochastic optimization. Journal of Machine LearningResearch, 12(7):2121–2159.Folland, G. (1999). Real Analysis: Modern Techniques and their Applications.John Wiley & Sons.Forsyth, D. and Ponce, J. (2002). Computer Vision: a Modern Approach. PrenticeHall Professional Technical Reference.Gelfand, A. and Smith, A. (1990). Sampling-based approaches to calculatingmarginal densities. Journal of the American Statistical Association,85(410):398–409.Ghosal, S., Ghosh, J., and van der Vaart, A. (2000). Convergence rates of posteriordistributions. The Annals of Statistics, 28(2):500–531.Gibbs, A. and Su, F. E. (2002). On choosing and bounding probability metrics.International Statistical Review, 70(3):419–435.Grenda´r, M. and Judge, G. (2009). Asymptotic equivalence of empirical likelihoodand Bayesian MAP. The Annals of Statistics, pages 2445–2457.Haddad, R. and Akansu, A. (1991). A class of fast Gaussian binomial filters forspeech and image processing. IEEE Transactions on Signal Processing,39(3):723–727.42Han, W. and Yang, Y. (2019). Statistical inference in mean-field variational Bayes.arXiv:1911.01525.Hastings, W. (1970). Monte Carlo sampling methods using Markov chains andtheir applications. Biometrika, 57(1):97–109.Hoffman, M., Blei, D., Wang, C., and Paisley, J. (2013). Stochastic variationalinference. The Journal of Machine Learning Research, 14(1):1303–1347.Huggins, J., Kasprzak, M., Campbell, T., and Broderick, T. (2020). Validatedvariational inference via practical posterior error bounds. In InternationalConference on Artificial Intelligence and Statistics.Ionides, E. (2008). Truncated importance sampling. Journal of Computational andGraphical Statistics, 17(2):295–311.Jaiswal, P., Rao, V., and Honnappa, H. (2019). Asymptotic consistency of α−Re´nyi-approximate posteriors. arXiv:1902.01902.Jennrich, R. (1969). Asymptotic properties of non-linear least squares estimators.The Annals of Mathematical Statistics, 40(2):633–643.Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. (1998). An introduction tovariational methods for graphical models. In Learning in Graphical Models,pages 105–161. Springer.Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. InInternational Conference on Learning Representations.Kingma, D. and Welling, M. (2014). Auto-encoding variational Bayes. InInternational Conference on Learning Representations.Kleijn, B. (2004). Bayesian asymptotics under misspecification. PhD thesis, VrijeUniversiteit Amsterdam.Kleijn, B. and van der Vaart, A. (2012). The Bernstein-von-Mises theorem undermisspecification. Electronic Journal of Statistics, 6:354–381.Kontorovich, A. (2014). Concentration in unbounded metric spaces andalgorithmic stability. In International Conference on Machine Learning.Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. (2017).Automatic differentiation variational inference. The Journal of MachineLearning Research, 18(1):430–474.43LeCam, L. (1960). Locally Asymptotically Normal Families of Distributions.Berkeley: University of California Press.Lehmann, E. and Casella, G. (2006). Theory of Point Estimation. Springer Science& Business Media.Liese, F. and Vajda, I. (1987). Convex Statistical Distances. Teubner.Lindeberg, T. (1990). Scale-space for discrete signals. IEEE Transactions onPattern Analysis and Machine Intelligence, 12(3):234–254.Meyn, S. and Tweedie, R. (2012). Markov Chains and Stochastic Stability.Springer Science & Business Media.Mobahi, H. (2013). Optimization by Gaussian smoothing with application togeometric alignment. PhD thesis, University of Illinois at Urbana-Champaign.Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.Nesterov, Y. (1983). A method of solving a convex programming problem withconvergence rate O(k2). In Doklady Akademii Nauk.Nixon, M. and Aguado, A. (2012). Feature Extraction and Image Processing forComputer Vision. Academic Press.Pinheiro, J. and Bates, D. (1996). Unconstrained parametrizations forvariance-covariance matrices. Statistics and Computing, 6:289–296.Qiao, Y. and Minematsu, N. (2010). A study on invariance of f -divergence and itsapplication to speech recognition. IEEE Transactions on Signal Processing,58(7):3884–3890.Rakhlin, A., Shamir, O., and Sridharan, K. (2012). Making gradient descentoptimal for strongly convex stochastic optimization. In International Coferenceon International Conference on Machine Learning.Ranganath, R. (2014). Black box variational inference. In Advances in NeuralInformation Processing Systems.Robbins, H. and Monro, S. (1951). A stochastic approximation method. TheAnnals of Mathematical Statistics, pages 400–407.Robert, C. and Casella, G. (2013). Monte Carlo Statistical Methods. SpringerScience & Business Media.44Roberts, G. and Rosenthal, J. (2004). General state space Markov chains andMCMC algorithms. Probability surveys, 1:20–71.Schatzman, M. (2002). Numerical Analysis: a Mathematical Introduction.Clarendon Press. Translation: John Taylor.Shen, X. and Wasserman, L. (2001). Rates of convergence of posteriordistributions. The Annals of Statistics, 29(3):687–714.Stefanski, L. and Boos, D. (2002). The calculus of M-estimation. The AmericanStatistician, 56(1):29–38.Stein, C. (1972). A bound for the error in the normal approximation to thedistribution of a sum of dependent random variables. In Proceedings of the SixthBerkeley Symposium on Mathematical Statistics and Probability.van der Vaart, A. (2000). Asymptotic Statistics. Cambridge University Press.van der Vaart, A. and Wellner, J. (2013). Weak Convergence and EmpiricalProcesses: with Applications to Statistics. Springer Science & Business Media.Van Erven, T. and Harremos, P. (2014). Re´nyi divergence and Kullback-Leiblerdivergence. IEEE Transactions on Information Theory, 60(7):3797–3820.Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2015). Paretosmoothed importance sampling. arXiv:1507.02646.Wainwright, M. and Jordan, M. (2008). Graphical Models, Exponential Families,and Variational Inference. Now Publishers Inc.Wang, Y. and Blei, D. (2019). Frequentist consistency of variational Bayes.Journal of the American Statistical Association, 114(527):1147–1161.Xing, E., Jordan, M., and Russell, S. (2002). A generalized mean field algorithmfor variational inference in exponential families. In Uncertainty in ArtificialIntelligence.Yang, Y., Pati, D., and Bhattacharya, A. (2020). α-variational inference withstatistical guarantees. The Annals of Statistics, 48(2):886–905.Zhang, F. and Gao, C. (2020). Convergence rates of variational posteriordistributions. The Annals of Statistics, 48(4):2180–2207.45Appendix AProofsIn this appendix, we provide proofs of Theorems 4.1.1 and 4.2.1.A.1 Proof for Theorem 4.1.1A.1.1 Gradient and Hessian derivationThe gradient for smoothed posterior is as follows,∇ log pˆin(θ) =∇ log{E[exp(− 12αn‖θ −W‖2)]}=E[exp(− 12αn ‖θ −W‖2)(− 1αn)(θ −W )]E[exp(− 12αn ‖θ −W‖2)] ,and the Hessian matrix is given by∇2 log pˆin(θ) = 1α2nE[e−‖θ−W‖2−‖θ−W ′‖22αn W (W −W ′)T]E[e−‖θ−W‖22αn]2 − 1αn I, (A.1)where W,W ′ i.i.d.∼ Πn.46A.1.2 Proof of 1st statement of Theorem 4.1.1Proof of 1st statement of Theorem 4.1.1. To show the MAP estimation for smoothedposterior is asymptotically strictly convex, we will show thatlimn→∞P(sup‖θ−θ0‖≤Mλmax(∇2 log pˆin(θ)) < 0) = 1.We focus on the first term of Eq. (A.1), and show that asymptotically it is uni-formly smaller than α−1n so that the overall Hessian is negative definite. For the de-nominator of Eq. (A.1), defineBn := {W,W ′ : max{‖W ′ − θ0‖, ‖W − θ0‖} ≤ βn}for any sequence βn = o(αn). Then we haveE[e−‖θ−W‖22αn]2= E[e−‖θ−W‖2−‖θ−W ′‖22αn 1Bn]+ E[e−‖θ−W‖2−‖θ−W ′‖22αn 1Bcn]≥ E[e−‖θ−W‖2−‖θ−W ′‖22αn 1Bn]≥ E[(infv,v′∈Bne−‖θ−v‖2−‖θ−v′‖22αn)1Bn]=(infv,v′∈Bne−‖θ−v‖2−‖θ−v′‖22αn)P(Bn).By minimizing over v, v′ ∈ Bn, the above leads toE[e−‖θ−W‖22αn]2≥ e−2(‖θ−θ0‖+βn)22αn P(Bn). (A.2)For the numerator of the first term of Eq. (A.1), since W,W ′ are i.i.d.,E[e−‖θ−W‖2−‖θ−W ′‖22αn W (W −W ′)T]=12E[e−‖θ−W‖2−‖θ−W ′‖22αn (W −W ′)(W −W ′)T],47and since λmax((W −W ′)(W −W ′)T ) = ‖W −W ′‖2,λmax(E[e−‖θ−W‖2−‖θ−W ′‖22αn W (W −W ′)T])(A.3)≤ 12E[e−‖θ−W‖2−‖θ−W ′‖22αn ‖W −W ′‖2].With Eqs. (A.2) and (A.3), we can therefore bound the maximal eigenvalue of theHessian matrix,λmax(∇2 log pˆin(θ)) (A.4)≤ 12α2nP(Bn)E[e−‖θ−W‖2−‖θ−W ′‖22αn e2(‖θ−θ0‖+βn)22αn ‖W −W ′‖2]− 1αn.We now bound the supremum of this expression over {θ ∈ Rd : ‖θ − θ0‖ ≤ M}.Focusing on the exponent within the expectation,sup‖θ−θ0‖≤M1αn[2(‖θ − θ0‖+ βn)2 − ‖θ −W‖2 − ‖θ −W ′‖2]= sup‖θ−θ0‖≤M1αn[2(‖θ − θ0‖+ βn)2 − ‖θ − θ0 + θ0 −W‖2−‖θ − θ0 + θ0 −W ′‖2]≤ 1αn[(2β2n + 4Mβn)− (‖θ0 −W‖2 + ‖θ0 −W ′‖2)+2M(‖θ0 −W‖+ ‖θ0 −W ′‖)] ,where the inequality is obtained by expanding the quadratic terms and bounding‖θ − θ0‖ with M . We combine the above bound with Eq. (A.4) to show thatα2nλmax(∇2 log pˆin(θ))+ αn is bounded above byβn2P(Bn)e2β2n+4Mβnαn E[e2M(‖θ0−W‖+‖θ0−W ′‖)−(‖θ0−W‖2+‖θ0−W ′‖2)αn‖W −W ′‖2βn]. (A.5)48By multiplying and dividing by exp(‖W−W ′‖√βn), one notices that‖W −W ′‖2βn= exp(‖W −W ′‖√βn)exp(−‖W −W′‖√βn) ‖W −W ′‖2βn≤4e−2 exp(‖W − θ0‖+ ‖W ′ − θ0‖√βn),where the inequality is by the fact that x2e−x maximized at x = 2 with value 4e−2and ‖W −W ′‖ ≤ ‖W‖ + ‖W ′‖. If we combine this bound with Eq. (A.5) andnote that W,W ′ are iid, Eq. (A.5) is bounded above by2e−2βnP(Bn)e2β2n+4Mβnαn E[e(1αnM+β−1/2n)‖W−θ0‖− 12αn ‖W−θ0‖2]2. (A.6)To show that the Hessian is asymptotically negative definite, it suffices to showthat Eq. (A.6) is oPθ0 (αn). For the terms outside the expectation, βn = o(αn)implies that 2e−2βne2β2n+4Mβnαn = o(αn), and Assumption 1 and Lemma A.1.1together imply thatP(Bn) = Πn ({W : ‖W − θ0‖ ≤ βn})2Pθ0→ 1,so2e−2βnP(Bn)e2β2n+4Mβnαn = oPθ0 (αn).Therefore, in order to show Eq. (A.6) is oPθ0 (αn), it is sufficient to show thatE[e(1αnM+β−1/2n)‖W−θ0‖− 12αn ‖W−θ0‖2]= OPθ0 (1).The next step is to split the expectation into two regions—‖W − θ0‖ ≤ βn and‖W − θ0‖ > βn—and bound its value within them separately.1. When ‖W − θ0‖ ≤ βn, the exponent inside the expectation is shrinking49uniformly since βn = o(αn):E[1{‖W−θ0‖≤βn}e(1αnM+β−1/2n)‖W−θ0‖− 12αn ‖W−θ0‖2]≤ E [1{‖W−θ0‖≤βn}] e( 1αnM+β−1/2n )βn= OPθ0 (1).2. When ‖W − θ0‖ > βn, we take the supremum over the exponent (a quadraticfunction), yielding ‖W − θ0‖ = M + αnβ−1/2n and the following bound,(1αnM + β−1/2n)‖W − θ0‖ − 12αn‖W − θ0‖2 (A.7)≤ sup‖v−θ0‖((1αnM + β−1/2n)‖v − θ0‖ − 12αn‖v − θ0‖2)=(1αnM + β−1/2n)(M + αnβ−1/2n)− 12αn(M + αnβ−1/2n)2=M22αn+Mβ1/2n+αn2βn.This yieldsE[1{‖W−θ0‖>βn}e(1αnM+β−1/2n)‖W−θ0‖− 12αn ‖W−θ0‖2]≤ Πn ({W : ‖W − θ0‖ > βn}) exp(M22αn+Mβ1/2n+αn2βn).Note that it is always possible to choose βn = o(αn) with βn = ω(α2n). Withthis choice of βn, the dominating term among the three of Eq. (A.7) is M22αn.Then by Lemma A.1.1, there exists a sequence βn = o(αn) with βn = ω(α2n)such that the following holds,Πn({W : ‖W − θ0‖ > βn}) = oPθ0(exp{−M22αn}),50which impliesE[1{‖W−θ0‖>βn}e(1αnM+β−1/2n)‖W−θ0‖− 12αn ‖W−θ0‖2]= oPθ0 (1).This finishes the proof.In the last step of the above proof, we require an exponential tail bound forthe posterior Πn. We provide this in the following lemma, following the generalproof strategy of van der Vaart (2000, Thm 10.3). The proof of Lemma A.1.1involves many probability distributions; thus, for mathematical convenience andexplicitness, in the proof of Lemma A.1.1 we use square bracket—P [X]—to denotethe expectation of random variable X with respect to a probability distribution P .When taking expectation to a function of n data points f(X1, . . . , Xn) , where(Xi)ni=1i.i.d.∼ Pθ, we still write Pθ[f ]; and Pθ here represents the product measure.Lemma A.1.1. Under Assumption 1, α3nn→∞, there exists a sequence βn satisfy-ing βn = o(αn), βn = ω(α2n) and βn = ω(n−1/2) such that for any fixed constantM ,Πn({W : ‖W − θ0‖ > βn}) = oPθ0(exp{−M22αn}).Proof of Lemma A.1.1. In order to show that βn satisfies the tail probability bound,it suffices to prove thate1αn Pθ0 [Πn({W : ‖W − θ0‖ > βn})]→ 0,due to Markov’s inequality (we absorb the M2/2 constant into αn because it doesnot affect the proof). To achieve this, we take advantage of the existence of a testsequence applied from Assumption 1. By van der Vaart (2000, Lemma 10.6), giventhe 1st and the 2nd conditions of Assumption 1 and the fact that the parameter spaceRd is σ-compact, there exists a sequence of tests φn : X n → [0, 1], where X n isthe space of (X1, . . . , Xn), such that as n→∞,Pθ0 [φn]→ 0, sup‖θ−θ0‖>εPθ [1− φn]→ 0.51Further, by Kleijn (2004, Lemma 1.2) and van der Vaart (2000, Lemma 10.3), underAssumption 1 and the existence of the above test sequence φn, for every Mn →∞,there exists a constant C > 0 and another sequence of tests ψn : X n → [0, 1] suchthat for all ‖θ − θ0‖ > Mn/√n and sufficiently large n,Pθ0 [ψn] ≤ exp{−Cn}, Pθ [1− ψn] ≤ exp{−Cn(‖θ − θ0‖2 ∧ 1)}. (A.8)Using ψn, we split the expectation as following,e1αn · Pθ0 [Πn({W : ‖W − θ0‖ > βn})]= e1αn · Pθ0 [Πn({W : ‖W − θ0‖ > βn})ψn]︸ ︷︷ ︸(I)+ e1αn · Pθ0 [Πn({W : ‖W − θ0‖ > βn})(1− ψn)]︸ ︷︷ ︸(II),and we aim to show both parts converging to 0.For term (I), the first statement of Eq. (A.8) implies that ∃C > 0 such thate1αn · Pθ0 [Πn({W : ‖W − θ0‖ > βn})ψn] ≤ e1αn Pθ0 [ψn] ≤ e1αn e−nC .where the first inequality follows by Πn({W : ‖W − θ0‖ > βn}) ≤ 1. Sincenα3n →∞, the last bound in the above expression converges to 0.For term (II), we work with the shifted and scaled posterior distribution. DefineZn =√n (W − θ0) and Bn = {Zn : ‖Zn‖ >√nβ2n }, and let Π˜0 be thecorresponding prior distribution on Zn and Π˜n be the shifted and scaled posteriordistribution, which yieldse1αn · Pθ0 [Πn({W : ‖W − θ0‖ > βn})(1− ψn)]= e1αn · Pθ0[Π˜n (Bn) (1− ψn)].(A.9)Let U be a closed ball around 0 with a fixed radius r, then restricting Π˜0 on Udefines a probability measure Π˜U0 , i.e., for all measurable set B, Π˜U0 (B) = Π˜0(B ∩U)/Π˜0(U). Write Pn,z for the joint distribution of n data points (X1, . . . , Xn)52parameterized under θ0 + z/√n and hence write the marginal distribution of(X1, . . . , Xn) under Π˜U0 for Pn,U =∫Pn,zdΠ˜U0 (z). The densities of these distri-butions will be represented using lower case, e.g., pn,U (x) =∫pn,z(x)p˜iU0 (z)dz isthe PDF of Pn,U . Here we abuse the notation that x represents (x1, . . . , xn).We replace Pθ0 in Eq. (A.9) with Pn,U . Under Assumption 1, by van der Vaart(2000, P. 141), Pn,U is mutually contiguous to Pθ0 (LeCam, 1960), that is, for anystatistics Tn (a Borel function of Xn), TnPθ0→ 0 iff Tn Pn,U→ 0. Thus, consideringΠ˜0 (Bn) (1 − ψn) as the statistics Tn, the convergence to 0 of the expression inEq. (A.9) is equivalent toe1αn · Pn,U[Π˜n (Bn) (1− ψn)]→ 0.Manipulating the expression of Pn,U and Π˜n (Bn) (we write Π˜n (Bn, x) in theintegral and write Π˜n (Bn, (Xi)ni=1) in the expectation to make the dependence ofposterior on the data explicit),Pn,U[Π˜n (Bn, (Xi)ni=1) (1− ψn)]=∫Π˜n (Bn, x) (1− ψn)dPn,U (x)=∫Π˜n (Bn, x) (1− ψn)pn,U (x)dx.Note that pn,U (x) =∫pn,z(x)dΠ˜U0 (z),=∫Π˜n (Bn, x) (1− ψn)(∫pn,z(x)dΠ˜U0 (z))dx.Recall that for all measurable set B, Π˜U0 (B) = Π˜0(B ∩ U)/Π˜0(U), thus=1Π˜0(U)∫Π˜n (Bn, x) (1− ψn)(∫Upn,z(x)dΠ˜0(z))dx.By using Bayes rule, we expand Π˜n (Bn, x) =∫1[Bn]pn,z(x)dΠ˜0(z)∫pn,z(x)dΠ˜0(z) ,=∫(1− ψn)(∫1[Bn]pn,z(x)dΠ˜0(z))(∫U pn,z(x)dΠ˜0(z))dxΠ˜0(U)∫pn,z(x)dΠ˜0(z).53Note that Π˜n (U, x) =∫U pn,z(x)dΠ˜0(z)∫pn,z(x)dΠ˜0(z) ,=1Π˜0(U)∫ (∫1[Bn]pn,z(x)dΠ˜0(z))(1− ψn)Π˜n (U, x) dx.By Fubini Theorem and Π˜n (U, x) ≤ 1,≤ 1Π˜0(U)∫Bn(∫(1− ψn)pn,z(x)dx)dΠ˜0(z)=1Π˜0(U)∫{‖z‖>√nβ2n }Pn,z[1− ψn]dΠ˜0(z).Note that Pn,z[1− ψn] ≡ Pθ[1− ψn] for θ = θ0 + z/√n and that√nβ2n → ∞due to βn = ω(n−1/2). Thus, we can use the second statement of Eq. (A.8) tobound Pn,z[1− ψn], yielding1Π˜0(U)∫{‖z‖>√nβ2n }Pn,z[1− ψn]dΠ˜0(z)≤ 1Π˜0(U)∫{‖z‖>√nβ2n }exp{−C(‖z‖2 ∧ n)}dΠ˜0(z).We then derive upper bounds for both the fraction and the integral to show the aboveis o(e−1αn). For the fraction, we define Un :={w ∈ Rd : √n (w − θ0) ∈ U},thenΠ˜0(U) =Π0(Un) ≥ pid2γ(d2 + 1)(n−1/2r)dinfw∈Unpi0(w).By Assumption 1, for all w ∈ Rd, pi0(w) is positive and continuous, and henceinfw∈Un pi0(w) is an increasing sequence that converges to pi0(θ0) > 0 as n→∞.Thus, there is a constant D > 0 such that for sufficiently large n,Π˜0(U) ≥ Dn−d/2, (A.10)54yielding that∃C > 0, s.t. 1Π˜0(U)≤ Cnd/2.For the integral, by splitting Bn into {√nβ2n < ‖zn‖ ≤ k√n } and {‖zn‖ >k√n } for some positive k < 1,∫{‖z‖>√nβ2n }exp{−C(‖z‖2 ∧ n)}dΠ˜0(z)≤∫{k√n≥‖z‖>√nβ2n }exp{−C‖z‖2}dΠ˜0(z) + e−Ck2n.Then by change of variable to w = 1√nz + θ0,=∫{k≥‖w−θ0‖>βn}exp{−Cn‖w − θ0‖2}pi0(w)dw + e−Ck2n.Note that by Assumption 1, pi0(w) is continuous for all w ∈ Rd, we can choose asufficiently small k such that pi0(θ) is uniformly bounded by a constant C over theregion {k ≥ ‖w − θ0‖ > βn}. Thus, the above can be bounded above byC∫{k≥‖w−θ0‖>βn}exp{−Cn‖w − θ0‖2}dw + e−Ck2n= Cn−d/2∫{k√n≥‖z‖>√nβ2n }exp{−C‖z‖2} dz + e−Ck2n≤ Cn−d/2∫{‖z‖>√nβ2n }exp{−C‖z‖2} dz + e−Ck2n,where the equality is by change of variable back to z =√n (w − θ0). Then,consider the integral on RHS. Using spherical coordinates, there exists a fixedconstant D > 0 such that∫{‖z‖>√nβ2n }exp{−C‖z‖2} dz = D ∫{r>√nβ2n }e−Cr2rd−1dr= DC−d/2∫{s>Cnβ2n}e−ssd2−1ds,55where the second equality is by setting s = Cr2. Note that the integrand of RHS isproportional to the PDF of Γ(d2 , 1). Using the tail properties of the Gamma randomvariable (Boucheron et al., 2013, p. 28), we have that for some generic constantD > 0, ∫{‖z‖>√nβ2n }exp{−C‖z‖2} dz ≤ De−Cnβ2n .Therefore, for some generic constants C,D > 0,∫{‖z‖>√nβ2n }exp{−C(‖z‖2 ∧ n)}dΠ˜0(z) (A.11)≤ Dn−d/2e−Cnβ2n + e−Ck2n.Then we combine Eqs. (A.10) and (A.11), yielding that for some constantsC,D > 0independent to n,e1αn · Pn,U[Π˜0 (Bn) (1− ψn)]≤ e 1αn 1Πn,0(U)∫{‖z‖>√nβ2n }exp{−C(‖z‖2 ∧ n)}dΠ˜0(z)≤ e 1αnC√n de−Cn + e 1αnDe−Cnβ2n .Lastly, it remains to show that there exists a positive sequence βn satisfying bothβn = o(αn) and βn = ω(α2n) such that the RHS converges to 0. The first termalways converges to 0 due to α3nn → ∞. For the second term, we consider twodifferent cases. If αn = o(n−1/6), we pick βn = n−1/3, which is both o(αn) andω(α2n). Thene1αnDe−Cnβ2n = D exp{α−1n − Cn1/3}−→ 0,where the convergence in the last line is by nα3n → ∞ ⇔ 1αn = o(n1/3). If56αn = ω(n−1/6), we pick βn = α2n. Thene1αnDe−Cnβ2n = D exp{α−1n − Cnα4n}.Since αn = ω(n−1/6), 1αn = o(n1/6) and nα4n = ω(n1/3), yielding that the aboveconverges to 0 as n→∞.This completes the proof.A.1.3 Proof of 2nd statement of Theorem 4.1.1In this section, we show that the smoothed MAP estimator (θˆn) is also a consistentestimate of θ0, but with a convergence rate that is slower than the traditional√n . This is the case because the variance of the smoothing kernel satisfies αn =ω(n−1/3), and the convergence rate of θˆn is determined by αn via‖θˆn − θ0‖ = OPθ0 (√αn ). (A.12)Recall that θMAP,n := arg maxpin(θ) is the MAP estimator for the exact pos-terior, which is a√n -consistent estimate of θ0. Thus, it is sufficient to show∥∥∥θˆn − θMAP,n∥∥∥ = OPθ0 (√αn ).Note that θˆn and θMAP,n are maximals of stochastic process pˆin(θ) and pin(θ)respectively, which can be studied in the framework of M-estimator (van der Vaart,2000; van der Vaart and Wellner, 2013). A useful tool in establishing the asymptoticsof M-estimators is the Argmax Continuous Mapping theorem (van der Vaart andWellner, 2013, Lemma 3.2.1), which is introduced as follows.Lemma A.1.2 (Argmax Continuous Mapping (van der Vaart and Wellner, 2013)).Let {fn(θ)} and f(θ) be stochastic processes indexed by θ, where θ ∈ Θ. Let θˆ bea random element such that almost surely, for every open sets G containing θˆ,f(θˆ) > supθ/∈Gf(θ).57and θˆn be a random sequence such that almost surelyfn(θˆn) = supθ∈Θfn(θ).If supθ∈Θ |fn(θ)− f(θ)| = oP (1) as n→∞, thenθˆnd→ θˆ.The proof strategy of the 2nd statement of Theorem 4.1.1 is to apply Lemma A.1.2in a setting where fn is pˆin(θ) and f is a Gaussian density. Using the Bernstein-vonMises Theorem 3.1.1, we show that pˆin(θ) converges uniformly to this Gaussiandensity, which implies that the MAP of pˆin(θ) converges in distribution to the MAPof this Gaussian distribution by the Argmax Continuous Mapping theorem. Thedetailed proof is as follows.Proof of 2nd statement of Theorem 4.1.1. Assumption 2 guarantees ‖θMAP,n−θ0‖=OPθ0 (1/√n ). Note that‖θˆn − θ0‖ ≤ ‖θˆn − θMAP,n‖+ ‖θMAP,n − θ0‖.Since√αn = ω(1/√n ), in order to get Eq. (A.12), it suffices to show∥∥∥θˆn − θMAP,n∥∥∥ = OPθ0 (√αn ) .Thus, in this proof, we aim to show∥∥∥θˆn − θMAP,n∥∥∥ = OPθ0 (√αn ) and it issufficient to prove1√αn(θˆn − θMAP,n) Pθ0→ 0.Let ξ = 1√αn (θ − θMAP,n), ξ∗n = 1√αn(θˆn − θMAP,n)and t = 1√αn (w − θMAP,n).58By expressing pˆin(θ), which is defined in Eq. (4.1),ξ∗n = arg maxξpˆi (√αn ξ + θMAP,n)= arg maxξ∫pin (√αn t+ θMAP,n) exp(− 12αn‖√αn ξ −√αn t‖2)dt= arg maxξ∫αd/2n pin (√αn t+ θMAP,n) exp(−12‖ξ − t‖2)dt.Definefn(ξ) =∫αd/2n pin (√αn t+ θMAP,n) exp(−12‖ξ − t‖2)dt,gn(ξ) =∫φ(t; 0,1nαnH−10)exp(−12‖ξ − t‖2)dt,f(ξ) =(2pi)d/2φ (ξ; 0, I) ,where φ(·;µ,Σ) denotes the PDF of N (µ,Σ).By adding and subtracting f(ξ),ξ∗n = arg maxξfn(ξ)= arg maxξ{fn(ξ)− f(ξ) + f(ξ)} .We then apply Lemma A.1.2 to show ξ∗nd→ arg maxξ f(ξ). We start by verifying acondition of the argmax continuous mapping theorem thatlimn→∞ supξ|fn(ξ)− f(ξ)| = 0. (A.13)By triangle inequality, for all n,supξ|fn(ξ)− f(ξ)| ≤ supξ|fn(ξ)− gn(ξ)|+ supξ|gn(ξ)− f(ξ)|. (A.14)Later we show both two terms on the RHS converging to 0.For the first term. Note that αd/2n pin(√αn t+ θMAP,n) is the probability density59function of Π√αn t+θMAP,n , which is the posterior distribution parameterized on t.Thus, for all n,supξ|fn(ξ)− gn(ξ)|= supξ{∫ ∣∣∣∣αd/2n pin(√αn t+θMAP,n)−φ(t; 0, 1nαnH−10 )∣∣∣∣exp(−12‖ξ − t‖2)dt}≤ DTV(Π√αn t+θMAP,n ,N(0,1nαnH−10)),where the inequality is by supξ,t exp(−12‖ξ − t‖2) ≤ 1. Under Assumption 1, theposterior distribution admits Bernstein-von Mises theorem (Theorem 3.1.1) thatDTV(Πn,N(0,1nH−10))= oPθ0 (1).With the invariance of total variation under reparametrization, we haveDTV(Π√αn t+θMAP,n ,N(0,1nαnH−10))= oPθ0 (1).This shows the uniform convergence from fn(ξ) to gn(ξ).For the second term in Eq. (A.14). Note that we can evaluate gn(ξ) since it is aconvolution of two Gaussian PDFs, that isgn(ξ) = (2pi)d/2φ(ξ; 0,1nαnH−10 + I).Comparing this to f(ξ) = (2pi)d/2φ (ξ; 0, I), one notices that 1nαnH−1θ0+ I → Idue to α3nn → ∞. And further for Gaussian distributions, the convergence ofparameters implies the uniform convergence of PDFs, yielding thatlimn→∞ supξ|gn(ξ)− f(ξ)| = 0.Thus, we have Eq. (A.14) converging to 0 as n→∞.Now we look at f(ξ) with the goal to apply Lemma A.1.2 and to obtain ξ∗nd→60arg maxξ f(ξ). Note thatarg maxξf(ξ) = 0 and supξf(ξ) = det (I)−1/2 = 1.To apply Lemma A.1.2, we need to ensure that for any open set G that contains 0,f(0) > supξ∈Gf(ξ). (A.15)This holds by the unimodality of standard Gaussian distirbution.Therefore, with both conditioins Eq. (A.13) and Eq. (A.15), we can applyLemma A.1.2 to conclude that1√αn(θˆn − θMAP,n) Pθ0→ 0.This completes the proof.61A.2 Proof for Theorem 4.2.1Proof of Theorem 4.2.1. In this proof, we aim to apply Theorem A.2.1 withX = {µ ∈ Rp, L lower triangular with non-negative diagonals},which is closed and convex. Note that in the notation of this theorem, x =(µT , LT1 , . . . , LTp )T ∈ R(d+1)d and V ∈ R(d+1)d×(d+1)d is set to be a diagonalmatrix with entries 2 for the µ components and r/(2‖I − L?n‖F ) for the L compo-nents. ThereforeJ(x) = J(µ,L) = 4‖µ− µ?n‖2 +r24‖I − L?n‖2F‖L− L?n‖2F .This setting yields two important facts. First, by Theorems 4.1.1 and 3.2.1,θˆnPθ0→ θ0 and µ?nPθ0→ θ0,yielding thatP(‖θˆn − θ0‖+ ‖µ?n − θ0‖ ≤r4√2)→ 1, as n→∞.For ‖µ0 − θˆn‖2 ≤ r232 , by triangle inequality, the probability that the followinginequalities hold converges to 1 in Pθ0 as n→∞,‖µ0 − µ?n‖ ≤ ‖µ0 − θˆn‖+ ‖θˆn − θ0‖+ ‖µ?n − θ0‖ ≤r2√2.Further with L0 = I , J(µ0, L0) ≤ 3r24 ≤ r2. Hence, if we initialize L0 = I and µ0such that ‖µ0 − θˆn‖22 ≤ r232 ,P(x0 ∈ {x : J(x) ≤ r2})→ 1, as n→∞. (A.16)62Second, if J ≤ r2 then µ is close to the optimal and ‖L‖F is not too large, i.e.,J(µ,L) ≤ r2 =⇒ ‖µ− µ?n‖2 ≤ r2/4J(µ,L) ≤ r2 =⇒ ‖L− L?n‖2F ≤ 4‖I − L?n‖2F=⇒ ‖L‖F ≤ 2‖I − L?n‖F + ‖L?n‖F ,yielding that {J(µ,L) ≤ r2} ⊆ Br,n. Recall thatBr,n ={µ ∈ Rd, L ∈ Rd×d : ‖µ− µ?n‖2 ≤r24and‖L− L?n‖2F ≤ 4‖I − L?n‖2F}.Then by Corollary 3.3.8, under Assumptions 1 and 3, the probability of the eventthatFn is2Dn-strongly convex in {J(µ,L) ≤ r2} (A.17)and globally `Dn-Lipschitz smoothconverges to 1 in Pθ0 as n→∞.For brevity, we make the following definitions for the rest of this proof: recallthe definition of fn, Fn in Eq. (3.2) (we state here again):In : X → R, In(x) := − 1nlog detLfn : Rd → R, fn(θ) := − 1nlog pin(θ)f˜n : (X ,Rd)→ R, f˜n(x, Z) := − 1nlog pin(µ+1√nLZ)Fn : X → R, Fn(x) := E[− 1nlog pin(µ+1√nLZ)]φn := In + f˜n, Φn := In + Fn.Here φn(x, z) is the KL cost function with no expectation, and Φn(x) is the costfunction with the expectation. To match the notation of Theorem A.2.1, we refor-63mulate the scaled gradient estimator defined in Eq. (4.3) as gn,gn(x, Z) ={Rn(x)∇φn(x, Z) x ∈ X olimy→xRn(y)∇φn(y, Z) x ∈ ∂X,for a diagonal scaling matrix Rn(x) ∈ Rd(d+1)×d(d+1). Define that Rn(x) hasentries 1 for the µ components, 1 for the off-diagonal L components, and 1/(1 +(nLii)−1) for the diagonal L components. Note that x→ ∂X means that Lii → 0,ensuring that gn(x, Z) has entries −1 for the Lii. Since Z is a standard normalrandom variable, under the event that− 1n log pin has Lipschitz gradient, the gradientcan be passed through the expectation so that the true gradient is defined as below,Gn(x) := E [gn(x, Z)] = Rn(x)∇Φn(x).Note that the projected stochastic iterationxk+1 = ΠX (xk − γkgn(xk, Zk)) , k = N ∪ {0},with ΠX (x) := arg miny∈X ‖V (x − y)‖2 is equivalent to the iteration describedin Algorithm 2. Note that the differentiability of φn only holds for x ∈ X o. Forthe case where Lii = 0 for some i ∈ [d], we can use continuation via the limitlimLii→0− (nLii)−11+(nLii)−1= −1 to evaluate even though the gradient is not defined.For the following proof, we do not make special treatments to those boundary pointswhen applying Taylor expansion and taking derivative.Next we apply Theorem A.2.1 to carry out the proof. The rest of the proof con-sists two parts: to show the confinement result (statement 2. of Theorem A.2.1)and to show the convergence result (statement 3. of Theorem A.2.1) ). Weprove these two results under the event that Eqs. (A.16) and (A.17) hold; since theprobability that these events hold converges in Pθ0 to 1 as n→∞, the final resultholds with the same convergent probability.We first show the confinement result by analyzing (x), `2(x), and σ2(r),which are defined in Eqs. (A.23) and (A.24) respectively. We aim to obtain that64i. We can find sufficiently small γk > 0 such thatαk = 1 + 1[J(xk) ≤ r2](−2γk(r) + 2γ2k`2(r)) ∈ (0, 1]holds for all x ∈ X , i.e,∀x ∈ X : J(x) ≤ r2, 0 ≤ 2γk(x)− 2γ2k`2(x) ≤ 1. (A.18)ii. σ2(r)→ 0 as n→∞ to guarantee the SGD iterations are eventually locallyconfined as n→∞ (based on Theorem A.2.1).To show the statement i., Eq. (A.18), we start by deriving a lower bound for2γk(x)− 2γ2k`2(x). Examine the expression,2γk(x)− 2γ2k`2(x)= 2γkJ(x)−1(x− x?)TV TV Rn(x) (∇Φn(x)−∇Φn(x?))−2γ2kJ(x)−1 (∇Φn(x)−∇Φn(x?))T RT (x)V TV Rn(x) (∇Φn(x)−∇Φn(x?))=2γkJ(x)(V (x− x?))T V Rn(x)(∫· · ·)V −1 (V (x− x?))− 2γ2kJ(x)(V (x− x?))TV −T(∫· · ·)T(V Rn(x))2(∫· · ·)D−1 (V (x− x?)) ,where(∫ · · · ) = (∫ 10 ∇2Φn((1− t)x? + tx)dt). By splitting Φn into the regular-ization In(x) and the expectation Fn(x); and definingA(x) := V Rn(x)(∫ 10∇2In((1− t)x? + tx)dt)V −1B(x) := V Rn(x)(∫ 10∇2Fn((1− t)x? + tx)dt)V −1v(x) := V (x− x?),65the above expression can be written as2γk(x)− 2γ2k`2(x)= 2γkv(x)T (A(x) +B(x))v(x)− γk‖(A(x) +B(x))v(x)‖2‖v(x)‖2≥ 2γk v(x)TA(x)v(x)+v(x)TB(x)v(x)−2γk‖A(x)v(x)‖2−2γk‖B(x)v(x)‖2‖v(x)‖2≥ 2γk{v(x)TA(x)v(x) + v(x)TB(x)v(x)− 2γk‖A(x)v(x)‖2‖v(x)‖2−2γk‖B(x)(V Rn(x)`DnV −1)−1‖2‖V Rn(x)`DnV −1v(x)‖2‖v(x)‖2}= 2γk{v(x)TA(x)v(x) + v(x)T (B(x)− V Rn(x) 2DnV −1)v(x)‖v(x)‖2+v(x)T(V Rn(x)2DnV−1) v(x)− 2γk‖A(x)v(x)‖2‖v(x)‖2−2γk‖B(x)(V Rn(x)`DnV−1)−1‖2‖DRn(x)`DnV −1v(x)‖2‖v(x)‖2}Note that by Corollary 3.3.8 that 2Dn ∇2Fn(x) `Dn and all the V , Rn(x)are positive diagonal matrices, leading toB(x)− V Rn(x) 2DnV−1 0I‖B(x)(V Rn(x)`DnV −1)−1‖2 ≤ 1.66Thus, the above expression can be bounded below by2γk(x)− 2γ2k`2(x)≥ 2γk{v(x)TA(x)v(x) + v(x)T(V Rn(x)2DnV−1) v(x)‖v(x)‖2−2γk‖A(x)v(x)‖2 − 2γk‖V Rn(x)`DnV −1v(x)‖2‖v(x)‖2}=2‖v(x)‖2 v(x)T{[γkA(x)− 2γ2kA2(x)]+12[γkRn(x)Dn − 4`2 (γkRn(x)Dn)2]}v(x)≥ 2λmin([γkA(x)− 2γ2kA2(x)]+12[γkRn(x)Dn − 4`2 (γkRn(x)Dn)2]).Now, notice thatA(x) ,Rn(x)Dn are all diagonal matrices with non-negative entries,γkA(x)− 2γ2kA2(x) = γkA(x) (I − 2γkA(x)) (A.19)γkRn(x)Dn − 4`2 (γkRn(x)Dn)2 = γkRn(x)Dn(− 4`2γkRn(x)Dn).As long as the entries of A(x), Rn(x)Dn are bounded above by a constant for all n,there exists a sufficiently small constant c such that for all γk < c, Eq. (A.19) areboth non-negative. Given that for all n and ∀x ∈ X ,A(x) = diag(0, · · · , (nLii)−11 + (nLii)−11L?ii, · · · , 0) 1mini∈[d] L?iiIRn(x)Dn I,we obtain the boundedness of the entries of A(x), Rn(x)Dn. Therefore, we con-clude that∀x ∈ X , 0 ≤ 2γk(x)− 2γ2k`2(x).67It remains to show the second inequality of Eq. (A.18), i.e.,supx∈X :J(x)≤r22γk(x)− 2γ2k`2(x) ≤ 1.This is true ifsupx∈X :J(x)≤r2(x) ≤ γ−1k .Since γk → 0 as k →∞, the above holds if supx∈X :J(x)≤r2 (x) is bounded aboveby a constant that is independent to n. Now we consider the upper bound forsupx∈X :J(x)≤r2 (x). Expanding (x),(x) = J(x)−1(x− x?)TDTDRn(x) (∇Φn(x)−∇Φn(x?))=v(x)T (A(x) +B(x))v(x)‖v(x)‖2≤ λmax(A(x) +B(x))= λmaxRn(x)1/2(∫ 10∇2Φn((1− t)x? + tx)dt)Rn(x)1/2.Split Φn into the regularization In(x) and the expectation Fn(x). For the expecta-tion, by Corollary 3.3.8 that ∇2Fn(x) `Dn and entries of Rn(x) are bounded by1, we haveRn(x)1/2∇2Fn(x)Rn(x)1/2 `I,and for the regularization, note that ∇2In is a diagonal matrix with 0 for µ andoff-diagonals of L and L−2ii /n for diagonals of L, soRn(x)1/2(∫ 10∇2In((1− t)x? + tx)dt)Rn(x)1/2= diag(0, · · · , (nLii)−11 + (nLii)−11L?ii, · · · , 0) 1mini∈[d] L?iiI(A.20)68By the fact that ∀i ∈ [d], L?ii > 0, we have Eq. (A.20) is bounded above by aconstant C. Use the Weyl’s inequality to bound the maximal eigenvalue of thesummation of two Hermitian matrices, we conclude thatsupx∈X :J(x)≤r2(x) ≤ `+ C.Therefore, we have completed the proof for statement i., Eq. (A.18).Then we show the statement ii. by getting upper bound on σ2(r). Recallthat σ2(r) is the upper bound of the fourth moment of‖V Rn(x) (∇φn(x, Z)−∇Φn(x))‖ .Since the regularizor is cancelled in this expression, we only consider the expectationpart. Note that V Rn(x) is a diagonal matrix with positive diagonals,E[∥∥∥V Rn(x)(∇f˜n(x, Z)−∇Fn(x))∥∥∥4]1/4≤ maxi∈[d(d+1)](V Rn(x))iiE[∥∥∥∇f˜n(x, Z)−∇Fn(x)∥∥∥4]1/4 .Let Z1, Z2 be independent copies, by tower property of conditional expectation,= maxi∈[d(d+1)](V Rn(x))iiE[∥∥∥E [∇f˜n(x, Z1)−∇f˜n(x, Z2)|Z1]∥∥∥4]1/4 .By the convexity of ‖ · ‖4 and Jensen’s inequality,≤ maxi∈[d(d+1)](V Rn(x))iiE[E[∥∥∥∇f˜n(x, Z1)−∇f˜n(x, Z2)∥∥∥4 |Z1]]1/4= maxi∈[d(d+1)](V Rn(x))iiE[∥∥∥∇f˜n(x, Z1)−∇f˜n(x, Z2)∥∥∥4]1/4 .By∥∥∥∇f˜n(x, Z1)−∇f˜n(x, Z2)∥∥∥≤∥∥∥∇f˜n(x, Z1)∥∥∥+∥∥∥∇f˜n(x, Z2)∥∥∥ and Minkowski’s69inequality,≤ maxi∈[d(d+1)](V Rn(x))ii{E[‖∇f˜n(x, Z1)‖4]1/4+ E[‖∇f˜n(x, Z1)‖4]1/4}= 2 maxi∈[d(d+1)](V Rn(x))iiE[‖∇f˜n(x, Z)‖4]1/4.Now we focus on bounding E[‖∇f˜n(x, Z)‖4]1/4. We examine ‖∇f˜n(x, Z)‖,∇f˜n(x, Z) =∇fn(µ+ 1√nLZ)Z1√n∇fn(µ+ 1√nLZ)...Zp√n∇fn(µ+ 1√nLZ) ∈ Rd(d+1),yielding‖∇f˜n(x, Z)‖4 =∥∥∥∥∇fn(µ+ 1√n LZ)∥∥∥∥4(1 + ZTZn)2.By Cauchy-Schiwartz inequality,E[‖∇f˜n(x, Z)‖4]1/4≤ E[∥∥∥∥∇fn(µ+ 1√n LZ)∥∥∥∥8]1/8E[(1 +ZTZn)4]1/8.We then bounds these two terms on RHS separately. We use the sub-Gaussianproperty of∥∥∥∇fn (µ+ 1√n LZ)∥∥∥ to bound its 8th moment. First notice that∥∥∥∇fn (µ+ 1√n LZ)∥∥∥ is a maxi∈[d] Lii √`n -Lipschitz function of Z,∣∣∣∣∥∥∥∥∇fn(µ+ 1√n LZ1)∥∥∥∥− ∥∥∥∥∇fn(µ+ 1√n LZ2)∥∥∥∥∣∣∣∣≤∥∥∥∥∇fn(µ+ 1√n LZ1)−∇fn(µ+1√nLZ2)∥∥∥∥=∥∥∥∥∫ 10∇2fn(µ+ (1− t)LZ2/√n + tZ1/√n)dtL√n(Z1 − Z2)∥∥∥∥ .70Given that∇2fn `I , the above is bounded by`√n√(Z1 − Z2)TLTL(Z1 − Z2) ≤ `√nλmax(LTL)‖Z1 − Z2‖=`√nmaxi∈[d]L2ii‖Z1 − Z2‖.Since a Lipschitz function of Gaussian noise is sub-Gaussian (Kontorovich, 2014,Thm 1), i.e., let Z ∼ N (0, Id), ψ : Rd → R be L-Lipschitz, thenP (|ψ(Z)− E[ψ(Z)]| > ) ≤ 2 exp(− 24L2).Thus,∥∥∥∇fn (µ+ 1√n LZ)∥∥∥ is 4`2n maxi∈[d] L2ii-sub-Gaussian. Then note that fora σ2-sub-Gaussian random variable X ∈ R, for any positive integer k ≥ 2,E[|X|k]1/k ≤ σe1/e√k . Hence we obtainE[∥∥∥∥∇fn(µ+ 1√n LZ)∥∥∥∥8]1/8≤ 2`√ne1/e√8 maxi∈[d]Lii.Along with the fact that Gaussian random variable has arbitrary order moments,E[(1 +ZTZn)4]≤ C,for some constant C, we obtainE[‖∇xfn‖4]1/4 ≤ 2C1/4`√ne1/e√8 maxi∈[d]Lii,and henceE[‖V Rn(x) (∇xfn −∇xFn)‖4]1/4≤ maxi∈[d(d+1)](V Rn(x))ii2C1/4`√ne1/e√8 maxi∈[d]Lii.Taking supremum over J(x) ≤ r2, the RHS is bounded above by a universal71constant, we therefore conclude thatσ2(r) = supx∈X :J(x)≤r2E[‖V Rn(x) (∇φn(x, Z)−∇Φn(x))‖4]1/4 → 0, n→∞.Therefore, with ∀x ∈ X : J(x) ≤ r2, 0 ≤ 2γk(x)−2γ2k`2(x) ≤ 1 and σ2(r)→ 0as n→∞, applying Theorem A.2.1 yields the confinement result, i.e.,P(supk∈NJ(xk) ≤ r2)→ 1in Pθ0 as n→∞.Lastly, by statement 3. of Theorem A.2.1, we prove the convergence resultby checkinginfx∈X ,J(x)≤r2(x) > 0.We use the similar way to expand the expression,(x) = J(x)−1(x− x?)TV TV Rn(x) (∇Φn(x)−∇Φn(x?))=v(x)T (A(x) +B(x))v(x)‖v(x)‖2≥ λmin(A(x) +B(x))= λminRn(x)1/2(∫ 10∇2Φn((1− t)x? + tx)dt)Rn(x)1/2.By splitting Φn into the regularization and the expectation, we haveR1/2n (x)(∫ 10∇2In((1− t)x? + tx)dt)Rn(x)1/2= diag(0, · · · , (nLii)−11 + (nLii)−11L?ii, · · · , 0) 0I,(A.21)72andRn(x)1/2(∫ 10∇2Fn((1− t)x? + tx)dt)Rn(x)1/2≥ Rn(x)1/2Dn2Rn(x)1/2 /2n > 0.(A.22)We then combine Eqs. (A.21) and (A.22) and use Weyl’s inequality to bound theminimal eigenvalue of the summation of two Hermitian matrices, yieldinginfx∈X ,J(x)≤r2(x) > /n > 0.This gives the convergence result.Then the proof is complete by applying Theorem A.2.1. We know that ξk isstrictly positive. Since (r) > 0 and `(r) is bounded above, there exists γk =Θ(k−ρ), ρ ∈ (0.5, 1) so that it satisfies the condition of the theorem. We have thatσ → 0, which makes 3. in the statement of Theorem A.2.1 becomeP(lim supk→∞‖V (xk − x?)‖2 = 0)p→ 1, n→∞Even though D is a function of n, n is fixed as Algorithm 2 runs. Since D isinvertible,P(lim supk→∞‖xk − x?‖2 = 0)Pθ0→ 1, n→∞which is exactly our desired result: as the number of data n→∞, the probabilitythat Algorithm 2 finds the optimum (as we take more iterations, k →∞) convergesto 1. In other words, variational inference gets solved asymptotically.Theorem A.2.1. Let X ⊆ Rp be closed and convex, g : X ×Z → Rp be a function,G(x) := E [g(x, Z)] for a random element Z ∈ Z , x? ∈ X be a point in X suchthat G(x?) = 0, V ∈ Rp×p be invertible, J(x) := ‖V (x − x?)‖2, and r ≥ 0.73Consider the projected stochastic iterationx0 ∈ X , xk+1 = ΠX (xk − γkg(xk, Zk)) , k = N ∪ {0},with independent copies Zk of Z, γk ≥ 0, and ΠX (x) := arg miny∈X ‖V (x−y)‖2.If1. For all k ∈ N ∪ {0}, the step sizes satisfy∀x ∈ X : J(x) ≤ r2, 0 ≤ 2γk(x)− 2γ2k`2(x) ≤ 1 (A.23)(x) :=1J(x)(x− x?)TV TV (G(x)−G(x?))`2(x) :=1J(x)‖V (G(x)−G(x?))‖2 ,2. For all x ∈ X , (E‖V (g(x, Z)−G(x))‖4)1/4 ≤ σ˜(x) for σ˜ : X → R≥0,andσ(r) := supx∈X : J(x)≤r2σ˜(x), (A.24)then1. The iterate xk is locally confined with high probability:P(J(xk) ≤ r2) ≥ ξ2kξ2k + 8σ(r)2ζkξk(r) := max{0, r2 − J(x0)− 2σ2(r)∑j<kγ2j }ζk(r) := r2∑j<kγ2j + σ2(r)∑j<kγ4j .2. The iterate xk stays locally confined for all k ∈ N with high probability:P(supk∈NJ(xk) ≤ r2)≥ ξ2ξ2 + 8σ2(r)ζξ(r) := limk→∞ξk(r) ζ(r) := limk→∞ζk(r).743. If additionallyinfx∈X :J(x)≤r2(x) > 0 and γk = Θ(k−ρ), ρ ∈ (0.5, 1),the iterate xk converges to x? with high probability:P(lim supk→∞J(xk) = 0)≥ P(supk∈NJ(xk) ≤ r2).Proof. To begin, we show ΠX is non-expansive,‖V (ΠX (x)−ΠX (y))‖2 ≤ ‖V (x− y)‖2.For all x, y ∈ Rp, define 〈x, y〉V = xTV TV y. Since V is invertible, V TV issymmetric and positive definite, and hence (Rp, 〈·, ·〉V ) forms a Hilbert space. Anyprojection operator of a Hilbert space is non-expansive (Bauschke and Combettes,2011, Prop. 4.4).Note that x? = ΠX (x?) and the projection operation is non-expansive, expand-ing the squared norm yields‖V (xk+1 − x?)‖2 ≤ ‖V (xk − x?)‖2− 2γk(xk − x?)TV TV g(xk, Zk) + γ2k ‖V g(xk, Zk)‖2 .Adding and subtracting G(xk) in the second and third terms, using the elementarybound ‖a+ b‖2 ≤ 2‖a‖2 + 2‖b‖2, and definingβk(x) := −2γk(x− x?)TV TV (g(x, Zk)−G(x))+ 2γ2k ‖V (g(x, Zk)−G(x))‖2 − 2γ2kE[‖ · ‖2](x) :=1J(x)(x− x?)TV TV (G(x)−G(x?))`2(x) :=1J(x)‖V (G(x)−G(x?))‖22 ,75we have thatJ(xk+1) ≤J(xk)(1− 2γk(xk) + 2γ2k`2(xk))+ βk(xk) + 2γ2kσ˜2(xk).We now define the filtration of σ-algebrasFk = σ(x1, . . . , xk, Z1, . . . , Zk−1),and the stopped process for r > 0,Y0 = J(x0)Yk+1 ={Yk Yk > r2J(xk+1) o.w.Note that Yk is Fk-measurable, and that Yk “freezes in place” if J(xk) ever jumpslarger than r2; so for all t2 ≤ r2,P(J(xk) > t2)= P(J(xk) > t2, Yk−1 > r2)+ P(J(xk) > t2, Yk−1 ≤ r2)= P(J(xk) > t2, Yk > r2, Yk−1 > r2)+P(Yk > t2, Yk−1 ≤ r2)≤ P (Yk > r2, Yk−1 > r2)+ P (Yk > t2, Yk−1 ≤ r2)≤ P (Yk > t2, Yk−1 > r2)+ P (Yk > t2, Yk−1 ≤ r2)= P(Yk > t2).Therefore if we obtain a tail bound on Yk, it provides the same bound on J(xk).Now substituting the stopped process into the original recursion and collectingterms,Yk+1≤Yk(1+1[Yk≤r2](−2γk(xk)+γ2k`2(xk)))+1[Yk≤r2](βk(xk)+2γ2kσ˜2(xk))≤Yk(1+1[Yk≤r2](−2γk(xk)+γ2k`2(xk)))+1[Yk≤r2]βk(xk)+2γ2kσ2(r).76Using the notation of Lemma A.2.2, setαk = 1 + 1[Yk ≤ r2](−2γk(xk) + 2γ2k`2(xk))βk = 1[Yk ≤ r2]βk(xk)ck = 2γ2kσ2(r).By the fourth moment assumption, βk has variance bounded above by τ2k conditionedon Fk, whereτ2k = 8γ2k1[Yk ≤ r2] ‖V (xk − x?)‖2σ˜(xk)2 + 8γ4k1 [Yk ≤ r2] σ˜4(xk)≤ 8γ2kr2σ2(r) + 8γ4kσ4(r).Therefore, using the descent Lemma A.2.2 and the fact that 0 ≤ αk ≤ 1,PYk − Y0 > t+ 2σ2(r)∑j<kγ2j≤8σ2(r)(r2∑j<k γ2j + σ2(r)∑j<k γ4j)t2 + 8σ2(r)(r2∑j<k γ2j + σ2(r)∑j<k γ4j) .Finally let ξk and ζk be defined asξk := max{0, r2 − Y0 − 2σ2(r)∑j<kγ2j }ζk := r2∑j<kγ2j + σ2(r)∑j<kγ4j ,yielding the first result,P(Yk > r2) ≤ 8σ2(r)ζkξ2k + 8σ2(r)ζk.Now since Yk+1 ≤ r2 =⇒ Yk ≤ r2 for all k ≥ 0, the sequence of events77{Yk ≤ r2}is decreasing. Therefore the second result follows fromP( ∞⋂k=0{Yk ≤ r2})= limk→∞P(Yk ≤ r2)≥ limk→∞1− 8σ2(r)ζkξ2k + 8σ2(r)ζk=ξ2ξ2 + 8σ2(r)ζ,where ξ := limk→∞ ξk and ζ := limk→∞ ζk (or ∞ if ζk diverges). Finally, weanalyze the conditional tail distribution of Yk given that it stays confined, i.e.,∀k ≥ 0, Yk ≤ r2. First define0 ≤ ak := supx∈X :J(x)≤r21− 2γk(x) + 2γ2k`2(x) ≤ 1,i.e., ak is the largest possible value of αk when Yk ≤ r2. So again applyingLemma A.2.2,PYk − Y0 k−1∏j=0aj > tk−1∏j=0aj +k−1∑j=0cjk−1∏m=j+1am | ∀k Yk ≤ r2=P(Yk − Y0∏k−1j=0 aj > t∏k−1j=0 aj +∑k−1j=0 cj∏k−1m=j+1 am,∀k Yk ≤ r2)P (∀k Yk ≤ r2)=P(Yk − Y0∏k−1j=0 αj > t∏k−1j=0 αj +∑k−1j=0 cj∏k−1m=j+1 αm, ∀k Yk ≤ r2)P (∀k Yk ≤ r2)≤P(Yk − Y0∏k−1j=0 αj > t∏k−1j=0 αj +∑k−1j=0 cj∏k−1m=j+1 αm)ξ2ξ2+8σ2(r)ζ≤(8σ2(r)ζkt2+8σ2(r)ζk)(ξ2ξ2+8σ2(r)ζ) .78If we set t = s∏k−1j=0 a−1j for any s ≥ 0, this implies thatPYk > s+ Y0 k−1∏j=1aj +k−1∑j=0cjk−1∏m=j+1am | ∀k Yk ≤ r2≤(8σ2(r)ζks2∏j<k a−2j +8σ2(r)ζk)(ξ2ξ2+8σ2(r)ζ) .Now since γk = Θ(k−ρ), ρ ∈ (0.5, 1), and infx∈X :J(x)≤r2 (x) > 0, we know that∏j<k a−2j = Θ(exp(Ckρ′))for some ρ′, C > 0. So therefore for any s ≥ 0,P(Yk > s | ∀k Yk ≤ r2)= O(exp(−Ckρ′)).By the Borel Cantelli lemma, we have that P(limk→∞ Yk = 0 | ∀k Yk ≤ r2)= 1.ThereforeP(limk→∞Yk = 0)≥ P(limk→∞Yk = 0 | ∀k Yk ≤ r2)P(∀k Yk ≤ r2)= P(∀k Yk ≤ r2) ,and the result follows.Lemma A.2.2 (Descent). Suppose we are given a filtration Fk ⊆ Fk+1, k ≥ 0. LetYk+1 ≤ αkYk + βk + ck, k ≥ 0,where Yk ≥ 0 and 0 ≤ αk ≤ 1 are Fk-measurable, βk is Fk+1-measurable andhas mean 0 and variance conditioned on Fk bounded above by τ2k , and τ2k , ck ≥ 0are F0 measurable. ThenPYk − Y0 k−1∏i=0αi ≥ tk−1∏i=0αi +k−1∑i=0cik−1∏j=i+1αj ≤ ∑k−1i=1 τ2it2 +∑k−1i=1 τ2i.79Proof. Solving the recursion,Yk ≤ αk−1Yk−1 + βk−1 + ck−1≤ αk−1 (αk−2Yk−2 + βk−2 + ck−2) + βk−1 + ck−1≤ . . .≤ Y0k−1∏i=0αi +k−1∑i=0βik−1∏j=i+1αj +k−1∑i=0cik−1∏j=i+1αj .SoPYk − Y0 k−1∏i=0αi −k−1∑i=0cik−1∏j=i+1αj ≥ tk−1∏i=0αi≤ Pk−1∑i=0βik−1∏j=i+1αj ≥ tk−1∏i=0αi= Pk−1∑i=0βii∏j=0αj ≥ t .By Cantelli’s inequality and the fact that the ith term in the sum is Fi-measurable,Pk−1∑i=0βii∏j=0αj ≥ t ≤ ∑k−1i=1 E[β2i∏ij=0 α2j]t2 +∑k−1i=1 E[β2i∏ij=0 α2j]≤∑k−1i=1 E[β2i]t2 +∑k−1i=1 E[β2i]≤∑k−1i=1 τ2it2 +∑k−1i=1 τ2i.80
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- On the computational asymptotics of Gaussian variational...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
On the computational asymptotics of Gaussian variational inference Xu, Zuheng 2020
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | On the computational asymptotics of Gaussian variational inference |
Creator |
Xu, Zuheng |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | Variational inference is a popular alternative to Markov chain Monte Carlo methods that constructs a Bayesian posterior approximation by minimizing a discrepancy to the true posterior within a pre-specified family. This converts Bayesian inference into an optimization problem, enabling the use of simple and scalable stochas- tic optimization algorithms. However, a key limitation of variational inference is that the optimal approximation is typically not tractable to compute; even in simple settings the problem is nonconvex. Thus, recently developed statistical guarantees—which all involve the (data) asymptotic properties of the optimal varia- tional distribution—are not reliably obtained in practice. In this work, we provide two major contributions: a theoretical analysis of the asymptotic convexity prop- erties of variational inference in the popular setting with a Gaussian family; and consistent stochastic variational inference (CSVI), an algorithm that exploits these properties to find the optimal approximation in the asymptotic regime. CSVI con- sists of a tractable initialization procedure that finds the local basin of the optimal solution, and a scaled gradient descent algorithm that stays locally confined to that basin. Experiments on nonconvex synthetic examples show that compared with standard stochastic gradient descent, CSVI improves the likelihood of obtaining the globally optimal posterior approximation. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-10-01 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0394576 |
URI | http://hdl.handle.net/2429/76241 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_november_xu_zuheng.pdf [ 708.54kB ]
- Metadata
- JSON: 24-1.0394576.json
- JSON-LD: 24-1.0394576-ld.json
- RDF/XML (Pretty): 24-1.0394576-rdf.xml
- RDF/JSON: 24-1.0394576-rdf.json
- Turtle: 24-1.0394576-turtle.txt
- N-Triples: 24-1.0394576-rdf-ntriples.txt
- Original Record: 24-1.0394576-source.json
- Full Text
- 24-1.0394576-fulltext.txt
- Citation
- 24-1.0394576.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0394576/manifest