On Amortized Inference in Large-Scale SimulatorsbySaeid NaderipariziB.Sc., Sharif University of Technology, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES(Computer Science)The University of British Columbia(Vancouver)January 2020c© Saeid Naderiparizi, 2020The following individuals certify that they have read, and recommend to the Faculty of Graduate andPostdoctoral Studies for acceptance, the thesis entitled:On Amortized Inference in Large-Scale Simulatorssubmitted by Saeid Naderiparizi in partial fulfillment of the requirements for the degree of Master ofScience in Computer Science.Examining Committee:Frank Wood, Computer ScienceSupervisorDavid Poole, Computer ScienceAdditional examineriiAbstractMotivated by the problem of amortized inference in large-scale simulators, we introduce a probabilisticprogramming library that brings us closer to this goal. This library enables us to perform Bayesianinference on any simulator written in a wide variety of programming languages, with minimal mod-ification to the simulator’s source code. However, there are challenges in achieving this goal in itsmost general meaning. In particular, we address the obstacles caused by unbounded loops. Existingapproaches to amortized inference in probabilistic programs with unbounded loops can produce esti-mators with infinite variance. An instance of this is importance sampling inference in programs thatexplicitly include rejection sampling as part of the user-programmed generative procedure. We developa new and efficient amortized importance sampling estimator. We prove finite variance of our estimatorand empirically demonstrate our method’s correctness and efficiency compared to existing alternativeson generative programs containing rejection sampling loops and discuss how to implement our methodin a generic probabilistic programming framework.iiiLay SummaryOur goal is to invert simulators automatically. In other words, given a simulator and an instance of itsoutput, we want to know a distribution over the random choices inside the simulator that can lead to thegiven output instance being generated. In this work, we first introduce a framework that inverts existingsimulators by looking at them as probabilistic programs and point out an vital efficiency issue in suchframework: unbounded loops. We propose a method to solve the problem in case of rejection samplingloops, which are common in writing simulators.ivPrefaceChapter 3 is based on a joint project with Atılım Gu¨nes¸ Baydin, Lukas Heinrich, Wahid Bhimji, LeiShao, Andreas Munk, Jialin Liu, Bradley Gram-Hansen, Gilles Louppe, Lawrence Meadows, PhilipTorr, Victor Lee, Prabhat, Kyle Cranmer, Frank Wood, and others. This project led to two publica-tions, “Efficient Probabilistic Inference in the Quest for Physics Beyond the Standard Model”, Baydinet al., NeurIPS 2019 [20] and “Etalumis: Bringing Probabilistic Programming to Scientific Simulatorsat Scale”, Baydin et al., Super Computing conference 2019 [19]. These two projects have been designedand the group was working on them when I joined them. I was responsible for helping with maintain-ing pypyob package introduced in this project and one of the most important parts of these projects,implementing the diagnostic tests and verifying the results.Given the size of these two projects and how long they took, it is not easy to tell a part was doneby one of the collaborators. To mention the most important people for each part, the idea of these twoprojects were mainly introduced by my supervisor, Frank Wood, Kyle Cranmer and Atılım Gu¨nes¸ Bay-din. Providing and maintaining the simulator in this project, sherpa, was done by our Physicist collab-orators, Kyle Cranmer, Lukas Heinrich, Gilles Louppe, and Wahid Bhimji. Performing the experimentswere mostly done by Wahid Bhimji, Atılım Gu¨nes¸ Baydin and Lei Shao. Our collaborators from Intelwere mostly responsible for making the distributed training and the use of high-performance computingpossible. The main responsibility of our collaborators from the University of British Columbia and theUniversity of Oxford, including myself was in the Probabilistic Programming part of the project. I, inparticular, implemented different diagnostics and applied them to the results, contributed to the pyprobpackage and enhanced parts of it, and worked on validation the results of the projects, which led to themain subject of this thesis.Chapters 4 and 5 are based on “Amortized Rejection Sampling in Universal Probabilistic Program-ming”, Naderiparizi et al., published on arXiv in 2019 [71]. I was the primary author of this work.Designing the approach was done through discussions between my supervisor, Frank Wood, AdamS´cibior, Mehrdad Ghadiri, Andreas Munk and myself. Most of the implementation was done by mewith advise from Atılım Gu¨nes¸ Baydin. The result plots were created by Mehrdad Ghadiri and thepaper writing and editing was mostly done by Adam S´cibior, Andres Munk, Frank Wood and myself.Portions of the introductory text of this thesis is from the three aforementioned papers.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Probabilistic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Sequential Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Inference Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Particle Physics and Probabilistic Inference . . . . . . . . . . . . . . . . . . . . . . . 103.2 Data Analysis in Particle Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Probabilistic Inference in Large-Scale Simulators . . . . . . . . . . . . . . . . . . . . 123.3.1 Designing a PPL for Existing Large-Scale Simulators . . . . . . . . . . . . . . 123.3.2 A Cross-Platform Probabilistic Execution Protocol . . . . . . . . . . . . . . . 14vi3.3.3 Controlling SHERPA’s Simulation of Fundamental Particle Physics . . . . . . 154 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Gaussian Unknown Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.1 Importance weight variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.2 Variable addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41viiList of TablesTable A.1 Examples of addresses in the τ lepton decay problem in SHERPA (C++). Onlythe first 6 addresses are shown out of a total of 24,382 addresses encountered over1,602,880 executions to collect statistics. . . . . . . . . . . . . . . . . . . . . . . . 41viiiList of FiguresFigure 2.1 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 3.1 Top: branching ratios of the τ lepton, effectively the prior distribution of the de-cay channels in SHERPA. Note that the scale is logarithmic. Bottom: Feynmandiagrams for τ decays illustrating that these can produce multiple detected particles. 11Figure 3.2 Top left: overall framework where the PPS is controlling the simulator. Bottom left:probabilistic execution of a single trace. Right: LSTM proposals conditioned on anobservation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Figure 3.3 Interpretability of the latent structure of the τ lepton decay process, automaticallyextracted from SHERPA executions via the probabilistic execution protocol. Show-ing model structure with increasing detail by taking an increasing number of mostcommon trace types into account. Node labels denote address IDs (A1, A2, etc.)that correspond to uniquely identifiable parts of model execution such as those inTable A.1. Addresses A1, A2, A3 correspond to momenta px, py, pz, and A6 cor-responds to the decay channel. Edge labels denote the frequency an edge is taken,normalized per source node. Red: controlled; green: rejection sampling; blue: ob-served; yellow: uncontrolled. Note: the addresses in these graphs are “aggregated”,meaning that we collapse all instances it of addresses (at , it) into the same node inthe graph, i.e., representing loops in the execution as cycles in the graph, in order tosimplify the presentation. This gives us ≤60 aggregated addresses representing thetransitions between a total of approximately 24k addresses (at , it) in the simulator. 17Figure 4.1 (a) illustrates the problem we are addressing. Existing approaches to inference com-pilation use trained proposals for the importance sampler shown in (b), where w canhave infinite variance, even when each wk individually has finite variance, as k isunbounded. There exists a simplified program (c) equivalent to (a) and ideally wewould like to perform inference using the importance sampler in (d). While thisis not directly possible, since we do not have access to the conditional densitiesrequired, our method approximates this algorithm, guaranteeing finite variance of w. 18Figure 4.2 Dependency graph for a trace from Program 4.1a. Here, Ai is the event of acceptinga sampled z i.e. c(x,zi) holds and similarly, Ai is defined for rejection. . . . . . . . 20ixFigure 4.3 An illustration of annotations required by our system. To apply our method weonly require that entry and exit to each rejection sampling loop be annotated withrs start and rs end respectively. Our implementation then automatically han-dles the whole inference process, even if the annotated loops are nested. . . . . . . 23Figure 5.1 Inference results for the Gaussian unknown mean program. We estimate Ep(µ|y) [µ]which in this instance has an analytic ground truth value. The plots show an aggre-gation of 100 different runs of the experiment. (top) Estimation error of 3 differentmethods (IC, Biased, and (ours) ARS with different values of M). Our methodreaches estimation error approaching zero faster and with significantly less variancethan the alternatives. Larger values of M lead to lower variance and faster conver-gence. (bottom) Effective Sample Size (ESS) as a function of number of IS pro-posals for each estimator. The multiplicative inclusion of importance weights in theexisting IC approach significantly adversely affects ESS compared to our methodregardless of the chosen value of M. . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 5.2 Results of GMM experiment. We target estimating Ep(µ|y) [µ] which can be com-puted analytically in this model. The plots show aggregation of 100 different runsof the experiment. (a) With a fixed proposal (b) With perfect proposal for the basedistribution (c) With perfect proposal for the base distribution and a proposal for ac-ceptance distribution that leads to less rejection. Our method with any M achieveszero estimation error with lower variance compared to IC. As expected, larger Mleads to faster convergence, lower variance of the weights and higher ESS. In thisexample, IC does not converge to zero estimation error. . . . . . . . . . . . . . . . 29xGlossaryESS Effective Sample SizeHMC Hamiltonian Monte CarloIC inference compilationIS importance samplingKL Kullback–LeiblerLMH lightweight Metropolis–HastingsMCMC Markov chain Monte CarloNN neural networkPPL Probabilistic Programming LanguageRMH random-walk Metropolis–HastingsSIS sequential importance samplingSMC sequential Monte CarloSNIS self-normalized importance samplingxiAcknowledgmentsI would like to express my sincerest gratitude to my supervisor, Frank Wood, for his support and men-torship throughout my master’s research. Your encouragement, enthusiasm, and technical advice havebeen a great source of inspiration for me. I am also very grateful to my second examiner, David Poole,who took the time to read this thesis.I like to extend my thanks to Atılım Gu¨nes¸ Baydin, Mehrdad Ghadiri, Andreas Munk and AdamS´cibior for their technical input and discussions throughout my projects. I would also like to thankAndrew Warrington and Robert Zinkov for various discussions we had. Thanks must also extend to myother collaborators and colleagues in the lab, Boyan, Christian, Michael, Vaden, and William. It hasbeen a pleasure to meet and learn alongside all of you.I would also like to thank my family for their support throughout my life and the opportunities theyhave provided for me. I especially want to thank my brother, Sobhan, who has always had a directimpact on my professional life.Last but not least, I want to thank my wife, Setareh, who has always been there for me, encouragedme and supported me.xiiDedicationTo my beloved wife, Setareh, my parents, Robabeh and Mehdi and to the memory of my amazingfriends, Arash and Pouneh.xiiiChapter 1IntroductionComplex stochastic simulators are widely used to express causal generative models of data across thescientific community. They are often designed by domain experts and incorporate domain expertise.Simulators have applications as diverse as hazard analysis in seismology [51], supernova shock wavesin astrophysics [37], market movements in economics [79], blood flow in biology [77], and coupledreactions in chemistry [25, 81, 89].A critical question is how to “invert” simulators, i.e., to recover the internal state of a simulator, givenan instance of its output. This problem is essential because the internal state of a simulator encompassesall information about the event it simulates. For example, internal state of a CAPTCHA simulatorrepresents the content of the CAPTCHA and all different types of noise applied on it [62, 63, 66], orinternal state of a particle physics experiment simulator represents all the collisions, decay types anddetailed properties of each resulting particle [19, 20].More formally, our goal is to perform Bayesian inference on simulators, targeting the posteriordistribution over its latent variables (which is the set of all the random samples in the simulator), givenits output. Everything that is sampled in the simulator is a latent variable; in other words, all thestochastic parts of a model are for sampling latent variables. Therefore, if latent variables in a simulatorare fixed, its output is deterministic and as a result, inferring latent variables gives all the informationencoded in the simulator about an observation.The simulators are typically non-differentiable and lead to intractable likelihoods, which rendersmany traditional statistical inference algorithms irrelevant and motivates a new class of so-called likelihood-free inference algorithms [50]. A class of probabilistic inference, known as likelihood-free inference,concerns simulators that do not define an evaluable likelihood. There are two broad strategies forlikelihood-free inference problem. In the first, one uses a simulator indirectly to train a surrogate modelendowed with a likelihood that can be used in traditional inference algorithms, for example, approachesbased on conditional density estimation [56, 75, 82, 94] and density ratio estimation [29, 36]. Alter-natively, approximate Bayesian computation (ABC) [92, 96] refers to a large class of approaches forsampling from the posterior distribution of these likelihood-free models, where the original simulatoris used directly as part of the inference engine. While variational inference [22] algorithms are oftenused when the posterior is intractable, they are not directly applicable when the likelihood of the data1generating process is unknown [93].The class of inference strategies that directly use a simulator avoids the necessity of approximatingthe generative model. Moreover, using a domain-specific simulator offers a natural pathway for infer-ence algorithms to provide interpretable posterior samples. In this work, we take this approach, extendprevious work in universal probabilistic programming [46, 95] and inference compilation [62, 64] tolarge-scale complex simulators. In a nutshell, we approach existing simulators as probabilistic pro-grams written in a “universal” probabilistic programming language and perform inference on them.While the generality of universal probabilistic programming inference methods allows them to beapplicable to any stochastic simulator, this flexibility introduces complexities, some of surprising andsubtle character. For instance, nothing prohibits simulators that have rejection sampling loops specifyingall or part of them. While existing inference approaches may asymptotically produce correct inferenceresults for such simulators, the reality, which we discuss at length in this thesis, is murkier.The specific problem we address, that of efficient amortized importance-sampling-based inferencein models with user-defined rejection sampling loops, is more prevalent than it might seem on firstconsideration. Our experience suggests that rejection sampling within the generative model specifica-tion is actually the rule rather than the exception when programmers use universal languages for modelspecification. To generate a single draw from anything more complex than standard distribution effec-tively requires either adding a new probabilistic primitive to the language (beyond most users), hardconditioning on constraint satisfaction (inefficient under most forms of universal PPL inference), or auser-programmed rejection loop. A quick example of this is sampling from a truncated normal. If themodel specification language does not have a truncated normal primitive, then the most natural way togenerate such a variate is via user-programmed rejection. More sophisticated examples abound in sim-ulators used in the physical sciences [19, 20], chemistry [25, 81, 89], and other domains [91]. Note thatthe issue we address here is not related to hard rejection via conditioning which is addressed in [83] andrelated work. Ours is specifically about rejection sampling loops within the generative model program,whereas the latter is about developing inference engines that are reasonably efficient even when theuser-specified program has a constraint-like observation that produces an extremely peaked posterior.The first inference algorithms for languages that allowed generative models containing rejectionsampling loops to be written revolved around Markov chain Monte Carlo (MCMC) [45, 98] and sequentialimportance sampling (SIS) [99] using the prior as the proposal distribution and then mean-field varia-tional inference [97]. Those methods were very inefficient, prompting extensions of PPLs providing pro-grammable inference capabilities [66, 87]. Efforts to speed inference since then have revolved aroundamortized inference [41], where a slow initial off-line computation is traded against fast and accuratetest-time inference. Such methods work by training neural networks that quickly map a dataset either toa variational posterior [84] or to a sequence of proposal distributions for SIS [62]. This thesis examinesand builds on the latter inference compilation (IC) approach.Unbounded loops potentially require integrating over infinitely many latent variables. With IC eachof these variables has its own importance weight, and the product of all the weights can have infinitevariance, resulting in a divergent importance sampler. Furthermore, the associated self-normalizing2importance sampler can converge to an arbitrary point, giving wrong inference results without warning.It is, therefore, necessary to take extra steps to ensure consistency of the importance sampler resultingfrom IC when unbounded loops are present.In this thesis, we first introduce a framework for incorporating existing simulators in a ProbabilisticProgramming Language (PPL) system and performing inference on them. We then present the maincontribution of this thesis, a solution to the problem of unbounded loops for the common case of re-jection sampling loops. We establish, both theoretically and empirically, that computing importanceweights naively in this situation can lead to arbitrarily bad posterior estimates. To remedy this problem,we develop a novel estimator and prove its unbiasedness and finite variance.3Chapter 2Background2.1 Bayesian InferenceBayesian inference [32, 40] provides a framework for expressing beliefs about the parameters of amodel and, given an observation, updating them based on Bayes rule. More formally, let each state ofthe model be represented by a x ∈X, and each observation be a y ∈Y. If p(x) encodes our beliefs aboutthe world (called prior distribution) and p(y|x) provides the likelihood of y being observed from a modelwith parameter x, the updated belief (called posterior distribution) is defined by Bayes rule as followsp(x|y) = p(y|x)p(x)p(y)(2.1)An important feature of Bayes rule is that it can be reused in a similar way where the prior for the newobservation is the posterior achieved after last observation i.e.p(x|y,y′) = p(y′|x,y)p(x|y)p(y′|y) =p(y′|x,y)p(y|x)p(x)p(y′|y)p(y) (2.2)2.2 Probabilistic ProgrammingProbabilistic Programming Languages (PPLs) [27, 45, 46, 65, 69, 70, 78, 95, 99] allow for efficient in-ference in generative models defined as computer programs using a variety of built-in inference engines.PPLs can be divided into two categories; restricted PPLs such as Infer.NET [70], and STAN [27] andunrestricted PPLs such as Church [45], Venture [65] and Anglican [99]. Unrestricted PPLs facilitate“inversion” (in the sense of inference) in generative models defined in terms of higher-order functions,control flow, and recursion. Example applications include the recent success of performing inference inparticle physics simulators [19, 20] containing thousands of latent variables.We adopt the same notation as [62], identify the execution of a program by it’s “trace” which isa sequence of random variables (xt ,at , it), where t = 1, . . . ,T , with T being the trace length (whichmay vary between executions), xt denotes the tth sampled value, at is the associated address and it4represents the instance: the number of times the same address has been encountered previously, i.e.,it = ∑tj=i1(at = a j). Requiring a fixed number N of observations y = (y1, . . . ,yN) in each trace anddefining the set of latent variables as x = (x1, . . . ,xT ), we write the joint distribution asp(x,y) =T∏t=1fat (xt |x1:t−1)N∏n=1gn(yn|x1:τ(n)). (2.3)Here fat (·|x1:t−1) is the probability distribution at address at conditioned on all preceding values x1:t−1,and gn(·|x1:τ(n)) is the likelihood density given the sampled values preceding the observation yn, andτ(n) is a mapping from the nth observation to the index of the last latent variable sampled before theobservation yn.Once a joint distribution p(x,y) is expressed as a probabilistic program, we can perform Bayesianinference to get posterior distributions p(x|y) of latent variables x given some observation y. Inferencein PPLs is achieved by running the probabilistic model and using one of a selection of general-purposeinference engines. The inference engines typically found in higher-order probabilistic programs [95]include SIS [10, 34], lightweight Metropolis–Hastings [98], sequential Monte Carlo (SMC) [31], andHamiltonian Monte Carlo (HMC) [35].Inference engines of the MCMC family, designed to perform in the space of probabilistic execu-tion traces, constitute the gold standard for obtaining samples from the true posterior of a probabilisticprogram [61, 95, 98].Given a current sequence of latents x in the trace space, these work by making proposals x′ accordingto a proposal distribution q(x′|x) and deciding whether to move from x to x′ based on the Metropolis–Hasting acceptance ratio of the formα = min{1,p(x′)q(x|x′)p(x)q(x′|x))}. (2.4)However, MCMC methods have several issues; most importantly, given each new observation y, anew Markov chain should be run from scratch, which makes the method undesired for repeated and fastinference. Furthermore, because MCMC methods rely on running a Markov chain the samples shouldbe drawn sequentially and are not admissible to parallelization .Figure 2.1 shows a Gaussian mixture model implemented in a universal probabilistic program. Eachfat and a gn in equation 2.3 respectively corresponds to a sample and observe statement in thisprogram. Therefore, a sample from p(x,y) can be drawn by running the program and sampling fromthe given distributions in any sample or observe statement that is executed. To evaluate p(x,y)for a given value of x and y, we run the program but, once hit a sample statement at an address at ,we assume the value sampled at this point is xt and continue executing the program. On the side, weaccumulate the probability of all the sample and observe statements at the given values for xat andyn (which corresponds to fat (xt |x1:t−1,yγ(t)) and gn(yn|x1:τ(n),y1:n−1)) in x or y, which makes evaluationof p(x,y) possible. It is worth mentioning that all the stochasticity of a probabilistic program comesfrom sample and observe statements and every other part of the program is deterministic.5def GMM(y obs):K = sample(Poisson(3)) + 1mixtures = []for k in range(K):mu k = sample(Normal(0, 1))sigma k = sample(Uniform(1, 5))mixtures.append((mu k , sigma k))z list = []pi = Categorical([1/K for in range(K)])for n in range(len(y obs)):z list.append(sample(pi))for n in range(len(y obs)):mu, sigma = mixtures[z list[n]]observe(Normal(mu, sigma), y obs[n])return mixturesFigure 2.1: Gaussian Mixture Model2.3 Importance SamplingImportance sampling (IS) is a method for comuting integrals of the form Ih = Ep(x) [h(x)] when wecannot directly sample from p but we can evaluate it. If another distribution q (known as proposaldistribution) exists such that q(x)> 0 whenever h(x)p(x) 6= 0, we can equivalently writeIh = Eq(x)[p(x)q(x)h(x)](2.5)Therefore, having independent samples x1,x2, . . . ,xK from q, the following Monte Carlo estimator isconsistent and unbiased:Ih ≈ IISh =1KK∑k=1p(xk)q(xk)h(xk) (2.6)IS is also useful for variance reduction when h(x) is non-zero only on A⊂ X where X is the supportof p and P(X ∈ A) is small [73]. The set A may have small volume, or it may be in the tail of p. In thissituation, a plain Monte Carlo estimator for Ih, sampling from p, could fail to even have a single samplein the region A. We can get around this problem by defining a q that has large amount of mass on A andusing IS. This type of problem is common in Bayesian inference when the observation is unlikely.A variant of IS, called self-normalized importance sampling (SNIS) [73] exists for the cases wherep can be evaluated only up to a normalizing constant i.e. we can only evaluate pu(·) = Zp(·) for anunknown normalizing constant Z > 0. If q is a distribution that we can both evaluate and sample from,also, if q(x)> 0 whenever p(x)> 0.Ih =Eq(x)[pu(x)q(x) h(x)]Eq(x)[pu(x)q(x)] (2.7)6If we draw K independent samples x1,x2, . . . ,xK from q, we can estimate Ih in the following way whichis consistent, but biased.Ih ≈ ISNISh =∑Kk=1pu(xk)q(xk) h(xk)∑Kk=1pu(xk)q(xk)(2.8)Because, once the samples {xk}Kk=1 are fixed, the denominator in Equation is a constant, this equationcan equivalently be written as a weighted sum of h evaluated on sampled points, as shown below:ISNISh =K∑k=1Wkh(xk)Wk =wk∑Kj=1 w jwk =pu(xk)q(xk)(2.9)In all variants of IS, one way of measuring quality of the estimator is Effective Sample Size (ESS)which is defined on a set of weighted samples x1,x2, . . . ,xK with the weights w1,w2, . . . ,wK as thefollowingESS =(∑Kk=1 wk)2∑Kk=1(wk)2(2.10)2.4 Sequential Importance SamplingSequential importance sampling [11, 34] is one of the methods used for performing inference overexecution traces of a probabilistic program in universal probabilistic programming languages [99]. Inthis method, the expectations Ih = Ep(x|y) [h(x)] are computed similar to SIS,Ih ≈ ISISh = ∑Kk=1 wkh(xk)∑Kk=1 wk(2.11)Where the samples xk are drawn from a proposal distribution q(x) and the importance weights wk arecalculated for each xk according to wk = p(xk,y)/q(xk). The difference with SNIS is in the fact that inSIS it is assumed that both the target distribution (p(x|y) in this case) and proposal distribution (q(x) inthis case) are assumed to have factorization in a number of sequential lower-dimensional conditionals.In SIS for probabilistic programs, q(x) is factorized to a set of qa,i for each address a and instancecounter i corresponding to each sample statement in the program. Sampling a proposal trace is doneby running the program as usual, except, when encountered a sample statement at address at at time twith instance counter it , the value xt is drawn from qa,i(·|x1:t−1). Therefore, the weights wk are computedas follows:wk =T∏t=1fat (xkt |xk1:t−1,yγ(t))qat ,it (xkt |xk1:t−1)×N∏n=1gn(yn|xk1:τ(n),y1:n−1). (2.12)7If the proposal q(x) is close to the target p(x|y), we get better sample efficiency, lower variance inthe computed weights and, higher ESS. Crucially, consistency of the SIS estimator is only guaranteedwhen q(x) is absolutely continuous with respect to p(x|y) and the variance of the weights is finite. Whenthose conditions are not satisfied, SIS may converge to the wrong answer or fail to converge at all.2.5 Inference CompilationInference compilation [62] is an amortized inference [41] engine for probabilistic programs, which pro-vides efficient proposals for SIS. IC involves an upfront amortization phase in which it learns proposaldistributions q(x|y;φ), where, φ is the parameters of a deep recurrent neural network. Once the propos-als are learnt, repeated and fast inference can be done by SIS using efficient proposals provided by ICnetwork.In order for the proposals q(x|y;φ) to be suitable for SIS, they are trained to be close to the true pos-terior p(x|y) ∝ p(x,y) defined by the generative model. The measure of closeness in IC is the expectedKullback–Leibler (KL) divergence KL( p(x|y)||q(x|y;φ)). Inference is amortized by minimizing theexpected KL divergence under the marginal p(y),L (φ) = Ep(y) [KL( p(x|y)||q(x|y;φ))]= Ep(x,y)[logp(x|y)q(x|y;φ)]= Ep(x,y) [− logq(x|y;φ)]+ const. (2.13)The proposal q(x|y;φ) is trained using an unbiased estimator of the gradient of equation 2.13,∇φL (φ) = Ep(x,y)[−∇φ logq(x|y;φ)]≈− 1K K∑k=1∇φ logq(xk|yk;φ),where (xk,yk)∼ q(x|y;φ) for k = 1, . . . ,K.At inference time, rather than using q(x|y;φ) as a direct replacement for p(x|y), to adjust for thediscrepancy between them, the expectation under the posterior p(x|y) of a function h(x) is estimatedusing SIS where q(x|y;φ) is the proposal distribution.2.6 Rejection SamplingMany commonly employed stochastic simulators contain rejection sampling loops. A rejection sam-pling loop can be an instance of the well-known rejection sampling algorithm [72], but more generallyit involves repeatedly sampling one or more variables until they satisfy certain conditions, as shown inAlgorithm 1 where c is the a function that check the condition. The crucial property of rejection sam-pling loops is that they are stateless, which means that rejected samples do not have any bearing on thesubsequent iterations or the remainder of the program.In standard IC a separate proposal distribution is learned for each address in the program, so everyiteration of the rejection sampling loop would use a different proposal, trained to match the samples seenin that loop iteration during training. This effectively encourages the proposals to learn to approximate8the distribution of rejected samples in early iterations, which leads to proposed samples rarely beingaccepted, making the whole inference process inefficient.This problem has been addresses before, in [20], where they train the network by fixing the sameproposal across all the iterations of the loop and training it only to approximate the accepted samples,disregarding the rejected ones. While this procedure generates good proposal distributions, at inferenceit is necessary to include an importance sampling weight for all the rejected samples, which can lead toproblems with high weight variance.Algorithm 1 Rejection sampling algorithm1: for k ∈ N+ do2: xk ∼ p(x)3: if c(xk) then4: x = xk5: break9Chapter 3MotivationWe motivate the main contribution in this paper by another work of ours where we introduce probabilis-tic programming as a novel tool to determine the properties of particles at the Large Hadron Collider(LHC) [1, 28] at CERN [20]. This is achieved by coupling our framework with SHERPA1 [44], a state-of-the-art Monte Carlo event generator of high-energy reactions of particles, which is commonly usedwith Geant42 [4], a toolkit for the simulation of the passage of the resulting particles through detec-tors. In particular, we perform inference on the details of the decay of a τ (tau) lepton measured by anLHC-like detector by controlling the SHERPA simulation (with minimal modifications to the standardsoftware), extract posterior distributions, and compare to ground truth. To our knowledge this is the firsttime that universal probabilistic programming has been applied in this domain and at this scale, control-ling a code base of nearly one million lines of code. Our approach is scalable to more complex eventsand full detector simulators, paving the way to its use in the discovery of new fundamental physics.3.1 Particle Physics and Probabilistic InferenceOur work is motivated by applications in particle physics, which studies elementary particles and theirinteractions using high-energy collisions created in particle accelerators such as the LHC at CERN.In this setting, collision events happen many millions of times per second, creating cascading particledecays recorded by complex detectors instrumented with millions of electronics channels. These ex-periments then seek to filter the vast volume of (petabyte-scale) resulting data to make discoveries thatshape our understanding of fundamental physics.The complexity of the underlying physics and of the detectors have, until now, prevented the com-munity from employing likelihood-free inference techniques for individual collision events. However,they have developed sophisticated simulator packages such as SHERPA [44], Geant4 [4], Pythia8 [88],Herwig++ [16], and MadGraph5 [6] to model physical processes and the interactions of particles withdetectors. This is interesting from a probabilistic programming point of view, because these simulatorsare essentially very accurate generative models implementing the Standard Model of particle physics1Simulation of High-Energy Reactions of Particles. https://sherpa.hepforge.org/2Geometry and Tracking. https://geant4.web.cern.ch/10and the passage of particles through matter (i.e., particle detectors). These simulators are coded inTuring-complete general-purpose programming languages, and performing inference in such a settingrequires using inference techniques developed for universal probabilistic programming that cannot behandled via more traditional inference approaches that apply to, for example, finite probabilistic graph-ical models [57]. Thus we focus on creating an infrastructure for the interpretation of existing simulatorpackages as probabilistic programs, which lays the groundwork for running inference in scientifically-accurate probabilistic models using general-purpose inference algorithms.The specific physics setting we focus on in this paper is the decay of a τ lepton particle insidean LHC-like detector. This is a real use case in particle physics currently under active study by LHCphysicists [2] and it is also of interest due to its importance to establishing the properties of the recentlydiscovered Higgs boson [1, 28] through its decay to τ particles [12, 33, 48, 49]. Once produced, the τdecays to further particles according to certain decay channels. The prior probabilities of these decaysor “branching ratios” are shown in Figure 3.1.10−410−310−210−1100101102103104105106BranchingRatioντe− ν¯eντµ− ν¯µντpi−ντK−ντpi− pi0ντK− KSντK− KLντpi− KSντpi− KLντK− pi0ντpi− pi0 pi0ντpi+ pi− pi−ντpi− KSpi0ντpi− KLpi0ντK− pi+ pi−ντpi0 pi0 K−ντK− K+ pi−ντK− KSpi0ντK− KLpi0ντpi− KSKLντpi− KSKSντpi− KLKLντK− K− K+ντpi+ pi− pi− pi0ντpi− pi0 pi0 pi0ντK− pi0 pi0 pi0ντKSpi− pi0 pi0ντKLpi− pi0 pi0ντK− K+ pi− pi0ντpi− ηντK− ηντpi− ηηντηpi− pi0ντpi− pi− pi+ pi0 pi0ντpi− pi0 pi0 pi0 pi0ντK− pi0 pi0 pi0 pi0ντpi− pi− pi+ pi0 pi0 pi0ντpi− pi− pi− pi+ pi+ pi0τ−ντe/µ−νe/µW−τ−ντudddpi0γγW−pi−Figure 3.1: Top: branching ratios of the τ lepton, effectively the prior distribution of the decaychannels in SHERPA. Note that the scale is logarithmic. Bottom: Feynman diagrams for τdecays illustrating that these can produce multiple detected particles.3.2 Data Analysis in Particle PhysicsInference for an individual collision event in particle physics is often referred to as reconstruction [60].Reconstruction algorithms can be seen as a form of structured prediction: from the raw event data theyproduce a list of candidate particles together with their types and point-estimates for their momenta.The variance of these estimators is characterized by comparison to the ground truth values of the latentvariables from simulated events. Bayesian inference on the latent state of an individual collision is rare11in particle physics, given the complexity of the latent structure of the generative model. Until now,inference for the latent structure of an individual event has only been possible by accepting a drasticsimplification of the high-fidelity simulators [3, 5, 7–9, 15, 23, 26, 38, 39, 47, 58, 67, 68, 85, 90]. Incontrast, inference for the fundamental parameters is based on hierarchical models and probed at thepopulation level. Recently, machine learning techniques have been employed to learn surrogates for theimplicit densities defined by the simulators as a strategy for likelihood-free inference [24].Currently particle physics simulators are run in forward mode to produce substantial datasets thatoften exceed the size of datasets from actual collisions within the experiments. These are then reducedto considerably lower dimensional datasets of a handful of variables using physics domain knowledge,which can then be directly compared to collision data. Machine learning and statistical approachesfor classification of particle types or regression of particle properties can be trained on these large pre-generated datasets produced by the high-fidelity simulators developed over many decades [13, 55].The field is increasingly employing deep learning techniques allowing these algorithms to process high-dimensional, low-level data [14, 17, 30, 54, 80]. However, these approaches do not estimate the posteriorof the full latent state nor provide the level of interpretability our probabilistic inference frameworkenables by directly tying inference results to the latent process encoded by the simulator.3.3 Probabilistic Inference in Large-Scale SimulatorsIn this section we describe the main components of our probabilistic inference framework, including:(1) a novel PyTorch-based [76] PPL and associated inference engines in Python, (2) a probabilisticprogramming execution protocol that defines a cross-platform interface for connecting models and in-ference engines implemented in different languages and executed in separate processes, (3) a lighweightC++ front end allowing execution of models written in C++ under the control of our PPL.3.3.1 Designing a PPL for Existing Large-Scale SimulatorsA shortcoming of the state-of-the-art PPLs is that they are not designed to directly support existing codebases, requiring one to implement any model from scratch in each specific PPL. This limitation rules outtheir applicability to a very large body of existing models implemented as domain-specific simulatorsin many fields across academia and industry. A PPL, by definition, is a programming language withadditional constructs for sampling random values from probability distributions and conditioning val-ues of random variables via observations [46, 95]. Domain-specific simulators in particle physics andother fields are commonly stochastic in nature, thus they satisfy the behavior random sampling, albeitgenerally from simplistic distributions such as the continuous uniform. By interfacing with these simu-lators at the level of random number sampling (via capturing calls to the random number generator) andintroducing a construct for conditioning, we can execute existing stochastic simulators as probabilisticprograms. Our work introduces the necessary framework to do so, and makes these simulators, whichcommonly represent the most accurate models and understanding in their corresponding fields, subjectto Bayesian inference using general-purpose inference engines. In this setting, a simulator is no longera black box, as all predictions are directly tied into the fully-interpretable structured model implemented12SimulatorProbabilistic programming systemSHERPA (C++)PPL + (Python)......ExecuterequestExecutereplySamplerequestSamplereplyObserverequestObservereplySimulator executionProbprog trace recording and controlSTART A1 A2 A3 ENDSTART A1 A2 A3 ENDSamplerequestSamplereplyx1A2xN-1AN0A1LSTMq(x1|y) q(x2|x1,y) q(xN|x1:N-1,y)………3D-CNNyobsysimINFERENCEp(x1) p(x2|x1) p(xN|x1:N-1)…x1 x2 xNSampled random variables (trace)Prior distributionsor likelihoodsProposaldistributionsProposallayersLSTM inputs:- Observ. embed.- Address embed.- Sample embed.SIMULATIONAddresses A1 A2 AN……p(y|x1:N)Simulation outputObserved outputFigure 3.2: Top left: overall framework where the PPS is controlling the simulator. Bottom left:probabilistic execution of a single trace. Right: LSTM proposals conditioned on an observa-tion.by the simulator code base.To realize our framework, we implement a universal PPL called pyprob,3 specifically designed to ex-ecute models written not only in Python but also in other languages. Our PPL currently has two familiesof inference engines:4 (1) MCMC of lightweight Metropolis–Hastings (LMH) [98] and random-walkMetropolis–Hastings (RMH) [61] varieties, and (2) SIS [11, 34] with its regular (i.e., sampling from theprior) and IC [62] varieties. The IC technique, where a recurrent neural network is trained in order toprovide amortized inference to guide (control) a probabilistic program conditioning on observed inputs(Figure 3.2, right), forms our main inference method for performing efficient inference in large-scalesimulators. Because IC training and inference uses dynamic reconfiguration of NN modules [62], webase our PPL on PyTorch [76], whose automatic differentiation feature with support for dynamic com-putation graphs [18] has been crucial in our implementation. The LMH and RMH engines we implementare specialized for sampling in the space of execution traces of probabilistic programs, and provide away of sampling from the true posterior and therefore provide a baseline—at a high computational cost.In IC training, we may designate a subset of all addresses (at , it) to be “controlled” (learned) bythe NN, leaving all remaining addresses to use the prior fat as proposal during inference. Expressedin simple terms, taking an observation y (an observed event that we would like to recreate or explainwith the simulator) as input, the NN learns to control the random number draws of latents x during thesimulator’s execution in such a way that makes the observed outcome likely (Figure 3.2, right).3https://github.com/pyprob/pyprob4The selection of these families was motivated by working with existing simulators through an execution protocol (Sec-tion 3.3.2) precluding the use of gradient-based inference engines.13The neural network (NN) architecture in IC is based on a stacked LSTM [53] recurrent core thatgets executed for as many time steps as the probabilistic trace length. The input to this LSTM in eachtime step is a concatenation of embeddings of the observation f obs(y), the current address f addr(at , it),and the previously sampled value f smpat−1,it−1(xt−1). fobs is a NN specific to the domain (such as a 3Dconvolutional NN for volumetric inputs), f smp are feed-forward modules, and f addr are learned addressembeddings optimized via backpropagation for each (at , it) pair encountered in the program execution.The addressing scheme at is the main link between semantic locations in the probabilistic program [98]and the inputs to the NN. The address of each sample or observe statement is supplied over theexecution protocol (Section 3.3.2) at runtime by the process hosting and executing the model. The jointproposal distribution of the NN q(x|y) is factorized into proposals in each time step qat ,it , whose typedepends on the type of the prior fat . Sherpa simulator uses categorical and continuous uniform priors,for which IC uses, respectively, categorical and mixture of truncated Gaussian distributions as proposalsparameterized by the NN. The creation of IC NNs is automatic, i.e., an open-ended number of NNmodules are generated by the PPL on-the-fly when a simulator address at is encountered for the firsttime during training [62]. These modules are reused (either for inference or undergoing further training)when the same address is encountered in the lifetime of the same trained NN.A common challenge for inference in real-world scientific models, such as those in particle physics,is the presence of large dynamic ranges of prior probabilities for various outcomes. For instance, someparticle decays are ∼104 times more probable than others (Figure 3.1, appendix), and the prior distri-bution for a particle momentum can be steeply falling. Therefore some cases may be much more likelyto be seen by the NN during training relative to others. For this reason, the proposal parameters andthe quality of the inference would vary significantly according to the frequency of the observations inthe prior. To address this issue, we apply a “prior inflation” scheme to automatically adjust the measureof the prior distribution during training to generate more instances of these unlikely outcomes. Thisapplies only to the training data generation for the IC NN, and the unmodified original model prior isused during inference, ensuring that the importance weights (Equation 2.12) and therefore the empiricalposterior are correct under the original model.3.3.2 A Cross-Platform Probabilistic Execution ProtocolTo couple our PPL and inference engines with simulators in a language-agnostic way, we introduce aprobabilistic programming execution protocol (PPX)5 that defines a schema for the execution of prob-abilistic programs. The protocol covers language-agnostic definitions of common probability distribu-tions and message pairs covering the call and return values of (1) program entry points (2) samplestatements, and (3) sample statements (Figure 3.2, left). The implementation is based on flatbuffers,6which is an efficient cross-platform serialization library through which we compile the protocol into theofficially supported languages C++, C#, Go, Java, JavaScript, PHP, Python, and TypeScript, enablingvery lightweight PPL front ends in these languages—in the sense of requiring only an implementation to5https://github.com/pyprob/ppx6http://google.github.io/flatbuffers/14call sample and observe statements over the protocol. We exchange these flatbuffers-encoded mes-sages over ZeroMQ7 [52] sockets, which allow seamless communication between separate processes inthe same machine (using inter-process sockets) or across a network (using TCP).Connecting any stochastic simulator in a supported language involves only the redirection of callsto the random number generator (RNG) to call the sample method of PPX using the correspondingprobability distribution as the argument, which is facilitated when a simulator-wide RNG interface isdefined in a single code file as is the case in SHERPA (Section 3.3.3). Conditioning is achieved by eitherproviding an observed value for any sample at inference time (which means that the sample will befixed to the observed value) or adding manual observe statements, similar to Pyro [21].Besides its use with our Python PPL, the protocol defines a very flexible way of coupling any PPLsystem to any model so that these two sides can be (1) implemented in different programming languagesand (2) executed in separate processes and on separate machines across networks. Thus we present thisprotocol as a probabilistic programming analogue to the Open Neural Network Exchange (ONNX)8project for interoperability between deep learning frameworks, in the sense that PPX is an interoperabil-ity project between PPLs allowing language-agnostic exchange of existing models (simulators). Notethat, more than a serialization format, the protocol enables runtime execution of probabilistic modelsunder the control of inference engines in different PPLs. We are releasing this protocol as a separatelymaintained project, together with the rest of our work in Python and C++.3.3.3 Controlling SHERPA’s Simulation of Fundamental Particle PhysicsWe demonstrate our framework with SHERPA [44], a Monte Carlo event generator of high-energyreactions of particles, which is a state-of-the-art simulator of the Standard Model developed by theparticle physics community. SHERPA, like many other large-scale scientific projects, is implementedin C++, and therefore we implement a C++ front end for our protocol.9 We couple SHERPA to thefront end by a system-wide rerouting of the calls to the RNG, which is made easy by the existence ofa third-party RNG interface (External RNG) already present in SHERPA. Through this setup, we canrepurpose, with little effort, any stochastic simulation written in SHERPA as a probabilistic generativemodel in which we can perform inference.Random number draws in C++ simulators are commonly performed at a lower level than the actualprior distribution that is being simulated. This applies to SHERPA where the only samples are fromthe standard uniform distribution U(0,1), which subsequently get used for different purposes usingtransformations or rejection sampling. Sherpa simulator works with all uniform samples except for aproblem-specific single address that we know to be responsible for sampling from a categorical distribu-tion representing particle decay channels. The modification of this address to use the proper categoricalprior allows an effortless application of prior inflation (Section 3.3.1) to generate training data equallyrepresenting each channel.7http://zeromq.org/8https://onnx.ai/9https://github.com/pyprob/pyprob cpp15Rejection sampling [43, 72] sections in the simulator pose a challenge for our approach, as theydefine execution traces that are a priori unbounded; and since the IC NN has to backpropagate throughevery sampled value, this makes the training significantly slower. Rejection sampling is key to the appli-cation of Monte Carlo methods for evaluating matrix elements [59] and other stages of event generationin particle physics; thus an efficient treatment of this construction is primal.16START A11.000 A21.000 A31.000 A41.000 A51.000 A61.000 A261.0000.784A90.2160.464A100.5360.437A150.378A110.185A161.000A200.672A170.328A211.000 A221.000 A231.000 A241.000 A251.000 END1.000A121.000 A131.000 A141.000 1.000 0.500A180.500 A191.000 1.000(a) Latent probabilistic structure of the 10 most frequent trace types.START A11.000 A21.000 A31.000 A41.000 A51.000 A61.000A260.925A70.068A270.0070.788A90.2120.461A100.4190.1150.0060.470A150.319A110.208A310.0040.017A160.966A390.017A200.608A170.392A211.000 A221.000 A231.000 A241.000 A251.000 END1.000A121.000 A131.0000.149A140.8510.8510.1490.500 A180.500A191.0000.8510.149A81.0000.1250.875A401.000 A411.000 A421.000 A431.0001.000A281.000 A291.000 A301.0000.2500.7501.000(b) Latent probabilistic structure of the 25 most frequent traces types.START A11.000 A21.000 A31.000 A41.000 A51.000 A61.000A260.814A70.091A270.0950.789A90.2110.143A100.1310.0850.6410.510A150.266A110.192A310.0320.043A160.917A390.0400.012A200.599A170.3860.004 A211.000 A221.000 A231.000 A241.000 A251.000 END1.000A121.000 A131.0000.211A140.785A450.0050.7570.2090.0340.500A180.500A191.0000.7870.208A460.005A81.0000.1250.8750.057A400.943 A411.000 A421.0000.146A430.854 0.9430.057A281.000 A291.000 A301.0000.2210.7790.8460.077A320.077 A331.000 A341.000A350.250A440.750A361.0000.464A370.536 A381.0000.4000.6001.0000.5000.500A471.0000.5000.500(c) Latent probabilistic structure of the 100 most frequent traces types.START A11.000 A21.000 A31.000 A41.000 A51.000 A61.000A260.723A70.098A270.1790.803A90.1970.076A100.0640.0730.7870.548 A150.219A110.188A310.0450.084A160.842A390.075 0.030A200.562A170.3970.009A570.003A211.000 A221.000 A231.000 A241.000 A251.000 END1.000A121.000 A131.0000.249A140.724A450.019A560.007 0.6640.2460.0880.0020.500A180.500A191.0000.7180.2390.010A460.021A480.0070.005A81.0000.1250.8750.107A400.893A410.959A550.041A421.0000.177A430.8230.0080.8840.107A281.000 A291.000 A301.0000.2200.7800.7450.128A320.1280.0790.0350.0050.0050.876A471.0000.8330.1671.000A331.000 A341.000A350.196A440.804 A361.0000.559A370.441 A381.0000.5080.0480.4441.000A491.000 A501.0000.538A510.4620.538A520.4620.500 A530.500A541.0000.4620.5381.000A581.000 A591.000 A601.0000.3330.667(d) Latent probabilistic structure of the 250 most frequent traces types.Figure 3.3: Interpretability of the latent structure of the τ lepton decay process, automatically ex-tracted from SHERPA executions via the probabilistic execution protocol. Showing modelstructure with increasing detail by taking an increasing number of most common trace typesinto account. Node labels denote address IDs (A1, A2, etc.) that correspond to uniquelyidentifiable parts of model execution such as those in Table A.1. Addresses A1, A2, A3 cor-respond to momenta px, py, pz, and A6 corresponds to the decay channel. Edge labels denotethe frequency an edge is taken, normalized per source node. Red: controlled; green: rejectionsampling; blue: observed; yellow: uncontrolled. Note: the addresses in these graphs are “ag-gregated”, meaning that we collapse all instances it of addresses (at , it) into the same nodein the graph, i.e., representing loops in the execution as cycles in the graph, in order to sim-plify the presentation. This gives us ≤60 aggregated addresses representing the transitionsbetween a total of approximately 24k addresses (at , it) in the simulator.17Chapter 4Proposed MethodA preview of the problem and the proposed solution is shown in Figure 4.1.1: x∼ p(x)2: w← p(x)q(x|y)3: for k ∈ N+ do4: zk ∼ p(z|x)5: wk← p(zk|x)q(zk|x,y)6: w← wwk7: if c(x,zk) then8: z = zk9: break10: observe(y, p(y|z,x))(a) Original programx∼ q(x|y)w← p(x)q(x|y)for k ∈ N+ dozk ∼ q(z|x,y)wk← p(zk|x)q(zk|x,y)w← wwkif c(x,zk) thenz = zkbreakw← wp(y|z,x)(b) Inference compilation1: x∼ p(x)2: w← p(x)q(x|y)3: z ∼ p(z|x,c(x,z))4: w← w p(z|x,c(x,z))q(z|x,y,c(x,z))5: observe(y, p(y|z,x))(c) Equivalent to abovex∼ q(x|y)w← p(x)q(x|y)z ∼ q(z|x,y,c(x,z))w← w p(z|x,c(x,z))q(z|x,y,c(x,z))w← w p(y|z,x)(d) Our IS estimatorFigure 4.1: (a) illustrates the problem we are addressing. Existing approaches to inference com-pilation use trained proposals for the importance sampler shown in (b), where w can haveinfinite variance, even when each wk individually has finite variance, as k is unbounded.There exists a simplified program (c) equivalent to (a) and ideally we would like to performinference using the importance sampler in (d). While this is not directly possible, since wedo not have access to the conditional densities required, our method approximates this algo-rithm, guaranteeing finite variance of w.184.1 Problem FormulationOur formulation of the problem will be presented concretely starting from the example probabilisticprogram shown in Figure 4.1a. Even though both the problem and our solution are more general, ap-plying to all probabilistic programs with rejection sampling loops regardless of how complicated theyare, this simple example captures all the important aspects of the problem. As a reminder, inferencecompilation refers to offline training of importance sampling proposal distributions for all random vari-ables in a program [62]. Existing approaches to inference compilation for the program in Figure 4.1acorrespond to the importance sampler shown in Figure 4.1b where there is some proposal learned forevery random choice in the program. While the weighted samples produced by this method result inunbiased estimates, the variance of the weights can be very high and potentially infinite due to the un-bounded number of wk’s. To show this, we start by more precisely defining the meaning of the samplerin Figure 4.1b.Definition 4.1.1 (Naive weighing). Let p(x,y,z) be a probability density such that all conditional densi-ties exist. Let c(x,z) be a Boolean condition, and A be the event that c is satisfied, such that p(A|x,y)≥ εfor all (x,y) and some ε > 0. For each y, let q(x,z|y) be a probability density absolutely continuouswith respect to p(x,z|y) and q(A|x,y) ≥ ε for all (x,y). Let x ∼ q(x|y) and let zk ∼i.i.d. q(z|x,y) andwk = p(zk|x)q(zk|x,y) for all k ∈ N+. Let L = min{k|c(x,zk)}, z = zL and wIC =p(x)q(x|y) p(y|x,z)∏Lk=1 wk.We assert that definition 4.1.1 corresponds to the program in Figure 4.1b. A rigorous correspondencecould be established using formal semantics methods, such as the construction of S´cibior et al. [86], butthis is beyond the scope of this paper. Although the resulting importance sampler correctly targets theposterior distribution, the variance of wIC is a problem, and it is this specific problem that we tackle inthis paper.Intuitively, a large number of rejections in the loop leads to a large number of wk being includedin wIC and the variance of their product tends to grow quickly. In the worst case, this variance may beinfinite, even when each wk has finite variance individually. This happens when the proposed samplesare rejected too often, which is formalized in the following theorem.Theorem 1. Under assumptions of Definition 4.1.1, if the following condition holds with positive prob-ability under x∼ q(x|y)Ez∼q(z|x,y)[p(z|x)2q(z|x,y)2 (1− p(A|x,z))]≥ 1 (4.1)then the variance of wIC is infinite.Proof. In Appendix A.1.What this means is that importance sampling with proposals other than the prior hurt more thanhelp in the case of rejection sampling loops. Furthermore, existing IC schemes are effectively uselessunder the conditions of Theorem 1, since the consistency of the self-normalizing importance samplerdepends on the variance of weights being finite [42]. Worse, even when the variance is finite but large,it may render the ESS too low for practical applications, a phenomenon we have observed repeatedly19xzk· · ·z1z0yA0 A1 AkFigure 4.2: Dependency graph for a trace from Program 4.1a. Here, Ai is the event of accepting asampled z i.e. c(x,zi) holds and similarly, Ai is defined for rejection.in practice. What remains is to derive an alternative way to compute wIC that guarantees finite varianceand in practice leads to larger ESS than existing methods.4.2 ApproachA starting point to the presentation of our algorithm is to observe that the program in Figure 4.1a isequivalent to the program 4.1c, where z satisfying the condition c is sampled directly. Figure 4.1dpresents an importance sampler targeting 4.1c, obtained by sampling z directly from q under the condi-tion c. Note that the sampling processes in 4.1b and 4.1d are the same, only the weights are computeddifferently.Definition 4.2.1 (Collapsed weighing). Extending Definition 4.1.1, letwC =p(x)q(x|y)p(z|x,A)q(z|x,A,y)︸ ︷︷ ︸Tp(y|x,z) (4.2)Note that E [wIC] = E [wC], which we prove in Theorem 3. However, since wC only involves a fixednumber (three) of terms, we can expect it to avoid problems with exploding variance. Unfortunately, wecannot directly compute wC.In equation 4.2 we can directly evaluate all terms except T , since, p(z|x,A) and q(z|x,A,y) aredefined implicitly by the rejection sampling loop. Applying Bayes’ rule to this term gives the followingequality:T =q(A|x,y)p(A|x)︸ ︷︷ ︸1p(z|x)q(z|x,y)︸ ︷︷ ︸2p(A|z,x)q(A|z,x,y)︸ ︷︷ ︸3(4.3)The term 2 can be directly evaluated, since we have access to both conditional densities and theterm 3 is always equal to 1, since A is deterministic given x and z. However, the term 1 , which is theratio of acceptance probabilities under q and p, can not be computed directly and we need to estimate20it. We provide separate unbiased estimators for q(A|x,y) and 1p(A|x) .For q(A|x,y) we use straightforward Monte Carlo estimation of the following expectation:q(A|x,y) =∫q(A|z,x,y)q(z|x,y)dz=∫c(z,x)q(z|x,y)dz= Ez∼q(z|x,y) [c(z,x)] (4.4)For 1p(A|x) we use Monte Carlo estimation after applying the following standard lemma:Lemma 2. Let A be an event that occurs in a trial with probability p. The expectation of the number oftrials to the first occurrence of A is equal to 1p .It is important that these estimators are constructed independently of z being sampled to ensure thatwe obtain an unbiased estimator for wC specified in Equation 4.2. We put together all these elements toobtain our final method in Algorithm 2.Algorithm 2 Pseudocode for our algorithm applied to the probabilistic program from Figure 4.1a.1: x∼ q(x|y)2: w← p(x)q(x|y)3: for k ∈ N+ do4: zk ∼ q(z|x,y)5: if c(x,zk) then6: z = zk7: break8: w← w p(z|x)q(z|x,y)9: . estimate q(A|x,y) using Equation 4.410: K← 011: for i ∈ 1, . . .N do12: z′i← q(z|x,y)13: K← K+ c(z,x)14: . estimate 1p(A|x) using Lemma 215: for j ∈ 1, . . .M do16: for l ∈ N+ do17: z′′j,l ← p(z|x)18: if c(x,z′′j,l) then19: Tj← l20: break21: T ← 1M ∑Mj=1 Tj22: w← w KTN23: w← w p(y|z,x)More formally, the weight obtained in our method is defined as follows.Definition 4.2.2 (Our weighing). Extending Definition 4.1.1, let z′i ∼i.i.d q(z|x,y) for i ∈ 1, . . . ,N andK be the number of z′i for which c(x,z′i) holds. Let (z′′j,1, . . . ,z′′j,Tj) be sequences of potentially varying21length for j ∈ 1, . . . ,M with z′′j,l ∼i.i.d. p(z|x) such that for all j, Tj is the smallest index l for whichc(x,z′′j,l) holds. Let T =1M ∑Mj=1 Tj. Finally, letw =p(x)q(x|y)KTNp(z|x)q(z|x,y) p(y|x,z). (4.5)Throughout this section we have only informally argued that the three importance samplers pre-sented target the same distribution. With all the definitions in place we can make this argument precisein the following theorem.Theorem 3. For any N ≥ 1 and M ≥ 1, and all values of (x,y,z),E [wIC|x,y,z] = wC = E [w|x,y,z] . (4.6)Proof. For the second equality, use Equation 4.4, then Lemma 2, Equation 4.3, and finally Equation 4.2.E [w|x,y,z] = p(x)q(x|y)p(z|x)q(z|x,y) p(y|x,z)1NEz′ [K]Ez′′ [T ] (4.7)=p(x)q(x|y)p(z|x)q(z|x,y) p(y|x,z)q(A|x,y)1p(A|x) (4.8)=p(x)q(x|y)p(z|x,A)q(z|x,A,y) p(y|x,z) (4.9)= wC (4.10)For the first equality, use Equation A.4 in Appendix A.1 to getE [wIC|x,y,z] = p(x)q(x|y) p(y|x,z)wLEz1:L−1[L−1∏k=1wk](4.11)=p(x)q(x|y) p(y|x,z)p(z|x)q(z|x,y)q(A|x,y)p(A|x) (4.12)= wC (4.13)Since all three importance samplers use the same proposal distributions for (x,z), Theorem 3 showsthat they all target the same distribution, which is the posterior distribution specified by the originalprobabilistic program in Figure 4.1a.Finally, we can prove that our method handles inference in rejection sampling loops without intro-ducing infinite variance. Note that variance may still be infinite for reasons not having to do with therejection sampling loop, if q(x|y) and q(z|x,y) are poorly chosen.Theorem 4. For any N ≥ 1 and M ≥ 1, if wC from Definition 4.2.1 has finite variance, then w fromDefinition 4.2.2 has finite variance.22Proof. Note that conditionally on (x,y,z) K follows a binomial distribution, so Var[KN |x,y,z] < 1 < ∞,while T follows a geometric distribution and Var[T |x,y,z] < 1p(A|x)2 ≤ 1ε2 < ∞. Also, conditionallyon (x,y,z), KN and T are independent, so Var[KTN ] < B for some constant B < ∞. Then see that w =wCp(A|x)q(A|x,y)KTN . Then, using the law of total variance, we getVar[w] = E [Var[w|x,y,z]]+Var[E [w|x,y,z]] (4.14)= E[(wCp(A|x)q(A|x,y))2Var[KTN∣∣∣∣x,y,z]]+Var[wC] (4.15)≤ E[wC1ε2B]+Var[wC]< ∞ (4.16)4.3 ImplementationListing 4.1: Originalx = sample(P x)while True:z = sample(P z(x))if c(x, z):breakobserve(P y(x,z), y)return x, zListing 4.2: Annotatedx = sample(P x)while True:rs start()z = sample(P z(x))if c(x, z):rs end ()breakobserve(P y(x,z), y)return x, zFigure 4.3: An illustration of annotations required by our system. To apply our method we onlyrequire that entry and exit to each rejection sampling loop be annotated with rs start andrs end respectively. Our implementation then automatically handles the whole inferenceprocess, even if the annotated loops are nested.In the previous sections we have presented and experimentally validated our method in simple pro-grams, starting with the exemplar program in Figure 4.1, which only has a single rejection samplingloop and a very simple structure otherwise. However, our method applies much more broadly, in fact toall probabilistic programs with arbitrary stochastic control flow structures and any number of rejectionsampling loops, including nesting such loops to arbitrary degree. The only constraint is that observationscan not be placed inside the loops, i.e. no conditioning inside rejection sampling loops. We conjecturethat our method produces correct weights for all probabilistic programs satisfying this constraint, butproving that is beyond the scope of this paper and would require employing sophisticated machinery forconstructing formal semantics, such as developed by S´cibior et al. [86].To enable practitioners to use our method in its full generality we have implemented it in PyProb23[62], a universal probabilistic programming library written in Python. The particulars of such a generalpurpose implementation pose several difficult but interesting problems, including identifying rejectionsampling loops in the original program, addressing particular rejection sampling loops, and engineeringsolutions that allow acceptance probabilities bespoke to each loop to be estimated by repeatedly execut-ing the loops with different proposal distributions. In this section we describe some of these challengesand discuss our initial approach to solving them.The first challenge is identifying rejection sampling loops in the probabilistic program itself. Onemechanism for doing this is to introduce a rejection sampling primitive, macro, or syntactic sugar intothe probabilistic programming language itself whose arguments are two functions, one the acceptancefunction, the other the body of the rejection sampling loop. While this may be feasible, our approach liesin a different part of the design space, given our choice to implement in PyProb and its applicability toperforming inference in existing stochastic simulators. In this setting there are two other design choices:some kind of static analyzer that automates the labelling of rejection sampling loops by looking forrejection sampling motifs in the program (unclear how to accomplish this in a reliable and general way)or providing the probabilistic programmer functions that need to be carefully inserted into the existingprobabilistic program to demarcate where rejection sampling loops start and end.This is the intial approach we have taken. In the programs in Section 5 we nearly silently introducedthe functions rs start and rs end to tag the beginning and end of rejection sampling loops. Thepurpose of these functions was hinted at then, but now is explained as the way to inform an inferenceengine about the scope of each rejection sampling loop, in particular so that all sample statementsin between calls to rs start and rs end can be tracked. Programs in Figure 4.3 illustrates wherethese primitives have to be inserted in probabilistic programs with rejection loops to invoke our ARStechniques.The specific implementation details are beyond scope for this paper and require substantial reviewof PyProb internals, however, the key functionality enabled by these tags includes two critical things.First, we need to be able to execute additional iterations of every rejection sampling loop in the programto compute our estimator of q(A|x,y)p(A|x) . For every rs start we have to be able to continue the programmultiple times, executing the rejection loop both proposing from p and q, terminating the continuationonce the matching rs end is reached. Efficient implementations of this make use of the same forkingideas that made probabilistic-c possible [74]. Second, we have to be able to identify rejection samplerstart and end pairs in a way that is more strict than the uniqueness requirement on random variableaddresses in probabilistic programming.Our PyProb implementation addresses all of these issues and the details of our implementation willbe made public as part of its continuing open source development.24Chapter 5Experiments5.1 Experimental SetupListing 5.1: Gaussian unknown mean (GUM) modeldef GUM(mu 0 , sigma 0 , sigma, y obs):u = sample(Uniform(low=0, high=1))if u > 0.5:while True:rs start()mu = sample(Normal(mean=mu 0 , std=sigma 0))if mu > mu 0:rs end ()breakelse:while True:rs start()mu = sample(Normal(mean=mu 0 , std=sigma 0))if mu <= mu 0:rs end ()breaky = observe(Normal(mean=mu, std=sigma), y obs)return muWe illustrate our method by performing inference in two example probabilistic programs that includerejection sampling loops. We designed these programs so that baseline posteriors can be derived ana-lytically.As is now nearly standard in writing about probabilistic programming, in the following programssample means to draw a random value from the given distribution and observe means to condition25the program on having observed the given value under the given distribution.In our example models, Programs 5.1 and 5.2, we either explicitly introduce rejection samplingor replace some sample statements with rejection sampling loops that generate samples from the samedistribution as the one replaced. We evaluate the efficacy of our approach in several ways including com-puting the ESS of IS posterior estimates and empirically comparing the convergence rates of differentmethods to analytically computed ground truth values.The specific methods we compare are:• Naive (IC): Like the approach in Figure 4.1b this uses a proposal for all iterations of rejectionsampling loops. Final importance weights are computed by multiplying all the weights for allsamples in the trace, accepted and rejected.• Ours (ARS, M=m): Similar to the approach in Figure 4.1d this uses our method (Algorithm 2)with M = m and N = max(m,10) as defined in Definition 4.2.2, not multiplying in the weights ofall of the rejected samples on the trace.• Existing (Biased): Implements the incorrect variant of ARS that appears currently in PyProb. Itdoes not multiply in all the weights of the rejected samples, however, it does not estimate normultiple in the correction factor q(A|x,y)p(A|x) .5.2 Gaussian Unknown MeanOur first Gaussian unknown mean (GUM) model consists of one latent variable, the mean µ ∈R, whichis given a Gaussian prior with mean µ0 ∈ R and variance σ0 ∈ R+, and a single observation y governedby a normal observation model with fixed standard deviation σ ∈ R+.µ ∼ N(µ0,σ0)y|µ ∼ N(µ,σ)We introduce rejection sampling into this model by splitting the prior into two truncated normal distri-butions and sampling from these truncated normals via rejection sampling. This is shown in Program5.1. The reason that u is a random variable will be made apparent in due course; for now suffice it to saythat the full generality of our method extends to various compositions of rejection samplers, includingstate dependent rejection samplers that are have potentially different exit probabilities in each trace.We fix µ0 = 0,σ0 = 1,σ =√22 . The observation y is 0. The proposals are fixed; for the first branchin Program 5.1 the proposal is N(−2,2) whereas all other proposals are fixed to be equal to the prior. Inthis experiment we intentionally chose the proposal so as to not be close to the true posterior, yielding amore difficult problem.Figure 5.1 shows the aggregated results of running each IS inference technique 100 times. While ICconverges to the ground truth, it has poor ESS. The high variance of the IC estimator can also plainly beseen in comparison to our method. On the other hand, as our method does not multiplicatively update26importance sampling weights with weights from rejected samples both has has much higher ESS andconverges more rapidly to the ground truth.−10−1−10−2010−210−1EstimationError0 25 50 75 100 125 150Number of draws×103050100150ESS×103ICBiasedARS, M=1ARS, M=5ARS, M=10ARS, M=100Figure 5.1: Inference results for the Gaussian unknown mean program. We estimate Ep(µ|y) [µ]which in this instance has an analytic ground truth value. The plots show an aggregationof 100 different runs of the experiment. (top) Estimation error of 3 different methods (IC,Biased, and (ours) ARS with different values of M). Our method reaches estimation errorapproaching zero faster and with significantly less variance than the alternatives. Largervalues of M lead to lower variance and faster convergence. (bottom) ESS as a functionof number of IS proposals for each estimator. The multiplicative inclusion of importanceweights in the existing IC approach significantly adversely affects ESS compared to ourmethod regardless of the chosen value of M.5.3 Gaussian Mixture ModelOur second example program is a GMM with two components and a single observation. Its implemen-tation is shown in Program 5.2 where sampling from one of the mixtures is done through a rejectionsampling loop. The target distribution for the rejection sampling branch is N(µ1,σ1) and is implic-itly defined via a rejection-sampler with base distribution N(µ0,σ0). The implementation of alpha(x) isα(x) = 1MN(x;µ1,σ1)N(x;µ0,σ0) where the value of M is computed based on the base and the target distribution ana-lytically so as to satisfy the textbook “soft” rejection sampling requirement, i.e. M = argmaxxN(x;µ1,σ1)N(x;µ0,σ0) .27Listing 5.2: Gaussian Mixture Model (GMM)def GMM(pi 1 , mu 0 , sigma 0 , mu 1 , sigma 1 , mu 2 , sigma 2 , sigma, y obs):u = sample(Uniform(low=0, high=1))if u < pi 1:while True:rs start()mu = sample(Normal(mean=mu 0 , std=sigma 0))u2 = sample(Uniform(low=0, high=1))if u2 < alpha(x):rs end ()breakelse:mu = sample(Normal(mean=mu 2 , std=sigma 2))y = observe(Normal(mean=mu, std=sigma), y obs)return muIn our experiments we set the mixture probabilities to pi1 = 0.5 and pi2 = 0.5, the parameters of the firstGaussian to µ1 =−1,σ1 = 1, with µ2 = 2,σ2 = 1 for the second Gaussian. The observation is y= 0. Weset µ0 = 1,σ0 = 2. For inference, we use N(−2,2) as the proposal for the first Gaussian and every otherproposal is the same as the prior. Figure 5.2a shows aggregated results of running different inferencemethods 100 times each.We also run an experiment on the same GMM with a proposal for µ1 that will be perfect if all thesamples from it gets accepted, and every other proposal is equal to the prior. This corresponds to the ICvariant for handling rejection sampling proposed in [20] where the proposal is learned perfectly, as thiswould be the ideal proposal it learns for this particular random variable. In this experiment, we call thisproposal the perfect proposal. Figure 5.2b shows our results.In our next setup, besides having the aforementioned perfect proposal for µ1, we use N(−0.5,0.5)as a proposal for u2 (which is sampled from the acceptance distribution, Uniform(0,1) in the prior).This leads to less rejection at inference time and is a better overall proposal since, ideally, samplesfrom the perfect proposal should not be rejected as distribution of accepted samples would be imperfectotherwise. The results are shown in Figure 5.2c.In this experiment IC does not converge, even when we use the perfect proposal. This is in addi-tion to its poor sample efficiency. This is evidence of the weight variance problem with existing ICapproaches to amortized IS. They may not work even when the perfect proposal is used. Our method,as in the previous example, converges to the true expectation value with lower error, lower variance, andbetter sample efficiency (higher ESS).28−10−1−10−2010−210−1EstimationError−10−1−10−2010−210−1−10−1−10−2010−210−10 25 50 75 100 125 150Number of draws(a)×1030255075100ESS×1030 25 50 75 100 125 150Number of draws(b)×103050100150200×1030 25 50 75 100 125 150Number of draws(c)×103050100150200×103IC Biased ARS, M=1 ARS, M=5 ARS, M=10 ARS, M=100Figure 5.2: Results of GMM experiment. We target estimating Ep(µ|y) [µ] which can be computedanalytically in this model. The plots show aggregation of 100 different runs of the experi-ment. (a) With a fixed proposal (b) With perfect proposal for the base distribution (c) Withperfect proposal for the base distribution and a proposal for acceptance distribution that leadsto less rejection. Our method with any M achieves zero estimation error with lower variancecompared to IC. As expected, larger M leads to faster convergence, lower variance of theweights and higher ESS. In this example, IC does not converge to zero estimation error.29Chapter 6ConclusionsWe have addressed a significant issue in amortized and adapted importance-sampling-based inferencefor universal probabilistic programming languages. We have demonstrated that even simple rejectionsampling loops cause major problems for existing probabilistic programming inference algorithms. Inparticular, we showed empirically and theoretically that sequential importance sampling will performpoorly in presence of rejection sampling loops. Our proposed method is an unbiased estimator withlower variance than naı¨ve SIS estimators.The cost of taking our approach is somewhat subtle but involves needing to estimate the exit prob-abilities of all rejections sampling loops in the program under both the prior and the proposal. In manycases this can be amortized in the sense that these rejection sampling loops can be statically determinedand the relevant constants estimated at training time. Unfortunately if at test time a rejection samplingloop is encountered for which these constants have not been already estimated, the inference enginemust “pause” and estimate them on the fly. Not only can this be slow but it seriously increases theimplementation complexity of the probabilistic programming system.However, when all is said and done, this “fix” of amortized and adapted importance-sampling-basedinference for universal probabilistic programming systems is very significant, particularly as it pertainsto uptake of this kind of probabilistic programming system. Current users of systems who wisely userejection sampling in their generative models will experience the probabilistic programming system asconfusingly but simply not working. This will be due to slowness if non-amortized inference techniquesare utilized and, worse, poor sample efficiency and potentially non-diagnosable non-convergent behaviorif existing fast amortized inference techniques are utilized instead. Our work makes it so that efficient,amortized inference engines work for probabilistic programs that users actually write and in so doingremoves a major impediment to the uptake of universal probabilistic programming systems in general.30Bibliography[1] G. Aad, T. Abajyan, B. Abbott, J. Abdallah, S. Abdel Khalek, A. A. Abdelalim, O. Abdinov,R. Aben, B. Abi, M. Abolins, and et al. Observation of a new particle in the search for theStandard Model Higgs boson with the ATLAS detector at the LHC. Physics Letters B, 716:1–29,Sept. 2012. doi:10.1016/j.physletb.2012.08.020. → pages 10, 11[2] G. Aad et al. Reconstruction of hadronic decay products of tau leptons with the ATLASexperiment. Eur. Phys. J., C76(5):295, 2016. doi:10.1140/epjc/s10052-016-4110-0. → page 11[3] V. M. Abazov et al. A precision measurement of the mass of the top quark. Nature, 429:638–642,2004. doi:10.1038/nature02589. → page 12[4] J. Allison, K. Amako, J. Apostolakis, P. Arce, M. Asai, T. Aso, E. Bagli, A. Bagulya, S. Banerjee,G. Barrand, et al. Recent developments in geant4. Nuclear Instruments and Methods in PhysicsResearch Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 835:186–225, 2016. → page 10[5] J. Alwall, A. Freitas, and O. Mattelaer. The Matrix Element Method and QCD Radiation. Phys.Rev., D83:074010, 2011. doi:10.1103/PhysRevD.83.074010. → page 12[6] J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H.-S. Shao, T. Stelzer,P. Torrielli, and M. Zaro. The automated computation of tree-level and next-to-leading orderdifferential cross sections, and their matching to parton shower simulations. Journal of HighEnergy Physics, 2014(7):79, 2014. → page 10[7] J. R. Andersen, C. Englert, and M. Spannowsky. Extracting precise Higgs couplings by using thematrix element method. Phys. Rev., D87(1):015019, 2013. doi:10.1103/PhysRevD.87.015019.→ page 12[8] P. Artoisenet and O. Mattelaer. MadWeight: Automatic event reweighting with matrix elements.PoS, CHARGED2008:025, 2008.[9] P. Artoisenet, P. de Aquino, F. Maltoni, and O. Mattelaer. Unravelling tth via the Matrix ElementMethod. Phys. Rev. Lett., 111(9):091802, 2013. doi:10.1103/PhysRevLett.111.091802. → page12[10] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for onlinenonlinear/non-gaussian bayesian tracking. Ieee Transactions on Signal Processing, 50(2):174–188, 2002. ISSN 19410476, 1053587x. doi:10.1109/78.978374. → page 5[11] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for onlinenonlinear/non-gaussian bayesian tracking. IEEE Transactions on signal processing, 50(2):174–188, 2002. → pages 7, 1331[12] A. Askew, P. Jaiswal, T. Okui, H. B. Prosper, and N. Sato. Prospect for measuring the CP phasein the hττ coupling at the LHC. Phys. Rev., D91(7):075014, 2015.doi:10.1103/PhysRevD.91.075014. → page 11[13] L. Asquith et al. Jet Substructure at the Large Hadron Collider : Experimental Review. 2018. →page 12[14] A. Aurisano, A. Radovic, D. Rocco, A. Himmel, M. Messier, E. Niner, G. Pawloski, F. Psihas,A. Sousa, and P. Vahle. A convolutional neural network neutrino event classifier. Journal ofInstrumentation, 11(09):P09001, 2016. → page 12[15] P. Avery et al. Precision studies of the Higgs boson decay channel H→ ZZ→ 4l with MEKD.Phys. Rev., D87(5):055006, 2013. doi:10.1103/PhysRevD.87.055006. → page 12[16] M. Ba¨hr, S. Gieseke, M. A. Gigg, D. Grellscheid, K. Hamilton, O. Latunde-Dada, S. Pla¨tzer,P. Richardson, M. H. Seymour, A. Sherstnev, et al. Herwig++ physics and manual. The EuropeanPhysical Journal C, 58(4):639–707, 2008. → page 10[17] P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physicswith deep learning. Nature Communications, 5:4308, 2014. → page 12[18] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation inmachine learning: a survey. Journal of Machine Learning Research (JMLR), 18(153):1–43, 2018.URL http://jmlr.org/papers/v18/17-468.html. → page 13[19] A. G. Baydin, L. Shao, W. Bhimji, L. Heinrich, L. Meadows, J. Liu, A. Munk, S. Naderiparizi,B. Gram-Hansen, G. Louppe, et al. Etalumis: Bringing probabilistic programming to scientificsimulators at scale. arXiv preprint arXiv:1907.03382, 2019. → pages v, 1, 2, 4[20] A. G. Baydin, L. Shao, W. Bhimji, L. Heinrich, S. Naderiparizi, A. Munk, J. Liu,B. Gram-Hansen, G. Louppe, L. Meadows, et al. Efficient probabilistic inference in the quest forphysics beyond the standard model. In Advances in Neural Information Processing Systems,pages 5460–5473, 2019. → pages v, 1, 2, 4, 9, 10, 28[21] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos, R. Singh,P. Szerlip, P. Horsfall, and N. D. Goodman. Pyro: Deep universal probabilistic programming.Journal of Machine Learning Research, 2018. → page 15[22] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians.Journal of the American Statistical Association, 112(518):859–877, 2017. ISSN 1537274x,01621459. doi:10.1080/01621459.2017.1285773. → page 1[23] S. Bolognesi, Y. Gao, A. V. Gritsan, K. Melnikov, M. Schulze, N. V. Tran, and A. Whitbeck. Onthe spin and parity of a single-produced resonance at the LHC. Phys. Rev., D86:095031, 2012.doi:10.1103/PhysRevD.86.095031. → page 12[24] J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez. A Guide to Constraining Effective FieldTheories with Machine Learning. Phys. Rev., D98(5):052004, 2018.doi:10.1103/PhysRevD.98.052004. → page 12[25] X. Cai. Exact stochastic simulation of coupled chemical reactions with delays. The Journal ofchemical physics, 126(12):124108, 2007. → pages 1, 232[26] J. M. Campbell, R. K. Ellis, W. T. Giele, and C. Williams. Finding the Higgs boson in decays toZγ using the matrix element method at Next-to-Leading Order. Phys. Rev., D87(7):073005, 2013.doi:10.1103/PhysRevD.87.073005. → page 12[27] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker,J. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language. Journal of statisticalsoftware, 76(1), 2017. → page 4[28] S. Chatrchyan, V. Khachatryan, A. M. Sirunyan, A. Tumasyan, W. Adam, E. Aguilo, T. Bergauer,M. Dragicevic, J. Ero¨, C. Fabjan, and et al. Observation of a new boson at a mass of 125 GeVwith the CMS experiment at the LHC. Physics Letters B, 716:30–61, Sept. 2012.doi:10.1016/j.physletb.2012.08.021. → pages 10, 11[29] K. Cranmer, J. Pavez, and G. Louppe. Approximating likelihood ratios with calibrateddiscriminative classifiers. arXiv preprint arXiv:1506.02169, 2015. → page 1[30] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman. Jet-images – deeplearning edition. Journal of High Energy Physics, 2016(7):69, 2016. → page 12[31] P. Del Moral, A. Doucet, and A. Jasra. Sequential monte carlo samplers. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006. → page 5[32] A. P. Dempster. A generalization of bayesian inference. Journal of the Royal Statistical Society:Series B (Methodological), 30(2):205–232, 1968. → page 4[33] A. Djouadi. The Anatomy of electro-weak symmetry breaking. I: The Higgs boson in thestandard model. Phys. Rept., 457:1–216, 2008. doi:10.1016/j.physrep.2007.10.004. → page 11[34] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later.Handbook of nonlinear filtering, 12(656-704):3, 2009. → pages 5, 7, 13[35] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid monte carlo. Physics letters B,195(2):216–222, 1987. → page 5[36] R. Dutta, J. Corander, S. Kaski, and M. U. Gutmann. Likelihood-free inference by penalisedlogistic regression. arXiv preprint arXiv:1611.10242, 2016. → page 1[37] E. Endeve, C. Y. Cardall, R. D. Budiardja, S. W. Beck, A. Bejnood, R. J. Toedte, A. Mezzacappa,and J. M. Blondin. Turbulent magnetic field amplification from spiral sasi modes: Implicationsfor core-collapse supernovae and proto-neutron star magnetization. Astrophysical Journal, 751(1):26, 2012. ISSN 15384357, 0004637x, 15384365, 00670049.doi:10.1088/0004-637X/751/1/26. → page 1[38] J. S. Gainer, J. Lykken, K. T. Matchev, S. Mrenna, and M. Park. The Matrix Element Method:Past, Present, and Future. In Proceedings, 2013 Community Summer Study on the Future of U.S.Particle Physics: Snowmass on the Mississippi (CSS2013): Minneapolis, MN, USA, July29-August 6, 2013, 2013. URL http://inspirehep.net/record/1242444/files/arXiv:1307.3546.pdf.→ page 12[39] Y. Gao, A. V. Gritsan, Z. Guo, K. Melnikov, M. Schulze, and N. V. Tran. Spin determination ofsingle-produced resonances at hadron colliders. Phys. Rev., D81:075022, 2010.doi:10.1103/PhysRevD.81.075022. → page 1233[40] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian dataanalysis. CRC press, 3 edition, 2013. → page 4[41] S. Gershman and N. Goodman. Amortized inference in probabilistic reasoning. In Proceedings ofthe Annual Meeting of the Cognitive Science Society, volume 36, 2014. → pages 2, 8[42] J. Geweke. Bayesian inference in econometric models using monte carlo integration.Econometrica: Journal of the Econometric Society, pages 1317–1339, 1989. → page 19[43] W. R. Gilks and P. Wild. Adaptive rejection sampling for Gibbs sampling. Applied Statistics,pages 337–348, 1992. → page 16[44] T. Gleisberg, S. Hoeche, F. Krauss, M. Schonherr, S. Schumann, F. Siegert, and J. Winter. Eventgeneration with SHERPA 1.1. Journal of High Energy Physics, 02:007, 2009.doi:10.1088/1126-6708/2009/02/007. → pages 10, 15[45] N. D. Goodman, V. K. Mansinghka, D. Roy, K. Bonawitz, and J. B. Tenenbaum. Church: Alanguage for generative models. Proceedings of the 24th Conference on Uncertainty in ArtificialIntelligence, Uai 2008, pages 220–229, 2008. → pages 2, 4[46] A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. Rajamani. Probabilistic programming. InProceedings of the Future of Software Engineering, pages 167–181. ACM, 2014. → pages2, 4, 12[47] A. V. Gritsan, R. Ro¨ntsch, M. Schulze, and M. Xiao. Constraining anomalous Higgs bosoncouplings to the heavy flavor fermions using matrix element techniques. Phys. Rev., D94(5):055023, 2016. doi:10.1103/PhysRevD.94.055023. → page 12[48] B. Grzadkowski and J. F. Gunion. Using decay angle correlations to detect CP violation in theneutral Higgs sector. Phys. Lett., B350:218–224, 1995. doi:10.1016/0370-2693(95)00369-V. →page 11[49] R. Harnik, A. Martin, T. Okui, R. Primulando, and F. Yu. Measuring CP Violation in h→ τ+τ−at Colliders. Phys. Rev., D88(7):076009, 2013. doi:10.1103/PhysRevD.88.076009. → page 11[50] F. Hartig, J. M. Calabrese, B. Reineking, T. Wiegand, and A. Huth. Statistical inference forstochastic simulation models–theory and application. Ecology Letters, 14(8):816–827, 2011. →page 1[51] A. Heinecke, A. Breuer, S. Rettenberger, M. Bader, A. A. Gabriel, C. Pelties, A. Bode, W. Barth,X. K. Liao, K. Vaidyanathan, M. Smelyanskiy, and P. Dubey. Petascale high order dynamicrupture earthquake simulations on heterogeneous supercomputers. International Conference forHigh Performance Computing, Networking, Storage and Analysis, Sc, 2015-(January):7012188,3–14, 2014. ISSN 21674337, 21674329. doi:10.1109/SC.2014.6. → page 1[52] P. Hintjens. ZeroMQ: messaging for many applications. O’Reilly Media, Inc., 2013. → page 15[53] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 1530888x, 08997667. → page 14[54] B. Hooberman, A. Farbin, G. Khattak, V. Pacela, M. Pierini, J.-R. Vlimant, M. Spiropulu, W. Wei,M. Zhang, and S. Vallecorsa. Calorimetry with Deep Learning: Particle Classification, EnergyRegression, and Simulation for High-Energy Physics, 2017. Deep Learning in Physical Sciences(NIPS workshop). https://dl4physicalsciences.github.io/files/nips dlps 2017 15.pdf. → page 1234[55] G. Kasieczka. Boosted Top Tagging Method Overview. In 10th International Workshop on TopQuark Physics (TOP2017) Braga, Portugal, September 17-22, 2017, 2018. URLhttp://inspirehep.net/record/1647961/files/arXiv:1801.04180.pdf. → page 12[56] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improvedvariational inference with inverse autoregressive flow. In Advances in neural informationprocessing systems, pages 4743–4751, 2016. → page 1[57] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MITpress, 2009. → page 11[58] K. Kondo. Dynamical Likelihood Method for Reconstruction of Events With MissingMomentum. 1: Method and Toy Models. J. Phys. Soc. Jap., 57:4126–4140, 1988.doi:10.1143/JPSJ.57.4126. → page 12[59] F. Krauss. Matrix elements and parton showers in hadronic interactions. Journal of High EnergyPhysics, 2002(08):015, 2002. → page 16[60] W. Lampl, S. Laplace, D. Lelas, P. Loch, H. Ma, S. Menke, S. Rajagopalan, D. Rousseau,S. Snyder, and G. Unal. Calorimeter Clustering Algorithms: Description and Performance.Technical Report ATL-LARG-PUB-2008-002. ATL-COM-LARG-2008-003, CERN, Geneva,Apr 2008. URL https://cds.cern.ch/record/1099735. → page 11[61] T. A. Le. Inference for higher order probabilistic programs. Masters Thesis, University of Oxford,2015. → pages 5, 13[62] T. A. Le, A. G. Baydin, and F. Wood. Inference compilation and universal probabilisticprogramming. In Proceedings of the 20th International Conference on Artificial Intelligence andStatistics (AISTATS), volume 54 of Proceedings of Machine Learning Research, pages1338–1348, Fort Lauderdale, FL, USA, 2017. PMLR. → pages 1, 2, 4, 8, 13, 14, 19, 24[63] T. A. Le, A. G. Baydin, R. Zinkov, and F. Wood. Using synthetic data to train neural networks ismodel-based reasoning. In 2017 International Joint Conference on Neural Networks (IJCNN),pages 3514–3521. IEEE, 2017. → page 1[64] M. Lezcano Casado, A. G. Baydin, D. Martinez Rubio, T. A. Le, F. Wood, L. Heinrich,G. Louppe, K. Cranmer, W. Bhimji, K. Ng, and Prabhat. Improvements to inference compilationfor probabilistic programming in large-scale scientific simulators. In Neural InformationProcessing Systems (NIPS) 2017 workshop on Deep Learning for Physical Sciences (DLPS),Long Beach, CA, US, December 8, 2017, 2017. → page 2[65] V. Mansinghka, D. Selsam, and Y. Perov. Venture: a higher-order probabilistic programmingplatform with programmable inference. page 78, 2014. → page 4[66] V. Mansinghka, D. Selsam, and Y. Perov. Venture: a higher-order probabilistic programmingplatform with programmable inference. arXiv preprint arXiv:1404.0099, 2014. → pages 1, 2[67] T. Martini and P. Uwer. Extending the Matrix Element Method beyond the Born approximation:Calculating event weights at next-to-leading order accuracy. JHEP, 09:083, 2015.doi:10.1007/JHEP09(2015)083. → page 12[68] T. Martini and P. Uwer. The Matrix Element Method at next-to-leading order QCD for hadroniccollisions: Single top-quark production at the LHC as an example application. 2017. → page 1235[69] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. Blog: Probabilisticmodels with unknown objects. Ijcai International Joint Conference on Artificial Intelligence,pages 1352–1359, 2005. ISSN 10450823. → page 4[70] T. Minka, J. Winn, J. Guiver, Y. Zaykov, D. Fabian, and J. Bronskill. /Infer.NET 0.3, 2018.Microsoft Research Cambridge. http://dotnet.github.io/infer. → page 4[71] S. Naderiparizi, A. S´cibior, A. Munk, M. Ghadiri, A. G. Baydin, B. Gram-Hansen, C. S. de Witt,R. Zinkov, P. H. Torr, T. Rainforth, Y. W. Teh, and F. Wood. Amortized rejection sampling inuniversal probabilistic programming. arXiv preprint arXiv:1910.09056, 2019. → page v[72] V. Neumann. Various techniques used in connection with random digits. National Bureau ofStandards, Applied Math Series, 12:36–38, 11 1950. → pages 8, 16[73] A. B. Owen. Monte Carlo theory, methods and examples. 2013. → page 6[74] B. Paige and F. Wood. A compilation target for probabilistic programming languages. InProceedings of the 31st international conference on Machine learning, volume 32, pages1935–1943, 2014. → page 24[75] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation.Advances in Neural Information Processing Systems, 2017-:2339–2348, 2017. ISSN 10495258.→ page 1[76] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPS 2017 Autodiff Workshop:The Future of Gradient-based Machine Learning Software and Techniques, Long Beach, CA, US,December 9, 2017, 2017. → pages 12, 13[77] P. Perdikaris, L. Grinberg, and G. E. Karniadakis. Multiscale modeling and simulation of brainblood flow. Physics of Fluids, 28(2):021304, 2016. ISSN 10897666, 10706631.doi:10.1063/1.4941315. → page 1[78] A. Pfeffer. Figaro: An object-oriented probabilistic programming language. Charles RiverAnalytics Technical Report, 137:96, 2009. → page 4[79] M. Raberto, S. Cincotti, S. M. Focardi, and M. Marchesi. Agent-based simulation of a financialmarket. Physica A: Statistical Mechanics and Its Applications, 299(1-2):319–327, 2001. ISSN18732119, 03784371. doi:10.1016/S0378-4371(01)00312-0. → page 1[80] E. Racah, S. Ko, P. Sadowski, W. Bhimji, C. Tull, S.-Y. Oh, P. Baldi, et al. Revealing fundamentalphysics from the daya bay neutrino experiment using deep neural networks. In Machine Learningand Applications (ICMLA), 2016 15th IEEE International Conference on, pages 892–897. IEEE,2016. → page 12[81] R. Ramaswamy and I. F. Sbalzarini. A partial-propensity variant of the composition-rejectionstochastic simulation algorithm for chemical reaction networks. The Journal of chemical physics,132(4):044102, 2010. → pages 1, 2[82] D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. 32nd InternationalConference on Machine Learning, Icml 2015, 2:1530–1538, 2015. → page 136[83] D. Ritchie, B. Mildenhall, N. D. Goodman, and P. Hanrahan. Controlling procedural modelingprograms with stochastically-ordered sequential Monte Carlo. ACM Transactions on Graphics(TOG), 34(4):105, 2015. → page 2[84] D. Ritchie, P. Horsfall, and N. D. Goodman. Deep amortized inference for probabilisticprograms. arXiv preprint arXiv:1610.05735, 2016. → page 2[85] D. Schouten, A. DeAbreu, and B. Stelzer. Accelerated Matrix Element Method with ParallelComputing. Comput. Phys. Commun., 192:54–59, 2015. doi:10.1016/j.cpc.2015.02.020. → page12[86] A. S´cibior, O. Kammar, M. Va´ka´r, S. Staton, H. Yang, Y. Cai, K. Ostermann, S. K. Moss,C. Heunen, and Z. Ghahramani. Denotational validation of higher-order bayesian inference.Proceedings of the ACM on Programming Languages, 2(POPL):60, 2017. → pages 19, 23[87] A. M. S´cibior. Formally justified and modular Bayesian inference for probabilistic programs.PhD thesis, University of Cambridge, 2019. → page 2[88] T. Sjo¨strand, S. Mrenna, and P. Skands. Pythia 6.4 physics and manual. Journal of High EnergyPhysics, 2006(05):026, 2006. → page 10[89] A. Slepoy, A. P. Thompson, and S. J. Plimpton. A constant-time kinetic monte carlo algorithm forsimulation of large biochemical reaction networks. The journal of chemical physics, 128(20):05B618, 2008. → pages 1, 2[90] D. E. Soper and M. Spannowsky. Finding physics signals with shower deconstruction. Phys.Rev., D84:074002, 2011. doi:10.1103/PhysRevD.84.074002. → page 12[91] A. Stuhlmu¨ller and N. D. Goodman. Reasoning about reasoning by nested conditioning:Modeling theory of mind with probabilistic programs. Cognitive Systems Research, 28:80–99,2014. → page 2[92] M. Sunna˚ker, A. G. Busetto, E. Numminen, J. Corander, M. Foll, and C. Dessimoz. ApproximateBayesian computation. PLoS Computational Biology, 9(1):e1002803, 2013. → page 1[93] D. Tran, R. Ranganath, and D. Blei. Hierarchical implicit models and likelihood-free variationalinference. In Advances in Neural Information Processing Systems, pages 5523–5533, 2017. →page 2[94] B. Uria, M. A. Cote, K. Gregor, I. Murray, and H. Larochelle. Neural autoregressive distributionestimation. Journal of Machine Learning Research, 17:1–37, 2016. ISSN 15337928, 15324435.→ page 1[95] J.-W. van de Meent, B. Paige, H. Yang, and F. Wood. An introduction to probabilisticprogramming. arXiv preprint arXiv:1809.10756, 2018. → pages 2, 4, 5, 12[96] R. D. Wilkinson. Approximate Bayesian computation (ABC) gives exact results under theassumption of model error. Statistical Applications in Genetics and Molecular Biology, 12(2):129–141. → page 1[97] D. Wingate and T. Weber. Automated variational inference in probabilistic programming. arXivpreprint arXiv:1301.1299, 2013. → page 237[98] D. Wingate, A. Stuhlmueller, and N. Goodman. Lightweight implementations of probabilisticprogramming languages via transformational compilation. In Proceedings of the FourteenthInternational Conference on Artificial Intelligence and Statistics, pages 770–778, 2011. → pages2, 5, 13, 14[99] F. Wood, J. W. Meent, and V. Mansinghka. A new approach to probabilistic programminginference. In Artificial Intelligence and Statistics, pages 1024–1032, 2014. → pages 2, 4, 738Appendix ASupporting MaterialsA.1 Importance weight varianceTheorem 5. Under assumptions of Definition 4.1.1, if the following condition holds with positive prob-ability under x∼ q(x|y)Ez∼q(z|x,y)[p(z|x)2q(z|x,y)2 (1− p(A|x,z))]≥ 1 (A.1)then the variance of wIC is infinite.Proof. In this proof, we carry the assumptions and definitions from Definition 4.1.1, with the exceptionthat we use subscript for denoting weights and samples in iterations of the loop i.e. zk→ zk and wk→wk.Let Eq[∏L−1k=1 wk]denote the mean value of the product of all the weights corresponding to the rejectedsamples, and Eq[∏L−1k=1 w2k]is defined similarly. We will compute the variance of the importance weightof the rejected samples.First, we compute the mean of weights:Eq[L−1∏k=1wk]= q(A|x,y)∞∑L=1q(A|x,y)L−1L−1∏k=1Ez∼q[w|A] (A.2)Ez∼q[w|A]= Ez∼q(z|x,y)[ p(z|x)q(z|x,y) 1− c(x,z)q(A|x,y)]=1q(A|x,y)Ez∼p(z|x) [1− c(x,z)] =p(A|x)q(A|x,y) (A.3)Therefore,Eq[L−1∏k=1wk]= q(A|x,y)∞∑k=1p(A|x)k−1 = q(A|x,y)p(A|x) (A.4)Next, we compute the expected value of squared of weights:Eq[L−1∏k=1w2k]= q(A|x,y)∞∑L=1q(A|x,y)L−1L−1∏k=1Ez∼q(z|x,y)[w2|A] (A.5)39Ez∼q(z|x,y)[w2|A]= Ez∼q[ p(z|x)2q(z|x,y)2 1− c(x,z)q(A|x,y)]=1q(A|x,y)Ez∼q[p(z|x)2q(z|x,y)2 p(A|x,z)]:=Sp,qq(A|x,y)(A.6)Hence,Eq[L−1∏k=1w2k]= q(A|x,y)∞∑L=1(Sp,q)L−1 (A.7)Consequently, since the mean of the weights is finite (Equation A.4), if Equation A.7 is not finite, thevariance of the rejected weights will be infinite.if Sp,q ≥ 1⇒Var(k−1∏i=1wi)is infinite (A.8)Noting that if the variance of the weight of a subset of traces is infinite, the variance of the whole traceswould be infinite completes the proof.40A.2 Variable addressesAddress ID Full addressA1 [forward(xt:: xarray container¡xt:: uvector¡double, std:: allocator¡double¿ ¿, (xt:: layout type)1, xt:: svec-tor¡unsigned long, 4ul, std:: allocator¡unsigned long¿, true¿, xt:: xtensor expression tag¿)+0x5f; SherpaGen-erator:: Generate()+0x36; SHERPA:: Sherpa:: GenerateOneEvent(bool)+0x2fa; SHERPA:: Event Handler::GenerateEvent(SHERPA:: eventtype:: code)+0x44d; SHERPA:: Event Handler:: GenerateHadronDe-cayEvent(SHERPA:: eventtype:: code&)+0x45f; ATOOLS:: Random:: Get(bool, bool)+0x1d5; probprog RNG::Get(bool, bool)+0xf9] Uniform 1A2 [forward(xt:: xarray container¡xt:: uvector¡double, std:: allocator¡double¿ ¿, (xt:: layout type)1, xt::svector¡unsigned long, 4ul, std:: allocator¡unsigned long¿, true¿, xt:: xtensor expression tag¿)+0x5f;SherpaGenerator:: Generate()+0x36; SHERPA:: Sherpa:: GenerateOneEvent(bool)+0x2fa; SHERPA::Event Handler:: GenerateEvent(SHERPA:: eventtype:: code)+0x44d; SHERPA:: Event Handler:: Generate-HadronDecayEvent(SHERPA:: eventtype:: code&)+0x477; ATOOLS:: Random:: Get(bool, bool)+0x1d5;probprog RNG:: Get(bool, bool)+0xf9] Uniform 1A3 [forward(xt:: xarray container¡xt:: uvector¡double, std:: allocator¡double¿ ¿, (xt:: layout type)1, xt:: svec-tor¡unsigned long, 4ul, std:: allocator¡unsigned long¿, true¿, xt:: xtensor expression tag¿)+0x5f; SherpaGen-erator:: Generate()+0x36; SHERPA:: Sherpa:: GenerateOneEvent(bool)+0x2fa; SHERPA:: Event Handler::GenerateEvent(SHERPA:: eventtype:: code)+0x44d; SHERPA:: Event Handler:: GenerateHadronDe-cayEvent(SHERPA:: eventtype:: code&)+0x48f; ATOOLS:: Random:: Get(bool, bool)+0x1d5; probprog RNG::Get(bool, bool)+0xf9] Uniform 1A4 [forward(xt:: xarray container¡xt:: uvector¡double, std:: allocator¡double¿ ¿, (xt:: layout type)1, xt:: svec-tor¡unsigned long, 4ul, std:: allocator¡unsigned long¿, true¿, xt:: xtensor expression tag¿)+0x5f; SherpaGen-erator:: Generate()+0x36; SHERPA:: Sherpa:: GenerateOneEvent(bool)+0x2fa; SHERPA:: Event Handler::GenerateEvent(SHERPA:: eventtype:: code)+0x44d; SHERPA:: Event Handler:: GenerateHadronDe-cayEvent(SHERPA:: eventtype:: code&)+0x8f4; ATOOLS:: Particle:: SetTime()+0xd; ATOOLS:: Flavour::GenerateLifeTime() const+0x35; ATOOLS:: Random:: Get()+0x18b; probprog RNG:: Get()+0xde] Uniform 1A5 [forward(xt:: xarray container¡xt:: uvector¡double, std:: allocator¡double¿ ¿, (xt:: layout type)1, xt:: svec-tor¡unsigned long, 4ul, std:: allocator¡unsigned long¿, true¿, xt:: xtensor expression tag¿)+0x5f; SherpaGen-erator:: Generate()+0x36; SHERPA:: Sherpa:: GenerateOneEvent(bool)+0x2fa; SHERPA:: Event Handler::GenerateEvent(SHERPA:: eventtype:: code)+0x44d; SHERPA:: Event Handler:: GenerateHadronDe-cayEvent(SHERPA:: eventtype:: code&)+0x982; SHERPA:: Event Handler:: IterateEventPhases(SHERPA::eventtype:: code&, double&)+0x1d2; SHERPA:: Hadron Decays:: Treat(ATOOLS:: Blob List*, dou-ble&)+0x975; SHERPA:: Decay Handler Base:: TreatInitialBlob(ATOOLS:: Blob*, METOOLS:: Ampli-tude2 Tensor*, std:: vector¡ATOOLS:: Particle*, std:: allocator¡ATOOLS:: Particle*¿ ¿ const&)+0x1ab1;SHERPA:: Hadron Decay Handler:: CreateDecayBlob(ATOOLS:: Particle*)+0x4cd; PHASIC:: De-cay Table:: Select() const+0x76e; ATOOLS:: Random:: Get(bool, bool)+0x1d5; probprog RNG:: Get(bool,bool)+0xf9] Uniform 1Table A.1: Examples of addresses in the τ lepton decay problem in SHERPA (C++). Only thefirst 6 addresses are shown out of a total of 24,382 addresses encountered over 1,602,880executions to collect statistics.41
You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- On amortized inference in large-scale simulators
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
On amortized inference in large-scale simulators Naderiparizi, Saeid 2020
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | On amortized inference in large-scale simulators |
Creator |
Naderiparizi, Saeid |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | Motivated by the problem of amortized inference in large-scale simulators, we introduce a probabilistic programming library that brings us closer to this goal. This library enables us to perform Bayesian inference on any simulator written in a wide variety of programming languages, with minimal modification to the simulator's source code. However, there are challenges in achieving this goal in its most general meaning. In particular, we address the obstacles caused by unbounded loops. Existing approaches to amortized inference in probabilistic programs with unbounded loops can produce estimators with infinite variance. An instance of this is importance sampling inference in programs that explicitly include rejection sampling as part of the user-programmed generative procedure. We develop a new and efficient amortized importance sampling estimator. We prove finite variance of our estimator and empirically demonstrate our method's correctness and efficiency compared to existing alternatives on generative programs containing rejection sampling loops and discuss how to implement our method in a generic probabilistic programming framework. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-01-20 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0388330 |
URI | http://hdl.handle.net/2429/73351 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_may_naderiparizi_saeid.pdf [ 1.22MB ]
- Metadata
- JSON: 24-1.0388330.json
- JSON-LD: 24-1.0388330-ld.json
- RDF/XML (Pretty): 24-1.0388330-rdf.xml
- RDF/JSON: 24-1.0388330-rdf.json
- Turtle: 24-1.0388330-turtle.txt
- N-Triples: 24-1.0388330-rdf-ntriples.txt
- Original Record: 24-1.0388330-source.json
- Full Text
- 24-1.0388330-fulltext.txt
- Citation
- 24-1.0388330.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0388330/manifest