Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

One weird trick for advertising outcomes : an exploration of the multi-armed bandit for performance-driven… Burtini, Giuseppe Antonio 2015

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2015_november_burtini_giuseppe.pdf [ 774.07kB ]
JSON: 24-1.0166786.json
JSON-LD: 24-1.0166786-ld.json
RDF/XML (Pretty): 24-1.0166786-rdf.xml
RDF/JSON: 24-1.0166786-rdf.json
Turtle: 24-1.0166786-turtle.txt
N-Triples: 24-1.0166786-rdf-ntriples.txt
Original Record: 24-1.0166786-source.json
Full Text

Full Text

“One Weird Trick” for Advertising Outcomes: An Exploration ofthe Multi-Armed Bandit for Performance-Driven MarketingbyGiuseppe Antonio BurtiniB.A., University of British Columbia, 2013a thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of Scienceinthe college of graduate studies(Interdisciplinary Studies)THE UNIVERSITY OF BRITISH COLUMBIA(Okanagan)October 2015c© Giuseppe Antonio Burtini, 2015AbstractIn this work, we explore an online reinforcement learning problem called the multi-armed banditfor application to improving outcomes in a web marketing context. Specifically, we aim to producetools for the efficient experiment design of variations of a website with the goal of increasing somedesired behavior such as purchases.We provide a detailed reference, with a statistical lens, of the existing research in variants andassociated policies known for the problem, then produce a set of theoretical and empirical analysesof specific application area questions. Concretely, we provide a number of contributions:First, we present a new standardized simulation platform integrating knowledge and techniquesfrom the existing literature for the evaluation of bandit algorithms in a set of pre-defined worlds.To the best of our knowledge, this is the first comprehensive simulation platform for multi-armedbandits capable of arbitrary arms, parameterization, algorithms and repeatable experimentation.Second, we integrate Thompson sampling into linear model techniques and explore a number ofimplementation questions, finding both that replication of Thompson sampling and adjusting forestimative uncertainty is a plausible mechanism for improving the results.Third, we explore novel techniques for dealing with certain types of structural non-stationaritysuch as drift and find that the technique of weighted least squares is a strong tool for handling bothknown and unknown drift. Empirically, in the unspecified case, an exponential decaying weightprovides good performance in a large variety of cases; in the specified case, an experimenter canselect a weighting strategy to reflect their known drift achieving state-of-the-art results.Fourth, we present the first known oracle-free measure of regret called statistical regret, whichutilizes intuitions from the confidence interval to produce a type of interval metric by replayinglate-experiment knowledge over prior actions to determine how performant an experimenter canbelieve their results to be.Fifth, we present preliminary results on a specification-robust and computationally efficientsampling technique called the Simple Optimistic Sampler which shows promising outcomes via atechnique which requires no modelling assumptions to implement.iiPrefaceThis thesis is the original and independent work of the author, Giuseppe A. Burtini. The researchwas identified, designed, performed and analyzed by the author.Sections 3.2 (Linear Model Thompson Sampling: LinTS) and 3.4 (Nonstationary Time SeriesTechniques) draw heavily from the published work Burtini et al. [36] (2015a), where Drs. JasonLoeppky and Ramon Lawrence provided an advisory role.A variant of the work which appears in chapter 2, in which Drs. Jason Loeppky and RamonLawrence provided an advisory and editorial role, has been submitted to Statistics Surveys Burtiniet al. [37] and published on the preprint archive work which appears in sections 3.3, 3.5 and 3.6 is intended to be submitted for externalpublication either in whole or in part at a future date.All other work is unpublished as of this date.The title “One Weird Trick” for Advertising Outcomes refers to a style of advertising popularizedin 2013 after the acknowledgment of some influential experimental results in consumer psychology– highlighting just how fundamental, and even formulaic, the scientific approach of advertising hasbecome. The language of “One Weird Trick” itself has become memetic in online advertising andeven in a minority of academic work [97]. This work discusses an approach to performance-drivenexperimentation appropriate for scientific advertising.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiChapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Specific Interest Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Adaptive Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Portfolio Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Natural Resource Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Research and Development Investment . . . . . . . . . . . . . . . . . . . . . . 7Employee Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 7Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8General Real World Explore/Exploit Tradeoffs . . . . . . . . . . . . . . . . . 81.3 Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.1 A Simulation Platform for Multi-Armed Bandits . . . . . . . . . . . . . . . . 81.3.2 LinTS: The Regression Thompson Sampler . . . . . . . . . . . . . . . . . . . 91.3.3 Experiments in Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . 9iv1.3.4 Time-series Techniques for Non-Stationary Bandits . . . . . . . . . . . . . . . 101.3.5 Statistical Regret for Applied Bandit Models . . . . . . . . . . . . . . . . . . 101.3.6 Simple Efficient Sampling for Optimistic Surrogate Models . . . . . . . . . . 111.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1 The Stochastic Multi-Armed Bandit Model . . . . . . . . . . . . . . . . . . . . . . . 132.1.1 A Stylized Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1.2 Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Measures of Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Variance and Bounds of Regret . . . . . . . . . . . . . . . . . . . . . . . . . . 20Higher Moments and Risk Measures of Regret . . . . . . . . . . . . . . . . . 21Feedback Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Problem Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Stationarity of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Change-Point Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 23Kalman Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Ethical and Practical Constraints . . . . . . . . . . . . . . . . . . . . . . . . . 23Practical Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.1.3 Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.2 Studied Problem Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.2.1 Traditional K-Armed Stochastic Bandit . . . . . . . . . . . . . . . . . . . . . 25-greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Constant  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Adaptive and -Decreasing . . . . . . . . . . . . . . . . . . . . . . . . . 25-first . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Multiple Epoch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27UCB1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27UCB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28UCB-Tuned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28MOSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Bayes-UCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29KL-UCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30POKER and price of knowledge . . . . . . . . . . . . . . . . . . . . . . . . . 312.2.2 K-Armed vs. Infinite-Armed Bandits . . . . . . . . . . . . . . . . . . . . . . . 32Bandit Algorithm for Smooth Trees (BAST) . . . . . . . . . . . . . . . . . . 33Hierarchical Optimistic Optimization (HOO) . . . . . . . . . . . . . . . . . . 342.2.3 Adversarial Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35vHedge and Exp3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Exp4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Exp4.P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Stochastic and Adversarial Optimal (SAO) . . . . . . . . . . . . . . . . . . . 382.2.4 Contextual Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42LinUCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43CoFineUCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Banditron and NeuralBandit . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.2.5 Non-stationary Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Discounted UCB(-T) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Sliding-Window UCB(-T) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Adapt-EvE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Kalman Filtered Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472.2.6 Probability Matching and Thompson Sampling . . . . . . . . . . . . . . . . . 47Optimism in Probability Matching . . . . . . . . . . . . . . . . . . . . . . . . 49The Bernoulli Approach to Nonparametric Thompson Sampling . . . . . . . 49Bootstrap Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 51Change-Point Thompson Sampling (CTS) . . . . . . . . . . . . . . . . . . . . 512.3 Application Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Chapter 3: Towards the Use of Multi-Armed Bandits in Advertisement Testing 543.1 An Extensible Platform for Simulating Bandit Problems . . . . . . . . . . . . . . . . 543.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Arms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.1.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.2 Linear Model Thompson Sampling: LinTS . . . . . . . . . . . . . . . . . . . . . . . . 563.2.1 Optimistic Thompson Sampling in LinTS . . . . . . . . . . . . . . . . . . . . 573.3 Experiments in Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.3.1 Measure of Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.3.2 Estimative Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.3.3 Sampling Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.4 Non-Stationary Time Series Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 643.4.1 A Short Review of Stochastic Drift . . . . . . . . . . . . . . . . . . . . . . . . 64Generalized Linear Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.4.2 Overview of the Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65viAutoregression and Detrending . . . . . . . . . . . . . . . . . . . . . . . . . . 65Penalized Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . 663.4.3 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.5 Statistical Regret for Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.5.1 Traditional Parametric Statistical Regret . . . . . . . . . . . . . . . . . . . . 703.6 Simple Efficient Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.6.1 Simple Efficient Symmetric Sampling (SESS) . . . . . . . . . . . . . . . . . . 723.6.2 Efficient Non-Symmetric Sampling (ENSS) . . . . . . . . . . . . . . . . . . . 723.6.3 A Short Background in Nonparametric Sampling . . . . . . . . . . . . . . . . 73Bootstrap Thompson Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 73The Eckles and Kaptein (2014) Model . . . . . . . . . . . . . . . . . . . . . . 733.6.4 A Simple Efficient Nonparametric Sampler . . . . . . . . . . . . . . . . . . . 73Simple Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Introducing Optimism to the Simple Sampler . . . . . . . . . . . . . . . . . . 74Experiments in Replication Strategies with SOS . . . . . . . . . . . . . . . . 773.6.5 Using Categorical Contextual Variables in SOS . . . . . . . . . . . . . . . . . 773.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Chapter 4: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.2.1 Theoretical Bounds in Low Sample Size Scenarios . . . . . . . . . . . . . . . 824.2.2 Prior Elicitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82...from Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82...from Prior Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2.3 Risk-Aware Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2.4 Feedback Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.2.5 Contextual Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Costs of Misspecification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Clustering and PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.2.6 Speed and Computational Complexity . . . . . . . . . . . . . . . . . . . . . . 84Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86viiList of TablesTable 2.1 A hierarchy of bandit problems, categorized by the adversarial bandits general-ization in Audibert and Bubeck [11]. . . . . . . . . . . . . . . . . . . . . . . . . . 41Table 3.1 Results from eliminating estimative uncertainty in the unbiased sampling case. . . 61Table 3.2 Results from eliminating estimative uncertainty in the optimistic sampling case. . 61Table 3.3 Results of a selection of replication strategies. . . . . . . . . . . . . . . . . . . . . 63Table 3.4 The robustness and performance of the distribution-free sampler compared to thetraditional parametric sampler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Table 3.5 Replication strategy experiments for SOS and Simple Sampler . . . . . . . . . . . 78viiiList of FiguresFigure 2.1 An example of playing an expected-suboptimal arm but achieving a high rewarddue to random variation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Figure 2.2 An example showing how expected-expected regret, expected-payoff regret andsuboptimal plays differ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Figure 2.3 Thompson Sampling for Bernoulli Bandits . . . . . . . . . . . . . . . . . . . . . . 50Figure 3.1 Pseudocode of combined algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 68Figure 3.2 Adjusted average cumulative regret of selected algorithms over 1,000 replicatesof all worlds and true drift forms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Figure 3.3 The Simple Nonparametric Sampler . . . . . . . . . . . . . . . . . . . . . . . . . 75Figure 3.4 The Fully Parameterized Simple Optimistic Sampler (SOS) . . . . . . . . . . . . 76Figure 3.5 The Categorical-Contextual Simple Optimistic Bootstrap Sampler (SOS) . . . . 79ixList of SymbolsThis listing provides a reference for some of the common symbols used within this work.xi,t The payoff received after selecting arm i of a multi-armed bandit process at time t.xi The payoff received after selecting arm i of a multi-armed bandit process explicitlyassumed to be stationary in time.Eθ Expectation taken over the distribution of an arm (equivalently, over the prior parameterθ for the arm distribution.)E Expectation to be taken over both the random selection of a a priori fixed matrix ofrewards and the actions of the player.µi The mean payoff from arm i. An alternative expression of Eθ(xi).µ∗ The highest mean payoff of an arm. Alternatively, maxEθ(xi).K The number of arms available in a multi-armed bandit. Usually a constant positiveinteger.H The horizon or number of time periods to be played in a multi-armed bandit. A positiveinteger or infinity, often unknown.n The number of time periods consumed thus far when considering measures of regret takenprior to completion of the process.nj The number of time periods thus far in which an arm j has been selected. Equivalently,the number of observations of arm j.St The sequence of arm selections made by the player.R One of the many measures of regret. See Section 2.1.2.xAcknowledgmentsAn immense thank you to Drs. Ross Hickey and Jason Loeppky for their unwavering willingness(despite very busy schedules!) to consistently make time to hear my thoughts, listen to my concernsand contribute – in their interest, applications, excitement, ideas, motivation and encouragement– to my research and in a broader sense to my curiosity.Of course, thank you to all my friends, but especially Graeme Douglas and Scott Fazackerleywho have made a regular effort to remain informed and interested in my research, critically examinemy ideas and writing and provide feedback of all varieties, positive and negative, as necessary. Inmany ways, critical feedback is both the hardest to find and, to me, the most valuable and theselifelong friends have been willing to provide it in every step of my work from inception to now.Thank you to everyone (and my bicycle!) who put up with me when frustration, impostersyndrome or technical trouble pushed me to the edge and made me a little crazy, but especiallyto my parents, for reading my work after long hours of writing (even when it made no sense), forhelping keep me on track in hard times and for their relentless lifelong support for me in absolutelyeverything I’ve ever tried to do.Finally, thank you to Dr. Ramon Lawrence, my supervisor throughout the entirety of thiswork for both the opportunity to work in his research group and the incredible motivation heprovides through his drive, organization, attitude, assurances and encouragement. Ramon is anincredibly talented and driven individual, and his drive is contagious in a way which directlyproduces inspiration in those around him and indirectly produces a phenomenal network effect ofmutual inspiration among his students.xiChapter 1IntroductionThe real world has a wealth of circumstance where one must simultaneously explore their sur-roundings, options or choices while also maintaining or maximizing some variable: their output,well-being or wealth. Indeed the pursuit of a good life can be represented as finding a balancebetween “exploration,” or the investigation of new options and “exploitation,” the utilization ofthe knowledge one has already accrued. People make these decisions on a regular basis, fromcareer and education (“should I continue to invest in education, exploration, or begin exploitingthe knowledge I have accrued?”) to shopping and selection (“should I purchase a product I havealready experienced and am satisfied with, or should I explore other options?”) and everything inbetween.These problems have been studied extensively in a human decision making context, a gametheory “maximizing” context, and the positive psychology “satisficing” context of increasing hap-piness through decision making. This type of decision making also arises in learning problems of thestatistical and machine learning variety and is represented within a broad class of problems calledreinforcement learning. One common form of this exploration vs. exploitation tradeoff is called themulti-armed bandit problem where the world can be imagined as a collection of slot machines, eachwith a different but unknown payoff, and the player must play, repeatedly, to maximize his wealth.This is the form of the sequential decision making problem for exploration vs. exploitation that wewill explore in this work.The model for this exploration vs. exploitation tradeoff that we consider in this work is calledthe multi-armed bandit or the multi-armed bandit problem. It describes the environment wherean unknowledgeable player in a casino has to make repeated decisions about which slot machine(colloquially, a “one-armed bandit”) to play in order to maximize his or her total reward. PeterWhittle, an early researcher in the area, captured the elegant application of the multi-armed banditand the entirety of the importance of exploration vs. exploitation tradeoffs in his 1989 quote:“Bandit problems embody in essential form a conflict evident in all human action: in-formation versus immediate payoff.”1When the first efficient, tractable solution to the multi-armed bandit was presented [65], Whittlerecalls [66] an early conversation with a colleague that highlighted the immense difficulty with whichthe problem was first seen:Colleague 1: “What would you say if you were told that the multi-armed bandit problemhad been solved?”Colleague 2: “Sir, the multi-armed bandit problem is not of such a nature that it can besolved.”Indeed the problem was said [67] to have historically been seen as a weapon of war: Allied scien-tists proposed to deliver a related problem to Germany “as the ultimate instrument of intellectualsabotage.”Before we begin with a more specific definition of the problem, a final quote to set the tone anda perspective which inspires the theory of the multi-armed bandit.“To consult the statistician after an experiment is finished is often merely to ask him toconduct a post-mortem examination. He can perhaps say what the experiment died of.”— R. A. Fisher at the Presidential Address to the First Indian Statistical Congress, 1938.1.1 Problem DefinitionOften these exploration vs. exploitation problems arise in a business context, where the efficiencyof such choices is a strong determinant of business success. This work addresses the challenges inmultivariate experimental optimization in an online advertising firm environment. The theoreticallens which we develop for use within this context is a variant of the multi-armed bandit (MAB)problem. The multi-armed bandit problem, first1 proposed by Robbins (1952) which built thenew model on the sequential analysis work of Wald (1947) and others, has been extensively usedto model the exploration-exploitation tradeoff in reinforcement learning and experiment design.The traditional multi-armed bandit problem is the prime example of a sequential exploration-exploitation tradeoff problem. In the problem, an agent aims to balance gaining new knowledge1Thompson (1933) provides an answer to a related question: how to identify the probability of a distributionbeing better than all others from a set of distributions, and has thusly been sometimes credited as the origin of themulti-armed bandit. Even more confounding on the origins of the multi-armed bandit, Dr. Peter Whittle said inreview of the 1979 paper of Gittins [67] the following “As I said, the problem is a classic one; it was formulatedduring the war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion wasmade that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage. In the event,it seems to have landed on Cardiff Arms Park. And there is justice now, for if a Welsh Rugby pack scrumming downis not a multi-armed bandit, then what is?” As World War II ended in 1945, this provides evidence that the problemwas under discussion at least privately by the military if not elsewhere prior to the Robbins (1952) paper. Robbins(1952) is the first indexed paper to call the problem the multi-armed bandit and provides a formulation similar tothe formulation used to date.2with exploiting its current knowledge. This applies in many variants in both commercial contexts— where vendors may wish to explore new products or new solutions while exploiting existingknowledge — and in numerous research contexts from clinical trials to simulating animal behaviourto adaptive network routing.The specific application explored in this thesis is motivated from the view of an online mar-keting firm aiming to maximize its revenue through an efficient balance of exploration (trying newsolutions, learning) and exploitation (utilizing learned knowledge to profit). Specifically, the modelwith which we approach the problem is one where there is a hierarchical funnel of traffic, with eachstage having its own vector of parameters and specific objective, but a payout not being achieveduntil the visitor has successfully performed an action in all the stages. The stages are outlined asfollows.1. T (−→t ): Traffic acquisition. This level is where most of the demographic information aboutthe user is decided. Variables at this level are either individually fixed (gender, geographiclocation) or fluid (time of day) from the perspective of the system and are either directlyselectable (such as age or gender on some targeting platforms) by the targeting vector−→t orindirectly influenceable (such as factors related to state of mind, which may be influencedby the advertisement text and imagery, or immediately preceding thoughts, which may beinfluenced in aggregate by selection of a particular medium to purchase advertisements on).Every “trial” at this level has a cost which may or may not be fixed across repeated trials,depending on the particular media chosen. In one version of the model, traffic acquisitionis assumed to come from a source that is much larger than the firm where the firm cannotappreciably influence either the supply or price of available traffic. In another version of themodel, the firm is large in relation to the traffic source and traffic can be exhausted or pricesincreased as demand increases.2. P (T,−→w ): Pre-sales. During pre-sales, the user is shown text, video and graphical contenton a website principally designed by a marketing professional with the assistance of bothdomain expertise and statistical testing. The controllable factors are denoted −→w , however,the dimensionality of this vector is extremely high with many of the variables going largelyunexplored. This includes variables ranging from color to word choice to ordering and pagedesign and layout and much more. In many contexts, the pre-sales process may involveother interaction including e-mail or phone call lead generation. In our model, the pre-salesfunction P is influenced by the result of the traffic acquisition step in a high-dimensional way:specifically, which users are selected affects the impact of each variable in the feature vector−→w . This is the level at which we focus the majority of our experimentation, however, it iseasy to extend this to testing at the targeting vector or product selection level.3. C(P, β): Conversion. This is the final step where the user becomes a customer. As the3pre-sales step is often designed to be general, this step may choose a different product β (witha fixed vector of features, which we do not control and can only observe with uncertainty) fordifferent users or different pre-sales vector results. In the case of an indirect marketing firmrelationship, this may be the set of steps after pre-sale and product selection not controlledfrom within the system, such as price or order form design. In a more direct consultingrelationship, all steps leading prior to sale are under control and this step is simply therecording of a success condition, with a given value, in the bandit system.The goal at any step of the process is to bring the user to the next step successfully. Broadly,our task is to choose S = {−→t ,−→w , β} from a finite2 set of alternatives to maximize the sum of time-discounted revenue and satisfy competitive business objectives such as risk tolerance and ethicalfactors. In practice, this is generally achieved with experts selecting each of the choice variables.In our model, we explore the algorithms necessary to replace or augment this expert with anautomated experiment.In a real world application with risk management considerations, it is often important to considerthe problem within the context of “runs” and “horizons.” Specifically, for a given run of theproblem, the system will settle on a specific group of parameters that can be expected to runfor some horizon before exogenous factors change the state of the world or the business moveson. For a variety of reasons, it may not be optimal to maximize immediate profit with thatconsideration, if the selection of parameters that currently maximize profit will influence how futureusers respond. In modern business, this takes the form of ethical constraints (including “pricegouging” and discriminatory pricing) and an expectation of certain constraints on the pre-salesvector −→w . These competing objectives, from legal concerns to ethical constraints and other-than-profit business objectives, complicate the problem significantly.Fundamentally, this problem (and many applications of the multi-armed bandit problem) isat the intersection and forefront of economics, operations research, game theory, decision theory,statistics and computer science in the way it combines computational learning and statistical esti-mates with an appropriate level of dynamic decision making and risk-coordination in a real, appliedbusiness environment.2In a variant of the problem, some of the variables are discrete and finite and others are theoretically continuousand unbounded, such as color or font size selection; in practice, most selection parameters can be considered boundedto a relatively small range. Similarly, most observed factors about our users or experimental subjects (covariates orcontextual variables) are categorical or easily discretized.41.2 Motivation1.2.1 Specific Interest AreaThe area of online marketing can be considered one where a semi-random3 flow of users (eachtreated as an experimental subject) arrive to a website, — each with its own context vector whichdescribes properties of uncertain relevance to the objective function considered — the users aredelivered a treatment (a variant of the “pre-sales” vector) and then at some time in the (near)future, a reward (e.g., a purchase of a product) of variable value (possibly zero) is observed for thatuser. Each user has an explicit cost to the business that can range from very small (e.g., if theyarrived via search engine traffic or word of mouth) to a large fraction of total expected revenue(e.g., if they arrived to the website via a pay-per-click or pay-per-lead service).As a concrete, but by no means limiting, example, we can imagine the treatment being a selectionbetween a red button and a blue button, with our objective (reward) being 0 or 1 indicating whethera user clicked the button or not. The colour of the button could plausibly effect the likelihood ofdrawing the user’s attention and thus attracting their click. The variants can range in scale frombeing small micro-experiments (such as buttons, language choice or colours) to fully transformedversions of a website, as long as the objective (reward) is well-defined for all variants.The motivation in exploring this area is one of improving business performance on the web.There is clear room for improvement in how testing is currently performed within the marketingcontext. The environment is well-modelled by the multi-armed bandit, but in the applied context,there is little application of the multi-armed bandit to this space.Bandit strategies, in particular, appear especially useful for performing experimentation withthe same type of complicating factors that arise in web experiments. In relatively small sample sizeenvironments, for example, by properly balancing exploration and exploitation, multi-armed banditalgorithms can achieve payoff rates that significantly exceed that of traditional testing modelsin general, but especially in the case where a traditional model would not have yet finished itsexperiment. Additionally, many of the explored bandit strategies do not require a fixed explorationperiod a priori, which is more appropriate in a business context where total users, horizon andother variables may not be predictable.Importantly, the bandit model, as we will explore in future sections, generally does not requireselection of any fixed significance level or other statistical assumptions prior to application. Thisis important, as experiment designers in the online advertising context may not be statisticallysophisticated and simply wish to apply a prebuilt solution to their exploration.While web experimentation is the motivating field for the remainder of this work, we do notexperiment specifically in a “real world” environment, rather, we work within a simulation envi-3This level of the funnel can often be controlled, at least in part, by targeted advertising and selective useracquisition mechanisms.5ronment to test and resolve a number of theoretical and empirical questions that are relevant tothe application of the multi-armed bandit in a variety of environments.1.2.2 Other ApplicationsOnline advertising is not the only appropriate application of the multi-armed bandit problem. Whilenot the focus of this work, the literature is ripe with many applications, many of which may benefitfrom a hierarchical or contextual analysis and provide further inspiration for immediate application.Most of the other applications discussed here are predictive or prescriptive in the sense that theyact as a decision theory mechanism, however, there is a range of descriptive literature which utilizesbandit models as tools to attempt to explain empirically estimated behaviour [24, 25, 84, 91, 110,127]. These are not explored in detail as they do not share much analytical similarity with theproblem at hand.Clinical TrialsOne of the most widely discussed applications of the multi-armed bandit problem is that of dif-ferential allocation in clinical trials [82, 153]. In a clinical trial, real patients are being treatedwith medications with unknown properties – selecting between which is an exploration-exploitationtradeoff problem. In the multi-armed bandit application, the goal is to simultaneously learn theproperties of the medication (explore) and cure the patients (exploit): as specific medication startsto show statistical promise, it could be prescribed to a larger fraction of the patients creating morepositive outcomes in expectation. The ethical implications of minimizing tail-effect losses, con-straining total regret and monitoring for model error make this an interesting research area withhigh value impact.The medical domain is particularly amiable to a variant of the multi-armed bandit problemwhere there are covarying and confounding factors. In this variant, patient criteria for admission,selection into a particular trial or particular medication and patient factors such as age, sex andother variables may influence treatment or interact with the medication in a meaningful way. Theability to consider this contextual information is also important when it can provide a more rapidor more accurate understanding of the world, as there is an extremely high cost associated witherrors in this domain.Adaptive RoutingThe multi-armed bandit problem has been applied in the context of learning ideal routing pathsthrough a computer network. This is specifically interesting as it can be phrased as a hierarchical(and hierarchical-contextual, if we have a vector of features for each node) problem traversing adirected (or undirected) graph. Simply, each route is tested and median time, variance or otherobjective value of interest is monitored and minimized.6Portfolio DesignHoffman, Brochu, and de Freitas [77] propose a mechanism that relies on a multi-armed banditstrategy to select among competing portfolio selection strategies called acquisition functions. Thisindicates a common theme in multi-armed bandit applications where competing mechanisms (orexperts) are chosen via a bandit process. Sorensen [136] explores modelling venture capital decisionswithin a bandit framework.Natural Resource ExplorationIn a number of natural resource exploration contexts, bandit models with side-information havebeen applied to balance “earning” (exploitation) and “learning” (exploration) within the searchprocedure. Given a fixed set of resources, a firm’s decision between further exploration withintheir region and the harvest of the sites with maximal expected value based on currently observedinformation has been compared primarily in oil exploration [20, 21, 28]. There is room for similarapplication in this space for mineral exploration and optimal farmland utilization exploration aswell as room to extend the model in general to support covariates of a planar type such as coordinateand geographic covarying parameters.Research and Development InvestmentWeitzman [158] introduced an example of the bandit problem in which a firm explores two compet-ing technologies, the benefits of which are uncertain, to produce a commodity at a minimal cost.A substantial amount of work since has produced further development in how firms can optimallybalance exploration and exploitation [45, 122] in research and development expenditure. Some ofthe literature extends the concept of the bandit model to have “switching costs” between arms[17], representative of the often extremely high cost of changing technique or investigating newproduction methods in a commercial context.Employee Resource AllocationRelated to the research investment problem is the general problem of human resource management.Farias and Madan [58] produce a modification of the contextually informed model where armscannot be discarded once pulled and utilize packing problem type heuristic to admit a practicalsolution. This “cannot be discarded” trait approximates the hiring problem for a variable skill,large number of staff employment environment. Many other papers have investigated the employeeresource allocation problem within a variety of contexts: allocating tasks to staff, allocating staffto departments, selecting staff, and more. This is very different than the exploratory stochasticmodel we explore, but it demonstrates the generality of the model.7CrowdsourcingRelated to the employee resource allocation problem, crowdsourcing has recently become a popularsolution for collecting and distributing human capital for a variety of creative and computationalproblems from graphic design4 to microcomputation to research task allocation. Jain, Gujar, Bhat,Zoeter, and Narahari [80] introduce a multi-armed bandit model for assuring accuracy and mini-mizing total cost in a crowd-worker environment. The mechanism they propose, called ConstrainedConfidence Bound, is one that identifies “quality consensus” to combine multiple competing advi-sors in an cost-optimal way. This problem arises again in our work when we introduce the expertsystem to our training function. The problem of combining competing expertise has been consideredextensively in the psychology and management literature [85] however, rarely have actionable de-cision mechanisms been proposed. Recently, a variety of other researchers have produced adaptivealgorithms for multi-armed bandit treatment of the crowdsourcing and task assignment problems[2, 75, 146].General Real World Explore/Exploit TradeoffsDumitriu et al. [54] discusses an approach to the multi-armed bandit problem in the stylizedcontext of professional golfers selecting a brand of golf ball. The real world is ripe with exploration-exploitation tradeoffs from selecting optimal fishing tackle for an angler to deciding upon an optimalcareer path for a student, many of which involve hierarchical or context-informed decisions. Asufficiently simple algorithm to be performed by humans or a decision making toolset for professionalmanagement could provide significant guidance in a variety of practical situations.1.3 Research ContributionIn this section we enumerate and begin to motivate our research contributions, which include astandardized simulation platform, a set of new policies or algorithms for the multi-armed banditproblem and some insight into an online measurement of regret and so-called optimism in the faceof uncertainty. Each of these contributions takes the form of concrete experimental, developmentalor algorithmic work. In addition to these explicit contributions, the background section providesa significant contribution itself: an extensive survey of the literature, concisely enumerating andproviding a taxonomy for much of the significant progress throughout the space of multi-armedbandit problems and their variants.1.3.1 A Simulation Platform for Multi-Armed BanditsTo date, there is no standardized simulation platform for comparing stochastic multi-armed banditpolicies. We produce a repeatable, parallelized and extensible platform for experimentation and4For example, the online business 99Designs uses crowdsourcing to operate contests for producing logos, websitelayouts and other design elements.8a set of standardized worlds for comparisons. The count and associated parameters of underlyingpayoff distributions are configurable and support non-stationarity in both changepoint and driftfashion, mixed distribution models and contextual covariates. Further, the platform supportsintegration with the PASCAL Exploration vs. Exploitation (EvE) Challenge dataset [79] andthe Yahoo! real world news click-through data [103], two popular datasets for comparing theperformance of stochastic bandit policies.The importance of a preset standardized set of experiments is hard to overstate. Researcherselected experiments are prone to an unintentional form of selection bias which may result in aloss of the quality of results and a slower pace of research understanding and progress. While ouravailable worlds aim to be general in many dimensions and violations of assumptions, the platformis extensible in a way that allows the creation of new worlds and new sets of experiments easily.The platform computes as many meaningful empirical measures that we could define, includingeverything computable presented in Section 2.1 of this work. This includes all the computable vari-ants of regret, variance and empirical confidence intervals of regret and other measures, divergencemeasures of difficulty, computation time and more. We include a small set of data visualizationtools to transform the detailed iterate- and replicate- level outputs of the simulator into graphicalrepresentations of regret over iterates (for evaluation of small sample and convergence behavior) andregret comparisons across varying categorizations of the underlying world (assumption violations,distributions, or size/scale of the problem) and policy or policy configuration.This platform emerges as the tool which drives our further experimentation. The extensibleand repeatable nature allows us to rapidly develop tests for a variety of application-level andparameter-level questions, as well as experiment with new policies and variants of existing policies.1.3.2 LinTS: The Regression Thompson SamplerWe first extend a popular policy called LinUCB [101] to solve a variant of multi-armed bandit prob-lems with (linear) covarying factors called the contextual bandit problem. This extension replacesa fixed substrategy called UCB (upper confidence bounds) with a technique from the efficientprobability matching paradigm proposed by Thompson [144], producing an easily implemented,tractable, regression-based Thompson sampler which we call Linear Thompson Sampling (LinTS).We then utilize the simulation platform to show experimentally how various choices in LinTS affectthe final results in terms of measures of success.1.3.3 Experiments in Thompson SamplingWe return to analyzing a number of application area questions in Thompson sampling, especiallythose related to the impact of optimism in the face of uncertainty to the policy, finding someinteresting trends in the performance of such a model. We show that excessive optimism, to anextent not previously considered in the literature, may provide a beneficial approach in certain9probability matching environments, and we discuss the plausible implications of this result on theunderstanding of optimism. We further find a number of interesting results in this light includinga strict reduction in regret from a prescient uncertainty correction — to the extent it is available— which applies in the regression sampling case and again consider the interesting interactionwith this effect and optimism in light of our understanding of optimistic exploration. Together,these results provide both a guidance in the implementation-level decision making with regardto the deployment of an efficient bandit process and an interesting theoretical direction for theunderstanding of optimism.1.3.4 Time-series Techniques for Non-Stationary BanditsExtending our work in defining LinTS, the regression-based Thompson sampler, we explore a num-ber of time-series approaches to removing underlying trends and non-stationarities in the rewardsprocess. One especially promising technique is an exponentially-weighted decay process which al-lows the handling of a priori unknown forms of underlying drift and performs well on all testedvariants of drift. Further, we show that if there is knowledge of the form of drift, one can performnear optimally using an appropriately calibrated decay technique. We explore a number of otherplausible approaches to time-series detrending that do not find significant promise in the unknowndrift-form application.1.3.5 Statistical Regret for Applied Bandit ModelsIn general, the measure used to quantify the result of a bandit process is the regret – that is, a lossfunction type measure which captures the difference in the largest achievable payoff and the payoffactually achieved. To compute regret it is necessary to know the largest achievable payoff – thatis, to have an oracle which provides the correct answer for comparison. Outside of a simulationenvironment, this is not a computable quantity: if an oracle was available, there would be no needto perform online exploration or the associated exploration vs. exploitation tradeoff decisions. Wedefine a prediction interval -based measure which allows the ex-post computation of a calibratedregret upper and lower bound. We show the interval behaves like a traditional confidence interval,allowing the user to configure γ, the interval’s confidence level which produces a tradeoff betweentightness of the interval and the frequency with which the interval correctly includes the true value.This is a significant contribution for the measurement, diagnostics and parameter selection ofmulti-armed bandit policies in an applied context. By the nature of the problem, there is little roomfor traditional experimentation to identify the ideal parameter structure and an online analyticalmethod is required. Statistical regret shows that an evaluation process for the quality of parameterselection can be produced and that differing individual runs of a bandit process can be examinedand compared quantitatively in practice.101.3.6 Simple Efficient Sampling for Optimistic Surrogate ModelsThe principle of optimism in the face of uncertainty arises repeatedly in multi-armed bandit poli-cies. Similarly, the principle of probability matching, especially the probability matching samplingtechnique of Thompson [144] is a regular tool used in the treatment of these problems. Com-bined, we provide a treatment of optimistic sampling that provides easily implemented efficienttechniques which both accurately capture the intended optimistic model and perform in constantcomputational complexity. First, we present the most simple case, a technique for sampling op-timistically from a symmetric distribution, then we extend that to a sampling technique for theoptimistic surrogates of any generalized nonsymmetric distribution. These techniques, while bothsimple, are important, as some results in the literature have understated the effect of optimismby erroneously selecting a sampling technique which, while optimistic in the strict sense, does notaccurately capture the intended surrogate model.Finally, in this section, we provide a new technique for producing a distribution-free samplerwhich can take on both optimism and contextual variables as augmenting behaviors. We show withexperimentation that this technique is at least as strong (in terms of regret) as the parametric,distribution-dependent sampler, while providing increased robustness to misspecification.1.4 OutlineThe rest of this thesis proceeds as follows. In Chapter 2, we rigorously introduce the formal prob-lem, introduce the different variables that are worthy of consideration for each potential selectionpolicy and explore the policies that have been applied in the literature thus far, and finally discussa selection of variants to the pure formal problem that are relevant to our application. When dis-cussing policies, we have chosen to err on the side of intuitive and applicable explanations leavingany exposition of correctness or asymptotics to the referenced papers. In Chapter 3, we discuss ourresearch contributions from both a theoretical and applied perspective and integrate the resultsinto the existing literature. Finally, in the last chapter, we discuss future work in a broad sense andexplore avenues for further research that would both advance the state of the theoretical literatureand provide large value in an economic sense.11Chapter 2BackgroundChronologically, Robbins [125] introduces the idea of the important tradeoff between explorationand exploitation in recurring decision problems with uncertainty, building on the prior sequentialdecision problem work of Wald [154] and Arrow et al. [8]. Lai and Robbins [99] produce thefirst asymptotic analysis of the objective function of such a decision process showing a bound ofO(log t) for the regret of the standard stochastic, finite-armed, multi-armed bandit, and produceda regret-efficient solution where the rewards of a given arm are stationary and independent andidentically distributed (i.i.d.) with no contextual information or covariates. In the coming sections,we will explore some of the large number of algorithms which have been proposed since Lai andRobbins [99] including -exploration approaches [152, 156], upper-confidence bound techniques [16],probability matching techniques [6] and others. Many variants of the initial problem have also beeninvestigated in the literature including the many- or infinite-armed, adversarial, contextual, andnon-stationary cases.Initially, bandit problems were discussed in the context of the sequential design of experiments[99, 125] and adaptive experiment design as a natural progression of the experimental design lit-erature which seeks to design efficient tools for identifying causal effects. Recently, a policy ordecision-theoretic lens has been used, characterizing the problem as a recurrent policy selection tomaximize some utility function or minimize some regret function.The terminology “bandit” originates from the colloquialism “one-armed bandit” used to describeslot machines. It is important to note immediately that this stylized terminology suggests thenegative expected value embodied in the context of the word “bandit,” despite that not being arequirement, nor being common in most applications of the model. In one common formalization ofthe multi-armed bandit model, the player can sequentially select to play any arm from the K ≥ 1arms and obtain some payoff x ∈ R with some probability p ∈ [0, 1]. This specific formalizationhas explicitly defined binomial arms (fixed payoff, random variable payoff probability), howeverin our interpretation of the problem, the distribution on the arms can (and often is) be from anydistribution.12In all variants of the bandit problem only the payoff for the selected arm at any time step isobserved and not the payoffs for non-selected arms. This is the “partial information” nature of thebandit problem which distinguishes the problem from generalized reinforcement learning, where“full information” (observing the payoff of all arms, whether selected or not) is usually assumed.In many variants of the problem, the reward distribution itself is not assumed, however, rewardsare assumed to be i.i.d. across arms and across prior plays. Some variants of the problem relaxboth of these assumptions. Similarly, it is generally assumed that both K, the number of machines(or “arms”) and x, the payoff function are finite and stationary (time-invariant), however, variantsrelaxing these assumptions have also been considered. An important variant of the problem intro-duces the concept of a known, finite time-horizon, H to bound play time. Two important variants,the contextual problem and the adversarial problem, remove assumptions such that there are feweror even no statistical assumptions made about the reward generating process.In our motivating web development example, the “arms” in the multi-armed bandit become thevariants of a website being experimented with. For example, there may be three arms: one wherea red button is displayed to the user, one where a yellow button is displayed and a final variantwhere a blue button is displayed.2.1 The Stochastic Multi-Armed Bandit Model2.1.1 A Stylized ExampleThroughout the rest of this work, an example will be valuable to elucidate variants, applications andalgorithms. Imagine yourself walking into a casino and being presented with a row of slot machines.Prior to your visit, you had absolutely zero knowledge about the payoff probabilities of the machines:they may have negative, positive or neutral expected value, they may all be the same or they maynot. Your prior information is complete uncertainty1. The challenge, broadly, is to identify how toplay the machines to maximize your return. This is a challenge of balancing exploration (collectingnew, high-value information to minimize uncertainty) with exploitation (using the knowledge youhave already acquired to gain the expected reward). An obvious intuitive solution, is to startrecording results and play each machine some fixed number of times (or randomly) to get anestimate of the payoff rate of each machine, then, if the expected payoff is positive, play thestrongest machine indefinitely into the future. This solution is very close to the strategy known as-first, which, while intuitive, has a number of undesirable properties and can be outperformed bymore sophisticated strategies which we will explore in more detail later in this chapter.1However, in some Bayesian bandit contexts, perhaps you wish to consider the variant where you have priorinformation and wish to explore further.132.1.2 ConsiderationsIn the following section, we explore some of the important considerations and measures which varyacross different problems and policies.Measures of RegretThere are a number of possible objective functions to evaluate a bandit process that have beenused within the literature. In general, regret-based objectives are loss functions which compare theperformance of a process with an oracle (a policy which ex-ante knows all information about theunderlying distributions) over a finite and discrete time horizon t = 1, ...,H. The definitions varyon how the stochastic nature of the process is considered and what information and selections aremade available to the oracle. The sequence of arms selected St can itself be a random variable (suchas in the Thompson sampling case discussed later). Expectation Eθ is taken over the parametersof the arm distribution(s) θ which is taken as fixed per-arm and time a priori.For most practical applications of multi-armed bandit models this is the most important con-ceptual variable to consider. Unfortunately, regret is not consistently defined in the literature asa single variable of interest, but generally takes one of a variety of forms. In a pragmatic sense,regret has the same meaning as it has in English: the remorse (which would be) felt as a resultof dissatisfaction with the agents choices. We enumerate some of the varieties of regret that arediscussed (explicitly or implicitly) within the literature here.1. Expected-Payoff (Strong-) Regret (R¯P ). This is the difference between the expectedpayoff an oracle would have earned, selected per play (∑maxi=1,2,...,K xi,t) and what thealgorithm being tested (∑xSt,t) earned. Formally,R¯P =H∑t=1(maxi=1,2,...,KEθi,t [xi,t])−H∑t=1xSt,t. (2.1)In this measure, regret is a random variable in both θ and St. It is possible to produce an R¯Pthat is negative (indicating that the policy outperformed the expectation), and it is possiblefor two runs to display regret incongruent with the best policy in expectation. The Figure 2.1shows a two-arm example where expected-payoff regret would compute a regret of negative2 (for a single play) assuming the policy picked Arm 1 and drew a value at the solid verticalgreen line. The dashed lines indicate the expectation. This regret of −2 = 8 − 10 is despiteplaying the suboptimal (in expectation) arm (arm 1). This measure has also been referred toas empirical regret by, for example, Eckles and Kaptein [55] but other works such as Maillard[104], Seldin, Szepesva´ri, Auer, and Abbasi-Yadkori [134] have used the term empirical regretwith a range of different meanings.The main motivation for this form of regret is its ability to compute meaningful results appro-14Payoff0 7 8 10 16Arm 1, µ=7Arm 2, µ=8Figure 2.1: An example of playing an expected-suboptimal arm but achieving a high rewarddue to random variation. The solid vertical line indicates the payoff realized (x1,t = 10)from the selected arm (St = 1).priate for risk-aware bandits (such as an application in medical trials with risk limitations orportfolio design with maximum drawdown limitations). Some interesting statistics possiblewith this form of regret include the standard deviation and γ-percentile.2. Expected-Expected (Strong-) Regret (R¯E). Similar to expected-payoff regret, this formtakes the expectation of the arm payoffs in addition. In simulations with well-defined expectedvalues, expected-expected regret can quantify the achievement of the policy without consid-ering statistical variance. Formally,R¯E =H∑t=1(maxi=1,2,...,KEθi,t [xi,t])−H∑t=1EθSt,t [xSt,t]. (2.2)In general, existing regret proofs hinge on this definition of regret. Expected-expected regretmay be a biased measure of the true parameter of interest in the case of a non-symmetricsampling process (St) when there is a focus on risk (such as in the case of a drawdown sensitivefinancial market policy or other loss-mitigation objective function). In this measure, regretis no longer a random variable in θ. An arm selection fully determines the regret measure fora single draw, independent of the drawn value. In the stationary case, the expectation Eθi,tcan be written without the t in the subscript as the distribution is not dependent on time.In our prior example, if arm 1 (the suboptimal arm) is selected and a payoff of 10 is observedas shown, the expected-expected regret will still be Eθ[x2]− Eθ[x1] = 8− 7 = 1.3. Adversarial Regret (¯R). Adversarial regret, also called “weak regret”, variants can be15constructed of each of the prior definitions of regret. In adversarial regret, the oracle mustpick only one arm for all plays. Specifically, the maximization operator is moved outsidethe summation to select only the single arm which is optimal in aggregate over all plays.In summary, adversarial regret measures the difference in the single best arm being playedrepeatedly and the choices the algorithm made. Formally,¯R =(maxi=1,2,...,KH∑t=1Eθi,t [xi,t])−H∑t=1EθSt,t [xSt,t]. (2.3)With the maximization outside the summation, we select a single arm i for all t, a result thatcompares played arms only to the best in aggregate, naive to per-iterate changes in the idealarm.This is used most often in the adversarial bandits literature where “best arm per play” regretresults may no longer be ideal due to the assumptions of the model, but it is also sometimesused in non-adversarial literature for reasons of accident or provability. In particular, in theadversarial regret model, the learning problem can be made arbitrarily hard under traditionalmeasures of regret, for example, a large number of low payoff arms may be provided amongone high paying arm selected randomly. In a traditional measure, this would result in a highlevel of regret while in the adversarial measure, only the arm which has the highest totalpayoff over all iterates is considered.4. Simple Regret (~R). Like adversarial regret, simple regret variants can be constructed ofeach of the prior variants of regret. Simple regret contrasts other regret measures by beingnon-cumulative, that is, it only considers the regret at the current time. Simple regret attime H is the difference in the selected arm and the best arm at that point in time, with noconsideration for prior plays. Formally, for expected-expected simple regret,~RE =(maxi=1,2,...,KEθi,H [xi,H ])− EθSt,H [xSH ,H ]. (2.4)The intuition of the value of simple regret is in the limiting properties of a well-behaved banditalgorithm. It is expected that a bandit algorithm will, in the probability limit, converge toplay the best arm, and as such, simple regret will converge to zero. In some variants of theproblem, such as infinitely-armed bandits [39], simple regret is available where other regretmeasures are not. A common variant of simple regret which draws on the same intuitionsis long-term average regret defined as any traditional regret measure (say expected-expectedregret) divided by the horizon (R¯E/H), as the horizon tends to infinity.5. Bayesian Regret (R¯Bayes). The election science literature (among others) gives us a tech-nique called Bayesian regret that lends itself well to bandit analysis through the frequent16application of Bayesian-derived strategies. Bubeck and Liu [31], Kaufmann et al. [89] andothers provide the first use of variants of this measure in a bandit context. The differencein Bayesian regret and the traditional regret measures is only that the parameters of thedistributions (θi,t) are taken explicitly to be themselves drawn from some prior distributionparameterized on Θ. Formally,R¯Bayes = EΘ[R¯], (2.5)θi,t ∼ f(Θ).Many bandit policies (especially sampling-based techniques) are inherently Bayesian in nature(despite often being analyzed in a frequentist framework) lending Bayesian regret credibilityas a reliable formalization of the concept.6. Suboptimal Plays (N∨). Suboptimal plays is a simple measure of regret accrued by count-ing the number of plays that were not the best arm. This does not account for the scale ofthe difference. Formally,N∨ =H∑t=11[St 6= arg maxi=1,2,...,KEθi,t [xi,t]]. (2.6)In the simple (stationary, context-free) two arm case, suboptimal plays is a sufficient measurefor comparing algorithms as the gap between all suboptimal decisions (there is only one) isthe same. In the three arm case shown in Figure 2.2, we can see that arm 3 is much worsethan the next best arm (arm 1), however, if either arm is played, we increment a simplecounter of suboptimality.The biggest advantages to suboptimal plays over other measures is that of unit independenceand scale invariance. This simple measure provides a count of plays that were not the bestwith no consideration of how suboptimal they were.Two further definitions of regret show up in the literature, generally referred to as expectedregret and denoted ER and pseudo-regret, referred to as R¯. These arise more frequently in workthat is derived from the adversarial and non-stochastic approaches to representing bandit problemsas they require a different perspective on the problem than we take. Specifically, if a bandit processis represented as a fixed a priori matrix of dimensions H (time horizon) by K (number of arms)where each entry represents a payoff with possibly fixed structure2, we can eliminate the underlying2Structure allows representation of stationary or non-stationary payoffs, distribution-derived payoffs and othercomplicating factors.17Payoff0.0 6.5 9.0 18.0Arm 1, µ= 9Arm 2, µ=10Arm 3, µ=6.5Figure 2.2: A play from arm 3 (the most suboptimal with expectation of 6.5) is drawn earninga reward of (7.5). Giving an expected-expected regret of 10− (6.5) = 3.5, an expected-payoff regret of 10− (7.5) = 2.5 and a suboptimal plays regret of just 1, a counter forthe number of times an arm other than the optimal was played. The solid vertical lineindicates the payoff realized (x3,t = 7.5) from the selected arm (St = 3).essential randomness from the consideration3. We use a capital X to denote a two dimensionalmatrix of the form previously described.7. Expected Regret (ERn) [30]. Expected regret is treated as the gold-standard in the liter-ature which uses it. This is distinct from what we are calling expected-expected regret (R¯E)in what is meant by the word “expected.” Formally,ERn = E[maxi=1,2,...,Kn∑t=1Xi,t −n∑t=1XSt,t]. (2.7)In this case, expectation is taken both over any randomness in the creation of the rewardsmatrix Xi,t and any randomness in the selection of St. As such, this produces a measure ofregret which requires the agent to be prescient to any purely random factors in the rewardmatrix in order to achieve zero regret. In other words, it is no longer sufficient to select thearm with the highest expected payoff (expected in what is possible to know) but rather onemust select the arm with the highest actual payoff in order to achieve zero regret.This way of thinking eliminates any stochastic nature from the forecasting discussion, aswe measure against the ideal world state for a particular instance of the game (inside theexpectation) and then average that over all possible world states (outside the expectation).3This is in some sense an operationalization of a bandit process represented with true randomness, as in practice,the randomness is implemented with a pseudo-random number generator which amounts to a vector of fixed a priorinumbers derived from some seed.18This perspective of evaluation is distinct from the techniques generally applied to performforecasting in a stochastic bandit, where knowledge is expressed with uncertainty and theunderlying world has some inherent (unexplainable) uncertainty.8. Pseudo-Regret (R¯n) [30]. This measure is identical to the regret we refer to as adversarialregret (¯R) with the exception that the definition of expectation differs. As in the previousdefinition, expectation here E is defined to be expectation both over the creation of therandom rewards matrix and over any randomness in the selection process – that is, it producesa weighted average over all possible world states, while in adversarial regret expectation ispushed in to be over the reward distribution. Formally,R¯n = maxi=1,2,...,KE[n∑t=1Xi,t −n∑t=1XSt,t]. (2.8)In this case, the optimal decision is defined only in expectation. In the stationary, context-freestochastic setting, this reduces to a simpler equation where µ∗ is the mean payoff of the bestarm, that is, using our earlier definition of expectation: µ∗ = maxi=1,2,...,K Eθ[xi]. Formally,R¯n = nµ∗ −n∑t=1EµSt . (2.9)Furthermore, Audibert and Bubeck [11] relate these two quantities and show that R¯n ≤ ERnvia an argument reliant on Jensen’s inequality and in the stationary, context-free stochasticcase, further show the gap is bounded by ERn − R¯n ≤√n logK2 .In the rest of this work, we view the bandit problem as an exclusively stochastic problem (fromthe perspective of the agent) where arms are represented as unknown distributions and thus avoidutilizing these two measures of regret.As we will show in the following sections, in experiments, the selection of regret measure isparamount to the understanding of policy decisions.The variable definitions of regret are a serious issue in comparing bandit algorithms in manycontexts. In much of the published research, it is difficult to identify which definition of regret isutilized, as it is quite easy to phrase a plain language definition of regret in a form that leaves theoracle criteria ambiguous4.Lai and Robbins [99] demonstrated that the lower bound of regret for policies for the generalstochastic multi-armed bandit problem was O(log t). This lower bound does not hold for all variants4As an example – though definitely not to single out these authors, as ambiguity of this sort is common – asentence defining regret as “the difference between the sum of rewards expected after N successive arm pulls, andwhat would have been obtained by only pulling the optimal arm” appears in Granmo and Berg [70] and is not initself clear whether “only pulling the optimal arm” means the singular mean optimal arm (as in pseudo-regret) orthe variable per-play optimal arm (as in strong regret) or even the prescient form of expected regret.19of the problem, and indeed it is easy to construct synthetic variants of the problem with zerominimum regret.It is important to note that regret is further complicated in a number of variants of the problem.Broadly, regret is only a simply defined concept in the case where there is a single best arm (ora set of indistinguishable best arms). In any variant of the problem where the best arm changesover time, plays or context or in any variant where the payoff choice is chosen adversarially or froma non-stationary distribution, there will be substantial complications to the analysis of regret, inparticular with regard to what information the oracle has available to it.Variance and Bounds of RegretSimilarly to the well-known bias/variance tradeoffs in the estimator selection literature, varianceminimization is an important consideration for many applications. In the most extreme case, a high-variance algorithm will be completely unacceptable in a practical sense even if it had a minimalexpectation regret. Certain choices in the measurement of regret are more appropriate if onechooses to measure variance than others, for instance, selecting expected-payoff regret allows oneto measure the variance of an individual payoff (e.g., a next play payoff variance) while expected-expected regret will only allow measurement of variance across repeated plays of the full sequence.Often, rather than measuring variance of a policy, a high-probability bound on the regret is pro-posed. High-probability bounds arise from the “probably approximately correct” learning (PAClearning) literature [149] which provides a framework for evaluating learning algorithms with tradi-tional asymptotic measures from computational complexity theory. The language PAC,δ-learnableprovides that an algorithm exists such that it can learn the true value with an error rate less than with probability of at least 1− δ in the limit or within finite number of fixed iterates.High-probability bounds introduce this PAC learning framework to regret analysis. The primarydifference from the regret in expectation is that a high-probability bound considers the potentialvariation in return, which can be (and is, in many cases) large, a factor which is very important inmedical and many financial contexts, where stakes are high. Intuitively, a high-probability boundprovides a function that can be used to evaluate a bound at a given probability. This is a statementof the form P [R(n) > bhigh probability] ≤ δ where R(n) is the regret after n plays, bhigh probability isthe high-probability bound and δ is the probability with which we evaluate the “high” probability.The parameter δ is often picked to be a function of the form n−c for a fixed constant c [9].To provide an intuitive understanding of high-probability bounds compared to expectationregret, consider the slot-playing -first example: if we have two slot machines to pick between, andwe explore 10 times (5 each) to measure our empirical estimates of each arm then exploit the bestmeasured machine forever. In expectation the best machine will be picked, however, randomnessand the small number of plays may result in a single win dominating the results and causing theestimate of the best machine to be incorrect. While the expectation of regret in this model is zero,20the variance of regret is high: in many5 plays, the regret will tend to infinity. The high-probabilitybound in general, will say that for p% or fewer of the repeated plays of the game, the regret willexceed b. In this specific example, it is likely the bound will be very weak for any reasonable δ,as in this particular strategy, a small number of lucky early explorations will result in suboptimalplays forever, accruing large regret.Historically, prior to high-probability bounds being introduced for the multi-armed bandit prob-lem, strict bounds were provided for certain algorithms. These bounds tend to be much weakerthan the high-probability bounds found in more recent research.Higher Moments and Risk Measures of RegretIn some contexts, other parameters of the shape of the regret distribution may prove relevant. Thereis very little work analyzing or applying moments higher than variance. Skewness and kurtosismay interest some applications, especially those where risk is being calculated upon the notion ofregret. In general, while important, risk-aware or risk-averse bandits are conjectured by Audibertand Bubeck [11] and others as a significantly more complex problem and few major publicationsexplore these problems. In one of the few, Carpentier and Valko [38] propose a measure of regretthat is based on the extreme value of the arm distribution, producing an algorithm which minimizestail events in a constrained environment. There is a substantial literature of risk measurement andmitigation in financial market research which the bandits literature may benefit from in the future.Feedback DelayIn many problems, there is a simultaneity issue with training: new trials may run before therewards for older problems have been observed. In other problems, there is a tradeoff betweencomputation time (to recompute decision parameters for the model) and model accuracy: a longercomputation time creates more feedback delay and more accurate answers, but the feedback delayeffects the currency of the answers. For example, a user may be delivered an advertisement andstill be reading or navigating through the process, or a computation process for that user may nothave completed, when another user arrives, this delay is called feedback delay. Feedback delay isan important, often unexplored, consideration for multi-armed bandit policies with some policiesbeing unviable in high delay environments producing an environment where fast suboptimal (interms of theoretical regret) policies may outperform slower, theoretically-optimal policies.Problem DifficultyIn the context of quantifying performance of algorithms in a distribution agnostic sense, a measure of“hardness” is important. Specifically, a problem where the difference between the optimal arms andnon-optimal arms is small seems to have some “intrinsic difficulty” that applies across algorithms.5Dependent on the gap between the two machines.21In order to capture this intrinsic difficulty, Audibert et al. [12] introduced two measures of hardnessof identifying the best distribution in a set of K distributions, which they go on to prove are withina logarithmic factor of each other. The measures presented are,H1 =K∑i=11∆2i(2.10)andH2 = maxi∈{1,...,K}i∆2i(2.11)where ∆i is the difference between the i ∈ Kth arm’s payoff and the optimal arm’s payoff.Further, there is existing work in the statistics literature aiming to quantify the differencebetween distributions. Many statistical tests are dependent on the idea of quantifying the distancebetween an assumed true distribution and the observed empirical distribution. One such measure,Kullback-Leibler (K-L) divergence, has been proposed [30, 99, 105] as an appropriate tool forthe bandit problem. Broadly, Kullback-Leibler divergence is the logarithmic ratio between twodistributions with regard to the first distribution. Formally,D(P,Q) =∫ ∞−∞ln(p(x)q(x))p(x)dx. (2.12)Where p(x), q(x) is the probability density function of P, Q respectively. While this is not atrue metric, in that it does not satisfy the triangle inequality nor is it symmetric, for our purpose,it can be used to quantify the distance between the optimal arm and all other arms. In the twoarm case, this is a valid measure of the desired property. In the k > 2 arm case, an arm-iteratedregret-weighted K-L divergence can be proposed of the form,Darms =K∑k=0D(Bbest, Bk) · µBbest . (2.13).Stationarity of the ProblemOne of the strongest assumptions of many statistical models, including most variants of the multi-armed bandit problem, is that the underlying distributions and parameters are stationary. Inmany contexts, including the context studied here, this is not a reasonable assumption: the stateof the world is changing around the learning process and in our context, the best arm in one timeperiod may not be the best arm in another. Non-stationary problems are in general challengingfor algorithms that make stationarity assumptions, whether explicit or implicit, as the real worldperformance of any such policy can continuously degrade in response to unconsidered changes in the22distribution. In particular, rapid changes of the distribution and switching-type models (day, night;seasonal; or any other repeatedly changing, but unmodelled, confounding factor) have extremelypoor performance on many fixed policies.Some variants of the model, known generally as non-stationary bandit models have been pro-posed with drifting or step-function changing parameters. A simple solution to deal with non-stationarity is to allow the data to “decay” out of the model with a time-weighted component,however, this solution requires an accurate model of the appropriate rate of decay to be efficient.Non-stationarity is very important in the advertising context, as there is no reason to believe theunderlying distributions are static. This is discussed at depth in the non-stationary bandits section(2.2.5).Change-Point Detection For the step-function variant of the problem, change-point analysis (orstep detection in signal theory) is a deeply researched statistical technique aimed at identifyingthe times when the underlying probability distribution of a stochastic process changes. Mellorand Shapiro [108] introduces an online Bayesian change-point detection algorithm appropriate foruse in a variety of “switching environments” where the underlying arm distributions change insemi-structured ways.Kalman Filters For the drifting or noisy version of the problem, a Kalman filter “state transitionmodel” technique has been applied. Granmo and Berg [70] introduce a Bayesian technique thatuses sibling Kalman filters to define the distributions for a Thompson sampling-type policy anddemonstrate that it is empirically strong for both non-stationary and stationary variants of theproblem, suggesting it could be a strong technique in cases of uncertainty. Later, we explore timeseries techniques for solving non-stationarity in the context of a changing world.Ethical and Practical ConstraintsIn medical and research applications of multi-armed bandit models, exploratory phases may involvehuman subjects. These cases bring with them ethical constraints independent of the moments ofthe distribution. Specifically, in medical applications it may be desirable to bound the exploratoryportion of a model without exception or to favour exploiting new knowledge as soon as it passes somestatistical significance. In a resource exploration context, some locations may be environmentallysensitive or unexplorable for legal reasons; these might take the form of a hole in our explorationfunction in a contextual bandit framework or simply some bandits which are left unexplored in apure bandit framework.Practically, other constraints may exist. In some contexts, such as investment in researchor private corporation stock, the exploration may be bounded by large minimum or insufficientmaximum investments. In a financial portfolio or marketing context, due to model or stochastic23error, it may never be desirable to expose a large portion of a firm’s total capital to any one optimalresult (or, in a contextual framework, any cluster of results). We consider the application of risk-based measures to attempt to quantify this effect and later discuss state of the art results withrespect to satisficing and convex tinkering from risk management research on how to best deal withthis within our marketing context.Practical SignificanceThe multi-armed bandit specification discards one motivation of traditional experiment design:statistical significance or hypothesis testing6. Fortunately, the model maintains some sense ofeconomic or practical significance: arms which have largely different rewards will still, to the extentpossible, be identified as such in the multi-armed bandit, as a good, uncertainty-aware policy (suchas UCB or Thompson sampling) must isolate the best arm from the others in a similar sensethey must in a traditional experiment design in order to ensure they are appropriately balancingexploration and exploitation.2.1.3 FormalizationTo reiterate the nature of the problem we are considering, we provide the following summaryformalization. A (finite-armed, stochastic) multi-armed bandit problem is a process where an agent(or forecaster) must choose repeatedly between K independent and unknown reward distributions(called arms) over a (known or unknown) time horizon H in order to maximize his total reward (orequivalently, minimize some total regret, compared to an oracle strategy). At each time step, t, thestrategy or policy selects (plays) a single arm St and receives a reward of xSt,t drawn from the itharm distribution which the policy uses to inform further decisions. In our application, individualarms represent webpage modifications with the goal of maximizing desired user behavior (sales,time engaged, etc.).2.2 Studied Problem VariantsFor each of the variants of the model, we introduce the variant, describe the state of the researchand then explore how different algorithms can be applied to this specific problem variant. Inparticular, we explore the common threads in terms of algorithm design, model assumptions andother factors that tie together the variants and known solutions to the problem. At the end of thissection, the reader should have a strong intuition for both the state of the art and sufficient contextto understand and approach new variants of the model.6While achieving strict statistical significance in the test of distinguishing arms from each other is not a directgoal in the multi-armed bandit model, [87] shows that the model does not exclude traditional significance testing,especially when the goal is only to identify that a given arm performs strongly.242.2.1 Traditional K-Armed Stochastic BanditThe most studied variant of the model is the traditional model described earlier in this chapter,with a discrete, finite number of arms (K) for the agent to choose between at each time step t.There are a large number of techniques for solving this variant, many of which meet various notionsof provably efficient.A strategy or algorithm used to solve the multi-armed bandit problem is often called a policy.We discuss a set of policies, the  policies [141], dependent on a parameter  which determineshow much exploration takes place, a set of policies called UCB-strategies (upper confidence boundstrategies), based on the observation by Auer, Cesa-Bianchi, and Fischer [15] on utilizing upperconfidence bounds, a variety of standalone policies and finally probability matching policies whichrely on the idea of matching the probability of success with the probability of drawing that arm.Strategies like -based strategies that maintain an ongoing distinction between exploitation andexploration phases are called semi-uniform.-greedyThe -greedy approach appears to be the most widely used simple strategy to solve the simplestochastic, i.i.d. form of the (discrete) multi-armed bandit model in practice. The strategy, inwhich the agent selects a random arm 0 ≤  ≤ 1 fraction of the time, and the arm with the bestobserved mean so far otherwise, was first presented by Watkins [156] as a solution to the equivalentone-state Markov decision process problem7. The choice of  and strategy for estimating the meanis left to the application.-based strategies have been well studied. Even-Dar et al. [57] show that after O(Kα2log Kδ)random plays an α-optimal arm will be found with probability greater than 1 − δ, a result thatapplies to all major  strategies.Constant  With a constant value of , a linear bound on regret can be achieved. Constant -greedy policies are necessarily suboptimal, as a constant  prevents the strategy, in general, fromasymptotically reaching the optimal arm [152]. That is, even after strong knowledge is acquired,the strategy will continue to behave randomly some  fraction of the time.Adaptive and -Decreasing One of the more salient variants of -greedy is the -decreasing strategy.In a stationary, finite horizon environment, it is logical to have a policy do more exploration earlyand more exploitation as it becomes more confident about its knowledge or as it gets closer toits horizon. This can be implemented with a variance weighted strategy or by simply decreasing according to some rule (time, observations, etc.). In known-horizon environments, -decreasing7Watkins’ motivation was in modelling learning processes in the real world, not for machine learning. Thedistinction does not appear to be important for the particular policy he devises.25policies can weight the rate exploration as a function of the remaining horizon available, though noknown work has explicitly defined the correct functional form to do so.A simple -decreasing strategy is natural and given by Vermorel and Mohri [152] which defines(t) as the value of  after t plays as min(1, 0t ) where 0 is left as a choice to the user. A similarstrategy is called GreedyMix and analyzed in Cesa-Bianchi and Fischer [40] where (t) (referred toas γ) is defined as min(1, 5Kd2· ln(t−1)t−1 ) where 0 < d < 1 is a constant picked by the user8. GreedyMixis shown to have regret on the order of ln(H)2 for H trials for Bernoulli- and normally-distributedbandits. Selection of d is left to the reader, and performance degrades if a sub-optimal value of dis selected.An interesting result regarding -decreasing policies is given by Auer, Cesa-Bianchi, and Fischer[15] with the simulations on a policy called n-greedy. n-greedy is the generalized form of  greedywhere the fraction of exploration is a function of the time step. At each time step t, we select thet = (t). By defining (t) ≡ min{1, cKd2t}and correctly selecting an unknown parameter c > 0and a lower bound 0 < d < 1 on the difference between the reward expectations of the best andsecond best arms, we get a policy that which has an expected regret of O(logH). Unfortunately, asnoted in Auer et al. [15] this result is not of a lot of practical use, for the same reason GreedyMixlacks practicality: the selection of the constant factors c and d are dependent on the underlyingdistribution which we are trying to estimate and the performance degrades rapidly in the incorrectlytuned case. A theoretical, but not particularly practical, extension of this strategy is one where(t) is correctly chosen for each time step; this strategy is guaranteed to converge in an optimalnumber of trials in expectation.-firstIn the non-academic web optimization and testing literature, -first is used extensively, generallyfor 2-armed bandits and is widely known as “A/B testing”9. In -first, the horizon, H, must beknown a priori. The first  · H plays are called the exploration phase, and the agent picks armsuniformly randomly, producing an estimate of each arm’s payoff. In the remaining (1− ) ·H plays,called the exploitation phase, the agent strictly picks the best empirically estimated arm.An -first strategy is superior to an -greedy strategy when the horizon is fixed and stationarity8Note that by letting 0 =5Kd2GreedyMix is similar to the Vermorel and Mohri strategy, but not the same, as therate of decrease is ln(t−1)t−1 .9The extensive toolsets available for automating this testing often perform “A/B testing” incorrectly. Specifically,they perform testing with repeated testing without an appropriate multiple testing significance adjustment. It isup to the user, who is generally not expected to be familiar with the statistics involved, to behave appropriately tomaintain the assumptions of the model. Many researchers have addressed the multiple testing issue, for an overviewof the problem see Jennison and Turnbull [82]; for an review of strategies for correcting multiple testing errors, seeHsu [78] or Westfall, Young, and Wright [159]. Indeed, the most fragile of these toolsets offer functionality to makedecisions “upon reaching significance” (using an unadjusted measure of significance) which suggests a significancetest after every trial: the worst form of the multiple-testing problem, resulting in a false positive rate which increasesas the number of trials increases [82].26on the arms can be assumed as the estimates are expected to be better for a larger number ofplays and thus fewer suboptimal plays will be likely. Compared to -greedy, -first is vulnerableto non-stationarity of the reward distribution, because all learning takes place “upfront”. -first isalso vulnerable to errors in estimating the time horizon, the number of trials remaining.Multiple Epoch A variant of the  algorithms is the multiple epoch approach. Multiple epochapproaches can be applied to many multi-armed bandit policies (e.g., Langford and Zhang [100]),but they are largely unstudied in non- approaches. They may show promise in non-stationarybandit cases where the epoch length (and data decaying) can be used to control for the maximumdeviation. In the multiple epoch approach, we divide our total time horizon (known or unknown,finite or infinite) into epochs of an integer length. The respective policy is then applied withinthe epoch. For example, in the -first strategy, this eliminates some of the vulnerability to non-stationarity and horizon-unawareness by allowing learning to take place at spaced periods withinthe total time.UCB1Much of the research in regret bounds demonstrates regret that is logarithmic (“optimal”) onlyasymptotically. Auer et al. [15] present an algorithm originating in Agrawal [5] called UCB1 whichachieves expected logarithmic regret uniformly over time, for all reward distributions, with no priorknowledge of the reward distribution required. UCB1 is the first strategy we have discussed thatis not a semi-uniform strategy, that is, it does not maintain a distinction between an exploratoryphase and an exploitation phase, choosing instead to optimize how exploration happens at eachindividual iterate. UCB1 belongs to the general family of upper confidence bound (UCB) algorithms,first proposed in Lai and Robbins [99] but developed extensively in Auer et al. [15]. UCB algorithmstake the form of picking the arm which maximizes a surrogate function, i.e., they pick,i = arg maxiµi + Pi, (2.14)where µi is the “average function” which estimates the mean payoff of arm i and Pi is a paddingfunction which generally takes the form of an approximation of the uncertainty on µi. The primarycontribution of variants of the UCB algorithms is the selection of Pi.For convenience, let ∆i be defined the same way as in Auer et al. [15]: ∆i ≡ µ∗ − µi where µ∗represents the mean reward expected from the optimal arm and µi represents the current rewardexpectation for arm i.UCB1 begins by playing each arm once to create an initial estimate. Then, for each iterate t,arm i is selected to achieve the maximum value maxi x¯i +√2 ln tniwhere x¯i is the average observedreward of arm i thus far (the empirical mean) and ni is the number of times arm i has been played.The second term in this equation acts as an approximation for “optimism” by treating arms which27have been played less as more uncertain (and thus plausibly better) than arms that have been playedfrequently. In UCB1’s strict formulation, the bound is derived from the Chernoff-Hoeffding bound[22, 44, 76] on the right tail distributions for the estimation of Bernoulli random variables, but theconfidence bound model applies equally well to any distribution where an appropriate bound canbe defined.The second term in the maximization criterion has been extended, as in the MOSS algorithm[10] (discussed in an upcoming section) to consider the remaining horizon to create an “exploratoryvalue” that is declining in finite time or to improve the tightness of the bound on variance.UCB1 as specified has a bounded regret at time t, for Bernoulli arms, given by the followingformula, shown in the original paper,8 · ∑i:µi<µ∗(ln t∆i)+ (1 + pi23)( K∑i=1∆i). (2.15)UCB2UCB2, an iterative improvement over UCB1, reduces the constant term in the fraction of time asuboptimal arm will be selected, reducing the overall regret, at the cost of only a slightly morecomplicated algorithm.In UCB2, iterates are broken into epochs of a varying size. In each epoch, arm i is selected tomaximize x¯i +√(1+α)(ln(et/(1+α)ri ))2(1+α)ri and then played exactly d(1 + α)ri+1 − (1 + α)rie times beforeending the epoch and selecting a new arm. ri is a counter indicating how many epochs arm i hasbeen selected in and 0 < α < 1 is a parameter that influences learning rate discussed below.The bound of regret for UCB2 is known for times t ≥ maxi:µi<µ∗ 12∆2i and is given by,∑i:µi<µ∗((1 + α)(1 + 4α) ln(2e∆2i t)2∆i+cα∆i), (2.16)where e is Euler’s constant and cα = 1 +(1+α)eα2+ (1+αα )(1+α)(1 + 11(1+α)5α2 ln(1+α)) as proven in Aueret al. [15]. The important property of cα to notice is that cα → ∞ as α → 0, forcing a trade-offbetween the selection of α to minimize the first term towards 1/(2∆2i ) and the second term. Theoriginal paper suggests optimal results from setting α such that it is decreasing slowly in t but isnot specific to the form of decrease; in practice, they also demonstrate, the choice of α does notseem to matter much as long as it is kept relatively small.UCB-TunedA strict improvement over both UCB solutions can be made by tuning the upper bound parameterin UCB1’s decision rule. Specifically, Auer et al. [15] further expands these solutions by replacing28the second term√2 ln tniwith the tuned term√ln tnimin(14 , Vi(ni)) where Vi is an estimate of theupper bound of the variance of arm i given by, for example,Vi(ni) ≡(1nini∑τ=1X2i,τ)− X¯2i,ni +√2 ln tni, (2.17)where ni is the number of times arm i has been played out of t total plays. UCB-Tunedempirically outperforms UCB1 and UCB2 in terms of frequency of picking the best arm. Further,Auer et al. [15] indicate that UCB-Tuned is “not very” sensitive to the variance of the arms.Simple experimentation shows that UCB-Tuned as defined above outperforms the earlier UCBssignificantly in all tested worlds.MOSSMOSS [11], or the Minimax Optimal Strategy in the Stochastic case, produces a variant of UCB1that is presented in a generalized context, such that it can apply to all known bandit variants orsubproblems. In MOSS, the ln t component of the padding function in UCB1 for arm i is replacedwith ln HKni where ni is the number of times arm i has been played, H is the total number ofiterates to be played (the horizon, at the beginning) and K is the number of arms available in astochastic (non-adversarial) bandit problem. The work of Audibert and Bubeck [11] shows thatexpected regret for MOSS is bounded from above, by,ER ≤ 25√HK ≤ 23K∆log(max(140H∆2K, 104)), (2.18)where ∆ = mini:∆i>0 ∆i, the smallest gap between the optimal arm and the second best arm.Note that this calculation of regret applies continuously in the stochastic case, but we will seelater in the adversarial discussion that it is marginally complicated in that environment due tonon-unicity of the optimal arm.Bayes-UCBCombined with KL-UCB (covered in the next section), Bayes-UCB [88] — an explicitly Bayesianvariant of UCB — represents the current state of the art of UCB algorithms. It is an asymptoticallyefficient advanced algorithm with promising empirical results. In the Bayesian approach to themulti-armed bandit problem, each arm is represented as an estimate of a distribution that is updatedin the traditional Bayesian fashion. Kaufmann et al. [88] show that this Bayesian-derived UCBhas a cumulative regret that empirically outperforms the strongest of the original UCB algorithmsby a substantial margin in a handful of selected problems while having the advantage of beingdistribution agnostic and showing the early-iterate flexibility of a Bayesian approach to knowledgeacquisition. A computational complexity challenge is acknowledged but not explored in depth.29Bayes-UCB is similar to the probability matching strategies to be discussed later: quantiles of adistribution are estimated to increasingly tight bounds and the probability of a given arm “beingthe best” is used to determine the next step. To perform Bayes-UCB, the algorithm requires a prioron the arms, Π0 and a function to compute the quantiles of the expected distributions, Q(α, ρ)such that Pρ(X ≤ Q(α, ρ)) = α. At each time step t, Bayes-UCB draws the arm i to maximize thequantile selected as follows. It picksi = arg maxiqi(t) = Q(1− 1t, λt−1i ), (2.19)where Q meets the property described above and λ(t−1)i is the estimated posterior distributionof arm i at the previous time step. This is then updated according to the Bayesian updating ruleand used as the prior for the next iteration.In a theoretical analysis, Kaufmann et al. [88] show that Bayes-UCB achieves asymptotic opti-mality and a non-asymptotic finite-time regret in O(H).It is interesting to note that by treating the quantile function and underlying model appropri-ately, Bayes-UCB can, in theory, represent any distribution and most subproblems of the multi-armed bandit. As a simple but valuable example, by representing the underlying model as aBayesian regression, one can include contextual information in the bandit process.KL-UCBKL-UCB [105] presents a modern approach to UCB for the standard stochastic bandits prob-lem where the padding function is derived from the so-called Kullback-Leibler (K-L) divergence.KL-UCB demonstrates regret that improves the regret bounds from earlier UCB algorithms byconsidering the distance between the estimated distributions of each arm as a factor in the paddingfunction. Specifically, define the Kullback-Leibler divergence[62, 98] (for Bernoulli distributionarms) as,d(p, q) = p logpq+ (1− p) log 1− p1− q , (2.20)with convention of 0 log 0 = 0, 0 log 00 = 0, and x logx0 = +∞ for x > 0. The Kullback-Leibler divergence d(p, q) provides a probability-weighted measure of the difference between twodistributions which does not rely on collapsing the distribution to a midpoint (e.g., expectation).To pick an arm in each iteration of KL-UCB, we maximizei = arg maxini · d(µi,M) ≤ log t+ c log log t (2.21)where M is picked from the set of all possible reward distributions. The K-L divergence ofd(x,M) is strictly convex and increasing in [x, 1) [62] making this equation tractable.30POKER and price of knowledgeA non-UCB algorithm, POKER [152] or Price of Knowledge and Estimated Reward is a generaliz-able economic analysis inspired approach to the problem.The intuition behind POKER is to assign a “value” of the information (the “exploration bonus”)gained while pulling a given arm. This value is estimated for each arm, and then the arm withthe highest expected payoff plus expected value of information is played. Value of information isdefined to maximize expected outcome over the horizon. To gain an intuition, first assume an oracleprovides the best arm as arm i∗ with payoff µ∗, that we have an estimate for each arm’s payoff µˆiand that we have an estimated best arm iˆ with estimated payoff µˆ∗. Define the magnitude of theexpected improvement as δ = E[µ∗ − µˆ∗], then the probability of an improvement for a given armis P [µi − µˆ∗ ≥ δ].When there are (H − t) plays left, any new knowledge found in this iterate can be exploited(H − t) times. This means the expected improvement has a (non-discounted) value of δ · (H − t).A problem arises in computing δ, as if i∗ and µ∗ were known, there would be no need toexplore. Instead, the ordered estimate of the means are used. Imagine, an ordered list of the meanrewards as µˆi1 ≥ · · · ≥ µˆiq . Vermorel and Mohri [152] choose, based on primarily empirical results,to approximate δ proportional to the gap between µˆi1 and the µˆi√K arm. Specifically, they setδ =µˆi1−µˆi√K√K. That is, if there are K arms, the difference between the best and the√Kth bestcurrent estimate is proportional to the plausible gain.In the limit (as the number of arms approaches infinity), this approximation strategy ensuresbias and variance minimization.Additionally, one can observe that the whole probability P [µi − µˆ∗ ≥ δ] = P [µi ≥ µˆ∗ + δ] isapproximated (or identical, in the event of normally distributed means10) by the cumulative prob-ability of the reward higher than the best empirically expected reward plus expected improvementµˆ∗ + δ,P [µi ≥ µˆ∗ + δ] =∫ ∞µˆ∗+δN(x, µˆi,σˆi√ni)dx, (2.22)where N(x, µ, σ) represents the normal distribution and ni means the number of times arm ihas been played and mean µi and variance σi take on their usual meaning.This gives us sufficient information to define a decision criterion. Select the arm which maximizesthe expected sum of total rewards over the horizon H. Formally, at each time step t, select arm ito play:arg maxiµi + δ(H − t)P [µi ≥ µˆ∗ + δ] (2.23)10This is true in the limit by the central limit theorem, but as there may be a small number of arms and trials, itmay be a poor approximation in some environments.31Note that δ(H− t) is the total expected gain over the remaining horizon. By multiplying by theprobability this arm will actually exceed the best known arm, we achieve a sensible expectation tomaximize. This value could be easily time-discounted by introducing a sum of discounted payoffsif the time horizon was measured at a scale where time-discounting were of value.POKER uses knowledge of the length of the horizon or number of plays that remain, (H −t), as a parameter that effectively determines how to weight exploration and exploitation. Theauthors make the claim that requiring the horizon explicitly is a more intuitive parameter than theparameters associated with many other algorithms. Additionally, the parameter can be set to afixed value to simply use it to balance exploration and exploitation in the case horizon is unknownor infinite.Vermorel and Mohri [152] introduce the POKER policy and use the term zero-regret strategyto describe it. In their context, zero-regret means guaranteed to converge on the optimal strategy,eventually: that is, a strategy which has average per-play regret tending to zero for any problemwhich has a horizon tending to infinity. The term zero-regret will not be used in the rest of ourdiscussion, preferring instead “guaranteed to converge to zero.”The authors compare POKER to  strategies, Exp3 (discussed in a future section) and others ona real world redundant retrieval11 routing problem and find that POKER outperforms  strategiesby a factor of approximately 3. As of this writing, there has been no known finite time analysis ofregret for POKER.2.2.2 K-Armed vs. Infinite-Armed BanditsExpanding on the traditional model is a variant which treats the number of arms as an infinite orcontinuous range with some functional form defining the relationship or a sufficient mechanism fordiscretizing the infinite space. This variant allows for substantial variation in problem difficulty byvarying how much the agent knows about the arm relationship.As an example of the infinite-armed bandit case, consider the case of picking a color for apurchase button to optimize clicks. Each user that views the purchase button is (possibly) influ-enced by its color, and color is a (theoretically) continuous function. As it would be impossible tosample all the colors, in the infinite-armed case, for the analysis of an infinite-armed bandit to betractable, there must exist an underlying well-behaved function defining the relationship betweenarms (colors) and the payoff function (clicks).While the infinite-armed case is very interesting from a web optimization and online advertisingperspective, it is not a major source of motivation for this work choosing to prefer a focus on discretechanges. As such, only a quick overview of the state of the art is provided here for completeness.One of the earliest works in this space was Agrawal [5].11The problem is to identify the fastest source for a content delivery network with numerous redundant sources ofrequested data.32Recent work has studied such variants of the infinite armed problem as high-dimensional spaces[147, 148], non-smooth spaces [47] and multiple-objectives [151], although work on theoreticalanalysis of the existing algorithms is still ongoing. An approach to define a computationally andmathematically feasible regret in a generalized infinite-armed case is presented in Carpentier andValko [39].Bandit Algorithm for Smooth Trees (BAST)[48] present an analysis of the popular UCT (upper confidence bound for trees) algorithm whichcombines Monte Carlo tree-based techniques from artificial intelligence12 with the UCB1 algorithmdiscussed prior. UCT is popular in planning problems for game-playing artificial intelligence, butnot itself appropriate for infinite-armed bandit problems as it applies in a space where the numberof potential decisions extremely large13, not infinite.The Bandit Algorithm for Smooth Trees introduces a mechanism related to continuity in tothe tree approach, choosing to, for example, represent a continuous space as a tree with repeatedbranches dividing that space. The fundamental assumption is that leaf nodes in the tree canbe expected to have similar values in the payoff space. Coquelin and Munos [48] represent thisassumption rigorously requiring that for any level in the tree, there exists a value δd > 0 called thesmoothness coefficient such that for at least one (optimal) node i in that level d the gap betweenthe optimal leaf node (µ∗) and all other leaf nodes is bounded by δd. Formally,µ∗ − µj ≤ δd ∀j ∈ leaf(i) (2.24)Assumptions of this style are the core tool with which infinite-armed bandits are generallyrepresented. The tree is then generally assumed to take a coarse-to-fine hierarchical representationof the space down to some maximum tree height.BAST performs selection at each level of the tree using a variant of the UCB algorithms whichtakes in to consideration the estimates of nearby nodes. Specifically, for a given smoothness co-efficient δd for each level on the tree, a selection mechanism for any non-leaf node is given asmaximizingBi,ni = min{(maxBj,nj ), Xi,ni + δd + cni}(2.25)And for any leaf node as the simple UCB criteria,12A good survey of related algorithms for tree search is given by Browne et al. [29].13Many games are well modelled as a tree structure where each decision reveals a new set of decisions until an endstate (win or lose) is reached. These trees can grow very rapidly, but are not generally continuous. A number oftechniques are proposed for dealing with them [120] and these techniques frequently overlap with some techniquesproposed for infinite- or continuum-armed bandits.33Bi,ni = Xi,ni + cni (2.26)Where cni takes the role of the padding function from UCB and is defined ascn =√log(2Nn(n+ 1)β−12n(2.27)A technique described in [49] and [64] is also explored in [48] which allows the online productionof the tree with or without the assumption of a maximum tree height. In the iterative growingvariant of BAST, the algorithm starts with only the root node, then at each iterate, selects aleaf node via its selection mechanism described above and expands the node in to two new childnodes, immediately choosing to play each child node once. This iterative technique still requiresO(n) memory to maintain the tree, but results in only optimal branches being explored in depth,a desirable property.Hierarchical Optimistic Optimization (HOO)Similar to BAST, Hierarchical Optimistic Optimization [34, 35, 92] attempts to build an estimateof the functional form f by treating the problem as a hierarchical (coarse-to-fine) tree, with aparticular focus on only maintaining a high-precision estimate of f near its maxima. HOO buildsand maintains a binary tree where each level is an increasingly precise subspace of the total spaceof arms, X. Each node in the tree tracks its interval range (subspace), how many times the nodehas been traversed and the empirical estimate of the reward, which it uses to compute an optimisticupper bound estimate, B, for this leaf’s reward in a similar fashion to the UCB algorithms. Ateach time step, the algorithm traverses the tree, picking the highest B node at each junction untilit reaches a leaf. At a leaf, it splits the node and creates a new point which is evaluated and theupper bound estimate is updated up the tree accordingly.This makes an assumption about the shape of f , but not one as strong as the BAST algorithmdid. Rather than requiring continuity in a strongly defined sense as in the δd existence assumptionbefore, HOO requires only that a dissimilarity function exists which puts a lower bound on themean-payoff function over the arbitrary space the arms exist in.In HOO, the selection strategy for each node requires a measure µd,i which represents theempirical mean of the payoff from each time the node been traversed and Nd,i, the number of timesthat node has been traversed. We use d to denote depth in the tree, as in the BAST exposition,and i to denote the specific node. Following again in the UCB strategy, the corresponding upperconfidence bound criterion can be given asµd,i +√2 lnnNd,i+ υ1ρd (2.28)34where 0 < ρ < 1 and υ1 > 0 are parameters selected by the implementerSince their work, there has been work on an algorithm called Open-Loop Optimistic Planning(OLOP) [32, 150] and a combination work called hierarchical OLOP [157] which combines the meth-ods. These are not explored in more detail here, as infinite-armed bandits do not arise significantlyin the remainder of our work.2.2.3 Adversarial BanditsOne of the strongest generalizations of the k-armed bandit problem is the adversarial banditsproblem. In this problem, rather than rewards being picked from an a priori fixed distribution,rewards are selected, in the worst case per-play, by an adversary. The problem is transformed intoan iterated three step process; in step 1, the adversary picks the reward distributions (generally withfull availability of the list of prior choices, though the constraints on the distribution are discussed);in step 2, the agent picks an arm; in step 3, the rewards are assigned. This is a strong generalizationbecause it removes the distribution dependence on the arms (and as such, stationarity and otherdistribution-dependent assumptions); an algorithm that satisfies the adversarial bandits problemwill satisfy more specific14 bandits problems, albeit, often sub-optimally.As adversarial bandits are such a strong generalization, Audibert and Bubeck [11] provide ataxonomy of bandit problems that builds from the constraints on the adversary’s selection process.Fundamentally, allowing the distribution to vary in each time, they let n represent the number ofpossible distributions available. They then provide five distinctions. (1) The purely deterministicbandit problem, where rewards are characterized as a matrix of nK rewards, where K represents thenumber of arms and n the number of time steps. In each time step, a single deterministic rewardis set (fixed a priori) for each arm. (2) The stochastic bandit problem – the variant discussed inthe majority of this work – in this taxonomy is characterized by a single distribution for each arm,stationary in time, independent and bounded on some range, say, xi ∈ [0, 1]. (3) The fully obliviousadversarial bandit problem, in which there are n distributions for each arm, independent of eachother (both through time and across arms) and independent of the actor’s decisions, correspondingto changes selected by the adversary across time. (4) The oblivious adversarial bandit problem, inwhich the only constraint is that the distributions are selected independent of the actor’s decisions.Finally, (5) the adversarial bandit, in their work referred to as the non-oblivious bandit problem,where the reward distributions can be chosen as a function of the actor’s past decisions. Forreference, we have presented the distinctions and how they map to the bandit taxonomy we haveprovided in Table 2.1.In the majority of this work, we focus explicitly on the stochastic variants of the multi-armedbandit problem, choosing a lens by which deterministic or even simply non-oblivious bandits arenot known to be deterministic or non-oblivious by the agent ahead of time. Our lens models the14Especially, contextual bandits.35various forms of oblivious bandits as considerations to the stochastic nature of the problem, for ex-ample, treating contextual covariates and non-stationarity as a form of statistical misspecification,even when sometimes that misspecification will be impossible to resolve (as in the case of Knightianuncertainty [93], where the correct reward model has some immeasurable and incalculable compo-nent). This differs from the lens in Audibert and Bubeck [11], providing a prospective which appliesmore closely to the application area of interest in this work (one in which the true underlying modelalmost certainly consists of unknown or unknowable covariates, but is also partially approximatedby variables we can observe), but comes at the cost of not generalizing to the pure non-obliviousadversarial problems.In our slot machine example, this is a world where the casino selects which machines will havewhich payoff distributions after each play, in the worst case, in effort to minimize your reward. Forexample, if a player were consistently picking the same machine, the casino would move the worstpayoff distribution to that machine. Indeed, the title of Auer and Cesa-Bianchi [14] paper intro-ducing the problem, “Gambling in a rigged casino: the adversarial multi-armed bandit problem”captures this stylized example well. In the general sense, the adversarial case makes no assumptionon the underlying payoff distribution, and the adversary does not need to necessarily be a trueminimizer.The major caveat of adversarial bandits, is that our definition of “performance” needs to berelaxed for any measures to be meaningful. Specifically, a strong performing algorithm must bedefined using a measure of regret that compares our decisions solely to a fixed machine over time,that is, a strong adversarial bandit can still achieve logarithmic regret, but only if the “best arm”is defined at time t = 0 and does not vary across trials. To rephrase, that means that of ourdefinitions of regret given earlier in this chapter, only the “weak-regret” notions can be meaningfulwithin an adversarial context.In the online advertising context in this work, the general form of adversarial bandits have littledirect application other than as motivation for contextual (and distribution-free) bandits, as suchonly a cursory overview of their evolution follows. Further work may include competition with otheradvertisers as an adversarial case, though the specific implementation of the model is uncertain.The majority of the efficient solutions to adversarial problems are variants of the Exp3 algo-rithm presented in Auer, Cesa-Bianchi, Freund, and Schapire [16] for the general, no statisticalassumptions adversarial bandits case. Beygelzimer, Langford, Li, Reyzin, and Schapire [26] extendthe work of Auer et al. [16] and McMahan and Streeter [107] to transform Exp4 to produce ahigh-probability bounded version called Exp4.P.Hedge and Exp3Auer and Cesa-Bianchi [14] present the first look at the adversarial bandit problem and includean algorithm with high-probability bounded regret called Exp3: the exponential-weight algorithm36for exploration and exploitation based on an algorithm called Hedge for the full informationproblem. Exp3 [16] presents a readily understandable, simple algorithm for adversarial bandits.Given a “pure exploration” parameter  ∈ [0, 1], which measures the fraction of time the algorithmselects a purely random decision, the algorithm then spends (1 − ) of the time doing a weightedexploration/exploitation based on the estimated actual reward.The estimation process for Exp3 is an exponentially updating probability weighted sample. Thearm weight is updated immediately after pulling a given arm and being delivered the reward ρiwith the formulawi,t = wi,t−1 · e·ρipi,t·K , (2.29)where wi,t is the arm i specific weight at time t and p is our selection criteria. The probabilityof each specific arm to play in each iteration is selected according to p, which considers the armweighting and  semi-uniformity, namely,pi,t = (1− ) wi,t∑Kj=1wj,t+  · 1K. (2.30)In some sense, Exp3 combines the semi-uniformity in the parameterization of  strategies withthe “probability of best” weighted exploration/exploitation strategies of probability matching meth-ods.A computationally efficient version of Exp3 called Exp3.S is presented in Cesa-Bianchi andLugosi [41].Exp4Exp3 does not include any concept of contextual variables or “expert advice”. Auer et al. [16]develop an extension of Exp3, called Exp4 (Exp3 with expert advice). Exp4 is identical to Exp3,except the probability of play is selected with the addition of a set of N context vectors ξ per timeand the weight function is similarly replaced to consider the context vectors. One should note thatthe weights w are now computed per context vector, where a context vector can be viewed as an“expert” advising of a selection coefficient for each arm; we now use j to indicate the index of theexpert and continue to use i to indicate the index of the arm, for clarity,wj,t = wj,t−1 · e·ρj ·ξj,tpj,t·K . (2.31)For the selection probability p, interpret ξj,t(i) as the advice coefficient expert j gives at time tabout arm i,pi,t = (1− )N∑i=1wj,tξj,t(j)∑Kk=1wk,t+  · 1K. (2.32)37Note that k represents an iterator over all arms in the second term. With a minor abuse ofnotation, this is equivalent to Exp3 where we update our weight vector with the context ξ, rewardρ, and selection probability p according to ξ · ρ/p for the arm played at each time step except thatthe weight vector is now the summed contextual weight vector.Exp4.P Exp4.P is a variant of the Exp4 algorithm presented in Beygelzimer, Langford, Li, Reyzin,and Schapire [26] with bounded15 regret in the high-probability case of O˜(√KH logN). The bounddoes not hold in high-probability in the original Exp4 presentation as the variance of importance-weighted numerator term is too high [26]. Exp4.P modifies Exp4 such that the bound holds withhigh-probability. The change in Exp4.P is only in how the weight vector is updated at time t.Rather than using Equation (2.31), Exp4.P uses an updating function,wj,t = wj,t−1 · e2·(ρj ·ξj,t+vˆj,t√ln (N/δ)/KH), (2.33)where δ > 0 is a parameter that defines the desired probability bound of the regret (1− δ) andvj,t is defined asvj,t =∑1,...,Kξi,t(j)pi,t. (2.34)This modification allows Beygelzimer, Langford, Li, Reyzin, and Schapire [26] to bound regretof the new algorithm, Exp4.P, with probability of at least 1− δ to −6√KH log(N/δ).Stochastic and Adversarial Optimal (SAO)Bubeck and Slivkins [33] introduce a testing technique that is capable of handling both thestochastic (non-adversarial) problem and the adversarial problem with near-optimal regret results.Stochastic problems generally use a different definition of regret than adversarial problems, so theanalysis provided in this work takes place in two parts assuming the model is either stochastic oradversarial showing asymptotically regret of O(polylog(n)) in the stochastic case16 and the O(√n)pseudo-regret from Exp3 in the adversarial case.SAO proceeds in three phases, making it a semi-uniform strategy: exploration, exploitation andthe adversarial phase. The exploration and exploitation phases are largely as expected, interleavedto operate pairwise (arm 1 vs. arm 2) and rule out “suboptimal” arms as it progresses. For theremainder of this exposition, assume there are only two arms and arm 1 is strictly superior toarm 2. Further, let C ∈ Ω(log n) be an arbitrary parameter which enforces consistency, selectedspecifically for the application area, for example C = 12 log(n), let H˜i,t be the average observed15The notation O˜(n) is read “soft-O of n” and is equivalent to O(n logk n), i.e., the big-O notation where logarithmicfactors are ignored.16The notation O(polylog(n)) means O((logn)k) for some k. This is similar to the use of O˜ to indicate theinsignificance logarithmic terms often bring to the analysis of algorithms.38reward for arm i, t represent time (number of iterates so far) and τ∗ represent the point we switchfrom exploration to exploitation.We start in a state of exploration, where we pick an arm with equal probability for a minimumof C2 rounds and until we find a “sufficiently superior” arm according to the condition,|H˜1,t − H˜2,t| < 24C√t. (2.35)During the exploitation phase, the arms are drawn according to the probabilities pt(2) =τ∗2t andpt(1) = 1−pt(2), that is, the probability of drawing the suboptimal arm is decreasing asymptoticallyin time. A set of conditions is checked to see if the observed rewards still fit within the expectedstochastic model. The conditions checked are referred to as consistency conditions and are asfollows.The first consistency condition, which checks if the observed rewards in exploitation are congru-ent with the findings of the exploration phase, that is, whether the rewards are bounded in a rangeconsistent with our observation that arm 1 is better than arm 2 by approximately the observedamount. Concretely, the first consistency condition is8C√τ∗≤ H˜1,t − H˜2,t ≤ 40C√τ∗. (2.36).The second consistency condition, which checks that arm i’s estimate is still within boundsof the expected estimate, consistent with the fact that during exploitation the suboptimal arm isdrawn with low probability. Consider Hˆi,t to be the expected reward from arm i at time t given thatthe world is stochastic and the arm can be appropriately modelled, so that H˜i,t − Hˆi,t representsthe difference in the expected reward and the observed reward. Concretely, the second consistencyconditions are,|H˜1,t − Hˆ1,t| ≤ 6C√t, (2.37)|H˜2,t − Hˆ2,t| ≤ 6C√τ∗. (2.38)All the magic numbers in these conditions are derived from the high-probability Chernoff boundsfor the stochastic case. The different denominators on the right hand side of the equation accountfor the low probability of drawing the inferior arm (arm 2) during exploitation.In the event any of the consistency conditions fail, we assume the model is non-stochasticand switch from the explore-exploit algorithm to that of Exp3. The work explores and provesproperties of the conditions. Selection of the consistency parameters is important, as they wouldallow a carefully crafted adversary to maintain the conditions. Such conditions cannot allow theadversary to create a high level of regret for the application yet must hold in high probability in39the non-adversarial case.This algorithm as described combines the asymptotic regret bounds of both UCB1 and Exp3 ina near-optimal (asymptotic) fashion for both stochastic and the most general form of adversarialbandits. There is no analysis of the finite time regret.40Table 2.1: A hierarchy of bandit problems, categorized by the adversarial bandits general-ization in Audibert and Bubeck [11].Fully ObliviousEach arm’s distribution may change through time, but are fully independent of each other bothacross time and across arms, e.g., each arm changes to a randomly drawn new payoff every iterate.– DeterministicArm payoffs are not random variables and are fully selected ahead of time, independently of botheach other, the player’s decisions and time.– StochasticArm distributions are random variables drawn from a distribution for each arm which is fixed intime and independent of each other. In an equivalent alternative presentation, arm distributionsare drawn from a single K-dimensional distribution which is i.i.d.ObliviousEach arm’s distribution may change through time and may be correlated with itself, other arms, orobserved or unobserved covarying variables, but not with the player’s past decisions. This appearsas an “adversary” who is not informed of the player’s choices.– Nonstationary StochasticArm distributions may have their parameters drifting in time or have changepoints in time at whichthey suddenly change in parameters.– Contextual StochasticArm distributions have observed or unobserved covariates which partially determine their payoff.E.g., the web optimization environment described where a user may have demographic or individualvariables which make one arm better or worse for that specific context than another or the case ofa casino with slot machines half red and half blue, where red arms pay at a different rate than bluearms (but the player does not know that ex-ante). Covariates can be received from the environment(in the described case of an individual observer having them) or from the arms themselves (as inthe colored machines case).Non-ObliviousArm distributions can be chosen as any function of the actor’s past decisions, including one thatmeets the intuitive definition of “adversary,” e.g., the arms may be “designed to trick” the actor,a long string of wins followed by a large loss.412.2.4 Contextual BanditsThe simple k-armed bandit problem performs sub-optimally by its design in the advertising context.In general, the contextual bandits framework is more applicable than the non-contextual variantsof the problem, as it is rare that no context is available [100]17. Specifically, the simplest form ofthe model selects from k advertisements then discovers the payout associated with that particularplay.The contextual bandit setting has taken many names including bandits with context, banditproblems with covariates [121, 131, 160], generalized linear bandits, associative bandits [139] andbandit problems with expert advice [15]. The contextual bandit problem is closely related to workin machine learning on supervised learning and reinforcement learning; indeed, some authors [53]have referred to it as “the half-way point” between those fields because of the ability to constructalgorithms of a reinforcing nature with convergence guarantees while considering relatively generalmodels.In this work, we further divide the context into both the advertisement (arm-context) and theuser or world-state (weather) being selected for (world-context), where arm-context can be usedto learn shared properties across arms in the same way as in infinite armed bandits, while world-context interacts with arm context and is declared on a per step basis. In a more complicatedhierarchical model, we have even more information - we know at each step all the factors of thepreceding step (hierarchy and world context). This allows a much more rich learning process wherethe hidden vector of contextual variables can be used to guide learning, even if they are incomplete.Broadly, the expected reward can be approximated by a model of the formYi = α+ βAi + γWt + ξAiWt +  (2.39)where Yi indicates the expected payoff of a given arm conditional on the context, β indicates thecoefficient vector as a function of the arm context, and γ a coefficient vector of the world context.In the web search context, the world context vector might be the words included in the searchquery, in which case we would expect our agent, in the limit, to learn a model that suggests idealadvertisements related to the query for any given search.A slightly more general form of contextual or side-information bandits is referred to as associa-tive reinforcement learning [139] in some statistical and machine learning literature.Early research for the contextual bandit problem includes Wang et al. [155] and Pandey et al.[118] and makes additional assumptions about the player’s knowledge of the distribution or re-lationship between arms. One of the first practical algorithms to be discussed in the context ofhorizon-unaware side-information bandits was Epoch-Greedy presented by Langford and Zhang17While it is rare that no context is available, it is not rare that the value of the context is entirely unknown – inthe stylized example of a slot machine, the arms may be different colors, whether that is a determining factor in thepayoff probabilities or not may be a priori completely uncertain.42[100]. One of the most salient works in this space is that of Dudik et al. [53] which brings contex-tual learning to a practical light by producing an online learning algorithm with a running time inpolylog(N) and regret that is additive in feedback delay. Additionally, the work of Chu, Li, Reyzin,and Schapire [46] produces an analysis of an intuitive linear model-type upper confidence boundsolution called LinUCB [1, 50, 128] derived from the UCB solutions for non-contextual banditswhich provides good real world performance.Importantly, Exp4 [16] makes no statistical assumptions about the state of the world or armsand therefore can be applied to the contextual problem, however, the majority research thus farderived from Exp-type algorithms has been focused on the adversarial problem discussed prior.One exception is the application of Exp4.P to the strictly contextual problem found in Beygelzimeret al. [26].Returning briefly to the slot machine example, contextual bandits model the situation where themachines have properties (arm-context) we believe may effect their payoff: perhaps some machinesare red, some machines are blue (categorical context); perhaps machines closer to the front casinoseem to pay better (linear, continuous context). This could also represent the situation wherepayoffs vary by day of week, time of day or another (world-context): perhaps slot machines ingeneral are set by the casino to pay more on weekdays than on weekends, in effort to increase thenumber of plays during the week.LinUCBLinUCB is a strong, intuitive polynomial time approach to the contextual bandits problem. Largely,LinUCB builds on the upper confidence bound work of the non-contextual bandits solution by syn-thesizing concepts captured by the associative reinforcement learning algorithm LinRel[13]. Lin-UCB introduces a feature vector to the UCB estimate which is maintained with a ridge regression.In general form, LinUCB observes a set of d features per arm (i) xt,i at each time step (t) andthen selects an arm by a maximization of the regularized upper confidence bound estimate,i = arg maxiθ′txt,i + α√x′t,iA−1xt,i, (2.40)where α is a positive regularization parameter and θt is the coefficient estimate for the arm’sfeatures. (θt = A−1b where A and b are maintained via the ridge regression updating process afterobserving the reward18)LinUCB achieves regret in polylog(H). Specifically, the regret bound shown by Chu, Li, Reyzin,and Schapire [46] is O(√Td ln3(KH ln(H)/δ)) for d dimensional feature vectors up to a probabilityof 1 − δ. The algorithm’s sensitivity to non-stationarity and feedback delay has not yet beeninvestigated in depth though it may perform adequately on feedback delayed situations as the18Recall that in the regression minimization problem, θˆ = (X′X)−1X′y and let A = X′X and b = X′y where y isthe observed reward43effect (or “pull”) of each additional observation should decrease in increasing trials.We present a technique in Chapter 3 which extends LinUCB to the Thompson Samplingparadigm for improved performance in handling variance in early pulls and in generalizing thedistribution underlying the regression coefficients.CoFineUCBAn interesting approach to the contextual bandits problem is to treat the exploratory contexts as ahierarchy. When this works, it could achieve logarithmic treatment of the features by treating themas a tree. Generalizing LinUCB, CoFineUCB approaches the estimation in a coarse-to-fine approachthat allows increasing accuracy by drilling into a particular variable subspace. CoFineUCB extendsLinUCB to fit a model strictly within a selected “coarse” subspace with a regularization parameterfor the “fine” regression. The intuition provided is one of user’s preferences – if preferences canbe embedded in a coarse-fine hierarchy (e.g., movies (coarse), action movies (fine); or ice cream(coarse), vanilla ice cream (fine)), then an initial model on the coarse levels can be supplementedby a stronger model on only those within the class to predict the fine levels.In practice, CoFineUCB has been used in a recommender system context and shows goodperformance on experimental measures of regret when the coarse subspace accurately reduces theprediction variation for most users.Our technique presented in Chapter 3 for creating LinTS is applicable to the CoFineUCB resultsto produce CoFineTS, an information-theoretically improved variant of the coarse-fine strategy.Banditron and NeuralBanditThe contextual bandits solutions explored so far require the effect of context be linear in the param-eters within the interval being estimated. While some flexibility exists in terms of acceptable errorrate and interval estimation, the linear regression techniques are all subject to similar constraints.Banditron [86] and NeuralBandit [7] are recent algorithms for the non-linear contextual banditwhich utilize the insights from the multi-layer perceptron [126]. At a high-level, these algorithmsreplace the (generally back-propagation based) updating process in the perceptron algorithm, witha partial information technique using only the bandit feedback. The specific update process differsin each algorithm.As of the date of this work, neural network-based techniques lack much theoretical analysis andshow significantly suboptimal regret in stationary and linear applications, however they are robustto both non-stationarity and non-linearity (and do not require a convex cost function whatsoever)where they show superior results.442.2.5 Non-stationary BanditsNon-stationary bandit problems is currently a very active research area. Slowly changing environ-ments have been explored in depth in the Markov decision process literature [52, 140]. In their paper[63], Garivier and Moulines prove an O(√n) lower-bound of regret for generalized non-stationarybandits.Discounted UCB(-T)Discounted UCB and Discounted UCB-Tuned [63, 94] build on the work of UCB1 and UCB-Tunedfor the original stochastic bandit problem, modifying the uncertainty padding estimate (the secondterm in the maximizing condition) and using a “local empirical average” instead of the traditionalaverage considering older data in a discounted fashion. Effectively, discounted UCB creates anexponentially decayed version of UCB parameterized by some discount factor γ ∈ (0, 1).In the same fashion as the original UCB, in Discounted UCB, at time t we select the arm ithat maximizes the form x¯i,t + ci,t where ci,t is a measure that “shifts” the estimate upward (oftenselected as a variance-adjusted estimator of the “exploratory value” of arm i). In Discounted UCB,however, we parameterize both terms of that equation with a discount factor γ. We use an indicatorfunction 1σ defined as 1 if the condition σ is true and 0 otherwise and a list At that indicates thearm selected at time t. Specifically,x¯i,t(γ) =1Nt(γ, i)t∑s=1γt−sxs(i)1As=i, (2.41)where Nt(γ, i) =∑ts=1 γt−s1As=i is the discounted average denominator and xs(j) is the payoffreceived from arm i so far. This equation serves to capture the mean discounted payoff estimatefor arm i at time t. And ci,t is,ci,t(γ) = 2B√ξ log∑Kj=1Nt(γ, j)Nt(γ, i), (2.42)where B is an upper bound on the reward, as in the general formulation of UCB1 and ξ is aparameter selected as 12 in their paper, but with little further exploration.Garivier and Moulines [63] shows that Discounted UCB achieves optimum non-stationary regretup to logarithmic factors, O˜(√n). By replacing the ci,t term with the tuned term from UCB-Tunedwith an additional time discounting in the same γt−s fashion, we get a variant of DiscountedUCB, Discounted UCB-Tuned that is expected to have the same empirical improvements as in thenon-discounted case.45Sliding-Window UCB(-T)Sliding-Window UCB (SW-UCB) [63] is an extension of Discounted UCB to use a sliding windowrather than a continuous discount factor. A sliding window can be modelled as a discount factorof 100% for all data points older than some parameter τ representing the size of the window. Todefine the UCB functions for SW-UCB, [63] extends the same UCB1-type maximization processx¯i,t + ci,t for a τ period window as,x¯i,t(τ) =1Nt(τ)t∑s=t−τ+1xs(i), (2.43)where Nt(τ) = min(t, τ) is the total length of the set being summed over, eliminating thediscounting consideration from above19. The padding or optimism function, ci,t, then, is,ci,t(τ) = B√ξ log(min(t, τ))Nt(τ, i), (2.44)where Nt(τ, i) indicates the number of times arm i was played in the window of length τ . SW-UCB performs slightly better in their experimentation than the pure discounted approach and hasthe benefit of not requiring maintenance of data older than τ records. Both algorithms are stronglysuperior in regret to the Exp3 type algorithms and UCB1 with no non-stationarity modificationsfor the non-stationary problems tested.Adapt-EvEHartland, Gelly, Baskiotis, Teytaud, and Sebag [73] present an extension to the UCB-Tuned algo-rithm [15] to deal with abrupt changes in the distribution associated with each arm. Adapt-EvEis considered a meta-bandit algorithm in that it uses a bandit algorithm at a higher level of ab-straction to determine which bandit algorithm parameterization to use at each time. In particular,Adapt-EvE works by running the UCB-Tuned policy until a change-point in the underlying distri-bution is detected using one of many change-point detection algorithms (in their paper they use thePage-Hinckley test [18, 74, 116] with “discounted inertia” to only trigger in the change-point case,not the drifting case20). Upon detecting a change-point, a meta-bandit is initialized with two arms:one, which continues using the trained version of UCB-Tuned, and the other which resets all param-eters and instantiates a new instance of Adapt-EvE. Training continues at the meta-bandit level(learning whether to continue using the trained data or learn again) and at the selected sub-level.19In their paper, Nt is erroneously provided as the same discounted version provided for discounted UCB. Thiscannot be correct, as γ is no longer provided and the average would be incorrect.20The rationale presented in Hartland et al. [73] for discounting the change-point statistic is that UCB-Tunedis capable of handling slowly drifting reward distributions within itself. We show in Section 3.4 that for certainforms of non-stationarity, UCB-Tuned is out-of-the-box insufficient to perform optimally and an informed detrendingtechnique is preferred.46Kalman Filtered BanditKalman filtered bandits [23, 70, 71] have been investigated in which the estimate of the meanpayout of an arm is maintained by a recursive sibling Kalman filter parameterized by two a priorinoise estimates σ2ob for observation noise or measurement error and σ2tr for transition noise (thenon-stationarity error). Results are somewhat sensitive to these noise estimates. At each time stept, an estimate of the mean and variance for the arm played (with reward received xi,t) is updated,µi,t =(σ2i,t−1 + σ2tr) · xi,t + σ2ob · µi,t−1σ2i,t−1 + σ2tr + σ2ob, (2.45)σ2i,t =(σ2i,t−1 + σ2tr) · σ2obσ2i,t−1 + σ2tr + σ2ob. (2.46)The non-played arms all have σ2tr added to their variance estimate for each time step, indicatinghow their uncertainty increases as time progresses. These equations and the general form of thismodel arise from the well-studied Kalman filter. The numerous published extensions to the Kalmanfilter for varying confounding factors can likely be applied in this space.This approach performs very well in drifting and change-point cases, however is outperformedby Adapt-EvE in the well-defined change-point case. The resilience to form of non-stationary makethis a valuable approach in the event the parameters can be well predicted. This has not beenexplored within a contextual context, with Thompson sampling or probability matching techniquesor with an optimistic approach.2.2.6 Probability Matching and Thompson SamplingW. R. Thompson (1933), “On the likelihood that one unknown probability exceeds another in viewof the evidence of two samples” produced the first paper on an equivalent problem to the multi-armed bandit in which a solution to the Bernoulli distribution bandit problem now referred to asThompson sampling is presented.The stochastic solution presented by Thompson [144] involves matching the probability of play-ing a particular arm with the arm’s inherent “probability of being the best” given the data observedby sampling from each distribution precisely once and selecting the maximum sample. The lan-guage probability matching arises from this intuition and seems to originate from Morin [112].Probability matching is extensively used in the experimental psychology literature to describe thebehavior matching action probabilities to the probability of an outcome. This concept is distinctfrom the actual implementation of sampling precisely once from the posterior estimate to simulatethe optimality pseudo-distribution, which we refer to as Thompson sampling. A factor motivatingthis interplay of nomenclature is the increasingly common use of multi-armed bandit processes inthe modelling of animal and human psychology and behavior [e.g., 123, 137].Scott [132] applies a strictly Bayesian framework to presenting Thompson sampling and specif-47ically calls it randomized probability matching. We will simply use the language of Thompsonsampling through the rest of this discussion.Thus far, we have not discussed any probability matching solutions. Probability matching is atechnique that draws on the Bayesian literature and theoretically applies directly to any variant ofthe bandit problem where a probability of an arm being the best can be expressed as a distribution.Recently, it has been shown in many analyses that probability matching techniques are competitivewith the state of the art in a variety of bandit and other learning contexts. We show later that anonparametric form of probability matching is available and compare it to existing similar solutions(such as the quasi-parametric binomial probability matching technique of Agrawal and Goyal [6]and the bootstrap-derived approach of Eckles and Kaptein [55]).Recent research in Thompson sampling has provided an information-theoretic analysis [130],various proofs and demonstrations of regret minimization [68, 72], a technique to apply Thompsonsampling via the online bootstrap [55], exploration of the cold-start problem21 in recommendersystems [115] and numerous applications of the technique [27, 68, 114, 135].Strict bounds on regret were a hindrance to theoretical adaption of generalized Thompson sam-pling, however, recently, bounds for a single specific model (traditional K-armed bandits with betaprior distributions) have been discovered by Agrawal and Goyal [6]. For K = 2, their bound onregret is given as O( lnH∆ ); for K > 2 the bound is significantly worse, as O(lnH∑Ki=2 (∆i2)2). Sig-nificantly, the information-theoretic work of Russo and Van Roy [130] proves efficient (O(logH))regret bounds for Thompson sampling and show convincingly that Thompson sampling performscomparably to a correctly-tuned UCB-type algorithm in general. This is a result which had been ex-pected, however is significant as Thompson sampling is a more general solution than any particularimplementation of a UCB-type algorithm.Empirical research show this strategy has both excellent results in traditional constrained vari-ants and in variants with less strongly maintained assumptions.Thompson Sampling extends well to cases with generalized distributions for the arms. Forexample, a Bayesian approach for Thompson Sampling for the common case of Bernoulli arms ismade computationally efficient and simple to implement by sampling from the appropriate productof conjugate beta distributions parameterized in such a way that only tracking the number ofsuccesses and number of failures is necessary.In order to formalize the matching of our play probabilities with the probability of a given playbeing the best play, we adopt a Bayesian framework and, in general, a parametric distribution overa parameter set θ. We can compute the probability at time t of a given arm providing optimalreward as,21The cold-start problem is a particularly apt use of bandit modelling. Specifically, the problem models the issueof being unable to draw any recommendations for new users until sufficient data has been collected for said user tofit an appropriate model or prediction to his or her preferences. The multi-armed bandit model provides exploratoryguidance in some contexts to help address this problem.48∫1[St = arg maxi=1,...,KEθ[xi,t]]P (θ|x)dθ. (2.47)Rather than computing the integral, Thompson [144] and others show that it suffices to sim-ply sample from the estimated payoff distribution at each round and select the highest sampledestimated reward. That is, the repeated selection of the maximum of a single draw from eachdistribution, produces an estimate (and thus selection behavior) of the optimality distribution.This result, while long known, is surprising and valuable, turning an intractable problem into acomputationally simple one.Optimism in Probability MatchingA recurring perspective on the efficient use of uncertainty within such a multi-armed bandit (andexploratory learning in general) has been that of “optimism in the face of uncertainty” [15, 99, 113,142]. The idea is presented as a method for treating uncertainties and balancing exploration: whena statistical uncertainty is present, a small but consistent gain in outcome [42] can be achievedby simply remaining optimistic and assuming the value is in the “more desirable” portion of thedistribution under uncertainty. We show in a later section that “small but consistent” may not bethe right language to describe optimism, indeed the gains to optimism are inconsistent and can belarge.This idea has been seen already in many of the static (non-probability matching) algorithmspresented prior. For example, any UCB-type algorithm derives its action estimates from an “op-timistic” surrogate about the state of the empirical estimate. This form of static optimism is thebasis of most algorithms for multi-armed bandits, though the mechanism for defining optimism isvariable.In probability matching, optimism is usually defined as mean-optimism, that is, sampling onlyfrom the distributions which are above the mean. Often we will refer to the distribution of valuesonly above the mean as the optimistic surrogate distribution. Later, we will show how to efficientlysample from this distribution for any tractable distribution, and then show how empirical samplingand resampling can be used to improve the results when the true underlying distribution is unknown.The Bernoulli Approach to Nonparametric Thompson SamplingIn the case of bounded bandits (that is, arms which have a reward guaranteed to be in some range),Agrawal and Goyal [6] propose an algorithm which transforms any bounded distribution to looklike a Bernoulli distribution for the sake of sampling. First, the Bernoulli implementation of thesimplest form of Thompson Sampling is presented in Figure 2.3.This implementation as described relies on the way the Beta(W +1, L+1) distribution behavesaccurately as an estimator of the mean and variance of the binomial (repeated Bernoulli trials)49Data:i: the set of available arms (i = 1, 2, ...,K)Ω: some stopping rule (e.g., exit when the horizon H is reached)Result: Thompson Sampling for Bernoulli Bandits// Set counters of wins and losses for each arm to their prior (or one, forignorance).initialize ∀i Wi ← 1 ;initialize ∀i Li ← 1 ;while (Ω) dorˆ∗ ← −∞;St ← nil;for each arm i do// Draw once from the beta distribution representing each arm.rˆtest ← draw from Beta(Wi, Li);// Record the best draw.if rˆtest > rˆ∗ thenrˆ∗ ← rˆtest;St ← i;endendreward ← play(St);if reward == 1 thenWSt ←WSt + 1;else if reward == 0 thenLSt ← LSt + 1;endendFigure 2.3: Thompson Sampling for Bernoulli Banditsdistribution and as such is only appropriate for arms drawn as a Bernoulli trial. When the rewardsare generated from an arbitrary bounded distribution, Agrawal and Goyal [6] suggests a simplemodification: assume the rewards are drawn from a distribution bounded between zero and one,then, when a reward is observed, it performs a Bernoulli trial (i.e., a weighted coin flip) and updatesWi or Li according to the payoff. The modified algorithm only augments the play function in thepseudocode above to become one which plays and then flips a coin based on the value of the result.This technique only works if the distribution is bounded in a range that can be treated (viascaling) as [0,1]. In Chapter 3, we present both an empirical sampling technique which allows theimplementation of an optimistic variant of this strategy and a novel technique which produces adistribution-free sampler without the boundedness requirement.50Bootstrap Thompson SamplingEckles and Kaptein [55] present an implementation of Bootstrap Thompson Sampling (BTS) whichproduces a scalable, robust approach to Thompson sampling. We show in this section that theirimplementation can be modified to support an arbitrarily scaled form of optimism and a completelynonparametric approach at minimal cost, at least for some subset of the problem space. Wethen show that the sampling strategies shown in the Sampling Uncertainty section can be appliedhere to avoid a number of poor edge-case performances and essentially control the risk of such apolicy. Finally, we expand the implementation to a categorical-contextual policy which subsets thesampling space into categories to implement a refinement of the model.The initial implementation of BTS involves selecting a parameter J which simultaneously con-trols the computational complexity and relative greediness of the model. Upon receipt of eachreward ri, BTS trains J parametric models on the assumed underlying distribution by consideringthe reward in each model with probability one-half. To select the next arm to play, BTS, on a perarm basis, chooses one of the J replicates and uses the expected value of that model to predictan empirical mean payoff. In the Thompson sampling style, the highest of those estimated payoffsis selected as the action for this iterate. As the empirical mean is deterministic across the repli-cates22, if J is too small, the decision becomes fundamentally greedy, choosing to only play thebest empirical arm prematurely. This technique works well, showing performance competitive withthe traditional model and even exceeding it in the case of heteroskedasticity.Change-Point Thompson Sampling (CTS)Mellor and Shapiro [108] present an algorithm based on a two-stage Thompson Sampling techniquecombined with a particle filter based learning tool [3, 59, 60] for a specific form of non-stationarityin the distributions. The non-stationarity utilized is a constant hazard rate change-point problem,where the payoff distributions have sudden and complete changes at a constant (but unknown)rate.They model the problem as a non-contextual bandit simulation, where, because of the completeand independent nature of the change, after a change is found all prior data has no value. In orderto utilize the randomized probability matching technique within this dual-unknown, they samplefirst the posterior (produced from a particle filter technique) for the hazard rate, and then usetheir drawn sample to sample from the bandit arms. This idea that Thompson sampling can beapplied at a recurring process inspires the basis for one of our linear bandits Thompson samplingimplementation experiments in Chapter 3.A big contribution of the Mellor and Shapiro [108] paper is their simulation technique. Theyamalgamate the (non-stationary) PASCAL EvE challenge [79], the Yahoo! [102] real world news22Computing the expectation, rather than sampling from the subdistributions, gives us an efficient sampler in thebootstrap sampling distribution θ, but does not maintain the “uncertainty awareness” property in the individualdistributions. A sufficiently large J returns the uncertainty awareness property of Thompson sampling.51click-through data and a foreign exchange simulation [109]. Their algorithm shows great real worldperformance, but underperforms on the simulation data.2.3 Application AreaThe multi-armed bandit represents the advertising problem explored in this work well. This muchcan be seen simply by the size of the substantial related literature. Much of the literature in placeconsiders the inverse problem: selecting advertisements to display or purchase for a given vendorto maximize its outcomes [117, 129], rather than the problem of maximizing an outcome given aset of advertisements. These problems are related, but the experimental design on the sales copyand web design side of the problem provides a substantially larger distribution of marketer-selectedvariables, as well as a significantly different sampling distribution for experimentation over.One of the most significant related works in the application area is that of Li et al. [101], wherethe algorithm known as LinUCB is presented. LinUCB presents the first work to provide a simpleregression-based framework for multi-armed bandits and applies it explicitly in selecting the optimal(in terms of revenue, time on site, or other objective criteria) news story to display on the homepageof Their result is successful, applies the set of readily available contextual variablesin a clear and easy to understand manner and shows promise in extending to more complicatedvariants of the same problem.Scott [132] and Scott [133] present perspectives of Thompson sampling as applied to the on-line advertising problem. Scott [133] in particular, provides a number of designs for experimentsintimately related to the problem under consideration in this work: explicit experiments in webdesign or content selection. For example, an experiment for selecting button color for a call toaction button is presented where the experimental configuration is as a logistic regression with aboolean variable in place for each of the available options. Scott [133] also draws attention to themixed modality of the objective function: showing that it is not always simply sales (or immediateprofits) that is the correct variable to optimize, but rather many intermediaries such as measuresof quality and measures of satisfaction are necessary to produce a long-term sustainably profitablebusiness.A work by Tang et al. [143] produces a system for using the multi-armed bandit problem toselect the appropriate advertisement layout in order to increase effectiveness of an advertisement(measured either in click-through rate or total revenue in an online advertisement auction context)within an existing webpage. Their work relies both on a large corpus of work related to predictingclick-through rates [4, 19, 43, 69, 124] and a large corpus of available contextual variables fromthe context of users being signed in to their LinkedIn accounts when displayed the advertisingin question. Importantly, their work, as in all of this work, finds that Thompson sampling is adeceptively competitive tool for optimizing bandit problems of this nature.Chapelle, Manavoglu, and Rosales [43] provide a comprehensive look at both the online adver-52tising industry itself and the utilization of a logistic regression to predict the response to displayadvertising via a number of contextual variables related to both the advertisement itself and thecontext within which it is to be displayed. Such a prediction can be applied within a multi-armedbandit context (in a technique related to that of Li et al. [101] or of our own LinTS model presentedin Section 3) to produce a revenue (or response) maximizing online exploration-exploitation awaresystem.As a quick review of the necessary background, the economic model under which our problemis operating looks similar to many online eCommerce businesses: a product is for sale and there isa user acquisition process which directs users to a pre-sales environment, where they are presentedthe marketing information and opportunity to purchase the product. Three common forms of useracquisition exist for businesses of this sort, each having two parties, the advertiser, who wishesto acquire users to his marketing technique and the publisher, who operates an existing mediumwith captive users to be advertised to. The three common user acquisition processes are: (1) along-term approach called search engine optimization (SEO), where the goal is to increase relevancyin the search engines (e.g., in order to acquire traffic from users who were searchingfor target keywords; (2) a medium-term branding-oriented approach where advertisers pay cost permille (CPM) for views (priced per thousand), but not necessarily clicks, of their advertisementson a selected other website; (3) the short-term approach of purchasing clicks directly, paid forper-click on the advertisement. There exists a fourth case where advertisers pay publishers of theiradvertisements per sale (CPA) of their product (or other action), but it is not directly consideredin our work, however, our primary goal (increasing “actions” or sales) could then be seen from theperspective of the publisher.In the first case, there is little short-term control over the number or cost of new users. Inthe second case, the advertiser must select an advertisement creative in order to have the maxi-mum desired effect. Often, CPM advertising will be used for branding campaigns, or advertisingcampaigns where the immediate goal is not necessarily increased sales, but rather a longer termrelationship with the public. Our work does not consider the branding case. In the immediatelyincreased sales variant, the advertisement creative is selected to maximize clicks through to theirwebsite, at which point the marketer must select an appropriate website design and sales text inorder to maximize sales or profit. In the third case, the advertiser pays only for clicks through totheir website, in which case only the last step, selecting the appropriate website design and salestext is paramount.53Chapter 3Towards the Use of Multi-ArmedBandits in Advertisement Testing3.1 An Extensible Platform for Simulating Bandit ProblemsWe constructed a highly-extensible, multi-paradigm simulation and experimentation platform forquantifying and comparing bandit strategies across a variety of contexts. The platform providesan efficient way of running comparable controlled experiments for the purpose of learning aboutpolicies or parameters in a risk-free environment. Fundamentally, the structure of the platform ismade up of three components: simulators, arms and policies. The platform is highly parallel inboth world variants, replicates and policy or parameter choices and is set up to immediately spawnseparate threads to maximally utilize available CPU hardware.3.1.1 ImplementationSimulatorA simulator is a container that holds the current state of the world in terms of available arms andmaintains detailed statistics on the results throughout the simulation. The simulator is initializedwith the set of information the experimenter wishes to retain and a pointer to the instruction filesfor the problem set (a collection of different world states, each made up of a number of arms) andthe policy set being experimented upon (a collection of policies and their configurations). In theset of problems and policies, tags or shared identifiers are used to categorize worlds and policies inarbitrary dimensions in an easily comparable way.54ArmsAn arm is an object which contains its own state information (usually distributional parameterssuch as mean and variance, but also state variables related to non-stationarity or contextual con-founders), knows how to calculate its own expected value (if possible) and can produce a draw fromits result distribution. Arms within this context can have context (provided by the simulator ateach time step) at both an arm and world level, can drift or have change-points, and can generallyintroduce any arm-level deviations of the problem desired via further extension of the prototypes.PoliciesA policy or “strategy” is a self-contained unit that interacts with the simulator to pull arms andreceive rewards. Policies are implemented by the experimenter, in their own class which receivesonly the information necessary to inform the policy. We have included an implementation of many ofthe policies explored in Chapter 2. This is important to produce reproducible real world applicableunderstanding of policies and their parameters. In certain contexts, modifications to the simulatorare available (with appropriate selection at initialization) to allow policies the ability to observeinformation that they would not normally be able to access. This is used in particular later whenwe explore the prescient measures to elevate our understanding of how particular modelling errorscan effect the result.A policy generally comes with an object which specifies its configuration. This allows differentparameters for a policy (such as  in the -strategies, weight functions in the case of our own WLSimplementation or discount rates in strategies like POKER) to be explored in a way that allowseasy comparison in the simulation output.MeasurementsThe simulator is capable of reporting a wide variety of desirable measurements including all thecomputable definitions of regret given in Chapter 2 (weak and strong regret; stochastic and ex-pected value; non-optimal play count) and a number of measures of “divergence” between the arms(updated per pull, to handle cases of non-stationarity) including Kullback-Leibler (K-L) divergence[98] (symmetrized [83]), resistor average [83] and the J-measure [81] computed distribution-wide(across all arms), and in the maximizing, minimizing and closest-arm subcases.Included with the simulator is a small set of visualization tools intended to transform the iterate-and replicate- level variable outputs of the simulator into appropriate graphical representations forunderstanding, analysis and publication. These tools allow categorization over any classificationvariable available for the underlying problems or worlds examined, including assumption violations,distribution choice or size/scale of the problem, as well as categorization over policy selection andpolicy configuration.553.1.2 ProblemsWhile many problems can be simulated, we argue that the multi-armed bandit problem is extremelydifficult to simulate in a way that generalizes to applied problems effectively. As such, one ofthe simulators we have provided is the “Yahoo! News Click-Through” simulator, which uses thetechnique of Li, Chu, Langford, Moon, and Wang [103] to allow the reconstruction of bandit statefrom the wealth of real Yahoo! data and a simulator which uses live currency-pair data [109]in a similar method to simulate forex trading with bandit algorithms. These produce real worldenvironments with which simulation-driven error can be detected by a cautious experimenter.Additionally, we provide re-implementations of the PASCAL Exploration vs. Exploitation(EvE) 2006 environments. PASCAL EvE was a competition to identify new and experimentally op-timal non-stationary bandit algorithms. The challenge provided six artificial simulations: frequentswap, where the best arm switches rapidly; long gaussians, where the payoff drifts in a gaussianfashion over a long time period; weekly variation, in which two sinusoidal components vary thepayoff probability with the longer being dominant; daily variation, in which two sinusoidal compo-nents vary the payoff probability with the shorter being dominant; weekly close variation, in whichresponse rates are very close together and constant, in which there is no non-stationarity.Finally, we have a set of our own synthetic simulations, covering features including: drift (op-tions including: linear, exponential (varying size), sinusoidal, random walk, none); change-points(constant, poisson, none); arm class (binomial, Gaussian, contextual); mean and variance; num-ber of arms. This set of base simulations can be extended to introduce further “world problems”(deviations from the assumptions) by future authors to improve the quality of simulation results.This platform drives most of the experimentation in the following sections. Where appropriate,assumption violations and modifications to the problem have been implemented as novel problemsets (and even novel arm classes) in the simulation framework to produce research which is bothreproducible and extensible to new, but related, questions.3.2 Linear Model Thompson Sampling: LinTSLinUCB [101] as discussed in Section 2.2.4 and most other regression-based UCB models can betransformed to use Thompson sampling if one can get estimates of the higher (non-mean) momentsof the distribution. Indeed, one almost always has sufficient information to do this, as most UCBmechanisms rely on a formulaic representation of variation to capture the definition of the upperconfidence bound. We extend the model only to the linear regression case similar to that of LinUCB.At a high level, we extend the model of LinUCB, to fit a model of the form,Y = α+ β0C + β1A+ γC ·A+  (3.1),56where Y is the expected reward for the given set of parameters, C = [Cn ∈ RK ]n={0,...,N} is theset of contextual variables and A ∈ {0, 1}K is a set of arm dummy variables. The fitting techniqueused is left to the implementer, however, it must provide an adequate estimate of the moments ofthe (true) distribution. In the frequentist domain penalized least squares techniques such as ridgeregression show promise, especially in the case of uncertainty where the relevance of the contextualvariables is uncertain and some level of overspecification is likely.Without loss of generality1, we assume the normally-distributed linear regression fit with acoefficient (mean estimate) and standard error (variance estimate) provided for each contextualvariable, arm and interaction effect. This scheme is inherently Bayesian in nature, however, toimprove computability in a practical sense, we utilize a traditional linear regression model with pe-nalized coefficients (ridge regression). Ridge regression can be represented as a Bayesian regressiontechnique with a fixed prior on the βs making some of the result interpretation more tractable.The algorithm used is roughly as follows:1. Fit the model. Using the fitting technique given, produce coefficients (µ) and standarderrors (σ2), generating normally distributed random variables ξ ∼ N(µ, σ2).2. Calculate a summed distribution Y˜ for each arm. For each arm i, set Ai = 1, Aj = 0where (j : {j 6= i}), sum the random variables multiplied by their observed context vector Cnto get an estimate of Y˜i,C ∼ N(µ˜, σ˜2) for arm i and context C.3. Apply Thompson Sampling. With each Y˜i in our given context, we now have a modelof the (current) distribution of expected reward for each arm given our current knowledge.Using this, we sample according to the probability that reward is the maximizing reward in thetradition of Thompson Sampling. We explore a few methods of doing this in a computationallytractable and affordable way.This provides a result that is similar to that of LinUCB, however utilizes the Thompson Sam-pling strategy rather than the constrained upper confidence bound approach. In order to optimizethe outcome, we test a number of uncertain implementation details within our simulation architec-ture, expecting that each question has a “correct” answer that can be generalized to all problems.3.2.1 Optimistic Thompson Sampling in LinTSWe propose a technique that exploits the assumptions of the linear model and the probabilitymatching technique of Thompson sampling. Based on the assumption of normality, the regressioncoefficients, βˆ, are normal and hence the predictions yˆt are normal. We then optimistically sample(drawing only values above the mean) from a normal distribution with mean∑i(βˆi · xi,t) andvariance∑i (V̂ar(βˆi) · x2i,t) to approximate yˆt. A more general form of this fundamentally Bayesian1As necessary, replace distributional assumptions in our algorithm with those of your fitting technique.57algorithm can be constructed utilizing the techniques of Bayesian regression [111] at the cost ofgenerally higher computational complexity.In this section, we have presented a computationally efficient, flexible model, linear regression-based Thompson Sampling implementation, LinTS, which is easily extensible to any contextualmodel, simple to implement and provides low regret and easily interpretable results in practicalexperiments. Further research on LinTS is necessary to prove (theoretical) asymptotic bounds onregret and to better understand how results from the linear regression model can be interpreted fornon-prediction type tasks2.In the next section, we extend this model and explore the space of non-stationary regressionmodels and show how existing time series techniques can be extended to LinTS in order to efficientlyhandle both slowly drifting and rapidly changing true world states.3.3 Experiments in Thompson SamplingIn this section, we explore some of the considerations and questions that arise when implementingan optimistic sampler in a parametric (specifically, model-fitting) context. These questions areexperimented within the context of the LinTS algorithm, but the experiments have been generalizedto other forms of Thompson sampling via the simulator. Each of these questions is explored inmore detail in the coming sections, but as a quick introduction and reference a summary is providedhere.• Q1: How does the measure of centrality effect optimism? When computing anoptimistic sampler, we must choose a form of optimism. Indeed, optimism can take manyforms: one could drop the least favorable observations, sample only from the top 10% orotherwise be optimistic. In general, optimism with regard to Thompson sampling is tested inthe context of mean-optimism, where the results from the portion of the distribution abovethe mean are used. We explore using the median in certain contexts to modify the impact ofskewness and outliers and find that there is no significant effect here.• Q2: How does the concept of estimative uncertainty differ from the standarderror for sampling? We test this question by using our within-simulator prescience to thetrue underlying variance of an arm. We find two surprising results: (1) the effect size ofremoving true (or even computed) underlying variation from the estimate under sampling islarge and (2) this effect appears to be shared with the effect of optimism – that is, the benefitto removing estimative uncertainty is substantially smaller in optimistic sampling than it is in2Specifically, to interpret linear regression coefficients as causal in the econometric sense or to derive many proofsof unbiasedness and efficiency, one must assume the regression samples are drawn randomly. In the linear banditcase, regression samples are drawn according to the Thompson Sampling process or UCB maximizing process on theprevious time world state. How this affects the model is an open research question which requires further exploration.58non-optimistic sampling. This result provides a piece of evidence towards the understandingof why optimism is beneficial to the exploratory process under uncertainty.• Q3: How does k-sampling from Y˜i effect the distribution of regret? To compute thelikelihood of a given arm being the best, we can simply draw k times from each distributionand combine the estimates in a variety of ways. We ex ante expect that higher k’s learnfaster, but have a marginally higher computation cost and are more likely to get stuck onincorrect solutions.We explore different values of k and different combination strategies and find some particu-larly interesting results about the true nature of optimism: specifically, excessively optimisticsampling (taking the most optimistic of the k samples per arm) seems to perform asymptoti-cally better in k suggesting that the ratio of prospective best values is a determinant in banditperformance. This is a surprising result as the surrogate (sampling) distribution is no longerinfluenced strongly by the true underlying mean, but only the most optimistic forecast. Athought experiment with k = ∞ provides an interesting but poorly understood look at thisresult, showing that the result only holds if the ratio of two arms’ most optimistic forecastsis well defined.3.3.1 Measure of CentralityThe traditional implementation of optimistic Thompson sampling implements optimism as a sur-rogate distribution bounded at the mean or expectation of the distribution. This is intuitive, asour goal is to maximize total reward, however, in smaller samples drawn from distributions withsufficient skew it is plausible that a different measure of centrality such as the median may bepreferred. In general, in our model of optimism, there are an infinite set of degrees of optimism,each representing a portion of the distribution to be discarded. That is, a median optimism can betreated as dropping exactly the lower 12 of the distribution.We compared mean-optimism and median-optimism using the beta distribution as the estimatorfor the binomial distribution. This gives us the cases of positively skewed, negatively skewed andnon-skewed parameterizations. We selected the beta distribution approximation solely for the easeof varying the relevant shape parameters3. The results shown appear to generalize well to otherdistributions along their relevant skewness.After extensive experimentation, including experimentation with artificially introduced error,we find the measure of centrality is not a significant factor in the performance of optimism, evenin highly skewed distributions, when measured by expectation regret, total reward or suboptimal3Recall, the beta distribution is parameterized on α and β, shape parameters. Kerman [90]’s approximation tothe median is used and distributional parameters are selected such that the numerically computed relative error ofthe approximation is < 10−6.59plays. We do reproduce the small benefit accrued to optimism seen in prior work and explore thatin more depth going forward.3.3.2 Estimative UncertaintyWhen sampling with the goal of maximizing the expected return, it is (generally) functional butnot optimal to sample from the distribution denoted by the mean and the standard error of themodel’s estimate (for example, a LinTS [36] regression-based model or the similar Bayesian re-gression technique described in Eckles and Kaptein [55]). The standard deviation of the estimatecan be decomposed into two components, estimative uncertainty, caused by sampling error andunderlying variance, which exists absolutely in the underlying distribution. The goal of Thompsonsampling is to exploit the estimative uncertainty where it could plausibly lead to higher rewards infuture iterates, but it is wasted effort to play suboptimal arms if the variance is provided solely byunderlying variance.We attempt to remove the underlying variance in two methods. One method, where arms are ofknown distribution with known standard deviation, simply subtracts the known standard deviationfrom the estimated standard error, giving an accurate estimate of our estimative uncertainty4and another where we empirically maintain an online estimate of the variance and subtract thatfrom the standard error of the model to produce our sampling distribution. Not surprisingly, innon-optimistic (unbiased) sampling, the prescient technique outperforms the empirical technique,however they both outperform using the raw standard error by a large margin. Table 3.1 presentsresults which are an average across all our representative normal distribution worlds in each of50, 100 and 1000 iterates each with 2,000 replicates demonstrate the benefit in the non-optimisticsampling case.The gains did not maintain their magnitude when (symmetric) optimism was invoked, as op-timism accrues a fairly large benefit itself, but does not appear to accrue much additional benefitfor the uncertainty correction as later iterates’ variance goes to zero. This may suggest the opti-mism result is from systematic early underexploration in the Thompson sampling strategy. Furtherresearch is necessary to explain how optimism and the uncertainty correction interact to producesimilar scale results.A further set of tests were conducted scaling the variance of the sampling process in arbitraryways (squaring, dividing by constants) with no lucrative benefits. The unscaled variance alwaysoutperformed on average with or without estimative uncertainty adjustments.4Note, we must floor this value at zero, as a negative variance does not have meaning, but could arise in thismodel due to the statistical properties of the distribution.60Table 3.1: Results from eliminating estimative uncertainty in the unbiased sampling case.∗∗∗ denotes p < 0.01 (against the pairwise null hypothesis of no effect of changing fromraw standard error).Average R¯E (SD) ImprovementRaw Standard Error 534 (23.47) –Prescient Uncertainty Correction 371 (31.69) 43.9% ***Empirical Uncertainty Correction 419 (27.92) 27.4% ***Table 3.2: Results from eliminating estimative uncertainty in the optimistic sampling case.Average R¯E (SD) ImprovementRaw Standard Error 404 (27.23) –Prescient Uncertainty Correction 377 (31.99) 7.1%Empirical Uncertainty Correction 395 (27.49) 2.3%613.3.3 Sampling UncertaintyWe can change the expected variance of the (optimistic or non-optimistic) sampler by sampling morethan once. The Thompson [144] strategy of drawing once from each positively biased distributionis distinct from, say, drawing twice and using the mean of the samples. Here we experiment withthe variance of our sampler and its interaction with optimism. An initial intuition is to simplycompare optimistic sampling under the single sample case to a number of k (say, 2, 5, 10) samplecases under different strategies to produce an action sample value. We experiment with a numberof replication strategies:1. Sample means. In this strategy, the k samples are averaged to produce our action value.This reduces the variance of the sampling strategy in both cases, around the mean in thenon-optimistic case and around the biased-mean in the optimistic case.2. Most optimistic. In this strategy, the most optimistic element of the k samples is taken.In the optimistic case, this serves to create a more optimistic sampler than pure 1-sampleoptimism. In the non-optimistic case, this creates an optimistic sampler without the hardconstraint at the measure of centrality.3. Least optimistic. In this strategy, the least optimistic element of the k samples is taken.In the optimistic case, this serves to dampen optimism.4. Least deviation. In this strategy, the sample which is closest to the measure of centrality istaken. This is equivalent to least optimistic in the optimistic case, but in the non-optimisticcase, it is simply a conservative sampler.Our results are presented in Table 3.3. We see convincing evidence in favor of optimism,even excessive optimism, showing both the least regret and the lowest standard deviation of thatregret. When k = 1, we should be indifferent between replication strategies, as they will all simplyreturn their first sample. One can think of the Optimistic-Most Optimistic sampler as encouragingexcessive optimism by sampling optimistically, then taking the most optimistic of those samples.We get a distribution that is on average further in the tail of the distribution (proportional to k).Similarly, one can think of the Unbiased-Most Optimistic sampler as a simulation implementationof optimistic sampling and would expect that as k → ∞ that sampler will perform more like theOptimistic-Most Optimistic sampler (with more computation cost). It is interesting to note thatsetting k (finitely) higher appears to improve the performance of the Most Optimistic replicationstrategy asymptotically, suggesting that the increase in relative tail probabilities (across arms)continues to improve the result. The Sampled Mean and Least Deviation strategies act to reduceoptimism (or possible pessimism, in the case of the unbiased sampler) by reverting samples towardsthe mean.62Table 3.3: Results of a selection of replication strategies.# Sampling Policy Replication Strategy k Average R¯E SD∑t xSt,t SD1 Optimistic Least Deviation 1 1.787940 2.812723 28.25736 6.9194742 Optimistic Least Deviation 2 2.038760 3.147973 27.99885 7.0065903 Optimistic Least Deviation 3 2.131280 3.288711 27.87352 7.0773014 Optimistic Least Deviation 4 2.239480 3.419299 27.80768 7.1074575 Optimistic Least Deviation 5 2.299420 3.503120 27.64148 7.1350776 Optimistic Least Deviation 10 2.433740 3.657933 27.56664 7.2609297 Optimistic Sampled Mean 1 1.794856 2.814387 28.22871 6.8926128 Optimistic Sampled Mean 2 1.797440 2.831595 28.14065 6.8993149 Optimistic Sampled Mean 3 1.774540 2.815696 28.20008 6.90077810 Optimistic Sampled Mean 4 1.742480 2.808865 28.29840 6.89205511 Optimistic Sampled Mean 5 1.782160 2.845401 28.22016 6.88876512 Optimistic Sampled Mean 10 1.772120 2.842582 28.21684 6.87443713 Optimistic Most Optimistic 1 1.782860 2.795673 28.23848 6.93239414 Optimistic Most Optimistic 2 1.627420 2.594976 28.39532 6.84165515 Optimistic Most Optimistic 3 1.539440 2.467145 28.42102 6.79801916 Optimistic Most Optimistic 4 1.488580 2.410582 28.52400 6.76949017 Optimistic Most Optimistic 5 1.454160 2.358807 28.52707 6.76888018 Optimistic Most Optimistic 10 1.373060 2.249344 28.61777 6.74318619 Optimistic Least Optimistic 1 1.801640 2.820853 28.19821 6.92671720 Optimistic Least Optimistic 2 2.017980 3.129366 27.98708 6.99434521 Optimistic Least Optimistic 3 2.133160 3.293516 27.91489 7.06555222 Optimistic Least Optimistic 4 2.237860 3.415851 27.76209 7.11205723 Optimistic Least Optimistic 5 2.303500 3.496363 27.65757 7.14184124 Optimistic Least Optimistic 10 2.389940 3.633834 27.60138 7.23084125 Unbiased Least Deviation 1 2.414180 3.317339 27.52917 7.15689026 Unbiased Least Deviation 2 2.433260 3.483726 27.51896 7.19695927 Unbiased Least Deviation 3 2.457000 3.575881 27.59020 7.23347528 Unbiased Least Deviation 4 2.488600 3.635268 27.54201 7.22673929 Unbiased Least Deviation 5 2.483780 3.664905 27.52593 7.26430730 Unbiased Least Deviation 10 2.531200 3.758365 27.47851 7.35130631 Unbiased Sampled Mean 1 2.419500 3.311818 27.62926 7.18882232 Unbiased Sampled Mean 2 2.439500 3.445440 27.52135 7.19514333 Unbiased Sampled Mean 3 2.428440 3.488071 27.59352 7.18911934 Unbiased Sampled Mean 4 2.461000 3.550799 27.56638 7.21387935 Unbiased Sampled Mean 5 2.444140 3.556658 27.54016 7.22050836 Unbiased Sampled Mean 10 2.489640 3.653163 27.49565 7.20609437 Unbiased Most Optimistic 1 2.408760 3.310960 27.61688 7.14345638 Unbiased Most Optimistic 2 1.948180 2.920867 28.07839 6.96697339 Unbiased Most Optimistic 3 1.797240 2.756331 28.16101 6.84641840 Unbiased Most Optimistic 4 1.675020 2.627932 28.30523 6.86207941 Unbiased Most Optimistic 5 1.596760 2.556773 28.41517 6.81257242 Unbiased Most Optimistic 10 1.482580 2.393168 28.49253 6.74406343 Unbiased Least Optimistic 1 2.429660 3.322913 27.54089 7.14966644 Unbiased Least Optimistic 2 2.993740 3.845496 26.99108 7.44098745 Unbiased Least Optimistic 3 3.332900 4.118677 26.63976 7.64749646 Unbiased Least Optimistic 4 3.592980 4.287377 26.41372 7.74607247 Unbiased Least Optimistic 5 3.745220 4.392518 26.23426 7.86418048 Unbiased Least Optimistic 10 4.173680 4.586818 25.83976 8.000412633.4 Non-Stationary Time Series TechniquesChange-point analysis, also known as change detection or structural breakpoints modelling, is awell-studied problem in the applied stochastic process literature. Intuitively, a change-point is asudden, discrete or “drastic” (non-continuous) change in the shape of the underlying distribution.In an offline fashion, change-points may be detected efficiently and with an adequate set of tuneableparameters with clustering algorithms. For bandits, the problem is necessarily an online problemand offline algorithms for change point detection are not feasible. The basic idea of online change-point bandits is to use a mechanism to detect change-points, generally parameterized for someacceptable false alarm rate, and then utilizing some mechanism to “forget” learned informationafter each change-point as necessary.Hartland, Gelly, Baskiotis, Teytaud, and Sebag [73] propose an algorithm called Adapt-EvEbased on the UCB-Tuned algorithm [15]. Adapt-EvE uses the frequentist Page-Hinckley test toidentify change-points. Upon detection of a change-point, Adapt-EvE treats the problem as ameta-bandit problem. That is, a second layer of bandit optimization is instituted with two arms:(1) continues using the learned data and (2) restarts the UCB-Tuned algorithm from scratch.This meta-bandit forms a hierarchical strategy that can be expected to efficiently evaluate thecost in regret of each detected change. This technique was the winning technique in the PASCALExploration vs. Exploitation challenge in 2006 [79] demonstrating its ability to handle both driftingand change-point type bandits.Kocsis and Szepesva´ri [95] present a variant of UCB-Tuned called DiscountedUCB which appliesa continuous discount factor to the estimates in time. Garivier and Moulines [63] introduce SlidingWindow UCB (SW-UCB) parameterized by a window length and show it performs similarly toDiscountedUCB contingent on appropriately selected parameterizations.Mellor and Shapiro [108] present an online Bayesian change-point detection process for switching(discrete change) bandits with constant switching rate – the frequency with which the distributionschange – in the contexts where switching occurs globally or per-arm and when switching ratesare known or must be inferred. Their algorithm is probability matching based, but, as presenteddoes not support contextual variables. Further, their technique addresses a bandit with switchingbehavior, rather than drifting behavior as explored in this work.3.4.1 A Short Review of Stochastic DriftIn time-series analysis, stochastic drift is used to refer to two broad classes of non-stationarity in thepopulation parameter being estimated: (1) cyclical or model-able drift that arise because of modelmisspecification and (2) the random component. Often it is possible to detrend non-stationary databy fitting a model that includes time as a parameter. Where the function of time is well-formed andappropriate for statistical modelling, a trend stationary model can be found with this detrendingprocess. For some models, detrending is not sufficient to make a process stationary, but, sometimes64difference stationary models can be fit, where the differences between values in time Yt and Yt−ncan be represented as a well-formed function appropriate for statistical modelling.Difference stationary models are represented with autoregressive models. The generalized rep-resentation of the simple autoregressive model is referred to as AR(n) where n is the number oftime steps back the current value maintains a dependency upon,AR(n) : Yt = α0 + α1Yt−1 + α2Yt−2 + · · ·+ αnYt−n + t, (3.2)where t is the error term with the normal characteristics of zero mean (E[t] = 0), variance σ2and independence across times (E[ts] = 0, ∀t ∈ {t6=s}) after fitting the autoregressive correlations.If these two detrending strategies are not sufficient to make a given process stationary, more complexfilters such as a band-pass or Hodrick-Prescott filter may be applied.Generalized Linear BanditsFilippi, Cappe, Garivier, and Szepesva´ri [61] use generalized linear models (GLMs) for banditanalysis, extending the work of Dani et al. [51] and Rusmevichientong and Tsitsiklis [128] to utilizethe UCB strategy of Auer, Cesa-Bianchi, and Fischer [15] and proving (high-probability) pseudo-regret bounds under certain assumptions about the link function and reward distributions. Insome sense, our work extends the Filippi et al. result to an experimental analysis within the non-stationary case, as well as introducing a Thompson sampling based strategy for integrating GLMs,rather than the UCB technique.3.4.2 Overview of the ApproachThe general technique we experiment with is to fit a regression model of varying form to the dataand then to utilize the technique of optimistic Thompson sampling to predict arm payoffs in thenext iteration of the algorithm. We explore and compare two primary models, the autoregressive,time-detrended approach and the weighted least squares approach for handling non-stationaritieswith a regression framework.Autoregression and DetrendingFormally, we fit a modelYt,i = αt + ARi(p) + Trendi(t) +At,i + t,i, (3.3)where Trend(t) is a function representing the expected time trend, AR(p) is the autoregressiveterm of order p and Yt,i is the expected reward for arm i at time t. In general, trends and AR termsmust be considered on a per-arm basis (equivalently, as an interaction effect on the A matrix) as itis the change between arms that determines a meaningful nonstationarity for the sake of decisionmaking. In practice, this model is generally fit as a model of Yt with binary (“dummy”) variables65At,i and relevant interaction terms indicating which arm is detected. In our experimental results,we explore how variations (especially overspecification of the functional form) in the “correctness”of the selection of Trend(t) affect the overall results. This model, fit with the ordinary least squarestechnique, the ridge regression technique [145] or the Bayesian conjugate prior technique, returns anestimated set of time-detrended, plausibly stationary5 coefficients βˆ and estimates of their standarderrors ŜE(βˆ). This model can be readily extended to contain any contextual variables, such asdemographic information about the user (in the web optimization context) or grouping criteria onthe arms to improve the learning rate.Combined, we follow in standard experiment design terminology and call the terms in our modelα, ARi(p), Trendi(t), and At,i the design matrix and refer to it as X.Penalized Weighted Least SquaresThe weighted least squares (WLS) process introduces a multiplicative weighting of “reliability” foreach observation, resulting in a technique which minimizes the reliability-adjusted squared errors.In the multi-armed bandit context with drifting arms (without any a priori knowledge of thefunctional form of the drift), the weights are set to the inverse of their recency, indicating that ateach time step t, older data provides a less reliable estimate of the current state.Intuitively, weighted least squares provides a simple, well-explored, highly tractable techniqueto discount the confidence of old data, increasing predictive uncertainty as time progresses. Thisis a desirable quality within the context of restless bandits as it appropriately accounts for thegrowing predictive uncertainty of old observations.Formally, the weighted least squares procedure picks βˆ, coefficients on a set of variables, X,called the independent variables (or regressors), according to the equation βˆ = (XTWX)−1(XTWy)where W is the matrix of weights and y is the rewards as observed (or, in general, the regressand).Standard errors of the coefficients are also computed, producing an estimate of the standard devi-ation of our estimators.To apply the weighted least squares procedure, we extend LinTS (presented in the prior section),following in the work of Pavlidis, Tasoulis, and Hand [119] which uses a standard linear regressionto compute the estimates of each arm and the work of the LinUCB algorithm [101] which appliesa non-weighted penalized linear regression to compute estimates of the payoff for each arm. Aswe are a priori uncertain about the functional form of the non-stationarity in our bandit arms,we experiment with a variety of time weighting techniques – logarithmic, with varying scale andbase; linear, with varying polynomials; exponential, with varying coefficients; and sinusoidal –demonstrating the generality of this technique. In all cases we strictly decrease the weight of asample as it becomes further in time from our current prediction time. When additional informationabout the form of non-stationarity is available, weights can be specified appropriately to reduce the5As long as the detrending process successfully removed the non-stationarity.66predictive uncertainty.3.4.3 Simulation EnvironmentTo test our combined strategies and produce objective comparisons, we utilize the synthetic sim-ulator described in Section 3.1 with a wide variety of “true worlds” (unobserved to the agent)including arm distribution type and parameters, arm count, and drift type from a set of functionalforms including random walk, exponential random walk, logarithmic, linear (in varying degree),exponential and periodic drift (sinusoidal over varying periods). Each form of drift is parameter-ized by a randomly drawn real number constrained to be within the same order of magnitude asthe arm payoffs in its simulation world which determines the scale of the parameterization.We present the combined algorithm, parameterized in degrees of autoregression, detrending andfunctional form of our weighted least squares discounting process in pseudocode in Figure 3.1. Thefirst n, the number of model terms, iterations must be performed using another method (uniformlyat random, in our case) to provide enough degrees of freedom to fit the regression model.3.4.4 Experimental ResultsSpecific details of the optimistic sampling procedure are shown in the next section when we discusstechniques for sampling from an arbitrary distribution. In this case, as normality is assumed in themodel, we can sample very easily.In the results presented, we omit -greedy, UCB1, DiscountedUCB and others as they werestrictly outperformed by UCB-Tuned or SW-UCB for all parameter choices tested. Across all trueworlds, we find in general that a detrending term congruent with the true drift form (e.g. lineardetrend in the linear drift quadrant of Figure ??) outperforms all other strategies in the long run,producing a zero-regret strategy [152] for restless bandits where the functional form of restlessnessis known. Similarly, we find that utilizing a weighting function which closely approximates thetrue drift performs well in most cases. Surprisingly, we find that linear detrending is an effectivetechnique for handling the random walk, a result that is robust to variations in the step type andscale of the random walk. Unintuitively, WLS techniques also perform strongly even in the casewhen there is no drift.In these experiments, we find no convincing evidence for a general application for detrending inpolynomial degree greater than one or autoregression of any level in our model. Both autoregressionand higher degree polynomials strictly reduce regret if the true world trend is autoregressive ordetermined, even partially, by the chosen form. We find the linear weighted least squares technique(weights set to the inverse of t) to be the most robust technique over all experiments, suggestingit is the strongest technique in the case of no a priori information on the form of drift: havingthe lowest mean total regret (20.8), lowest standard deviation across all drift types (11.8) and thelowest 75th (worst-) percentile regret (26.6).67Data:λ: the penalty/regularization factor for the ridge regression w(t): a function whichdefines the weighting strategyΩ: a function which determines whether we should play another roundResult: A discounted (in time) bandit algorithm proportional to the weighting strategy.X ← y ←W ← [];t← 0;while Ω do// Generate the weighting matrix according to our function w(t).W [t]← w(t);// Get the model from the penalized weighted least squares (WLS)subroutine.βˆ ← (XTWX + λI)−1(XTWy);s2 ← (y − βˆX)2/n ;// Compute the errors (estimated variance) necessary to perform thesampling procedure.V̂ar(βˆ)← diag[s2(XTWX + λI)−1];// End penalized WLS subroutine.rˆ∗ ← −∞;St ← nil;for each arm i do// Draw from the estimated distribution for this arm.rˆtest ← draw optimistic(N(∑i(βˆi ·Xi,t),∑i (V̂ar(βˆi) ·X2i,t)));// Maintain a Thompson-like estimate of the best arm.if rˆtest > rˆ∗ thenrˆ∗ ← rˆtest;St ← i;endendreward ← play(St);extend X, the design (history) matrix;append reward to rewards history y;t← t+ 1;endFigure 3.1: Pseudocode of combined algorithm.68IterationAdjusted Cumulative Regret020406080100200 400 600 800 1000l l l l lLinear DriftlllllLogarithmiclllllNo Drift200 400 600 800 1000020406080100lllllRandom Walkl l l Linear DetrendSWUCBUCBTunedWLS Exponential 0.8WLS Exponential 2.0WLS LinearWLS LogarithmicFigure 3.2: Adjusted average cumulative regret of selected algorithms over 1,000 replicatesof all worlds and true drift forms.693.5 Statistical Regret for ApplicationsAll definitions of regret given in Chapter 2 are dependent on access to an oracle which ex-ante bothknows the parameters of the underlying distributions (e.g., at a minimum, know the best mean)and fully enumerates any non-stationarities or contextual factors. We propose an oracle-free regretinterval which can be used in real world applications to quantify differences in bandit policies aposteriori.Statistical regret is an important contribution in the case of real world computation. In appliedbandits, no oracle is available6, but there is still a need to compare policies. If an oracle wereavailable, a nontrivial bandit policy would be unnecessary, as the ideal arm would be fully known.In the event all arms’ rewards are available after each play, the problem is better represented asgeneral reinforcement learning than a multi-armed bandit. A simpler technique where samples aredivided (equally, in a traditional experiment design) across multiple bandit algorithms is availableunder certain distributional assumptions, however in the case of non-stationarity in time and per-play regret reduction, this may not accurately capture the difference in policies.Our measure, presented in the next subsection, is derived from intuitions of the γ-confidenceinterval and the expected-expected and expected-payoff regrets described in the prior section, andprovides a meaningful applied measure to compare algorithms under certain assumptions. We fur-ther suggest a plausible extension to statistical regret using the bootstrap [56] to compute confidencebounds in effort to tighten the bounds for a given confidence level and produce a distribution-freeestimator.3.5.1 Traditional Parametric Statistical RegretTo compute parametric statistical regret, begin by enumerating your set of predicted arm meandistributions from observations up to time t, X˜i. Then, assume a confidence interval exists of theform (Lt,γ(X˜i), Ut,γ(X˜i)) such that,Pr(Lt,γ(X˜i) < X˜i < Ut,γ(X˜i)) = γ.A confidence interval of the form above exists if and only if there were sufficient observationsto accurately fit the assumed parametric distribution. From these values, we can compute thestatistical best- and worst-case scenarios. Intuitively, for each play in our recorded history, welook at the best upper and best lower bound and accrue regret accordingly. Formally, the totalstochastic statistical regret up to time t is an interval,6It is conceivable that some non-simulation applications exist in which only regret is revealed after each round, inwhich case traditional regret measures are well-defined without an out-of-band oracle.70R˜Pγ = t∑j=0(maxiLt,γ(X˜i)− xSj ),t∑j=0(maxiUt,γ(X˜i)− xSj ) . (3.4)This produces a regret interval with properties similar to expected-payoff regret; it is possibleto accrue negative regret and the resulting output is a random variable in both Sj and the Xdistributions’ parameters θ. We can eliminate the stochasticity on the parameters θ by eliminatingthe actual received payoff from our consideration and instead using the empirical mean (computedfrom all data available at the time we compute, t) on the chosen arm in iteration j, (X¯Sj ) withR˜Eγ = t∑j=0(maxiLt,γ(X˜i)− X¯Sj )),t∑j=0(maxiUt,γ(X˜i)− X¯Sj ) . (3.5)By utilizing the estimates at time t to evaluate the plays at times j = 0, ..., t we utilize thebest information available to judge actions chosen by the policy prior to that information beingavailable; by utilizing the bounds per-arm we acknowledge that there is an uncertainty (for manybandit policies, a large uncertainty) to the knowledge we have. When considering statistical regret,it is important to note that unlike in a traditional experiment, in a bandit experiment a tighterbound on statistical regret (somewhat equivalently, a tighter bound on the confidence intervals) isnot the objective to be optimized.We experiment extensively on varying world formats, horizon lengths and confidence boundwidths and find that stochastic statistical regret accurately tracks the true expected-payoff regretwith sufficiently tight bounds to enforce the γ confidence level. As γ → 1.0, the required samplesize per arm tends to ∞.Statistical regret combines properties of the traditional experiment with the multi-armed banditto produce a powerful real world diagnostic and evaluation tool. Post-hoc analyses in an oracle-free,multi-armed bandit context are made possible by such a tool. In particular, we conjecture thatstatistical regret will be a useful diagnostic for aiding in the applied calibration of common banditpolicy parameters such as discount rate or egalitarianism.3.6 Simple Efficient SamplingWhen performing Thompson sampling, we always need a method to actually produce a sample froma distribution. In this section, we deal with how to perform this sampling in a way that is efficientin two dimensions: computationally efficient, as in, to not take more computational instructions (ortime) than necessary and statistically accurate, as in, to accurately capture the intended samplingdistribution.There are two interacting rationales for this work observed in the literature. A number of papersproduce work which samples from an optimistic surrogate distribution in an iterated Monte Carlo-71like strategy, sampling from the initial (non-optimistic) distribution until the intended results arefound, or sampling from a black box to produce an estimate of the underlying distribution. Thisstrategy is computationally inefficient, and we show that it can be done in a strictly more efficientway for most distributions. The other reason is the statistical inaccuracy – optimism in (empirical)sampling research is often left undefined or as an exercise to the implementer. This has resultedin work which produces definitions of optimism with likely unintended behaviors, such as placinghalf of the samples at exactly the mean.3.6.1 Simple Efficient Symmetric Sampling (SESS)For symmetric distributions, finding a technique for sampling from the optimistic distributionefficiently (both in the sense of computational efficiency and statistical accuracy) is simple. Theprocess is: (1) center the distribution at zero, (2) get a sample, s, from the centered distribution,(3) add the measure of centrality (e.g., mean or median) of the uncentered distribution to |s|.For example, in the normal distribution, parameterized on µ and σ2, we can draw each sampleaccording tosAi = µi + |N(0, σ2i )|. (3.6)3.6.2 Efficient Non-Symmetric Sampling (ENSS)For a general distribution, Pi with expectation µi, (including non-symmetric distributions), at leastone paper presents a statistically inefficient technique (in the sense of not completely achieving theoptimistic sampling process we intended) which produces a sample,sBi = max{µi, Pi(·)}. (3.7)As an example of the inefficiency of this technique, we take the case of a symmetric distributionin which this produces a sample that is 50% biased to the mean (every P (·) < µ is selected at exactlyµ). To perform (both statistically and computationally) efficient sampling in the non-symmetriccase, we require a quantile function Q(p) = inf{x ∈ R : p ≤ F (x)} where F is the cumulativedistribution function. When this is available, we can sample from the optimistic distribution inthree steps.1. Find the quantile, v, of the measure used to bound optimism (for example, for median-optimism, set v = 0.5).2. Draw a random value from a uniform distribution t ∼ UNIF(v, 1) indicating the quantile ofthe desired value.3. Find the value in our original distribution at that point, sCi = Q(t), this is our sample.72For the most common distributions used in multi-armed bandits, the quantile function requiredto determine v and cumulative distribution function are readily available. For example, the normal,Student, beta, and gamma distributions’ quantile functions are explored in Steinbrecher and Shaw[138] and available to programmers in the R programming language as qnorm, qt, qbeta and qgammarespectively.3.6.3 A Short Background in Nonparametric SamplingBootstrap Thompson SamplingEckles and Kaptein [55] present an implementation of Bootstrap Thompson Sampling (BTS) whichproduces a scalable, robust approach to Thompson sampling. We show in this section that theirimplementation can be modified to support an arbitrarily scaled form of optimism and a completelynonparametric approach at minimal cost, at least for some subset of the problem space. Wethen show that the sampling strategies shown in the Sampling Uncertainty section can be appliedhere to avoid a number of poor edge-case performances and essentially control the risk of such apolicy. Finally, we expand the implementation to a categorical-contextual policy which subsets thesampling space into categories to implement a refinement of the model.The Eckles and Kaptein (2014) ModelThe initial implementation of BTS involves selecting a parameter J which simultaneously controlsthe computational complexity and relative greediness of the model. Upon receipt of each reward ri,BTS trains J parametric models on the assumed underlying distribution by considering the rewardin each model with probability one-half. To select the next arm to play, BTS, on a per arm basis,chooses one of the J replicates and uses the expected value of that model to predict an empiricalmean payoff. In the Thompson sampling style, the highest of those estimated payoffs is selectedas the action for this iterate. As the empirical mean is deterministic across the replicates7, if Jis too small, the decision becomes fundamentally greedy, choosing to only play the best empiricalarm prematurely. This technique works well, showing performance competitive with the traditionalmodel and even exceeding it in the case of heteroskedastic errors.3.6.4 A Simple Efficient Nonparametric SamplerOur goal sets out to produce a nonparametric sampler and simultaneously maintain the benefit wehave seen thus far with respect to optimism in the face of uncertainty. To do so, we first proposea simple non-optimistic sampler and then augment that sampler by adding optimism, producing7Computing the expectation, rather than sampling from the subdistributions, gives us an efficient sampler in thebootstrap sampling distribution θ, but does not maintain the “uncertainty awareness” property in the individualdistributions. A sufficiently large J returns the uncertainty awareness property of Thompson sampling.73the simple optimistic sampler we call SOS. In the section that follows, we will present a simplediscretization observation that allows this model to be extended to a relatively small number ofcategorical contextual variables easily.Simple SamplerIn the simple sampler, we maintain a history of observed rewards per arm x˜i,t and sample one value(rˆtest ∈ R) uniformly from the history for each arm i at each iterate t as presented in Figure 3.3. Thisrˆtest value represents a single payoff from the history of the arm; of these samples, the maximumvalue seen (from the K values, each representing a single arm) determines the arm to be playedin the next iterate with ties broken randomly. As the distribution of our observed history limitsto the true population distribution, this technique captures the true underlying distribution of thearms rapidly at the cost of some low sample size misbehavior. In experimentation, if the assumedparametric distribution is correct this model performs similarly to the standard unbiased Thompsonsampler fitted with either an assumed beta-binomial (true binomial) or the least squares procedure(true normal). When modelling error is introduced to simulate real world misspecifications, we findthat this model greatly outperforms the standard Thompson sampler even in the early process.Further, if the early sample size behavior is deemed high risk in the application, the replicationstrategies described earlier can be applied to change the nature of the sampled distribution.This technique draws from the same assumption that Thompson sampling itself draws from: theconcept that sampling a single time from each distribution is enough to approximate the probabilityof optimality distribution. As such, it is expected to have similar considerations. We considerfirst optimism, and then the replication strategy questions answered in the previous section fortraditional Thompson sampling.Introducing Optimism to the Simple SamplerIntroducing optimism in the simple sampler requires maintaining a payoff-ordered list of historicalrewards. In order to produce a sampler that is both computationally reasonable and robust to thebinomial distribution (or any similar distribution where one distribution may be a clear winner(or tied) in the top half but not in expectation), we propose 3 additional optional parameters: amaximum sample size (B = 100), the optimism cutoff (γ = 0.5), indicating the percentile (fractionof the observed results) that should be discarded to get our optimistic surrogate distribution and aminimum sample size (κ = 20). Despite this larger parameterization, we show that even eliminatingthese parameters (that is, setting B = ∞, κ = 0 and using the traditional mean cutoff for γ =Q(F (µ)) where Q is the quantile function and F is the empirical CDF) we still have a sampler thatperforms well in all but the most degenerate distributions, even in the large sample binomial case.From the sorted list of historical payoffs, we sample uniformly from the top n = min(κ,N · γ)observations. Upon receiving a reward, we append it to our list of rewards. If the total length of74Data:Si: the result of playing n > 1 purely random seed rounds for each arm iΩ: a function which determines whether we should play another roundResult: A simple distribution-free bandit sampling strategy.initialize ∀i x˜i ← Si ;while (Ω) dorˆ∗ ← −∞;St ← [];// Draw once from each arm history and play the best draw, St.for each arm i dorˆtest ← draw uniformly(x˜i);if rˆtest > rˆ∗ thenrˆ∗ ← rˆtest;St ← [i];else if rˆtest == rˆ∗ thenSt.append(i);endend// Break ties randomly and play the selected arm.S∗t ← draw uniformly(St);reward ← play(S∗t );(x˜S∗t ).append(reward);endFigure 3.3: The Simple Nonparametric Samplerthe list of rewards N is more than our maximum size B, we uniformly randomly select one to dropfrom the sample. This can be trivially extended to the J-replicate system of the original Ecklesand Kaptein (2014) sampler but we find no benefit in doing so when B is sufficiently large.This technique has a major advantage in that it is distribution-free. The standard opti-mistic/nonoptimistic sampler performs very poorly in the case the model is misspecified. As anexample, we have computed comparisons for SOS, Simple Sampler and the traditional parametricmodel computed with the normal distribution sampler and computed with a beta-binomial sampleras described in Chapelle and Li [42]. As expected, in the event the rewards are congruent with theselected sampler, the performance is not much different (although the implementation is arguablysimpler under SOS) but in the event the rewards are incongruent with the selected sampler (thatis, a beta-binomial to approximate a normal reward, or a normal to predict a beta-binomial) wefind that the SOS and Simple Sampler performance is significantly improved. This is essential forapplications where the true payoff distribution may not be generated from a known distribution.75Data:Si: the result of playing n > 1 purely random seed rounds for each arm iΩ: a function which determines whether we should play another roundB: the maximum size of our sample history (positive integer or infinity)γ: the degree of optimism in [0, 1), γ = 0 is non-optimistic, γ = 0.5 provides an approximation tomedian-optimismκ: the minimum size of our sample below which we retain the whole sampleResult: A simple distribution-free optimistic bandit sampling strategy.initialize ∀i x˜i ← sort(Si) ;initialize ∀i Ni ← count(Si) ;while (Ω) dorˆ∗ ← −∞;St ← [];for each arm i do// Sample from a γ-optimistic surrogate distribution that is at least κ in size (if κsamples exist).if (Ni − γNi) < κ thenx˜optimistici ← subset(x˜i, Ni − κ, Ni);elsex˜optimistici ← subset(x˜i, γNi, Ni) ;endrˆtest ← draw uniformly(x˜optimistici ) ;if rˆtest > rˆ∗ thenrˆ∗ ← rˆtest;St ← [i];else if rˆtest == rˆ∗ thenSt.append(i);endendS∗t = draw uniformly(St);reward ← play(S∗t );(x˜S∗t ).insert sorted(reward);NS∗t ← NS∗t + 1;if NS∗t > B then// Randomly remove a reward above B from the samples.x˜S∗t ← delete(draw uniformly(x˜S∗t ));NS∗t ← NS∗t − 1;endendFigure 3.4: The Fully Parameterized Simple Optimistic Sampler (SOS)76Table 3.4: The robustness and performance of the distribution-free sampler compared to thetraditional parametric sampler.Sampling Policy Distribution Average R¯ETraditional Unbiased Incorrect 0.500Traditional Unbiased Correct 0.391Traditional Optimistic Incorrect 0.498Traditional Optimistic Correct 0.341Simple Sampler – 0.368Simple Optimistic Sampler – 0.323We see in Table 3.4 the simple samplers produce results comparable to the traditional paramet-ric sampler in the case of correct specification without any knowledge of the underlying distribution,producing a performant real-world sampling technique for arbitrary applications. Further, we seethat the traditional technique can perform very poorly (approximately 30% worse) under misspec-ification providing more evidence for using the robust sampling technique.Experiments in Replication Strategies with SOSWe reproduce the multiple-sampling replication experiments from Section 3.3 here in the contextof SOS. This is significant, as the small sample performance of unreplicated SOS is not obviouslyperformant and there are plausible concerns of managing the distribution around optimism incertain degenerate cases. We find that neither of these concerns are a problem in expected regretor cumulative reward. For comparison, we provide the results from the traditional and traditionaloptimistic Thompson sampler in both matching (correct) and non-matching (incorrect) parametricmodels. We present the results of the replication strategy experiments in Table Using Categorical Contextual Variables in SOSIn many contexts, contextual variables are fundamental to the application of bandit methods.Historically, methods for computing contextual bandits have been computationally expensive. Theindependence of the sampling process in SOS allows us to produce a contextual bandit methodfor cases of a small number of independent categorical variables by subsetting the data to thoserecords which belong to the appropriate subset. This method has the advantage of being easilyand efficiently implemented within a relational database context but suffers from the curse ofdimensionality; there are two distinct considerations in handling that concern: (1) for a smallnumber of classes (even misspecified classes), it performs well while remaining easily scalable and(2) we can introduce a new parameter κ which sets a minimum size within a class. If the minimumsize is not reached, a random class is dropped from the subset until the size is met. Our fullyparameterized algorithm for categorical contextual bandits with SOS is detailed in Figure 3.5.77Table 3.5: Replication strategy experiments for SOS and Simple Sampler# Sampling Policy Replication Strategy k Average R¯E SD∑t xSt,t SD49 Correct OTS None - 0.341200 0.474160 2.683923 2.05169050 Incorrect OTS None - 0.492600 0.499995 2.512447 2.06007651 SOS None - 0.326600 0.469016 2.650144 2.04271552 SOS Least Deviation 2 0.337000 0.472732 2.678071 2.04996453 SOS Least Deviation 3 0.338000 0.473076 2.629016 2.06961454 SOS Least Deviation 4 0.347000 0.476063 2.662008 2.05867955 SOS Least Deviation 5 0.328800 0.469824 2.662791 2.03459956 SOS Least Deviation 10 0.331200 0.470692 2.636481 2.02781157 SOS Least Deviation 30 0.331000 0.470620 2.667166 2.02842458 SOS Sampled Mean 2 0.330600 0.470476 2.711309 2.06752559 SOS Sampled Mean 3 0.336600 0.472594 2.699488 2.04743660 SOS Sampled Mean 4 0.324600 0.468272 2.656813 2.03911661 SOS Sampled Mean 5 0.335800 0.472317 2.702142 2.10119262 SOS Sampled Mean 10 0.332200 0.471049 2.711977 2.04246763 SOS Sampled Mean 30 0.330600 0.470476 2.685058 2.04341764 SOS Most Optimistic 2 0.341800 0.474360 2.653896 2.08921465 SOS Most Optimistic 3 0.331200 0.470692 2.676421 2.09532266 SOS Most Optimistic 4 0.327800 0.469458 2.646591 2.08016067 SOS Most Optimistic 5 0.330200 0.470332 2.698665 2.04297368 SOS Most Optimistic 10 0.336400 0.472525 2.695991 2.02534269 SOS Most Optimistic 30 0.336800 0.472663 2.715561 2.06307970 SOS Least Optimistic 2 0.343400 0.474891 2.617706 2.04896371 SOS Least Optimistic 3 0.332800 0.471263 2.722110 2.05561672 SOS Least Optimistic 4 0.338800 0.473349 2.665208 2.03681573 SOS Least Optimistic 5 0.337800 0.473007 2.672261 2.05565874 SOS Least Optimistic 10 0.347800 0.476320 2.644284 2.03716075 SOS Least Optimistic 30 0.327800 0.469458 2.699044 2.04310076 Correct TS None - 0.391200 0.488068 2.640098 2.10061277 Incorrect TS None - 0.496200 0.500036 2.497952 2.07479978 Simple Sampler None - 0.368600 0.482473 2.643983 2.04603779 Simple Sampler Least Deviation 2 0.362800 0.480856 2.644576 2.04222080 Simple Sampler Least Deviation 3 0.358000 0.479460 2.622352 2.04773981 Simple Sampler Least Deviation 4 0.361600 0.480512 2.675990 2.04898482 Simple Sampler Least Deviation 5 0.362000 0.480627 2.634314 2.06145883 Simple Sampler Least Deviation 10 0.365600 0.481646 2.601818 2.05408184 Simple Sampler Least Deviation 30 0.367200 0.482090 2.620877 2.02975085 Simple Sampler Sampled Mean 2 0.345800 0.475676 2.696116 2.05579686 Simple Sampler Sampled Mean 3 0.327400 0.469312 2.675551 2.01924587 Simple Sampler Sampled Mean 4 0.334800 0.471968 2.647391 2.05007888 Simple Sampler Sampled Mean 5 0.324800 0.468347 2.671250 2.07671989 Simple Sampler Sampled Mean 10 0.320600 0.466754 2.680171 2.04416690 Simple Sampler Sampled Mean 30 0.314200 0.464243 2.663312 2.06648091 Simple Sampler Most Optimistic 2 0.358600 0.479637 2.649196 2.06134292 Simple Sampler Most Optimistic 3 0.355400 0.478682 2.637505 2.01178593 Simple Sampler Most Optimistic 4 0.340000 0.473756 2.664240 2.05927294 Simple Sampler Most Optimistic 5 0.330400 0.470404 2.619605 2.03002095 Simple Sampler Most Optimistic 10 0.343200 0.474825 2.652753 2.07038096 Simple Sampler Most Optimistic 30 0.325400 0.468571 2.657045 2.02536297 Simple Sampler Least Optimistic 2 0.363400 0.481027 2.684557 2.06889898 Simple Sampler Least Optimistic 3 0.344000 0.475089 2.619288 2.05201099 Simple Sampler Least Optimistic 4 0.337200 0.472801 2.683669 2.053468100 Simple Sampler Least Optimistic 5 0.334000 0.471687 2.681712 2.062158101 Simple Sampler Least Optimistic 10 0.337600 0.472939 2.672240 2.037189102 Simple Sampler Least Optimistic 30 0.341600 0.474294 2.672088 2.06360978Data:Si: the result of playing n ≥ κ > 1 purely random seed rounds for each arm iΩ: a function which determines whether we should play another roundB: the maximum size of our sample historyγ: the degree of optimism in [0, 1)κ: the minimum size of our sample below which we retain the whole sampleκ: the minimum size within a class below which we will collapse the classResult: A distribution-free categorical-contextual optimistic bandit sampling strategy.initialize ∀i x˜i ← sort(Si) ;initialize ∀i Ni ← count(Si) ;while (Ω) dorˆ∗ ← −∞;St ← [];C0t ← Ct ← get full context vector;for each arm i dorepeat// Subset observed rewards to only the matching context category. If insufficientvalues are available, remove a context category randomly and repeat (minimumκ).x˜matchi ← subset(x˜i, Ct ∈ context(x˜i));Ct ← delete(draw uniformly(Ct)) ;until count(x˜matchi ) > κ;Nmatchi ← count(x˜matchi );if (Nmatchi − γNmatchi ) < κ thenx˜matching optimistici ← subset(x˜matchi , Nmatchi − κ, Nmatchi );elsex˜matching optimistici ← subset(x˜matchi , γNmatchi , Nmatchi ) ;endrˆtest ← draw uniformly(x˜matching optimistici ) ;if rˆtest > rˆ∗ thenrˆ∗ ← rˆtest;St ← [i];endelse if rˆtest == rˆ∗ thenSt.append(i);endendS∗t ← draw uniformly(St);reward ← play(S∗t , C0t );(x˜S∗t ).insert sorted by reward(reward, C0t );NS∗t ← NS∗t + 1;if NS∗t > B thenx˜S∗t ← delete(draw uniformly(x˜S∗t ));NS∗t ← NS∗t − 1;endendFigure 3.5: The Categorical-Contextual Simple Optimistic Bootstrap Sampler (SOS)793.7 SummaryOur research presented in this section draws a unifying story on the nature of optimism andcontextual sampling strategies – answering a set of questions which arise in the implementation ofan advertising system driven by a simple linear model Thompson sampler. In particular, we havepresented a simulation platform which provides the infrastructure we use to run tests, comparealgorithms across a wealth of dimensions and assumptions in a repeatable and tractable way andanswer questions as they arise. Then, we provided a definition of a new technique called LinTSwhich provides a linear model-driven Thompson sampler, then we both answer implementationquestions and show how it can be extended to the non-stationary (drifting) case in a reliableway. In this process, we identify a number of interesting observations on the nature of optimismitself - providing evidence for an open question regarding the reason for the effectiveness andchoice of specific type of optimism in the face of uncertainty in exploratory problems. Finally,we present a set of considerations related to the nature of performing the sampling: efficientimplementations and a nonparametric, distribution-free model of a Thompson sampler which cansupport categorical covariates and performs similarly to the traditional parametric sampler whilebeing robust to misspecification across model selection (in fact, requiring no model selection) in away the traditional sampler is not.80Chapter 4Conclusions4.1 SummaryIn this work we have explored some of the factors necessary to produce an efficient toolkit foronline experiment design, especially with regard to application to the web optimization problempresented in the introduction. Our work takes a statistically-motivated perspective, relying onthe fundamental tools of statistical analysis, from confidence intervals to regression analysis, whileboth explicitly and implicitly acknowledging the differences of the exploratory tradeoff inherent inbandits from traditional experiments. This perspective deviates slightly from some of the otherperspectives presented in the literature, in a way which motivates the novelty of many of our results,especially with regard to the taxonomy of regret presented in the background.We first presented an exploration of the type of problems and confounding factors typicallyconsidered in multi-armed bandits, an exploratory look at the work that has preceeded this and thefirst in-depth discussion of the considerations, especially the many definitions of regret, necessaryto be understood when analyzing bandit problems. Our background presentation focused heavilyon breadth and understanding for the sake of implementation, while still providing knowledge ofthe existing bounds and understanding the positional “efficiency,” in the considerations for whichefficiency applies, for each algorithm within the literature.We have considered a number of theoretical and implementation questions and provided evi-dence to the interpretation of concepts such as regret and optimism. More so, we presented efficientalgorithms for both parametric and nonparametric sampling and the simultaneous handling of thecontextual and non-stationarity problems which are so fundamental to the problem of advertisingand marketing decision theory.Further, the presentation of the new metric of statistical regret creates the possibility of com-paring algorithms ex-post in a real-world applied environment where there is no a priori oraclewith which the experimenter can determine the correct answer. By treating the learning processas producing its own interval of correctness and confidence, we are able to compute measures in81all environments which compare to the traditional measures that were only available in simulation.This allows a diagnostic and evaluative view of policy behavior in the real world to be taken, in away not previously available, with an eye on tweaking and calibrating parameters for better results.Especially interesting in our results are the outcomes with respect to the nature of empiricallyeffective optimism, both with regard to how the prescient elimination of underlying variation inthe arm distribution interacts with optimism to produce a similar benefit, and with regard tothe excessively optimistic nature of optimism. This is likely to be an area where further researchwill provide a more in-depth understanding, especially with regard to the differences and relativemerits between probability matching strategies and upper-confidence bound strategies in general.The surprising result with regard to the exponential discounting strategy for non-stationary banditsitself is interesting even for applications outside of the multi-armed bandit: anywhere where thetracking of an uncertain estimate of a moving variable is a requirement, this result should beconsidered and tested for application, indeed even if the nature of the path or functional form ofthe drift is unknown, it seems the exponential discounting process works well and fits cleanly withina regression-oriented framework.4.2 Future Work4.2.1 Theoretical Bounds in Low Sample Size ScenariosIn terms of the immediate application in a marketing and web optimization context, it is often thecase that researchers will want to apply an optimization technique where the search space greatlyexceeds the available trials in terms of power. One such example is the case of headline analysis foradvertisements, where a researcher may wish to test hundreds of distinct headlines in a diverse setof locations where the total traffic that will observe each headline is relatively small. Techniquesfrom the natural language processing literature and other pre-heuristics may be applied at thislevel to attempt to provide context (side-information) to accelerate the learning process, however,the theoretical work in low sample size cases currently leaves much to be desired.4.2.2 Prior ElicitationIn most models, some concept of prior knowledge is available. This is, in fact, nearly the wholepurpose of the multi-armed bandit: at each step, utilizing (and balancing) the knowledge which isaccrued with the knowledge which is available to be accrued for future benefit. Capturing earlyprior knowledge effectively is a clear path to improving the results of these methods....from ExpertsIt is often the case that a priori ignorance is not the correct assumption for the type of variableswe are optimizing. The enormous literature existent on the topic of eliciting priors from expert82opinions and the inclusion of domain-expert knowledge in the described techniques remains an openproblem....from Prior ExperimentsWhile the contextual variable method of shared parameters across experiments may allow muchof this prior elicitation to take place, a simple and coherent system for modifying and extendingprior experiments to new variants, while retaining the valuable component of the prior data is itselfan area ripe for contribution. Such a result may allow a crowd-motivated transformation of howthese experiments are conducted, with shared quantified knowledge across all experiment types(especially those from the medical or academic domain) becoming the norm.4.2.3 Risk-Aware AnalysisRisk aware methods for bandit optimization are essential to use in financial and medical domains.This comes in a multitude of forms from model and specification error (especially in the likelihood oftail events, as seen extensively in financial analysis) to the ability for the process to absorb negativeresults earned through exploration. In the medical domain, a negative result may represent theunsuccessful treatment or even death of a patient and depending on the stakes, these results may beentirely unacceptable. Risk aware bandits are conjectured [11] to be a substantially harder problemthan the traditional models explored here, but some recent work has shown promise in this space.The application of models from financial analysis, especially with regard to maximal draw-down and spectral measures of risk may prove promising sources of interdisciplinary integration ofknowledge for the multi-armed bandit.4.2.4 Feedback DelayWhen a user accesses a website, traditionally, his request is logged in an access log for the purposeof compiling statistics. These statistics could be cross-referenced with sales or other action datato inform the multi-armed bandit process. A model of this variety is how we expect many weboptimization implementations to be implemented. Unfortunately, the time gap between the firstaccess and the point where the action such as a signup or purchase is recorded could be on theorder of minutes, hours or even days, during which time more users (and as such, more experimentalsubjects) are processed through the system.This feedback delay, the gap between when the user is presented the arm and the reward isobserved, can be a source of many errors, as well as can be expected to increase regret. In the worstcase, where the stream of rewards occurs after all available users have passed through the system,there is no learning taking place whatsoever. The interaction of feedback delay in reasonable casesand the policy selected can be large – for instance, we conjecture that a sampling-based techniquewill outperform any static technique (such as UCB variants) in the case of high feedback delay as83they are able to consider the current uncertainty in a way that does not overtrain an individualarm. This is an area that necessitates further exploration both in quantifying the effects of feedbackdelay and prescribing appropriate treatments for the problem.4.2.5 Contextual VariablesWhile we have presented two solutions (the simple categorical sampler and the LinTS techniques)which support contextual covariates, the contextual problem is so fundamental to rapid learning inthe exploratory problem that it warrants additional attention.Costs of MisspecificationIn particular, because of small sample size effects in both the categorical sampler and the LinTStechnique, it can be very expensive (in regret) to perform a kitchen sink regression, where all plausi-bly relevant variables are treated within the model. Quantifying the cost of different specificationsto provide guidance to practitioners in how to select appropriate models for balancing forecastingerror and model-associated fixed costs remains an important open question.Clustering and PCAAlong the same line, where it is solely the fixed cost of early regression fits and the inability toappropriately group samples together that creates a resistance to adding model variables, it maybe of value to explore clustering and the use of dimensionality reduction techniques like principlecomponent analysis. These techniques will earlier sample fitting on small sample trials via thereduction of the feature space in a way that reduces the regret accrued in the small trial contextand in the way of reducing the feedback delay associated with model computation time. Further,it is suspected that at least in some contexts, it may be possible to outperform the unclustered orhigh dimensional model in general, even without consideration of the fixed cost improvements.4.2.6 Speed and Computational ComplexityIn the online advertising context, millisecond-scale response times are essential for user satisfaction.Research from Google [106] finds that total load time cannot exceed 400 milliseconds before havingan effect on users’ well-being and positive perception of the website. Google further found that ahalf second increase in load time could result in a loss of up to 20% of their users for a particularrequest. Research from Amazon finds that every 100 milliseconds additional load time reduces salesby 1% [96]. Hundreds of other research projects both in-house and public have found many relatedresults: response speed is an essential factor of user perception and behavior on the web; the CEOof Yahoo, Marissa Mayer is quoted as reaffirming this point “Users really respond to speed.”It is generally the case where a bandit model can be “frozen” in time for some number ofiterations, and updated only when computational resources are available. The problem with this84strategy is that it interacts tightly with the problems of feedback delay: in the early (smaller)sample sizes, where the model is still highly uncertain, regret may accrue very rapidly from a non-updated model. Further complicating the issue, the nature of the Internet is such that many serviceproviders receive non-uniform flows of users, where spikes (often referred to as the Slashdot effector a flash crowd) occur when a larger site links to the vendor, and the nature of these users is likelyto be much different than the prior users. These factors complicate the adjustment in terms of thenon-stationarity we have considered and require computationally efficient techniques for updatingthe model.In particular, the utilization of an early-exit policy and continuous tracking type models forfitting the regression coefficients in LinTS or similar will prove invaluable in controlling the ratiobetween recomputation cost and the negative effects of feedback/update delay.85Bibliography[1] Y. Abbasi-Yadkori and C. Szepesva´ri. Regret bounds for the adaptive control of linearquadratic systems. In Conference on Learning Theory, pages 1–26. Association forComputational Learning, 2011. → pages 43[2] I. Abraham, O. Alonso, V. Kandylas, and A. Slivkins. Adaptive crowdsourcing algorithmsfor the bandit survey problem. In Conference on Learning Theory, volume 26, pages 1–29.Association for Computational Learning, 2013. → pages 8[3] R. P. Adams and D. J. MacKay. Bayesian online changepoint detection, 2007. University ofCambridge. Technical Report. → pages 51[4] D. Agarwal, B.-C. Chen, and P. Elango. Spatio-temporal models for estimatingclick-through rate. In International Conference on World Wide Web, pages 21–30.Association for Computing Machinery, 2009. doi:10.1145/1526709.1526713. → pages 52[5] R. Agrawal. Sample mean based index policies with o(log n) regret for the multi-armedbandit problem. Advances in Applied Probability, 27(4):pp. 1054–1078, 1995. ISSN00018678. doi:10.2307/1427934. URL → pages 27, 32[6] S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed banditproblem. Journal of Machine Learning Research, pages 1–26, 2012. → pages 12, 48, 49, 50[7] R. Allesiardo, R. Fe´raud, and D. Bouneffouf. A neural networks committee for thecontextual bandit problem. In Advances in Neural Information Processing Systems, pages374–381. Springer, 2014. doi:10.1007/978-3-319-12637-1 47. → pages 44[8] K. J. Arrow, D. Blackwell, and M. A. Girshick. Bayes and minimax solutions of sequentialdecision problems. Econometrica: Journal of the Econometric Society, pages 213–244, 1949.→ pages 12[9] J. Aspnes. Notes on randomized algorithms. Course Notes, 2014. → pages 20[10] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits. InConference on Learning Theory, pages 773–818. Association for Computational Learning,2009. → pages 28[11] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partialmonitoring. Journal of Machine Learning Research, 11:2785–2836, 2010. → pages viii, 19,21, 29, 35, 36, 41, 8386[12] J.-Y. Audibert, S. Bubeck, and R. Munos. Best arm identification in multi-armed bandits.In Conference on Learning Theory. Association for Computational Learning, June 2010. →pages 22[13] P. Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal ofMachine Learning Research, 3:397–422, 2003. → pages 43[14] P. Auer and N. Cesa-Bianchi. On-line learning with malicious noise and the closurealgorithm. Annals of Mathematics and Artificial Intelligence, 23(1-2):83–99, 1998.doi:10.1023/A:1018960107028. → pages 36[15] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed banditproblem. Machine Learning, 47(2-3):235–256, 2002. doi:10.1023/A:1013689704352. → pages25, 26, 27, 28, 29, 42, 46, 49, 64, 65[16] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmedbandit problem. Journal on Computing, 32(1):48–77, 2002. doi:10.1137/S0097539701398375.→ pages 12, 36, 37, 43[17] J. S. Banks and R. K. Sundaram. Switching costs and the Gittins index. Econometrica:Journal of the Econometric Society, pages 687–694, 1994. doi:10.2307/2951664. → pages 7[18] M. Basseville. Detecting changes in signals and systems: a survey. Automatica, 24(3):309–326, 1988. doi:10.1016/0005-1098(88)90073-8. → pages 46[19] K. Bauman, A. Kornetova, V. Topinskii, and D. Khakimova. Optimization of click-throughrate prediction in the Yandex search engine. Automatic Documentation and MathematicalLinguistics, 47(2):52–58, 2013. doi:10.3103/S0005105513020040. → pages 52[20] L. Benkherouf and J. Bather. Oil exploration: sequential decisions in the face ofuncertainty. Journal of Applied Probability, pages 529–543, 1988. → pages 7[21] L. Benkherouf, K. Glazebrook, and R. Owen. Gittins indices and oil exploration. Journal ofthe Royal Statistical Society: Series B (Methodological), pages 229–241, 1992. → pages 7[22] V. Bentkus. On hoeffding’s inequalities. Annals of Probability, 32(2):1650–1673, 2004.doi:10.1214/009117904000000360. → pages 28[23] S. Berg. Solving dynamic bandit problems and decentralized games using the KalmanBayesian learning automaton. Master’s thesis, University of Agder, 2010. → pages 47[24] D. Bergemann and U. Hege. Venture capital financing, moral hazard, and learning. Journalof Banking and Finance, 22(6):703–735, 1998. doi:10.1016/S0378-4266(98)00017-X. →pages 6[25] D. Bergemann and U. Hege. The financing of innovation: learning and stopping. RANDJournal of Economics, pages 719–752, 2005. → pages 6[26] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. E. Schapire. Contextual banditalgorithms with supervised learning guarantees. In International Conference on ArtificialIntelligence and Statistics, volume 15, pages 19–26, 2011. → pages 36, 38, 4387[27] D. Bouneffouf, R. Laroche, T. Urvoy, R. Fe´raud, and R. Allesiardo. Contextual bandit foractive learning: Active Thompson sampling. In Advances in Neural Information ProcessingSystems, pages 405–412. Springer International Publishing, 2014.doi:10.1007/978-3-319-12637-1 51. → pages 48[28] D. B. Brown and J. E. Smith. Optimal sequential exploration: Bandits, clairvoyants, andwildcats. Operations Research, 61(3):644–665, 2013. doi:10.2307/23474009. → pages 7[29] C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. Cowling, P. Rohlfshagen, S. Tavener,D. Perez, S. Samothrakis, and S. Colton. A survey of monte carlo tree search methods.Computational Intelligence and AI in Games, IEEE Transactions on, 4(1):1 –43, March2012. URL papers/ieeetcaigmctssurvey2012.pdf. A massive Survey of MCTS related papers.→ pages 33[30] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochasticmulti-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122,2012. doi:10.1561/2200000024. → pages 18, 19, 22[31] S. Bubeck and C.-Y. Liu. Prior-free and prior-dependent regret bounds for Thompsonsampling. In Advances in Neural Information Processing Systems, pages 638–646, 2013. →pages 17[32] S. Bubeck and R. Munos. Open loop optimistic planning. In Conference on LearningTheory. Association for Computational Learning, June 2010. → pages 35[33] S. Bubeck and A. Slivkins. The best of both worlds: Stochastic and adversarial bandits. InConference on Learning Theory, pages 1–23. Association for Computational Learning, 2012.→ pages 38[34] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesva´ri. Online optimization in X-armedbandits. In Advances in Neural Information Processing Systems, pages 201–208, 2008. →pages 34[35] S. Bubeck, R. Munos, G. Stoltz, and C. Szepesva´ri. X-armed bandits. Journal of MachineLearning Research, 12:1655–1695, 2011. → pages 34[36] G. Burtini, J. Loeppky, and R. Lawrence. Improving online marketing experiments withdrifting multi-armed bandits. In International Conference on Enterprise InformationSystems, 2015. → pages iii, 60[37] G. Burtini, J. Loeppky, and R. Lawrence. A survey of online experiment design with thestochastic multi-armed bandit. Statistics Surveys, submitted 2015. → pages iii[38] A. Carpentier and M. Valko. Extreme bandits. In Advances in Neural InformationProcessing Systems, volume 27, pages 1089–1097, 2014. → pages 21[39] A. Carpentier and M. Valko. Simple regret for infinitely many armed bandits. InInternational Conference on Machine Learning. Association for Computing Machinery,2015. → pages 16, 3388[40] N. Cesa-Bianchi and P. Fischer. Finite-time regret bounds for the multiarmed banditproblem. In International Conference on Machine Learning, pages 100–108. Association forComputing Machinery, 1998. doi:10.1023/A:1013689704352. → pages 26[41] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge UniversityPress, 2006. → pages 37[42] O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances inNeural Information Processing Systems, pages 2249–2257, 2011. → pages 49, 75[43] O. Chapelle, E. Manavoglu, and R. Rosales. Simple and scalable response prediction fordisplay advertising. Transactions on Intelligent Systems and Technology, 5(4):61–95, 2014.doi:10.1145/2532128. → pages 52[44] H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sumof observations. Annals of Mathematical Statistics, 23(4):493–507, 1952.doi:10.1214/aoms/1177729330. → pages 28[45] P. D. Childs and A. J. Triantis. Dynamic R&D investment policies. Management Science,45(10):1359–1377, 1999. doi:10.1287/mnsc.45.10.1359. → pages 7[46] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payofffunctions. In Conference on Artificial Intelligence and Statistics, volume 14, pages 208–214,2011. → pages 43[47] R. Combes and A. Proutiere. Unimodal bandits: Regret lower bounds and optimalalgorithms. International Conference on Machine Learning, pages 521–529, 2014. → pages33[48] P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. Conference onUncertainty in Artificial Intelligence, pages 67–74, 2007. → pages 33, 34[49] R. Coulom. Efficient selectivity and backup operators in monte-carlo tree search. InComputers and games, pages 72–83. Springer, 2007. → pages 34[50] V. Dani, S. M. Kakade, and T. P. Hayes. The price of bandit information for onlineoptimization. In Advances in Neural Information Processing Systems, pages 345–352, 2007.→ pages 43[51] V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under banditfeedback. In Conference on Learning Theory, pages 355–366. Association for ComputationalLearning, 2008. → pages 65[52] R. W. Dearden. Learning and planning in structured worlds. PhD thesis, The University ofBritish Columbia, 2000. → pages 45[53] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang.Efficient optimal learning for contextual bandits. In Conference on Uncertainty in ArtificialIntelligence, volume 28, pages 169–178. AUAI Press, 2011. → pages 42, 4389[54] I. Dumitriu, P. Tetali, and P. Winkler. On playing golf with two balls. Journal on DiscreteMathematics, 16(4):604–615, 2003. doi:10.1137/S0895480102408341. → pages 8[55] D. Eckles and M. Kaptein. Thompson sampling with the online bootstrap. arXiv preprintarXiv:1410.4009, 2014. → pages 14, 48, 51, 60, 73[56] B. Efron. Bootstrap methods: another look at the Jackknife. Annals of Statistics, pages1–26, 1979. doi:10.1214/aos/1176344552. → pages 70[57] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions forthe multi-armed bandit and reinforcement learning problems. Journal of Machine LearningResearch, 7:1079–1105, 2006. → pages 25[58] V. F. Farias and R. Madan. The irrevocable multiarmed bandit problem. OperationsResearch, 59(2):383–399, 2011. doi:10.1287/opre.1100.0891. → pages 7[59] P. Fearnhead and P. Clifford. On-line inference for hidden Markov models via particlefilters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(4):887–899, 2003. doi:10.1111/1467-9868.00421. → pages 51[60] P. Fearnhead and Z. Liu. On-line inference for multiple changepoint problems. Journal ofthe Royal Statistical Society: Series B (Statistical Methodology), 69(4):589–605, 2007.doi:10.1111/j.1467-9868.2007.00601.x. → pages 51[61] S. Filippi, O. Cappe, A. Garivier, and C. Szepesva´ri. Parametric bandits: The generalizedlinear case. In Advances in Neural Information Processing Systems, pages 586–594, 2010. →pages 65[62] A. Garivier and O. Cappe´. The KL-UCB algorithm for bounded stochastic bandits andbeyond. In Conference on Learning Theory, volume 24, pages 359–376. Association forComputational Learning, 2011. → pages 30[63] A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationary banditproblems. arXiv preprint arXiv:0805.3415, 2008. → pages 45, 46, 64[64] S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of uct with patterns inmonte-carlo go. 2006. → pages 34[65] J. Gittins and D. Jones. A dynamic allocation index for the sequential design ofexperiments. In Progress in Statistics, pages 241–266. North-Holland, Amsterdam, NL,1974. → pages 2[66] J. Gittins, K. Glazebrook, and R. Weber. Multi-armed bandit allocation indices. John Wiley& Sons, 1989-2011. → pages 2[67] J. C. Gittins. Bandit processes and dynamic allocation indices. Journal of the RoyalStatistical Society: Series B (Methodological), 41:148–164, 1979. → pages 2[68] A. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems.In International Conference on Machine Learning, pages 100–108. Association forComputing Machinery, 2014. → pages 4890[69] T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. Web-scale Bayesian click-throughrate prediction for sponsored search advertising in Microsoft’s Bing search engine. InInternational Conference on Machine Learning, pages 13–20. Association for ComputingMachinery, 2010. → pages 52[70] O.-C. Granmo and S. Berg. Solving non-stationary bandit problems by random samplingfrom sibling Kalman filters. In Trends in Applied Intelligent Systems, pages 199–208.Springer Berlin Heidelberg, 2010. doi:10.1007/978-3-642-13033-5 21. → pages 19, 23, 47[71] O.-C. Granmo and S. Glimsdal. A two-armed bandit based scheme for accelerateddecentralized learning. In Modern Approaches in Applied Intelligence, pages 532–541.Springer Berlin Heidelberg, 2011. doi:10.1007/978-3-642-21827-9 54. → pages 47[72] S. Guha and K. Munagala. Stochastic regret minimization via Thompson sampling. InConference on Learning Theory, pages 317–338. Association for Computational Learning,2014. → pages 48[73] C. Hartland, S. Gelly, N. Baskiotis, O. Teytaud, and M. Sebag. Multi-armed bandit,dynamic environments and meta-bandits. In Advances in Neural Information ProcessingSystems, 2006. → pages 46, 64[74] D. V. Hinkley. Inference about the change-point in a sequence of random variables.Biometrika, 57(1):1–17, 1970. doi:10.1093/biomet/57.1.1. → pages 46[75] C.-J. Ho and J. W. Vaughan. Online task assignment in crowdsourcing markets. InConference on Artificial Intelligence. AAAI, 2012. → pages 8[76] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal ofthe American Statistical Association, 58(301):13–30, 1963. doi:10.1002/0471667196.ess2066.→ pages 28[77] M. D. Hoffman, E. Brochu, and N. de Freitas. Portfolio allocation for Bayesianoptimization. In Uncertainty in Artificial Intelligence, pages 327–336, 2011. → pages 7[78] J. Hsu. Multiple Comparisons: Theory and Methods. CRC Press, 1996. → pages 26[79] Z. Hussain, P. Auer, N. Cesa-Bianchi, L. Newnham, and J. Shawe-Taylor. Exploration vs.exploitation PASCAL challenge., 2006. →pages 9, 51, 64[80] S. Jain, S. Gujar, S. Bhat, O. Zoeter, and Y. Narahari. An incentive compatiblemulti-armed-bandit crowdsourcing mechanism with quality assurance. In Symposium onLearning, Algorithms and Complexity, 2015. → pages 8[81] H. Jeffreys. An invariant form for the prior probability in estimation problems. Journal ofthe Royal Society of London: Series A (Mathematical and Physical Sciences), 186(1007):453–461, 1946. doi:10.1098/rspa.1946.0056. → pages 55[82] C. Jennison and B. W. Turnbull. Statistical approaches to interim monitoring of medicaltrials: A review and commentary. Statistical Science, pages 299–317, 1990.doi:10.1214/ss/1177012099. → pages 6, 2691[83] D. H. Johnson and S. Sinanovic. Symmetrizing the Kullback-Leibler distance. 2001. →pages 55[84] B. Jovanovic. Job matching and the theory of turnover. Journal of Political Economy,pages 972–990, 1979. doi:10.1086/260808. → pages 6[85] D. Kahneman and G. Klein. Conditions for intuitive expertise: a failure to disagree.American Psychologist, 64(6):515–526, 2009. doi:10.1037/a0016755. → pages 8[86] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Efficient bandit algorithms for onlinemulticlass prediction. In International Conference on Machine Learning, volume 25, pages440–447. Association for Computing Machinery, 2008. doi:10.1145/1390156.1390212. →pages 44[87] M. Kaptein. The use of thompson sampling to increase estimation precision. Behaviorresearch methods, 47(2):409–423, 2015. → pages 24[88] E. Kaufmann, O. Cappe´, and A. Garivier. On Bayesian upper confidence bounds for banditproblems. In International Conference on Artificial Intelligence and Statistics, pages592–600, 2012. → pages 29, 30[89] E. Kaufmann, O. Cappe´, and A. Garivier. On the complexity of best arm identification inmulti-armed bandit models. arXiv preprint arXiv:1407.4443, 2014. → pages 17[90] J. Kerman. A closed-form approximation for the median of the beta distribution. arXivpreprint arXiv:1111.0433, 2011. → pages 59[91] S.-J. Kim, M. Aono, and M. Hara. Tug-of-war model for the two-bandit problem:Nonlocally-correlated parallel exploration via resource conservation. BioSystems, 101(1):29–36, 2010. doi:10.1016/j.biosystems.2010.04.002. → pages 6[92] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. InSymposium on Theory of Computing, volume 40, pages 681–690. Association for ComputingMachinery, 2008. doi:10.1145/1374376.1374475. → pages 34[93] F. H. Knight. Risk, uncertainty and profit. New York: Hart, Schaffner and Marx, 1921. →pages 36[94] L. Kocsis and C. Szepesva´ri. Bandit based monte-carlo planning. In European Conferenceon Machine Learning, pages 282–293. Springer, 2006. doi:10.1007/11871842 29. → pages 45[95] L. Kocsis and C. Szepesva´ri. Discounted UCB. EvE PASCAL Challenges Workshop, 2006.→ pages 64[96] R. Kohavi and R. Longbotham. Online experiments: Lessons learned. Computer, 40(9):103–105, 2007. doi:10.1109/MC.2007.328. → pages 84[97] A. Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXivpreprint arXiv:1404.5997, 2014. → pages iii[98] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of MathematicalStatistics, pages 79–86, 1951. → pages 30, 5592[99] T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances inApplied Mathematics, 6(1):4–22, 1985. doi:10.1016/0196-8858(85)90002-8. → pages 12, 19,22, 27, 49[100] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with sideinformation. In Advances in Neural Information Processing Systems, pages 817–824, 2007.→ pages 27, 42, 43[101] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach topersonalized news article recommendation. In International Conference on World WideWeb, volume 19, pages 661–670. Association for Computing Machinery, 2010.doi:10.1145/1772690.1772758. → pages 9, 52, 53, 56, 66[102] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation ofcontextual-bandit-based news article recommendation algorithms. In InternationalConference on Web Search and Data Mining, pages 297–306. Association for ComputingMachinery, 2011. doi:10.1145/1935826.1935878. → pages 51[103] L. Li, W. Chu, J. Langford, T. Moon, and X. Wang. An unbiased offline evaluation ofcontextual bandit algorithms with generalized linear models. In International Conferenceon Web Search and Data Mining, 2012. doi:10.1145/1935826.1935878. → pages 9, 56[104] O.-A. Maillard. Robust risk-averse stochastic multi-armed bandits. In Algorithmic LearningTheory, pages 218–233. Springer, 2013. doi:10.1007/978-3-642-40935-6 16. → pages 14[105] O.-A. Maillard, R. Munos, and G. Stoltz. A finite-time analysis of multi-armed banditsproblems with Kullback-Leibler divergences. In Conference on Learning Theory, volume 24,pages 497–514. Association for Computational Learning, 2011. → pages 22, 30[106] M. Mayer. A glimpse under the hood at Google, 2008. → pages 84[107] H. B. McMahan and M. J. Streeter. Tighter bounds for multi-armed bandits with expertadvice. In Conference on Learning Theory. Association for Computational Learning, 2009.→ pages 36[108] J. Mellor and J. Shapiro. Thompson sampling in switching environments with Bayesianonline change detection. In International Conference on Artificial Intelligence andStatistics, pages 442–450, 2013. → pages 23, 51, 64[109] J. C. Mellor. Decision making using Thompson Sampling. PhD thesis, University ofManchester, 2014. → pages 52, 56[110] R. A. Miller. Job matching and occupational choice. Journal of Political Economy, pages1086–1120, 1984. → pages 6[111] T. P. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis,Massachusetts Institute of Technology, 2001. → pages 58[112] R. E. Morin. Factors influencing rate and extent of learning in the presence ofmisinformative feedback. Journal of Experimental Psychology, 49(5):343–351, 1955.doi:10.1037/h0042813. → pages 4793[113] R. Munos. From bandits to Monte-Carlo tree search: The optimistic principle applied tooptimization and planning. Foundations and Trends in Machine Learning, 7:1–129, 2014.doi:10.1561/2200000038. → pages 49[114] J. Neufeld, A. Gyo¨rgy, D. Schuurmans, and C. Szepesva´ri. Adaptive Monte Carlo viabandit allocation. In International Conference on Machine Learning, volume 31, pages1944–1952. Association for Computing Machinery, 2014. → pages 48[115] H. T. Nguyen, J. Mary, and P. Preux. Cold-start problems in recommendation systems viacontextual-bandit algorithms. arXiv preprint arXiv:1405.7544, 2014. → pages 48[116] E. Page. Continuous inspection schemes. Biometrika, pages 100–115, 1954.doi:10.1093/biomet/41.1-2.100. → pages 46[117] S. Pandey and C. Olston. Handling advertisements of unknown quality in searchadvertising. In Advances in Neural Information Processing Systems, volume 19, pages1065–1072, 2006. → pages 52[118] S. Pandey, D. Agarwal, D. Chakrabarti, and V. Josifovski. Bandits for taxonomies: Amodel-based approach. In Statistical Analysis and Data Mining, volume 7, pages 216–227.SIAM, 2007. doi:10.1137/1.9781611972771.20. → pages 42[119] N. G. Pavlidis, D. K. Tasoulis, and D. J. Hand. Simulation studies of multi-armed banditswith covariates. In UKSim, pages 493–498. IEEE, 2008. → pages 66[120] J. Pearl. Heuristics: intelligent search strategies for computer problem solving. 1984. →pages 33[121] V. Perchet, P. Rigollet, et al. The multi-armed bandit problem with covariates. Annals ofStatistics, 41(2):693–721, 2013. doi:10.1214/13-AOS1101. → pages 42[122] R. S. Pindyck. A note on competitive investment under uncertainty. American EconomicReview, pages 273–277, 1993. → pages 7[123] P. Reverdy, V. Srivastava, and N. E. Leonard. Modeling human decision-making inmulti-armed bandits. In Proceedings of the IEEE, volume 102, page 544, 2014.doi:10.1109/JPROC.2014.2307024. → pages 47[124] M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating theclick-through rate for new ads. In International Conference on World Wide Web, pages521–530. Association for Computing Machinery, 2007. doi:10.1145/1242572.1242643. →pages 52[125] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the AmericanMathematical Society, 58(5):527–535, 1952. doi:10.1007/978-1-4612-5110-1 13. → pages 12[126] F. Rosenblatt. The perceptron: a probabilistic model for information storage andorganization in the brain. Psychological Review, 65(6):386–408, 1958. doi:10.1037/h0042519.→ pages 4494[127] M. Rothschild. A two-armed bandit theory of market pricing. Journal of Economic Theory,9(2):185–202, 1974. doi:10.1016/0022-0531(74)90066-0. → pages 6[128] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits. Mathematics ofOperations Research, 35(2):395–411, 2010. → pages 43, 65[129] P. Rusmevichientong and D. P. Williamson. An adaptive algorithm for selecting profitablekeywords for search-based advertising services. In Conference on Electronic Commerce,pages 260–269. Association for Computing Machinery, 2006. doi:10.1145/1134707.1134736.→ pages 52[130] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. Mathematics ofOperations Research, 39(4):1221–1243, 2014. doi:10.1287/moor.2014.0650. → pages 48[131] J. Sarkar. One-armed bandit problems with covariates. Annals of Statistics, 19(4):1978–2002, 1991. doi:10.1214/aos/1176348382. → pages 42[132] S. L. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Modelsin Business and Industry, 26(6):639–658, 2010. → pages 47, 52[133] S. L. Scott. Multi-armed bandit experiments in the online service economy. AppliedStochastic Models in Business and Industry, 31(1):37–45, 2015. doi:10.1002/asmb.2104. →pages 52[134] Y. Seldin, C. Szepesva´ri, P. Auer, and Y. Abbasi-Yadkori. Evaluation and analysis of theperformance of the EXP3 algorithm in stochastic environments. In European Workshop onReinforcement Learning, pages 103–116. Microtome Publishing, 2013. → pages 14[135] B. Shahriari, Z. Wang, M. W. Hoffman, A. Bouchard-Coˆte´, and N. de Freitas. An entropysearch portfolio for Bayesian optimization. arXiv preprint arXiv:1406.4625, 2014. → pages48[136] M. Sorensen. Learning by investing: Evidence from venture capital. In New OrleansMeetings Paper. AFA, 2008. doi:10.2139/ssrn.967822. → pages 7[137] V. Srivastava, P. Reverdy, and N. Leonard. Optimal foraging and multi-armed bandits. InAllerton Conference on Communication, Control, and Computing, pages 494–499, 2013.doi:10.1109/Allerton.2013.6736565. → pages 47[138] G. Steinbrecher and W. T. Shaw. Quantile mechanics. European Journal of AppliedMathematics, 19(02):87–112, 2008. → pages 73[139] A. L. Strehl, C. Mesterharm, M. L. Littman, and H. Hirsh. Experience-efficient learning inassociative bandit problems. In International Conference on Machine Learning, pages889–896. Association for Computing Machinery, 2006. doi:10.1145/1143844.1143956. →pages 42[140] R. S. Sutton. Integrated architectures for learning, planning, and reacting based onapproximating dynamic programming. In International Conference on Machine Learning,volume 7, pages 216–224. Association for Computing Machinery, 1990. → pages 4595[141] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 1998.→ pages 25[142] I. Szita and A. Lo˝rincz. The many faces of optimism: a unifying approach. In InternationalConference on Machine Learning, pages 1048–1055. Association for Computing Machinery,2008. doi:10.1145/1390156.1390288. → pages 49[143] L. Tang, R. Rosales, A. Singh, and D. Agarwal. Automatic ad format selection viacontextual bandits. In International Conference on Information and KnowledgeManagement, pages 1587–1594. Association for Computing Machinery, 2013.doi:10.1145/2505515.2514700. → pages 52[144] W. R. Thompson. On the likelihood that one unknown probability exceeds another in viewof the evidence of two samples. Biometrika, 1933. → pages 9, 11, 47, 49, 62[145] A. Tikhonov. Solution of incorrectly formulated problems and the regularization method. InSoviet Mathematics - Doklady, volume 5, pages 1035–1038. American Mathematical Society,1963. → pages 66[146] L. Tran-Thanh, S. Stein, A. Rogers, and N. R. Jennings. Efficient crowdsourcing ofunknown experts using multi-armed bandits. In European Conference on ArtificialIntelligence, volume 20, pages 768–773, 2012. doi:10.1016/j.artint.2014.04.005. → pages 8[147] H. Tyagi and B. Ga¨rtner. Continuum armed bandit problem of few variables in highdimensions. In Workshop on Approximation and Online Algorithms, volume 8447, pages108–119. Springer International Publishing, 2013. doi:10.1007/978-3-319-08001-7 10. →pages 33[148] H. Tyagi, S. U. Stich, and B. Ga¨rtner. On two continuum armed bandit problems in highdimensions. Theory of Computing Systems, pages 1–32, 2014.doi:10.1007/s00224-014-9570-8. → pages 33[149] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142,1984. doi:10.1145/1968.1972. → pages 20[150] M. Valko, R. Munos, B. Kveton, and T. Koca´k. Spectral bandits for smooth graphfunctions. In International Conference on Machine Learning, volume 31, pages 46–54.Association for Computing Machinery, 2014. → pages 35[151] K. Van Moffaert, K. Van Vaerenbergh, P. Vrancx, and A. Nowe´. Multi-objective χ-armedbandits. In International Joint Conference on Neural Networks, pages 2331–2338. IEEE,2014. doi:10.1109/IJCNN.2014.6889753. → pages 33[152] J. Vermorel and M. Mohri. Multi-armed bandit algorithms and empirical evaluation. InEuropean Conference on Machine Learning, pages 437–448. Springer, 2005.doi:10.1007/11564096 42. → pages 12, 25, 26, 31, 32, 67[153] S. S. Villar, J. Bowden, and J. Wason. Multi-armed bandit models for the optimal design ofclinical trials: Benefits and challenges. Statistical Science, 30(2):199–215, 2015.doi:10.1214/14-STS504. → pages 696[154] A. Wald. Sequential Analysis. Wiley Series in Probability and Mathematical Statistics. J.Wiley & Sons, Incorporated, 1947. → pages 12[155] C.-C. Wang, S. R. Kulkarni, and H. V. Poor. Bandit problems with side observations.Transactions on Automatic Control, 50(3):338–355, 2005. doi:10.1109/TAC.2005.844079. →pages 42[156] C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge,1989. → pages 12, 25[157] A. Weinstein. Big-ol-bandit (BOB) implemented, tested., 2010.Accessed: June 2, 2015. → pages 35[158] M. L. Weitzman. Optimal search for the best alternative. Econometrica: Journal of theEconometric Society, pages 641–654, 1979. → pages 7[159] P. H. Westfall, S. S. Young, and S. P. Wright. On adjusting p-values for multiplicity.Biometrics, pages 941–945, 1993. doi:10.2307/2532216. → pages 26[160] M. Woodroofe. A one-armed bandit problem with a concomitant variable. Journal of theAmerican Statistical Association, 74(368):799–806, 1979.doi:10.1080/01621459.1979.10481033. → pages 4297


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items