UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Multiagent learning and empirical methods Zawadzki, Erik P. 2008-12-31

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


24-ubc_2008_fall_zawadzki_erik.pdf [ 837.4kB ]
JSON: 24-1.0051290.json
JSON-LD: 24-1.0051290-ld.json
RDF/XML (Pretty): 24-1.0051290-rdf.xml
RDF/JSON: 24-1.0051290-rdf.json
Turtle: 24-1.0051290-turtle.txt
N-Triples: 24-1.0051290-rdf-ntriples.txt
Original Record: 24-1.0051290-source.json
Full Text

Full Text

Multiagent Learning and EmpiricalMethodsbyErik P. ZawadzkiB.Sc. Honours, The University of British Columbia, 2005A THESIS SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinThe Faculty of Graduate Studies(Computer Science)The University Of British Columbia(Vancouver)October, 2008c Erik P. Zawadzki 2008AbstractMany algorithms exist for learning how to act in a repeated game and most have theoretical guaran-tees associated with their behaviour. However, there are few experimental results about the empir-ical performance of these algorithms, which is important for any practical application of this work.Most of the empirical claims in the literature to date have been based on small experiments, and thishas hampered the development of multiagent learning (MAL) algorithms with good performanceproperties.In order to rectify this problem, we have developed a suite of tools for running multiagent ex-periments called the Multiagent Learning Testbed (MALT). These tools are designed to facilitaterunning larger and more comprehensive experiments by removing the need to code one-off experi-mental apparatus. MALT also provides a number of public implementations of MAL algorithms?hopefully eliminating or reducing differences between algorithm implementations and increasingthe reproducibility of results. Using this test-suite, we ran an experiment that is unprecedented interms of the number of MAL algorithms used and the number of game instances generated. Theresults of this experiment were analyzed by using a variety of performance metrics?includingreward, maxmin distance, regret, and several types of convergence. Our investigation also drawsupon a number of empirical analysis methods. Through this analysis we found some surprising re-sults: the most surprising observation was that a very simple algorithm?one that was intended forsingle-agent reinforcement problems and not multiagent learning?performed better empiricallythan more complicated and recent MAL algorithms.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Preference and Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 One-shot Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 Solution Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Repeated Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Multiagent Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.1 Fictitious Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.2 Determined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.3 Targeted Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.4 Q-learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.5 Gradient Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.6 Previous Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 163 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1 The Platform Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.1.2 Platform Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.3 Algorithm Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 21iiiTable of Contents4 Empirical Methods and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.3 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3.1 Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3.2 Spearman's Rank Correlation Test . . . . . . . . . . . . . . . . . . . . . . 314.4 Probabilistic Domination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Empirical Evaluation of MAL Algorithms . . . . . . . . . . . . . . . . . . . . . . . 335.1 Reward-Based Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.1.1 Average Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.1.2 Maxmin Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.3 Convergence-Based Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.3.1 Strategic Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.3.2 Stage-Game Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . . . 715.3.3 Repeated-Game Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . 715.4 Links Between Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.4.1 Linking Reward With Maxmin Distance . . . . . . . . . . . . . . . . . . 735.4.2 Linking Reward With Regret . . . . . . . . . . . . . . . . . . . . . . . . 775.4.3 Linking Reward With Nash Equilibrium Convergence . . . . . . . . . . . 806 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887 Future Work: Extension to Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917.1 Wardrop Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927.2 Congestion Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927.3 Our Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937.3.1 Other Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.3.2 Experimental Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 96AppendicesA Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99ivList of Tables2.1 Previous experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1 Design decisions for fictitious play . . . . . . . . . . . . . . . . . . . . . 223.2 Design decisions for determined . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Design decisions for meta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Design decisions for meta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5 Design decisions for GIGA-WoLF. . . . . . . . . . . . . . . . . . . . . . . . . . . 243.6 Design decisions for GSA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.7 Design decisions RVsigma(t). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.8 Design decisions for Q-learning. . . . . . . . . . . . . . . . . . . . . . . . . . 254.1 The number and name of each game generator. . . . . . . . . . . . . . . . . . . . 285.1 The different algorithms and their best-response sets . . . . . . . . . . . . . . . . 385.2 The proportion of grand distribution subsampled algorithm games where each al-gorithm was strictly or weakly dominated. . . . . . . . . . . . . . . . . . . . . . . 405.3 The set of best algorithms for each generator. . . . . . . . . . . . . . . . . . . . . 49A.1 Two schemes for sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97vList of Figures2.1 von Neummann and Morgenstern's six preference axioms. . . . . . . . . . . . . . 42.2 A partial ordering example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 A game of Prisoner's Dilemma in normal-form with canonical payoffs. . . . . . . 72.4 A game of Battle of the Sexes in normal-form with canonical payoffs. . . . . . . . 82.5 A game of Traveler's Dilemma with 100 actions. . . . . . . . . . . . . . . . . . . 112.6 A Sidewalk or Dispersion game, where two agents try to miscoordinate where theystep. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.7 A game with a convincing Stackelberg outcome . . . . . . . . . . . . . . . . . . . 143.1 The five steps in running an analyzing an experiment using MALT. . . . . . . . . . 204.1 Kolmogorov-Smirnov test example. . . . . . . . . . . . . . . . . . . . . . . . . . 305.1 Lens plot for reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2 Bootstrapped mean estimator distributions for Q-learning and RVsigma(t) . . . . . . 365.3 A heatmap showing the mean reward for each algorithm . . . . . . . . . . . . . . 375.4 The mean reward algorithm game . . . . . . . . . . . . . . . . . . . . . . . . . . 395.5 The mean reward algorithm game for D4 . . . . . . . . . . . . . . . . . . . . . . . 425.6 Reward distribution probabilistic domination between the algorithms . . . . . . . . 435.7 Self-play lens plot for reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.8 Algorithm-generator heatmap for reward. . . . . . . . . . . . . . . . . . . . . . . 465.9 Normalized algorithm-generator heatmap for reward. . . . . . . . . . . . . . . . . 475.10 Bootstrapped mean estimate distribution for D7 . . . . . . . . . . . . . . . . . . . 485.11 Correlation between size and reward . . . . . . . . . . . . . . . . . . . . . . . . . 505.12 Similarity among different algorithms . . . . . . . . . . . . . . . . . . . . . . . . 525.13 The sign of the safety distance of each run, by algorithm. . . . . . . . . . . . . . . 545.14 The distribution of maxmin distances for AWESOME, minimax-Qand Q-learn-ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555.15 The proportion of enforceable runs, blocked by opponent. . . . . . . . . . . . . . . 575.16 The distribution of negative maxmin distances for GIGA-WoLF and RVsigma(t). . . . . 585.17 Lens plot for regret. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.18 The number of runs for each algorithm that have negative, zero, or positive regret. . 615.19 The distribution of regret for Q-learning and GIGA-WoLF. . . . . . . . . . . . 62viList of Figures5.20 Mean average regret, blocked by opponent. . . . . . . . . . . . . . . . . . . . . . 635.21 The number of opponents for which the algorithm on the ordinate probabilisticallydominates the algorithm on the abscissa. For example, Q-learning probabilis-tically dominates fictitious play on PSMs involving ten out of eleven pos-sible opponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.22 The number of generators for which the algorithm on the ordinate probabilisticallydominates the algorithm on the abscissa. . . . . . . . . . . . . . . . . . . . . . . . 665.23 Proportion of stationary runs, blocked on opponent . . . . . . . . . . . . . . . . . 685.24 Proportion of non-stationary runs, blocked on generator and protagonist . . . . . . 695.25 Convergences proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.26 Self-play convergences proportions . . . . . . . . . . . . . . . . . . . . . . . . . . 725.27 Proportion of PSMs with enforceable payoffs and payoffs profiles achieved, byalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.28 Sign of correlation between reward and maxmin distance . . . . . . . . . . . . . . 755.29 The distribution of reward for RVsigma(t) when conditioning on different maxmin dis-tances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.30 The distribution of reward for Q-learning when conditioning on different en-forceability. Runs from D10 were excluded. . . . . . . . . . . . . . . . . . . . . . 785.31 Bivariate histogram showing reward and maxmin distance for RVsigma(t) . . . . . . . . 795.32 The sign of correlation between reward and regret . . . . . . . . . . . . . . . . . . 815.33 GIGA-WoLF's CDF curves for positive and non-positive reward . . . . . . . . . . 825.34 Reward and regret for AWESOME . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.35 Reward and regret for Q-learning . . . . . . . . . . . . . . . . . . . . . . . . 845.36 The sign of correlation between reward and lscriptinfinity-distance to the closest Nash equi-librium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.37 Reward and lscriptinfinity-distance to the closest Nash equilibrium . . . . . . . . . . . . . . 877.1 A sample road graph with four intersections. . . . . . . . . . . . . . . . . . . . . . 917.2 A sample road graph with two intersections that has no Wardrop equilibrium for asystem with two atomic drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92viiAcknowledgmentsThis thesis would not have been possible without the excellent support that I received from allof my friends, colleagues, and professors at the University of British Columbia. I cannot nameeveryone that has helped me with my work in the past seven years and two degrees at UBC, but Iwill try to name those who directly helped with this dissertation.There were three students that worked on this project with me. All of this work stems froma previous thesis by Asher Lipson, and reading his thesis and paper helped me get a grip on theproblem. David Ludgate was very helpful with the coding of this version of MALT and bore withme when I was still figuring out the rudimentary details of this project. Alice Gao was an essentialpart of the traffic application of this work.My fellow students were a constant source of ideas, suggestions, and advice. I'm very thank-ful to Baharak Rastegari, Albert Xin Jiang, David Thompson, James Wright, Samantha Leung,Damien Bargiacchi, Frank Hutter, Lin Xu, and Tristram Southey for their insights. I am particu-larly thankful for James' careful read over the traffic part of the thesis. Both the GTDT and EARGreading groups listened and commented on version iterations of my talk and watched it grow froma summer project, to a course project and finally to a Master's thesis.The entire direction for the analysis of results changed after taking the empirical algorithmicscourse given by Dr. Holger Hoos, and I am indebted to him for his interesting presentation of thematerial and useful suggestions for my problem. Dr. Nando de Frietas, beyond agreeing to beinga reviewer for this thesis (on his sabatical, no less), was very helpful throughout the process?particularly with some of the statistics for this project.Dr. Kevin Leyton-Brown, my supervisor, showed me what it meant to do research and essen-tially taught me how to write from scratch. He constantly challenged me and was a great person towork with. Three pieces of his advice will always stick with me: be the step-mother of your work?not its mother; research is a dialog and you should see your work as adding to the conversation;and always use a sans serif font in presentations.viiiDedicationTo my parents, my Babcia, and Gwenn. Each one is as important as the last1.1This can be seen as a preference relationship that does not satisfy completeness. See Table 2.1ixChapter 1IntroductionUrban road networks, hospital systems and commodity markets are all examples of complicatedmultiagent systems that are essential to everyday life. Indeed, any social interaction can be seen asa multiagent problem. As a result of the prominence of multiagent systems, a lot of attention hasbeen paid to designing and analyzing learning algorithms for multiagent environments. Examplesinclude algorithms by Littman [30], Singh et al. [47], Hu and Wellman [24], Greenwald and Hall[19], Bowling [7], Powers and Shoham [38], Banerjee and Peng [5], and Conitzer and Sandholm[13]. As a result of this attention, a multitude of different algorithms exist for a variety of differentsettings.Before introducing a new algorithm, no matter the problem, a fundamental question needs tobe asked: how does it compare with previous approaches? In particular, there is little point insuggesting a method that does not offer some form of improvement. Such a question rarely can beanswered with theoretical guarantees alone, and a full answer typically involves an extensive set ofexperiments.This thesis follows the Artificial Intelligence agenda [46] for multiagent learning: a good multi-agent learning (MAL) algorithm is one that gets high rewards for its actions in a multiagent system.However, past work in MAL has primarily focused on proving theoretical guarantees about differ-ent algorithms and it has been difficult to prove results about raw average reward. Instead therehave been numerous theorems about a variety of alternative performance metrics?for which re-sults are provable?that are intended to `stand in' for reward. So how good are the numerous MALalgorithms in terms of the Artificial Intelligence criterion for performance?While most results for MAL algorithms have been theoretical, some experiments have beenconducted. Most of these experiments were small in terms of game instances, opposing algorithmsand performance metrics. The small number of metrics is particularly important. Authors havefocused on many different aspects of performance for both theoretical and empirical work. Theabundance of possible metrics and lack of comprehensive experiments mean that given any two al-gorithms it is unlikely that their empirical results can be meaningfully compared?the experimentsthat have been done investigated many incomparable aspects of performance. As a result, there arestill opportunities to expand our empirical understanding of how MAL techniques interact and howone could design an algorithm with improved empirical characteristics.Part of the reason for the relative paucity of large-scale empirical work is that neither a central-ized algorithm repository nor a standardized test setup exists. This is unfortunate, not only becauseneedless work has to be invested in designing one-off testbeds and reimplementing algorithms, butalso because a centralized and public repository increases reproducibility and decreases the poten-1Chapter 1. Introductiontial for implementation differences. Publicly available and scrutinized implementations will makeexperiments easier to run, reproduce, and compare.In this thesis we make two contributions. First, we describe the design and implementationof a platform for running MAL experiments (? 3). This platform offers several advantages overone-off setups and we hope that this initial investment in architecture will facilitate larger and morecomprehensive empirical work.Our second contribution is a large experiment that we set up and executed with our platformthat is, to our knowledge, unprecedented in terms of scale (? 4). Our discussion of this experimentfocuses on suggesting empirical methods for analyzing MAL performance data and engaging in adetailed discussion of the results. In particular, we show that there are some interesting relation-ships between the different performance metrics and that some very basic algorithms out performmore sophisticated algorithms on several key metrics.2Chapter 2Background and Related WorkGame theory is a way to model the interactions of multiple self-interested agents. Each gamemaps the decisions made by the agents to outcomes which are in turn mapped to rewards for eachof the agents. Everyone's reward depends on the decisions made by other agents and vice versa.This powerful representation not only captures everything that is commonly thought of as a game,like rock-paper-scissors and backgammon, but also many other complicated situations in politics,economics and other social interactions.Games are a mathematical object. While all game formalizations share the broad characteri-zation that the agent's rewards are tied to the decisions made by others, specific games can varygreatly in their detail. For instance, is the game played once, or many times? How much aboutthe game and the other players is known? While it is not the intention of this thesis to give anexhaustive survey of all game forms, we will describe the model for the type of game that we usein this dissertation?the repeated game. However, repeated games are built upon a more basic typeof game, the one-shot game, which in turn relies on having a crisp mathematical formalization ofthe intuitive notion of having preferences. We will give some background for each of these topics,starting with building a formal idea of preference.2.1 Preference and UtilityGame theory is founded on the idea that agents have preferences over the outcomes of the game?the state of the world after the game has been played. Game outcomes are a natural concept withmany examples. Rock-paper-scissors can be seen as having three outcomes: player one might win,player two might win, or they might both tie. Poker can be seen as having many more possibleoutcomes, one for each possible distribution of chips to the various players. Each agent has pref-erences over these outcomes. Using the example of rock-paper-scissors, each agent prefers anyoutcome where they win to one where they lose or tie. Since all agents prefer the outcome wherethey win there is some natural tension.Formalizing this intuition requires some light notation. Let O be a set of outcomes and followsequalpropersubsetO ? O be some relationship over them. The notation o1 followsequal o2 denotes that o1 is weakly preferredto o2. Strict preference is denoted o1 follows o2 and neutrality is denoted o1 similar o2. A game might havesome random component or the agents themselves might do something random, and this leads tothe idea of a lottery over outcomes. A lottery is simply a multinomial probability distributionsover a set of outcomes. We denote a lottery over outcomes as {o1,... ,on} as [p1 : o1,... ,pn : on]where pi is the probability that the outcome of the lottery will be oi.3Chapter 2. Background and Related WorkNot every relationship among the outcomes makes sense as a preference relationship. Wewant preference relationships to capture the intuitive sense of `preferring'. In particular, we wantto rule out relationships that make incomprehensible statements about preference. Consider therelationship `triangleright': o1 triangleright o2 and o2 triangleright o3, but not o1 triangleright o3?while `triangleright' is a perfectly valid relationship, butit is hard to imagine having non-transitive preference over outcomes so `triangleright' does not make sense asa preference relationship.The most accepted account of what special properties make a relationship a preference is due tothe six von Neummann and Morgenstern axioms [52]. These axioms are provided here for flavourand completeness. We will largely take them as writ in this thesis with one exception noted below.I Completeness: Pick two outcomes. Either one is preferred to the other or they are equallypreferred.II Transitivity: If outcome o1 followsequal o2, and o2 followsequal o3, then necessarily o1 followsequal o3.III Substitutability: If o1 similar o2, then[p : o1,p3 : o3,... ,pn : on] similar [p : o2,p3 : o3,... ,pn : on].IV Decomposability: If two lotteries give equal probabilities to all outcomes then they areequally preferred. Essentially there is ?no fun in gambling??a game of craps and a slotmachine are equally preferred if they pay out the same amounts with the same probabilities.V Monotonicity: If o1 followsequal o2 and p > qthen [p : o1,(1 -p) : o2] followsequal [q : o1,(1 -q) : o2].VI Continuity: If o1 follows o2, and o2 follows o3, then there has to be a lottery such that o2 similar [p :o1,(1 -p) : o3]Figure 2.1: von Neummann and Morgenstern's six preference axioms.Of course, some of these axioms could be disputed. For instance one might want to have amodel of preference that does not assert completeness. Such a system would be able to account forsituations where one might not be sure which of two outcomes they like better. This is different thanmerely being indifferent between two outcomes. One might be simultaneously trying to optimizealong multiple dimensions: for example, a hospital administrator wants to save both money andincrease the quality of care but might have trouble spelling off alternatives that trade off one againstthe other2. This idea of incomparability can be captured with a partial ordering over outcomes.As an example, if we define a partial ordering through a partial preference relationship followsequal overRfractur2 where outcomes are real-valued pairs and (a,b) followsequal (aprime,bprime) if and only if a greaterequal aprime and b greaterequal bprime. Withsuch an ordering, (2,4) is preferred to (1,1), but there is no preference relationship between (2,4)2Although, estimates for the Value of Statistical Life?the projected marginal cost of a human life [51]?are basedon the fact that money and care quality do get traded off in practice.4Chapter 2. Background and Related Work(2,2)(1,2) (2,1)(1,1)Figure 2.2: An example of a partial ordering over elements from Rfractur2, where the edges indicate thatthe source node is preferred to the sink node.and (3,0). Partial orderings can be conveniently visualized using a graph, such as in Figure 2.2. Wewill later argue that there are strong reasons for not having a preference between two algorithmseven when we have extensive experimental data for their performance?they might be good indifferent situations.If we have a preference relationship that satisfies the von Neummann and Morgenstern axiomsit can be shown that there exists a utility function u : O mapstoarrowright [0,1] that captures the preferencerelationship exactly [52]. This means that all of the structure of a preference relationship can bedistilled to a mapping of outcomes to real numbers such that o1 followsequal o2 if and only if u(o1) greaterequal u(o2);we can express our preference for any outcome in terms of a one-dimensional number.We will call the value of the utility function for a particular outcome the reward for that out-come. This is for historical reasons and is not a substantive claim about our setting. `Reward'is a more common term for the generic measure of worth in than `utility' in the single-agent re-inforcement learning and artificial intelligence communities, and so we will inherit this piece ofterminology.2.2 One-shot GamesAs mentioned earlier, agents might have different preference orderings over the outcomes of agame, so what happens when agents with different preferences interact? If the agents only interactonce then this conflict can be modeled as a one-shot game. The distinguishing characteristic of theone-shot game is that there is no notion of time or sequence of actions: the agents simultaneouslyact once and only once. Therefore the agents need not think about souring or fostering futurerelationships when acting in the game. They only need to concern themselves with how their5Chapter 2. Background and Related Workaction affect their immediate one-off reward.A one-shot game can be modeled with a 5-tuple: G = angbracketleftN,A,O,o,uangbracketright. Each element of thistuple has a simple interpretation:? N is the set of players.? A = producttextielementN Ai is the set of actions sets, where each action set Ai is set of possible actionthat agent i element N can take. These action sets do not have to be finite or even countable butwe will focus on finite actions sets in this thesis. A particular member a element A is called anaction profile, and is a joint action decision for all players. A useful piece of notation tohave is a-i element A-i which denotes the action choices by the opponents of a particular agent:A-i = producttextjelementN\{i} Aj.? O is the set of outcomes.? o is a function o : A mapstoarrowright O that maps action profiles to outcomes.? u is a vector function that map outcomes of the game to rewards for each of the players i.e.for each agent i element N, they have a utility function ui : O mapstoarrowright Rfractur. In this thesis, we are solelyconcerned with the reward associated with an outcome and so we will abbreviate ui(o(a)) asui(a).If a player decides to play a particular action ai element Ai in G, it is said to be playing a purestrategy. In many settings agents are allowed to randomize over decisions giving rise to mixedstrategies. These strategies are denoted as sigmai element delta(Ai), where delta(Ai) is the |Ai|-dimensionalprobability simplex?sigmai is a multinomial distribution over agent i's action set. Note that theseprobabilities are absolutely not allowed to depend on what the opponent does: the agents unfurltheir fully-specified pure or mixed strategies simultaneously and independently.The probability that any action ai will be played is sigmai(ai). Any action ai that has non-zeroprobability in sigmai is said to be in its support. When playing a mixed strategy in a one-shot game,agents are concerned with maximizing their expected reward: ui(sigma) = summationtextaelementA p(a|sigma)ui(a).If, given the opponents' strategies sigma-i, agent i plays a strategy sigmai such that ui(sigmai,sigma-i) =maxsigmaprimei ui(sigmaprimei,sigma-i), then sigmai is said to be a best response. The best response against a particularset of opponent strategies is not necessarily unique?there may be many strategies that attain themaximum reward.Simple one-shot games are represented in normal-form tables where the rewards are explicitlywritten in |N|-dimensional vectors (one utility entry for each player) in a table with one entry peroutcome. For two-players this is a `matrix' where each element belongs to Rfractur2. Game 2.3 is anexample of a famous game, Prisoner's Dilemma, in normal form.Explicitly writing out all the reward entries is wasteful in games that have highly-structuredpayoffs. While more parsimonious representations exist, the games examined in this thesis arelargely restricted to having two players and typically small. All the games that we look at in thisthese are in normal form, although the discussion of a family of large n-player traffic games lookedat in ? 7 will primarily focus on issues of representation.6Chapter 2. Background and Related WorkC DC 3,3 0,4D 4,0 1,1Figure 2.3: A game of Prisoner's Dilemma in normal-form with canonical payoffs.2.2.1 Solution ConceptsOnce a game has been defined, a natural question is: how it should the agents act? Game theoryis devoted to answering this question in the context of rational agents. A rational agent is an entitythat ruthlessly and perfectly maximizes expected utility given its knowledge of the game and itsopponent. Unfortunately, there is no single `solution' to a game, but rather different families ofsolution concepts which are sets of strategies or strategy profiles that satisfy some game theoreticproperty. The most famous of these is the Nash equilibrium, but we will first look at two othersolution concepts that will be important to later discussions: non-dominated outcomes and maxminstrategies.Non-dominated OutcomesSome actions make never make any sense to play. An important class of these are the dominatedactions. An action is (strictly) dominated if there is another pure or mixed strategy that yieldshigher reward regardless of what the opponents do. This is formally captured asai Dominates aprimei equivalence universalsigma-i,ui(ai,sigma-i) greaterequal ui(aprimei,sigma-i). (2.1)No rational agent ever plays a dominated action and we can discard any outcome that result fromdominated action profiles. The idea of weak domination is similar, but the agent can be neutral be-tween the dominating and dominated strategy for some (but not all) profiles of opponent strategies.For example, in Prisoner's Dilemma, the row player's action C is strictly dominated: if thecolumn player plays C, row is better off playing D than C, and if column plays D row would stillbe better playing D. There is no reason that row should ever play C and so C can be discarded asa possible action.While strict domination is easy to spot in a two-player two-action game, it generally has to befound with a series of linear feasibility programs?one for every action by every player. For anyagent and action, we can use a feasibility program to look for some mixed strategy over the otheractions that has higher expected utility for all of the opponent's actions. If there is a pure strategylike this, then the corresponding action can be thrown out.In a two-player game if we assume that the opponent believes that the protagonist agent isrational?or reward maximizing?then the opponent must also believe that the protagonist willnever play any strategy that has a strictly dominated action in its support. If we furthermore assumethat the opponent is, itself, rational, then it will never play any action that is strictly dominated withrespect to the protagonist's non-dominated actions. This i n turn means that the protagonist will7Chapter 2. Background and Related WorkA BA 2,1 0,0B 0,0 1,2Figure 2.4: A game of Battle of the Sexes in normal-form with canonical payoffs.never play any strategy that is strictly dominated with respect to the opponent's surviving actionsand so forth. This process of removing actions is called iterated domination removal (IDR) and isrepeated until no more actions can be removed for either agent. If an action has survived IDR it hasattained a minimal certificate of sensibility: there is at least some belief about what the opponentwill do for which this strategy is a best response3.Maxmin StrategiesIf an agent has no idea what the opponents will do it can still guarantee some level of reward. Oneway of thinking about this problem is in terms of playing against an adversary: no matter whatstrategy an agent picks, the opponent just-so-happens to be playing the action that minimizes theprotagonist's reward. The highest reward that an agent can get in such a situation is called themaxmin value or the security value, which is formally stated asmaxmini(G) = maxsigmaielementproducttext(Ai)mina-ielementA-iui(sigmai,a-i). (2.2)Any strategy that achieves the maxmin value is called a maxmin strategy. While for somegames the maxmin value is a pessimistic lower-bound, in other games agents are actually playingagainst opponents that are trying to minimize their reward and the maxmin strategy is well moti-vated. Indeed, any game with two agents competing for a common finite resource has this feature:if there is only one cake, any cake that you have is a slice that I do not have. In two-player constantsum games, the strategy profile where both players adopt a maxmin strategy is a member of thenext solution concept that we will discuss?the Nash equilibrium.Nash EquilibriaA Nash equilibrium is a strategy profile where all agents are playing a mutually optimal strategy?everyone is best responding to everyone else. Formally, the set of Nash equilibria isNE(G) = braceleftbigsigma element delta(A) | universali element N,universalsigmaprimei element delta(Ai),ui(sigma) greaterequal ui(sigmaprimei,sigma-i)bracerightbig . (2.3)This set is always non-empty [35], but is not necessarily a singleton. For example, the Battle ofthe Sexes (Game 2.4) has three Nash equilibria: both agents playing A, both playing B and somemixture between the two.3This is not necessarily true when the number of players is greater than two.8Chapter 2. Background and Related WorkThe Nash equilibrium is a positive concept and not a predictive or normative one. If there areseveral Nash equilibria there is no prescription for picking which one an agent should or will play.There has been a lot of work done on restricting the set of Nash equilibria to sets of equilibriathat seem `natural' or predictive. Much of the existing work in repeated game dynamics can beseen as describing restricted classes of Nash equilibria in terms of simple adaptation rules?see,for example, Kalai and Lehrer [25], Kalai and Lehrer [26], Hart and Mas-Colell [21] and Hart andMansour [20].One of Nash equilibrium restrictions that we look at in this thesis is the set of Pareto-optimalNash equilibria. Pareto-domination is a partial ordering over outcomes, where one outcome Pareto-dominates another if all the agents get weakly higher reward in the former outcome than the latter:aPareto Dominates aprime equivalence universali element N,ui(a) greaterequal ui(aprime). (2.4)Pareto-optimality refers to any outcome that is not Pareto-dominated by any other outcome: inevery other outcome at least one of the agents is worse off. Indeed, the partial ordering example inFigure 2.2 can be seen as an example of Pareto-domination for two agents.A Pareto-optimal Nash equilibrium is an equilibrium that is Pareto-optimal when outcomes arerestricted to the Nash equilibria. In a Pareto-dominated equilibrium, all agents would do better ifeveryone switched to another equilibrium. This suggests that Pareto-optimal Nash equilibria shouldbe a particularly stable appealing set of Nash equilibria for the agents. Note that a Pareto-optimalNash equilibrium is not necessarily a Pareto-optimal outcome: the difference can be seen in Pris-oner's Dilemma. (C,C) is a Pareto-optimal outcome, but (D,D) is the unique Nash equilibriumand so is also the Pareto-optimal equilibrium.2.3 Repeated GamesOne-shot games are the foundation of repeated games, in which two or more agents repeatedly (foreither a finite or infinite number of iterations) play a one-shot stage-game. Unlike the one-shotgame where the agents play the game once and then never interact again, the history of play in arepeated game is kept and agents can condition their strategies on it. This suggests that agents needto worry about how their current choice of action will effect future reward.Tit-for-Tat (TfT) is an example of a strategy in a repeated game and is perhaps the most famousrepeated-game strategy for Prisoner's Dilemma (Game 2.3). TfT begins by cooperating and thenplays whatever the opponent did the past iteration. Therefore, if the opponent is obliging andcooperates TfT will continue to cooperate. If ever the opponent defects TfT will defect the nextround, but will start cooperating whenever the opponent starts feeling cooperative again. TfT is arelatively simple strategy, but one could imagine constructing more and more elaborate strategieswith trigger conditions and complicated modes of behaving.Formally, let us denote the action played in iteration t by agent i as a(t)i , and the reward thati receives is r(t)i . For the model of repeated games that we will use, agents are allowed to submit9Chapter 2. Background and Related Workmixed strategies, denoted sigma(t)i , but agents do not receive the expected utility. Instead, an action issampled from the mixed strategy, and payoff is calculated using the sampled action.Agents in finitely repeated games are interested in maximizing their average reward, and agentsin an infinitely repeated game are interested in maximizing their limit average reward:?i = limTarrowrightinfinitybracketleftBiggsummationtextTt=1 r(t)iTbracketrightBigg. (2.5)In this thesis, we will exclusively focus on simulating infinitely repeated games in a finite numberof iterations: agent do not believe that the game will ever end even though it does after a set numberof iterations.One interpretation of these repeated-game strategies is that they are strategies in a one-shotsupergame where instead of making a decision each iteration conditioned on the past history,agents make a single policy decision. In this interpretation, TfT is a single action in the Prisoner'sDilemma supergame. There are many more policies than there are actions in the stage-game: if allplayers have n actions and the game is repeated for T, then there are O(nT2) possible pure-actionpolicies (policies that only ever make pure-strategy decisions). While the supergame is clearly nota compact way of representing a repeated game, it is an intuitive way of seeing that finitely repeatedgames must have Nash equilibria.Infinitely repeated games also necessarily have equilibria and, furthermore, each infinitely re-peated game has infinitely many equilibria. We cannot explicitly characterize every repeated gameequilibria but we can instead say something about the the payoff profiles that these equilibriaachieve. If all agents have an average reward above their respective security values (recall from? 2.2.1) then there exists some Nash equilibrium that attains the same profile of payoffs. This is thecelebrate Folk Theorem4. Any payoff profile where all agents are attaining reward higher that theirsecurity values is said to be enforceable.To gain some insight about the Folk Theorem, let us again examine the game of Prisoner'sDilemma and construct a repeated game Nash equilibrium where both players repeatedly play C.The security value for this game is 1 (if an agent always plays D, then it gets a utility of at least 1regardless of what the opponent does), and so (C,C) is clearly enforceable. Therefore, (3,3) is apayoff profile of the repeated game Nash equilibrium. What are the equilibrium strategies? Thereare a number of strategy profiles that can attain this outcome but one of the simplest profiles isGrim Trigger in self play.Grim Trigger plays C until the opponent plays D and then plays D forever: upon the firstsign of deviation from the (C,C) outcome the Grim Trigger strategy starts a merciless and unend-ing program of punishment. Grim Trigger in self play is an equilibrium because if either playertries to deviate to another strategy that ever plays D against Grim Trigger, the deviator will beworse off: the deviant strategy strategy can, at best, attain an average reward of 1. If either playerswitches to a strategy that never plays D against Grim Trigger, then this strategy differs only in the4There are actually several Folk Theorems, but we will only look at one for average rewards.10Chapter 2. Background and Related Work100 99 ... 1100 100,100 99 -delta,99 + delta 1 -delta,1 + delta99 99 + delta,99 -delta 99,99 1 -delta,1 + delta... ...1 1 + delta,1 -delta 1 + delta,1 -delta 1,1Figure 2.5: A game of Traveler's Dilemma with 100 actions.`off-equilibrium' punishment details and receives the sam e average reward as Grim Trigger?thedeviant strategy has not improved reward.The Folk Theorem does not claim that if two agents are achieving an enforceable payoff profilethen the agents are playing a Nash equilibrium. The idea of equilibrium is tied deeply to the threatof punishment. Grim Trigger in self play is an equilibrium while two strategies that blindly play Cis not, even if their behaviour looks the same to an outside observer. In the latter case if either agentswitched to the simple strategy of blindly playing D, than their average reward would be higher.While we might have been unsatisfied by the potential multitude of Nash equilibria in the one-shot game, the predictive power of the Folk Theorem is even less: there are games where nearlyevery outcome arises under a Nash equilibrium. For instance, consider an extension of Prisoner'sDilemma, the Traveler's Dilemma (Game 2.3; Prisoner's Dilemma is the special case with only2 actions?apparently travelers have more options available to them than prisoners). The securityvalue is 1 for both players, and so all outcomes?except for outcomes of the form (1 -delta,1 + delta) or(1 + delta,1 -delta)?are potentially the result of a Nash equilibrium.Again, repeated game Nash equilibria are positive statements and not normative, but in manycases we want a normative claim: how should we behave in a repeated game? How should wego about selecting a particular strategy for a multiagent system? This thesis is largely devoted toevaluating one approach that computer science has suggested for this problem: multiagent learningalgorithms.2.4 Multiagent Learning AlgorithmsMAL algorithms have been studied for a long time (57 years at the time of writing) and manydifferent algorithms exist. Not only is there a profusion of algorithms but there are also several dif-ferent settings for multiagent learning. Does an algorithm know the game's reward functions beforethe game starts? Some authors assume yes, while others assume that these reward functions needto be learned. There are other questions. What signals of the opponent actions can an algorithmobserve? Are stage-game Nash equilibria and other computationally-expensive game propertiesassumed to be computable? Each of these assumptions changes the learning problem. A settingwhere the rewards are known a priori is fundamentally different than a setting where the rewardsare not known a priori and algorithms have no ability to observe the opponents' rewards.The algorithms that we describe in this section were designed with a variety of different goals11Chapter 2. Background and Related Workin mind and this reflects a general disagreement over what these MAL algorithms should be tryingto do. Should they be trying to converge to a stage-game Nash equilibrium? Should they tryto avoid being exploited by other algorithms? Or are they trying to maximize their sequence ofreward? There is no single answer (but these issues are examined at length in Shoham et al. [46]and Sandholm [44]).Each of these goals poses different empirical questions. For instance, if we are primarily inter-ested the Bowling and Veloso [10] criterion?all algorithms should converge to a stationary strat-egy and if the opponent converges to a stationary policy all algorithms should converge to a bestresponse?one should analyze experiments using performance metrics that are sensitive to strategicstationarity and to the difference between the current strategy and the best response strategy.In this dissertation we focus on two-player repeated games with many (i.e. more than two)actions per player. Other learning settings have been investigated. Some of these settings arefurther restrictions that insist, for example, on two-action games [47] or constant-sum games [30].Other work looks at learning in generalizations of two-player repeated games: stochastic games orN-player games [53]. There are also MAL experiments that have been conducted in settings thatare neither generalizations nor restrictions, such as the population-based work by Axelrod [3] andAiriau et al. [2]. Of these games, the repeated two-player game setting is the best studied and thereare many recent algorithms designed for such games.In the remainder of this section we will discuss a selection of algorithms intended for two-player repeated games and look at some previous MAL experiments. We do not mean to give anexhaustive survey of this literature but we do want to build intuition about this set of algorithms,look at the assumptions that they make and indicate some of the relationships between them.2.4.1 Fictitious PlayFictitious play [11] is probably the earliest example of a learning algorithm for two-playergames repeated games. Essentially, fictitious play assumes that the opponent is playing anunknown and potentially mixed stationary strategy, and tries to estimate this strategy from the op-ponent's empirical distribution of actions?the frequency counts for each of its actions normalizedto be probabilities. Clearly, in order to collect the frequency counts fictitious play mustbe able to observe the opponent's actions. The algorithm then, at each iteration, best responds tothis estimated strategy. Because fictitious play needs to calculate a best response, it alsoassumes complete knowledge of its own payoffs.Fictitious play is guaranteed to converge to a Nash equilibrium in self play for a restricted setof games. These games are said to have the fictitious play property (see, for instance Mondererand Shapley [34]; for an example of a simple 2 ? 2 game without this property see Monderer andSela [33]). Fictitious play will also eventually best respond to any stationary strategy. Thisalgorithm's general structure has been extended in a number of ways, including smooth fictitiousplay [17], and we will see later that fictitious play provides the foundation for AWESOMEand meta, two more modern algorithms. These algorithms are described later in Section 2.4.3.Fictitious play is known to have miscoordination issues, particularly in self play. For12Chapter 2. Background and Related Workexample, consider the Sidewalk Game (Game 2.6), where two identical fictitious playagents are faced with the issue of trying to get by each other on the sidewalk by either passingto the West or the East. Since both algorithms are identical and deterministic, these algorithmswill cycle between (W,W) and (E,E). There are some clever measures that can be taken toavoid some of these kinds of problems (for instances, special best response tie-breaking rules andrandomization), but miscoordination is a general issues with the fictitious play approach.W EW -1,-1 1,1E 1,1 -1,-1Figure 2.6: A Sidewalk or Dispersion game, where two agents try to miscoordinate where theystep.2.4.2 DeterminedDetermined or `bully' (see, for example, Powers and Shoham [38]) is an algorithm that solvesthe multiagent learning problem by ignoring it. MAL algorithms typically change their behaviourby adapting to signals about the game. However determined, as its name suggests, stubbornlydoes not change its behaviour and relies on other algorithms adapting their strategies to it.Determined enumerates the stage-game Nash equilibria and selects the one that maximizesits personal reward at equilibrium. Certainly, determined is not a final solution to the MALproblem: for instance, two determined agents will stubbornly play different equilibria (unlessthere is a an equilibrium that is best for both agents), possibly leading to a situation where bothalgorithms receive sub-equilibrium reward. Additionally, enumerating all the Nash equilibria notonly requires complete knowledge of every agents' reward functions, but also is a costly compu-tational activity that is infeasible on anything but the smallest stage games. With that said, it iscertainly an interesting learning approach to test and compare. Slight variations of determinedare, like fictitious play, at the heart of meta and AWESOME.Using a stage-game Nash equilibrium is only one way of being stubborn and getting an op-ponent to adapt. One could also imagine aiming for convergence to other outcomes: for instancelooking for the outcome with the highest reward given that the opponent is best responding. Notethat this differs from a stage-game Nash equilibrium because the determined algorithm doesnot have to be best responding itself. This amounts to an equilibrium of the Stackelberg version ofthe game: imagine the same game, but instead of moving simultaneously, the determined agentmoves first. Clearly, a sensible opponent will best-respond to whatever determined does, andso determined should pick the action that gives maximum reward given that the opponent willbest respond.As an example of a Stackelberg outcome: in Game 2.7 the unique Nash equilibrium is (B,R).Indeed, this is the only outcome that survives IDR. However, there is something very appealing13Chapter 2. Background and Related Workabout the Pareto-optimal outcome at (T,L): if the row player can teach the column player that itwill, in fact, play T then the row player will be much better off.L RT 1 -epsilon1,1 0,0B 1,0 epsilon1,1Figure 2.7: A game showing a situation where a determined-style algorithm might be better offnot best responding to its opponent.2.4.3 Targeted AlgorithmsWe will next focus on a class of algorithms called the targeted algorithms. Targeted algorithmsfocus on playing against a particular class of opponents. For example, AWESOME [13] guaran-tees convergence when playing itself or any stationary opponent. Both these algorithms are basedaround identifying what the opponent is doing, with particular attention paid to stationarity andNash equilibrium, and then changing their behaviour based on this assessment.Meta [38] switches between three simpler strategies: a strategy similar to fictitiousplay (there are some small differences in how best responses are calculated), a determin-ed-style algorithm that stubbornly plays a Nash equilibrium, and the maxmin strategy. Averagereward and empirical distributions of the opponents' actions are recorded for different periods ofplay. Based on these histories one of the three algorithms is selected. Meta was theoretically andempirically shown to be nearly optimal against itself, close to the best response against stationaryagents, and to approach (or exceed) the security level of the game in all cases.AWESOMEalso tracks the opponent's behaviour in different periods of play and tries to maintainhypotheses about their play. For example, it attempts to determine whether the other algorithmsare playing a special stage-game Nash equilibrium. If they are, AWESOME responds with its owncomponent of that special equilibrium. This special equilibrium is known in advance by all imple-mentations of AWESOME to avoid equilibrium selection issues in self play. There are other situa-tions where it acts in a similar fashion to fictitious play, and there are still other discretemodes of play that it engages in depending on what hypotheses it believes.Because both of these algorithms switch between using simpler strategies depending on thesituation, these algorithms can be viewed as portfolio algorithms. Here, they both manage similarportfolios that include a determined-style algorithm and a fictitious play algorithm.2.4.4 Q-learning AlgorithmsA broad family of MAL algorithms are based on Q-learning [55]: an algorithm for finding theoptimal policy in Markov Decision Processes (MDPs; can be thought of as single-agent stochasticgames). This family of MAL algorithms does not explicitly model the opponent's strategy choices.14Chapter 2. Background and Related WorkThey instead settle for learning the expected discounted reward for taking an action and then fol-lowing some set policy: the Q-function. In order to learn the Q-function, algorithms typically takerandom exploratory steps with a small (possibly decaying) probability.Each algorithm in this family has a different way of selecting its strategy based on this Q-function. For instance, one could try a straight forward adaptation of the single-agent Q-learn-ing to the multiagent setting by ignoring the impact that the opponent's action makes on theprotagonist's payoffs. The algorithm simply updates its reward function whenever a new rewardobservation is made, where the new estimate is a convex combination of the old estimate and thenew information:Q(ai) = (1 -alphat)Q(ai) + alphatbracketleftBigr + gamma maxa Q(a)bracketrightBig. (2.6)This algorithm essentially considers the opponent's behaviour to be an unremarkable part of anoisy and non-stationary environment. The non-stationarity of the environment makes learningdifficult but this idea is not entirely without merit: Q-learning has been shown to work in othernon-stationary environments (see, for instance, Sutton and Barto [49]).Minimax-Q[30] is one of the first explicitly multiagent applications of this idea. The Q-function that it learns is based on the action profile and not just the protagonist action: it learnsQ(ai,a-i). Minimax-Q uses the mixed maxmin strategy calculated from the Q-function as itsstrategy:Q(ai,a-i) = (1 -alphat)Q(ai,a-i)+alphatbracketleftBiggr + gamma maxsigmaielementproducttext(Ai)bracketleftBiggmina-ielementA-isummationdisplayaisigmai(ai)Q(ai,a-i)bracketrightBiggbracketrightBigg. (2.7)It should be noted that since its maxmin strategies are calculated from learned Q-values, theymay not be the game's actual maxmin strategies and thus fail to attain the security value. LikeQ-learning, minimax-Q also takes the occasional exploration step.There are further modifications to this general scheme. NashQ [24] learns Q-functions for itand its opponents and plays a stage-game Nash equilibrium strategy for the game induced by theseQ-values. Correlated-Q [19] does something similar except that it chooses from the set ofcorrelated equilibria using a variety of different selection methods. Both of these algorithms assumethat they are able to observe not only the opponents' actions but also their rewards, and additionallythat they have the computational wherewithal to compute the necessary solution concept.2.4.5 Gradient AlgorithmsGradient ascent algorithms, such as GIGA-WoLF [7] and RVsigma(t) [5], maintain a mixed strategythat is updated in the direction of the payoff gradient. The specific details of this updating processdepend on the individual algorithms, but the common feature is that they increase the probabilityof actions with high reward and decrease the probability of unpromising actions. This familyof algorithms is similar to Q-learning because they do not explicitly model their opponent'sstrategies and instead treat them as part of a non-stationarity environment.15Chapter 2. Background and Related WorkGIGA-WoLF is the latest algorithm in the line of gradient learners that started with IGA [47].GIGA-WoLF uses an adaptive step length that makes it more or less aggressive about changingits strategy. It compares its strategy to a baseline strategy and makes the update larger if it isperforming worse than the baseline. GIGA-WoLFguarantees non-positive regret in the limit (regretis discussed in greater detail in ? 5.2) and strategic convergence to a Nash equilibrium when playingagainst GIGA [57] in two-player two-action games.There are two versions of GIGA-WoLF. The first version assumes prior knowledge of personalreward and the ability to observe the opponent's action?thi s is the version used in the proofsfor GIGA-WoLF's no-regret and convergence guarantees. There is also a second version?onwhich all the experiments were based?that makes limited assumptions about payoff knowledgeand computational power. Instead, like Q-learning, it merely assumes that it is able to observeits own reward.RVsigma(t) [5] belongs to a second line of gradient algorithms initiated by ReDVaLeR [4]. Thisalgorithm also uses an adaptive step size when following the payoff gradient, like GIGA-WoLF,but this is done on a action-by-action basis. This means that, unlike GIGA-WoLF, RVsigma(t) can beaggressive in updating some actions while being cautious about updating others, and it does thisby comparing its reward to the reward at a Nash equilibrium. Therefore, RVsigma(t) requires completeinformation about the game and sufficient computational power to discover at least one stage-gameNash equilibrium. RVsigma(t) also guarantees no-regret in the limit and additionally provides someconvergence results for self play for a restricted class of games.GIGA-WoLF and RVsigma(t) differ in the way that they ensure that their updated strategies arestill probabilities. GIGA-WoLF retracts: it maps an unconstrained vector to the vector on theprobability simplex that is closest in lscript2 distance. This approach has a tenancy to map vectors tothe extreme points of the simplex, reducing some action probabilities to zero. RVsigma(t) normalizes,which is less prone to removing actions from its support. This difference may explain some of theexperimental results later on.2.4.6 Previous Experimental ResultsSetting up a general-sum repeated two-player game experiment requires a number of design choices.Say that one has an algorithms to be evaluated in terms of a particular performance metric: whatset of games should these algorithms be run on? What other algorithms should this performancebe compared to? If one is dealing with randomized algorithms (which includes any algorithm thatis able to submit a mixed strategy), how many different runs should be simulated? For a partic-ular game, how many iterations should a simulation be run for? As can be seen in Table 2.4.6,existing literature varies in all of these dimensions. Additionally, some papers do not even discussparameters used which makes it difficult to reproduce experiments.Overall, most of the tests performed in these papers considered few algorithms. In most ofthese experiments, the newly proposed algorithms were only evaluated by playing against one ortwo opponents. Some papers?like Littman [30] and Greenwald and Hall [19]?seemed to usemany algorithms, but in fact these algorithms were quite similar to each other and varied only in16Chapter 2. Background and Related WorkPaper Algorithms Distributions Instances Runs IterationsLittman [30] 6 1 1 ? ?Claus and Boutilier [12] 2 3 1 - 100 ? 50-2500Greenwald and Hall [19] 7 5 1 2500 - 3333 1 ?105Bowling [8] 2 6 1 ? 1 ?106Nudelman et al. [36] 3 13 100 10 1 ?105Powers and Shoham [38] 11 21 ? ? 2 ?105Banerjee and Peng [5] 2 1 1 1 16000Conitzer and Sandholm [13] 3 2 1 1 2500Table 2.1: This table shows a summary of the experimental setup for a selection of papers. Thesummary includes the number of algorithms, the number of game distributions, the number of gameinstances drawn from these distributions, the number of runs or trials for each instance, and thenumber of iterations that the simulations were run for. In some cases, the setup was unclear, indi-cated with a `?'. In many cases, fewer than [Algorithms ?Distributions ?Instances ?Runs]runs were simulated, due to some sparsity in the experimental structures.some small details. For example, in Littman [30] two versions of minimax-Q and two versions ofQ-learning were tested and each version varied only by its training regime. In Greenwald andHall [19], four versions of Correlated-Q were tested against Q-learning and Friend-Qand Foe-Q(the last two are from Littman [29]). Powers and Shoham [38] implemented the greatestvariety of opposing algorithms out of these experiments. While four of the eleven tested were sim-ple stationary strategy baselines, the remaining seven were MAL algorithms including Hyper-Q[50], WoLF-PHC [10], and a joint action learner [12].Experiments have also tended to involve small numbers of games instances, and these instanceshave tended to have been drawn from an even smaller number of game distributions. For example,Banerjee and Peng [5] used only a single 3?3 action ?simple coordination game? and Littman [30]probed algorithm behaviour with a single grid-world version of soccer. For earlier papers, this par-tially reflected the difficulty of creating a large number of different game instances for use in tests.However with the creation of GAMUT [36], a suite of game generators, generating large game setsis now easy and involves little investment in time. Indeed, Nudelman et al. [36] performed a largeMAL experiment using three MAL algorithms (minimax-Q, WoLF [9], and Q-learning) on1300 game instances drawn from thirteen distributions. Some recent papers have taken full ad-vantage of the potential of GAMUT, such as Powers and Shoham [38], but adoption has not beenuniversal.Experiments have also differed substantially in the number of iterations considered rangingfrom 50 [12] to 1 ? 106 [8]. Iterations in a repeated game are usually divided into ?settling in?(also calleda ?burn-in? period) and ?recording? phases , allowing the algorithms time to settle oradapt before results are recorded. Powers and Shoham [38] recorded the final 20 000 out of 200 000iterations and Nudelman et al. [36] used the final 10 000 iterations out of 100 000.17Chapter 3PlatformUnfortunately, empirical experiments have largely been run with one-off code tailored to showing aparticular feature of an algorithm. This has a number of negative consequences. First, it decreasesthe reproducibility of experiments by, for instance, obscuring the details of algorithm implementa-tion. Even when source code for the original experiment is available, it might be difficult to extendto new experimental settings; having to recode apparatus reduces the flexibility of experimental de-sign. If one experiment hints at an unexpected result it is more difficult to flesh out this behaviourwith a new experiment if it involves recoding the platform. Finally, rewriting the same code againand again wastes time that could be spent running more comprehensive experiments.3.1 The Platform ArchitectureIn this section, we describe an open and reusable platform that we call MALT 2.0 (MultiagentLearning Testbed) for running two-player, general-sum, repeated-game MAL experiments. Basicvisualization and analysis features are also included in this platform. This is the second version ofthe platform (the original version is described in Lipson [28]) and this new version is a completerecoding of the platform5.What we intend with MALT is not a finished product, but a growing repository of tools, algo-rithms and experimental settings (such as N-player repeated games or stochastic games). Essen-tially we want this version of MALT to be a base upon which other researchers can add and sharetools.3.1.1 DefinitionsIn order to clarify our discussion of running a experiment on a particular game with a particularset of algorithms, it is useful to define some terms. An ordered pair of two algorithms is a pairing.This pair is ordered because many two-player games are asymmetric: the payoff-structure for therow player is different than the payoff structure for the column player. The case where an algorithmis paired with a copy of itself (but with different internal states and independent random seeds) isself play.5In this version, we recoded each one of the algorithms carefully from the original pseudo-code, completely re-designed the repeated game simulator, and created an entirely new visualization interface. In short, none of the originalsource code remains.18Chapter 3. PlatformWe largely concentrate on drawing games from distributions called game generators. A par-ticular sample from a game generator is a game instance. Prisoner's Dilemma is a game generatorand an example game instance is a particular set of payoffs that obey the Prisoner's Dilemma pref-erence ordering. However, Prisoner's Dilemma is a simple example and not all game instancesfrom the same generator are as closely tied.A pairing and a game instance, taken together, are called a match. A match with one of thealgorithms in the pairing left unspecified is a partially specified match (PSM). If two algorithmsplay the same PSM, we will conclude that any differences between their performances are due tothe algorithms themselves (including any internal randomization) because everything else was heldconstant between the two matches.A particular simulation of a match is called a run or trial. For deterministic algorithms, a sin-gle run is sufficient to understand the performance of that match, but for randomized algorithms(including any algorithm that plays a mixed strategy) multiple runs may each display different be-haviour. In such cases, the solution quality distribution (SQD)6?the empirical distribution of aperformance metric?for the match should be compared. Each run consists of a number of itera-tions. During an iteration, the algorithms submit their actions to a stage game and receive somefeedback?such as observing their reward or what action the opponent played. Algorithms areallowed to submit a mixed strategy in which case a single action is sampled from the mixing distri-bution by the game. The iterations are separated into settling-in iterations and recorded iterations.3.1.2 Platform StructureIn this section we give an overview of the structure of the platform. The five steps in running anexperiment with the platform are summarized in Figure 3.1. There are three major componentsto this platform: the configuration GUI, the actual experiment engine (the piece that simulates therepeated games) and the visualization GUI.The first step to running an experiment is to specify its parameters. There are three parts to thisand the configuration GUI guides the process. The first step is to pick a group of algorithms and settheir parameters. The second step is to select the GAMUT game distributions used and choose theparameters for these games. The third step is to establish general experimental parameters, such asthe number of iterations for each simulation.We have tried to make it as easy as possible to add new algorithms to MALT. Adding a newalgorithm to the GUI is as simple as providing a text file with a list of parameters. Adding thealgorithm to the actual engine requires minimal additional coding beyond the implementation ofthe algorithm.The performance of many algorithms are likely very dependent on these parameters, howeverit is out of the scope of this thesis to conduct a sensitivity analysis or to tune these parameters.Along with each algorithm, we have provided some default parameters for these implementations6We call these distribution ?solution quality distributions? despite the fact that in MAL there is no clear idea of a`solution' to a game. These distributions could be more meaningfully called `metric distributions', however, `SQD' isthe terminology traditionally used in empirical algorithmics and so we adopt this language.19Chapter 3. PlatformAgentsGamesSettingsComputeNEMaxminGenerate JobsSimulate Data Compute Metric VisualizeAnalyzeSet Up1. Set Up Experiment 2. Generate Jobs3. Run Jobs 4. Calculate Metrics 5. Interpret ResultsFigure 3.1: The five steps in running an analyzing an experiment using MALT.20Chapter 3. Platformthat were taken from the original papers or were set to values that seemed reasonable from thedescription of the algorithm when this was not possible. These parameter settings can be easilychanged using the GUI.Once the experimental setup has been finalized, text files are generated for the agent configu-rations and the game instances. These files are easy to edit and they are also easy to generate usingscripts that bypass the GUI. MALT uses GAMBIT's [32] implementation of Lemke-Howson [27]to find the set of Nash equilibria for each game instance and an internal linear program finds themaxmin strategies (however, MALT requires CPLEX to solve this and other linear programs). Ajob file is generated for each match. Each job file references the agent, game, equilibrium, andmaxmin-strategy files. These files are referenced and this makes altering the job files simple evenafter the job files have been generated.This set of job files may be run a number of ways. The most basic is to run them in a batchjob, however for large experiments this may be prohibitively expensive. Because each job is in-dependent, a cluster may be used. Each job creates an individual data file upon completion thatrecords the history of play, so they may be run in any order. For each recorded iteration and for eachagent in the pair, the strategy, sampled action, reward received, and beliefs about the opponents arerecorded.After the data files have been generated the performance metrics are calculated. A plain-textfile describes the metrics to be calculated: e.g. if we are looking at some kind of convergence,we might want to specify that the lscriptinfinity sense of distance should be used. Calculating the metricscan proceed in serial or it can be run on a cluster. MALT includes basic tools for analyzing andvisualizing these results, and there is a visualization GUI that guides the use of these tools.3.1.3 Algorithm ImplementationsTo carry out this study, we selected and implemented eleven MAL algorithms. A brief descriptionof each is useful for intuition.Fictitious playIn our implementation of fictitious play, the initial action frequencies are set to one foreach action, which is a uniform and easily overwhelmed prior. Tie-breaking (selecting amongmembers of the best-response set) favours the previous action to encourage stability. For instance,if fictitious play plays ai in iteration t and at iteration t+1 the best response set include ai,the algorithm will chose ai. If the best response set at t + 1 does not include ai, then the algorithmuniformly mixes between best responses.21Chapter 3. PlatformDesign Decision SettingBR Tie-Breaking Previous action if still BRUniform otherwiseInitial Beliefs Unit virtual action countTable 3.1: Design decisions for fictitious playDeterminedOur implementation of determined repeatedly plays the Nash equilibrium that obtains the high-est personal reward, but if there are multiple equilibria with the same protagonist reward, then theequilibrium with the highest opponent reward is selected. If there are any equilibria that are stilltied we use the one found first by GAMBIT's implementation of L emke-Howson.Design Decision SettingNE Tie-Breaking Highest opponent utilityTable 3.2: Design decisions for determinedAWESOMEAWESOME is implemented according to the pseduo-code in Conitzer and Sandholm [13]. We usethe parameter settings suggested in Conitzer and Sandholm [13] as its default. For picking the`special' equilibrium we use the first equilibrium found by G AMBIT's implementation of Lemke-Howson. It would be interesting to compare our implementation of AWESOME to one that used themore computationally expensive approach of picking, say, a socially optimal equilibrium.There is a a small performance difference between our implementation of AWESOME and theoriginal implementation from Conitzer and Sandholm [13]7. A small test?involving ten differentgame instances from a variety of generators and 100 runs against the random agent?showed thaton three instances there was a significant difference between their solution quality distributions. Atwo-sample Kolmogorov-Smirnov independence test (see ? 4.3.1) with alpha = 0.05 was used to checkfor significance. For these three game instances, our implementation probabilistically dominated(see ? 4.4) the original implementation in terms of reward (every reward quantile was higher for ourimplementation). We were not able to track down the source of this behaviour difference; howeverwe spend a considerable amount of time verifying our implementation against the pseudocode inthe paper, and we are convinced that it is correct.7The original implementation was in C and MALT 2.0 is written in Java, so the original implementation could not beused directly.22Chapter 3. PlatformDesign Decision SettingSpecial Equilibrium(piasteriskmathp) First foundEpoch period (N(t))ceilingleftBigg|A|?parenleftBig1- 12t-2parenrightBig(epsilon1te)2ceilingrightBiggEquilibrium threshold (epsilon1e(t)) 1t+2Stationarity threshold (epsilon1s(t)) 1t+1Table 3.3: Design decisions for metametaMeta is implemented according to the pseduo-code in Powers and Shoham [38]. The Powers andShoham [38] implementation of meta used a distance measure based on the Hoeffding Inequality,even though the pseudo-code called for using an lscript2 norm. We follow the pseudo-code and use thelscript2 norm. We do not adjust the default threshold level (epsilon13) for distance and left it at the originalvalue.Design Decision SettingSecurity threshold (epsilon10) 0.01Bully threshold (epsilon11) 0.01?Generous? BR parameter (epsilon12) 0.005Stationarity threshold (epsilon13) 0.025Coordination/exploration period (tau0) 90 000Initial period (tau1) 10 000Secondary period (tau2) 80 000Security check period (tau3) 1 000Switching probability (p) 0.00005Window (H) 1 000bardbl?bardbl lscript2Table 3.4: Design decisions for metaGradient AlgorithmsOur implementation of GIGA-WoLF follows the original pseudo-code and uses the learning rateand step size schedules from the original experiments as defaults. These step sizes, however, wereset for drawing smooth trajectories and not necessarily for performance. Additionally, the originalexperiments for GIGA-WoLF involved more iterations than we simulated: we used 105 iterationsin our experiments as opposed to 106 in Bowling [7]. It is possible that a more aggressive set ofparameters (e.g. larger etat) might improve some facets of performance. We, however, stuck with theoriginal parameter settings for our implementation and defer parameter tuning questions to futurework.23Chapter 3. PlatformFor GIGA-WoLF's retraction map operation (the function that maps an arbitrary vector in Rfracturnto the closest probability vector in terms of lscript2 distance) we used an algorithm based on the methoddescribed in Govindan and Wilson [18]. GIGA-WoLF has two variants: in one it assumes that itcan counterfactually determine the reward for playing an arbitrary action in the previous iteration,and in the other it only knows the reward for the the action that it played and has to approximatethe rewards for the other actions. The formula for this approximation is given byuniversal? element Ai ?(t+1)?a = (1 -alpha)r(t)I?a=a(t) + alpha(?(t)?a ). (3.1)In this equation, r(t) is the reward that the algorithm experienced while playing action a(t) initeration t. The vector ?(t) is an |Ai|-dimensional vector that reflects the algorithm's beliefs aboutrewards. We implemented the latter approach, as all of GIGA-WoLF's experimental results areproduced by this version.Design Decision SettingLearning rate (alpha(t)) 1radical t10 +100Step size (eta(t)) 1radical104t+108Table 3.5: Design decisions for GIGA-WoLF.We also tested an algorithm, GSA (Global Stochastic Approximation Spall [48]; to our knowl-edge this was first suggested for use in a MAL setting by Lipson [28]), which is a stochasticoptimization method that resembles GIGA, but takes a noisy, rather than deterministic, step. TheGSA strategy is updated according to Equation 3.2. Inx(t+1) = P(x(t) + eta(t)r(t) + lambda(t)zeta(t)), (3.2)xt is the previous mixed strategy, rt is the reward vector, zetat is a vector where each component issampled from the standard normal distribution (with variance controled by the parameter lambda(t)), andP(?) is the same retraction function used for GIGA-WoLF.Design Decision SettingLearning rate (alpha(t)) 1radical t10 +100Step size (eta(t)) 1radical104t+108Noise Weight (lambda(t)) 1radical105t+108Table 3.6: Design decisions for GSA.RVsigma(t) is a implementation of the algorithm given in Banerjee and Peng [5]. Some initialexperiments showed that the settings of the algorithm used in the paper performed poorly, and weconsequently used some hand picked parameter settings that were more aggressive and seemed toperform better.24Chapter 3. PlatformDesign Decision Settingsigma-schedule (sigma(t)) 11+ 125radicaltStep size (eta(t)) 1radical1000t+105Table 3.7: Design decisions RVsigma(t).Q-LearningOur implementation of Q-learning is very basic. Since in a repeated game there is only one`state', Q-learning essentially keeps track of Q-values for each of its actions. We use an epsilon1-greedy exploration policy (perform a random action with probability epsilon1) with a decaying epsilon1. 400exploration steps are expected for this epsilon1-schedule, and epsilon1 drops below a probability of 0.05 at ap-proximately iteration 2800. It is negligible at the end of the settling-in period (less than 3E-9).The learning rate (alpha) decays to 0.01 at the end of the settling in period. The discount factor ofgamma = 0.9 was set rather arbitrarily. There is no need to trade off current reward with future reward:all actions take the algorithm back to the same state.Design Decision SettingLearning rate (alpha(t)) parenleftbig1 - 12000 parenrightbigtExploration rate (epsilon1(t)) 15 parenleftbig1 - 1500 parenrightbigtFuture discount factor (gamma) 0.9Table 3.8: Design decisions for Q-learning.Minimax-Q and Minimax-Q-IDRFor minimax-Q, we solved a linear program to find the mixed maxmin strategy based on theQ-values. This program wasMaximize U1subject to summationtextjelementA1 u1(aj1,ak2) ? sigmaj1 greaterequal U1 universalk element A2summationtextsigmaj1 = 1sigmaj1 greaterequal 0 universalj element A1(see, for example, Shoham and Leyton-Brown [45]). The learning rate, exploration rate, andfuture discount factor are identical to Q-learning. We also look at a variant of minimax-Qcalled minimax-Q-IDR that iteratively removes dominated actions. In each step of the iterativeIDR algorithm a mixed-strategy domination linear program (see, for example, Shoham and Leyton-Brown [45]). Both programs are solved with CPLEX 3. PlatformRandomThe final algorithm, random, is an simple baseline that uniformly mixes over the available actions.Specifically, it submits a mixed strategy sigma where universala element A, sigma(a) = 1|A| .26Chapter 4Empirical Methods and SetupThe primary purpose of MALT is to facilitate the creation of experiments. The next two chaptersdemonstrate what MALT can do. In this chapter, we describe the setup of a large experiment?indeed this experiment is the largest MAL experiment on many dimensions?that is aimed at com-paring the performance of different MAL algorithms using a variety of different metrics. We alsofocus on building some tools that allow us to make our second contribution: comprehensively com-paring existing MAL algorithms using a variety of different metrics. This analysis also shows therelationship between reward and some of the alternative metrics that other authors have used. Thefollowing chapter (? 5) is devoted to explaining these results.4.1 Experimental SetupWe used MALT to set parameters for the eleven algorithms described previously. The goal of thisexperiment was to find the algorithms that are best against particular opponents on different typesof games. In order to do this, we ran each of the eleven algorithms on a number of matches, andcompared their results.To test a variety of game instances we used 13 game generators from the GAMUT game col-lection (see Table 4.1). From these generators we generated a total of 600 different game in-stances. The generators selected were diverse and created instances that belonged to many familiesof games. We do not describe the details of each generator (these descriptions are available inGAMUT's online documentation), but we do discuss their relevant features when they are im-portant for understanding the results. The game instances rewards were normalized to the [0,1]interval, in order to make the results more interpretable and comparable.We examined five different action set sizes: 2 ? 2, 4 ? 4, 6 ? 6, 8 ? 8 and 10 ? 10. For eachsize, we generated 100 game instances, drawing uniformly from the first twelve generators. Anadditional 100 instances were drawn from the last distribution, D13, which is a distribution of allstrategically distinct 2 ? 2 games [41]. The distribution induced by mixing over all 13 GAMUTgenerators is called the grand distribution.With eleven algorithms and 600 game instance there were 11 ? 11 ? 600 = 72 600 matches.Each match was run once for 100 000 iterations, although the first 90 000 iterations that were spentadapting were not recorded for analysis. Each match could have been run multiple times insteadof just once, and indeed doing so would have been essential to understanding how any particularmatch behaves if at least one of the algorithms is randomized. However, conducting more runsper matches would mean that for the same amount of CPU time we would have had to either27Chapter 4. Empirical Methods and SetupD1 A Game With Normal Covariant Random PayoffsD2 Bertrand OligopolyD3 Cournot DuopolyD4 Dispersion GameD5 Grab the DollarD6 Guess Two Thirds of the AverageD7 Majority VotingD8 Minimum Effort GameD9 Random Symmetric Action Graph GameD10 Travelers DilemmaD11 Two Player Arms Race GameD12 War of AttritionD13 Two By Two GamesTable 4.1: The number and name of each game generator.experiment with fewer games or fewer algorithms. We chose not to do this and traded a betterunderstanding of how particular matches behaved in return for more data from more varied gameinstances and algorithms. Furthermore, we show in Appendix A that not stratifying (holding oneexperimental variable fixed while varying another; as opposed to varying both) on game instancesreduces variance for many estimates of summary statistics like mean and median. Since we onlyran each algorithm once on each PSM (a partially specified match), we use the terms `run' and`PSM' interchangeably when we discuss the results.This experiment generated a lot of data. In order to interpret the results precisely we usedseveral different empirical methods. Each is motivated by a particular problem that we encounteredin the analysis.4.2 BootstrappingIf we conduct an experiment where two algorithms are run on a number of PSMs then a naturalway to compare their performance is to compare the sample means of some measure of their perfor-mance (average reward, for example). However, if we have the conclusion that `the sample meanof algorithm A is higher than the sample mean of algorithm B', how robust is this claim? If we ranthis experiment again are we confident that it would support the same conclusion?A good way to check the results of an experiment is to run it multiple times. If the conclusion isthe same each time than we can be fairly confident that the conclusion is true. Let's say we run anexperiment 100 times, and we found that 95% of the experiments had a sample mean for algorithmA of between [a, a], that 95% of the experiments had a sample mean for algorithm B of between[b, b]. If a > b (the lower bound of A's interval is great than the upper bound of B's) then wecan be confident that A is better in terms of mean. These intervals are the 95% percentile intervals28Chapter 4. Empirical Methods and Setupof the mean estimate distribution, and the fact that they do not overlap will be taken as sufficientevidence for there to be a significant relationship between the means.While this repeated experimentation is sufficient to allay concerns of insignificant results, it isalso expensive. To verify the summary statistics from one experiment, we had to run many more(99 in the above example). This is not always possible (our experiments took 7 days on a largecomputer cluster, so to rerun them a hundred more times would have taken the better part of twoyears) and is certainly never desirable. Is there a way to use the data from one experiment and stillconstruct confidence intervals of summary statistics? The answer is yes, and one way to do this isthrough the powerful technique of bootstrapping.Given an experiment with m data points, we can `virtually' rerun the the experiment by sub-sampling from the empirical distribution defined by those m points. For example, if we have asample with 100 data points, we could subsample 50 data points (with replacement) from these100 and look at the statistic for this subsample. We can cheaply repeat this procedure as manytimes as we like, creating a distribution for each estimated statistic. From these bootstrapped es-timator distributions we can form bootstrapped percentile intervals and check for overlap. Thisis exactly what we would do if we were rerunning experiments, although bootstrapping does notinvolve running a single new experiment and is just a manipulation of the data that was alreadycollected.There are two parameters that control the bootstrapped distribution: we form the distributionby subsampling l points from the original m, and we repeat this process k times. For this thesiswe will chose l to be floorleftm/2ceilingright and k to be around 2 500. These particular parameters were chosento ensure that there would be diversity among the subsamples (this explains the moderate size of l)and that the empirical distributions would be relatively smooth (this explains the large k).4.3 Statistical Tests4.3.1 Kolmogorov-Smirnov TestWhile bootstrapping is useful for seeing if summary statistics are significantly different or not, willalso want to check if two distributions are themselves significantly different. A beta distributionand a Gaussian distribution might coincidentally have the same mean, but they are are not the samedistribution. So how can we distinguish between distributions that are different?The most common way of doing this for general functions is a statistical test called the Kol-mogorovSmirnov (KS) independence test. This test is nonparametric, which means that it does notassume that the underlying data is drawn from some known probability distribution. The KS testchecks the vertical distance between two CDFs (see Figure 4.3.1) and if the maximum vertical dis-tance is large enough then the distributions are significantly different. `Large enough' is controlledby the significance level, alpha, and we will use the standard alpha = 0.05 unless otherwise noted.29Chapter 4. Empirical Methods and Setup-4 -3 -2 -1 0 1 2 3 400. of Probabilistic DominationABFigure 4.1: An illustration of the KS statistic in terms of two CDFs. The KS statistic is the verticaldotted black line between the CDF curves.30Chapter 4. Empirical Methods and Setup4.3.2 Spearman's Rank Correlation TestSpearman's rank correlation test is a way to establish whether or not there is a significant monotonicrelationship between two paired variables. For example, we might want to show that there is somesignificant monotonic relationship between the size of a game's action set size and the reward thata learning agent receives on it.This test is, like the KS test, non-parametric: it does not assume any parametric form of theunderlying data. The relationship between the two variables can be positive (high values of onevariable are correlated with high values of the other variable) or negative (high values of one vari-able are correlated with low values of the other).4.4 Probabilistic DominationSay that we have two algorithms and that one has both a higher mean and more `bad' runs than theother (say, runs below 0.1). Looking solely at the sample means would lead us to conclude thatthis algorithm is better. Should we be happy with the account of performance given by the samplemean?We might worry, for example, that the distribution of game instances that we have in the exper-iment is not representative of a practical problem that we want deploy a MAL algorithm on: the`bad' runs might be more common in practice. Or perhaps we do not know the reward functionbut instead we observe a monotonic transformation of the reward. For example, in traffic we mayobserve trip time but not the exact reward function?it probably is not the case that a 60 minuteroute is exactly 12 times worse than a trip that takes 5 minutes, but we do know that it is worse.However, sometimes performance results are clear enough that these sorts of objections do notmatter and we can claim that one algorithm is better than the other without exploring these issues.These are situations where performance conclusions can be drawn without making subjective andproblem-specific judgments, and they are particularly compelling. These situations can be capturedusing probabilistic domination?a robust partial ordering for distributions.A solution quality distribution A (SQD; the distribution of a performance metric) dominatesanother SQD B if universalq element [0,1], the q-quantile of A is higher than the q-quantile of B. If there are twoalgorithms, A and B, that are trying to maximize reward, and A's SQD probabilistically dominatesB then regardless of the reward value r that one picks, there are more runs of A than of B thatattain a reward higher than r. Notice that this means that probabilistic domination is unaffected ifall rewards are shifted by some monotonic function: there will still be more runs of A than B thatattain a higher reward than f(r). Probabilistic domination is stronger than a claim about the meanof the distributions: domination implies higher means.Checking for probabilistic domination between two samples, with a maximum size of n, canbe done in O(n log n) time (sorting the two samples is the dominating step), but it can also bechecked visually by looking at the CDF plots. If one of the CDF curves is below the other curveeverywhere, than the former dominates the latter. Intuitively, this is because the better SQD has31Chapter 4. Empirical Methods and Setupless probability mass on low solution qualities, and more mass on higher solution qualities: betterdistributions are right-shifted (in Figure 4.3.1, SQD A dominates SQD B).32Chapter 5Empirical Evaluation of MALAlgorithmsThe experiment described in Chapter 4 was big: recording all 72 600 matches for 10 000 iterationsgenerated 143GB of data. Furthermore, the experiment took approximately 126 CPU days torun. This is an incredible amount of information to sort through and extract meaning from, andto make sense of the experiment at all we had to summarize much of this information. First, weused performance metrics that map the 10 000 recorded iterations of each match into a singlenumber (such as average reward or average regret). Secondly, we tended to use summary statisticsto examine and analyze the distribution of these performance metrics. In this section, we onlycomment on properties of distributions as a whole when the properties are especially strong: e.g.probabilistic domination. It is an understatement to say that some information is lost in this process,but this is inevitable.Work in MAL has focused on many different metrics and our experiment evaluated differentalgorithms using several of these measures of performance. We used average reward and a selectionof other metrics that addressed other aspects of empirical performance. We took these metrics fromtwo broad families of metrics that either measure performance based on aggregated reward or checkfor various types of strategic convergence. For each of these metrics we, as much as possible, tryto relate our results to the agent's algorithmic structure.Additionally, many authors have proven results about these alternative metrics instead of di-rectly proving results about reward. However, the Artificial Intelligence [46] agenda for MALlearning states that the fundamental goal of all agents should be to achieve high reward. Becauseof the mismatch between this goal and existing results, not only do we evaluate the algorithmswith each metric, but we also investigate the general connections between reward and the alterna-tive performance metrics. For example, if an algorithm frequently attains low regret, does it alsofrequently attain high reward?5.1 Reward-Based Metrics5.1.1 Average RewardAs argued earlier, reward is the most fundamental of all metrics as agents are explicitly trying tomaximize reward given what the other agents are doing. Because of this, we will engage in adetailed discussion of the reward results. Merely looking at the average reward attained in all the33Chapter 5. Empirical Evaluation of MAL Algorithmsrun over the 10 000 iterations is an extremely coarse summary of a distribution of games, and soour analysis does not stop there. In particular, there are interesting trends when we consider varyingthe different opposing algorithms and game instance distributions. However, a broad summary ofthe results undifferentiated by run features is a sensible place to start and gives a gross ranking ofthe algorithms.The average reward that we look at is with respect to the sampled actions, and not the submittedmixed strategy. This is formally stated in Equation 5.1, where the iterations 1 to T refer to the10 000 recorded iterations:?(T)i =summationtextTt=1 r(t)iT . (5.1)Observation 1 Q-Learning and RVsigma(t) attained the highest rewards on the grand distribution.Q-learning had the highest mean reward at 0.714, although RVsigma(t) was close with an av-erage of 0.710 (see Figure 5.1). We noticed considerable variation within the reward data, and allof the other algorithms' sample means still were within one standard deviation of Q-learning,including random (which obtained a sample mean of 0.480).The distribution of reward was definitely not symmetric and tended to have negative skewness(randomwas the only exception). Q-learning's distribution had the highest skewness, -0.720,indicating that the proportion of runs that attained high reward was larger than the proportion ofruns that attained low reward.These ranking were not all significant. The slight difference in means between Q-learningand RVsigma(t) does not in fact indicate that Q-learning was a better algorithm (in terms of means)on the grand distribution of games and opponents. These two algorithms attained significantlyhigher reward than any other algorithm, however. This was determined by looking at the 95%percentile intervals on bootstrapped mean estimator distributions (see ? 4.2) and seeing whichintervals overlapped (see Figure 5.2). The distributions were obtained by subsampling 2 500 times,where each subsample had 6 600 runs (half of the 13 200 runs that each algorithm participated in).Observation 2 Algorithm performance depended substantially on which opponent was played.We blocked the runs based on the opponent for a more detailed analysis of the reward results.Not only is this useful from an algorithm design perspective (why is my algorithm particularlyweak against algorithm A?), but it is also useful for lifting results from our experiment to otherexperimental settings. For instance, if one was looking for an algorithm that only operated ongames where the payoffs were unknown (excluding, for example, RVsigma(t)), blocking could be usedto restrict attention to algorithms that satisfied this constraint.Figure 5.3 shows the mean reward for each algorithm against every possible opponent. Themost salient feature of this figure is that minimax-Q,minimax-Q-IDR and random were all relatively weak against a broad range of opponents.34Chapter 5. Empirical Evaluation of MAL Algorithmsq rvs gsa det giga awe fict meta mini min rand0. RewardFigure 5.1: A plot that shows the mean reward (bar) for each algorithm and one standard deviationin either direction (indicated by the size of the lens).35Chapter 5. Empirical Evaluation of MAL Algorithms0.69 0.695 0.7 0.705 0.71 0.715 0.72 0.725 0.73 0.73500. Mean Reward Estimate DistributionQRVsigma(t)Figure 5.2: The distribution of mean reward estimates for Q-learning and RVsigma(t), constructedby bootstrapping. The 95% confidence intervals are indicated by the dark circles and dashed-lines.36Chapter 5. Empirical Evaluation of MAL AlgorithmsProtagonist Mean Average RewardOpponentProtagonist  fict det meta awe q rvs gsa giga mini min randfictdetmetaaweqrvsgsagigaminiminrand00. 5.3: A heatmap showing the mean reward for each protagonist algorithm (ordinate) playingagainst each opposing algorithm (abscissa).37Chapter 5. Empirical Evaluation of MAL AlgorithmsOpponent Best-Response SetAWESOME GIGA-WoLF, GSA and RVsigma(t)Determined AWESOME, GIGA-WoLF, GSA,Q-learning and RVsigma(t)Fictitious play GSA, Q-learning and RVsigma(t)GIGA-WoLF determined, Q-learning and RVsigma(t)GSA determined, Q-learning and RVsigma(t)Meta determined, GIGA-WoLF, GSA and RVsigma(t)Minimax-Q Q-learningMinimax-Q-IDR Q-learningQ-Learning determined, Q-learning and RVsigma(t)Random determined, Q-learning and RVsigma(t)RVsigma(t) determinedTable 5.1: The different algorithms and their best-response setsAnother interesting feature is that fictitious play and determined tend to get lower re-ward against themselves (self-play) and each other than against other opponents. Meta?an algo-rithm that manages a profile of algorithms including fictitious play and determined?also inherited these performances issues, while AWESOME?the other portfolio algorithm?largelyavoided them.If we know what algorithm the opponent is using, which algorithm should we use? We con-structed best-response sets for each possible opponent using bootstrapped percentile intervals. Wecall the algorithm with the highest mean against a particular opponent a best response, but anyalgorithm with a overlapping bootstrapped 95% percentile interval was also in this set?we cannotsignificantly claim that these algorithms do worse than the apparent best algorithm. These bestreponse sets are summarized in Table 5.1.1. Q-learning and RVsigma(t) are most frequently bestresponses, while fictitious play, meta, minimax-Q, minimax-Q and random neverare best responses.An interesting interpretation of these best-response results is to consider the one-shot `algo-rithm' game where a player's action space is the set of algori thms and the player picks one of theseas a proxy for playing the repeated game. The payoff for using algorithm A against algorithm B isthe mean reward that algorithm A attained against B.Observation 3 Determined and Q-learning both participate in pure strategy Nash equilib-ria of the algorithm game.With this interpretation, what can we say about this algorithm game? There were three al-gorithms that were strictly dominated in this grand distribution algorithm game: minimax-Q,minimax-Q-IDR and random. Strict domination means that regardless of what algorithm theother player is using, we could use an algorithm that is strictly better than the one that we are using.38Chapter 5. Empirical Evaluation of MAL AlgorithmsAlgorithm GameOpponentProtagonist  fict det meta awe q rvs gsa giga mini min randfictdetmetaaweqrvsgsagigaminiminrand00. 5.4: Interpreting the mean reward results as a one-shot game. The cells that are cross-hatched are dominated and the `star's indicate pure-strategy Nash equilibria. The determined andQ-learning equilibrium shows up twice since the protagonist can either select Q-learningor determined, and we indicate this symmetry by making one of the corresponding stars hollow.39Chapter 5. Empirical Evaluation of MAL AlgorithmsAlgorithm Strictly Dominated Weakly DominatedAWESOME 10.8% 11.7%Determined 0.0% 0.0%Fictitious play 35.9% 36.4%GIGA-WoLF 54.1% 55.1%GSA 0.4% 0.4%Meta 28.8% 28.2%Minimax-Q 100.0% 100.0%Minimax-Q-IDR 100.0% 100.0%Q-Learning 0.0% 0.0%Random 100.0% 100.0%RVsigma(t) 0.0% 0.0%Table 5.2: The proportion of grand distribution subsampled algorithm games where each algorithmwas strictly or weakly dominated.The domination must be significant: we want to be confident that if this experiment were repeated,we would get a similar result. We used bootstrapping to check this: we subsampled 6 600 PSMs10 000 times and from these formed 10 000 `subsampled' games. We checked for strict dominationin each game, and considered an algorithm dominated if it was dominated in at least 95% of thesubsampled games. The proportion of subsampled games where each algorithm was dominated onis shown in Table 5.2.There were only two pure-strategy Nash equilibria that ever occurred in the subsampled gamesfor the grand distribution: Q-learning in self-play, and Q-learning against determin-ed. Q-Learning in self-play is particularly convincing because it was symmetric and did notrequire that the players coordinate to playing different strategies, and it occurred in 90.2% of thesub-sample games. The other equilbrium occurred in the remaining 9.8% of games.We also looked at the algorithm games formed by restricting attention to individual genera-tors. For these per-generator algorithm games Determined in self-play was the most commonsymmetric pure strategy Nash equilibrium. It was a significant Nash equilibrium in seven of thegenerator games, i.e. determinedin self play was a pure strategy Nash equilibrium in more than95% of the subsampled games for these each of these seven generator games. Q-Learning wasthe second most common symmetric pure strategy Nash equilibrium, and existed in four generatorgames.The generators varied substantially in their pure strategy Nash equilibria. For instance D1 (AGame with Normal Covariant Payoffs) had no significant pure strategy Nash equilibrium. D4,Dispersion Game, is the other extreme and had 22 pure strategy Nash equilibria (see Figure 5.5).Part of the reason for the vast number of equilibria in D4 is that majority of runs for many ofthe algorithms yielded reward of 1 (e.g. 84.6% of AWESOME's runs yielded a reward of 1). Thismeant that in many of the subsampled games, the majority of payoffs were exactly 1 and so manyof these Nash equilibria are weak. For example, both RVsigma(t) and Q-learning attained a re-40Chapter 5. Empirical Evaluation of MAL Algorithmsward of 1 against fictitious play, and fictitious play itself attained a reward of 1against RVsigma(t) and fictitious play. Therefore both RVsigma(t) and fictitious play, andQ-learning and fictitious playare Nash equilibria.Using the concept of probabilistic domination (? 4.4), we can make more robust statementsabout performance than we could by using mean reward.Observation 4 Q-Learning was the only algorithm that was never probabilistically dominatedby any other algorithm when playing any opponent.Determined and RVsigma(t) were the next-least dominated: determined was only proba-bilistically dominated by AWESOME against a fictitious play opponent, which was in turndominated by Q-learning. RVsigma(t) was dominated by Q-learning when playing against theminimax-Q variants, and also by determined when playing against RVsigma(t). Indeed, beingdominated by another algorithm in self-play seemed to be common: only AWESOME, determin-ed and Q-learning avoided being dominated by another algorithm when playing themselves.The fact that determined was not dominated should be seen as a property of the games distri-butions that we chose.It should be noted that while there are some strong domination relationships, these are theexceptions and ambiguity the rule: for most algorithm pairs on most opponents no probabilisticdomination relationship exists (see Figure 5.6). Furthermore, there is no opponent for which onealgorithm probabilistically dominates all others.Observation 5 Most algorithms were worse in self-play than in general.We noticed above in the probabilistic domination section (see ? 4.4) that self-play is a difficultsituation for many algorithms. We also noticed that there is a slight tendency towards `cool' cellson the main diagonal of Figure 5.3. A closer analysis shows that for most algorithms there isindeed a significant relationship between self-play instances and low reward. The distribution ofreward in the self-play runs for AWESOME, determined, fictitious play and meta wereprobabilistically dominated by the distribution of reward in the non-self-play runs.There were no domination results of this kind for the gradient algorithms because they hada tendency to get fewer low-reward runs in self-play, but their self-play means were significantlylower than their non-self-play means. We verified this by looking at the 95% bootstrapped per-centile intervals. There was no significant relationship for minimax-Q and minimax-Q-IDR,and this self-play trend was reversed for Q-learning: its self-play runs probabilistically domi-nated its non-self-play runs. Furthermore, Q-learning has the highest mean reward in self-play(see Figure 5.7).Interestingly, AWESOME was one of the algorithms with poorer self-play runs, despite its ma-chinery for converging to a special equilibrium in self-play. One might wonder whether this is be-cause AWESOME does not converge due to an overly conservative threshold for detecting whetherits opponent is playing part of an equilibrium, or an indication that AWESOME was converging tothe special equilibrium but it was just not associated with high reward (our implementation used41Chapter 5. Empirical Evaluation of MAL AlgorithmsAlgorithm Game for D4OpponentProtagonist  fict det meta awe q rvs gsa giga mini min randfictdetmetaaweqrvsgsagigaminiminrand00. 5.5: Interpreting the mean reward results for D4 (Dispersion Game) as a one-shot game.The cells that are cross-hatched are dominated, and the `star's indicate pure-strategy Nash equilibria.Some equilibria show up twice since some of the equilibria are asymmetric, and we indicate thissymmetry by making one of the corresponding stars hollow.42Chapter 5. Empirical Evaluation of MAL Algorithmsq awe det gsa giga rvs fict metamini minrand020406080100120140Reward Probabilistic Dominance, Block on OpponentOpponent-Candidate Pairs  DominatesNeitherDominatedFigure 5.6: The number of opponents and candidate algorithm that each algorithm dominates, isdominated by, or is in a non-dominance relationship with.43Chapter 5. Empirical Evaluation of MAL Algorithmsq rvs gsa giga awe det meta fict mini min rand0. Mean RewardFigure 5.7: A plot that shows the mean reward (bar) for each algorithm in self-play and one standarddeviation in either direction (indicated by the size of the lens).44Chapter 5. Empirical Evaluation of MAL Algorithmsthe first Nash equilibrium found by GAMBIT's implementation of Lemke-Howson). At the riskof keeping the reader in suspense, we defer the answer to ? 5.3.2 where we look at convergenceresults.Blocking on the opponent yields many trends and quirks but the `problem' faced by an algo-rithm is not completely described by the opponent. The instance generator also matters: one couldeasily imagine a case where a particular algorithm is excellent on one type of game and poor on an-other, and this information is lost if the data is solely examined in terms of the opponent (or, indeed,if the results are not blocked by any feature of the match). Analyzing each generator separately isparticularly useful from an algorithm design perspective. Weak distributions can be identified, andhopefully any poor behaviour can be isolated and rectified. Per-generator results are also usefulfrom an deployment perspective; there is little point in deploying a MAL algorithm on a particulardistribution of games instances if the algorithm is weak on it, even if the algorithm seems to begood `in general'.Observation 6 Q-Learning is the best or one of the best algorithms to use for most generators.Generators are an important part of the reward story. As can be seen in Figure 5.8, the resultsfor any algorithm vary considerably between the different game generators. However, it helps totake these per-generator reward results and normalize them, dividing the results for each algorithmon a particular generator by the maximum reward attained by any algorithm. Some familiar trendsemerge: minimax-Q, minimax-Q-IDR and random are all worse than the other algorithms ina broad range of generators, and Q-learning and RVsigma(t) tend to be good.Q-Learning was the best algorithm or was one of the best algorithms for 10 generators (seeTable 5.3). It was the only algorithm that was a unique best response of any generator in termsof mean reward, and was the only best response for generators D1, D4, and D9). FurthermoreQ-learningalso belonged to the set of best algorithms in generators D2, D3, D7, D10, D11, D12and D13?this is the set formed by algorithms whose bootstrapped mean estimator 95% percentileintervals overlaped with the algorithm with the best sample mean. While Q-learning mostfrequently was a member of a generator's best algorithm set, fictitious play and deter-mined were also frequently in these sets (6 and 7 generators respectively).The gradient algorithms were especially strong on D7 and this was the only generator whereall three gradient algorithms were in the best algorithm set (Figure 5.10). D5, D6, and D8 wereinteresting distributions for AWESOME and meta. In D5, neither AWESOME nor meta managedto be one of the best algorithms despite the fact that both fictitious play and deter-mined?two of the algorithms that they manage?were. In D6, AWESOME joins fictitiousplay and determined but meta does not, and in D8 the reverse happens: meta, fictit-ious play and determined were the three best algorithms. These three generators illustratesituations where at least one of the portfolio algorithms failed to capitalize on one of their managedalgorithms. It would be interesting to run further experiments to determine why this is the case forthese distributions and if this issues could be remedied.45Chapter 5. Empirical Evaluation of MAL AlgorithmsAgent Mean Average RewardGeneratorProtagonist  D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13fictdetmetaaweqrvsgsagigaminiminrand00. 5.8: A heatmap showing the reward for the protagonist algorithm playing PSMs from aparticular generator, averaged over both iterations and PSMs.46Chapter 5. Empirical Evaluation of MAL AlgorithmsAgent Mean Average Reward, NormalizedGeneratorProtagonist  D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13fictdetmetaaweqrvsgsagigaminiminrand 5.9: A heatmap showing the mean reward for the protagonist algorithm, playing againstthe opposing algorithm. These cells have been normalized. Each column has been divided by themaximum average reward attained by any algorithm on that particular generator.47Chapter 5. Empirical Evaluation of MAL Algorithms0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.7400. Mean Reward Estimate DistributionGIGA-WoLFGSAQRVsigma(t)Figure 5.10: The bootstrapped mean estimate distribution for D7. Four algorithms are shown; theyare the algorithms that have a 95% confidence interval that overlaps with Q-learning, the algo-rithm with the highest mean. The 95% confidence intervals for Q-learning and GIGA-WoLF(the algorithms with the highest and lowest mean) are indicated with a dashed black line and circles.48Chapter 5. Empirical Evaluation of MAL AlgorithmsGenerator Best AlgorithmD1 Q-learningD2 Q-learning and RVsigma(t)D3 AWESOME, determined, fictitious play, GSA,meta, Q-learning and RVsigma(t)D4 Q-learningD5 determined and fictitious playD6 AWESOME, determined and fictitious playD7 GSA, Q-learning and RVsigma(t)D8 determined, fictitious play and metaD9 Q-learningD10 fictitious play and Q-learningD11 determined, fictitious play,meta and Q-learningD12 determined and Q-learningD13 AWESOME, determined, GSA, Q-learning and RVsigma(t)Table 5.3: The set of best algorithms for each generator.Effect of Game Size on RewardHow does the size of a game's action set effect performance? L arger action spaces entail thepossibility of more complicated game dynamics that take longer to learn about and adapt to, so itis natural to assume that average reward will decrease as the size of the game increases. Are thereclear trends in this respect?Observation 7 There is no general relationship between game size and reward: on some gen-erators there is a strong positive correlation and on other generators there is a strong negativecorrelation.Our experiment shows that these intuitions do not always hold. For many algorithms on manygenerators we could not reject the null hypothesis of Spearman rank correlation test?that therewas no significant correlation between size and performance?at a significance level of alpha = 0.05.For instance, in D7 only GSA and GIGA-WoLF had significant trends (both exhibited negativecorrelation; reward was lower in larger games).As can be seen in Figure 5.11, D2 and D12 were the only two distributions on which we couldreject the null hypothesis for all algorithms, and they supported opposite conclusions. On instancesfrom D2, correlation was completely and strongly negative: the larger the game, the worse everyonedid. The least correlated algorithm was random with a Spearman's coefficient of correlationrho = -0.329. Correlation was entirely positive for D11, although some of the coefficients weresmaller. Fictitious play was the least sensitive to size (rho = 0.07), but it was anomalous.The algorithm with the next smallest coefficient was GIGA-WoLF, at rho = 0.267.49Chapter 5. Empirical Evaluation of MAL Algorithmsx x x xx x x x xx x x xx x x x xx x x x xx x x x x xx x xx xx x xx x x xx x x xCorrelation Between Size and RewardGeneratorAlgorithmD1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12fictdetmetaaweqrvsgsagigaminiminrandFigure 5.11: A heatmap summarizing the correlations between size and reward for different agentson different generators. A white cell indicates positive correlation, a black cell indicates negativecorrelation, and a gray cell with an `x' indicated an insigni ficant result.50Chapter 5. Empirical Evaluation of MAL AlgorithmsDistributions tended to either support entirely negative significant correlations, or entirely pos-itive significant correlations. The only exceptions to this were D1 and D6, which supported bothkinds of correlation. D2, D7, D8, D9 and D11 were negatively correlated distributions; the re-mainder were positive. These results were less clear than might be expected, and we are not surewhy. It could be the case that when the action spaces increase in size, important game features tiedwith high reward become more common, or it could be that larger actions spaces make it easier forMAL algorithms to miscoordinate, which is desirable for some games. Indeed, D4?DispersionGames?are show positive correlation between the number of actions and reward, and this is agame where agents need to miscoordinate to do well.Observation 8 Similar algorithms tended to have similar performance.Several of the algorithms that we implemented have common approaches to learning. Arethese similarities reflected in the reward results? There are three major blocks of algorithms withprogrammatic similarities: AWESOME and meta are similar because they both manage portfolioswith versions of fictitious play and determined; GIGA-WoLF, GSA and RVsigma(t) aresimilar as they are all variations on following the reward gradient; and minimax-Q and mini-max-Q-IDR are similar as the latter is the same as the former except for the addition of an IDRpreprocessing step. We call these the portfolio, gradient, and minimax blocks. We also mightsuspect that Q-learning, an algorithm that does not explicitly model the opponent, might bearsome performance similarities to the gradient algorithms.The algorithms were tested for similarity on PSMs that had the same generator and opponent,and the results are aggregated by summation. There are a possible 13?10 = 130 cases where sim-ilarity could occur?algorithms are of course similar to themselves and we did not bother checkingthese cases. Failing to reject the null hypothesis of the KS test (the hypothesis that both sampleswere drawn from the same population) is some evidence for the samples being similar. This rough-and-ready approach does not establish significant similarity and is merely suggestive of similarity;failing to reject a null hypothesis is not the same as having shown that the null hypothesis is true.However, with this caveat in mind, there are some interesting trends.All three predicted blocks emerge, as can be seen in Figure 5.12. Meta, AWESOME, fict-itious play and determined were all similar to each other on a number of opponent andgenerator pairs. Both meta and AWESOME are similar in more cases to determined than tofictitious play. For instance, AWESOMEis similar to determinedin 101 out of 130 caseswhile being similar to fictitious play in only 81 cases. Meta and AWESOMEalso look quitesimilar to each other, being similar in 88 cases. Q-learning is similar to the algorithms in thisblock, especially with determined and AWESOME, which we had not expected. AWESOME ismore similar to Q-learning than to any other algorithm: they were similar in 103 cases?evendetermined and AWESOME were only similar in 101 cases.Q-Learning also bears similarities to the gradient-algorithm block. The block of algorithmsconsisting of RVsigma(t), GIGA-WoLF and GSA were all similar in a number of instances and there isa particularly tight relationship between GIGA-WoLF and GSA (they were similar in 111 cases).51Chapter 5. Empirical Evaluation of MAL AlgorithmsAverage Reward KS TestsAlgorithmAlgorithm  fict det meta awe q rvs gsa giga mini min randfictdetmetaaweqrvsgsagigaminiminrand020406080100120140Figure 5.12: A heatmap that summarizes the number of opponent/generator pairs two algorithmsare similar on in terms of reward distribution. This relationship is symmetric, so only the lowerhalf of the plot is presented. The hotter the cell, the more situations the two algorithms are similarin.52Chapter 5. Empirical Evaluation of MAL AlgorithmsThe gradient-block algorithms also tended to be similar to determined and AWESOME on somecases.The connection between minimax-Q and minimax-Q-IDR was particularly strong. Thesetwo algorithms were similar in 118 cases. They were also the most similar algorithms to random.Indeed, these two algorithm were almost twice as likely to be similar to random than the nextmost similar algorithm (AWESOME: it was similar in 11 cases to minimax-Q's 21 cases).5.1.2 Maxmin DistanceLooking at the difference between the reward that an agent acquires and the maxmin value of theunderlying game instance is a way of placing reward results in context:MaxminDistance(vectori) =summationtextTt=1 r(t)iT - maxi min-i u(ai,a-i). (5.2)We call this difference maxmin distance despite the fact that it can be negative. One can always playa maxmin strategy without fear of exploitation, so getting above the maxmin value is a minimalrequirement of sensible MAL behaviour. Having a enforceable payoff (having a non-negativemaxmin distance) is also a necessary condition achieving payoffs consistent with some repeatedgame equilibrium, and we will examine this more in ? 5.3.Observation 9 Q-Learning attains an enforceable payoff more frequently than any other algo-rithm.Q-Learning is the algorithm that most frequently attained an enforceable payoff; it attaineda negative maxmin distance in only 1.8% of its runs. The runs where Q-learning failed toattain an enforceable payoff mostly came from either D4 (Dispersion Game; accounted for 37.6%of the unenforcable runs) or D13 (Two by Two Game; accounted for 33.3% of the runs). They alsooccured prodominently against random (29% of the unenforceable runs), minimax-Q (17.3%)and minimax-Q-IDR (16.0%). There is a sharp jump in the number of non-enforceable runsbetween Q-learning and the next best algorithm, AWESOME, which attained a negative maxmindistance in 7.4% of its runs.Minimax-Q and minimax-Q-IDR were the algorithms least likely to attain enforceablepayoffs (with the exception of random). They failed to attain enforceability in 28.9% and 27.7%of their runs respectively. While they look for the maxmin value of the game they do this withrespect to the payoffs they they have learned. This result suggests that they might have difficultyattaining their maxmin value due to having inaccurate beliefs.Minimax-Q and minimax-Q-IDR were especially poor in self-play, where conservativeplay can retard payoff learning. There is also a greater proportion of enforceable runs on 2 ? 2games (75.2%) than on 10?10 games (68.5%)?larger games have more payoffs to learn. Workingon a more sophisticated exploration scheme looks like an especially promising place to improveour implementation of minimax-Q and its variant.53Chapter 5. Empirical Evaluation of MAL Algorithmsq rvs det gsa awe giga meta fict mini min rand00. of Maxmin DistanceProportion of RunsAlgorithmsBetter than MaxminExactly MaxminWorse than MaxminFigure 5.13: The sign of the safety distance of each run, by algorithm.54Chapter 5. Empirical Evaluation of MAL Algorithms-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 DistanceF(x)Distribution of Maxmin Distances  AWESOMEMinimax-Q-IDRQFigure 5.14: The distribution of maxmin distances for AWESOME, minimax-Q and Q-learn-ing.55Chapter 5. Empirical Evaluation of MAL AlgorithmsWhile Q-learning is successful against a broad range of opponents, there are other algo-rithms that have problems against certain algorithms. In particular, meta is quite good against allalgorithms except for fictitious play, determined, AWESOME and itself. It is especiallybad against fictitious play, where only 68.0% of its runs are enforceable. This shouldbe contrasted with its excellent performance against Q-learning: enforceability is attained in97.7% of it runs. Fictitious play also has issues playing against meta, determined anditself. We notice that AWESOME and determined did not share this problem.RVsigma(t) had problems attaining enforceable runs too, and although it received payoffs well abovethe maxmin value frequently (it had the second highest proportion of runs with strictly positive dis-tances at 68.8%) it had a large number of instances that were close to but below zero. This is incontrast to the minimax distance distribution of GIGA-WoLF, which had fewer non-enforceableruns with greater negative minimax distance (see Figure 5.1.2). We speculate that these runs werecaused by RVsigma(t) maintaining a small amount of probability mass on all of its actions, causing it to`tremble'. RVsigma(t), like all gradient algorithms, maintains a mixed strategy which is updated in thedirection of the reward gradient. This updated vector needs to be mapped back to the probabilitysimplex (the action weight might not sum to one after an update). RVsigma(t) does this by normalizingthe updated vector, while GSA and GIGA-WoLF use a retraction operator that is biased toward theextreme points on the probability simplex, and has a bias toward dropping actions from the mixedstrategy's support. An interesting tweak of RVsigma(t) would be to use GIGA-WoLF's retraction oper-ator instead of normalization, and see if this improves how frequently RVsigma(t) attains enforceability.5.2 RegretRegret is the difference between the reward that an agent could have received by playing the beststatic pure strategy and the reward that it did receive by using the algorithm:Regret(vectori,vector-i) = maxaelementAiTsummationdisplayt=1bracketleftBigr(a,a(t)-i) -EbracketleftBigr(sigma(t)i ,a(t)-i)bracketrightBigbracketrightBig. (5.3)The best pure strategy is chosen after the run assuming that the opponent's actions choices arefrozen. Note that we are using the expected reward formulation of regret?as opposed to one thatuses the actual actions that the algorithm played?following Bowling [7].Regret has been suggested as a measure of how exploitable an algorithm is. If an agent accruessignificant regret one possible explanation is that it has been `tricked' into playing the wrong actionby the opponent. However, there are situations, like in Game 2.7, where ignoring regret might leadbetter long-term reward.Some algorithms, including GIGA-WoLF and RVsigma(t), are no-regret learners: they have theo-retic results which guarantee that they will accrue zero regret as the number of iterations approachesinfinity. However, to our knowledge it has not been shown experimentally how the regret achievedby these algorithms compares to the regret that other algorithms achieve; nor has it been demon-strated whether these algorithms achieve better than zero regret in practice.56Chapter 5. Empirical Evaluation of MAL AlgorithmsProportion of Enforceable RunsOpponentProtagonist  fict det meta awe q rvs gsa giga mini min randfictdetmetaaweqrvsgsagigaminiminrand 5.15: The proportion of enforceable runs, blocked by opponent.57Chapter 5. Empirical Evaluation of MAL Algorithms-0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 DistanceF(x)Distribution of Maxmin Distances  GIGA-WoLFRVsigma(t)Figure 5.16: The distribution of negative maxmin distances for GIGA-WoLF and RVsigma(t).58Chapter 5. Empirical Evaluation of MAL AlgorithmsRather than looking at the total sum of regret over all 10 000 recorded iterations, we willdiscuss the average regret over these iterations. Since player payoffs are restricted to the [0,1]interval, averaged regret can give a better sense of the magnitude of regret with respect to possiblereward.Observation 10 Q-Learningwas the best algorithm in terms of minimizing regret. GIGA-WoLFwas the algorithm that most frequently had negative regret runs.Based on the distribution of games and opponents used in this experiment all algorithms hadpositive mean regret (Figure 5.17). All the means were significantly different, which was shownby checking the 95% percentile intervals for overlap (there was none). Of these, Q-learninghad the lowest regret, at 0.008. The gradient algorithms?GIGA-WoLF, GSA and RVsigma(t)?had thenext lowest regret after Q-learning. Among the gradient algorithms, RVsigma(t) had lower meanregret than GIGA-WoLF, but GSA had lower mean regret than either of them. These empiricalresults agree with GIGA-WoLF and RVsigma(t)'s theoretical no-regret guarantees?not only were theypredicted to get zero regret in the limit, but also they had good empirical regret results?althoughthe best algorithm in terms of mean regret, Q-learning, has no guarantees about regret.Mean regret masks an interesting difference between Q-learning and the gradient algo-rithms: they have low mean regret for different reasons. Most (89.5%) of Q-learning's runsattain zero regret. It has the fewest positive runs at 10.4% (the next lowest is AWESOME at 18.2%),and has the second-fewest negative runs (only fictitious play has fewer at 0.1%). The gra-dient algorithms, on the other hand, tended to have many negative runs; the three algorithms withthe most negative regret runs were GIGA-WoLF (5.8%), RVsigma(t) (3.2%) and GSA (3.0%). The gra-dient algorithms also have few zero runs. The algorithms, in order, with the fewest zero runs areRVsigma(t), GSA, random and GIGA-WoLF.The negative regret runs were only slightly negative: the run with the lowest regret had anaverage regret of -2 ? 10-6. The same cannot be said for positive regret: in 440 different runsan average regret of 1 was attained. These runs indicate disastrously poor play since one of thealgorithms has taken the exact wrong action at every possible step. 48.6% of these runs involveboth fictitious play or one of the algorithms that wrap around fictitious play( awe-some or meta) in self-play, and are on D4 (Dispersion Games), which are games that encouragemiscoordination. Indeed, Dispersion Games generalize the intuition of Game 2.6 to more than twoactions. This behaviour suggests that fictitious play becomes stuck in pathological cyclingbetween the symmetric outcomes (outcomes where both agents play the same action), which getsno reward in game instances from D4. This is a well known problem with fictitious playand a judicious application of noise to the fictitious play algorithm might break the abovelockstep cycle and improve performance.In terms of mean regret, Q-learning was the best algorithm to use for any generator exceptfor D13 (all strategically distinct 2?2 games)?RVsigma(t) was the best algorithm there. Q-learningwas also the best algorithm to use against almost every opponent. There were only two exceptions:one wants to use RVsigma(t) against Q-learning and AWESOME against itself. Another interesting59Chapter 5. Empirical Evaluation of MAL Algorithmsq rvs* gsa giga* awe det meta fict mini min rand-0.2- RegretFigure 5.17: A plot that shows the mean regret (bar) for each algorithm and one standard deviationin either direction (indicated by the size of the lens). Algorithms with an asymptotic no-regretguarantee are indicated with a `asteriskmath'.60Chapter 5. Empirical Evaluation of MAL Algorithmsq awe det fict meta mini min giga rvs gsa rand02000400060008000100001200014000Sign of RegretRunsAlgorithmsNegativeZeroPositiveFigure 5.18: The number of runs for each algorithm that have negative, zero, or positive regret.61Chapter 5. Empirical Evaluation of MAL Algorithms-0.2 0 0.2 0.4 0.6 0.8 1 UtilityF(x)Distribution of Regret for Q and GIGA-WoLFGIGA-WoLFQFigure 5.19: The distribution of regret for Q-learning and GIGA-WoLF.62Chapter 5. Empirical Evaluation of MAL AlgorithmsMean Average RegretOpponentProtagonist  fict det meta awe q rvs gsa giga mini min randfictdetmetaaweqrvsgsagigaminiminrand00. 5.20: Mean average regret, blocked by opponent.63Chapter 5. Empirical Evaluation of MAL Algorithmspairing was when Q-learning played against fictitious play: Q-learning attainedzero regret in every single game. This seems to indicate that Q-learning converged to a pure-strategy best response in every game against fictitious play. Q-Learning was the onlyalgorithm to do this.Observation 11 Q-learning, GIGA-WoLF, GSA and RVsigma(t) are rarely probabilistically dom-inated in terms of regret.When blocking on the opponent, some strong probabilistic dominance trends emerge betweenthe different distributions of regret. For example, the gradient algorithms were never dominatedby any other algorithm. Q-learning is also seldom dominated. It was only dominated once byAWESOME when playing against an AWESOME opponent. However, it is unsurprising that AWE-SOME attains lower regret than Q-learning in this case since it has special machinery for con-verging to a stage-game Nash equilibrium in self-play. In a Nash equilibrium both agents are best-responding so both accrue zero regret. Fictitious play, on the other hand, was frequentlydominated, especially by AWESOME, determined, Q-learning and to a lesser degree meta.Both determined and Q-learning dominated fictitious play against 10 opponents(Q-learning was the exception for determined and vice versa), and AWESOME dominatedfictitious play on 9 opponents (GIGA-WoLFand meta were the only opponents for whichAWESOME did not dominate fictitious play).Similar observations result from blocking on the game generators. Q-learningdominated other algorithms frequently?particularly fictitious play (on 9 gen-erators), meta (8 generators), and AWESOME (on 8 generators)?while avoiding domination byanother algorithm. Fictitious play was dominated frequently by Q-learning (9 genera-tors), determined (6), AWESOME (6) and meta (4) on many different instance generators.5.3 Convergence-Based MetricsIn this section we shift away from looking at metrics that are based on reward and instead look atmetrics that are based on empirical frequency of action. We focus on various ideas of convergence,from a weak form that merely insists that the empirical distribution of actions is stationary tomuch stronger forms that insist on convergence to a restricted class of stage-game Nash equilibria.We will also consider whether average payoofs are consistent with the infinitely repeated gameequilibrium.When we look at convergence, we will consider a sequence of action to be either converged ornot: we will not have an extended discussion about how some sequences are ?closer? than othersto convergence.One issue in studying convergence based on empirical data is dealing with runs that appear ?notquite? to have converged because of random fluctuations in the empirical action frequency. TheFisher exact test (FET) and Pearson's chi2-test can be used for checking whether two multinomialsamples are drawn from a distribution. For example, we might check whether a later empirical64Chapter 5. Empirical Evaluation of MAL AlgorithmsRegret Domination  fict det meta awe q rvs gsa giga mini min randfictdetmetaaweqrvsgsagigaminiminrand01234567891011Figure 5.21: The number of opponents for which the algorithm on the ordinate probabilisticallydominates the algorithm on the abscissa. For example, Q-learning probabilistically dominatesfictitious play on PSMs involving ten out of eleven possible opponents.65Chapter 5. Empirical Evaluation of MAL AlgorithmsRegret Domination, Blocked by Generator  fict det meta awe q rvs gsa giga mini min randfictdetmetaaweqrvsgsagigaminiminrand024681012Figure 5.22: The number of generators for which the algorithm on the ordinate probabilisticallydominates the algorithm on the abscissa.66Chapter 5. Empirical Evaluation of MAL Algorithmsaction distribution was drawn from the same distribution as an earlier sample (establishing thatthe empirical mixed strategies were stationary) or that an empirical action distribution profile wasdrawn from the same distribution as a stage-game Nash equilibrium.Each test was unfortunately inappropriate for the situations that we needed. The chi2 test does nothandle situations where some of the actions are rare or not present and the FET was computationallyexpensive, and the implementation of it that we used [40] failed on some of the larger and morebalanced action vectors (typically in the 10 ?10 case).Instead, we used the incomplete set of FET results to calibrate a threshold based on vectordistance where any two vectors that were closer than the threshold theta were considered to be thesame. We calibrated theta using a receiver operating characteristic curve. The incomplete FET resultswere used as ground truth, and we plotted the change in true positive rate and false positive rateas we varied theta. We picked the threshold that lead to equal number of false positives and falsenegatives. Based on this ROC analysis, we picked a theta of Strategic StationarityThe weakest form of convergence that we will look at is whether or not the algorithms converge toa stationary strategy profile. This is interesting in its own right, but is also a necessary conditionfor stronger forms of convergence. We consider a run to be stable if the joint distribution of actionsis the same in the first and second half of the recorded iterations, using lscriptinfinity-distance. This is a jointproperty of both algorithms, so while determined and random play stationary strategies theymay still participate in runs that are not stable.To check how successful our threshold criterion is at detecting stationarity we looked at the re-sults for two algorithms that always use stationary strategies. Determinedwas found to be stablein 99.5% of self-play matches and random was found to be stable in 92.0% of self-play matches.When playing each other, they were found to be stable in 94.8% of their runs. The differencesthat exist between these cases are likely because determined has weakly smaller supports thanrandom and mixed strategies with smaller supports are more likely to produce empirical actiondistributions that are close to the original strategy. We note that a false positive rate of between0.5% and 8% larger than might be hoped, but nevertheless defer improved criterion for empiricalconvergence to future work.GIGA-WoLF and GSA were the least likely to be stable?particularly in self-play, against eachother, or against meta (see Figure 5.23). Their striking instability with meta is potentially becausethey trip meta's internal stability test and change its behaviour. However, AWESOME also has aninternal check like this, but the stability of the GIGA-WoLF and GSA are not noticeably differentbetween matches with AWESOME and with Q-learning (which has no such check). RVsigma(t),the other gradient algorithm, was more stable than GIGA-WoLF and GSA. This might be becauseRVsigma(t) had a more aggressive step length: the parameters used in this experiment for GIGA-WoLFand GSA were taken from [7] and they were intended to produce smooth trajectories and ratherthan fast convergence.Meta, determined, fictitious play and AWESOME were, for the most part, quite67Chapter 5. Empirical Evaluation of MAL AlgorithmsOpponentProtagonistProportion of Stationary Runs  fict det meta awe q rvs gsa giga mini min randfictdetmetaaweqrvsgsagigaminiminrand0.70.750.80.850.90.951Figure 5.23: Proportion of stationary runs, blocked on opponent. This intensity map is symmetricand we removed redundant entries for clarity.)68Chapter 5. Empirical Evaluation of MAL AlgorithmsGeneratorProtagonistProportion of Non-stationary, Blocked on Generator  D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13fictdetmetaaweqrvsgsagigaminiminrand 5.24: Proportion of non-stationary runs, blocked on generator and protagonistgood at achieving stationarity with each other and in general. meta and fictitious playwere particularly strong against each other, and always reached a stationary strategy profile. Theonly exception to the stability in this group was AWESOME vs. meta; this pairing was unstable in10.3% of runs. We are not sure why this is, but it likely has to do with the discrete behaviouralchanges that both algorithms undergo when their internal states change.There were a number of problem generators for the different algorithms (see Figure 5.24). Forexample: generators D1, D2, and D10 created instances that were particularly difficult for thegradient algorithm in terms of strategic stability; Q-Learning was weak on both D5 and D7;and meta tended to be unstable on D5, D7 and D10. However these unstable instances were rareregardless of the algorithm paring. The vast majority of runs found a stationary strategy profile.Even GIGA-WoLF, which was the algorithm least likely to stabilize, found stationarity in 87.0%of its runs (see Figure 5.25).69Chapter 5. Empirical Evaluation of MAL Algorithmsawedet q metafict rvs minimingsagigarand00. of RunsRun Convergence  Unique NEPareto NENon-Pareto NEStableUnstableFigure 5.25: The proportion of runs that were stationary, converged to a Nash equilibrium or con-verged to a Pareto-optimal Nash equilibrium.70Chapter 5. Empirical Evaluation of MAL Algorithms5.3.2 Stage-Game Nash EquilibriaA subset of the stable runs settled on one of the stage-game Nash equilibrium. For some algorithms,Nash equilibrium convergence was reasonably common: AWESOME converged in 54.3% of itsruns, and determinedconverged in 53.1% of its runs. Determined was better at AWESOME atconverging to a Pareto-optimal Nash equilbrium (a Nash equilbrium that is not Pareto-dominated byany other Nash equilibrium). Indeed, AWESOME most frequently converged to a Pareto-dominatedequilibrium. This this likely has to do with the way that our implementation of AWESOMEpicked its`special' equilibrium. 8 It was simply the first equilibrium found by the Lemke-Howson algorithm,without attention to whether it was e.g. Pareto-dominated. AWESOME also tended to attain lowerreward when it is converged to a Pareto-dominated Nash equilbrium than when it did not convergeor converged to a dominated Nash equilibrium.Figure 5.26 gives the convergence results for self-play. One of the first things that jumps out ishow often determined manages to converge to a Nash equilibrium in self-play. This indicates thatthe games we choose had an important characteristic: many possessed a single Nash equilibriumthat was the best for both agents. Indeed, we can see that there is a surprisingly high number ofgames with a unique stage-game Nash equilibrium (58.5%). This is not a general property of allgames and so determined's convergence results could be radically different on another set ofgames. This property likely also affects the convergence properties for the other algorithms.We see that AWESOME nearly always attains a stage-game Nash equilibrium. Yet, if we re-call the discussion about self-play reward from Section 5.1.1, AWESOME received lower reward inself-play than non-self-play runs. Together, this indicates that while AWESOME was successful atconverging to a Nash equilibrium, this was not enough to guarantee high rewards in self-play. Aninteresting tweak to AWESOME would be to use its special self-play machinery to converge to otheroutcomes that are not stage-game Nash equilibria, such as the socially-optimal outcome of thestage game or the Stackelberg-game equilibrium. The aim of this adjustment would be to improveself-play reward results while keeping AWESOME resistant to exploitation by other algorithms.5.3.3 Repeated-Game Nash EquilibriaSo far, we have been looking at the equilibria for the stage game. The algorithms are actuallyplaying a repeated game, however, and we now turn to analyzing properties of this repeated game.We look at enforceability?achieving payoff profiles in which both payoffs exceed their respectivemaxmin values?since enforceability is a necessary condition for any repeated game Nash equi-librium. Unfortunately, a sufficient condition for repeated game equilibria would involve testingwhether each algorithm has correct off-equilibrium behaviour?punishments, in particular?builtinto its strategy that prevents profitable deviation by its opponent. While the algorithms that welooked at lack off-equilibrium punishments, it is still interesting to see how frequently the algo-rithms converge to payoff profiles that are realizable by repeated game Nash equilibria.8The original paper, Conitzer and Sandholm [13], left the method of picking the `special' equilibrium unspecified.71Chapter 5. Empirical Evaluation of MAL Algorithmsdetawe q fictmetarvs gsagigaminiminrand00. of RunsRun Convergence in Self-Play  Unique NEPareto NENon-Pareto NEStableUnstableFigure 5.26: The proportion of self-play runs that were stationary, converged to a Nash equilibriumor converged to a Pareto-optimal Nash equilibrium.72Chapter 5. Empirical Evaluation of MAL AlgorithmsObservation 12 Q-Learning was involved in matches with payoff profiles consistent repeatedgame Nash equilibrium more often than any other algorithm.Of the algorithms that we examined, Q-learning most frequently had runs that were consis-tent with a repeated game Nash equilibrium. It was consistent with a repeated game equilibrium in76.8% of its runs (see Figure 5.27). Determined and AWESOME were the next most frequentlyconsistent, and were consistent in 75.0% and 73.8% of their runs, respectively. Consistency witha repeated game Nash equilibrium is common, but not universal. Even after 90 000 adaptationiterations, no algorithm was completely successful at achieving enforceable payoff profiles. Wenote that if a protagonist algorithm is playing against a particularly unsuccessful opponent, such asrandom, it might fail to achieve an enforceable payoff profile simply because its opponent doesnot achieve an enforceable payoff. In particular, if the opponent fails to achieve an enforceablepayoff, it would be unfair to conflate runs where the protagonist was able to achieve an enforceablepayoff with runs where it also failed. We looked at the proportion of matches where each algorithmattained an enforceable payoff in Section 5.1.2, and we included these results in Figure 5.27 forcomparison. Again, no algorithm achieved payoffs above the maxmin value in all of its runs.5.4 Links Between MetricsWe argued earlier in this chapter that reward is the most fundamental metric and that the othermetrics, like regret, can be seen as `standing in' for reward. Therefore, it is important to comparereward results to the other metrics to see if these alternative metrics are reasonable substitutes forreward. For example, is high reward linked with converging to a Nash equilibrium? We saw in? 5.3.2 that while AWESOME was very good at converging to a Nash equilibrium in self-play, itdid not get especially high reward in these runs. Is this a general trend, or a special property ofAWESOME? What are the other links between the different aspects of performance?5.4.1 Linking Reward With Maxmin DistanceObservation 13 Algorithms tend to receive larger rewards when runs are also enforceable.For most of the algorithms there is a clear and strong relationship between enforceability andreward. This is to be expected, because a run is only enforceable for an algorithm if the algorithmattains high reward. Maxmin distance is positively correlated with reward for all algorithms. Thiswas tested with Spearman's rank correlation test ( ? 4.3.2) at a significance level of alpha = 0.05. Ifwe block on game generator (Figure 5.28), the results are largely the same with a few differences.There are a few insignificant results, mostly on D11. For example, minimax-Q is negativelycorrelated on D11. Interestingly, minimax-Q-IDR still exhibits positive correlation.For all algorithms the mean reward for enforceable runs is higher than the mean reward forunenforceable runs. However, when we compare reward SQDs based enforceable runs and unen-forceable runs, the former does not always probabilistically dominate the latter. (Figure 5.29). The73Chapter 5. Empirical Evaluation of MAL Algorithmsq det awe gsa meta fict giga rvs mini min rand00. of RunsAlgorithmEnforceability  Enforceable ProfileEnforceable PayoffNeitherFigure 5.27: Proportion of PSMs with enforceable payoffs and payoffs profiles achieved, by algo-rithm.74Chapter 5. Empirical Evaluation of MAL AlgorithmsxxxxxxCorrelation Between Reward and Maxmin DistanceGeneratorAlgorithmD1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12fictdetmetaaweqrvsgsagigaminiminrandFigure 5.28: The sign of correlation between reward and maxmin distance for each algorithm andeach game generator. A white cell indicates positive correlation, a black cell indicates negativecorrelation, and a grey cell with an `x' indicates insignificant correlation.75Chapter 5. Empirical Evaluation of MAL Algorithms0 0.2 0.4 0.6 0.8 and Enforceability for RVsigma(t)Above 0.2 Maxmin DistanceEnforceableNot EnforceableFigure 5.29: The distribution of reward for RVsigma(t) when conditioning on different maxmin dis-tances. For example, the `Above 0.2 Maxmin Distance' curve is the empirical CDF curve thatrepresents the distribution of reward for all runs that have a maxmin distance greater than 0.2.76Chapter 5. Empirical Evaluation of MAL Algorithmsexceptions here are the gradient algorithms and Q-learning. For both the gradient algorithmsand Q-learning there tend to be many enforceable zero-reward runs. The majority of thesezero-reward enforceable runs occur in D10?Traveller's Dilemma. All instances from D10 have asecurity value of zero so D10 is one of the few generators where algorithms can get an enforceablezero reward run. All algorithms frequently have zero-reward runs on instances of D10 but theytend to get unenforceable zero reward runs in other games more frequently than either the gradientalgorithms or Q-learning?the enforceable zero-reward runs stand out more for the gradientalgorithms or Q-learning since they have fewer unenforceable 0 reward runs. If we excluderuns from D10 then GIGA-WoLF, GSA and RVsigma(t) also exhibit domination, but Q-learningstill does not?though the cross-over is small (Figure 5.30). We also compared the reward SQDsfor runs with reward strictly higher than the maxmin value to the reward SQDs for unenforceableruns, and we found for all algorithm except Q-learning, the strictly enforceable maxmin rewardSQDs dominates the unenforceable SQDs.A more detailed explanation for this relationship can be see in a bivariate histograms such asFigure 5.31. This figure is a representative example of the relationship that exists between rewardand maxmin distance for all algorithms except random. Reward bounds maxmin distance: if onegets a reward of x, then maxmin distance must be between x - 1 (security value is 1) and x (se-curity value of 0). These constraints create a feasible regions that is a parallelogram with points{(1,1),(1,0),(0,0),(0,-1)}, where reward is the first coordinate and maxmin distance the sec-ond.There are two prominent ridges in this histogram. The first is the ridge formed by runs havingzero or close to zero maxmin distance, and the second is formed by runs with a reward of close to1. The three bins with the most runs are all close to zero maxmin distance with rewards of 0, 0.5,and 1. The first bin (reward of 0) largely corresponds to runs from D10 (although a few runs camefrom D7 and one run came from D13), the middle bin is mostly composed of runs from D6, andthe final bin(reward of 1) consists of instances drawn from either D3, D7, or D11. Again, theseobservations are for RVsigma(t) but are echoed in histograms for the other algorithms.These ridges tell us something rather surprising: runs tend to get either close to the maximumreward or to the security value. There seems to be little else in the way of a trend between the twometrics beyond the fact that one bounds the other. This observation is not limited to game instancesfrom one generator, and many generators contribute to these ridges.5.4.2 Linking Reward With RegretObservation 14 There is a link between obtaining large reward and low regret.There is also a link between having low regret and high reward. Regret and reward are neg-atively correlated for all algorithms (Spearman's rank correlation test; alpha = 0.05): high reward islinked with low regret. When blocking on generators, we see that D10 induces positive correlationfor all algorithms except determined, and this is sensible: algorithms get better reward whenthey are not best responding in this game (the unique Nash equilibrium is one of the worst outcomes77Chapter 5. Empirical Evaluation of MAL Algorithms0 0.2 0.4 0.6 0.8 and Enforceability for Q with D10 CensoredEnforceableNot EnforceableFigure 5.30: The distribution of reward for Q-learning when conditioning on different enforce-ability. Runs from D10 were excluded.78Chapter 5. Empirical Evaluation of MAL Algorithms00.51-0.50.310510Maxmin DistanceReward and Maxmin Distance for RVsigma(t)Rewardlog(runs)Figure 5.31: Bivariate histogram showing reward and maxmin distance for RVsigma(t). The 25 ? 25uniform bins were used. The height of the bins is shown on a log scale.79Chapter 5. Empirical Evaluation of MAL Algorithmsof the game).Since relatively few runs accrued significant negative regret (see ? 5.2), the most importantdivision is between runs that attained positive regret and runs that attained non-positive regret. Formost of the algorithms, their performance in non-positive runs probabilistically dominated theirperformance in positive regret runs. There were some exceptions; for example Q-learningand GIGA-WoLF experienced cross-overs. While for Q-learning the cross-over was relativelyminor, the cross-over for GIGA-WoLF was more significant: runs that attained positive regret lessoften attained zero reward (Figure 5.33).There is an even more dramatic result: the positive-regret runs dominated the non-positive runsfor GSA and RVsigma(t). These two gradient algorithms exhibited behaviour that none of the otheralgorithms displayed: runs with positive regret had better reward characteristics than runs withzero or negative regret. This phenomenon did not seem to be due to only one generator nor any oneopponent. We can note that the probabilistic domination visually seemed to be the weakest whenPSM involving Traveler's Dilemma were censored.The bivariate histograms for the different algorithms show that there is a ridge for all of themwhere regret is zero (the histogram for AWESOME is presented in Figure 5.34). This ridge has aprominent bin for runs with regret close to zero and reward close to one. This bin indicates goodruns where algorithms are best-responded and got the game's maximum reward. This bin is thelargest in terms of runs for most of the algorithms. However, there are other interesting bins thatare only observed in some of the algorithms. In particular, all algorithms except for the gradientalgorithms and Q-learning had a number of runs with reward close to zero and regret close toone. These runs are horrible: not only did the algorithms get close to the minimum reward possible,but also they could have switched to a pure strategy and potentially received a reward close to one.Of course, it is possible that a reward of 1 was not attainable with the new action?the opponentcould have adapt to the candidate's strategy?but it is hard t o imagine that the new action wouldhave done much worse: these runs were already getting close to the minimum possible reward.Furthermore, some algorithms were able to avoid these runs entirely. Q-Learning, for example,had no runs of this type and generally avoided high-regret low-reward runs (Figure 5.35). Theseruns should serve as a focal point for thinking about how existing algorithms should be improved.5.4.3 Linking Reward With Nash Equilibrium ConvergenceA lot of work in multiagent systems has focused on algorithms that try to converge to a stage-gameNash equilibrium. Indeed, many algorithms like determined and AWESOME explicitly try toconverge to some stage-game Nash equilibrium. But if one is primarily interested in getting highreward, is converging to an equilibrium desirable? Or, more generally, is proximity to a stage-gameNash equilibrium correlated with obtaining high reward?Observation 15 There is a link between obtaining large reward and being close to a stage-gameNash equilibrium for most algorithms.80Chapter 5. Empirical Evaluation of MAL Algorithmsxxx xx xx x xx xxx xCorrelation Between Reward and RegretGeneratorAlgorithmD1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12fictdetmetaaweqrvsgsagigaminiminrandFigure 5.32: The sign of correlation between reward and regret for each algorithm and each gamegenerator. A white cell indicates positive correlation, a black cell indicates negative correlation,and a grey cell with an `x' indicates insignificant correlation.81Chapter 5. Empirical Evaluation of MAL Algorithms0 0.2 0.4 0.6 0.8 and Regret for GIGA-WoLFNon-positive RegretPositive RegretFigure 5.33: A CDF plot showing GIGA-WoLF's performance conditioned on achieving eitherpositive or non-positive reward. Notice that there is less probability mass on 0 reward whenGIGA-WoLF attains positive regret.82Chapter 5. Empirical Evaluation of MAL Algorithms01010510RewardReward and Regret for AWESOMERegretlog(runs)Figure 5.34: A bivariate histogram showing reward and regret for AWESOME. Reward boundsregret: if one gets a reward of x, since reward is on [0,1], one can at most attain a regret of1 -x.83Chapter 5. Empirical Evaluation of MAL Algorithms00.5100.40.90510RewardReward and Regret for Q-learningRegretlog(runs)Figure 5.35: A bivariate histogram showing reward and regret for Q-learning. Notice that thereare fewer low-reward high-regret runs than in Figure 5.34?there is less mass on the right of theplot.84Chapter 5. Empirical Evaluation of MAL AlgorithmsAll algorithms have reward that was negatively correlated with lscriptinfinity-distance to the closest Nashequilibrium (Spearman's rank correlation test; alpha = 0.05). Furthermore, most algorithms werenegatively correlated even when we blocked on the game generators (Figure 5.36). There weresome exceptions. The most noticeable exceptions were on D6, D10, and D12 there were a numberof algorithms with positive correlation: it was better to be far away from the equilibrium. This isespecially true on D10 and algorithms received much higher reward if they participated in someother outcome.Bivariate histograms (Figure 5.37) reveal three major trends for all algorithms: runs are eitherhigh in reward, close to a Nash equilibrium, or far from a Nash equilibrium (this does not excludebeing high in reward and being either close or far from the equilibrium). For most algorithms,being close to a Nash equilibrium and being high in reward is the most common bin. Between8.1% (GIGA-WoLF) and 26.2% (determined) of runs have a reward greater than 0.96 and areless than 0.04 away from a Nash equilibrium (this is the right-most bin in Figure 5.34). Mostalgorithms have other strong modes at the other corners of the plots, but these are less promienentthat the close and high reward bin.85Chapter 5. Empirical Evaluation of MAL Algorithmsx xxxxx xxxCorrelation Between Reward and NE ProximityGeneratorAlgorithmD1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12fictdetmetaaweqrvsgsagigaminiminrandFigure 5.36: The sign of correlation between reward and lscriptinfinity-distance to the closest Nash equilib-rium for each algorithm and each game generator. A white cell indicates positive correlation, ablack cell indicates negative correlation, and a grey cell with an `x' indicates insignificant correla-tion.86Chapter 5. Empirical Evaluation of MAL Algorithms01010510RewardReward and NE Distance for DeterminedNE Distancelog(runs)Figure 5.37: A bivariate histogram showing reward and lscriptinfinity-distance to the closest Nash equilibriumfor determined.87Chapter 6Discussion and ConclusionIn this thesis, we argued for a standardized testbed for multiagent experimentation. This testbed,while initially more work to create, should offer lasting benefits: it will allow researchers to focuson experimental design and not implementation details. We built a version of this testbed, MALT2.0, and conducted a sample experiment: we used MALT to evaluate a set of MAL algorithms. Weanalyzed this experiment in depth and we suggested some analytical methods that are intended notonly to be useful for understanding and comparing existing algorithm behaviour but also will beuseful for empirically-minded algorithm design.We observed clear performance results. Firstly, minimax-Q and minimax-Q-IDR tendedto perform poorly from a number of perspectives. These included reward, regret, minimax dis-tance, strategic stationarity, and convergence to a stage-game Nash equilibrium. On the otherhand Q-learning had excellent results and was frequently better than more recent and sophis-ticated learning algorithms?such as GIGA-WoLF and meta?on most game instance generatorsfor most performance metrics. This was a surprising result, and it could be seen as embarrassingfor the MAL community: an off-the-self algorithm designed for single-agent learning handily beatthe latest multiagent learning algorithms in many respects. This suggests that there is a lot of roomfor improving the empirical performance of specially-designed MAL algorithm. Indeed, there area number of areas where efforts should be focused.Firstly, there simply needs to be more experimentation. Our experiment was large, but it isimpossible to answer all empirical MAL questions in one experiment. Some promising directionsfor extension are:? More careful examination of the relationship between performance and game properties likesize;? More detailed investigation of behaviour of instances from a single generator. This mightgive more insight into algorithm behaviour than an experiment with a broad focus like theone presented in this dissertation;? Further evaluation studies involving more algorithms like Hyper-Q [50] and Nash-Q [23];? Extension to N-player repeated games and stochastic games.Secondly, many of the more recent and sophisticated algorithms have a lot of tunable param-eters. It was beyond the scope of this paper to adjust them. Indeed, parameters might be one ofthe reason's that Q-learning did so well: it had only three parameters which were all easy toset. This does not excuse poor performance from algorithms with many parameters. Indeed, the88Chapter 6. Discussion and Conclusiononus of finding a good set of parameters belongs to the architect of the algorithm. However, itis possible that some algorithms may have hidden potential that can be release through rigoroustuning. This tuning will require further experimentation?hopefully MALT will be of assistancehere?and there are some interesting questions to ask:? Is one parameter setting good for many problems, or is it the case that one parameter settingwill be great on one set of matches and poor on another?? Which of, for instance, meta's parameters are the most important?? How much better would a rigorously-tuned setting be for AWESOME be than the defaultvalues presented in Conitzer and Sandholm [13]?? Does AWESOME's performance change radically when it selects the sociall y optimal Nashequilibrium as its special equilibrium? How about the `Stackelberg' equilibrium?? For gradient algorithms, is it better to ensure that mixed strategies are feasible through re-traction or normalization?? Is a set of parameters that are good for reward also good for regret?? What is the best way to automate parameter tuning?.Thirdly, we presented two different tweaks to existing algorithms:minimax-Q-IDR and GSA. In many situations, these algorithms offered improvements over theoriginal algorithm, and in many cases probabilistically dominated the originals. Other modifica-tions and preprocessing steps can be added to existing algorithms, and an interesting direction ofwould be to see which ones tend to work the best. For instance, does IDR-preprocessing alwaysimprove performance? When does it hinder rather than help? Hopefully, experiments of this naturewill also build intuition for how to rectify problematic situations and for building new algorithmsfrom scratch.Finally, managing a portfolio of existing algorithms also seems like a promising approach fordesigning new algorithms with good empirical properties. AWESOMEand meta both can be seen asportfolio algorithms: they look for features of their opponent's behaviour and switch between dif-ferent algorithms accordingly. A more general framework for building these portfolio algorithms?especially ones where the portfolio is not explicitly written into the algorithm?could be a way toreuse existing MAL methods. Such an algorithm would switch between the different algorithmsin a portfolio as the situation demands, depending on the empirical characteristics of the managedalgorithms. Again, this direction of research leads to a whole host of empirical questions. Whatfeatures of the game and game play are the best to look for? Does adding an algorithm to a portfoliostrictly improve performance?We leave the reader with a host of unanswered questions. This is a sign of vibrancy of the field:there is still a vast amount of research that needs to be done in this area. We will end this thesis witha discussion of urban traffic that shows that this field of research is not merely an abstract study oflearning algorithms. MAL research has practical value, especially as societal interactions becomemore numerous and more difficult to navigate. We hope that this thesis has piqued curiosity and left89Chapter 6. Discussion and Conclusionsome tools that are useful in addressing these many issues associated with learning and behavingin multiagent environments.90Chapter 7Future Work: Extension to TrafficIn this section, we will discuss how MAL algorithms can be used to understand urban traffic.There are two major goals of such work. The first is to characterize what behaviour occurs in atraffic system?called traffic modeling or prediction?and the second is to optimize various policytools?called traffic management. In this chapter we focus on the first problem, while noting thatthe second problem is the ultimate goal that we want to tackle once we have a satisfactory modelof traffic to work with.We are certainly not the first people to study traffic modeling. Civil engineers, for example,have worked extensively on the problem. There are multiple textbooks including May [31] andde Dios Ortzar and Willumsen [14] that are devoted solely to modeling issues. We take a slightlydifferent approach to the problem than is traditionally used in civil engineering. Specifically, weare more interested the incentives behind routing decisions rather than specific physical details ofthe system.Let us clearly articulate one of the problems of traffic modeling: route selection. We assumethat we start with a set of trips between two points in a road system (these trips are generatedin the earlier phases of modeling). This system can be modeled as a graph like Figure 7 wherenodes represent intersections and directed edges represent lanes. This is allowed to be a multi-graph where there are several distinct lanes between two intersections. What kind of behaviourwill intelligent drivers engage in? What system properties can we predict from their interaction?a bc dFigure 7.1: A sample road graph with four intersections.91Chapter 7. Future Work: Extension to Trafficad(x)=xd(x)=32 xbFigure 7.2: A sample road graph with two intersections that has no Wardrop equilibrium for asystem with two atomic drivers.7.1 Wardrop EquilibriumOne of the suggested route selection methods from the civil engineering community is to assign adelay function to each edge and look for a Wardrop equilibrium [54]. The Wardrop equilibrium issimilar to the Nash equilibrium, but assumes that there is an uncountably infinite number of agentsnavigating the networks. The users of the road system are modeled as infinitely divisible flows thattravel from a source node to a sink node. The delay for the flow along a particular path is equal tothe sum of delays on each edge.The traffic flow is assumed to be selfish so at equilibrium all paths with any flow must have thesame delay and any paths without flow must be worse?this is the Wardrop condition. If we assumenot, then there is some small fraction of flow on a more expensive path that would be better off ona cheaper path and so the system was not in equilibrium. Traffic engineers use these conditions toestimate traffic flow.This model assumes that commuters are infinitely divisible. This is not the case in practice,where each driver is a discrete and atomic object that cannot be subdivided. This is a substantialassumption and Wardrop's conditions do not work when drivers are atomic. Consider the systemin Figure 7.1 with two players: there is no way to make the two paths equal in cost. Additionally,atomic drivers could be in a Nash equilibrium but not a Wardrop equilibrium. In Figure 7.1 thereis a Nash equilibrium when one driver takes the top route and the second driver takes the bottombut notice that the route costs are 32 and 1 respectively. This shows that the condition for a Wardropequilibrium does not work in the natural atomic agent case.7.2 Congestion GamesCongestion games [43], roughly, are a discretization of the previous continuous commuter modelof traffic. Each atomic agent i element N picks a path through a graph. The delay for a user is the sumof the delay along each edge of the path. This is formally stated in:d(L) =summationdisplaylelementLdl(#l). (7.1)92Chapter 7. Future Work: Extension to TrafficHere, #l is the number of agents using lane l and dl(?) is the delay function. Notice that the delayfor a lane only depends on the number of agents using it; there is no inter-lane dependency.Congestion games assume that the traffic conditions in one intersection do not affect the condi-tions in neighbouring intersections; this is an unrealistic assumption. In real traffic networks, whenone lane becomes congested, cars entering from neighbouring lanes are impeded. This propagatescongestion to nearby lanes and causes delays to cascade throughout the road network. However, incongestion games the congestion in each lane is independent, and so we cannot accurately representthis phenomenon.7.3 Our GameThe inability to model interlane dependencies is a serious issue associated with both atomic andnon-atomic congestion games. To model the propagation of congestion more accurately, we sug-gest our own model of traffic where these inter-lane dependencies can be expressed. We representthe traffic network as an extensive form game where agents make a series of turning decisionsbased on whatever observations they can make about the state. This mapping of observations toturns is called a policy. We explicitly model the position of each agent. Because of this, driversmay be unable to enter a lane that is heavily congested and either need to bypass jammed lanesor wait out the congestion. This is exactly the kind of delay propagation that we are interested in,where a jam in one lane creates blockages in or strains on other near by lanes.While this model has the necessary richness to express inter-lane delay effects, it is also muchmore complicated than one-shot congestion games. In particular, our system dynamics make itpractically impossible to write a closed-form function that maps from a profile of strategies for theextensive form game to a profile of delays. As a consequence we need to simulate the road networkmodel to determine the outcome of a game.Our simulator models both space and time as discrete. We want a simulator that can efficientlysimulate large sections of a city and continuous models of time, space and motion are usuallycomplicated and expensive to simulate. However, the discretization needs to be done with care, orimportant details will be lost, fundamentally altering the problem. We will call a physical quantuma `cell' and a temporal quantum a `tick'. Cells are the fundam ental unit of progress: in each tickan unimpeded car should be able to move at least one cell. Also, a cell should be a sufficientdescription of a car's position: a car cannot be half-way through a cell.We define a cell to be large enough to hold one and only one car. Cells that are larger than carsabstract space more aggressively and cells that are smaller model space and movement with greaterprecision. While our model could easily be extended to use larger or smaller-grained discretization,vehicle-sized cells offer a good trade-off between more efficient super-vehicular cells and moreprecise sub-vehicular cells (this raises an interesting empirical question: how sensitive is the systemto more or less coarse discretization of space?).Each car has a velocity (in cells-per-tick or `CPT's) bounded by some upper limit and thesome acceleration function (in CPT2). For now, we take acceleration to be constant for simplicity93Chapter 7. Future Work: Extension to Trafficbut our model could trivially be extended to more realistic acceleration functions. We offer noformal advice for how these numbers should be set, but there should be a qualitative differencebetween CPTs for fast-moving cars and slow moving cars. Also, cars should not be able to movetoo far in one tick. Moving larger distances in each tick means that a car could potentially interferewith many other cars and the point of simulating is to make these conflicts simple to resolve andrelatively infrequent. Setting these kinematic numbers and a coarsening of space tacitly sets thecoarsening of time and we need not consider it as a special topic.The velocity of a car is bounded by the speed limit of each road, some internal maximumvelocity, and by the velocity of the cars in front of it. All agents are assumed to perfectly decelerateto avoid accidents. While accidents are an important part of real traffic jams we do not modelthem in this thesis. In general we assume agents to be flawless and lawful: they are perfect driversthat always follow laws. Beyond our assumption of flawlessness and lawfulness, the agents havecomplete freedom to choose routes and adapt to traffic conditions. At each intersection, agents areable to make a turning decision based on some local features of the state. However, this policyof mapping observations to turning actions is complicated and needs to be learned. We use MALalgorithms to do this.Indeed, the only difference between our traffic game and the earlier repeated game experimentsis simply the complexity of the stage game: rather than repeatedly playing a simple two-player one-shot game like Prisoner's Dilemma the algorithms are repeatedly playing a large extensive-formgame with N players. Because the game is large and the payoffs are unknown?we are simulatingexactly because we do not have a closed-form expression for the utility functions?we cannot usemany of the algorithms looked at in earlier sections. The only algorithms that are able to functionin such an environment are the gradient algorithms and Q-learning. Of these, we will focus onQ-learning because the gradient algorithms are only designed for one-shot games, and wouldrequires serious redesign to work well in extensive-form games. Additionally, in Section 5 we sawthat Q-learning was a good algorithm for learning in multigent systems and was better than theother MAL algorithms for most metrics.One of the problems for any learning algorithm is that even our simple traffic simulator has avast number of states. The state of the system is the position and velocity information for everyvehicle (ignoring traffic lights for the moment). If there are C cells, N vehicles, and a maximumvelocity of v, then there are C!(C-N)! ?vN possible states.This can be done efficiently through value approximation [6]. As an example, for Q-learn-ing to learn efficiently with so many states it estimates the value of Q(s,a) for new states bygeneralizing from similar states that it has seen before. Essentially, Q-learning needs to usenon-linear regression to get an approximate ?(s,a) based on past observations. Techniques likeforests of regression trees and Gaussian processes (see, for example, Rasmussen and Williams[42]) might be useful for this regression, but they need to be fast to update relative to the costof simulating each round: N learning algorithms need to update their model each round. Thereare a lot of interesting questions to explore: what is a good set of features? Are some regressiontechniques better than other for this problem? How should we explore to perform well in thisnon-stationary environment? Designing RL algorithms with good empirical properties for traffic94Chapter 7. Future Work: Extension to Trafficmodeling is an exciting topic, and we hope to work more on this problem in the future.7.3.1 Other ModelsLet us summarize our model in four main points:? Space is discrete and each cell is the same size as a car,? Cars may have different velocities, and can accelerate,? Agents are lawful and flawless drivers,? Drivers are adaptive and are able to learn from past runs.This model of traffic is not the only way to simulate simple urban road systems but it is aflexible model that can be extended in a number of directions. However, there have been othermodels of traffic suggested. In the following discussion we summarize a few of them and andindicate how they differ from ours.There are a number of commercial simulators for simulating traffic including VISSIM [1],Paramics [39], and CORSIM [16]. Most commercial simulators are continuous time and space:each vehicle is a physical object in Rfractur3 with real-valued velocity and acceleration. However, thismakes simulations expensive to run. Indeed, most commercial simulators are meant for simulatingshort stretches of highways or single intersections to ensure that proposed alterations can accom-modate predicted use. These simulators typically assume that the driver's routes are drawn fromsome static distribution. For example, a traffic engineer might want to simulate 100 drivers perminute heading down a highway from North to South while 10 drivers per minute attempt to enterthe highway via a new on-ramp to ensure that merging is smooth.Wiering [56], Porche and Lafortune [37] and de Oliveira et al. [15] all suggest discrete simu-lators that are intended for simulating intersections for optimizating light times. These simulatorsare much more closely aligned with our own goal of exploring policy tools than the commercialsimulators.Wiering [56] uses vehicle-sized cells but does not model acceleration. Vehicles move one cellforward every tick iff that cell is unoccupied. This understates the difference between an unimpededfast moving car and a slow car moving on a congested road. Agents either randomly pick one ofthe shortest lscript1-distance paths or co-learn: they pick the shortest lscript1-distance path with the lowestestimated waiting time. This estimated waiting time is not a subjective estimate learned by theagent. It is an estimate maintained by the system that each car has access to.This type of learning is unconvincing. There is no particular reason why any shortest lscript1-distance paths will be the best path. For example, one might want to skirt around the downtowncore of a city, even if it is directly between the start and end locations. Also, this model assumesthat all driving agents obediently follow the advice of some central congestion-tracking service.This ignores the question of whether it is always in the best interest of the agent to follow thisservice's advice.de Oliveira et al. [15] use a model very similar to our own: cells are vehicle-sized and carscan accelerate. Agents are assumed to be lawful but not flawless: indeed the agents are overly95Chapter 7. Future Work: Extension to Trafficenthusiastic about breaking and with probability p slow down by one CPT. Random shocks likethis might be an important part of why congestion forms, but we will ignore it for this phase of thework. The drivers in this model are simple path following algorithms that are unintelligent and donot adapt.Porche and Lafortune [37] use super-vehicular cells where one cell is an entire block. Carsmove at the constant rate of one CPT and they incur a delay that is not physically modeled: evenif a road is heavily congested all cars still move at one CPT. This bears some striking similaritiesto a congestion game. Like congestion games, the Porche and Lafortune simulator fails to modelhow delays can propagate through a network. As we argued earlier, this omits an important featureof the traffic problem. In particular, roads in this model can never `jam' and affect neighbouringroad segments. The drivers in this model are unintelligent and unadaptive. The vehicles' routes aresimply picked from some distribution and executed.We also note that the empirical experiments are small: Porche and Lafortune [37] conduct anexperiment with a uniform 4 ?4 grid of intersections, de Oliveira et al. [15] conduct a smaller onewith a 3?3 grid and Wiering [56] also conducts an experiment on a 3?3 grid. These experimentsare too small to realistically represent genuine route choices, although this does not matter becausethe drivers are largely unadaptive. Clearly, there is a lot of room for improvement in the empiricalsimulation of road systems, especially with adaptive and indeed strategic agents.7.3.2 Experimental DirectionsWe have not yet run any experiments on this model. However, we have indicated particularlypromising areas for empirical work.For the system:? How frequently does the system converge to a stationary or relatively stationary state?? Of these stationary states, are they in equilibrium?? How sensitive is the system to random accidents or driver mistakes?For the MAL algorithm:? For reinforcement algorithms, what is a good regression technique and what is a good set ofstate features?? What is a good way to explore without introducing too much noise into the system?? Can gradient learning algorithms be extended to extensive-form games like traffic?? What are good performance metrics for driving agents?? What are the best learning algorithm for different situations?All these questions are important and interrelated, and we look forward to eventually answeringthem.96Appendix AStratified SamplingIn this thesis we need to evaluate with a limited computational budget a large number of algorithmsand a large number of game instances. However, we want to get the maximum information fromthe simulations that we did perform. How should we go about running these experiments?For all of our experiments, we are concerned with the expected performance of a match, de-noted by f(?,zeta). Here, f is some metric function, ? similar M is a match, and zeta similar Z is a randomseed that completely determined any non-deterministic behaviour in both algorithms. The gameinstance/seed pairing uniquely define a run.When designing our experiment, we must choose whether or not to stratifying runs based on thematch. For instance, if we have enough computational time to run 100 simulations, we can eithersample 100 matches and perform a single run on each, or we can sample only 10 matches and run10 runs for each. Stratification clearly leads to a more detailed understanding of the role that ran-domization plays in each match and is critical information for algorithm design. However, for twokinds of common summary statistics?means and quantiles?one should avoid any stratification.Lemma A.0.1 If we are trying to obtain an estimate of Ebracketleftbigf(M,Z)bracketrightbig and we have a limited budget ofsamples, it reduces variance to sample from M and Z independently rather than to stratify basedon M.Proof Consider two schemes of sampling from M and Z, as seen in Table A.1. In the first scheme,M and Z are sampled separately each time. In the second scheme k samples are taken from Mand for each sample of M, Z is sampled si times.Independent {(M1,Z1),... ,(Mn,Zn)}Stratified {(M1,Z1,1),... ,(M1,Z1,s1),... ,(Mk,Zk,sk)}Table A.1: Two schemes for sampling.In both cases, the sample mean is used as the point estimator for the population. Since G andZ are sampled independently, both schemes yield unbiased estimators.However, the first scheme yields lower variance as can be seen in Equations A.1-A.3. Equa-tion A.2 follows from the fact that completely independent random variables have no covariance(Equation A.5) and so if two samples share the same strata (the same sample ? similar M) then theyhave weakly higher covariance (Equation A.4).97Appendix A. Stratified SamplingV arbracketleftBiggsummationdisplayif(Mi,Zi)bracketrightBigg=summationdisplayi,jCov [f(Mi,Zi),f(Mj,Zj)] (A.1)<=<summationdisplayi,j,k,lCov [f(Mi,Zi,j),f(Mk,Zk,l)] (A.2)= V arbracketlefttpbracketleftbtsummationdisplayi,jf(Mi,Zi,j)bracketrighttpbracketrightbt (A.3)Cov [f(Mk,Zk,l),f(Mk,Zk,m)] greaterequal Cov [f(Mi,Zi),f(Mj,Zj)] (A.4)Cov [f(Mi,Zi),f(Mj,Zj)] = Cov [f(Mk,Zk,l),f(Mm,Zm,n)] (A.5)Additionally, stratifying increases the variance of quantile point estimation. This result can befound in Heidelberger and Lewis [22], but it is provided without proof.98Bibliography[1] PTV AG. Vissim 5.00, 2008. URL www.ptvag.com.[2] St?phane Airiau, Sabyasachi Saha, and Sandip Sen. Evolutionary tournament-based compari-son of learning and non-learning algorithms for iterated games. Journal of Artificial Societiesand Social Simulation, 10(3):7, 2007. ISSN 1460-7425.[3] R. Axelrod. The evolution of strategies in the iterated prisoner's dilemma. In L. Davis, editor,Genetic Algorithms and Simulated Annealing, pages 32?41. Morgan Kaufman, Los Altos,CA, 1987.[4] B. Banerjee and J. Peng. Performance bounded reinforcement learning in strategic interac-tions. In AAAI 11, 2004.[5] B. Banerjee and J. Peng. RV: a unifying approach to performance and convergence in onlinemultiagent learning. In AAMAS '06, pages 798?800, 2006.[6] A.G. Barto, S.J. Bradtke, and S.P. Singh. Learning to act using real-time dynamic program-ming. Technical Report UM-CS-1993-002, University of Massachusetts, Amherst, 1993.[7] M. Bowling. Convergence and no-regret in multiagent learning. In NIPS 17, 2004.[8] M. Bowling. Convergence and no-regret in multiagent learning. Technical Report TR04-11,University of Alberta, 2004.[9] M. Bowling and M Veloso. Rational and convergent learning in stochastic games. In IJCAI17, August 4 ? 10 2001.[10] Michael H. Bowling and Manuela M. Veloso. Multiagent learning using a variable learningrate. Artificial Intelligence, 136(2):215?250, 2002.[11] G. Brown. Iterative solution of games by ficticious play. In Activity Analysis of Productionand Allocation, New York, 1951.[12] C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagentsystems. In AAAI 4, pages 746 ? 752, July 28 1997.99Bibliography[13] V. Conitzer and T. Sandholm. AWESOME: A general multiagent learning algorithm thatconverges in self-play and learns a best response against stationary opponents. MachineLearning, 67(1):23?43, 2007.[14] Juan de Dios Ortzar and Luis G. Willumsen. Modelling Transport. Wiley, 2001.[15] Denise de Oliveira, Ana L. C. Bazzan, Bruno Castro da Silva, Eduardo W. Basso, and LusNunes. Reinforcement learning based control of traffic lights in non-stationary environments.In EUMAS, volume 223 of CEUR Workshop Proceedings, 2006.[16] Center for Microcomputers in Transportation. TSIS-CORSIM 6.0, 2008. URL mctrans.ce.ufl.edu.[17] Drew Fudenberg and David M. Kreps. Learning mixed equilibria. Games and EconomicBehavior, 5(3):320?367, July 1993.[18] S. Govindan and R. Wilson. A global newton method to compute Nash equilibria. Journal ofEconomic Theory, 110(1):65 ? 86, 2003.[19] A. Greenwald and K. Hall. Correlated-Q learning. In ICML 20, 2003.[20] S. Hart and Y. Mansour. How long to equilibrium? The communication complexity of un-coupled equilibrium procedures . In ACM Symposium on Theory of Computing, volume 39,pages 345 ? 353, 2007.[21] Sergiu Hart and Andreu Mas-Colell. Stochastic uncoupled dynamics and nash equilibrium:extended abstract. In TARK '05: Proceedings of the 10th conference on Theoretical aspectsof rationality and knowledge, pages 52?61, 2005.[22] P. Heidelberger and PAW Lewis. Quantile estimation in dependent sequences. OperationsResearch, 32(1):185?209, 1984.[23] J. Hu and M. Wellman. Multiagent reinforcement learning: theoretical framework and analgorithm. In ICML 15, pages 242 ? 250, 1998.[24] Junling Hu and Michael P. Wellman. Nash Q-learning for general-sum stochastic games.Journal of Machine Learning Research, 4:1039?1069, 2003.[25] Ehud Kalai and Ehud Lehrer. Rational learning leads to Nash equilibrium. Econometrica, 61(5):1019?1045, 1993.[26] Ehud Kalai and Ehud Lehrer. Subjective games and equilibria. Games and Economic Behav-ior, 8(1):123?163, 1995.[27] C. Lemke and J. Howson. Equilibrium points of bimatrix games. In Journal of the Societyfor Industrial and Applied Mathematics, volume 12, page 413423, 1964.100Bibliography[28] A. Lipson. An empirical evaluation of multiagent learning algorithms. Master's thesis, Uni-versity of British Columbia, Vancouver, Canada, 2005.[29] M. Littman. Friend-or-foe Q-learning in general-sum games. In ICML 18, pages 322 ? 328,June 28 ? July 1 2001.[30] M. Littman. Markov games as a framework for multi-agent reinforcement learning. In ICML11, pages 157 ? 163, 1994.[31] A.D. May. Traffic Flow Fundamentals. Prentice-Hall, Upper Saddle River, N.J., 1990.[32] R.D. McKelvey, A.M. McLennan, and T.L. Turocy. Gambit: software tools for game theory.Version http://econweb.tamu.edu/gambit, 2004.[33] D. Monderer and A. Sela. A 2? 2 game without the fictitious play property. Games andEconomic Behavior, 14:144?148, 1996.[34] D. Monderer and L.S. Shapley. Fictitious play property for games with identical interests.Journal of Economic Theory, 68(1):258?265, 1996.[35] J.F. Nash. Equilibrium points in N-person games. Proceedings of the National Academy ofSciences of the United States of America, 36(1):48?49, 1950.[36] E. Nudelman, J. Wortman, K. Leyton-Brown, and Y. Shoham. Run the GAMUT: a compre-hensive approach to evaluating game-theoretic algorithms. In AAMAS 3, July 19 ? 14 2004.[37] I. Porche and S. Lafortune. Adaptive look-ahead optimization of traffic signals. Technicalreport, University of Michigan, 1997.[38] R. Powers and Y. Shoham. New criteria and a new algorithm for learning in multi-agentsystems. In NIPS, volume 17, pages 1089?1096, 2005.[39] Paramics Modeller. Quadstone Paramics Ltd., 21. URL www.paramics-online.com.[40] R Development Core Team. R: a language and environment for statistical computing.R Foundation for Statistical Computing, Vienna, Austria, 2006. URL http://www.R-project.org.[41] A. Rapoport, M. Guyer, and D. Gordon. The 2x2 Game. Univeristy of Michigan Press, 1976.[42] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning(Adaptive Computation and Machine Learning). The MIT Press, December 2005.[43] R.W. Rosenthal. A class of games possessing pure-strategy Nash equilibria. InternationalJournal of Game Theory, 2:65?67, 1973.101Bibliography[44] T. Sandholm. Perspectives on multiagent learning. Artificial Intelligence, 171(7):382?391,2007.[45] Yoav Shoham and Kevin Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic,and Logical Foundations. Cambridge University Press, New York, 2008.[46] Yoav Shoham, Rob Powers, and Trond Grenager. If multi-agent learning is the answer, whatis the question? Artificial Intelligence, 171(7):365?377, 2007.[47] S. Singh, M. Kearns, and Y. Mansour. Nash convergence of gradient dynamics in general-sumgames. In UAI 16, 2000.[48] J. C. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simulation andControl. John Wiley & Sons, Hoboken, New Jersey, 2003.[49] R.S. Sutton and A.G. Barto. Reinforcement Learning, An Introduction. The MIT Press,Cambridge, Massachusetts, 1999.[50] G. Tesauro. Extending Q-learning to general adaptive multi-agent systems. In NIPS 16, 2004.[51] W.K. Viscusi. The value of life: estimates with risks by occupation and industry. EconomicInquiry, 42(1):29?48, 2004.[52] J. von Neumann and O. Morgenstern. Theory of Games and Economic Behavior. PrincetonUniversity Press, 1944.[53] T. Vu, R. Powers, and Y. Shoham. Learning against multiple opponents. In AAMAS, 2005.[54] J. Wardrop. Some theoretical aspects of road traffic research. Proceedings of the Institutionof Civil Engineers, Part II, 1(36):352?362, 1952.[55] C.H. Watkins and P. Dayan. Q-learning: technical note. Machine Learning, 8:279?292, 1992.[56] Marco Wiering. Multi-agent reinforcement learning for traffic light control. In Proc. 17thInternational Conf. on Machine Learning, pages 1151?1158. Morgan Kaufmann, San Fran-cisco, CA, 2000.[57] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. InICML'03, 2003.102


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items