C O N V E R G E N C E OF B E H A V I O U R RULES IN I T E R A T E D MATRIX GAMES by JONATHAN PATRICK B. Arts & Sc., McMaster University, 1997 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES Department of Mathematics Institute of Applied Mathematics We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA October 1999 © Jonathan Patrick, 1999 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Mathematics The University of British Columbia Vancouver, Canada Date (3/.e>/V'f Abstract This master's thesis reports on a foray into Game Theory, focusing solely on the twoperson (not necessarily zero-sum) game. Primarily, I am interested in the convergence properties of different behaviour rules and how one might proceed to introduce some form of learning into the strategies of the players involved in the game. Therefore, I begin with the introduction of some key equilibrium sets - namely the set of Nash Equilibria (NE), the set of correlated equilibria (CE) and the marginal best-response set (MBR). I briefly discuss the relationship between these three sets before moving on to describe some desirable properties of behaviour rules. From there, I introduce six behaviour rules (four from the literature, two original) that attempt to incorporate some form of learning into the game. The four from the literature are Fictitious Play, Exponential Fictitious play, Regrets 1 and Regrets 2. I have named the two original behaviour rules Past Response and Modified Regrets. I then move on to describe the convergence properties of each. This thesis was originally motivated by a talk given by Andreu Mas-Collel on the properties of the two Regrets-based behaviour rules. Thus, a fair amount of time is spent reworking the convergence proofs of both Regretsl and Regrets2 as they were developed by Mas-Collel and Sergiu Hart. I then suggest an alternative proof of the Regretsl convergence properties. I close off the paper with some numerical results from three games - a zero-sum game, a game developed by Lloyd Shapley (called the Shapley game) and a game called Battle of the Buddies. They are designed to give some numerical confirmation of the convergence theorems stated earlier in the paper as well as some indication as to where further study might be useful. ii Table of Contents Abstract ii Table of Contents iii Acknowledgements v Chapter 1. Introduction: Playing the Game 1.1 The one-shot game 1.2 Mixed Strategies: Best Response and Minimax 1.3 Co-ordinated Strategies (Measures on the Product Space) 1.4 Relationships between Equilibrium Concepts 1.5 Repeated Plays '• Chapter 2. Behaviour Rules 2.1 Overview 2.2 Properties of Adaptive Behaviour Rules 2.3 Forecasting and Response 2.4 Forecast-free Behaviour Rules 2.5 Modified Regrets Chapter 3. Blackwell's Approachability Theorem 3.1 Overview 3.2 Auxiliary Results 3.3 Concise Statement and Detailed Proof 1 :. 1 3 6 9 14 17 17 ;. ..18 • • • 20 24 ...29 ; — Chapter 4. Convergence Results 4.1 Fictitious Play, Past Response and /^-Exponential Fictitious Play — 4.2 Regrets 2 4.3 Regrets 1 4.4 Modified Regrets Chapter 5. Experimental Results 5.1 Scissors-Paper-Rock-Glass-Water . 5.2 The Shapley Game 5.3 The Battle of the Buddies Chapter 6. 31 31 34 • • • 41 45 . . . . . . 45 ....47 50 61 ®^ . . . . . . . 64 66 70 Conclusion 73 Appendix A: Graphical and Tabular Results iii 76 Appendix B: Programs Bibliography Acknowledgements First and foremost, I would like to thank my supervisor, Dr. Loewen, for his unceasing optimism and constant encouragement - not to mention his patience and willingness to give of his time. Thanks is also due to Dr. Puterman from the Management Science department for his willingness to help out in any way and Dr. Anstee for being my second reader. Equally worthy of thanks are David Urminsky and Marc Mailhot (my housemates) who had to put up with my sometimes less than congenial self on those days when the thesis writing got the better of me. (Dave also saved me from taking a bat to a few computers by providing much needed computer knowledge.) Chapter 1 Introduction: Playing the Game A mathematical game is an attempt to model the interaction between individuals or organizations. A game may have any number of "players", representing either individuals or groups. In this thesis, we will restrict ourselves to the most obviously practical and common case of a two player game. This is not a limiting restriction as most non-cooperative games can be reduced to two players simply by having each player view all the others as one unit. Thus each player responds to the outcome of all the other players' actions rather than looking at each action individually. We will refer to this second player, N, as "nature" or the "environment". The first player will be denoted by the letter M. We will, more often than not, take the position of player M with player N's point of view usually following analogously. 1.1 The one-shot game In order to illustrate the basic setup, consider the simple game called "matching pennies". In this game M tries to match the choice of heads or tails made by N, while iV tries to prevent M from doing so. Each receives a numerical pay-off of +1 for a success and —1, for a failure. Thus if JV chooses heads and M chooses tails then M receives a pay-off of —1 (penalty) and N receives a pay-off of +1 (reward). In the event that M plays strategy i, (in this game either heads, h, or tails, t), and N plays strategy j (again either heads or tails), the payoff (sometimes called utility) for player M is represented by for player N by u (i,j). N j) and Thus u (h,t) = -1 and u (h,t) = 1. Of course, if M knows M N nature's choice, in advance of making his own decision, then M will win every time ... leading to a rather uninteresting game: Even partial knowledge of iV's action will confer an advantage on M. The effects of what information is available to each player and how best to use that information is a key issue in game theory. Here we address the simplest case where both players act independently and in ignorance of the other's action. I will denote the set of possible strategies (or actions) available to player M by SM, and the set of possible actions available to nature by S^. In the matching pennies game, both M and N have only two possible actions, h and t. Their payoffs therefore can be represented in a very simple matrix form where the columns represent the strategies of nature and the rows represent the strategies of M, i.e., the top right hand entry of the matrix refers to the event that M chooses heads and N chooses tails. The entries in the matrix refer to the payoff received in each case. Thus, the payoff matrices of players M and N respectively are 1 -1 -1 1 -1 1 1 -1 It is easy to see that in any two person game where player M has m strategies, \SM\ = m, and player N has n strategies, \SN\ = n, we can represent their respective payoffs for each possible pair of strategies in two m x n matrices. Such a game will be called an m x n matrix game and is defined by its two payoff matrices. The Matching Pennies game is an example of what is called a zero-sum game. Such a game is characterized by the fact that UM = —u^. Much of the theory is simplified in this sub-class of games. 2 1.2 Mixed Strategies: Best Response and Minimax There are obviously many ways for a player in a game to choose an action. For instance, in the "matching pennies" game, M could choose to play heads, or to play tails, or, using a random number generator, to pick heads with probability p and tails with probability 1 — p, for any preselected p obeying 0 < p < 1. In more mathematical terms, this amounts to M placing a probability measure, p, on the set of possible strategies, SM- N, naturally, must do likewise, placing a probability measure, u, on the set of strategies, SN(Thus the case where M plays heads with certainty is simply the measure that places all its weight on heads and none on tails. Probability measures that place all their mass on a single action are called pure strategies; general measures are called mixed strategies.) I will denote the set of probability measures on a set S by P[S]. Once M and N have chosen their measures p G P[SM] and v G P[SN] respectively, player M's expected payoff is (1.1) where the summation is taken over all i G SM and all j G SNA natural question concerns how player M decides on a measure considering that the measure v of player N is unknown. One method is to make an (educated) guess as to how N is likely to act and then to respond in a manner that maximizes one's own expected payoff based on that guess. This is formalized in the following definition. Definition: We say that p, is an e-best response to v for M if "M(£> ) > v P M(H, t*€P[S ] S U U V) - e. M If e = 0 then we refer to p simply as a best response to v. Among the best-response measures, there is, necessarily, a measure that places all its weight on one strategy. This follows from the fact that 3 MM(A*> V) is a linear function in p. That is, it can be written as, Thus, one best-response measure, ft, places all its weight on the strategy i G S M has the greatest coefficient, J2jes M(i,j)v(j)- which It should be fairly obvious that the set u N of best-response measures to v,B[v] = argmax^p[s ]UM(p,^), M is never empty (since | SM I is finite) and may consist of more than one measure. For the purposes of this thesis, in cases where there is more than one best response we will use the arbitrary rule which chooses the best-response measure which places all its weight on the action with the smallest subscript. That is, we pick \x = 5k where k = mm{m : 6 m E B[u}}. The key question concerning the origins of this educated guess (since the actual measure, v, of player N is unknown) will be dealt with later. Of course, one need not proceed by always playing a best-response to some estimate. A more cautious player might want to maximize the worst-case outcome. Specifically, this means that player M would choose (x in order to maximize: min y). veP[s ] N The von Neumann minimax theorem asserts that, for a zero-sum matrix game, each player can insure that his pay-off does not dip below a certain value, v. In mathematical terms, this means that there exists a fi G P[SM] such that WJW(A> ) > u Vz/ G P[SN]- v Similarly, there exists a measure v G P\S^\ such that u (p,v) N Vyu G > —v 4 P[SN]. In a zero-sum game, this implies that there exists (/}, z>) G P[S ] x P[SN] and a scalar v M such that u (v, M ([1]). J>) < v < u (fr, v) V(//, v) e P[S ] x P[SN] M M Other perhaps more interesting methods of choosing a measure will be discussed later on once more complexity has been added to the game. The above discussion might lead one to question whether there might exist a pair of measures (one from each player) such that neither player would benefit from unilaterally changing measure (that is, by playing differently while assuming that the other player's measure remains the same). The following definition formalizes this concept. Definition: A pair of measures ji G P[SM] and v G P[SN] is a Nash equilibrium combination iff no single player can unilaterally increase his (expected) payoff by choosing a different measure. Thus (p,, i>) is a Nash equilibrium iff both: WM(A>^)> m a x %( u,i') J and UN(P,,I>)> H€P[SM] m a x UN(P,,U). veP[S ] N I will denote the set of Nash equilibria by NE. This is a (possibly empty) subset of the Cartesian Product space, P[SM] X JP[SW]- Note that if (/}, 0) e NE, then jj, must be a best response to v in the sense defined above. Similarly, <v must also be a best response to ft. In the matching pennies game, if v = [0.5,0.5], then UM(M, £ ) = 0 = u (fx, v) for all fj, G P[SM] N This includes the two pure strategies, \ih = [1,0] and fx = [0,1]. To find a NE pair, t (A, z>), involving the above i>, we must choose fi to ensure that 0 = UJV(A, £) > max veP[s ] UN(P-,^) N 5 - This is equivalent to finding \x such that 0 ^ max [(A(2) - A(1)M1) + (A(l) - A(2)M2)] = " max[/i(2)-/i(l),/l(l)-/2(2)]. Thus for (/}, ii) G NE, we are forced to choose fi(l) = fi(2), i.e., fi — [0.5,0.5]. Such a choice ensures that (/i, v) G NE. 1.3 Co-ordinated Strategies (Measures on the Product Space) So far, we have only considered placing separate measures on the two players' strategy sets. A natural extension is to place a measure, a, on the product space SM X SN instead. Thus, under this measure, each pair of strategies (i, j) G SM X SN is given a certain probability or weight. We will call such a measure a correlated measure. The expected value of the payoff function for player M (and similarly for player N), imposed by the measure a, is then defined by: (i,J)€S xS M N where a(i, j) is the weight placed on the pair of strategies (i,j) by the measure a. I will use the short-hand notation, UM{OI) for E (uM) a whenever possible. All this may also be written in matrix notation. The correlated measure, a, can be written as a \SM\ X \SN\ matrix A, with nonnegative entries, a(i,j), that sum up to one. In the case where a is a product measure, writing p and v as row vectors gives the matrix representation A = p v. A correlated measure thus represents a product measure T if and only if its matrix representation has rank 1. We have already seen that player M's (and player iV's) utility function can be represented in matrix form as UM (UN)- Thus M{ ) U A — tr(A UM)T 6 One might wonder why one would want to study correlated measures seeing as we are here only interested in non-cooperative games. After all a correlated measure seems to represent a form of cooperation between players. It will become clearer later as to why this is not necessarily so but an initial hint comes from the following fact. It should be evident that maxtiji(a) > m a x i t ^ / i , v) since the maximum on the right hand side is taken over a subset of the measures that are available on the left. Thus it seems possible that both players might gain more by considering product measures, even if they are not interested in cooperating. Once one has a correlated measure, it is a simple procedure to deduce the marginal measures for each player. Under a correlated measure a, the marginal measure for player M is given by: The marginal for player N, a , follows analogously. N However, a measure a e P[SM X SN] may not derive from the product of two measures, //. G P[SM] and v € PfSW]- The following example makes this clear. Consider a correlated measure that places the following probabilities on the nine pairs of strategies in a 3 x 3 game: .20 .30 .00 A = .00 .20 .05 .15 .00 .10 The marginal measure for player M is (.5, .25, .25) and for player N is (.35, .5, .15), but there are no marginals that, when multiplied together, will produce the measure, a, given above. An easy way to ascertain this is to note that the rank of the above matrix is 3. 7 The introduction of correlated measures leads to a redefinition of the concepts of best-response and equilibrium: Definition: A (correlated) measure a has the max UM{/J>,CXN) < % ( a ) marginal best-response property and max if, UN(O>M, V) < UN(CX). The marginal best-response property is called exact if the above two equations are actually equalities ([4]). I will denote the set of measures with the marginal best-response property (not necessarily exact) by MBR. Definition: A correlated measure a is called a correlated if the following two equilibrium equations are satisfied: U {.OL) > M u (a)> N ^ XI u (ip(i),j)a(i,j) M N(i,(l>(j))a(i,j) u VV> : S M V<> : S N -> S M S. N I will denote by CE, the set of correlated equilibria. In words, a correlated equilibrium is a probability distribution or'measure that assigns probability to all possible combinations of players' strategies in such a way that no one player can, by himself, increase his expected gain by redistributing the measure. One way of visualizing this is in terms of the payoff matrix. Suppose we hold the probability of each entry in the matrix A fixed. Then a correlated measure is in CE if and only if player M (for instance) cannot increase his expected payoff by relabelling the rows in any way. In the matching pennies game, the possibilities for relabelling are 1) to assign tails to the first row and heads to the second, 2) to assign tails to both or 3) to assign heads to both. 8 1.4 Relationships between Equilibrium Concepts The set of correlated measures with the marginal best-response property, MBR, contains but need not equal CE. To see this, recall that the best response set to any v G P[SN] includes a pure strategy. Hence the best response set to ajv includes a pure strategy, which we will assume places all its mass on A; G SM- Thus, for the constant function tp : S M -» S M = k for all % G S , we have: defined by M max UMIPICXN) i*eP[s ] = WM(&,Q!JV) = X M M(k,j)a (j) U N jes N If a G CE then RHS < % ( a ) , so the first condition defining MBR is satisfied. Since we could do as much taking the position of N (and therefore show the same property for player A T ) , this is enough to prove that CE C MBR. Moreover, max «M(A ,C*N) actually equals 2^ies ies M^{'i)- J) {i-,!J) t U M M N a l f ° some r constant function tp. But it is easy to produce examples where there exist non-constant ip which outperform every constant choice. Consider the correlated measure, a, which assigns probability as before: .20 .30 .00 A = .00 .20 .05 .15 .00 .10 Further, suppose that the utility matrix for player M is the identity matrix. Then %(d) = 0.5 and a M = (0.50,0.25,0.25) and a N 9 = (.35, .50, .15) Thus, max UM{p, Qijv) = 0.5. /*eP[s ] M Thus a is in MBR. However, if player M chooses -0(1) = 2,-0(2) 2 and ^(3) = 1, his expected payoff is: u (il>(i),j)a(iJ) M = .3 + .2 + .15 = .6S This is higher than % ( « ) = .5 so a 0 CE. In this example, the inequality is strict proving that CE C MBR. The difference between a Nash equilibrium and a correlated one is that a Nash equilibrium assigns probabilities to each player's possible actions on their own while a correlated equilibrium assigns probabilities to all possible pairs of actions of players M and N. To highlight the potential usefulness of correlated equilibria, consider, for example, the simple game called The Battle of the Buddies ([9]). To understand this game, consider two friends, Dave who wants to go to a hockey game and Marc who wants to go to a movie. Each, however, would rather go to his friend's choice than do his own thing alone. Thus Dave's first choice is for both of them to go to the hockey game but his second choice would be for both of them to go to a movie. (Of course, we are considering the situation in which they both choose independently.) Thus we can define the game by the following payoff matrices: 1 0 5 0 U N 0 1 0 5 As an aside, there is a legitimate question here about the plausibility or even the desirability of quantifying rewards. In order to be useful, game theory must necessarily reduce all rewards to a quantity. It cannot deal with qualitative rewards which may 10 nonetheless figure prominently in the decision process. This is an unfortunate limitation and one that needs very much to be kept in mind at all times. You cannot quantify a friendship. Nor can you quantify the value of a person's job, or the effect of an action on the environment. An excessive focus on the mathematics behind a problem may lead one to allow what is quantifiable to trump the qualitative aspects, to the detriment of all involved. Returning, to the above example, let us first determine the existence of Nash equilibria. Recall that for a pair of measures, (/*, u), to be a Nash equilibrium it must satisfy: > M A UMII^,^) X V£P[SM] =• 5£(1)*>(1) + (1 - A(l))(l - HI)) > max [5/i(l)i>(l) + (1 - /x(l))(l - P(l))] »eP[s ] M =>/i(l)(6i>(l)-l) > max >(1)(6/>(1)-1)]. V€P[S ] M There are therefore three possible cases allowing for the existence of a Nash Equilibrium. First, if 6i>(l) - 1 > 0 then p,(l) must be equal to 1. If, 6z>(l) - 1 < 0 then fi(l) must be equal to zero. The other case requires that z>(l) be equal to 1/6. Since we can do much the same using N, it is easy to verify that there exist only three Nash equilibria - with [>, v) = [(0,1), (0,1)], [(1,0), (1,0)] or [(5/6,1/6), (1/6,5/6)]. The corresponding payoffs are [UM,U ] N = [1,5], [5,1] and [5/6,5/6]. The following diagram illustrates the possible payoff region provided we restrict ourselves to measures derived from the multiplication of marginals. However, the set of correlated equilibria is much larger. Recall that for a measure, a, to be a correlated equilibrium, it must satisfy: %(a) > u (ip(i),j)a{i,j) M (.i,j)€S xS M 5d(l, 1) + d(2,2) N > M</>(1), 1.) + u (rp(\), 2)a(l, 2) + u (^(2), M + «M(^(2),2)a(2,2) 11 M l)a(2,1) Figure 1.1: Payoff region allowing only for measures derived from the multiplication of marginals as well as, u (&) N > X u (i,<j)(j))a(i,j) N (hj)es xS M => d(l, 1) + 5o(2,1) > N t/^(l, 0(l))o(l, 1) + *(2))a(l, 2) + ujr(2, #(l))a(2,1) + Utf(2,#2))a(2,2) for arbitrary </>, ^ : {1)2} —» {1,2}. This implies that if a measure a, is a correlated equilibrium then the following two conditions must be satisfied: o(l,l) > 6(2,2) > 5d(2,l) 5d(l,2) These lead to a set of measures rather than three points. The following picture depicts the possible payoff region provided we allow for any correlated measure. Note that these two diagrams highlight the potential usefulness of correlated measures that do not derive from the multiplication of two marginal measures. The payoff pair (3,3) which seems to be the most obvious compromise is included only in the second diagram and not the first. 12 Figure 1.2: Payoff region allowing for all measures Now, if we restrict ourselves to the set of correlated measures of product form, i.e., where then all three sets (CE, MBR and NE) are indistinguishable. That a — a a>N, M MBR and NE are equivalent under this restriction is obvious from the definitions. (MBR's definition is precisely that of NE in this case.) To show that CE and MBR are also equivalent, we need only show that some constant transformation, SM, is an upper bound on ip : SM 2~2ies jes M M{i>{i),j)oi.(i,j) u N if>(i) = k for all i G for all possible transformations SM- It would then follow that MBR=CE since the maximization in the definition of MBR turns out to be the largest constant transformation. Now, since we are in the case where a is a product measure, X u (ip(i),j)a{i,j) M = X = X ies M u (ii)(i),j)Q (i)a {j) M ( Yl \jes N U M N M(V>W> J W ( j ) ) / Oi {i) M Furthermore, since SM is finite, it follows that there exists an k G SM such that u (k,j)a (j) M jes N N > X M(i,j)®N(j) j&s u for all i G S M N Thus, letting ip(i) — k for all i G SM will give the desired upper bound and therefore C E 13 and MBR are equivalent. 1.5 Repeated Plays Of course, game theory is hardly confined to one-shot games, the application of which is fairly limited. In the more involved scenario of repeated play, player M must impose a measure on his set of strategies, SM, each time the game is played. (It is hopefully clear that actual actions are single-strategy choices, but that the choice of which action to play may very well be governed by a pre-chosen probability measure over the set of all possible actions.) In the simplest scenario, a player may decide to use the same measure for all rounds of the game. Thus his empirical distribution.of play (that is the distribution of the actual history of the game) derives from repeated draws from a fixed distribution. He may, on the other hand, decide to vary his mixed strategy at each round. (In fact, if he doesn't and the other player is at all rational then he is not likely to fare very well in the long run - unless he happens to have hit upon an equilibrium.) How, and for what reasons, a player might decide to vary his play is a question at the heart of this thesis. The most logical idea would be for each player to attempt to learn from the history of the game. So, for instance, if t rounds have already been played, then each player has the experience of t rounds from which to draw in order to better gauge the behaviour of his opponent. For ease of notation, we can define a history matrix, h : 1 at M s (l)...s (t) N S (l)...S (t) B b M N M N where SM{k) is the actual strategy played by M during round k (similarly for player iV). I will denote by S M x S , the set of all histories of length t. Thus, M's measure or N 14 behaviour rule for round t + 1 may be a function of h . We will assume, however, that 1 each player's payoff matrix is stationary with respect to time. The matching pennies example illustrates this notation perfectly. If M were to notice that, no matter what he chooses, nature seems always to choose heads (i.e. s = (h,h, ...,h)) then it would make N sense for player M to do likewise. Thus game theory becomes interested in the incorporation of learning into a game. This is, obviously, no trivial question. How do you incorporate into a mathematical model the idea that people learn from their past experience? We will look at some attempts to do just that later on in this paper. The bulk of this paper will be concerned with the long run performance of a number of different behaviour rules (policies used to determine action) that attempt to incorporate learning. More specifically, I am interested in each behaviour rule's differing convergence properties and whether or not they do actually converge to some stable equilibrium where each player eventually sticks to one measure (or set of measures) over his set of strategies. Of course, one might still wonder what purpose there is in discussing equilibria. What reason do we have for believing that players will eventually come to decide upon measures that lead to an equilibrium? The study of game theory itself assumes a certain amount of rationality in the players. (If you take issue with that assumption then game theory will have little to offer.) The idea behind the study of equilibria is that, given time, the players will come to some compromise. It is this idea of a compromise that equilibria are meant to represent. However, not all correlated equilibria result in great compromises. Recall, for example, the Battle of the Buddies where two equilibria resulted in one player caving in every time. So even if we can insure convergence to CE, this may not be all that desirable a result. Thus, even within CE, we may want to insure convergence to certain specific 15 distributions. Questions of how to incorporate learning into a game and how to insure convergence have, of course, been around for a while. A number of methods of deciding on one's play (behaviour rules) have been developed which will be discussed in the remainder of this thesis (along with some original ones). I will be focusing specifically on their differing convergence results - what they converge to (if at all) and under what circumstances. 16 Chapter 2 Behaviour Rules 2.1 Overview Though the concept of a behaviour rule has already arisen, we have until now left it somewhat informal. This needs now to be remedied. Definition: A behaviour rule for M is a measure-valued mapping, p : \J >o(S t M x S ) —> N P[SM] which returns a mixed strategy for M based on a given vector of observed plays. In other words, we imagine player M choosing a fixed mapping p, and choosing as the mixed strategy to play at time t + l, the measure, p t+l = p(s (l), s (l),SJV(£))- M N Of course, it is quite possible to have a behaviour rule that only takes a portion of the history into account or which even ignores the history completely. For instance, in Matching Pennies, one might choose to play heads all the time no matter what happens (a constant mapping) or to play heads in the morning (first n rounds) and tails in the afternoon (next n rounds). We will, however, concentrate on history dependent rules. Any given pair of behaviour rules (p, v) defines a probability distribution over all possible histories of any fixed length t. I will denote by p(p,u)[h ] - the probability t that the history h* will result given that M uses the fixed rule p and N uses the fixed rule v. Recall that we are assuming that each player acts independently. Similarly, if H CS L x S , we can define p(p, ^[H ] as the probability that a history h} G H will 1 M L N 17 result given that M uses the fixed rule \x and N uses the fixed rule v. One means of evaluating a behaviour rule is to determine whether it guarantees a certain lower bound pay-off. To that end the following two quantities are useful. First, for any history vector, b} = ( s ( l ) , s ( t ) , Sjv(l), M M s (t)), we can define the time-, N averaged realized pay-off, 1 * Unitf) = -^2U (S {T),S (T)). M M N T-l Second, a useful performance index associated with a particular history h} is UMih ) = max w (//,F(/i*)) 1 M where represents the empirical distribution. of play of player N up to time t. In words, this represents the pay-off to M if N plays the empirical measure of iV's past plays and M plays the corresponding best response. 2.2 Properties of Adaptive Behaviour Rules To express desirable properties for M's behaviour rule ji, several criteria have been proposed. Each involves a different assumption concerning the behaviour rule of player N and attempts to insure that player M's time-averaged realized pay-off exceeds some lower bound. The first criteria, called safety, allows player N to use any available behaviour rule, v, and seeks to guarantee that M receives at least his min-max pay-off. This property is obviously desirable, but perhaps not essential. It seems reasonable to think that in pursuing safety, one might be forced to ignore opportunities for much larger gains. . Safety: A behaviour rule, fi, is said to be e-safe for player M if there exists a t such that for any possible behaviour rule, u, available to nature and for any t>t 18 there is a subset of histories of length t, H ' c S ^ x S , with p(p, > 1 - e, such that N UM^) + e > min max'u (/x, i/) M V/i*e#* A behaviour rule is said to be safe if it is e-safe for every positive e ([4]). A second proposed criterion is consistency. This property is satisfied when player M does no worse than if he had played a best response against the empirical distribution. There are no less than three variations on the definition of consistency, each making different assumptions on the behaviour rule of player N. In the first definition given below, player N is assumed to employ a specific constant behaviour rule, v. In the second definition, the behaviour rule is again assumed to be constant but now player N can use any fixed measure, v £ P[SN]- Finally, in the definition of universal consistency, all restrictions on player TV's behaviour rule are lifted. In each case however, the goal is the same. Consistency: Let v be a fixed measure in P[SN}- A behaviour rule, p, is said to be e-consistent against 0 if there exists a t such that for any t > t there is a subset of histories of length t, H* C S M x S , with p(p, > 1 — e, and N U {h}) + e > U {ti) M M 'WeH 1 ([4])If the above holds true for all fixed measures v € -P[»SV] then p is said to be econsistent. In words, a behaviour rule is e-consistent if, given that the mixed strategy played by nature is constant, M does about as well as playing a best response against the empirical distribution of play. A behaviour rule is said to be consistent if it is e-consistent for all positive e. 19 Finally, a behaviour rule, ft, is said to be e-universally consistent if there exists a t such that for any behaviour rule v available to nature and for any t > t there is a subset of histories of length t, #* C S M x S , such that p(p, > 1 - e, N U {ti) + e>U {ti), M M WeHK The key change here is that if ft is e-universally consistent then v is allowed to vary with time (i.e. there is no guarantee that v = u for i ^ j). Again, a behaviour rule is said l j to be universally consistent if it is e-universally consistent for all positive e ([4]). It is easy to show, directly from the definitions, that universal consistency implies safety. Universal-consistency of a measure, ft, implies that for all e > 0 there exists a t and a set of histories H . C S l M xS N such that, for any behaviour rule v by player N and for any t > t, p(p, v^H } > 1 — e and 1 But max Mjvf(//, z7(/i )) > min^fmax^ UM{P, V)\, for all /i*ei?*. So the above inequality 4 M implies the defining inequality for safety. 2.3 Forecasting and Response It is useful to divide the larger set of all possible behaviour rules into two subsets. The first subset of behaviour rules attempts to predict TV's play and then proceeds accordingly. The first step in this process - prediction - is the function of a forecast. A forecast is an attempt to determine how the other player will act (possibly expressed probabilistically) in the next round. The most obvious forecast is the empirical average of the opponent's past play. Thus if the game has been played for t rounds and N has played strategy i, x number of 20 times then the forecast would simply predict that player N will play strategy i with a probability of x/t. Thus, we define 1 * ^ ) = 7 E 7 i M ^ . j = l,2,...,\S \, N '(2-1). r=l representing the empirical probability of playing strategy j observed in the history, h*. Here, Ij(p) = 1 if p == j, and zero otherwise. More generally, a forecast, 7, will generate, after round t of the game, a /c-tuple, v = (i>i(h ),Vkih})) t where fc = \S \ and is M's forecasted probability that N nature will play strategy j at time t + 1. The forecast, of course, depends on whatever factors player M deems relevant. It would seem reasonable, however, to let the forecast depend only on the past history, h} € S* x S . Thus a forecast is a function, 7 : l M U ~ i SM x N &N ~> P[SN]- Calibration: The idea behind a calibrated forecast is that eventually the empirical distribution of play of the opponent (nature) should converge to that which is predicted by the forecast (or alternatively, the forecast should adapt to fit the empirical distribution). If this occurs then the forecast is said to be well-calibrated. Naturally, this idea can be represented mathematically. Fix v £ P[SV] and a sequence of plays as recorded in history matrices h , h ,h},... 1 Let N(v,t) denote the number 2 of the first t rounds where M's forecast, 7(/i*), generated the measure v. This can be written as: iV(/>,t) = ^ J , ( ( / i ) ) t 7 Let p(i>,j,t) be the fraction of these rounds for which nature plays j. . . P\v,3,t)=\ ifAT(M = 0 Jo { N^f)^r=iW{h ))Ij{sN(T)) T 21 otherwise The forecast 7 is calibrated with respect to the history sequence h , h ,... 2 l limV VjeS | ^ £ ^= o t—>oo — ' if: N t z where the summation is taken over all possible v ([3]). In other words, calibration merely insists that, in the long term, the forecast agrees with reality. However, it is clear that a good forecast need not guarantee a good behaviour rule or even that a good forecast is necessary for a good behaviour rule. The following are a couple of forecast-based behaviour rules that seek to make good use of forecasts in order to incorporate some form of learning into a player's behaviour. Fictitious Play: One of the most popular behaviour rules is that of Fictitious Play (FP) which attempts to make an educated guess (updated after each round) at the measure chosen by the opponent and then plays a best response to this guess. Player M makes some initial guess at nature's measure, u , and then proceeds to modify it using the empirical proba0 bility distribution, F(/i*), which is accumulated as the game progresses. Mathematically, this "educated guess" or forecast takes the following form: rule, n +t n +t 0 0 where n is a fixed number ([4]). The FP behaviour rule is then given by 0 fi t+1 = argmax p s ]U (tJ','y(h )) t lie [ M M (see definition of best response, chapter 1, pg.4). In the cases where there is more than one argmax, recall from chapter 1, pg. 4, the rule for deciding on a pure strategy. As time goes on, F P places less and less weight on the initial guess, u , and more Q and more weight on the accumulated empirical distribution, Z/(/i*). The long term effect 22 is that the forecast rule eventually mirrors the empirical distribution of past play by the opponent (and is thus calibrated). As time goes on, therefore, FP begins to behave more and more like a best-response to the forecast 7(/i ) = V(h}), the empirical average of 4 play. Thus we have a form of learning incorporated into the behaviour rule. This rule is deterministic in the sense that although it forecasts a probability for each strategy available to the opponent, its output is a pure strategy. Though FP is consistent against certain measures it is not safe, as Fudenberg and Levine have shown ([4]). Therefore it is not universally-consistent either. Fudenberg and ' Levine did show that, given a situation where switching between actions becomes more and more infrequent, FP does turn out to be consistent (see the Shapley game discussed below). The lack of universal consistency has consequences for the convergence properties of FP which will be discussed later. Past Response: The past response behaviour rule, PR, is much like FP. The real difference occurs in the method of forecast. Suppose player M plays strategy j in round t. Whereas FP's forecast is based on the empirical distribution of the entire game up to the present, PR forecasts based only on the empirical distribution of those plays by N that followed a round in which M played j -that is, on observed "past responses" of player A?" to action j. Like FP, it would then place positive measure only on those actions which are a best response to this empirical distribution. Again an initial guess, v , is needed. Assuming 0 sjif(t) = j , PR's forecast rule for round t+1 has the following form: / v ' n +t no + t 0 ' 3 where no is a fixed number and u{h^) is the empirical distribution of play by N based 23 only on those rounds r < t, where s (r - 1) = j. Specifically, for all k € S , M N t (h))[k] = y X v t J i M T - WkMT)) where T = T=2 J2 k(s (r)). I M T=l PR therefore is an attempt to account for learning in the opponent while at the same time learning oneself. For the same reasons as FP, however, PR is neither safe nor universally-consistent. This being said, PR does seem to do better against the other behaviour rules (at least in the Shapley game - see Chapter 5) and may be an example of how insuring consistency requires that one forego certain opportunities for greater gain. PR, though doing better against the other behaviour rules, when played against itself has a much lower payoff than any of the others. Thus it has the possibility of much greater gain and the risk of much greater loss. 2.4 Forecast-free Behaviour Rules Both FP and P R depend on h} indirectly through an explicit forecast, 7(/i*). However, an intermediate forecast is not essential as the next four behaviour rules demonstrate. Though they all rely on the history of the game, they never directly attempt to forecast the measure of the other player. K-exponential Fictitious Play: K-exponential FP, an invention of Fudenberg and Levine, assigns probability to each strategy, i £ SM, in the following way: Wj exp [Kufavjh*))] t W where w(i,i/(/i*)) = J UMi^u^ih )) 1 U £,eSM™i PMi,^))] e x (see Chapter 1, equation 1.1). Here, w\,w\s \ M collection of positive weights independent of time t and chosen in advance ([4]). 24 is a At first glance, this behaviour rule bears scant resemblance to FP. Looking for similarities, the first most obvious is that, like FP, it also depends on the empirical distribution of player N's play, namely V(h}). In fact, closer examination shows that /^-exponential FP is a smoothing of the best-response strategy of FP that places equal weight on all maximizers. In other words, for K sufficiently large and for fixed weights, Wi > 0, this behaviour rule assures that player M will play an e-best response to with high probability and the remaining strategies with small probability. To see this, consider equal weights, IUJ, and define w(i,I7(/i')) = m for ease of notation. The best response would simply be the argmax over i of {itj : i = 1,.., |SM|}- ^-exponential play however now has the form: E i expf/cuj-] 1 V • exp[/c(Mj - Ui)) Ej: U j = U i 1 + Yty.UjKm exp[-K(«i - Uj)] + £j : t y > t t i eX [«(^ P Ui)} Now, if we let K go to infinity, we will have two options. If i is a best response then the third sum will be empty and the second sum will go to zero, leaving only the first sum. If i is not a best response then the third sum will be non-empty forcing to zero. Thus, lim = < I 0 1 if i is a best response otherwise So, as K goes to infinity, ^-exponential FP converges to the best response measure that places equal weight on all maximizers. Unlike FP, ^-exponential FP does allow for other strategies to be played if only with small probability. This is called a tremble from the best-response. Exponential FP does, in some sense, still use the empirical distribution of play as a forecast even though it is not explicitly stated as such. The major advantage of the ^-exponential FP over standard FP is that M's measure 25 now depends smoothly on the empirical distribution. Thus, since the empirical distribution adjusts as the inverse of the sample size, M's play cannot oscillate wildly. As Fudenberg and Levine have shown, it is this property of ^-exponential FP which causes it to be e-universally consistent ([4]). Regrets 1: This behaviour rule was developed by Andreu Mas-Collel and Sergiu Hart. The idea is to define a function that in some sense measures the "regret" that M has for having played one strategy in the past rather than another. Suppose the game has already been played for t rounds. Then, given any two strategies j, k G SM, the regret function for player M is defined in the following way: ri _ y RM(J,k) X - ( M{k,s (r)) u N U (J,S (T))) m N T<t:s {r)=j M where SJV(T) is the strategy played by N at time (or round) r and [z] = max2;, 0 ([8]). + This quantity measures the difference in payoff between what M received and what he could have received if he had consistently played the strategy k whenever he had, in the past, played strategy j. Note that R (j,j) M = 0. (This function, R , M obviously requires a certain amount of knowledge about M's payoff function which may or may not be reasonable depending on the application.) Computing this for every possible strategy j £ SM yields a x \SM\ \SM\ matrix, R , M where the (j, k)th entry is R (j,k) M as defined above. We can convert this into a stochastic matrix, P , by dividing by an appropriate M constant, K, and replacing the diagonal entries by 1 minus the sum of their corresponding 26 rows. In other words, PL = I+\{#M-diag{& e)} (2.2) M where e is a column of ones and K - which is independent of both time and history - is chosen to insure that the sum of the non-diagonal entries in each row is strictly less than one. The regrets 1 behaviour rule then proceeds in the following manner. Given that •SMOO = j, the measure or behaviour rule chosen by player M is simply the jth row of the matrix P . In detail, l M n (k) = » u) = i - t+l t+1 ±& (j,k) yk^j M E (2.3) (-) 2 4 Note that unlike FP, this rule assigns positive probability to all strategies having positive regret as well as to the previously played strategy. This differs from fictitious play which assigns positive probability only to those strategies that are a best response. This regret 1 rule however is not universally consistent, so Mas-Collel and Hart derived the following method, also based on regrets, that is universally consistent. Regrets 2: The idea behind this method is to find an invariant vector, q, to the probability matrix Pj^. That is, q must satisfy the following equation: t qtp* = q l (2.5) where q is a row vector of length \SM\ whose components sum to one. That such a vector l q exists is a result of the following theorem (taken from Isaacson and Madsen, [5]): l 27 Theorem: A l lfinitestochastic matrices P have 1 as an eigenvalue and, moreover, there exist non-negative eigenvectors corresponding to A = 1. In fact, Isaacson and Madsen actually show that there is a left-eigenvector corresponding to A = 1 that has non-negative components that sum up to one ([5]). (Note that the subscript M (or N) has been dropped on R, P and q for ease of notation.) By multiplying through by K, equation 2.5 can be written as: nq* = Kq I + q [R — diag(R e)\ l ^0 = l f t q* (R* - diag(R?e)) (2.6) The regrets 2 behaviour rule assigns probabilities to playing each strategy i G SM, based on solutions to the above equation. That is, v : G s.w //'• (0 r! ? for some q satisfying equation 2.6. l Both regrets behaviour rules have an interesting interpretation in terms of Continuoustime Markov Chain (CTMC) theory. If we let Q* = R* - diagiffe) then Q* has the properties of an infinitesimal generator of a CTMC. That is, -Q*M= Q%j) Vi G SM• So we can interpret the regret, i?*(j, k), as the rate of discarding strategy j in favour of strategy k. Therefore solving equation 2.6 is equivalent to solving for the stationary distribution of a CTMC. In essence, Regrets 2 runs a fictitious CTMC, between each round of the game, and uses its stationary distribution as the measure for the following round. In terms of this matrix Gf, the stochastic matrix, P*, has a very simple form:. P' - / : V . K 28 Interestingly, this is precisely the formulation for the stochastic matrix derived by the uniformization of the CTMC process (see [5] or [6]). Regrets 2 has been shown by Mas-Collel and Hart to be universally consistent. 2.5 Modified Regrets The idea for this behaviour rule stemmed from the results of the Battle of the Buddies game. When the game was played using any of the above behaviour rules (Regrets 2 not included due to programming difficulties), the result was that one of the two friends caved in to the wishes of the other every time. Which one ended up caving was entirely dependent on what happened in the first few rounds. This hardly seems satisfactory though not entirely unexpected. Regretsl for instance only changes action if there is a positive regret while holding the other player's action constant. Now if we assume that player iV chooses his preferred action then there will never be a positive regret for player M, as his choice of payoff is between one (if he caves in) and zero (if he doesn't). Thus, M will inevitably cave in even though his gain will remain one by not switching and might be 5 if he switched and convinced his friend to do likewise. In other words, all these behaviour rules do not take into account the ability of one player to affect the action of the other. This, I think, is a major defect but not one that is easily overcome. For instance, how does one go about including in one's idea of a regret, the fact that if one had played differently one might have been able to change the, action of one's opponent? In the standard Regrets-based behaviour rule, the regret for not having played action k when one actually played j is simply a matter of the difference between what one would have received if one had played k and what one actually did receive, assuming the other player does not deviate from his play. Somehow one would like to also incorporate into the idea of a regret, R(j, k), some means of determining how 29 much one could possibly receive if, by switching from strategy j to k, one also persuaded one's opponent to switch as well. It is this idea that motivated the following variant of Regrets 1 which I will call Modified Regrets (MR). MR also looks at the difference between what one could have received if one had played strategy k in the past when one actually played strategy j but now we allow the strategy of nature to vary over the whole set SN. Thus, if j was played in round t then the modified regret is defined in the following way: E {uM(k, S (T)) N - u (j, M S (T))} n T<US (T)=J M + IM!I + I M | £ 2 1 11 1 2 1 E where M = max(U ) and M = min(U )M 2 (2.7) r<t:s {r)=j ieS M x W M - « M ( J , * ) } M N Note that the first sum is simply the "normal" regret from the Regrets 1 behaviour rule. The second sum allows player M to account for losses arising because he has not tried to force the other player to switch strategy. The weighting on the second sum turns out to be useful since M R is of little benefit when the game is zero-sum. Thus, with the above weighting, MR reduces to normal regrets 1 when we have a zero-sum game. Unfortunately, the above MR behaviour rule causes both players to be too stubborn in the Battle of the Buddies game (whereas, with the other rules, they are too compliant) so some idea of a compromise has to be introduced. The necessary twist is to have player M use MR unless his own payoff has diminished over the last two rounds due to a change in strategy by player N. This turn of events is likely to occur because player N sees some sort of benefit to playing his new strategy. Thus in such a scenario, player M, for that round only, responds with a best response to the last strategy played by his opponent. Results from this version of MR will be discussed later but turn out to be fairly encouraging. 30 Chapter 3 BlackwelPs Approachability Theorem 3.1 Overview Before analyzing the performance of the different behaviour rules, it is useful to make a short digression into a versatile and abstract result called Blackwell's Approachability Theorem (BA.T.). Though only explicitly used in the proof of the convergence results of Regrets 2, B A . T . also formed the spring board for Regrets 1 which in turn led to Modified Regrets. In order to give an intuitive understanding of this theorem, we need to define a general "pay-off" function, WM(S*) which is the payoff to player M when the players play s* = s (t)). This WM(S') need not be a scalar, so in this general setting, we define N it as a vector (or matrix) in R . function, VM(^) = L We also define a generalized time-averaged pay-off \ J2 <t M(S ), also a vector in R . Thus, %(s ) is the realized pay- V L T T T off in round r and VM(^*) is the realized, time-averaged pay-off for a particular history of length t. An underlying assumption in the following discussion is that all pay-offs lie in the same bounded subset of R . L The question that B A . T . attempts to answer concerns whether or not it is possible for M to insure that V {h ) approaches a pre-determined set C with probability one. t M 31 Blackwell's approach is to impose a condition. Suppose that for any Vjvf(^*) £ C there exists a measure p 6 P ^ M ] (dependent on V A / ( ^ ' ) ) [«M(/i, j ) - p r o j V ( / i ) ] n < 0 t c where n = Vjf (h*) — projcVMih?) in C to V M ( ^ ) ) a n d v (l^,j) M = M S U C N T N A T : Vj e S N (that is, the outward normal of C at the closest point (see Chapter 1, equation 1.1). Geometrically, this implies that for all s N G Sjy, the expected-payoff in the next round, VM(IJ>,SN), will be, relative to the hyperplane through PTOJCVM^), on the same side as C (as demonstrated in Figure 3.1 below). This is called the Blackwell Condition. The dependence of the choice for p on VM(^*) and the set C is clear but will not be specifically indicated in order to ease the notation. Figure 3.1: Geometrical representation of Blackwell's condition Blackwell's theorem asserts that, given any predetermined set C, if the above condition can be satisfied then M can insure that VM(^*) approaches C with probability one. 32 Before delving into the actual proof, I will give a somewhat geometrical interpretation of why B.A.T. makes sense. Given any p e P[S ] and j e S , the expected value of M VM{h ), t+l when AT plays strategy j £ S N N and M uses measure p as a behaviour rule, can be written as follows: V (li, j)[h ) t+1 M = j^V^h*) + j^v (p,j). M Furthermore, if p is chosen to satisfy the Blackwell condition and t is sufficiently large, then for all j e SN, VM{h, j)[h ] is contained in the circle with center projcVM{h ) and t+l radius 1 1 — p r o j c V M i h } ) \ \ . of VM(^*) and v {^,j) M l This follows from the fact that it is a convex combination (which lies on the same side of the hyperplane as C) and t is large. Using this fact, it is easy to show that the euclidean distance between C and VM(IV), denoted by d(Vjvf(/i*), C), cannot increase very fast as t goes to infinity (since.larger t favours smaller "correction" from VM^*))- Indeed, d(V (h ),C) t+l M - d{V (ti),C) M < \\V (p,,s )[h ] t+1 M N -PTOJCVMWW = l l ^ y M ^ ) + j^jv (p,s ) M -\lV„(li') - N - ||V (tf) - p r o j V M c -PTOJCVMWW proj V (ti)i: c M This last expression obviously goes to zero as t goes to infinity (since all pay-offs are bounded). A precise computation shows that rf(14f(/i ), C) doesn't merely converge to t some limit greater than zero and remain bounded away from C but actually goes to zero itself. Thus by the Law of Large numbers, since the distance to C in expectation goes to zero so must the realized distance between the time-averaged pay-off and the set C. Blackwell's more detailed proof is outlined in the next two sections which may be skipped if the reader so desires. 33 M 3.2 Auxiliary Results Blackwell's proof of his Approachability Theorem depends heavily on a version of the Strong Law of Large Numbers (SLLN), due to Blackwell himself, which he uses in order to prove a key lemma. These (Blackwell's version of the SLLN and the lemma) I will state first before diving in to the actual proof of B A . T . Blackwell's version of the SLLN is as follows: Theorem: If Bi,B ,... is a sequence of random variables such that \Bk\ < 1 and there 2 exists a u > 0 such that ...,£*_!] < - u m a x ( | 5 * | | £ , . . . , £ _ i ) 1 fc then for all a e R Prob{Bi + ... + B > a for some k} < k I will not attempt to prove the above theorem here but refer the reader to the bibliography ([2]). For the following lemma, however, I will provide the proof, essentially as Blackwell presented it ([1]). Lemma 3.2: Let (5\<5 ,... be a sequence of random variables for which there exist con2 stants a, b and c such that, with probability one, 0'<<y*<o VteN |^+i_ t|<_A_ 5 E[6» \5\S*,...,S<]< l (3.1) vteiV (1-^)5* + ^ (3.2). '(S3)' VtGiV Then 5* -» 0 almost surely (a.s.). Indeed, for every e > 0, there exists a T depending 0 34 only on e, a, b and c such that for any 5 satisfying equations 3.1, 3.2 and 3.3, we have 1 Prob{8 > e for some i > T } < e. (3.4) 1 0 To see that this conclusion is truly equivalent to almost sure convergence, first assume almost sure convergence and fix n > 0. Then {UJ : 6 -> 0} = f = {UJ : Ve G (0,n) n [u e e ( 0 i 7 ? ) 3T s.t. Vi > T,6 t r{u; : S < e, Vi > T}] l TGA < e,} As e decreases, the above sets that form the intersection shrink in a nested fashion. Therefore, -> 0] = inf P [U {<5* < e, Vi > T}1 T (3.5) ee(0,v) And as T increases, the above sets that form the union also increase in a nested fashion so, P{5*->0}= inf supP{<J*<e, T Vi>T} Assuming 5 —> 0 a.s. implies that for all e > 0, 1 sup P{5* < e, T Hence there exists a T such that P{<5* < e, 0 Vi > T} = 1 Vi > T } > 1 — e which is equivalent to 0 equation 3.4. Conversely, if equation 3.4 holds, then for each e G (0,77) we can find T = T such e that P{S < e, t Vi > T J > 1 - e. Consequently, from equation 3.5, P{5* ->• 0} > inf [1 - el = 1 - 77. 35 Since this is true for any rj > 0, equation 3.4 implies almost sure convergence. Proof of Lemma: Fix any e > 0 and any t > 2. We first prove that there exists a t\ > t depending 0 0 only on t ,a,b and c such that 0 ProbiS > e/2 for t < t < h} < e/2. 1 (3.6) Q To see this, define for t > to, f a = < 5 l if 6 > 0 for t < i < t i 0 I 0 otherwise Then, cu* < e/2 implies that <5* < e/2 for some i e [t , t]. 0 By equation 3.3, for t > t , Q c It is clear, therefore, that this implies that the sequence of constants E{a ), eventually t decreases. To see that it does in fact converge to zero, let us define e* = Ela ]. Then, 1 <• <- Hy+? =>t(i-l)e' < (t-l)(t-2)e t 1 + ^~ ^. C 1 If we let 0* = t(t - l)e* then Fixing t , it follows that /?* - @ < (t - t )c. Thus, substituting for /?*, we get to 0 0 * 6 t (t -1)6*° 0 - (t- )c 0 t(t-i) 36 to + W ^ ) ( ( i . 1 which, of course, implies that Efa*] converges to zero as t -> oo at a rate depeding only on to, a, c. In other words, for all e > 0 there exists a T > 0 such that for all t>T, ( -Y > E(c?) > f e V / a*dP > J i V J{a >e/2} / t > e/2}. Z Therefore, there exists a ti > t depending only on to, a, c such that 0 Prob{a l1 < e/2} > 1 - e/2. This completes the first step in proving the lemma - demonstrating that equation 3.6 holds under the conditions stated in the lemma. In order to continue, we define the following double sequence. (A new random sequence with index k associated with each fixed t.) For every t,k with t < k we define the variable z k as follows: t 0 unless 5 k 5 Ztk 8 if ko t-i s < e t _ 1 / 2 < e/2 and 5 > e/2 1 a n d $i > /2 for all i such that e for k > h , if S ' 1 1 0 and 6 ko t<i<k < e/2 and 5 > e/2 for all i, such that t<i<k i 0 < e/2 Perhaps it would be helpful to unpack this definition a little bit. We can first break it down into two possibilities - either there is an upcrossing of the level e/2 by the function 8 between round t — 1 and round t or there isn't. If there isn't then z is equal to zero l tk for all k. If there is then again we have two possibilities - either 8 remains above e/2 for l all % such that t < i < k or else it dips below e/2 at some round k , t < k < k. These 0 0 two possibilities account for the second two parts of the definition of z k (see diagram t below). Thus, z monitors the upcrossings of e/2 made by 5 . If an upcrossing is made at k tk 37 time t, then z keeps track of the value of 8 up until it once again dips below e/2. z k tk tk then holds the constant value associated with the first value below e/2. epsildn/2 Figure 3.2: Three cases for z tk It follows from the above definition that if 8 > e for some k > t then either (i) k x 8 > e/2 for all t such that t < t < ti, or (ii) z k > e for some t € [to>^]- Indeed, if l 0 t (i) fails then we must have 5* < e/2 for some t with t < i < t± and there must exist Q an upcrossing somewhere between i and k. Therefore, there exists a t € [to,fc]such that zi = 8 > e, i.e., (ii) must hold. k k Now, we have already shown that outcome (i) is not very likely, i.e., Prob{8 > e/2 Vi G [t ,*i]} < e/2. 1 0 Therefore, if we can show that the same is true for case (ii), i.e., Prob{z k > e for some t >t and some k > t} < e/2 t 0 38 (3.9) then both options would be nicely bounded which would imply that Prob{5 k > e for some k > ti} < e, completing the proof. In order to prove equation 3.9, fix t > t . If t is not associated with an upcrossing, 0 then z k = 0 for all k > t. Hence such a f makes no contribution to the probability on t the left in equation 3.9 and so we may assume that t is associated with an upcrossing. In this case, define B = z — z ,k-i, k >t (with f3 = 0). Two cases arise. If z k-\ > e/2 k tk t t t; we have both z ,k-i — o~ ~ and z k = 5 , so @k = <5 k l k t $ and fc-1 — k t E[5 - 5 ~ \5\ k E\Pk\ztt,Pt,-,Pk-i. k 5 ~] l k l E[S \5 ,...,5 - ]-5 k t k 1 k 1 — max(|^||A,...,/? -i) fe since < b/k. (3.10) If z ,k-i < e/2 then B = 0 since we can't be in the second case in the definition of t k z . So, in either case, equation 3.10 is satisfied. tk We now use the aforementioned variation on the SLLN in the following manner. If we let Bk — (t/b)Pk+u and we define u = e/26 then, by equation 3.2: Pk < r Vfc > t B < k t < 1 Vk > t k+1 ~ and, by equation 3.10, E[B -i\B ,-Bfc-2. k t = ElUkWi+u-Pk-i] b < -^max(-|^ ||)9t+i,..,^jfe-i) = -^maxdS/t-xllBt,...,^) fc 39 Thus, this definition of B satisfies the hypotheses of the above theorem allowing us to k assert that, for u = e/26, Prob{B + ... + B _i > a for some k] < {^T~^j t k • But = l/3 B + ... + B ^ t t+l k + ... + lf3 k = ^[(Zt,t+1 - Zt,t) + ••• + (Zt,k - = ^( t,k - z ,t) Ztjc-l)] z t Thus, taking a = st/b above, we have (l-u\ ' I \l + uj l Prob{z tt Recall that z = 5K Therefore, since b where r = I — z > s for some k > t} < r ts tk < e/2 (by assumption) and t > t , we can tt 0 insure that z < 3e/4 (by taking £ large enough and using equation 3.2), so that z > e tt 0 tk for some k implies that z — z > e/4. Thus, tk Prob{z tt tk > e for some k>t}< (r )*. e/4 This, finally implies that, 0 Prob{z tk > e for some k>t,t>t }< 0 0 > (r )* = r e(*o/4) e/4 n t=t 0 Now r < 1, so for large enough £ , we have equation 3.9. Thus both options are nicely 0 bounded completing the proof of the lemma. One final theorem is required before we can complete the proof of B.A.T. Theorem 1: For every closed subset C G R , there exists a function TTC • R —> R , N N with the property that < z - 7r {z),y -ir {z) c c > < -^\y - TTC{Z)\ 2 40 Vy G C. N Indeed, it suffices to take for n any selection of the nonempty-valued (since C is closed) c multifunction z —> argmin{\z — y\ : y G C}. Proof of Theorem 1: The proof is a straightforward application of the properties of an inner product. If z G R and n = irc(z) is a nearest point in C to z, then n \z - 7r| <\Z- Vy £ C y\ 2 2 Thus, \ Z - K \ 2 < \{z — \z <z-Tr,y-ir> < - ir) — 7r| 2 + (n - y)\ 2 + 2< z — ]r\y - ir\ 7T, 7T — y > +\TT y\ 2 — Vy G C 2 completing the proof. 3.3 Concise Statement and Detailed Proof Before we can even make sense of the statement of BAT, we will need the following definition. Definition: Let C be a set in L-space. We shall say that C is approachable if there exists a behaviour rule (for player M) such that for every sequence of mixed strategies available to player N, the sequence {<5* : t G L} converges to zero almost surely. Here 5 is the l squared distance to C of the empirical V M ( ^ * ) , Le., S = min{\V (h )-TT\ f t 2 M :7i eC}. (3.11) That is, for every e > 0 (and every possible behaviour rule by player N) there is a To such that Prob^ > e for some t > T } < e. 0 41 Note that, as shown earlier, this is equivalent to almost sure convergence of 8 to zero. f This being said, we are finally in a position to actually give a concise statement of B.A.T. Blackwell's Approachability Theorem: Let C be any closed set in R . Suppose there L exists a map g : {R \C) L -> P[S ] such that for every z e R \C, the measure fjt = g(z) L M obeys <v (/j,,s ) M -Tr {z),z-ir (z)> N c < c 0 Then C is approachable. Indeed, the behaviour rule 7(/i*) := Vs N e S. (3.12) N #(VM(^*)) w u ^ serve. In other words, if we find ourselves at round t, then the Blackwell condition requires that there exists a measure fi t+1 %(s i+1 € P[S ] such that the expected value of the next payoff, M ) , is on the C-side of the hyperplane that separates this new payoff from the time- averaged payoff up to time t. The theorem then promises convergence of V M ( ^ * ) to the set C. Proof of B.A.T.: The method of this proof is to show that, given Blackwell's condition, the random sequence S defined by equation 3.11 satisfies the three inequalities of Lemma 3.2. It 1 then follows that 5* converges to zero almost surely, which is equivalent to C being approachable as shown earlier. If we denote by y*, then Blackwell's condition requires that, given the 7TC(VM(^*)) history up to time t, we have: < v {i/ \ + M s ) - y\ y (/i*) -y*> N M < 42 0 Vt G R and s N e S. N (3.13) Therefore, since y € C and using theorem 1, t S < \V {h ) t+l - yf t+1 M = \V (h )-V (h )\ t+l t M + 2 M 2<V (h )-y\V (h )-V (h )> t t+l M t M M + \V (ti)-y \ (3.14) t 2 M Now, , V (h ) - V {h<) = t+1 M M -. t m ^X>"(* ) - TE^( ) T S T T = l T= l " "*' T + iTTE»-M-it«-(0 < 1 r=l T=l = +~ v {s ) t+l M [ ^ 4 ) - (* + l)V {h})] M - M^) t+l Using this fact, we can now write both < V (h}) - y\V {h^) M - V {h}) > M = M ~ y*> Ms ) = m [< ^ -y > t+l l t+l M - t+l M M + < v (h*) M - y* > -£r\y* = <h< V {h}) - y\ v (s ) t+l M [< V {h}) - y\v {s ) y V (h})>\ M \ * - v (h}) >) y M - V {h})\ 2 M and v (s ) t+1 \V {h^ )-V {h )\ 1 t M M = 2 M - V {h}) < M t+l since all payoffs are assumed to be bounded. Thus, given the history up to time t — 1 and choosing // e P[SM] such that Blackwell's condition (equation 3.13) is satisfied, equation 3.14 implies: E[< VM^' ) +2 1 E^S ,...,* 1 t-n \V (h - )-y - \ t 1 t 1 2 M l^-vw^- )! 1 - 2 < 5 t - y'- ^ ,-,^- ] -tf-\v {*) 1 M 1 1 2 + E[\V (h )-V (h - )\ \8\...,6 - } t M t 1 2 t 1 M t - i _tV + 4t 1 2 1 - ^ +^ \/t £ R and Vsjy € 5^ 43 (3.15) Since all payoffs are bounded, we know that for some scalar a and 0 < <5* < a ViG R and VSJV £ £V (3.16) Finally, < ||V^r(/i*) - TT|| - MV^C/i*- - TT|| 2 < < 2 for all h* WVMM-VMih^W 2 < -4 ~ 1 t 2 - for some scalar b Vi G and Vsjy G (3.17) These last three numbered equations are equivalent to those required for the lemma. Thus B.A.T. follows. 44 Chapter 4 Convergence Results What then can be said about the convergence properties of these various behaviour rules? Do they converge at all? If so, can we characterize the set to which they converge independent of the game being played? In this chapter, I will focus on the convergence of the empirical distribution of play, namely the sequence in P[S x SN] defined by: M T<t 4.1 Fictitious Play, Past Response and /^-Exponential Fictitious Play Fictitious Play Foster and Vohra have shown that, given a certain condition, the set of all distributions which are limit points of z is equal to CE provided that the behaviour rules l involved are both a best-response to a calibrated forecast ([3]). Let us denote the set of limit points of this subset of behaviour rules by BR. Thus, Foster and Vohra proved that, given a certain condition, CE=BR. The required condition is that the set of measures over SN for which i is a best-response, M (i), must have a non-empty interior for all b i G SM- More specifically, this implies that Mb(i) ^ Mb(j) for all i j and i,j G SM- In' those cases where this does not occur then there may exist a correlated equilibrium which is not contained in BR but any point in BR will still be a correlated equilibrium (i.e. B R 45 is contained in but not necessarily equal to CE). Fortunately, the above condition can be shown to be true for almost every game. Since FP is a best-response to a calibrated forecast, it follows that if FP converges then it must converge to a correlated equilibrium. However, Foster and Vohra make no claim that any specific best-response behaviour rule based upon a calibrated forecast (such as FP) must converge. Their result merely implies that given any correlated equilibrium and given a game where the above condition is satisfied, there exists some best-response behaviour rule based on a calibrated forecast that will converge to it. The Shapley game described below is an example of a game in which FP does not converge to C E simply because there are no limit points. Thus the known convergence results for FP are entirely game specific. Past Response Past Response is a behaviour rule that I developed myself essentially as a potential opponent for the other rules. The idea behind it makes intuitive sense but I have made no attempt to determine any convergence results for it. Numerical results from three different games are given in the following chapter. These numerical results suggest that PR does converge to a correlated equilibrium when played against itself but not necessarily to a very desirable one (a defect that is common to most of the other behaviour rules as well). K-exponential Fictitious Play: Fudenberg and Levine [4] prove that if all players use a universally consistent behaviour rule then the empirical distribution of play will eventually remain within MBR. Moreover, if all players use ^-exponential FP then the conclusion can be strengthened to insure that the empirical distribution of play will eventually remain within the exact 46 MBR. To be a little more rigorous, assume that all players are using universally consistent behaviour rules. We can then derive a probability distribution, p (/x, u), over the set of T possible empirical distributions arising from the use of the given behaviour rules up to any time T. Now, there is no a priori reason to assume that this probability distribution will converge at all as time progresses. However, since the space of measures on a compact set is compact (in the topology of weak convergence), we can at least be assured of the existence of accumulation points. What Fudenberg and Levine prove is that, with probability one, these accumulation points will be found entirely within MBR. 4.2 Regrets 2 Mas-Collel and Hart make stronger claims for both Regrets 1 and 2. Their Regrets 2 convergence result depends heavily on the aforementioned B.A.T., which allows MasCollel and Hart to establish the following theorem. Theorem 2.3.1: Suppose that at every period t, M (or N) chooses strategies as outlined by Regrets 2. Then M's (or iV's) regrets i?*(j, k), converge to zero almost surely for every j, k £ S M Proof: Recalling the setup outlined in section 3.1, let L = {(j,k) £ SM X SM}, SO that R L can be viewed as the space of all \SM\ X \SN\ matrices, and let C be the non-positive orthant of R . We define M's vector payoff, as the following square matrix: L u (k, s (t)) - u (j, s (t)) M N M N if s {t) = j, M otherwise. (Thus, has only one non-zero row, namely row j 47 = SM(*)-) (4.1) So then, the regret function defined earlier is now equal to the positive part of the time-averaged pay-off. That is, R Mk)=[Wu(h W,k)] . t Moreover, VM(/I*) t + e C if and only if all of player M's regrets are zero. Now, suppose / i € P[SM] is an invariant vector of the stochastic matrix P associated l u with Pi . This is the same vector used in Regrets 2, to determine player M's measure. M As in Chapter 2, \i satisfies, li B} T M = i^diagiR^e). (4.2) Thus, to prove the theorem it would suffice to show that this \i satisfies Blackwell's condition for the set C since then, by B.A.T., V M ( ^ ) will approach C almost surely. (Hence R —¥ 0 almost surely.) l M Note that for any V {h}) e R \C, L M V {ti) M - pro V {ti) JC M = V {ti) - (V {ti) M - VM^Y) M = R* . M Moreover, proJcVMitf) • {V {ti) - projcVMih})) = V (tf)" • F ( ^ ) M M M + = 0. Thus Blackwell's condition can be re-written as, E K(^^)](i,A;)^(i,fc)<0 j,kes \/s eS , N N (4.3) M or, in matrix notation, where "•" represents the standard matrix inner product (A • B=tr(A B)), T v {^s ) M N • R < 0. l M 48 Now since [v (fj,, s )](j, k) = p(j)[u (k, M N M s ) - u (j, s )], N M N s) - u {j, equation 4.3 is equivalent to fi{j)[ M(k, X u N s )}R (j, M N k) < 0 M j,kes M ^Y, p(k)R (k,j)-p{j) H jes lkes M R {hk) M u (j,s )<0 M M N kes M M j€S M Let = [ Lt i? e ,-/i (eJ R ) ]. T / M : t r J T M So then /3 = p?R -p?diag(R e). M M But this we know to be equal to zero by Equation 4.2 . Thus the inequality demanded here actually holds as an equality in this case. So Equation 4.3, and thus Blackwell's condition, holds. It follows, therefore, that V (h ) M converges to the set C almost surely and therefore so t does R (j, k) = max{[VM(/i*)](j, k), 0}, for all j, k e S - That is, all regrets go to zero. M M Mas-Collel and Hart go on to show that having the regrets go to zero for all players is a necessary and sufficient condition for convergence of the empirical distribution to CE. (Note that CE need not be a single point. Thus, the guarantee is not that Regrets 1 and 2 will ensure convergence of the empirical distribution to a single element of C E but that, given either of these behaviour rules, the empirical distribution will eventually remain within a small neighbourhood of the set of correlated equilibria.) Proposition: Let (s*)^!^... be a sequence of plays such that Umsupt^ooRl < 0 for i = M and N. Then the sequence of empirical distributions, z , converges to the set CE. t Proof: Let a G P[SM X SN] be an accumulation point of the sequence, z*, of empirical distributions of play and consider arbitrary strategies j, k € SM- Then seS xS :s =J M N M 49 In the limit, this is bounded above by zero (since all regrets go to zero). Therefore, E limsup t—too _„ z {s)[u (k,s ) - u (s)] t M N < M 0. „ Thus taking the limit along the subsequence where z* converges to a gives, E o>(s)[u {k,s ) M N - u (s)} < 0 < 0 M S&SM'XS :SM=J' N E O (J, N)[u (i>(j),s ) a s M N - u (j,s )] M N where = k SES XSN:SM=J M Now, since this is true for any arbitrary pair j, k G SM, we get E ^MC0(SM), s )a(s) N - u {a) M < 0 (4.4) for all ip : SM —> SM- Since the same could be shown using N instead of M , this proves that a is a correlated equilibrium. 4.3 Regrets 1 The proof that, if all players use Regrets 1, the empirical distribution of play converges to CE is a little more complicated. (Note that this hypothesis is also more restrictive than Regrets 2, which only required that all players use an adaptive procedure that insures that their regrets go to zero.) It does share similarities with the Regrets 2 approach in that the idea is to show that for each individual player, the regret goes to zero as the game progresses. Thus, by the above proposition, we have convergence to CE. Theorem: If every player plays according to Regrets 1, then the empirical distribution of play z will converge almost surely as t —> oo to the set CE. l Proof: We will drop the subscript M whenever this will cause no confusion. Thus once again we define the set C to be the non-positive orthant of R , where L L 50 — {(j,k) £ SM*SM}- Recall that = \ J2 <t ( )v st T Regrets going to zero is then equivalent to V(h ) converging to the set C as t goes to infinity. A logical quantity to study therefore l is the distance between V(h}) and the set C. We define: p := t [distiVih')^)] . 2 This proof will depend on taking a subsequence, {i }n=o,i,2..., of the rounds of the game n and showing that along this subsequence all regrets go to zero. That is, p tn n —> oo. Then, by showing that all rounds in between t and i n n + 1 -> 0, as are bounded by the inverse of n, we will prove that all regrets go to zero. Thus, it is no use simply investigating the difference p \ — p . Instead, we must look at p t+ t t+s Pt where s = — Specifically we need to look at the expected value of p t+s t n + i — t and t = t . n n where only the history up to time t is known. The proof will depend on the fact that we can keep s small relative to t. This proof makes use of the standard "O" notation. For two real-valued functions, /(•) and #(•), defined on a domain X, "f(x) = 0(g(x))" means that there exists a constant K < oo such that < Kg(x) for all x e X. To simplify this process it is helpful to notice the following recursive relationship. Recalling the definition of v(s ) as given in Equation 4.1, it follows that: t (4.5) Keeping all this in mind, we may now proceed with the proof of the convergence of the empirical distribution arising from the regrets 1 behaviour rule to the set of correlated equilibria (essentially following the proof developed by Mas-Collel and Hart, [8]). 51 Step 1: With this background (equation 4.5) and since [^(ft*)] G C, we can now derive a - bound on p : t+s Pt+s |2 t < t + S t t +s (t + sfp w—1 (v(h<) - T ^- J2« s t + w s ) ~ w=l < t p + 2tJ2 {v(s ) - [V(h*)]-) • (Vitf) - \V{h*)]-) 2 t+s [v (/»*)]-) + t+w t w=l + E (v(s ) t+w - [Vih*)}-) < t p + 2tJ2v(s ) 2 • R*+ 0(s ) t+w since R = [V(h )} 2 t l t + < t p + 0(ts + s ) 2 2 t This last inequality follows from the fact that all payoffs are bounded and all strategy sets are finite. The above derivation leads to two useful equations: E[(t + sfpt^h } 1 <t 2 + 2tY E[v(s )\h ] t+w Pt t -R* + 0(s ), (4.6) < 0{ts + s ). (4.7) 2 / w=l (t + s) p -t p 2 2 t+s 2 t Step 2: Thus, we need to bound the expectation on the right hand side of equation 4.6, namely EMs^lh*] = E diag^M •SJVGSJV where <j) (j) = Pr[s t+W SN = (j, SJV)|/I*] and U (j,k) = [u (k,s ) SN 52 M N - u (j,s )] . + M N The difficulty is that the stochastic process (s* % ,i,2,.. * ° t stationary so the sn +u =0 probability that M plays strategy j in round t+w is dependent on the play of N in rounds t through t+w-1. Thus, given only the history up to time t, the above expectation is very difficult to analyse. We proceed, therefore, by introducing a second stochastic process (s' ) o,i,2...+w tu= This new stochastic process uses the stationary transition probability matrices P , Pj^ for,all' rounds greater than or equal to t. That is, s = s for all r < t T M T and Pr[s t+W = (j, s ) I s * , s ^ - } = P (s {t 1 N M + w- 1), j)Ph(SN(t M + w-l), s) N where P is the transition matrix defined by equation 2.2 on page 27. We can then M define <p for the process (s )u;=o,i,2...in an analogous way to (j). t+w We will proceed by first showing that the difference between these two processes can be bounded for fixed w and then prove that if we use the stationary process instead of the original one then we would be able to get a useful bound for Equation 4.6. Thus we will have proven that the original process gives a similar bound. Step 3: In round t + w, the original process will use the matrix P t+W while the new stationary process will use the matrix P*. Thus, we would like to get a bound on the difference between these two transition matrices. Towards this end consider, v{h^)-v(h}) = -J_v(fc') + _L-^; ,( *^)-v(/ «) t t +s ^ < v ' a l t+ s 0(s/t). (4.8) Therefore, it follows that the difference between the matrices P* and P t+S more than 0(s/t) either. 53 can be no Step 4: However, we are really interested in comparing 4> and </>. Based on step 3, we get that. Pr[s = {j, s )\s\ t+W s -} t+w N = P (s (t M -Pr[s 1 + w-l),j)P (s (t M N -P (s (t t = P {sM(t + N N + w- l),j)P (s (t M N - (P (s (t M = 0(w/t) + SN t+w S ~] l w-l),s ) N + w - 1), s^) N + w- M )| *,s N t +s M SjV w-l),s ) + w-l),j)P ( (t Us M = (j, t+W + 0(w/t))(P {s (t N + w-l), N s ) + 0(w/t)) N since w < t However, this only bounds the one-step difference where the history up to time t + w — 1 is known and is equivalent in the two processes, whereas to compare (p and (j) we need to bound the s-step difference where only the history up to time t is known and equivalent. To do this we proceed by induction. Consider the following proposition involving n: w P r [ ^ = (j, ,)|^..., S A t + S "]-Pr[ t + 5 - =0> £ ) 0 ( r / t ) , (4.9) )|^..., * "]< + A R S r—n+l We know this is true by the above, for n = w — 1. We need to show that this is true for n — 0. But this is proven by the following = £ * + n Pr[s = {j, SN^S , t+W < E ^( P r s is t + w = Pr [ **> 8 \ r s t + W U N)W, s ](Pr[s t+n t+n t+n (i,a )|s«,...,^- ] + ET 1 = J V = s \s\ = B 54 0(r/t) t+n s^" ] 1 1 t+n =s'+V,s 1 s ^ ] + 0{n/t)) = s \s\ t+n s ]Pr[s S t+n t+n = {j, SN^S*, = s -} t+n + X:r = n + i 0(r/t))Pr[§ t+n t+W = s \s\ t+n N s P t+n = (j, s )\s*,s ] < Y, t+n Pr[s < E s ^ s ]Pr[s 1 s t + n - } + Z7=n 0(r/t) 1 . Thus, by induction, this proves that equation 4.9 is true for n = 0. Therefore, w r-0 4>s (j) -4> {j) < SN N — E R FOR S O M E LAR § E K K (w{w + 1)' t v 2 K (w(w < T + iv) It is important to note that this step requires that player N also does not change strategy too quickly. That is, player N must use a behaviour rule which insures that the transition probabilities do not change too drastically between round t and round t + w. The statement of the theorem is that all players use the same Regrets 1 behaviour rule. This is somewhat more restrictive than necessary but it does, at least, insure the validity of this proof. Step 5: So then E[v{s )\h*] • PJ - B[u(s )|/i*] • R* t+w t+W J =H es diag((f) )U 8N = N ^s es N SN 9(k dia N N SN M < 4 m a x ( ) [T.jeSM 2 SN SN SN I • R f •R 1 SN 9(k dia N T diag((p )U S j v e S j v ~ <t> )U <2max(w ) J2SNesN UM - £ /SN - <t>s )u N SN </>** U) ~ <t>s (J) N by Step 4 It remains to be shown, therefore, that E[v(s* ')|/i*] • R can be bounded for all w. = 0(w /t) 2 +u 55 l Step 6: 4> (j) aN = Pr[s {t + w) = s \h }Pr[s (t = Pr[s (t + w) = - (Hag^e)) N N Therefore, since P = I+\(R l EHs^h^.R! t « ( J2 = t Af = J J - I + t (4.11) -diagiBJe) \ k (j)[u (k,s ) N t M • (iP SN N K (4.10) w)=j\h ] i i diag(4> )U ) SN + M s |/ ][P r(s (i),;) \s es N t N 1 - UMiJ^N^iJik) N M stjeSN \j,keSM X (by defn of U, ) N ) ^ (i)wM(i,s r)P (i,A;) &w0') Af(*:,Sjv)-P*C7,fc) u (Note that the second equality follows since I — ^diag(R e) t JV A is a diagonal matrix and f/ t 5jv has only zero elements along the diagonal.) Now, if we switch the indices in the first sum over j, k € SM and recognize that the second sum reduces to J2 s je <f>s (j)u (j, M N M (since £ s) N f c e 5 M P\j, k) = 1 for all j e 5 ) , M we get that ^ Thus, setting fi {j,s ) t,w N *. (A0/*(*J)-fc(i) w (4.12) equal to what is inside the square brackets and recalling equa- tion 4.11, we have KkeSM = Pr[s (t + w) = = Pr[s (t + w) = N N SHW {[P*r Mt)J) +1 ~ [PT(sM(t),j)} s \h }[(P r -(P ) ](s (t),j) N t t +1 t w M Therefore,if we can bound d^ then we will have bounded P[u(s )|/i*] • R as desired. w i+w 56 l Step 7: Step 7 depends on the following lemma: Let P be an m x m stochastic matrix with all of its diagonal entries positive. Then [pw+l _ p w ^ = 0(^-1/2) for a l l ^ k = 1 Given this lemma, it is clear that j3 t,w ) ^ m = 0(w~ / ) since P is designed so that the 1 2 l diagonal entries are positive. Step 8: Thus, we have now shown that £ ? [ u ( s ) | h ] -R = 0(w~ ) t+w t 1/2 l Therefore, by step 5, 2 Finally, returning to equation 4.6 from step 1, it follows that £[(i + ) S 2 A+s |/i<] < t 2 = Pt + 2t^O(^ + w- ' ) + 0(s ) 1 2 2 t p + 0(s + ts^ ) 2 s 2 t Step 9: Here then is where we need to make an intelligent choice for a subsequence, {£„}. Let t be equal to the largest integer not exceeding n 5 / 3 n . Therefore, by letting t = t and n s = t i — t , step 8 results in n+ n E[t ^} 2 = n+lPtn+ t +0{n ) 2 2 nPtn since s = 0(n / ) which implies that s = 0(n ) and i s / = 0{n ). 2 3 3 2 57 1 2 2 Step 10: Finally, we can use the following theorem by Loeve, to show that along this subsequence p goes to zero as n —> oo. tn Theorem: Let X be a sequence of random variables and b a sequence of real numbers n n increasing to infinity, such that the series J2 Var(X )/b converges. Then 2 n 1 n l n lim — y^(X - E[X \X ...,X _ ]) r r u r =0 1 a.s. r=l n If we let b = £ and X = b p — & -iPt _i then by equation 4.7 from step 1, it follows 2 n n that \X \ < 0(t s n n + n £ n tn n n s ) = 0(n / ). Therefore, 2 7 3 V«r(X )/6 = £ 0 ( n / ) / n ° / = £ 0 ( l / n ) < oo 2 14 3 2 3 2 n n n n Moreover, Step 9 implies that, rE ^l^ W = O(n" )E°( ) = ° ( " £ On 10/S r<n ra n 1B/V ) = °( "- ) '•' n /3 r<n Therefore, since this obviously goes to zero as n —> oo, it follows that a.s. Step 11: Since p = J2 [R 0 ' ^)} , step 10 implies that the subsequence R converges to the tn tn 2 tn jk set C a.s. as n ->• oo. When t <t< n t we have R - R f n+l tn < 0(n )/0(n ^) 2/3 5 = 0{l/n) by equation 4.8 in step 3. Thus the full sequence R converges to the set C a.s., thereby l completing the proof. 58 Alternative Proof for Regrets 1 Convergence The concept behind this alternative proof is to use what we already know concerning the convergence properties of the Regrets 2 behaviour rule. If we can show that eventually the Regrets 1 behaviour rule behaves similarly to that of Regrets 2 then we will have proven the desired result. The basic idea is to show that a subsequence of the transition matrices arising from Regrets 1 is essentially equivalent to the transition matrices arising from Regrets 2. Towards that end it is necessary to introduce a tremble. Recall that a tremble is introduced to insure that all transitions have positive probability. In this case, I will introduce a tremble of l/t l . l A That is, if S{ is the original sequence of transition matrices arising from the Regrets 1 behaviour rule, then the new sequence of transition matrices . has the following form: ^ ^ ( l - ^ f ^ S r forallx. where A is a matrix with each entry equal to one and r is large enough so that the coefficient in front of S{ is positive. Note that the difference between the two sequences: must always be of the order of l / r 1 / / 4 so that, for large enough r, dealing with one sequence is essentially the same as dealing with the other. Therefore, E[S{ - 5* |/i*] = 0{l/{t + s) ). +S Now, recall that R +s = [V {h )} t M 1/4 and that V M ( ^ * ) obeys the following recursion + M formula: V (h ) t+s M = -±-V {lt) t+s +- i -V t + s —' M z v (s ). t+w M w=l This, in turn led us to the fact that P ^ - P s M = 0(s/t). 59 : -(4.13) (see step 3 from the original proof for regrets 1.) Moreover, as t gets large, the above recursion formula makes it clear that the Regrets 1 behaviour rule essentially uses the same transition matrix over and over again. In other words, E[p£'\h*\ In fact, the difference is of the order of s/t. But, by theorem 2.15 in Markov Processes for Stochastic Modeling [7], the difference between [P ] and the stationary distribution associated with P can be bounded in the S M M following way: [PMY - 7T* 1 <C (4.14) where C is a constant and 7r* is the stationary distribution associated with the transition matrix P . Note that this is where it is necessary to assume that the transition matrices M have non-zero entries. The above discussion on the addition of a tremble however shows that the perturbation introduced by the tremble can be easily controlled. Therefore, if we are given an e > 0 and a S > 0, and we start with t large enough, then we can choose s large enough so that the bound in equation 4.14 is less that 5 without having the bound in equation 4.13 exceed e. The only difficulty would occur if as t increased the ratio s/t required in order to reduce the bound from equation 4.14 to a fixed number increased. However, it is easy to show computationally that this is not the case and that in fact this ratio decreases as t gets larger. Thus we can take a subsequence of the transition matrices arising from the Regrets 1 behaviour rule - (where t P ,P ,... +SL M +S2 i+x M p u + 1 + S i + 1 _ u p 60 =U + si) so that + S i < e ( 415) and E[P > +1+S M +1 - 7r* i + S i < [P T +S +1 M < 5+e ~ K U + S I + e (4.16) Equation 4.16 proves that this subsequence of Regrets 1 essentially behaves like that of Regrets 2. Now, we know that Regrets 2 forces the regret matrix to zero and therefore forces the transition matrix to the identity. Therefore along this subsequence, Regrets 1 also forces the regrets to zero. Moreover, the transition matrices (and therefore the regrets) between the members of the subsequence are restricted by equation 4.15 to be within e of the closest member of the above subsequence. Therefore, Regrets 1 must also force the regrets to zero and so insure convergence to CE. 4.4 Modified Regrets The convergence results for modified regrets (MR) are entirely numerical. In the zero-sum game discussed in the next chapter, MR does not converge to a correlated equilibrium. However, it does converge towards the "best" compromise position where both players get an equal payoff. In the other two games, neither of which are zero sum, M R converges at a very fast rate to a correlated equilibrium. More particularly, it converges to the correlated equilibrium that results in equal payoff to both players. None of the other rules show anything close to the same kind of convergence rate - at least not in all three games. (These results will be given in more detail in the next chapter.) Intuitively, this makes sense. MR is based on the idea of compromise. It is an attempt to incorporate into a behaviour rule the idea of accomodation. Thus, it would make sense that, in games where there is a possibility of compromise (i.e., in non-zero sum games), MR would work very well. Even in a zero-sum game, the idea of a compromise would simply mean that both players should receive approximately zero payoff. 61 This brings into question, in my mind, the whole idea of convergence to correlated equilibria in the first place. As will be demonstrated, in the Battle of the Buddies game all behaviour rules converge to a correlated equilibrium but it is very obvious that the result (except in the case of MR) is far from satisfactory for one of the two players (see results next chapter). Thus convergence to the set of correlated equilibria is not necessarily a desirable outcome; or at least not a sufficiently restrictive one. If instead, we could guarantee convergence to the "best" compromise then surely that would be a more useful result. How one defines the "best" compromise is not an easy question to answer. Certainly, in a zero-sum game the best compromise would result in a zero payoff for each player. For a nonzero-sum game though the "best" compromise may not be definable in the general setting but rather depend on the specifics of each individual game. There is one type of game that would most obviously cause the M R behaviour rule some difficulty. Consider a game with the following pay-off matrices: 1 0 0 UM 0 3 0 0 4 0 u N 0 0 1 = 0 0 1 1 0 0 In such a game, MR will lead player N to have a positive regret for not playing strategy 2. But if N plays strategy 2 then player M will play strategy 2 as well. Player N may continue to play strategy 2 (despite receiving zero pay-off) in the hopes of forcing player M to switch but in this case there is no reason why M ever would. Thus MR will only work well in those games in which both players can act in such a way as to create an adverse effect on the other player which might then influence him to switch strategy (such as Battle of the Buddies). Without knowing the pay-off matrix of one's opponent, it is not easy to see how this might be remedied. One possible option would be to introduce a weighting on the two 62 sums in equation 2.7 on page 30. Such a weighting might consider monitoring the average of the components of % and increasing the weight on the first sum (the normal regret) if the average of past payoffs dipped below this level. Such a weighting might then prevent player M from continuing to try to influence player N when previous play had shown such influence to be minimal. My own experience, however, with the introduction of a weighting system has made it very clear that this is a delicate task. A weighting that may work well for one game can prove disastrous in another. Whether there does exist a "optimal" weighting system that would allow MR to perform well in all types of games remains an open question. 63 Chapter 5 Experimental Results This chapter describes numerical experiments with the convergence results of various behaviour rules. I chose three different games to play, picking three whose forms varied quite dramatically. The first one is a zero-sum game which is an expanded version of the familiar scissors-paper-rock game. The second is a game developed by Lloyd Shapley and the third is the aforementioned Battle of the Buddies. I will use the following abbreviations to refer to the behaviour rules defined above: Regrets 1 - RI, Regrets 2 - R2, Modified Regrets - MR, Fictitious Play - FP and K exponential-fictitious play - EFP. 5.1 Scissors-Paper-Rock-Glass-Water This game (abbreviated SPRGW) works exactly the same way as the old game of scissorspaper-rock, except that now, instead of having three choices, each player has five ([9]). Which choice beats which is demonstrated in the following diagram. If the arrow goes from vertex u to vertex v then u beats out v. Thus, for instance, glass beats out stone (see diagram below). A loss means a penalty of —1 and a win means a reward of +1. Thus UN = —UM, 64 w S G Figure 5.1: SPRGW - rules of the game where u M is of the following form: 0 -1 1 1 -1 1 0 1 -1 -1 -1 -1 0 -1 1 -1 1 1 0 -1 1 1 -1 1 0 In this game, the convergence rate for FP and PR is extremely fast. FP and PR when played against themselves converge to the Nash equilibrium where the marginals are both (1/9,1/9,1/3,1/9,1/3). The resulting payoff for both players is, of course, zero. EFP converges to the same Nash equilibrium, though at a slower rate. RI seems to be doing the same though the convergence rate is extremely slow. Finally, M R played against itself leads to a fast convergence to the marginals (1/3,1/3,0,1/3,0) which results in zero payoff as well. However, this distribution is not an equilibrium. So, in each case, an "optimal" compromise is reached in which both players do equally well. When the different behaviour rules are played against each other, however, their convergence properties become much less obvious (see Appendix). Numerical results however did support Fudenberg and Levine's claim that EFP converges to MBR in all cases where 65 the opponent uses a consistent behaviour rule. 5.2 The Shapley Game Julia Robinson proved, as early as 1951, that, in a two person zero-sum game, FP does insure the convergence of the empirical distribution to NE ([10]). The Shapley game was originally conceived by Lloyd Shapley in order to prove that FP does not necessarily converge to a NE outside of the narrow context assumed by Robinson ([11]). Shapley's game is also useful here because it turns out to be a good example for demonstrating the previously discussed convergence properties (or lack thereof) of the various behaviour rules. The game is a nonzero-sum 3 x 3 game with the following payoff matrices: 1 0 0 U M = 0 1 0 0 1 0 , 0 0 1 u N = 0 0 1 1 0 0_ Foster and Vohra [3] have proven that, not only does FP not converge to an equilibrium but that, in fact, it doesn't converge at all. A fairly simple program will show that, in this game, FP oscillates between 6 states: (1,1), (1,2), (2, 2), (2,3), (3,3) and (3,1). Now the only correlated equilibrium with support on these states places equal weight, 1/6, on each of them. If we order the coordinates of the matrix by labelling the first row across numbers 1 through 3 and then the second 4 through 6 and so on, FP produces the following empirical distribution of play: After 1000 iterations: (.120, .281,0,0, .258, .058, .128,0, .086) After 30000 iterations: (.064, .095,0,0, .139, .203, .201,0, .298) 66 These values and the graphical results in Appendix A provide no indication that FP is converging to the correlated equilibrium. In fact, F P cycles through these six states and with each state change the length of time to the next change becomes longer. Note, however, that according to Fudenberg and Levine, due to the increase in time between state changes, FP is consistent in this game and thus should remain within MBR (the table below provides numerical confirmation of this). Thus this shows that Fudenberg and Levine's result cannot be strengthened to insure convergence to CE. The Ac-exponential FP behaviour rule also cycles through the same six states and produces the following empirical distribution of play after a certain number of iterations, again showing that universal consistency is only enough to insure convergence to MBR: After 1000 iterations: (.223, .329, 0, 0, .132, .066, .150, 0, .101) After 30000 iterations: . (.243, .061,0,0, .089, .131, .283,0, .193) EFP seems to do much the same as FP in that it cycles through the same six states with the rate of state change getting longer and longer. The Regret 1 behaviour rule again cycles through the same six states. The convergence to CE is obviously extremely slow. After 1000 iterations: (.218, .347, .000, .000, .237, .038, .093, .000, .068) After 30000 iterations: (.136, .149, .006, .015, .182, .197, .114, .004, .196) P R does something a little different. This behaviour rule cycles between all nine states and does seem to converge (though the evidence is entirely numerical) to the 67 correlated equilibrium which places the same weight on each of the six states mentioned above, and a little less weight on the remaining three states. After 1000 iterations: (.131, .117, .103, .097, .107, .112, .123, .092, .119) After 30000 iterations: (.114, .115, .108, .107, .112, .110, .113, .109, .112) This, once again, highlights the fact that not all correlated equilibria are desirable results. Here, the payoff is close to 1/3 for each player which is far from the optimal compromise of 1/2 each. Finally, MR not only converges to CE but does so at a rate much faster than that of Regrets 1. It also places all its weight on the same six states. After 1000 iterations: (.1668, .1668, 0, 0, .1668, .1668, .1668, 0, .1665) After 30000 iterations: (.1665, .1666,0,0, .1665, .1665, .1671,0, .1665) The following table summarizes the payoffs between behaviour rules after 30000 iterations. It also gives the defining quantities, max (^, a^) and max^(OJM, ^), for MBR M so that the results of Fudenberg and Levine can be easily confirmed. 68 Behaviour Rule max^ u (p,a ) M N u (a) max„ UM(OSM,^)u (a) M N (M vs N) PR vs PR .336 .337 .337 .338 FP vs PR .403 .402 .481 .593 FP vs FP .501 .501 .499 .499 EFP vs PR .418 .418 .469 .578 EFP vs FP .413 .413 .476 .583 EFP vs E F P .526 .525 .475 .475 RI vs PR .359 .451 .356 .470 RI vs FP .453 .537 .463 .463 RI vs E F P .447 .485 .462 .462 RI vs RI .399 .514 .395 .461 MR vs PR .333 .250 .333 .500 MR vs FP .333 .800 .333 .200 MR vs E F P .433 .525 .294 .392 MR vs RI .335 .850 .345 .101 MR vs MR .334 .500 .334 .500 Pathological as the Shapley game seems to be, it does, however ratify Fundenberg and Levine's result that any rule which is universally-consistent will eventually converge to the exact MBR (at least to a good approximation). Results show clearly that E F P converges to the exact MBR, no matter what behaviour rule is used by the opposing player (except MR). FP converges to the exact MBR except when played against MR or EFP. PR, on the other hand, converges to MBR only when played against itself. Recall that Fudenberg and Levine's result only insures convergence to MBR when all players are using a consistent behaviour rule. RI has been touted by Mas-Collel and Hart to 69 converge to CE in any game. While there is indication that this is true (see Appendix A) and theoretically the proof is sound, the rate of convergence is clearly slower than one might hope. Even after 30000 iterations, RI versus itself has yet to satisfy the criterion for MBR, let alone that of CE. Excluding MR, most of the above behaviour rule pairs do not converge to any distribution. The exceptions are PR against itself and RI against itself. Thus, the actual figures given above for payoffs received are not all that useful (except as ratification for Fudenberg and Levine's result). Closer examination actually shows that the payoffs continue to vary a fair bit, favouring player M at one iteration and player N at another. In contrast, MR causes rapid convergence when played against itself as well as against FP, RI and PR. MR vastly outplays both FP and Regrets 1 (see results in the above table) while being itself outplayed by PR. This last result is due to the fact tha,t P R will inevitably do well against a behaviour rule that uses a best response technique. PR when played against M R results in frequent switches causing MR to resort to the bestresponse option on a frequent basis. Against EFP, the empirical distribution oscillates in such a way that both players do about equally well. Against itself, MR converges rapidly to the correlated equilibrium that places equal weight (1/6) on the six states (1,1),(1,2),(2,2),(2,3),(3,1) and (3,3) which, of course, leads to both players receiving a payoff of 1/2. 5.3 The Battle of the Buddies Recall that the Battle of the Buddies game is defined by the following matrices: u M 5 0 1 0 10 1 0 5 70 For this game, the NE contains three strategy pairs - [(0,1), (0,1)],[(1,0), (1,0)] and [(5/6,1/6), (1/6, 5/6)] as shown earlier (chapter 1). Recall that, respectively, they result in payoffs of (1,5), (5,1) and (5/6,5/6). The first two equilibria result when one of the two friends caves in and agrees always to accede to his friend's choice. The last is more of a compromise but results in both of them getting even less of a reward than if either one had caved. Excluding MR, the results from this game are rather uninteresting. For any combination of behaviour rules, the result is always that one of the two players caves in to the choice of the other. Which player caves in is entirely conditional on the first few rounds of play. In any case, the result, even though it is a correlated equilibrium, is anything but a compromise with one player always receiving a payoff of 5 and the other a payoff of 1. Only MR was able to do differently. Playing this behaviour rule against any of the others (except PR) leads to the opponent caving in every time - regardless of the initial condition. When played against PR, MR does not do well as both players tend to be too stubborn (see the table on the next page). When played against itself, M R comes close to the correlated equilibrium which places equal weight on strategy pairs (1,1) and (2,2) and none anywhere else. (After only 3000 iterations, the empirical distribution was (.4389, .1256, 0, .4355), which is itself a correlated equilibrium.) This leads to a much more satisfactory compromise where both players receive a payoff of approximately three. After only 3000 iterations the following payoffs and best responses to the marginals were observed: 71 Behaviour Rule max,, MR vs FP u (a) maxj, u {ctM,v) u (a) 4.97 4.97 .999 .998 MR vs EFP 4.96 4.96 .998 .999 MR vs RI 4.94 4.94 .999 .994 MR vs PR .985 .203 .866 .837 MR vs MR 2.19 2.63 2.18 2.62 u (p,a ) M N M M N Note that the initial conditions were set so that player ./V had the initial advantage. If they are set so that player M has the initial advantage then all of the first four converge directly to the Nash equilibrium that places all its weight on strategy pair (1,1). Thus, MR will not only outplay all but PR in any game of the same form as Battle of the Buddies but will, if played against itself, result in the "optimal" compromise. 72 Chapter 6 Conclusion The problems related to the convergence of behaviour rules within game theory can be broken into three areas - whether or not a given behaviour rule can insure convergence of the empirical distribution, if it does, at what rate and, finally, to what point (or set of points) does it force convergence? The three games described in the last chapter show clearly that it is no simple matter to insure convergence of the empirical distribution even to a set, let alone a point. One solution is to relax the goal. Thus, Fudenberg and Levine developed ^-exponential FP which concentrates on insuring convergence to MBR. As the Battle of the Buddies game shows, however, this may lead to a less than satisfactory result, depending on the nature of the game. One can very easily play a best-response to the marginal of the other player and still fail to maximize one's payoff or even reach an acceptable compromise. Fudenberg and Levine's result for /c-exponential fictitious play is especially nice, however, in that it guarantees convergence to MBR no matter what behaviour rule is used by the opponent. Normal regrets, as developed by Mas-Collel and Hart, has the benefit of insuring convergence of the empirical distribution to a smaller set (CE) no matter what the game but again this does not address the fact that the set CE may contain points whose status as a compromise are less than ideal (e.g. Battle of the Buddies). Moreover, it is very clear 73 that though convergence is assured, the rate of convergence leaves much to be desired. Still there is something to be said for being able to insure convergence to C E as then, at least, one's payoff is assured. Unfortunately, Mas-Collel and Hart's result only insures convergence to CE if both players use normal regrets. Our numerical experiments showed that in the Shapley game, regrets 1 fails to converge to CE when played against anything but itself. What would be the ideal situation? Clearly one would want to insure a fast convergence, but to what? I am not convinced that either MBR or CE are necessarily the best answers to that question; or at least not sufficiently restrictive answers. In many games there are clearly correlated equilibria that do not lead to a satisfactory result for both players. An alternative would be the set of "optimal" compromises. Obviously, in zero-sum games this set would only include those distributions which lead to both players receiving a payoff of zero. However, in a nonzero-sum game, though the concept of an "optimal" compromise is fairly intuitive, it is not so clear what the set of "optimal" compromises would entail. Clearly it is not simply those for which both players do equally well as this would allow both players to do equally badly. For instance, in the Shapley game there are numerous distributions that would lead to both players getting zero. The set would include some correlated equilibria, but not all, and would also include some distributions outside of CE. It would hopefully avoid such unsatisfactory results as occurred in the Battle of the Buddies. Even more ideally (and perhaps unrealistically) one would like to be able to insure convergence to the above set even against other behaviour rules. In fact, one would, expand the set so that the goal is for the empirical distribution to converge to the set that insures at least the "optimal" compromise. This may, however, be too ambitious a task. 74 MR was an attempt to provide a behaviour rule that forced convergence to a more satisfactory set. While the results when MR is played against itself are generally excellent (and definitely superior to the other behaviour rules studied), there are games (as described on page 62) where currently MR fails to converge to a distribution resulting in anything that might reasonably be called an "optimal" compromise. Moreover, MR failed to do well against PR in any of the games discussed in the previous chapter. Whether or not there exists a useful weighting of the two sums involved that would lead to more satisfactory convergence, no matter what the game, is an open question. In view of our results above, it seems fairly clear that the three problems related to the convergence of behaviour rules in game theory (mentioned at the outset of this conclusion) have yet to be ideally answered. Certainly there is possible work to be done in determining a more satisfactory target set to replace MBR and CE. Even with C E as the target set, it would be beneficial to have a behaviour rule whose convergence rate was a little faster than that of Regrets 1; though as yet Mas-Collel and Hart, to their credit, are the only ones to guarantee convergence to CE at all. The above notwithstanding, there is much to be said for the work that has already been done. My hope is that this thesis has provided a clear understanding of the research so far and at least some indication of future potentialities. 75 Appendix A: Graphical and Tabular Results This appendix presents graphical and tabular evidence of convergence of the different behaviour rule pairs in the games described in the chapter 5. I have not included the graphs of all possible pairs in all three games as that would be both tedious and unnecessary. I have included some examples from the Shapley game as an indication of the various convergence rates. Also included are tables of the payoffs and empirical distributions after 30000 iterations, along with an indication of each pair's convergence properties. Legend: * obvious convergence + evidence of convergence — no evidence of convergence Battle of the Buddies - 3000 iterations (initial conditions set to favour player N) Convergence * * * * * * * * * * * * * * * Behaviour Rule RI vs FP RI vs EFP RI vs PR RI vs RI MR vs RI MR vs FP MR vs EFP MR vs PR MR vs MR FP vs FP FP vs EFP FP vs PR EFP vs EFP EFP vs PR PR vs PR 76 Empirical Distribution (0,0,0,1) (0,0,0,1) (0,0,0,1) (0,0,0,1) (.9887, .0103, .0000, .0010) (.9943, .0050, .0000, .0007) (.9917, .0070, .0000, .0013) (.0073, .8194, .0073, .1659) (.4389,1256, .0000,-.4355) (0,0,0,1) (0,0,0,1) (0,0,0,1) (0,0,0,1) (0,0,0,1) (0,0,0,1) Behaviour Rule RI vs FP RI vs EFP RI vs PR RI vs RI MR vs RI MR vs FP MR vs EFP MR vs PR MR vs MR FP vs FP FP vs EFP FP vs PR EFP vs FP EFP vs PR PR vs PR max it ( u,Q!jv) 1 1 1 1 4.94 4.97 4.96 .985 2.19 1 1 1 1 1 1 Behaviour rules RI vs FP RI vs E F P RI vs PR RI vs RI MR vs RI MR vs FP MR vs EFP MR vs PR MR vs M R FP vs FP FP vs EFP FP vs PR EFP vs E F P EFP vs PR PR vs PR /x M / u {a) M 1 1 1 1 4.94 4.97 4.96 .203 2.63 1 1 1 1 1 1 max„ u (a , 5 5 5 5 .999 .999 .998 .866 2.18 5 5 5 5 5 5 M M v) u (a) N 5 5 5 5 .994 .998 .999 .837 2.62 5 5 5 5• 5 5 Marginal for player 1 Marginal for player 2 (0,1) (0,1) (0,1) (0,1) (o, i) (0,1) (0,1) (0,1) (.9990, .0010) (.9887,.0113) (.9993, .0007) (.9943, .0057) (.9987,-0013)' (.9917, .0083) (.8267,.1733) (.0147, .9853) (.4389,-5611) (.5645, .4355) (0,1) (0,1) (0,1) (0,1) (0,1) (0,1) (0,1) (0,1) (0,1) (0,1) (0,1) (0,1) 77 SPRGW - 30000 iterations Behaviour Rule RI vs FP RI vs EFP RI vs PR RI vs RI MR vs RI MR vs FP MR vs EFP MR vs PR MR vs MR FP vs FP FP vs EFP FP vs PR EFP vs EFP EFP vs PR PR vs PR Behaviour rules RI vs FP RI vs EFP RI vs PR RI vs RI MR vs RI MR vs FP MR vs EFP MR vs PR MR vs M R FP vs FP FP vs EFP FP vs PR EFP vs EFP EFP vs PR PR vs PR max u (Ai, a ) .0471 .0739 .0821 .1565 .6453 .6666 .9027 .7866 .9993 .0101 .0066 .0018 .0061 .0115 .0101 M M N max„ u -.0411 .0424 -.0333 .0332 -.0695 .0399 +.0239 .0673 +.5678 .4192 -.0673 .3278 -.2456 .2513 -.5287 .5405 -.0003 .9987 +.0000 .0101 -.0038 .0123 -.0120 .0108 -.0068 .0048 -.0088 .0121 +.0000 .0101 u (a) M Marginal for player 1 (.1011, .1358, .2991, .0924, .3717) (.1029, .1093, .3533, .1029, .3266) (.0961, .1391, .3341, .0934, .3372) (.1186, .0791, .3739, .1004, .3279) (.2636, .2229, .2814, .2141, .0130) (.2203, .2203, .3334, .2204, .0056) (.1929, .1928, .3360, .1933, .0851) (.1917, .1175, .4187, .2558, .0164) (.3332, .3332, .0007, .3329, .0000) (.1102, .1117, .3300, .1153, .3329) (.1036, .1115, .3259, .1230, .3360) (.1120, .1073, .3348, .1147, .3314) (.1055, .1155, .3288, .1126, .3376) (.1159, .1087, .3298, .1173, .3283) (.1095, .1114, .3238, .1130, .3424) 78 {a , M M v) u (a) N +.0411 +.0333 +.0695 -.0239 -.5678 +.0673 +.2456 +.5287 +.0003 +.0000 +.0038 +.0120 +.0068 +.0088 +.0000 Marginal for player 2 (.0756, .1569, .3027, .1173, .3475) (.1239, .0861, .3269, .0896, .3735) (.1460, .0842, .3529, .1035, .3134) (.1426, .1152, .2355, .1342, .3726) (.2301, .2332, .0617, .2437, .2313) (.2592, .21350, .1940, .3334) (.2990, .2911, .0003, .3129, .0967) (.2887, .2129, .0002, .2851, .2131) (.3332, .3329, .0000, .3332, .0007) (.1102, .1117, .3300, .1153, .3329) (.1096, .1152, .3286, .1076, .3390) (.1108, .1112, .3342, .1100, .3338) (.1157, .1092, .3275, .1083, .3393) (.1178, .1241, .3260, .0918, .3404) (.1095, .1114, .3238, .1130, .3424) Convergence Behaviour Rule RI vs FP RI vs EFP RI vs PR + RI vs RI + MR vs RI + MR vs FP + MR vs EFP * MR vs PR * MR vs MR * FP vs FP + FP vs EFP * FP vs PR * EFP vs EFP + EFP vs PR * PR vs PR Empirical Distribution (.0101, .0142, .0287, .0146, .0335, .0191, .0058, .0434, .0241, .0434, .0384, .0616, .0741, .0295, .0955, .0044, .0060, .0390, .0123, .0307, .0037, .0692, .1174, .0369, .1145) (.0022, .0015, .0257, .0226, .0509, .0221, .0007, .0386, .0255, .0225, .0575, .0483, .0982, .0247, .1246, .0041, .0001, .0486, .0000, .0551, .038, .0355, .1158, .0168, .1205) (.0125, .0088, .0258, .0063, .0427, .0148, .0044, .0457, . .0372, .0371, .0511, .0389, .1141, .0299, .1001, .0180, .0019, .0435, .0025, .0276, .0496, .0302, .1238, .0276, .1059) (.0085, .0047, .0199, .0283, .0571, .0301, .0006, .0144, .0141, .0144, .0588, .0446, .01039, .0407, .1259, .0040, .0203, .0228, .0057, .0426, .0362, .0449, .0739, .0453, .1276) (.0128, .0177, .0439, .1715, .0177, .1687, .0139, .0016, .0238, .0149, .0329, .0249, .0118, .0304, .1815, .0112, .1690, .0044, .0150, .0145, .0047, .0077, .0000, .0029, .0027) (.0098, .1005, .0000, .0758, .0342, .0903, .0059, .0000, .0752, .0490, .0452, .0359, .0000, .0371, .2152, .1122, .0700, .0000, .0049, .0333, .0018, .0011, .0000, .0011, .0016) (.0588, .0577, .0001, .0587, .0175, .0607, .0529, .0000, .0566, .0226, .0966, .0984, .0002, .1091, .0316, .0577, .0596, .0000, .0571, .0188, .0252, .0225, .0000, .0313, .0061) (.0384, .0381, .0001, .0384, .0767, .0385, .0012, .0000, .0760, .0017, .0844, .1709,0, .0812, .0822, .1235, .0008, .0000, .0844, .0470, .0039, .0019, .0001, .0051, .0054) (.0000, .1666, .0000,-1666, .0000, .1666, .0000, .0000, .1666, .0000, .0003, .0000, .0000, .0000, .0003, .1663, . .1663, .0000, .0000, .0003, .0000, .0000, .0000, .0000, .0000) (.1102, .0000, .0000, .0000, .0000, .0000, .1117, .0000, .0000, .0000, .0000, .0000, .3300, .0000, .0000, .0000, .0000, .0000, .1153, .0000, .0000, .0000, .0000, .0000, .3329) (.0097, .0120, .0349, .0096, .0375, .0128, .0118, .0413, .0127, .0329, .0378, .0378, .1061, .0337, .1105, .0131, .0108, .0362, .0154, .0476, .0362, .0429, .1101, .0362, .1106) (.0087, .0119, .0381, .0135, .0397, .0135, .0121, .0351, .0110, .0355, .0381, .0396, .1089, .0380, .1102, .0125, .0131, .0373, .0126, .0392, .0380, .0345, .01148, .0350, .1092) (.0076, .0154, .0329, .0141, .0354, .016, .0076, .0371, .0149, .0399, .0392, .0352, .1116, .0353, .1075, .0145, .0152, .0370, .0078, .0382, .0385, .0358, .1088, .0362, .1183) (.0132, .0129, .0388, .0112, .0398, .0131, .0165, .0341, .0081, .0370, .0413, .0418, .1038, .0300, .1129, .0136, .0130, .0378, .0111, .0418, .0366, .0399, .1114, .0315, .1089) (.1095, .0000, .0000, .0000, .0000, .0000, .1114, .0000, .0000, .00(ft), .0000, .0000, .3238, .0000, .0000, .0000, .0000, .0000, .1130, .0000, .0000, .0000, .0000, .0000, .3424) 71 Shapley Game - 30000 iterations Convergence — — + + * * — * * — — — — — * Behaviour Rule RI vs FP RI vs EFP RI vs PR RI vs RI MR vs RI MR vs FP MR vs EFP MR vs PR MR vs MR FP vs FP FP vs EFP FP vs PR EFP vs EFP EFP vs PR PR vs PR Behaviour Rule RI vs FP RI vs EFP RI vs PR RI vs RI MR vs RI MR vs FP MR vs EFP MR vs PR MR vs MR FP vs FP FP vs EFP FP vs PR EFP vs EFP EFP vs PR PR vs PR Empirical Distribution (.1595, .0862, .0005, .0003, .1389, .1429, .2424, .0009, .2284) (.1892, .2093, .0000, .0376, .2678, .0677, .1216, .0018, .1050) (.1389, .1608, .0302, .0393, .1610, .1637, .1386, .0365, .1310) (.1363, .1492, .0058, .0153, .1818, .1975, .1141, .0038, .1961) (.2571, .0548, .0270, .0218, .2439, .0498, .0582, .0358, .2516) (.2667, .0667, .0000, .0000, .2667, .0667, .0667, .0000, .2666) (.1489, .2094, .0142, .0774, .2427, .0559, .1158, .0624, .0734) (.0833, .1666, .0833, .0833, .0834, .1667, .1667, .0833, .0834) (.1667, .1667, .0000, .0000, .1667, .1667, .1667, .0000, .1667) (.0644, .0945, .0000, .0000, .1386, .2033, .2013, .0000, .2980) (.0891, .1304, .0000, .0000, .1910, .2804, .0602, .0000, .2490) (.1137, .1833, .0013, .0015, .1835, .2962, .1135, .0011, .1059) (.2430, .0611, .0000, .0000, .0895, .1311, .2826, .0000, .1925) (.1084, .1754, .0017, .0017, .1756, .2848, .1082, .0015, .1426) (.1138, .1147,1083, .1072, .1124, .1104, .1126, .1092, .1115) max^ujw^, a ) .402 .479 .358 .399 .337 .333 .514 .333 .333 .501 .529 .403 .526 .429 .336 N %(«) .527 .562 .431 .514 .752 .800 .465 .250 .500 .501 .529 .403 .525 .427 .338 80 max„ u (ctM,v) .472 .399 .364 .395 .346 .333 .376 .333 .333 .499 .471 .481 .475 .462 .337 M u (a) N .472 .399 .463 .461 .163 .200 .381 .500 .500 .499 .471 .593 .475 .568 .338 Shapley Game: R1 vs R1 (30000 iterations) 0.4 0.35 H 0.3 H 0 10 20 30 40 SI 50 60 No. of Iterations 70 80 90 100 Shapley Game: F P vs F P (30000 iterations) 0.4 |I ' I 0 10 20 : I 30 . I 40 . I I 50 60 No. of Iterations tz I 70 l_ 80 _l _1 90 100 Shapley Game: M R vs M R (30000 iterations) 0.18 r 0.16 h 0.14 0.12h _o | 0.11- CO T3 O 'o. E 0.08 LU 0.06 0.04 0.02 10 20 30 40 50 60 No. of Iterations 35 70 80 90 100 Appendix B: Programs All programs were written in Matlab. Though not presented here, they are available for public viewing at the following website: http://www.iam.ubc.ca/theses/ for any who are interested. The main program is given first and is called game.m. Each behaviour rule has two separate programs - one for player M and one for player N. The line in program "game" which starts jl = ... determines what behaviour rule is used by player M while the line right below it determines what behaviour rule is used by player N. The program "game" will first ask the user to input the payoff matrices for both players. Thus, it will run any game the user wants. Moreover by changing two lines, the user can run any pair of behaviour rules against each other. The output includes the empirical distribution (after however many iterations) of the game and, for each player, the payoff, the marginal distribution and the best-response payoff against the opponent's marginal distribution. (In other words, all the information displayed in the tables in Appendix A.) 86 Bibliography [1] Blackwell, D. "An Analog of the Minimax Theorem for Vector Payoffs". (1954). [2] Blackwell, D. "On Optimal Systems". Ann. Math. Stat. 25(1954), 394-397. [3] Foster, D. and Vohra, R. "Calibrated Learning and Correlated Equilibrium". Games and Economic Behaviour 21(1997), 40-45. [4] Fudenberg, D. and Levine D. "Consistency and Cautious Fictitious Play". Journal of Economic Dynamics & Control 19(1995), 1065-1089. [5] Isaacson, D. Markov Chains: Theory and Applications. John Wiley & Sons. Toronto. 1976. [6] Kao,E. An Introduction to Stochastic Processes. Duxbury Press. Toronto. 1997. [7] Kijima, M . Markov Processes for Stochastic Modelling. Chapman & Hall. New York. 1997. [8] Mas-Collel, A. and Hart, S. "A simple Adaptive Procedure Leading to Correlated Equilibrium", preprint (1998). [9] Morris, Peter. Introduction to Game Theory. Springer-Verlag. New York: 1994. [10] Robinson, Julia. "An Iterative Method of solving a Game". Ann, of Math. 54(1951), 296-301. [11] Shapley, Lloyd. "Some Topics in Two-Person Games". Advances in Game Theory Princeton: Princeton University Press, 1964. 87
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Convergence of behaviour rules in iterated matrix games
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Convergence of behaviour rules in iterated matrix games Patrick, Jonathan 1999
pdf
Page Metadata
Item Metadata
Title | Convergence of behaviour rules in iterated matrix games |
Creator |
Patrick, Jonathan |
Date Issued | 1999 |
Description | This master's thesis reports on a foray into Game Theory, focusing solely on the twoperson (not necessarily zero-sum) game. Primarily, I am interested in the convergence properties of different behaviour rules and how one might proceed to introduce some form of learning into the strategies of the players involved in the game. Therefore, I begin with the introduction of some key equilibrium sets - namely the set of Nash Equilibria (NE), the set of correlated equilibria (CE) and the marginal best-response set (MBR). I briefly discuss the relationship between these three sets before moving on to describe some desirable properties of behaviour rules. From there, I introduce six behaviour rules (four from the literature, two original) that attempt to incorporate some form of learning into the game. The four from the literature are Fictitious Play, Exponential Fictitious play, Regrets 1 and Regrets 2. I have named the two original behaviour rules Past Response and Modified Regrets. I then move on to describe the convergence properties of each. This thesis was originally motivated by a talk given by Andreu Mas-Collel on the properties of the two Regrets-based behaviour rules. Thus, a fair amount of time is spent reworking the convergence proofs of both Regretsl and Regrets2 as they were developed by Mas-Collel and Sergiu Hart. I then suggest an alternative proof of the Regretsl convergence properties. I close off the paper with some numerical results from three games - a zero-sum game, a game developed by Lloyd Shapley (called the Shapley game) and a game called Battle of the Buddies. They are designed to give some numerical confirmation of the convergence theorems stated earlier in the paper as well as some indication as to where further study might be useful. |
Extent | 6468309 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-06-26 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0080029 |
URI | http://hdl.handle.net/2429/9763 |
Degree |
Master of Science - MSc |
Program |
Mathematics |
Affiliation |
Science, Faculty of Mathematics, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1999-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1999-0586.pdf [ 6.17MB ]
- Metadata
- JSON: 831-1.0080029.json
- JSON-LD: 831-1.0080029-ld.json
- RDF/XML (Pretty): 831-1.0080029-rdf.xml
- RDF/JSON: 831-1.0080029-rdf.json
- Turtle: 831-1.0080029-turtle.txt
- N-Triples: 831-1.0080029-rdf-ntriples.txt
- Original Record: 831-1.0080029-source.json
- Full Text
- 831-1.0080029-fulltext.txt
- Citation
- 831-1.0080029.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0080029/manifest