Deep Learning for Predicting Human Strategic BehaviorbyJason Siyanda HartfordB. Science, University of Witwatersrand, 2008M. Economics, University of Witwatersrand, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)October 2016c© Jason Siyanda Hartford, 2016AbstractPredicting the behavior of human participants in strategic settings is an importantproblem for applications that rely on game theoretic reasoning to design mech-anisms or allocate resources. Most existing work either assumes that participantsare perfectly rational, or attempts to directly model each participant’s cognitive pro-cesses based on insights from cognitive psychology and experimental economics.In this work, we present an alternative, a deep learning-based approach that auto-matically performs cognitive modeling without relying on such expert knowledge.We introduce a novel architecture that allows a single network to generalize acrossnormal form games with varying numbers of actions. We show that the archi-tecture generalists the most successful existing models and that its performancesignificantly improves upon that of the previous state of the art, which relies onexpert-constructed features.iiPrefaceThe work presented in this thesis was performed in collaboration with James R.Wright and my advisor Kevin Leyton-Brown. It has not yet been published, buthas been accepted for publication at the Thirtieth Annual Conference on NeuralInformation Processing Systems, 2016.I worked with James and Kevin to devise the architecture described in Chapter3. I wrote all the code used in the experiments described in Chapter 5 with theexception of a software package written by James that presents the data in a man-ageable format. I wrote the first draft of all the text under the guidance of Jamesand Kevin who later edited and wrote additional text.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Prediction in Normal Form Games. . . . . . . . . . . . . . . . . . 42.1.1 Quantal Response . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Iterative Strategic Reasoning . . . . . . . . . . . . . . . . 52.2 Deep Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Feed-Forward Neural Networks . . . . . . . . . . . . . . 72.2.2 Invariances in Neural Networks . . . . . . . . . . . . . . 92.2.3 Neural Networks in Strategic Settings . . . . . . . . . . . 103 Modeling Human Strategic Behavior with Deep Networks . . . . . . 113.1 Invariance-Preserving Hidden Units . . . . . . . . . . . . . . . . 133.1.1 Pooling Units . . . . . . . . . . . . . . . . . . . . . . . . 15iv3.1.2 Softmax Output . . . . . . . . . . . . . . . . . . . . . . . 173.2 Action Response Layers . . . . . . . . . . . . . . . . . . . . . . 183.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Relation to Existing Deep Models . . . . . . . . . . . . . . . . . 204 Representational Generality of Our Architecture . . . . . . . . . . . 224.1 Behavioral Models . . . . . . . . . . . . . . . . . . . . . . . . . 224.1.1 Level-k . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1.2 Cognitive Hierarchy . . . . . . . . . . . . . . . . . . . . 244.1.3 Quantal Level-k . . . . . . . . . . . . . . . . . . . . . . . 254.1.4 Quantal Cognitive Hierarchy . . . . . . . . . . . . . . . . 254.2 Game Theoretic Features . . . . . . . . . . . . . . . . . . . . . . 255 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 295.4 Regular Neural Network Performance . . . . . . . . . . . . . . . 326 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 34Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 39A.1 Representing behavioural features . . . . . . . . . . . . . . . . . 39A.1.1 Max Max Payoff . . . . . . . . . . . . . . . . . . . . . . 40A.1.2 Max Min Payoff . . . . . . . . . . . . . . . . . . . . . . 41A.1.3 Max Max Efficiency . . . . . . . . . . . . . . . . . . . . 41A.1.4 Minimax Regret . . . . . . . . . . . . . . . . . . . . . . 42A.1.5 Min Min Unfairness . . . . . . . . . . . . . . . . . . . . 42vList of TablesTable 5.1 Our datasets. Each experiment had subjects play between 8 and20 games for a total of 128 games, 113 of which were unique. . 28viList of FiguresFigure 1.1 An example 3×3 normal form game. . . . . . . . . . . . . . 3Figure 3.1 This figure shows the direct application of a regular feed for-ward network to an example 3×3 game matrix. Two problemswith this approach are immediately clear. First, changing theordering of the actions ({T, M, B} and {L, C, R}) changes thefunction output; and second if we were to add an action (e.g.add a row to the above payoff matrix), we could no longer ap-ply the network. . . . . . . . . . . . . . . . . . . . . . . . . 12Figure 3.2 A schematic representation of our architecture. The featurelayers consist of hidden matrix units (orange), each of whichuse pooling units to output row- and column-preserving aggre-gates (blue and purple) before being reduced to distributionsover actions in the softmax units (red). Iterative response ismodeled using the action response layers (green) and the finaloutput, y, is a weighted sum of the row player’s action responselayers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Figure 3.3 A graphical representation of our hidden matrix units. Eachnode is a weighted sum of matrices with a single scalar weightassociated with each matrix, followed by an element-wise ac-tivation function, φ(·) . . . . . . . . . . . . . . . . . . . . . . 15viiFigure 3.4 The above image shows the input and two hidden layers of anetwork with a 5×5 input matrix, three invariance preservinghidden units in the first hidden layer and one unit in the secondhidden layer. Without pooling units, each element of every hid-den matrix unit depends only on the corresponding elements inthe units from the layer below; e.g., the middle element high-lighted in red depends only on the value of the elements of thematrices highlighted in orange. . . . . . . . . . . . . . . . . . 17Figure 3.5 This figure shows the same architecture as Figure 3.4 withpooling units added at each layer in the network. Now eachelement of every hidden matrix unit depends both on the cor-responding elements in the units below and the pooled quan-tity from each row and column. E.g., the light blue and purpleblocks represent the row and column-wise aggregates corre-sponding to their adjacent matrices. The dark blue and purpleblocks show which of these values the red element dependson. Thus, the red element depends on both the dark- and light-shaded orange cells. . . . . . . . . . . . . . . . . . . . . . . 18Figure 3.6 AR layers model best response to beliefs by taking the dotproduct of weighted sums of preceding AR layers with weightedsums of hidden units. This is equivalent to computing the ex-pected value of an internal representation of utility given by aweighted sum of the hidden units with respect to the distribu-tion defined by the preceding opposition AR layers. . . . . . . 19Figure 5.1 Negative Log Likelihood Performance. The error bars rep-resent 95% confidence intervals across 10 rounds of 10-foldcross-validation. We compare various models built using ourarchitecture to a quantal cognitive hierarchy (QCH) model withuniform level-0 behavior (pink line) and QCH with a weightedlinear model of level-0 behaviour (blue line). . . . . . . . . . 30viiiFigure 5.2 Performance comparison with and without pooling units. Allfour models were fit with the same hyperparameters using dropoutunless otherwise stated, with the only difference being the num-ber of layers and hidden units and whether or not the modelsused pooling units. . . . . . . . . . . . . . . . . . . . . . . . 31Figure 5.3 Performance comparison as we vary the number of action re-sponse layers. The AR layers appear to result in overfitting inthe network as increasing the number of AR layers reduces theaverage training set loss while increasing the average test setloss. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 5.4 Performance comparison on 3× 3 games of a feed-forwardneural network (FFNet), a feed-forward neural network withdata augmentation at every epoch (FFNet (Permuted)), our ar-chitecture with two hidden layers of 50 units each (labeled (50,50)), and quantal cognitive hierarchy with four hand-craftedfeatures (QCH Linear4). . . . . . . . . . . . . . . . . . . . . 33ixAcknowledgmentsFirst and foremost, I’d like to thank my advisor Kevin Leyton-Brown for his end-less support and patience over the last two years. He has given me more oppor-tunities than I imagined possible and has pushed me to be a better researcher bymaking me think clearly about what makes a problem interesting.I also can’t thank James Wright enough for his contributions to this project.James was a second mentor to me, his research provided the solid ground on whichthis thesis is built, and with his probing and insightful questions he taught me tomethodically explore a space of ideas.Neil, Chris, Alice, Lars and Hedayat the rest of the Game Theory and DecisionTheory group provided both helpful research discussions and entertaining luncheswhen I needed a break from research.This thesis explores the intersection of game theory and machine learning.Much of my knowledge of machine learning comes from Mark Schmidt and theenlightening discussions at our Machine Learning Reading Group.Finally, I’d like to thank Nicole for joining me on this crazy journey to the otherside of the world. It wouldn’t be the same without you.xFor NicolexiChapter 1IntroductionGame theory provides a powerful framework for the design and analysis of multia-gent systems that involve strategic interactions [see, e.g., 26]. Prominent examplesof such systems include search engines, which use advertising auctions to generatea significant portion of their revenues and rely on game theoretic reasoning to an-alyze and optimize these mechanisms [10, 33]; spectrum auctions, which rely ongame theoretic analysis to carefully design the “rules of the game” in order to coor-dinate the reallocation of valuable radio spectrum [21]; and security systems, whichanalyze the allocation of security personnel as a game between rational adversariesin order to optimize their use of scarce resources [32]. In such applications, systemdesigners optimize their choices with respect to assumptions about the preferences,beliefs and capabilities of human players [22]. A standard game theoretic approachis to assume that players are perfectly rational expected utility maximizers and, in-deed, that they have common knowledge of this. In some applications, such as thehigh-stakes spectrum auctions just mentioned, this assumption is probably reason-able, as participants are typically large companies that hire consultants to optimizetheir decision making. In other scenarios that allow less time for planning or in-volve less sophisticated participants, however, the perfect rationality assumptionmay lead to suboptimal system designs. For example, Yang et al. [38] were ableto improve the performance of systems that defend against adversaries in securitygames by relaxing the perfect rationality assumption. Of course, relaxing this as-sumption means finding something else to replace it with: an accurate model of1boundedly rational human behavior.The behavioral game theory literature has developed a wide range of modelsfor predicting human behavior in strategic settings by incorporating cognitive bi-ases and limitations derived from observations of play and insights from cognitivepsychology [2]. Like much previous work, we study the unrepeated, simultaneous-move setting, for two reasons. First, the setting is conceptually straightforward:games can be represented in a so-called “normal form”, simply by listing the util-ities to each player in for each combination of their actions (e.g., see Figure 1.1).Second, the setting is surprisingly general: auctions, security systems, and manyother interactions can be modeled naturally as normal form games. The most suc-cessful predictive models for this setting combine notions of iterative reasoningand noisy best response [34] and use hand-crafted features to model the behaviorof non-strategic players [36].The recent success of deep learning has demonstrated that predictive accuracycan often be enhanced, and expert feature engineering dispensed with, by fittinghighly flexible models that are capable of learning novel representations. A keyfeature in successful deep models is the use of careful design choices to encode“basic domain knowledge of the input, in particular its topological structure. . . tolearn better features” [1, emphasis original]. For example, feed-forward neuralnets can, in principle, represent the same functions as convolution networks, butthe latter tend to be more effective in vision applications because they encode theprior that low-level features should be derived from the pixels within a small neigh-borhood and that predictions should be invariant to small input translations. Anal-ogously, Clark and Storkey [5] encoded the fact that a Go board is invariant torotations. These modeling choices constrain more general architectures to a subsetof the solution space that is likely to contain good solutions. Our work seeks todo the same for the behavioral game theory setting, identifying novel architecturalconstraints that extend deep learning to predicting behavior in strategic scenariosencoded as two player, normal-form games.A key property required of such a model is invariance to game size: a modelmust be able to take as input an m× n bimatrix game (i.e., two m× n matricesencoding the payoffs of players 1 and 2 respectively) and output an m-dimensionalprobability distribution over player 1’s actions, for arbitrary values of n and m,20 5 10 15 20 25 3010,105,38,183,520,2025,018,80,2515,15TMBTMBLCR Counts of ActionsFigure 1.1: An example 3× 3 normal form game. The row player choosesfrom actions {T,M,B} and the column player chooses from actions{R,C,L}. If the row player played action T and column player playedaction C, their resulting payoffs would be 3 and 5 respectively. Givensuch a matrix as input we aim to predict the distribution over the rowplayer’s choice of actions defined by the observed frequency of selectedactions shown on the right.including values that did not appear in training data. In contrast, existing deepmodels typically assume either a fixed-dimensional input or an arbitrary-length se-quence of fixed-dimensional inputs, in both cases with a fixed-dimensional output.We achieve this by leveraging the simplifying behavioral assumption that playersare indifferent to the order in which actions are presented.In Chapter 2, we briefly review the key ideas from the behavioral game theoryliterature and deep learning upon which this thesis builds, before proceeding toChapter 3 where we present an architecture that operates on matrices using scalarweights to capture invariance to changes in the size of the input matrices and topermutations of its rows and columns. Chapter 4 shows how this architecture gen-eralizes the models described in Chapter 2. In Chapter 5 we evaluate our model’sability to predict distributions of play given normal form descriptions of gameson a dataset of experimental data from a variety of experiments, and find that ourfeature-free deep learning model significantly exceeds the performance of the cur-rent state-of-the-art model, which has access to hand-tuned features based on ex-pert knowledge [36].3Chapter 2BackgroundThis work builds on ideas from both the behavioral game theory literature and deeplearning. In this chapter, we survey relevant ideas from both fields. Section 2.1describes key ideas from the behavioral game theory literature that lead to accurateout-of-sample predictions of human behavior in unrepeated normal-form games.Section 2.2 provides a background on deep learning and explains the representationlearning philosophy, how successful models encode invariances through modelingchoices and prior work applying deep learning to strategic settings.2.1 Prediction in Normal Form Games.The task of predicting actions in normal form games has been studied mostly in thebehavioral game theory literature. These models tend to have few parameters andto aim to describe previously identified cognitive biases and limitations [2]. Whilethis approach allows researchers to test theories that attempt to explain human be-havior in normal-form games, it fails to answer an important practical question: ifone wants to use these models to actually predict behavior, which of them providesthe best out-of-sample performance? Wright and Leyton-Brown [34, 37] answeredthis through an extensive meta-analysis comparing the popular approaches used inthe behavioral game theory literature. They found that there are two key ideas thatsupport the best performing models: the relaxation of best response to “quantalresponse” and the notion of “limited iterative strategic reasoning”.4We survey these two ideas here, as well as an extension from Wright andLeyton-Brown [36] that improves predictive accuracy by explicitly modeling “non-strategic” players. In doing so, they showed the importance of using hand-craftedbehavioral features to improve predictive accuracy. To the best of our knowledge,this approach represents the state-of-the-art method for modeling human behaviorin unrepeated normal-form games.2.1.1 Quantal ResponseThe first important behavioral observation is that human players tend to play anoisy best response strategy. Models that assume quantal response assume thatplayers select actions with probability increasing in expected utility instead of al-ways selecting the action with the largest expected utility [20]. This is expressedformally by assuming that players select actions, ai, with probability, si, given bythe logistic quantal response function,si(ai) =exp(λui(ai,s−i))∑a′i exp(λui(a′i,s−i)).This function is mathematically equivalent to the “softmax” function (whichis widely used in the deep learning literature) with an additional scalar sharpnessparameter λ that allows the function to output a strategy that uniformly mixesbetween the actions in the set of best responses as λ → ∞ (and plays the bestresponse with certainty if it is unique) and the uniform distribution as λ → 0. Thisrelaxation is motivated by the behavioral notion that if two actions have similarexpected utility then they will also have similar probability of being chosen.2.1.2 Iterative Strategic ReasoningThe second behavioral observation is that players are likely to be boundedly ra-tional in their decision making. Iterative strategic reasoning means that playersperform a bounded number of steps of reasoning in deciding on their actions,rather than always converging to fixed points as in classical game theory. Mod-els incorporating this idea typically assume that every agent has an integer level.Non-strategic, level-0 players choose actions uniformly at random; level-k players5best respond to the level-(k−1) players in Costa-Gomes et al.’s [8] Level-k modelor to a mixture of levels between level-0 and level-(k− 1) in Camerer et al.’s [3]cognitive hierarchy model.The two ideas can be combined, allowing players to quantally respond to lowerlevel players [30, 35]. We examine the relationship between these models in moredetail in Chapter 4.Non-Strategic FeaturesThe assumption that level-0 players play uniformly at random has an importanteffect on the performance of the models for two reasons: it has been shown thatthere tend to be a large number level-0 players [35], and more importantly, thelevel-0 distribution effectively parameterizes the distributions of play of higher-level players because the models are defined recursively starting from a base caseof level-0 behavior.Given this observation we might expect that the performance of all of the mod-els mentioned above can be improved by better modeling the non-strategic level-0players. Wright and Leyton-Brown [36] demonstrate this by modeling level-0 be-havior using a weighted linear model of hand-crafted “non-strategic features” thateach recommend one or more actions. Examples of such features include the ac-tion associated with the largest payoff in the matrix, the max max payoff, or theaction that maximizes a player’s worst case payoff, the max min payoff. Thesefeatures are combined by taking their linearly weighted sum and normalizing todefine a distribution over actions. Their most successful model (in terms of pre-dictive accuracy) combines quantal response and bounded steps of reasoning withthe weighted-linear model of non-strategic level-0 behavior. This represents thestate-of-the-art behavioral game theory model against which we benchmark ourcontributions.2.2 Deep Learning.The performance of most machine learning algorithms depends heavily on the rep-resentation of its input data [13]. The core idea of deep learning is that one shouldbuild hierarchical models that may learn more useful and complex representations6at every level of the the hierarchy [13]. This approach has had much recent successin solving supervised learning problems in vision, speech and natural languageprocessing [for recent surveys see, 17, 24].In this section we begin by reviewing feed-forward neural networks and thendescribe how they have been constrained using domain-specific invariances to im-prove performance for tasks in particular domains.2.2.1 Feed-Forward Neural NetworksFeed-forward neural networks (or multi-layer perceptrons) are composite functionsthat map from an input vector in Rn to an output vector in Rm through a sequenceof linear transformations and element-wise nonlinearities as follows,f (x) = Wnφ(Wn−1φ(. . .φ(W2φ(W1x+b1)+b2). . .)+bn−1)+bn= fn ◦ fn−1 ◦ · · · ◦ f2 ◦ f1 ◦ xwhere fi(y) = φ(Wiy+bi) and φ(x) is some non-linear activation function appliedelement-wise. Commonly used non-linearities are the sigmoid function, sig(x) =11+e−x , the hyperbolic tangent function, tanh(x) =ex−e−xex+e−x , and rectified linear units,relu(x) =max(0,x). Each function fi is referred to as a “layer” in a network, wherelayers are vector-to-vector maps. Each scalar element of the vector output of fi isreferred to as a “node” or “neuron” in the network.We can evaluate the performance of the neural network by specifying a lossfunction that evaluates the output of the neural network for a given input vector withrespect to the corresponding target values observed in the data. The loss function,L (θ |{(xi,yi)}Ni=1) where θ is the set of parameters of the model, (xi,yi) are theN tuples of training examples from the dataset, is typically derived by maximizingthe likelihood of the data with respect to the parameters of the network.Because each of the functions, fi, is continuous and differentiable (almost)everywhere,1 one can compute gradients of the loss function with respect to eachof the parameter vectors/matrices (Wi,bi) in the network using the chain rule. Thisoperation can be done efficiently using the backpropagation algorithm by noticing1relu(x) is not differentiable at x = 0 but a set of sub-gradients exists.7that because∂L∂θi=∂L∂ fi+1∂ fi+1∂ fi∂ fi∂θi,we can reuse previously computed values ∂L∂ fi+1 if we compute gradients “back-wards” from the output layer to the input layer. For each layer i, we only have tocompute ∂ fi+1∂ fi∂ fi∂θi and can reuse computation from the previous layer to get∂L∂ fi+1 .This procedure allows us to compute gradients with respect to all parameters in thenetwork in time proportional to the forward pass through the network. Given thesegradients, the parameters of the network can be optimized using stochastic gradientdecent.Neural networks are non-convex so gradient decent is not guaranteed to reachthe global optimum, but empirically this tends not to be a problem as researchershave found that for the high-dimensional models used in practice, most local op-tima tend to offer similar generalization performance [4, 9].A bigger challenge is overfitting: neural networks are very flexible models,which makes them prone to overfitting the training data and poor generalizationperformance. Researchers combat this using a selection of regularization tech-niques. The simplest of these limits the capacity of the model by adding a normpenalty term to the loss function as follows,L˜ (θ |{(xi,yi)}Ni=1) =L (θ |{(xi,yi)}Ni=1)+α‖θ‖ j∈{1,2}Both the the absolute value norm ( j= 1, known as L1 regularization) and Euclideannorm ( j = 2, known as L2 regularization or weight decay) are commonly used inpractice. The former tends to result in more sparse solutions, while the latter pe-nalizes larger weight values more than smaller values and hence results in smalleraverage parameter values.An alternative approach to avoiding overfitting is model averaging. Averagingthe output of a large number of neural networks tends to outperform any individualmodel because while individual networks that overfit tend to exhibit high variance,they are not necessarily biased and hence averaging their predictions reduces vari-ance thereby improving generalization performance. Unfortunately, building largeensembles is infeasible because each network takes too long to train.8Dropout [28] is an alternative approach that can be interpreted as an efficientform of model averaging. It works by “dropping out” a different random subset ofthe nodes in the network at every mini-batch of training by element-wise multiply-ing the output of each fi by a random binary mask. At test time, one can either takea Monte-Carlo average of the model’s predictions by sampling a large number ofbinary masks and averaging the resulting prediction, or one can approximate theMonte-Carlo average by removing the binary mask and scaling the outputs of thenodes by the probability that they are retained.2.2.2 Invariances in Neural NetworksWhile a sufficiently large feed forward network could, in theory, represent the samefunctions as those used to achieve most state-of-the-art results published in theliterature,2 in practice they tend to be outperformed by models that have a moreconstrained output space that limits the network to solutions that are “reasonable”in a given domain.Convolutional neural networks are the best example of this approach. Insteadof applying an arbitrary affine transformation at every layer of the network, con-volution networks instead apply multiple convolution filters at each layer and thenetwork is parameterized by the weights of their convolution kernels. The con-volution operation is linear so there exists an equivalent affine transformation, butit requires a far larger number of parameters and involves significant redundancybecause the affine transformation representation has a large number of weights tiedto the same value. By constraining the network in this fashion we encode the factthat natural images tend to be invariant to translation. In some domains we can takethis idea further. For example, Clark and Storkey [5] encode the rotational invari-ance of the playing surface in the board game Go by constraining the convolutionkernels to be symmetric.We would like to predict action distributions in normal-form games with anarbitrary number of actions. This requires invariance to the size of the network’sinput. The best known prior work that achieves an analogous goal is Long et al.[19]’s Fully Convolutional Network architecture which predicts a class label for2For domains that depend on sequences, recurrent connections are also necessary. We omit adiscussion of recurrent neural networks here because they are not relevant to this work.9each pixel in an arbitrarily sized image. They achieve this by replacing the “fullyconnected layers” (regular feed-forward layers) that are typically used in the finallayers of a convolutional network with convolutions that map to a filter for eachclass label.While regular convolutional structure is useful in images, our model is mostmathematically similar to “MLP Convolutions” developed in Lin et al. [18]’s Net-work in Network model, though we derived our architecture independently usinggame theoretic invariances. We discuss the relationships between the two modelsin Section 3.4 at the end of Chapter 3.2.2.3 Neural Networks in Strategic SettingsThere have been relatively few applications of deep learning to multiagent set-tings. Notable exceptions are the evaluation function in Clark and Storkey [5] andthe policy network used in Silver et al. [27]’s work in predicting the actions of hu-man players in Go. Their approach is similar in spirit to ours: they map from adescription of the Go board at every move to the choices made by human players,while we perform the same mapping from a normal form game. The setting differsin that Go is a single, sequential, zero-sum game with a far larger, but fixed, actionspace, which requires an architecture tailored for pattern recognition on the Goboard. In contrast, we focus on constructing an architecture that generalizes acrossgeneral-sum, normal form games.10Chapter 3Modeling Human StrategicBehavior with Deep NetworksIn this chapter we explore how one might use deep learning to model human be-havior in normal form games. A natural starting point in applying deep networksto a new domain is testing the performance of a regular feed-forward neural net-work. To apply such a model to a normal form game, we need to flatten the utilityvalues into a single vector of length mn+nm and learn a function that maps to them-simplex output via multiple hidden layers. Feed-forward networks can’t handlesize-invariant inputs, but we can temporarily set that problem aside by restrictingourselves to games with a fixed input size. We experimented with that approach andfound that feed-forward networks often generalized poorly as the network overfit-ted the training data (see Section 5.4 of Chapter 5). One way of combating overfit-ting is to encourage invariance through data augmentation: for example, one mayaugment a dataset of images by rotating, shifting and scaling the images slightly. Ingames, a natural simplifying assumption is that players are indifferent to the orderin which actions are presented, implying invariance to permutations of the payoffmatrix.1 Incorporating this assumption by randomly permuting rows or columns ofthe payoff matrix at every epoch of training dramatically improved the generaliza-tion performance of a feed-forward network in our experiments, but the network is1We thus ignore salience effects that could arise from action ordering; we plan to explore this infuture work.1110,105, 38, 183, 520,2025, 018, 80, 2515, 15TMBRCL......· · ·· · ·· · ·...TMBFigure 3.1: This figure shows the direct application of a regular feed forwardnetwork to an example 3×3 game matrix. Two problems with this ap-proach are immediately clear. First, changing the ordering of the actions({T, M, B} and {L, C, R}) changes the function output; and second ifwe were to add an action (e.g. add a row to the above payoff matrix),we could no longer apply the network.still limited to games of the size that it was trained on.Our approach is to enforce this invariance in the model architecture rather thanthrough data augmentation. We then add further flexibility using novel “poolingunits” and by incorporating iterative response ideas inspired by behavioral gametheory models. The result is a model that is flexible enough to represent the allthe models surveyed in Wright and Leyton-Brown [35, 36]—and a huge space ofnovel models as well—and which can be identified automatically. The model isalso invariant to the size of the input payoff matrix, differentiable end to end andtrainable using standard gradient-based optimization.The model has two parts: feature layers and action response layers; see Figure3.2 for a graphical overview. The feature layers take the row and column player’snormalized utility matrices U(r) and U(c) ∈Rm×n as input, where the row player has12m actions and the column player has n actions. The feature layers consist of multi-ple levels of hidden matrix units, H(r)i, j ∈Rm×n, each of which calculates a weightedsum of the units below and applies a non-linear activation function. Each layer ofhidden units is followed by pooling units, which output aggregated versions ofthe hidden matrices to be used by the following layer. After multiple layers, thematrices are aggregated to vectors and normalized to a distribution over actions,f(r)i ∈ ∆m in softmax units. We refer to these distributions as features because theyencode higher-level representations of the input matrices that may be combined toconstruct the output distribution. The rest of the model is a cognitive hierarchymodel, and so these can be interpreted as the L0 features in such an architecture.The behavioral game theory literature has shown that iterative strategic reason-ing is an important phenomenon in human decision making (see Section 2.1.2 ofChapter 2 for details); we thus want to allow our models the option of incorporatingsuch reasoning. To do so, we compute features for the column player in the samemanner by applying the feature layers to the transpose of the input matrices, whichoutputs f(c)i ∈ ∆n. The first action response layer for each player is built from aweighted sum of their feature layer outputs. Each subsequent action response layerfor a given player then takes the opposite player’s preceding action response lay-ers as input and uses them to construct distributions over the respective players’outputs. The final output y ∈ ∆m is a weighted sum of all action response layers’outputs.3.1 Invariance-Preserving Hidden UnitsWe build a model that ties parameters in our network by encoding the assump-tion that players reason about each action identically. Our assumption that the rowplayer evaluates each action in the same way implies that she applies the samefunction to each row in the utility matrices. Thus, in a normal form game repre-sented by the utility matrices U(r) and U(c), the weights associated with each rowof U(r) and U(c) must be the same. Similarly, the corresponding assumption aboutthe column player implies that the weights associated with each column of U(r) andU(c) must also be the same. We can satisfy both assumptions by applying a singlescalar weight to each of the utility matrices, computing wrU(r)+wcU(c). This idea13H(r)1,1H1,1 ↓H1,1 ↓...H(r)1,jH1,j ↓H1,j ↓H(r)2,1H2,1 ↓H2,1 ↓...H(r)2,jH2,j ↓H2,j ↓. . .. . .f1...fjFeature LayersOutputInputUnitsSoftmaxUnitsH(c)1,1H1,1 ↓H1,1 ↓...H(c)1,jH1,j ↓H1,j ↓H(c)2,1H2,1 ↓H2,1 ↓...H(c)2,jH2,j ↓H2,j ↓. . .. . .f1...fjU(r)U(r)↓U(r) ↓U(c)U(c)↓U(c) ↓ar(r)0ar(r)1 ...ar(r)k−1ar(r)kar(c)0ar(c)1 ...ar(c)k−1yAction Response LayersFigure 3.2: A schematic representation of our architecture. The feature lay-ers consist of hidden matrix units (orange), each of which use poolingunits to output row- and column-preserving aggregates (blue and pur-ple) before being reduced to distributions over actions in the softmaxunits (red). Iterative response is modeled using the action response lay-ers (green) and the final output, y, is a weighted sum of the row player’saction response layers.can be generalized as in a standard feed-forward network to allow us to fit morecomplex functions. A hidden matrix unit taking all the preceding hidden matrixunits as input can be calculated asHl,i = φ(∑jwl,i, j Hl−1, j +bl,i)Hl,i ∈Rm×n,where Hl,i is the ith hidden unit matrix for layer l, wl,i, j is the jth scalar weight, bl,iis a scalar bias variable, and φ is a non-linear activation function applied element-wise (see Figure 3.3 for a graphical representation).Notice that, as in a traditional feed-forward neural network, the output of eachhidden unit is simply a nonlinear transformation of the weighted sum of the pre-ceding layer’s hidden units. Our architecture differs by maintaining a matrix ateach hidden unit instead of a scalar. So while in a traditional feed-forward network14H1,1H1,2...H1,kH2,1H2,2...H2,k′. . .. . .. . .U(r)U(c)H2,1 = φ(∑k w2,k H1,k )Input payoffmatricesFigure 3.3: A graphical representation of our hidden matrix units. Each nodeis a weighted sum of matrices with a single scalar weight associatedwith each matrix, followed by an element-wise activation function, φ(·)each hidden unit maps the previous layer’s vector of outputs into a scalar output, inour architecture each hidden unit maps a tensor of outputs from the previous layerinto a matrix output.Tying weights in this way reduces the number of parameters in our networkby a factor of nm, offering two benefits. First, it reduces the degree to whichthe network is able to overfit; second and more importantly, it makes the modelinvariant to the size of the input matrices. To see this, notice that each hiddenunit maps from a tensor containing the k output matrices of the preceding layer inRk×m×n to a matrix in Rm×n using k weights. Thus our number of parameters ineach layer depends on the number of hidden units in the preceding layer, but noton the sizes of the input and output matrices. This allows the model to generalizeto input sizes that do not appear in training data.3.1.1 Pooling UnitsA limitation of the weight tying used in our hidden matrix units is that it forcesindependence between the elements of their matrices, preventing the network fromlearning functions that compare the values of related elements (see Figure 3.5(left)). Recall that each element of the matrices in our model corresponds to anoutcome in a normal form game. A natural game theoretic notion of the “relatedelements” which we’d like our model to be able to compare is the set of payoffs as-sociated with each of the players’ actions that led to that outcome. This correspondsto the row and column of each matrix associated with the particular element.15This observation motivates our pooling units, which allow information sharingby outputting aggregated versions of their input matrix that may be used by laterlayers in the network to learn to compare the values of a particular cell in a matrixand its row- or column-wise aggregates.H→{Hc,Hr}=maxi hi,1 maxi hi,2 . . .maxi hi,1 maxi hi,2 . . .......maxi hi,1 maxi hi,2 ,max j h1, j max j h1, j . . .max j h2, j max j h2, j . . .......max j hm, j max j hm, j . . .(3.1)A pooling unit takes a matrix as input and outputs two matrices constructed fromrow- and column-preserving pooling operations respectively. In principle, a pool-ing operation could be any continuous function that maps from Rn→R, but somefunctions stand out as more behaviorally plausible. One might imagine playerscomparing payoffs to the best possible payoff associated with an action, implyingthe max pooling operation, or to the average payoffs associated with an action, im-plying the mean pooling operation. In practice we found the max function offeredthe best performance.Equation (3.1) shows the output of a max pooling unit applied to some arbitrarymatrix H. The first of the two outputs, Hc, is column-preserving in that it selectsthe maximum value in each column of H and then stacks the resulting vector n-dimensional vector m times such that the dimensionality of H and Hc are the same.Similarly, the row-preserving output constructs a vector of the max elements ineach column and stacks the resulting m-dimensional vector n times such that Hrand H have the same dimensionality. We stack the vectors that result from thepooling operation in this fashion so that at the hidden units from the next layerin the network may take H,Hc and Hr as input. This allows these later hiddenunits to learn functions where each element of their output is a function both of thecorresponding element from the matrices below as well as their row and column-preserving maximums (compare Figure 3.4 with Figure 3.5).16Figure 3: Left: Without pooling units, each element of every hidden matrix unit depends only on thecorresponding elements in the units from the layer below; e.g., the middle element highlighted inred depends only on the value of the elements of the matrices highlighted in orange. Right: Withpooling units at each layer in the network, each element of every hidden matrix unit depends both onthe corresponding elements in the units below and the pooled quantity from each row and column.E.g., the light blue and purple blocks represent the row and column-wise aggregates corresponding totheir adjacent matrices. The dark blue and purple blocks show which of these values the red elementdepends on. Thus, the red element depends on both the dark- and light-shaded orange cells.JH: TODO: add level labelsAction Response Layers The feature layers described above are sufficient to meet our objective203of mapping from the input payoff matrices to a distribution over the row player’s actions. However,204this architecture is not capable of explicitly representing iterative strategic reasoning, which the205behavioral game theory literature has identified as an important modeling ingredient. We incorporate206this ingredient using action response layers: the first player can respond to the second’s beliefs,207the second can respond to this response by the first player, and so on to some finite depth. The208proportion of players in the population who iterate at each depth is a parameter of the model; thus,209our architecture is also able to learn not to perform iterative reasoning.210More formally, we begin by denoting the output of the feature layers as ar(r)0 =Pki=1 w(r)0i f(r)i ,211where we now include an index (r) to refer to the output of row player’s action response layer212ar(r)0 2 m. Similarly, by applying the feature layers to a transposed version of the input matrices,213the model also outputs a corresponding ar(c)0 2 n for the column player which expresses the row214player’s beliefs about which actions the column player will choose. Each action response layer215composes its output by calculating the expected value of an internal representation of utility with216respect to its belief distribution over the opposition actions. For this internal representation of utility217we chose simply a weighted sum of the final layer of the hidden layers,Pi wiHL,i, because each218HL,i is already some non-linear transformation of the original payoff matrix, and so this allows the219model to express utility as a transformation of the original payoffs. Given the matrix that results from220this sum, we can compute expected utility with respect to the vector of beliefs about the opposition’s221choice of actions, ar(c)j , by simply taking the dot product of the weighted sum and beliefs. When222we iterate this process of responding to beliefs about one’s opposition more than once, higher level223players will respond to beliefs, ari, for all i less their level and then output a weighted combination224of these responses using some weights, vl,i. Putting this together, the lth action response layer for the225row player (r) is defined as226ar(r)l = softmax l l1Xj=0v(r)l,j kXi=1w(r)l,i H(r)L,i!· ar(c)j!!, ar(r)l 2 m, l 2 {1, ...,K},where l indexes the action response layer, l is a scalar sharpness parameter that allows us to sharpen227the resulting distribution, w(r)l,i and v(r)l,j are scalar weights,HL,i are the row player’s k hidden units228from the final hidden layer L, ar(c)j is the output of the column player’s jth action response layer and229K is the total number of action response layers. We constrain w(r)li and v(r)lj to the simplex and use230l to sharpen the output distribution so that we can optimize the sharpness of the distribution and231relative weighting of its terms independently. We build up the column player’s action response layer,232ar(c)l , similarly, using the column player’s internal utility representation,H(c)L,i, responding to the row233player’s action response layers, ar(r)l . These layers are not used in the final output directly but are234relied upon by subsequent action response layers of the row player.2356Figure 3: Left: Without pooling units, each element of every hidden matrix unit depends only on thecorresponding elements in the units from the layer below; e.g., the middle element highlighted inred depends only on the value of the elements of the matrices highlighted in orange. Right: Withpooling units at each layer in the network, each element of every hidden matrix unit depends both onthe corresponding elements in the units below and the pooled quantity from each row and column.E.g., the light blue and purple blocks represent the row and column-wise aggregates corresponding totheir adjacent matrices. The dark blue and purple blocks show which of these values the red elementdepends on. Thus, the red element depends on both the dark- and light-shaded orange cells.JH: TODO: add level labelsAction Response Layers The feature layers described above are sufficient to meet our objective203of mapping from the input payoff matrices to a distribution over the row player’s actions. However,204this architecture is not capable of explicitly representing iterative strategic reasoning, which the205behavioral game theory literature has identified as an important modeling ingredient. We incorporate206this ingredient using action response layers: the first player can respond to the second’s beliefs,207the second can respond to this response by the first player, and so on to some finite depth. The208proportion of players in the population who iterate at each depth is a parameter of the model; thus,209our architecture is also able to learn not to perform iterative reasoning.210More formally, we begin by denoting the output of the feature layers as ar(r)0 =Pki=1 w(r)0i f(r)i ,211where we now include an index (r) to refer to the output of row player’s action response layer212ar(r)0 2 m. Similarly, by applying the feature layers to a transposed version of the input matrices,213the model also outputs a corresponding ar(c)0 2 n for the column player which expresses the row214player’s beliefs about which actions the column player will choose. Each action response layer215composes its output by calculating the expected value of an internal representation of utility with216respect to its belief distribution over the opposition actions. For this internal representation of utility217we chose simply a weighted sum of the final layer of the hidde layers,Pi wiHL,i, because each218HL,i is already some non-linear transformation of the origi al payoff matrix, and so this allows the219model to express utility as a tra sf r ati f t e riginal payoffs. Given the matrix that results from220this sum, we can compute expected utility with respect to the vector of beliefs about the opp sition’s221choice of actions, ar(c)j , by simply taking the dot product of the weighted sum and beliefs. When222we iterate this process of responding to beliefs about one’s opposition more than once, hig r level223pl yers will respond to beliefs, ari, for all i less their level and then output a weighted combination224of these responses using ome weights, vl,i. Putting this together, the lth action r sponse layer for the225row player (r) is defined as226ar(r)l = softmax l l1Xj=0v(r)l,j kXi=1w(r)l,i H(r)L,i!· ar(c)j!!, ar(r)l 2 m, l 2 {1, ...,K},where l indexes the action response layer, l is a scalar sharpness parameter that allows us to sharpen227the resulting distribution, w(r)l,i and v(r)l,j are scalar weights,HL,i are the row player’s k hidden units228from the final hidden layer L, ar(c)j is the output of the column player’s jth action response layer and229K is the total number of action response layers. We constrain w(r)li and v(r)lj to the simplex and use230l to sharpen the output distribution so that we can optimize the sharpness of the distribution and231rel tiv weighting of its terms i dependently. We build up the column player’s action response layer,232ar(c)l , similarly, using the column player’s internal utility representation,H(c)L,i, responding to the row233player’s action response layers, ar(r)l . These layers are not used in the final output directly but are234relied upon by subsequ nt action response layers of the row player.2356Input UnitsHidden Layer 1Hidden Layer 2Figure 3.4: The above image shows the input and two hidden layers of a net-work with a 5×5 input matrix, three invariance preserving hidden unitsin the first hidden layer and one unit in the second hidden layer. With-out pooling units, each element of every hidden matrix unit dependsonly on the corresponding elements in the units from the layer below;e.g., the middle element highlighted in red depends only on the value ofthe elements of the matrices highlighted in orange.3.1.2 Softmax OutputOur model predicts a distribution over the row player’s actions. In order to do this,we need to map from the hidden matrices in the final layer, HL,i ∈ Rm×n, of thenetwork onto a point on the m-simplex, ∆m. We achieve this mapping by applyinga row-preserving sum to each of the final layer hidden matrices HL,i (i.e. we sumuniformly over the columns of the matrix s d scribed above) and then applying asoftmax function to convert each of the resulting vectors hi into normalized distri-butions. This produces k features fi, each of which is a distribution over the rowplayer’s m actions:fi = softmax(h(i))where h(i)j =n∑k=1h(i)j,k for all j∈{1, ...,m}, h(i)j,k ∈H(i) i∈{1, ...,k}.We c n then produce the output of our features, ar0, using a weighted sum ofthe individual features, ar0 = ∑ki=1 wifi, where we optimize wi under simplex con-straints, wi ≥ 0, ∑i wi = 1. Because each fi is a distribution and our weights wi arepoints on the implex, the output of the feature layers is a mixture of distributions.17Figure 3.5: This figure shows the same architecture as Figure 3.4 with pool-ing units added at each layer in the network. Now each element of everyhidden matrix unit depends both on the corresponding elements in theunits below and the pooled quantity from each row and column. E.g.,the light blue and purple blocks represent the row and column-wise ag-gregates corresponding to their adjacent matrices. The dark blue andpurple blocks show which of these values the red element depends on.Thus, the red element depends on both the dark- and light-shaded or-ange cells.3.2 Action Response LayersThe feature layers described above are sufficient to meet our objective of map-ping from the input payoff matrices to a distribution over the row player’s actions.However, this architecture is not capable of explicitly representing iterative strate-gic reasoning, which the behavioral game theory literature has identified as an im-portant modeling ingredient. We incorporate this ingredient using action responselayers: the first player can respond to the second’s beliefs, the second can respondto this response by the first player, and so on to some finite depth. The proportionof players in the population who iterate at each depth is a parameter of the model;thus, our architecture is also able to learn not to perform iterative reasoning.More formally, we begin by denoting the output of the feature layers as ar(r)0 =∑ki=1 w(r)0i f(r)i , where we now include an index (r) to refer to the output of rowplayer’s action response layer ar(r)0 ∈ ∆m. Similarly, by applying the feature layersto a transposed version of the input matrices, the model also outputs a correspond-ing ar(c)0 ∈ ∆n for the column player which expresses the row player’s beliefs about18Hl,1Hl,2Hl,3∑j wjHl,jar0ar1ar2∑jwj arj•Dot product= fark+1AR layers modelbest response to be-liefs by multiplyingweighted sums ofpreceding AR layerswith weighted sumsof hidden units.f(·) = softmax(x)Weighted sum ofthe last hiddenlayerWeighted sumof oppositionAR layersFigure 3.6: AR layers model best response to beliefs by taking the dot prod-uct of weighted sums of preceding AR layers with weighted sums ofhidden units. This is equivalent to computing the expected value of aninternal representation of utility given by a weighted sum of the hiddenunits with respect to the distribution defined by the preceding oppositionAR layers.which actions the column player will choose. Each action response layer com-poses its output by calculating the expected value of an internal representation ofutility with respect to its belief distribution over the opposition actions (see Figure3.6). For this internal representation of utility we chose a weighted sum of thefinal layer of the hidden layers, ∑i wiHL,i, because each HL,i is already some non-linear transformation of the original payoff matrix, and so this allows the model toexpress utility as a transformation of the original payoffs. Given the matrix thatresults from this sum, we can compute expected utility with respect to the vec-tor of beliefs about the opposition’s choice of actions, ar(c)j , by simply taking thedot product of the weighted sum and beliefs. When we iterate this process of re-sponding to beliefs about one’s opposition more than once, higher-level playerswill respond to beliefs, ari, for all i less than their level and then output a weightedcombination of these responses using some weights, vl,i. Putting this together, the19lth action response layer for the row player (r) is defined asar(r)l = softmax(λl(l−1∑j=0v(r)l, j(k∑i=1w(r)l,i H(r)L,i)·ar(c)j)), ar(r)l ∈ ∆m, l ∈ {1, ...,K},where l indexes the action response layer, λl is a scalar sharpness parameter thatallows us to sharpen the resulting distribution, w(r)l,i and v(r)l, j are scalar weights,HL,i are the row player’s k hidden units from the final hidden layer L, ar(c)j is theoutput of the column player’s jth action response layer, and K is the total numberof action response layers. We constrain w(r)li and v(r)l j to the simplex and use λlto sharpen the output distribution so that we can optimize the sharpness of thedistribution and relative weighting of its terms independently. We build up thecolumn player’s action response layer, ar(c)l , similarly, using the column player’sinternal utility representation, H(c)L,i , responding to the row player’s action responselayers, ar(r)l . These layers are not used in the final output directly but are reliedupon by subsequent action response layers of the row player.3.3 OutputOur model’s final output is a weighted sum of the outputs of the row player’s actionresponse layers. This output needs to be a valid distribution over actions. Becauseeach of the action response layers also outputs a distribution over actions, we canachieve this requirement by constraining these weights to the simplex, thereby en-suring that the output is just a mixture of distributions. The model’s output is thusy = ∑Kj=1 w jar(r)j , where y and ar(r)j ∈ ∆m, and w j ∈ ∆K .3.4 Relation to Existing Deep ModelsOur model’s functional form has interesting connections with existing deep modelarchitectures. We discuss two of these connections here. First, our invariance-preserving hidden layers can be encoded as MLP Convolution Layers described inLin et al. [18] with the two-channel 1×1 input xi, j corresponding to the two play-ers’ respective payoffs when actions i and j are played (using patches larger than1×1 would imply the assumption that local structure is important, which is inap-20propriate in our domain; thus, we do not need multiple mlpconv layers). Second,our pooling units are superficially similar to the pooling units used in convolutionalnetworks. However, ours differ both in functional form and purpose: we use pool-ing as a way of sharing information between cells in the matrices that are processedthrough our network by taking maximums across entire rows or columns, while incomputer vision, max-pooling units are used to produce invariance to small trans-lations of the input image by taking maximums in a small local neighborhood.21Chapter 4Representational Generality ofOur ArchitectureOur work aims to extend existing models in behavioral game theory via deep learn-ing, not to propose an orthogonal approach. Thus, we must demonstrate that ourrepresentation is rich enough to capture models and features that have provenimportant in that literature. In this section we make explicit the connection be-tween our model and popular models from the behavioral game theory literatureby demonstrating how our architecture is able to express these models. At a highlevel, we express behavioral models using appropriately parameterized the actionresponse layers and we express non-strategic features using the invariance preserv-ing hidden layers with pooling units.4.1 Behavioral ModelsThe four behavioral models we consider are quantal cognitive hierarchy (QCH)[30, 36] and quantal level-k (QLk) [30], cognitive hierarchy (CH) [3] and level-k(Lk) [8]. They differ in their behavioral assumptions, but they are similar in theirmathematical descriptions. All involve some notion of a response (in the form ofa strategy or distribution over one’s own actions) to beliefs about one’s oppositionstrategy (in the form of a distribution over one’s opposition’s actions).22Responses We can divide these four models into two distinct classes: either theyassume players best respond or quantally respond. Players best respond by select-ing an action from the set of actions that maximizes their expected utility giventheir beliefs. Alternatively, they quantally respond by choosing actions with prob-ability proportional to the action’s expected utility.This is modeled formally as follows: let u¯i(s) = ∑nj=1 ui, js j denote a player’sexpected utility, given beliefs, s, about the opposition actions. A quantal responsestrategy is defined assi(sj) =exp(λ u¯i(s))∑mi=1 exp(λ u¯i(s)).Quantal response approaches best response as λ → ∞ in the sense that it defines astrategy where players uniformly randomize from the set of best responses1. Alter-natively, if λ → 0, players ignore their payoffs and uniformly randomize over theirset of actions.Beliefs The four models can also be categorized based on how they define beliefsabout one’s opposition, s j. All of the models rely on a notion of a “cognitive level”that differs among players. However, while Level-k models assume that a playerat cognitive level k only has beliefs about level (k−1) players, cognitive hierarchymodels assume that players respond to beliefs about the full distribution of playershaving cognitive level less than their own.We now show the connection between our neural network-based approach andthese behavioral models. Recall that action response layer l is defined asar(r)l = softmax(λl(l−1∑j=0v(r)l, j(k∑i=1w(r)l,i H(r)L,i)·ar(c)j)), ar(r)l ∈ ∆m, l ∈ {1, ...,K},and action response layer 0 is defined as a weighted sum of features, ar0 =∑ki=1 wifi.The behavioral models do not depend on transformed versions of the input1The claim that we can represent best response by letting parameters tend to infinity may appeardubious given that the models are optimized numerically. However because the exp(x) functionsaturates quickly using floating point numbers, in practice λ only needs to be moderately large tooutput a best response.23matrices or behavioral features, so we let the parameters of the network be set suchthatar(r)l = softmax(λl(l−1∑j=0v(r)l, j U(r) ·ar(c)j)), ar(r)l ∈ ∆m, l ∈ {1, ...,K},and let ar0,i = 1m for all i ∈ {1, . . . ,m}.4.1.1 Level-kThe Level-k model associates every player in the population with a particular cog-nitive level corresponding to the number of steps of strategic reasoning they com-plete (bounded by some fixed maximum level), and assumes that level-k playersbest respond to the strategy played by level-(k− 1) players and that level-0 play-ers select actions uniformly at random. Each level also has some probability εk ofmaking an “error” by selecting some action other than their best response.We model this with the action response layers, by settingv(r)l,i =1− εi if i = l−1εi if i = 00 otherwise.and letting λl → ∞ in order to simulate best response.4.1.2 Cognitive HierarchyCognitive Hierarchy is similar to level-k except it assumes a distribution over thelevels a player may take, and assumes they best respond without error to the nor-malized distribution of players below them.That is, there is some level distribution p where pi is the proportion of playersin the population who behave according to a particular level and players respondto a normalized distribution p[0:i−1].We model this with the action response layers, by settingv(r)l,i =pi∑i−1j=0 p j24and letting λl → ∞ in order to simulate best response.4.1.3 Quantal Level-kQuantal Level-k differs from the level-k model described above by allowing playersto quantally respond with each level having a different precision parameter λl . Theparameters remain as described in Section 4.1.1 except that we use λl instead ofletting λ → ∞4.1.4 Quantal Cognitive HierarchyQuantal Cognitive Hierarchy [30, 36] generalists cognitive hierarchy by allowingplayers to quantally respond and optimism the parameter λ . Similarly to the above,the parameters of our action response layers remain the same as in Section 4.1.2except we use λl = λ instead of letting λl → ∞.4.2 Game Theoretic FeaturesWright and Leyton-Brown [36] showed the importance of explicitly modeling non-strategic level-0 players. They investigated a wide range of models and found thatweighted linear combinations of nonstrategic features were most effective for im-proving predictive performance. In this section we argue that our model is suffi-ciently flexible to represent all the behavioral features used in their best performingmodels which allows us to generalize the quantal cognitive hierarchy with weightedlinear features model presented in Wright and Leyton-Brown [36]. We make thisclaim formally below.Claim: A network with two hidden layers, one hidden unit per layer, poolingunits at every layer and rectified linear unit activation functions can represent eachof the following normalized features• min max regret,• min min unfairness,• max min payoff,25• max max payoff,• max max efficiency.The proof of this claim is laborious but straight forward because it simplyamounts to selecting appropriate parameters to represent each features. It is givenin Section A.1 in the Appendix. The key idea is that all of the above functionscan be represented using a composition of max, min and absolute value operations.The pooling layers allow us to represent max and min functions, we can representthe absolute value function using a pair of rectified linear unit activations, and thenetwork allows function composition by design.26Chapter 5Experiments5.1 Experimental SetupWe used a dataset that combined observations from 9 human-subject experimentalstudies conducted by behavioural economists in which subjects were paid to selectactions in normal-form games. Their payment depended on the subject’s actionsand the actions of their unseen opposition who chose an action simultaneously.The subjects were shown the payment for each outcome using a payoff matrix thatlists each pair of actions and the respective player’s payments (see Figure 1.1 inChapter 1 for an example). Each experiment presented subjects with between 8and 20 different games and the number of subjects who selected each action isrecorded. Obtained outcomes were only shown at the end of each experiment toprevent learning effects. The games range in size from 2× 2 to 121× 121; in themajority, players have three actions each.Our model takes such payoff matrices as input and predicts the observed fre-quency with which each action was selected by the row player. We encode thepayoffs of each normal form game as a pair of utility matrices corresponding to therespective players’ payoffs, normalised such that the standard deviation of the pay-offs is approximately 1. For symmetric games we combine observations as thoughboth players were the row player and for asymmetric games we treat observationsof the column player’s choice of actions as though they had come from a game witha transposed payoff matrix, such that they become the row player. A few games ap-27Table 5.1: Our datasets. Each experiment had subjects play between 8 and 20games for a total of 128 games, 113 of which were unique.Source Games nStahl and Wilson [30] 10 400Stahl and Wilson [31] 12 576Costa-Gomes et al. [7] 18 1296Goeree and Holt [11] 10 500Cooper and Van Huyck [6] 8 2992Rogers et al. [23] 17 1210Haruvy et al. [15] 15 869Haruvy and Stahl [14] 20 2940Stahl and Haruvy [29] 18 1288All9 113 unique 12071pear in multiple experiments; we combined their observed frequencies into a singlecommon game.We evaluate performance of the models using 10-fold cross-validation. Werandomly partitioned our 113 unique games into 10 folds containing between 11and 15 games each. Our experimental results examine both the mean and varianceacross 10 different such 10-fold cross-validations, each time re-randomising theassignment of games into folds (requiring each model to be retrained 100 times).We are interested in the model’s ability to predict the distribution over the rowplayer’s action, rather than just its accuracy in predicting the most likely action. Asa result, we fit models to maximise the likelihood of training data P(D |θ) (whereθ are the parameters of the model andD is our dataset) and evaluate them in termsof negative log-likelihood on the test set.5.2 Training DetailsAll the models presented in the experimental section were optimized using Adam[16] with an initial learning rate of 0.0002, β1 = 0.9, β2 = 0.999 and ε = 10−8.The models were all regularised using Dropout with probability 0.2 and L1 regu-larization with parameter = 0.01. They were all trained until there was no trainingset improvement up to a maximum of 25 000 epochs and the parameters from the28iteration with the best training set performance was returned.Our architecture imposes simplex constraints on the mixture weight parame-ters. Fortunately, simplex constraints fall within the class of simple constraintsthat can be efficiently optimized using the projected gradient algorithm first pro-posed by Goldstein [12]. The projected gradient algorithm modifies standard gra-dient decent by projecting the relevant parameters onto the constraint set after eachgradient update.15.3 Experimental ResultsFigure 5.1 shows a performance comparison between a model built using our deeplearning architecture with only a single action response layer (i.e. no iterative rea-soning; details below) and the previous state of the art, quantal cognitive hierarchy(QCH) with hand-crafted features (shown as a blue line); for reference we also in-clude the best feature-free model, QCH with a uniform model of level-0 behaviour(shown as a pink line). We refer to an instantiation of our model with L hiddenlayers and K action response layers as an N +K layer network. All instantiationsof our model with 3 or more layers significantly improved on both alternatives andthus represents a new state of the art. Notably, the magnitude of the improvementwas considerably larger than that of adding hand-crafted features to the originalQCH model.Figure 5.1 considers the effect of varying the number of hidden units and lay-ers on performance using a single action response layer. Perhaps unsurprisingly,we found that a two layer network with only a single hidden layer of 50 unitsperformed poorly on both training and test data. Adding a second hidden layerresulted in test set performance that improved on the previous state of the art. Forthese three layer networks (denoted (20, 20), (50, 50) and (100, 100)), performanceimproved with more units per layer, but there were diminishing returns to increas-ing the number of units per layer beyond 50. The four-layer networks (denoted(50, 50, 50) and (100, 100, 100)) offered further improvements in training set per-formance but test set performance diminished as the networks were able to overfit1We adapted Schmidt et al. [25]’s implementation of the simplex projection operation to optimiseour model.2950 20, 20 50, 50 100, 100 50, 50, 50 100, 100, 100940960980100010201040NLL(TestLoss)50 20, 20 50, 50 100, 100 50, 50, 50 100, 100, 100Model Variations (# hidden units)75008000850090009500NLL(TrainingLoss)Figure 5.1: Negative Log Likelihood Performance. The error bars represent95% confidence intervals across 10 rounds of 10-fold cross-validation.We compare various models built using our architecture to a quantalcognitive hierarchy (QCH) model with uniform level-0 behavior (pinkline) and QCH with a weighted linear model of level-0 behaviour (blueline).the data.To test the effect of pooling units on performance, in Figure 5.2 we first re-moved the pooling units from two of the network configurations, keeping the restof the hyperparameters unchanged. The models that did not use pooling layersunderfit on the training data and performed very poorly on the test set. While wewere able to improve their performance by turning off dropout, these unregularisednetworks didn’t match the training set performance of the corresponding networkconfigurations that had pooling units. The test set performance for all the networkswe trained without pooling units remained significantly worse than our best per-forming networks that used pooling units. Thus, our final network contained twolayers of 50 hidden units and pooling units and used dropout.3050, 50(no pooling)50, 50 (no poolingor dropout)50, 50(pooling)100, 100, 100(no pooling)100, 100, 100 (nopooling or dropout)100, 100, 100(pooling)940960980100010201040NLL(TestLoss)50, 50(no pooling)50, 50 (no poolingor dropout)50, 50(pooling)100, 100, 100(no pooling)100, 100, 100 (nopooling or dropout)100, 100, 100(pooling)Pooling Comparison (# units)75008000850090009500NLL(TrainingLoss)Figure 5.2: Performance comparison with and without pooling units. All fourmodels were fit with the same hyperparameters using dropout unlessotherwise stated, with the only difference being the number of layersand hidden units and whether or not the models used pooling units.Our next set of experiments committed to this configuration for feature layersand investigated configurations of action-response layers, varying their number be-tween one and four (i.e., from no iterative reasoning up to three levels of iterativereasoning; see Figure 5.3 ). The networks with more than one action-response layershowed signs of overfitting: performance on the training set improved steadily aswe added AR layers but test set performance suffered. Thus, our final networkused only one action-response layer. We nevertheless remain committed to an ar-chitecture that can capture iterative strategic reasoning; we intend to investigatemore effective methods of regularising the parameters of action-response layers infuture work.311 2 3 4940960980100010201040NLL(TestLoss)1 2 3 4Action Response (# layers)75008000850090009500NLL(TrainingLoss)Figure 5.3: Performance comparison as we vary the number of action re-sponse layers. The AR layers appear to result in overfitting in the net-work as increasing the number of AR layers reduces the average trainingset loss while increasing the average test set loss.5.4 Regular Neural Network PerformanceFigure 5.4 compares the performance of our architecture with that of a regularfeed-forward neural network, with and without data augmentation, and the previ-ous state-of-the-art model on this dataset. It shows that the feed-forward networkdramatically overfitted the data without data augmentation. Data augmentationimproved test set performance, but it was still unable to match state of the art per-formance. A three layer instantiation of our model (two layers of 50 hidden unitsand a single AR layer) matched the previous state of the art but failed to improveupon it. We suspect that this may be because the subset of the data that containsonly 3×3 games is too small to take advantage of the flexibility of our model.32FFNet FFNet (Permuted) (50, 50) QCH Linear4500550600650700750800NLL(TestLoss)FFNet FFNet (Permuted) (50, 50) QCH Linear44600470048004900500051005200NLL(TrainingLoss)Figure 5.4: Performance comparison on 3×3 games of a feed-forward neuralnetwork (FFNet), a feed-forward neural network with data augmenta-tion at every epoch (FFNet (Permuted)), our architecture with two hid-den layers of 50 units each (labeled (50, 50)), and quantal cognitivehierarchy with four hand-crafted features (QCH Linear4).33Chapter 6Discussion and ConclusionsTo design systems that efficiently interact with human players, we need an accu-rate model of boundedly rational behavior. This thesis presented an architecture forlearning such models that both significantly improves upon state-of-the-art perfor-mance without needing hand-tuned features developed by domain experts. Inter-estingly, while the full architecture can include action response layers to explicitlyincorporate the iterative reasoning process modeled by Level-k-style models, ourbest-performing model did not need them to achieve set a new performance bench-mark. This indicates that the model is performing the mapping from payoffs todistributions over actions in a manner that is substantially different from previoussuccessful models.The model presented in this work can represent a large class of nonstrategicgame-theoretic features but it is not universal. For example, we can’t model behav-ior where players are attracted to particular actions that are salient because of theirlocation in the payoff matrix or if they avoid actions because they are dominated bysome other action. We would like to remove this limitation in future work. Somenatural future directions, are to extend our architecture beyond two-player, unre-peated games to games with more than two players, as well as to richer interactionenvironments, such as games in which the same players interact repeatedly andgames of imperfect information.34Bibliography[1] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A reviewand new perspectives. IEEE Transactions on Pattern Analysis and MachineIntelligence, 35(8), 2013. → pages 2[2] C. Camerer. Behavioral game theory: Experiments in strategic interaction.Princeton University Press, 2003. → pages 2, 4[3] C. Camerer, T. Ho, and J. Chong. A cognitive hierarchy model of games.Quarterly Journal of Economics, 119(3), 2004. → pages 6, 22[4] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. Theloss surface of multilayer networks. In AISTATS 2015, Proceedings of theEighteenth International Conference on Artificial Intelligence and Statistics.2015. → pages 8[5] C. Clark and A. J. Storkey. Training deep convolutional neural networks toplay go. In Proceedings of the 32nd International Conference on MachineLearning, ICML 2015, 2015. → pages 2, 9, 10[6] D. Cooper and J. Van Huyck. Evidence on the equivalence of the strategicand extensive form representation of games. Journal of Economic Theory,110(2), 2003. → pages 28[7] M. Costa-Gomes, V. Crawford, and B. Broseta. Cognition and behavior innormal-form games: an experimental study. Discussion paper, UCSD, 1998.→ pages 28[8] M. Costa-Gomes, V. Crawford, and B. Broseta. Cognition and behavior innormal-form games: An experimental study. Econometrica, 69(5), 2001. →pages 6, 22[9] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio.Identifying and attacking the saddle point problem in high-dimensional35non-convex optimization. In Z. Ghahramani, M. Welling, C. Cortes,N. Lawrence, and K. Weinberger, editors, Advances in Neural InformationProcessing Systems 27. Curran Associates, Inc., 2014. → pages 8[10] B. Edelman, M. Ostrovsky, and M. Schwarz. Internet advertising and thegeneralized second-price auction: Selling billions of dollars worth ofkeywords. The American Economic Review, 97(1), 2007. → pages 1[11] J. K. Goeree and C. A. Holt. Ten little treasures of game theory and tenintuitive contradictions. The American Economic Review, 91(5), 2001. →pages 28[12] A. A. Goldstein. Convex programming in hilbert space. Bull. Amer. Math.Soc., 70(5), 09 1964. → pages 29[13] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book inpreparation for MIT Press. Last accessed 01-10-2016, 2016. URLhttp://www.deeplearningbook.org. → pages 6, 7[14] E. Haruvy and D. Stahl. Equilibrium selection and bounded rationality insymmetric normal-form games. JEBO, 62(1), 2007. → pages 28[15] E. Haruvy, D. Stahl, and P. Wilson. Modeling and testing for heterogeneityin observed strategic behavior. Review of Economics and Statistics, 83(1),2001. → pages 28[16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InThe International Conference on Learning Representations (ICLR), 2015.→ pages 28[17] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015. → pages7[18] M. Lin, Q. Chen, and S. Yan. Network in network. In InternationalConference on Learning Representations, volume abs/1312.4400. 2014. →pages 10, 20[19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks forsemantic segmentation. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2015. → pages 9[20] R. McKelvey and T. Palfrey. Quantal response equilibria for normal formgames. GEB, 10(1), 1995. → pages 536[21] P. Milgrom and I. Segal. Deferred-acceptance auctions and radio spectrumreallocation. In Proceedings of the Fifteenth ACM Conference on Economicsand Computation. ACM, 2014. → pages 1[22] D. C. Parkes and M. P. Wellman. Economic reasoning and artificialintelligence. Science, 349(6245), 2015. → pages 1[23] B. W. Rogers, T. R. Palfrey, and C. F. Camerer. Heterogeneous quantalresponse equilibrium and cognitive hierarchies. Journal of EconomicTheory, 144(4), July 2009. → pages 28[24] J. Schmidhuber. Deep learning in neural networks: An overview. NeuralNetworks, 2015. → pages 7[25] M. W. Schmidt, E. Berg, M. P. Friedlander, and K. P. Murphy. Optimizingcostly functions with simple constraints: A limited-memory projectedquasi-newton algorithm. In Proceedings of the Twelfth InternationalConference on Artificial Intelligence and Statistics (AISTATS-09), volume 5,page None. Journal of Machine Learning Research - Proceedings Track,2009. → pages 29[26] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic,Game-theoretic, and Logical Foundations. Cambridge University Press,2008. → pages 1[27] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van denDriessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot,S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap,M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering thegame of go with deep neural networks and tree search. Nature, 529:484–503, 2016. → pages 10[28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.Dropout: A simple way to prevent neural networks from overfitting. J.Mach. Learn. Res., 15(1), Jan. 2014. → pages 9[29] D. Stahl and E. Haruvy. Level-n bounded rationality and dominatedstrategies in normal-form games. JEBO, 66(2), 2008. → pages 28[30] D. Stahl and P. Wilson. Experimental evidence on players’ models of otherplayers. JEBO, 25(3), 1994. → pages 6, 22, 25, 28[31] D. Stahl and P. Wilson. On players’ models of other players: Theory andexperimental evidence. GEB, 10(1), 1995. → pages 2837[32] M. Tambe. Security and Game Theory: Algorithms, Deployed Systems,Lessons Learned. Cambridge University Press, New York, NY, USA, 1stedition, 2011. → pages 1[33] H. R. Varian. Position auctions. International Journal of IndustrialOrganization, 25, December 2007. → pages 1[34] J. R. Wright and K. Leyton-Brown. Beyond equilibrium: Predicting humanbehavior in normal-form games. In AAAI. AAAI Press, 2010. → pages 2, 4[35] J. R. Wright and K. Leyton-Brown. Behavioral game-theoretic models: ABayesian framework for parameter analysis. In Proceedings of the 11thInternational Conference on Autonomous Agents and Multiagent Systems(AAMAS-2012), volume 2, pages 921–928, 2012. → pages 6, 12[36] J. R. Wright and K. Leyton-Brown. Level-0 meta-models for predictinghuman behavior in games. In Proceedings of the Fifteenth ACM Conferenceon Economics and Computation, pages 857–874, 2014. → pages 2, 3, 5, 6,12, 22, 25, 40[37] J. R. Wright and K. Leyton-Brown. Evaluating, understanding, andimproving behavioral game theory models for predicting human behavior inunrepeated normal-form games. CoRR, abs/1306.0918, 2016. URLhttp://arxiv.org/abs/1306.0918. Last accessed 07-10-2016. → pages 4[38] R. Yang, C. Kiekintvled, F. Ordonez, M. Tambe, and R. John. Improvingresource allocation strategies against human adversaries in security games:An extended study. Artificial Intelligence Journal (AIJ), 2013. → pages 138Appendix ASupporting MaterialsA.1 Representing behavioural featuresClaim: A network with two hidden layers, one hidden unit per layer, pooling unitsat every layer and rectified linear unit activation functions can represent each of thefollowing normalised features,• min max regret,• min min unfairness,• max min payoff,• max max payoff• max max efficiency.Proof By expanding the sums from the definition of the network, we see thefirst hidden layer has the following functional form:H(1,1)= relu(w1,rU(r)+w1,cU(c)+w1,rcU(r)c +w1,rrU(r)r +w1,ccU(c)c +w1,crU(c)r +b1,1).where U(r) is the row player’s payoff matrix and U(r)c is the row player’s payoffmatrix aggregated using the column-preserving pooling unit where we use the max39function to perform the aggregation. Similarly, the second hidden layer can bewritten as,H(2,1) = relu(w2,1H(1,1)+w2,cH1,1c +w2,rH(1,1)r +b2,1).We denote H(1,1) as the output of the first hidden layer and H(1,1)c and H(1,1)r are itsrespective pooled outputs.Game theoretic features can be interpreted as outputting a strategy (a distribu-tion over a player’s actions) given a description over the game. We express featuresin a style similar to [36] by outputting a vector f such that fi ≈ 0 for all fi ∈ f if ac-tion i does not correspond to the target feature, and fi ≈ 1l where l is the number ofactions that correspond to the target feature (with l = 1 if the actions uniquely sat-isfies the feature; Wright and Leyton-Brown [36] instead used a binary encoding,but that does not fit naturally into our framework). We have approximate equal-ity, ≈, because we construct the features using a softmax function and hence ouroutput approaches fi = 0 or 1l as our parameters → ∞. Because features are allconstructed from a sparse subset of the parameters, we limit notational complexityby letting wi, j = 0 and bi, j = 0 for all i, j ∈ 1,2,r,c unless stated otherwise.A.1.1 Max Max PayoffRequired: f maxmax(i)=1l if i ∈ argmaxi∈{1,...,m}max j∈{1,...,n} ui, j ui, j ∈ U(r)0 otherwiseLet w1,r = 1,w2,r = c where c is some large positive constant and b1,1 = b whereis some scalar b ≥ mini, j U(r)i, j and all other parameters wi, j,bi, j = 0. Then H(1,1)reduces to,H(1,1) = relu(U(r)+b) = U(r)+b since U(r)+b≥ 0 by definition of bH(2,1) = relu(cH(1,1)r )⇒ h j,k = c(maxku j,k +b) ∀u j,k ∈ U(r), h j,k ∈H(2,1)That is, all the elements in each row of H(2,1) equal an positive affine transformation40of the maximum element from the corresponding row in U(r).f(1)i = softmax(n∑k=1h j,k)= softmax(n∑k=1c(maxku j,k +b))= softmax(nc(maxku j,k +b))Therefore, as c→ ∞, f(1)i → f maxmax(i) as required.A.1.2 Max Min PayoffRequired: f maxmin(i)=1l if i ∈ argmaxi∈{1,...,m}min j∈{1,...,n} ui, j ui, j ∈ U(r)0 otherwiseMax Min Payoff is derived similarly to Max Max except with w1,r = −1, andb1,1 = b where b≥maxi, j U(r)i, j ; we keep w2,r = c as some large positive constant.Then H(1,1) reduces to,H(1,1) = relu(−U(r)+b) =−U(r)+b since −U(r)+b≥ 0 by definition of bH(2,1)= relu(cH(1,1)r )⇒ h j,k = c(maxk(−u j,k+b))= c(minku j,k+b) ∀u j,k ∈U(r), h j,k ∈H(2,1)Since maxi−xi+b = mini xi+b. Thus,f(1)i = softmax(n∑k=1h j,k) = softmax(nc(minku j,k +b))Therefore, as c→ ∞, f(1)i → f maxmin(i) as required.A.1.3 Max Max EfficiencyRequired: f max max efficiency(i)=1l if i ∈ argmaxi∈{1,...,m}max j∈{1,...,n} u(c)i, j +u(r)i, j0 otherwiseMax Max Efficiency follow from the derivation of Max Max except with w1,r =1,w1,c = 1,w2,r = c and b1,1 = b where b≥mini, j(U(r)i, j +U(r)i, j ).41Following the same steps we get,f(1)i = softmax(n∑k=1h j,k) = softmax(n∑k=1c(maxk(u(r)+u(c)) j,k +b))= softmax(nc(maxk(u(r)+u(c)) j,k +b))= f max max efficiency(i) as c→ ∞A.1.4 Minimax RegretRegret is defined as r(i, j) = maxi ui, j−ui, j ui, j ∈ U(r)Required: f minimax regret(i)=1l if i ∈ argmini∈{1,...,m}max j∈{1,...,n} r(i, j)0 otherwiseLet w1,rc = 1, w1,r =−1, and b1,1 = 0; we keep w2,r = c as some large positiveconstant.Then H(1,1) reduces to,H(1,1) = relu(U(r)c −U(r)) = U(r)c −U(r) since U(r)c ≥ U(r) by definition of U(r)cH(2,1)= relu(cH(1,1)r )⇒ h j,k = c(maxk(maxju j,k−u j,k)) ∀u j,k ∈U(r), h j,k ∈H(2,1)Thus,f(1)i = softmax(n∑k=1h j,k) = softmax(nc(maxk(maxju j,k−u j,k)))Therefore, as c→ ∞, f(1)i → f minimax regret(i) as required.A.1.5 Min Min UnfairnessRequired: f min min unfairness(i)=1l if i ∈ argmaxi∈{1,...,m}min j∈{1,...,n} |u(r)i, j −u(c)i, j |0 otherwiseTo represent Min Min Unfairness, we an additional hidden unit in the first layer.42Let H(1,2) be defined in the same manner as H(1,1).For H(1,1), we let w11,r = 1,w11,c =−1 and b1,1 = 0 such that,H(1,1)= relu(U(r)−U(c))=max(U(r)−U(c),0) where the max is applied element-wiseFor H(1,2), we let w11,r =−1,w11,c =− and b1,1 = 0 such that,H(1,2) = relu(U(c)−U(r)) = max(U(c)−U(r),0)Now, notice that if w2,1 = 1 and w2,2 = 1,H(2,1) =H(1,1)+H(1,2) =max(U(r)−U(c),0)+max(U(c)−U(r),0) = |U(r)−U(c)|Which gives us a measure of “unfairness” as the absolute difference between thetwo payoffs.We can therefore simulate f min min unfairness(i) by letting w2,1 = −1 and w2,2 =−1, and using the output of H(2,1)r (which gives us min unfairness), then construct-ing fi by letting c→ ∞.43
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Deep learning for predicting human strategic behavior
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Deep learning for predicting human strategic behavior Hartford, Jason Siyanda 2016
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Deep learning for predicting human strategic behavior |
Creator |
Hartford, Jason Siyanda |
Publisher | University of British Columbia |
Date Issued | 2016 |
Description | Predicting the behavior of human participants in strategic settings is an important problem for applications that rely on game theoretic reasoning to design mechanisms or allocate resources. Most existing work either assumes that participants are perfectly rational, or attempts to directly model each participant's cognitive processes based on insights from cognitive psychology and experimental economics. In this work, we present an alternative, a deep learning-based approach that automatically performs cognitive modeling without relying on such expert knowledge. We introduce a novel architecture that allows a single network to generalize across normal form games with varying numbers of actions. We show that the architecture generalists the most successful existing models and that its performance significantly improves upon that of the previous state of the art, which relies on expert-constructed features. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2016-10-25 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-ShareAlike 4.0 International |
DOI | 10.14288/1.0319323 |
URI | http://hdl.handle.net/2429/59559 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2016-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-sa/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2016_november_hartford_jason.pdf [ 838.49kB ]
- Metadata
- JSON: 24-1.0319323.json
- JSON-LD: 24-1.0319323-ld.json
- RDF/XML (Pretty): 24-1.0319323-rdf.xml
- RDF/JSON: 24-1.0319323-rdf.json
- Turtle: 24-1.0319323-turtle.txt
- N-Triples: 24-1.0319323-rdf-ntriples.txt
- Original Record: 24-1.0319323-source.json
- Full Text
- 24-1.0319323-fulltext.txt
- Citation
- 24-1.0319323.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0319323/manifest