Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Team LSTM : player trajectory prediction in basketball games using graph-based LSTM networks Cohan, Setareh 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2020_may_cohan_setareh.pdf [ 685.12kB ]
JSON: 24-1.0388468.json
JSON-LD: 24-1.0388468-ld.json
RDF/XML (Pretty): 24-1.0388468-rdf.xml
RDF/JSON: 24-1.0388468-rdf.json
Turtle: 24-1.0388468-turtle.txt
N-Triples: 24-1.0388468-rdf-ntriples.txt
Original Record: 24-1.0388468-source.json
Full Text

Full Text

Team LSTM: Player Trajectory Prediction in BasketballGames using Graph-based LSTM NetworksbySetareh CohanBSc, Sharif University of Technology, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)January 2020c© Setareh Cohan, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Team LSTM: Player Trajectory Prediction in Basketball Games usingGraph-based LSTM Networkssubmitted by Setareh Cohan in partial fulfillment of the requirements for the de-gree of Master of Science in Computer Science.Examining Committee:James J.Little, Computer ScienceCo-supervisorLeonid Sigal, Computer ScienceCo-supervisorHelge Rhodin, Computer ScienceAdditional ExamineriiAbstractAutonomous systems deployed in human environments must have the ability to un-derstand and anticipate the motion and behavior of dynamic targets. More specif-ically, predicting the future positions of agents and planning future actions basedon these predictions is a key component of such systems. This is a challenging taskbecause the motion behavior of each agent not only depends on its own goal intent,but also the presence and actions of surrounding agents, social relations betweenagents, social rules and conventions, and the environment characteristics such astopology and geometry.We are specially interested in the problem of human motion trajectory predic-tion in real-world, social environments where potential interactions affect the waypeople move. One such environment is a basketball game with dynamic and com-plex movements driven by various social interactions. In this work, we focus onplayer motion trajectory prediction in real basketball games. We view the problemof trajectory prediction as a sequence prediction task where our goal is to predictthe future positions of players using their past positions.Following the success of recurrent neural network models for sequence predic-tion tasks, we investigate the ability of these models to predict motion trajectoriesof players. More specifically, we propose a graph-based pooling procedure thatuses relation networks and incorporates it with long short-term memory networks.We study the effect of different graph structures on the accuracy of predictions.We evaluate the different variations of our model on three datasets; two pub-licly available pedestrian datasets of ETH and UCY, as well as a real-world basket-ball dataset. Our model outperforms vanilla LSTM and Social-LSTM baselines onboth of these datasets.iiiLay SummaryMotion trajectory prediction is an important task which aims to predict the futurepositions of agents given an observed sequence of their positions. This task hasmany application domains such as service robots, self-driving vehicles, and ad-vanced surveillance systems. One of the major challenges for this task is the com-plexity of human behavior and diversity of factors affecting movements of agentssuch as own goal intent, interactions with surrounding agents, social rules and en-vironment characteristics. In this thesis, we look at the problem of player trajectoryprediction in basketball games, a complex and dynamic environment. We proposea model that jointly predicts player trajectories while considering interactions thattake place in the game. These interactions include player-player and player-ballinteractions. We evaluate the effectiveness of our model on two sets of data andoutperform baseline models.ivPrefaceThe entire work presented here is original work done by the author, Setareh Cohan,under the supervision of James J. Little and Leonid Sigal.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . 103.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 103.5 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . 12vi4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Social LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2.1 Position Prediction . . . . . . . . . . . . . . . . . . . . . 174.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3.1 Graph-based Pooling . . . . . . . . . . . . . . . . . . . . 194.3.2 Position Prediction . . . . . . . . . . . . . . . . . . . . . 204.3.3 Inference for Trajectory Prediction . . . . . . . . . . . . . 215 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.1.1 Basketball Data . . . . . . . . . . . . . . . . . . . . . . . 245.1.2 Pedestrian Data . . . . . . . . . . . . . . . . . . . . . . . 265.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35viiList of TablesTable 5.1 Quantitative results of methods across pedestrian datasets. Wereport two error metrics Average Displacement Error (ADE)and Final Displacement Error (FDE) for the unobserved partof the trajectories in meters. Our method consistently outper-forms S-LSTM method and is especially good for long termpredictions (lower is better). . . . . . . . . . . . . . . . . . . . 29Table 5.2 Quantitative results of methods on basketball dataset. We reportthe two error metrics Average Displacement Error (ADE) andFinal Displacement Error (FDE) for the unobserved part of thetrajectories. Our method consistently outperforms the Social-LSTM method and is especially good for long term predictions(lower is better). . . . . . . . . . . . . . . . . . . . . . . . . . 30viiiList of FiguresFigure 3.1 An artificial neuron with inputs x1 to xm, activation function ϕand output y. Each input xi has weight wi which is learned overtime from training data. . . . . . . . . . . . . . . . . . . . . . 8Figure 3.2 A 2-layer fully connected feed-forward neural network con-sisting of an input layer, a hidden layer and an output layer. . . 9Figure 3.3 A RNN unfolded into a full chain-structured network. . . . . . 11Figure 3.4 Two variations of LSTM cells. 3.4a shows an LSTM cell with-out peephole connections which we call vanilla LSTM. 3.4bshows an LSTM cell with peephole connections. Cancatina-tion operation is shown with symbol || in these figures. . . . . 13Figure 4.1 Social pooling for the person represented by a black-dot. Thehidden states of the neighbors (shown in yellow, blue and red)are pooled within a certain spatial distance. . . . . . . . . . . 17Figure 4.2 The figure shows the block diagram of our Team-LSTM model.Input embedding layer, which is a linear layer followed byrectified linear unit (ReLU) non-linearity, computes the inputembedded vector eit . The graph-based pooling module com-putes the pooled hidden tensor H it . It consists of a gθ and fφmodules (shown in orange) which are both multi-layer percep-trons (MLPs). gθ inputs all hidden state pairs of connectedagents and the summation of these vectors is then fed into fφto compute H it . The output linear layer is responsible to derivethe outputs given the current hidden state hit . . . . . . . . . . . 19ixFigure 4.3 Pooled hidden state of player 1 at time step t, H it , is computedas shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Figure 5.1 At each time step, player and ball positions are stored as serialindices in a 400×360 grid representation of the half-court. . . 25Figure 5.2 Average `2 distances to ground-truth positions at different timesteps on basketball data. . . . . . . . . . . . . . . . . . . . . 31xGlossaryGNN graph neural networkLSTM long short-term memoryMLP multi-layer perceptronReLU rectified linear unitRNN recurrent neural networkVRNN variational recurrent neural networkxiAcknowledgmentsI would like to express my gratitude for those that have helped me throughout thisjourney.First, my supervisors, James J.Little and Leonid Sigal who have guided me atevery step of the way. Dr. Little, your cheerful attitude and earnest support havebeen always motivating. Dr. Sigal, your mentorship and invaluable feedback havemade this work possible. I would like to thank my examiner committee member,Helge Rhodin, who took the time to read this thesis.Next, I would like to thank my lovely husband, Saeid, for his love and enduringsupport. Next in line are my colleagues, Gursimran and Rayat, who have helpedand inspired me.Moreover, I want to thank my parents whose unconditional love and supporthave encouraged me throughout my life. I am forever grateful for your tirelessefforts and all the opportunities you have provided for me. Next, I would like tothank my brother, Arman, who has always been my role model. I would not behere if it was not for you.I would like to thank my friends who make my life more enjoyable. Specialthanks must go to my beloved Pouneh and Arash, who, although no longer with us,continue to inspire me to live better and happier.xiiChapter 1IntroductionPerceiving, understanding and predicting human motion are essential for coexistingand interacting with humans. These predictions are the foundations based on whichwe must plan our future actions and movements. For example, drivers driving ona road must be able to foresee the future positions of fellow road users and adjusttheir paths accordingly to avoid collisions while moving towards their destination.Another example includes pedestrians walking on a sidewalk, moving towards theirdestinations while avoiding obstacles and accommodating fellow pedestrians.Humans have the natural ability to understand the motion of one another andmake accurate predictions of these motions. While making these predictions, weconsider many complicated common-sense rules such as respecting personal spaceand yielding right of way. Despite being natural to humans, predicting the motionof human targets while considering such rules and social behaviors is an extremelychallenging problem. Motion not only depends on one’s goal, but also the presenceand actions of surrounding agents, social interactions between agents, social rulesand common-sense rules, and the environment characteristics such as topology andgeometry. Most of these factors are not observable and need to be inferred fromthe available contextual information. Moreover, to be effective, motion predictionneeds to be accurate and real-time operatable.Similar to humans, intelligent systems deployed in human environments, mustbe able to make accurate predictions of the motion of all dynamic targets includinghumans around it. More specifically, predicting the future positions of these targets1and planning future movements based on these predictions is a key componentof such systems. Examples of such systems include socially-aware robots, self-driving vehicles, and advanced surveillance systems.We are interested in the problem of human motion trajectory prediction in real-world, social environments. One such environment is a basketball game with dy-namic and complex motions where various interesting factors affect players mo-tions. In a basketball game, player trajectories are affected by their personal goalintents, the goals of their teams, and movements of surrounding players. In thiswork, we focus on predicting players’ future motion trajectories in real basketballgames.Similar to Alahi et al. (2016), we view the problem of trajectory predictionas a sequence prediction task where the goal is to predict the future positions ofplayers using their past positions. Following the success of recurrent neural net-work (RNN) models for sequence prediction tasks, we first investigate the abilityof a long short-term memory (LSTM) models to predict the trajectories of bas-ketball players. Next, we incorporate game structure information and interactionsof players into our LSTM-based model by proposing a graph-based pooling pro-cedure that uses relation networks. This pooling allows the LSTMs of connectedplayers to share their hidden state information with each other. This architecturewhich we call “Team-LSTM” is capable of predicting future positions of playerswhile modeling and learning the interactions between connected players.We evaluate our Team-LSTM model on two types of data: publicly availablepedestrian trajectories (Pellegrini et al., 2009), (Lerner et al., 2007); and real bas-ketball trajectories, rendered as a series of bird’s eye views of the court (Zhenget al., 2016). For the basketball dataset, we consider poolings based on three differ-ent graph structure and for the pedestrian datasets, we consider one graph structureand test all these variations of our Team-LSTM model. Our experiments show thatour Team-LSTM model outperforms the vanilla LSTM and Social-LSTM modelson the pedestrian datasets. On the basketball dataset, all the variations of our modelachieve comparable results with both the vanilla LSTM and Social-LSTM modelsand only one of the three variations outperforms the baselines.The remainder of this thesis is organized as follows: In Chapter 2, we reviewrelated work on the topic of human behavior prediction, and human motion trajec-2tory prediction in particular. Chapter 3, contains the background information onsequence prediction with Neural Networks. In Chapter 4, we start by defining theproblem we will be working on. Then, we review the Social-LSTM model in moredepth. Finally, we describe our graph-based pooling and our Team-LSTM modelin detail. Chapter 5 contains an overview of the datasets used for evaluation. Wedescribe the metrics we used while evaluating our models and then we provide theresults of our experiments. Lastly, in Chapter 6, we discuss the limitations of ourmodel and possible future directions for the work.3Chapter 2Related WorkHuman behavior prediction is an important and challenging problem. Different as-pects of this problem have been studied by numerous scientific communities, suchas computer vision, robotics, and self-driving vehicles. Generally, research in fore-casting human behavior can be grouped into predicting human-space interactionsor human-human interactions (Gupta et al., 2018). The goal of the former is tolearn scene-specific motion patterns (Ballan et al., 2016; Coscia et al., 2016; Huet al., 2011; Kim et al., 2011; Kitani et al., 2012; Morris and Trivedi, 2011; Zhouet al., 2011). The latter which models the dynamic content of the scenes, includinghow people interact, is the focus of our work. In this work, we focus on the mostclosely related literature and review some of these works.Human-Human InteractionPioneering work of Helbing and Molnar (1995) presented a human motion modelcalled the Social Forces model. Social Forces models two types of forces. First,attractive forces that guide people towards their goal and second, repulsive forcesthat encourage people to avoid collisions. This model has been shown to achievecompetitive results even on modern pedestrian datasets (Lerner et al., 2007; Pelle-grini et al., 2009). Social Forces was later extended to robotics (Luber et al., 2010),and activity understanding (Leal-Taixe´ et al., 2011, 2014; Mehran et al., 2009; Pel-legrini et al., 2010; Yamaguchi et al., 2011). Tools popular in economics have alsobeen used to model human behavior by Antonini et al. (2006). Moreover, Gaussian4processes were used for modeling human motion Tay and Laugier (2008); Wanget al. (2007). All these methods use handcrafted energy potentials based on rela-tive distances and scene-specific rules. In contrast, generic data-driven approacheshave been proposed more recently which outperform the above traditional ones.RNNs for Human Motion Trajectory PredictionRNNs are a powerful class of dynamic models that have been shown to be success-ful in sequence prediction in different domains such as machine translation (Chunget al., 2015), speech recognition (Chorowski et al., 2014; Chung et al., 2015;Graves and Jaitly, 2014), and image captioning (Karpathy et al., 2014; Vinyalset al., 2015; Xu et al., 2015). Following this success, Alahi et al. (2016) were thefirst to view at human trajectory prediction problem as a sequence prediction taskand use RNNs. They show that using a single RNN per person fails to capture inter-actions between people and propose a social pooling layer that enables modelingsuch interactions. Lee et al. (2017) propose an RNN encoder-decore frameworkthat exploits a variational autoencoder (VAE) to predict trajectories. Although suc-cessful, this model does not model human-human interactions and fails in crowded,dynamic scenes. Inspired by this limitation, Gupta et al. (2018) propose Social-GAN model. This model consists of an RNN encoder-decored generator and anRNN-based discriminator. Their model has a variety of losses which encouragesthe model to spread its predictions and cover the space of possible paths while be-ing consistent with the observed sequences. Social-GAN is also able to capture thesocial interactions because of a new pooling mechanism they call global poolingthat uses a multi-layer perceptron (MLP) followed by a max pooling layer. Theyshow that this pooling is more computationally efficient and works as well or betterthan the social pooling of Alahi et al. (2016).Graph-structured RNNsThere are several papers that combine standard RNNs with graph-structured rep-resentations. Chang et al. (2016) learn a physics simulator by proposing a pro-cedure for factorizing object dynamics into pairwise interactions. Battaglia et al.(2016) propose a model that can reason how objects in complex systems inter-5act. In addition, Sanchez-Gonzalez et al. (2018) introduce a new class of learnablemodels based on graph networks (GN). Following these works and the introduc-tion of stochastic RNNs, Sun et al. (2019) proposed a model called Graph-VRNNwhich incorporates variational recurrent neural networks (VRNNs) with graphs.This model is able to perform state estimation and future forecasting at the levelof objects and relations directly from image pixels. On a similar line, Yeh et al.(2019) proposed a model called GVRNNT for multi-agent trajectory prediction.This model incorporates graph neural networks (GNNs) and VRNNs for joint tra-jectory prediction suitable for sports. The main goal of using GNNs in GVRNNT ismodeling permutation-equivariant interactions among players while in the work ofSun et al. (2019), the main goal of using GNNs was learning the underlying graphstructure of the game. We propose to incorporate RNNs with graphs for playertrajectory prediction in basketball games. Our model is computationally efficientas it only requires the position coordinate of players. Moreover, due to the use ofrelation networks (Santoro et al., 2017) as our graph support, our model can learnthe graph structure.6Chapter 3BackgroundOur task is structured as a standard sequence prediction problem where we areinterested in predicting the next values for a given input sequence. Our model isa RNN and its building blocks are LSTM units. In this chapter, we review somebackground material relevant to the rest of our discussion.3.1 Neural NetworksNeural networks (NNs) are computational models that map an input to an out-put. Inspired by the biological neural connectivities in human and animal brains,neural networks “learn” to perform tasks by considering examples, without be-ing specifically programmed with task-specific rules (Hinton and Salakhutdinov,2006; McCulloch and Pitts, 1990). For instance, neural networks are capable oflearning to classify images into different groups using manually labeled data orwithout any prior knowledge (Krizhevsky et al., 2012). Analogous to the neuronsand synapses in a biological brain, neural networks consist of a number of con-nected computational units called artificial neurons. An artificial neuron receivesone or more inputs where each input has a weight associated to it. This weightdetermines its relative importance to other inputs and is adjusted as the learningproceeds. Each neuron also has a non-linearity function called an activation func-tion through which it passes the weighted sum of its inputs. The purpose of thisfunction is to introduce some non-linearity to the otherwise linear computation of7the neurons. Therefore, the output of each neuron is computed by some non-linearfunction of the weighted sum of its inputs. Figure 3.1 shows an example of a singleneuron with m inputs of x1 to xm, activation function ϕ , and output y. The outputof this neuron isy= ϕ(m∑i=1wixi)(3.1)where wi is the weight associated to input xi.Neurons of a neural network are typically organized into multiple layers; aninput layer, multiple hidden layers, and an output layer. The input layer of a neuralnetwork is composed of input neurons that bring the initial data into the networkfor further processing by succeeding layers. The hidden layer is a layer in betweeninput layers and output layers, where neurons take in a set of weighted inputs andproduce an output through their activation function. The output layer producesthe output and its neurons perform the same operations like the ones in the hiddenlayer. A neural network with multiple layers is also called a deep neural network.ϕx1x2...xmyw1w2wmFigure 3.1: An artificial neuron with inputs x1 to xm, activation function ϕand output y. Each input xi has weight wi which is learned over timefrom training data.In typical neural networks, neurons of each layer can only connect to neurons inthe preceding or following layers. When the direction of these connections is frominputs to hidden layers and to outputs, the network is called a feed-forward neuralnetwork. Between two layers, multiple connection patterns are possible. Theycan be fully connected, with every neuron in one layer connected to every neuronin the next layer. These networks are called fully connected neural networks and8Figure. 3.2 is an example of such networks.InputlayerHiddenlayerOutputlayerFigure 3.2: A 2-layer fully connected feed-forward neural network consistingof an input layer, a hidden layer and an output layer.Learning in neural networks is equivalent to adjusting the network parametersso that network performance increases in a task considering sample observations.This process which is called training involves finding the connection weights of thenetwork such that the cost function, error between observed outputs and ground-truth values, decreases over time. Backpropagation is the method used for trainingby calculating the gradient of the cost function associated with a given output withrespect to its weights. The weight updates can then be done using any optimizationalgorithm such as gradient descent (GD) (LeCun et al., 2015).3.2 Activation FunctionAs introduced in Section 3.1, an activation function of a neuron defines the outputof that neuron given a set of input values. More specifically, to create the output ofa neuron, the activation function is applied to the weighted sum of its inputs. Thepurpose of activation functions is incorporating non-linearity into computations.Without them, neural networks are essentially linear regression models unable toperform complex tasks known to be possible for neural networks (Hornik et al.,91989).There are many forms of activation functions used in practice. Sigmoid, hyper-bolic tangent (tanh), and rectified linear unit (ReLU) are three of the most appliedactivation functions which are described below:• Sigmoid: A bounded and differentiable function defined for all real inputvalues. It maps its real-valued input to a real value between 0 and 1 and hasa non-negative derivative at all points.f (x) =11+ e−x(3.2)• Tanh: A bounded and differentiable function defined for all real input values.It maps its real-valued input to a real value between -1 and +1.tanh(x) =(ex− e−x)(ex+ e−x)(3.3)• ReLU: It takes a real-valued input and thresholds it at zero by replacingnegative values with zero.f (x) = max{0,x} (3.4)3.3 Multi-Layer PerceptronA multi-layer perceptron (MLP) is a special sub-class of neural networks. AnMLP is a feed-forward neural network with at least one hidden layer. The forwarddirection of connections means that connections start from the input layer and gotowards the output layers. The hidden layers and output layer of an MLP containa non-linear activation function. Figure. 3.2 is an example of a two-layered MLP(Murtagh, 1991).3.4 Recurrent Neural NetworksA recurrent neural network (RNN) is a deep neural network where connectionsbetween neurons can form a loop. These looping connections allow the use of10hxyˆunfold⇒ · · · ht−1 ht ht+1 · · ·xt−1yˆt−1xtyˆtxt+1yˆt+1WhWxWyWh Wh Wh WhWxWyWxWyWxWyFigure 3.3: A RNN unfolded into a full chain-structured network.previous outputs for present computations. RNNs essentially consist of copies ofa single network layer connected with a directed connection to its following layer.Each layer consists of an input neuron, a hidden neuron, and an output neuron, andis responsible for computations at a particular time step (Cleeremans et al., 1989;Jordan, 1986; Pearlmutter, 1989). Figure. 3.3 shows an RNN unfolded into a chain-structured network where xt , ht and yˆt denote the input, the hidden state, and theoutput of the RNN at time step t respectively. Wx, Wh and Wy are connectionsweights of each layer which are shared across all the layers of the RNNs, and arelearned during training. At time step t, hidden state ht is first computed using inputxt , and then output yˆt is computed using the freshly computed ht as below:ht = ϕh(whht−1+wxxt +bh)yˆt = ϕy(Wyht +b)(3.5)where ϕh is the non-linear activation function of the hidden neuron and ϕy is thenon-linear activation function of the output neuron.113.5 Long Short-Term MemoryOne of the appealing features of an RNN is its capability to exploit previous in-formation in present computations. In theory, RNNs are capable of handling arbi-trary long-term dependencies in the input sequence. In practice, when the distancebetween the current time step and past relevant information is small, RNNs aresuccessful. However, as this distance grows, RNNs become unable to learn to con-nect the past information. This problem, called the vanishing gradient problem, iscaused by backpropagation through time during training RNNs. During backprop-agation, gradients of later outputs have to propagate back to affect computations ofearly outputs. As gradients pass back through many time steps, they tend to expo-nentially decrease. As the gradients shrink, the contributions of these gradients tocomputations of early steps become negligible. The vanishing gradient problem inRNNs is explored in depth in Bengio et al. (1994).Long short-term memory (LSTM) is a special type of RNN designed to avoidthe vanishing gradient problem when long-term dependencies exist. The originalLSTM architecture was proposed by Hochreiter and Schmidhuber (1997) and iscomposed of a memory cell, an input gate, an output gate. The cell remembersvalues over different time intervals and the three gates regulate the flow of infor-mation into and out of the cell. The input gate regulates the extent to which a newinput value flows into the cell and the output gate regulates the extent to which thevalue in the cell is used to compute the output activation of the LSTM unit. Forgetgates were added by Gers et al. (1999) to the original LSTM structure. The forgetgate controls the extent to which a value remains in a cell and without these gates,cell state values of LSTMs may grow indefinitely and eventually cause the networkto break down. More modification was proposed by Gers and Schmidhuber (2000)who incorporated peephole connections from internal cells to the gates of the samecell and omitted output activation function. These connections help with learningprecise timings of events. The most commonly used LSTM architecture in thecurrent literature can contain all these modifications and was proposed by Gravesand Schmidhuber (2005). They introduced full backpropagation through time forLSTM networks in this version which simplifies the training procedure and theimplementation of LSTMs. Figure 3.4 shows two variants of this LSTM model;12(a) (b)Figure 3.4: Two variations of LSTM cells. 3.4a shows an LSTM cell with-out peephole connections which we call vanilla LSTM. 3.4b showsan LSTM cell with peephole connections. Cancatination operation isshown with symbol || in these figures.Figure 3.4a without peephole connections which we refer to as vanilla LSTM, andfigure 3.4b with peephole connections. The computations used during a forwardpass of a vanilla LSTM unit as shown in figure 3.4a are as below:ft = σ(Wf [ht−1,xt ]+b f ) forget gateC˜t = tanh(Wc[ht1 ,xt ]+bc) candidate memory cellit = σ(Wi[ht−1,xt ]+bi) input gateCt = ft ∗Ct−1+ it ∗C˜t memory cellot = σ(Wo[ht−1,xt ]+bo) output gateht = ot ∗ tanh(Ct) hidden statewhere xt is the input, Ct is the cell state value, and ht is the hidden state of theLSTM unit at time step t. σ denotes the sigmoid activation function, and ∗ indicateselement-wise multiplications between two vectors. For the LSTM structure with13peephole connections of 3.4b, computations of a forward pass are the following:ft = σ(Wf [Ct−1,ht−1,xt ]+b f ) forget gateC˜t = tanh(Wc[ht1 ,xt ]+bc) candidate memory cellit = σ(Wi[Ct−1,ht−1,xt ]+bi) input gateCt = ft ∗Ct−1+ it ∗C˜t memory cellot = σ(Wo[Ct ,ht−1,xt ]+bo) output gateht = ot ∗ tanh(Ct) hidden stateAnother variation of the vanilla LSTM is the Gated Recurrent Unit (GRU), intro-duced by Cho et al. (2014). This architecture combines the forget gate and inputgate into a single update gate. It also merges the cell state and hidden state. Theoutput gate is called a reset gate in this architecture which specifies whether thecurrent hidden state would ignore the previous hidden state or not. The resultingmodel is a simpler structure than the standard vanilla LSTM model and is morecomputationally efficient.14Chapter 4MethodThe problem of human trajectory prediction is a challenging and important prob-lem. Basketball games are complex and dynamic environments where player tra-jectories are inspired by many interesting factors including goal of teams, teamstructures, roles of players, and interactions between players. The characteristicsof basketball games motivate us to build a model that can predict a player’s mo-tion trajectory while incorporating game-related structures and taking the behaviorof other players into account. In this section, we first describe the basis of ourmodel called the Social-LSTM model. Following that, we describe our modifica-tion which is a new graph-based pooling procedure that uses relation networks. Anoverview of our model called TEAM-LSTM can be viewed in figure Problem FormulationFollowing the convention of previous works, we assume that each scene is initiallypreprocessed to obtain the spatial x-y coordinates of all players at different timesteps (Alahi et al., 2016; Gupta et al., 2018; Sun et al., 2019). At any time step t,i-th player in the scene is presented by their x-y coordinates (xit ,yit). Our task is topredict the positions of all players from time steps Tobs+1 to Tpred while having theirground-truth positions from time step 1 to Tobs observed. This is essentially equiv-alent to a sequence prediction problem where the input sequence is the observedpositions of a player and the output is a sequence identifying the future positions15of this player at different time steps.4.2 Social LSTMAs discussed in Section 4.1, the problem of human motion trajectory predictioncan be formulated as a sequence prediction task. Following the success of LSTMnetworks for learning and generalizing the properties of sequences like handwriting(Graves, 2013) and speech (Graves and Jaitly, 2014), Alahi et al. (2016) were thefirst to use LSTMs for the task of human trajectory prediction. In particular, eachperson has their own LSTM where the observed sequence of his/her positions isthe input of this LSTM. The output of each LSTM is the predicted positions of itscorresponding person in future time steps. While this model works well for sceneswith few people having little to no interactions, Alahi et al. (2016) show that it failsto capture interactions of people in crowded and interactive scenes.Motivated by this limitation, Alahi et al. (2016) proposed the Social-LSTMmodel. In this model, the interactions of nearby people are incorporated by intro-ducing a new pooling mechanism called social pooling. With social pooling, atany time step, each LSTM cell receives pooled hidden state information from theLSTM cells of its neighbors, allowing them to share motion trajectory information.Neighboring LSTMs are modeled using a grid-based pooling as below. The pooledsocial hidden state tensor of i-th person at time step t is shown by H it and computedas below:H it (m,n, :) = ∑j∈Ni1mn[xjt − xit ,y jt − yit ]h jt−1 (4.1)where hit , xit , and yit are the hidden state, and x and y positions of i-th person at timestep t. h jt−1 coresponds to the hidden state of j-th person at time step t−1, 1mn[x,y]is and indicator function determining whether (x,y) is located in the (m,n)-th cellof the grid. Ni is the set of neighbors of person i. H it (m,n, :) is essentially the sumof hidden states of all neighbors in the n-th grid of person m. Figure 4.1 shows avisualization of this pooling.16Figure 4.1: Social pooling for the person represented by a black-dot. The hid-den states of the neighbors (shown in yellow, blue and red) are pooledwithin a certain spatial distance.4.2.1 Position PredictionThe pooled social hidden tensor of person i at time step t is incorporated into thecalculations of its LSTM cell as below:eit = φ(xit ,yit ,We)ait = φ(Hit ,Wa)hit = LSTM(hit−1,eit ,ait ,Wl)(4.2)where φ is an embedding function with ReLU non-linearity, and We, Wa are theembedding weights, and Wl is the weights of the LSTM cell.To predict the x-y position of person i at the next time step t+1, the hidden stateof time step t is passed through a linear output embedding layer which returns theparameters of a bivariate Gaussian distribution; the mean µ it+1 = (µx,µy)it+1, stan-dard deviation σ it+1 = (σx,σy)it+1, and correlation coefficient ρit+1. The predictedcoordinates (xˆt+1, yˆt+1)i are then sampled from this bivariate Gaussian distribution:[µ it+1,σit+1,ρit+1] =Wphit(xˆt+1, yˆt+1)i ∼N (µ it+1,σ it+1,ρ it+1)(4.3)The network is trained by minimizing the negative log-likelihood loss (NLL)of the ground-truth future positions given the predicted distribution parameters, forall trajectories in the training dataset. NLL of the trajectory of i-th person or i-th17trajectory, Li is defined as below:Li(We,Wa,Wl,Wp) =−Tpred∑t=Tobs+1logP(xit ,yit |µ it ,σ it ,ρ it ) (4.4)4.3 ApproachLooking back at the social pooling mechanism explained in detail in section 4.2,we can see that it has the following properties:• Hidden states are shared among people who are close enough to fall in oneanothers neighborhood distances.• We do not differentiate between hidden states of people that fall within acertain grid.These properties show that Social-LSTM model which was desined for modelinginteractions between pedestrians is not a good match for modeling the interactionsof basketball games. This is because in a basketball game:• Interactions are not limited to spatially close players.• There is a difference between the type of interactions that are in a certaingrid; for instance player-player interactions are different from player-ballinteractions. Moreover, team-members interact differently from opponents.• There are certain interactions that are expected to be present in a basketballgame such as interactions between team members, between opponents, andbetween players and the ball.Inspired by these differences, we propose an LSTM-based model called “Team-LSTM” for player motion trajectory prediction in basketball games. We introducea new graph-based pooling method which uses relation networks and is capable ofincorporating the structure of the game while modeling the interactions betweenplayers and players and the ball. The overall architecture of our model is shown infigure 4.2.18Figure 4.2: The figure shows the block diagram of our Team-LSTM model.Input embedding layer, which is a linear layer followed by ReLU non-linearity, computes the input embedded vector eit . The graph-basedpooling module computes the pooled hidden tensor H it . It consists ofa gθ and fφ modules (shown in orange) which are both MLPs. gθ in-puts all hidden state pairs of connected agents and the summation ofthese vectors is then fed into fφ to compute H it . The output linear layeris responsible to derive the outputs given the current hidden state hit .4.3.1 Graph-based PoolingIn a basketball game, every player has a different motion pattern; they move withtheir own velocities and accelerations while being affected by different game struc-ture factors such as goal intent of their team, their interactions with their team-mates, opponents, and the ball. Incorporating these interactions into an LSTM-based model is expected to boost the accuracy of the predictions. Therefore, wepropose a new graph-based pooling structure using relation networks proposed bySantoro et al. (2017). Our pooling allows for each LSTM cell to receive pooledhidden state information from the LSTM cells of the players connected to it. Con-nected LSTMs are initially determined by a set of hidden state pairs, however,the relations between these pairs is later learned by the relation network modulesduring training. Given a hidden state dimension D, and neighborhood size No, andembedding dimension R, we construct an No×R dimensional pooled hidden tensor19Figure 4.3: Pooled hidden state of player 1 at time step t, H it , is computed asshown.for the i-th trajectory at time step t shown by H it :H it = fφ(∑j∈Nigθ (hit−1,hjt−1))(4.5)where hit−1 is the hidden state of i-th player at time step t−1, andNi are the set ofplayers connected to player i at time step t−1. Relation network modules gθ and fφare functions with learnable parameters θ and φ respectively. For our purpose, gθ :2×D→ D and fφ : D→ R are MLPs where the parameters are learnable weights,making them end-to-end differentiable. The output of gθ is called a “relation”, andits role is to infer the ways its two hidden state inputs are related, or if they arerelated at all. Therefore, our pooling model can learn the relations between pairsof hidden states and the initial graph structure only provides the pairing. Figure 4.3shows a visualization of our graph-based pooling. x4.3.2 Position PredictionIncorporation of the newly desinged pooled hidden tensor into the LSTM calcu-lations is similar to the one of Alahi et al. (2016) and is done by the following20calculations:eit = φ(xit ,yit ,We)hit = LSTM(hit−1,eit ,Hit ,Wl)(4.6)where φ : 2→ R is an embedding function with ReLU non-linearity, and We is theembedding weight, Wl is the weights of the LSTM cell. H it , our newly proposedpooled hidden tensor defined in equation 4.5, is concatinated with eit and is inputedto the LSTM cell. Comparing equations of 4.2 and 4.6, we can see that instead ofpassing H it through an embedding layer, we now directly use it as an input to theLSTM cell. Note that the LSTM cells used in both the Social-LSTM model andour model is the vanilla LSTM explained in section 3.5.Similar to Alahi et al. (2016), positions are assumed to be from a bivariateGaussian distribution. To predict the position of player i at time step t + 1, thehidden state of time step t is passed through a linear output embedding layerΦ : R→ 5 which returns five parameters of the bivariate Gaussian distribution;the mean µ it+1 = (µx,µy)it+1, standard deviation σit+1 = (σx,σy)it+1, and corelationcoefficient ρ it+1. The predicted coordinates (xˆt+1, yˆt+1)i are then sampled from thisbivariate Gaussian distribution:[µ it+1,σit+1,ρit+1] =Wphit(xˆt+1, yˆt+1)i ∼N (µ it+1,σ it+1,ρ it+1)(4.7)Along the same lines, the training is performed by minimizing the NLL of theground-truth positions given the prediced distribution parameters as below:Li(We,Wl,Wp,φ ,θ) =−Tpred∑t=Tobs+1logP(xit ,yit |µ it ,σ it ,ρ it ) (4.8)4.3.3 Inference for Trajectory PredictionDuring test time, we use the trained Team-LSTM model to predict the future posi-tion (xˆit , yˆit) of the i-th player. From time step T1 to Tobs, we input the ground-truthpositions to our Team-LSTM cells. However, from time Tobs+1 to Tpred, we use the21predicted position from the previous Team-LSTM cell in place of the ground-truthcoordinates as they are unavailable.22Chapter 5ExperimentsOur implementation is done with PyTorch. Following Alahi et al. (2016), we set thevalues of hidden state dimension to 128 for all the Team-LSTM cells, and embed-ding dimension is set to 64. We build pooling modules of equation 4.5 followingSantoro et al. (2017). gθ is a four-layer MLP with 256 units per layer with ReLUnon-linearities between every pair of layers. fφ is a three-layer MLP with 256 unitsper layer, 50% dropout at second layer and ReLU non-linearities between all pairsof layers. The pairings provided to gθ are determined by an input graph structure tothe Team-LSTM model. We use the fully-connected graph structure for the pedes-trian dataset and for the baskeball dataset, we use three different graph structures:fully-connected graph, a disconnected graph where each person is connected to itsteam-mates, and a connected graph where each person is connected to its team-mates and the ball. The input embedding layer of equation 4.6 is a linear layerfollowed by a ReLU non-linearity. The output embedding layer Φ of equation 4.7is also a linear layer mapping the D-dimensional hidden state to a 5-dimensionaloutput. Following Alahi et al. (2016), we used a learning rate of 0.003 and Ada-grad optimizer Duchi et al. (2011) with a weight decay of 0.0005 for training themodel. We trained all the variations of our model and the baseline models for 30epochs on the pedestrain dataset and for 12 epochs on the basketball dataset. Weuse the same hyperparameters described above while implementing and trainingthe baseline models as well.235.1 DatasetsWe validate our Team-LSTM model on two types of data. First is a dataset of realbasketball trajectories, rendered as a series of birds eye views of the court. Nexttwo dataset are publicly available pedestrian trajectories datasets: ETH and UCY.All these datasets are explained in detail below.5.1.1 Basketball DataWe use the basketball dataset from Zheng et al. (2016) which contains basketballtracking data from a professional basketball league. The dataset includes a trainingset and test set, each containing a number of possession files. Each of these posses-sion files include a sequence of game events during which one team has possessionof the ball. Possessions end when the ball is captured by the opposing team, orif a shot is made. Training set contains 32,377 and the test set contains 3,953possession files.Each possession file is stored as a sequence of tuples, one for each time step.Each tuple contains 16 numbers with formatted as:(ball, i1, ..., i10, label1, label2, label3, label4,goal)where the first 11 numbers are the positions of the ball and players, the next 4numbers are the velocity labels of of player 1 for 4 intermediate time-steps, and thelast number is the macro-goal label of player 1.Players 1 - 5 are the offense team, defined as the team that has the ball inthe possession. Players 6 - 10 are the defense team. For each time step, playerand ball positions are stored as serial indices in a 400× 360 grid representationof the half-court, indexed row-major, with (0,0) in the top-left. Figure 5.1 showsa visualization of this positioning. The player and ball positions are sampled at25/4 = 6.25Hz and each possession contains at most 70 sub-sampled frames.For each time step, the velocity labels consist of the instantaneous velocity ofplayer 1 for 4 intermediate time steps. A single velocity label is a row-major serialindex in a 17×17 velocity grid centered at that player, where (0,0) is in the top-left.The macro-goal label is a stationary point of player 1 along its trajectory. Macro-24Figure 5.1: At each time step, player and ball positions are stored as serialindices in a 400×360 grid representation of the half-court.goals are weak labels for the long-term intent of the player, i.e. which region of thecourt the player wants to reach.In order to build the trajectories, we only consider the positions and ignore thevelocity and macro-goal labels. In Subsection 5.1.1, we explain how the trajecto-ries are extracted from the raw possession files.PreprocessingAs explained earlier, each possession file includes the successive positions of theball and all 10 players. The successive positions of each agent is called a trajectoryand we are intersted in extracting the 11 trajectories of each possession file. Wewant each trajectory to have a number of examples with format of:(frame number,agent id,y,x) (5.1)and we preprocess all possession files to extract all such trajectories.25For each possession file with 11 agents, we have 11 trajectories to extract. Toextract trajectory of a-th agent of p-th possession file with postion value pos valuein the raw files, assuming the possession file has n tuples, we follow the stepsbelow:• Build n examples of format 5.1.• Set frame number of example i to p×n+ i for i ∈ [0,10].• Set agent id of all examples to p×11+a.• Set y to bpos value360 c.• Set x to (pos value mod 360).Following the above steps, we will have 11 trajectories per possession file of thedataset. Frame numbers and agent ids of trajectories of different possession fileswill be distinct as well. Overall, there are 356,147 training and 43,483 test exam-ples in the final preprocessed dataset. For this dataset, following the experimentalsetup of Sun et al. (2019), we set every 5 frames as 1 step. Therefore, we consider10 steps in total, 6 observed and 4 unobserved. We consider all 11 agents in ourmodel.5.1.2 Pedestrian DataThe two publicly available human motion trajectory datasets that we used for eval-uating our models are the ETH and UCY datasets. The former contains 2 sets ofdata and the later contains 3 sets of data. In order to make full use of the datasetwhile training our models, we use the leave-one-out approach. We train and vali-date our model on 4 sets and test on the remaining set. We repeat this procedurefor all of the 5 sets.For both the ETH and UCY datasets, frames are sampled at a frame rate of 0.4and trajectories are 8secs long. Following experiments of Alahi et al. (2016), weobserve the first 3.2secs or 8 frames of each trajectory and predict the next 4.8secsor 12 frames.26ETHThe ETH dataset proposed by Pellegrini et al. (2009), contains human motion tra-jectory data from public scenes. This dataset includes two scenes each with 750different pedestrians and is split into two sets of data: ETH and Hotel. The formatof enteries of trajectories are similar to the one of basketball 5.1.UCYThe UCY dataset was proposed by Lerner et al. (2007) and similar to the ETHdataset, consists of real-world human trajectories of public scences. This datasetcontains two scenes with 786 people and has 3 components: ZARA01, ZARA02and UCY. Agian, the format of enteries of trajectories are similar to the one ofbasketball MetricsFollowing the conventions of previous work (Alahi et al., 2016; Gupta et al., 2018;Pellegrini et al., 2009), we use the following metrics to evaluate our model:• Average Displacement Error (ADE): Average `2 distance between ground-truth positions and our predictions. The ADE at T consecutive frames con-taining N agents is calculated asADE =1T1NT∑t=1N∑i=1√(xˆit − xit)2+(yˆit − yit)2. (5.2)• Final Displacement Error (FDE): The distance between the final ground-truth position and our final predicted position at the end of the predictionperiod. Therefore, the FDE of player i at last frame t = Tpred containing Nagents is calculated asFDE =1NN∑i=1√(xˆiTpred− xiTpred)2+(yˆiTpred− yiTpred)2. (5.3)275.3 ResultsIn this part, we evaluate the accuracy of different variations of our model on thebasketball and pedestrain datasets. The variations of our Team-LSTM model are:• Graph-FC: Our Team-LSTM model where the initial graph structure is afully-connected graph.• Graph-TMS: Our Team-LSTM model where the initial graph structure isdisconnected graph with each player connected to all of their team members.This model is only applicable to the basketball dataset.• Graph-TMSB: Our Team-LSTM model where the initial graph structure isgraph with each player connected to all of their team members and the ball.This model is also only applicable to the basketball dataset.We compare against the following baselines:• Vanilla: A simple vanilla LSTM with no pooling mechanism.• Social: The Social-LSTM method proposed by Alahi et al. (2016).The accuracy results of our model and the baselines, computed on ADE and FDEmetrics tested on pedestrain datasets are available in Table 5.1. As expected, ourTeam-LSTM model with an initial fully-connected graph structure outperformsboth the vanilla LSTM and Social-LSTM models. Social-LSTM model performsslightly better than the vanilla LSTM model. However, the Social-LSTM resultsare not as good as the reported results of the original paper (Alahi et al., 2016).The authors of this paper trained the model on synthetic dataset and then fine-tuned on real datasets. We do not use synthetic data to train the models whichcould potentially lead to worse performance.Table 5.2 shows accuracy results of our model and the baselines, computed onADE and FDE metrics tested on the basketball dataset. Again, the vanilla LSTMmodel has the highest prediction error among all models. Social-LSTM performsslightly better than the vanilla model but the improvements are very small. Allvariations of our model achieve errors lower than or equal to the errors achievedby Social-LSTM and vanilla LSTM models. Graph-TMS performs worse than28Metric DatasetMethodsVanilla Social Graph-FCADEETH 0.181 0.182 0.159HOTEL 0.158 0.153 0.139ZARA01 0.373 0.287 0.265ZARA02 0.233 0.232 0.193UCY 0.221 0.215 0.192Avg. 0.233 0.214 0.188FDEETH 0.287 0.277 0.262HOTEL 0.251 0.254 0.232ZARA01 0.469 0.352 0.363ZARA02 0.331 0.294 0.264UCY 0.302 0.276 0.273Avg. 0.328 0.291 0.277Table 5.1: Quantitative results of methods across pedestrian datasets. Wereport two error metrics Average Displacement Error (ADE) and FinalDisplacement Error (FDE) for the unobserved part of the trajectories inmeters. Our method consistently outperforms S-LSTM method and isespecially good for long term predictions (lower is better).Graph-FC and Graph-TMSB. This seems intuitive as the location of the ball andits relative position to player positions is a valuable information while predictingthe positions of all agents (players and the ball). Moreover, Graph-TMS performsworse than Graph-FC. This is also intuitive as the fully-connected structure is ableto capture more of the interactions. Graph-TMSB slightly outperforms Graph-FC in our setting. However, we expect Graph-FC to be the strongest variation asit has the potential to learn all the pairwise relations between all the agents whileGraph-TMSB has access to a subset of these relations. In general, the unimpressiveperformance of Team-LSTM variations against the Social-LSTM model may bedue to the restricted training time. As our pooling is more complex than the socialpooling, it requires longer training time to learn the model parameters than theSocial-LSTM model. However, We have trained all of the variations of our Team-LSTM model for 12 epochs while training the Social-LSTM model for 30 epochs.Although a weak argument and needs further investigation, longer training couldresult in better performance for all the variations of our model.29Metric MethodsVanilla Social Graph-FC Graph-TMS Graph-TMSBADE 0.196 0.193 0.193 0.194 0.191FDE 0.264 0.261 0.259 0.259 0.257Table 5.2: Quantitative results of methods on basketball dataset. We reportthe two error metrics Average Displacement Error (ADE) and Final Dis-placement Error (FDE) for the unobserved part of the trajectories. Ourmethod consistently outperforms the Social-LSTM method and is espe-cially good for long term predictions (lower is better).Figure 5.2 shows the average `2 distance between the ground-truth and pre-dicted positions of all agents from time step t = 1 to t = Tobs for the basketballdataset. As can be seen, variations accross different models is quite small. Wecan see that the Graph-TMSB and Graph-FC errors generally go down over time,as the system integrates evidence from multiple frames. Similar to results of ta-ble 5.2, Graph-TMS and Graph-FC outperform other methods, and the Graph-TMSmethod slightly outperforms Graph-FC. As expected, the vanilla model generallyhas the highest error and the Social-LSTM method outperforms the vanilla model.Graph-TMS starts with a higher initial error than other methods, but manages tooutperform both the vanilla and social methods. Errors of all these three modelsstarts to increase after 3 time steps. This shows that, unexpectedly, they all fail toutilize information from earlier observations.301 2 3 4 5Time step0.9000.9250.9500.9751.0001.0251.0501.0751.100Predictionerror×10−1Vanilla Social Graph-FC Graph-TMS Graph-TMSBFigure 5.2: Average `2 distances to ground-truth positions at different timesteps on basketball data.31Chapter 6ConclusionsIn this thesis, we explored a graph-based pooling method for LSTMs called theTeam-LSTM model for player motion trajectory prediction in basketball games.By looking at trajectories as a sequence of positions, we view trajectory predic-tion as a sequence prediction task. Following the success of LSTMs for sequenceprediction, the first solution would be to use an LSTM cell per person and predictthe trajectories of all players. However, Alahi et al. (2016) show that the naiveuse of one LSTM cell per person does not capture the interactions of people whoare close to each other. They address this limitation by proposing a new poolingstrategy called Social pooling. At every time step, the LSTM cell correspondingto each person receives pooled hidden states from the LSTM cells of its neighbors.This pooling allows the LSTMs of spatially close sequences to share their hiddenstates with each other, and can automatically learn typical interactions that takeplace among pedestrians.In our work, we focus on predicting the motion trajectories of basketball play-ers. In a basketball game, player trajectories are affected by a number of factorssuch as the interactions of each player with its team-members, opponents, as wellas the ball. Although Social-LSTM model can capture interactions between play-ers, it is designed for capturing pedestrian interactions. Pedestrian interactions arelimited to people who are spatially close and do not contain any game-related in-formation such as being a member of a team. We expect that incorporating suchgame-related information while pooling the hidden states, can improve the accu-32racy of predictions of basketball players. Therefore, we propose a new graph-basedpooling structure using relation networks proposed by Santoro et al. (2017). Ourpooling allows for each LSTM cell to receive pooled hidden state information fromthe LSTM cells of players connected to it. Although connected LSTMs are intiallydetermined by a set of LSTM pairs, the relations betwenn these pairs are laterlearned by the relation network modules during training. Following this poolingprocedure, we have a tensor per player which includes information from hiddenstates of players connected to it. These pooled hidden state tensors are then incor-porated into LSTM calculations similar to Alahi et al. (2016) as shown in equa-tions 4.6. We finally sample future locations from a bivariate Gaussian distributionwhose parameters are the outputs of our LSTM cells.We evaluate our model on three datasets. Two publicly available pedestriandatasets, ETH (Pellegrini et al., 2009) and UCY (Lerner et al., 2007), and a basket-ball dataset containing real basketball trajectories (Zheng et al., 2016). Our modeloutperforms the Social-LSTM and vanilla LSTM models on the pedestrian datasetby a good margin. On the basketball dataset, Graph-TMSB variation of our modelalso outperforms the Social-LSTM and vanilla LSTM models. We believe that wecan achieve better results with longer training on the basketball dataset.One line of future work is to train our model for longer and observing the per-formance of our Team-LSTM model. As our pooling adds a number of parametersto the base LSTM model, we expect it to require longer training time in order toaccurately learn these parameters.Another one would be to compare our model against other variations of RNNssuch as VRNNs, or to implement our pooling on top of such cells instead of vanillaLSTM cells (Chung et al., 2015). One of the interesting recent works on trajectoryprediction is a graph-structured variational RNNs (Graph-VRNN) which predictspositions by taking the actual frames as input. It would be interesting to compareour model agains this recent model as well.Another line of future work is to include more data in our model. For instace,static-scene frame images could be embedded and used as an additional input tothe LSTM cells. Another example would be to input players poses as an additionalinput the LSTM cells. Moreover, we could use the teams of players as additionalinput to the LSTMs. This is a very interesting direction as it preserves team mem-33bership information even when the graph structure does not enforce it.Furthermore, we could incorporate more game-related stuctures into our pool-ing. For instance, we could consider three different relation functions gθ for team-member interactions, opponent interactions, and player-ball interactions. Anotherexample would be to extend our model by including multi-class settings wheredifferent player roles such as center, forward and guard (or agent types in generalsuch as bicycles, skateboards, carts, and pedestrians) are considered in the poolingprocedure.Lastly, there is a need for a richer real-world basketball dataset that containsvisual information along positions data. Although the currently used basketballdataset is a good step for evaluating our model, it does not contain useful infor-mation such as roles of players, positions of players in a game rather than a singlepossession, and the acutal frame images. Also, there is a need for a deeper ex-ploration of what architecture works better for our task. Currently, we are usingLSTMs for our base-RNN and the MLPs for our graph-based pooling are identi-cal to the ones used by Santoro et al. (2017). A comparative study to benchmarkdifferent RNN architectures could also be very useful for future research in thisdirection.34BibliographyA. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese.Social lstm: Human trajectory prediction in crowded spaces. In Proceedings ofthe IEEE conference on computer vision and pattern recognition, pages961–971, 2016. → pages 2, 5, 15, 16, 20, 21, 23, 26, 27, 28, 32, 33G. Antonini, M. Bierlaire, and M. Weber. Discrete choice models of pedestrianwalking behavior. Transportation Research Part B: Methodological, 40(8):667–687, 2006. → page 4L. Ballan, F. Castaldo, A. Alahi, F. Palmieri, and S. Savarese. Knowledge transferfor scene-specific motion prediction. In European Conference on ComputerVision, pages 697–713. Springer, 2016. → page 4P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. Interaction networks forlearning about objects, relations and physics. In Advances in neuralinformation processing systems, pages 4502–4510, 2016. → page 5Y. Bengio, P. Simard, P. Frasconi, et al. Learning long-term dependencies withgradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994. → page 12M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum. A compositionalobject-based approach to learning physical dynamics. arXiv preprintarXiv:1612.00341, 2016. → page 5K. Cho, B. Van Merrie¨nboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio. Learning phrase representations using rnnencoder-decoder for statistical machine translation. arXiv preprintarXiv:1406.1078, 2014. → page 14J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio. End-to-end continuousspeech recognition using attention-based recurrent nn: First results. arXivpreprint arXiv:1412.1602, 2014. → page 535J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. Arecurrent latent variable model for sequential data. In Advances in neuralinformation processing systems, pages 2980–2988, 2015. → pages 5, 33A. Cleeremans, D. Servan-Schreiber, and J. L. McClelland. Finite state automataand simple recurrent networks. Neural computation, 1(3):372–381, 1989. →page 11P. Coscia, F. Castaldo, F. A. Palmieri, L. Ballan, A. Alahi, and S. Savarese.Point-based path prediction from polar histograms. In 2016 19th InternationalConference on Information Fusion (FUSION), pages 1961–1967. IEEE, 2016.→ page 4J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for onlinelearning and stochastic optimization. Journal of Machine Learning Research,12(Jul):2121–2159, 2011. → page 23F. A. Gers and J. Schmidhuber. Recurrent nets that time and count. InProceedings of the IEEE-INNS-ENNS International Joint Conference on NeuralNetworks. IJCNN 2000. Neural Computing: New Challenges and Perspectivesfor the New Millennium, volume 3, pages 189–194. IEEE, 2000. → page 12F. A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continualprediction with lstm. 1999. → page 12A. Graves. Generating sequences with recurrent neural networks. arXiv preprintarXiv:1308.0850, 2013. → page 16A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrentneural networks. In International conference on machine learning, pages1764–1772, 2014. → pages 5, 16A. Graves and J. Schmidhuber. Framewise phoneme classification withbidirectional lstm and other neural network architectures. Neural networks, 18(5-6):602–610, 2005. → page 12A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. Social gan: Sociallyacceptable trajectories with generative adversarial networks. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pages2255–2264, 2018. → pages 4, 5, 15, 27D. Helbing and P. Molnar. Social force model for pedestrian dynamics. Physicalreview E, 51(5):4282, 1995. → page 436G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data withneural networks. science, 313(5786):504–507, 2006. → page 7S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation,9(8):1735–1780, 1997. → page 12K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks areuniversal approximators. Neural networks, 2(5):359–366, 1989. → page 9W. Hu, N. Xie, L. Li, X. Zeng, and S. Maybank. A survey on visual content-basedvideo indexing and retrieval. IEEE Transactions on Systems, Man, andCybernetics, Part C (Applications and Reviews), 41(6):797–819, 2011. → page4M. Jordan. Attractor dynamics and parallelism in a connectionist sequentialmachine. In Proc. of the Eighth Annual Conference of the Cognitive ScienceSociety (Erlbaum, Hillsdale, NJ), 1986, 1986. → page 11A. Karpathy, A. Joulin, and L. F. Fei-Fei. Deep fragment embeddings forbidirectional image sentence mapping. In Advances in neural informationprocessing systems, pages 1889–1897, 2014. → page 5K. Kim, D. Lee, and I. Essa. Gaussian process regression flow for analysis ofmotion trajectories. In 2011 International Conference on Computer Vision,pages 1164–1171. IEEE, 2011. → page 4K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. Activity forecasting. InEuropean Conference on Computer Vision, pages 201–214. Springer, 2012. →page 4A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deepconvolutional neural networks. In Advances in neural information processingsystems, pages 1097–1105, 2012. → page 7L. Leal-Taixe´, G. Pons-Moll, and B. Rosenhahn. Everybody needs somebody:Modeling social and grouping behavior on a linear programming multiplepeople tracker. In 2011 IEEE international conference on computer visionworkshops (ICCV workshops), pages 120–127. IEEE, 2011. → page 4L. Leal-Taixe´, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese.Learning an image-based motion context for multiple people tracking. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3542–3549, 2014. → page 437Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436,2015. → page 9N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker. Desire:Distant future prediction in dynamic scenes with interacting agents. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 336–345, 2017. → page 5A. Lerner, Y. Chrysanthou, and D. Lischinski. Crowds by example. In Computergraphics forum, volume 26, pages 655–664. Wiley Online Library, 2007. →pages 2, 4, 27, 33M. Luber, J. A. Stork, G. D. Tipaldi, and K. O. Arras. People tracking with humanmotion predictions from social forces. In 2010 IEEE International Conferenceon Robotics and Automation, pages 464–469. IEEE, 2010. → page 4W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent innervous activity. Bulletin of mathematical biology, 52(1-2):99–115, 1990. →page 7R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior detection usingsocial force model. In 2009 IEEE Conference on Computer Vision and PatternRecognition, pages 935–942. IEEE, 2009. → page 4B. T. Morris and M. M. Trivedi. Trajectory learning for activity understanding:Unsupervised, multilevel, and long-term adaptive approach. IEEE transactionson pattern analysis and machine intelligence, 33(11):2287–2301, 2011. →page 4F. Murtagh. Multilayer perceptrons for classification and regression.Neurocomputing, 2(5-6):183–197, 1991. → page 10B. A. Pearlmutter. Learning state space trajectories in recurrent neural networks.Neural Computation, 1(2):263–269, 1989. → page 11S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll never walk alone:Modeling social behavior for multi-target tracking. In 2009 IEEE 12thInternational Conference on Computer Vision, pages 261–268. IEEE, 2009. →pages 2, 4, 27, 33S. Pellegrini, A. Ess, and L. Van Gool. Improving data association by jointmodeling of pedestrian trajectories and groupings. In European conference oncomputer vision, pages 452–465. Springer, 2010. → page 438A. Sanchez-Gonzalez, N. Heess, J. T. Springenberg, J. Merel, M. Riedmiller,R. Hadsell, and P. Battaglia. Graph networks as learnable physics engines forinference and control. arXiv preprint arXiv:1806.01242, 2018. → page 6A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia,and T. Lillicrap. A simple neural network module for relational reasoning. InAdvances in neural information processing systems, pages 4967–4976, 2017.→ pages 6, 19, 23, 33, 34C. Sun, P. Karlsson, J. Wu, J. B. Tenenbaum, and K. Murphy. Stochasticprediction of multi-agent interactions from partial observations. arXiv preprintarXiv:1902.09641, 2019. → pages 6, 15, 26M. K. C. Tay and C. Laugier. Modelling smooth paths using gaussian processes.In Field and Service Robotics, pages 381–390. Springer, 2008. → page 5O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural imagecaption generator. In Proceedings of the IEEE conference on computer visionand pattern recognition, pages 3156–3164, 2015. → page 5J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical modelsfor human motion. IEEE transactions on pattern analysis and machineintelligence, 30(2):283–298, 2007. → page 5K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, andY. Bengio. Show, attend and tell: Neural image caption generation with visualattention. In International conference on machine learning, pages 2048–2057,2015. → page 5K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg. Who are you with andwhere are you going? In CVPR 2011, pages 1345–1352. IEEE, 2011. → page 4R. A. Yeh, A. G. Schwing, J. Huang, and K. Murphy. Diverse generation formulti-agent sports games. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 4610–4619, 2019. → page 6S. Zheng, Y. Yue, and P. Lucey. Generating long-term trajectories using deephierarchical networks. In Advances In Neural Information Processing Systems,pages 1543–1551, 2016. → pages 2, 24, 33B. Zhou, X. Wang, and X. Tang. Random field topic model for semantic regionanalysis in crowded scenes from tracklets. In CVPR 2011, pages 3441–3448.IEEE, 2011. → page 439


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items