UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Classification of puck possession events in ice hockey Tora, Moumita Roy 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2017_november_tora_moumitaroy.pdf [ 4.14MB ]
Metadata
JSON: 24-1.0355849.json
JSON-LD: 24-1.0355849-ld.json
RDF/XML (Pretty): 24-1.0355849-rdf.xml
RDF/JSON: 24-1.0355849-rdf.json
Turtle: 24-1.0355849-turtle.txt
N-Triples: 24-1.0355849-rdf-ntriples.txt
Original Record: 24-1.0355849-source.json
Full Text
24-1.0355849-fulltext.txt
Citation
24-1.0355849.ris

Full Text

Classification of Puck Possession Events in Ice HockeybyMoumita Roy ToraB.Sc, Brac University, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)September 2017c©Moumita Roy Tora, 2017AbstractGroup activity recognition in sports is often challenging due to the complex dy-namics and interaction among the players. In this thesis, we propose a deep ar-chitecture to classify puck possession events in ice hockey. Our model consists ofthree distinct phases: feature extraction, feature aggregation and, learning and in-ference. For the feature extraction and aggregation, we use a Convolutional NeuralNetwork (CNN) followed by a late fusion model on top to extract and aggregatedifferent types of features that includes handcrafted homography features for en-coding the camera information. The output from the CNN is then passed into aRecurrent Neural Network (RNN) for the temporal extension and classification ofthe events. The proposed model captures the context information from the framefeatures as well as the homography features. The individual attributes of the play-ers and the interaction among them is also incorporated using a pre-trained modeland team pooling. Our model requires only the player positions on the image andthe homography matrix and does not need any explicit annotations for the individ-ual actions or player trajectories, greatly simplifying the input of the system. Weevaluate our model on a new Ice Hockey Dataset and a Volleyball Dataset. Ex-perimental results show that our model produces promising results on both thesechallenging datasets with much simpler inputs compared with the previous work.iiLay SummaryGroup activity recognition is the task of determining what a group of people aredoing given a single image or a short clip of video. We have looked at group activityrecognition in sports videos, particularly ice hockey. Thus given a sequence ofimages, we aim to classify the sequence into a group activity or event. There aremany possible events that can happen in ice hockey but we have looked at a subsetof only those events which involve the possession of the puck by the players. Someexamples include pass and shot. We have solved this problem by proposing a deepnetwork architecture which takes into account player appearance and contextualinformation. These features from different sources are fused together and passedinto a temporal model to learn the dependencies across the images in the givensequence.iiiPrefaceThis thesis is submitted in partial fulfillment of the requirements for a Master ofScience Degree in Computer Science. The entire work presented here is originalwork done by the author, Moumita Roy Tora, performed under the supervision ofProfessor James J. Little. A version of this work has appeared in the followingpublication:• “Classification of Puck Possession Events in Ice Hockey”, The IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR) Workshops,Jul 2017, pp. 91-98.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Group Activity Recognition . . . . . . . . . . . . . . . . . . . . . 82.2 Group Activity/ Event Classification in Sports . . . . . . . . . . . 92.2.1 Formulating Ice Hockey Events . . . . . . . . . . . . . . 92.2.2 Analogy of Our Problem with Other Team Sports . . . . . 102.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Overview of the Methodologies . . . . . . . . . . . . . . . . . . . . . 133.1 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . 133.1.1 Evolution of CNNs . . . . . . . . . . . . . . . . . . . . . 15v3.2 Recurrent Neural Network (RNN) . . . . . . . . . . . . . . . . . 154 Our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1 Individual Attributes . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Contextual Information from the Image . . . . . . . . . . . . . . 194.3 Homography for the Spatial Feature Representation . . . . . . . . 204.3.1 Computing Features from the Homography Matrix . . . . 204.4 Fusion Model for Combining Homography Features and FrameFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5 Temporal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.1 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . 275.4 Ice Hockey Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 275.4.1 Event Annotation . . . . . . . . . . . . . . . . . . . . . . 285.4.2 Tracking Data . . . . . . . . . . . . . . . . . . . . . . . . 285.4.3 Homography Transformation . . . . . . . . . . . . . . . . 295.4.4 Quantitative Results . . . . . . . . . . . . . . . . . . . . 295.4.5 Qualitative Results . . . . . . . . . . . . . . . . . . . . . 365.5 Volleyball Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 375.5.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . 375.5.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . 386 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49viList of TablesTable 5.1 Event descriptions and corresponding number of training exam-ples in the Ice Hockey Dataset. . . . . . . . . . . . . . . . . . 30Table 5.2 The descriptions of the baselines. . . . . . . . . . . . . . . . . 30Table 5.3 Performance of our model on Ice Hockey Dataset compared tothe baselines. In the fine-tuning column, w/o and w representwithout and with fine-tuning, respectively. . . . . . . . . . . . 34Table 5.4 Per class recall and precision of our model on the Ice HockeyDataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Table 5.5 Event descriptions in the Volleyball Dataset. In the originaldataset and all our experiments, these events are further clas-sified into right and left where right always refers to the teamin the right side of the net and left refers to the team in the leftside. For example, r-pass refers to the pass event occurring inthe right side. . . . . . . . . . . . . . . . . . . . . . . . . . . 37Table 5.6 The Descriptions of the baselines. . . . . . . . . . . . . . . . . 38Table 5.7 Performance of our model on Volleyball Dataset compared tothe baselines and previous work. w and w/o refers to with andwithout fine-tuning respectively. . . . . . . . . . . . . . . . . . 39Table 5.8 Per class recall and precision of our model on the VolleyballDataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39viiList of FiguresFigure 1.1 Example images of group activity. . . . . . . . . . . . . . . . 2Figure 1.2 Example images of puck possession events. In each image, thedashed red line is the potential trajectory of the puck exceptfor LPR in which the red lines are the trajectories of potentialplayer movements. . . . . . . . . . . . . . . . . . . . . . . . 3Figure 1.3 Schematics of puck possession events. This figure shows schemat-ics of five puck possession events that our system aims to clas-sify. In some events, individual players appearances/motionscould be very similar such as dump in and dump out. . . . . . 4Figure 1.4 CNN model with a late fusion on top. fc corresponds to fullyconnected convolutional layer. fc7 is the second last fully con-nected layer before the classifier in AlexNet. The numbersby the arrows correspond to the input/output feature vector di-mensions. We have fine-tuned last two layers of AlexNet onour datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Figure 1.5 Overview of our model. Our method can classify puck posses-sion events for ice hockey games from a sequence of images.We first extract scene features from a whole frame, individualfeatures from the players and homography features using thehomography matrix. Team pooling is performed for the playerfeatures and all the features are then fused. Finally we train aBLSTM model using a sequence of temporal features. . . . . 7viiiFigure 3.1 AlexNet illustration. The input is a 224 by 224 image that goesthrough several hidden layers before being classified by soft-max in the final layer. The output from the last fully connectedlayer is a 1000 dimensional feature vector that goes to the clas-sifier and the output of the classifier is a score vector for eachpossible classes. Source: [2] . . . . . . . . . . . . . . . . . . 14Figure 3.2 Block diagram of a simple RNN that unfolds with time form-ing a chain structure. A refers to one unit of the RNN, x andh refer to the input and the output at a given time step. Thehidden state values of one unit are passed to the next one se-quentially and are represented by the arrow. Source: [5] . . . 17Figure 3.3 Building Block of an LSTM unit. Each LSTM unit has 3 gates-input, output and forget, the input block and the output block.There are different types of weights associated with each com-ponent of the unit which are learned during training the net-work. The gates are usually controlled by sigmoid functions.Source: [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 4.1 Weight computation method for a player and his associatedcenters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Figure 5.1 The transformation from the image to the template coordinatesystem by the homography matrix. . . . . . . . . . . . . . . 29Figure 5.2 Training loss over time for the Ice Hockey Dataset. . . . . . . 31Figure 5.3 Training accuracy over time for the Ice Hockey Dataset. . . . 31ixFigure 5.4 Confusion matrix for top-1 event prediction with a tolerance of0.0 on the Ice Hockey Dataset. The rightmost column showsthe precision and the bottom row shows the recall per class inpercentages. The green values represent the percentages of thecorrect predictions by the model whereas the red values showthe percentages of the incorrect predictions. For example, re-call for lpr is 39% and the its precision is 46.6% which meansout of all the lpr events, 39% were correctly classified and outof all the lpr predictions made by the model, 46.6% were cor-rect. The blue rectangle in the bottom right corresponds to theoverall accuracy. . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 5.5 Confusion matrix for top-2 event prediction with a tolerance of0.15 on the Ice Hockey Dataset. The rightmost column showsthe precision and the bottom row shows the recall per class inpercentages. The green values represent the percentages of thecorrect predictions by the model whereas the red values showthe percentages of the incorrect predictions. For example, re-call for lpr is 74.3% and the its precision is 78% which meansout of all the lpr events, 74.3% were correctly classified andout of all the lpr predictions made by the model, 78% werecorrect. The blue rectangle in the bottom right corresponds tothe overall accuracy. . . . . . . . . . . . . . . . . . . . . . . 33Figure 5.6 Training loss over time for the Volleyball Dataset. . . . . . . . 40Figure 5.7 Training accuracy over time for the Volleyball Dataset. . . . . 40xFigure 5.8 Confusion matrix for top-1 event prediction with a toleranceof 0.0 on the Volleyball Dataset. The rightmost column showsthe precision and the bottom row shows the recall per class inpercentages. The green values represent the percentages of thecorrect predictions by the model whereas the red values showthe percentages of the incorrect predictions. For example, re-call for r-set is 41.7% and the its precision is 51.3% whichmeans out of all the r-set events, 41.7% were correctly clas-sified and out of all the r-set predictions made by the model,51.3% were correct. The blue rectangle in the bottom rightcorresponds to the overall accuracy. . . . . . . . . . . . . . . 41Figure 5.9 Confusion matrix for top-2 event prediction with a tolerance of0.15 on the Volleyball Dataset. The rightmost column showsthe precision and the bottom row shows the recall per class inpercentages. The green values represent the percentages of thecorrect predictions by the model whereas the red values showthe percentages of the incorrect predictions. For example, re-call for r-set is 59.9% and the its precision is 70.9% whichmeans out of all the r-set events, 59.9% were correctly clas-sified and out of all the r-set predictions made by the model,70.9% were correct. The blue rectangle in the bottom rightcorresponds to the overall accuracy. . . . . . . . . . . . . . . 42Figure 5.10 Qualitative results. The correct predictions made by our modelwith a tolerance of 0.0 on the Ice Hockey Dataset. . . . . . . . 43Figure 5.11 Qualitative results. The correct predictions made by our modelwith a tolerance of 0.15 on the Ice Hockey Dataset. True refersto the actual class, Pred 1 and Pred 2 refer to the predictionwith a tolerance of 0.0 and 0.15 respectively and δ refers tothe difference in probabilities between the two predictions. . 44Figure 5.12 Qualitative results. The images show the cases where our modelfailed on the Ice Hockey Dataset. . . . . . . . . . . . . . . . 44Figure 5.13 Qualitative results. The correct predictions made by our modelwith a tolerance of 0.0 on the Volleyball Dataset. . . . . . . . 45xiFigure 5.14 Qualitative results. The correct predictions made by our modelwith a tolerance of 0.15 on the Volleyball Dataset. True refersto the actual class, Pred 1 and Pred 2 refer to the predictionwith a tolerance of 0.0 and 0.15 respectively and δ refers tothe difference in probabilities between the two predictions. . 45Figure 5.15 Qualitative results. The figures show the cases where our modelfailed on the Volleyball Dataset. . . . . . . . . . . . . . . . . 46xiiAcknowledgmentsI would like to express my greatest gratitudes to a number of people who have beena continuous source of support during my Masters.First of all I would like to thank my supervisor, Professor James J. Little, whois not only a great mentor but also one of the kindest human beings I have comeacross. From the very beginning of my research, he has been very supportive aboutmy work and interests. He appreciated and acknowledged every little idea that Icame up with and he helped me to gain new insights of any problem. I would alsolike to thank Professor Leonid Sigal for agreeing to read my thesis as a secondreader.Next, I would like to thank University of British Columbia (UBC) for givingme a lifetime opportunity to enhance my knowledge. I have enjoyed every littlepiece of my 2 years here. There were times when I was too exhausted with all thecourses I had to take but at the end of the day, I gained a new perspective of mylife. I would like to thank all the great instructors I have come across, my peers,friends, family and all my lab-mates who always helped me whenever I needed.This work was supported in part by the Natural Sciences and Engineering Re-search Council of Canada (NSERC) and the Institute for Computing, Informationand Cognitive Systems (ICICS) at UBC, and enabled in part by WestGrid and Com-pute Canada. I would like to thank them for funding me and providing me theopportunity to travel to a number of conferences and meetings.Lastly, my heartiest gratefulness is for my parents, brother and husband whobelieved in me more than myself. My parents taught me to dream bigger from myvery childhood and whatever little I have learned in my life is only because of theirinspiration and blessings.xiiiChapter 1IntroductionComputer vision has been widely used in many sports applications [43]. The num-ber of applications has expanded from information extraction such as player detec-tion and tracking [38] to new visual information generation such as free-viewpointvideo generation [25], and further to prediction of shot location [63] and broadcastcamera angle planning [12].Among various applications, group activity recognition is an active researcharea. Group activity recognition refers to determining what a group of people aredoing, providing semantic and abstract descriptions for a sequence of images. Fig-ure 1.1 shows some examples of group activities collected from Collective ActivityDataset [15]. Group activity recognition is similar to activity/action recognition butadditional dynamics of the group interaction needs to be considered in determininggroup activities. In most group activities, the human-human or the human-objectinteraction need to be modeled. In order to do that, we need to detect/segment thehumans and the objects first which is not trivial. For video classification, additionalcues such as poses, tracking, motion and camera movements play important roles.However, sports activity recognition can be very complex due to the rapid tran-sition between the events, occlusions and fast movements of the players, variedcamera viewpoints, and camera motions. Moreover, spatiotemporal structures ofevents vary greatly in different sports. For example, the locations of the players involleyball is relatively static compared to players in ice hockey. Another key chal-lenge is to distinguish between motion, action, activity and event in sports because1Figure 1.1: Example images of group activity.the transition between these four states is very rapid and complex. Our focus ison events classification. In ice hockey, sometimes the events are instants of just1 frame whereas other times they can be much longer. Sometimes there is a longinterval between two events and there is no clear boundary between them. Thereare also cases where two events can occur simultaneously on the same frame. Allthese factors make the recognition problem complex and hard to generalize acrossdifferent sports. Most previous work has tried to address group activity classifica-tion in non-sports applications [7, 13, 14, 36]. Consequently, studying a particularsport with domain knowledge is valuable and complementary to general activityrecognition.We are interested in classification of puck possession events in ice hockeygames. Puck possession events are those events that involve players taking overor losing the control of the puck. For example, in the pass event, the control of thepuck shifts from one player to another in the same team. Unlike other ball sports2(a) Dump in (b) Dump out (c) LPR(d) Pass (e) ShotFigure 1.2: Example images of puck possession events. In each image, thedashed red line is the potential trajectory of the puck except for LPR inwhich the red lines are the trajectories of potential player movements.such as basketball and soccer, the puck in hockey can be in the possession of neitherteam for extended time periods, for example, when the puck moves out from the de-fensive zone or into the offensive zone. All the events we have considered involveinteraction between the puck and the player. This is different from sports such asrunning or swimming, where there is no need to model the object-human or human-human interaction. However modeling this interaction in ice hockey is difficult asthe puck is not very clear in most of the frames because of its size, color and rapidmovement unlike sports such as volleyball or soccer where the ball is much biggerand visible. The dataset, provided by SPORTLOGiQ (http://sportlogiq.com/), con-tains the annotations for puck possession events in National Hockey League (NHL)games. The playing surface (rink) is large and enclosed by the boards. Play-by-play commentary, provided in real-time during a broadcast, annotates shooting,scoring and hit events at time intervals of 3 to 15 seconds, typically, while the in-tervals between our annotated events range from 3 to 200 frames (30 FPS). Theoutput of our system can greatly benefit the manual events annotation with sometolerance of error. The classification results may enable coaches and analysts todetermine both strategic concepts and support evaluation of individual players.Our proposed model aims to classify five puck possession events which aredump in, dump out, pass, shot and loose puck recovery (LPR). The descriptions of3Figure 1.3: Schematics of puck possession events. This figure showsschematics of five puck possession events that our system aims to clas-sify. In some events, individual players appearances/motions could bevery similar such as dump in and dump out.these events are in Table 5.1. In all the events but the LPR, the puck-control goesfrom possession to no possession of the players. In case of LPR, the puck is not inpossession of any player and it goes from no possession to possession. Figure 1.2shows example images of these events. In the images, the dashed red lines showthe movements of the pucks or the players. Figure 1.3 shows the schematics ofthese events on the ice hockey strategy board.An important descriptive cue for the classification of possession events is thelocation of the puck and the players in coordinates of playing field. However, itis extremely hard to track the puck in images as the puck is very small and movesvery fast. Due to motion blur, the puck’s color and texture can be merged intothe backgrounds. Without any kind of puck information, the event classificationproblem is even more difficult.Alternative cues for puck possession events classification are player locationsin the image and their appearance, their spatiotemporal information and estimatingplayer locations in the playing ground coordinates. Players are coached to keep the4team shape and move to offense/defense together. But looking only at individualplayers can be ambiguous in some events. For example, the player appearanceand action might be very similar in the two events – pass and shot. In this case,additional cues such as context information are necessary to distinguish these twoevents.Our model uses a deep architecture which requires the homography matrix,detected bounding boxes of the players and the corresponding frames. Figure 1.5shows the pipeline of our method and Figure 1.4 shows the CNN model with alate fusion on top that we have trained. The input of our method is a sequence ofimages with player detection results (bounding boxes in the image) and the homog-raphy features. The detailed computation of these features is described in Section3.3. Our method first extracts context features from the whole image, individualfeatures from the player image patches and homography features using the homog-raphy matrices. The homography features represent the spatial distribution in rinkcoordinate of the players. Because the number of detected players changes overframes, we use a max pooling layer to aggregate the individual features. The maxpooling is done in a team-wise fashion. Then, we use a bidirectional LSTM to trainan event classification model using features from the sequence of given images.Building upon the existing work, our model takes advantage of the discrimina-tive power of deep learning and captures the structural and spatiotemporal informa-tion in group activities. Moreover, it shows how combining contextual informationand person level features can improve accuracy for events that are very similar toeach other. The main contribution of our work is four-fold. First, we propose abenchmark for event classification on a new challenging Ice Hockey Dataset. Sec-ond, we extensively study the features from whole frames and individual players.We provide solid evidence that our model works best when the individual’s infor-mation is combined with the context of the events in ice hockey games. Third, wepropose a new fusion approach to combine features from different sources. Lastly,we also test our method on a Volleyball Dataset and obtain reasonable results thatshow the generalization property of our approach across quite different sports.The rest of the paper is organized as follows: Chapter 2 surveys previous workin sports, group activities and deep learning, Chapter 3 gives an overview of themethodologies we have used in our experiments, Chapter 4 describes our archi-5Figure 1.4: CNN model with a late fusion on top. fc corresponds to fully con-nected convolutional layer. fc7 is the second last fully connected layerbefore the classifier in AlexNet. The numbers by the arrows correspondto the input/output feature vector dimensions. We have fine-tuned lasttwo layers of AlexNet on our datasets.tecture in detail, Chapter 5 illustrates the experiments and results we have foundfollowed by Chapter 6 that summarizes our work with suggestion of some possibleextensions in the future.6Figure 1.5: Overview of our model. Our method can classify puck possessionevents for ice hockey games from a sequence of images. We first extractscene features from a whole frame, individual features from the playersand homography features using the homography matrix. Team poolingis performed for the player features and all the features are then fused.Finally we train a BLSTM model using a sequence of temporal features.7Chapter 2Related Work2.1 Group Activity RecognitionGroup activity recognition has been an active area of research over the years. Be-fore the dramatic success of deep learning on a variety of tasks, researchers mostlyrelied on hand-crafted features [47] [15] [62]. When deep learning started gainingattention, similar to most other applications, group activities also started to takeadvantage of these neural networks [28] [45] [61] [8] [52]. In parallel, there wasother interesting work which continued to explore handcrafted features with somenotable amount of success. In most work, researchers break down the problem ofgroup activity recognition into subproblems of tracking and identifying individualactions, key persons or interaction among the individuals. The knowledge fromthese smaller problems is then combined to learn the group activity [51] [14] [36]Graphical models seemed to gain particular attention in the hierarchical approach[44] [7] [35].Besides, there are other models where researchers started to combine deepmodels with other machine learning approaches that can boost the performancefurther, because the deep models are sometimes hard to interpret and require lotsof fine-tuning. Most recent work tries to explore advanced deep models such asdeep reinforcement learning or deep graphical models [16] [45] [61] [8] [52] [64].But these concepts are still very new to the domain of activity recognition.82.2 Group Activity/ Event Classification in SportsThe domain of sports analysis is vast because each sport has its unique character-istics. As a result, a model from a particular sport is unable to achieve reasonableperformance on other sports datasets. However, researchers have narrowed downthe domain into specific sports particularly the most challenging ones and havetried to solve different aspects over the years. For example, Yue-Hei Ng et al.[65] proposed and evaluated different deep neural network architectures to classifylonger sequence of sports videos. Meanwhile, various approaches have been pro-posed for different types of sports analysis [10, 42, 49]. Although, there is someprevious work on player tracking and action recognition in ice hockey [39–41], tothe best of our knowledge, no previous work has been done for event classificationin ice hockey.2.2.1 Formulating Ice Hockey EventsSports events can be broadly classified based on several characteristics: object vs.non-object sports such as swimming and golf, instant events vs. a sequence ofactions, single person action/activities vs. team sports, game policies and so on.In our problem, all the events are centered around the puck and what the playersare doing with it. So we need to consider the human-object and human-humaninteraction. The object-human interaction in sports can be analogous to non-sportsactivities such as cutting fruits with a knife. This activity can be broken down into asequence of shorter actions such as holding the knife, holding the fruit, touching theknife with the fruit and cutting it. Modelling this less complex activity still requiresattention to delicate details such as how the hand is holding the objects and whatare the shapes and textures of these objects. Events in sports are ususally muchmore complicated sequence of actions and the transition between these actions arevery fast. The actions might also look much more similar to each other and therecould be multiple actions occurring on the same frame.We can further narrow down our problem into multi-person activities. Al-though some of the events such as goal or shot could be considered as sole actions,determining these events often needs to look at the surrounding players and theirindividual actions. For example, during the shot event, the opposing team defend-9ers will most likely try to dump the puck out. However, the duration of each eventand the possession of the puck varies widely. The puck may be out of possessionfor a long period of time unlike other sports and there is no annotation for theframes where the puck is not in possession of any team and when an event ends.2.2.2 Analogy of Our Problem with Other Team SportsVolleyballIbrahim et al. [28] built a hierarchical deep network to learn the individual actionsusing one LSTM which is then combined with features extracted from CNNs topass into another LSTM to predict the group actions. However, their method hasdifficulties on events which are very similar to each other. Very recently, Bagaut-dinov et al. [8] proposed a deep network that jointly detects the players, inferstheir social actions and estimates the collective actions. Their method outperforms[28] on the Volleyball Dataset. However, both these models need explicit labelsfor individual actions which are expensive and hard to label for ice hockey. In ourscenario, the events are similar to each other and there is no person-level anno-tations. Both these factors make these models less suitable for our problem. Onthe other side, volleyball is very different from ice hockey games. Although boththese games involve possessing the puck/ball and modeling the object-human orhuman-human interaction, the puck is much smaller, faster and harder to track thanthe ball. Moreover, the ball is mostly in the air in volleyball whereas the puck canbe either in the air or on the ground. The players from the two teams in volley-ball never mingle with each other and are clearly separated by a line/net but in icehockey, they are almost never separated. The relative motions of the volleyballplayers are static compared to ice hockey where the players skate at a fast speedall the time and hit and bounce from the surrounding boards. If the ball goes out ofthe boundary in volleyball, it is usually a miss whereas in ice hockey the puck cango in and out of the boundary line anytime while the game continues.10BasketballAlthough there has been some work on event detection in basketball, much focushas not been given on the event classification problem. Ramanathan et al. [45]proposed a deep temporal model with an attention mechanism to detect and classifyevents on a new basketball dataset which they have made available to the public.Their model attends to a subset of players in a frame rather than all the playersbecause usually a few players are participating in the event. Their unified approachcan detect the key players and classify the activities simultaneously. However,the authors have mentioned that the performance is usually poor for classes withfewer examples. Although the overall mean average precision for their approachis slightly better than other baselines, per class accuracy is still not satisfactory formany events. They also have not experimented on any other dataset that makes itdifficult to compare the generalization property of their approach. There are manyfundamental differences between basketball and ice hockey. The ball is alwaysin possession of some player during an event and it is always in the air whereasthe puck may or may not be in possession of any team and can be either in theair or on the ground. The player can either hold, bounce or throw the ball withthe hands whereas the puck is always handled with the sticks without any humantouch. Moreover, similar to volleyball or soccer, the ball in basketball is muchbigger than the puck and is generally easier to distinguish from the background.Nevertheless, each sport has its own challenges which might require the need ofsport-specific precise models.SoccerCompared to many other sports, research using the soccer videos dates to a longtime back. Kong et al. [33] proposed a local motion-based approach to classifythree different actions. Their model first computes SIFT features and then relativemotion descriptors from the background and the foreground key-point sets usingthe bag-of-words strategy. Recently, Komori et al. [59] have proposed a hierar-chical deep architecture for recognizing three different activities in soccer. Otherwork in soccer includes field localization [27], player tracking [37], automatic cam-era planning [11] etc. Soccer fields are usually open and much bigger compared11to the ice hockey rinks which are enclosed by boundaries. The movement of theplayers are much faster in ice hockey than soccer. The soccer ball is bigger andeasier to track than the puck. But in both the sports, the ball and the puck may ormay not be in possession of any of the teams and can be in the air or on the ground.The group events are also very similar in both these sports.2.3 Deep LearningAs more data became available, the success of Convolutional Neural Networks(CNN) has been proved in numerous applications over the last decade on computervision tasks such as image recognition [34] and video analysis [31, 53]. RecurrentNeural Networks (RNN) particularly Long Short Term Memory (LSTM) [26] arewidely popular models that are well suited for variable length sequence inputs.LSTM has been successfully applied to speech and handwriting recognition [22],human action recognition [9, 17] and image caption generation [30, 60].Researchers have shown different ways to combine LSTMs with CNNs orgraphical models for group activity recognition. For example, Deng et al. in-tegrated a graphical model with a deep neural network [16]. The network learnsstructural relations by representing individuals and the scene as nodes passing mes-sages among them and imposing a gating mechanism to determine the meaning-ful edges. But the method is not designed for sports activities where interactionsbetween individuals are generally more complicated. In [45], Ramanathan et al.argued that in many group activities redundant information can be ignored by con-centrating on a subset of people who contribute to the group activity. Thus, theyfirst extracted features from individuals who are attending to the event as well asglobal context features representing the entire scene and then solved the problemof event classification using a deep network.12Chapter 3Overview of the MethodologiesThis chapter provides a high level description of the algorithms we have used. Ourcore architecture is built on two widely popular supervised models – a Convolu-tional Neural Network (CNN) and a Recurrent Neural Network (RNN). In general,each of these networks takes some feature vectors and the desired output valuesas inputs, and the networks are trained using a loss function. During training, thenetwork learns the weights and the hyper-parameters from the given inputs and thetrained network is then applied to an unseen dataset to map the inputs based onthe learned function. However, there are some fundamental differences betweenthe two networks and they are usually used for different purposes. For example,a CNN is used mainly for visual understanding whereas an RNN is known for itstemporal structure that can learn sequential problems.3.1 Convolutional Neural Network (CNN)A convolutional neural network is a feed forward neural network which has aninput, an output and multiple hidden layers of specific types. The input is a 3-dimensional image, the hidden layers can be pooling, convolutional, fully con-nected, normalization or ReLU (rectified linear unit) and the output is usually ascore for each target class (for the classification problems), all of which are stackedtogether to form a convolutional neural network [1]. The layers in CNN are madeup of neurons that have associated weights and biases, certain hidden layers have13Figure 3.1: AlexNet illustration. The input is a 224 by 224 image that goesthrough several hidden layers before being classified by softmax in thefinal layer. The output from the last fully connected layer is a 1000dimensional feature vector that goes to the classifier and the output ofthe classifier is a score vector for each possible classes. Source: [2]activation functions and parameters, the last fully connected layer has a loss func-tion and the network is trained to learn visual features from the input image byoptimizing the objective function and updating the weights. During the forwardpass of the training, the network predicts the output based on the current weightsand during backpropagation, it computes the error and the gradient of the loss func-tion to minimize the error which is propagated from the output back to the networkupdating the network weights. Figure 3.1 shows an example image of AlexNet [34]which is a widely used CNN for classification problems. The input to the networkis an image sample of fixed dimension and the output is a probability score for eachtarget class. The image goes through a series of hidden layers and passes throughthe classification layer, softmax after the last fully connected layer. In general, allthe layers take 3D activation maps as inputs and transform them to some other acti-vation maps by means of a differentiable function to represent higher level features.In the first layer, the input is the raw pixels of an RGB image and the output is anumber of activation maps obtained by convolving the image with the filter. Thenumber and output dimension of the activation maps depend on the number andthe dimension of the filters applied to the input during convolution and the featuremaps are independent of each other. The activation maps are generated when filterswith fixed dimension, stride and padding are applied to the input. Stride refers tothe amount by which a filter shifts during convolution and padding refers to filling14the input volume by some number (mainly zeros or ones) around the border to pre-serve the spatial dimension. In the subsequent layers, the activation maps are givenas inputs and the outputs are passed through a series of different types of layers tolearn more complex features before being classified in the last layer. Other com-monly used CNNs that are used for image recognition tasks include VGGNet [54],ResNet [24] and GoogLeNet [56].3.1.1 Evolution of CNNsAlthough the idea of convolutional neural networks (CNNs) dates back to 1980s, itgained popularity from 2012 by the dramatic success of AlexNet on image recog-nition task [34]. The network was relatively simple containing 5 convolutional lay-ers, max-pooling layers, dropout layers, and 3 fully connected layers. This networkachieved a record breaking performance on the ImageNet Dataset with a top-5 errorrate of 15.4% and illustrated the benefits of techniques such as data augmentationand drop out for boosting the performance. After this huge breakthrough in thecomputer vision community, researchers began to use CNNs for a whole range ofdifferent applications and deeper networks came into the spotlight when VGGNetwas proposed by Simonyan and Zisserman [54]. The network consists of 19 layersbut it only uses 3x3 filters with stride and pad of 1, along with 2x2 max poolinglayers with stride 2. This network achieved good performance for both image clas-sification and localization tasks. Their network was the first one to emphasize theimportance of deeper networks. Since then, the CNNs started to become deeper.GoogLeNet [56] and ResNet [24] were proposed in 2015 that had about 100 and152 layers respectively. In parallel, R-CNN [20] and their extensions- Fast R-CNN [19] and Faster R-CNN [46], Generative Adversarial Networks [21], SpatialTransformer Networks [29] etc. became widely successful for a wide variety ofother applications [3].3.2 Recurrent Neural Network (RNN)A recurrent neural network is a neural network which is widely used for sequentialproblems such as speech recognition and image captioning. RNNs have internalgates which allow them to capture and store long-term dependencies. An RNN en-15codes temporal dependencies into the the network and computes the hidden statevalues based on the current input as well as the previous step hidden value. Figure3.2 shows an example of the basic structure of an RNN. It shows the loop struc-ture of an RNN and how it unrolls with time to form a chain-like structure whereeach unit A is identical and is connected to the next one thus passing information.However, in practice RNNs suffer from vanishing or exploding gradient problems.Similar to traditional neural networks, RNNs are also associated with some non-linear activation functions such as tanh that have gradients in the range [-1,1] andduring back propagation the gradients are being computed to learn the weights ofthe network. If the chain structure grows, the gradient becomes very small whencomputed with chain rule and multiplied with small numbers. The exploding gradi-ent problem occurs when, due to similar reasons, the gradient grows exponentially.These problems prevent RNNs from learning temporal dependencies effectively.In order to solve these problems, Long Short Term Memory (LSTM) networkswere proposed by Hochreiter and Schmidhuber [26] that have special memoryblocks to store or forget information which is not modified during the learning pro-cess allowing it to remember values for a long time. Figure 3.3 shows the gates thatare present inside a standard LSTM unit, an input block and how they are operated.There are usually three types of gates inside an LSTM – the input gate controls theflow of information inside the memory, the output gate controls the informationflow from the memory to the network and the forget gate controls which infor-mation to drop or keep in the memory. There are four different weight matricesfor each of these gates that control the gate operations and the weights are learnedduring training the network. Figure 3.3 shows that all the gates are controlled bysigmoid functions and the output is connected back to the gates and the input blockin a recurrent manner. There are different variations and extensions of LSTMs andfor our work we have used Bidirectional LSTM [50] which is described in Chapter4.16Figure 3.2: Block diagram of a simple RNN that unfolds with time forminga chain structure. A refers to one unit of the RNN, x and h refer to theinput and the output at a given time step. The hidden state values of oneunit are passed to the next one sequentially and are represented by thearrow. Source: [5]Figure 3.3: Building Block of an LSTM unit. Each LSTM unit has 3 gates-input, output and forget, the input block and the output block. There aredifferent types of weights associated with each component of the unitwhich are learned during training the network. The gates are usuallycontrolled by sigmoid functions. Source: [4]17Chapter 4Our MethodMethod overviewThe input to our model is a sequence of images as well as the player boundingboxes and the homography matrix in each image. The bounding boxes of playerscan be in any order as our method does not require player trajectories. The outputof the model is a group activity label for the entire sequence.Our method has three stages: feature representation, feature aggregation andevents prediction. In the feature representation and aggregation, we extract andaggregate different types of features that represent different aspects of the game. Inthe events prediction, we use a bidirectional LSTM model to classify the sequenceof events. Our main efforts are on integrating different types of features to improvethe classification accuracy.4.1 Individual AttributesWe use appearance features to model individual players. The appearance featureis extracted by the f c7 layer of AlexNet [34] using the sub-image of a player.We choose to use the pre-trained AlexNet (on ImageNet object recognition task)because it has been successfully used in various computer vision tasks [17, 28] andwe do not have any player-level annotations such as individual actions. The outputof the CNN represents the appearance information of an individual player.Interaction among the individuals is essential to determine group activities. We18use max pooling features of individual players in a particular frame to incorporateplayer interactions. Max pooling is a widely accepted technique that proved itseffectiveness on a wide range of applications [28] [66]. We started by max poolingacross all the players regardless of which team they are in. However this makeslittle sense because offending and defending players or players in different teamscertainly behave and interact differently in any given sequence. This encouragedus to pool across the players by teams. Since we did not have any annotation to dis-tinguish between the teams, we applied a segmentation and thresholding approachfollowed by max pooling and concatenation as shown in Figure 1.5.For the segmentation task, we adopted k-means clustering algorithm whichsegments an image by color. The value of k is 2 which refers to the number ofclusters we want. In our case, one cluster refers to the player and the other clusterrefers to the background. The input to the model is a player bounding box and thesegmentation uses the jersey color of a particular player to separate him from thebackground. Since different games have different color jersey worn by the players,we manually set a threshold for each game to classify a player into one of thetwo teams. For example, for black and white jersey colors, if the most frequentpixel value exceeded 70, it was usually the white team. This simple thresholdingapproach gave satisfactory results as it did not matter for our task whether the colorwas correctly classified unless the players of the same team were correctly groupedtogether.4.2 Contextual Information from the ImageWe use deep features from a whole image to model the context information. Ineach frame, we use the f c7 activation in fine-tuned AlexNet as the representationof the context. We have only fine-tuned the last two fully connected layers onour datasets and used pre-trained weights for the other layers. The intuition foradding this context is that some events can only be determined if we know thescene information. For example, if we consider the events dump in and dump outin ice hockey, they are almost the same except the fact that they occur at differentzones.194.3 Homography for the Spatial Feature RepresentationThe homography matrix is a 3 by 3 matrix that can be applied to the input imagepixels to get its warped version. In other words, if two images are taken from thesame camera but have different viewpoints, we can compute the homography ma-trix to find the feature correspondences between the images. Some widely usedapplications of homography estimation include camera calibration, 3D reconstruc-tion, stereo vision, scene understanding, camera motion, 3D modeling of objects,image registration and rectification. The homography matrix can be computed forboth static and moving cameras and there is a wide range of literature proposed byresearchers for computing the matrix [57] [23] [11] [18].We took advantage of this homography matrix to incorporate the spatial ar-rangement of the players that in turn adds context to our knowledge domain. Asmentioned earlier some events can only be distinguished in terms of the player lo-cation on the rink i.e. the zone information. Consider Figure 1.2 where the relativeplayer arrangement is very similar for two different events: dump in and dump out.Thus looking at individual players or the relative positions of the players does nothelp us in identifying the events. However if we look at the spatial arrangement ofthe players with respect to each other as well as with respect to the rink positions,it is easier to distinguish those events. Moreover humans look at the whole imageto identify certain events rather than looking at only a fraction of players. Keep-ing in mind that the camera is continuously moving, it is essential that we projectthe image coordinates to either template or world coordinate system. Homographymatrix allows us to do this conversion easily. Thus given a homography matrix, ourgoal is to model the spatial player information that can be used as context featuresfor our deep network.4.3.1 Computing Features from the Homography MatrixAt first we computed the template coordinates of each player for each frame usingEquation 4.1 where pit is the ith player in frame t, H is the homography matrix,T is the template point and I is the image point. The image point corresponds tothe bottom center of player bounding box. The world coordinate is just a lineartranslation and scaling of the template coordinates. So in our case it did not matter20Figure 4.1: Weight computation method for a player and his associated cen-ters.whether we used template or world coordinate system. For visual convenience, wehave used template coordinates. The template dimension of the ice hockey rink is640*1440. In order to keep the same aspect ratio, we divided the template imageinto 4*9 bins. This gives us 36 bin centers in total. For each pi we have computedwhich 4 bin centers are its neighbors and what is the weight associated with eachcenter with respect to pi. The neighbors of pi refer to the bin centers that form asquare around pi. Equation 4.2 shows the weight computation method for center cat time t. The weights are summed over the total number of players, n associatedwith c. Figure 4.1 shows how we have computed the weights for a center ci withrespect to a particular player. At first, we find the nearest centers around the playersuch that they form a square. Then we divide the square into smaller rectanglesfrom the player to the center locations. Next, we measure the area of each smallerrectangles. Since the weights should be higher for centers that are closer to theplayer than the centers that are further, we swap the areas as shown in the figure.For example, the weight for center C1 is equal to the area A1. For centers thatdo not have an associated player, the weights are 0. The closer the center is frompi, higher the weight is for that center associated with pi. The weights for eachsquare is normalized to sum to 1. Thus the sum over all the centers is the numberof players on the particular frame. Note that since there are fewer players thanthe total number of bins and players are usually clustered in a specific location21of the rink, we get a very sparse matrix of weights. The next step is to combinethe weight matrices with the deep features. This process is not very straightforwardbecause the weights computed from the homography are hand-crafted features. Weconstructed our own late fusion model to fuse these features together.Tpit = hpit ∗ Ipit (4.1)Wct =n∑p=1weight(p) (4.2)4.4 Fusion Model for Combining Homography Featuresand Frame FeaturesInspired by Karpathy et al. [31], we have used a late fusion approach to incorpo-rate the contextual cues. We have added fully connected layers to reduce all thefeature vectors to similar dimension and then did late fusion on top. The modelarchitecture is shown in figure 1.4. We first fine-tuned AlexNet so that the outputfrom the fc7 layer is reduced to 500 dimensions. The homography features werealso passed through an fc layer to produce a feature vector of 500 dimensions. Thebounding box features are reduced to 1000 dimensions. These feature vectors werethen concatenated and normalized before passing through two more fc layers. Thedimensions of the feature vectors and the choice of different layers were foundempirically.4.5 Temporal ModelWe use a bidirectional Long Short Term Memory (BLSTM) network [50] to modelthe temporal information of the images. In every timestep t, the basic LSTM in-cludes a hidden unit ht , an input gate it , forget gate ft , output gate ot , input mod-ulation gate gt and memory cell ct . The LSTM formulation can be represented asthe following equations:22it = σ(Wxixt +Whiht−1 +bi)ft = σ(Wx f xt +Wh f ht−1 +b f )ot = σ(Wxoxt +Whoht−1 +bo)gt = φ(Wxcxt +Whcht−1 +bc)ct = ft  ct−1 + it gtht = ot φ(ct)(4.3)where W terms denote weight matrices (e.g., Wxi is the matrix of weights from theinput to the input gate), b terms are bias vectors. σ is the logistic sigmoid function,φ is the tanh function,  is the element-wise product. The BLSTM differs fromLSTM by using two independent forward and backward LSTM layers. In both theforward and the backward passes, the input sequence and the ouput layer valuesare fed to the two hidden layers in the opposite directions. We have also tried moreadvanced LSTMs such as the LSTM with peephole connections [48]. There is noperformance gain compared with the basic BLSTM model.In this representation, the group dynamics is evolving over time and the eventthat occurs at a frame can be determined based on the hidden states computationfrom the preceding and the future time steps as well as the current input xt .23Chapter 5ExperimentsWe have conducted experiments on two challenging sports datasets: an Ice HockeyDataset provided by SPORTLOGiQ (http://sportlogiq.com/) and a publicly avail-able Volleyball Dataset [28]. Our model can be divided into three distinct phases:feature extraction, feature aggregation and, learning and inference. We have usedMatlab and TensorFlow [6] for all our experiments.5.1 Evaluation MetricWe have reported the qualitative and quantitative results for both the datasets. Thequantitative measure includes prediction accuracy with some tolerance, recall andprecision per class and the confusion matrix. The prediction accuracy is a singlevalue measurement for prediction evaluation. The output of all our temporal base-lines and the final model is a probability distribution across all the classes. Thegeneral convention is to select the class having maximum probability as the pre-dicted class. This is known as selecting the top-1 value. If the classes having thehighest as well as the second highest probability are matched with the target class,we can call it top-2 prediction. In our case, we have found that there is a huge per-formance gain if we choose the top-2 values. However, we only have a very smallnumber of target classes so top-2 would not be a very reliable measurement forthe performance evaluation of our model. We did a little analysis of the probabili-ties the model predicts and found that in most cases the model is highly confident24about two predictions with a little probability difference between them and rest ofthe probabilities are very small numbers. This observation was particularly true forthe minority classes that had very few training samples. This motivated us to picktop-2 predictions with a small tolerance value. Thus if the difference between thetop-2 predictions is withing some tolerance value and if either of them matches thetrue class label, we consider it to be an accurate prediction. This is done becausethe Ice Hockey Dataset is highly imbalanced and in many confusing cases, themodel predicts two classes with a very high probability. In those cases, the highestpredicted probability is very close to the second highest probability. The tolerancevalue is set very low to ensure that we consider only those cases where the modelis very confident about any two classes. This also eliminates cases where the sec-ond prediction has very low probability. For both the datasets, we have reportedthe accuracies for tolerance 0.0 and 0.15 for a fair comparison with the previouswork. For the non-temporal baselines where we have used a SVM classifier, thetolerance is 0.0, i.e. a prediction is correct only if the predicted class matches thetrue class because the output of the SVM is binary. The confusion matrix helpsto visualize the per class recall and precision. This metric is particularly helpfulfor the Ice Hockey Dataset because the samples per class is highly imbalanced.If the model always predicts the majority class, the overall accuracy may be veryhigh but the result would not reflect the true performance of the system. Howeverwith the help of the confusion matrix, it is easy to understand how the model per-forms with respect to each class. In our confusion matrix, the rightmost columnshows the precision per class whereas the bottom row shows the recall per class inpercentage. In general, recall measures the correctness of a model and precisionmeasures the exactness. A high recall and precision is desired for the evaluation ofany model and often it is a more reliable metric than the overall accuracy. In orderto visualize the convergence of the network, we have also included the training costand accuracy curves.5.2 BaselinesWe have considered the following baseline models for the evaluation of the datasets:1. Frame-level Classification with CNN (M1): This baseline extracts frame25level features from target frames and classifies the event for the target framesusing a Support Vector Machine (SVM).2. Person-level Classification with CNN (M2): This baseline first extracts playerlevel features from target frames. Then it max pools across the teams andclassifies events for target frames using a SVM.3. Frame-level Temporal Model (M3): This is an extension of the first baseline(M1). Instead of using the target frames and SVM classification, this methodfeeds the frame level features from the whole sequence into an LSTM toclassify events for the whole sequence.4. Person-level Temporal Model (M4): This is an extension of the second base-line (M2). It feeds the player level features from a sequence of images to anLSTM to classify events for the whole sequence.5. Frame-level Classification with fine-tuned CNN (M5): This baseline is sim-ilar to M1 but we fine-tuned AlexNet using the target frame events and clas-sified using softmax.6. Frame-level Temporal Model (M6): This baseline is same as M3 but it usesfine-tuned AlexNet features rather than pre-trained features.7. Frame and Person Fusion Model (M7): This baseline fine-tunes AlexNet andadds a late fusion on top of it to combine the person level and the frame levelfeatures. The classification is done using softmax.8. Frame and Person Fusion Model with LSTM (M8): This baseline is a tem-poral extension of M7. The features from the fusion model are passed intoan LSTM for temporal classification.9. Our non-temporal method (M9): This baseline fine-tunes AlexNet and addsa late fusion on top of it to combine the person level, the frame level andthe homography features. It does not encode any temporal information. Theprediction is done using softmax.26We also compare our method with the C3D [58] network. The C3D networkis pre-trained on the UCF101 action recognition dataset [55] and fine-tuned on ourdataset. For the Volleyball Dataset, we have additionally reported the previousstate-of-the-art results.5.3 Implementation DetailsWe extracted deep features from the player bounding boxes in Matlab using anAlextNet pre-trained on the ImageNet for object recognition task. We also usedMatlab for extracting the homography features from the homography matrices. Allother experiments were conducted using TensorFlow [6] framework. For the imagefeature extraction, we used the AlexNet fine-tuned on our datasets if not specifiedotherwise. For the prediction model, we used either a support vector machine(SVM) or a softmax classifier. Our LSTM network consists of 28 hidden nodes,500 input features and optimizes weighted softmax cross entropy loss function.The weight for each class is inversely proportional to the frequency of the sam-ples per class in the dataset normalized to sum to 1. We used a learning rate of0.0000005, batch size of 256, 50% dropout, batch-normalization and Adam opti-mizer [32]. Only the predictions at the states that correspond to the target framesare used as the classification probabilities and the loss is computed only on the tar-get frames during training. Figures 5.2 and 5.3 show the training loss and accuracychanges respectively as a function of time for the Ice Hockey Dataset.For the Volleyball Dataset, we used a batch size of 500. The dataset was notimbalanced like the Ice Hockey Dataset so we used a regular softmax cross entropyloss function instead of the weighted loss. The other hyper-parameters were sameas ice hockey. Figures 5.6 and 5.7 show the training loss and accuracy changesrespectively as a function of time for the Volleyball Dataset.5.4 Ice Hockey DatasetThis dataset consists of National Hockey League (NHL) videos and was obtainedfrom SPORTLOGiQ (http://sportlogiq.com/). We used part of this dataset and con-sidered five puck possession events. Table 5.1 shows the descriptions of the eventsand corresponding number of instances in the dataset. It clearly shows that some of27the events occur very rarely such as dump in. We addressed this imbalance by min-imizing weighted softmax cross entropy loss. We randomly used 3966 events fortraining and 270 events for testing. The dataset has the annotated frame numberswhere an event occurred and we used the preceding and future frames for our tem-poral classification. All the events are considered to be independent of each otherand were trained as individual short clips of events, which is a standard protocolused for activity recognition tasks [28][58] .5.4.1 Event AnnotationIn this dataset, an example contains a target frame which is associated with anevent label. The target frame is generally a frame that marks the beginning ofan event. We do not have any annotation for the duration of an event. To solvethis problem, one naive approach would be to mark all the frames up to the nextevent as the current event. However, in the Ice Hockey Dataset, we have foundthat although sometimes the naive approach could make sense, in many cases theintermediate frames either belonged to the negative example class or the next event.By the negative example class we refer to the frames that are not associated withany event. As a result, we could not make that assumption. Rather we fixed themaximum length of a sequence to be 24. This included 18 frames before and 5frames after the target frame. If two events were less than 18 frames apart, wemarked all the preceding and succeeding frames as belonging to the target frame.The maximum length of a sequence was found empirically.5.4.2 Tracking DataThe dataset have annotations for the player bounding boxes in each frame. Thebounding boxes are annotated by the x,y,width,height format to mark the locationof each player in the pixel coordinate system. However, the referee bounding boxis also annotated in most of the frames and there is no annotation to distinguishthe referee from other players. As a result the features from the referee are alsoincluded when we extracted the deep features from the player bounding boxes.Some of the bounding box annotations are very noisy as well especially in caseswhere there is occlusion or the players are very close to each other. In these cases,28Figure 5.1: The transformation from the image to the template coordinatesystem by the homography matrix.there is a big bounding box that surrounds multiple players. In our model, we haveonly used the bounding box coordinates of all the annotated instances to crop theplayers out of the frame and extract deep features to incorporate their appearanceinformation. We have not used any other tracking information.5.4.3 Homography TransformationThe dataset contains the homography matrix for each frame that projects the videoframe to the template. We have used these matrices to calculate the homographyfeatures which is described in Chapter 4. Figure 5.1 shows some examples of thistransformation. The players in each frame are denoted by the small red dots in thetemplate image.5.4.4 Quantitative ResultsTable 5.3 shows the classification accuracy of our models and the baselines. Ournon-temporal method outperforms all the baselines except M5 and M7 where thetolerance is 0.0 and only M7 where the tolerance is 0.15. The temporal exten-sion also produces competitive results. The accuracies for both tolerance 0.0 and29Event Description ExamplesLoose puck recovery(LPR)The player recoveredthe puck as it was out ofpossession of any player1,412Pass The player attempts apass to a teammate1,688Shot A player shoots on goal 346Dump in When a player sends thepuck into the offensivezone219Dump out When a defendingplayer dumps the puckup the boards withouttargeting a teammate fora pass301Table 5.1: Event descriptions and corresponding number of training exam-ples in the Ice Hockey Dataset.Baseline DescriptionsM1 Frame features from pre-trained AlexNet classified with SVMM2 Person features from pre-trained AlexNet classified with SVMM3 Temporal extension of M1 classified with softmaxM4 Temporal extension of M2 classified with softmaxM5 Frame features from fine-tuned AlexNet classified with softmaxM6 Temporal extension of M5 classified with softmaxM7 Fusion of frame and person features classified with softmaxM8 Temporal extension of M7 classified with softmaxTable 5.2: The descriptions of the baselines.30Figure 5.2: Training loss over time for the Ice Hockey Dataset.Figure 5.3: Training accuracy over time for the Ice Hockey Dataset.31Figure 5.4: Confusion matrix for top-1 event prediction with a tolerance of0.0 on the Ice Hockey Dataset. The rightmost column shows the pre-cision and the bottom row shows the recall per class in percentages.The green values represent the percentages of the correct predictions bythe model whereas the red values show the percentages of the incorrectpredictions. For example, recall for lpr is 39% and the its precision is46.6% which means out of all the lpr events, 39% were correctly clas-sified and out of all the lpr predictions made by the model, 46.6% werecorrect. The blue rectangle in the bottom right corresponds to the overallaccuracy.32Figure 5.5: Confusion matrix for top-2 event prediction with a tolerance of0.15 on the Ice Hockey Dataset. The rightmost column shows the preci-sion and the bottom row shows the recall per class in percentages. Thegreen values represent the percentages of the correct predictions by themodel whereas the red values show the percentages of the incorrect pre-dictions. For example, recall for lpr is 74.3% and the its precision is78% which means out of all the lpr events, 74.3% were correctly clas-sified and out of all the lpr predictions made by the model, 78% werecorrect. The blue rectangle in the bottom right corresponds to the overallaccuracy.33Method Fine-tuning Acc(%)Tol=0.0 Acc(%)Tol=0.15M1 w/o 32.4 -M2 w/o 30.5 -M3 w/o 40.1 58M4 w/o 35.4 65.6M5 w 46.9 75.8M6 w 40.4 60M7 w 48.4 79.7M8 w 40.0 62.6C3D [58] w 44.0 -Our non-temporal method w 45.7 77.3Our temporal method w 44.4 70.4Table 5.3: Performance of our model on Ice Hockey Dataset compared to thebaselines. In the fine-tuning column, w/o and w represent without andwith fine-tuning, respectively.Event Recall (%)Tol=0.0Precision (%)Tol=0.0Recall (%)Tol=0.15Precision (%)Tol=0.15LPR 39 46.6 74.3 78Pass 61.9 49.7 84.7 71.4Shot 23.1 20 30.8 33.3Dump in 5.6 10 16.7 33.3Dump out 12.5 20 31.2 55.6Average 28.42 29.26 47.54 54.32Table 5.4: Per class recall and precision of our model on the Ice HockeyDataset.tolerance 0.15 show similar patterns.If we compare the first six baselines with a tolerance of 0.0, the frame levelfeatures work better than the player level features. For example, M1, M3 and M5outperforms M2, M4 and M6. Adding frame level features is important becauseevents such as dump in and dump out can possibly be distinguished only by zoneinformation. Another observation is that fine-tuned frame features perform bet-ter than the pre-trained frame features because the AlexNet is pre-trained from34an image recognition task in which sample images usually have an object in themiddle of the frame. When we apply this pre-trained model to our problem, it ismore suitable for extracting player level features as players are in the middle ofthe frame. On the other hand, a whole frame has multiple players and large areasof background, thus the pre-trained AlexNet becomes less suitable. That is whywe have fine-tuned AlexNet for frame level classification using the target frameevents. M5 and M6 show that the fine-tuned frame features outperform both M1and M3. However, we had challenges to fine-tune bounding box features becausewe do not have person level annotations.Another interesting observation is that although M3 and M4 perform betterthan M1 and M2 after encoding the temporal information, in all other cases thenon-temporal versions perform significantly better than the temporal extensions.For example, M7 performs better than M8. When we dug deeper, we found thatalthough the overall accuracy is higher for the non-temporal models, the modelalways predicted either lpr or pass events and never predicted the rest. But whenwe encode temporal knowledge, per class accuracy was much better. This indicatesthat the features evolve over the time as they approach the target events and itbecomes easier for the model to predict the minority classes. When we provideonly the target frames as inputs to our models, they find it difficult to separate theless frequent classes from the more frequent ones in the training set.Next, we compare all the fusion models and C3D. Our non-temporal and tem-poral models perform better than M8 and C3D. The accuracy of M7 is the highestamong all the baselines which fuses frame features with the player features. How-ever, the confusion matrix showed that per class recall and precision was the bestfor the temporal model where we fused homography, player and frame features,although the overall accuracy is lower.Figures 5.4 and 5.5 show the confusion matrices of the events. Figure 5.4 showsthat the model performs quite well in classifying lpr and pass but it seriously failson the other three events. If we look at columns 3-5 and rows 1-2 of the matrix,it is clearly seen that these three events are mostly misclassified as lpr or pass.For example, 11 and 5 dump in events are misclassified as pass and lpr eventsrespectively. This is because the training dataset is highly imbalanced and the lessfrequent events perform poorly compared to the dominant events.35However, if we look at Figure 5.5 that considers the top-2 predictions with avery small probability difference, the overall performance gain is approximately60% and the per class performance also improves significantly. Although the clas-sification rate is still not very high, a comparison between the two confusion matri-ces shows that class imbalance is a major concern for this dataset which forces themodel to always predict the majority classes although sometimes it is highly con-fident about other classes. The class imbalance is present because we consideredcomplete games of the ice hockey videos rather than handpicking certain clips and,in any ice hockey game, some events will always occur more frequently comparedto others. One solution could be to ignore the majority events in certain gamesand handpick the minority classes but that would be possible only if we have manyavailable games in the dataset. Another way could be adding augmented clipswhich is not trivial like adding augmented images.Table 5.4 shows the recall and precision values for each class. It is again seenthat lpr and pass have much higher figures compared to rest of the events. It is alsoseen that slightly increasing the tolerance makes a huge shift in performance for allthe classes.5.4.5 Qualitative ResultsFigures 5.10, 5.11 and 5.12 show the qualitative results of our model for event pre-diction in Ice Hockey Dataset. Figure 5.10 shows the correct predictions with atolerance of 0.0, Figure 5.11 shows the wrong predictions by the model with tol-erance 0.0 and the correction by the model with tolerance 0.15 with the changein probability between the two predictions and Figure 5.12 shows the failed pre-dictions. Our model could successfully identify some challenging scenarios suchas blurred image as seen in Figure 5.10. On the other side, Figure 5.11 showsthat our model fails to predict the less frequent events with the highest confidencebut the prediction was accurate for some difficult cases if we consider the top-2predictions.36Event DescriptionSet The setter, located in the center front, hits the ball high above thenet so that a spiker can spike it acrossSpike When an offensive player attacks the ball with a one-arm motiondone over the head, attempting to get a killPass Receiving a serve or the first contact of the ball with the intent tocontrol the ball to another playerWinpoint When a team scoresTable 5.5: Event descriptions in the Volleyball Dataset. In the original datasetand all our experiments, these events are further classified into right andleft where right always refers to the team in the right side of the net andleft refers to the team in the left side. For example, r-pass refers to thepass event occurring in the right side.5.5 Volleyball DatasetWe have used the publicly available Volleyball Dataset provided by Ibraim et al.[28]. The dataset contains annotations for individual players as well as group ac-tivities of 4830 randomly picked frames of 55 different games from YouTube. Wehave only used the event annotations and the bounding boxes of the players in ourexperiments. We have used 5 frames before and 4 frames after the target framefor the temporal sequence. Since the homography matrix is not available for thisdataset, we conducted our experiments without the homography features. Table5.5 shows the event descriptions in the Volleyball Dataset.5.5.1 Quantitative ResultsTable 5.7 summarizes the accuracy of all the experiments on the Volleyball Dataset.For this dataset, we could only compare the baseline models and previous state-of-the-arts because we did not have the homography matrices. Similar to ice hockeyresults, similar pattern in performance is observed for both tolerance 0.0 and 0.15.The model proposed by Bagautdinov et al. [8] outperforms all the baselines bya significant margin. On the other hand, M2 performs surprisingly better than mostother baselines although it encodes only bounding box features and does not haveany temporal knowledge. This shows the huge benefit of team pooling in volley-37Baseline DescriptionsM1 Frame features from pre-trained AlexNet classified with SVMM2 Person features from pre-trained AlexNet classified with SVMM3 Temporal extension of M1 classified with softmaxM4 Temporal extension of M2 classified with softmaxM5 Frame features from fine-tuned AlexNet classified with softmaxM6 Temporal extension of M5 classified with softmaxM7 Fusion of frame and person features classified with softmaxM8 Temporal extension of M7 classified with softmaxTable 5.6: The Descriptions of the baselines.ball games mostly because the teams are well defined and separated in volleyball.Another interesting finding is fine-tuning with only the frame features gave a verylow accuracy. But when we combined the fine-tuned frame features with the playerfeatures in M8, the accuracy was much higher.Table 5.8 shows the per class recall and precision results when we appliedM8 on the Volleyball Dataset. Unlike Ice Hockey Dataset, the Volleyball Datasetdid not have any imbalance among the classes and the results show that the perclass performance is also similar. Only for the event, r-winpoint, we found a lowaccuracy. However when we increased the tolerance value to 0.15, the precisionfor this event jumped to the highest with a notable increase in recall compared toall other events. Analyzing the confusion matrix shows that most r-winpoints areconfused with l-winpoints.Figures 5.8 and 5.9 show the confusion matrices of our model on the VolleyballDataset. There was a significant performance gain of 68% when we increasedthe tolerance value. The classification accuracy, recall and precision were highlybalanced across all the events unlike the Ice Hockey Dataset.5.5.2 Qualitative ResultsFigure 5.13 shows the example images where our model predicted the correctevents with a tolerance of 0.0. If we look at the bottom left image, it is seenthat both the teams are wearing red and white jerseys and it is hard to distinguish38Method Fine-tuning Acc(%)Tol=0.0 Acc(%)Tol=0.15M1 w/o 37.7 -M2 w/o 47.6 -M3 w/o 38.1 60.1M4 w/o 39 61.3M5 w 37 58M6 w 34 61.4M7 w 35.6 57.2M8 w 47.5 69.2C3D [58] w 74 -Two-stageHierarchicalModel [28]w 51.1 -Social SceneUnderstanding[8]w 89.9 -Table 5.7: Performance of our model on Volleyball Dataset compared to thebaselines and previous work. w and w/o refers to with and without fine-tuning respectively.Event Recall (%)Tol=0.0Precision (%)Tol=0.0Recall (%)Tol=0.15Precision (%)Tol=0.15r-set 41.7 51.3 59.9 70.9r-spike 51.2 43.2 67.3 63.1r-pass 51.5 46.1 70.6 63.7r-winpoint 15.9 36.1 68.3 80l-winpoint 53.1 46.4 69.8 72.8l-pass 44.5 45 71.4 69.5l-spike 59.3 52.5 77.4 70.3l-set 48.8 51.6 68.1 73.4Average 45.75 46.525 69.1 70.46Table 5.8: Per class recall and precision of our model on the VolleyballDataset.39Figure 5.6: Training loss over time for the Volleyball Dataset.Figure 5.7: Training accuracy over time for the Volleyball Dataset.40Figure 5.8: Confusion matrix for top-1 event prediction with a tolerance of0.0 on the Volleyball Dataset. The rightmost column shows the preci-sion and the bottom row shows the recall per class in percentages. Thegreen values represent the percentages of the correct predictions by themodel whereas the red values show the percentages of the incorrect pre-dictions. For example, recall for r-set is 41.7% and the its precision is51.3% which means out of all the r-set events, 41.7% were correctlyclassified and out of all the r-set predictions made by the model, 51.3%were correct. The blue rectangle in the bottom right corresponds to theoverall accuracy.41Figure 5.9: Confusion matrix for top-2 event prediction with a tolerance of0.15 on the Volleyball Dataset. The rightmost column shows the preci-sion and the bottom row shows the recall per class in percentages. Thegreen values represent the percentages of the correct predictions by themodel whereas the red values show the percentages of the incorrect pre-dictions. For example, recall for r-set is 59.9% and the its precision is70.9% which means out of all the r-set events, 59.9% were correctlyclassified and out of all the r-set predictions made by the model, 70.9%were correct. The blue rectangle in the bottom right corresponds to theoverall accuracy.42Figure 5.10: Qualitative results. The correct predictions made by our modelwith a tolerance of 0.0 on the Ice Hockey Dataset.between them if we only consider the player features. But our model successfullyidentified the correct event which shows the importance of encoding the frame fea-tures. The dataset has many such examples where the team jerseys are very similarand our model successfully distinguished and classified them. Figure 5.14 showsthe scenarios where our model made a wrong prediction with a tolerance of 0.0 buta correct prediction with a tolerance of 0.15. There are several interesting observa-tions for this figure. In most cases, the model was confused between left and right.For example, in the first image, left spike was confused with right spike. Moreover,the probability difference is very small in most cases which shows that the modelis highly confident about the true events but it had confusion between the side ofthe court because we do not have any mechanism to distinguish between the leftand the right teams. The events that are considered for this dataset can be mergedinto 4 events and sub divided into left and right. As a result, it is important thatwe incorporate some sort of mechanism to distinguish the team sides. Homogra-phy features could be an important cue to solve this ambiguity as they incorporatethe spatial player arrangements into the model. Lastly, Figure 5.15 shows someexamples where our model failed.43Figure 5.11: Qualitative results. The correct predictions made by our modelwith a tolerance of 0.15 on the Ice Hockey Dataset. True refers to theactual class, Pred 1 and Pred 2 refer to the prediction with a toleranceof 0.0 and 0.15 respectively and δ refers to the difference in probabil-ities between the two predictions.Figure 5.12: Qualitative results. The images show the cases where our modelfailed on the Ice Hockey Dataset.44Figure 5.13: Qualitative results. The correct predictions made by our modelwith a tolerance of 0.0 on the Volleyball Dataset.Figure 5.14: Qualitative results. The correct predictions made by our modelwith a tolerance of 0.15 on the Volleyball Dataset. True refers to theactual class, Pred 1 and Pred 2 refer to the prediction with a toleranceof 0.0 and 0.15 respectively and δ refers to the difference in probabil-ities between the two predictions.45Figure 5.15: Qualitative results. The figures show the cases where our modelfailed on the Volleyball Dataset.46Chapter 6ConclusionIn this thesis, we proposed a deep learning model to classify group activities inice hockey. We have shown that feature aggregation is essential in determiningthe game events. We have looked at five puck possession events and shown thatthey can be classified without the need of explicit labeling for individual actions orpuck information. We have proposed a fusion model to combine deep features fromdifferent sources and shallow features. We have also applied our partial model to aVolleyball Dataset and received competitive results.Our problem and solution are very generalized and there are many possiblefuture extensions. Any types of features can be given as inputs to our fusion modeland combined to see how that influences the accuracy. We would like to do a moredetailed analysis of the homography features because the weights that we havecomputed might be informative about the possible puck location in the rink. Ratherthan using the puck location to classify the events, one possible extension could beto use the homography features and events knowledge to locate the puck. Currentlywe do not have the evaluation results of our model on the Volleyball Dataset due tolack of homography information. We would like to build the homography matricesfor the volleyball games and extract the homography features to run our final modelon this dataset.We have used the AlexNet architecture [34] for the deep feature extractionwhich is quite old. So one important extension of our work would be to try morerecent CNNs such as ResNet [24] or VGGNet [54] and compare with AlexNet. Fu-47ture work will also focus on incorporating player motion and 3D pose informationinto this model. These features might be essential cues for motion-based eventssuch as pass and shot or determining the appearance of the players. In this paper,we have looked at a subset of many possible puck possession events. We wouldalso like to classify other events such as goal and carry in. Deep learning modelsrequire a huge number of training examples and our dataset was quite small. Wewould like to add more games and increase the dataset size especially for the lessfrequent classes so that the dataset is more balanced. Other possible extensionsmight include finding the motion and position of the hockey sticks with respect tothe players and taking advantage of gaze information of the players.48Bibliography[1] Convolutional neural networks for visual recognition, . URLhttp://cs231n.github.io/convolutional-networks/. → pages 13[2] Example image of alexnet, . URLhttps://ischlag.github.io/images/alexnet.png. → pages ix, 14[3] Popular cnn architectures, . URLhttps://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html. → pages 15[4] Example image of a long short term memory network (lstm). URL https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2016/03/LSTM.png.→ pages ix, 17[5] Example image of a recurrent neural network (rnn). URL http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png. → pagesix, 17[6] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scalemachine learning on heterogeneous distributed systems. arXiv preprintarXiv:1603.04467, 2016. → pages 24, 27[7] M. R. Amer, P. Lei, and S. Todorovic. HIRF: Hierarchical random field forcollective activity recognition in videos. In European Conference onComputer Vision (ECCV), 2014. → pages 2, 8[8] T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, and S. Savarese. Social sceneunderstanding: End-to-end multi-person action localization and collectiveactivity recognition. In Proceedings of the Conference on Computer Visionand Pattern Recognition (CVPR), 2017. → pages 8, 10, 37, 3949[9] F. Caba Heilbron, W. Barrios, V. Escorcia, and B. Ghanem. SCC: Semanticcontext cascade for efficient action detection. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2017. →pages 12[10] D. Cervone, A. DAmour, L. Bornn, and K. Goldsberry. POINTWISE:Predicting points and valuing decisions in real time with nba optical trackingdata. In 8th Annual MIT Sloan Sports Analytics Conference, 2014. → pages9[11] J. Chen and J. J. Little. Where should cameras look at soccer games:Improving smoothness using the overlapped hidden markov model. InComputer Vision and Image Understanding, 2017. → pages 11, 20[12] J. Chen, H. M. Le, P. Carr, Y. Yue, and J. J. Little. Learning online smoothpredictors for realtime camera planning using recurrent decision trees. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016. → pages 1[13] W. Choi and S. Savarese. A unified framework for multi-target tracking andcollective activity recognition. European Conference on Computer Vision(ECCV), 2012. → pages 2[14] W. Choi and S. Savarese. Understanding collective activities of people fromvideos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6):1242–1257, 2014. → pages 2, 8[15] W. Choi, K. Shahid, and S. Savarese. What are they doing? : Collectiveactivity classification using spatio-temporal relationship among people. InVisual Surveillance Workshop, ICCV, 2009. → pages 1, 8[16] Z. Deng, A. Vahdat, H. Hu, and G. Mori. Structure inference machines:Recurrent neural networks for analyzing relations in group activityrecognition. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2016. → pages 8, 12[17] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrentconvolutional networks for visual recognition and description. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2015. → pages 12, 18[18] E. Dubrofsky. Homography estimation. In Masters Thesis, University ofBritish Columbia (Vancouver), Kelowna, BC, Canada, 2009. → pages 2050[19] R. Girshick. Fast r-cnn. In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), 2015. → pages 15[20] R. Girshick, J. Donahua, T. Darrell, and J. Malik. Rich feature hierarchiesfor accurate object detection and semantic segmentation. In Proceedings ofthe IEEE conference on Computer Vision and Pattern Recognition (CVPR),2014. → pages 15[21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, Xu, D. Warde-Farley, S. Ozair,A. Courville, and Y. Bengio. Generative adversarial nets. In Advances inNeural Information Processing Systems (NIPS), 2014. → pages 15[22] A. Graves and N. Jaitly. Towards end-to-end speech recognition withrecurrent neural networks. In International Conference on MachineLearning (ICML), 2014. → pages 12[23] A. Gupta, J. Little, and R. Woodham. Using line and ellipse features forrectification of broadcast hockey video. In Computer and Robot Vision(CRV), 2011. → pages 20[24] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer vision andpattern recognition (CVPR), 2016. → pages 15, 47[25] A. Hilton, J.-Y. Guillemaut, J. Kilner, O. Grau, and G. Thomas.Free-viewpoint video for TV sport production. In Image and GeometryProcessing for 3-D Cinematography, pages 77–106. 2010. → pages 1[26] S. Hochreiter and J. Schmidhuber. Long short-term memory. NeuralComputation, 9(8):1735–1780, 1997. → pages 12, 16[27] N. Homayounfar, S. Fidler, and R. Urtasun. Soccer field localization from asingle image. In arXiv preprint arXiv:1604.02715, 2016. → pages 11[28] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori. Ahierarchical deep temporal model for group activity recognition. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016. → pages 8, 10, 18, 19, 24, 28, 37, 39[29] M. Jaderberg, K. Simonyan, and Z. A. Spatial transformer networks. InAdvances in Neural Information Processing Systems (NIPS), 2015. → pages1551[30] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generatingimage descriptions. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2015. → pages 12[31] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei.Large-scale video classification with convolutional neural networks. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2014. → pages 12, 22[32] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980, 2014. → pages 27[33] Y. Kong, W. Hu, X. Zhang, H. Wang, and Y. Jia. Learning group activity insoccer videos from local motion. In Proceedings of Asian Conference onComputer Vision, 2009. → pages 11[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. In Advances in Neural InformationProcessing Systems (NIPS), 2012. → pages 12, 14, 15, 18, 47[35] T. Lan, L. Sigal, and G. Mori. Social roles in hierarchical models for humanactivity recognition. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 1354–1361, 2012. → pages 8[36] T. Lan, Y. Wang, W. Yang, S. N. Robinovitch, and G. Mori. Discriminativelatent models for recognizing contextual group activities. IEEE Transactionson Pattern Analysis and Machine Intelligence, 34(8):1549–1562, 2012. →pages 2, 8[37] J. Liu, X. Tong, W. Li, T. Wang, Y. Zhang, and H. Wang. Automatic playerdetection, labeling and tracking in broadcast soccer video. In PatternRecognition Letters, 2009. → pages 11[38] J. Liu, P. Carr, R. T. Collins, and Y. Liu. Tracking sports players withcontext-conditioned motion models. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2013. → pages 1[39] W.-L. Lu and J. J. Little. Simultaneous tracking and action recognition usingthe pca-hog descriptor. In The 3rd Canadian Conference on Computer andRobot Vision, 2006. → pages 9[40] W.-L. Lu and J. J. Little. Tracking and recognizing actions at a distance. InProceedings of the ECCV Workshop on Computer Vision Based Analysis inSport Environments, 2006. → pages52[41] W.-L. Lu, K. Okuma, and J. J. Little. Tracking and recognizing actions ofmultiple hockey players using the boosted particle filter. In Image and VisionComputing, 2009. → pages 9[42] B. Macdonald. An improved adjusted plus-minus statistic for NHL players.In Proceedings of the MIT Sloan Sports Analytics Conference, 2011. →pages 9[43] T. B. Moeslund, G. Thomas, and A. Hilton. Computer vision in sports. InSpringer, 2014. → pages 1[44] V. Ramanathan, B. Yao, and L. Fei-Fei. Social role discovery in humanevents. In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2013. → pages 8[45] V. Ramanathan, J. Huang, S. Abu-El-Haija, A. Gorban, K. Murphy, andL. Fei-Fei. Detecting events and key actors in multi-person videos. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2016. → pages 8, 11, 12[46] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-timeobject detection with region proposal networks. In Advances in NeuralInformation Processing Systems (NIPS), 2015. → pages 15[47] M. Ryoo and J. Aggarwal. Recognition of composite human activitiesthrough context-free grammer based representation. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR),2006. → pages 8[48] H. Sak, A. W. Senior, and F. Beaufays. Long short-term memory recurrentneural network architectures for large scale acoustic modeling. InProceedings of the Annual Conference of International SpeechCommunication Association (INTERSPEECH), 2014. → pages 23[49] O. Schulte, M. Khademi, S. Gholami, Z. Zhao, M. Javan, and P. Desaulniers.A markov game model for valuing actions, locations, and team performancein ice hockey. Data Mining and Knowledge Discovery, pages 1–23, 2017.→ pages 9[50] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks.IEEE Transactions on Signal Processing, 45(11), pages 2673–2681, 1997.→ pages 16, 2253[51] T. Shu, D. Xie, B. Rothrock, S. Todorovic, and S. Chun Zhu. Joint inferenceof groups, events and human roles in aerial videos. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR),2015. → pages 8[52] T. Shu, S. Todorovic, and S.-C. Zhu. Cern: Confidence-energy recurrentnetwork for group activity recognition. In Proceedings of the Conference onComputer Vision and Pattern Recognition (CVPR), 2017. → pages 8[53] K. Simonyan and A. Zisserman. Two-stream convolutional networks foraction recognition in videos. In Advances in Neural Information ProcessingSystems (NIPS), 2014. → pages 12[54] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. In arXiv preprint arXiv:1409.1556., 2014. →pages 15, 47[55] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 humanactions classes from videos in the wild. arXiv preprint arXiv:1212.0402,2012. → pages 27[56] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2015. → pages 15[57] G. Thomas. Real-time camera tracking using sports pitch markings. InJournal of Real-Time Image Processing, 2007. → pages 20[58] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learningspatiotemporal features with 3D convolutional networks. In IEEEInternational Conference on Computer Vision (ICCV), 2015. → pages 27,28, 34, 39[59] T. Tsunoda, Y. Komori, M. Matsugu, and T. Harada. Football actionrecognition using hierarchical lstm. In 3rd IEEE International Workshop onComputer Vision in Sports (CVsports), 2017. → pages 11[60] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neuralimage caption generator. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015. → pages 1254[61] M. Wang, B. Ni, and X. Yang. Recurrent modeling of interaction context forcollective activity recognition. In Proceedings of the Conference onComputer Vision and Pattern Recognition (CVPR), 2017. → pages 8[62] Y. Wang and G. Mori. Max-margin hidden conditional random fields forhuman action recognition. In Proceedings of the Conference on ComputerVision and Pattern Recognition (CVPR), 2009. → pages 8[63] X. Wei, P. Lucey, S. Morgan, and S. Sridharan. Forecasting the next shotlocation in tennis using fine-grained spatiotemporal tracking data. IEEETransactions on Knowledge and Data Engineering, 28(11):2988–2997,2016. → pages 1[64] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning ofaction detection from frame glimpses in videos. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2016. →pages 8[65] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga,and G. Toderici. Beyond short snippets: Deep networks for videoclassification. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2015. → pages 9[66] M. Zeng, L. T. Nguyen, B. Yu, O. J. Mengshoel, J. Zhu, P. Wu, and J. Zhang.Convolutional neural networks for human activity recognition using mobilesensors. In 6th International Conference on Mobile Computing,Applications and Services (MobiCASE), 2014. → pages 1955

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0355849/manifest

Comment

Related Items