Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Learning to track and identify players from broadcast sports videos Lu, Wei-Lwun 2011

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2012_spring_lu_weilwun.pdf [ 25.26MB ]
Metadata
JSON: 24-1.0052129.json
JSON-LD: 24-1.0052129-ld.json
RDF/XML (Pretty): 24-1.0052129-rdf.xml
RDF/JSON: 24-1.0052129-rdf.json
Turtle: 24-1.0052129-turtle.txt
N-Triples: 24-1.0052129-rdf-ntriples.txt
Original Record: 24-1.0052129-source.json
Full Text
24-1.0052129-fulltext.txt
Citation
24-1.0052129.ris

Full Text

Learning to Track and Identify Players from Broadcast Sports Videos by Wei-Lwun Lu B.Sc, National Tsing Hua University, 2002 M.Sc, University of British Columbia, 2007 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University Of British Columbia (Vancouver) December 2011 c©Wei-Lwun Lu, 2011 Abstract Tracking and identifying players in sports videos filmed with a single pan-tilt-zoom camera has many applications, but it is also a challenging problem. This thesis introduces the first intelligent system that tackles this difficult task. The system possesses the ability to detect and track multiple players, estimates the homography between video frames and the court, and identifies the players. The tracking system is based on the tracking-by-detection philos- ophy. We first localize players using a player detector, categorize detections based on team colors, and then group them into tracks of specific players. Instead of using visual cues to distinguish between players, we instead rely on their short-term motion patterns. The homography estimation is solved by using a variant of the Iterated Closest Points (ICP). Unlike most existing algorithms that rely on matching robust feature points, we propose to match edge points in two images. In addition, we also introduce a technique to update the model online to accommodate logos and patterns in different stadiums. The identifi- cation system utilizes both visual and spatial cues, and exploits both temporal and mutual exclusion constraints in a Conditional Random Field. In addition, we propose a novel Linear Programming Relaxation algorithm for predicting the best player identification in a video clip. In order to reduce the number of labeled training data required to learn the identification system, we pioneer the use of weakly supervised learning with the assistance of play-by-play texts. Experiments show promising results in tracking, homography esti- mation, and identification. Moreover, weakly supervised learning with play-by-play texts greatly reduces the number of labeled training data required. Experiments show that we can use weakly supervised learning with merely 200 labels to achieve similar accuracies to a strongly supervised approach, which requires at least 20000 labels. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Multi-Target Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Player Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Learning from Videos and Texts . . . . . . . . . . . . . . . . . . . . . . . 13 3 Player Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Detection and Team Classification . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Multi-target Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iii 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4 Homography Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Iterated Closest Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Online Model Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5 Player Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1 Probabilistic Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Visual and Spatial Features . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.1 Visual features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.2 Spatial features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4.2 The play-by-play texts . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4.3 Weakly supervised learning . . . . . . . . . . . . . . . . . . . . . 69 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.5.2 Feature comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.3 Model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5.4 Learning comparison . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6.1 Automatic collection of player statistics . . . . . . . . . . . . . . . 82 5.6.2 Star camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.7 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 iv Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 v List of Figures Figure 2.1 Hockey player tracking by the Boosted Particle Filter (BPF) . . . . . . 6 Figure 2.2 Sport player tracking by mean shift . . . . . . . . . . . . . . . . . . . 7 Figure 2.3 Tracking-by-detection: an illustration . . . . . . . . . . . . . . . . . . 8 Figure 2.4 Number recognition by Ye et al. [YHJ+05] . . . . . . . . . . . . . . . 11 Figure 2.5 Challenges of number detection . . . . . . . . . . . . . . . . . . . . . 12 Figure 2.6 Face recognition in the TV program “Buffy the Vampire Slayer” . . . . 14 Figure 2.7 Face recognition in the TV program “Lost” . . . . . . . . . . . . . . . 16 Figure 2.9 Learning action models and grammars from videos and stories . . . . . 18 Figure 3.1 The tracking-by-detection process . . . . . . . . . . . . . . . . . . . . 21 Figure 3.2 The Deformable Part Model (DPM) for basketball players . . . . . . . 24 Figure 3.3 Results of automatic player detection . . . . . . . . . . . . . . . . . . 26 Figure 3.4 Results of team classification . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 3.5 Trajectories of players generated by the tracking system . . . . . . . . 33 Figure 3.6 Results of multi-target tracking . . . . . . . . . . . . . . . . . . . . . . 35 Figure 3.7 Comparison of two tracking algorithms . . . . . . . . . . . . . . . . . 36 Figure 4.1 Model point set for homography estimation . . . . . . . . . . . . . . . 43 Figure 4.2 ICP: first and final iteration . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 4.3 Edge map for homography estimation . . . . . . . . . . . . . . . . . . 46 Figure 4.4 Model updating for homography estimation . . . . . . . . . . . . . . . 49 Figure 4.5 Accuracy of homography estimation in basketball . . . . . . . . . . . . 51 Figure 4.6 Homography estimation results . . . . . . . . . . . . . . . . . . . . . 52 Figure 4.7 Comparison of two homography estimation methods . . . . . . . . . . 54 Figure 5.1 Conditional Random Field (CRF) for player identification . . . . . . . 59 vi Figure 5.2 Visual features for player identification . . . . . . . . . . . . . . . . . 62 Figure 5.3 Spatial features for player identification . . . . . . . . . . . . . . . . . 65 Figure 5.4 The play-by-play texts of a basketball game . . . . . . . . . . . . . . . 69 Figure 5.5 Comparison of different features in player identification . . . . . . . . 73 Figure 5.6 The most distinctive MSER visual words . . . . . . . . . . . . . . . . 74 Figure 5.7 Comparison of different graphical models for player identification . . . 75 Figure 5.8 Comparison of different learning strategies for player identification . . 77 Figure 5.9 Player identification results I . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 5.10 Player identification results II . . . . . . . . . . . . . . . . . . . . . . 81 Figure 5.11 Application I: automatic statistics collecting . . . . . . . . . . . . . . . 82 Figure 5.12 Application II: star camera . . . . . . . . . . . . . . . . . . . . . . . . 83 vii List of Algorithms 1 Multi-target tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2 ICP for homography estimation . . . . . . . . . . . . . . . . . . . . . . . . 47 3 EM for weakly supervised learning . . . . . . . . . . . . . . . . . . . . . . 70 viii Glossary BPF Boosted Particle Filter CRF Conditional Random Field DPM Deformable Part Model EM The Expectation-Maximization algorithm GP Gaussian Processes HD High Definition HDTV High Definition TV HMM Hidden Markov Models HOG Histograms of Oriented Gradients ICP Iterated Closest Points JPDAF Joint Probabilistic Data Association Filter KF Kalman Filter LBP Loopy Belief Propagation LOG Laplacian of Gaussian LP Linear Programming LSVM Latent Support Vector Machine ix MCMC Markov Chain Monte Carlo MHT Multiple Hypothesis Tracking MSER Maximally Stable Extremal Regions NBA National Basketball Association NHL National Hockey League OCR Optical Character Recognition PF Particle Filter RANSAC RANdom SAmple Consensus SIFT Scale Invariant Feature Transform SVM Support Vector Machine x Acknowledgements I would like to thank my advisors, Prof. Jim Little and Prof. Kevin Murphy. Jim Little in- spired me to pursue the research of sports video analysis. He constantly provided technical insights from the computer vision perspective and also supported me financially through the years. Kevin Murphy inspired me to explore the world of machine learning. The ideas of graphical models and weakly supervised learning were all derived from our fruitful weekly meetings. I would also like to thank my colleagues Jo-Anne Ting and Kenji Okuma. Jo-Anne Ting and I worked together to develop the first version of the identification system and conduct many preliminary experiments. Kenji Okuma and I worked together to develop a video annotation program that served as the backbone to collect groundtruth labels for the basketball and hockey datasets. I would also like to thank Misha Fine and Shervin Mohammadi Tari for helping me annotate some of the basketball and hockey datasets. Finally, I would like to thank my parents for encouraging and supporting me to pursue my dreams, both emotionally and financially. I would also like to thank my wife, Yu-Hsuan Yang, for giving me motivation to write this dissertation, accompanying me through all these difficult years far away from our home country, and giving me new dreams to pursue. xi Chapter 1 Introduction 1.1 Motivation Intelligent sports video analysis systems have many commercial applications and have spawned much research in the past decade. Previously, people focused on high-level anal- ysis such as video summarization [CH06, ETM03] and shot event recognition [HSC06, SO05]. Recently, with the emergence of accurate object detection and tracking algorithms, the trends have shifted to a more detailed analysis of sports videos, such as player tracking and identification. Automatic player tracking and identification has many commercial applications. From the coaching staff’s point of view, this technology can be used to gather game statistics (e.g., the average speed and running distance of players) in order to analyze their competitors’ strength and weakness. TV broadcasting companies also benefit by using such systems to create star-camera views – video streams that highlight star players. Since both tasks are currently performed by human annotators, automating these processes by an intelligent system would significantly increase the production speed and reduce cost. Most existing techniques tried to tackle the problem of tracking and identifying players by utilizing video streams taken by multiple cameras. These cameras are usually station- ary, calibrated, and synchronized [BBFF11]. These techniques have two major drawbacks. Firstly, they require installing a calibrated camera network in the stadium prior to the game. This process is expensive and usually impractical for amateur sports fields. Secondly, these techniques are inapplicable for existing sports video archive taken by uncalibrated moving 1 cameras (e.g., old broadcast sports videos, or videos taken by hand-held cameras). There- fore, an intelligent video analysis system that can track and identify players from video streams taken by a single pan-tilt-zoom camera is needed and could benefit a wider range of users. Due to these reasons, this thesis is devoted to developing an intelligent system to track and identify players from a single broadcast video taken from a pan-tilt-zoom camera. The only input of the system is a single broadcast sports video. The system will automatically localize and track players over time, estimate their locations on the court, and finally recog- nize the players’ identities. To the best of our knowledge, the system proposed in this thesis is the first one that is able to solve all these challenging tasks together given only a single broadcast video taken from a pan-tilt-zoom camera. 1.2 Challenges Tracking and identifying players in broadcast sports video is a challenging task. Tracking multiple players in this scenario is hard due to several reasons. Firstly, most tracking sys- tems rely on an appearance model of players (e.g., color histograms). However, in sports videos, players of the same teams wear uniforms of the same colors, and they all have similar skin colors and body shapes. This usually confuses the tracking system, especially after cross-over of players. Secondly, in broadcast sports videos, occlusions are frequent and sometimes long. Without sufficient image evidence, tracking systems are likely to lose track of players during occlusions. Thirdly, many tracking systems assume that targets have a certain motion pattern, or they enter and exit the scene in certain locations. These are also hard to recognize in broadcast sports videos because players usually have a complicated motion pattern. Identifying players is even a harder problem. Most existing systems could only recog- nize players in close-up views, where facial features are clear, or numbers on the jerseys are visible. In frames taken from a court view, identifying players is a very challenging prob- lem. Since image resolution is very low, facial features are blurry and indistinguishable. In fact, it is difficult even for humans to identify players from only faces in these low- resolution images. Recognizing numbers on jerseys is possible, but still very hard. Since players constantly change their poses and orientations, numbers on jerseys are only visible in very limited cases. Even when they are visible, they are usually deformed and broken 2 because the numbers are printed on the players’ shirts, which deform. Colors are also not strong cues for recognizing players because players on the same team have the same jersey color, and many have very similar hair and skin color. One possible solution to tackle the identification problem is to train a classifier that combines multiple weak cues by some labeled images, as we proposed in our previous paper [LTML11] (see Section 5.4.1 for details). However, this technique usually requires a large amount of labeled training data, and acquiring those labels is very time-consuming. For example, using an interactive tracking program (e.g., [WC08]), an experienced annotator needs to spend about 50 hours to label a basketball game video of 5000 frames, in order to obtain enough training data for a single basketball team. A learning strategy that requires a fewer number of labeled data is thus preferable. 1.3 Contributions This thesis introduces an intelligent system that tracks and identifies players in broadcast sports videos filmed by a single pan-tilt-zoom camera. We focus on video frames taken from a court view because it provides more information about players’ locations on the court and thus has more applications such as generating game statistics and star-camera views. Although this thesis only shows experiments in basketball videos, the techniques can be easily generalized to other sports such as soccer or hockey. The contributions of this thesis are as follows: • Multi-target tracking: We present an effective solution to track multiple players in broadcast sports videos. The tracking system is based on the tracking-by-detection philosophy, where we first localize players by a detector, categorize detections based on team colors, and finally group detections into tracks corresponding to specific players. Experiments show that the proposed algorithm can reliably track multiple players even under occlusions, and it also has a better accuracy than existing tech- niques such as the Boosted Particle Filter (BPF) [OTdF+04]. • Homography estimation: We introduce a robust algorithm to estimate the homog- raphy matrices between video frames and the court. Unlike most existing tech- niques that relied on matching robust feature points such as Scale Invariant Feature Transform (SIFT) [Low04], the proposed system matches edge points in two images, 3 and utilizes a variant of the Iterated Closest Points (ICP) [Zha94] to find the homogra- phy. Experiments show that the proposed algorithm outperforms existing techniques such as [GLW11] and has a faster speed. • Player identification: Instead of relying on face or number recognition, this thesis introduces an algorithm to recognize players as entities in video frames taken from a court view. The proposed system combines both visual and spatial cues in a clas- sifier. Visual cues are extracted from image patches of players, and spatial cues are computed from the locations of players in the court coordinate. We introduce a Con- ditional Random Field (CRF) [LMP01] that combines the classifier with temporal and mutual exclusion constraints. We also present an efficient inference algorithm based on Linear Programming Relaxation [NW99] to predict players’ identities in test videos. • Weakly supervised learning from play-by-play: Training the player identification system requires a lot of labeled training data which is very expensive to acquire. This thesis instead leverages the play-by-play texts (the logs of important events in the game) to train the identification system by using weakly supervised learning. To the best of our knowledge, this thesis is the first one that utilizes play-by-play to train a video analysis system. Experiments also show that the system trained by weakly supervised learning can achieve a similar identification accuracy to the system trained by using fully annotated data. 1.4 Thesis Outline This thesis is organized as follows. In Chapter 2, we will review relevant papers in multi- target tracking, player identification, and weakly supervised learning for video analysis. Chapter 3 will describe the multi-target tracking algorithm, which is based on tracking- by-detection. In Chapter 4, we will present the homography estimation algorithm and its results. In Chapter 5, we will introduce the player identification system, including the graphical model, features, inference and learning algorithm. We conclude this thesis in Chapter 6 and also provide some suggestions for future extensions. 4 Chapter 2 Related Work The literature about video analysis is huge. In this chapter, we will review the most relevant parts, especially papers that tackled sport videos. We will first review recent work about multi-target tracking, followed by player identity recognition, and finally weakly supervised learning. 2.1 Multi-Target Tracking Tracking is a broad field in computer vision, and there are numerous algorithms and systems developed in the past decade to tackle this problem. In this section, we only review the most relevant work. Interested readers could consult [YJ06] for a survey of recent tracking algorithms. One could also consult [DL10] that reviews soccer player tracking and soccer video analysis. For tracking multiple targets in a video sequence, the most popular approach is to ap- ply multiple independent single-target trackers to track different targets. Specifically, one can initiate a single-target tracker by the initial location of the target, and then ask it to follow the player throughout the video sequence. The initial location can be given by hand [PHVG02], or by an automatic object detection system [OTdF+04]. These trackers are in- dependent because they do not know the existence of other trackers, and thus there is no global coordination among trackers. The most popular single-target tracking algorithms are Particle Filter (PF) [DGA00] and mean shift [CM02]. The color-based particle filtering framework developed by Perez 5 (a) (b) Figure 2.1: Hockey player tracking by the Boosted Particle Filter (BPF) [OTdF+04]. et al. [PHVG02] can be directly applied to track sport players. However, since [PHVG02] used the transition distribution as the proposal distribution (also known as the bootstrap filter), the performance was usually slow and inaccurate. Later on, Okuma et al. [OTdF+04] introduced the Boosted Particle Filter (BPF), an algorithm that combines particle filtering and object detection. In their influential work, they used a combination of the Viola-Jones detector [VJS05] and the transition distribution as the proposal distribution. This led to a faster speed and better accuracy in tracking hockey players. In the following work, Cai et al. [CdL06] utilized bi-partite matching to ensure a one-to-one mapping between detections and trackers. They also improved the importance function by using a mean-shift proposal [CRM03]. Recently, Hess et al. [HF09] introduced an algorithm to discriminatively train importance weights in particle filtering, and they applied the system to track American football players. These particle filter based approaches are usually able to successfully track players in a short period of time. However, when occlusions occur, the single-target tracker is likely to follow another player with a similar color distribution, instead of the correct one. This error is mainly due to a lack of global coordination among trackers. Figure 2.1 shows some results of tracking hockey players by using the Boosted Particle Filter [OTdF+04]. Another popular single-target tracking algorithm is mean shift [CM02] and its variants (e.g., Camshift [Bra98]). Given the initial template of the target (usually represented by color histograms), mean shift can efficiently search nearby regions to find the patch that is most similar to the template. For example, Comaniciu et al. [CRM03] used mean shift to track American football players, as shown in Figure 2.2a. Recently, Hu et al. [HCWC11] applied Camshift [Bra98] to track basketball players, as shown in Figure 2.2b. Like particle 6 (a) Comaniciu et al. [CRM03] (b) Hu et al. [HCWC11] Figure 2.2: Sport player tracking by mean shift [CM02]. (a) American football player tracking by the kernel-based tracker [CRM03] c©2003 IEEE. The target is marked by a bounding box. (b) Basketball player tracking by Camshift [HCWC11] c©2011 IEEE. Tar- gets are marked by circles on the ground plane. filtering, the mean shift based trackers usually use colors as features to track players, and they lack a mechanism to coordinate different trackers. As a result, they are also prone to lose track after occlusions, because players of the same team wear uniforms of the same colors. With the emergence of accurate object detectors [VJS05, DT05, FMR08], modern multi- target tracking systems usually apply a tracking-by-detection approach. These systems first run an object detector to locate people in images, and then utilize some mechanisms to connect these detections into tracks – sets of detections over a short period of time. Fig- ure 2.3 illustrates the idea of these approaches. In this figure, detections are represented by circles, where numbers represent the time. The problem can be thought as partitioning de- tections into groups of different targets, where different targets have different colors. Most algorithms enforce the mutual exclusion constraints that prevent a detection being taken by more than one target. In addition, a target is allowed to have at most one detection at any time. Every partition is associated with a score that encourages a smooth trajectory of detections and a small change of appearances. The goal of tracking-by-detection is to find the best configuration that maximizes the score. Since the score measures the overall effectiveness among all targets, this approach usually performs better than using multiple independent trackers. One way for searching for the best tracking results is data-driven Markov Chain Monte 7 (a) raw detections (b) tracking-by-detection Figure 2.3: An illustration of the tracking-by-detection approach, courtesy of Oh et al. [ORS04] c©2004 IEEE. (a) Each circle represents a detection, and numbers represent the time. (b) The goal of tracking-by-detection is to partition detections into groups of different targets, which are represented by different colors. Notice that one detection can be only assigned to one target, and a target would have only a single detection at any time. Carlo (MCMC) [RC05]. Starting from the initial tracking results, MCMC explores the search space in a stochastic way. Specifically, in every iteration, MCMC will propose a new con- figuration by splitting, merging, shrinking, or extending existing tracks. MCMC will accept the new configuration depending on how much it improves the score. Otherwise, the new configuration will be dropped. The pioneer of these approach is Oh et al. [ORS04], who used data-driven MCMC to search the best configuration. The approach is data-driven be- cause they used data-driven approach for modifying (split, merge, shrink, extend) existing tracks. Later on, Liu et al. [LTL+09] used the Viola-Jones detector [VJS05] to localize soccer players, and then applied data-driven MCMC to find the best soccer player track- ing. Since the search space is huge, one trick to increase the speed is to first group 5–10 detections into tracklets using some simple heuristics, and then group tracklets into longer tracks. For example, Ge et al. [GC08] first used a simple mean shift tracker [CRM03] to create short tracklets, and then employed data-driven MCMC to connect short tracklets into longer tracks. They also introduced an algorithm to automatically tune tracking parame- ters from training data. One could also exploit this technique in a hierarchical way: first group detections into tracklets, and then group tracklets into longer tracks, as in Huang et al. [HWN08]. Another approach is to transform the problem to combinatorial optimization problems 8 in graphs. In these graphs, people used vertexes to represent detections and arcs to represent the relationship between detections. For example, Fleuret et al. [FBLF08] used Dynamic Programming to find the shortest path on such graphs, where a path in the graph corresponds to a track of the target. The distance between two vertexes is measured based on the spatial and appearance similarity between two detections. Zhang et al. [ZLN08] formulated the tracking-by-detection problem as finding the min-cost flow in a graph. In their formulation, every arc is associated with two attributes, the flow and the cost. The flow enforces the mutual exclusion constraints, while the cost encourages or discourages the aggregation of these nodes. They showed that the tracking-by-detection problem is equivalent to finding the network flow that minimizes the cost, which has a polynomial-time complexity and can be efficiently solved. Another formulation is Linear Programming Relaxation. For example, Jiang et al. [JFL08] associated every arc with an indicator variable. If the indicator variable is one, the detections associated with the arc are from the same track; otherwise, they are from different tracks. Optimizing the score function with respect to all indicator variables generates the best tracks of multiple targets, subject to the mutual exclusion constraints. Berclaz et al. [BFTF11] introduced a similar Linear Programming Relaxation formulation, but it is applicable for multiple synchronized videos taken from different cameras. Recently, Ben Shitrit et al. [BBFF11] extended [BFTF11] by using person-specific score function. They showed promising tracking and identification results in basketball videos taken from 8 synchronized static cameras. The score functions of these graph-based formulations can only model pair-wise rela- tionship between detections, such as spatial and appearance similarity between detections. However, much useful information is not pair-wise. A good example is the similarity of speed or moving direction [GC08, LTL+09], which requires interaction among at least three vertexes. Another useful high-order information is the track length [GC08, LTL+09], where a longer track is always more preferable than a shorter one. In these cases, the MCMC ap- proach can be used for search, though it is usually slower than graph-based methods. Recently, people have introduced different methods to model motion patterns of body parts. For example, Agarwal et al. [AT04] used a mixture of high-order auto-regression to model the relationship between the current body configuration and its N previous ones. This complex relation can be also modeled by Gaussian Processes (GP), as in [GQ08, HGC+07, UFF06]. In this thesis, we introduce a technique to model the motion pattern and show its effectiveness in tracking basketball players (see Chapter 3 for details). 9 2.2 Player Identification The goal of player identification is to recognize players in sports videos. Interestingly, all previous player identification systems focused on recognizing players from video frames taken from close-up views. The main reason of this restriction was the lack of image reso- lution. The High Definition TV (HDTV) broadcast became publicly available only in recent years. Without high resolution video, even people have difficulty recognizing players. Some systems relied on face recognition to identify players in close-up views, where facial features are distinguishable. These systems usually have two components: face local- ization and face recognition. For example, Bertini et al. [BBN05] first used the Viola-Jones face detector [VJ01] to extract the face region from a video frame.Then, they matched the face region to a database of known faces using the nearest-neighbor search, where faces are represented by skin colors and Scale Invariant Feature Transform (SIFT) descriptors [Low04] centered in two eyes. Similarly, Ballan et al. [BBBN07] first detected and ex- tracted facial regions corresponding to eyes, mouth, and nose. Within these facial regions, they computed SIFT features and descriptors [Low04]. Then, they used nearest-neighbor to find the closest face in the training dataset. The similarity measure was computed by using the number of correct SIFT matches. Recently, Jie et al. [JCF09] introduced an al- gorithm that identified players from their faces and body poses. The face was represented by nine SIFT descriptors centered on important regions (eyes, mouth, etc). They then used a generative model to predict identities of players in test images. All these systems relied on an accurate localization of distinctive facial regions. However, the performance of facial region detectors is not guaranteed. In addition, they are likely to fail under occlusions or view-point changes, where some distinctive facial regions are not present. Other systems recognized numbers on the jerseys to identify sports players. These sys- tems also have two components: number localization and number recognition. For example, Bertini et al. [BBN05] trained multiple number detectors using Haar-like features [VJ01], and applied them to localize specific numbers on soccer players’ jersey. Since detectors were tuned for specific numbers (e.g., the detector of number 11 can only detect number 11 in the scene), recognition was automatically achieved. However, this approach cannot handle situations when there are multiple number detections in a scene. In their follow- ing paper, Bertini et al. [BBN06] instead used Maximally Stable Extremal Regions (MSER) [MCUP02] to obtain candidate number regions. Then, these candidate regions were fed 10 (a) raw image (b) detect candidate regions (c) detection & recognition Figure 2.4: Detecting and recognizing numbers on jerseys by Ye et al. [YHJ+05] c©2005 SPIE. Notice that all systems only work in images taken from a close-up view. (a) Raw test image. (b) Candidate region detection by image thresholding, assuming that numbers have specific colors. (c) Final number detection and recognition. Non-number candidates are dropped based on criterion of sizes, aspect ratios, and shapes. to an Optical Character Recognition (OCR) system to recognize the numbers. Ye et al. [YHJ+05] first used image thresholding to divide image pixels into two groups, and then they ran the connected-component algorithm to congregate pixels of the same groups into locally-cohesive segments. By assuming that numbers on the jerseys are white, segments with a lighter color are considered as potential number regions, as in Figure 2.4b. The candidate regions were matched to an image database of different numbers using k-nearest- neighbors. Saric et al. [SDPR08] also used image thresholding to detect candidate number regions. However, they found out that processing images in the HSV color space gave bet- ter results than the RGB color space used in [YHJ+05]. In both [YHJ+05] and [SDPR08], the threshold for image thresholding was manually chosen. Recently, Ben Shitrit et al. [BBFF11] proposed to learn this threshold by some labeled training images. Specifically, they collected training pixels from both jersey regions and number regions, and computed their average RGB colors. During test time, a pixel was classified as a part of the jersey regions if its colors were closer to the average jersey color, and vise versa. They then used template matching to recognize the player, where the template was also learnt from training images. We observe two challenges for all these number recognition systems. Firstly, localizing numbers is hard. Most systems assumed numbers on the jerseys have some specific colors or shapes. However, regions of similar colors and shapes may appear in the video frame and 11 (a) tilted numbers (b) occluded numbers (c) background numbers Figure 2.5: Detecting numbers in sports videos is a challenging task. (a) Numbers are tilted and is hard for machine to recognize. (b) Numbers are occluded by body parts or other players. (c) There are also numbers on the background. confuse the number localization system. For example, as shown in Figure 2.5c, numbers not only appear on the jersey, but also on the background and on the super-imposed score board. One simple technique to reject these incorrect regions is using sizes [SDPR08, YHJ+05] or locations [BBFF11], but the performance is not guaranteed. Secondly, number recogni- tion is challenging, especially when numbers are on deformable surfaces such as players’ jerseys. Most OCR systems requires text to be normalized to some specific sizes, aspect ratios, and orientations. However, numbers on the jerseys are usually distorted and tilted as in Figure 2.5a. Sometimes, numbers are even occluded by body parts or other players, as in Figure 2.5b. In these challenging situations, a temporal aggregation of information would improve the performance [BBFF11, YHJ+05]. In addition, a large number of la- beled training images is usually required in order to learn an accurate number recognition system [BBN05, YHJ+05]. In the surveillance community, there is a related problem called re-identification, where the goal is to find a pedestrian of some appearance over a network of cameras (e.g., [FBP+10, GT08, JSRS08]). However, since most of these systems rely on color, shape or texture, they cannot be directly applied to sport videos due to the uniformity of jersey colors in a team. Some systems even use the geometric configuration of cameras to help re-identify pedestri- ans (e.g., [JSRS08]), which is also not applicable in our case because we only have a single moving camera. 12 2.3 Learning from Videos and Texts As mention in the previous section, using supervised learning to train a identification sys- tem usually requires a large amount of fully-labeled images. In the domain of player iden- tification, a dataset is fully labeled if both images and their corresponding class labels are given. For example, Bertini et al. [BBN05] trained their face recognition system using a set of face images, where every image is associated with the player’s name. Similarly, number recognition system of Ye et al. [YHJ+05] required training images of numbers, where every image is associated with a single label – the corresponding number. Labeled training images are very expensive to acquire. Given an unlabeled image, human annotators have to first localize the target of interest and then use a bounding box or circle to specify the region. Afterwards, they have to give a class label to the specified region. Annotating a single image may take several seconds to a few minutes, depending on how many targets in the image, and the way to specify the region (e.g., using bounding boxes is faster than tracing the contours). In order to increase the annotation speed, one possible solution is to divide images into micro-task and distribute them to crowd-sourced marketplaces, such as LabelMe [RTMF08]. Annotating videos is even more expensive due to a larger amount of data. For example, an one-minute long video may contain 1800 video frames, assuming 30 frames per seconds (fps). One trick to increase the speed is to first label a small number of key-frames, and then interpolate annotations between key-frames [YRLT09]. Interactive tracking algorithms help to shorten the annotation time, too, as in [WSTS07, WC08]. Recently, Vondrick et al. [VRP10] divided videos into small chunks and distributed them to crowd-sourced marketspaces for a large-scale annotation. Although these techniques alleviate the problem, video annotation still remains very time-consuming. For example, using an interactive tracking system similar to [WSTS07], we still needed to spend about 50 hours to annotate 5000 video frames, which is equivalent to a 2.5 minute video. In order to reduce the annotation work, people sought learning strategies other than supervised learning. The first strategy is semi-supervised learning, where a small set of training images is labeled, while the majority still remains unlabeled. For example, Balcan et al. [BBC+05] used a graph-based semi-supervised learning approach to propagate labels from labeled training images to unlabeled ones, and applied this approach to recognize peo- ple in videos taken by a web camera. They first constructed a graph whose nodes represent 13 Figure 2.6: Face recognition in the TV program “Buffy the Vampire Slayer”. The face recognition system was developed by Sivic et al. [SEZ09] c©2009 IEEE, based on the ideas of training the system using videos and their subtitles [ESZ06]. We can observe that the system is able to localize and recognize both profile and frontal faces. In this figure, the bounding box shows the face’s location, and the tag shows the character’s name. class labels of images, and edges represent similarities between images. Then, they propa- gated labels from labeled images to unlabeled ones by encouraging similar images to have similar labels, as in the Gaussian field and harmonic function algorithm [ZGL03]. Cai et al. [CHH07] introduced the Semi-supervised Discriminant Analysis (SDA) for face recogni- tion. SDA is a variant of the Linear Discriminant Analysis (LDA) [Bis06], which requires a set of fully labeled images to train the system. On the other hand, SDA only used the labeled images to maximize the separability between classes, and it took advantage of the unlabeled images to estimate the intrinsic geometric structure of the data. Many video sequences are associated with cheap but ambiguous labels. Therefore, another approach is to utilize these ambiguous labels to train the identification system. One example is subtitles (captions) – textual versions of the dialogue in movies or TV programs. Subtitles specify the timestamps, the corresponding characters, and the written conversation of the dialogue. We can use this information to assign ambiguous labels to video frames. For example, if “Jack” appears in the subtitle of a video frame, we can assume that he also appears in the same frame. However, there might be multiple people present in the same frame. The correspondences between image patches of people and their labels are still missing, and therefore supervised learning cannot be used to train the identification system. 14 Instead, people sought to use weakly supervised learning to tackle this challenging task. The pioneers that utilized movies and subtitles to train an identification system was Ever- ingham et al. [ESZ06]. Everingham et al. first used the Viola-Jones face detector [VJ01] to localize faces in movie videos. However, as mentioned before, there might be multiple faces in a video frame, while there is only one name appearing in the subtitle. In order to resolve this ambiguity, they applied a mouth detector to detect lip movements. The face with the most lip movements is thus the speaking person. Then, they trained a face recognition sys- tem [AZ05] by using images of speaking people, and then applied the system to recognize the remaining faces. The system represented the face by image patches centered in impor- tant regions such as eyes and mouth. Though effective, the system of Everingham et al. [ESZ06] only worked for recognizing frontal faces. In order to relax this restriction, Sivic et al. [SEZ09] proposed to use face tracking to collect examples of profile faces. Then, they trained a face recognition system based on the Histograms of Oriented Gradients (HOG) descriptor [DT05] of the entire face. This approach provided a good recognition accuracy even when some facial features were occluded. Figure 2.6 shows some example results in the TV series “Buffy the Vampire Slayer”. Sometimes there are multiple people involved in a dialogue (e.g., “Jack” and “Kate”) and their faces also appear in the video frame. The correspondences between names and faces are still missing and need to be resolved while training the identification system. To tackle this problem, Cour et al. introduced the ambiguous label learning [CSJT09, CST11]. In ambiguous label learning, every face is associated with a candidate set of labels, only one of which is correct. They formulated a convex learning strategy that penalized assigning la- bels not in the candidate set to the training faces. Cour et al. applied this technique to train a face recognition system to identify characters in the TV series “Lost” and showed promising results. Figure 2.7 shows some results. One noticeable fact is that the size of candidate set was small (the average size of the candidate set is 2.17 faces in their “Lost” dataset, where there are 10 possible characters). When the size of the candidate set increased, the classi- fication accuracy also dropped significantly. This indicates that ambiguous label learning may not be effective in cases when the average size of the candidate set is large. In their following work, Cour et al. [CSNT10] further loosened the constraints by using only the written dialogue as labels to train the identification system, instead of using a candidate set of labels for every video frame [CSJT09, CST11]. In this challenging case, some additional constraints such as video editing cues and temporal grouping cues were needed in order to 15 Figure 2.7: Face recognition in the TV program “Lost”, developed by Cour et al. [CSJT09, CST11] c©2009 IEEE. The system was trained by using the ambiguous label learning, given videos and their subtitles. We can observe that the system can only localize and recognize frontal faces, where there are merely 1-2 faces per frame. In this figure, the bounding box shows the face’s location,and the tag shows the character’s name. resolve the ambiguity. We could also train an action recognition system from movies and their subtitles and scripts. In this case, the input are video frames, and the output are action labels such as “GetOutCar”, “Kiss”, “Hug”. One approach is to utilize the verbs in the script to predict the action. However, since there might be multiple verbs in the script, this task is challenging. For example, the “GetOutCar” action labels is associated with the script “A black car pulls up. Two army officers get out.” We could see that there are two verbs “pull” and “get” in the script, and therefore the relationship between action labels and scripts becomes ambiguous. To cope with this problem, Laptev et al. [LMSR08] used a classifier to predict action labels from subtitles. The classifier was trained by using supervised learning with some training data, where movie scripts were represented by the bag-of-words features. After resolving the ambiguity in the script, they then trained an action recognition system using videos and their predicted action labels. Specifically, the classifier was a Support Vector Machine (SVM) [CV95] where video clips were represented by bag of visual words extracted by a space-time interest point detector [Lap05]. In their following paper, Marszalek et al. [MLS09] explored the effectiveness of using scene recognition to assist action recognition. The basic idea is to utilize the correlation of actions and places (e.g., action “eating” should occur in places such like “kitchen” or “cafe”). The scene recognition system was also trained by using movies and scripts, where scenes were represented by bag of visual words extracted by the Harris detector [MS04]. Figure 2.8 shows some example results. 16 (a) eating, kitchen (b) eating, cafe (c) running, road (d) running, street Figure 2.8: Automatic action and scene recognition in movies, developed by Marszalek et al. [MLS09] c©2009 IEEE. Both the action recognizer and scene recognizer were trained by using movies and subtitles. In the sub-captions, the first label represents the action, where the second one represents the place. Another challenge is the temporal ambiguity of action labels. For example, a 10-second video clip may be associated with the action label “GetOutCar”, but only 2 seconds of them contain the corresponding action. In order to tackle this problem, Duchenne et al. [DLS+09] used discriminative clustering to localize the most distinctive segments in video clips with a particular action label. They then used these distinctive video segments to train the action recognition system of [LMSR08] and showed better results. Buehler et al. [BEZ09] instead used multiple instance learning to achieve the same goal. In multiple instance learning, a set of images shares a single label, where the label is positive if one of the images has the same label. For example, a set of 30 video frames is associated with action label “GetOutCar” if one of them contain such action. Multiple instance learning trains a classifier that predicts the action label given an image, and in the same time it gives labels of those ambiguous training images. Recently, Gupta et al. [GSSD09] learnt the action grammar by sports videos and their stories. The input was videos and a story describing the video, such as “Pitcher pitches the ball and then the batter hits it. After hitting the ball, batter runs to the base.” The objective was to learn a classifier that recognized the action “pitching”, “hitting”, “running”, and in the same time figured out the grammar among these actions. In order to learn the action models and their grammar, Gupta et al. used the Expectation-Maximization algorithm (EM) [DLR77]. Specifically, they started with a simple action models trained from some labeled images. Then, they gradually refined the model by using videos and stories. Figure 2.9 shows two examples. In this thesis, we propose to utilize the play-by-play texts of sport videos as weak labels to train the identity recognition system. The play-by-play texts have been previously used in tactic recognition from sports videos [ZXH+09]. However, play-by-play was used as one 17 Figure 2.9: Learning action models and grammars from videos and stories, developed by Gupta et al. [GSSD09] c©2009 IEEE. In these two examples, we can see some video frames and their corresponding stories. The diagrams in the right-hand and left-hand side are the action grammar, where boxes represent actions. The system is able to localize regions of actions in images and their categories. of the features in the classification task, but not the source of supervision. To the best of our knowledge, there is no existing work trying to utilize the play-by-play as the source of weak supervision to train a player identification system. 18 Chapter 3 Player Tracking Given video clips taken from a medium distance, our goal is to localize sports players – the object detection process. In addition, we also want to track locations and sizes of players over time – the visual tracking process. Object detection and tracking has many applications in sports video analysis. For example, when collecting game statistics automatically, we can use the trajectories of players to estimate their average speeds and running distances, and analyze their moving styles. Star-camera views also require knowing locations of players in the video in order to crop and enlarge specific players. In addition, tracking also assists other components in the proposed video analysis system. For instance, tracking provides image patches that contain appearance information of players. Later on, the identification will utilize this information to recognize players. Tracking also provides temporal aggregation of information by grouping image patches over time into an entity. This significantly boosts the performance of the identification system. Although visual tracking is the core process in the system, it is also very challenging, especially in the domain of broadcast sports videos. Since the camera is not stationary, we cannot use background subtraction techniques to reliably detect players, as in many previ- ous systems [BBG06, BBFF11, JFL08]. Instead, we apply a shape-based player detector, which is less accurate and generates many false positive and missed detections. In addi- tion, since players of the same team wear jerseys of the same colors, color-based tracking algorithms such as mean shift [CRM03] usually fail to track the same player when occlu- sions occurs. Moreover, occlusions are frequent and sometimes have a long duration. For example, in basketball, a player may occlude another player for 5–10 seconds and fails 19 most tracking systems. Players also have more complicated motion patterns. Compared to pedestrians who usually have similar moving directions and speeds, players’ motion is harder to predict. For these reasons, general-purpose tracking systems usually do not work well in tracking players in broadcast sports videos. In order to reliably track players in broadcast sports videos, this thesis takes the tracking- by-detection approach. Specifically, we first use an object detector to localize players in video frames, and then we group detections of the same player into tracks. Unlike most tracking systems that use appearance to distinguish one target from another, we instead uti- lize players’ short-term trajectories. In Section 3.1, we provides a brief introduction of the proposed algorithm. In Section 3.2, we describes the techniques of localizing and catego- rizing players. In Section 3.2, we detail how to group detections into the tracks in order to solve the multi-target tracking problem. Section 3.4 presents the experimental results. 3.1 Introduction In this thesis, we track players in broadcast sports video using the tracking-by-detection ap- proach. The tracking-by-detection approach starts from running an object detector to local- ize players in the video. Then, the system groups detections into tracks of different players. A track can be thought as a temporal aggregation of detections of the same player. We pre- fer a track having a smooth change of appearances, locations, and sizes. Mutual exclusion constraints are also enforced. Specifically, a detection can be assigned to at most one track, and one track can associate with at most one detection at a time. We give an example of the tracking-by-detection approach in Figure 3.1. In this figure, the horizontal axis represents the time, and the vertical axis represent the x-coordinate of detected bounding boxes. In Figure 3.1a, we see that the object detector localizes players in a video clip of 400 frames, where players’ x-coordinates are represented by red dots. The goal of tracking-by-detection is to group detections into tracks of specific players. Figure 3.1b shows the tracking results, where we use different colors to represent different tracks. Notice a player’s trajectory may be broken into several tracks of different colors. The tracking-by-detection problem can be solved by either data-driven Markov Chain Monte Carlo (MCMC) [GC08, LTL+09, ORS04], graph-based approaches [FBLF08, ZLN08], or Linear Programming Relaxation [JFL08, BFTF11, BBFF11]. All these methods try to find a partition of detections that maximizes an objective function favoring a smooth change 20 (a) detection (b) tracking Figure 3.1: An illustration of the tracking-by-detection process. We show the x-t plot in both figures, where the horizontal axis represents the time, and vertical axis represents the x-coordinate of players. (a) Automatic player detections generated by an object detector. We use red dots to represent detected bounding boxes. (b) The tracking-by-detection ap- proach group detections into tracks of specific players. We use different colors to represent different tracks. Mutual exclusion constraints are also enforced, i.e., a detection can be assigned to at most one track, and one track have at most one detection at a time. of appearance and locations, and at the same time satisfies the mutual exclusion constraints. Among these three methods, data-driven MCMC [GC08, LTL+09, ORS04] is the most general one and allows us to model a more complex relationship among detections, such as player’s speed and track length. Unfortunately, since MCMC is a sampling technique, it is also the slowest. Graph-based methods [FBLF08, ZLN08] and Linear Programming Relaxation [JFL08, BFTF11, BBFF11] are much faster. However, they lack the ability to model the complex relationship among detections, as opposed to MCMC. In this thesis, we take an approach similar to Cai et al. [CdL06]. Instead of finding a global partition of all detections in a video clip, we perform local grouping on a frame- by-frame basis. Specifically, we first find the best detection-to-track assignment in the first frame. Then, based on the assignment in the first frames, we search for the best detection- to-track assignment in the second frame. We repeat this procedure until all frames in a video clip are processed. This approach is much faster than MCMC, graph-based meth- ods, and Linear Programming because we divide an expensive optimization problem into smaller ones, which can be solved very efficiently. In addition, high-order features such 21 as speeds and track lengths can be used in our framework to improve accuracy. We espe- cially focus on developing an accurate motion model to distinguish between overlapping players. In practice, we found that the proposed system outperforms graph-based methods and MCMC in both speed and accuracy. The following sections will detail the proposed tracking algorithm. 3.2 Detection and Team Classification The goal of player detection is to localize players in sports videos. Specifically, given a video frame It at time t, we need an algorithm that gives us a set of detections Dt = {d1,d2, . . . ,dNt}, where dk = [xk, yk, wk, hk]T is a bounding box at location [xk, yk] and of size wk × hk pixels, and Nt is the number of detections at time t. If the camera is sta- tionary, we can use background subtraction to localize foreground objects. However, since broadcast sports videos are taken from a moving camera, this approach is not applicable. We have also tried to learn the color model of the court using a mixture of Gaussians, and then used GrabCut [RKB04] to segment images into court and non-court regions in order to detect players. This approach works in sports such as soccer [BBG06], whose field has a single distinct color. However, in sports such as basketball and hockey, this method is less effective because players sometimes share the same colors as the court. For example, in Figure 3.3, we can see that players wear yellow jerseys whose colors are similar to the “Lakers” logos on the court. In hockey, this problem becomes more severe because players may wear white jackets, while the rink also is white. In this thesis, we instead utilize shapes instead of colors to localize players. Specifi- cally, we use the Deformable Part Model (DPM) detector developed by Felzenszwalb et al. [FMR08] to automatically locate sport players. DPM represents images of a particular ob- ject class by multiple templates, one of which consists of a root filter and several part filters. The root filter captures the rough shape of the object while the part filters are responsible for the detailed shapes of particular regions. Both root and part filters are represented by features similar to the Histograms of Oriented Gradients (HOG) descriptor [DT05], which are orientation histograms of image gradients. DPM also allows parts to have flexible lo- cations, and thus the object’s shape is deformable. In order to detect basketball players, we use a DPM consisting of 3 templates with 6 part filters. Figure 3.2a shows the root filters. In this figure, we shows the weights of oriented gradient features. We can see that 22 the templates approximate the shapes of basketball players, where the first template (first row) corresponds to frontal views and the second and third templates correspond to profile views. Figure 3.2b show 6 part filters and Figure 3.2c show their spatial histograms rel- ative to the center of the player. We see that DPM uses the head and four limbs as parts for frontal-view images. On the other hand, DPM uses upper and lower bodies as parts for profile-view images because some limbs might be self-occluded. In order to learn templates of players, DPM takes a weakly supervised approach. DPM requires users to prepare a set of positive and negative image patches, where positive patches contain players and negative patches have no players. DPM then uses a Latent Support Vec- tor Machine (LSVM) [FGMR10] to simultaneously predict locations of parts and learn the template models. The algorithm is called Latent SVM because locations of parts are not given in the training data. To train the basketball players detector used in this thesis, we prepared 3000 positive patches that contain players, and 300 negative images that have no players. The original implementation of DPM1 [FMR08] would use a data mining tech- nique to automatically extract hard negative examples from those negative images to further improve the detection performance. During test time, DPM uses template matching to locate parts of different locations and sizes. Specifically, it convolves shape templates of parts across the image pyramid to compute the confidence map. These part detections then vote for the central locations and sizes of players. Similarly, this is done by convolving the spatial templates of parts across the confidence map. The final detections are obtained by thresholding the confidence map. To eliminate duplicate detections, DPM uses non-maximum suppression. Specifically, the smaller one of two overlapping detections will be dropped if its overlapping area is greater than a threshold. Figure 3.3 shows results of detected basketball players in broadcast videos, where player detections are represented by red bounding boxes. We can see that DPM successfully detects basketball players in most cases. However, there are three common errors. Firstly, there are duplicate detections in (a), (b), and (g). These duplicate detections usually occur during occlusions and cannot be removed by non-maximum suppression. Secondly, there are false positive detections from spectators in (g) and (h), and false positive detections from referees in (a). Spectators and referees have similar shapes as basketball players, and thus it is difficult for DPM to distinguish between players and other people. 1Matlab codes are available in http://people.cs.uchicago.edu/∼pff/latent/ 23 (a) root filter (b) part filters (c) spatial model Figure 3.2: The Deformable Part Model (DPM) [FMR08] for basketball players. The DPM consists of 3 templates (shown in different rows) and each of them has 6 parts. (a) The root filter. The visualization shows weights of oriented gradients features, where a larger weight has a longer tail. (b) The part filters. The visualization is the same as (a). (c) The spatial models. The visualization shows spatial histograms of parts relative to the root filter. Thirdly, DPM fails to detect some players in (d), (e), (g), and (h), because they are partially occluded by others and have irregular shapes. In order to reduce the number of false positive detections, we use the fact that players of the same team wear jerseys whose colors are different from the spectators, referees, 24 and the other team. Specifically, given a detected bounding box d, we use a classifier to predict its category. We use three possible categories: “Team A”, “Team B”, and “others”. To represent a bounding box, we use its RGB color histogram. In our experiments, we use a 10-bin histogram for each of the R, G, B channels, and therefore the resulting color histogram has 30 dimensions. To emphasize the central regions that most likely have jersey colors, we apply a Gaussian kernel centered in the detected bounding box to re-weight the color histogram. Then, we use multi-class logistic regression [Bis06] to predict the category c given the weighted RGB color histogram z: p(c = k|z,θ) = exp(θ T k z)∑3 j=1 exp(θ T j z) (3.1) where θ are parameters of the classifier. Parameters θ can be learnt by using supervised learning. Specifically, given labeled training data {(z1, c1), (z2, c2), . . . , (zN , cN )}, we can find the best parameters by optimizing the log-posterior: max θ N∑ k=1 log p(ck|zk,θ)− α||θ||1 (3.2) where α is a constant and N is the number of training examples. Notice that we use a L1 regularizer ||θ||1 to encourage sparsity of parameters θ, which is similar to feature selection. The log-posterior can be optimized by using the algorithm in [Ng04]. Figure 3.4 show some team classification results in broadcast basketball videos. In this figure, we use green bounding boxes to represent Celtics players and yellow bounding boxes to represent Lakers players. We also discard those boxes that correspond to the category “others”. We can see that team classification successfully eliminates detections of referees and spectators, and therefore reduces the number of false positive detections. 25 (a) (b) (c) (d) (e) (f) (g) (h) Figure 3.3: Automatic player detection results generated by the DPM detector [FMR08]. DPM estimates locations and sizes of players, which are represented by red bounding boxes in the figure. 26 (a) (b) (c) (d) (e) (f) (g) (h) Figure 3.4: Results of team classification. We classify detected bounding boxes into three categories: Celtics (green), Lakers (yellow), and others (not shown). 27 3.3 Multi-target Tracking After player detection and team classification, the next step is to group detections into tracks that correspond to different players. We use the team classification results to separate detections into two groups “Team A” and “Team B”, then we perform multi-target tracking independently for each group. Instead of seeking a global partition of all detections in a video clip, we perform local grouping on a frame-by-frame basis, similar to Cai et al. [CdL06]. Specifically, we first find the best partition in the first frame. Then, the algorithm will utilize the estimated results in the first frame to find the best partition in the second frame. We repeat this procedure until all frames are processed. In every frame, we execute three steps to perform multi-target tracking. We first as- sociate detections with existing tracks, and in the same time ensure there is a one-to-one mapping between detections and tracks. Then, we update the track current location by tak- ing both its current and previous detections into account. We add new tracks if there are any unused detections. On the other hand, we remove existing tracks if they do not have any image evidence for too long. The details of these components are as follows. Data association: The first task is to find the best association of detections and exist- ing tracks and ensure that there is a one-to-one mapping between them. Assuming that there are N detections and M existing tracks, we represent the detection-to-track association by a M × N matrix A, where we define Ai,j = 1 if the j-th detection is assigned to the i-th track, and otherwise Ai,j = 0. We also use a M ×N matrix C, where Ci,j represents the cost of associating the i-th track to the j-th detection. To find the best one-to-one mapping, we optimize the following objective function: min A M∑ i=1 N∑ j=1 Ci,jAi,j (3.3) subject to the following constraints N∑ j=1 Ai,j = 1 for i = 1 . . .M (3.4) M∑ i=1 Ai,j ≤ 1 for j = 1 . . . N (3.5) Ai,j ≥ 0 for i = 1 . . .M,N = 1 . . . N (3.6) 28 Equation 3.4 ensures that every track has exactly one detection. Equation 3.5 ensures that one detection is assigned to at most one existing track. Equation 3.6 prevents us having a trivial solution. The above problem is called bi-partite matching and can be solved effi- ciently by using the Hungarian algorithm [PS98]. To compute the association cost Ci,j , most previous tracking systems used both appear- ance and motion cues [CdL06, GC08, LTL+09, OTdF+04]. However, in tracking sports players, appearance cues are very weak because players have similar shapes and colors. Distinct features such as jersey numbers are very rare in the video and thus cannot be re- liably used in the cost function. Instead, we use a cost function that focuses on motion cues. Let ŝi = [x̂i, ŷi, ŵi, ĥj ]T be the predictive bounding box of the i-th track, and let dj = [xj , yj , wj , hj ] T be the j-th bounding box generated by the detector, we compute the cost function by Ci,j = √ (ŝi − dj)T (ŝi − dj) (3.7) We compute the predictive bounding box at time t by the following equation (we drop subscript i for simplicity): ŝ = tat + bt (3.8) where both at and bt are both 4 × 1 vectors. We estimate at and bt by utilizing the trajectory of the i-th track in the previous T frames. Specifically, we optimize the following least-square function with respect to parameters at and bt min T∑ k=1 (1− α)k||(t− k)at + bt − st−k||2 (3.9) where st−k is the track’s estimated bounding box at time t − k, and 0 ≤ α < 1 is a small positive constant. Equation 3.9 can be thought of fitting a linear function to map time t to a bounding box st 2. As shown in Figure 3.1, the relationship between time and player’s locations is approximately linear in a short time interval. We have also tried first-order auto- regression (i.e., ŝ = Qst−1) and second-order auto-regression (i.e., ŝ = Qst−1 + Rst−2). However, the proposed method works the best in practice. After solving the bi-partite matching problem in Equation 3.3, every track is associated with one detection. However, in reality, there might be a missed detection for a target due 2Notice that we fit a linear model only when we have a sufficient number of training data. Otherwise, first-order auto-regression (i.e., ŝ = st−1) is used. 29 to occlusion. In this case, assigning one detection for every existing track will generate incorrect tracking results. To cope with this problem, we create a modified association matrix Ã, and let Ãi,j = 1 if (Ai,j = 1) ∧ (Ci,j < λ), and otherwise Ãi,j = 0. In other words, we only accept the association if its corresponding cost is below a threshold λ. This prevents a track from associating with an incorrect detection when occlusions occur. Note that there exist more advanced data association techniques such as Multiple Hy- pothesis Tracking (MHT) and the Joint Probabilistic Data Association Filter (JPDAF) [Cox93]. MHT maintains a pool of multiple hypotheses to help compute the best data association in the current frame, instead of just keeping the best hypothesis as described above. JPDAF computes the probability of every possible data association and its corresponding state es- timate, and then combines them in a probabilistic formulation. Both approaches could be more robust than the proposed one, but they are also more computationally expensive. In our experiments on the sports video dataset, we found out that bi-partite matching for data association is sufficient in most circumstances, and therefore we use this technique through- out this thesis. State update: After data association, we use probabilistic filtering to update the state of a track. We represent the state of a track at time t as a bounding box st = [xt, yt, wh, yt], where (xt, yt) is the location in the image coordinate, and wt and ht is the width and height, respectively. Since we encourage a smooth change in location and size of the bounding box, we model the transition probability as: p(st|st−1) = N (st|st−1, σ2dI) (3.10) where N (st|st−1, σ2dI) is a Gaussian distribution with mean st−1 and covariance σ2dI, and I is the identity matrix. The transition probability assumes that current bounding box will be the same as the previous one plus some Gaussian noise. Assuming that the track is associated with a detection dt, we model the emission probability (sometimes called the observation likelihood) as p(dt|st−1) = N (dt|st, σ2eI) (3.11) where N (dt|st, σ2eI) is a Gaussian distribution with mean st and covariance σ2eI. Proba- 30 Algorithm 1 Multi-target tracking Input: a set of video frames I1, I2, . . . , IT 1: T A0 = {} 2: T B0 = {} 3: for t = 1 to T do 4: Dt = DPMDetect(It) . see [FMR08] 5: DAt ,DBt = TeamClassify(Dt) . Equation 3.1 6: for all k ∈ {A,B} do 7: Akt = DataAssociate(T kt−1, Dkt ) . Equation 3.3 8: T kt = StateUpdate(T kt−1, Dkt , Akt ) . Equation 3.12 9: T kt = T kt ∪ AddTarget(Dkt , Akt ) 10: T kt = T kt \ RemoveTarget(T kt , It) 11: end for 12: end for bilistic filtering computes the posterior of the current state by using a recursion p(st|s1:t−1,d1:t) ∝ p(dt|st) ∫ st−1 p(st|st−1)p(st−1|s1:t−2,d1:t−1) (3.12) where p(st−1|s1:t−2,d1:t−1) is the posterior in the previous frame, p(st|st−1) is the transi- tion probability, and p(dt|st) is the emission probability. As shown in [Bis06], since both the transition and observation probability are Gaussian distributions, the resulting posterior is also a Gaussian with mean µt and covariance Σt defined as follows: µt = µt−1 + Kt(dt − µt−1) (3.13) Σt = Σt−1 + σ2dI−Kt(Σt−1 + σ2dI) (3.14) Kt = (Σt−1 + σ2dI)(Σt−1 + σ 2 dI + σ 2 eI) −1 (3.15) where µt−1 and Σt−1 is the mean and covariance of the posterior in the previous frame, respectively. The above recursion is also known as the Kalman Filter (KF) [Kal60]. The parameters σd and σe can be manually set or automatically learnt from training data. In general, σd and σe should become larger with the increase of the image size and player’s moving speed. We manually set and use the same parameter values throughout this thesis. 31 Add & remove tracks: If a detection du is not assigned to any existing track, we create a new track and associate the detection with it. We assume the initial probability as: p(s0) = N (s0|du, σdI) (3.16) where N (s0|du, σdI) is a Gaussian distribution with mean du and covariance σdI (the covariance matrix is the same as the transition probability). However, the track is dropped if it fails to have a sufficient number of detections associated with it after a short period of time. This trick is similar to using track lengths as features, as in [GC08, LTL+09]. Existing tracks are also removed if they are close to the image boundaries, or if they fail to have any detections associated with them for a sufficient period of time (1 second in our experiments). 3.4 Experimental Results In order to evaluate the performance of the tracking system, we first have to prepare a collection of sports videos with ground-truth bounding boxes. In order to do so, we develop a smart annotation program similar to ViPER [DM00]. The program allows users to perform an interactive tracking similar to [WSTS07] to increase the annotation speed. Though we used an intelligent tool, annotation still took an enormous amount of time. In total, we annotated about 50,000 frames in basketball videos, which consist of 350,000 bounding boxes. We spent more than 200 hours to complete the annotation. We use precision and recall to evaluate the performance of player detection, team clas- sification, and multi-target tracking. Given an estimated bounding box d and its closest ground-truth bounding box d′, we first compute the normalized area of overlap a = area(d ∩ d′) area(d ∪ d′) (3.17) where area(d ∩ d′) denotes the intersection area of the two bounding boxes d and d′, and area(d∪ d′) denotes the union area of the two bounding boxes. We consider the detection d as a true positive if we have the normalized area of overlap a > 0.5. Otherwise, we 32 Figure 3.5: Trajectories of players generated by the tracking system. Every dot in the graph represents a detected bounding box, where (x, y) is the center of a bounding box, and t is the time. We use different colors to represent different tracks. We can see that tracking algorithm is robust even under occlusions and cross-over of players. consider the detection as a false positive. The precision is defined as precision = # true positives # true positives + # false positives (3.18) In other words, precision reflects the percentage of correct detections in an image. The recall is defined as recall = # true positives # ground-truth detections (3.19) Recall shows the number of players being successfully detected in an image. A good system need to have a high precision and a high recall. The DPM detector [FMR08] had a moderate precision 73% and recall 78% in the test basketball dataset. Figure 3.3 shows some qualitative results. We can see that DPM de- tected most players, but it also detected false positives such as referees and spectators. Moreover, DPM usually had a poor performance under occlusion. After team classifica- tion, the precision increases to 97% while it retains a recall level of 74% in the basketball dataset. The precision is significantly improved because we utilized jersey colors to discard false positives generated from referees and spectators, who wore clothes of different colors. Figure 3.4 shows some qualitative results. 33 The tracking algorithm has a 98% precision with an improved recall of 82% in the bas- ketball dataset. This is because the original detections are temporally sparse, and tracking helps to bridge the temporal gap between disjointed detections. As an example, we can see that the tracking system successfully locates an occluded player in Figure 3.6(e), while DPM fails to detect the player in Figure 3.3(e). Another example is Figure 3.5, where we show trajectories of players in a videos of 200 frames. Every dot represents the x and y position of a player, and we use different colors for different tracks. We can see that the proposed algorithm works well even under occlusions and cross-over of players 3. 3Videos are also online at http://www.cs.ubc.ca/∼vailen/thesis/thesis.shtml. 34 (a) (b) (c) (d) (e) (f) (g) (h) Figure 3.6: Results of multi-target tracking. We use different colors to represent different tracks. The text inside a bounding box is the track’s id (not player’s identity). 35 Figure 3.7: Performance comparison of the proposed algorithm and the Boosted Particle Filter (BPF) [OTdF+04]. The first figure shows the precision over time (defined in Equa- tion 3.18), and the second figure shows the recall over time (defined in Equation 3.19). We can see that the proposed algorithm has a better precision and recall in most frames. We compared the proposed tracking algorithm with the Boosted Particle Filter (BPF) [OTdF+04, LOL09]. BPF used multiple independent color-based trackers to track multiple players. Since the observation likelihood is not Gaussian, Okuma et al. used Particle Filter (PF) [PHVG02] to update the state, instead of the Kalman Filter (KF) [Kal60] used in this thesis. Okuma et al. also utilized the player detections generated by the Viola-Jones detector [VJS05] for the proposal distribution in order to reduce the number of particles needed. We compared the performance of BPF and the proposed method on the same hockey dataset released by the authors [OTdF+04, LOL09], which consists of 1000 frames of a broadcast hockey game. Figure 3.7 shows the precision and recall of both the proposed algorithm and the BPF. The BPF has an average precision of 82.68% and an average recall of 64.14% over the 1000-frame hockey video (red lines). On the other hand, the proposed algorithm has a higher average precision of 91.77% and a higher average recall of 79.71% (blue lines). From Figure 3.7, we can observe that in most frames the proposed algorithm substantially outperforms BPF in both precision and recall. The proposed algorithm outper- forms BPF due to several reasons. Firstly, the proposed algorithm uses DPM [FMR08] for player detection, and DPM has been shown to be more accurate and robust in detecting the 36 human whole body than the Viola-Jones detector [FMR08]4. Secondly, the proposed algo- rithm performs bi-partite matching to ensure the one-to-one assignment of detections and tracks. On the other hand, Okuma et al. did not enforce one-to-one matching and therefore one detection may be assigned to different targets. Finally, the proposed tracking algorithm has a better mechanism for adding new tracks and removing existing tracks when necessary. From our experiments, we anticipate that having accurate player detections is the key for the success of multi-target tracking. 3.5 Summary and Future Work This chapter introduces a multi-target tracking system specific for tracking sports players in videos taken from a pan-tilt-zoom camera. The tracking system is based on the tracking- by-detection philosophy. We first localize players using a player detector, categorize de- tections based on team colors, and then group them into tracks of specific players. Instead of using visual cues to distinguish between players, we instead rely on their short-term tra- jectories. Experiments shows promising tracking results in basketball videos of more than 50,000 frames. The proposed algorithm also outperforms the Boosted Particle Filter (BPF) [OTdF+04] in both precision and recall. Future Work: Possible areas for future work include player re-identification and using the ground plane information to improve the quality of tracking: • Offline Tracking-by-Detection: The current system can be run online because the estimation of the current frame only depends on the results of the previous frames. We may be able to get better tracking results if we apply offline methods such as lin- ear programming [JFL08] or MCMC [LTL+09], which process the entire video all together. The benefit of such methods is that information can be propagated both for- ward and backward along the time-line, where the current system can only propagate the information forward. Nonetheless, offline tracking algorithms are usually slower than online ones such as the proposed method. • Player Re-identification: The current system breaks a player’s trajectory into two tracks if he exits and re-enters the scene. We can use pedestrian re-identification techniques (e.g., [FBP+10, GT08, JSRS08]) to tackle this problem. Specifically, we 4The Viola-Jones detector [VJS05] is particularly effective in detecting human faces. 37 can associate the new track with one of the removed tracks that has the most similar appearance. The association can be determined by a classifier learnt from some la- beled training data, as in [LHN09, KHN10]. Notice that player re-identification may generate longer tracks, but the identities of players still remain unknown. • Ground Plane: The current system tracks players in the image coordinates. The performance can be improved if the ground plane information is provided (this can be done by using techniques discussed in Chapter 4). For example, the size of the player in the image depends on his position on the ground plane. We can use this prior information to remove false positive bounding boxes with incorrect sizes. In addition, the player’s motion in the ground plane is easier to predict in the image, because the system compensates for the camera’s motion [CdL06]. Furthermore, it is easier to resolve occlusion among players because we can use the ground plane information to estimate the depth of players in the image [GROJ04] in order to reason about the visibility. 38 Chapter 4 Homography Estimation The tracking algorithm described in Chapter 3 estimates locations and sizes of players in the image coordinate system. However, in many applications, the players’ positions on the court coordinate system are needed. For example, when collecting game statistics automat- ically, the goal is to compute the player’s average running speed, running distance, shooting locations, etc. This computation requires tracking players and balls in the court coordinate, instead of the image coordinate. In addition, since different players have different motion patterns, knowing the trajectories of players on the court may help us to identify them. These two reasons motivated us to develop a technique to estimate players’ locations on the court, given the tracking results in Chapter 3 as the input data. One possible solution is to estimate the camera’s pose and internal parameters from images – the camera calibration process. If we can locate feature points in two planes in a single image, we could use Tsai’s approach [Tsa87] to calibrate the camera, which is imple- mented in OpenCV [BK08]. One example is Hu et al. [HCWC11], who utilized the planes of basketball court and backboards to calibrate the camera. Unfortunately, their algorithm is not globally applicable because backboards can be only seen in limited viewpoints. In fact, it is difficult to find two common planes in sports videos, especially in outdoor sports such as soccer or American football. If we could find at least two images where the same plane can be seen from different views, Zhang’s approach [Zha00] can be used, which is also im- plemented in OpenCV [BK08]. One restriction is that Zhang’s approach requires internal camera parameters (e.g., focal length) remaining the same in these images. However, since most sports videos are shot from pan-tilt-zoom cameras, the assumption of fixed internal 39 camera parameters does not always hold (e.g., zooming in/out changes focal lengths), and thus Zhang’s approach [Zha00] is not applicable. In this thesis, we take a different approach. Since we only care about the players’ loca- tions on the court, we do not need to compute the camera’s poses and internal parameters. Instead, we can only compute the homography matrix between the image and court plane, and then project the players’ locations to the court plane, as in [OLL04, HF07, GLW11]. In Section 4.1, we will give readers a brief background about the homography computation. In Section 4.2, we will detail our homography computation algorithm, which is a variant of the Iterated Closest Points (ICP) algorithm [Zha94]. Section 4.3 describes how we adapt a common court model to different stadiums. Section 4.4 presents some experimental results. 4.1 Background The homography matrix specifies the perspective transformation between two 2D points. In perspective geometry, we usually represent a 2D point in the homogeneous coordinates. Specifically, let p = [x, y]T be a 2D point in an image, the homogeneous representation of p is p̃ = [x̃, ỹ, f̃ ]T , where p = [ x y ] = 1 f̃ [ x̃ ỹ ] (4.1) The relationship between a point p = [x, y]T in the first image and its corresponding point p′ = [x′, y′]T in the second image can be specified by a perspective transformation or homography. Let us first represent the first point in homogeneous coordinates by setting f̃ = 1, that is, p̃ = [x, y, 1]T . The corresponding point p̃′ (represented in the homogeneous coordinate) in the second image is a linear transformation of p̃, i.e., p̃′ =  x̃′ ỹ′ f̃ ′  =  h1 h2 h3 h4 h5 h6 h7 h8 1   x y 1  = Hp̃ (4.2) where H is the 3× 3 homography matrix with 8 parameters: H =  h1 h2 h3 h4 h5 h6 h7 h8 1  (4.3) 40 By substituting Equation 4.2 into Equation 4.1, we can specify the relationship between p and p′ in the image coordinate by the function p′ = f(p;H) where p′ = [ x′ y′ ] = 1 f̃ ′ [ x̃′ ỹ′ ] = 1 h7x+ h8y + 1 [ h1x+ h2y + h3 h4x+ h5y + h6 ] = f(p;H) (4.4) Given a correspondence (p,p′), we can see that f(p;H) is a nonlinear function of param- eters [h1, h2, h3, h4, h5, h6, h7, h8], due to the denominator term in Equation 4.4. Given a pair of images, the goal of homography estimation is to compute the homog- raphy matrix H between two images. The standard approach is to utilize point correspon- dences (p,p′) in two images. From Equation 4.4, we can see that a pair of point cor- respondences provides two constraints for finding the homography matrix. Since there are multiple point correspondences in two images, we can estimate the optimal homog- raphy matrix by solving an nonlinear least-square problem with respect to the parameters [h1, h2, h3, h4, h5, h6, h7, h8] of the homography matrix: min H ∑ i ( f(pi;H)− p′i )2 (4.5) where (pi,p′i) denotes the i-th point correspondence. The standard way to solve the above optimization algorithm is by an iterative approach such as the Levenberg-Marquardt al- gorithm [NW99]. Since there might be incorrect correspondences, we can use RANdom SAmple Consensus (RANSAC) [FB81] to reject outliers and obtain a more accurate estima- tion. Interested readers can consult Szeliski’s book [Sze11] for more details. Since the computation of the homography matrix is well-established, the remaining challenge is to extract reliable point correspondences from two images. The standard ap- proach is to extract and match interest points such as the Scale Invariant Feature Transform (SIFT) features [Low04] in both images in order to establish the correspondences. For example, Hess et al. [HF07] extracted and matched SIFT features in video frames of Amer- ican football. Gupta et al. [GLW11] also used SIFT features to establish correspondences in hockey videos. However, they discovered that using SIFT features alone was not suffi- cient because most SIFT features were generated by images of players, and SIFT features generated from the rink image were very sparse. In this thesis, we take a different approach by matching edge points in two images, as in Okuma et al. [OLL04]. This approach not 41 only improves the accuracy and robustness of homography estimation, but also increases the speed. The following section will describe the proposed algorithm in details. 4.2 Iterated Closest Points In this thesis, we used the Iterated Closest Points (ICP) algorithm [Zha94] to estimate the homography matrix between two images. ICP was originally designed to find the similarity transformation (rotation, translation, and scaling) between two point clouds. The basic idea is simple. In every iteration, every point in the first point cloud finds its closest points in the second point cloud. These two points establish a single point correspondence between two point clouds. Given all correspondences, the similarity transformation can be estimated in a least-square formulation. ICP iterates this procedure until convergence to compute the final transformation. Many variants of ICP are available. Interested readers can consult Rusinkiewicz et al. [RL01] for a complete survey. The proposed algorithm uses ICP to compute the frame-to-frame homography. Specif- ically, given a video sequence consisting of N images {I1, I2, . . . , IN}, the goal is to esti- mate the homography Ht,t+1 between It and It+1, where t ranges from 1 to N − 1. After computing the frame-to-frame homography, the court-to-frame homography can be easily derived. Given the first court-to-frame homography H1, the subsequent homographies can be estimated by using a recursion: H2 = H1,2H1 H3 = H2,3H2 H4 = H3,4H3 . . . Ht+1 = Ht,t+1Ht (4.6) where we use Ht to represent the homography between the court and frame t, and Ht,t+1 to represent the homography between frame t and frame t+1. We can see that the above recur- sion only requires the first-frame homography H1 as the input. The first-frame homography can be obtained by the user’s hand annotation, or it can be estimated by matching SIFT fea- tures between the first frame and a set of key frames, as in Gupta et al. [GLW11]. In this thesis, we annotate the first frame homography by hand. In order to compute the frame-to-frame homography using ICP, we must first extract point clouds from frame t and t + 1. One possible way to compute the point cloud at frame t is to utilize interest point detectors such as SIFT [Low04]. However, sports fields (e.g., hockey rinks or basketball courts) usually lacks textures, and therefore the number of distinctive interest points is usually small. The insufficient number of interest points 42 (a) (b) Figure 4.1: (a) Model point cloud of professional basketball court. where red dots represent model points. (b) The point cloud in (a) can be obtained by computing edge points in this court image taken from the overhead view. However, a better approach is to specify model points by hand because it not only generates a more accurate model, but also ignores edges points specific for different stadiums, e.g., the “Staples Center” logo. degrades the quality of homography estimation [GLW11]. Instead of computing the point cloud directly from the image, we rely on the pre-defined model points and the homography Ht to construct the point cloud at frame t. Specifically, we compute the point cloud at frame t by transforming model points using the court-to-frame homography Ht. The model point setM = [m1,m2, . . . ,mN ] are the edge points of basketball court or hockey rink, where mi = [xi, yi] T is a 2D point in the court or rink coordinate, and N is the number of model points. Figure 4.1a shows the model point set of professional basketball (NBA). We can see that the edge points are densely sampled along the lines and circles of the basketball court. Since the dimensions of basketball courts are identical for all stadiums in the professional leagues, we only have to construct the model point setM once for all basketball videos. Moreover, by controlling the sampling rate, we can ensure that there is always a sufficient number of model points at frame t. To construct the model point set, one technique is to detect edge points in a basketball image taken from an overhead view, as shown in Figure 4.1b. We could also manually specify model points by hand based on the dimensions. In this thesis, we favor the second approach because it not only generates a more accurate model, but it also ignores edges points specific for different stadiums1, e.g., the “Staple Centre” logo in Figure 4.1b. In addition, since the model needs to be constructed once, the hand annotation time is amortized. 1We instead use a mechanism to learn these stadium-specific points online in Section 4.3. 43 (a) ICP (init) (b) ICP (final) Figure 4.2: The initial and final iteration of applying the Iterated Closest Point (ICP) al- gorithm [Zha94] for homography estimation. We use red dots to represent model points in both images. (a) When processing frame t + 1, we use previous court-to-frame ho- mography Ht to project the model point set M the the image It+1. Notice that there are mis-alignments between model points and the image. (b) We then use ICP to find the optimal transformation that minimize the mis-alignment. This figure shows model points transformed by the homography estimated by the ICP. We project all model points using the court-to-frame homography Ht. Specifically, for every model point mi, we use Equation 4.4 to find its projection m′i = f(mi;Ht). We then ignore those projections that are out of the image boundaries because they do not have any correspondence in the image. Note that the original ICP [Zha94] assumed the second point cloud is a noisy version of the first one, but there is no occlusion. However, in our case, most model points are occluded due to a partial camera view. This makes the problem more challenging than the original setting of ICP. Figure 4.2a shows an example of projecting model points to the frame t + 1 using the previous court-to-frame homography Ht, where red dots represent model points. We can see there are mis-alignments between model points and the image. The goal of ICP is to find the optimal transformation that minimizes the mis-alignments, as shown in Figure 4.2b. In order to extract point cloud in the second image, i.e., frame t + 1, we first convert the color image to gray-scale, and then use the Canny edge detector [Can86] to localize edge points. Other edge detectors such as Sobel or Laplacian of Gaussian (LOG) [GW02] are possible. However, we found that the Canny edge detector is the most reliable for detecting edges of different orientations, and it is more robust under illumination changes [GW02]. The Canny edge detector has two parameters: the high and low threshold for edge 44 intensities. In this thesis, we set the high threshold to 0.030 · Imax and the low threshold to 0.015 · Imax where Imax is the maximum intensity of the gray-scale image. We use a small threshold in order to detect weak edges in motion blur during fast camera movement. Figure 4.3b shows an example of edges detected by the Canny edge detector, where white dots represent edge points. From Figure 4.3b, we can observe that the edge map contains edges from the basketball court, players, spectators, or even the overlaid score bar. In order to improve accuracy and robustness, edges of non-court objects should be discarded. We use a user-defined mask to remove edges of score bar, as shown in Figure 4.3c. Since the score bar has a fixed position and size throughout the entire video, the mask only needs to be set up once for a video. Then, we use automatic player detections generated by the DPM detector [FMR08] to remove edges caused by players, i.e., edge points within detected bounding boxes will be discarded. Figure 4.3d shows an edge map where edges caused by score bar and players have been removed. We denote the remaining edge points by a set E = [e1, e2, . . . , eM ] where ei = [xi, yi]T is a 2D point in the image coordinate, and M is the number of remaining edge points. In our preliminary experiments, we have also tried to use color to remove non-court objects. Specifically, given some training images of basketball court as in Figure 4.1b, we learnt the pixel’s color model by fitting a mixture of Gaussians [Bis06]. In test time, we could use an image segmentation algorithm such as GrabCut [RKB04] to separate court and non-court objects. In practice, this approach did not work as well as we expected. The major problem is that court and non-court objects sometimes share similar colors. For example, the same yellow appears in both players’ jerseys and the logo on the basketball court. Therefore, using color to remove edges caused by non-court objects usually leads to unsatisfactory results. Given transformed model points and edge map, we can then use ICP to find the frame- to-frame homography. More specifically, in frame t+ 1, we first transform model points to image coordinates by using the court-to-frame homography Ht in the previous frame. The transformation can be done by using Equation 4.4, and it generates a transformed model point set M̃ = [p̃1, p̃2, . . . , p̃N ]. For image frame t+1, we extract edge points by running the Canny edge detector [Can86]. We discard edge points caused by players and the score bar, as described above. The remaining edge point set is E = [e1, e2, . . . , eM ]. Then, for every model point p̃i, we find its closest point p̃′i in the edge point set E , where the metric 45 (a) (b) (c) (d) Figure 4.3: Edge map for homography estimation. (a) The raw color image of a basketball video. (b) The edge map computed by the Canny edge detector [Can86]. (c) Since the edge map in (b) contains many non-court objects, we first use a user-defined mask (represented by yellow bounding boxes) to remove edges caused by the score bar. (d) We further discard edges caused by players by using the DPM detection results [FMR08]. Specifically, edges within player detections (represented by yellow bounding boxes) are removed. is Euclidean distance: p̃′i = argmin ek∈E ||p̃i − ek|| (4.7) The pair (p̃i, p̃′i) establishes a correspondence between frame t and t+1. In order to reject outlier correspondences, we apply two simple tricks. Firstly, correspondences with a large Euclidean distance will be discarded. Specifically, we reject a correspondence if ||p̃i − p̃′i|| > θd where θd is the distance threshold. The distance threshold is dependent on image resolution and camera motion. In this thesis, we set θd = 30 pixels for all experiments. We further reject correspondences with inconsistent edge normal directions, as suggested in [GLW11, OLL04]. Let n be the edge normal of p̃, and n′ be the edge normal of p̃′, the 46 Algorithm 2 ICP for homography estimation Input: initial court-to-frame homography H1 Input: model point setM = [p1,p2, . . . ,pN ] 1: M1 =M 2: for t = 2→ T do 3: Et = ExtractEdgePoints(It) 4: H(0) = Ht−1 5: k = 0 6: repeat 7: M̃t = TransformModelPoints(Mt−1, H(k)) . Equation 4.4 8: Ẽt = FindClosestPoints(M̃t, Et) . Equation 4.7 9: Ht−1,t = EstimateHomography(M̃t, Ẽt) . Equation 4.5 10: H(k+1) = Ht−1,tH(k) . Equation 4.6 11: k = k + 1 12: until convergence 13: Ht = H (k−1) 14: Mt = UpdateModelPoints(M, Et, Ht) . Equation 4.10 15: end for angle between two edge normals can be computed by φ = arccos ( nTn′ ||n|| · ||n′|| ) (4.8) We discard correspondences with φ > θr, where θr is the angle threshold. We set θr = 15◦ in all our experiments. After obtaining correspondences between frame t and t + 1, we use the Levenberg- Marquardt algorithm [NW99] to solve the nonlinear least-square problem in Equation 4.5 to find the optimal homography. In order to improve robustness, we also use RANSAC [FB81] to further reject outliers. Then, we transform model points using the new homography, and repeat the procedure. We iterate finding closest points and estimating homography until convergence. Algorithm 2 details the proposed algorithm. In practice, we apply two tricks to increase the speed of ICP. One of the speed bottle- neck is the search of closest points in Equation 4.7, which has a time complexity O(N2), where N is the number of edge points. Instead, we can construct a k-d tree [Ben75], which reduces the time complexity to O(N logN). Another trick is to use the bootstrap approach 47 suggested by Stewart et al. [STR03]. Specifically, instead of finding the homography trans- formation by solving an expensive non-linear least-square problem in Equation 4.5, we can first start with finding a transformation that only involves translation and rotation. This transformation only has four parameters and can be easily estimated by solving a linear least-square problem. Then, starting with this transformation, we use ICP to find the affine transformation, which has six parameters. Finally, we use the affine transformation to ini- tialize the full version of ICP, which estimate the 8-parameter homography. This strategy significantly reduces the number of iterations required for solving the expensive nonlinear least-square problem. 4.3 Online Model Update Sometimes, model points are sparse in video frames due to camera’s viewpoints or occlu- sions. This sometimes causes an incorrect homography estimation due to a lack of con- straints in some image regions (in order to have a robust homography estimation, it is better to have uniformly distributed correspondences all over the image [HZ03]). To alleviate this problem, we augment the model point set by stable points. Stable points are edge points that have a consistent homography as the model points. In Figure 4.1a, we can see that model points correspond to lines and circles that are identical in all stadiums. However, there are also stadium-specific marks and logos that can provide additional constraints for homography estimation. For example, part of the three point line is occluded in Figure 4.3a, and thus we do not have sufficient constraints in the bottom-left of the image. This problem can be alleviated if we can include the “NBA Finals” logo in the model point set. In order to implement this idea, we have to find those edge points whose homography is consistent with the original model points. We achieve this by maintaining a score map that measures the consistency of edge points. Specifically, we first construct a transformed edge map Ẽt that have the same size of the court. The transformed edge map has the property: Ẽt(p) = 1 if f(p;Ht) ∈ Et0 otherwise (4.9) where we uniformly sample p = [x, y]T in the court coordinate, Et are edge points ex- 48 (a) raw image (b) transformed edge points (c) score map (d) new model points Figure 4.4: The process of updating model points in homography estimation. (a) The raw video frame. (b) The transformed edge map Et, computed by using the estimated homog- raphy Ht. (c) The score map St that measures the consistency of edge points. (d) The new model points Mt. This is computed by thresholding the score map St and included the original model point setM. tracted by the Canny edge detector, and f(p;Ht) is the homography transformation defined in Equation 4.4. Ẽt can be thought as a back-projection of edge points Et to the court coor- dinate system. If p is a stable point, i.e., it is an edge point caused by the stadium-specific marks or logos, we will have Ẽt(p) = 1 in consecutive frames. On the other hand, if p is not an edge point on the court, we will have Ẽt(p) = 0 most of the time (note that some- times we still have Ẽt(p) = 1 due to players’ edges). Therefore, in order to detect stable points, we maintain a score map St such that St(p) = αẼt(p) + (1− α)St−1(p) (4.10) where the subscript t denotes the time, and 0 < α < 1 is a constant forgetting factor. 49 Initially, we set S0(p) = 0 for all possible points p. After processing several frames, the scores of stable points increases, while scores of non-stable points remain close to zero. Thresholding the score map gives us a new set of model points that are tuned for specific stadium. Figure 4.4 shows the model updating process. Figure 4.4a is the original video frame. Figure 4.4b shows the transformed edge map Ẽt computed by using the estimated court- to-frame homography Ht. We can notice that most players’ edges have been removed by using the detection results. However, there is one player not detected and thus the edges still remain in the edge map. Figure 4.4c shows the score map St computed by using Equation 4.10. Since the score map is a weighted average over previous frames, the edges of players receive low values while the edges of logos have high values. Figure 4.4d shows the new model point setMt. New model points are constructed by thresholding the score map St and include the original model point setM. We can observe that new model points not only have the common lines and circles, but they also have logos and marks of the specific stadium. 4.4 Experimental Results In order to evaluate the performance of the proposed homography estimation algorithm, we first hand-annotated seven basketball video clips. In total, we have 5696 frames with ground-truth homographies. Annotating ground-truth homography took a lot of time. We developed a software that allows users to interpolate homography between frames and in- teractively fix them. However, it still took more than 30 hours to annotate 5696 frames, which are equivalent to a 3-minute video. We measure the performance by computing the average distance between points trans- formed by the ground-truth homographies and points transformed by the estimated homo- graphies. Specifically, we divide the court into grid cells of size 1 × 1 inch. Since the size of the basketball court is 94 × 50 feet, there are in total 1128 × 600 grid cells. For every grid cell, we sample a point p from its centre. These points form a point collection P = [p1,p2, . . .]. Given the ground-truth homography Ĥt and estimated homography Ht at frame t, we compute the average error of the estimated homography by Errort = 1∑ p∈P χ(p; Ĥt) ∑ p∈P χ(p; Ĥt)||f(p; Ĥt)− f(p;Ht)|| (4.11) 50 Figure 4.5: Accuracy of homography estimation in one basketball test clip. The homog- raphy is estimated by using a variant of ICP [Zha94]. The error is measured by the aver- age distance between points transformed by the groundtruth homography and points trans- formed by the estimated homography, as defined in Equation 4.11. where f(p;Ht) is the homography transformation defined in Equation 4.4, and χ(p; Ĥt) is an indicator function that excludes points out of the image boundaries χ(p; Ĥt) = 1 if f(p; Ĥt) is within the image boundary0 otherwise (4.12) We use the indicator function because ground-truth homographies are only reliable within the image boundary. Measuring distances for those out-of-boundary points would produce misleading results. The average error of homography across all 5969 frames was 7.32 pixels. Since the image resolution is 1280 × 720 pixels, the error was about 1% of the image size, which is very small. Figure 4.5 shows the estimation errors of one selected test clip. We can see that errors are usually below 10 pixels, except the region between 200-th to 400-th frame, which have a fast camera motion. Fast camera motions usually happen during offensive-defensive changes, and they cause significant motion blur. Motion blur reduces the chance to detect edges of the court. Even if an edge is successfully detected, its location might be off due to the motion blur. Improving ICP’s robustness under motion blur is thus the goal of future research. 51 (a) (b) (c) (d) (e) (f) Figure 4.6: Homography estimation results. We use red dots to show the projection of model points in the image coordinate using the homography estimated via a variant of ICP [Zha94]. (a) Zoom-in. Only a quarter of the court is visible. In addition, many model points are occluded by players. (b,c,d) Fast camera motion during offensive-defensive change. We can see severe motion blur in (c). (e) Camera flash. The algorithm is robust under flashes due to the use of the Canny edge detector [Can86]. (f) Another zoom-in view. 52 Figure 4.6 show some qualitative results in the basketball videos 2. We use red dots to show the projection of model points in the image coordinate using the estimated homogra- phy. We can see that the proposed algorithm works in diverse situations. For example, our algorithm works in zoom-in views such as (a) and (f), while only a quarter of model points are visible, and many of them are occluded by players. Our algorithm also works during fast camera motion as in (b,c,d), though it is not as accurate. In (c), we can observe a se- vere motion blur, which obscures many edges. Our algorithm still works in this challenging case. We can see camera flash in (e), which changes colors of the court. Our algorithm still works well because it relies on player detections instead of color segmentation to remove non-court objects. Our algorithm had successfully estimated homographies of more than 50,000 frames in the basketball videos with a reasonable speed. With a Matlab implementation, our algorithm ran in 1.5 second per frame in a machine with a 2.8GHz CPU, including I/O and processing time. We expect having close to real-time performance with a GPU implementation. Since most existing methods used domain-specific knowledge, it is difficult to run their algorithms on the basketball videos. Instead, we ran our algorithm in the hockey dataset of Gupta et al. [GLW11], and compared the performance of the two methods. In [GLW11], Gupta et al. first matched the test frame with a set of key frames by utlizing the SFOP interest points [FDS09]. Then, Gupta et al. applied the Direct Linear Transform (DLT) method [HZ03] to estimate the homography between the test frame and the closest key frame. In order to refine the estimation, Gupta et al. sought the best homography that minimized the difference between the model and the test frame. The optimization was performed by minimizing the area of lines and ellipses between the model and the test frame. Figure 4.7 shows the comparison of Gupta et al. [GLW11] and the proposed approach in a hockey video of 1000 frames. Gupta et al. had an average error of 20.21 pixels in the hockey video (green line). In comparison, the average error of the proposed ICP-based approach was only 13.35 pixels (blue line), which reduced the errors by 33%. Since the image resolution is 1920×980 pixels, the error was about 1.4% of the image size, which was similar to the basketball case. This demonstrates that the proposed algorithm is effective in different kinds of sports videos. The speed of ICP is also much faster. ICP ran about 2 second per frame, due to a larger image size. However, [GLW11] took at least 60 second to 2Videos are available online at http://www.cs.ubc.ca/∼vailen/thesis/thesis.shtml. 53 Figure 4.7: Comparison of two homography estimation methods in a hockey video clip. The blue line is the results of ICP [Zha94]. The green line is the results of Gupta et al. [GLW11]. The error is measured by the average distance between points transformed by the groundtruth homography and points transformed by the estimated homography, as de- fined in Equation 4.11. We can see that ICP outperforms the methods used in Gupta et al. [GLW11] in most cases. process a frame, which was about 30 times slower than ICP. The proposed ICP-based approach produced better results than Gupta et al. [GLW11] due to several reasons. Firstly, Gupta et al. matched SFOP points [FDS09] to estimate the homography between the test frame and the closest key frame. However, SFOP points are very sparse on the image of the hockey rink and therefore the homography estimation would be inaccurate. The area minimization of Gupta et al. is similar to the proposed ICP approach. However, their approach relies on the detections of lines and ellipses, which are usually unreliable during occlusions. On the other hand, the proposed approach first detects edge points, which are denser and more reliably detected. We also utilize an online model update method to add stadium-specific model points that improve the homography estimation. The use of the standard Levenberg-Marquardt solver [NW99] in ICP rather than the non-standard sampling-based solver used in [GLW11] also increases the accuracy. These enhancements result in a faster and more accurate homography estimation algorithm than Gupta et al. [GLW11]. 54 4.5 Summary and Future Work This chapter presents an algorithm to estimate the homography matrix between the court and video frames. The homography estimation is solved by using a variant of Iterated Closest Points (ICP). Unlike most existing algorithms that relied on matching robust feature points, we propose matching edge points in two images. In addition, we also introduce a technique to update the model online to accommodate logos and patterns in different stadiums. Experiments show promising homography estimation results in basketball videos of more than 50,000 frames. In addition, the proposed algorithm outperforms Gupta et al. [GLW11] in both accuracy and speed. Future Work: Possible areas for future work include automatic initialization, fault detection and recovery, and online parameters update: • Automatic Initialization: The proposed system requires the first-frame homogra- phy, which is now given by a human annotator. Instead, we can automatically initial- ize the first-frame homography by utilizing a set of key frames. Specifically, we can first collect a set of key frames that roughly cover all possible view points (usually 10 to 15 frames are sufficient). We ask users to manually annotate the homographies of these key frames. Given the first frame of a test video clip, we first find its closest key frame. Then, we compute the homography between the test frame and its closest key frame by matching SIFT points [Low04], as in Gupta et al. [GLW11]. This approach will give a rough estimate of the first-frame homography for initializing ICP. • Fault Detection and Recovery: The proposed system cannot detect faults in the homography estimation, and it also lacks the ability to recover from the failure cases. We can use the shape and appearance of the estimated court/rink to determine whether ICP fails or not. For example, a failing ICP would generate very skewed court region whose colors are significant different from the original one. To recover from the fail- ure cases, we can use the same technique as the automatic initialization, i.e., finding closest key frames and matching SIFT points to estimate the homography [GLW11]. • Online Parameters Update: The proposed system uses the same parameters for all video frames. To get a better performance, we can use different parameters for different situations. For example, during fast camera motion, the threshold for the Canny edge detector should be lower in order to detect weaker edges. The distance 55 threshold θd and angle threshold θr can be also adapted according to the camera’s speed and moving direction. • Non-uniform Weighting Function for Robust Homography Estimation: We es- timate the homography by solving a nonlinear least-square problem in Equation 4.5, where all point correspondences have the same weights. Using non-uniform weights for different correspondences may improve the accuracy and robustness of the sys- tem. For example, an isolated correspondence should receive a higher weight than the one with many nearby correspondences. The point correspondences with higher appearance similarity (e.g., having similar SIFT descriptors [Low04]) should also reveive higher weights than other correspondences. 56 Chapter 5 Player Identification Given the tracking and homography results, the next step is to automatically identify sports players in video shots taken from a court view. Notice that face recognition techniques [BBN05, BBBN07, JCF09] are infeasible in this domain due to several reasons. Firstly, im- age resolution is very low. For example, even in the High Definition (HD) videos (with video resolution 1280× 720 pixels), the average size of the face region is around 20× 20 pixels1. Secondly, due to video compression and camera motion, images are also very blurry, which make facial features indistinguishable. Moreover, distinguishable facial features are only visible in frontal-view images. Unfortunately, frontal-view images are very rare because players always change their pose and orientation. In fact, it is very difficult even for human beings to recognize the faces of players in these images. Recognizing numbers on player’s jerseys is possible, but it is still very challenging. Numbers have a lower image resolution than faces. For example, the average size of the numbers on the jerseys is around 12× 12 pixels in high-definition videos2. In many cases, numbers are also blurry due to fast camera motion and video compression. In addition, since numbers are on a deformable surface (player’s jersey), they are usually distorted and broken. A clear upright frontal view of numbers is also very rare because players frequently change their poses and orientations. To tackle these problems, we have tried to use image thresholding to detect candidate regions of numbers, and run an off-the-shelf Optical Char- acter Recognition (OCR) [Tes], as in Ben Shitrit et al. [BBFF11]. However, we found out 1The average size of faces may vary depending on the camera’s view points. 2The average size of numbers may vary depending on the camera’s view points and the styles of the jerseys 57 that image thresholding cannot reliably detect numbers. The off-the-shelf OCR also gave poor results because it was not trained on images with the same font and distortion as the broadcast sports videos. Notice that Ben Shitrit et al. [BBFF11] can achieve a good iden- tification accuracy by number recognition because they have 8 synchronized cameras, and two of them are close-up views. We adopt a different approach, ignoring face and number recognition, and instead fo- cusing on identification of players as entities. We extract visual features from the entire body of players. These features can be faces, numbers on the jersey, skin or hair colors. We also utilize spatial features such like locations and speed of players. By combining all these weak features together into a novel Conditional Random Field (CRF) [LMP01], the system is able to automatically identify sports players, even in video frames taken from a single pan-tilt-zoom camera. Moreover, in order to reduce the number of labeled training data, we adopt the weakly supervised learning approach. Specifically, we utilize the play-by-play texts as weak labels to train the identification system. The play-by-play texts are publicly available for most sports games, and we are the first one to demonstrate how to use them to train a video analysis system. This chapter is organized as follows. In Section 5.1, we will describe the probabilistic graphical model of the identification system, which is a variant of the Conditional Random Field (CRF) [LMP01]. In Section 5.2, we will discuss the visual and spatial features used in the CRF to recognize the players. Section 5.3 will detail the inference algorithm, which is based on Linear Programming Relaxation [NW99]. Section 5.4 describes how to learn the identification system from the play-by-play texts. The algorithm is a variant of the Expectation-Maximization algorithm (EM) [DLR77]. In Section 5.5, we will present the experimental results. Section 5.6 demonstrates some potential applications of the system. 5.1 Probabilistic Graphical Model Given the tracking results generated in Chapter 3 and homography estimated in Chapter 4, the next step is to identify the players based on this information. In order to achieve this goal, we have to first introduce a model to characterize the relationship between videos and player identities. In this thesis, we utilize the Conditional Random Field (CRF) [LMP01], which is one kind of the undirected probabilistic graphical models. Interested readers can consult the thesis of Murphy [Mur02] or the book of Bishop [Bis06] for an introduction of 58 Figure 5.1: Graphical model for training clips. x are detections. Mutex arcs exist between detection identities y in a frame. Temporal arcs exist between y nodes across frames. the undirected graphical model. Probabilistic graphical models use a graph to represent the probabilistic distributions of dependent random variables. More specifically, we use the nodes to represent random variables, and we use the edges between nodes to represent conditional independence of variables. In this thesis, we construct a CRF for the entire video clip, as shown in Figure 5.1. The CRF consists of N feature nodes xt,d that represent the observed feature vectors of the player detection d at time t, where N is the number of detections in a video clip. The CRF also has N identity nodes yt,d that represent the player identity of the detection d at time t, whose values will be estimated given all observed x. The feature node yt,d is a discrete random variable with |C| possible values, where C is a set of all possible player classes (e.g., we have |C| = 12 for a team in professional basketball). The number of detections N in a single video clip ranges from 5000 to 30000 in our test dataset. We first connect identity nodes yt,d to corresponding feature nodes xt,d. The edge potential is defined as: ψfeat(yt,d,xt,d) = p(yt,d|xt,d,θ) (5.1) where xt,d are feature vectors and θ are parameters. We model p(yt,d|xt,d,θ) as multi-class logistic regression [Bis06]: p(yt,d = k|xt,d,θ) = exp(θ T k xt,d)∑ j exp(θ T j xt,d) (5.2) 59 In other words, given a feature vector xt,d, we can predict its corresponding identity yt,d by using the above equation. In Section 5.4, we will discuss how to train parameters θ in a discriminative way using either fully-labeled and weakly-labeled training data. We then connect identity nodes yt,i and yt+1,j if they belong to the same track, where tracking is done by using the algorithm described in Chapter 3. We use this edge to en- courage temporal smoothness of identities within a track. The edge potential is defined as: ψtime(yt,i, yt+1,j) = { 1−  if yt,i = yt+1,j  otherwise (5.3) where 0 ≤  ≤ 1 is a fixed parameter that reflects the tracking error. Setting  to 0 forces all identity nodes y within a track to have the same estimated value. On the other hand, setting  to a positive value allows identity nodes y to change values within a track. In our previous work [LTML11], we set  to a small positive value to account for tracking errors, i.e., the tracking system may follow the wrong player after the cross-over. However, in this thesis, we set  = 0 because this simplifies the optimization problem and does not affect the identification accuracy in our experiments. We also connect all pairs of identity nodes yt,i and yt,j if they appear in the same time t. We then introduce an edge potential that enforces mutual exclusion: ψmutex(yt,i, yt,j) = { 1 if yt,i 6= yt,j 0 otherwise (5.4) This potential specifies the constraint that a player can appear only once in a frame. For example, if the i-th detection yt,i has been assigned to Bryant, the j-th detection yt,j cannot have the same identity because Bryant cannot appear twice in a frame. The joint posterior of the entire CRF is then a product of all potentials: p(y|x,θ) =  |T |∏ t=1 |Dt|∏ d=1 ψfeat(yt,d,xt,d)  ·  |T |∏ t=1 |Dt|∏ d=1 ψtime(yt,d, yt+1,succ(t,d))  60 ·  |T |∏ t=1 |Dt|∏ d=1 ∏ j 6=d ψmutex(yt,d, yt,j)  (5.5) where succ(t, d) is the next node (if it exists) that is connected to yt,d in the track, |Dt| is the number of detections in frame t, and |T | is the total number of frames in the video clip. Given parameters θ, the goal is to estimate the identities of players y in the whole sequences given all observed features x in a video clip. 5.2 Visual and Spatial Features The feature vector xt,d consists of two different kinds of cues: visual cues and spatial cues. Visual cues are computed from the image patches of players, as shown in Figure 5.2. These image patches are extracted by using the bounding boxes generated by the tracking system in Chapter 3. Visual cues represent the appearance of the player, which can be the face, jersey numbers, skin and hair colors, etc. Spatial cues are computed from the trajectories of players in the court coordinate, as shown in Figure 5.3. The trajectories are extracted by combining the multi-target tracking and homography estimation results. We use them to encode the motion patterns of players. In total, the feature vector xt,d has 1250 dimensions, of which 830 are visual cues and 420 are spatial cues. We will detail these visual and spatial features in the following sections. 5.2.1 Visual features We used three different visual cues to represent the appearance of the player: Maximally Stable Extremal Regions (MSER) [MCUP02], Scale Invariant Feature Transform (SIFT) fea- tures [Low04], and RGB color histograms. As described in the previous sections, visual features such as faces and numbers on the jersey are either unreliable or difficult to detect. Therefore, we relied on robust interest point detectors such as MSER and SIFT, and we used global features such color histograms. MSER regions [MCUP02] are stable segments whose colors are either darker or lighter than their surroundings. They are useful for detecting texts in natural scenes because text often has an uniform color and high contrast. In order to use MSER for player identification, we first detected MSER regions [MCUP02], as shown in Figure 5.2b. The MSER regions were those stable segments extracted by the watershed segmentation algorithm [GW02]. 61 (a) raw image (b) MSER visual words (c) SIFT visual words Figure 5.2: Visual features for player identification. We use three visual cues: MSER regions [MCUP02], SIFT points [Low04], and RGB color histograms. (a) Raw image patches extracted from the tracking system. (b) Green ellipses represent detected MSER regions [MCUP02]. (c) Red circles represent detected SIFT interest points [Low04]. Since image segments have different shapes, sizes, and orientations, we have to normalize them before computing the image descriptor. We used the normalization technique sug- gested by Forssen et al. [FL07] to transform the image segments into square image patches with an upright dominant orientation. For every MSER region, a 128-dimensional SIFT descriptor [Low04] was computed. To compute the SIFT descriptor, we first divided the image patch into 4× 4 grids. In each grid, a 8-bin orientation histogram of image gradients was computed. We used the SIFT descriptor because it is very distinct and robust under small translation, rotation, and scale changes. We then quantized the SIFT descriptor of 62 the MSER region into one of 300 visual words using a learnt codebook. The codebook is learnt by applying k-means clustering [JD88] to partition training image patches into groups. The final MSER representation of the image is a 300-dimensional bag-of-words bit vector, where a value of 1 indicates presence of the corresponding visual word in the image and 0 otherwise. SIFT interest points [Low04] are stable local patches that are invariant to scale and affine transformation. They can be detected by finding the extrema points in a image pyra- mid computed by using difference of Gaussians. Figure 5.2c shows some examples of the detected SIFT points. For every SIFT point, we extracted its 128-dimensional SIFT descriptor from its surrounding region. These regions also needed to be rotated and nor- malized according to its dominant orientation, as described in [Low04]. Similar to MSER, we quantized the SIFT descriptor into one of the 500 visual words. We used more visual words for SIFT because there were more SIFT interest points than MSER regions. Although colors are weaker features (players of the same team wear the same uniform), skin color may provide some information for player identification. To account for colors of skin and hair, we also extracted RGB color histograms from the entire image patch. In order to emphasize the central region, we used a Gaussian centered in the image patch to re-weight the color histogram. Specifically, we used 10 bins for each of the R, G and B channels. We treat the three colors independently, so the full histogram has in total 30 bins. Figure 5.2 shows an example of MSER regions and SIFT interest points. We see that faces are always blurred, while numbers can only be seen clearly in the last frame. Since we do not segment the player from the background, some MSER regions and SIFT points may be generated from the background. However, these features will not affect identifica- tion results because they are assigned lower weights in Equation 5.2 due to the use of L1 regularization. 63 5.2.2 Spatial features Spatial features were extracted from player trajectories on the court. Given a detected bounding box s = [x, y, w, h]T , we approximated the player’s foot position in the image by p′ = [x + w2 , y + h, 1] T . We added 1 to make it a homogeneous representation. Then, player’s location on the court can be computed by p = H−1p′, where H is the homography transformation matrix estimated in Chapter 4. To aggregate temporal information, we gathered the player’s foot locations in the past F frames, and then computed the spatial histogram. Specifically, we divided the court into grids of 1 × 1 meter. If the player stands on top of the grid, its histogram value will be increased by one. We repeated this process F times to obtain the spatial histogram of the player’s most recent trajectory on the court. This spatial histogram is sometimes called the heatmap. Figure 5.3 shows spatial histograms of three different types of players in professional basketball In order to obtain a better visualization, the histograms are computed by aggre- gating player trajectories over 5000 frames. The first heat-map is extracted from the captain (Kobe Bryant), who is a diverse player. Offensively, he most often stays behind the 3-point arc to command his team members, but we can also see him attack from the low post (close to the basket). The second heat-map is from the center (Andrew Bynum). The center is usually the tallest player in a team whose job is to grab rebounds and attack the basket from a short distance, and thus he usually stays closer to the basket. On the contrary, the third heatmap is generated from a three-point shooter (Sasha Vujacic), whose job is to shoot the ball from a long distance, usually from the three-point line. We can observe that players of different positions have very distinguishable spatial histograms. The difference would be more obvious in outdoor sports such as soccer and American football. For example, in soccer, we can easily use spatial histograms to classify players into left wing, right wing, middle field, left defence, right defence, forward, and goal keeper, because they all have distinct heatmaps. Summary: Given the tracking and homography results, we compute the MSER bag of words, SIFT bag of words, color histograms, and spatial histograms. In total, the feature vector xt,d has 1250 dimensions, of which 830 are visual cues and 420 are spatial cues. 64 (a) Bryant (b) Bynum (c) Vujacic Figure 5.3: The spatial histograms (heat-maps) of three professional basketball players. Players offend in the left-hand side while defend in the right-hand side. In all images, lighter colors mean higher values in the histogram. (a) Bryant: a diverse player and the captain of the team. (b) Bynum: a center whose job is to grab rebounds and attack the basket from a short distance. (c) Vujacic: a 3-point shooter whose job is to shoot from a long distance. 65 5.3 Inference Given features extracted from the video sequence, our goal is to estimate identities of play- ers. This can be achieved by maximizing the log posterior in Equation 5.5 with respect to identity variables y, given that the feature vectors x are observed. One possible solution is applying sum-product Loopy Belief Propagation (LBP) [FM98] to compute the marginal distribution p(yt,d|x). Taking the maximum independently to every marginal gives a sub- optimal configuration. We have explored this solution in our previous work [LTML11]. However, we notice that this approach sometimes produces a configuration that violates mutual exclusion constraints in Equation 5.4 and is thus not favorable. Another choice is to use a Markov Chain Monte Carlo (MCMC) style stochastic local search, as shown in our early work [LTLM11]. However, this approach is very slow. In order to resolve this issue, we convert the original unconstrained optimization prob- lem in Equation 5.5 to a constrained one. To do so, we first represent the identity variable y by an auxiliary column vector z, and we have z = [z1, z2, . . . , zC ]T where zc = 1 if y = c, and 0 otherwise. Then, we re-write the joint posterior as a Gibbs distribution: p(z|x,θ) = 1 Z(θ) exp(E(z,x)) (5.6) where Z(θ) is the normalization constant. E(z,x) is a log-linear energy function: E(z,x) = |T |∑ t=1 |Dt|∑ d=1 fTt,dzt,d (5.7) + |T |∑ t=1 |Dt|∑ d=1 zTt,dG zt+1,succ(t,d) (5.8) + |T |∑ t=1 |Dt|∑ d=1 ∑ j 6=d zTt,dJ zt,j (5.9) We represent feature potentials by a column vector f t,d = [ft,d,1, ft,d,2, . . . , ft,d,C ]T where ft,d,c = θ T c xt,d. The temporal constraints are encoded by the matrix G, where G(i, i) = log(1− ) and G(i, j) = log() if i 6= j. The mutual exclusion constraints are represented by the matrix J, where J(i, i) = − inf and J(i, j) = 0 if i 6= j. We further simplify the problem by setting  = 0, and thus both Equation 5.8 and 5.9 66 become hard constraints. We can then re-write the optimization problem as: max z |T |∑ t=1 |Dt|∑ d=1 fTt,dzt,d (5.10) subject to the following constraints: |C|∑ c=1 zt,d,c = 1 ∀t ∈ T , d ∈ Dt (5.11) zt,d,c − zt+1,succ(t,d),c = 0 ∀t ∈ T , d ∈ Dt, c ∈ C (5.12) zt,d,c + zt,j,c ≤ 1 ∀t ∈ T , c ∈ C, d 6= j (5.13) zt,d,c ≥ 0 ∀t ∈ T , d ∈ Dt, c ∈ C (5.14) where T is a set of all training frames, C is a set of all possible players, and Dt is a set of detections at time t. Equation 5.11 ensures that there is exactly one variable in zt,d whose value is 1. Equation 5.12 ensures that both zt,d,c and zt+1,succ(t,d),c have the same value, and thus it enforces the temporal constraint. Equation 5.13 prevents both zt,d,c and zt,j,c being 1, which violates the mutual exclusion constraint. Equation 5.14 ensures that all z are non-negative. Since solving the above optimization problem with respective to binary variables z is hard, we relax the problem and allow z to take real values. We then see that Equation 5.10 becomes a Linear Programming (LP) problem [NW99] with linear constraints in Equation 5.11–5.14. This problem can be efficiently solved by standard optimization algorithms 3 [NW99]. After solving the problem for real-valued z, the player identity yt,d can be obtained by yt,d = argmaxczt,d,c. 3In our implementation, we used Matlab’s interior-point algorithm to solve LP. 67 5.4 Learning The goal of learning is to find the best parameters θ in feature potentials ψfeat. In other words, we want to train a classifier p(yt,d|xt,d,θ) that maps feature vectors xt,d to player class yt,d given some labeled training data. 5.4.1 Supervised learning If we can afford a large number of labeled training data as in [LTML11], the most straight- forward approach is supervised learning. Specifically, we found the best parameters θ by maximizing the log-posterior of labeled training data with a L1 regularizer, as in [Ng04]: max θ ∑ t ∑ d log p(yt,d|xt,d,θ)− α||θ||1 (5.15) where α is a constant. The above optimization problem can be efficiently solved by the algorithm introduced by Schmidt et al. 4 [SvFM09]. Here, we assumed that all training data is labeled, i.e., every detected bounding box xt,d has a corresponding label yt,d. The major problem of supervised learning is that it usually requires a large amount of labeled training data. For example, in our previous work [LTML11], more than 30000 labeled training data is needed in order to train an accurate identification system. Unfortunately, labeled training data is very expensive to acquire. This thesis instead takes a different approach. Starting with a small number of labeled training data, we use semi-supervised learning to train the identification system. We then take advantage of the play-by-play text that is available for most professional sports games to further reduce the number of labeled training data required. 5.4.2 The play-by-play texts Play-by-play texts are the logs of important events in the game. They are usually available in real-time during professional games and can be freely downloaded from the Internet. Fig- ure 5.4 shows an example of play-by-play text downloaded from the National Basketball Association (NBA) website. We see that it shows event types (e.g., “Dunk”, “Foul”, “Sub- stituting”), player names (e.g., “Bryant”, “Bynum”), and the timestamps (e.g., “00:42.3”). 4Notice that we only learn the parameters of feature potentials, while the parameters of temporal and mutual exclusion potentials are set by hand. 68 Figure 5.4: The play-by-play texts of a basketball game. Play-by-plays show the time and people involved in important sport events. Play-by-play information is available for most broadcast sport games and can be freely downloaded from the Internet. Take the 10th line in Figure 5.4 as an example, it shows that, at elapsed time 10:55, Fisher from the Lakers made a 3 point shot, and he was assisted by Gasol. Play-by-play texts are very rich and provide us a lot of information about the game. Since the purpose of this chapter is identity recognition, we only focus on a single type of event – the “substitution” event. The substitution events tell us when a player is replaced by another one. Since the starting players are also listed in the play-by-play texts, we can use this information to deduce which players are on the court at any given time. Another problem is that the timestamps shown in play-by-play text are measured by the game clock, which do not match the internal clock of the video recording device. Therefore, in order to use play-by-play data, we have to first synchronize them with sports videos. To do this, we exploit the fact that there is usually an information bar overlaid on most broadcast sport videos for showing game time, team names, and scores (see Figure 5.9 for examples). We then run an off-the-shelf OCR system [Tes] on the overlaid clock region to recognize the game time of every video frame. The OCR system has nearly perfect accuracy because the clock region has a fixed location and background of the clock is homogeneous. 5.4.3 Weakly supervised learning The semi-supervised learning algorithm is a variant of the Expectation-Maximization (EM) [DLR77]. We start with a small number of randomly labeled training data. This is achieved by presenting random image patches of players to human annotators to label, where im- age patches are generated from the training clips by the Deformable Part Model (DPM) 69 Algorithm 3 EM for weakly supervised learning 1: estimate θ0 by using labeled data xL and yL 2: k = 0 3: repeat 4: k = k + 1 5: ŷU = LinearProgramming(xU , θk−1) . Eq. 5.10–5.14 6: θk = MultiLogitRegFit(yL, xL, ŷU , xU ) . Eq. 5.17 7: until convergence 8: return θk detector [FMR08]. We then find the initial model parameters θ0 by maximizing the log- posterior with a L1 regularizer, as in Equation 5.15. Then, in the first iteration, we predict the identities of the unlabeled training data by solving the linear programming problem in Equation 5.10 with the initial parameters θ0. This is called E-step because we compute the expected value of yt,d given the current model. Then, we optimize the log-posterior with all training data: max θ ∑ l∈L log p(yl|xl,θ) + ∑ u∈U ∑ yu p(yu|xu,θold) [log p(yu|xu,θ)p(yu)]− α||θ||1 (5.16) where L is the set of labeled data, and U is the set of unlabeled data, and θold are the parameters in the previous iteration. We approximate the summation over yu by using the prediction ŷu generated from LP, which is known as Perceptron Training [Col02]: max θ ∑ l∈L log p(yl|xl,θ) + ∑ u∈U log p(ŷu|xu,θ)p(ŷu)− α||θ||1 (5.17) Specifically, for those labeled data, we use their groundtruth labels y provided by human annotators. For those unlabeled data, we use their predicted label ŷ computed in the E-step. The above optimization problem can be still efficiently solved by the algorithm introduced by Schmidt et al. [SvFM09]. This is called M-step because we try to maximize the log- posterior given expected labels generated from the E-step. We repeat this process until convergence to obtain the best parameters. Algorithm 3 summarizes the EM-based semi- supervised learning algorithm. In the standard semi-supervised learning, we set the prior p(ŷt,d) = 1|C| , where |C| is the number of all possible players. This means that the predicted label ŷt,d has an uniform 70 probability to be any of the |C| possible players. When the play-by-play text is available, we set: p(ŷt,d = c) =  1 |Pt| if c ∈ Pt 0 otherwise (5.18) where Pt is a set of player that appears in frame t, and Pt ⊂ C. We call this strategy weakly supervised learning because we are given additional constraints provided by the play-by-play texts. Notice that the majority of data still remains unlabeled, i.e., , there is no one-to-one mapping between yU and xU . Taking professional basketball games as an example, a team usually has 12 players in their roster, and therefore |C| = 12. However, in any given time t, there are only 5 players on the court, and thus |Pt| = 5. In the experiments, we will show that this play-by-play prior is crucial to train the identification system with a very small number of labeled training data. 71 5.5 Experimental Results We evaluated the performance of the player identification system on a basketball dataset of over 20,000 frames, where the video was shot from a single pan-tilt-zoom camera. The system can achieve up to 88% identification accuracy in this challenging situation. The experiments also demonstrate that weakly supervised learning only requires 200 labeled images (20 labeled examples per player) with the play-by-play text to train the identifica- tion system. On the other hand, the supervised learning approach needs at least 20,000 labeled images. The following sections will give a more detailed analysis of the experimen- tal results. 5.5.1 Datasets We used data from the 2010 NBA Championship series (Los Angeles Lakers vs. Boston Celtics) for our experiments. The video consists of different kind of shots (close-up, medium- distance, commercials), and we extracted clips from only the medium-distance shots (also known as the court shots). Although each team has 12 players, only 10 players from the Lakers and 9 players from the Celtics played. As a result, the number of player class labels for the Lakers and Celtics was 10 and 9, respectively. The maximum number of player detections in a frame is 10. Also, using the highly accurate team classifier in Chapter 3, we were able to separate detections into teams and perform identification for each team individually. We used a labeled training set which is twice as large as [LTML11]. Specifically, we use 153160 detections (77004 of Celtics players, 76156 of Lakers players) across 21306 frames. We evaluated the learnt models on 20 test clips, consisting of a total of 13469 frames with 90001 detections (45367 of Celtics players, 44634 of Lakers players). The test clips varied in length, with the shortest at 300 frames and longest at 1400 frames. They also varied in level of identification difficulty. Labeling both training and test sets took us considerable effort, on the order of 200+ hrs. The size of this training data set is comparable or larger than others in the weakly labeled learning literature. For example, in previous work on high-resolution movies, Everingham et al. [ESZ06] trained/tested on 49447 faces, and Cour et al. [CSJT09] trained on about 100000 faces. 72 Figure 5.5: Player identification results of different features. We compare the effectiveness of RGB color histograms, MSER bag of words [MCUP02], SIFT bag of words [Low04], spatial features, and a combination of all features. Yellow bars show accuracies of Lakers players, while green bars show accuracies of Celtics players. 5.5.2 Feature comparison We first compared the effectiveness of different features. Figure 5.5 shows the results. In the experiments, we randomly chose 30000 labeled image patches, and then used the supervised learning approach in Equation 5.15 to train the identification system. Inference was performed by solving the linear programming problem in Equation 5.10–5.14. Appearance features were effective in identifying basketball players, having accuracies ranging from 56% to 85%. Among the three appearance features, the SIFT bag of words had the strongest discriminative power, followed by the MSER bag of words and RGB color histograms. Colors were weak in this domain because players of the same team wore uniforms of the same color, and many players had very similar skin and hair colors. Compared with the three appearance cues, spatial features were not very useful for iden- tifying basketball players, and it only achieved around 25% identification accuracy. After analyzing the data closely, we found out that there are only three distinct types of heatmaps in professional basketball (see Figure 5.3 for examples). Since there are twelve possible players per team, several players may have a very similar motion pattern (for example, the 73 (a) Celtics (b) Lakers Figure 5.6: This figure shows the fifteen most distinctive MSER visual words after learning the parameters from training data (i.e., having the highest weight in Equation 5.2). We can see that most of these visual words correspond to numbers on the player’s jersey. (a) The fifteen most distinctive MSER visual words for Celtics. (b) The fifteen most distinctive MSER visual words for Lakers. power forward and the center usually have a similar heatmap). This caused a low accuracy while only using spatial features for identifying basketball players. However, we anticipate a better accuracy in sports such as soccer, where players have more diverse motion patterns. Since we use a L1 regularizer in the learning phase (Equation 5.17), the majority of features will receive zero weights, and those distinctive features will have weights of large absolute values. To illustrate what those distinctive features look like, Figure 5.6 shows the fifteen most distinctive MSER features of Celtics and Lakers after learning the parameters from training data. This figure is plotted by sorting the absolute values of feature weights (θd,k in Equation 5.2) in decreasing order, and then picking up the first fifteen features. We can observe that most MSER features correspond to numbers on the jersey, because they are the most recognizable features of players. Another interest observation is that most features correspond to single digits. Since we use the bag-of-word representation, the system cannot distinguish between different permutations of digits (e.g., “38” and “83” will have identical bag-of-word representation). In order to tackle this problem, doublets can be used to encode spatially local co-occurring of visual words [SRE+05]. We leave this improvement for the future work. 74 Figure 5.7: Player identification results of different graphical models. We compare the ef- fectiveness of using only feature potentials (IID), using feature and mutex potentials (MU- TEX), using feature and temporal potentials (TEMP), and the full graphical model (FULL). 5.5.3 Model comparison We then compared the effectiveness of the graphical model. Similar to the previous ex- periments, we randomly chose 30000 labeled image patches, and then used the supervised learning to train the identification system, with both appearance and spatial features. Fig- ure 5.7 shows the results. The IID model assumed that there was no connection between any identity nodes yt,d. In other words, we identified players by only the feature potential in Equation 5.1, but without using the temporal potential in Equation 5.3 and mutual ex- clusion potential in Equation 5.4. The identification results were poor, having an accuracy about 50% for Lakers and 55% for Celtics. This demonstrates the challenges of identify- ing players from a single-view camera. Since players constantly change their poses and orientations, it is very difficult to identify players from a single image. Adding temporal potentials to the model significantly boosts the performance. In Lak- ers, the accuracy increased to 81%, while in Celtics, the accuracy increased to 85%. Tem- poral edges in the graphical model were constructed by the tracking system. If the identifi- cation system has high confidence about the identity of even a single image in a track (e.g., the frontal-view image of a player), this information will be passed forward and backward 75 to the entire track, and helps identify images of obscure views. Also notice that without the mutual exclusion edges, the graphical model becomes multiple independent Hidden Markov Models (HMM) [Rab89], where we have an HMM for every single track. In this particular case, inference can be performed very efficiently by using the forward-backward algorithm [Rab89]. If speed is a concern, we suggest using this model (with only temporal and feature potentials) instead of the full graphical model because the inference speed is faster. Adding mutual exclusion potentials to the model slightly improves the accuracy. In Lak- ers, the accuracy increased to 83%, while in Celtics, the accuracy became 87%. Although the improvements were minor, it is still necessary to include mutual exclusion constraints in order to prevent a duplicate of identities in a frame. In applications like automatic statistics gathering and star-camera view, such duplications will significantly reduce the quality, and thus it should be avoided. 5.5.4 Learning comparison Figure 5.8 compares different learning strategies for player identification. We compared strongly supervised, semi-supervised, and weakly supervised learning, with different num- ber of labeled training data. For all learning strategies, we repeated the experiments for 10 times to compute the mean and standard deviation. The supervised learning utilized only labeled training data yL and xL to train the model by using Equation 5.15. We can ob- serve that the identification accuracy converged slowly with the increase of labeled training data. In both Celtics and Lakers, accuracies converged after using more than 20000 labeled training examples. The semi-supervised approach used both labeled training data yL and xL, and unlabeled training data yU and xU . This approach used the EM-based algorithm (Algorithm 3) to train the model parameters θ. Since play-by-play texts were not provided, we set the prior to a uniform distribution over all possible players, i.e., , p(ŷt,d) = 1|C| . We can notice that the accuracy of semi-supervised learning converged faster than the supervised one. Using only 2000 labeled examples, semi-supervised learning can achieve similar identification accuracy to the fully supervised approach. The weakly supervised approach also used both labeled and unlabeled training data, and it also applied the EM-based algorithm (Algorithm 3). However, the weakly supervised 76 (a) (b) Figure 5.8: Player identification results of different learning strategies. In both figures, we show identification results (mean and standard deviation) of strongly supervised, semi- supervised, and weakly supervised learning, with different numbers of labeled training data. Notice that the x-axis is in log-scale. The strongly supervised approach uses only labeled training data in Equation 5.15. The semi-supervised approach uses both labeled and un- labeled data in Equation 5.17, but it uses an uniform prior over all possible players, i.e., , p(ŷt,d) = 1 |C| . The weakly supervised approach also uses both labeled and unlabeled data in Equation 5.17, but it uses the prior in Equation 5.18 provided by the play-by-play texts. (a) Identification results for Celtics. (b) Identification results for Lakers. 77 approach took advantage of the additional constraints provided by the play-by-play texts, and it used the prior in Equation 5.18. We can observe that weakly supervised learning converged much faster than the semi-supervised one, and it can achieve a similar accuracy to the fully supervised approach by using merely 200 labeled examples. This is a significant reduction of labeled training data, compared with 2000 labels needed for semi-supervised learning, and 20000 labels required for supervised learning. For comparison, we also show results of ambiguous label learning introduced by Cour et al. [CSJT09]. Ambiguous label learning assumes that every training image is associated with a set of labels, one of which is the correct label for the image. Since facial features in the original implementation of [CSJT09] are not suitable in this case, we replace them by the proposed appearance and spatial cues. We train the classifier using all our training im- ages, with corresponding label sets provided by the play-by-play texts (the set size is 1 for labeled images and 5 for unlabelled images). After training, we use the same LP inference algorithm to identify players, instead of the IID approach used in [CSJT09]. We can see that ambiguous label learning performs better than the proposed EM-based algorithm while using a very small amount of labeled data (10–30 labels, or 1–3 labeled images per player). This is because ambiguous label learning has a more sophisticated mechanism to deal with weakly labeled data. However, after giving 30–50 labels (3–5 labeled images per player) to initialize the model in EM, the proposed weakly supervised approach quickly outper- forms the ambiguous label learning. This is because the proposed EM algorithm utilizes the temporal and mutual exclusion potentials to help deal with ambiguous images (e.g., profile views), while the ambiguous label learning classifies every image independently in the learning phase5. Interestingly, performance for the Celtics was better than for the Lakers (87% vs 83%). Closer examination of the confusion matrix for the Lakers shows that one of the players, Lamar Odom (jersey 7) was often mistaken for his teammates Andrew Bynum (jersey 17) and Ron Artest (jersey 37). All three players not only had similar stature and appearance but also wore jersey numbers that, when not viewed entirely, could lead to mistaken identities. Figure 5.9 shows tracking and identification results on the basketball sequence 6. In this figure, we use green bounding boxes to represent Celtics players, and yellow bounding 5 [CSJT09] used tracking to reduce the number of training data, but they did not utilize temporal consistency in their learning algorithm. 6Videos are available online at http://www.cs.ubc.ca/∼vailen/thesis/thesis.shtml. 78 boxes to represent Lakers players. Text within boxes are identification results (player’s name), while red boxes highlight misclassifications. We see that the proposed system is able to track and identify multiple basketball players effectively in both frontal and profile views, even under some occlusion cases. We can also see that, in most cases, jersey numbers are not shown. However, since the identification system utilizes temporal and mutual exclusion information, identities of players still can be resolved. Figure 5.10 shows tracking and identification results on another basketball sequence. This sequence is slightly harder than the previous one because there are more occlusions among players. The proposed algorithm still works reasonably well. 79 Figure 5.9: Automatic tracking and identification results in a broadcast basketball video. Green boxes represent Celtics players, and yellow boxes represent Lakers players. Text within boxes are identification results (player’s name), while red boxes highlight misclassi- fications. 80 Figure 5.10: Automatic tracking and identification results in a broadcast basketball video. Green boxes represent Celtics players, and yellow boxes represent Lakers players. Text within boxes are identification results (player’s name), while red boxes highlight misclassi- fications. 81 Figure 5.11: Application I: automatic statistics collecting: We can use the proposed sys- tem to automatically gather players’ statistics. The figure shows trajectories of players, computed by using multi-target tracking, homography estimation, and player identification. 5.6 Applications The proposed tracking and identification system has many potential applications. In this thesis, we only implement two of them: automatic collection of player statistics and the star-camera view. We believe the system would have even more applications when com- bined with the ball tracking and action recognition systems. 5.6.1 Automatic collection of player statistics Statistics are very important for analyzing the performance of sports players. For example, in basketball, people record how many points a player makes, how many rebounds a player grabs, how many assists a player has, etc. These statistics are recorded by human annotators during the game. In professional sports such as National Basketball Association (NBA) or National Hockey League (NHL), a team may hire 10 to 20 staff to perform this task, depending on how many items they want to collect. With a visual tracking system as proposed in this thesis, this expensive process can be automated by using cameras and computers. Given a video sequence, we can first track and identify players. Then, we can use the homography transformation estimated in Chapter 4 to compute the players’ trajectories on the basketball court or hockey rink, as in Figure 5.11. This enables us to collect statistics such as the players’ running distances, average speeds, 82 Figure 5.12: Application II: star camera: We can use the proposed system to automatically crop video streams to show the star player. This is particular useful for devices with a small screen, such as smart phones. spatial histograms (heat-maps), etc. If a ball tracking system is available (e.g., [CTC+09]), we can easily extend the system to estimate ball possession, number of rebounds grabbed, number of ball shot, etc. Since the system only requires a video taken from a pan-tilt- zoom camera, we can easily apply it to process games played by amateur players, who cannot afford to install an expensive camera network in their stadium. We could also use the system to analyze the styles of legendary players in the past, whose detailed statistics are not available. 5.6.2 Star camera TV broadcasting companies usually provide video streams taken from a court view. How- ever, in devices such as smart phones that have smaller screens, the court view is unfavor- able because players are too small and hard to watch. In these circumstances, it is better to have a view that focuses only on star players – the star-camera view. TV broadcasting companies now achieve this by using two approaches. The first one is to install additional cameras in the stadium to follow the chosen star player (additional camera operators are also needed). Another approach is to hire human annotators to manually crop and enlarge the image region corresponding to the star player. Both tricks are expensive and require a lot of labor. 83 With automatic tracking and identification, this task can be easily automated by com- puters, as in Figure 5.12. Since the system can track and identify players, we use utilize this information to automatically crop out the image region corresponding to the star player. More interestingly, we can even allow users to choose the player they want to watch. For example, in Figure 5.12, the user can savor the movements of Kobe Bryant. When they are bored, they can easily switch to other players such as Kevin Garnett or Paul Pierce. This would enable TV broadcast companies to provide a personalized viewing experience to viewers of sports games. 5.7 Summary and Future Work This chapter introduces the techniques for identifying players in broadcast sports videos. The identification problem is formulated as finding the maximum posterior configuration in a Conditional Random Field (CRF). The CRF utilizes both visual and spatial cues, and exploits both temporal and mutual exclusion constraints. We also propose a novel Linear Programming Relaxation algorithm for predicting the best player identification in a video clip. For learning the identification system, we introduce the use of weakly supervised learning with the play-by-play texts. Experiments show that we can use weakly supervised learning with merely 200 labels to achieve similar accuracies to a strongly supervised ap- proach, which requires at least 20000 labels. We also show potential applications of the proposed techniques, such as automatic statistics gathering and star-camera view. Future Work: The proposed system utilized several weak features to identify play- ers. Possible ways to improve the weak features include: • Visual Features: We can improve the visual features in several ways. For example, the bag of visual words features currently contain no spatial information. This can be improved by dividing the image into cells, and then computing the bag of visual words in every individual cell, as in [LSP06]. The current system also does not model the co-occurrence of visual words. This can be tackled by using the doublets introduced by [SRE+05]. • Spatial Features: The proposed system only uses the heat-map (spatial histograms). Other spatial features such as player’s speed and acceleration can be also utilized. In addition, we can also model the correlation of players’ locations. For example, the 84 center of a basketball team usually defends the center of the other team, and therefore they are usually close to each other. • Ball Features: If the ball can be tracked in the sports video, then we can derive ball-related features to help identify players. For example, ball possession time can be used to identity players because certain players (e.g., the point guards in basketball) usually have longer ball possession times than others. In basketball, the shooting locations also help to recognize players because the players of different positions usually have a different shooting range (e.g., the three-point shooters are more likely to shoot from the three-point line). The learning algorithm can be improved by several possible ways: • Ambiguous Label Learning: The proposed weakly supervised learning algo- rithm is based on EM, and it requires some labeled training data to initialize the model. When there is no labeled data, we can possibly apply ambiguous label learn- ing [CSJT09] to initialize the model, and then use the proposed EM algorithm to refine it. This approach may boost the performance when no labels are available. • Error Correction of Play-by-Play Texts: The proposed system assumes that there is no error in the play-by-play texts. In reality, the play-by-play texts may con- tain errors such as typos or incorrect time stamps. These errors will affect the quality of parameter learning. This issue can be tackled by slightly modifying Equation 5.18. Instead of using zero probability for those players not on the court (provided by the play-by-play), we assign a small probability to them. Therefore, if the image evi- dence provides strong evidence that a player should appear on the court, we will have a chance to correct the mistake. This feature will be particularly useful in hockey videos because the time stamps of players’ entering and exiting the scene (time shift) usually contain many errors. 85 Chapter 6 Conclusion In this thesis, we introduce an intelligent system that tackles the challenging problem of tracking and identification of players in broadcast sports videos taken from a single pan- tilt-zoom camera. The system possesses the ability to detect and track multiple players, estimates the homography between video frames and the court, and identifies the players. • Multi-target tracking: The tracking system is based on the tracking-by-detection philosophy. We first localize players using a player detector, categorize detections based on team colors, and then group them into tracks of specific players. Instead of using visual cues to distinguish between players, we instead rely on their short- term trajectories. Experiments shows promising tracking results in basketball videos of more than 50,000 frames. The proposed algorithm also outperforms the Boosted Particle Filter (BPF) [OTdF+04]. • Homography estimation: The homography estimation is solved by using a variant of the Iterated Closest Points (ICP). Unlike most existing algorithms that relied on matching robust feature points, we propose matching edge points in two images. In addition, we also introduce a technique to update the model online to accommodate logos and patterns in different stadiums. Experiments show promising homography estimation results in basketball videos of more than 50,000 frames. In addition, the proposed algorithm outperforms Gupta et al. [GLW11] in both accuracy and speed. • Player identification: The identification problem is formulated as finding the max- imum posterior configuration in a Conditional Random Field (CRF). The CRF uti- 86 lizes both visual and spatial cues, and exploits both temporal and mutual exclusion constraints. We also propose a novel Linear Programming Relaxation algorithm for predicting the best player identification in a video clip. For learning the identification system, we introduce the use of weakly supervised learning with the play-by-play texts. Experiments show that we can use weakly supervised learning with merely 200 labels to achieve similar accuracies to a strongly supervised approach, which requires at least 20000 labels. We also show potential applications of the proposed techniques, such as automatic statistics gathering and star-camera view. One disadvantage of the current implementation is the processing speed. The De- formable Part Model (DPM) detector [FMR08] used in this thesis is not efficient and requires 3–5 seconds to process a single image of resolution 1280 × 720. With the cascade struc- ture [FGM10] and a GPU implementation, the processing time might be able to achieve 1 second per frame. Increasing the speed and accuracy of object detectors are still the topic of future research. Data association and multi-target tracking run in a speed of 5–10 frames per second and thus are not the bottleneck of the proposed system. Homography estimation requires about 1 second to process a frame with our non-optimized MATLAB implemen- tation. Homography estimation can be easily made real-time with a more efficient GPU implementation. The current identification system builds a CRF for the entire video and therefore is only able to process the video offline. However, we can use a small look-ahead window (30–50 frames) and build a CRF for only those frames without sacrificing much ac- curacy. With all these improvements, we anticipate that the processing time can be reduced to 1–2 seconds per frame, where the major bottleneck is the DPM detector. Compared with the commercial sports video analysis software [Tho11], the system pro- posed in this thesis has several advantages. Firstly, the proposed system does not require calibrated cameras and thus can be used to analyze videos taken from hand-held cameras or existing video archive. This is particularly valuable for the applications such as the star- camera view. Secondly, the proposed system is able to reliably track and identify multiple players in videos taken from a single camera and estimate their locations on the court. This task still remains very challenging in the industry [Tho11], where the current semi- automatic solutions require multiple calibrated cameras. Thirdly, the current commercial software requires special hardware installation and configuration prior the game [Tho11]. On the other hand, the proposed system can be easily prepared for a new setting since most 87 parameters of the system are learnt from training data (see Appendix for a list of free param- eters). Specifically, the users need to prepare a few labeled images (usually 10–20 images per player are sufficient) and the corresponding play-by-play texts for training the identifi- cation system. The player detector and court model can be re-used for the same sport once trained. For videos of different sports, the court model should be specified while the player detector can remain the same1. In our preliminary experiments, we have also applied the proposed system to videos of hockey games and obtained promising results. We anticipate that the proposed system can be easily generalized to videos of different sports, and it will automate the process of tracking and identification of players from a single pan-tilt-zoom camera in the near future. 6.1 Future Work Possible areas for future work include coupled tracking and identification, ball tracking, and action recognition: • Coupled Tracking and Identification: The proposed system separates tracking and identification into two processes. However, it is better to combine the two components into a single system where one helps another. Specifically, we can use a probabilistic tracking algorithm such as the data-driven Markov Chain Monte Carlo (MCMC) [ORS04, LTL+09] to generate a candidate track configuration. The candidate track configuration is then scored by the identification system to measure its effectiveness of explaining the data. Then, based on the current track configura- tion and score, a new candidate track configuration is sampled to improve the tracking quality. This process can be repeated until convergence to obtain a better tracking and identification result. • Ball Tracking: Tracking the ball is easier given the results of player tracking and homography estimation. Homography can provides a rough estimate of the 3D location of the ball. Player tracking constrains the search space of the ball because the ball usually travels along a parabolic path between two players. With this useful information, the problem of basketball or hockey puck tracking is greatly simplified (see Duan’s thesis [Dua11] for an example) . 1For better accuracy, we recommend to train a player detector specific for a sport. 88 • Action Recognition: This thesis only utilized the names of players in the play-by- play texts to train a player identification system. However, the play-by-play also con- tains other useful information such as the important actions and events. We believe that a similar technique can be applied to train an action recognition system by using the play-by-play texts as weak supervision. In this case, we can replace the Scale Invariant Feature Transform (SIFT) and Maximally Stable Extremal Regions (MSER) bag of visual words by a space-time interest point detector such as [Lap05]. Ball- related features can be also used because many important events and actions are about the interactions between the players and the ball (e.g., pass, shoot, rebound, etc). 89 Bibliography [AT04] Ankur Agarwal and Bill Triggs, Tracking Articulated Motion Using a Mixture of Autoregressive Models, Proceedings of the 8th European Conference on Computer Vision, LNCS, vol. 3023, Springer, 2004, pp. 54–65. [AZ05] Ognjen Arandjelovic and Andrew Zisserman, Automatic Face Recognition for Film Characters Retrieval in Feature-Length Films, Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. [BBBN07] L. Ballan, Marco Bertini, Alberto Del Bimbo, and Walter Nunziati, Soccer players identification based on visual local features, Proceedings of the 6th ACM international conference on Image and video retrieval, 2007, pp. 258–265. [BBC+05] Maria-Florina Balcan, Avrim Blum, Patrick Pakyan Choi, John Lafferty, Brian Pantano, Mugizi Robert Rwebangira, and Xiaojin Zhu, Person Identification in Webcam Images: An Application of Semi-Supervised Learning, ICML 2005 Workshop on Learning with Partially Classified Training Data, 2005. [BBFF11] Horesh BenShitrit, Jerome Berclaz, Francois Fleuret, and Pascal Fua, Tracking Multiple People under Global Appearance Constraints, Proceedings of the 13th IEEE International Conference on Computer Vision, 2011. [BBG06] Michael Beetz, Jan Bandouch, and Suat Gedikli, Camera-based Observation of Football Games for Analyzing Multi-agent Activities, Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems, 2006, pp. 42–49. [BBN05] Marco Bertini, Alberto Del Bimbo, and Walter Nunziati, Player Identification in Soccer Videos, Proceedings of the 7th ACM SIGMM international workshop on Multimedia Information Retrieval, 2005, pp. 25–32. 90 [BBN06] , Automatic Detection of Player’s Identity in Soccer Videos using Faces and Text Cues, Proceedings of the 14th Annual ACM International Conference on Multimedia, 2006. [Ben75] J. L. Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM 18 (1975), no. 9, 509–517. [BEZ09] Patrick Buehler, Mark Everingham, and Andrew Zisserman, Learning sign language by watching TV (using weakly aligned subtitles), Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009. [BFTF11] Jerome Berclaz, Francois Fleuret, Engin Turelken, and Pascal Fua, Multiple Object Tracking Using K-Shortest Paths Optimization, IEEE Transactions on Pattern Analysis and Machine Inlligence 33 (2011), no. 9, 1806–1819. [Bis06] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. [BK08] Gary Bradski and Adrian Kaehler, Learning OpenCV, O’Reilly, 2008. [Bra98] Gary R. Bradski, Computer Vision Face Tracking For Use in a Perceptual User Interface, 1998. [Can86] J. Canny, A Computational Approach to Edge Detection, IEEE Transactions on Pattern Analysis and Machine Inlligence 8 (1986), no. 6, 679–698. [CdL06] Yizheng Cai, Nando de Freitas, and James J. Little, Robust Visual Tracking for Multiple Targets, Proceedings of the 9th European Conference on Computer Vision, LNCS, vol. 3954, 2006, pp. 107–118. [CH06] Chih-Chieh Cheng and Chiou-Ting Hsu, Fusing of Audio anad Motion Information on HMM-Based Highlight Extraction for Baseball Games, IEEE Transactions on Multimedia 8 (2006), no. 3, 585–599. [CHH07] Deng Cai, Xiaofei He, and Jiawei Han, Semi-supervised Discriminant Analysis, Proceedings of the 11th IEEE International Conference on Computer Vision, 2007. [CM02] D. Comaniciu and P. Meer, Mean Shift: A Robust Approach Toward Feature Space Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), no. 5, 603–619. 91 [Col02] Michael Collins, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, The Conference on Empirical Methods on Natural Language Processing, 2002. [Cox93] Ingemar J. Cox, A Review of Statistical Data Association Techniques for Motion Correspondence, International Journal of Computer Vision 10 (1993), no. 1, 53–66. [CRM03] D. Comaniciu, V. Ramesh, and P. Meer, Kernel-Based Object Tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003), no. 5, 564–575. [CSJT09] Timothee Cour, Benjamin Sapp, Chris Jordan, and Ben Taskar, Learning from Ambiguously Labeled Images, Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009. [CSNT10] Timothee Cour, Ben Sapp, Akash Nagle, and Ben Taskar, Talking Pictures: Temporal Grouping and Dialog-Supervised Person Recognition, Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010. [CST11] Timothee Cour, Benjamin Sapp, and Ben Taskar, Learning from Partial Labels, Journal of Machine Learning Research 12 (2011), 1501–1536. [CTC+09] Hua-Tsung Chen, Ming-Chun Tien, Yi-Wen Chen, Wen-Jiin Tsai, and Suh-Yin Lee, Physics-based ball tracking and 3D trajectory reconstruction with applications to shooting location estimation in basketball video, Journal of Visual Communication and Image Representation 20 (2009), 204–216. [CV95] C. Cortes and V. Vapnik, Support-Vector Networks, Machine Learning 20 (1995), no. 3, 273–297. [DGA00] Arnaud Doucet, Simon Godsill, and Christophe Andrieu, On sequential Monte Carlo sampling methods for Bayesian filtering, Statistics and Computing 10 (2000), 197–208. [DL10] T. D’Orazio and M. Leo, A review of vision-based systems for soccer video analysis, Pattern Recognition In Press (2010), In Press. [DLR77] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, Journal of the Royal Statistical Society. Series B 39 (1977), no. 1, 1–38. 92 [DLS+09] Olivier Duchenne, Ivan Laptev, Josef Sivic, Francis Bach, and Jean Ponce, Automatic Annotation of Human Actions in Video, Proceedings of the 12th IEEE International Conference on Computer Vision, 2009. [DM00] D. Doermann and D. Mihalcik, Tools and Techniques for Video Performances Evaluation, Proceedings of the 13th International Conference on Pattern Recognition, 2000. [DT05] Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 886–893. [Dua11] Xin Duan, Automatic determination of puck possession and location in broadcast hockey video, Master’s thesis, University of British Columbia, 2011. [ESZ06] Mark Everingham, Josef Sivic, and Andrew Zisserman, ”Hello! My name is... Buffy” - Automatic Naming of Characters in TV Video, Proceedings of the British Machine Vision Conference, 2006, pp. 899–908. [ETM03] Ahmet Ekin, A. Murat Tekalp, and Rajiv Mehrotra, Automatic Soccer Video Analysis and Summarization, IEEE Transactions on Image Processing 12 (2003), no. 7, 796–807. [FB81] Martin A. Fischler and Robert C. Bolles, Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography, Communications of the ACM 24 (1981), no. 6, 381–395. [FBLF08] Francois Fleuret, Jerome Berclaz, Richard Lengagne, and Pascal Fua, Multicamera People Tracking with a Probabilistic Occupancy Map, IEEE Transactions on Pattern Analysis and Machine Inlligence 30 (2008), no. 2, 267–282. [FBP+10] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, Person Re-Identification by Symmetry-Driven Accumulation of Local Features, Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010. [FDS09] W. Forstber, T. Dickscheid, and F. Schindler, Detecting interpretable and accurate scale-invariant keypoints, Proceedings of the 12th IEEE International Conference on Computer Vision, 2009. 93 [FGM10] Pedro F. Felzenszwalb, Ross B. Girshick, and David McAllester, Cascade Object Detection with Deformable Part Models, Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010. [FGMR10] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan, Object Detection with Discriminatively Trained Part Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (2010), no. 9, 1627–1645. [FL07] Per-Erik Forssen and David G. Lowe, Shape Descriptors for Maximally Stable Extremal Regions, Proceedings of the 11th IEEE International Conference on Computer Vision, 2007. [FM98] B.J. Frey and D.J.C. Mackay, A revolution: Belief propagation in graphs with cycles, Advances in Neural Information Processing Systems 10, 1998. [FMR08] Pedro Felzenszwalb, David McAllester, and Deva Ramanan, A Discriminatively Trained, Multiscale, Deformable Part Model, Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008. [GC08] Weina Ge and Robert T. Collins, Multi-target Data Association by Tracklets with Unsupervised Parameter Estimation, Proceedings of the British Machine Vision Conference, 2008. [GLW11] Ankur Gupta, James J. Little, and Robert J. Woodham, Using Line and Ellipse Features for Rectification of Broadcast Hockey Video, The Eighth Canadian Conference on Computer and Robot Vision, 2011. [GQ08] Feng Guo and Gang Qian, Monocular 3D Tracking of Articulated Human Motion in Silhouette and Pose Manifolds, EURASIP Journal on Image and Video Processing 2008 (2008), ID 326896. [GROJ04] D. Greenhill, J. Renno, J. Orwell, and G.A. Jones, Occlusion Analysis: Learning and Utilising Depth Maps in Object Tracking, Proceedings of the British Machine Vision Conference, 2004. [GSSD09] Abhinav Gupta, Praveen Srinivasan, Jianbo Shi, and Larry S. Davis, Understanding Videos, Constructing Plots: Learning a Visually Grounded Storyline Model from Annotated Videos, Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009. 94 [GT08] Douglas Gray and Hai Tao, Viewpoint Invariant Pedestrian Recognition with an Ensemble of Localized Features, Proceedings of the 10th European Conference on Computer Vision, LNCS, vol. 5302, 2008, pp. 262–275. [GW02] R.C. Gonzalez and R.E. Woods, Digital Image Processing, 2 ed., Prentice Hall, 2002. [HCWC11] Min-Chun Hu, Ming-Hsiu Chang, Ja-Ling Wu, and Lin Chi, Robust Camera Calibration and Player Tracking in Broadcast Basketball Video, IEEE Transactions on Multimedia 13 (2011), no. 2, 266–279. [HF07] Robin Hess and Alan Fern, Improved Video Registration using Non-Distinctive Local Image Features, Proceedings of the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007. [HF09] Rob Hess and Alan Fern, Discriminatively Trained Particle Filters for Complex Multi-Object Tracking, Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009. [HGC+07] Shaobo Hou, Aphrodite Galata, Fabrice Caillette, Neil Thacker, and Paul Bromiley, Real-time Body Tracking Using a Gaussian Process Latent Variable Model, Proceedings of the 11th IEEE International Conference on Computer Vision, 2007. [HSC06] Chung-Lin Huang, Huang-Chia Shih, and Chung-Yuan Chao, Semantic Analysis of Soccer Video Using Dynamic Bayesian Network, IEEE Transactions on Multimedia 8 (2006), no. 4, 749–760. [HWN08] Chang Huang, Bo Wu, and Ramakant Nevatia, Robust Object Tracking by Hierarchical Association of Detection Responses, Proceedings of the 10th European Conference on Computer Vision, LNCS, vol. 5303, Springer, 2008, pp. 788–801. [HZ03] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd edition ed., Cambridge University Press, 2003. [JCF09] Luo Jie, Barbara Caputo, and Vittorio Ferrari, Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation, Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, eds.), 2009, pp. 1168–1176. 95 [JD88] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988. [JFL08] Hao Jiang, Sidney Fels, and James J. Little, Optimizing Multiple Object Tracking and Best Biew Video Synthesis, IEEE Transactions on Multimedia 10 (2008), no. 6, 997–1012. [JSRS08] Omar Javed, Khurram Shafique, Zeeshan Rasheed, and Mubarak Shah, Modeling inter-camera space-time and appearance relationships for tracking across non-overlapping views, Computer Vision and Image Understanding 109 (2008), 146–162. [Kal60] R.E. Kalman, A new approach to linear filtering and prediction problems, Journal of Basic Engineering 82 (1960), no. 1, 35–45. [KHN10] Cheng-Hao Kuo, Chang Huang, and Ram Nevatia, Multi-Target Tracking by On-Line Learned Discriminative Appearance Models, Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010. [Lap05] Ivan Laptev, On Space-Time Interest Points, International Journal of Computer Vision 64 (2005), no. 2/3, 107–123. [LHN09] Yuan Li, Chang Huang, and Ram Nevatia, Learning to Associate: HybridBoosted Multi-Target Tracker for Crowded Scene, Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009. [LMP01] John Lafferty, Andrew McCallum, and Fernando Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the 18th International Conference on Machine Learning (San Francisco, CA, USA), Morgan Kaufmann Publishers Inc., 2001, pp. 282–289. [LMSR08] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld, Learning realistic human actions from movies, Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008. [LOL09] Wei-Lwun Lu, Kenji Okuma, and James J. Little, Tracking and Recognizing Actions of Multiple Hockey Players using the Boosted Particle Filter, Image and Vision Computing 27 (2009), no. 1-2, 189–205. 96 [Low04] David G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision 60 (2004), no. 2, 91–110. [LSP06] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, Proceedings of the 2006 IEEE Computer Society Coneference on Computer Vision and Pattern Recognition, vol. 2, 2006, pp. 2169–2178. [LTL+09] Jia Liu, Xiaofeng Tong, Wenlong Li, Tao Wang, Yimin Zhang, and Hongqi Wang, Automatic player detection, labeling and tracking in broadcast soccer video, Pattern Recognition Letters 30 (2009), 103–113. [LTLM11] Wei-Lwun Lu, Jo-Anne Ting, James J. Little, and Kevin P. Murphy, Semi-supervised Learning for Identifying Players from Broadcast Sports Videos with Play-by-Play Information, Tech. Report TR-2011-08, University of British Columbia, 2011. [LTML11] Wei-Lwun Lu, Jo-Anne Ting, Kevin P. Murphy, and James J. Little, Identifying Players in Broadcast Sports Videos using Conditional Random Fields, Proceedings of the 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011. [MCUP02] J. Matas, O. Chum, M. Urban, and T. Pajdla, Robust Wide Baseline Stereo from Maximally Stable Extremal Regions, Proceedings of the British Machine Vision Conference, vol. 1, 2002, pp. 384–393. [MLS09] Marcin Marszalek, Ivan Laptev, and Cordelia Schmid, Actions in Context, Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009. [MS04] Krystian Mikolajczyk and Cordelia Schmid, Scale & Affine Invariant Interest Point Detectors, International Journal on Computer Vision 60 (2004), no. 1, 63–86. [Mur02] Kevin P. Murphy, Dynamic Bayesian Networks: Representation, Inference and Learning, Ph.D. thesis, University of California, Berkeley, 2002. [Ng04] Andrew Y. Ng, Feature selection, L1 vs. L2 regularization, and rotational invariance, Advances in Neural Information Processing Systems 16, 2004. [NW99] Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Springer, 1999. 97 [OLL04] Kenji Okuma, James J. Little, and David G. Lowe, Automatic Rectification of Long Image Sequences, Proceedings of Asian Conference on Computer Vision, 2004. [ORS04] Songwai Oh, Stuart Russell, and Shankar Sastry, Markov Chain Monte Carlo Data Association for General Multiple-Target Tracking Problems, The 43rd International Conference on Decision and Control, vol. 1, 2004, pp. 735–742. [OTdF+04] Kenji Okuma, Ali Taleghani, Nando de Freitas, James J. Little, and David G. Lowe, A Boosted Particle Filter: Multitarget Detection and Tracking, Proceedings of the 8th European Conference on Computer Vision (Tomás Pajdla and Jirı́ Matas, eds.), LNCS, vol. 3021, Springer-Verlag, 2004, pp. 28–39. [PHVG02] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet, Color-Based Probabilistic Tracking, Proceedings of the 7th European Conference on Computer Vision (A. Heyden, G. Sparr, M. Nielsen, and P. Johansen, eds.), LNCS, vol. 2350, Springer-Verlag, 2002, pp. 661–675. [PS98] Christos H. Papadimitriou and Kenneth Steiglitz, Combinatorial Optimization: Algorithms and Complexity, unabridged edition ed., Dover Publications, 1998. [Rab89] L.R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE 77 (1989), no. 2, 257–286. [RC05] Christian P. Robert and George Casella, Monte Carlo Statistical Methods, 2nd ed., Springer, 2005. [RKB04] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake, ”GrabCut” – Interactive Foreground Extraction using Iterated Graph Cuts, ACM Transactions on Graphics 23 (2004), no. 3, 309–314. [RL01] Szymon Rusinkiewicz and Marc Levoy, Efficient Variants of the ICP Algorithm, The 3rd International Conference on 3D Digital Imaging and Modeling, 2001. [RTMF08] Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman, LabelMe: A Database and Web-Based Tool for Image Annotation, International Journal of Computer Vision 77 (2008), no. 1–3, 157–173. 98 [SDPR08] Matko Saric, Hrvoje Dujmic, Vladan Papic, and Nikola Rozic, Player Number Localization and Recognition in Soccer Video using HSV Color Space and Internal Contours, International Conference on Signal and Image Processing, 2008. [SEZ09] Josef Sivic, Mark Everingham, and Andrew Zisserman, ”Who are you” - Learning person specific classifiers from video, Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009. [SO05] David A. Sadlier and Noel E. O’Connor, Event Detection in Field Sports Video Using Audio-Visual Features and a Support Vector Machine, IEEE Transactions on Circuits and Systems for Video Technology 15 (2005), no. 10, 1225–1233. [SRE+05] J. Sivic, B.C. Russell, A.A. Efros, A. Zisserman, and W.T. Freeman, Discovering objects and their locations in images, Proceedings of the 10th IEEE International Conference on Computer Vision, vol. 1, 2005, pp. 370–377. [STR03] Charles V. Stewart, Chia-Ling Tsai, and Badrinath Roysam, The Dual-Bootstrap Iterative Closest Point Algorithm With Application to Retinal Image Registration, IEEE Transactions on Medical Imaging 22 (2003), no. 11, 1379–1394. [SvFM09] Mark Schmidt, Ewout van den Berg, Michael P. Friedlander, and Kevin P. Murphy, Optimizing Costly Functions with Simple Constraints: A Limited-Memory Projected Quasi-Newton Algorithm, International Conference on Artificial Intelligence and Statistics, 2009. [Sze11] Richard Szeliski, Computer Vision – Algorithms and Applications, Springer, 2011. [Tes] Tesseract-OCR, http://code.google.com/p/tesseract-ocr/. [Tho11] Graham Thomas, Sports TV Applications of Computer Vision, ch. 28, pp. 563–579, Springer-Verlag, 2011. [Tsa87] Roger Y. Tsai, A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses, IEEE Journal of Robotics and Automation 3 (1987), no. 4, 323–344. 99 [UFF06] R. Urtasun, D.J. Fleet, and P. Fua, 3D People Tracking with Gaussian Process Dynamical Models, Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2006, pp. 238–245. [VJ01] Paul Viola and M.J. Jones, Rapid Object Detection using a Boosted Cascade of Simple Features, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2001, pp. 511–518. [VJS05] Paul Viola, Michael J. Jones, and Daniel Snow, Detecting Pedestrians Using Patterns of Motion and Appearance, International Journal of Computer Vision 63 (2005), no. 2, 153–161. [VRP10] Carl Vondrick, Deva Ramanan, and Donald Patterson, Efficiently Scaling Up Video Annotation with Crowdsourced Marketplaces, Proceedings of the 11th European Conference on Computer Vision, 2010. [WC08] Xiaolin K. Wei and Jinxiang Chai, Interactive Tracking of 2D Generic Objects with Spacetime Optimization, Proceedings of the 10th European Conference on Computer Vision, LNCS, vol. 5302, Springer, 2008, pp. 657–670. [WSTS07] Yichen Wei, Jian Sun, Xiaoou Tang, and Heung-Yeung Shum, Interactive Offline Tracking for Color Objects, Proceedings of the 11th IEEE International Conference on Computer Vision, 2007. [YHJ+05] Qixiang Ye, Qinming Huang, Shuqiang Jiang, Yang Liu, and Wen Gao, Jersey number detection in sports video for athlete identification, Proceedings of SPIE, 2005, pp. 1599–1606. [YJ06] Alper Yilmaz and Omar Javed, Object Tracking: A Survey, ACM Computing Surveys 38 (2006), no. 4, No. 13. [YRLT09] Jenny Yuen, Bryan Russell, Ce Liu, and Antonio Torralba, LabelMe video: Building a Video Database with Human Annotations, Proceedings of the 12th IEEE International Conference on Computer Vision, 2009. [ZGL03] Xiaojin Zhu, Zoubin Ghahramani, and John Lafferty, Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions, Proceedings of the 20th International Conference on Machine Learning, 2003. 100 [Zha94] Zhengyou Zhang, Iterative Point Matching for Registration of Free-form Curves and Surfaces, International Journal of Computer Vision 13 (1994), no. 2, 119–152. [Zha00] , A Flexible New Technique for Camera Calibration, IEEE Transactions on Pattern Analysis and Machine Inlligence 22 (2000), no. 11, 1330–1334. [ZLN08] Li Zhang, Yuan Li, and Ramakant Nevatia, Global Data Association for Multi-Object Tracking Using Network Flows, Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008. [ZXH+09] Guangyu Zhu, Changsheng Xu, Qingming Huang, Yong Rui, Shuqiang Jiang, Wen Gao, and Hongxun Yao, Event Tactic Analysis Based on Broadcast Sports Video, IEEE Transactions on Multimedia 11 (2009), no. 1, 49–66. 101 Appendix The following is a list of free parameters. These parameters can be either manually set or automatically learnt from training data using cross-validation. λ Threshold for the data association cost σd The covariance of the transition probability in the tracking system σe The covariance of the emission probability in the tracking system M The model points for homography estimation θd The distance threshold for rejecting incorrect point correspondences θr The angle threshold for rejecting incorrect point correspondences α The forgetting factor for online model update NSIFT The number of SIFT visual words NMSER The number of MSER visual words NRGB The number of channels for the RGB color histograms Nspatial The number of grid cells for the spatial histogram of basketball court 102

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0052129/manifest

Comment

Related Items