Automatic initialization for broadcast sports videos rectification by Shervin Mohammadi Tari Bachelor of Science, University of Tehran, 2005 a thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the faculty of graduate studies (Computer Science) The University Of British Columbia (Vancouver) December 2011 c Shervin Mohammadi Tari, 2011 Abstract Broadcast sport videos can be captured by a static or a moving camera. Unfortunately, the problem with a moving camera is that planar projective transformations (i.e., the homographies) have to be computed for each image frame in a video sequence in order to compensate for camera motions and viewpoint changes. Recently, a variety of methods have been proposed to estimate the homography between two images based on various correspondences (e.g., points, lines, ellipses matchings, and their combinations). Since the frame to frame homography estimation is an iterative process, it needs an initial estimate. Moreover, the initial estimate has to be accurate enough to guarantee that the method is going to converge to an optimal estimate. Although the initialization can be done manually for a couple of frames, manual initialization is not feasible where we are dealing with thousands of images within an entire sports game. Thus, automatic initialization is an important part of the automatic homography estimation process. In this dissertation we aim to address the problem of automatic initialization for homography estimation. More precisely, this thesis comprises four key modules, namely preprocessing, keyframe selection, keyframe matching, and frame-to-frame homography estimation, that work together in order to automatically initialize any homography estimation method that can be used for broadcast sports videos. The first part removes blurry images and roughly estimates the game-field area within remaining salient images and represents them as a set of binary masks. Then, those resulting binary masks are fed into the keyframe selection module in order to select a set of representative frames by using a robust dimensionality reduction method together with a clustering algorithm. The third module finds the closest keyframe to each input frame by taking advantage of three classifiers together with an artificial neural network to combine their results and improve the overall accuracy of the matching process. The last module takes the input frames, their corresponding closest keyframes, and computes the model-to-frame homography for all input frames. Finally, we evaluate the accuracy and robustness of our proposed method on one hockey and two basketball datasets. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table of Contents ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Overview of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Homography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Hierarchy of Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Euclidean Transformations . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.2 Similarity Transformations . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3 Affine Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.4 Projective Transformations . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Homography Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Feature Extraction and Matching . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Keypoint Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 Edge Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 3 Preprocessing Input Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1 Blur Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Game-Field Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Hue-masks as Visual Words . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.2 Color Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 iii 3.3 Morphological Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Automatic Keyframe Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 28 30 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1.1 . . . . . . . . . . . . . . . . . . . 32 Clustering - Keyframe Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.1 35 Conventional PCA vs. Robust PCA K-means vs. X-means . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Keyframe Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.1 Clustering Labels as the Measure of Similarity . . . . . . . . . . . . . . . . . . 43 5.2 RGB Histograms Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Interest Point Detection and Matching . . . . . . . . . . . . . . . . . . . . . . 45 5.4 Classification Fusion - Finding the Closest Keyframe . . . . . . . . . . . . . . . 46 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.5.1 RGB Histograms Grids Evaluation . . . . . . . . . . . . . . . . . . . . 51 5.5.2 Interest Point Detection and Matching Evaluation . . . . . . . . . . . . 52 5.5.3 Classification Fusion Evaluation . . . . . . . . . . . . . . . . . . . . . . 53 6 Frame to Frame Homography Estimation . . . . . . . . . . . . . . . . . . . . . 54 6.1 Initial Homography Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2 ICP Homography Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 7 Experiments and Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . 63 7.1 Basketball Datasets Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 64 7.2 Hockey Dataset Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 8 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 75 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 8.1.1 Automatic Keyframe Selection . . . . . . . . . . . . . . . . . . . . . . 76 8.1.2 Automatic Keyframe Matching - Sophisticated Classifiers . . . . . . . . 77 8.1.3 Automatic Keyframe Matching - Ensemble Classification . . . . . . . . 78 8.1.4 Homography Refinement - Correspondence Prioritization . . . . . . . . 78 8.1.5 Homography Refinement - Structural Information Integration . . . . . . 78 8.1.6 Robust ICP - Gaussian Mixture Models for the ICP . . . . . . . . . . . 79 8.1.7 3D Homography Estimation - 3D Point Cloud Alignment . . . . . . . . 79 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 80 List of Tables Table 7.1 Performance Analysis of SIFT vs. ICP Estimates - 2010 Basketball Dataset . 64 Table 7.2 Performance Analysis of SIFT vs. ICP Estimates - 2011 Basketball Dataset . 65 Table 7.3 Performance Analysis of SIFT vs. ICP Estimates - Failure Cases Removed . 69 v List of Figures Figure 1.1 Homography Estimates Visualization . . . . . . . . . . . . . . . . . . . . . 2 Figure 1.2 Overview of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Figure 2.1 SIFT Keypoint Features and Descriptors . . . . . . . . . . . . . . . . . . . 16 Figure 3.1 Blur Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 3.2 HSV Color Representation - Example . . . . . . . . . . . . . . . . . . . . . 25 Figure 3.3 Hue-based Classification Failure . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 3.4 Hue-based Classification Failure . . . . . . . . . . . . . . . . . . . . . . . 27 Figure 3.5 Color Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Figure 3.6 Morphological Refinements . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 4.1 Singular Values Spectrum Representation . . . . . . . . . . . . . . . . . . 33 Figure 4.2 Singular Values Spectrum Representation . . . . . . . . . . . . . . . . . . 34 Figure 4.3 Automatically Selected Keyframes - 2010 Basketball Dataset . . . . . . . . 38 Figure 4.4 Automatically Selected Keyframes - 2011 Basketball Dataset . . . . . . . . 39 Figure 4.5 Automatically Selected Keyframes - Hockey Dataset . . . . . . . . . . . . 40 Figure 5.1 Image Grid Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 5.2 SIFT Matching Inliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Figure 5.3 Classification Fusion Representation . . . . . . . . . . . . . . . . . . . . . 46 Figure 5.4 3-Layer Artificial Neural Network Representation . . . . . . . . . . . . . . 47 Figure 5.5 Classification Evaluation - Color Histograms . . . . . . . . . . . . . . . . . 51 Figure 5.6 Closely Similar Keyframes - 2010 Basketball Dataset . . . . . . . . . . . . 52 Figure 5.7 Closely Similar Keyframes - 2011 Basketball Dataset . . . . . . . . . . . . 52 Figure 5.8 Classification Evaluation - SIFT Inlier Matches . . . . . . . . . . . . . . . 53 Figure 5.9 Classification Evaluation - Classification Fusion . . . . . . . . . . . . . . . 53 Figure 6.1 SIFT Matching All . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 6.2 SIFT Matching Inliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Figure 6.3 SIFT Homography Estimates Visualization . . . . . . . . . . . . . . . . . . 58 Figure 6.4 DPM Detection Results Visualization . . . . . . . . . . . . . . . . . . . . . 59 Figure 6.5 Stable Points Visualization for 2010 Basketball Dataset . . . . . . . . . . . 60 Figure 6.6 SIFT versus ICP Homography Estimates . . . . . . . . . . . . . . . . . . . 62 vi Figure 7.1 RMSE Curve of The 2010 Basketball Dataset . . . . . . . . . . . . . . . . 64 Figure 7.2 RMSE Curve of The 2011 Basketball Dataset . . . . . . . . . . . . . . . . 65 Figure 7.3 RMSE Curve of The Hockey Dataset . . . . . . . . . . . . . . . . . . . . . 68 Figure 7.4 RMSE Curve of The Hockey Dataset . . . . . . . . . . . . . . . . . . . . . 69 Figure 7.5 Homography Estimation Failures Visualization - Hockey Dataset . . . . . . 70 Figure 7.6 SIFT Matching Failures Visualization - Hockey Dataset . . . . . . . . . . . 71 Figure 7.7 SIFT versus ICP Homography Estimates - The 2010 Basketball Dataset . . 72 Figure 7.8 SIFT versus ICP Homography Estimates - The 2011 Basketball Dataset . . 73 Figure 7.9 SIFT versus ICP Homography Estimates - The Hockey Dataset 74 vii . . . . . . Acknowledgements First and foremost, I would like to express my utmost gratitude to my supervisor Professor James J. Little for his thoughtful supervision, insightful guidance and endless support since the first day we started working together. I will remain eternally in his debt forever. I would like to thank Professor Robert J. Woodham not only for his academic assistance, but also for all his inspirations, thoughtful advices and motivational stories. To me, Professor Woodham exemplifies the best qualities of the consummate teacher and mentor. I would like to express my sincere gratitude to Professor David Lowe for his helpful comments and the fascinating SIFT not only for its undeniable revolutionary impacts on the computer vision research community, but also for the fact that it enables us to create beneficial vision applications with hope to make the people’s lives more convenient. I would like to extend my sincerest thanks to Wei-Lwun Lu and Kenji Okuma whom without their constant technical assistance and support I would have never achieved this work. I also would like to thank all my colleagues and fellow graduate students in the Laboratory of Computational Intelligence for all fruitful discussions we have had during my research, particularly Masrour Zoghi, Misha Fine and Misha Denil. I would like to express my gratitude to my best friends Bavand Keshavarz, Samad Kardan, Koosha Khalvati, Alireza Davoudi, Mohammadreza Tavakoli, Pouya Kamalinejad, Mohammad Mohammadnia, Ankur Gupta, Xin Chen, David Meger, Hamidreza Boostanimehr, Mohammad Khabbaz and Nima Hazar for all their constant support during the past two years. Special thanks go to my undergraduate adviser Professor Morteza Mohammad Noori who always motivated me to learn more. I also would like to thank Dr. Irmtraud M. Meyer for giving me the priceless chance of pursuing my Master’s at UBC. Finally, it is my pleasure to express my deepest gratitude to my parents and my beloved sister whom without their endless love, encouragement, support and sacrifice none of this have been even possible. SHERVIN MOHAMMADI TARI The University of British Columbia December 2011 viii Dedication To my Beloved Parents, From Whom I learned to be Patient, Work Hard and Make Sacrifices to Achieve My Goals and Realize My Dreams, And To All Curious Skeptics Whom Their Skepticism Leads Them to Learn More rather than Happily Living within The Darkness of Misconceptions and False Casualties Forever. ix Chapter 1 Introduction Intelligent video analysis in the sport domain is still an active and challenging area of research in computer vision, since many automatic and semi-automatic systems have been developed to analyze tennis [37], hockey [23], basketball [6], soccer [20], and American football [18] games based on computer vision technologies. Automatic analysis of these games requires understanding of game-specific events together with game rules and game-field (e.g., a basketball court, or a hockey rink) information. In fact, broadcast sport videos comprise all this information in a form of a vast amount of visually encrypted data that has to be interpreted using either manual or automatic annotation procedures. Unfortunately, the manual annotation approach for interpreting the massive visual data requires a significant amount of time and resources. However, using automatic or even semi-automatic systems to do so would definitely save us a lot of time and resources. Moreover, in real-time sport analysis applications, the annotation has to be performed immediately in order to analyze the game as it goes forward. Alas, it is almost impossible to produce annotated data instantly using manual annotations. These issues motivate computer vision researchers to profoundly investigate automatic systems that are able to understand sport scenes, and provide accurate functional analysis. Basically, sport scenes can be captured by a static or a moving camera. In fact, using a moving camera makes the scene understanding problem more complicated. The problem with a moving camera is that planar projective transformations (i.e., the homography) have to be computed for each image frame in a video sequence in order to compensate for camera motions and viewpoint changes. Then, the transformed data can be used to automatically understand and analyze sport video scenes without having any prior knowledge about camera’s poses and its parameters. Recently, a variety of methods have been proposed to estimate the homography between two images based on various correspondences (e.g., points, lines, ellipses matchings, and their combinations). Even though the homography estimation is an almost solved problem, finding the best set of these correspondences to compute the most accurate estimate is still 1 Figure 1.1: Homography Estimates Visualization. An accurate geometric representation of the game-field and its transformed version are illustrated for both basketball and hockey games. Transformed game-field models, which are determined by using the model to frame homography transformation, represented in red color. a challenging research problem. Particularly in the sport domain, the accurate model of the game-field is always available and people are interested in computing the homography between the model and each frame in a video sequence to transform the image coordinate system into the game-field coordinate system (Figure 1.1). Once the model to frame homography is computed for one image, the homography for the rest of images in the sequence can be estimated by computing a frame to frame homography iteratively. This estimation called the model-based homography estimation. To conclude, the game-field coordinate system is beneficial in automatic sport analysis applications where we need to track players [4; 26; 38] and the ball and shooting locations [6], and retrieve players’ locations and trajectories [30] on the game-field. Therefore, estimating the homography transformation automatically plays an important role in developing an automatic sport analysis system. All these reasons motivate us to probe this problem further, and essentially to automate the entire procedure. 2 1.1 Problem Definition As it is mentioned earlier, the homography estimation is an iterative process, so, it needs an initial estimate. Moreover, the initial estimate has to be accurate enough to guarantee that the method is going to converge to an optimal estimate. Although, the initialization can be done manually for a couple of frames, manual initialization is not feasible where we are dealing with thousands of images within an entire sport game. Thus, automatic initialization is an important part of the automatic homography estimation process. In this dissertation we aim to address the problem of automatic initialization for the homography estimation. More precisely, we want to compute the best possible homography estimate between the accurate model of the game-field and the first frame of all clips over an entire game of a sport video. We also want our estimates to have high geometric accuracy and be robust to feature detection errors, and motion blur. Furthermore, we make the following assumptions in order to restrict the scope of the problem: • All images are captured from a single camera that is allowed to pan, zoom and tilt, • All image frames correspond to real games, • The geometric model of the game-field is an accurate representation of the game-field in all images within an entire game. 1.2 Overview of the System This section provides an overview of this thesis work to help readers to have a better understanding of the automatic initialization process for the homography estimation. Hence, in this section, high level descriptions of required tasks are mentioned very concisely followed by a flowchart that demonstrates the scope the entire system. Basically, we break the automatic initialization problem down into smaller pieces, and aim to resolve them to solve the original problem. In a very first step we need to preprocess the data to remove blurry frames from the input frames since they are neither informative nor useful for the automatic initialization procedure. Then, the game-field area in remaining images has to be roughly estimated. The output of this part is an input to the clustering section that produces a set of keyframes. Once all keyframes are automatically selected, the closest keyframe to each input frame needs to be estimated using three different classifiers. Finally, the frameto-frame homography transformations have to be determined. The flowchart presented in 3 Input image sequence Background-Foreground Segmentation (Hue-mask / Color Classification) Morphological Refinements Clustering Image Frames into K Clusters (PCA + K-Means) Set of K Cluster Centers (Representative Frames) Classifier 1: Classifier 2: Find the set of closest keyframes by comparing RGB-Histogram Grids Find the set of closest keyframes by comparing Binary Masks (XOR) Classifier 3: Find the set of closest keyframes by counting the number of inlier SIFT correspondences Homography Estimation Classification Fusion: Finalize classification by combining classifiers’ votes using 3-Neural Network and Logistic Regression (Set of Closest Keyframes for all first frame) Compute Initial Homography Estimates (SIFT + RANSAC + DLT) Refine initial homography estimates using the ICP algorithm Keyframe Matching (Image Classification) Manual Homography Initialization Number of Keyframes K Keyframe Selection Blur Measurement Preprocessing First Frames of all Clips Detection results for all first frames uing the DPM method Final homography estimates for the first frame of all clips Figure 1.2: Illustration of the system overview. This flowchart provides an overview of the system that is designed and developed for the automatic homography initialization, where blue rectangular boxes represent main software components of the system, pink ellipses represent inputs and outputs, rounded rectangular boxes hatched with pink lines denote the data added to the system through manual settings, and finally the rounded rectangular box hatched with blue lines denotes the data produces by our system. Moreover, dashed arrow lines represent the data flow among components of the system, whereas solid arrow lines denote the program flow. 4 Figure 1.2 illustrates the overview of our system. Moreover, inputs of the system are image frames from broadcast video sequences of basketball and hockey games, and the output includes the estimated game-field to image homography for each image superimposed on it as well as accuracy measure curves. 1.3 Outline of Dissertation The automatic initialization system is consisted of three key procedures: keyframe selection, keyframe matching, and the frame to frame homography estimation. In following chapters we are going to precisely discuss how each of these procedures contributes to automatically initialize the homography estimation process. This dissertation is organized as follow: Chapter 2. Background and Related Work. A brief background about the homography computation and basic related concepts are reviewed in this chapter, along with the survey of state of the art homography estimation approaches in the sport domain. Chapter 3. Preprocessing input images. Since blurred image frames are neither informative nor useful in the automatic initialization process, a basic preprocessing is needed to remove such frames from inputs of the system. Moreover, to enhance the homography estimation process we need to have a rough estimate of the game-field area within each individual image frame. The outputs of this process would be a set of binary masks that helps us in keyframe selection, keyframe matching, and homography refinement procedures. This chapter describes how our system performs these tasks in more detail. Chapter 4. Automatic keyframe selection. This chapter demonstrates how we integrate machine learning and computer vision techniques in order to select a set of representative frames from entire images within the input sequence that satisfies our primary constraints. Thus, the objective is to perform the robust PCA on the binary masks of the entire image sequence along with the X-Means clustering algorithm in order to find K keyframes. K centers of clusters are selected, and corresponding images to those centers are chosen as keyframes. Chapter 5. Keyframe matching. In order to estimate the homography accurately, we need to find the most similar frame among all keyframes to the given frame. In this chapter we describe how we find the closest keyframe to each input frame among all K keyframes 5 using color histograms, inlier feature matchings and binary masks. Moreover, a classification fusion technique is used that integrates outputs of these three classifiers through an artificial neural network to come up with more accurate predictions. Chapter 6. Frame to frame homography estimation. After performing the keyframe selection process and finding the closest keyframe to each given frame, we have to compute the homography between these two frames. In this section we describe how we use SIFT keypoint matchings along with the RANSAC and the DLT algorithms in order to compute initial homography estimates. Moreover, we discuss how running the ICP over our initial estimates results in improving the accuracy of the homography estimation process. Chapter 7. Experiments and Evaluation Results. We design a set of experiments to evaluate our method using first images within all clips of two complete basketball games and one period of a hockey game. For all these sequences of images we have the corresponding groundtruth homography. Having groundtruth data and test results, we calculate RMSE (Root Mean Square Error), maximum and median errors in order to evaluate the accuracy of estimated results. Chapter 8. Discussion and Conclusion. This chapter provides a summary of contributions, achievements and obstacles, significant research challenges, open issues for further explorations, potential directions for further research and future works are discussed in this chapter. 6 Chapter 2 Background and Related Work This chapter provides a brief background about the homography transformation and its basic related concepts, feature matching and detection, the robust inlier selection, as well as a comprehensive survey about the state of the art homography estimation methods in the sport domain. More precisely, first the homography transformation and its mathematical representation are introduced. Then, we describe a hierarchy of linear transformations as well as a simple linear method, called DLT algorithm, to compute the homography using only point correspondences. The content provided in these sections is based on [16]. Afterward, in the feature matching and detection section, we discuss interest point detectors and descriptors that we use or have been used in previous works in order to measure the image similarity and compute the homography. Finally, we present a comprehensive review of the state of the art homography estimation methods in the sport domain. 2.1 Homography In this section, we explore the theoretical aspects of the homography and its mathematical representation. In order to provide an intuition about the homography, first we must introduce the homogeneous notation for 2D points and lines, and the 2D projective plane. Considering Rn as a vector space, any 2D point can be represented by a pair of coordinates (x,y) correspond to the vector x = (x, y)T in R2 . Similarly, any line in the 2D plane can be presented by a triple (a,b,c) correspond to the vector l = (a, b, c)T in R3 , as it is originally represented by an equation such as ax + by + c = 0. Moreover, since both ax + by + c = 0 and (ka)x + (kb)y + (kc) = 0 represent the same line for ∀k = 0, vectors (a, b, c)T and k(a, b, c)T related by an overall scaling factor, and can be considered as being equivalent. The equivalence class of vectors based on this equivalence relationship is called homogeneous vector. Furthermore, the set of equivalence classes of vectors in R3 , excluding the vector (0, 0, 0)T , makes the space known as the projective space P2 . Consequently, one can think of points and lines in the P2 space as lines and planes through the origin in the R3 space 7 respectively. Similar to lines, points can be also represented by the set of homogeneous vectors (kx, ky, k)T for ∀k = 0 in R3 such that the homogeneous vector x = (x1 , x2 , x3 )T represents the point (x1 /x3 , x2 /x3 ) in R2 . As a result, points can be also elements of P2 as homogeneous vectors are. After providing the basic intuition about homogeneous points and lines, and the projective plane, now we can introduce the homography. The homography, also known as planar projective transformation, is an invertible linear transformation that maps homogeneous 2D points and lines on the projective plane P2 : Definition 1 A homography is an invertible mapping h from P2 to itself, such that three points x1 , x2 and x3 represented by homogeneous 3-vectors, lie on the same line if and only if mapped points h(x1 ), h(x2 ) and h(x3 ) lie on the same line as well. In other words, the homography is a linear transformation on homogeneous 3-vectors that can be presented by a 3 × 3 non-singular matrix H such that x = Hx: h11 h12 h13 x1 x2 = h21 h22 h23 x2 x3 h31 h32 h33 x3 x1 (2.1) The matrix H is a homogeneous matrix that has eight degrees of freedom because only the ratio of its elements is significant and there are eight independent ratios among all its nine elements. As a result, the homography is a linear transformation that has eight degrees of freedom. Moreover, if (x, y) and (x , y ) represent 2D inhomogeneous points in the world and image plane respectively that can be directly measured from those two planes, the homography can be expressed in inhomogeneous form as follow: x = x1 h11 x + h12 y + h13 = x3 h31 x + h32 y + h33 (2.2) y = x2 h21 x + h22 y + h23 = x3 h31 x + h32 y + h33 (2.3) Obviously, each point correspondence results in two equations for elements of the homography matrix. Therefore, we need at least four exact point correspondences, such that no three of them are collinear, in order to compute the homography. 8 2.2 Hierarchy of Linear Transformations Projective transformations on n dimensions form a mathematical group called the projective linear group P L(n) that is the set of invertible n × n real matrices related by a scalar multiplier. Considering n = 3, P L(3) includes different subgroups such as the affine group, including matrices such that their last row is (0, 0, 1), and the Euclidean group, a subgroup of the affine group including matrices in which the upper left hand 2×2 matrix is orthogonal. In this section we present a hierarchy of these transformations from the most specific one, namely a displacement, to the most general one, which is the projective transformation. 2.2.1 Euclidean Transformations Euclidean transformations, also known as isometries, are a set of transformation in R2 that preserve the Euclidean distance and presented as: x cos θ − sin θ tx x y = sin θ cos θ ty y 1 0 0 1 1 where (2.4) = ±1. The representation apparently implies the Euclidean transformation is a composition of translations and rotations. This class of transformations can also be represented more succinctly in a form of a block matrix as x = HE x = R t 0 1 x (2.5) where R is a 2 × 2 rotation matrix, t is a translation 2-vector, and 0 is null 2-vector. Consequently, Euclidean transformations have three degrees of freedom, two for the translation and one for the rotation, and can be determined by using only two exact point correspondences. 2.2.2 Similarity Transformations One can think of a similarity transformation as a Euclidean transformation combined with an isotropic scaling that preserves the shape rather than the Euclidean distance. As a result, its matrix representation is very closely similar to the Euclidean transformation 9 representation: x x s cos θ −s sin θ tx y = s sin θ s cos θ ty y 1 0 0 1 1 (2.6) where the scalar s is the isotropic scaling factor. Similarly, similarity transformations can be represented in a form of a block matrix: x = HS x = sR t 0T 1 x (2.7) Because of the isotropic scaling factor, similarity transformations have one more degree of freedom compared to Euclidean transformations. Therefore, they have four degrees of freedom, but they can still be computed using only two exact point correspondences. 2.2.3 Affine Transformations An affine transformation, also known as an affinity, is a generalization of a similarity transformation such that a composition of an isotropic scaling and a rotation is replaced with a non-singular linear transformation, followed by a translation. Here is its matrix representation: x a11 a12 tx x y = a21 a22 ty y 1 0 0 1 1 (2.8) and a block matrix form: x = HA x = A t 0T 1 x (2.9) where A is a 2 × 2 non-singular matrix. Intuitively, one can think of A as a combination of rotations and non-isotropic scalings. Apparently, the matrix A has four degrees of freedom, added to two degrees of freedom of the translation results in six degrees of freedom for the affine transformation. Therefore, we need at least three exact point correspondences to compute the affine transformation matrix. 10 2.2.4 Projective Transformations Intuitively a projective transformation, also known as a projectivity, is a generalization of affine transformations. As we discussed in a first section of this chapter, a projectivity can be defined as a non-singular linear transformation of homogeneous coordinates and has eight degrees of freedom. Similarly, it can be presented in a block matrix form: x = HP x = A t vT v x (2.10) where v = (v1 , v2 )T . In order to have a better intuition about the projectivity matrix, one can decompose it into a chain of more specialized transformations that we discussed before. As a result, a projectivity matrix can be written as: H = HS HA HP = sR t K 0T 0T 1 0 I 1 vT 0 v = A t vT v (2.11) where v = 0, A = sRK + tvT is a non-singular matrix, and K is a normalized upper triangular matrix. More precisely, HS is a similarity matrix with four degrees of freedom, HA is an affinity with no translation and only has two degrees of freedom, and finally HP represents a linear transformation called elation that moves the line at infinity and only has two degree of freedom. By adding these numbers together, we will end up with eight degrees of freedom for the projective transformation as we expected. Therefore, we need at least four exact non-collinear point correspondences to compute a projectivity matrix as we mentioned earlier in this chapter. 2.3 Homography Estimation As we discussed in previous sections, we need at least four exact point correspondences, such that no three of them are collinear, in order to compute the homography. However, this is a minimal solution since we always have noise in our correspondences which makes them inexact. Therefore, we usually need more point correspondences and a reliable selection approach to choose the best subset from them to estimate the homography. Moreover, if we are given a set of inexact correspondences between two images that may not be perfectly compatible to any homography estimate, we have to figure out the best possible projective transformation with respect to the given data. It means that we have to find the H that 11 minimizes some cost function. In this particular domain, there are two main classes of cost functions: • Cost functions that minimize the algebraic error • Cost functions that minimize the geometric and statistical image distance In case of estimating homography between two different viewpoints, the reprojection error for both images can be used as a cost function that needs to be minimized: ˆ xi ∀i d(xi , x ˆi )2 subject to x ˆi = Hˆ d(xi , x ˆi )2 + i (2.12) i Optimizing this cost function is a Maximum Likelihood estimation problem that estimates ˆ and determines a set of perfectly matched point correspondences x the homography H ˆi and x ˆi . Having the set of n (n 4) perfectly matched 2D point correspondences, now we are going to introduce the basic linear algorithm called Direct Linear Transformation (DLT) [16] that computes the homography H. Since the homography transformation H is derived from the equation xi = Hxi and both xi and xi are homogeneous vectors, the 3-vectors xi and Hxi are not essentially equal. In fact, they have the same direction but may have different magnitudes. As the 3-vectors xi and Hxi have the same direction, one can express the transformation equation as xi × Hxi = 0 in terms of the vector cross product. Using the latter equation allows us to compute the homography matrix H by solving the following linear system: xi xi h11 h12 h13 xi = yi xi = yi H = h21 h22 h23 zi zi h31 h32 h33 (2.13) We assume that zi = zi = 1 since (xi , yi ) and (xi , yi ) are both measured in the image coordinate system. So h11 xi + h12 yi + h13 Hxi = h21 xi + h22 yi + h23 (2.14) h31 xi + h32 yi + h33 yi (h31 xi + h32 yi + h33 ) − (h21 xi + h22 yi + h23 ) xi × Hxi = 0 ⇒ (h11 xi + h12 yi + h13 ) − xi (h31 xi + h32 yi + h33 ) = 0 xi (h21 xi + h22 yi + h23 ) − yi (h11 xi + h12 yi + h13 ) 12 0 0 0 −xi −1 −yi yi xi yi yi ⇒ xi yi 1 0 0 0 −xi xi xi yi −yi xi yi yi −yi xi xi xi yi xi 0 0 h11 h12 h13 yi h21 h −xi 22 =0 h23 0 h31 h32 h33 (2.15) For the sake of simplicity, one can write this linear equation as Ai h = 0, where Ai is a 3 × 9 transformation matrix such that only its first two rows are linearly independent, and h is a column vector that contains all elements of the homography matrix H. Since the third row of Ai is linearly dependent of its two other rows, we can simply eliminate it and rewrite the previous equation as: ⇒ 0 0 0 −xi −1 −yi xi yi 1 0 0 0 yi xi yi yi yi −xi xi xi yi −xi h11 h12 h13 h21 h22 = 0 h23 h31 h32 h33 (2.16) where Ai is a 2 × 9 matrix. As a result, given a set of n (n ≥ 4) point correspondences form a linear system such that Ah = 0, where A is a matrix built from rows of Ai that derived from each correspondence. In fact, if we use (2.15) to derive linear equations from point correspondences, A is going to be a 2n × 9 matrix made up of stacking up n Ai matrices; however, using (2.16) results in a (2n − 1) × 9 matrix (i.e., it contains 2n − 1 linear equations). Having the set of linear equations, the objective is to find a non-zero solution for h. For both two cases mentioned earlier, the matrix A has rank 8 and has a one-dimensional null-space that gives us a solution for h. Obviously, the linear system derived from (2.16) is going to be over-determined for more than four point correspondences. However, using four correspondences with exact positions does not change the rank of matrix A, so, it maintains its rank and definitely has a one-dimensional null-space that provides an exact 13 solution for h. Even though, having four exact correspondences is almost impossible due to the noise as we discussed earlier. Therefore, typically the over-determined system Ah = 0 does not have an exact solution except h = 0. Nevertheless, an approximate solution for h can be determined that minimizes a proper cost function, as we mentioned earlier in this chapter. To avoid undesirable solutions such as h = 0, one can introduce a further constraint on h , for example h = 1. Consequently, an approximate solution for the over-determined systemAh = 0 can be determined by minimizing Ah with regard to the previous constraint h = 1. It can be shown that the this optimization problem is equivalent to the problem of finding the unit eigenvector of the AT A that corresponds to the smallest eigenvalue of A. This approximation algorithm that estimates the homography for more than four point correspondences is known the basic DLT algorithm. Images can be represented in different coordinate systems. For example, the origin of an image coordinate system is typically assumed to be at its top left hand; however, sometimes it is assumed to be at the center of the image. Having said that, the crucial disadvantage of the basic DLT algorithm is that its homography estimation results highly depend on the choice of an image coordinate system that point correspondences are extracted from. In other words, using different image coordinate systems result in different homography estimates. In fact, it can be shown the computed homography is not invariant to similarity transformation of the image. One solution to resolve these issues is to normalize point correspondences into the same coordinate system first; then, feed normalized correspondences to the DLT algorithm and estimate the homography. The normalization has two major benefits: • Results in more accurate homography estimates • Makes the DLT algorithm invariant to similarity transformations of the image To conclude, DLT algorithm has been dramatically improved recently by using not only point correspondences, but also line and ellipse correspondences [8; 14] to compute the homography. Furthermore, the notion of normalized line correspondences is also developed with respect to point correspondences [8]; however, in order to estimate the homography by a combination of point and line correspondences, we still require a suitable normalization approach with respect to both of them. We review these extensions to the DLT algorithm later in this chapter. 14 2.4 Feature Extraction and Matching Establishing a set of dense correspondences between image pairs is the one of key elements of the homography estimation and an accurate image similarity measurement. In fact, detecting, tracking and matching distinctive visual features can provide a set of dense correspondences that we are looking for. Hence, feature extraction, tracking and matching are substantial components of our system. Moreover, keypoint features and edges are two particular classes of features that we are interested in. From numerous features in the first class, we concisely review SIFT keypoint detector and descriptor and the KLT tracker as well as edges from the second class, since we use them in our method, and they have been also widely used in related approaches. 2.4.1 Keypoint Features Keypoint features have been significantly used in the computer vision for registering images by finding corresponding locations between them. In fact, KLT features are the very first keypoint features that were used to the estimate homography by using the KLT tracker system. According to [35], good features are ones that can be easily tracked, so, they propose a method that detects features and tracks detected features simultaneously based on affine changes in images instead of extracting features and tracking them using separated procedures. It also monitors the quality of detected features during tracking. More precisely, they search for a window of pixels with a certain size that contains sufficient amount of texture (keypoint detection); then, use an extended version of Newton-Raphson method that works under affine transformations to minimize the difference between two consecutive windows and track good features. Apart from KLT, among numerous distinctive keypoint features, we are interested in SIFT (Scale Invariant Feature Transform) [24] features due to their invariance to image scale, rotation, illumination and the camera viewpoint changes. Basically, SIFT uses the difference of Gaussian filters (DoG) to find interest points that are stable in location and scale. As a result, SIFT keypoint detectors produce a dense set of stable distinctive features at all ranges of image scales and locations. Having features extracted from images we need a local descriptor to find their correspondences in different images. Again, a local SIFT descriptor is the best choice for our application since it is invariant to scale and orientation. Basically, a SIFT descriptor is created by computing the gradient of all pixels within a certain area around detected SIFT keypoints (a 16 × 16 window of pixels) at the particular scale level. Moreover, gradient magnitudes for pixels that are far from a center of the window are penalized by using a Gaussian fall-off func- 15 Figure 2.1: SIFT Keypoint Features and Descriptors. SIFT feature points are represented by yellow circles that denote their location, scale and orientation. Green grids are SIFT descriptors at each SIFT keypoint. tion. Then, the gradient orientation histogram with eight bins is produced within each 4x4 quadrant. Afterward, values of resulting histogram are softly distributed to bins of the adjacent histogram in order to reduce the error of finding the wrong dominant orientation. Finally, the resulting 128-dimensial vector of non-negative elements is normalized and result in a vector called the SIFT descriptor (Figure 2.1). In fact, SIFT keypoint detectors and descriptors play an important role in the homography estimation process by finding point correspondences between pairs of images [9; 17; 18; 32]. 2.4.2 Edge Features Edge-like features are well known for providing information about structural properties of surfaces, objects’ boundaries, and indicating occlusion events within an image. Furthermore, connected isolated edge components represent curves, contours and lines in images. Essentially, edge features provide complementary information to interest point descriptors, since edge maps contain a great deal of structural information of images. Similar to keypoint features, we have to detect salient edges within an image, and devise a method to establish accurate edge correspondences between image pairs. To detect reliable edges with different orientation under different illumination conditions, 16 the Canny edge detector [5] is the one that is widely used in computer vision applications. Here is how the Canny edge detector works: first it convolves the grayscale image with a Gaussian filter with a certain standard deviation to make it smooth and reduce potential noises. In fact, larger standard deviations result in more smoothing and better edge detections whereas smaller ones result in less smoothing and better edge localizations. After smoothing, the image is convolved with first derivative of Gaussian filters to compute the gradient image. Since the Canny edge detector detects edges based on local extrema of the first derivative image, it uses two thresholds to determine potential edge components within an image. Intuitively, once edges are extracted from images, they can be simply matched based on their orientation and local appearance. Recently, edge maps have been widely used in homography estimation methods [14; 25; 32]. Moreover, [8; 9; 14] detect line and curves using the Canny edge detector together with the Hough transform [16] and RANSAC [11] to improve homography estimates. 2.5 Previous Work Even though the homography estimation between two images is a well-established problem, finding the best set of correspondences that result in the most accurate estimate is still a challenging research problem. Moreover, a variety of methods have been proposed recently in order to automatically estimate the homography between image pairs within a sport video sequence based on various correspondences, since the game-field coordinate system is incredibly beneficial in automatic sport analysis applications. In this section we provide a comprehensive review of the state of the art homography estimation methods in the sport domain. Okuma et al. [32] addresses the automatic homography estimation for a long image sequence by using the Kanade-Lucas-Tomasi (KLT) tracker system together with RANSAC and the normalized DLT algorithm. In fact, the KLT tracker computes local displacements of image features and finds an initial set of local point correspondences. Then, running RANSAC eliminates outliers and determines a consistent subset of correspondences between successive frames. Afterwards, the normalized DLT algorithm uses the inlier point correspondences to compute an initial homography estimate. Eventually, a model-based correction system is used to fit projected images to the model and minimize projection errors produced by the automatically estimated homography, based on performing a local search for both straight and circular line segments and distinguishes them by their orientation. Therefore, it does not require direct methods of conic detection or line detection. Unfortunately, this method avoids relying on visual distinctive features since they use the KLT tracking system to find 17 generic point correspondences together with edge-based model fitting. Therefore, it requires a manual initialization to compute the homography. In fact, manual initialization of video sequences can be a cumbersome task in the case of processing very long video sequences. Hess and Fern [18] develop a generic approach for the rectification of American football broadcast video sequences based on the work of Okuma et al. [32]; however, they take advantage of non-distinctive visual features as well as distinctive features. Basically, they introduce a model-based approach that addresses the broadcast sport video rectification problem using a static predefined model. In fact, they use point correspondences between each image frame within the input sequence and the model to compute the homography. Moreover, they use invariant local features fed into a matching technique based on local distinctiveness. In more precise terms, matching invariant image features within a video sequence to a set of features assembled from keyframes that represents the model determines video-to-model correspondences. The concept of local distinctiveness enables them to find model matches for almost all visual features in all video frames. They use the Harris-affine detector [29] and SIFT descriptor [24]to detect and describe visual features. Then, the 2NN heuristic is used to compute feature matchings. Since both DLT and least squares algorithm are very sensitive to outliers, a consistent set of inlier correspondences is found by using RANSAC algorithm. Finally, they compute the homography for each frame by using least squares. Later on, Hayet and Piater [17] introduce a method to estimate image-to-field homography for rectification of sport videos. Their method is based on three complementary techniques. First, an initial homography is estimated using line-feature matching (line detection module). Then, they re-estimate the homography with line features tracking (line tracking module). The line tracking module tracks lines whose positions can be estimated using classical registration techniques. It also exploits a robust RANSAC to remove outliers due to false local warpings. Finally, the motion across images is computed (i.e., incremental homography estimation) through point-feature tracking (visual-odometry module). This module tracks feature points between successive frames using the KLT tracking system. Their approach suffers from a couple of limitations. First, it is not robust to motion blur due to fast camera motions. Therefore, under a very fast camera motion, their system loses track until adequate lines become available for the homography re-initialization using the line detection module. Moreover, it is not robust to projection errors either. In more precise terms, whenever in the keypoint features tracking module the uncertainty level that corresponds to the projection of the center of the image point becomes too large, they simply switch back to the line detection module to estimate the homography. Furthermore, their system is not capable of dealing with very large image sequences. 18 Recently, the performance of the DLT algorithm has been dramatically improved by using not only normalized point correspondences, but also line and ellipse correspondences and their combinations. Here we review a couple these extensions to the DLT algorithm. Dubrofsky and Woodham [8] extend the normalized DLT algorithm using the combination of line and point correspondences to improve homography estimation results. Lately, only point correspondences are used in the DLT algorithm in; however, their method extends the DLT algorithm to use both point and line correspondences. They indicate that adding line correspondences to the DLT algorithm produces more accurate results. They also show that direct use of line instead of using their point intersections produces more accurate results. In fact, as their experiments demonstrate, more information produces more accurate results as long as the information is consistent. Consequently, using two line correspondences and their intersection point produces lower error than using just two line correspondences. However, as their extension is point-centric (i.e., lines are normalized to fit with the normalization used for points); it would be most useful if there are more point correspondences rather than line correspondences in input images. Later on, Gupta et al. [14] extend the linear method for point matches (the DLT algorithm), and introduce a new approach that combines point, line and ellipse correspondences in order to estimate the homography between image pairs. They incorporate an appearance model and a geometric model of the hockey rink to robustly estimate the homography over the time. They also use an optimization method called DIRECT to minimize the area-based geometric error measure to refine initial estimates. Their optimization was based on minimizing the residual area between the projected model of the rink and detected shapes. Even though using line and ellipse features together with point correspondences make the homography estimate more robust to occlusion and motion blur, their approach is too sensitive to errors in detecting those shapes. In fact, high sensitivity to detection errors causes severe problems in the homography estimation process when these shapes, as a part of geometric model of the game-field, are not visible or too difficult to detect (e.g., basketball broadcast videos). As methods for the keypoint matching between image pairs are much advanced and produce very reliable results, Fan et al. [9] propose a novel approach that automatically computes a line matching between a pair of images based on a set of supporting point correspondences. Basically, they use a set of temporary point correspondences that are obtained by using keypoint matching methods, and then use those point correspondences to improve line matching results. More precisely, their approach investigates an affine invariant from one line and two points, and then uses that invariant to match lines (i.e., calculate the similarity of two line segments). As they report, point correspondences for computing the invariant can be acquired by using any keypoint matching method. In fact, they use the DAISY descriptor [36] to describe keypoints and utilize the Nearest Neighbor Distance Ratio (NNDR) to match 19 them. They also define similarity based on the median statistic to handle the problem of incorrect correspondences in a set of point correspondences. Moreover, as their line matching method is based on the geometric configuration between lines and points, it is robust to scale, rotation, brightness, viewpoint and occlusion in point correspondences. Even though this approach outperforms line matching methods that we discussed earlier in this section, using it to generate a set of line and point correspondences to estimate the homography is not guaranteed to work where lines are severely occluded. In fact, lines of the game-field within broadcast sport video frames are almost always occluded by players, referees, or fans. Thus, this method may fail to accurately detect a line within an image that corresponds to the one of the game-field model’s line due to occlusions. As a result, we are looking for a method that utilizes points to produces a set of correspondences between image pairs, and uses additional structural information from accurately detected lines and other correspondences to improve the correspondence set if such accurate detection results are applicable. Lu [25] proposes a method based on edge point correspondences between image pairs to estimate the homography using the Iterated Closest Point (ICP) algorithm [2]. The ICP algorithm is usually used to iteratively find the similarity transformation between two point clouds and register them in a common coordinate system. Actually, it establishes a set of point correspondences between two point clouds, and estimates the similarity transformation between them using a least squares formulation. Furthermore, it iterates this procedure until it converges to the optimal similarity transformation estimate. As a result, estimating the frame to frame homography with ICP algorithm requires extracting a set of point correspondences between image pairs. To do so, they extract a point cloud from the first image by transforming edge points of the game-field model using the model-to-frame homography estimate; then, detect and localize edge points on the second image using Canny edge detector [5] to extract the corresponding point cloud. In order to remove irrelevant edge points caused by the score bar, they use a predefined fixed size mask. Moreover, they use automatic player detection results generated by the DPM detector [10] to reject edge points caused by players. Having corresponding point clouds extracted from two images, the Levenberg-Marquardt algorithm [22; 27] is used to estimate an optimal homography between image pairs by solving a nonlinear least-square problem together with RANSAC to remove outliers and improve the robustness of the method. Then model points are transformed using the new estimated homography and the procedure iterates until ICP converges to the optimal homography estimate such that the misalignment error between projected game-field model points and image pixels is minimal. Comparing to previous methods, ICP improves homography robustness, accuracy and running time. More precisely, since the Canny edge detector is used to extract point clouds from input images, detected edges 20 are robust to rotational and illuminations changes, and motion blurs. However, as other iterative techniques, the ICP algorithm suffers from following drawbacks. First of all, it is very slow since it has a O(n2 ) time complexity. Second, it requires an initial estimate to start iterations from. Finally, it may become trapped in a local minimum and not converge to the optimal homography. 21 Chapter 3 Preprocessing Input Images In order to automatically initialize the homography estimation process, the first task we have to do is to select a set of representative frames from the input sequence. In fact, the selection process has to be automatic and representative frames have to satisfy following conditions: • Keyframes have to be salient images • Keyframes must have overlapping features • A set of keyframes must cover the entire game-field • A homography between the game-field and each keyframe is known As a result, we need to preprocess the input data in order to guarantee that the first two conditions are satisfied. First of all, we must measure the image blur and remove blurry frames from the input sequence. In order to satisfy second and third conditions we need to identify the game-field area within each image. Thus, we propose a method that takes advantage of the difference between hue values of the game-field pixels and the rest of pixels to identify the game-field area within each image frame. While this approach highly depends on the color model of images, we introduce more generic approach to do so proposed by [31] called color classification. Since the rough estimate of court area within each image using any of these methods contains noise, we use morphological filters to refine the result and produce a clean binary mask that specifies the court area within the image. 3.1 Blur Measure At the very first step of preprocessing, blurry frames have to be removed from the input sequence since they are neither informative nor useful in the automatic initialization process. To distinguish blurry images from salient ones a metric needs to be defined that measures 22 (a) Blur Metric = 0.4611 (b) Blur Metric = 0.4945 (c) Blur Metric = 0.4218 Figure 3.1: Blur Measure Visualization. These images illustrate examples of blurry frames for both hockey and basketball images in which the blur metric value is greater than 0.4. the blur level of each image frame. In order to estimate the image blur, we use an approach proposed in [7] that processes a grayscale version of a given image and returns a quantitative metric in a range of [0, 1] called the blur metric. In fact, the blur metric indicates the perceptible blur level of each given image. More precisely, in this method the grayscale version of each input image is convolved with a low-pass filter in order to introduce synthetic blur to it. Then, normalized intensity variations of neighboring pixels within both original image and its blurred version are computed and compared in order to investigate the blur level of a given image. In fact, high variations indicate that the input image was salient whereas low variations suggest that it was blurry. Finally, the result of the blur measurement process is represented by a number in the range of [0, 1] where 0 denotes the least and 1 indicates the most blur levels. Using this approach, we have to find the best possible threshold value that helps us to distinguish blurred images from salient ones. According to our experiments, any image frame that has a blur metric greater than 0.4 has to be considered as a blurry image frame; thus, we remove it from the input sequence and use remaining frames for further processes (Figure 3.1). The next step of preprocessing is involved with a method that produces a naive estimate of the game-field area within each image frame that can be used for in the keyframe selection process. 23 3.2 Game-Field Identification Once undesirable blurred frames are filtered out from the input data, we have to come up with a rough estimate of the game-field area within each image frame to be able to select keyframe candidates that have overlapping features, and cover the entire game-field together. Thus, a method must be designed to classify each pixel within each image frame into court pixel and non-court pixel classes. In this section, we propose an approach that takes advantage of the difference between hue values of the game-field pixels and rest of image pixels in order to identify the game-field area within each image frame (Figure 3.2). While this approach highly depends on the relative difference between hue values of court and non-court pixels, we introduce a more generic approach proposed by [31] called color classification that uses a few samples of manually labeled pixels and automatically learns the color model of court pixels by using mixture models together with a boosting algorithm [12] (Figure 3.5). 3.2.1 Hue-masks as Visual Words While RGB representation of input image frames does not help us naively identify the game-field area, HSV representation looks discriminative, and so, helpful for the game-field identification task (Figure 3.2). However, taking advantage of the HSV color model by itself may not work for cases in which court- and non-court-pixels have very similar values (Figure 3.3). Using the hue-based pixel classification method, first each input frame must be converted from the RGB into the HSV color model (Figure 3.2a-b). Then, hue values of all pixels within each image need to be extracted from the HSV image and result in an image called hue-map in which all pixel values are in the range of [0, 1] (Figure 3.2c). Once the hue-map of an image is produced, we have to classify each of its pixels based on their hue values. Since in our dataset of basketball images, non-court pixels have significantly larger hue values (within the range of [0.5, 1] ) compared to court pixels (within the range of [0, 0.15]), all we need to do is to set a threshold on hue values in order to distinguish these two classes of pixels. Intuitively, we set a certain threshold on the pixel’s hue value to suppress pixels that have hue values larger than the threshold and classify them as non-court pixels. Consequently, the rest of pixels are classified as court pixels. The result of this classification process is a set of binary masks in which white pixels correspond to court pixels and black pixels denote non-court pixels (Figure 3.2d). Moreover, these masks need to get refined to be utilized as visual words that have to be clustered in the keyframe selection process. 24 (a) RGB Representation (b) HSV Representation (c) Hue-Map (d) Binary Mask Figure 3.2: From the HSV Representation to the Binary Mask. These four images illustrate (a) an image represented in RGB, (b) its converted version into HSV where bluish areas correspond to yellow-brown colors in the RGB representation, and cyan-pink areas corresponds to the spectrum of dark-brown to black colors in the RGB representation, (c) the extracted hue-map from the HSV image, and (d) an initial binarized hue-mask that can be used for the court identification task. Mask refinements by using morphological filters will be discussed later in this chapter. Although our method is not very complicated and works reasonably fast and accurately, it suffers from a couple of limitations since using a threshold is always problematic and involved with some potential trade-offs. The first issue is how to choose the best value for the threshold. In order to resolve this problem, we try different values and choose the one which works the best among all tested ones. According to our experiments, the threshold must set to be in the range [0.1, 0.15] for basketball images. Even though this number may not be the best choice since we do not test all possible values on all different datasets, it works reasonably well on our dataset of basketball images. The other issue is whether using one fixed threshold always work or we need more generic approach. More precisely, there might be some areas within the court area of an image that have hue values almost similar to non-court regions and/or vice versa which both may 25 (a) RGB Representation (b) HSV Representation (c) Hue-Map Figure 3.3: Hue-based Classification Failure. Using HSV color-map does not help to identify the game-field area within an image, since both court and non-court pixels have similar hue values. diminish the accuracy of our classification method (Figure 3.4). Moreover, in case of hockey images where all rink and non-rink pixels have very similar hue values, it is too difficult to distinguish one class of pixels from another since it is almost impossible to find any reliable threshold that allows us to do so (Figure 3.3). These issues bring us to the next section where we discuss how machine learning helps us to improve this classification task to be much more generic and robust without need of using any threshold. 3.2.2 Color Classification In order to overcome limitations of our initial hue-based pixel classification method, we take the advantage of a color segmentation method proposed by [31] that learns the color model of game-field pixels by using a boosting algorithm. In fact, we slightly modify its implementation for the purpose of our application. To use the color classification method, first a few samples of RGB pixel values from both court and non-court pixels must be collected. We use an interactive labeling tool that allows us to manually label court and non-court pixels for each given image frame to perform this task. More precisely, the color classification method feeds manually labeled pixels into an OpenCV implementation of a 26 (a) RGB Representation (b) HSV Representation (c) Hue-Map (d) Binary Mask Figure 3.4: Hue-based Classification Failure. Using HSV color-map by itself results in identifying the game-field area with noise, since some areas within the court area have hue values almost similar to non-court regions and/or vice versa. boosting algorithm called Gentle Adaboost [12] that trains 150 weighted decision trees in which the maximum depth is 104 in order to learn the color model of court pixels from labeled data. Moreover, the initial trained model can be improved by using more labeled pixels sampled from both court and non-court areas within an input image by updating the model. In fact, the updating process can be repeated by adding labeled pixels from few additional images (at most five images), that are not either in the keyframe set or our test set, until producing reasonably accurate results. Figure 3.5 illustrates the color classification results for both hockey and basketball images in which our previous method fails to accurately classify the court and non-court pixels. Similar to hue-based classification, outputs of the classification process are binary masks than can be used in the keyframe selection process, where white pixels denote court pixels and black pixels denote non-court ones. Since these masks inevitably contain noise, we have to come up with a procedure to remove noise and improve the quality of binary masks. This issue brings us to the next topic we are going to discuss in this chapter that is morphological refinements. 27 (a) RGB Representation (b) Background Segmentation - Binary Mask (c) RGB Representation (d) Background Segmentation - Binary Mask Figure 3.5: Color Classification Results. Illustration of final results (binary masks) of running the color classification algorithm on basketball (a-b) and hockey (c-d) images after few iterations. Obviously, both masks contain noise that can be removed by using morphological filters. 3.3 Morphological Refinement The final step in the preprocessing procedure is to refine binary masks produced by pixel classification methods to improve their quality by removing undesirable noises. In order to do so, first we run the MATLAB implementation of a morphological closing procedure on binary masks in order to fill undesirable black gaps within large white areas; then, find a set of dense connected components of white pixels that presumably correspond to the game-field regions within a given image. Removing sparse components with small areas is the next step. Afterwards, we use a morphological dilation on resulting binary masks of previous step by using a disk (radius = 25) as a structural element to enlarge white regions within each masks. Then, we re-compute connected components and calculate their corresponding size (i.e., the area within each connected component) in order to find small black regions surrounded by large white regions and/or vice versa. Finally, once the holes in both black and white regions are filled by using the Area ratio, we come up with a set of clean binary masks that represent a rough estimate of the game-field within each image (Figure 3.6). 28 The final output of this process is going to be used as the input of the keyframe selection process which we discuss in the next chapter. (a) RGB Representation (b) Initial Hue-Mask (c) Refined Hue-Mask (d) Final Binary Mask Figure 3.6: Morphological Refinements. Illustration of morphological refinements on a binary mask produced by hue-based classification method. (a-b) morphological closing + connected component analysis, (c) morphological dilation, (d) connected component analysis that results in the final mask. 29 Chapter 4 Automatic Keyframe Selection Having a set of binary masks that roughly represent the game-field areas within each image frame, we propose to cluster input images based on their corresponding binary masks in order to produce a set of keyframes with overlapping features that cover the entire gamefield. More precisely, since the blurred frames are removed from the inputs during the preprocessing stage, and also the binary masks represent rough estimates of the gamefield areas within remaining salient images, final results of the clustering process would produce keyframes such that they not only are salient images, but also have overlapping features that cover the entire game-field. However, clustering these high dimensional feature vectors (1280×720-vectors for basketball images and 1920×1080-vectors for hockey images) is neither timely efficient nor computationally feasible by using any clustering algorithm due the curse of dimensionality. Thus, we use Principal Component Analysis (PCA) to address this issue by linearly reducing the dimensionality of the input data (i.e., the binary masks). As we discussed in Chapter 3, the resulting binary masks produced by the color classification method are post-processed by using morphological filters to clean up noisy artifacts; however, they may still contain some noise even after the post-processing step. In fact, those noisy artifacts may not hurt the performance of the game-field identification task, but they harshly affect the performance of PCA, since PCA works based on least squares estimation algorithms that are very sensitive to outliers. To overcome this crucial issue, we take advantage of the robust linear dimensionality reduction method proposed by [21] called Robust PCA that works well even in the presence of noise within the input data. In the first section of this chapter we compare the performance of both basic PCA and its robust version on both hockey and basketball datasets, by providing further empirical performance evaluations. Once we produce the low dimensional data from the set of high dimensional binary masks for each dataset, we have to choose a proper clustering method to run over the dimensionality reduced data in order to produce a set of keyframes. Hence, we first decide to use the popular clustering algorithm called K-Means [3] for clustering since it is fast and produces tighter clusters compare to other clustering algorithms. Unfortunately, an undesirable shortcoming 30 of using K-Means is that the number of clusters K (i.e., the number of keyframes) has to be specified manually. In fact, one of the most important issues that we must consider, even before the clustering step, is how to estimate a sufficient number of keyframes we need to select from each dataset in order to guarantee that the selected keyframe set fulfills all predefined constraints that are discussed in Chapter 3. By using K-Means we have to manually set a large value for K which guarantees that selected keyframes (i.e., centers of clusters) necessarily represent overlapping frames that cover the entire game-field. However, using a novel variation of K-Means proposed by [33] called X-Means allows us to automate the value assignment process for K as well as accelerate the clustering by improving K-Means’s scalability with regard to the time it takes to perform each iteration within its iterative clustering process. In fact, X-Means takes the proper range in which the true value for K presumably lies, and returns cluster centers together with the best value for K based on using a particular scoring function that evaluates the model selection criteria in order to choose the best model. In other words, the X-Means clustering approach starts clustering inputs with K centroids such that K is equal to the lower bound of the given range, and then iteratively add more centroids to K until either the value of K reaches the upper bound of the range or the clustering performance does not change by adding new centroids. Even though by specifying the expected range for K in X-Means or even specifying the value of K for K-Means we guarantee these clustering methods to produce a sufficient number of keyframes (i.e., cluster centers) that satisfies the keyframe selection criteria, this number may not necessarily result in an optimal set of selected keyframes with regard to the task that those keyframes are going to perform in our system during later stages. More precisely, the objective function that X-Means attempts to optimize by choosing the best value for the K is different from objective functions we have used during the homography estimation process in order to compute the best possible estimates. Therefore, the resulting selected keyframes of our proposed keyframe selection process may contain one or more very similar images that may make the keyframe matching task very challenging for those cases. However, since the intermediate goal of our system is to compute frame-to-frame homography estimates between inputs and keyframes rather than accurately classify inputs based on keyframes, having more keyframes not only helps us to increase the chance of computing more accurate homography estimates between inputs and keyframes, but also reduces the effect of potential misclassifications that may occur during the keyframe matching process. Moreover, by setting a proper upper bound for the range of values that K can take, we avoid our keyframe selection module to come up with an infeasible number of keyframes with respect to the length of input image sequences within each of our three different datasets. In fact, an ideal keyframe selection approach would be the one that starts with an initial value of K, uses the resulting K centers of clusters as 31 keyframes and computes the homography for the entire input sequence, and then iteratively increases the value of K, redoes the entire process and compares the misregistration errors between pairs of resulting estimates, and continues this process until the error gets fixed or not increased. In this chapter, first we discuss conventional PCA and compare it with robust PCA with regard to purposes of our application; then we discuss the K-Means and X-Means clustering algorithms, and argue why we prefer to use X-Means rather than using K-Means with fixed K in order to select a set of keyframes from images within our three different datasets. Finally, after we select a set of keyframes for each dataset, we manually annotate the modelto-frame homography estimates for each of them in order to be used for the homography estimation stage. 4.1 Dimensionality Reduction In this section we discuss the approach that we use for reducing the dimensionality of the clustering input data. Hence, having a set of high dimensional binary masks with dimensionality D, the objective is to project them onto an optimal lower dimensional subspace with dimensionality M (M << D) in which M is inferred from the data itself while minimizing the projection cost which is defined as an average of squared distances between the high dimensional data and their projections. Equation 4.1 represents the mathematical formulation of the projection cost function where N is number of inputs, x denotes the original data, x ˜ denotes the projected data, and the value of the J is also known as the distortion [3]. 1 J= N 4.1.1 N xn − x ˜n 2 2 (4.1) n=1 Conventional PCA vs. Robust PCA Having N binary masks I bw 1 , I bw 2 , I bw 3 , . . . , I bw N with dimensionality D, corresponding to the N input image frames within each dataset, the aim of using PCA is to project these masks onto a lower dimensional subspace with dimensionality M while minimizing the projection cost. By using PCA we can infer the proper value of M from the input data itself. In order to do so, first we have to standardize the data to have zero mean and unit variance before using PCA. Then, the covariance matrix S is computed from the standard data. 32 Finally, we have to find the M eigenvectors of the covariance matrix S corresponding to it M largest eigenvalues [3]. Moreover, since in our case, the number of data (N ≈ 100, 000) is much less than the dimensionality of the data space (D ≈ 10, 000, 000) the maximum value of the M can be N − 1 as at least the D − N + 1 eigenvalues of the covariance matrix S are equal to zero. Figure 4.1 illustrates two plots representing the spectrum of singular values (i.e., the square root of eigenvalues) of covariance matrices that are computed for binary masks of images within the 2011 basketball dataset and the hockey dataset by using conventional PCA. Moreover, we intentionally choose these two datasets to evaluate the performance of conventional PCA since the binary masks corresponding to images within the hockey dataset are the noisiest ones whereas the masks corresponding to images within the 2011 basketball dataset are the cleanest. Singular Value Analysis − Hockey Dataset 2000 Singular Value Analysis − 2011 Basketball Dataset 1400 Singular Value Singular Value 1800 1200 1600 1000 Singular Values Singular Values 1400 1200 1000 800 600 800 600 400 400 200 200 0 0 10 20 30 40 50 60 70 Number of Principal Components 80 90 0 100 (a) 0 10 20 30 40 50 60 Number of Principal Components 70 80 90 (b) Figure 4.1: Singular Values Spectrum Representation for Hockey and Basketball Datasets. Plot (a) represents a plot of the singular values spectrum for the sorted 100-largest singular values of the covariance matrix of binary masks corresponding to images within the hockey dataset, and plot (b) illustrates a plot of the singular values spectrum for the sorted 90largest singular values of the covariance matrix of binary masks corresponding to images within the 2011 basketball dataset. As can be seen from illustrated plots in Figure 4.1, in case of the hockey masks, the singular values dropped off abruptly after first 20 principal components, while the rest still have large values, so they cannot be cut off from a set of basis of the input space without affecting the projection and reconstruction results. However, for the basketball masks, the singular values dropped off abruptly after the first 20 principal components, and then still decreases for the remaining components. As we discussed earlier in this chapter, since conventional PCA works based on the least squares formulation, it is very sensitive to outliers. To overcome this issue, we propose to use robust PCA proposed by [21] instead of conventional PCA. In fact robust PCA is a method that can be used for robust subspace learning where 33 the input data contaminated with outliers, by detecting and modeling outliers at the pixel level based on robust M-estimations. Figure 4.2 illustrates two plots that demonstrate the spectrum of singular values of covariance matrices of binary masks corresponding to images within the 2011 basketball dataset and the hockey dataset by using the robust PCA. Singular Value Analysis − Hockey Dataset 2000 Singular Value Analysis − 2011 Basketball Dataset 1400 Singular Value Singular Value 1800 1200 1600 1000 Singular Values Singular Values 1400 1200 1000 800 600 800 600 400 400 200 200 0 0 10 20 30 40 50 60 70 Number of Principal Components 80 90 0 100 0 (a) 10 20 30 40 50 60 Number of Principal Components 70 80 90 (b) Figure 4.2: Singular Values Spectrum Representation for Hockey and Basketball Datasets. Plot (a) represents a plot of the singular values spectrum for the sorted 100-largest singular values of the covariance matrix of binary masks corresponding to images within the hockey dataset, and plot (b) illustrates a plot of the singular values spectrum for the sorted 90largest singular values of the covariance matrix of binary masks corresponding to images within the 2011 basketball dataset. Dashed red lines denote principal components that are manually chosen to create a set of basis for the projected data. As can be seen from the plots in Figure 4.2, robust PCA performs much better in presence of noise within the binary masks corresponding to hockey images, compared to the performance of using conventional PCA for the same data. Moreover, using robust PCA slightly improves the performance of conventional PCA on the 2011 basketball dataset too. Having the spectrum of sorted singular values, now we should discuss how PCA infers the best value for M from the data itself such that it does not result in aliasing. In fact, the best for M is the number of largest eigenvalues of the covariance matrix S while minimizing the distortion. Moreover, the distortion can be reformulated (Equation 4.2) in terms of summation of removed eigenvalues. J= 1 N D λn (4.2) n=M +1 Hence, we can simply remove the eigenvectors corresponding to small eigenvalues of the S (M + 1 to D), and project the data onto the subspace that is constructed upon only M 34 eigenvectors corresponding to the M largest eigenvalues of the covariance matrix S while minimizing the distortion (Equation 4.2). As a result, the value of M is set to 90 for the hockey dataset, and set to 70 for the 2011 basketball dataset (Figure 4.2). 4.2 Clustering - Keyframe Selection Once we reduce the dimensionality of the binary masks, the next and the final task is to cluster the processed data in order to produce a set of keyframes for each dataset such that it satisfies the keyframe selection criteria discussed earlier. 4.2.1 K-means vs. X-means Once we produce the low dimensional data from the set of high dimensional binary masks for each dataset, we have to choose a proper clustering method to run over the dimensionality reduced data in order to produce a set of keyframes. In fact, one of the most important issues that we must consider before clustering is estimating a sufficient number of clusters. If the K denotes the number of clusters, the value of the K can be specified by using stochastic measures, determined manually by the user, specified within a range, or even determined later in the post-processing step. Moreover, we should consider the fact that the small values for K may result in a set of keyframes that does not cover the entire gamefield whereas the large value for K may result in a set of keyframes which may contain very similar images that would affect the accuracy of the keyframe matching process. Since the ultimate aim of the clustering stage is to produce a set of keyframes that facilitates the frame-to-frame homography estimation, selecting more keyframes not only helps us to increase the chance of computing more accurate homography estimates between inputs and keyframes, but also reduces the effect of potential misclassifications that may occur during the keyframe matching process. First we decide to use the popular clustering algorithm called K-Means to cluster the binary masks that represent the game-field areas within each images, since it is fast and produces tighter clusters compare to other clustering algorithms. In fact, in order to use K-Means, we must manually set the number of clusters K (i.e., the number of keyframes). Thus, we manually determine the value of K with respect to the number of images within each dataset such that the value of K does not violate the notion of the term ”selection”, and such that K must be large enough to guarantee that the selected keyframes (i.e., centers of clusters) all together can necessarily cover the entire game-field. Moreover, that K must be large enough 35 to maximize the chance that selected keyframes are sufficiently overlapped. As a result, we set K to 20 for all of our three datasets. Obviously, 20 is a magic number, and so, any other number greater than 10 may result in a set of keyframes for each dataset that may identically affect the homography estimation process. Since K-Means is prone to get trapped in local minimums as it uses the gradient descent-based methods in its optimization step, and so, its resulting clusters and cluster centers are highly dependent on its initialization process, we take advantage of a randomized initialization approach in which initial cluster centers are randomly chosen from the clustering input data, and then perform K-Means clustering. In fact, we repeat this entire process with the random initialization step for 30 different times (i.e., 30 different set of random seeds for the initialization step), and then use resulting cluster centers of the one that performs the best by producing clusters such that the sum of distances between all elements of each cluster and their corresponding centers is minimum among all 30 trails, and also the resulting cluster centers fulfill our predefined constraints about selecting keyframes. Furthermore, by using conventional K-Means we do not consider characteristics of the data itself in order to come up with a proper number of clusters (i.e., we chose the same number of clusters for both hockey and basketball datasets). In order to overcome these crucial shortcomings, we propose to use a novel variation of K-Means proposed by [33] called X-Means that allows us to not only automate the value assignment process for K with respect to the input data, but also accelerate the clustering process by improving K-Means’s scalability with regard to the time it takes to perform each iteration within its iterative clustering process. In fact, X-Means takes the proper range in which the true value for K presumably lies, and returns cluster centers together with the best value for K based on using a particular scoring function called Bayesian Information Criterion (BIC scoring) that evaluates the model selection criteria in order to choose the best model. In other words, the X-Means clustering approach starts clustering the inputs by using conventional K-Means with K centroids, in which K is equal to the lower bound of the given range, together with the random initialization step, and then iteratively adds more centroids to K until either the value of K reaches the upper bound of the given range or the clustering performance does not change by adding new centroids. In our experiments, we specify the expected range for K to [10, 20] for all three datasets since we believe that any number of clusters in this range guarantees X-Means to produce a sufficient number of keyframes (i.e., cluster centers) that satisfies the keyframe selection criteria. In fact, using X-Means results in 12 clusters for the 2011 basketball dataset, 14 clusters for the 2010 basketball dataset, and 15 clusters for the hockey dataset. Moreover, it estimates the center of each of these clusters such that their corresponding image frame can be used as a keyframe. Sets of selected keyframes for each dataset of images are illustrated in Figure 4.3, Figure 4.4 and Figure 4.5. 36 Once keyframes are selected for images within each dataset, we must manually annotate the model-to-frame homography for each of them in order to make them ready for use by the keyframe matching and frame-to-frame homography estimation modules of our system. 37 Figure 4.3: Automatically Selected Keyframes - 2010 Basketball Dataset. 38 Figure 4.4: Automatically Selected Keyframes - 2011 Basketball Dataset. 39 Figure 4.5: Automatically Selected Keyframes - Hockey Dataset. 40 Chapter 5 Keyframe Matching The last step before computing the frame-to-frame homography between image pairs is to find the most similar keyframe to each input frame. This procedure called keyframe matching process. The major challenge involved in this process is to devise a method that takes advantage of low level features that can be extracted from both input images and keyframes, such as color histograms, distinctive local keypoint features and color classification results (i.e., binary masks), together with a proper matching technique in order to accurately find the best matching keyframe for each given input frame where the keyframe set may contain several keyframes that look pretty similar to each other. In fact, one can think of the keyframe matching process as a classification task, where the goal is to label each input frame with a label of one of K keyframes based on evaluating their similarity. Consequently, using the insufficient amount of information that can be extracted from the classification input data and/or poor similarity measurements would result in severe misclassification errors and/or a large subset of keyframes that remain unused during the keyframe matching process. Once input images within each dataset are clustered based on their corresponding binary masks, each image is assigned a cluster label with respect to its distance to all final cluster centers (i.e., all keyframes). Thus, after performing clustering, each input image is labeled with a label of the cluster center that has the minimum distance to it, among all final cluster centers, in terms of comparing their corresponding binary masks. Even though these estimated labels capture the significance of similarity between the viewpoints of input images and keyframes and visible game-field areas within them (i.e., binary masks), this single similarity measure alone is not sufficient to accurately estimate the best matching keyframe for each input frame. Hence, we take advantage of clustering results together with two other measures such that they do not only produce a single guess of the closest keyframe for each input frame, but generate K-lists that represent rankings of all K choices of keyframes for each input frame based on evaluating their similarity in terms of three different extracted low level features from them, namely color histograms, distinctive local keypoint features and color classification results. Then we use resulting ranking K-vectors of input frames 41 as the high level representation of input frames in terms of keyframes, and feed them into a classifier that takes advantage of all those high level features (i.e., rankings) in order to accurately estimate the best matching keyframe for each input frame. More precisely, first we take advantage of clustering results such that in addition to estimate the best cluster label for each input frame, we compute distances of each input frame to all K final cluster centers by comparing their corresponding binary masks, and so, generate a K-list consisting of corresponding keyframe labels, that are ranked in decreasing order of similarity based on those measured distances, for each input frame. Moreover, we divide each image frame within the set of input frames and keyframes to a grid of predefined number of same sized cells, extract RGB histograms from each cell, and integrate all resulting histograms within each grid into a single feature vector that can represent the image frame. In fact, this new representation provides additional information about the color model of input images and keyframes compared to the previous representation (i.e., binary masks). Afterwards, we compute distances between each input frame and all K keyframes by calculating the Euclidean distance between their corresponding grids of RGB histograms, and generate a K-list consisting of corresponding keyframe labels for each input frame that are ranked in decreasing order of similarity in terms of the color model based on those calculated Euclidean distances. Furthermore, we extract SIFT features from all input frames and keyframes, calculate their corresponding descriptors, and then compute a set of inlier correspondences between each input frame and all keyframes. Afterwards, we count the number of inlier correspondences between each input frame and all keyframes, and generate a K-list consisting of corresponding keyframe labels for each input frame that are ranked in decreasing order of similarity in terms of the keypoint feature matchings based on those calculated number of inlier correspondences. Finally, all resulting rankings of those three procedures are fed into the classification fusion module that takes advantage of a three layer artificial neural network (ANN) to combine those ranks and make final classification decisions for inputs with high overall accuracy compared to using each of previously discussed processes individually. Moreover, in the classification fusion module we use a framework proposed by [34] in order to combine decisions (i.e., rankings) of those multiple complementary low level classifiers to improve the overall accuracy of the classification task. In this chapter, first we explain how we produce sets of high level features for each image based on extracting different low level features from them, and then discuss how we use the combination of those resulting high level features in order to classify input images into classes of keyframes. Moreover, we compare the performance of the classification task when we only use each classifier individually and when we use a combination of them through a classification fusion module. 42 5.1 Clustering Labels as the Measure of Similarity Once we cluster input images within each dataset based on their corresponding binary masks, each image is assigned a cluster label with respect to its distance to all final cluster centers (i.e., all keyframes). Thus, after the clustering stage each input image is labeled with a label of the cluster center that has the minimum distance to it among all final cluster centers. As we discussed earlier in this chapter, even though these estimated labels capture the significance of similarity between the viewpoints of input images and keyframes and visible game-field areas within them (i.e., binary masks), this single similarity measure alone does not provide sufficient information to accurately estimate the best matching keyframe for each input frame. In fact, capturing those properties (i.e., viewpoints and visible gamefield areas) of images is very beneficial to identify misclassification errors; however, for an accurate classification we need additional information about the color model and/or keypoint features of both inputs and keyframes which unfortunately is not available through using this representation of images (i.e., binary masks). To overcome this problem, we take advantage of clustering results together with two other measures, namely color histograms, distinctive local keypoint features, such that they do not only produce a single guess of the closest keyframe for each input frame, but generate K-lists that represent rankings of all K choices of keyframes for each input frame based on evaluating their similarity in terms of three different extracted low level features from them, and combine their decisions in order to perform the final classification. As a result, in this step we take advantage of clustering results such that in addition to estimate the best cluster label for each input frame, we compute distances of each input frame to all K final cluster centers by comparing their corresponding binary masks, and generate a K-list consisting of corresponding keyframe labels for each input frame that are ranked in decreasing order of similarity based on those measured distances. Then, the resulting sets of K-lists are fed into the classification fusion module in order to combine with results of two other low level classifiers and produce final classification decisions for each input frame. 5.2 RGB Histograms Grids In order to extract additional information about the color model of images within both input frames and keyframes from their corresponding RGB histograms, first we spatially decompose each image to a grid of particular number of cells that are determined from and 43 Figure 5.1: Image Grid Representation. This image visualizes the 3 × 5 grid of 15 equally sized cells that is produced for an image within the 2010 basketball dataset. cross validated on the input data (Figure 5.1). In fact, one can construct a grid of cells from each image such that each cell has a different size within a grid (as in [6]) in order to prioritize different regions within an image; however, in our experiments we use grids of equal-size cells since we do not make any assumption about the significance of particular regions within images in our datasets. Once we produce those grids for both input images and keyframes, next we create a feature vector from normalized RGB histograms for each cell within each grid such that each histogram is quantized by a specific number of bins that are determined, cross validated, and inferred from the input data similar to the number of cells. Once those vectors are produced for all cells within each grid, we concatenate the resulting vectors in order to generate one single vector such that each image can be represented and call it a color vector. Thus, instead of using input images and keyframes, we use their corresponding color vectors, and so, compare those vectors in order to find the closet keyframe to each input frame by computing the Euclidean distance between them. In fact, this new representation provides additional information about the color model of input images and keyframes compared to the previous representation (i.e., binary masks). As a result, by using this approach distances between each input frame and all K keyframes are computed by calculating the L2 distance between their corresponding color vectors, and generate a K-list consisting of corresponding keyframe labels for each input frame that are ranked in decreasing order of similarity in terms of the color model based on those calculated Euclidean distances. In our experiments, for images within both basketball datasets we use 3 × 5 gird of equally sized cells, also we set a number of the bins to 64. In fact, we cross validate all these 44 parameters on a subset of images within both of these datasets consisting 100 samples from each dataset. 5.3 Interest Point Detection and Matching (a) Input Frame Closest Keyframe (b) Inlier SIFT Matchings Figure 5.2: SIFT Matching Inliers Visualization. This image illustrates all inlier correspondences between an input image and one keyframe after running RANSAC on the putative set of all correspondences. One common approach to image classification is to extract a set of local distinctive keypoint features from each input image, and then match those extracted features with extracted features from a set of reference images and find the one that has a maximum number of inlier matches. We will discuss how to compute inlier matches in more detail in Chapter 6. In fact, this approach is highly beneficial when we are dealing with reasonably small images that may contain one or only a few objects within fairly uncluttered scenes. Moreover, the goal of those classification tasks is to classify images based on the visible object within them. However, in case of the sports video frames our objective for the classification is to classify input images into different classes of keyframes based on different criteria such as viewpoints, visible game-field areas, color, and also extracted interest points within the game-field areas. Moreover our system deals with large images cluttered with objects such as fans, players, referees, reserve players that may be neither informative nor helpful to accurately classify input images. However, those cluttered images may also contain beneficial visual features 45 that can be extracted from backboard, ad boards, and the ads on the game-field that can help us to classify input frames into classes of keyframes with a reasonable level of accuracy. Therefore, we still expect this approach to be helpful where an input image and a keyframe are closely similar to each other. Moreover, ranking choices of keyframes for each input frame based on matching features can be beneficial for cases that our previous approaches fail to discriminate accurately. In our experiment, we only extract SIFT features [24] from all input frames and all corresponding selected keyframes for each dataset, and then count the number of inlier correspondences between each input frame and all keyframes, and generate a K-list consisting of corresponding keyframe labels for each input frame that are ranked in decreasing order of similarity in terms of the keypoint feature matches based on those calculated number of inlier correspondences. 5.4 Classification Fusion - Finding the Closest Keyframe Low Level Classifiers Cluster Label Ranking (Binary Masks) Input Frame Rankings RGB Histograms Grids Ranking The High Level Classifier 3-Neural Network + Logistic Regression (Combined Decisions) Final Classification SIFT Inlier Correspondences Ranking Figure 5.3: Classification Fusion Representation. Once sets of rankings are produced based on using three different low level classifiers that we discussed earlier, finally we combine all those multiple complementary decisions together through a classification fusion module that works based on a framework proposed by [34] which takes advantage of a three layer artificial neural network (ANN) in order to combine those rankings and make final classification decisions (Figure 5.3). In fact, this approach is expected to classify input frames into classes of keyframes with higher overall accuracy compared to each of previously discussed methods since it takes advantage of additional information that may not be either available to each of those low level classifiers or well represented within them. In other words, the final high level classifier can correct some potential misclassification errors that are caused by each individual low level classifier by processing multiple decisions (i.e., rankings) of all of those classifiers together. In order to use an artificial neural network as a high level classifier that makes final classification decisions based on resulting high level features that are extracted from images by using low level classifiers, first we have to extract sets of rankings from a subset of images 46 z1 z2 x1 x2 z3 z4 y1 xK xK+1 y2 xK+2 y3 x2K yK x2K+1 zs-3 x3K zs-2 zs-1 zs Figure 5.4: 3-Layer Artificial Neural Network Representation. This Figure illustrates the structure of the 3-layer neural network that we use for classification. The input layer is composed of 3K input units correspond to the K-lists of ranks that are produced by 3 different classifiers. The hidden layer is composed of s hidden units in which the logistic regressions are performed on weighted combinations of all inputs based on using the set of weights (i.e., model parameters) for edges between the input and hidden layers. Finally, the output layer is composed of K units that represents a binary vector which indicates the estimated class label for each given data (i.e., rankings) based on performing logistic regressions on weighted combinations of outputs of hidden layer units by using the corresponding set of weights for edges between the hidden and output layers. within each dataset (i.e., neural network’s training set) that represent prioritized choices of keyframes for them in terms of color classification results (i.e., binary masks), RGB histograms grids (i.e., color vectors), and inlier SIFT correspondences. Moreover, we have to manually label each image within the training set with a label of its corresponding best matching keyframe. Once we create the training set, we must design a neural network architecture that can be used for an accurate multi-class classification of input images into classes of their corresponding keyframes. Hence, we design a 3-layer neural network such that its input layer 47 is composed of 3K input units correspond to the K-lists of ranks that are produced by 3 low level classifiers for each image, a hidden layer that is composed of s hidden units, and finally an output layer that is composed of K units such that one can think of it as a binary vector which indicates the estimated class label for each given data point (Figure 5.4). In fact, the number of units within the input and output layers depends on the number of classes K, whereas the number of units in the hidden layer is determined during the training phase through cross validation. Furthermore, we have to assign two sets of weights to edges between input and hidden layers, and also hidden and output layers. In fact, these sets of weights are parameters of the classification model that we initialize them by random values, and try to learn their actual correct values during the training phase. Afterwards, we must choose a discriminative activation function together with a proper learning strategy in order to train parameters of the neural network (i.e., classification model) that we designed. In fact, we use the logistic function as an activation function for our neural network such that each unit of the hidden layer within our network is actually a logistic unit that performs logistic regression on weighted combinations of elements of input vectors based on a set of weights for edges between the input and hidden layers, and also each unit within the output layer is a logistic unit as well that performs logistic regression on weighted combinations of outputs of logistic regressions that have been performed in units of the hidden layer based on a corresponding set of weights for edges between the hidden and output layers. Moreover, we take advantage of backpropagation learning as a learning strategy in order to learn accurate values for weights within our network during the training phase. More precisely, using backpropagation allows us to compute partial derivatives of the cost function by taking advantage of gradient-based optimization methods, and iteratively update values for those weights (i.e., model parameters), until they finally converge to a set of values that optimizes the neural network’s cost function (Equation 5.1). We also use the L2 regularizer in order to avoid overfitting in our classification. Cost (Ω) = λ 2N 1 N N K (i) yk log fΩ x(i) i=1 k=1 L−1 sl sl +1 (l) 2 Ωji (i) k + 1 − yk where fΩ (x) ∈ {0, 1}K l=1 i=1 j=1 log 1 − fΩ x(i) and fΩ (x) = k + (5.1) 1 1 + exp−ΩT x inputs: x ∈ {1 · · · K}3×K rankings, outputs: y ∈ {0, 1}K lables, weight matrix: Ω Equation 5.1 represents the cost function of our neural network where N is a total number of training examples, K denotes the number of output units, x denotes input vectors that represent all 3 K-lists of ranks that are produced by 3 different low level classifiers for each 48 training example, y represents the binary vector that indicate the estimated class label for each input (i.e., rankings), Ω is a matrix of weights (i.e., model parameters) for edges between input and hidden layers, and also hidden and output layers, and fΩ is a logistic function. Moreover, λ 2N L−1 l=1 sl i=1 sl +1 j=1 (l) 2 Ωji denotes a regularization term, where λ is a regularization parameter, L denotes the number of layers of our neural networks, and sl represents the number of units within layer l of our network. To conclude, during the training phase, first we assign random initial values to all of the model’s parameters, and then use them in order to perform the forward propagation step in which we estimate classification labels for training inputs by performing two sets of logistic regressions on weighted combinations of input units and hidden units respectively. Then, we compute the cost function, and perform backpropagation to compute partial derivatives of the cost function. Concretely, in each epoch of the training phase, we perform a series of forward propagations and backpropagations for all training examples and compute the corresponding cost function and its partial derivatives, then we take advantage of gradientbased optimization methods in order to update the values of those initial weights such that they eventually converge to a set of values that minimizes the cost function of our neural network. After the training phase, we construct our test set from first frames clips within each dataset, extract all high level features from images within it, and finally feed resulting high level feature vectors into the trained neural network in order to find the best matching keyframes for all images within the test set. In our experiments, for images within the 2010 basketball dataset, we design a 3-layer neural network that has 42 input units, 37 hidden units and 14 output units. Moreover, the regularization term λ is set to 0.1 and the maximum number of epochs is set to 10, 000. We train this network with a manually annotated synthetic training set consisting of 956 images within the 2010 basketball dataset, and test it with a sequence of 123 first images of all clips within that dataset. For images within the 2011 basketball dataset, we design a 3-layer neural network that has 36 input units, 30 hidden units and 12 output units. Moreover, the regularization term λ is set to 0.13 and the maximum number of epochs is set to 10, 000 times similar to the 2010 network. In fact, we train the corresponding neural network of images within the 2011 basketball dataset with a manually annotated synthetic training set consisting of 873 images within that dataset, and test it with a sequence of 97 first images of all clips within it. In fact, the number of hidden units and a value of the regularization parameters are determined by performing cross validation. Moreover, for performing optimization tasks during the training phase of both these neural networks we use the built-in optimization package in MATLAB that takes advantage of advanced 49 optimization methods such as Gradient Descent, Conjugate Gradient, BFGS, and L-BFGS. Finally, in the next section we evaluate the performance of classification for both low level classifiers and these neural networks on images within the corresponding test sets of the 2010 and 2011 basketball datasets, compare their empirical evaluation results, and discuss shortcomings of each approach and reasons that make the classification fusion module outperform all low level classifiers in terms of the accuracy of classifications. 5.5 Discussion In the remainder of this chapter we compare empirical evaluations of the performance of low level classifiers and also the high level classifier that we proposed in this chapter to perform the keyframe matching process. In fact, we evaluate all classifiers by using images within the test sets that are constructed from first frames clips within each dataset. Moreover, the performance of all those classifiers are evaluated in terms of confusion matrices. For each confusion matrix, the number of rows and columns denote the number of classes of keyframes that have to be used for finding the closest keyframes for images within each dataset. As a result, 14 × 14 confusion matrices represent the performance of different classifiers for images within the corresponding test set of the 2010 basketball dataset since we automatically select 14 keyframes for images within this dataset, and similarly, 12 × 12 confusion matrices represent the performance of all those classifiers for images within the corresponding test set of 2011 basketball dataset as only 12 keyframes are automatically selected for images within that. Moreover, for each confusion matrix, the value of off-diagonal elements denotes the number of mismatches for the particular class that is represented by the row/column number, and the value of the diagonal elements represents the number of correctly estimated matches for that class. In fact, in our representation, we normalize all these values by dividing them to the sum of all values within their corresponding row such that all elements of the confusion matrix have values within the range of [0, 1]. Moreover, we illustrate those numerical values within each matrix in terms of colors by taking advantage of a color map such that the values within the range of [0, 1] are visualized by the spectrum of colors which starts from dark blue, that corresponds to 0, and end with dark red, that corresponds to 1. Furthermore, in this section we discuss reasons that cause misclassification errors for each individual low level classifier, and how the classification fusion module takes advantage of additional information that are provided by them in terms of multiple high level rankings in order to improve the overall accuracy of classification. 50 5.5.1 RGB Histograms Grids Evaluation #Mismatches:26 −− Precision:0.78862 #Mismatches:21 −− Precision:0.78351 Estimation 0 0 1 2 3 4 5 6 7 8 Estimation 9 10 11 12 13 14 1 1 0 0.9 1 2 3 0.8 4 0.7 1 2 3 4 5 6 7 8 9 10 11 12 1 0.9 2 0.8 3 7 Groundtruth 0.6 6 0.5 8 0.4 9 0.7 4 5 Groundtruth 0 5 0.6 6 0.5 7 0.4 8 10 0.3 11 0.2 0.3 9 0.2 10 12 13 0.1 11 0.1 14 0 12 0 (a) Basketball 2010 Dataset (b) Basketball 2011 Dataset Figure 5.5: Classification Evaluation - Keyframe Matching by Using Color Histograms. Figure 5.5 illustrates empirical evaluations of using only RGB histograms grids for matching the closest keyframe to each input frame for images within 2010 and 2011 basketball datasets. In this approach, the closest keyframe to a given input frame is the one that its corresponding color vector has a minimum Euclidean distance to the corresponding color vector of the input frame. As can be seen from the confusion matrix in Figure 5.5 (a), this classifier works reasonably well for images within the 2010 basketball dataset and the corresponding set of automatically selected keyframes. In fact, this approach cannot accurately find the best matching keyframe for images that are similar to keyframes 13 and 14 since these two keyframes are very closely similar to each other, and so, this classifier is not sophisticated enough to correctly classify images that are close to either of those two keyframes (Figure 5.6). However, since the intermediate goal of our system is to compute frame-to-frame homography estimates between inputs and keyframes rather than accurately classify inputs based on keyframes, misclassification between keyframes that are pretty similar to each other does not hurt the accuracy of homography estimates between inputs and those keyframes. Moreover, Figure 5.5 (a) demonstrates that keyframes 12 and 10 are not used at all during the keyframe matching process for 2010 basketball dataset. Furthermore, Figure 5.5 (b) illustrates a resulting confusion matrix of our classification based on using RGB histograms grids for images within the 2011 basketball dataset and the corresponding set of keyframes. As can be seen from the confusion matrix, our proposed approach misclassifies all images that actually belong to the class of images within the keyframe 6 class to class of images that belongs to the keyframe 5. Moreover, it cannot accurately distinguish and classify images that are that similar to keyframes 11 and 12, 51 (a) Keyframe 13 (b) Keyframe 14 Figure 5.6: Closely Similar Keyframes - 2010 Basketball Dataset. since those two keyframes are very similar to each other (Figure 5.7). Furthermore, Figure 5.5 (b) demonstrates that keyframe 7 is not used at all during the keyframe matching process for 2011 basketball dataset. (a) Keyframe 5 (b) Keyframe 6 (c) Keyframe 11 (d) Keyframe 12 Figure 5.7: Closely Similar Keyframes - 2011 Basketball Dataset. 5.5.2 Interest Point Detection and Matching Evaluation Figure 5.8 illustrates empirical evaluations of using only SIFT inlier correspondences for matching the closest keyframe to each input frame for images within 2010 and 2011 basketball datasets. In this approach, the closest keyframe to a given frame is the one that has a maximum number of SIFT inlier correspondences with the input frame. As can be seen from those confusion matrices, the overall accuracy of this method is much lower than the previous approach we discussed; however, it works reasonably better than the previous one for the cases that the color histogram-based classifier fails to accurately classify. 52 #Mismatches:86 −− Precision:0.30081 #Mismatches:54 −− Precision:0.4433 Estimation 0 0 1 2 3 4 5 6 7 8 Estimation 9 10 11 12 13 14 1 1 0 0.9 1 2 3 0.8 4 0.7 1 2 3 4 5 6 7 8 9 10 11 12 1 0.9 2 0.8 3 7 Groundtruth 0.6 6 0.5 8 0.4 9 0.7 4 5 Groundtruth 0 5 0.6 6 0.5 7 0.4 8 10 0.3 11 0.2 0.3 9 0.2 10 12 13 0.1 11 0.1 14 0 12 0 (a) Basketball 2010 Dataset (b) Basketball 2011 Dataset Figure 5.8: Classification Evaluation - SIFT Inlier Matches. 5.5.3 Classification Fusion Evaluation Finally, Figure 5.9 illustrates empirical evaluations of the classification fusion module that combines decisions of all those three low level classifiers through a 3-layer neural network. As can be seen from confusion matrices, the estimated class labels that are produced by the classification fusion module for images within both 2010 and 2011 basketball datasets are considerably more accurate than the estimated labels that are produced by each individual low level classifier. Particularly, for images that belong to classes of keyframes 9, 13 and 14 within the 2010 dataset, and images that are similar to keyframes 6, 8, 11 and 12 within the 2011 dataset, the high level classifier works much better than individual low level classifiers. #Mismatches:16 −− Precision:0.86992 #Mismatches:13 −− Precision:0.86598 Estimation 0 0 1 2 3 4 5 6 7 8 Estimation 9 10 11 12 13 14 1 1 0 0.9 1 2 3 0.8 4 0.7 1 2 3 4 5 6 7 8 9 10 11 12 1 0.9 2 0.8 3 7 Groundtruth 0.6 6 0.5 8 0.4 9 0.7 4 5 Groundtruth 0 5 0.6 6 0.5 7 0.4 8 10 0.3 11 0.2 0.3 9 0.2 10 12 13 0.1 11 0.1 14 0 12 0 (a) Basketball 2010 Dataset (b) Basketball 2011 Dataset Figure 5.9: Classification Evaluation - Classification Fusion. In fact, using various types of visual features within more sophisticated classifiers such as SVM s [13] would definitely improve the overall accuracy and robustness of the keyframe matching results. We discuss possible directions to do so in Chapter 8. 53 Chapter 6 Frame to Frame Homography Estimation The ultimate goal of this dissertation is to automatically determine the homography transformations between the game-field model and each of the first images of all clips within complete basketball and/or hockey games by using a minimum amount of manually annotated data. Since the model-to-frame homography for each keyframe has to be known in order to compute the homography for each input frame, we use an interactive tool that allows us to manually annotate the game-field area within each keyframe and produces the model-to-frame homography. Once the closest keyframe to each input frame is determined, and the homography of all keyframes are manually initialized, an accurate homography estimate must be computed between each input frame and its best matching keyframe. As a result, in this section we are going to discuss how we estimate the frame-to-frame homography between image pairs within the system we have been developing so far. More precisely, first we describe the process of computing the initial homography estimate using SIFT matchings between image pairs together with RANSAC and normalized DLT ; then, we discuss how running the ICP algorithm over initial homography estimates together with the DPM detection results improves the accuracy of our initial estimates. 6.1 Initial Homography Estimates Given an input video sequence consisting N input frames {I1 , I2 , I3 , . . . , IN }, and a set of reference frames containing automatically selected K keyframes {F1 , F2 , F3 , . . . , FK } where (game-field) model-to-frame homography transformations Hf 1 , Hf 2 , Hf 3 , . . . , Hf K are known for all keyframes, the objective is to estimate the frame-to-frame homography Hi,j between an input frame Ij and its most similar matching keyframe Fi . Having said that, the model-to-frame homography for each input frame Hi 1 , Hi 2 , Hi 3 , . . . , Hi N can be simply obtained by using a series of matrix multiplications in which the frame-to-frame homography matrix Hi,j is multiplied to the model-to-frame homography matrix Hf i corresponds to the homography of the closest keyframe Fi to an input frame Ij , and results in the model-to54 (a) Input Image - Best Matching Keyframe (b) All SIFT Matchings Figure 6.1: SIFT Matching Visualization. (a) illustrates two best matching keyframes that our method proposed, and (b) illustrate all possible SIFT keypoint matching between them in order to construct a set of correspondences to estimates the homography between them. It is clear from the figure that the resulting correspondences contain many false matches (outliers) that will affect the homography estimation. frame homography matrix Hi j for an input frame Ij . Consequently, the model-to-frame homography transformations for all input frames Hi 1 , Hi 2 , Hi 3 , . . . , Hi N can be derived as follow: Hi 1 = Hj,1 Hf j Hi 2 = Hj,2 Hf j Hi 3 = Hj,3 Hf j ... Hi N = Hj,N Hf j (6.1) where j denotes the index of the closest keyframe to each input frame and ranges frame 1 to K accordingly. As we discussed in Chapter 2, a set of correspondences must be extracted from image pairs (i.e., a given input frame and its best matching keyframe) in order to compute the initial homography estimate between them. Since we propose a point-based homography estimation approach, first we need to use an interest point detector to extract a set of keypoints from both images; then, a local descriptor must be computed at each detected keypoint. In our initial homography estimates, we use the SIFT interest point detector 55 (a) Inlier SIFT Matchings Figure 6.2: SIFT Matching Inliers Visualization. This image illustrates all inlier correspondences between images that are presented in Figure 6.1 after running RANSAC on the putative set of all correspondences. to find keypoints in both images; then, compute the SIFT descriptor at each detected keypoint that allows us to find their potential correspondences within each image pairs. Having all keypoints and their corresponding descriptors for all input frames and keyframes, establishing a set of point correspondences between each image pair is the next step. Intuitively, a set of point correspondences can be derived by using a nearest neighbour method that computes the Euclidian distance between SIFT descriptors of each extracted feature within each input frame and its matching keyframe, and ranks potential correspondences based on calculated distances. Figure 6.1 illustrates all possible correspondences between pair of images using the approach we already discussed. This approach inevitably suffers from introducing undesirable incorrect matches to the set of correspondences. Since the homography estimation process is very sensitive to these outliers, they must be identified and removed from extracted correspondences before performing any further step. Hence, we run the RANdom SAmple Consensus (RANSAC) algorithm [11] on our initial set of correspondences to robustly estimate a set of inlier correspondences and an optimal homography estimate such that estimated inliers are consistent with. In fact, running the RANSAC algorithm over a set of putative correspondences between a pair of images produces a homography estimate between those images and also a set of inlier correspondences that are consistent with the estimated homography, and a set of outliers. In other words, the RANSAC algorithm uses the homography as a model, and randomly selects sample subsets from a set of putative correspondences that contains at least four correspondences, and then utilizes the normalized DLT [16] to compute the homography transformation between those samples and measures the reprojection error (discussed in Chapter 2). It repeats this procedure until the error reaches a value under the certain 56 threshold, and then it adds selected samples to the consensus set and repeats the entire process in order to construct a dense set of inlier correspondences from all putative ones together with an optimal homography estimate by using the Levenberg-Marquardt algorithm [22; 27] to minimize the reprojection error (Figure 6.2). In fact, the accuracy of the proposed approach that computes the initial homography estimates (Figure 6.3) can be improved by using a spatial weighting scheme such as a kernel density estimator that assigns a particular weight to each extracted point correspondence with regard to the density of inlier matches within the area of an image where it is extracted from such that correspondences that are extracted from areas of an image that have high density of inlier matches are assigned lower weights whereas correspondences within areas of the image that have lower density of matches are assigned higher weights. In other words, using a weighting scheme provides a constraint that puts emphasis on each extracted correspondence with respect to the density of extracted inlier correspondences within the area where it is extracted from. In fact, inlier correspondences within the areas of an image with high density of inliers result in more accurate model fitting rather than correspondences within the low density areas. Consequently, using the spatial weighting scheme helps inlier correspondences within areas of image with low density of inlier matches to result in more accurate model fitting, and so, facilitate more accurate homography estimates. Although the resulting homography estimates by using our proposed approach are not perfectly accurate and robust (Figure 6.3), they are precise enough to be used as initial estimates for the ICP algorithm in order to get refined into more accurate results. Hence, once the initial model-to-frame homography is estimated for each image within the input sequence, we have to post-process estimated results by using the ICP algorithm in order to improve the accuracy and robustness of the initial estimates, as the next and final step. 57 (a) (b) (c) (d) Figure 6.3: SIFT Homography Estimates Visualization. Images a-d illustrate model-toframe homography estimates based on using SIFT inlier correspondences by superimposing red lines and circles of an accurate geometric model of the basketball court on each image. 58 (a) (b) Figure 6.4: DPM Detection Results Visualization. Each red bounding box represents a detected player within each image by using the DPM player detector. 6.2 ICP Homography Estimates Having N input frames {I1 , I2 , I3 , . . . , IN }, and their initial model-to-frame homography estimates by using SIFT keypoint correspondences H1 , H2 , H3 , . . . , HN , the next task that must be performed is to take advantage of the ICP homography estimation method proposed by [25] to refine initial homography estimates and improve their overall accuracy. In other words, the ICP algorithm takes each SIFT model-to-frame homography estimate as an initial input and starts iterations, and eventually converges to optimal homography estimates Hicp 1 , Hicp 2 , Hicp 3 , . . . , Hicp N for each input frame after a number of iterations. Consequently, outputs of this process are final homography estimates produced by our system, and can be used for initializing any homography estimation method. Since the ICP homography estimation method uses the player detection results to remove false edges produced by players from the corresponding edge-map of each individual image frame, first we run the DPM detector on our input frames in order to detect players (Figure 6.4). Having detection results, the Canny edge detector [5] is used in order to extract edge components within both the game-field model and each image of the input sequence. In fact, using the Canny edge detector requires setting sensitivity thresholds in order to distinguish between strong and weak edges, and also the value of the sigma that is used in a derivative of Gaussian filter. Since we use three different datasets consisting hockey and basketball images, the values of these parameters have to be specified for each individual dataset to make the Canny edge detector work the best. For the hockey dataset, we set the high threshold value to 0.030 and set the sigma value to 3. Similarly the high threshold value and the value of the sigma for 2010 and 2011 basketball datasets are set to 0.015 and 1 respectively. Once edge components are extracted, we use the DPM detection results to remove false edges from the corresponding edge-map of each input frame. 59 w image (b) transformed edge points ore map Figure 6.5: Stable Points Visualization for 2010 Basketball Dataset. (d) new model points ess of updatingAfter model points in homography estimation. (a) The raw removing false edge components from the corresponding edge-map of each input frame, transformed edge Et , computed byhomography using theestimates estimated homogwe feedmap the initial model-to-frame into the ICP algorithm. The uses the model-to-frame homography of each input frame to project ore map St thatICP measures the consistency of edge points. (d) The newthe point cloud from edge components of the game-field model (as it is discussed in Chapter 2) his is computedextracted by thresholding the score map St and and include the to each frame, and removes projections that are outside the image boundaries. In addition set M . to edge components of the game-field model, the ICP method proposed by [25] uses a set of points called stable points to compute more accurate and robust homography estimates, since the model points are sparse in some frames because of the view point of the camera e transformedand/or edgeocclusions. map Et Incomputed by using estimated fact, stable points are thethe set of points that courtare extracted from the the ads, the logo of each edges teams, the stadiums’ and any other y Ht . We can texture noticeofthat most players’ have beenlogos removed by visual marks and/or textures on the actual game-field that have consistent homography transformations wever, there islike one not(Figure detected its using edges still remain theplayer model points 6.51 ). and Morethus precisely, stable points introduces additional constraints to regulate estimation process, and results5.10. in more accurate and gure 5.4c shows the score mapthe Sthomography computed by using Equation robust homography estimates accordingly. Moreover, his proposed method takes advantage is a weighted average previous frames, of players receive of an onlineover learning approach that learns edges and updates stable points within images of each individual dataset of Figure broadcast5.4d sport shows videos during processing theirpoint corresponding image es of logos have high values. the new model frames. points are constructed by thresholding the score map St and include Having the model points together with the stable points, ICP computes the similarity oint set M . We can observe thatthese newpoints model the from the edgetransformations between and points the point not cloudonly that ishave extracted map of the corresponding iteratively. Moreover, at each iteration ICP attempts cles, but also logos and marks ofinput the frame specific stadium. to find the closest point to each point in the first cloud from the second cloud in order to construct the set of correspondences between them. Afterward, it takes advantage of the Levenberg-Marquardt algorithm [22; 27] to estimate an optimal homography between 1 This picture is taken with permission from [25] 48 60 those correspondences by solving a nonlinear least-square problem. It also uses RANSAC [11] to remove outliers and improve the robustness of the solution. Nevertheless, ICP keeps iterating the entire process until it converges to the final locally optimal homography estimate for each input frame in which the misalignment error between projected game-field model points and image pixels is minimal. Since at each iteration ICP updates the modelto-frame homography estimate, new homography estimates are more accurate than the model-to-frame homography estimates that we fed into the algorithm earlier (Figure 6.6). As we discussed earlier, ICP improves the robustness, accuracy and running time of the homography estimation process compared to all previous approaches. Unfortunately, ICP only uses points to produce a set of correspondences between image pairs, and does not use any additional structural information within images such as lines, ellipses and any other correspondences. The rationale behind this approach lies in the fact that those nondistinctive structural features are almost always partially occluded by players, referees, or fans within the broadcast sport video frames and it is very difficult to extract and match them accurately. However, if the additional structural information from accurately detected lines, ellipses and other reliable correspondences can be added to the ICP method, it would definitely help ICP to establish better correspondences and result in more accurate and robust homography estimates accordingly. In fact, using the method proposed by [14] is one of the alternative approaches that can be used to accurately detect lines and ellipses together with other reliable correspondences even when they are occluded and/or in the extreme cases where recognizing those shapes is very difficult even by using our eyes. 61 (a) SIFT Homography Estimate (b) ICP Homography Estimate (c) SIFT Homography Estimate (d) ICP Homography Estimate (e) SIFT Homography Estimate (f) ICP Homography Estimate Figure 6.6: SIFT versus ICP Homography Estimates Visualization. SIFT versus ICP Homography Estimates Visualization. Images on the left side of the figure illustrate the visualization of SIFT homography estimates and images on the right side illustrate corresponding ICP homography estimates. As these images demonstrate running ICP on initial SIFT homography estimates considerably improves the accuracy of the estimation. The model-to-frame homography estimates are illustrated by superimposing red lines and circles of an accurate geometric model of the basketball court on each image. 62 Chapter 7 Experiments and Evaluation Results This chapter provides a set of empirical evaluations of the performance of our proposed method. We design a set of experiments using first images within all clips of two complete basketball games and one period of a hockey game, in which we also have the corresponding groundtruth homography estimate for each frame, in order to evaluate how close our estimates are. Having groundtruth data and estimated results, we calculate RMSE (Root Mean Square Error), maximum and median errors to evaluate the accuracy of estimated results. More precisely, we extract a grid of cells from game-field models such that each cell is a 5 × 5 square of pixels, and then use corner points of each cell as sample points. Once we extract sample points from the game-field model, we use the corresponding groundtruth homography Hgt and estimated homography Hest of each given frame in order to project extracted sample points to the image. Finally, projected sampled points that are outside of the image boundaries are removed from the set of projected points, and the Root Mean Square Distance (RMSD) between projected sample points using the groundtruth homography and the estimated homography is computed for each image as a measure of accuracy. This process can be formulated as follow: RMSE = 1 gt p∈P I(p, H i ) I(p, Hgt i ) T (p, Hgt i ) − T (p, Hest i ) (7.1) p∈P where I is an indicator function that checks whether each projected point is within the image boundaries or not, and T is a function that transforms sample points to each image by using the groundtruth and the estimated homography, and finally the index i denotes the index of the given input image. Basically, we use the image boundaries to get rid of projected points that lie outside of the image boundaries because computing the RMSD between points within and outside image is misleading, so, it does not help us to evaluate the precision of our estimated results. Using this measurement together with the Median, Mean and Max error, next we show how precise our estimates are compared to the groundtruth data for each of three different datasets (i.e., hockey, basketball 2010 and 2011). 63 7.1 Basketball Datasets Evaluation Table 7.1: Performance Analysis of SIFT vs. ICP Estimates - 2010 Basketball Dataset Method SIFT Homography ICP Homography Median Error (pixel) 5.557 5.090 Mean Error (pixel) 7.104 5.688 Max Error (pixel) 42.731 20.311 Figure 7.1: RMSE Curve of The 2010 Basketball Dataset. We evaluate the accuracy and robustness of our proposed method on two basketball datasets as well as one hockey dataset. The 2010 basketball dataset is an image sequence consisting of first frames of 123 clips within one complete basketball game, and the 2011 basketball dataset is an image sequence consisting of first frames of 97 clips within another game. In fact, computing the initial and final homography estimates for images within these basketball datasets is much less challenging than the hockey ones, since a reasonably large number of SIFT correspondences can be extracted from their consisting images, and this allows our method to produce reasonably accurate initial homography estimates that can feed into the ICP method and result in more accurate estimates with minimized error. Moreover, our empirical evaluation justifies the assumption that we made about running the ICP algorithm over initial homography estimates in order to improve their accuracy. The empirical results 64 illustrate that running the ICP method over the initial estimates considerably improves their accuracy and results in better estimates with much less error accordingly. Figure 7.1 and Figure 7.2 illustrate RMSE curves of both initial and final homography estimates for images within the 2010 and 2011 basketball datasets respectively. Moreover, Table 7.1 and Table 7.2 represent empirical evaluations of homography estimates for images those datasets in terms of the Median, Mean and Max error analysis. Table 7.2: Performance Analysis of SIFT vs. ICP Estimates - 2011 Basketball Dataset Method SIFT Homography ICP Homography Median Error (pixel) 7.639 5.898 Mean Error (pixel) 8.836 6.794 Max Error (pixel) 28.637 20.628 Figure 7.2: RMSE Curve of The 2011 Basketball Dataset. According to our empirical evaluations, in terms of the Mean and Median errors of both initial and final homography estimates, our proposed method works better for the 2010 basketball dataset rather than the 2011 dataset. However, in terms of refinements and reducing initial errors, our empirical analysis demonstrates that our method works better for the 2011 dataset rather than the 2010 dataset. In other words, in terms of the Mean and Median errors, the difference between the initial and final homography estimates in the 2011 dataset (it is about 2 pixels) is larger than the same measurement for 2010 dataset (it 65 is about 0.5 pixel). As a result, our method works better for the 2011 dataset in terms of reducing the estimation error by taking advantage of the ICP method. Moreover, in terms of comparing the Max errors of the initial and final estimates, our evaluation illustrates that ICP method still works the best by reducing the error even if the initial estimates suffer from a very large error. Since both of these datasets contain images with different levels of zoom, blur and illumination conditions (i.e., camera flash), our empirical evaluations clearly demonstrate the robustness of our proposed method under each of these extreme situations. In fact, severe camera motions that yields blurred images affect the accuracy of our initial estimates; however, taking advantage of the ICP method as a post-processing stage minimizes this effect. Furthermore, the experimental results justify that, since both out SIFT-keypoint-based homography estimation method and the ICP method are invariant to the illumination and scale changes, the different lighting conditions and zoom levels do not affect the accuracy of our estimates. Figure 7.7 and Figure 7.8 illustrate how well our proposed method works in terms of estimating both the initial and final homography for images within the 2010 and 2011 datasets under each of those extreme conditions. In terms of computing the initial homography estimates, as we discussed in Chapter 6, in our experiments we take advantage of all extracted SIFT correspondences within each input image and its closest keyframe in order to establish the initial homography estimate between them. However, since our ultimate goal is to estimate the model-to-frame homography for each input frame, we expect that extracting only SIFT correspondences that are on the game-field area may improve the accuracy of the initial estimates. To evaluate this assumption, we use the corresponding binary masks (discussed in Chapter 3) of both input images and their corresponding closest keyframes in order to select SIFT correspondences that are only extracted from the game-field areas. Then, we re-compute the initial homography estimates. Surprisingly, the resulting homography estimates are much less accurate than the estimates without performing the correspondence selection step. One reasonable interpretation of this observation is that masking out the non-game-field areas negatively affects the distribution of the extracted correspondences and results in less accurate estimates. Moreover, removing extracted correspondences within the non-game-field areas eliminates a large number of correct inlier correspondences that are on the planes and surfaces which are perpendicular to the game-field plane. In fact, those correspondences facilitate computing accurate homography estimates, so, removing them result in less accurate estimates. Nevertheless, we still expect that using a proper spatial weighting scheme such as a kernel density estimator would improve the accuracy and robustness of our initial estimates, since it provides a constraint that yields to put emphasis on each extracted correspondence 66 with respect to the density of extracted inlier correspondences within the area where it is extracted from. In terms of improving the accuracy of the initial estimates by using the ICP method, as we discussed in Chapter 6, we use the points (i.e., model points, stable points, and the edge components within each image) in order to produce a set of correspondences between image pairs. Moreover, in our experiments we eliminate the points that are extracted from the background objects for all sets of the extracted point clouds by taking advantage of the background color model of input images (discussed in Chapter 3). Unlike the SIFTkeypoint based homography estimates, using this selection approach improves the accuracy of the ICP homography estimates compared to its basic implementation. However, as we discussed in Chapter 3, since both foreground and background object within images in our datasets share a lot of similar colors, the resulting color model of both the background and the foreground does not perfectly distinguish objects that are associated to each of them. However, those models are still accurate enough in order to help us eliminate several wrong matches from the correspondence sets, and feed the refined sets into the ICP method in order to compute more accurate homography estimates accordingly. Furthermore, one additional approach that would eliminate more false matches from the correspondence sets and result in more accurate final estimates is to take advantage of additional structural information within images, such as lines and ellipses within the gamefield areas, in order to put more emphasis on the consistent edge points that are laid on these components rather than other edge points that might be extracted from the nongame-field areas or even noises. The rationale behind this approach lies in the fact that even though those non-distinctive structural features (i.e., lines and ellipses) are usually partially occluded by players, referees, or fans within the broadcast sport video frames and it is actually very difficult to extract and match them accurately, if they can be accurately extracted and added to the ICP method, then it would definitely help ICP to establish better correspondences and result in more accurate and robust homography estimates accordingly. In fact, using the method proposed by [14] is one of the alternative approaches that can be added the ICP in order to accurately detect lines and ellipses together with other reliable correspondences even if they are occluded and/or in the extreme cases where recognizing those shapes is very difficult even by using our eyes. 67 7.2 Hockey Dataset Evaluation In addition to the basketball datasets, we evaluate the accuracy and robustness of our proposed method on the hockey dataset as well. The hockey dataset is the most challenging dataset in terms of the homography estimation among all dataset we used to test the accuracy and robustness of our method, since it contains frames such that not only are severely blurred, but also do not have enough number of distinctive local features. Figure 7.3 illustrate the RMSE curve. Figure 7.3: RMSE Curve of The Hockey Dataset. Our empirical evaluation indicates that our method fails to accurately estimate the homography for four particular frames within the hockey dataset. Since our proposed method dramatically fails to accurately estimate the homography for those cases, we report the Median, Mean, and Max errors after removing those cases in the Table 7.3. Moreover, Figure 7.4 illustrates the RMSE curve after removing failure cases. According to the empirical results, not only our proposed method produces inaccurate initial estimates for images within the hockey dataset, but also running the ICP does not considerably improve the accuracy of the initial estimates compared to both basketball datasets. Hence, in the remainder of this section, we focus on further discussions about reasons that cause our proposed method to 68 fail to produce accurate estimates for the hockey dataset. Table 7.3: Performance Analysis of SIFT vs. ICP Estimates - Failure Cases Removed Method SIFT Homography ICP Homography Median Error (pixel) 13.014 11.172 Mean Error (pixel) 14.432 15.783 Max Error (pixel) 61.294 50.790 Figure 7.4: RMSE Curve of The Hockey Dataset - Failure Cases Removed. In order to further explore reasons that make our method fail to accurately estimate the homography for each of those four particular frames, we illustrate those frames with the final estimated homography superimposed on each of them in the Figure 7.5. As can be seen from images within the Figure, all failure cases are suffering from the severe blur level and/or lack of distinctive local features. Since our method estimates the initial homography transformations based on the SIFT correspondences, it probably fails to find a dense set of correspondences for these cases; thus, it ends up with a wrong homography estimate such that even the ICP cannot improve it. Figure 7.6 illustrates all the inlier SIFT matches between the frame 35 and its corresponding best matching keyframe. As can be seen from the Figure there are only a few inlier matches, and even correspondences that are estimated as inliers still contain outliers and false matches. 69 (a) Frame 13 (Blurred + Few Features) (b) Frame 18 (Blurred + Few Features) (c) Frame 27 (Few Features) (d) Frame 35 (Blurred + Few Features) Figure 7.5: Homography Estimation Failures Visualization. As we mentioned earlier, two major causes of those failures are: severe motion blur and lack of distinctive keypoints within those particular frames. In fact, we tackle the first problem by using the blur metric that we discussed in Chapter 3 to skip blurry images and move forward in the sequence until find the first salient image, and then replace it with the previous first image. It turns out that even removing blurred frames only slightly improves the performance of our estimation method, since we still end up with salient images that do not have enough distinctive keypoint features. Nevertheless, the blur measurement solves the first problem, but the second problem remains and does not allow our method to result in accurate estimates. In terms of initial homography estimates, in the absence of a set of reliable distinctive local features, extracting and matching non-distinctive features such as lines and ellipses within each image would provide invaluable additional information in order to compute more accurate and robust homography estimates [9; 14]. Therefore, one alternative solution to resolve the problem of dramatically inaccurate initial homography estimates is to take advantage of the approach proposed by [14] that uses lines and ellipses correspondences together with distinctive local feature in order to estimate the homography transformations. Moreover, as we discussed in previous section, adding the additional structural informa70 (a) Input Frame Closest Keyframe (b) Inlier SIFT Matchings Figure 7.6: SIFT Matching Failures Visualization. tion from accurately detected lines, ellipses and other reliable correspondences [14] to the ICP method would improve the overall accuracy of final homography estimates as well. Figure 7.9 illustrates how well our proposed method works in the presence of a sufficient number of distinctive local features and a reasonable amount of blur. 71 (a) SIFT Estimate (Correspondence Clutter) (b) ICP Estimate (Correspondence Clutter) (c) SIFT Estimate (Blurred) (d) ICP Estimate (Blurred) (e) SIFT Estimate (Blurred + Zoomed) (f) ICP Estimate (Blurred + Zoomed) (g) SIFT Estimate (Blurred + Zoomed) (h) ICP Estimate (Blurred + Zoomed) Figure 7.7: SIFT versus ICP Homography Estimates - The 2010 Basketball Dataset. The images in this Figure illustrate how well the ICP estimates works by using SIFT initial estimates under extreme conditions. The homography estimates are shown by superimposing red lines and circles of an accurate geometric model of the basketball court on each image. 72 (a) SIFT Estimate (Zoomed-out + Clutter) (b) ICP Estimate (Zoomed-out + Clutter) (c) SIFT Estimate (Blurred) (d) ICP Estimate (Blurred) (e) SIFT Estimate (Blurred + Clutter) (f) ICP Estimate (Blurred + Clutter) (g) SIFT Estimate (Blurred + Zoomed) (h) ICP Estimate (Blurred + Zoomed) Figure 7.8: SIFT versus ICP Homography Estimates - The 2011 Basketball Dataset. The images in this Figure illustrate how well the ICP estimates works by using SIFT initial estimates under extreme conditions. The homography estimates are shown by superimposing red lines and circles of an accurate geometric model of the basketball court on each image. 73 (a) SIFT Estimate (Salient - Few Features) (b) ICP Estimate (Salient - Few Features) (c) SIFT Estimate (Salient - Few Features) (d) ICP Estimate (Salient - Few Features) (e) SIFT Estimate (Blurred - Many Features) (f) ICP Estimate (Blurred - Many Features) Figure 7.9: SIFT versus ICP Homography Estimates - The Hockey Dataset. The images in this Figure illustrate how well the ICP estimates works by using SIFT initial estimates. The homography estimates are shown by superimposing red lines and circles of an accurate geometric model of the hockey rink on each image. 74 Chapter 8 Discussion and Conclusion To conclude, this thesis comprises four key modules, namely preprocessing, keyframe selection, keyframe matching, and frame-to-frame homography estimation, that work together in order to automatically initialize any homography estimation method that can be used for broadcast sport videos. More precisely, the first part of our system removes blurry images from the entire input frames; then, it roughly estimates the game-field area within each remaining salient image and represents it as a binary mask that corresponds to that image (Chapter 3). The resulting binary masks that are produced by this module are fed into the keyframe selection module in order to select a set of representative frames. Moreover, the blur measurement that is used in this module and the learned color model of the game-field areas which is produced by this module are both also used in the frame-to-frame homography estimation module in order to improve the accuracy and robustness of the resulting homography estimates (Chapter 6). The second part is the keyframe selection module. Having the set of binary masks that roughly represent the game-field area within corresponding images, this module uses a robust dimensionality reduction method (robust PCA) together with a clustering algorithm (X-Means) in order to cluster those masks. Then the corresponding images to the resulting centers of clusters are used as keyframes (Chapter 4). Having the selected keyframes and the input sequence, the third module of our system finds the closest keyframe to each input frame by taking advantage of three naive classifiers such that each of which produces results at the different levels of accuracy, but they all together produce results that are accurate enough to be used for frame-to-frame homography estimation. Therefore, we use an artificial neural network to combine the results of those naive classifiers and improve the overall accuracy of the keyframe matching process (Chapter 5). Finally, the last module takes the input frames, their corresponding closest keyframes, and the manually annotated model-to-frame homography estimates of all keyframes, and then computes the model-to-frame homography for all input frames. More precisely, it first estimates the frame-to-frame homography between each input frame and its corresponding closest keyframe, and then multiplies the resulting estimate to the model-to-frame homography of that keyframe in order to compute the model-to-frame homography estimate for the 75 given input frame. Moreover, this module produces the initial model-to-frame homography for all input frames by only using SIFT correspondences, and then refines those the initial estimates by taking advantage of an accurate and robust homography estimation method called ICP (Chapter 6). Even though our proposed approach is not completely automatic, compared to previous methods it considerably reduces the demand of the manually annotated and human supervision during the entire process, particularly when we are dealing with a large amount of input data. More precisely, despite the fact that the color classification method that is used in the preprocessing module automatically learns the color model of both game-field and non-game-field areas within input images, it still needs to use manually annotated pixels selected from each of those areas within a couple of images (at most five images) in order to produce fairly accurate color models. Moreover, although in the keyframe selection module, we do not specify the number of the keyframes to be selected from inputs and let the X-Means itself infer this number from the input data, we still have to set the upper threshold for the X-Means that specifies the maximum number of the keyframes we are looking for. Furthermore, computing the model-to-frame homography estimate for all input frames still requires the manually annotated model-to-frame homography data for all selected keyframes. 8.1 Future Work In the remainder of this chapter we are going to focus on alternative directions for the future work that may result in considerable improvements in each of principal modules of our system respectively. 8.1.1 Automatic Keyframe Selection As we discussed in the (Chapter 4), although using the X-Means instead of the K-Means improves the scalability of the clustering and diminishes the need to manually specify the K, it still suffers from two major shortcomings. The first one is that using the X-Means still requires us to manually specify the maximum number of the clusters we are looking for. Moreover, the second and the most important shortcoming of this approach is that the number of the resulting clusters may not be optimal, since the X-Means’s objective function works only based on comparing the binary masks with each other rather than comparing the accuracy of the resulting homography estimates by using a particular number of the 76 clusters. However, the selected keyframes by using this approach still satisfy the criteria of the keyframe selection process. In fact, one alternative solution that may improve the efficiency of the keyframe selection module is to set a new objective function for the XMeans algorithm such that it aims to reduce the homography estimation error by choosing a proper number of clusters. In other words, we need to change the X-Means algorithm in a way that it starts with an initial number of the clusters, and then compute the homography for the entire given sequence and measure the homography estimation error. Then it can iteratively increase the number of the clusters and perform this entire process again until the resulting homography estimation error does not considerably increase and/or change. 8.1.2 Automatic Keyframe Matching - Sophisticated Classifiers Having N input frames and K keyframes, as we discussed in Chapter 5, we can formulate the problem of finding the closest keyframe to each given input frame as a classification problem such that the index of each of the K keyframes have to be treated as class labels, and so, the ultimate goal of the keyframe matching module is to label the N given frames with all those K labels corresponding to the keyframes. Even though our proposed method uses naive discriminative techniques that work reasonably well based on comparing the squared distances between RGB histogram grids, the non-overlapping areas within the XORs of the binary masks, and the number of inlier SIFT correspondences, using more sophisticated classifiers such as SVM s would definitely improve the accuracy and robustness of the keyframe matching module. In fact, one alternative approach is to first reduce the dimensionality of both input frames and keyframes by using the robust PCA, and then feed the dimensionality reduced data into the kernel SVM with the pyramid matching kernel [13] and let it extract local distinctive features from those images, such as SIFT [24], MSER [28], and SURF [1] in order to precisely classify the N input frames to the K classes of keyframes. One other alternative approach would be using either the dimensionality reduced RGB histogram grids of both input frames and keyframes or their dimensionality reduced corresponding binary masks as the input for the SVM, and let the SVM classify the N input frames to the K classes of keyframes based on similarities among the RGB histogram grids and/or the binary masks (discussed in Chapter 3 and Chapter 5). Even though all these classifiers take images as the input and give us back a set of labels that represents the indices of the best matching keyframe for each given frame, one can modify them in order to produce the probability distribution over class labels instead, and make the estimated results much more soft and robust accordingly. 77 8.1.3 Automatic Keyframe Matching - Ensemble Classification In fact, using ensemble models [15] in order to combine the results of those classifiers (i.e., experts) we proposed in previous section may considerably improve the accuracy and robustness of the keyframe matching estimation process. One other alternative to improve the keyframe matching module is to take advantage of the hierarchical classification system such that first we rank our classifiers based on their precision and recall from the best one to the worst one, and then use the best classifier for running over entire input data and classify all of them. Then, the resulting failure cases of using the first classifier should be passed to the next best classifier, and so on, until all images are classified and/or there was no more classifiers to use. In case of using classifiers that produce probabilities rather than class labels, one can take advantage of mixture models together with the EM [3] methods in order to come up with more accurate and robust classification estimates. 8.1.4 Homography Refinement - Correspondence Prioritization In terms of improving the accuracy and robustness of the resulting homography estimates, there is room for many improvements; however, in this section and the following remaining sections, we can only mention ones we think that are more effective, and so, more important to be discussed. As we discussed in Chapter 6 and Chapter 7, using a proper spatial weighting scheme such as a kernel density estimator would improve the accuracy and robustness of our homography estimates, since it prioritizes extracted correspondences and intentionally put more emphasis on extracted correspondences within areas of image with low density of inlier matches rather than correspondences within high density areas in order to result in more accurate model fitting, and so, facilitate more accurate homography estimates. While using the weighting scheme may improve both the initial and final homography estimates, integrating it with the color model of the game-field area would definitely improve the overall accuracy and robustness of the final homography estimates. 8.1.5 Homography Refinement - Structural Information Integration One other important modification that can make a significant impact on the performance of the ICP method is to add additional structural information from accurately detected lines and ellipses correspondences to the ICP such that it can put more emphasis on the extracted points from those lines and ellipses correspondences rather than other correspondences and/or noises. In fact, the method proposed by [14] is one of the best alternatives that can be used in order to achieve this improvement. 78 8.1.6 Robust ICP - Gaussian Mixture Models for the ICP In these last two sections, we particularly focus on discussing two different alternatives that can be used for improving the performance of the ICP algorithm. In fact, one alternative approach that can be used to make the ICP method more robust to the significant amount of noise and outliers within the given set of point clouds is to take advantage of the method proposed by [19] in which each point is represented as a Gaussian model, and so, each point cloud is represented as a mixture of Gaussian models accordingly. Then, the problem of aligning pairs of point clouds can be reformulated into the problem of minimizing the statistical difference between their corresponding Gaussian mixture models which can be simply measured by computing the Euclidian distance between the Gaussian mixtures. 8.1.7 3D Homography Estimation - 3D Point Cloud Alignment Even though this thesis work is all about accurately estimating 2D homography transformations between image pairs, extending the 2D homography estimation to the 3D is one of the most interesting and challenging directions for the future work, since the 3D homography estimates can be used to facilitate the camera calibration and 3D scene reconstruction. However, computing the 3D homography estimate requires the point clouds that have to be extracted from at least two perpendicular planes within the image. In fact, this constraint makes this extension very challenging, since within the broadcast sport video frames there is no guarantee that such two (or more) planes exist, or if they exist, it might be very difficult to extract reliable points from them. Moreover, having the extracted 3D point clouds from images, another challenging problem would be how we can compute the homography between them. In order to address this issue, one can use the novel method proposed by [39] that automatically aligns the continuous video onto the 3D point clouds. 79 Bibliography [1] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In ECCV 2006: 9th European Conference on Computer Vision, 2006. [2] Paul J. Besl and Neil D. McKay. A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14:239 – 256, 1992. [3] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, ISBN:0387310738, 1 edition, 2007. [4] Yizheng Cai, Nando de Freitas, and James J. Little. Robust visual tracking for multiple targets. In ECCV ’06: 9th European Conference on Computer Vision, pages 107–118, 2006. [5] John Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8:679 – 698, 1986. [6] Hua-Tsung Chen, Ming-Chun Tien, Yi-Wen Chen, Wen-Jiin Tsai, and Suh-Yin Lee. Physics-based ball tracking and 3d trajectory reconstruction with applications to shooting location estimation in basketball video. Journal of Visual Communication and Image Representation, 20:204–216, 2009. [7] Frederique Crete, Thierry Dolmiere, Patricia Ladret, and Marina Nicolas. The blur effect: perception and estimation with a new no-reference perceptual blur metric. In SPIE Electronic Imaging Symposium Conf Human Vision and Electronic Imaging, 2007. [8] Elan Dubrofsky and Robert J. Woodham. Combining line and point correspondences for homography estimation. In ISVC ’08 Proceedings of the 4th International Symposium on Advances in Visual Computing, 2008. [9] Bin Fan, Fuchao Wu, and Zhanyi Hu. Line matching leveraged by point correspondences. In CVPR 2010: IEEE Conference on Computer Vision and Pattern Recognition, 2010. [10] Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR’08. IEEE Conference on Computer Vision and Pattern Recognition, 2008. 80 [11] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24:381–395, 1981. [12] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 38(2):337–407, 2000. [13] Kristen Grauman and Trevor Darrell. The pyramid match kernel: discriminative classification with sets of image features. In ICCV 2005. Tenth IEEE International Conference on Computer Vision, 2005. [14] Ankur Gupta, James J. Little, and Robert J. Woodham. Using line and ellipse features for rectification of broadcast hockey video. In CRV 2011; Eighth Canadian Conference on Computer and Robot Vision, 2011. [15] Lars Kai Hansen and Peter Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12:993 – 1001, 1990. [16] Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. [17] Jean-Bernard Hayet and Justus Piater. On-line rectification of sport sequences with moving cameras. In MICAI’07 Proceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence, 2007. [18] Robin Hess and Alan Fern. Improved video registration using non-distinctive local image features. In CVPR ’07: IEEE Conference on Computer Vision and Pattern Recognition, 2007. [19] Bing Jian and Baba C. Vemuri. Robust point set registration using gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33:1633 – 1645, 2011. [20] Kihwan Kim, Matthias Grundmann, Ariel Shamir, Iain Matthews, Jessica Hodgins, and Irfan Essa. Motion field to predict play evolution in dynamic sport scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, June 2010. [21] Fernando De la Torre and Michael J. Black. A framework for robust subspace learning. International Journal of Computer Vision, 54:117–142, 2003. [22] Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares. The Quarterly of Applied Mathematics, 2:164–168, 1944. 81 [23] Fahong Li and Robert J. Woodham. Video analysis of hockey play in selected game situations. Image and Vision Computing, 27:45–58, 2009. [24] David Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. [25] Wei-Lwun Lu. Learning to Track and Identify Players from Broadcast Sports Videos. PhD thesis, The University Of British Columbia, 2011. [26] Wei-Lwun Lu, Kenji Okuma, and James J. Little. Tracking and recognizing actions of multiple hockey players using the boosted particle filter. Image and Vision Computing, 27:189–205, 2009. [27] Donald W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11:431–441, 1963. [28] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22:761–767, 2004. [29] Krystian Mikolajczyk and Cordelia Schmid. Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60:63–86, 2004. [30] Kenji Okuma. Automatic acquisition of motion trajectories: Tracking hockey players. Master’s thesis, The University of British Columbia, 2003. [31] Kenji Okuma. Active Exploration of Training Data for Improved Object Detection. PhD thesis, The University Of British Columbia, 2011. [32] Kenji Okuma, James J. Little, and David G. Lowe. Automatic rectification of long image sequences. In ACCV 04: Asian Conference on Computer Vision, 2004. [33] Dan Pelleg and Andrew Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 727–734. Morgan Kaufmann, 2000. [34] Steven K. Rogers and Dennis W. Ruck. Learning ranks with neural networks. In Applications and Science of Artificial Neural Networks, 1995. [35] Jianbo Shi and C. Tomasi. Good features to track. In Proceedings CVPR ’94., 1994 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1994. 82 [36] Engin Tola, Vincent Lepetit, and Pascal Fua. Daisy: An efficient dense descriptor applied to wide baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:815 – 830, 2010. [37] Fei Yan, Alexey Kostin, William Christmas, and Josef Kittler. A novel data association algorithm for object tracking in clutter with application to tennis video analysis. In CVPR ’06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006. [38] Angela Yao, Dominique Uebersax, Juergen Gall, and Luc Van Gool. Tracking people in broadcast sports. In DAGM 2010 32nd Annual Symposium of the German Association for Pattern Recognition, 2010. [39] W. Zhao, D. Nister, and S. Hsu. Alignment of continuous video onto 3d point clouds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27:1305 – 1318, 2005. 83
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Automatic initialization for broadcast sports videos...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Automatic initialization for broadcast sports videos rectification Mohammadi Tari, Shervin 2011
pdf
Page Metadata
Item Metadata
Title | Automatic initialization for broadcast sports videos rectification |
Creator |
Mohammadi Tari, Shervin |
Publisher | University of British Columbia |
Date Issued | 2011 |
Description | Broadcast sport videos can be captured by a static or a moving camera. Unfortunately, the problem with a moving camera is that planar projective transformations (i.e., the homographies) have to be computed for each image frame in a video sequence in order to compensate for camera motions and viewpoint changes. Recently, a variety of methods have been proposed to estimate the homography between two images based on various correspondences (e.g., points, lines, ellipses matchings, and their combinations). Since the frame to frame homography estimation is an iterative process, it needs an initial estimate. Moreover, the initial estimate has to be accurate enough to guarantee that the method is going to converge to an optimal estimate. Although the initialization can be done manually for a couple of frames, manual initialization is not feasible where we are dealing with thousands of images within an entire sports game. Thus, automatic initialization is an important part of the automatic homography estimation process. In this dissertation we aim to address the problem of automatic initialization for homography estimation. More precisely, this thesis comprises four key modules, namely preprocessing, keyframe selection, keyframe matching, and frame-to-frame homography estimation, that work together in order to automatically initialize any homography estimation method that can be used for broadcast sports videos. The first part removes blurry images and roughly estimates the game-field area within remaining salient images and represents them as a set of binary masks. Then, those resulting binary masks are fed into the keyframe selection module in order to select a set of representative frames by using a robust dimensionality reduction method together with a clustering algorithm. The third module finds the closest keyframe to each input frame by taking advantage of three classifiers together with an artificial neural network to combine their results and improve the overall accuracy of the matching process. The last module takes the input frames, their corresponding closest keyframes, and computes the model-to-frame homography for all input frames. Finally, we evaluate the accuracy and robustness of our proposed method on one hockey and two basketball datasets. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2012-01-10 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0052173 |
URI | http://hdl.handle.net/2429/39969 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2012-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2012_spring_mohammaditari_shervin.pdf [ 14.68MB ]
- Metadata
- JSON: 24-1.0052173.json
- JSON-LD: 24-1.0052173-ld.json
- RDF/XML (Pretty): 24-1.0052173-rdf.xml
- RDF/JSON: 24-1.0052173-rdf.json
- Turtle: 24-1.0052173-turtle.txt
- N-Triples: 24-1.0052173-rdf-ntriples.txt
- Original Record: 24-1.0052173-source.json
- Full Text
- 24-1.0052173-fulltext.txt
- Citation
- 24-1.0052173.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052173/manifest