UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Robust visual tracking for multiple targets Cai, Yizheng 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2005-0385.pdf [ 10.63MB ]
Metadata
JSON: 831-1.0051114.json
JSON-LD: 831-1.0051114-ld.json
RDF/XML (Pretty): 831-1.0051114-rdf.xml
RDF/JSON: 831-1.0051114-rdf.json
Turtle: 831-1.0051114-turtle.txt
N-Triples: 831-1.0051114-rdf-ntriples.txt
Original Record: 831-1.0051114-source.json
Full Text
831-1.0051114-fulltext.txt
Citation
831-1.0051114.ris

Full Text

Robust Visual Tracking for Multiple Targets by Y i z h e n g C a i B . E . , Zhejiang University, 2003 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science i n T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Computer Science) The University of British Columbia September 2005 © Y i z h e n g C a i , 2005 Abstract We address the problem of robust multi-target t racking wi th in the appl icat ion of hockey player tracking. A l though there has been extensive work i n multi-target t racking, there is no existing visual t racking system that can automatical ly and robust ly track a variable number of targets and correctly mainta in their identities w i t h a monocular camera regardless of background clutter, camera mot ion and fre-quent mutua l occlusion between targets. We bu i ld our system on the basis of the previous work by O k u m a et a l . [ O T d F + 0 4 ] . T h e particle filter technique is adopted and modified to fit into the multi-target t racking framework. A rectification tech-nique is employed to map the locations of players from the video frame coordinate system to the standard hockey r ink coordinates so that the system can compensate for camera mot ion and the dynamics of players on the r ink can be improved by a second order auto-regression model . A global nearest neighbor data association a lgor i thm is introduced to assign boosting detections to the existing tracks for the proposal d is t r ibut ion i n particle filters. T h e mean-shift a lgor i thm is embedded into the particle filter framework to stabilize the trajectories of the targets for robust t racking dur ing mutua l occlusion. T h e color model of the targets is also improved by the kernel introduced by mean-shift. Exper imenta l results show that our system is able to correctly track a l l the targets i n the scene even i f they are par t ia l ly or completely occluded for a period of t ime. Key words: Mul t i - target tracking, Homography, Sequential Monte Car lo methods, D a t a association, Mean-shift i i Contents Abstract ii Contents iii List of Tables v List of Figures vi Acknowledgements ix 1 Introduction 1 1.1 Motivation 1 1.2 Problem Statement 2 1.3 Challenges 2 1.3.1 Target localization 2 1.3.2 Target representation 3 1.3.3 Non-rigidity of the target 3 1.3.4 Tracking stabilization 3 1.3.5 Mutual occlusion 4 1.4 Contributions 6 1.5 Thesis Outline 6 2 Previous Work 8 2.1 Target Representation and Localization 8 2.1.1 Representation 8 2.1.2 Localization 10 2.2 Filtering and Data Association 10 2.2.1 Filtering 11 2.2.2 Data association 14 2.2.3 Mutual exclusion 18 2.3 Summary 18 iii 3 Particle Filter for Multi-target Tracking 20 3.1 Part ic le F i l t e r 20 3.1.1 Monte Car lo s imulat ion 21 3.1.2 Sequential importance sampling 21 3.1.3 Resampling 22 3.2 Boosted Par t ic le F i l t e r 23 3.3 Our Improvements 24 4 Target Dynamics Modelling 26 4.1 Rect if icat ion 26 4.1.1 Homography 26 4.1.2 Mul t i - l ayer state space model 28 4.2 Autoregressive Dynamics M o d e l 30 5 Data Association 33 5.1 G a t i n g 33 5.2 Linear Opt imiza t ion 35 6 Mean-Shift Embedded Particle Filter 38 6.1 Color M o d e l 38 6.2 Mean-Shif t 41 6.3 Adap t ive Scaling 45 6.4 Mean-shift Embedded Part ic le F i l t e r 46 7 Experimental Results and Conclusion 51 7.1 Results 51 7.2 Conclus ion 60 7.3 Discussion and Future Extension 61 Bibliography 64 iv List o f Tables Assignment mat r ix for the assignment problem shown i n Figure 5.1 . 36 v List of Figures 1.1 The figure shows that different players can have quite difference shapes so that i t is impossible to use a bounding box w i t h a fixed edge ratio. 4 1.2 The above 4 frames show that the tracker of the target drifts signifi-cantly over t ime 5 1.3 The figure shows the si tuat ion of complete occlusion and par t ia l oc-clusion dur ing the process. The rectangles show the regions where occlusions take place 6 2.1 Graph ica l model representation of the relationship between an ob-served variable yt and a latent variable xt 11 2.2 Graph ica l model representation of the H idden M a r k o v M o d e l , where XQ is the in i t i a l state 12 3.1 T h e S I R process starts at t ime t — 1 w i t h a set of equally weighted par-ticles T V - 1 } ^ , which represent the dis t r ibut ion p(xt-i\yo-.t-2) • Importance weights are updated w i t h information at t ime t — 1 to rep-resent p(xt-i\yo:t-i) w i t h {Xfli,Wt-i}iLi- Resampl ing is performed to make it a equality weighted particle set {x[ll1: N^1}^ again but s t i l l represent the same dis t r ibut ion. F ina l ly , particles are propogated to {x[%\ N~l}?L1 at t ime t by the proposal d is t r ibut ion to represent p{xt\yo-.t-i) and iterations go forward [vdMAdFWOO] 23 3.2 The final proposal is i n the form of a mixture of Gaussian which combines the prior and the boosting detection [OTdF+04] 24 4.1 T h i s shows how video frames are mapped to the r ink. The top left image is the input video frame; the top right is the standard r ink model; the bo t tom image shows how the image looks after being mapped onto the r ink 29 4.2 Graph ica l model representation of the three-layer structure of the state space model 30 v i 4.3 A l l the black rectangles are the boosting detections. B lack dots on the bo t tom edge of each rectangle represent the positions of the players on the r ink 31 5.1 T I , T 2 , T 3 are predicted target positions; 0 1 , 0 2 , 0 3 , 0 4 are obser-vations. 0 2 is assigned to T I ; 0 1 is a candidate of T I , T 2 and T 3 ; 0 4 is a candidate of T 2 and T 3 ; 0 3 is not assigned to any existing tracks so that i t is a candidate of being a new target 34 5.2 T I , T 2 are predicted positions of targets; V I , V 2 are vectors that indicate the velocities of the targets 35 6.1 Mul t i -pa r t color histogram of two players from two different teams. . 41 6.2 T h i s figure shows how significantly the color histogram of the same target changes over t ime 42 6.3 Compar ison between the original track output and the mean-shift stabil ized result. Left ones are the original track result and the right ones are the stabilized result 47 6.4 T h i s figure shows the tracking result of the system i n which the par-ticles are biased by the mean-shift a lgor i thm after the deterministic resampling step i n particle filtering. The top frame shows the overall view of the t racking result at frame 1. The six frames below show the tracking result w i t h a close-up view of the three players i n the center region. Part icles from the same target are represented by rectangles w i t h same color 49 7.1 Th i s figure shows an example of the blur from interlacing 52 7.2 T h i s figure shows the overall view of the camera at the in i t i a l stage. T h e close-up view of the t racking results w i l l be shown i n Figure 7.3 by zooming into the rectangular region 52 7.3 T h i s sequence is a close-up view of the rectangular region i n F i g -ure 7.2. The left column is the t racking results of the system i n [ O T d F + 0 4 ] . The right column is the t racking results of our system. Different targets are labelled w i t h rectangles of different colors. . . . 53 7.4 Th i s figure shows the tracking result of the system that only has the camera mot ion compensated and the dynamics model changed but without mean-shift bias. The left column is the results of the system i n wh ich the deviat ion for the scale sampling is 0.02; the right co lumn is w i t h the deviat ion of 0.01 55 v i i 7.5 T h i s figure shows the particle representation of each target. The left co lumn shows the particles before the mean-shift bias; the middle co lumn shows the particles after the mean-shift bias; the right co lumn shows the particles after the resampling. E a c h particle is represented w i t h a rectangle and particles that belong to the same target are of the same color 57 7.6 T h i s figure shows the trajectories of the three targets shown i n Figure 7.3. (b)(c)(d)(e) show the key frames dur ing the process, (b) is the start ing frame and (e) is the ending frame 58 7.7 T h i s figure shows the tracking result of the per iod when par t ia l occlu-sion happens, (a) is the overall view of the t racking result at frame 78. (b)(c)(d)(e) show the close-up view of the four targets in the region specified i n (a) 59 7.8 T h i s figure shows how the video frame is mapped onto the standard r ink 62 v i i i Acknowledgements I would like to acknowledge those who gave me great help dur ing my thesis work. W i t h o u t their support, I can never accomplish this work. M y sincere grati tude goes to my supervisors, D r . James L i t t l e and D r . Nando de Freitas for their passion and patience i n providing me w i t h extraordinary guidance and invaluable ideas. I also would like to thank D r . K e v i n M u r p h y and D r . D a v i d Lowe for their enlightening suggestions. M y appreciation extends to m y colleagues, X i a o j ing W u , K e n j i O k u m a , Fahong L i , and Yongning Zhu for their beneficial discussions. M y special thanks go to my family for their indispensable support. F ina l ly , I wish to acknowledge the funding support of the Nat iona l Science and Engineering Research Counc i l of Canada and G E O I D E , a Canad ian Network of Centres of Excellence. Y I Z H E N G C A I The University of British Columbia September 2005 ix Chapter 1 Introduction 1.1 Motivation Tracking mult iple targets, although has its root i n control theory, has been of broad interest i n many computer vision applications for decades as well . People or vehicle t racking for public surveillance systems, autonomous robots and vehicles steering, and sports game video annotat ion are a l l promising applications. Objects of interest for t racking can be of any k ind , including landmarks, vehicles, people, and various kinds of v i sua l features. T h e number of objects can vary over t ime. T h e backgrounds of the scenes to be analyzed can either be static or dynamic. A l l these complexities, which arise i n the real world, make the problem non-tr ivial . Tracking players i n sports game videos is one of the most pract ical appl i -cations for multi-target t racking because of the popular i ty of the games and the pervasive demand for automatic annotation and analysis of such videos for broad-casting and coaching purposes. It is also a challenging applicat ion because of its inherent properties including the varying number of players i n the scene, the dynam-ical ly changing background due to camera motion, and non-rigid target shapes due to various player poses. In the world of computer vision research, such applications have already been investigated on the soccer games [ V D P 0 3 , K C M 0 3 ] and hockey games [OTdF+04]. Recently, research on more higher level of mot ion analysis has also been carried out. Efros et a l . [ E B M M 0 3 ] focus on the character of the mot ion of soccer players (i.e., walk ing , running, standing) rather than their t ranslat ional mot ion and assume that the targets are stably tracked so that the targets are roughly centered wi th in the bounding box of the tracker. L i et a l . [Li04] a i m to represent and reason about hockey behaviors for an even higher level of mot ion understanding. Such applications a l l require accurate trajectory data extracted from lower level t racking systems as their input . Therefore, acquisit ion of accurate trajectory data over long video sequences becomes crucial . To generate such required trajectory data, one of the most cr i t ica l properties of the t racking system is its robustness i n mainta ining 1 the correct identity of different targets under various conditions. However, existing systems either have difficulties i n maintaining a variable number of targets or have problems in keeping the correct identity of targets when occlusion happens between different targets. Be ing aware of the weaknesses i n the exist ing systems, we a i m to bu i ld a t racking system that can robustly track mult iple targets and keep their identities over long video sequences captured by a monocular camera. 1 . 2 Problem Statement A s an extension of the previous endeavors devoted to bui ld ing robust multi-target t racking systems, and a preparation for researches on high level act ion analysis, this thesis addresses the task of bui ld ing a robust t racking system which can han-dle a variable number of targets and mainta in the accurate identity of each target regardless of background clutter, camera mot ion and frequent mutual occlusions between targets. The system is required to have the abi l i ty to track targets over long sequences and be robust under perturbations including i l lumina t ion condi t ion changes, shadows of the targets and so on. 1.3 Challenges A l t h o u g h extensive research on various tracking systems has already conquered many cr i t ica l obstacles i n target local izat ion and filtering i n tracking, there s t i l l remains challenging problems i n our applicat ion. We present i n this section the following major challenges: target local izat ion, target representation, non-rigidi ty of the targets, t racking stabil izat ion, and mutual occlusion. 1.3.1 Target localization Object detection has been a major focus of research in computer vis ion. Learning-based object detection [ R B K 9 8 , VJ04] is one of the most effective and efficient methods. Camera motions, including panning, t i l t , and zooming, make the video frame coordinates variable w i t h respect to the coordinate system, which is the hockey r ink in this case. M u l t i p l e cameras are able to solve the problem of local iz ing targets i n the real world coordinate system [KCM04] so long as the relative locat ion and configuration of different cameras are known and well synchronized. However, because the source data we use for this task is extracted from only one monocular T V broadcasting camera, it is very difficult to derive even the relative positions among the targets in the real world. A l though 2D or 3D geometry of the scene is one possible way to assist solving the problem for moving monocular camera, it is s t i l l a non- t r iv ia l problem. 2 1.3.2 Target representation Because of the inherent weakness i n video standards of Na t iona l Television System Commit tee ( N T S C ) , the video frames extracted from the recorded video clips are highly unsaturated especially i n hockey games. Therefore, the representation of the targets i n the color space becomes quite noisy and unstable so that i t is not salient enough to support t racking as the only information. In addit ion, the resolution of the video frames is so low that each target is sometimes as smal l as twenty pixels high. A s a result, a significant number of pixels that belong to one target are probably contaminated by the noise from the hardware device because of the l imi ted number of the to ta l pixels of the target. A l so , as is well known, hockey is a fast game where players tend to move so quickly that the targets that move i n the opposite direction of the camera motion appear b lur ry i n the video frames. Therefore the representation of those blurred targets i n the feature space w i l l become much different from the ground t ru th . 1.3.3 Non-rigidity of the target H u m a n bodies are one group of objects that are very difficult to detect or track because of their non-rigidity. Research has been done to detect or track human bodies as a whole [POP98, V J S 0 5 , IM01] or separately on different parts of the body w i t h some geometric constraints [MSZ04, R F 0 3 ] . A s is mentioned i n the previous section, because of the low resolution of the input video clips, t racking parts of the human body separately i n the same way as [RF03] does is almost infeasible i n this applicat ion. In the systems that detect or track people as a whole, human bodies are more or less upright and straight so that most of the bodies fit into a rectangular bounding box w i t h a fixed length ratio of the edges. However, it w i l l cause problems i n our appl icat ion because it is impossible to find such a bounding box to capture the whole body of the hockey players that has drastic pose changes dur ing the game. Examples are shown i n Figure 1.1. The opt imal way is to use varying shape boxes to capture the targets. However, as we use the detection results from the boosting detectors, we straightforwardly inherit the fixed shape bounding box as well . T h i s brings some disadvantages because, by using such k i n d of bounding boxes w i t h fixed edge ratio, it is possible that important pixels are excluded from the box and pixels that do not belong to the target would be included. Hence, the representation of the targets i n the feature space w i l l be contaminated. 1.3.4 Tracking stabilization Taking the bounding box tracker as an example, i t is very difficult to s tably fit the boxes on the targets. The tracker tends to drift off the targets more or less over t ime. Examples can be found in Figure 1.2. Th i s is the t racking output of the system 3 Figure 1.1: The figure shows that different players can have quite difference shapes so that it is impossible to use a bounding box with a fixed edge ratio. by [OTdF + 04]. Only one track is shown for clear representation. Having a stable tracker is important because as long as we need the trajectories of the targets to predict their locations, it is important that the trajectories are reliable. Otherwise, the prediction would lead the tracker to a region far away from the true target location. In most cases, such errors are irrevocable so that the tracker loses the target or tracks on to a different target. Although, intuitively, people can tell which tracker is more stable than the other one, there is no broadly accepted concrete standard to define the concept of stability. In addition, even though it is possible to define a standard for our particular application, it is difficult to evaluate the stability of the tracking result and perform certain corrections because the ground truth is unknown. 1.3.5 Mutual occlusion Compared to the challenges described in the previous sections, mutual occlusion is a higher level vision problem which requires the understanding of the scene rather than the video frames only. Because the source data is captured by a single fixed monocular camera, only the 2D projection of the real world is captured by the camera. Consequently, there are always occasions when different targets occlude each other in the video frames as they cross over each other. This can be avoided if the camera records the games from the top down view. However, it is not the case in most broadcasting videos. Occlusions can be partial or even complete for a certain period of time. Most detectors fail in those situations because they treat input video frames as independent and static images without any context information. Therefore, it is impossible to detect a target that is completely occluded. Partial 4 i tfsaJB. ' tut 1 1 1 ' p i (a) Frame 1 (b) Frame 8 (c) Frame 15 (d) Frame 25 Figure 1.2: T h e above 4 frames show that the tracker of the target drifts significantly over t ime occlusion is also problematic in detectors that use some templates or t ra ining sets w i t h only complete objects for detection. Consequently, trackers that only make use of the visual information or the output of the detector to locate targets fail dur ing occlusion. A l t h o u g h tracking systems may make use of spatial and temporal continuity properties of the video clips to predict the locat ion of the targets, a system s t i l l has difficulties in correctly t racking the targets i f the complete occlusion results i n the absence of the target i n one spot and its reappearance somewhere else. Pa r t i a l ly occluded targets are also difficult to track because of the insufficient visual support by those pixels that are not occluded. Therefore, without understanding of the context and the spatial configuration of the scene, i t is impossible to solve this problem. Figure 1.3 shows occasions that par t ia l and complete occlusions take place. 5 (a) (b) Figure 1.3: The figure shows the situation of complete occlusion and partial oc-clusion during the process. The rectangles show the regions where occlusions take place. 1 . 4 Contributions Challenges have been investigated and described in the previous section. In order to conquer those challenges and build a robust multi-target tracking system to cor-rectly track hockey players during the games, we propose four improvements on the previous systems. Firstly, a rectification technique is employed to compensate for camera motions by mapping the locations of the players from the video frame coordinate system to the time invariant standard hockey rink coordinate system. Secondly, a second order autoregression model is adopted as the dynamics model to predict the location of the targets. Thirdly, a global nearest neighbor data associ-ation technique is used to correctly associate boosting detections with the existing tracks. Finally, the mean-shift algorithm is embedded into the particle filter frame-work [AMGC01] to stabilize the trajectories of the targets. By assembling all the contributions, the new multi-target tracking system is capable of achieving the goal stated in Section 1.2. The details of all the techniques employed will be described in the following chapters. 1.5 Thesis Outline The remainder of the thesis will organized as follows. In the next chapter, a review of modern tracking systems will be presented with emphasis on multi-target track-ing. Previous work on object representation, motion modelling and data association will also be covered. Chapter 3 gives a general explanation of Sequential Monte Carlo Methods [DdFGOl], also known as particle filters, because it forms the basic skeleton of our tracking system. Extensions on the conventional particle filter to support multi-target tracking in our system will be introduced as well. Chapter 4 6 describes how to improve the model l ing of the dynamics of each target so that the mot ion information can provide enough support for the tracker while visual support is absent or unreliable. A sketch of the previous rectification technique [OLL04] , which w i l l be used i n our work to compensate camera motion, w i l l be shown and the autoregressive dynamics model w i l l be presented i n detail as well . Chapter 5 addresses the importance of the data association issue i n our system. A global nearest neighbor data association, which is designed par t icular ly for our applica-t ion, w i l l be introduced and explained in detail . Chapter 6 demonstrates how the representation of the targets i n the color space is improved by a kernel function. Mean-shift , known as a gradient descent algori thm, w i l l be introduced and how i t is embedded into the particle filter framework w i l l be described i n detai l . In Chapter 7, experiment results of our new tracking system w i l l be presented. W e conclude our work by summariz ing the contributions we have made and analyze strengths and weaknesses i n our system. F ina l ly , some possible extensions w i l l be discussed for the future work. 7 Chapter 2 Previous Work A typica l visual t racking system can be divided into two parts [CRM03] : one is Tar-get Representation and Localization, which has i ts root i n computer vision; the other is Filtering and Data Association, which comes from control theory. In this chapter, we present a brief int roduct ion of previous work i n bo th aspects w i t h emphasis on their strength and weakness. 2.1 Target Representation and Localization 2.1.1 Representation Target representation can be categorized into two major classes. One is for a group of generic objects, such as human faces or bodies, computer monitors, motorcycles, and so on. T h e other is for one specific target including a specific person, car, toy, bui ld ing and so on. T h e targets can be images, concrete objects or even abstract feature points. y For the first class, the representation should capture the common features among a l l the objects i n the same group. A s i t is beyond the scope of this thesis, we suggest readers refer to other related works. [ Y K A 0 2 ] is one such reference that summarizes the approaches i n face detection. However, some approaches also apply to the representation of other objects. In our system, hockey players are detected by a boosting algori thm, which is based on the work by V i o l a et a l . [VJ04]. It treats the hockey players as generic objects rather than specific objects. However, as we assume the boost ing detections already exist as the input of our t racking system, we skip the details of the detection algori thm. For the t racking system, in order to dist inguish different targets and keep their identity over t ime, i t is necessary to treat each ind iv idua l as a specific object. Therefore, we focus on the representation of specific objects i n this thesis. For specific object representation, salience and uniqueness are the most i m -portant characteristics. One popular approach is to represent an object or an image 8 by a set of local features, which are unique and distinguish the object from oth-ers. T h e Harris corners [HS88], the Kanade-Lucas-Tomasi (KLT) features [TK91] and the Scale Invariant Feature Transform (S IFT) [Low04] are three typica l ex-amples. These features can be applied to specific object recognition and tracking, image matching and rectification, 3D structure reconstruction from 2D images for scene or mot ion analysis, and even Simultaneous Localization and Mapping ( S L A M ) [DNCC01] for robotics. In our work, K L T features are used for image rectification [OLL04] . A s is mentioned i n Section 1.3.1, the locations of the hockey players i n the input video are i n temporal ly variant image coordinates because of the camera motion. Therefore it is difficult to model and predict the mot ion of the players i n the image coordinate system. M a p p i n g the locations of players from the image coordinate system to the standard hockey r ink coordinates, which is invariant over t ime, is necessary for the model l ing of the dynamics of the players. The K L T feature tracker is used to match the selected K L T features extracted from consecutive video frames so that the mapping between frames can be computed according to the geometrical properties i n the 2D plane-to-plane projection. A detailed int roduct ion of the plane-to-plane mapping and the mechanism to map between the frames and hockey rinks w i l l be given i n Chapter 4. A local feature based approach is effective for recognizing targets that have sufficient salient loca l features because pose changes and self-occlusions may result i n the absence of a significant proport ion of irreplaceable local features. However, as we mentioned i n Section 1.3.2, the resolution of a l l the players i n our appl icat ion is quite low. Therefore, it is difficult to extract sufficient local features to overcome the negative effect brought by occlusion. In addi t ion, because of the non-rigidi ty of the targets, the relative geometry between local features can not be modelled w i t h a static template. Tak ing into account a l l these factors, a global representation of the target is needed for our applicat ion to conquer the obstacles. Color-based representation is one of the most successful approaches i n the literature. B o t h Perez [ P H V G 0 2 , V D P 0 3 ] and Comanic iu [CRM03] have successfully used color histograms to represent targets for t racking. The a lgor i thm represents a specified image region w i t h a color histogram, which can be extracted from any color space. The histogram is represented by a vector and the value of each entry of the vector reflects the cumulative count of each b in i n the histogram. Because the purpose is to track a specific target i n the scene, a reference histogram of the part icular target is created at the in i t ia l iza t ion step. A metric based on Bhat tacharyya coefficient [Kai67] is adopted to evaluate the s imi lar i ty between two color histograms. There are two major weaknesses i n this color-based approach. One is the loss of spatial information because the histogram approach only takes into account the value of pixels i n the color space rather than the locat ion information. The 9 other is the interference from the background clutter and other targets. If some background regions or other targets had similar histograms as the target of interest, the tracker would probably get confused. To solve the first problem, a mult i -part color histogram representation is used to encode the spatial information. Par t ic le filters set up an excellent framework for the color-based approach to solve the second problem. Proper model l ing of the dynamics of the targets can assist i n resolving the ambiguity that arises when background clutter and similar targets overlap w i t h the target of interest. 2.1.2 Localization T h e a im of local izat ion is to detect the existence of the objects of interest and give their locat ion for each of them. Tak ing the human face detection as an example, various approaches have been applied to detect generic faces. Template based ap-proaches and learning based approaches are the two ma in strategies. Stat is t ical learning approaches, such as Neural Networks, Support Vector Machine and Boost-ing, have become popular. T h e V i o l a and Jones [VJ04] approach, wh ich makes use of AdaBoost, is known as one of the most successful classifiers and can be applied to many different classes of objects. O k u m a et al . [ O T d F + 0 4 ] incorporate the boosting detector into a particle filter based t racking system. In the new framework, which they called Boosted Particle Filter ( B P F ) , the three major functions of the boost-ing detector i n the t racking system are in i t ia l iz ing trackers, adding new targets, and improving the proposal density function for the particle filter. T h e y represent the proposal d is t r ibut ion w i t h a mixture of Gaussian model , which combines the boosting detections and the corresponding tracks. A s a result, the data association problem arises when searching for the correspondence between the boost ing detec-tions and the exist ing tracks i f mult iple targets are i n the scene. In [OTdF+04], the nearest neighbor association method is used, which only finds the local op t imal rather than the global op t imal solution. We improve their work by using a global nearest neighbor association method to reach the global op t imal observation-to-track assignment. T h e in-depth introduct ion can be found i n Chapter 5. 2.2 Filtering and Data Association Fi l t e r ing and data association have their theoretical basis i n control theory, where the signals of the targets are mostly from radar, sonar, laser and so on. In our applicat ion, visual information is the observed signal from the targets. B o t h Ba r -Shalom et a l . [BSF88] and B l a c k m a n et a l . [BP99] summarized modern techniques i n filtering and data association in their books. In this section, we break down the topic into three parts: filtering, da ta association and mutual exclusion. The last part discusses problems that arise from visual-based multi-target t racking. 10 Figure 2.1: Graph ica l model representation of the relationship between an observed variable yt and a latent variable xt 2.2.1 F i l t e r i n g In the field of object tracking, Bayesian Sequential Estimation [AM79], which is also called Bayesian Filtering, is the most widely accepted framework. In this section, we present the fundamental theory of Bayesian filtering as a preparation for its sequential Monte Car lo implementat ion which is also known as Particle Filters [ D d F G O l ] . We adopt it as the basic skeleton of our t racking system. In addit ion, we introduce some previous work that use two major implementations of Bayesian filtering: Kalman Filters and Par t ic le Fi l ters . Bayesian sequential estimation D u r i n g the t racking process, the evolution of the targets can be interpreted as a state t ransi t ion process, where each state can be represented by a vector that includes the locat ion of the target in the reference coordinate system, the scale and velocity of the target, and any values that are required to represent the complete property of the target. Here, we denote the state variable as xt € M.nx, where t 6 N indicates the t ime step and nx is the dimension of the state variable. In the real wor ld , the state variables are always difficult to observe or measure directly. For example, the a i m of a t racking system is to estimate the ground t ru th (or the state variables) of the targets through the output data of the observation or measuring devices. Therefore, such state variables are called hidden variables or latent variables. T h e observation of the hidden variables are denoted as yt € K™», where ny is the dimension of the observed variable. In most cases, Xt G M " 1 and yt € M " 1 ' are in different spaces, therefore nx and ny are not necessarily the same. The relation between the two variables can be represented by a generative model and the corresponding graphical model can be represented as is shown i n 2.1. The directed graph structure shows that the observed variable only depends on its corresponding latent variable and the condit ional probabi l i ty ly ing on the edge can be represented as p(yt\xt). The Hidden Markov Model ( H M M ) [AM79] is known to be an appropriate 11 Figure 2.2: Graph ica l model representation of the H idden Markov M o d e l , where XQ is the in i t i a l state model for sequential da ta (e.g., the t racking data). It l inks the data of each t ime step, which can be represented by a single node structure as is shown i n Figure 2.1, as a chain to represent the dependence relationship between the nodes. T h e graphical model to represent such k ind of chain structure is shown i n Figure 2.2. xo is the in i t i a l state to start the whole process. In an H M M , the latent variables are discrete so that the t ransi t ion probabil i ty on the arrows between latent variables is an M x M matr ix , i f the number of possible values is M. However, i n most t racking applications, the locations of the targets are often represented i n a continuous 2D or 3D space. Therefore, the State Space M o d e l (SSM) [AM79], which generalizes H M M s to the continuous space but shares the same graphical model structure as H M M s , is used to model the t racking process i n our applicat ion. The goal of Bayesian sequential est imation is to estimate the posterior dis-t r ibu t ion pixo.tlyo-.t)1 or its marginal p(xt\yo-.t)- The transi t ion between two consec-utive latent variables is given by Equa t ion 2.1 and the observation model is given by Equa t ion 2.2 xt = ft(xt-i,vt-i), (2.1) yt = ht(xt,ut), (2.2) where ft : Rnx x R 7 1 " a n d ht : RUx xM.nu^>Rnv can either be linear or nonlinear, vt and Ut are sequences of i . i .d . process noise and measurement noise respectively, and nv and nu are the dimensions of the noise vectors. T h e marginal posterior d is t r ibut ion can be computed by the following two 1X0:t = (XQ, -., Xt), 1/0:1 = (VO, ••; Vt) 12 step recursion based on the Markov assumptions 2 . predict ion step: p(xt\y0:t-i) = J' p{xt\xt-i)p(xt-i\yQ:t-i)dxt-i filterine- sterv n(rJ?;,,,) = ,. P^\xt)p{xt\yo:t-x) K2-S) nitering step. p(xt\yo-.t) - T K ^ ^ N M i o T ^ i ) ^ where the recursion is ini t ia l ized by the prior dis t r ibut ion p(xo\yo) = p(xo), the dynamics model p{xt\xt~\) is specified by Equa t ion 2.1, and the l ikel ihood model p(yt\xt) is specified by Equa t ion 2.2. K a l m a n Fi l ter The K a l m a n filter [BSF88] is one of the solutions to the Bayesian sequential esti-mat ion i f both function /(•) i n Equa t ion 2.1 and h(-) i n Equa t ion 2.2 are linear and Gaussian. It is widely used in visual t racking [GH96, K C M 0 3 ] and robotics [Kel94]. In order to handle the nonlinear and non-Gaussian situations, extensions have been made to the standard K a l m a n filter. The Extended Kalman Filter ( E K F ) [AM79] is one of the extensions which uses a first order Taylor series expansion of the nonlinear functions i n Equa t ion 2.1 and Equa t ion 2.2 for the approximation. T h e E K F has been widely used i n the field of robotics i n the last decade [ D N D W + 0 0 ] , especially i n the sub-field of S L A M [SSC90]. The Unscented Kalman Filter ( U K F ) [vdMAdFWOO] has even higher accuracy. It approximates the posterior p(xt\yi;t-i) directly instead of approximat ing the nonlinear functions /(•) and g(-). In partic-ular, it uses a determinist ical ly selected sample set and pass i t through the true nonlinear function to capture the true dis t r ibut ion. Because of its capabi l i ty of han-dl ing nonlinear models and its efficiency i n computat ion, i t has been successfully used i n tracking [CHR02] . Particle Fi l ter The particle filter gained its popular i ty because of its abi l i ty to handle highly nonlin-ear and non-Gaussian models i n Bayesian filtering w i t h a clear and neat numerical approximation. The key idea is to approximate the posterior d is t r ibut ion w i t h a set of randomly sampled particles that have weights associated to them. The set of particles can be represented as {x^,™^}^, where x\^ is the support ing point, (i) u>l is the weight, Ns is the number of sample points. T h e posterior d is t r ibut ion is approximated by i teratively propagating the particles w i t h the proposal d is t r ibut ion and updat ing the weights. The details of particle filtering are covered i n Chapter 3. 2 First order Markov chain has the following independency assumption Xt-LyO:t-l\Xt-l yt±-yo-.t-Axt which means the current state is independent of the past observations given the previous state and the current observation only depends on the current latent state. 13 Nowadays, particle filtering has been widely used i n computer vis ion and robotics. It was first introduced to visual t racking by Isard and Blake i n [IB98], where i t was known as the C O N D E N S A T I O N method. Perez et a l . [ H L C P 0 3 , V D P 0 3 ] extended the particle filter framework to track mult iple targets. O k u m a et al . [ O T d F + 0 4 ] further extended i t [VDP03] by incorporat ing a boosting detector [VJ04] into the particle filter in i t ia l iza t ion and the proposal d is t r ibut ion. B o t h [VDP03] and [OTdF+04] used a Mixture of Particle Filters ( M P F ) as the ma in framework to mainta in mult i-modali ty. The key contr ibut ion of M P F is to represent the overall posterior d is t r ibut ion w i t h a M-componen t mixture model where 2~3m=i ^ " i , * ~ ^, M is the number of targets or particle clouds. pm represents the dis t r ibut ion of each target and evolves independently over t ime. T h e only in -teraction between targets is through 7 r m i t . Par t ic le clouds are allowed to merge or split if different targets have significant overlap or overlapping targets split into two. Part icles can be associated to different targets during merging or spl i t t ing while the overall d is t r ibut ion s t i l l remains the same. However, when new targets come into the scene, the M P F is not able to associate particles to them because the target in i t ia l iza t ion only takes place at the first step. O k u m a et a l . [OTdF+04] adopted the framework of M P F and improved it by using boosting detection to ini t ial ize tar-gets. However, because the total number of particles i n the M P F framework is fixed, part of the particles from each target are allocated to the new targets. A s a result, i f more and more new targets came into scene, the particles of each target would decrease and not be able to properly approximate the dis t r ibut ion. Another major disadvantage of the M P F framework is that it can not handle a variable number of targets and mainta in their identities at the same time, because after the merge and split , there must be at least one target that loses its identity. Therefore, we track each target w i t h a set of particles independently instead of the M P F framework i n our system. A global mechanism is used to handle the b i r th and death of targets and the data association issue. 2.2.2 Data association D a t a association arises from the task of multi-target t racking. T h e a im of data asso-ciat ion is to find the correct correspondence among mult iple observations {ytn)m=i and the existing tracks {x™}^=l. M and N can be different because there might be outliers or mis-detections. Standard Bayesian filtering does not handle observation-to-track assignments. However, the color-based particle filter tracker [VDP03] solves the problem impl ic i t ly because, according to Equa t ion 3.9, the weight of each par-ticle is updated only by evaluating the l ikel ihood of color histogram at the locat ion M (2.4) m = l 14 specified by the particle. D a t a association becomes important i n our appl icat ion again because we adopted the idea of boosted particle filter [ O T d F + 0 4 ] . B P F uses a mixture of Gaussian that combines the boosting detections w i t h the existing tracks to improve the proposal d is t r ibut ion shown in Equa t ion 3.10. A s there are mult iple boosting detections and mult iple tracks at each t ime step, the association problem arises again. G a t i n g is one of the basic techniques to eliminate very unl ikely observation-to-track pairs. It defines a certain range around each existing track so that ob-servations that fall outside the ranges are not considered to be associated to them. However, gating is s t i l l a prel iminary procedure. M o r e sophisticated data association techniques are required. Linear optimization D a t a association can been seen as an auction process where the tracks b id for the observations, or vice versa. E a c h observation-to-track pair w i l l be associated w i t h a corresponding cost or profit. The goal is to optimize a linear objective function under several linear constraints. Therefore, da ta association can be considered as a linear opt imizat ion problem. B lackman and Popo l i give a detailed in t roduct ion of various opt imizat ion solutions i n their book [BP99]. T h e auction a lgor i thm [BP99] is found to be the most efficient method so far. However, the constraints i t requires do not satisfy the requirements of our applicat ion. In Chapter 5, we model our applicat ion w i t h a set of new constraints and introduce the solution to the new linear opt imizat ion setting. We adopt this approach because of three reasons. F i rs t ly , i t can handle the b i r th and death of targets, which is cr i t ica l i n our appl icat ion. Secondly, the a lgor i thm is easy to understand and implement. F ina l ly , it requires very l i t t le computat ion and is extremely fast. T h e major drawback of this approach is its inabi l i ty to handle error association. A s it is not a probabil ist ic method, there is only one op t imal solution at each step. A l so , because the associations at each step are independent, error association can not be corrected at later stages. Joint probabil i ty data association filter The Joint P robab i l i ty D a t a Associat ion F i l t e r ( J P D A F ) [BSF88] is one of the most widely accepted probabil ist ic approach to the data association problem. It can either be combined w i t h K a l m a n filters [KCM03] or w i t h particle filters [ S B F C 0 1 , M B 0 0 ] . The state space of J P D A F is a concatenation of a l l the states of the existing tracks: Xt = {^i,t, %N,t}- The observed variable is also a concatenation of a l l measurements: Yt = {yi,t, •••,yM,t}- B y introducing an association indicator 6, the 15 joint posterior dis t r ibut ion can be wri t ten as P(Xt\Y1:t) oc P(XuO\Y1:t)dO P(Yt\Xt,9)P(Xue\Y1:t-1)d0 P(Yt\Xt,9)P(e\Xt,Y1:t^)P(Xt\Yl:t^)de (2.5) It should be noted that, in most cases, the association is discrete and finite. There-fore, the integral can be replaced w i t h a sum over a l l possible associations. W i t h this association indicator 9, a l l the terms, especially the l ikel ihood term, can be easily factorized and further computed conditioned on the association setting. A problem arises i f the total number of observations and tracks is large, because it results i n a huge amount of enumeration and computat ion. It is even worse i f i t is combined w i t h particle filtering because sampling i n a h igh dimension space w i l l make the efficiency of sequential Monte Car lo approximat ion decrease drastically. Another problem w i t h J P D A F is that i t does not handle the b i r t h and death of targets explicit ly, because the dimension of the joint state space is determined at the in i t ia l iza t ion step by the to ta l number of targets. Therefore, it does not satisfy the requirement of our applicat ion. Multiple hypothesis tracking Multiple Hypothesis Tracking ( M H T ) is a deferred decision logic i n which alterna-tive data association hypotheses are formed whenever there are uncertainties about observation-to-track assignments. The algori thm is bui l t on the assumption that uncertainties can be resolved at subsequent stages. M H T was first called Reid ' s algori thm. A n efficient implementat ion of Reid 's approach is later presented by C o x and Hingoran i [CH96]. M u l t i p l e hypotheses are maintained at each step. N e w hypotheses are generated according to the previous hypotheses. E a c h hypothesis i n the previous stage w i l l generate one or more children. Therefore, a pruning tech-nique is required so that only the N best hypotheses w i l l be selected. Otherwise, after several steps, the number of total hypotheses w i l l easily increase enormously. The selection process is s imilar to the linear opt imizat ion mentioned in the previous section, however, instead of maintaining the opt imal solution only, i t finds the N best solutions. A l l the hypotheses are evaluated based on their scores, which is the sum of a l l the component observation-to-tracks association. where Lxt is the score of each observation-to-track assignment and Tj are tracks that belong to hypothesis Hj. It can be evaluated using the l ikel ihood p(ym\xn), or any other variations. (2.6) 16 Cox ' s implementat ion of Reid 's a lgor i thm [Rei79] is a track oriented solu-t ion to M H T . Instead of maintaining a set of hypotheses, i t creates a set of tracks. Hypotheses are generated at each step after the tracks are created. P r u n i n g is performed both i n the creation of tracks and hypotheses. Detai ls can be found i n [CH96]. T h i s implementat ion is much more efficient because the track-oriented ap-proach does not generate redundant tracks while the hypothesis oriented approach has many redundant observation-to-track assignments i n different hypothesis. How-ever, it is s t i l l not quite suitable for particle filter trackers. A s uncertainty increases, more and more possible potential tracks need to be created. E v e n i f there are at most 15 targets on the r ink i n our applicat ion, the to ta l number of tracks grows exponentially w i t h the increasing possible associations. A s each created new track needs a set of particles to support the evolution of track, the required storage space for the particles grows exponentially as well . A s a result, the computat ion for the propagation of a l l the particles becomes extremely expensive. Therefore, s imply combining conventional M H T w i t h particle filters is not pract ical . T h e experiments show that the linear opt imiza t ion approach satisfies our accuracy requirement for data association. Therefore, the computat ional ly more expensive M H T approach is not used. Monte Carlo method for data association Unl ike the three approaches mentioned above, Monte Car lo methods t ry to sample i n the association space rather than enumerating a l l possible assignments. Rao-Blackwellized Particle Filtering ( R B P F ) is one of the Monte Car lo methods that are widely used i n the literature [SVL04, SFH03] . The key idea is to sample for the da ta association while the remaining t racking part is computed i n an analy t ica l way, for example, the K a l m a n filter. A n in-depth explanation of R B P F can be found in [DdFMROO]. Because the Monte Car lo methods sample in the state space, the "curse of dimensionali ty" is the major problem if the to ta l number of targets is large. Perez et al . t r ied to solve the problem by developing three algorithms: Monte Carlo Joint Probability Data Association Filter ( M C - J P D A F ) , Sequential Sampling Par-ticle Filter ( S S P F ) and Independent Partition Particle Filter ( I P P F ) i n [VGP05] . T h e fundamental motivat ion underlying these three algorithms is to factorize the joint state space into ind iv idua l components. However, in visual t racking systems, detections at each stage can be consid-ered as mult iple signals generated by a single sensor. Therefore, the relationship between signals and sensors i n visual t racking is not as significant as it is i n robotics and other applications. Consequently, there w i l l not be significant gain to use a state space model to represent the association and use Monte Car lo method for approx-imat ion. In addi t ion, as we use the particle filter as basic framework for t racking, 17 sampling i n both the filtering part and the data association part w i l l decrease the efficiency and effectiveness of the particle filter. 2.2.3 Mutual exclusion M u t u a l exclusion is important i n multi-target t racking, especially when different tar-gets overlap and occlude each other. Some exclusion rules are the basic constraints i n da ta association. For example, one signal can hot be generated.by mult iple targets and one target can generate mult iple signals. W h e n targets occlude each other, mu-tua l exclusion plays a cr i t ica l role for the evaluation of l ikel ihood function. Sul l ivan et a l . [SBIM01] address the global op t imal solution to associate pixels to objects and background i n a image w i t h a Bays ian approach. It is important for mul t i -target t racking because it makes the l ikel ihood evaluation of the targets accurate. M a c C o r m i c k and Blake introduced a probabil ist ic exclusion principle for t racking mult iple objects i n [MBOO]. M a c C o r m i c k also introduced a Bayesian mult iple-blob tracker [IM01], which expl ic i t ly address the issue of mutual exclusion i n the 3D space. W u et al . [WYH03] expl ic i t ly model overlapping to handle occlusion. How-ever, a l l the approaches in [MBOO, I M 0 1 , W Y H 0 3 ] require explici t model l ing of the object shape. A s is mentioned i n Chapter 1, the non-rigidity of the hockey players in our appl icat ion makes the shape model l ing difficult. W h i l e the mutua l exclusion is difficult to achieve, mult iple cues are used to resolve the ambiguity dur ing occlusion. W u and Huang [WH01] introduced a co-inference tracking algori thm to ut i l ize bo th color and shape cues to track single target. Perez et a l . [PVB04] fuse color, mot ion and sound cues to track mult iple objects w i t h a particle filter. In our applicat ion, according to the law of inert ia, it is easier to use the physical model to predict the mot ion of hockey players. Therefore, better model l ing of the motion of the targets w i l l help resolve the ambiguity. T h e stabil ized tra-jectories of the targets are important for accurate model l ing of dynamics. Shan et al . [SWTO04] have embedded Mean-Shift [Che95, C M 0 2 ] into the particle filter for hand tracking. The i r results show that the mean-shift improves the efficiency of the sampling i n particle filter. The mean-shift a lgori thm is also used i n our system to achieve the s tabi l izat ion by biasing a l l particles to the true locat ion of the target. 2.3 Summary T h e survey i n the literature shows that there is no existing system that can satisfy a l l the requirements of our applicat ion. In order to achieve the goal of our robust t racking system, we propose four major extensions on the previous work. F i rs t ly , a better model l ing of target dynamics is used to improve the dynamics model of the targets. Secondly, a global nearest neighbor data association method is developed to associate boosting detections w i t h existing tracks for a better proposal d is t r ibut ion 18 in the particle filtering. Thirdly, the likelihood model is improved by a isotropic kernel brought by the mean-shift algorithm. Finally, mean shift is embedded into the particle filter framework to stabilize the tracking trajectory by biasing each particle. This is critical to both the dynamics model and likelihood model. 19 Chapter 3 Particle Filter for Multi-target Tracking A s is mentioned in the previous chapter, particle filtering performs well in Bayesian sequential est imation w i t h non-linear, non-Gaussian target dynamics and observa-t ion model . In our applicat ion, the fast mot ion of hockey players is highly non-linear and non-Gaussian. In addit ion, because we employ the observation model i n [ P H V G 0 2 , V D P 0 3 , OTdF+04] , which is an exponential function based o n the Bhat tacharyya coefficient between two color histograms, the l ikel ihood function is non-linear and non-Gaussian as well . Moreover, particle filtering has been success-fully applied to multi-target t racking [ V D P 0 3 , H L C P 0 3 ] . Therefore, the particle filter framework is the ideal model to be the basic skeleton of our t racking system. T h i s chapter covers the basic theory of generic particle filters and the boosted par-ticle filters. How the boosted particle filter is modified to fit in our multi-target t racking system w i l l also be introduced. 3.1 Particle Filter A s is shown i n Equa t ion 2.3, in order to continue the recursion of the Bayesian sequential estimation, the integration i n both the predict ion and filtering steps need to be computed at every step. If bo th the dynamics model and the l ikel ihood model are linear and Gaussian, i t becomes the well known analyt ical K a l m a n F i l t e r . T h e details of K a l m a n F i l t e r ing are beyond the scope of this thesis. A n in depth introduc-t ion can be found i n [BSF88]. For the analyt ical ly intractable cases, where bo th the dynamics model and l ikel ihood model are non-linear and non-Gaussian, approxima-t ion techniques are required. The most popular numerical approximat ion technique for this problem is the Sequential Monte Ca r lo ( S M C ) method [ D d F G O l ] . It is also known as Par t ic le F i l t e r ing [ A M G C 0 1 ] or C O N D E N S A T I O N method [IB98]. 20 3.1.1 Monte Carlo simulation The basic idea of Monte Car lo s imulat ion is to represent the whole posterior dis-t r ibu t ion by a set of random samples w i t h associated weights. For example, the posterior density function can be approximated as where Ns is the number of samples, {x\ ,wt Ji=i * s the s e * of particles drawn from the posterior d is t r ibut ion and their weights. Here the weights of a l l particles are equal to because the samples are direct ly drawn from the true dis t r ibut ion and normalized. Situations where the weights are not equal w i l l be introduced later. <5(-) denotes the Di rac delta function Therefore, w i t h the Monte Car lo simulation, any integral can be approximated by a discrete form Accord ing to the law of large numbers, when the number of samples approaches infinity, the approximations of the posterior d is t r ibut ion and the integral are equiv-alent to the true dis t r ibut ion and integral. Similar ly, any query estimation computed from the samples and weights approaches the op t imal Bayesian sequential estimate. 3.1.2 Sequential importance sampling A s it is often impossible to directly sample from the posterior density function, the particles are sampled from a known and easy-to-sample proposal d is t r ibut ion q(xt\yo-.t) instead. Then , Equa t ion 3.1 can be rewrit ten as (3.1) (3.2) E ( / ( x t ) ) = Jf(xt)p(xt\y0:t)dxt « ±- JT f(x?). (3.3) 3 (3.4) i=i where (i) _ p(4 \yo-t) (3.5) i(xt)\yo-.t) and 3 (3.6) 21 The new sampling strategy is called Importance Sampling and the weights are called Importance Weights. A s is shown i n Equa t ion 3.5, i f the proposal d is t r ibut ion is the same as the posterior, the weights reduce to the same as i n Equa t ion 3.1. B y replacing the integral of the predict ion step i n Equa t ion 2.3 w i t h the importance sampling, the one step ahead predict ion can be wri t ten as It should be noted that the proposal density function just samples the current state from the previous one rather than drawing samples for the whole trajectory of the states because we assume the current state only depends on the previous one. One step further, the current posterior dis t r ibut ion can be updated as Here the te rm fp(yt\xt)p(%t\yo:t-i)dxt is omit ted for convenience because i t is just a normalizat ion constant. Eventually, the est imation evolves by sequentially updat ing the weights for each support ing particle and the procedure is named Sequential Importance Sampling (SIS). 3.1.3 R e s a m p l i n g One inherent problem in the sequential importance sampling method is the degen-eracy phenomenon. It is observed that after several iterations, a l l but one particle have close to zero weights, which means the majori ty of the comput ing is wasted on those particles w i t h negligible weights. Even worse, it has been proved that the problem can not be avoided because the variance of the weights increases over t ime. Therefore, a resampling technique is required to ignore particles w i t h very low weights and concentrate on the more promising ones, which take an important role i n the approximation. One of the most widely used resampling schemes, which is also used i n our work, is named the Sampling-importance resampling (SIR) technique. The basic idea is to map a set of weighted particles {x[1^, wf^ } ^ 1 to a set of equally weighted samples {xf\ - / V " 1 } ^ through a discrete cumulative dis t r ibut ion of the original sample set. A s a result, the original particles that have higher weights would have more replicates while those w i t h smal l weights would have much less or even no replicates. Detai ls can be found in [Gor94]. The whole process of particle filtering is shown i n Figure 3.1 [vdMAdFWOO] . (3.7) (3.8) where (3.9) 22 Figure 3.1: The S I R process starts at t ime t — 1 w i t h a set of equally weighted par-ticles f x f _ | , i V - 1 } ^ , which represent the dis t r ibut ion p(xt-i\yo-.t-2) • Importance weights are updated w i t h information at t ime t — 1 to represent p(xt-i\yo-.t-i) w i t h {x[l}l,w[l}1}fL1. Resampl ing is performed to make it a equallly weighted particle set {^t-i' -^ _ 1 } i= i a g a i n D u t s t i l l represent the same dis t r ibut ion. F ina l ly , particles are propogated to {x[l\ J V - 1 } ^ at t ime t by the proposal d is t r ibut ion to represent p(xt\yo:t-i) and iterations go forward [vdMAdFWOO]. 3.2 Boosted Particle Filter In the standard particle filter, there are two major problems that need to be solved: the in i t ia l iza t ion of the particle filter and proper design of the importance pro-posal. The Boosted Part ic le F i l t e r ( B P F ) , which is first introduced by O k u m a k et a l . [ O T d F + 0 4 ] , properly solves the two problems by incorporat ing the A d a B o o s t detection a lgor i thm by V i o l a et a l . [VJ04]. Mos t t racking systems that use particle filters (i.e., [ P H V G 0 2 , V D P 0 3 , H L C P 0 3 ] ) ini t ial ize the particle filters by hand labell ing to set the value for P(XQ). The boosted particle filter uses A d a B o o s t detections for automatic in i t ia l iza t ion of the start ing states. 23 m A Figure 3.2: The final proposal is i n the form of a mixture of Gaussian which combines the prior and the boosting detection [ O T d F + 0 4 ] . A well designed importance proposal d is t r ibut ion should be as close as pos-sible to the posterior dis t r ibut ion. The proposal d is t r ibut ion should be able to shift the particles to the regions w i t h h igh l ikel ihood i f the mode of the prior d is t r ibut ion is far away from the mode of l ikel ihood dis t r ibut ion. The boosted particle filter assumes Gaussian dis t r ibut ion for the dynamics and superimposes Gaussian dis t r i -butions centered at the Adaboost detection modes as well . Then , it combines bo th parts to construct a proposal d is t r ibut ion w i t h a mixture of Gaussian model: q*B(xt\xt-i,y0:t) = aqada(xt\yt) + (1 - a)p(xt\xt^), (3.10) where a is the parameter which is dynamical ly updated according to the overlap between the Gaussian dis t r ibut ion of boosting detection and the dynamics prior. If there is no overlap, a is set to zero so that the boosting detection has no effect on the proposal d is t r ibut ion. T h e value of a can be fine tuned through experiments. In our applicat ion, it is set to 0.5. The mixture of Gaussians proposal can be best described by Figure 3.2 [OTdF+04]. 3.3 Our Improvements Part ic le filtering has proved i n [IB98, H L C P 0 3 ] to be successful i n t racking single or mult iple targets w i t h nonlinear and non-Gaussian dynamics and l ikel ihood models. In the original work of boosted particle filtering, a mixture of particle filters [VDP03] 24 was adopted and augmented by the A d a B o o s t detection to automatical ly track a variable number of targets. However, as i n the M P F , the B P F only uses a fixed number of particles for a l l the targets. Th i s causes cr i t ica l problems when the to ta l number of targets keeps increasing because new tracks have to steal particles from existing tracks so that the number of remaining particles for each target decreases un t i l i t is insufficient to correctly approximate the dis t r ibut ion. Moreover, when occlusion happens, the B P F merges overlapping particle clouds and splits them when they fall apart. Obviously, such a mechanism w i l l lose the unique identi ty of each target after the occlusion. Therefore, our work only retains the basic idea of the original boosted particle filter for ind iv idua l target and maintains mul t ip le boosted particle filters to track mult iple targets. Therefore, i t solves both problems stated above. O n one hand, i n the new structure, new tracks are created by assigning a new particle set for each of them rather than stealing particles from exist ing targets. Therefore, the new structure can dynamical ly handle an increasing number of targets without affecting the approximat ion accuracy of the exist ing tracks. O n the other hand, because a l l the tracks are maintained and removed independently by each particle set, there w i l l be no merge or split dur ing occlusion. B o t h of the new features are cr i t ica l i n maintaining correct identities of a l l targets i n our multi-target t racking task. 25 Chapter 4 Target Dynamics Model l ing A s is mentioned i n Section 1.3.5, i n visual t racking systems, mutua l occlusion be-tween different targets is one of the major factors that cause the tracker to lose the identity of different targets. Accurate model l ing of the target dynamics can improve the predict ion of the locations of the targets while visual support is insufficient due to occlusion. However, because of the camera mot ion i n our applicat ion, the i m -age coordinate system changes over t ime. Therefore, target mot ion model l ing and predict ion i n the image coordinates are difficult. Zhao et a l . [ZN04] uses camera cal ibrat ion to provide a transformation from the predefined ground plane in the scene to the image. It helps transform the invariant 3D model of the target to the 2D model i n the image to assist human segmentation and further mot ion modell ing. W e adopt the approach by O k u m a et al . [OLL04] to map the locations of the tar-gets from the image coordinates to the standard r ink coordinate system which is consistent over t ime. Therefore, according to the physical law of inert ia, the mo-tions of the players i n hockey games are easier to predict w i t h a constant velocity autoregressive model . 4 . 1 Rectification In this section, we w i l l give a brief int roduct ion to the concept of homography and how it acts as an invertible deterministic mapping between the image coordinates and the standard r ink coordinates. It is important that homography is an invertible transformation because mapping back and forth between the r ink and the image coordinate system is frequent during the tracking process. 4.1.1 Homography Homography, which is also known as projectivity or collineation, is defined by Har t ley and Zisserman i n [HZOO] as follows: Definition. A projectivity is an invertible mapping h from P 2 to itself such that 26 three points x i , X 2 and X 3 lie on the same line i f and only i f / i ( x i ) , / i ( x 2 ) and / i ( x 3 ) also lie o n the same line. Images recorded by cameras are 2D projections of the real wor ld through lenses. For any plane i n the world, its images from a camera, which can pan, t i l t , zoom or even move, are exactly the projection described above because any line i n the wor ld is s t i l l a line i n the images as long as there is no noticeable lens dis tor t ion. A s the hockey players are always moving i n the plane formed by the hockey rink, they are i n the same plane bo th i n the real world and the image space. Therefore, it is possible to project their locations between the two planes. The mathematical way to accomplish such projection is possible because of the the following theorem: Theorem. A mapping h: P 2 —> P 2 is a projectivity if and only if there exists a non-singular 3 x 3 matrix H such that for any point in P 2 represented by a vector x it is true that / i (x) = H x . Here, the vector x can be wri t ten as (x, y, w)T, which is a homogenous repre-sentation of a point i n the plane. The corresponding inhomogeneous representation of the point i n Eucl idean 2 D space is wr i t t en as (x/w,y/w)T, where x/w and y/w are the x- and y-coordinates of the point. Accord ing to the theorem, given the coordi-nates of a pair of matching points {(x, y), (x', y')}, i n two planes, the transformation can be wr i t ten as x ' = H x (4.1) where x ' , x and H are defined as: / x' / X \ / hu hn hi3 \ x ' = y' X = y H = h23 1 ) V 1 J h3i h32 h33 1 It needs to be noted that for a l l point pairs { x ^ , X j } from the two planes, Equa t ion 4.1 holds for the same mat r ix H . In order to compute H , at least three points are needed because there are 9 unknown entries. However, the homography between two video frames can be defined w i t h less parameters i f the camera rotates about its opt ical center, which is the case i n our applicat ion. The camera at each t ime step can be parameterized by 3 rotat ion angles 6 = {61,62,63] w i t h focal length / . Then , the pairwise homography H j j from frame j to frame i can be parameterized by the parameters of the camera in the two time steps, which is 8 i n a l l . Detai ls of how to form the homography w i t h the 8 parameters can be found i n [BL03]. W i t h this technique, it is possible to bui ld a whole panorama for the scene, including the audience, from a sequence of video frames. It allows many other techniques to assist i n improving the t racking result. For example, background subtract ion can be performed to eliminate interference from background clutter. 27 In [OLL04] , a set of point pairs are labelled to compute the in i t i a l video frame to r ink homography HQV. After the ini t ia l izat ion, the frame to r ink homography at each step is automatical ly computed i n a recursive way where H{L\ t is the transformation mat r ix from the previous frame to the current one. Different from the ini t ia l izat ion, the point pairs are automatical ly selected by using the K L T feature tracker to match the extracted K L T features from consecutive video frames. In order to improve the accuracy of the computed homography, a template-based correction step is added to further fit the predefined key points w i t h the standard r ink model . Therefore, the a lgori thm can automatical ly and accurately map a long sequence of video frames to the r ink. T h e whole process is explained i n detail i n [OLL04] . Not ice that this is an online algori thm, which perfectly satisfies the online requirement of the particle filter tracker. Another important property of the ho-mography is that i t is an invertible transformation, which makes it easy to map the points i n the images and r ink in bo th directions i n the following way w i t h the same transformation mat r ix H+R. where x £ and x £ are the points i n the r ink and video frames respectively. Figure 4.1 shows how the video frames are mapped to the standard r ink w i t h the homography. 4.1.2 Multi-layer state space model W i t h the frame to r ink homography, it is possible to perform the particle filter tracker only i n the r ink coordinate system. However, the locations of the targets i n the image coordinates are s t i l l required because, while evaluating the l ikel ihood function i n Equa t ion 3.9, the locations of the particles in the image coordinates are s t i l l needed to specify the regions where the color histograms are extracted. There-fore, one more layer, which represents the hidden states i n the image coordinates, is added into the original state space model. The graphical model of the new state space model is shown in Figure 4.2. Accord ing to the graphical model , there are two layers of latent variables. O n the top level, the variables contain locations of the targets i n the r ink coordinates. T h e y are the true state variables i n the particle filtering process. The middle layer states contain the locations of the targets in the image coordinates. The locations in the states between the top two layers have deterministic mapping function which is the same as Equa t ion 4.4. It should be noted that a l l state variables i n the middle - l (4.3) frame to rink: x £ = H{rx( rink to frame: x f = H(T (4.4) 28 Figure 4.1: Th i s shows how video frames are mapped to the r ink. The top left image is the input video frame; the top right is the standard r ink model; the bo t tom image shows how the image looks after being mapped onto the r ink. layer do not have any arrows between them, which means they are independent of each other. The state dynamics only applies to the top level state variables. The middle level acts as a communicat ion tunnel to bridge the true latent variable and the observed variables, which are i n the bo t tom layer. In our applicat ion, there are two kinds of observations. One is the boosting detection and the other is the color histogram l ikel ihood. A t each t ime step, boosting detections provide a set of detected targets w i th their locations in the video frame and their scales. Therefore, each target can be represented by a bounding box w i t h a predefined edge length ratio. A s is shown i n the bo t tom image i n Figure 4.1, players are significantly distorted while only the feet of the players are on the right posit ion relative to the r ink because homography only guarantees accurate mapping between points from two planes. Here, we assume that the bounding box fits the target t ightly. Therefore, we represent the locat ion of each player i n the image, x£ v wi th the middle point on the bo t tom edge of the bounding box as is shown i n Figure 4.3. O n l y these points w i l l be mapped to the r ink. A t the in i t ia l iza t ion step, these transformed locations are used to create particle filter trackers. A t later stages, they are used for the da ta association between boosting detections and the tracks or the creation of incoming new tracks. There are also situations that require mapping 2 9 Figure 4.2: Graph ica l model representation of the three-layer structure of the state space model. locations on the r ink back to the image coordinates. W h e n the l ikel ihood function needs to be evaluated, the positions of particles i n the r ink are mapped back to the video frames. T h e y are assumed to be the positions of the players. Therefore, w i t h this locat ion value and the scale, a rectangular bounding box can be created. It is just the inverse of the operations w i t h which the boosting detections are mapped to the rinks. Color histograms are extracted from the regions specified by the bounding boxes and compared w i t h the reference color histograms, which are created at the in i t ia l iza t ion step. 4.2 Autoregressive Dynamics Model A n autoregressive process is a t ime series model l ing strategy which takes into ac-count the historical da ta to predict the current state value. In this model , the current state x 4 only depends on the previous states w i t h a deterministic mapping function and a stochastic disturbance. If the deterministic mapping is linear and the current state only depends on the previous p states, the model can be wri t ten as p x t = ^2 OiXt-i + Zt (4.5) i = i where {o:i}f_ 1 are the autoregression coefficients, and Zt is the stochastic disturbance which is normal ly a white noise process. The autoregression coefficients can either be learned or predefined i n an ad-hoc way. N o r t h et a l . [NBIROO] t r ied to determine the coefficients by learning the 30 Figure 4.3: A l l the black rectangles are the boosting detections. B lack dots on the bo t tom edge of each rectangle represent the positions of the players on the r ink. patterns of different mot ion classes through a t ra ining process where perfect t racking could be achieved. O n the other hand, Perez et a l . [PHVG02] used a second order constant velocity model to predefine the coefficients. However, this only works if the camera is stationary. In the work by O k u m a et a l . [ O T d F + 0 4 ] , the camera mot ion made it impossible to separate the motion of players. Therefore, it is difficult to predict the locat ion of players w i t h the constant velocity autoregressive model in the image coordinates. B y mapping the locations of players from the video frame coordinate system to the standard r ink coordinates, the motions of the players on the r ink are separated from the camera motion. Therefore, no matter how the camera pans, t i l ts or zooms, the motions of the players on the r ink coordinate system w i l l not be affected. In hockey games, because of the effect of inertia, a constant velocity model is more suitable to model the mot ion of the players. It is best described by the following second order autoregressive model xt = Axt-1 + Bxt-2 + CAr(0,i:) (4.6) where {A, B, C} are the autoregression coefficients, A/ r (0, S ) is a Gaussian noise w i t h zero mean and standard deviat ion of 1. Unl ike [ P H V G 0 2 , O T d F + 0 4 ] , state x< i n our model only contains the locat ion of the player i n the x— and y—coordinates i n the r ink rather than including the scale of player as well because experiments show that incorrect sampling of player scales is more difficult to undo at later stages of t racking in this constant velocity model . Therefore, we use mean shift, which w i l l be introduced i n Chapter 6, instead to find the most appropriate scale at each step. Therefore, the states are defined as 31 x t = (xt,yt)T and the coefficients are defined as -(;S)*--(".,-O1)"-(JS) <«> where A and £? are standard for all second order constant velocity model, C can be fine tuned through experiment because it defines the standard deviation of the noise. In our application, it is defined to allow 3-pixel deviation on both x— and y—coordinates of the rink. 32 Chapter 5 Data Association Chapter 2 gives a general introduct ion to data association and compares several typica l approaches i n the literature. In this chapter, we discuss i n detai l the gating technique and the solution to the global nearest neighbor ( G N N ) approach that solves the association problem in the applicat ion we investigate. In our applicat ion, the data association does not take place i n a conventional manner because there are two different kinds of observations i n the framework. One is the color histogram and the other is the boosting detections. In a standard Bayesian filtering framework, da ta association is performed to pair the observations and tracks for the evaluation of the l ikel ihood function p(y t" |x") . The particle filter framework i n our appl icat ion handles this level of data association in an impl ic i t way because color histograms are extracted from the regions specified by particles. It means observations and tracks are no longer independent because observations are condit ioned on the particles. However, i t is s t i l l necessary to determine the association of the pixels w i t h i n the specified region because the region might include pixels that do not belong to the corresponding target. However, i n practice, as we do not have any concrete shape model of the targets, it is impossible to accomplish this level of data association. Because the boosting detections are used to improve the proposal d is t r ibut ion i n particle filters as i n shown in Equa t ion 3.10, we perform data association at this level to assign boosting detections to the existing tracks. Therefore, the proposal d is t r ibut ion can lead the sampling to more promising regions. 5.1 Gating G a t i n g is a technique to eliminate very unlikely observation-to-track assignments. A t each t ime step dur ing the tracking process, a certain region, also named as gate, is defined around the predicted posit ion of each target. Observations that fall w i th in the region are considered as candidates to update the track. Figure 5.1 depicts the concept of gating. A s is shown i n the figure, it is very l ikely that mult iple 33 Gate \ 03 • Gate Gate Figure 5.1: T I , T 2 , T 3 are predicted target positions; O l , 0 2 , 0 3 , 0 4 are observa-tions. 0 2 is assigned to T I ; 0 1 is a candidate of T I , T 2 and T 3 ; 0 4 is a candidate of T 2 and T 3 ; 0 3 is not assigned to any existing tracks so that i t is a candidate of being a new target. observations fall i n one gate or one observation falls i n mult iple gates. In addi t ion, there are observations that do not belong to any gate. It remains to be decided later if they are outliers or new tracks. Therefore, gating is just a pre l iminary process to reduce the computat ion burden for further data association processing. The sizes of the gates depend on the mot ion types of targets and the property of sensors. Because the camera is the only sensor for a l l targets, the effect from the sensor is homogeneous for a l l targets. Therefore, here we only consider the mot ion of targets. Targets that have high speed must have larger potential gates than slower targets. If the t racking were performed i n image coordinates, the sizes of the gates of the targets could depend on the relative distances between targets and the camera. For example, players that are closer to the camera apparently have higher speed than those far away even i f they are of the same speed i n the real world . A s a result, bo th target mot ion and camera motion affect the size of gate for each target, which makes the problem difficult. However, because of the three-layer state space model i n our applicat ion, particle filtering is performed i n the hockey r ink coordinate system after compensating for camera mot ion. Consequently, the sizes of gates only depend on the mot ion of targets on the r ink regardless of their locations because their speeds are evaluated homogeneously over the whole r ink rather than being evaluated w i t h respect to the camera posi t ion. There are different shapes of gates including rectangles and ellipsoids. It is reasonable to use ellipsoids i n our appl icat ion to reflect the direction and magnitude of the velocity of target as is shown i n Figure 5.2. However, as the t racking results 34 Ellipsoid gates Figure 5.2: T l , T 2 are predicted positions of targets; V I , V 2 are vectors that indicate the velocities of the targets. at each step might drift, the direction and magnitude computed from the previous steps could be very noisy and unstable. In addi t ion, boosting detections also might be off the targets w i t h several pixels. A s a result, el l ipsoidal gates might eliminate too many- potent ia l observation-to-track pairs that are correct. Therefore, i n our applicat ion, we adopt the simplest circular gates which place equal deviat ion on al l directions. The radi i of the circles are determined by the m a x i m u m possible magnitude of target speed, which is consistent for a l l targets. 5.2 Linear Optimization In [ O T d F + 0 4 ] , only nearest neighbor association is used local ly for the association, which means each track picks up the nearest boosting detection for update. Figure 5.1 shows that one observation can be i n mult iple gates. A s a result, the same observation might be associated to mult iple targets for update, which violates the basic constraint of data association that one observation can only be generated by one target. The global nearest neighbor method is one of the approaches to solve this problem. It maintains the single op t imal solution at each step. T h e assignment problem can be best represented by an assignment mat r ix shown i n Table 5.1. E a c h entry i n the table is the cost or gain of pair ing the corresponding track and observation. Assignments that are forbidden by gating are denoted by x i n the corresponding entries. Observations that are forbidden by the gat ing to be associated to any track are considered as a new track. Such assignment problems stem from economic theory and auct ion theory as well . Examples are assigning personnel to jobs and assigning goods to bidders. The objective of such problems is to minimize the cost or maximize the gain subject to a set of constraints. G i v e n the assignment mat r ix shown i n Table 5.1, the objective is to find a set X = {x{j} that maximize or minimize the objective function C = 35 Table 5.1: Assignment mat r ix for the assignment problem shown i n Figure 5.1 Observations Tracks 0 1 0 2 0 3 0 4 T I a n 0-12 X X T 2 « 2 1 X X T 3 ^31 X X « 3 4 2 ^ = 1 2~3j=i aijxij subject to the following constrains: where are b inary indicators _ J 1 i f observation j is assigned to track i 1 0 otherwise Linear programming was in i t ia l ly used to solve this problem. Later on, it was found that the auct ion algori thm [BP99] is the most efficient method so far to reach the op t imal solution or sub-optimal one without any pract ical difference. In the auction algori thm, normally, trackers are goods and observations are bidders. T h e objective is to maximize the global gain to satisfy a l l bidders. N o matter whether the values of a l l entries i n the assignment mat r ix are for maximiz ing gain or min imiz ing cost, it is always easy to convert them to satisfy a maximiza t ion problem by adding negative signs to a l l of them. T h e auct ion a lgor i thm is an iterative process that consists of a b idding phase and an assignment phase. Iterations keep going unt i l a l l bidders are satisfied. The detailed algori thm on a square association matr ix , which has the same number of targets as observations, is available i n [BP99]. T h e extended version of auction algori thm [Ber91] is able to solve the rect-angular mat r ix problems w i t h loosened constraints E i L i * y >l,Vi 1 ' These constraints allow one track to have mult iple observations, which often happens i n practice. However, i n our application, it is very l ikely that some tracks do not have any observation due to the mis-detection of the boosting detector. Therefore, even if there are some observations wi th in the gate of that track, it is s t i l l possible that none of the observations belongs to the track. Hence, the constraints are of the following form: E £ i * « > o , V i 1 j 36 A s a matter of fact, because xij are binary, the second constraint i n Equa t ion 5.4 is always satisfied. T o simplify the problem, the assignment mat r ix only has existing tracks and the observations that fall i n at least one gate. Observations that are not i n any gate are used to create new tracks separately. A s a result, the problem is significantly simplified. In our applicat ion, the values of a l l the entries i n the assignment mat r ix are defined to be the distance between the observations and tracks. It is mentioned i n [BP99] that the value of each entry can include bo th kinematic and attr ibute information to fuse different types of sensor signals. Color is one important potential at t r ibute i n pur applicat ion. However, because the reference color histogram of each target is created only at the first step and not updated adaptively over t ime, there are some t ime steps at which the current color histogram of track i is more similar to the reference color model of track j than the current color histogram of track j itself. It w i l l produce incorrect assignment scores especially when targets w i t h s imilar reference color model are close to each other or even overlap. Therefore, only the distance is used as the evaluation. G i v e n the new assignment mat r ix and the constraints, the solution to this problem becomes straightforward. The goal is to minimize the objective function and the op t imal solution is just a collection of observation-to-track pairs that have the min ima l entry value i n the corresponding column i n the assignment matr ix . (5.5) 37 Chapter 6 Mean-Shift Embedded Particle Filter The motivation of embedding the mean-shift algorithm into the particle filter frame-work of our tracking system is to stabilize the tracking result. It is important for the dynamic model of targets because stabilizing trajectories improves the accuracy of the computed velocity of targets, which is critical for improving the prediction of the location of targets. It is also important for the likelihood model because accurate prediction leads sampling to more promising areas so that the influence from background clutter and mutual occlusion will be reduced. As a result, the particles of one target, which are biased by mean-shift, will be more concentrated on the true location of the targets in the scene. In this chapter, we explain how the mean-shift algorithm works using the color features and how to embed it into our particle filtering process for robust tracking. 6.1 Color Model Color-based tracking is successful in tracking non-rigid objects with unique colors. It produces robust tracking results even if the targets change shape or are partially occluded by other objects. We adopted the color model in [PHVG02, OTdF+04] to represent the hockey players we try to track in our application. The model is originally introduced by Comaniciu et al. [CRMOO] for the mean-shift based non-rigid object tracking. In addition, because the color histogram based approach discards the spatial information of the targets, it might have problems when the background clutter has a similar representation to the targets of interest in the color space. Therefore, we adopt the multi-part color model from [PHVG02] with small variation to fit it into the kernel-based color model we use in our application. The color histogram is created in the Hue-Saturation-Value (HSV) color space because it is robust to the illumination condition changes. The H S V his-togram comprises N = Nh~xNs + Nv bins, which discretize the whole H S V space. 38 Nh, Ns, Nv are the number of bins i n each dimension of the H S V color space and are set to 10 equally in our experiment. T h e color vector of a p ixe l located at posi t ion k is denoted as y(k) = [h(k), s(k),v(k)]T and each entry is mapped to an appropriate b i n i n the corresponding dimension. The b in representation of the color vector is denoted as [b^k), bs(k), bv(k)]T. The b i n vector is further mapped to a b i n i n the histogram w i t h the following index function. b ( k ) = { bs(k)xNh + bh(k) i f s(k) > 0.1 and v{k) > 0.2 { bv(k) + NhxNs otherwise where b(k) is the index of pixel in the histogram. T h e saturation and value of a pixel is set to be above 0.1 and 0.2 respectively because the hue information is only reliable under that condit ion. P ixe ls that are too dark or unsaturated are classified only according to their value. W i t h this b in indexing function, the color histogram of the candidate region i ? ( x t ) around the locat ion x t at t ime t is denoted as Q ( x t ) = {q(n]xt)}n=lt...}N, where q(n-xt) = C E M x t ) # ) - n ] „ = i q(n; x ( ) = 1 where S is the Kronecker delta function, C is a normalizat ion constant, k is any pixel w i t h i n the region i ? ( x t ) , i?(xt) is a 2D region centered at locat ion x t , which can either be rectangular or ell ipsoidal. B y normaliz ing the color histogram, Q(xt) becomes a discrete probabil ist ic dis t r ibut ion. The color-based t racking searches for a candidate region whose histogram is the most s imilar to the one of a reference target model . T h e target model color histogram and target candidate histogram are denoted as follows: target model: Q* = q*(n; x 0 ) n = 1 ^ £ n = i q*(n; x 0 ) = 1 target candidate: Q ( x t ) = q(n; x t ) n = 1 N J2n=i l(n; x t ) = 1 where Q* is constructed at the in i t ia l iza t ion step. The Bhat tacharyya coefficient [Kai67] is adopted to represent the s imilar i ty between two sample histograms as N P(x 4 ) = p [ Q ( x t ) , Q*] = £ v /g (n ;x t )g*(n ;x 0 ) (6.4) 71=1 The distance between two histograms is represented as d(xt, xo) = Vl-p[Q(*t),Q*] (6.5) There are three major reasons to use Bhat tacharyya coefficient in the s imi-lar i ty function. Fi rs t ly , it has a straightforward geometric interpretation because i t is the cosine of two vectors. Secondly, d(xt) is a metric. F ina l ly , because it uses a 39 normalized discrete dis t r ibut ion as evaluation variable, it is invariant to the scale of the target. G i v e n the distance function, the l ikel ihood dis t r ibut ion can be computed by passing the distance into an exponential function: p (y t | x t ) oc e - A d 2 ( x * ' x ° ) (6.6) where A = 20 is determined through experiment. Mul t i -pa r t color model splits the region i ? ( x t ) into r parts so that R(xt) = 2~Zj=i Rjfat)- The color histogram of each part is constructed independently. In [ P H V G 0 2 ] , the l ikel ihood is formulated as • p ( y t | x t ) a e - x ^ = i d ? ( x t ' X o ) (6.7) where <i 2 (x t ,xo) is the distance between the two corresponding parts i n the target model and the target candidate. In our work, the mean-shift a lgor i thm w i l l be embedded into the particle filter framework to bias the particles according to the color feature. If the distances are evaluated on the part basis, the mean-shift algo-r i t h m is very l ikely to move them i n different directions, which breaks the integrity of the whole target. Therefore, i n our work, the color histograms of a l l parts are concatenated i n parallel to form a new color histogram and the distance is evaluated according to the integral histogram. A s a result, the concatenated histogram not only encodes the layout of a l l parts but also guarantees the integrity of the target candidate dur ing the mean-shift procedure. The l ikel ihood function is evaluated i n the same way as Equa t ion 6.6. Figure 6.1 shows the comparison of two mult i-part histograms of players i n two different uniforms. Here, the target region is d iv ided vert ical ly into two parts to represent the upper and lower body of the player. E a c h histogram is a concatenation of the upper and lower parts. Figure 6.2 shows how significantly the color histogram of the same target changes over t ime w i t h respect to the reference model extracted at the first frame. T h e comparison indicates that i t remains a challenging problem to find an op t imal model for the l ikel ihood evaluation. Adap t ive ly changing the reference color model w i t h the current color histogram of the target is one possible approach. However, as there is no background subtract ion i n our system, the bounding box of the tracker may include pixels from the background clutter and make the color model unstable. Drif t of the tracker and the mutua l occlusion w i l l also introduce undesired pixels from background or other targets, which make the color model stable. Therefore we give up the attempt to use any adaptive color model . Instead, we incorporate both the mean-shift based stabi l izat ion and strong dynamics model of the targets to assist the color histogram based l ikel ihood model and improve the overall t racking performance of our system. 40 (b) Color histogram of the player in (a) (d) Color histogram of the player in (c) Figure 6.1: Mul t i -pa r t color histogram of two players from two different teams. 6.2 M e a n - S h i f t Mean-shift is a nonparametric statist ical method which seeks the mode of a density d is t r ibut ion i n an iterative procedure. It was first generalized and analyzed by Cheng i n [Che95] and later developed by Comanic iu el al . i n [CM02]. It can be seen as a clustering technique, of which the fc-means clustering algori thm is a special case. It is also a mode seeking procedure on a surface created by a "shadow" kernel in the feature space. Therefore, it is popular and successful in many applications including image segmentation [CM02] and object tracking [ C R M 0 3 , Bra98, Col03]. Because of its abi l i ty to analyze a discrete feature space, it is a perfect tool to analyze color histogram features i n our applicat ion. The objective of the mean-shift a lgori thm is to i teratively compute an offset from the current location x to a new location x ' = x + A x according to the following relation A x = -> 5^- - x (6.8) E ^ - ( ^ ( | | ^ | | 2 ) where { a j } ^ 1 are points w i t h i n the region i ? (x ) around the current locat ion x , w(o,i) is the weight associated to each pixel ai, and k(x) is a kernel profile of ker-nel K that can be wri t ten i n terms of a profile function k : [0, oo) —> R such that K(x) = A;( | |x | | 2 ) . Accord ing to [Che95], the kernel profile k(x) should be nonnega-tive, nonincreasing, piecewise continuous, and k(r)dr < oo. Mean-shift is a mode seeking algori thm because the theory guarantees that 41 100 200 300 400 500 600 Frame 1 0 50 100 150 200 Color histogram of the labelled target in Frame 1 100 200 300 400 500 600 Frame 32 0 50 100 150 200 Color histogram of the labelled target in Frame 32 50 100 150 200 250 300 350 400 450 i BI JJ 100 200 300 400 Frame 67 0 50 100 150 200 Color histogram of the labelled target in Frame 67 Figure 6.2: This figure shows how significantly the color histogram of the same target changes over time. 42 the mean-shift offset A x at location x is in the opposite direction of the gradient direction of the convolution surface M C(x) = G(ai ~ x)w(oi) (6.9) i = i where kernel G is called the shadow of kernel K and satisfies the following relation-ship g'(r) = -ck(r) (6.10) where g is the kernel profile of G, g' denotes the derivative of kernel profile g, and c is a constant. Therefore, it is clear that mean-shift is a gradient ascent algorithm that converges to the mode of the convolution surface. In order to utilize mean-shift to analyze a discrete density distribution, i.e. the color histogram, an isotropic kernel G with a convex and monotonically de-creasing kernel profile g(x) is superimposed onto the candidate region .R(xj). The target is represented by an ellipsoidal region bounded by a rectangle specified by the tracker. The target model is normalized to a unit circle by rescaling the row and column dimensions so that the shape of the ellipsoid does not influence the model representation. The normalized pixel locations in the region of the target model are denoted as {a*}i-it...tM- With this kernel G, the target model element <7*(n;xn) can be rewritten as follows: M q*(n^0) = C^2g(\\a*\\2)5[b(a*)~n} (6.11) i = i where the region center is 0, C is the normalization constant to guarantee the probabilistic property of the discrete distribution. The nonincreasing kernel assigns lower weights on the pixels that are on the edge of the target. This is reasonable because most edges are blurry or might not actually belong to the target. They are much less reliable than those pixels that are closer to the center. As in the target model, the normalized pixel locations of the target candidate centered at x t are denoted as {a,i}i=it...,Mh- Each element of the target candidate histogram can be rewritten as Mh i = l h 5[b(ai)-n] (6.12) where Ch is also a normalization constant that depends on h, and h is the bandwidth that determines the scale of the target candidate. The similarity measure and the distance between the new target models and target candidates are the same as in Equation 6.4 and 6.5. The goal of a tracking process is to localize the position of the target in the current frame. In other words, it aims to search the neighborhood for a location so that the new candidate feature 43 extracted from that locat ion has the m i n i m u m distance to the model feature. We assume that, at each t ime step, the search starts at locat ion x and the normalized discrete dis t r ibut ion Q ( x ) is already computed, x can be the target locat ion i n the last t ime step or any k ind of locat ion specified by the system. T h e goal is to itera-t ively update x ' to reach the mode location. Accord ing to Equa t ion 6.5, min imiz ing the distance is equivalent to maximiz ing the Bhat tacharyya coefficient p [Q(x ' ) , Q*}. Here, we assume that the new candidate feature does not have significant change from the feature at the start ing location, which is true i n most situations i n practice. W i t h this assumption, we can use the Taylor expansion around the value Q ( x ) to have a reasonably accurate linear approximat ion of the Bha t tacharyya coefficient p[Q(x ' ) ,Q*] « \ £ y/q(n; x)<?*(n; x 0 ) + i f ^ n ; x ' ) J^ f^ (6-13) n=l n=l V ^ ' ' Substitute g (n ;x ' ) w i t h its kernel representation i n Equa t ion 6.12, the above equa-t ion can be rewrit ten as p[Q(x'),Q*] ~ 2 E v / 9 ( n ; x ) g * ( n ; x o ) + ^ w ^ g n=l i=l \ x ' (6.14) where w ^ ) = t{\f^^mal)-n}. (6.15) Accord ing to Equa t ion 6.14, the first term does not depend on x ' . Therefore, to maximize the whole equation, we maximize the second term. It is noticed that the second te rm has the same formulation as Equa t ion 6.9, which means the mode of the density est imation can be reached by applying mean-shift procedure, which i teratively moves the current locat ion x w i t h offset A t according to Equa t ion 6.8. In our applicat ion, the Epanechnikov profile is selected to be the profile function of kernel G 9 { X ) = \0 otherwise ( 6 ' 1 6 ) where d is the dimension of the geometrical space of the kernel, Cd is the volume of the uni t d-dimension sphere. Because kernel G i n the second t e rm i n E q u a t i o n 6.14 is the shadow kernel of kernel K i n the mean-shift process i n Equa t ion 6.8, they satisfy the relationship specified i n Equa t ion 6.10. Because the derivative of Epanechnikov profile is constant, the mean-shift offset reduces to Ej=i <Hw(ai) E5. w(oi) In conclusion, the mean-shift iterative procedure can be presented as follows. A » = ^ » - ? 7 - x . (6.17) 44 • Input target model Q* = x n ) n = 1 N and the start location x • Initialize new location x' so that x' = x • D o — x = x' — At location x, evaluate Bhattacharyya coefficient p[Q(x),Q*] = E n = i V/g(n;x)q*(n;x0) — For each normalized pixel location aj within, region R(x), compute the weights using Equation 6.19. — Find the new location / _ E ^ aiwjai) — At new location x', evaluate Bhattacharyya coefficient p[Q(x'),Q*] = E n = i ^(n;x')^(n;x 0) — While p[Q(x'),C?*]<p[Q(x),C?*] * x7 = (x + x')/2 * Evaluate p[6}(x'), Q*] • W h i l e ||x' - x|| > e1 • Output : Location x' that minimizes the distance d(x',xn) As mean-shift is able to seek the mode of the density distribution, it is also capable of stabilize the tracking result by shifting the tracking output to the location with the maximum similarity to the model in the feature space. The algorithm is exactly the same as the one shown above. The start location x for the stabilization is given by the output of the tracking system. 6.3 Adaptive Scaling As is shown in Equation 6.12, the bandwidth h determines the scales of the target candidate. In our application, the scales of players in the video frames is determined by their relative distance to the camera and the zooming effect of the camera. Therefore, it is important to adaptively change the scale through out the tracking process. Wu et al. [Wu05] tried to stabilize the scales of the targets through a pyramid based template matching, which is a coarse-to-fine scale search process to le selects the stop criteria for the mean-shift iteration. Any value is appropriate as long as it is small enough relative to the application. It is chosen to be 1 pixel in our application. 45 find the best scale that match the template. In addition, rather than using only one top match, it takes multiple top matches to generate a best estimate for a consistent scale to avoid sudden jump. As we do not have any template to match, we need to search for the best estimate of the scale on the fly through the mean-shift algorithm. In [OTdF+04], target scale is one of the dimensions in the the state space. Experiments show that the sampling process might generate scales that have a drastic change from the previous scale. However, in the real situation, because of the temporal and spatial continuity, the scales of players seldom change drastically. One of the solutions is to separate the scale from the state space and evolve it independently for each particle. The adaptive strategy searches several bandwidths around the previous bandwidth hprev to find the optimal bandwidth hopt at current step. To reduce the computational cost, only three bandwidth options are computed in our application: hwtv, hprev — A^, hprev + A^. Taking into consideration of the scale continuity, A^ is defined to be A^ = 0.1hprev. The optimal bandwidth hopt selects the bandwidth that outputs the minimum distance (or the maximum Bhattacharyya coefficient). The new bandwidth hnew is updated with the following smoothed adaptation strategy Kew = ihopt + (1 - y)hprev (6.18) where 7 is a smoothing factor that controls the adaptation speed to avoid drastic scale change. It is chosen to be 0.1 in our application. By combining the mean-shift algorithm and the adaptive scaling strategy, the tracking results can be well stabilized. Figure 6.3 compares the the raw tracking output and the stabilized result. It shows that the scale of the original tracker often is too small so that it cannot capture the major body of the target. Also, it drifts off the targets in some time steps. Meanwhile, the mean-shift stabilized result is much more stable. The scale changes more smoothly and fits the bounding box onto the target more properly. 6 . 4 Mean-shift Embedded Particle Filter Applying the mean-shift algorithm directly to the tracking output only gives one deterministic offset at each step. It might not be able to capture the true location of the targets due to background clutter or mutual occlusion between targets in the image. Embedding it into the particle filter framework brings uncertainty to the deterministic method so that the statistical property can improve the robustness of the algorithm. The key idea of the mean-shift embedded particle filtering is to insert one more operation into the conventional particle filter framework. There are two places where we can insert the mean-shift biasing step. One is after the particles are propagated by proposal distribution and the other is the after the deterministic resampling step, which is introduced in Section 3.1.3. There 46 L (a) Frame 24 A r 1 (b) Frame 40 — 1 1 L (c) Frame 63 ' JL 1 V - A [II (d) Frame 84 Figure 6.3: Compaxison between the original track output and the mean-shift sta-bi l ized result. Left ones are the original track result and the right ones are the stabil ized result. 47 is one fatal disadvantage that undermines the second option. Because mean-shift is only based on the s imilar i ty between the reference model and the candidate model i n the feature space, it might bias particles to other s imilar targets that are around the target of interest or some background clutter that are s imilar to the target of interest i n the feature space. A s a result, those wrongly biased particles that are far away from true locat ion of the targets w i l l have low weights because of the penalty from the prior dis t r ibut ion. W i t h o u t the resampling step, those low weight particles are s t i l l propagated i n the next step and lead the proposal d is t r ibut ion to sample i n the wrong region and the tracker quickly loses the targets. F igure 6.4 shows the t racking result of inserting the mean-shift after the resampling step i n the particle filtering. It can be noticed that, i n Subfigure (b)(e)(f), some particles are moved to some locations that are far away from the true target locat ion or the other targets. A l t h o u g h they are assigned low weights, they keep on propagating wi thout being penalized by the resampling. A s a result, they increase the variance of the particle set of one target and cause the target to be el iminated i n the next frame because of the high uncertainty i n particle hypotheses. In our applicat ion, the mean-shift operation biases a l l the particles right af-ter the sampling from the mixture of Gaussians proposal d is t r ibut ion and before the resampling step in the particle filter framework. In our mult i- level state space model , particles propagated by the proposal d is t r ibut ion are mapped back to the video frame coordinate system. Then , each particle is biased ind iv idua l ly and inde-pendently by the mean-shift a lgori thm to seek the mode of the density d is t r ibut ion i n the neighborhood of each particle. The locations of particles i n the video frame coordinates are set to be the start ing locations for a l l the independent mean-shift procedures. After the bias step, new samples are drawn from a Gaussian dis t r ibut ion which is centered at the biased particle w i t h a predefined covariance mat r ix . The mean-shift searching step combined w i t h the o ld proposal d is t r ibut ion can be considered as a new proposal d is t r ibut ion <j(xt|xt_i, y ( ) . Mean-shift biases the samples {'X-[^}i=i:...,N that are propagated by the o ld proposal d is t r ibut ion to a new particle set { x ^ } ^ ^ . . . ^ . We denote mean-shift searching w i t h function tp(-) such that X j = y ( x t ) . T h e new proposal, which combines the old proposal and the mean-shift bias, superimposes a Gaussian dis t r ibut ion on the biased particle x j for sampling new particles. Therefore, the weight is updated as follows: flxMlxM.y,) ^ where ^ ( x ^ l x ^ j ^ y f ) = A / " ( x ^ | x ^ \ £ ) . Here, E is a diagonal 2x2 mat r ix and the value of the two entries are chosen to be the same, which is 0.3, i n our applicat ion. (i) (i) Note that we use a sample x ^ ' instead of the biased particle kt . Th i s ensures that the sequential importance sampler remains unbiased and val id . 48 (b) Frame 19 (c) Frame 20 (d) Frame 22 (e) Frame 24 (f) Frame 25 (g) Frame 26 Figure 6.4: T h i s figure shows the t racking result of the system i n which the particles are biased by the mean-shift a lgori thm after the deterministic resampling step i n particle filtering. The top frame shows the overall view of the tracking result at frame 1. The six frames below show the t racking result w i t h a close-up view of the three players in the center region. Particles from the same target are represented by rectangles w i t h same color. 4 9 The following pseudo-code depicts the overall structure of our tracking sys-tem, which includes all the contributions in our work. • Initialization: t = 0 - Map boosting detections from the video frame coordinates to the rink coordinates to get {x.k,o}k=i,...,M0-- For k = 1 , M o , create particle set {xj^ 0; j?}iLi by sampling from prior distribution ^(xfc^). - K0 <- M 0 . . ' • • For t = 1,...,T, 1. Targets adding and removing - Remove Dt targets with large particle set variance. - Map boosting detections from the video frame to the rink. - Data association * For s = l , S t , create particle set { x ^ _ 1 + g t ; jf}iLi by sampling from prior distribution p(x S )t) for each new target. * Associate boosting detections to the existing Kt-\ tracks to con-struct Gaussian mixture proposal distribution <7(xfc,t|xfc,t-i> zfc,t)' where zj^t is boosting detection. * Kt <- Kt-i + St- Dt. 2. Importance sampling - For k = 1, ...,Kt, i = 1, ...,N, sample x j ^ ^ x ^ l x j ^ ^ z ^ ) . 3. Mean-shift biasing - For k = 1 , K t , i = 1, ...,N, bias the particles as xj^j = <p(xj t^). - For k = 1 , K t , i = 1 , N , sample xj.* ) t~g(x f c | t|xjj ! t ] ) t) 4. Weight update - For k = 1 , K t , map particle set { x j ^ } ^ from rink to frame coor-dinates. (i) - For A; = l,...,Kt, i = 1,...,N, update weights ih).^ according to Equation 6.19. - Normalize weights. 5. Deterministic resampling - Resample particles to get new sample set { x j ^ } ^ . - Update weights: wkt = jj. 6. Output - For k = 1 , K t , E(xk,t) = Eli " g x g 50 Chapter 7 Experimental Results and Conclusion A s a l l the important components of our t racking system have been introduced and explained i n detail , they are assembled to construct a robust t racking system that is capable of t racking mult iple targets accurately regardless of background clutter, camera mot ion and mutual occlusion between targets. In this chapter, t racking results of our system w i l l be presented and analyzed. T h e y w i l l be compared w i t h the results from previous work to support the val idi ty of our contributions. A summary of our work w i l l address the strengths and weaknesses of our work together w i t h some possible extensions for future work. 7.1 Results Firs t ly , it should be noted that, i n order to gather more visual information for the color model , the or iginal video tape is resampled i n our experiment to produce video frames w i t h the size of 640 x 480 pixels while, i n [ O T d F + 0 4 ] , the t racking is performed i n video sequences w i t h the frame size of 320x240 pixels. It is necessary to mention the issue of interlacing. The interlacing results from the inherent scanning mechanism of T V . In a T V , the whole picture is not in the frame, but instead a l l the odd rows are shown first, then a l l the even rows. Therefore the complete picture is only created after every second field. Th i s causes significant edge blur i f the object has significant offset between consecutive frames. Figure 7.1 shows an example of the b lur from interlace. T h i s w i l l affect the analysis of the loca l mot ion of the targets. However, it does not significantly affect our appl icat ion because we only care about first order mot ion and we only use the color histogram to represent the target, which does not 'emphasize spatial information. Figure 7.2 shows the in i t i a l configuration of our multi-target t racking system. E a c h target is labelled w i t h a colored rectangle. In order to have a clear view of the t racking process, we present the t racking result w i t h close-up view of the rectangular 51 Figure 7.1: Th i s figure shows an example of the blur from interlacing. Figure 7.2: T h i s figure shows the overall view of the camera at the in i t i a l stage. T h e close-up view of the t racking results w i l l be shown i n Figure 7.3 by zooming into the rectangular region. region i n Figure 7.2. Figure 7.3 shows the close-up view of the t racking results which focus on the three targets in the region. Images on the left column in the figure are the t racking results of the system i n [ O T d F + 0 4 ] . Those on the right co lumn are from our system. Notice i n Subfigure (c) and (e) of Figure 7.3 that two tracks merge into one track i n bo th frames because the M P F structure in [ O T d F + 0 4 ] merges particle sets when they have significantly overlap. Therefore, two targets are lost i n the t racking process. In Subfigure (g), a new target is created because the boosting detection detects the target and it is not in the gate of any other existing tracks. It is also noticed that the tracker of the player in the bo t tom part of Subfigure (a) i n Figure 7.3 migrates to another one, which was in the left part of Subfigure (a) i n Frame 30, i n Subfigure (g). In systems that do not have a strong prior dynamics model , such migra t ion happens frequently dur ing cross-over because, wi thout a strong dy-namics model , color l ikel ihood dominates the weights of particles and forces them 52 (a) Frame 30 (b) Frame 30 (c) Frame 39 (d) Frame 39 (g) Frame 58 (h) Frame 58 Figure 7.3: T h i s sequence is a close-up view of the rectangular region i n Figure 7.2. T h e left column is the t racking results of the system i n [ O T d F + 0 4 ] . The right co lumn is the t racking results of our system. Different targets are labelled w i t h rectangles of different colors. 53 to migrate onto targets with similar color features. However, the color model alone is not competent enough to distinguish different targets in our application because of the similarity between players and the mutual occlusion that makes the color model unreliable. The targets lose their identity after the cross-over in the BPF-based tracking system. However, in our tracking system, with the assistance of the prior dynamics model and the mean-shift stabilization mechanism, all the targets are accurately tracked and all their identities are well kept after the cross-over. In our system, particles do not easily migrate to nearby targets with similar color histograms because the dynamics model takes into account the trajectory of the target and assigns low probability to particles that are far away from the predicted location. Combining both the visual and motion information from the targets sig-nificantly improves the robustness of the tracking system. In order to demonstrate the advantages brought by the new dynamics model, Figure 7.4 shows the tracking result of the system in which the camera motion is compensated and the dynamics model is replaced with the new second order autoregressive model. The particles in this experiment are not biased by the mean-shift algorithm and the scale is still one dimension in the state space. In the sampling stage of the particle filtering, particles are sampled from a proposal distribution which is a mixture of Gaussians. It superimposes Gaussian distribution on the dynamics model and the boosting detection results. The value of the entry in the covariance of the Gaussian distribution that corresponds to the scale controls the deviation of the scale. The larger the value, the more possible the scale has a sudden jump. Figure 7.4 shows the tracking results of the experiment. The output scale is an average of the scale value in all the particles. It can be noticed that the two improvements do improves the accuracy of the tracking result. However, it can also be noticed that the result is very sensitive to the value in Gaussian covariance matrix that corresponds to the scale. The left column in Figure 7.4 shows the result with the value of 0.02 in the covariance, which makes the scale of the players getting smaller quickly over time. The right column shows the result with the entry value of 0.01, which is more reasonable because of the temporal continuity. However, in the experiment with the more reasonable value, one tracker migrates onto another one and miss the target. This indicates that, even with a strong dynamics model, if the particles are led to wrong regions because of the large variance on the trajectory of the particle cloud, it is still possible that the particles are trapped by another target with similar color histogram. Although the result on the left shows correct identity of all tracks, their scales no longer reflect the true scale of the targets. Therefore, there is not enough evidence to completely trust the system without mean-shift bias. Separating the scale from the state space can reduce the dimension of the state space so that, with the same number of particles, the particles can capture the true distribution more accurately. Moreover, the mean-shift algorithm with adaptive scaling is able to evolve the scales in a more reasonable and effective way. 54 (a) Frame 30 (b) Frame 30 (e) Frame 50 (f) Frame 50 (g) Frame 58 (h) Frame 58 Figure 7.4: T h i s figure shows the tracking result of the system that only has the camera mot ion compensated and the dynamics model changed but without mean-shift bias. T h e left column is the results of the system in which the deviat ion for the scale sampling is 0.02; the right column is w i th the deviat ion of 0.01. 55 Figure 7.5 shows the particle representation of the t racking results of our system. A l t h o u g h the particles are propagated in the r ink coordinate system, they are mapped to the video frames for representation purposes. In the pseudo-code i n Section 6.4, the evolution of particle sets i n each i terat ion of propagation can be divided into three steps i n our t racking system. The figure compares the difference between the particle sets after each step. T h e left co lumn i n the figure shows the particles that are direct ly sampled from the old mixture of Gaussian proposal. T h e middle column shows the particles that are biased by the mean-shift a lgori thm. The right co lumn shows the resampled particles after the deterministic resampling stage i n our particle filtering process. Generally, the mean-shift a lgor i thm moves particles from different locations around the target to locations i n the neighborhood that are most similar to the reference model in the color space. Therefore, particle sets appear more condensed after the mean-shift bias. The resampling technique duplicates more children for particles w i t h higher weights and particles w i t h much lower weights might not have any chi ld. Therefore, the resulting particle sets become even more compact. It does not mean that the number of particles for one target is reduced. Instead, they just overlap w i t h each other on a l imi ted number of locations. T h e difference between Subfigure (h) and (i) i n Figure 7.5 indicates that mean-shift might move particles to some other targets because of the s imi lar i ty between the two targets in the color space. However, particles that are shifted to the wrong targets w i l l be assigned low weights because of the regularization of the dynamics model of targets. A s a result, those particles w i l l have much fewer or no children after the resampling stage. For the same reason, particles that are biased to regions without any target, as are shown i n Subfigure (b) and (c) i n Figure 7.5, w i l l be penalized as well . In summary, both the mean-shift a lgor i thm and the dynamics model penalize erroneous particle hypotheses and improve the robustness of the overall t racking system. Because the particles evolve i n the r ink coordinates in our system, it is easy to acquire the trajectories of the targets on the r ink. Figure 7.6 shows the trajecto-ries of the same three targets that are shown i n Figure 7.3. Subfigure (a) shows the trajectories of the three players on the standard hockey r ink. Different trajectories are represented wi th different shapes on the nodes. Subfigures (b)(c)(d)(e) are the key frames dur ing the process. O n l y the three targets we focus on are labelled i n the key frames for the purpose of clear presentation. However, i n the real t racking results, a l l targets are well tracked and labelled. It is easy to judge from the key frames that the trajectories wel l demonstrate the actual locations of the three tar-gets, and hence support the val idi ty of the three-layer state space model described i n Section 4.1.2. Figure 7.7 shows the correct t racking result of our system dur ing the per iod when par t ia l occlusion happens. T h e color of each target is different from their color i n Figure 7.2 because they are changed for easy representation here. B o t h the 56 (g) Frame 50 (h) Frame 50 (i) Frame 50 (j) Frame 58 (k) Frame 58 (1) Frame 58 Figure 7.5: Th i s figure shows the particle representation of each target. The left co lumn shows the particles before the mean-shift bias; the middle column shows the particles after the mean-shift bias; the right column shows the particles after the resampling. E a c h particle is represented wi th a rectangle and particles that belong to the same target are of the same color. 57 (a) Trajectories of three players. (d) Frame 65 (e) Frame 97 Figure 7.6: This figure shows the trajectories of the three targets shown in Figure 7.3. (b)(c)(d)(e) show the key frames during the process, (b) is the starting frame and (e) is the ending frame. 58 (d) Frame 88 (e) Frame 98 Figure 7.7: Th i s figure shows the tracking result of the period when par t ia l occlusion happens, (a) is the overall view of the t racking result at frame 78. (b)(c)(d)(e) show the close-up view of the four targets i n the region specified i n (a). 59 results shown i n Figure 7.3 and Figure 7.7 provide evidence to support the fact that our new t racking system is able to accurately track mul t ip le targets regardless of par t ia l or complete occlusion. 7.2 Conclusion In this thesis, we devote our endeavors to bui ld ing a t racking system that is able to robust ly track mult iple targets and correctly mainta in their identities regardless of background clutter, camera motions and mutual occlusion between targets. A l -though there has been extensive research i n the field of multi-target t racking, it is s t i l l a tough journey to achieve the goal we set because of various v isual challenges discussed i n Section 1.3. Unrel iable target representation, unpredictable camera motion, and frequent mutua l occlusion between targets are the major challenges. We adopt the boosted particle filter framework from O k u m a el a l . [ O T d F + 0 4 ] as the basic skeleton of our t racking system because it is successful i n t racking a variable number of targets w i t h fully automatic in i t ia l iza t ion. However, we replace the M P F structure i n B P F w i t h a set of independent particle filters so that particle clouds of different targets w i l l not merge or split because the two actions result i n the loss of targets. The new structure provides a simple mechanism to handle a varying number of targets by just adding or removing particle sets. The advantages of the new structure become more prominent when more and more targets enter the scene. In B P F framework, as more tracks are created, new targets steal more particles from existing tracks so that the average number of particles of each target decreases and becomes insufficient to accurately capture the overall d is t r ibut ion. In addi t ion, after analyzing the particle weight update mechanism shown i n Equa t ion 3.9, we t ry to improve a l l the three terms i n the equation to bu i ld a better model for our system. Firs t ly , by using the rectification technique, the lo-cation of the targets can be mapped between the video frame and r ink coordinates through homography. Therefore, camera motions can be compensated and the mo-t ion of targets are easier to model and predict i n the r ink coordinate system. W i t h the camera mot ion compensated, a second order auto-regression model is further adopted to predict the locat ion of targets by taking into account the history of the trajectories of the targets. Exper iment results shown i n Subfigure (h) and (i) i n Figure 7.5 support the fact that the dynamics model plays an important role i n e l iminat ing incorrect particle hypotheses especially when the visual information is unreliable dur ing occlusion. Secondly, a global nearest neighbor data association technique is employed to assign boosting detections to the existing tracks or cre-ate new tracks. It improves the proposal d is t r ibut ion dur ing the mutua l occlusion because the nearest neighbor association fails to reach global op t imal association when detections are i n mul t ip le gates of tracks. F ina l ly , the mean-shift a lgor i thm is embedded into the particle filter framework to stabilize the trajectory of the targets 60 and improve the dynamics model prediction. On one hand, mean-shift superimposes a kernel onto the target patch in the image and improves the color histogram model by down weighting unreliable pixels on the edges of the targets. On the other hand, the mean-shift algorithm biases particles to new locations with high likelihood so that the variance of particle sets decrease significantly. Experiment results shown in Subfigure (d) and (e) in Figure 7.5 support the fact that mean-shift is able to bias particles back onto more promising regions rather than background. With all the improvements stated above, the new tracking system is able to keep robust tracking on the targets under all the challenging situations. Experiment results presented in the last section show the accurate tracking results of our system and support the validity of our contributions. 7.3 Discussion and Future Extension The experiments demonstrate accurate tracking results under all the tough visual challenges mentioned in Section 1.3 to support the conclusion that our contribu-tions in the new tracking system do improve the robustness and accuracy of the system. However, there are still some weaknesses in our approach. Here we focus on two major problems: one is result from the heterogeneous distances in the video frame coordinates and the rink coordinates; the other is caused by the second order autoregressive dynamics model. As is shown in Figure 7.8, the five big circles on the rink are of the same size while the two that are closer to the camera appears much larger than the middle one in Subfigure (a). It indicates that distances that are the same in the image coordinates might be different in the rink coordinates, and vice versa. It affects the mean-shift bias of the particles because it is performed in the image coordinates. It is possible that mean-shift bias two particles with the same distance from their origins in the image coordinates while the distances in the rink differs significantly. As the dynamics prior is evaluated in the rink coordinates, the Gaussian distribution assigns low probability to particles that are further away from the centroid. Therefore, although the biased particle appears eligible in the images, they are penalized by the prior. Therefore, one possible extension is to take into account the heterogenous distance in both the two coordinate systems while evaluating the prior probability of the biased particles. The covariance of the Gaussian distribution should vary according to the distances between the targets and the camera. The further the targets, the larger the covariance should be. Although the second order autoregressive dynamics model takes into account the history of the trajectory of the target, it does not handle noisy data properly. It uses the velocity computed during the last time interval as the current veloc-ity because of the temporal continuity property. However, without any smoothing technique, noisy data in the trajectory might produce abnormal velocity so that 61 (b) Mapped video frame blended with the rink Figure 7.8: Th i s figure shows how the video frame is mapped onto the standard r ink. the current predict ion that use this abnormal velocity leads to incorrect regions. It would affect the importance sampling as the the proposal dis t r ibut ion i n our system is a mixture of Gaussian which comprises bo th the prior dis t r ibut ion and the boost-ing detection. It w i l l not be easily corrected by the mean-shift a lgori thm because the dynamics prior assigns low probabil i ty to particles that are shifted further away from the centroid. If the sampling is misled to regions near another target w i t h s imilar color histogram, the particles tend to migrate onto that target and keep on tracking. If the particles are propagated out of the scene, the target w i l l be e l im-inated. One possible extension is to replace the second order autoregressive model w i t h a normal constant velocity model that maintains the history of the velocity instead of the trajectory and adaptively updates the velocity i n the way similar to the adaptive scaling shown in Section 6.3 to prevent drastic velocity change. Another cr i t ica l inherent disadvantage of our t racking system is brought by 62 the online a lgor i thm we use. Our system is sub-optimal because i t is only a two frame system that only keeps the record of the current state and previous state. T h e t racking can be improved i f i t is performed over the entire sequence or over predefined episodes—windows i n time. The information from the future may either smooth the trajectories of the tracks or resolve the uncertainties i n the past. Forward filtering and backward smoothing can be performed on the particle filter framework to smooth the trajectories. M H T can also be used to help e l iminat ing error tracks given the data i n a longer (more than two time steps) window i n t ime. 63 Bibliography [AM79] B . D . Anderson and J . B . Moore . Optimal Filtering. Prent ice-Hal l , Englewood Cliffs, N e w Jersey, 1979. [ A M G C 0 1 ] S. Aru l ampa l am, S.R. Maskel l , N . J . Gordon , a n d T . C l a p p . A Tutor ia l on Part ic le Fi l ters for On-l ine Non l inea r /Non- Gaussian Bayesian Tracking. IEEE Transactions on Signal Processing, 50(2): 174-189, 2001. [Ber91] D . P . Bertsekas. Linear Network Optimization: Algorithms and Codes. The M I T Press, Cambridge, 1991. [BL03] M . B r o w n and D . G . Lowe. Recognising Panoramas. In Internatinal Conference on Computer Vision, pages 1218-1225, 2003. [BP99] S. B l a c k m a n and R . Popo l i . Design and Analysis of Modern Track-ing Systems. Ar t ech House, Norwood, 1999. [Bra98] G . R . Bradsk i . Rea l T i m e Face and Object Tracking as a Compo-nent of a Perceptual User Interface. In Workshop on Application of Computer Vision, pages 214-219, 1998. [BSF88] Y . Bar -Sha lom and T . E . For tmann. Tracking and Data Association. Academic-Press, Boston, 1988. [CH96] I . J . C o x and S .L . Hingorani . A n Efficient Implementat ion of Reid ' s M u l t i p l e Hypothesis Tracking A l g o r i t h m and Its Eva lua t ion for the Purpose of V i s u a l Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(2):138-150, 1996. [Che95] Y . Z . Cheng. M e a n Shift, M o d e Seeking, and Cluster ing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790-799, 1995. [CHR02] Y . Chen , T . S . Huang, and Y . R a i . Parametr ic Contour Tracking U s i n g Unscented K a l m a n F i l t e r . In International Conference on Image Processing, volume III, pages 613-616, 2002. 64 [CM02] D . Coman ic iu and P. Meer . M e a n Shift: A Robust Approach Toward Feature Space Analys is . IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603-619, 2002. [Col03] R . T . Col l ins . Mean-shift blob tracking through scale space. In Inter-national Conference on Computer Vision and Pattern Recognition, volume II, pages 234-240, 2003. [CRM00] D . Comanic iu , V . Ramesh, and P . Meer . Real- t ime t racking of non-rigid objects using mean shift. In International Conference on Computer Vision and Pattern Recognition, pages 2142-2149, 2000. [CRM03] D . Comanic iu , V . Ramesh, and P . Meer . Kernel-based Object Track-ing. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 25(5):564-577, 2003. [DdFGOl] A . Doucet, J . F . G . de Freitas, and N . J . Gordon , editors. Sequential Monte Carlo Methods in Practice. Springer-Verlag, New Yor k , 2001. [DdFMROO] A . Doucet, J . F . G . de Freitas, K . P . Murphy , and S.J . Russel l . Rao-Blackwell ised Par t ic le F i l t e r ing for Dynamic Bayesian Networks. In Uncertainty in Artificial Intelligence, pages 176-183, 2000. [DNCC01] M . W . M . G . Dissanayake, P . Newman, Dur ran t -Whyte H . F . C la rk , S., and M . Csorba. A Solut ion to the Simultaneous Local i sa t ion and M a p B u i l d i n g ( S L A M ) Problem. IEEE Transactions on Robotics and Automation, 17(3):229-241, 2001. [DNDW+00] M . W . M . G . Dissanayake, P . Newman, H . F . Dur ran t -Whyte , S. C la rk , and M . Csorba. A n Exper imenta l and Theoret ical In-vestigation into Simultaneous Local isa t ion and M a p B u i l d i n g . In International Symposium on Experimental Robotics, pages 265-274, 2000. [ E B M M 0 3 ] A . A . Efros, A . C . B e r g , ' G . M o r i , and J . M a l i k . Recognizing A c t i o n at a Distance. In Internatinal Conference on Computer Vision, volume II, pages 726-733, 2003. [Gor94] N . J . Gordon . Bayesian Methods for Tracking. P h D thesis, Imperia l College, Univers i ty of London , 1994. [HLCP03] C . Hue, J .P . Le Cadre, and P . Perez. Tracking M u l t i p l e Objects w i t h Par t ic le F i l te r ing . In IEEE Transactions on Aerospace and Electronic Systems, volume 38, pages 313-318, 2003. 65 [HS88] C . Harr is and M . J . Stephens. A Combined Corner and Edge Detec-tor. In Alvey Vision Conference, pages 147-152, 1988. [HZOO] R . Har t ley and A . Zisserman. Multiple View Geometry in Computer Vision. Cambridge Univers i ty Press, 2000. [IB98] M . Isard and A . Blake. C O N D E N S A T I O N - C o n d i t i o n a l Densi ty Propagat ion for V i s u a l Tracking. Internatinal Journal on Computer Vision, 29( l ) :5-28, 1998. [IM01] M . Isard and J .P . M a c C o r m i c k . B r a M B L e : A Bayesian M u l t i p l e -B l o b Tracker. In Internatinal Conference on Computer Vision, vo l -ume II, pages 34-41, 2001. [Kai67] T . K a i l a t h . The Divergence and Bhat tacharyya Distance Meatures i n Signal Selection. IEEE Transactions on Communication Tech-nology, 15:52-60, 1967. [KCM03] J . K a n g , I. Cohen, and G . Med ion i . Soccer Player Tracking across Uncal ibra ted Camera Streams. In Joint IEEE International Work-shop on Visual Surveillance and Performance Evaluation of Track-ing and Surveillance (VS-PETS) In Conjunction with ICCV, pages 172-179, 2003. [KCM04] J . K a n g , I. Cohen, and G . Med ion i . Tracking Objects from M u l t i p l e Stat ionary and M o v i n g Cameras. In Asian Conference on Computer Vision, 2004. [Kel94] A . K e l l y . A 3 D Space Formula t ion of a Naviga t ion K a l m a n F i l -ter for Autonomous Vehicles. Technical Repor t C M U - R I - T R - 9 4 -19, Robot ics Institute, Carnegie M e l l o n Universi ty, P i t t sburgh , P A , 1994. [Li04] F . L i . Analys is of Player Act ions i n Selected Hockey Game Si tua-tions. Master 's thesis, Department of Computer Science, Univers i ty of B r i t i s h Columbia , 2004. [Low04] D . G . Lowe. Dis t inct ive Image Features from Scale-Invariant K e y -points. Internatinal Journal on Computer Vision, 60(2):91-110, 2004. [MBOO] J .P . M a c C o r m i c k and A . Blake . A Probabi l i s t ic Exc lus ion Pr inc ip le for Tracking M u l t i p l e Objects. Internatinal Journal on Computer Vision, 39(1):57-71, 2000. 66 [MSZ04] K . Miko la jczyk , C . Schmid, and A . Zisserman. H u m a n Detect ion Based on a Probabi l i s t ic Assembly of Robust Par t Detectors. In European Conference on Computer Vision, pages 69-82, 2004. [NBIR00] B . Nor th , A . Blake , M . Isard, and J . Rit tscher. Learning and Clas-sification of Complex Dynamics . IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):1016-1034, 2000. [OLL04] K . O k u m a , J . J . L i t t l e , and D . G . Lowe. Au toma t i c Rectif icat ion of L o n g Image Sequences. In Asian Conference on Computer Vision, 2004. [OTdF+04] K . O k u m a , A . Taleghani, J . F . G . de Freitas, J . J . L i t t l e , and D . G . Lowe. A Boosted Par t ic le F i l te r : Mul t i ta rge t Detect ion and Track-ing. In European Conference on Computer Vision, volume I, pages 28-39, 2004. [PHVG02] P . Perez, C . Hue, J . Vermaak, and M . Gangnet. Color -Based Prob-abil ist ic Tracking. In European Conference on Computer Vision, volume I, pages 661-675, 2002. [POP98] C P . Papageorgiou, M . Oren, and T . Poggio. A General Framework for Object Detection. In Internatinal Conference on Computer Vi-sion, pages 555-562, 1998. [PVB04] P . Perez, J . Vermaak, and A . Blake. D a t a Fusion for V i s u a l Tracking w i t h Particles. Proceedings of the IEEE, 92(3):495-513, 2004. [RBK98] H . Rowley, S. Baluja , and T . Kanade . Neura l Network-Based Face Detect ion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(l) :23-38, 1998. [Rei79] D . B . R e i d . A n A l g o r i t h m for Tracking M u l t i p l e Targets. IEEE Transactions on Automatic Control, 24(6):843-854, 1979. [RF03] D . R a m a n a n and D . A . Forsyth . F i n d i n g and Tracking People from the B o t t o m U p . In International Conference on Computer Vision and Pattern Recognition, volume II, pages 467-474, 2003. [SBFC01] D . Schulz, W . Burgard , D . Fox, and A . B . Cremers. Tracking M u l -t iple M o v i n g Targets w i t h a Mob i l e Robot using Par t ic le Fi l ters and Stat is t ical D a t a Associat ion. In International Conference on Robotics and Automation, pages 1665-1670, 2001. 67 [SBIM01] [SFH03] [SSC90] [SVL04] [SWTO04] [TK91] J . Sul l ivan, A . Blake , M . Isard, and J .P . M a c C o r m i c k . Bayesian Object Local isa t ion i n Images. Internatinal Journal on Computer Vision, 44(2):111-135, 2001. D . Schulz, D . Fox, and J . Hightower. People Tracking w i t h A n o n y -mous and ID-Sensors U s i n g Rao-Blackwel l ized Par t ic le F i l te rs . I n International Joint Conference on Artificial Intelligence, pages 9 2 1 -928, 2003. R . Smi th , M . Self, and P. Cheeseman. Es t ima t ing Uncer ta in Spat ia l Relationships i n Robotics , pages 167-193, 1990. S. Sarkka, A . Vehtar i , and J . Lampinen . Rao-Blackwel l ized Monte Car lo D a t a Associa t ion for M u l t i p l e Target Tracking . In Interna-tional Conference on Information Fusion, volume I, pages 583-590, 2004. C . Shan, Y . We i , T . Tan , and F . Ojardias. R e a l T i m e H a n d Tracking by Combin ing Part ic le F i l t e r ing and M e a n Shift. In International Conference on Automatic Face and Gesture Recognition, pages 669-674, 2004. C . Tomasi and T . Kanade . Shape and M o t i o n from Image Streams: A Factor izat ion M e t h o d Pa r t 3 - Detect ion and Track ing of Poin t Features. Technical Repor t C M U - C S - T R , 1991. [vdMAdFWOO] R . van der Merwe, Doucet A , J . F . G . de Freitas, and E . W a n . The Unscented Part ic le F i l te r . Technical Repor t C U E D / F - I N F E N G / T R 380, Cambridge Univers i ty Engineering Department, Cambridge, Eng land , 2000. [VDP03] J . Vermaak, A . Doucet, and P . Perez. Ma in t a in ing M u l t i - m o d a l i t y through M i x t u r e Tracking. In Internatinal Conference on Computer Vision, volume II, pages 1110-1116, 2003. [VGP05] J . Vermaak, S.J . Gods i l l , and P . Perez. Monte Car lo F i l t e r ing for Mul t i -Targe t Tracking and D a t a Associa t ion. Accepted for publica-tion in the IEEE Transactions on Aerospace and Electronic Systems, 2005. [VJ04] P . V i o l a and M . J . Jones. Robust Rea l -T ime Face Detect ion. Inter-natinal Journal on Computer Vision, 57(2): 137-154, 2004. [VJS05] P . V i o l a , M . J . Jones, and D . Snow. Detect ing Pedestrians Us ing Patterns of M o t i o n and Appearance. Internatinal Journal on Com-puter Vision, 63(2): 153-161, 2005. 68 Y . W u and T . S . Huang. A Co-inference Approach to Robust V i -sual Tracking. In Internatinal Conference on Computer Vision, vo l -ume II, pages 26-33, 2001. X . W u . Template-based A c t i o n Recognit ion: Classifying Hockey Players ' Movement. Master 's thesis, Department of Computer Sci-ence, Univers i ty of B r i t i s h Co lumbia , 2005. Y . W u , T . Y u , and G . Hua . Tracking Appearances w i t h Occ lu -sions. In International Conference on Computer Vision and Pattern Recognition, volume I, pages 789-795, 2003. M . H . Yang , D . J . Kr iegman , and N . Ahuja . Detect ing Faces i n Im-ages: A Survey. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 24(l) :34-58, 2002. T . Zhao and R . Nevatia . Tracking M u l t i p l e Humans i n Complex Situations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9):1208-1221, 2004. 69 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051114/manifest

Comment

Related Items