@prefix vivo: .
@prefix edm: .
@prefix ns0: .
@prefix dcterms: .
@prefix skos: .
vivo:departmentOrSchool "Science, Faculty of"@en, "Computer Science, Department of"@en ;
edm:dataProvider "DSpace"@en ;
ns0:degreeCampus "UBCV"@en ;
dcterms:creator "Cai, Yizheng"@en ;
dcterms:issued "2009-12-11T18:11:26Z"@en, "2005"@en ;
vivo:relatedDegree "Master of Science - MSc"@en ;
ns0:degreeGrantor "University of British Columbia"@en ;
dcterms:description """We address the problem of robust multi-target tracking within the application of
hockey player tracking. Although there has been extensive work i n multi-target
tracking, there is no existing visual tracking system that can automatically and
robustly track a variable number of targets and correctly maintain their identities
with a monocular camera regardless of background clutter, camera motion and frequent
mutual occlusion between targets. We build our system on the basis of the
previous work by Okuma et al. [OTdF⁺ 04]. The particle filter technique is adopted
and modified to fit into the multi-target tracking framework. A rectification technique
is employed to map the locations of players from the video frame coordinate
system to the standard hockey rink coordinates so that the system can compensate
for camera motion and the dynamics of players on the rink can be improved by
a second order auto-regression model. A global nearest neighbor data association
algorithm is introduced to assign boosting detections to the existing tracks for the
proposal distribution i n particle filters. The mean-shift algorithm is embedded into
the particle filter framework to stabilize the trajectories of the targets for robust
tracking during mutual occlusion. The color model of the targets is also improved
by the kernel introduced by mean-shift. Experimental results show that our system
is able to correctly track all the targets in the scene even if they are partially or
completely occluded for a period of time."""@en ;
edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/16480?expand=metadata"@en ;
skos:note "Robust Visual Tracking for Multiple Targets by Y i z h e n g C a i B . E . , Zhejiang University, 2003 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science i n T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Computer Science) The University of British Columbia September 2005 © Y i z h e n g C a i , 2005 Abstract We address the problem of robust multi-target t racking wi th in the appl icat ion of hockey player tracking. A l though there has been extensive work i n multi-target t racking, there is no existing visual t racking system that can automatical ly and robust ly track a variable number of targets and correctly mainta in their identities w i t h a monocular camera regardless of background clutter, camera mot ion and fre-quent mutua l occlusion between targets. We bu i ld our system on the basis of the previous work by O k u m a et a l . [ O T d F + 0 4 ] . T h e particle filter technique is adopted and modified to fit into the multi-target t racking framework. A rectification tech-nique is employed to map the locations of players from the video frame coordinate system to the standard hockey r ink coordinates so that the system can compensate for camera mot ion and the dynamics of players on the r ink can be improved by a second order auto-regression model . A global nearest neighbor data association a lgor i thm is introduced to assign boosting detections to the existing tracks for the proposal d is t r ibut ion i n particle filters. T h e mean-shift a lgor i thm is embedded into the particle filter framework to stabilize the trajectories of the targets for robust t racking dur ing mutua l occlusion. T h e color model of the targets is also improved by the kernel introduced by mean-shift. Exper imenta l results show that our system is able to correctly track a l l the targets i n the scene even i f they are par t ia l ly or completely occluded for a period of t ime. Key words: Mul t i - target tracking, Homography, Sequential Monte Car lo methods, D a t a association, Mean-shift i i Contents Abstract ii Contents iii List of Tables v List of Figures vi Acknowledgements ix 1 Introduction 1 1.1 Motivation 1 1.2 Problem Statement 2 1.3 Challenges 2 1.3.1 Target localization 2 1.3.2 Target representation 3 1.3.3 Non-rigidity of the target 3 1.3.4 Tracking stabilization 3 1.3.5 Mutual occlusion 4 1.4 Contributions 6 1.5 Thesis Outline 6 2 Previous Work 8 2.1 Target Representation and Localization 8 2.1.1 Representation 8 2.1.2 Localization 10 2.2 Filtering and Data Association 10 2.2.1 Filtering 11 2.2.2 Data association 14 2.2.3 Mutual exclusion 18 2.3 Summary 18 iii 3 Particle Filter for Multi-target Tracking 20 3.1 Part ic le F i l t e r 20 3.1.1 Monte Car lo s imulat ion 21 3.1.2 Sequential importance sampling 21 3.1.3 Resampling 22 3.2 Boosted Par t ic le F i l t e r 23 3.3 Our Improvements 24 4 Target Dynamics Modelling 26 4.1 Rect if icat ion 26 4.1.1 Homography 26 4.1.2 Mul t i - l ayer state space model 28 4.2 Autoregressive Dynamics M o d e l 30 5 Data Association 33 5.1 G a t i n g 33 5.2 Linear Opt imiza t ion 35 6 Mean-Shift Embedded Particle Filter 38 6.1 Color M o d e l 38 6.2 Mean-Shif t 41 6.3 Adap t ive Scaling 45 6.4 Mean-shift Embedded Part ic le F i l t e r 46 7 Experimental Results and Conclusion 51 7.1 Results 51 7.2 Conclus ion 60 7.3 Discussion and Future Extension 61 Bibliography 64 iv List o f Tables Assignment mat r ix for the assignment problem shown i n Figure 5.1 . 36 v List of Figures 1.1 The figure shows that different players can have quite difference shapes so that i t is impossible to use a bounding box w i t h a fixed edge ratio. 4 1.2 The above 4 frames show that the tracker of the target drifts signifi-cantly over t ime 5 1.3 The figure shows the si tuat ion of complete occlusion and par t ia l oc-clusion dur ing the process. The rectangles show the regions where occlusions take place 6 2.1 Graph ica l model representation of the relationship between an ob-served variable yt and a latent variable xt 11 2.2 Graph ica l model representation of the H idden M a r k o v M o d e l , where XQ is the in i t i a l state 12 3.1 T h e S I R process starts at t ime t — 1 w i t h a set of equally weighted par-ticles T V - 1 } ^ , which represent the dis t r ibut ion p(xt-i\\yo-.t-2) • Importance weights are updated w i t h information at t ime t — 1 to rep-resent p(xt-i\\yo:t-i) w i t h {Xfli,Wt-i}iLi- Resampl ing is performed to make it a equality weighted particle set {x[ll1: N^1}^ again but s t i l l represent the same dis t r ibut ion. F ina l ly , particles are propogated to {x[%\\ N~l}?L1 at t ime t by the proposal d is t r ibut ion to represent p{xt\\yo-.t-i) and iterations go forward [vdMAdFWOO] 23 3.2 The final proposal is i n the form of a mixture of Gaussian which combines the prior and the boosting detection [OTdF+04] 24 4.1 T h i s shows how video frames are mapped to the r ink. The top left image is the input video frame; the top right is the standard r ink model; the bo t tom image shows how the image looks after being mapped onto the r ink 29 4.2 Graph ica l model representation of the three-layer structure of the state space model 30 v i 4.3 A l l the black rectangles are the boosting detections. B lack dots on the bo t tom edge of each rectangle represent the positions of the players on the r ink 31 5.1 T I , T 2 , T 3 are predicted target positions; 0 1 , 0 2 , 0 3 , 0 4 are obser-vations. 0 2 is assigned to T I ; 0 1 is a candidate of T I , T 2 and T 3 ; 0 4 is a candidate of T 2 and T 3 ; 0 3 is not assigned to any existing tracks so that i t is a candidate of being a new target 34 5.2 T I , T 2 are predicted positions of targets; V I , V 2 are vectors that indicate the velocities of the targets 35 6.1 Mul t i -pa r t color histogram of two players from two different teams. . 41 6.2 T h i s figure shows how significantly the color histogram of the same target changes over t ime 42 6.3 Compar ison between the original track output and the mean-shift stabil ized result. Left ones are the original track result and the right ones are the stabilized result 47 6.4 T h i s figure shows the tracking result of the system i n which the par-ticles are biased by the mean-shift a lgor i thm after the deterministic resampling step i n particle filtering. The top frame shows the overall view of the t racking result at frame 1. The six frames below show the tracking result w i t h a close-up view of the three players i n the center region. Part icles from the same target are represented by rectangles w i t h same color 49 7.1 Th i s figure shows an example of the blur from interlacing 52 7.2 T h i s figure shows the overall view of the camera at the in i t i a l stage. T h e close-up view of the t racking results w i l l be shown i n Figure 7.3 by zooming into the rectangular region 52 7.3 T h i s sequence is a close-up view of the rectangular region i n F i g -ure 7.2. The left column is the t racking results of the system i n [ O T d F + 0 4 ] . The right column is the t racking results of our system. Different targets are labelled w i t h rectangles of different colors. . . . 53 7.4 Th i s figure shows the tracking result of the system that only has the camera mot ion compensated and the dynamics model changed but without mean-shift bias. The left column is the results of the system i n wh ich the deviat ion for the scale sampling is 0.02; the right co lumn is w i t h the deviat ion of 0.01 55 v i i 7.5 T h i s figure shows the particle representation of each target. The left co lumn shows the particles before the mean-shift bias; the middle co lumn shows the particles after the mean-shift bias; the right co lumn shows the particles after the resampling. E a c h particle is represented w i t h a rectangle and particles that belong to the same target are of the same color 57 7.6 T h i s figure shows the trajectories of the three targets shown i n Figure 7.3. (b)(c)(d)(e) show the key frames dur ing the process, (b) is the start ing frame and (e) is the ending frame 58 7.7 T h i s figure shows the tracking result of the per iod when par t ia l occlu-sion happens, (a) is the overall view of the t racking result at frame 78. (b)(c)(d)(e) show the close-up view of the four targets in the region specified i n (a) 59 7.8 T h i s figure shows how the video frame is mapped onto the standard r ink 62 v i i i Acknowledgements I would like to acknowledge those who gave me great help dur ing my thesis work. W i t h o u t their support, I can never accomplish this work. M y sincere grati tude goes to my supervisors, D r . James L i t t l e and D r . Nando de Freitas for their passion and patience i n providing me w i t h extraordinary guidance and invaluable ideas. I also would like to thank D r . K e v i n M u r p h y and D r . D a v i d Lowe for their enlightening suggestions. M y appreciation extends to m y colleagues, X i a o j ing W u , K e n j i O k u m a , Fahong L i , and Yongning Zhu for their beneficial discussions. M y special thanks go to my family for their indispensable support. F ina l ly , I wish to acknowledge the funding support of the Nat iona l Science and Engineering Research Counc i l of Canada and G E O I D E , a Canad ian Network of Centres of Excellence. Y I Z H E N G C A I The University of British Columbia September 2005 ix Chapter 1 Introduction 1.1 Motivation Tracking mult iple targets, although has its root i n control theory, has been of broad interest i n many computer vision applications for decades as well . People or vehicle t racking for public surveillance systems, autonomous robots and vehicles steering, and sports game video annotat ion are a l l promising applications. Objects of interest for t racking can be of any k ind , including landmarks, vehicles, people, and various kinds of v i sua l features. T h e number of objects can vary over t ime. T h e backgrounds of the scenes to be analyzed can either be static or dynamic. A l l these complexities, which arise i n the real world, make the problem non-tr ivial . Tracking players i n sports game videos is one of the most pract ical appl i -cations for multi-target t racking because of the popular i ty of the games and the pervasive demand for automatic annotation and analysis of such videos for broad-casting and coaching purposes. It is also a challenging applicat ion because of its inherent properties including the varying number of players i n the scene, the dynam-ical ly changing background due to camera motion, and non-rigid target shapes due to various player poses. In the world of computer vision research, such applications have already been investigated on the soccer games [ V D P 0 3 , K C M 0 3 ] and hockey games [OTdF+04]. Recently, research on more higher level of mot ion analysis has also been carried out. Efros et a l . [ E B M M 0 3 ] focus on the character of the mot ion of soccer players (i.e., walk ing , running, standing) rather than their t ranslat ional mot ion and assume that the targets are stably tracked so that the targets are roughly centered wi th in the bounding box of the tracker. L i et a l . [Li04] a i m to represent and reason about hockey behaviors for an even higher level of mot ion understanding. Such applications a l l require accurate trajectory data extracted from lower level t racking systems as their input . Therefore, acquisit ion of accurate trajectory data over long video sequences becomes crucial . To generate such required trajectory data, one of the most cr i t ica l properties of the t racking system is its robustness i n mainta ining 1 the correct identity of different targets under various conditions. However, existing systems either have difficulties i n maintaining a variable number of targets or have problems in keeping the correct identity of targets when occlusion happens between different targets. Be ing aware of the weaknesses i n the exist ing systems, we a i m to bu i ld a t racking system that can robustly track mult iple targets and keep their identities over long video sequences captured by a monocular camera. 1 . 2 Problem Statement A s an extension of the previous endeavors devoted to bui ld ing robust multi-target t racking systems, and a preparation for researches on high level act ion analysis, this thesis addresses the task of bui ld ing a robust t racking system which can han-dle a variable number of targets and mainta in the accurate identity of each target regardless of background clutter, camera mot ion and frequent mutual occlusions between targets. The system is required to have the abi l i ty to track targets over long sequences and be robust under perturbations including i l lumina t ion condi t ion changes, shadows of the targets and so on. 1.3 Challenges A l t h o u g h extensive research on various tracking systems has already conquered many cr i t ica l obstacles i n target local izat ion and filtering i n tracking, there s t i l l remains challenging problems i n our applicat ion. We present i n this section the following major challenges: target local izat ion, target representation, non-rigidi ty of the targets, t racking stabil izat ion, and mutual occlusion. 1.3.1 Target localization Object detection has been a major focus of research in computer vis ion. Learning-based object detection [ R B K 9 8 , VJ04] is one of the most effective and efficient methods. Camera motions, including panning, t i l t , and zooming, make the video frame coordinates variable w i t h respect to the coordinate system, which is the hockey r ink in this case. M u l t i p l e cameras are able to solve the problem of local iz ing targets i n the real world coordinate system [KCM04] so long as the relative locat ion and configuration of different cameras are known and well synchronized. However, because the source data we use for this task is extracted from only one monocular T V broadcasting camera, it is very difficult to derive even the relative positions among the targets in the real world. A l though 2D or 3D geometry of the scene is one possible way to assist solving the problem for moving monocular camera, it is s t i l l a non- t r iv ia l problem. 2 1.3.2 Target representation Because of the inherent weakness i n video standards of Na t iona l Television System Commit tee ( N T S C ) , the video frames extracted from the recorded video clips are highly unsaturated especially i n hockey games. Therefore, the representation of the targets i n the color space becomes quite noisy and unstable so that i t is not salient enough to support t racking as the only information. In addit ion, the resolution of the video frames is so low that each target is sometimes as smal l as twenty pixels high. A s a result, a significant number of pixels that belong to one target are probably contaminated by the noise from the hardware device because of the l imi ted number of the to ta l pixels of the target. A l so , as is well known, hockey is a fast game where players tend to move so quickly that the targets that move i n the opposite direction of the camera motion appear b lur ry i n the video frames. Therefore the representation of those blurred targets i n the feature space w i l l become much different from the ground t ru th . 1.3.3 Non-rigidity of the target H u m a n bodies are one group of objects that are very difficult to detect or track because of their non-rigidity. Research has been done to detect or track human bodies as a whole [POP98, V J S 0 5 , IM01] or separately on different parts of the body w i t h some geometric constraints [MSZ04, R F 0 3 ] . A s is mentioned i n the previous section, because of the low resolution of the input video clips, t racking parts of the human body separately i n the same way as [RF03] does is almost infeasible i n this applicat ion. In the systems that detect or track people as a whole, human bodies are more or less upright and straight so that most of the bodies fit into a rectangular bounding box w i t h a fixed length ratio of the edges. However, it w i l l cause problems i n our appl icat ion because it is impossible to find such a bounding box to capture the whole body of the hockey players that has drastic pose changes dur ing the game. Examples are shown i n Figure 1.1. The opt imal way is to use varying shape boxes to capture the targets. However, as we use the detection results from the boosting detectors, we straightforwardly inherit the fixed shape bounding box as well . T h i s brings some disadvantages because, by using such k i n d of bounding boxes w i t h fixed edge ratio, it is possible that important pixels are excluded from the box and pixels that do not belong to the target would be included. Hence, the representation of the targets i n the feature space w i l l be contaminated. 1.3.4 Tracking stabilization Taking the bounding box tracker as an example, i t is very difficult to s tably fit the boxes on the targets. The tracker tends to drift off the targets more or less over t ime. Examples can be found in Figure 1.2. Th i s is the t racking output of the system 3 Figure 1.1: The figure shows that different players can have quite difference shapes so that it is impossible to use a bounding box with a fixed edge ratio. by [OTdF + 04]. Only one track is shown for clear representation. Having a stable tracker is important because as long as we need the trajectories of the targets to predict their locations, it is important that the trajectories are reliable. Otherwise, the prediction would lead the tracker to a region far away from the true target location. In most cases, such errors are irrevocable so that the tracker loses the target or tracks on to a different target. Although, intuitively, people can tell which tracker is more stable than the other one, there is no broadly accepted concrete standard to define the concept of stability. In addition, even though it is possible to define a standard for our particular application, it is difficult to evaluate the stability of the tracking result and perform certain corrections because the ground truth is unknown. 1.3.5 Mutual occlusion Compared to the challenges described in the previous sections, mutual occlusion is a higher level vision problem which requires the understanding of the scene rather than the video frames only. Because the source data is captured by a single fixed monocular camera, only the 2D projection of the real world is captured by the camera. Consequently, there are always occasions when different targets occlude each other in the video frames as they cross over each other. This can be avoided if the camera records the games from the top down view. However, it is not the case in most broadcasting videos. Occlusions can be partial or even complete for a certain period of time. Most detectors fail in those situations because they treat input video frames as independent and static images without any context information. Therefore, it is impossible to detect a target that is completely occluded. Partial 4 i tfsaJB. ' tut 1 1 1 ' p i (a) Frame 1 (b) Frame 8 (c) Frame 15 (d) Frame 25 Figure 1.2: T h e above 4 frames show that the tracker of the target drifts significantly over t ime occlusion is also problematic in detectors that use some templates or t ra ining sets w i t h only complete objects for detection. Consequently, trackers that only make use of the visual information or the output of the detector to locate targets fail dur ing occlusion. A l t h o u g h tracking systems may make use of spatial and temporal continuity properties of the video clips to predict the locat ion of the targets, a system s t i l l has difficulties in correctly t racking the targets i f the complete occlusion results i n the absence of the target i n one spot and its reappearance somewhere else. Pa r t i a l ly occluded targets are also difficult to track because of the insufficient visual support by those pixels that are not occluded. Therefore, without understanding of the context and the spatial configuration of the scene, i t is impossible to solve this problem. Figure 1.3 shows occasions that par t ia l and complete occlusions take place. 5 (a) (b) Figure 1.3: The figure shows the situation of complete occlusion and partial oc-clusion during the process. The rectangles show the regions where occlusions take place. 1 . 4 Contributions Challenges have been investigated and described in the previous section. In order to conquer those challenges and build a robust multi-target tracking system to cor-rectly track hockey players during the games, we propose four improvements on the previous systems. Firstly, a rectification technique is employed to compensate for camera motions by mapping the locations of the players from the video frame coordinate system to the time invariant standard hockey rink coordinate system. Secondly, a second order autoregression model is adopted as the dynamics model to predict the location of the targets. Thirdly, a global nearest neighbor data associ-ation technique is used to correctly associate boosting detections with the existing tracks. Finally, the mean-shift algorithm is embedded into the particle filter frame-work [AMGC01] to stabilize the trajectories of the targets. By assembling all the contributions, the new multi-target tracking system is capable of achieving the goal stated in Section 1.2. The details of all the techniques employed will be described in the following chapters. 1.5 Thesis Outline The remainder of the thesis will organized as follows. In the next chapter, a review of modern tracking systems will be presented with emphasis on multi-target track-ing. Previous work on object representation, motion modelling and data association will also be covered. Chapter 3 gives a general explanation of Sequential Monte Carlo Methods [DdFGOl], also known as particle filters, because it forms the basic skeleton of our tracking system. Extensions on the conventional particle filter to support multi-target tracking in our system will be introduced as well. Chapter 4 6 describes how to improve the model l ing of the dynamics of each target so that the mot ion information can provide enough support for the tracker while visual support is absent or unreliable. A sketch of the previous rectification technique [OLL04] , which w i l l be used i n our work to compensate camera motion, w i l l be shown and the autoregressive dynamics model w i l l be presented i n detail as well . Chapter 5 addresses the importance of the data association issue i n our system. A global nearest neighbor data association, which is designed par t icular ly for our applica-t ion, w i l l be introduced and explained in detail . Chapter 6 demonstrates how the representation of the targets i n the color space is improved by a kernel function. Mean-shift , known as a gradient descent algori thm, w i l l be introduced and how i t is embedded into the particle filter framework w i l l be described i n detai l . In Chapter 7, experiment results of our new tracking system w i l l be presented. W e conclude our work by summariz ing the contributions we have made and analyze strengths and weaknesses i n our system. F ina l ly , some possible extensions w i l l be discussed for the future work. 7 Chapter 2 Previous Work A typica l visual t racking system can be divided into two parts [CRM03] : one is Tar-get Representation and Localization, which has i ts root i n computer vision; the other is Filtering and Data Association, which comes from control theory. In this chapter, we present a brief int roduct ion of previous work i n bo th aspects w i t h emphasis on their strength and weakness. 2.1 Target Representation and Localization 2.1.1 Representation Target representation can be categorized into two major classes. One is for a group of generic objects, such as human faces or bodies, computer monitors, motorcycles, and so on. T h e other is for one specific target including a specific person, car, toy, bui ld ing and so on. T h e targets can be images, concrete objects or even abstract feature points. y For the first class, the representation should capture the common features among a l l the objects i n the same group. A s i t is beyond the scope of this thesis, we suggest readers refer to other related works. [ Y K A 0 2 ] is one such reference that summarizes the approaches i n face detection. However, some approaches also apply to the representation of other objects. In our system, hockey players are detected by a boosting algori thm, which is based on the work by V i o l a et a l . [VJ04]. It treats the hockey players as generic objects rather than specific objects. However, as we assume the boost ing detections already exist as the input of our t racking system, we skip the details of the detection algori thm. For the t racking system, in order to dist inguish different targets and keep their identity over t ime, i t is necessary to treat each ind iv idua l as a specific object. Therefore, we focus on the representation of specific objects i n this thesis. For specific object representation, salience and uniqueness are the most i m -portant characteristics. One popular approach is to represent an object or an image 8 by a set of local features, which are unique and distinguish the object from oth-ers. T h e Harris corners [HS88], the Kanade-Lucas-Tomasi (KLT) features [TK91] and the Scale Invariant Feature Transform (S IFT) [Low04] are three typica l ex-amples. These features can be applied to specific object recognition and tracking, image matching and rectification, 3D structure reconstruction from 2D images for scene or mot ion analysis, and even Simultaneous Localization and Mapping ( S L A M ) [DNCC01] for robotics. In our work, K L T features are used for image rectification [OLL04] . A s is mentioned i n Section 1.3.1, the locations of the hockey players i n the input video are i n temporal ly variant image coordinates because of the camera motion. Therefore it is difficult to model and predict the mot ion of the players i n the image coordinate system. M a p p i n g the locations of players from the image coordinate system to the standard hockey r ink coordinates, which is invariant over t ime, is necessary for the model l ing of the dynamics of the players. The K L T feature tracker is used to match the selected K L T features extracted from consecutive video frames so that the mapping between frames can be computed according to the geometrical properties i n the 2D plane-to-plane projection. A detailed int roduct ion of the plane-to-plane mapping and the mechanism to map between the frames and hockey rinks w i l l be given i n Chapter 4. A local feature based approach is effective for recognizing targets that have sufficient salient loca l features because pose changes and self-occlusions may result i n the absence of a significant proport ion of irreplaceable local features. However, as we mentioned i n Section 1.3.2, the resolution of a l l the players i n our appl icat ion is quite low. Therefore, it is difficult to extract sufficient local features to overcome the negative effect brought by occlusion. In addi t ion, because of the non-rigidi ty of the targets, the relative geometry between local features can not be modelled w i t h a static template. Tak ing into account a l l these factors, a global representation of the target is needed for our applicat ion to conquer the obstacles. Color-based representation is one of the most successful approaches i n the literature. B o t h Perez [ P H V G 0 2 , V D P 0 3 ] and Comanic iu [CRM03] have successfully used color histograms to represent targets for t racking. The a lgor i thm represents a specified image region w i t h a color histogram, which can be extracted from any color space. The histogram is represented by a vector and the value of each entry of the vector reflects the cumulative count of each b in i n the histogram. Because the purpose is to track a specific target i n the scene, a reference histogram of the part icular target is created at the in i t ia l iza t ion step. A metric based on Bhat tacharyya coefficient [Kai67] is adopted to evaluate the s imi lar i ty between two color histograms. There are two major weaknesses i n this color-based approach. One is the loss of spatial information because the histogram approach only takes into account the value of pixels i n the color space rather than the locat ion information. The 9 other is the interference from the background clutter and other targets. If some background regions or other targets had similar histograms as the target of interest, the tracker would probably get confused. To solve the first problem, a mult i -part color histogram representation is used to encode the spatial information. Par t ic le filters set up an excellent framework for the color-based approach to solve the second problem. Proper model l ing of the dynamics of the targets can assist i n resolving the ambiguity that arises when background clutter and similar targets overlap w i t h the target of interest. 2.1.2 Localization T h e a im of local izat ion is to detect the existence of the objects of interest and give their locat ion for each of them. Tak ing the human face detection as an example, various approaches have been applied to detect generic faces. Template based ap-proaches and learning based approaches are the two ma in strategies. Stat is t ical learning approaches, such as Neural Networks, Support Vector Machine and Boost-ing, have become popular. T h e V i o l a and Jones [VJ04] approach, wh ich makes use of AdaBoost, is known as one of the most successful classifiers and can be applied to many different classes of objects. O k u m a et al . [ O T d F + 0 4 ] incorporate the boosting detector into a particle filter based t racking system. In the new framework, which they called Boosted Particle Filter ( B P F ) , the three major functions of the boost-ing detector i n the t racking system are in i t ia l iz ing trackers, adding new targets, and improving the proposal density function for the particle filter. T h e y represent the proposal d is t r ibut ion w i t h a mixture of Gaussian model , which combines the boosting detections and the corresponding tracks. A s a result, the data association problem arises when searching for the correspondence between the boost ing detec-tions and the exist ing tracks i f mult iple targets are i n the scene. In [OTdF+04], the nearest neighbor association method is used, which only finds the local op t imal rather than the global op t imal solution. We improve their work by using a global nearest neighbor association method to reach the global op t imal observation-to-track assignment. T h e in-depth introduct ion can be found i n Chapter 5. 2.2 Filtering and Data Association Fi l t e r ing and data association have their theoretical basis i n control theory, where the signals of the targets are mostly from radar, sonar, laser and so on. In our applicat ion, visual information is the observed signal from the targets. B o t h Ba r -Shalom et a l . [BSF88] and B l a c k m a n et a l . [BP99] summarized modern techniques i n filtering and data association in their books. In this section, we break down the topic into three parts: filtering, da ta association and mutual exclusion. The last part discusses problems that arise from visual-based multi-target t racking. 10 Figure 2.1: Graph ica l model representation of the relationship between an observed variable yt and a latent variable xt 2.2.1 F i l t e r i n g In the field of object tracking, Bayesian Sequential Estimation [AM79], which is also called Bayesian Filtering, is the most widely accepted framework. In this section, we present the fundamental theory of Bayesian filtering as a preparation for its sequential Monte Car lo implementat ion which is also known as Particle Filters [ D d F G O l ] . We adopt it as the basic skeleton of our t racking system. In addit ion, we introduce some previous work that use two major implementations of Bayesian filtering: Kalman Filters and Par t ic le Fi l ters . Bayesian sequential estimation D u r i n g the t racking process, the evolution of the targets can be interpreted as a state t ransi t ion process, where each state can be represented by a vector that includes the locat ion of the target in the reference coordinate system, the scale and velocity of the target, and any values that are required to represent the complete property of the target. Here, we denote the state variable as xt € M.nx, where t 6 N indicates the t ime step and nx is the dimension of the state variable. In the real wor ld , the state variables are always difficult to observe or measure directly. For example, the a i m of a t racking system is to estimate the ground t ru th (or the state variables) of the targets through the output data of the observation or measuring devices. Therefore, such state variables are called hidden variables or latent variables. T h e observation of the hidden variables are denoted as yt € K™», where ny is the dimension of the observed variable. In most cases, Xt G M \" 1 and yt € M \" 1 ' are in different spaces, therefore nx and ny are not necessarily the same. The relation between the two variables can be represented by a generative model and the corresponding graphical model can be represented as is shown i n 2.1. The directed graph structure shows that the observed variable only depends on its corresponding latent variable and the condit ional probabi l i ty ly ing on the edge can be represented as p(yt\\xt). The Hidden Markov Model ( H M M ) [AM79] is known to be an appropriate 11 Figure 2.2: Graph ica l model representation of the H idden Markov M o d e l , where XQ is the in i t i a l state model for sequential da ta (e.g., the t racking data). It l inks the data of each t ime step, which can be represented by a single node structure as is shown i n Figure 2.1, as a chain to represent the dependence relationship between the nodes. T h e graphical model to represent such k ind of chain structure is shown i n Figure 2.2. xo is the in i t i a l state to start the whole process. In an H M M , the latent variables are discrete so that the t ransi t ion probabil i ty on the arrows between latent variables is an M x M matr ix , i f the number of possible values is M. However, i n most t racking applications, the locations of the targets are often represented i n a continuous 2D or 3D space. Therefore, the State Space M o d e l (SSM) [AM79], which generalizes H M M s to the continuous space but shares the same graphical model structure as H M M s , is used to model the t racking process i n our applicat ion. The goal of Bayesian sequential est imation is to estimate the posterior dis-t r ibu t ion pixo.tlyo-.t)1 or its marginal p(xt\\yo-.t)- The transi t ion between two consec-utive latent variables is given by Equa t ion 2.1 and the observation model is given by Equa t ion 2.2 xt = ft(xt-i,vt-i), (2.1) yt = ht(xt,ut), (2.2) where ft : Rnx x R 7 1 \" a n d ht : RUx xM.nu^>Rnv can either be linear or nonlinear, vt and Ut are sequences of i . i .d . process noise and measurement noise respectively, and nv and nu are the dimensions of the noise vectors. T h e marginal posterior d is t r ibut ion can be computed by the following two 1X0:t = (XQ, -., Xt), 1/0:1 = (VO, ••; Vt) 12 step recursion based on the Markov assumptions 2 . predict ion step: p(xt\\y0:t-i) = J' p{xt\\xt-i)p(xt-i\\yQ:t-i)dxt-i filterine- sterv n(rJ?;,,,) = ,. P^\\xt)p{xt\\yo:t-x) K2-S) nitering step. p(xt\\yo-.t) - T K ^ ^ N M i o T ^ i ) ^ where the recursion is ini t ia l ized by the prior dis t r ibut ion p(xo\\yo) = p(xo), the dynamics model p{xt\\xt~\\) is specified by Equa t ion 2.1, and the l ikel ihood model p(yt\\xt) is specified by Equa t ion 2.2. K a l m a n Fi l ter The K a l m a n filter [BSF88] is one of the solutions to the Bayesian sequential esti-mat ion i f both function /(•) i n Equa t ion 2.1 and h(-) i n Equa t ion 2.2 are linear and Gaussian. It is widely used in visual t racking [GH96, K C M 0 3 ] and robotics [Kel94]. In order to handle the nonlinear and non-Gaussian situations, extensions have been made to the standard K a l m a n filter. The Extended Kalman Filter ( E K F ) [AM79] is one of the extensions which uses a first order Taylor series expansion of the nonlinear functions i n Equa t ion 2.1 and Equa t ion 2.2 for the approximation. T h e E K F has been widely used i n the field of robotics i n the last decade [ D N D W + 0 0 ] , especially i n the sub-field of S L A M [SSC90]. The Unscented Kalman Filter ( U K F ) [vdMAdFWOO] has even higher accuracy. It approximates the posterior p(xt\\yi;t-i) directly instead of approximat ing the nonlinear functions /(•) and g(-). In partic-ular, it uses a determinist ical ly selected sample set and pass i t through the true nonlinear function to capture the true dis t r ibut ion. Because of its capabi l i ty of han-dl ing nonlinear models and its efficiency i n computat ion, i t has been successfully used i n tracking [CHR02] . Particle Fi l ter The particle filter gained its popular i ty because of its abi l i ty to handle highly nonlin-ear and non-Gaussian models i n Bayesian filtering w i t h a clear and neat numerical approximation. The key idea is to approximate the posterior d is t r ibut ion w i t h a set of randomly sampled particles that have weights associated to them. The set of particles can be represented as {x^,™^}^, where x\\^ is the support ing point, (i) u>l is the weight, Ns is the number of sample points. T h e posterior d is t r ibut ion is approximated by i teratively propagating the particles w i t h the proposal d is t r ibut ion and updat ing the weights. The details of particle filtering are covered i n Chapter 3. 2 First order Markov chain has the following independency assumption Xt-LyO:t-l\\Xt-l yt±-yo-.t-Axt which means the current state is independent of the past observations given the previous state and the current observation only depends on the current latent state. 13 Nowadays, particle filtering has been widely used i n computer vis ion and robotics. It was first introduced to visual t racking by Isard and Blake i n [IB98], where i t was known as the C O N D E N S A T I O N method. Perez et a l . [ H L C P 0 3 , V D P 0 3 ] extended the particle filter framework to track mult iple targets. O k u m a et al . [ O T d F + 0 4 ] further extended i t [VDP03] by incorporat ing a boosting detector [VJ04] into the particle filter in i t ia l iza t ion and the proposal d is t r ibut ion. B o t h [VDP03] and [OTdF+04] used a Mixture of Particle Filters ( M P F ) as the ma in framework to mainta in mult i-modali ty. The key contr ibut ion of M P F is to represent the overall posterior d is t r ibut ion w i t h a M-componen t mixture model where 2~3m=i ^ \" i , * ~ ^, M is the number of targets or particle clouds. pm represents the dis t r ibut ion of each target and evolves independently over t ime. T h e only in -teraction between targets is through 7 r m i t . Par t ic le clouds are allowed to merge or split if different targets have significant overlap or overlapping targets split into two. Part icles can be associated to different targets during merging or spl i t t ing while the overall d is t r ibut ion s t i l l remains the same. However, when new targets come into the scene, the M P F is not able to associate particles to them because the target in i t ia l iza t ion only takes place at the first step. O k u m a et a l . [OTdF+04] adopted the framework of M P F and improved it by using boosting detection to ini t ial ize tar-gets. However, because the total number of particles i n the M P F framework is fixed, part of the particles from each target are allocated to the new targets. A s a result, i f more and more new targets came into scene, the particles of each target would decrease and not be able to properly approximate the dis t r ibut ion. Another major disadvantage of the M P F framework is that it can not handle a variable number of targets and mainta in their identities at the same time, because after the merge and split , there must be at least one target that loses its identity. Therefore, we track each target w i t h a set of particles independently instead of the M P F framework i n our system. A global mechanism is used to handle the b i r th and death of targets and the data association issue. 2.2.2 Data association D a t a association arises from the task of multi-target t racking. T h e a im of data asso-ciat ion is to find the correct correspondence among mult iple observations {ytn)m=i and the existing tracks {x™}^=l. M and N can be different because there might be outliers or mis-detections. Standard Bayesian filtering does not handle observation-to-track assignments. However, the color-based particle filter tracker [VDP03] solves the problem impl ic i t ly because, according to Equa t ion 3.9, the weight of each par-ticle is updated only by evaluating the l ikel ihood of color histogram at the locat ion M (2.4) m = l 14 specified by the particle. D a t a association becomes important i n our appl icat ion again because we adopted the idea of boosted particle filter [ O T d F + 0 4 ] . B P F uses a mixture of Gaussian that combines the boosting detections w i t h the existing tracks to improve the proposal d is t r ibut ion shown in Equa t ion 3.10. A s there are mult iple boosting detections and mult iple tracks at each t ime step, the association problem arises again. G a t i n g is one of the basic techniques to eliminate very unl ikely observation-to-track pairs. It defines a certain range around each existing track so that ob-servations that fall outside the ranges are not considered to be associated to them. However, gating is s t i l l a prel iminary procedure. M o r e sophisticated data association techniques are required. Linear optimization D a t a association can been seen as an auction process where the tracks b id for the observations, or vice versa. E a c h observation-to-track pair w i l l be associated w i t h a corresponding cost or profit. The goal is to optimize a linear objective function under several linear constraints. Therefore, da ta association can be considered as a linear opt imizat ion problem. B lackman and Popo l i give a detailed in t roduct ion of various opt imizat ion solutions i n their book [BP99]. T h e auction a lgor i thm [BP99] is found to be the most efficient method so far. However, the constraints i t requires do not satisfy the requirements of our applicat ion. In Chapter 5, we model our applicat ion w i t h a set of new constraints and introduce the solution to the new linear opt imizat ion setting. We adopt this approach because of three reasons. F i rs t ly , i t can handle the b i r th and death of targets, which is cr i t ica l i n our appl icat ion. Secondly, the a lgor i thm is easy to understand and implement. F ina l ly , it requires very l i t t le computat ion and is extremely fast. T h e major drawback of this approach is its inabi l i ty to handle error association. A s it is not a probabil ist ic method, there is only one op t imal solution at each step. A l so , because the associations at each step are independent, error association can not be corrected at later stages. Joint probabil i ty data association filter The Joint P robab i l i ty D a t a Associat ion F i l t e r ( J P D A F ) [BSF88] is one of the most widely accepted probabil ist ic approach to the data association problem. It can either be combined w i t h K a l m a n filters [KCM03] or w i t h particle filters [ S B F C 0 1 , M B 0 0 ] . The state space of J P D A F is a concatenation of a l l the states of the existing tracks: Xt = {^i,t, %N,t}- The observed variable is also a concatenation of a l l measurements: Yt = {yi,t, •••,yM,t}- B y introducing an association indicator 6, the 15 joint posterior dis t r ibut ion can be wri t ten as P(Xt\\Y1:t) oc P(XuO\\Y1:t)dO P(Yt\\Xt,9)P(Xue\\Y1:t-1)d0 P(Yt\\Xt,9)P(e\\Xt,Y1:t^)P(Xt\\Yl:t^)de (2.5) It should be noted that, in most cases, the association is discrete and finite. There-fore, the integral can be replaced w i t h a sum over a l l possible associations. W i t h this association indicator 9, a l l the terms, especially the l ikel ihood term, can be easily factorized and further computed conditioned on the association setting. A problem arises i f the total number of observations and tracks is large, because it results i n a huge amount of enumeration and computat ion. It is even worse i f i t is combined w i t h particle filtering because sampling i n a h igh dimension space w i l l make the efficiency of sequential Monte Car lo approximat ion decrease drastically. Another problem w i t h J P D A F is that i t does not handle the b i r t h and death of targets explicit ly, because the dimension of the joint state space is determined at the in i t ia l iza t ion step by the to ta l number of targets. Therefore, it does not satisfy the requirement of our applicat ion. Multiple hypothesis tracking Multiple Hypothesis Tracking ( M H T ) is a deferred decision logic i n which alterna-tive data association hypotheses are formed whenever there are uncertainties about observation-to-track assignments. The algori thm is bui l t on the assumption that uncertainties can be resolved at subsequent stages. M H T was first called Reid ' s algori thm. A n efficient implementat ion of Reid 's approach is later presented by C o x and Hingoran i [CH96]. M u l t i p l e hypotheses are maintained at each step. N e w hypotheses are generated according to the previous hypotheses. E a c h hypothesis i n the previous stage w i l l generate one or more children. Therefore, a pruning tech-nique is required so that only the N best hypotheses w i l l be selected. Otherwise, after several steps, the number of total hypotheses w i l l easily increase enormously. The selection process is s imilar to the linear opt imizat ion mentioned in the previous section, however, instead of maintaining the opt imal solution only, i t finds the N best solutions. A l l the hypotheses are evaluated based on their scores, which is the sum of a l l the component observation-to-tracks association. where Lxt is the score of each observation-to-track assignment and Tj are tracks that belong to hypothesis Hj. It can be evaluated using the l ikel ihood p(ym\\xn), or any other variations. (2.6) 16 Cox ' s implementat ion of Reid 's a lgor i thm [Rei79] is a track oriented solu-t ion to M H T . Instead of maintaining a set of hypotheses, i t creates a set of tracks. Hypotheses are generated at each step after the tracks are created. P r u n i n g is performed both i n the creation of tracks and hypotheses. Detai ls can be found i n [CH96]. T h i s implementat ion is much more efficient because the track-oriented ap-proach does not generate redundant tracks while the hypothesis oriented approach has many redundant observation-to-track assignments i n different hypothesis. How-ever, it is s t i l l not quite suitable for particle filter trackers. A s uncertainty increases, more and more possible potential tracks need to be created. E v e n i f there are at most 15 targets on the r ink i n our applicat ion, the to ta l number of tracks grows exponentially w i t h the increasing possible associations. A s each created new track needs a set of particles to support the evolution of track, the required storage space for the particles grows exponentially as well . A s a result, the computat ion for the propagation of a l l the particles becomes extremely expensive. Therefore, s imply combining conventional M H T w i t h particle filters is not pract ical . T h e experiments show that the linear opt imiza t ion approach satisfies our accuracy requirement for data association. Therefore, the computat ional ly more expensive M H T approach is not used. Monte Carlo method for data association Unl ike the three approaches mentioned above, Monte Car lo methods t ry to sample i n the association space rather than enumerating a l l possible assignments. Rao-Blackwellized Particle Filtering ( R B P F ) is one of the Monte Car lo methods that are widely used i n the literature [SVL04, SFH03] . The key idea is to sample for the da ta association while the remaining t racking part is computed i n an analy t ica l way, for example, the K a l m a n filter. A n in-depth explanation of R B P F can be found in [DdFMROO]. Because the Monte Car lo methods sample in the state space, the \"curse of dimensionali ty\" is the major problem if the to ta l number of targets is large. Perez et al . t r ied to solve the problem by developing three algorithms: Monte Carlo Joint Probability Data Association Filter ( M C - J P D A F ) , Sequential Sampling Par-ticle Filter ( S S P F ) and Independent Partition Particle Filter ( I P P F ) i n [VGP05] . T h e fundamental motivat ion underlying these three algorithms is to factorize the joint state space into ind iv idua l components. However, in visual t racking systems, detections at each stage can be consid-ered as mult iple signals generated by a single sensor. Therefore, the relationship between signals and sensors i n visual t racking is not as significant as it is i n robotics and other applications. Consequently, there w i l l not be significant gain to use a state space model to represent the association and use Monte Car lo method for approx-imat ion. In addi t ion, as we use the particle filter as basic framework for t racking, 17 sampling i n both the filtering part and the data association part w i l l decrease the efficiency and effectiveness of the particle filter. 2.2.3 Mutual exclusion M u t u a l exclusion is important i n multi-target t racking, especially when different tar-gets overlap and occlude each other. Some exclusion rules are the basic constraints i n da ta association. For example, one signal can hot be generated.by mult iple targets and one target can generate mult iple signals. W h e n targets occlude each other, mu-tua l exclusion plays a cr i t ica l role for the evaluation of l ikel ihood function. Sul l ivan et a l . [SBIM01] address the global op t imal solution to associate pixels to objects and background i n a image w i t h a Bays ian approach. It is important for mul t i -target t racking because it makes the l ikel ihood evaluation of the targets accurate. M a c C o r m i c k and Blake introduced a probabil ist ic exclusion principle for t racking mult iple objects i n [MBOO]. M a c C o r m i c k also introduced a Bayesian mult iple-blob tracker [IM01], which expl ic i t ly address the issue of mutual exclusion i n the 3D space. W u et al . [WYH03] expl ic i t ly model overlapping to handle occlusion. How-ever, a l l the approaches in [MBOO, I M 0 1 , W Y H 0 3 ] require explici t model l ing of the object shape. A s is mentioned i n Chapter 1, the non-rigidity of the hockey players in our appl icat ion makes the shape model l ing difficult. W h i l e the mutua l exclusion is difficult to achieve, mult iple cues are used to resolve the ambiguity dur ing occlusion. W u and Huang [WH01] introduced a co-inference tracking algori thm to ut i l ize bo th color and shape cues to track single target. Perez et a l . [PVB04] fuse color, mot ion and sound cues to track mult iple objects w i t h a particle filter. In our applicat ion, according to the law of inert ia, it is easier to use the physical model to predict the mot ion of hockey players. Therefore, better model l ing of the motion of the targets w i l l help resolve the ambiguity. T h e stabil ized tra-jectories of the targets are important for accurate model l ing of dynamics. Shan et al . [SWTO04] have embedded Mean-Shift [Che95, C M 0 2 ] into the particle filter for hand tracking. The i r results show that the mean-shift improves the efficiency of the sampling i n particle filter. The mean-shift a lgori thm is also used i n our system to achieve the s tabi l izat ion by biasing a l l particles to the true locat ion of the target. 2.3 Summary T h e survey i n the literature shows that there is no existing system that can satisfy a l l the requirements of our applicat ion. In order to achieve the goal of our robust t racking system, we propose four major extensions on the previous work. F i rs t ly , a better model l ing of target dynamics is used to improve the dynamics model of the targets. Secondly, a global nearest neighbor data association method is developed to associate boosting detections w i t h existing tracks for a better proposal d is t r ibut ion 18 in the particle filtering. Thirdly, the likelihood model is improved by a isotropic kernel brought by the mean-shift algorithm. Finally, mean shift is embedded into the particle filter framework to stabilize the tracking trajectory by biasing each particle. This is critical to both the dynamics model and likelihood model. 19 Chapter 3 Particle Filter for Multi-target Tracking A s is mentioned in the previous chapter, particle filtering performs well in Bayesian sequential est imation w i t h non-linear, non-Gaussian target dynamics and observa-t ion model . In our applicat ion, the fast mot ion of hockey players is highly non-linear and non-Gaussian. In addit ion, because we employ the observation model i n [ P H V G 0 2 , V D P 0 3 , OTdF+04] , which is an exponential function based o n the Bhat tacharyya coefficient between two color histograms, the l ikel ihood function is non-linear and non-Gaussian as well . Moreover, particle filtering has been success-fully applied to multi-target t racking [ V D P 0 3 , H L C P 0 3 ] . Therefore, the particle filter framework is the ideal model to be the basic skeleton of our t racking system. T h i s chapter covers the basic theory of generic particle filters and the boosted par-ticle filters. How the boosted particle filter is modified to fit in our multi-target t racking system w i l l also be introduced. 3.1 Particle Filter A s is shown i n Equa t ion 2.3, in order to continue the recursion of the Bayesian sequential estimation, the integration i n both the predict ion and filtering steps need to be computed at every step. If bo th the dynamics model and the l ikel ihood model are linear and Gaussian, i t becomes the well known analyt ical K a l m a n F i l t e r . T h e details of K a l m a n F i l t e r ing are beyond the scope of this thesis. A n in depth introduc-t ion can be found i n [BSF88]. For the analyt ical ly intractable cases, where bo th the dynamics model and l ikel ihood model are non-linear and non-Gaussian, approxima-t ion techniques are required. The most popular numerical approximat ion technique for this problem is the Sequential Monte Ca r lo ( S M C ) method [ D d F G O l ] . It is also known as Par t ic le F i l t e r ing [ A M G C 0 1 ] or C O N D E N S A T I O N method [IB98]. 20 3.1.1 Monte Carlo simulation The basic idea of Monte Car lo s imulat ion is to represent the whole posterior dis-t r ibu t ion by a set of random samples w i t h associated weights. For example, the posterior density function can be approximated as where Ns is the number of samples, {x\\ ,wt Ji=i * s the s e * of particles drawn from the posterior d is t r ibut ion and their weights. Here the weights of a l l particles are equal to because the samples are direct ly drawn from the true dis t r ibut ion and normalized. Situations where the weights are not equal w i l l be introduced later. <5(-) denotes the Di rac delta function Therefore, w i t h the Monte Car lo simulation, any integral can be approximated by a discrete form Accord ing to the law of large numbers, when the number of samples approaches infinity, the approximations of the posterior d is t r ibut ion and the integral are equiv-alent to the true dis t r ibut ion and integral. Similar ly, any query estimation computed from the samples and weights approaches the op t imal Bayesian sequential estimate. 3.1.2 Sequential importance sampling A s it is often impossible to directly sample from the posterior density function, the particles are sampled from a known and easy-to-sample proposal d is t r ibut ion q(xt\\yo-.t) instead. Then , Equa t ion 3.1 can be rewrit ten as (3.1) (3.2) E ( / ( x t ) ) = Jf(xt)p(xt\\y0:t)dxt « ±- JT f(x?). (3.3) 3 (3.4) i=i where (i) _ p(4 \\yo-t) (3.5) i(xt)\\yo-.t) and 3 (3.6) 21 The new sampling strategy is called Importance Sampling and the weights are called Importance Weights. A s is shown i n Equa t ion 3.5, i f the proposal d is t r ibut ion is the same as the posterior, the weights reduce to the same as i n Equa t ion 3.1. B y replacing the integral of the predict ion step i n Equa t ion 2.3 w i t h the importance sampling, the one step ahead predict ion can be wri t ten as It should be noted that the proposal density function just samples the current state from the previous one rather than drawing samples for the whole trajectory of the states because we assume the current state only depends on the previous one. One step further, the current posterior dis t r ibut ion can be updated as Here the te rm fp(yt\\xt)p(%t\\yo:t-i)dxt is omit ted for convenience because i t is just a normalizat ion constant. Eventually, the est imation evolves by sequentially updat ing the weights for each support ing particle and the procedure is named Sequential Importance Sampling (SIS). 3.1.3 R e s a m p l i n g One inherent problem in the sequential importance sampling method is the degen-eracy phenomenon. It is observed that after several iterations, a l l but one particle have close to zero weights, which means the majori ty of the comput ing is wasted on those particles w i t h negligible weights. Even worse, it has been proved that the problem can not be avoided because the variance of the weights increases over t ime. Therefore, a resampling technique is required to ignore particles w i t h very low weights and concentrate on the more promising ones, which take an important role i n the approximation. One of the most widely used resampling schemes, which is also used i n our work, is named the Sampling-importance resampling (SIR) technique. The basic idea is to map a set of weighted particles {x[1^, wf^ } ^ 1 to a set of equally weighted samples {xf\\ - / V \" 1 } ^ through a discrete cumulative dis t r ibut ion of the original sample set. A s a result, the original particles that have higher weights would have more replicates while those w i t h smal l weights would have much less or even no replicates. Detai ls can be found in [Gor94]. The whole process of particle filtering is shown i n Figure 3.1 [vdMAdFWOO] . (3.7) (3.8) where (3.9) 22 Figure 3.1: The S I R process starts at t ime t — 1 w i t h a set of equally weighted par-ticles f x f _ | , i V - 1 } ^ , which represent the dis t r ibut ion p(xt-i\\yo-.t-2) • Importance weights are updated w i t h information at t ime t — 1 to represent p(xt-i\\yo-.t-i) w i t h {x[l}l,w[l}1}fL1. Resampl ing is performed to make it a equallly weighted particle set {^t-i' -^ _ 1 } i= i a g a i n D u t s t i l l represent the same dis t r ibut ion. F ina l ly , particles are propogated to {x[l\\ J V - 1 } ^ at t ime t by the proposal d is t r ibut ion to represent p(xt\\yo:t-i) and iterations go forward [vdMAdFWOO]. 3.2 Boosted Particle Filter In the standard particle filter, there are two major problems that need to be solved: the in i t ia l iza t ion of the particle filter and proper design of the importance pro-posal. The Boosted Part ic le F i l t e r ( B P F ) , which is first introduced by O k u m a k et a l . [ O T d F + 0 4 ] , properly solves the two problems by incorporat ing the A d a B o o s t detection a lgor i thm by V i o l a et a l . [VJ04]. Mos t t racking systems that use particle filters (i.e., [ P H V G 0 2 , V D P 0 3 , H L C P 0 3 ] ) ini t ial ize the particle filters by hand labell ing to set the value for P(XQ). The boosted particle filter uses A d a B o o s t detections for automatic in i t ia l iza t ion of the start ing states. 23 m A Figure 3.2: The final proposal is i n the form of a mixture of Gaussian which combines the prior and the boosting detection [ O T d F + 0 4 ] . A well designed importance proposal d is t r ibut ion should be as close as pos-sible to the posterior dis t r ibut ion. The proposal d is t r ibut ion should be able to shift the particles to the regions w i t h h igh l ikel ihood i f the mode of the prior d is t r ibut ion is far away from the mode of l ikel ihood dis t r ibut ion. The boosted particle filter assumes Gaussian dis t r ibut ion for the dynamics and superimposes Gaussian dis t r i -butions centered at the Adaboost detection modes as well . Then , it combines bo th parts to construct a proposal d is t r ibut ion w i t h a mixture of Gaussian model: q*B(xt\\xt-i,y0:t) = aqada(xt\\yt) + (1 - a)p(xt\\xt^), (3.10) where a is the parameter which is dynamical ly updated according to the overlap between the Gaussian dis t r ibut ion of boosting detection and the dynamics prior. If there is no overlap, a is set to zero so that the boosting detection has no effect on the proposal d is t r ibut ion. T h e value of a can be fine tuned through experiments. In our applicat ion, it is set to 0.5. The mixture of Gaussians proposal can be best described by Figure 3.2 [OTdF+04]. 3.3 Our Improvements Part ic le filtering has proved i n [IB98, H L C P 0 3 ] to be successful i n t racking single or mult iple targets w i t h nonlinear and non-Gaussian dynamics and l ikel ihood models. In the original work of boosted particle filtering, a mixture of particle filters [VDP03] 24 was adopted and augmented by the A d a B o o s t detection to automatical ly track a variable number of targets. However, as i n the M P F , the B P F only uses a fixed number of particles for a l l the targets. Th i s causes cr i t ica l problems when the to ta l number of targets keeps increasing because new tracks have to steal particles from existing tracks so that the number of remaining particles for each target decreases un t i l i t is insufficient to correctly approximate the dis t r ibut ion. Moreover, when occlusion happens, the B P F merges overlapping particle clouds and splits them when they fall apart. Obviously, such a mechanism w i l l lose the unique identi ty of each target after the occlusion. Therefore, our work only retains the basic idea of the original boosted particle filter for ind iv idua l target and maintains mul t ip le boosted particle filters to track mult iple targets. Therefore, i t solves both problems stated above. O n one hand, i n the new structure, new tracks are created by assigning a new particle set for each of them rather than stealing particles from exist ing targets. Therefore, the new structure can dynamical ly handle an increasing number of targets without affecting the approximat ion accuracy of the exist ing tracks. O n the other hand, because a l l the tracks are maintained and removed independently by each particle set, there w i l l be no merge or split dur ing occlusion. B o t h of the new features are cr i t ica l i n maintaining correct identities of a l l targets i n our multi-target t racking task. 25 Chapter 4 Target Dynamics Model l ing A s is mentioned i n Section 1.3.5, i n visual t racking systems, mutua l occlusion be-tween different targets is one of the major factors that cause the tracker to lose the identity of different targets. Accurate model l ing of the target dynamics can improve the predict ion of the locations of the targets while visual support is insufficient due to occlusion. However, because of the camera mot ion i n our applicat ion, the i m -age coordinate system changes over t ime. Therefore, target mot ion model l ing and predict ion i n the image coordinates are difficult. Zhao et a l . [ZN04] uses camera cal ibrat ion to provide a transformation from the predefined ground plane in the scene to the image. It helps transform the invariant 3D model of the target to the 2D model i n the image to assist human segmentation and further mot ion modell ing. W e adopt the approach by O k u m a et al . [OLL04] to map the locations of the tar-gets from the image coordinates to the standard r ink coordinate system which is consistent over t ime. Therefore, according to the physical law of inert ia, the mo-tions of the players i n hockey games are easier to predict w i t h a constant velocity autoregressive model . 4 . 1 Rectification In this section, we w i l l give a brief int roduct ion to the concept of homography and how it acts as an invertible deterministic mapping between the image coordinates and the standard r ink coordinates. It is important that homography is an invertible transformation because mapping back and forth between the r ink and the image coordinate system is frequent during the tracking process. 4.1.1 Homography Homography, which is also known as projectivity or collineation, is defined by Har t ley and Zisserman i n [HZOO] as follows: Definition. A projectivity is an invertible mapping h from P 2 to itself such that 26 three points x i , X 2 and X 3 lie on the same line i f and only i f / i ( x i ) , / i ( x 2 ) and / i ( x 3 ) also lie o n the same line. Images recorded by cameras are 2D projections of the real wor ld through lenses. For any plane i n the world, its images from a camera, which can pan, t i l t , zoom or even move, are exactly the projection described above because any line i n the wor ld is s t i l l a line i n the images as long as there is no noticeable lens dis tor t ion. A s the hockey players are always moving i n the plane formed by the hockey rink, they are i n the same plane bo th i n the real world and the image space. Therefore, it is possible to project their locations between the two planes. The mathematical way to accomplish such projection is possible because of the the following theorem: Theorem. A mapping h: P 2 —> P 2 is a projectivity if and only if there exists a non-singular 3 x 3 matrix H such that for any point in P 2 represented by a vector x it is true that / i (x) = H x . Here, the vector x can be wri t ten as (x, y, w)T, which is a homogenous repre-sentation of a point i n the plane. The corresponding inhomogeneous representation of the point i n Eucl idean 2 D space is wr i t t en as (x/w,y/w)T, where x/w and y/w are the x- and y-coordinates of the point. Accord ing to the theorem, given the coordi-nates of a pair of matching points {(x, y), (x', y')}, i n two planes, the transformation can be wr i t ten as x ' = H x (4.1) where x ' , x and H are defined as: / x' / X \\ / hu hn hi3 \\ x ' = y' X = y H = h23 1 ) V 1 J h3i h32 h33 1 It needs to be noted that for a l l point pairs { x ^ , X j } from the two planes, Equa t ion 4.1 holds for the same mat r ix H . In order to compute H , at least three points are needed because there are 9 unknown entries. However, the homography between two video frames can be defined w i t h less parameters i f the camera rotates about its opt ical center, which is the case i n our applicat ion. The camera at each t ime step can be parameterized by 3 rotat ion angles 6 = {61,62,63] w i t h focal length / . Then , the pairwise homography H j j from frame j to frame i can be parameterized by the parameters of the camera in the two time steps, which is 8 i n a l l . Detai ls of how to form the homography w i t h the 8 parameters can be found i n [BL03]. W i t h this technique, it is possible to bui ld a whole panorama for the scene, including the audience, from a sequence of video frames. It allows many other techniques to assist i n improving the t racking result. For example, background subtract ion can be performed to eliminate interference from background clutter. 27 In [OLL04] , a set of point pairs are labelled to compute the in i t i a l video frame to r ink homography HQV. After the ini t ia l izat ion, the frame to r ink homography at each step is automatical ly computed i n a recursive way where H{L\\ t is the transformation mat r ix from the previous frame to the current one. Different from the ini t ia l izat ion, the point pairs are automatical ly selected by using the K L T feature tracker to match the extracted K L T features from consecutive video frames. In order to improve the accuracy of the computed homography, a template-based correction step is added to further fit the predefined key points w i t h the standard r ink model . Therefore, the a lgori thm can automatical ly and accurately map a long sequence of video frames to the r ink. T h e whole process is explained i n detail i n [OLL04] . Not ice that this is an online algori thm, which perfectly satisfies the online requirement of the particle filter tracker. Another important property of the ho-mography is that i t is an invertible transformation, which makes it easy to map the points i n the images and r ink in bo th directions i n the following way w i t h the same transformation mat r ix H+R. where x £ and x £ are the points i n the r ink and video frames respectively. Figure 4.1 shows how the video frames are mapped to the standard r ink w i t h the homography. 4.1.2 Multi-layer state space model W i t h the frame to r ink homography, it is possible to perform the particle filter tracker only i n the r ink coordinate system. However, the locations of the targets i n the image coordinates are s t i l l required because, while evaluating the l ikel ihood function i n Equa t ion 3.9, the locations of the particles in the image coordinates are s t i l l needed to specify the regions where the color histograms are extracted. There-fore, one more layer, which represents the hidden states i n the image coordinates, is added into the original state space model. The graphical model of the new state space model is shown in Figure 4.2. Accord ing to the graphical model , there are two layers of latent variables. O n the top level, the variables contain locations of the targets i n the r ink coordinates. T h e y are the true state variables i n the particle filtering process. The middle layer states contain the locations of the targets in the image coordinates. The locations in the states between the top two layers have deterministic mapping function which is the same as Equa t ion 4.4. It should be noted that a l l state variables i n the middle - l (4.3) frame to rink: x £ = H{rx( rink to frame: x f = H(T (4.4) 28 Figure 4.1: Th i s shows how video frames are mapped to the r ink. The top left image is the input video frame; the top right is the standard r ink model; the bo t tom image shows how the image looks after being mapped onto the r ink. layer do not have any arrows between them, which means they are independent of each other. The state dynamics only applies to the top level state variables. The middle level acts as a communicat ion tunnel to bridge the true latent variable and the observed variables, which are i n the bo t tom layer. In our applicat ion, there are two kinds of observations. One is the boosting detection and the other is the color histogram l ikel ihood. A t each t ime step, boosting detections provide a set of detected targets w i th their locations in the video frame and their scales. Therefore, each target can be represented by a bounding box w i t h a predefined edge length ratio. A s is shown i n the bo t tom image i n Figure 4.1, players are significantly distorted while only the feet of the players are on the right posit ion relative to the r ink because homography only guarantees accurate mapping between points from two planes. Here, we assume that the bounding box fits the target t ightly. Therefore, we represent the locat ion of each player i n the image, x£ v wi th the middle point on the bo t tom edge of the bounding box as is shown i n Figure 4.3. O n l y these points w i l l be mapped to the r ink. A t the in i t ia l iza t ion step, these transformed locations are used to create particle filter trackers. A t later stages, they are used for the da ta association between boosting detections and the tracks or the creation of incoming new tracks. There are also situations that require mapping 2 9 Figure 4.2: Graph ica l model representation of the three-layer structure of the state space model. locations on the r ink back to the image coordinates. W h e n the l ikel ihood function needs to be evaluated, the positions of particles i n the r ink are mapped back to the video frames. T h e y are assumed to be the positions of the players. Therefore, w i t h this locat ion value and the scale, a rectangular bounding box can be created. It is just the inverse of the operations w i t h which the boosting detections are mapped to the rinks. Color histograms are extracted from the regions specified by the bounding boxes and compared w i t h the reference color histograms, which are created at the in i t ia l iza t ion step. 4.2 Autoregressive Dynamics Model A n autoregressive process is a t ime series model l ing strategy which takes into ac-count the historical da ta to predict the current state value. In this model , the current state x 4 only depends on the previous states w i t h a deterministic mapping function and a stochastic disturbance. If the deterministic mapping is linear and the current state only depends on the previous p states, the model can be wri t ten as p x t = ^2 OiXt-i + Zt (4.5) i = i where {o:i}f_ 1 are the autoregression coefficients, and Zt is the stochastic disturbance which is normal ly a white noise process. The autoregression coefficients can either be learned or predefined i n an ad-hoc way. N o r t h et a l . [NBIROO] t r ied to determine the coefficients by learning the 30 Figure 4.3: A l l the black rectangles are the boosting detections. B lack dots on the bo t tom edge of each rectangle represent the positions of the players on the r ink. patterns of different mot ion classes through a t ra ining process where perfect t racking could be achieved. O n the other hand, Perez et a l . [PHVG02] used a second order constant velocity model to predefine the coefficients. However, this only works if the camera is stationary. In the work by O k u m a et a l . [ O T d F + 0 4 ] , the camera mot ion made it impossible to separate the motion of players. Therefore, it is difficult to predict the locat ion of players w i t h the constant velocity autoregressive model in the image coordinates. B y mapping the locations of players from the video frame coordinate system to the standard r ink coordinates, the motions of the players on the r ink are separated from the camera motion. Therefore, no matter how the camera pans, t i l ts or zooms, the motions of the players on the r ink coordinate system w i l l not be affected. In hockey games, because of the effect of inertia, a constant velocity model is more suitable to model the mot ion of the players. It is best described by the following second order autoregressive model xt = Axt-1 + Bxt-2 + CAr(0,i:) (4.6) where {A, B, C} are the autoregression coefficients, A/ r (0, S ) is a Gaussian noise w i t h zero mean and standard deviat ion of 1. Unl ike [ P H V G 0 2 , O T d F + 0 4 ] , state x< i n our model only contains the locat ion of the player i n the x— and y—coordinates i n the r ink rather than including the scale of player as well because experiments show that incorrect sampling of player scales is more difficult to undo at later stages of t racking in this constant velocity model . Therefore, we use mean shift, which w i l l be introduced i n Chapter 6, instead to find the most appropriate scale at each step. Therefore, the states are defined as 31 x t = (xt,yt)T and the coefficients are defined as -(;S)*--(\".,-O1)\"-(JS) <«> where A and £? are standard for all second order constant velocity model, C can be fine tuned through experiment because it defines the standard deviation of the noise. In our application, it is defined to allow 3-pixel deviation on both x— and y—coordinates of the rink. 32 Chapter 5 Data Association Chapter 2 gives a general introduct ion to data association and compares several typica l approaches i n the literature. In this chapter, we discuss i n detai l the gating technique and the solution to the global nearest neighbor ( G N N ) approach that solves the association problem in the applicat ion we investigate. In our applicat ion, the data association does not take place i n a conventional manner because there are two different kinds of observations i n the framework. One is the color histogram and the other is the boosting detections. In a standard Bayesian filtering framework, da ta association is performed to pair the observations and tracks for the evaluation of the l ikel ihood function p(y t\" |x\") . The particle filter framework i n our appl icat ion handles this level of data association in an impl ic i t way because color histograms are extracted from the regions specified by particles. It means observations and tracks are no longer independent because observations are condit ioned on the particles. However, i t is s t i l l necessary to determine the association of the pixels w i t h i n the specified region because the region might include pixels that do not belong to the corresponding target. However, i n practice, as we do not have any concrete shape model of the targets, it is impossible to accomplish this level of data association. Because the boosting detections are used to improve the proposal d is t r ibut ion i n particle filters as i n shown in Equa t ion 3.10, we perform data association at this level to assign boosting detections to the existing tracks. Therefore, the proposal d is t r ibut ion can lead the sampling to more promising regions. 5.1 Gating G a t i n g is a technique to eliminate very unlikely observation-to-track assignments. A t each t ime step dur ing the tracking process, a certain region, also named as gate, is defined around the predicted posit ion of each target. Observations that fall w i th in the region are considered as candidates to update the track. Figure 5.1 depicts the concept of gating. A s is shown i n the figure, it is very l ikely that mult iple 33 Gate \\ 03 • Gate Gate Figure 5.1: T I , T 2 , T 3 are predicted target positions; O l , 0 2 , 0 3 , 0 4 are observa-tions. 0 2 is assigned to T I ; 0 1 is a candidate of T I , T 2 and T 3 ; 0 4 is a candidate of T 2 and T 3 ; 0 3 is not assigned to any existing tracks so that i t is a candidate of being a new target. observations fall i n one gate or one observation falls i n mult iple gates. In addi t ion, there are observations that do not belong to any gate. It remains to be decided later if they are outliers or new tracks. Therefore, gating is just a pre l iminary process to reduce the computat ion burden for further data association processing. The sizes of the gates depend on the mot ion types of targets and the property of sensors. Because the camera is the only sensor for a l l targets, the effect from the sensor is homogeneous for a l l targets. Therefore, here we only consider the mot ion of targets. Targets that have high speed must have larger potential gates than slower targets. If the t racking were performed i n image coordinates, the sizes of the gates of the targets could depend on the relative distances between targets and the camera. For example, players that are closer to the camera apparently have higher speed than those far away even i f they are of the same speed i n the real world . A s a result, bo th target mot ion and camera motion affect the size of gate for each target, which makes the problem difficult. However, because of the three-layer state space model i n our applicat ion, particle filtering is performed i n the hockey r ink coordinate system after compensating for camera mot ion. Consequently, the sizes of gates only depend on the mot ion of targets on the r ink regardless of their locations because their speeds are evaluated homogeneously over the whole r ink rather than being evaluated w i t h respect to the camera posi t ion. There are different shapes of gates including rectangles and ellipsoids. It is reasonable to use ellipsoids i n our appl icat ion to reflect the direction and magnitude of the velocity of target as is shown i n Figure 5.2. However, as the t racking results 34 Ellipsoid gates Figure 5.2: T l , T 2 are predicted positions of targets; V I , V 2 are vectors that indicate the velocities of the targets. at each step might drift, the direction and magnitude computed from the previous steps could be very noisy and unstable. In addi t ion, boosting detections also might be off the targets w i t h several pixels. A s a result, el l ipsoidal gates might eliminate too many- potent ia l observation-to-track pairs that are correct. Therefore, i n our applicat ion, we adopt the simplest circular gates which place equal deviat ion on al l directions. The radi i of the circles are determined by the m a x i m u m possible magnitude of target speed, which is consistent for a l l targets. 5.2 Linear Optimization In [ O T d F + 0 4 ] , only nearest neighbor association is used local ly for the association, which means each track picks up the nearest boosting detection for update. Figure 5.1 shows that one observation can be i n mult iple gates. A s a result, the same observation might be associated to mult iple targets for update, which violates the basic constraint of data association that one observation can only be generated by one target. The global nearest neighbor method is one of the approaches to solve this problem. It maintains the single op t imal solution at each step. T h e assignment problem can be best represented by an assignment mat r ix shown i n Table 5.1. E a c h entry i n the table is the cost or gain of pair ing the corresponding track and observation. Assignments that are forbidden by gating are denoted by x i n the corresponding entries. Observations that are forbidden by the gat ing to be associated to any track are considered as a new track. Such assignment problems stem from economic theory and auct ion theory as well . Examples are assigning personnel to jobs and assigning goods to bidders. The objective of such problems is to minimize the cost or maximize the gain subject to a set of constraints. G i v e n the assignment mat r ix shown i n Table 5.1, the objective is to find a set X = {x{j} that maximize or minimize the objective function C = 35 Table 5.1: Assignment mat r ix for the assignment problem shown i n Figure 5.1 Observations Tracks 0 1 0 2 0 3 0 4 T I a n 0-12 X X T 2 « 2 1 X X T 3 ^31 X X « 3 4 2 ^ = 1 2~3j=i aijxij subject to the following constrains: where are b inary indicators _ J 1 i f observation j is assigned to track i 1 0 otherwise Linear programming was in i t ia l ly used to solve this problem. Later on, it was found that the auct ion algori thm [BP99] is the most efficient method so far to reach the op t imal solution or sub-optimal one without any pract ical difference. In the auction algori thm, normally, trackers are goods and observations are bidders. T h e objective is to maximize the global gain to satisfy a l l bidders. N o matter whether the values of a l l entries i n the assignment mat r ix are for maximiz ing gain or min imiz ing cost, it is always easy to convert them to satisfy a maximiza t ion problem by adding negative signs to a l l of them. T h e auct ion a lgor i thm is an iterative process that consists of a b idding phase and an assignment phase. Iterations keep going unt i l a l l bidders are satisfied. The detailed algori thm on a square association matr ix , which has the same number of targets as observations, is available i n [BP99]. T h e extended version of auction algori thm [Ber91] is able to solve the rect-angular mat r ix problems w i t h loosened constraints E i L i * y >l,Vi 1 ' These constraints allow one track to have mult iple observations, which often happens i n practice. However, i n our application, it is very l ikely that some tracks do not have any observation due to the mis-detection of the boosting detector. Therefore, even if there are some observations wi th in the gate of that track, it is s t i l l possible that none of the observations belongs to the track. Hence, the constraints are of the following form: E £ i * « > o , V i 1 j 36 A s a matter of fact, because xij are binary, the second constraint i n Equa t ion 5.4 is always satisfied. T o simplify the problem, the assignment mat r ix only has existing tracks and the observations that fall i n at least one gate. Observations that are not i n any gate are used to create new tracks separately. A s a result, the problem is significantly simplified. In our applicat ion, the values of a l l the entries i n the assignment mat r ix are defined to be the distance between the observations and tracks. It is mentioned i n [BP99] that the value of each entry can include bo th kinematic and attr ibute information to fuse different types of sensor signals. Color is one important potential at t r ibute i n pur applicat ion. However, because the reference color histogram of each target is created only at the first step and not updated adaptively over t ime, there are some t ime steps at which the current color histogram of track i is more similar to the reference color model of track j than the current color histogram of track j itself. It w i l l produce incorrect assignment scores especially when targets w i t h s imilar reference color model are close to each other or even overlap. Therefore, only the distance is used as the evaluation. G i v e n the new assignment mat r ix and the constraints, the solution to this problem becomes straightforward. The goal is to minimize the objective function and the op t imal solution is just a collection of observation-to-track pairs that have the min ima l entry value i n the corresponding column i n the assignment matr ix . (5.5) 37 Chapter 6 Mean-Shift Embedded Particle Filter The motivation of embedding the mean-shift algorithm into the particle filter frame-work of our tracking system is to stabilize the tracking result. It is important for the dynamic model of targets because stabilizing trajectories improves the accuracy of the computed velocity of targets, which is critical for improving the prediction of the location of targets. It is also important for the likelihood model because accurate prediction leads sampling to more promising areas so that the influence from background clutter and mutual occlusion will be reduced. As a result, the particles of one target, which are biased by mean-shift, will be more concentrated on the true location of the targets in the scene. In this chapter, we explain how the mean-shift algorithm works using the color features and how to embed it into our particle filtering process for robust tracking. 6.1 Color Model Color-based tracking is successful in tracking non-rigid objects with unique colors. It produces robust tracking results even if the targets change shape or are partially occluded by other objects. We adopted the color model in [PHVG02, OTdF+04] to represent the hockey players we try to track in our application. The model is originally introduced by Comaniciu et al. [CRMOO] for the mean-shift based non-rigid object tracking. In addition, because the color histogram based approach discards the spatial information of the targets, it might have problems when the background clutter has a similar representation to the targets of interest in the color space. Therefore, we adopt the multi-part color model from [PHVG02] with small variation to fit it into the kernel-based color model we use in our application. The color histogram is created in the Hue-Saturation-Value (HSV) color space because it is robust to the illumination condition changes. The H S V his-togram comprises N = Nh~xNs + Nv bins, which discretize the whole H S V space. 38 Nh, Ns, Nv are the number of bins i n each dimension of the H S V color space and are set to 10 equally in our experiment. T h e color vector of a p ixe l located at posi t ion k is denoted as y(k) = [h(k), s(k),v(k)]T and each entry is mapped to an appropriate b i n i n the corresponding dimension. The b in representation of the color vector is denoted as [b^k), bs(k), bv(k)]T. The b i n vector is further mapped to a b i n i n the histogram w i t h the following index function. b ( k ) = { bs(k)xNh + bh(k) i f s(k) > 0.1 and v{k) > 0.2 { bv(k) + NhxNs otherwise where b(k) is the index of pixel in the histogram. T h e saturation and value of a pixel is set to be above 0.1 and 0.2 respectively because the hue information is only reliable under that condit ion. P ixe ls that are too dark or unsaturated are classified only according to their value. W i t h this b in indexing function, the color histogram of the candidate region i ? ( x t ) around the locat ion x t at t ime t is denoted as Q ( x t ) = {q(n]xt)}n=lt...}N, where q(n-xt) = C E M x t ) # ) - n ] „ = i q(n; x ( ) = 1 where S is the Kronecker delta function, C is a normalizat ion constant, k is any pixel w i t h i n the region i ? ( x t ) , i?(xt) is a 2D region centered at locat ion x t , which can either be rectangular or ell ipsoidal. B y normaliz ing the color histogram, Q(xt) becomes a discrete probabil ist ic dis t r ibut ion. The color-based t racking searches for a candidate region whose histogram is the most s imilar to the one of a reference target model . T h e target model color histogram and target candidate histogram are denoted as follows: target model: Q* = q*(n; x 0 ) n = 1 ^ £ n = i q*(n; x 0 ) = 1 target candidate: Q ( x t ) = q(n; x t ) n = 1 N J2n=i l(n; x t ) = 1 where Q* is constructed at the in i t ia l iza t ion step. The Bhat tacharyya coefficient [Kai67] is adopted to represent the s imilar i ty between two sample histograms as N P(x 4 ) = p [ Q ( x t ) , Q*] = £ v /g (n ;x t )g*(n ;x 0 ) (6.4) 71=1 The distance between two histograms is represented as d(xt, xo) = Vl-p[Q(*t),Q*] (6.5) There are three major reasons to use Bhat tacharyya coefficient in the s imi-lar i ty function. Fi rs t ly , it has a straightforward geometric interpretation because i t is the cosine of two vectors. Secondly, d(xt) is a metric. F ina l ly , because it uses a 39 normalized discrete dis t r ibut ion as evaluation variable, it is invariant to the scale of the target. G i v e n the distance function, the l ikel ihood dis t r ibut ion can be computed by passing the distance into an exponential function: p (y t | x t ) oc e - A d 2 ( x * ' x ° ) (6.6) where A = 20 is determined through experiment. Mul t i -pa r t color model splits the region i ? ( x t ) into r parts so that R(xt) = 2~Zj=i Rjfat)- The color histogram of each part is constructed independently. In [ P H V G 0 2 ] , the l ikel ihood is formulated as • p ( y t | x t ) a e - x ^ = i d ? ( x t ' X o ) (6.7) where * 5^- - x (6.8) E ^ - ( ^ ( | | ^ | | 2 ) where { a j } ^ 1 are points w i t h i n the region i ? (x ) around the current locat ion x , w(o,i) is the weight associated to each pixel ai, and k(x) is a kernel profile of ker-nel K that can be wri t ten i n terms of a profile function k : [0, oo) —> R such that K(x) = A;( | |x | | 2 ) . Accord ing to [Che95], the kernel profile k(x) should be nonnega-tive, nonincreasing, piecewise continuous, and k(r)dr < oo. Mean-shift is a mode seeking algori thm because the theory guarantees that 41 100 200 300 400 500 600 Frame 1 0 50 100 150 200 Color histogram of the labelled target in Frame 1 100 200 300 400 500 600 Frame 32 0 50 100 150 200 Color histogram of the labelled target in Frame 32 50 100 150 200 250 300 350 400 450 i BI JJ 100 200 300 400 Frame 67 0 50 100 150 200 Color histogram of the labelled target in Frame 67 Figure 6.2: This figure shows how significantly the color histogram of the same target changes over time. 42 the mean-shift offset A x at location x is in the opposite direction of the gradient direction of the convolution surface M C(x) = G(ai ~ x)w(oi) (6.9) i = i where kernel G is called the shadow of kernel K and satisfies the following relation-ship g'(r) = -ck(r) (6.10) where g is the kernel profile of G, g' denotes the derivative of kernel profile g, and c is a constant. Therefore, it is clear that mean-shift is a gradient ascent algorithm that converges to the mode of the convolution surface. In order to utilize mean-shift to analyze a discrete density distribution, i.e. the color histogram, an isotropic kernel G with a convex and monotonically de-creasing kernel profile g(x) is superimposed onto the candidate region .R(xj). The target is represented by an ellipsoidal region bounded by a rectangle specified by the tracker. The target model is normalized to a unit circle by rescaling the row and column dimensions so that the shape of the ellipsoid does not influence the model representation. The normalized pixel locations in the region of the target model are denoted as {a*}i-it...tM- With this kernel G, the target model element <7*(n;xn) can be rewritten as follows: M q*(n^0) = C^2g(\\\\a*\\\\2)5[b(a*)~n} (6.11) i = i where the region center is 0, C is the normalization constant to guarantee the probabilistic property of the discrete distribution. The nonincreasing kernel assigns lower weights on the pixels that are on the edge of the target. This is reasonable because most edges are blurry or might not actually belong to the target. They are much less reliable than those pixels that are closer to the center. As in the target model, the normalized pixel locations of the target candidate centered at x t are denoted as {a,i}i=it...,Mh- Each element of the target candidate histogram can be rewritten as Mh i = l h 5[b(ai)-n] (6.12) where Ch is also a normalization constant that depends on h, and h is the bandwidth that determines the scale of the target candidate. The similarity measure and the distance between the new target models and target candidates are the same as in Equation 6.4 and 6.5. The goal of a tracking process is to localize the position of the target in the current frame. In other words, it aims to search the neighborhood for a location so that the new candidate feature 43 extracted from that locat ion has the m i n i m u m distance to the model feature. We assume that, at each t ime step, the search starts at locat ion x and the normalized discrete dis t r ibut ion Q ( x ) is already computed, x can be the target locat ion i n the last t ime step or any k ind of locat ion specified by the system. T h e goal is to itera-t ively update x ' to reach the mode location. Accord ing to Equa t ion 6.5, min imiz ing the distance is equivalent to maximiz ing the Bhat tacharyya coefficient p [Q(x ' ) , Q*}. Here, we assume that the new candidate feature does not have significant change from the feature at the start ing location, which is true i n most situations i n practice. W i t h this assumption, we can use the Taylor expansion around the value Q ( x ) to have a reasonably accurate linear approximat ion of the Bha t tacharyya coefficient p[Q(x ' ) ,Q*] « \\ £ y/q(n; x) e1 • Output : Location x' that minimizes the distance d(x',xn) As mean-shift is able to seek the mode of the density distribution, it is also capable of stabilize the tracking result by shifting the tracking output to the location with the maximum similarity to the model in the feature space. The algorithm is exactly the same as the one shown above. The start location x for the stabilization is given by the output of the tracking system. 6.3 Adaptive Scaling As is shown in Equation 6.12, the bandwidth h determines the scales of the target candidate. In our application, the scales of players in the video frames is determined by their relative distance to the camera and the zooming effect of the camera. Therefore, it is important to adaptively change the scale through out the tracking process. Wu et al. [Wu05] tried to stabilize the scales of the targets through a pyramid based template matching, which is a coarse-to-fine scale search process to le selects the stop criteria for the mean-shift iteration. Any value is appropriate as long as it is small enough relative to the application. It is chosen to be 1 pixel in our application. 45 find the best scale that match the template. In addition, rather than using only one top match, it takes multiple top matches to generate a best estimate for a consistent scale to avoid sudden jump. As we do not have any template to match, we need to search for the best estimate of the scale on the fly through the mean-shift algorithm. In [OTdF+04], target scale is one of the dimensions in the the state space. Experiments show that the sampling process might generate scales that have a drastic change from the previous scale. However, in the real situation, because of the temporal and spatial continuity, the scales of players seldom change drastically. One of the solutions is to separate the scale from the state space and evolve it independently for each particle. The adaptive strategy searches several bandwidths around the previous bandwidth hprev to find the optimal bandwidth hopt at current step. To reduce the computational cost, only three bandwidth options are computed in our application: hwtv, hprev — A^, hprev + A^. Taking into consideration of the scale continuity, A^ is defined to be A^ = 0.1hprev. The optimal bandwidth hopt selects the bandwidth that outputs the minimum distance (or the maximum Bhattacharyya coefficient). The new bandwidth hnew is updated with the following smoothed adaptation strategy Kew = ihopt + (1 - y)hprev (6.18) where 7 is a smoothing factor that controls the adaptation speed to avoid drastic scale change. It is chosen to be 0.1 in our application. By combining the mean-shift algorithm and the adaptive scaling strategy, the tracking results can be well stabilized. Figure 6.3 compares the the raw tracking output and the stabilized result. It shows that the scale of the original tracker often is too small so that it cannot capture the major body of the target. Also, it drifts off the targets in some time steps. Meanwhile, the mean-shift stabilized result is much more stable. The scale changes more smoothly and fits the bounding box onto the target more properly. 6 . 4 Mean-shift Embedded Particle Filter Applying the mean-shift algorithm directly to the tracking output only gives one deterministic offset at each step. It might not be able to capture the true location of the targets due to background clutter or mutual occlusion between targets in the image. Embedding it into the particle filter framework brings uncertainty to the deterministic method so that the statistical property can improve the robustness of the algorithm. The key idea of the mean-shift embedded particle filtering is to insert one more operation into the conventional particle filter framework. There are two places where we can insert the mean-shift biasing step. One is after the particles are propagated by proposal distribution and the other is the after the deterministic resampling step, which is introduced in Section 3.1.3. There 46 L (a) Frame 24 A r 1 (b) Frame 40 — 1 1 L (c) Frame 63 ' JL 1 V - A [II (d) Frame 84 Figure 6.3: Compaxison between the original track output and the mean-shift sta-bi l ized result. Left ones are the original track result and the right ones are the stabil ized result. 47 is one fatal disadvantage that undermines the second option. Because mean-shift is only based on the s imilar i ty between the reference model and the candidate model i n the feature space, it might bias particles to other s imilar targets that are around the target of interest or some background clutter that are s imilar to the target of interest i n the feature space. A s a result, those wrongly biased particles that are far away from true locat ion of the targets w i l l have low weights because of the penalty from the prior dis t r ibut ion. W i t h o u t the resampling step, those low weight particles are s t i l l propagated i n the next step and lead the proposal d is t r ibut ion to sample i n the wrong region and the tracker quickly loses the targets. F igure 6.4 shows the t racking result of inserting the mean-shift after the resampling step i n the particle filtering. It can be noticed that, i n Subfigure (b)(e)(f), some particles are moved to some locations that are far away from the true target locat ion or the other targets. A l t h o u g h they are assigned low weights, they keep on propagating wi thout being penalized by the resampling. A s a result, they increase the variance of the particle set of one target and cause the target to be el iminated i n the next frame because of the high uncertainty i n particle hypotheses. In our applicat ion, the mean-shift operation biases a l l the particles right af-ter the sampling from the mixture of Gaussians proposal d is t r ibut ion and before the resampling step in the particle filter framework. In our mult i- level state space model , particles propagated by the proposal d is t r ibut ion are mapped back to the video frame coordinate system. Then , each particle is biased ind iv idua l ly and inde-pendently by the mean-shift a lgori thm to seek the mode of the density d is t r ibut ion i n the neighborhood of each particle. The locations of particles i n the video frame coordinates are set to be the start ing locations for a l l the independent mean-shift procedures. After the bias step, new samples are drawn from a Gaussian dis t r ibut ion which is centered at the biased particle w i t h a predefined covariance mat r ix . The mean-shift searching step combined w i t h the o ld proposal d is t r ibut ion can be considered as a new proposal d is t r ibut ion zfc,t)' where zj^t is boosting detection. * Kt <- Kt-i + St- Dt. 2. Importance sampling - For k = 1, ...,Kt, i = 1, ...,N, sample x j ^ ^ x ^ l x j ^ ^ z ^ ) . 3. Mean-shift biasing - For k = 1 , K t , i = 1, ...,N, bias the particles as xj^j = *