UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Background subtraction with a pan/tilt camera Tsinko, Egor 2010

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


ubc_2011_spring_tsinko_egor.pdf [ 13.9MB ]
JSON: 1.0052028.json
JSON-LD: 1.0052028+ld.json
RDF/XML (Pretty): 1.0052028.xml
RDF/JSON: 1.0052028+rdf.json
Turtle: 1.0052028+rdf-turtle.txt
N-Triples: 1.0052028+rdf-ntriples.txt
Original Record: 1.0052028 +original-record.json
Full Text

Full Text

Background Subtraction with a Pan/Tilt CamerabyEgor TsinkoBSc. The University of Calgary, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE STUDIES(Computer Science)The University Of British Columbia(Vancouver)December 2010c Egor Tsinko, 2010AbstractBackground subtraction is a common approach in computer vision to detect mov-ing objects in video sequences taken with a static camera. Some background sub-traction algorithms have been extended to support a pan/tilt camera but they arerequired to provide a background mosaic for motion estimation purposes. Thisthesis describes a system that avoids this requirement by separating the motionestimation system and background subtraction system into two separate modules.The first module performs motion compensation by employing a feature based im-age registration method and the RANSAC algorithm. The second module detectsmoving objects in rectified image frames using a background subtraction algorithmdesigned for a static camera.Three background subtraction algorithms were extended in the course of thisproject: mixture of Gaussians, non-parametric kernel density estimation and code-book. This thesis demonstrates the usefulness of separating of motion estimationfrom the background subtraction system as it allows us to extend any backgroundsubtraction algorithm designed for a static camera to support a pan/tilt camera. Thedetection results are presented for both indoor and outdoor video sequences takenwith a pan/tilt camera.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1 Static camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.1 Parametric methods . . . . . . . . . . . . . . . . . . . . . 92.1.2 Non-parametric methods . . . . . . . . . . . . . . . . . . 112.1.3 Spatial methods . . . . . . . . . . . . . . . . . . . . . . . 142.2 Moving camera . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Motion Model Estimation and Mosaic Building . . . . . . . . . . . 213.1 Homography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Direct linear transformation . . . . . . . . . . . . . . . . . . . . . 233.2.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . 253.3 Minimizing geometric error . . . . . . . . . . . . . . . . . . . . . 253.4 Feature points and inlier selection . . . . . . . . . . . . . . . . . 26iii3.5 Decreasing the number of outliers . . . . . . . . . . . . . . . . . 273.6 Keypoints update strategies . . . . . . . . . . . . . . . . . . . . . 313.6.1 Recalculating keypoints . . . . . . . . . . . . . . . . . . 313.6.2 Appending keypoints . . . . . . . . . . . . . . . . . . . . 333.6.3 Comparison of strategies . . . . . . . . . . . . . . . . . . 344 Background Subtraction . . . . . . . . . . . . . . . . . . . . . . . . 384.1 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . 404.2 Non-parametric kernel density estimation . . . . . . . . . . . . . 454.3 Codebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.3.1 Color model improvement . . . . . . . . . . . . . . . . . 545 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 655.1 Detection of moving objects in outdoor scenes . . . . . . . . . . . 665.1.1 Sequence 1 . . . . . . . . . . . . . . . . . . . . . . . . . 675.1.2 Sequence 2 . . . . . . . . . . . . . . . . . . . . . . . . . 715.1.3 Sequence 3 . . . . . . . . . . . . . . . . . . . . . . . . . 765.2 Detection of moving objects in indoor scenes . . . . . . . . . . . 785.2.1 Sequence 4 . . . . . . . . . . . . . . . . . . . . . . . . . 805.2.2 Sequence 5 . . . . . . . . . . . . . . . . . . . . . . . . . 825.2.3 Sequence 6 . . . . . . . . . . . . . . . . . . . . . . . . . 885.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 94Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97ivList of TablesTable 4.1 Intersection rule for Short-term and Long-term models [36] . . 46Table 4.2 Comparison of codebook algorithm with three different metricswithout using brightness statistics ˆI; ˇI . . . . . . . . . . . . . . 64Table 4.3 Comparison of codebook algorithm with three different metricswith brightness statistics ˆI; ˇI . . . . . . . . . . . . . . . . . . . 64Table 5.1 Comparison of the three background subtraction pan/tilt algo-rithms for Sequence 1 . . . . . . . . . . . . . . . . . . . . . . 70Table 5.2 Comparison of the three background subtraction pan/tilt algo-rithms for Sequence 2 . . . . . . . . . . . . . . . . . . . . . . 75Table 5.3 Comparison of the three background subtraction pan/tilt algo-rithms for Sequence 3 . . . . . . . . . . . . . . . . . . . . . . 79Table 5.4 Comparison of the three background subtraction pan/tilt algo-rithms for sequence by Xu . . . . . . . . . . . . . . . . . . . . 83Table 5.5 Comparison of the three background subtraction pan/tilt algo-rithms for Sequence 5 . . . . . . . . . . . . . . . . . . . . . . 86Table 5.6 Comparison of the three background subtraction pan/tilt algo-rithms for Sequence 6 . . . . . . . . . . . . . . . . . . . . . . 91vList of FiguresFigure 1.1 Two frames from a typical surveillance sequence and the out-put of a background subtraction algorithm. . . . . . . . . . . 4Figure 3.1 Background mosaics built with and without using a Houghtransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Figure 3.2 Background mosaics built by recalculating keypoints and byappending keypoints to the list . . . . . . . . . . . . . . . . . 35Figure 3.3 Background mosaic with misaligned areas outlined in red . . . 36Figure 3.4 Number of keypoints in the list L . . . . . . . . . . . . . . . . 36Figure 4.1 Four situations that a pan/tilt background subtraction algorithmmight encounter . . . . . . . . . . . . . . . . . . . . . . . . . 39Figure 4.2 Frames 100, 200 and 300 of image sequence taken with pan/tiltcamera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 4.3 The color model of Kim et al. . . . . . . . . . . . . . . . . . 51Figure 4.4 Frame 400 and detection results from an artificial image sequence 57Figure 4.5 Frames 0 and 251 of the Wallflower Camouflage video sequence 57Figure 4.6 Classification results of frames 0 and 251 with e = 8 . . . . . 58Figure 4.7 Classification results of frames 0 and 251 with e = 1 . . . . . 58Figure 4.8 The color model for the codebook (CB) algorithm used in thisthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60Figure 4.9 Classification results of frames 0 and 251 with the thresholdfor q = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Figure 4.10 Normalized vector distance (adapted from [19]) . . . . . . . . 61Figure 5.1 Frames 501, 604, 700 and 949 of Sequence 1 . . . . . . . . . 67viFigure 5.2 Frames 625, 720, 850 and 930 of Sequence 1 with ground truthand detection results . . . . . . . . . . . . . . . . . . . . . . 69Figure 5.3 Background model for the mixture of Gaussians (MOG) algo-rithm at frames 625 and 720 . . . . . . . . . . . . . . . . . . 71Figure 5.4 Frames 78, 180, 413 and 829 of Sequence 2 . . . . . . . . . . 72Figure 5.5 Frames 400, 460, 660 and 900 of Sequence 2 with ground truthand detection results . . . . . . . . . . . . . . . . . . . . . . 74Figure 5.6 Classification results for frames 660 and 900 of Sequence 2produced by non-parametric kernel density estimation (KDE) . 76Figure 5.7 Frames 350, 430, 500 and 570 of Sequence 3 with ground truthand detection results . . . . . . . . . . . . . . . . . . . . . . 77Figure 5.8 Frames 1 and 52 of Sequence 4 by Xu[41] . . . . . . . . . . . 80Figure 5.9 Frames 65, 95, 125 and 150 of Xu sequence with ground truthand detection results . . . . . . . . . . . . . . . . . . . . . . 81Figure 5.10 Frames 400, 600, 720 and 999 of Sequence 5 . . . . . . . . . 84Figure 5.11 Example of radial distortion in Sequence 5 . . . . . . . . . . 84Figure 5.12 Frames 400, 600, 720 and 999 of Sequence 5 with ground truthand detection results . . . . . . . . . . . . . . . . . . . . . . 87Figure 5.13 Frames 239, 314, 406 and 442 of Sequence 6 . . . . . . . . . 88Figure 5.14 Frames 239, 314, 406 and 442 of Sequence 6 with ground truthand detection results . . . . . . . . . . . . . . . . . . . . . . 90viiGlossaryBG backgroundBS background subtractionCB codebookDLT Direct Linear TransformationEM expectation maximizationFG foregroundFP false positiveFN false negativeGT Ground TruthKDE non-parametric kernel density estimationMLESAC maximum likelihood estimation sample consensusMOG mixture of GaussiansMRF Markov random fieldNVD normalized vector distancePDF probability density functionRANSAC Random Sample ConsensusSIFT Scale Invariant Feature TransformviiiAcknowledgmentsI would like to acknowledge profound contribution to this work made by a numberof individuals. First of all I would like to thank Dr. Bob Woodham for being mysupervisor, his guidance, encouragement and financial support. I would also liketo express my gratitude to Dr. Jim Little for his help in getting me accepted intothe Masters program at UBC. Special thanks goes to Tomas Hoffman for his helpin creating test video sequences. Finally, I would like to thank my wife Yulia forall the help and support she offered me.ixChapter 1IntroductionThe work presented in this thesis extends a group of background subtraction algo-rithms designed to segment moving objects in video sequences taken with a staticcamera. The extension relaxes the camera’s motion constraints and allows back-ground subtraction algorithms to build the background model and subtract a back-ground from video sequences taken with the camera that rotates about its opticalcenter thus widening the application range of these algorithms.One of the active problems in the field of computer vision is to segment animage into “interesting” and “uninteresting” regions for future processing. The“interesting” regions might be defined by object categories such as “cars” or “hu-mans”. Alternatively they might be defined by the objects’ physical properties suchas change of position or orientation, i.e., “moving objects”.Background subtraction is a group of algorithms whose task is to separate mov-ing objects from the rest of the scene in a camera’s visual field. All stationaryobjects of a scene are considered to be “uninteresting” and the goal of the algo-rithm is to segment moving objects from the “uninteresting” static objects. Usuallythere is no notion of an object in background subtraction; the algorithm just locatesgroups of pixels that are different from the background model. These pixels can begrouped together as objects at later stages of image processing.Usually background subtraction is used as a first step in a high-level sys-tems [35]. The most common ones are surveillance[32] and tracking algorithms[16].Each frame of a video sequence is segmented and moving (i.e., “interesting”) ob-1jects are located. These objects are then tracked and classified. For example, analgorithm by [32] uses background subtraction to segment the moving objects andthen classifies them into different categories based on activity patterns. Anotheralgorithm [37] employs background subtraction to track human bodies in complexscenes. Alternatively, background subtraction can be used to remove moving ob-jects from the scene to generate a static background scene.Essentially, background subtraction can be seen as a classification problem— the goal of such an algorithm is to classify each pixel of a video sequenceas “foreground” or “background”, where “foreground” means that this pixel be-longs to some moving object (i.e., “interesting”) and “background” means that thepixel belongs to a static, or “uninteresting” object. Some apparent motion canstill be considered “uninteresting”, for example repetitive motion of swaying treebranches, flags waving in the wind or a flashing light at the intersection are usuallynot considered to be “interesting”. Depending on the task, these types of motioncould be filtered out.In order to perform a classification the algorithm needs to build a model, whichis an internal representation of the background. To detect moving objects the al-gorithm then compares each new frame to this model. In the simplest case thisbackground model could be a reference image which contains the scene withoutany moving objects. The algorithm calculates the difference between the referenceimage and the current frame to find the “interesting” pixels. In some other algo-rithms the background model is more complex than that and incorporates probabil-ity distributions of the values for each pixel.In most of the cases there is no training set of pre-labeled images to build abackground model. Therefore, the key assumption in background subtraction isthat most of the pixel values in the image sequence belong to the background. Thisusually results in a higher number of mislabeled pixels in the beginning of theprocess as some moving objects are classified as background since there is not yetenough evidence. Eventually the model stabilizes as more samples are collectedand the number of misclassified pixels decreases. To avoid this problem somealgorithms perform a training step during which they collect enough samples oversome period of time to build a background model which contains pixel values thatreappeared often during the training.2In surveillance scenarios the camera usually monitors a scene for a long periodof time, during which the background scene might change. For example in anoutdoor scene the lighting conditions might change as sun travels across the sky.Also, some new objects can be introduced or removed from the background suchas parked cars leaving the scene or new cars parking at the parking lot. The back-ground subtraction algorithm should be able to adapt to such changing background,and therefore the background model should be able to change with time.A serious limitation of background subtraction algorithms is that each pixelin an image frame is typically considered to be independent [4]. Typically in theimage frame a brightness and a color of neighboring pixels that belong to the sameobject are similar. Therefore these pixels should not be labeled as backgroundor foreground independently. The probability of one pixel having a specific labelshould depend on how other pixels belonging to the same object were labeled. Sev-eral solutions have been proposed for this problem — from using Markov RandomFields [40] to using a multi-scale image pyramids [42]. However, these solutionssignificantly increase computation time and, therefore, currently are impracticalwhen real-time processing is needed.Another limitation of background subtraction algorithms is the common re-quirement that the camera remains static. There is no notion of spatial position foreach pixel in a background subtraction algorithm. Therefore, if camera’s orienta-tion changes the algorithm will not know the position of the new pixels relative tothe background model. This severely limits the application range of backgroundsubtraction algorithms.Given the availability of cheap pan/tilt cameras surveillance can be performedusing rotating setups where camera sweeps over a large area. This means that back-ground subtraction algorithms that are used in these kind of applications must beable to accommodate camera movement. This thesis addresses the case of pan/tiltcamera motion in background subtraction. A typical surveillance scene taken witha pan/tilt camera can be seen in Figure 1.1. Figure 1.1a and Figure 1.1b showframes 50 and 100 respectively. In this sequence a pan/tilt camera monitors anintersection. During the sequence, a car and a pedestrian pass by. The panningmotion of the camera along with some background objects, in this case trees andpoles, that partially occlude the moving object makes this sequence challenging3(a) (b)(c) (d)(e) (f)Figure 1.1: Two frames from a typical surveillance sequence and the outputof a background subtraction algorithm. Figure 1.1c and Figure 1.1dshow manually produced ground truth where white pixels denote fore-ground and black pixels denote background. The results produced bythe algorithm described in this thesis are shown in Figure 1.1e and Fig-ure 1.1f.4and beyond the scope of a background subtraction algorithm designed for staticcamera. Figure 1.1c and Figure 1.1d show the ground truth for these two frames,where “uninteresting” pixels are labeled black and “interesting” pixels are white.The segmentation produced by the algorithm described in this thesis can be seen inFigure 1.1e and Figure 1.1f. Note, that the orientation of the camera has changedbetween these two frames and the algorithm was able to recognize that and suc-cessfully distinguish the camera motion from the motion of the foreground object.A pole and a tree in Figure 1.1f are also correctly classified as background.Since the assumed camera motion only involves rotation about the camera’soptical center there is no motion parallax [20]. This simplifies the problem of esti-mating camera motion since the apparent motion of all objects in the scene doesn’tdepend on their distance from the camera. With this assumption the motion param-eters can be estimated from the video sequence and the frames can be warped to acommon background coordinate system. This reduces the problem to a stationarycamera case when each frame of the sequence only covers some part of the overallbackground model [33].The goal of the described algorithm is to detect object motion independent ofthe motion introduced by camera pan/tilt. The algorithm is a two-step process: first,the camera motion is estimated and this motion is compensated for by transformingthe new frame into the coordinate system of the background model; second, back-ground subtraction is performed normally in the background model’s coordinatesystem using the warped frame.There are several ways to compensate for camera motion. One approach is tobuild and maintain an explicit background mosaic image. Each time a new frameis registered, each matching pixel in the mosaic image is updated by averagingits value with the value of the corresponding pixel in the warped frame. This ap-proach will work for both direct and feature-based image registration techniques.In the first case the new frame is directly matched to the mosaic using one of theexisting techniques. In the second case the keypoints are calculated from both thebackground mosaic and the new frame and then transformation between them iscalculated. Unfortunately, there is a problem with this approach. The presence ofmoving objects in some image frames can decrease the quality of the mosaic andthe registration can be thrown off by an incorrect mosaic image [20].5To solve this problem the algorithm described in this thesis uses feature-basedimage registration using SIFT keypoints [18]. The algorithm maintains a list ofSIFT keypoints instead of keeping an explicit background mosaic image. For eachnew frame the matching keypoints are found in the list of background keypointsand motion parameters are estimated using the RANSAC[8] algorithm. All thekeypoints in the new frame that don’t have matching keypoints in the list are thenadded to the list. The idea is that adding keypoints that already have matches willonly create redundancy and quickly increase the size of the list. By adding onlyunmatched keypoints the list size doesn’t increase rapidly and the keypoints fromthe previously unseen areas of the scene get incorporated into the list. There aresome keypoints from the moving object that also get added to the list, but exper-iments have shown that in the surveillance scenarios where the moving objectsoccupy a relatively small portion of the field of view these keypoints don’t affectthe registration and motion estimation.For the background subtraction step the algorithm described in this thesis usesvariations of three state of the art algorithms — mixture of Gaussians [31], non-pa-rametric kernel density estimation [6] and the codebook approach [15]. Initially allthese algorithms were designed to work with a static camera. In a pan/tilt scenariosome areas of the scene that were not part of the training sequence may becomevisible later as the camera rotates. Therefore the background subtraction algorithmmust be able to grow the model as previously unseen areas of the scene becomevisible. This thesis extends each of the three background subtraction algorithms tosupport dynamically growing the background model.A new method to compare codewords is suggested for the codebook algorithm.To match two codewords the original algorithm developed by Kim et al.[15] calcu-lates the Euclidean distance between the codewords’ color vectors. It is shown thatcalculating angles instead of a distance produces more robust results under differ-ent illumination and actually improves the detection rate for the original codebookalgorithm.The experimental results presented Chapter 5 show that all of the three back-ground subtraction algorithms extend to handle pan/tilt camera and show satisfac-tory performance in surveillance scenarios. This suggests that the motion compen-sation approach demonstrated in this thesis can be used to extend any static camera6background subtraction algorithm to a pan/tilt configuration.The rest of the thesis is organized as follows. In Chapter 2 a review of therelated techniques for both static and pan/tilt cameras is given. Descriptions ofthe motion model and of the algorithm to estimate camera motion are given in theChapter 3. Chapter 4 describes the three background subtraction (BS) algorithmsused in this thesis. The results of the experiments and the comparison of the threealgorithms are given in the Chapter 5. Chapter 6 concludes the thesis and suggestsfuture work.7Chapter 2Related WorkBackground subtraction characterizes a family of algorithms used to detect movingobjects in a video sequence. It is a well studied area of computer vision with ahistory of over 25 years. The core of every background subtraction algorithm is abackground model which is an internal representation of static objects in the scene.Each time a new image frame is obtained it is compared to the model and pixels thatrepresent moving objects according to the current model are marked as foreground.Many different background subtraction algorithms have been proposed that modelthe value of each pixel in an image as a probabilistic distribution. These algorithmscan be classified by the type of the distribution used to represent the backgroundmodel. Parametric methods assume that the background can be modeled by a givenparametric distribution while non-parametric methods don’t make this assumption.That is, they don’t use parametric distributions to model the background.2.1 Static cameraIn the simplest case the background subtraction is performed on an image se-quence taken with a static camera. Since there is no significant camera motionbetween consecutive frames in an image sequence, it is reasonable to assume thatthe same small area of the scene is always observed by the same sensor in thecamera. Therefore each pixel in a video sequence consistently represents the samepart of the scene. Once a moving object passes through the scene the values of8pixels that correspond to the parts of the scene occluded by the moving objectwill change. Indeed the simplest way to detect changes in the scene describedin [17, 35] is to compare pixel values of two adjacent frames in the video sequenceand mark the pixels whose difference is greater than some predefined threshold asforeground. Although simple and fast, this approach does not retain any historyabout the scene and only edges of moving objects are detected as objects rarelymove fast enough to cover entirely new area of the scene in the next frame [17].2.1.1 Parametric methodsParametric methods assume that each pixel in the scene can be modeled as a statisti-cal process and this process can be approximated by a given parametric distributionor a mixture of distributions. In most of the cases [9, 26, 31, 39] a Gaussian distri-bution is used for this purpose. Parametric models are simple and computationallyinexpensive ways to represent the process that generates the background. However,they cannot quickly adapt to a changing background [6]. If the background scenesuddenly changes, for example when a light is switched on or off in an indoor en-vironment, these algorithms fail to classify pixels correctly until the backgroundmodel finally adapts to the new background.A simple idea to represent a background model was used in a straightforwardstatistical approach by Wren et al. [39]. In this algorithm each pixel in the visualfield is represented by a single Gaussian distribution in a background model. Toclassify a new pixel the algorithm compares it to the mean and depending on howfar it is from the mean marks it as either foreground or background. Each timethe new pixel is classified as background the mean of a corresponding Gaussian isupdated using a running average scheme. The advantage of this method over thesimple interframe differencing is that it can adapt to a slowly changing backgroundbased on a learning rate that is used in an updating mechanism. Therefore thismethod can work for scenes in which lighting can slowly and dynamically changesuch as outdoor scenes. The problem of this approach is that it assumes that back-ground is uniformly distributed and cannot represent multimodal backgrounds suchas a flashing light or waving trees.Several comparable approaches that use a Kalman filter to update a background9model have been proposed [14, 16, 27]. Similarly to the previous algorithm, thebackground model is represented by an image of static objects in the scene (i.e.,means of background Gaussians). To detect moving objects these algorithms sub-tract the current image from the background image and threshold the difference.The common idea in these techniques is that they use a Kalman filter to updatethe background model, which means that a pixel value in the background imagestill gets updated even if the corresponding pixel in the new image frame is clas-sified as foreground. This allows the background model to recover in cases whenthe algorithm mistakenly classifies foreground pixels as background. Neverthe-less similarly to [39] these approaches use a unimodal distribution to model thebackground and as a consequence are not able to represent repetitive motion in thebackground scene such as a flashing light.To capture complex backgrounds with repetitive motion another type of modelis needed. A number of approaches that use a mixture of several distributions [9,25, 31] to represent multimodal background processes were introduced. In all thecited cases a mixture of Gaussian distributions used to model the background.One of the earliest algorithms that used a mixture of parametric distributionsis by Friedman and Russell[9]. In a traffic monitoring task, the authors modeledthe background process as a mixture of three Gaussian distributions, one of whichrepresented a road, another one represented a shadow on the road and the last onerepresented vehicles. Therefore pixels that belong to road and shadow distributionscould be classified as background. Although tailored for a very specific task thisalgorithm was one of the first steps towards background models that were repre-sented by multimodal distributions.Stauffer and Grimson introduced a more general algorithm that modeled eachpixel in the background model as a weighted mixture of several Gaussian distri-butions [31]. The expectation maximization (EM) algorithm used for parameterlearning in [9] was too slow for real-time performance. Consequently, this ap-proach used a K-means algorithm approximation to find the parameters and theweights of each Gaussian distribution. In principle, the algorithm allows any num-ber of Gaussians in the mixture model to belong to a background thus permittingthe model to capture complex backgrounds. A decision to which Gaussian a newpixel belongs was based on the mean and variance of the different Gaussians. One10of the major advantages of this algorithm is its ability to have more than one distri-bution to represent the background. This thesis uses Stauffer and Grimson as oneof the modules for background subtraction for a pan/tilt camera. A more detaileddescription of the algorithm is given in Chapter 4.Since the first publication by Stauffer and Grimson other researches have im-proved and extended some aspects of the original algorithm [13, 25]. However, thealgorithm along with the whole group of parametric algorithms suffers from oneimportant drawback — in many cases a background process is very complex andcannot be modeled by a single distribution or a mixture of Gaussian distributions.Another problem parametric approaches have is that they are difficult to adapt tohigh frequency variations in the background such as leaves shaking in the wind [6].2.1.2 Non-parametric methodsParametric models can handle complex backgrounds but their ability to success-fully adapt to a changing background highly depends on the learning rate. If thelearning rate is low, the model fails to detect a sudden change in the background.On the other hand if the learning rate is high, the model adapts too quickly and pix-els that belong to the slow moving foreground objects become incorporated into themodel [15]. The high dependence on the learning rate and the inability of paramet-ric mixture models to handle fast variations in the background [6, 15] led researchesto look for background subtraction approaches that avoid these problems. One suchgroup of background subtraction algorithms consists of non-parametric methods.These approaches assume that the background process is complex and cannot bemodeled by a parametric distribution. Instead they use non-parametric approachesto represent a background model. Non-parametric models are able to adapt to rapidchanges in the background and to detect objects with higher sensitivity [6, 15].Elgammal et al. developed a method [6] that used a non-parametric kernel den-sity estimation algorithm[23] to perform a background subtraction. During thetraining phase the algorithm only collects sample values for each pixel in a trainingdata set. To classify a pixel in each new frame the probability density function is es-timated using the Parzen window method[23] and the probability of the pixel beingforeground or background is calculated. To accommodate a changing background11the authors suggest maintaining two separate background models. One model isupdated blindly in FIFO manner. The other model employs a more selective up-date mechanism in which only the pixels that are classified as background are up-dated. This thesis uses this algorithm. It is explained in greater detail in Chapter 4.This approach proved to be more sensitive to moving low contrast objects ratherthan did a mixture of Gaussians (MOG)[31] and could quickly adapt to changesin the background [6]. However, a lot of computational resources are required tostore all the samples and to estimate a probability density function (PDF) for eachpixel. This limits the ability of this method to model long background sequences,as the model quickly “forgets” the samples in the distant past as new samples arecoming in. Therefore this approach may not be practical when the background re-quires significant sampling over long-time periods [15]. In addition this approachis very sensitive to the kernel bandwidth, a parameter that affects the shape of theestimated PDF [24].Several attempts [24, 36, 37] have been made to develop an algorithm thatwould retain the advantages of the non-parametric kernel density estimation (KDE)while decreasing the computational load and improving efficiency.The approach by Wang and Suter[37] was inspired by RANSAC [8]. Theiralgorithm, called SACON, keeps N recent pixel samples in a background model.To classify a pixel in a new image frame the algorithm calculates a number ofsamples in the model that agree with the new pixel (i.e., have their values close tothe new pixel value) and if this number is larger than some predefined thresholdthe pixel is labeled as background, otherwise it is labeled as foreground. Thissimple non-parametric background subtraction approach requires fewer parametersin comparison to MOG [31] and is computationally efficient [37].To overcome high sensitivity to kernel bandwidth and very high computationalload of KDE approach[6] Piccardi and Jan proposed a new method that is basedon the mean-shift algorithm[24]. The original mean-shift technique is an iterativeapproach that can obtain the modes of a multi-variate distribution from a set ofsamples[10]. In the case of the background subtraction the pixel’s recent valuesare used as samples. Since this process is iterative it is very computationally inten-sive and, therefore, cannot be used in real-time applications directly [24]. Due toparticular characteristics of the background subtraction problem (e.g., pixels have12only integer values), the authors were able to introduce several optimizations forthe mean-shift algorithm to reduce computational cost and to make the algorithmperform background subtraction in real-time. The authors didn’t explain mech-anisms used to update the background model. It is unclear how the algorithmupdates samples in the background model.The inability of the KDE approach[6] to model long background sequencesmotivated Kim et al. to develop a method that uses a codebook to model the back-ground [15]. This approach builds a highly compressed background model that canlearn long background sequences without making parametric assumptions [15]. Tolearn the background model during the training phase the algorithm builds a listof codewords for each pixel. Each codeword contains information including av-erage color value, minimum and maximum pixel brightness, and codeword accessfrequency. If the background contains repetitive motion, like waving trees, pix-els can have multiple codewords with each codeword representing a different ob-ject that was seen by the pixel at some time during training. During the testingphase the algorithm compares a pixel in a new image frame to the existing code-words associated with the pixel position. If the pixel matches any codeword it islabeled as background. Initially, this algorithm was not able to adapt to a slowlychanging background or to incorporate new information after the training was com-plete. The authors proposed some extensions to the original algorithm to let it adaptto slowly changing background and incorporate new objects into the backgroundmodel. More information about this approach along with an improvement is pre-sented in Chapter 4.Toyama et al. developed an algorithm that processed each frame of a sequenceat three different levels — pixel, region and frame [35]. At the pixel level thealgorithm assumes that each pixel is independent from other pixels and performsa background subtraction using a one step Weiner filter. It stores a number ofpast pixel values and tries to predict the next value of the pixel. If it is differentfrom the predicted, the pixel is marked as foreground. At the region level thealgorithm takes interpixel relationships into account thus making the algorithmspatial. At this level the algorithm tries to detect regions in the frame that couldbelong to one object. Finally, at the frame level the algorithm keeps track of anumber of pixels that are labeled as foreground in each frame. When this number13exceeds a threshold, the background model is switched to a new one. This keepsthe number of misclassified pixels to a minimum when there is a sudden change inthe background (e.g., lights turning on in a room).2.1.3 Spatial methodsUsually parts of an image that represent the same object have similar color values.This implies that there exists dependence among the pixels. The background sub-traction algorithms discussed so far make the assumption that pixels in every imageframe are independent from each other. Algorithms that try to use this additionalinformation have been described. Background models that consider neighboringpixels are called block-based or spatio-temporal [36]. Some consider blocks ofpixels instead of each pixel individually [19, 22, 29, 35]. In the work by Wu andPeng[40] a Markov random field (MRF) is used to represent interpixel correlations.Recently authors have described spatio-temporal extensions to existing backgroundsubtraction algorithms [36, 40].Matsuyama et al. suggest partitioning an image into a set of N N pixel blockseach of which can be represented by a N2 dimensional vector [19]. To build thebackground model the algorithm calculates a median image of a training videosequence and then computes vectors for each pixel from this background medianimage. To perform background subtraction a variant of vector distance is calcu-lated between the vectors in a new image frame and the vectors in the backgroundimage. Whole blocks are classified as foreground or background based on the cal-culated distance. Similarly, the algorithm by Seki et al.[29] subdivides the imageinto blocks which are then represented as N2 dimensional vectors. The covari-ance matrix is calculated for each block and it is then decomposed into eigenvaluesand eigenvectors. An eigenspace is then formed by taking eigenvectors with largeeigenvalues. To detect foreground moving objects the algorithm subdivides a newimage frame into blocks and projects them on to eigenspace. The backgroundsubtraction is then performed using points in the eigenspace instead of the N2 di-mensional vectors. Although these approaches are fast the quality of the resultis degraded and appears “blocky” because each block is completely classified asforeground or background.14Instead of looking at smaller image blocks a method developed by Oliver et al.considers a whole image frame as a single block [22]. To learn a backgroundmodel this algorithm calculates a covariance matrix of several samples of videoframes taken with a static camera. The covariance matrix is then diagonalizedusing eigenvalue decomposition and M eigenvectors that correspond to M largesteigenvalues are stored in a eigenbackground matrix FM which represents a back-ground model. Since moving objects are always in a new location in each framethey don’t contribute significantly to the background model and, therefore, are al-lowed during the learning phase [22]. To detect moving objects during the back-ground subtraction phase each new image frame is projected onto the eigenspaceusing the matrix FM and then a distance D between the frame and its projection iscalculated. All the pixels where D is greater than a predefined threshold are markedas foreground [22]. By calculating the covariance matrix this algorithm essentiallytakes into account spatial dependencies between all the pixels in an image. How-ever, this approach cannot incorporate new information into the background modelsuch as objects being relocated in the background or new objects becoming a partof the background. Depending on the training set and the frequency, any repetitivemotion in the background will not be learned which will result in a large numberof misclassified pixels.Vemulapalli and Aravind[36] extended the KDE model [6] to a spatio-temporaldomain. To consider interpixel correlations the authors suggest using 3-by-3 blocksaround each pixel in an image as 9 dimensional vectors to represent each pixel in abackground model. Since a 9-dimensional Gaussian kernel significantly increasescomputational load for non-parametric kernel estimation the authors propose touse a 9-dimensional hyperspherical kernel instead of a Gaussian kernel. Similarlyto [6] the authors use a global threshold on the estimated PDF to label each pixelas foreground or background. Vemulapalli and Aravind argue that their approachis significantly faster than the original KDE algorithm and produces better resultsfor noisy image sequences and dynamic backgrounds [36]. One drawback of thisapproach is that switching to the hyperspherical kernel leads to worse foregrounddetection as the Gaussian kernel provides a better estimation of pixel’s PDF [36].Similarly Wu and Peng[40] developed a spatio-temporal extension to the code-book (CB) algorithm [15]. The extension consists of two parts. The first is called15the spatial codebook. The authors make the assumption that due to image noise andcamera parameters background pixels can spatially fluctuate [40]. Therefore, dur-ing the background subtraction phase, for each pixel in a new frame the algorithmconsiders not only all the codewords in a corresponding location in the backgroundmodel but also codewords in the neighboring locations as well. The second im-provement employs an MRF to model dependencies between current pixel labelsand the previous frame’s labels. The MRF consists of three layers: current pixelvalues and previous pixel labels represent observable layers and current pixel la-bels make a hidden layer. The algorithm employs an energy minimization routineto find best labels for the current image frame. Although the results produced bythis approach are superior to the results obtained by using the original codebookalgorithm they come at a price of a great computational load.2.2 Moving cameraGiven the availability of pan/tilt cameras the assumption that a camera is stationaryrestricts the application range of background subtraction algorithms [30]. In recentyears researchers have been attempting to overcome this limitation by extendingthe background subtraction algorithms to support moving cameras. A commontheme [7, 12, 20, 26] is to utilize some form of motion compensation algorithm thatrectifies each frame of a video sequence to a coordinate system of a backgroundmodel and to perform the background subtraction in a normal fashion. Usually,to simplify the motion compensation these algorithms impose a pan/tilt motionconstraint on a camera.Sugaya and Kanatani introduced a method [33] that estimated camera motionand performed a background subtraction from the entire video stream off-line. Theauthors make the assumption that the camera model is affine and that there is nosignificant camera orientation change. The algorithm extracts feature points fromall the images in the video sequence and creates motion trajectories for each point.To find the camera motion between frames the algorithm employs an iterative pro-cedure that finds the affine transformation that most of the trajectories agree on.Then a background mosaic is created by mapping all the frames to a common ref-erence frame. For pixels that have multiple values coming from different frames the16median value is used. Finally, for each frame in the sequence the labeling is doneby mapping the mosaic image into the frame coordinate system and performingthe background subtraction with this mosaic image. There are two major limita-tions of this approach. First, the motion constraints used are unfeasible for reallife applications. Second, the simple background model does not support changingbackgrounds thus resulting in many misclassified pixels.Sheikh et al. developed an algorithm [30] that performs background subtractionon video sequences taken with a camera that can go through any type of motionincluding rotation and translation. To make this possible the authors make two im-portant assumptions: first, the camera projection is considered to be orthographicand second, the background is rigid and dominates the whole scene. The firstassumption removes the problem of a motion parallax, the physical phenomenonwhereby the apparent motion of objects closer to the image plane is higher thanthat of objects that are further away. Because of these two assumptions as thecamera moves all static rigid objects in the background have identical trajectories.As the first step, the algorithm extracts and tracks salient points in an image se-quence. Because of the above assumptions, the majority of the salient points willbe produced by the background objects and these salient points will be movingalong similar trajectories. The algorithm uses RANSAC[8] to find the backgroundtrajectory that most of the points agree on. All the salient points are then labeledas foreground or background depending on their agreement to the background tra-jectory. An MRF representing the probability of each pixel label based on salientpoint labels is used to produce pixelwise labeling. Generalizing beyond pan/tiltconstraints makes this approach useful for handheld devices. The assumption thatthe camera is orthogonal limits the approach [30].To accommodate camera rotation many approaches suggest building a back-ground mosaic image to use as a reference frame for estimating camera motionand performing the background subtraction. A problem with background mosaicsis that the total field of view cannot be larger than 180 [7]. To solve this problemone can use spherical or cylindrical coordinate representations for the backgroundmosaic but these representations are computationally intensive techniques [7]. InFarin et al.[7] a solution to this problem is presented. The idea is to split thebackground mosaic into a set of independent images, called multisprites. Each17multisprite covers a different part of the whole scene. This decreases both thearea of each multisprite as they are less distorted, and execution time. Finally thisallows for camera rotation angles greater than 180 [7]. After motion parame-ters have been estimated by a conventional method, the training video sequenceis partitioned into a set of segments using an iterative approach that minimizesthe area of the multisprite that is produced by each segment. Moving objects arethen removed in cases where a single pixel in a multisprite corresponds to severaloverlapping pixels in different image frames. This is done by applying a tempo-ral median filter. The moving object detection is then performed by employing astatic camera background subtraction algorithm with a warped background spritethat corresponds to the current image frame.Ren et al. describe a pan/tilt background subtraction algorithm called spatialdistribution of Gaussians. Their algorithm performs motion segmentation in twosteps. First, the motion parameters are estimated and motion compensation is per-formed on an image frame. Second, the background is subtracted using a back-ground model resembling other mixture models [9, 31]. The background modelused in this algorithm allows for any number of Gaussian distributions in the mix-ture although only one of them represents background. A precise match betweena new frame and the background mosaic is not required. This approach alwaysassumes a registration error. To minimize the effect of the error during the back-ground subtraction stage the algorithm considers a window around each pixel. Ifat least one pixel in the window matches the background distribution in the back-ground model, the pixel is labeled as background. The size of this window canchange as the algorithm progresses depending on the registration error from themotion estimation step. A key limitation of this approach is that only one Gaus-sian distribution in the mixture is considered to be background thus essentially thebackground is modeled as a unimodal distribution. More complex scenes that haverepetitive motion in the background cannot be described by this model [31].Due to the complex motion of a mobile robot background subtraction generallycannot be used when the robot is in motion. However, when the robot is station-ary background subtraction can be used to detect moving objects in a field of viewof a robot head pan/tilt camera, if available. Hayman and Eklundh[12] developeda background subtraction algorithm to use in this case. As the basis of their ap-18proach the authors used an MOG[31] approach for background subtraction. Motionbetween a new frame and the background mosaic is estimated using corner featuresand the MLESAC algorithm [34]. Since there exist registration errors in addition tosub-pixel motion and motion blur there could be cases when a pixel is misclassifiedbecause it is not compared with the right mixture of distributions. To address thisproblem the authors propose a statistical model which is very similar to the modelby Ren et al.[26]. It considers neighboring pixel distributions in the backgroundmodel in order to label a pixel in a new frame.During the background model initialization stage slowly moving objects can beincorporated into the background model which leads to a large number of misclas-sified pixels until the model stabilizes. Although this may not matter for algorithmsthat run over a longer period of time and thus eventually stabilize themselves, it isunsuitable when background subtraction is intended for a mobile robot, in whichcase valid segmentation results are required quickly [12]. Hayman and Eklundhaddress this problem by developing a bootstrapping procedure to set up the back-ground model during the initialization stage. As with other approaches that use anMOG[9, 13, 20, 26, 31] shortcoming of this algorithm is that the background modelcannot adapt to a rapidly changing background.Similarly to other approaches [7, 12, 26, 33], a method by Mittal and Hutten-locher[20] combines motion compensation with static camera background subtrac-tion. The authors assume that there is no motion parallax which implies that thecamera motion is restricted to pan/tilt. An MOG[31] is used to represent the back-ground model. The highest weighted Gaussian in each pixel mixture approximatesthe most probable background value for the pixel. The means of these Gaussiansdetermine the background image for the registration step. To locate moving objectseach new image frame is registered against the background mosaic and a transfor-mation matrix is obtained. Each pixel in the new frame is then warped to thecoordinate system of the background mosaic and background subtraction is per-formed normally. To accommodate registration errors or a changing backgroundthe algorithm compares the new pixel to a small neighborhood in the mosaic thusessentially making it a spatial approach.Most of the pan/tilt camera background subtraction approaches reviewed hererequire a complete background mosaic for all possible camera orientations before19they can actually perform background subtraction. This severely limits their appli-cation range as a video sequence containing all possible camera orientations mustbe available for training the algorithm. This is not always feasible. A backgroundsubtraction algorithm for pan/tilt camera should be able to grow its backgroundmodel on the fly as new parts of the scene become visible.Many approaches developed for a pan/tilt camera use an MOG as the back-ground model. One of the reasons for this is that it is very easy to extract a back-ground image that represents static objects in the scene from an MOG model. Thisimage is then used to calculate camera motion parameters for a new frame. It isnot so straightforward to extract the background image from other static camerabackground subtraction algorithms such as KDE[6] or CB[15]. Therefore anotherdesired property of a pan/tilt background subtraction algorithm is independence ofthe background image used to estimate motion parameters from the backgroundmodel used for background subtraction.The work described in this thesis separates background image maintenancefrom the background model used by a given background subtraction algorithm.They are kept as two different modules, one that performs motion compensationand another that performs background subtraction. This allows the backgroundsubtraction algorithm to be a ”plug-in“, which lets any background subtractionalgorithm designed for a static camera be used with a pan/tilt camera with minimalchanges.20Chapter 3Motion Model Estimation andMosaic BuildingThis chapter describes the part of the system that deals with motion compensa-tion. The goal of this subsystem is to register a new frame to the backgroundmodel’s coordinate system and provide the background subtraction subsystem withthe warped frame.We assume that all frames of the image sequence are taken with a pan/tilt cam-era that rotates about its optical center. In reality a typical camera will rotate aboutsome point that is not aligned with its optical center. But our experiments haveshown that it is not a major concern and does not result in significant registrationerrors for scenes in which objects are sufficiently far away.In the case of a pan/tilt camera all the frames are related by a homography [11].We assume that the background model’s coordinate system is the coordinate sys-tem of the first frame of a sequence. This means that the homography for the firstframe of the sequence is the identity matrix I. For each consecutive frame we cal-culate a homography H that transforms the new frame to the background model’scoordinate frame. The new frame and the homography H are then passed to thebackground subtraction module. It is the responsibility of the background subtrac-tion subsystem to maintain and to grow the background model depending on thenew frame and the homography H.To estimate H we employ a feature based approach using Scale Invariant Fea-21ture Transform (SIFT) keypoints [18]. The motion estimation module maintains alist of background SIFT keypoints, henceforth called L. Each time a new frameis obtained a set of SIFT keypoints is extracted from it and these keypoints arethen matched to those in L. The homography H is then estimated in two steps.Firstly, an initial guess for H and a list of inlier matches are found using the DirectLinear Transformation (DLT) [2] and Random Sample Consensus (RANSAC) [8]algorithms. Secondly, the final matrix H is estimated by using a non-linear leastsquares minimization algorithm that minimizes Sampson error [28]. Finally thelist L is updated with the keypoints from the new frame according to one of thestrategies described later in this chapter.3.1 HomographyIn order to perform background subtraction with a pan/tilt camera we estimatecamera motion between frames. Since the camera’s motion is assumed to be apure rotation about its optical center, the transformation from any frame to anotherframe is a homography which is an 8 d.o.f. transformation of the projective planethat preserves colinearity [11].Let x = [x;y;1]T be the coordinates of a given keypoint in one image framerepresented as a homogeneous column vector. Similarly, let x0= [x0;y0;1]T be thecoordinates of the corresponding keypoint in another image frame. In this thesiswe use symbol  to denote similarity up to scale. If both of these images aretaken with the same camera rotated about its optical center then the transformationbetween these points is defined up to a scale by a homography H:x0 Hx (3.1)Since homography is an 8 d.o.f. transformation and a pair x$x0 provides con-straints for 2 d.o.f. we need at least 4 pairs xi$x0i to estimate the H between twoimage frames.223.2 Direct linear transformationSince both x0 and x0i are homogeneous vectors and H is only defined up to scale x0imight not be equal to Hxi. Therefore we solve the equivalent equation x0i Hxi =0 [11], where  denotes cross product. One algorithm that can find a homog-raphy H from a similarity relation is the DLT algorithm originally developed byAbdel-Aziz and Karara[2]. Here, the variant of the DLT described by Hartley andZisserman[11] is used.Assume that we have 4 2D keypoint correspondences between two images xi$x0i, where x0i and xi are represented, not necessarily by x0i = [x0i;y0i;z0i]T and xi =[xi;yi;zi]T and i2f1:::4g. The goal is to find a homography H such that x0i Hxifor all i2f1:::4g.The product Hxi can be written as follows:Hxi =264h11 h12 h13h21 h22 h23h31 h32 h33375264xiyizi375=264h11xi +h12yi +h13zih21xi +h22yi +h23zih31xi +h32yi +h33zi375=264hT1 xihT2 xihT3 xi375; (3.2)where h1 = [h11 h12 h13]T , h2 = [h21 h22 h23]T and h3 = [h31 h32 h33]T . The crossproduct x0i Hxi = 0 can be written asx0i Hxi =264y0ihT3 xi z0ihT2 xiz0ihT1 xi x0ihT3 xix0ihT2 xi y0ihT1 xi375= 0: (3.3)Using the fact that hTj xi is a scalar and hTj xi =xTi hj Equation 3.3 can be rewrit-ten as 264y0ixTi h3 z0ixTi h2z0ixTi h1 x0ixTi h3x0ixTi h2 y0ixTi h1375= 0 (3.4)Equation 3.4 can be represented as a linear system:2640  z0ixTi y0ixTiz0ixTi 0  x0ixTi y0ixTi x0ixTi 0375264h1h2h3375= (3.5)232640 0 0  z0ixi  z0iyi  z0izi y0ixi y0iyi y0iziz0ixi z0iyi z0izi 0 0 0  x0ixi  x0iyi  x0izi y0ixi  y0iyi  y0izi x0ixi x0iyi x0izi 0 0 0375266666666666666664h11h12h13h21h22h23h31h32h33377777777777777775=Aih= 0; (3.6)where h=[h11 h12 h13 h21 h22 h23 h31 h32 h33]T . Since the third row of Ai is a linearcombination of the first and the second rows the rank of the matrix Ai is 2 [11].Given 4 corresponding keypoint pairs x1$x01, x2$x02, x3$x03, x4$x04we can stack the 4 matrices Ai to obtain 12x9 matrix A [11]:266664A1A2A3A4377775h= Ah= 0: (3.7)The rank of each Ai is 2. If no three points are colinear the rank of the matrix A is8, hence the solution of this system is a null-space of A [11].There could be a case when 3 or more of the 4 keypoints in one image framelie on a line (i.e., are colinear) and the corresponding keypoints in the other imageframe are not colinear. In this case there exists no homography H since H bydefinition preserves colinearity [11]. Another situation happens when 3 keypointsare colinear in both image frames. In this case H is degenerate and is not uniquelydefined by the 4 correspondences [11]. No 3 of the 4 corresponding keypointsxi$x0i can be colinear in order for the DLT to find the homography H.In reality we are usually able to obtain more than 4 keypoints for each imageand as a consequence there are more than 4 correspondences x0i$xi. The positionsof these points typically are not exact, and, therefore, h= 0 is the only one solutionto Equation 3.7 [11]. Instead, the problem is reformulated as the minimization of24kAhksubject tokhk= 1. The solution to this optimization problem is the smallesteigenvector of AT A [11].3.2.1 NormalizationBecause the DLT algorithm minimizes algebraic error it is not invariant to the co-ordinate frame for keypoints [11]. Therefore, sets of matching keypoints in boththe new image frame and the list L are normalized individually using coordinatetransformations denoted T0 and T respectively. The normalized keypoint coordi-nates are then ˜xi = Txi and ˜x0i = T0x0i. After that the DLT algorithm is applied tofind the ˜H that relates normalized keypoints ˜xi$˜x0i such that ˜x0i = ˜H ˜xi. Finally thehomography H for the unnormalized points is calculated as follows:H = T0 1 ˜HT (3.8)To calculate normalization matrices T and T0 Hartley and Zisserman[11] sug-gest translating each set of coordinates so that the centroid of each set is located atorigin. Each set is then scaled in such a way that the mean distance from the pointsto the origin isp2. Hence the matrix T for the coordinatesfxighas the followingform:T =2666664p2mdst 0  p2mdst mx0p2mdst  p2mdst my0 0 13777775; (3.9)where mdst is a mean distance from the set of coordinates to the origin, mx and myare the x and y coordinates of the mean of the set respectively. The matrix T0 forthe coordinatesfx0igis determined similarly.3.3 Minimizing geometric errorSince we are performing background subtraction it is important that each pixel inthe image frame is compared to the correct pixel in the background model. Henceit is important that each new image frame is registered as precisely as possible tothe background model.25The quality of homography estimation obtained by minimizing algebraic errordepends on the choice of coordinate systems for each image [5, 11]. Quality ispartially improved by the normalization described in the previous section. A betterestimate of the homography is obtained by minimizing a geometric distance for allpairs x0i$xi : id(xi; ˆxi)2 +d(x0i;H ˆxi)2; (3.10)where d(x;y) is an Euclidean distance between the two points x and y, H is thehomography being estimated and ˆxi is the noise-free coordinates of the ith key-point [5]. Although difficult to optimize, this geometric error is invariant to thechoice of coordinate system for the images [5, 11]. In this thesis we use a firstorder approximation of the geometric error called Sampson error [28]. A com-plete derivation of this error metric and a proof of its invariance to the choice ofcoordinate system are presented in [5].To estimate the homography that minimizes Sampson error we employ a non-linear least squares algorithm initialized using the homography H produced by thenormalized DLT algorithm.We have found experimentally that minimizing Sampson error does not im-prove homography estimation significantly in our examples and therefore this stepis considered optional.3.4 Feature points and inlier selectionThe DLT algorithm needs to be provided with a set of correspondences x0i $xi.Hence we need to be able to detect keypoints in each image such that two key-points representing the same locations in the scene in two different images matchto produce a pair x0$x. The DLT algorithm needs at least 4 correct matches.We use SIFT keypoints developed by Lowe[18]. The SIFT algorithm producesa large number of distinctive scale and rotation invariant features that are located atextremum points of a difference of Gaussian functions convolved with an image atdifferent scales. Furthermore SIFT keypoints demonstrate low sensitivity to affinechanges up to 30 [18].In typical video sequences hundreds of SIFT keypoints are extracted from each26image frame, however only a portion of these keypoints represent static backgroundobjects since some keypoints typically are also extracted from moving objects.Similarly, the list L that contains the background model keypoints can also includekeypoints from moving objects that once were in the camera’s field of view. Fur-thermore, when the keypoints are matched and matching pairs x0i $xi are estab-lished there can be incorrect matches. All these cases produce outliers — matchesthat don’t fit the correct homography H between the image frame and the back-ground model’s coordinate frame.We employ the RANSAC algorithm to find inliers and to estimate the homog-raphy H while ignoring outliers. This algorithm randomly samples data points,which, in the case of homography estimation, are corresponding matches x0i$xiand tries to fit a model, the homography H, to these data points. After that, anerror is calculated between the rest of the data points and the model and the datapoints are classified as outliers or inliers based on some threshold t. This process isrepeated until the number of outliers is sufficiently small. More detailed overviewof this algorithm applied to homography estimation is presented in [11].In order to distinguish inliers from outliers RANSAC has to be able to calculatehow well the estimated homography H fits all data points. For this purpose we usethe symmetric transfer error as the distance measure:E(xi;x0i)= d(xi;H 1x0i)2 +d(x0i;Hxi)2; (3.11)where d(x;y) is an Euclidean distance between the two points x and y, xi and x0i arethe coordinates of the corresponding keypoints and H is the estimated homographymatrix. A pair x0i$xi is considered an inlier if E(xi;x0i) < t for some predefinedconstant threshold t. The RANSAC algorithm used for this thesis is presented inAlgorithm Decreasing the number of outliersWhen the proportion of outliers grows above 50% RANSAC often fails to producea correct estimate in a reasonable amount of time [18]. Hence in some cases whenforeground objects occupy a large portion of an image and provide many keypointsthe number of inliers can be lower than the number of outliers. In order to decrease27Require: set of matches M =fx0i$xig, threshold t, desired probability pNormalize x0i and xi producing T0, T and ˜M =f˜x0i$˜xigas in Equation 3.9c 0N 1˜Mbest  /0while c < N doRandomly select 4 corresponding pairs ˜x0i$˜xiCalculate homography ˜H from the 4 random pairs solving Equation 3.7Find inliers ˜Min ˜M by calculating symmetric transfer error Ei for eachpair ˜x0i$˜xi as in Equation 3.11 and compare it to threshold tifj ˜Minj>j ˜Mbestjthen˜Mbest  ˜MinN = log(1 p)=log(1 (j ˜Mbestj=j ˜Mj)4)end ifc c+1end whilePerform a final least squares to find ˜H using all correspondences in ˜Mbestand solving Equation 3.7Denormalize ˜H to obtain H using T and T0 as in Equation 3.8Algorithm 3.1: RANSAC algorithm used for homography estimationthe number of outliers we employ a Hough transform [3] before applying RANSAC.The Hough transform is a method that uses voting procedure to fit a parametricmodel to some data points which contain outliers. It is not feasible to use theHough transform to estimate all the parameters for a homography matrix, thereforewe only utilize it to remove outlying matches x0i$xi.Since we assume that background objects are static and are sufficiently faraway we can also assume that all the keypoints representing these objects haveapproximately same speed as the camera pans across the scene. We use the Houghtransform to find the clusters of matching pairs x0i $xi that have similar verticaland horizontal displacements vx and vy, which are calculated by taking a differencebetween x and y components of points x0i and Hpxi. A homography Hp that relatesa previous image frame to the background coordinate system is used to predict themotion of the camera.To test the performance of the Hough transform we have used the pan/tilt28dataset found in [41]. In this video sequence taken indoors a pan/tilt camera fol-lows a person’s hand holding a box. A particular difficulty with this dataset isthat both the hand and the box occupy a large portion of most of the frames thusproducing a lot of outliers. Every 4th frame of the sequence was loaded and a back-ground mosaic was built by overlaying each matched frame on top of the existingmosaic. Every time a new image frame was overlaid, the mosaic was saved to acorresponding image file.Figure 3.1 shows two mosaics after frame 128 has been overlaid. The mosaicbuilt with the Hough transform applied before RANSAC is shown in Figure 3.1awhile the mosaic built without Hough transform is shown in Figure 3.1b. A newimage frame that is matched and overlaid on the mosaic is outlined in blue. Allkeypoints extracted from the new image frame that were matched to some key-points in the background mosaic and then passed to RANSAC are marked by redand green crosses. The red crosses denote keypoints that were determined to be anoutlier by RANSAC, while the green crosses denote inliers.The mosaic built using the Hough transform and displayed in Figure 3.1a showsthat there are fewer keypoints that were passed to RANSAC compared to the mosaicshown in Figure 3.1b. Almost all the keypoints extracted from the hand and fromthe box were filtered out by the Hough transform since their motion is not con-sistent with the motion of keypoints representing the background objects. All thebackground objects in the mosaic show no signs of distortion and align properly tothe background objects in the new frame outlined in blue.Conversely, the new frame outlined in blue in Figure 3.1b contains many key-points that were extracted from both the hand and the box. As the green crossesshow, only a small percentage of the matched pairs were determined to be inliersby RANSAC due to the large number of outlying matches from the hand and thebox. The homography obtained by RANSAC using only a small number of inliersis inaccurate as background objects, including the chair and the cabinet, are notmatched precisely to the same objects in the mosaic.The above observations suggest that the Hough transform improves the abilityof RANSAC to estimate the homography in cases where there is a high proportionof outliers (i.e., foreground objects occupying a large portion of an image frame).A complete workflow of the algorithm that matches a new image frame to the29(a) With Hough transform(b) Without Hough transformFigure 3.1: Background mosaics built with and without using Hough trans-form as a pre-filter. Red and green crosses denote all the keypoints thatwere passed to RANSAC algorithm. Keypoints that are classified as out-liers by RANSAC are red, while keypoints that are inliers are green.30Require: new frame Ft, list of background keypoints LifjLj= 0 thenThis is a very first frame of the sequence:L all keypoints from FtH IdentityelseP all keypoints from FtFind all matches between keypoints Mall = P$LPre-filter to remove outliers from Mall using Hough transformFind inliers Min Mall and homography Hinit using RANSAC andnormalized DLT as in Algorithm 3.1Optional: Find homography H by minimizing Sampson error using Hinitand Min to initialize algorithmUpdate list of background keypoints Lend ifAlgorithm 3.2: The complete homography estimation algorithmbackground coordinate system and estimates the homography H between the newframe to the background coordinate system is summarized in Algorithm Keypoints update strategiesThe algorithm presented in Algorithm 3.2 specifies that the list of background key-points L has to be updated. There are several methods to update the keypointsin the background model. First, the list of keypoints L can be recalculated everytime a new frame has been matched using the image representing all the static ob-jects in the scene as obtained via BS. Second, the keypoints from the new framecan be warped using the estimated homography H and appended to the list of thebackground model keypoints. Each of these two approaches has advantages anddisadvantages.3.6.1 Recalculating keypointsOne way to acquire keypoints that represent the background is to create an imagerepresenting all the static objects (i.e., a background mosaic) and to calculate key-points from this image every time a new image frame is obtained. All the keypoints31in the list L are then replaced with the ones that are calculated from the backgroundmosaic.This approach has several advantages. First, the number of keypoints repre-senting the background is maintained at its minimum as the list L only containskeypoints that are obtained from the mosaic. The number of keypoints increasesonly when the background mosaic grows as new scene areas becoming visible.This is useful when the background subtraction algorithm runs for a long period oftime and there is a constraint on memory available for the list L. Second, since thebackground mosaic only contains background objects all the keypoints in the listL represent the actual background. Hence there will be no outliers due to incorrectmatches with foreground objects.However, there are two significant drawbacks to this approach. The back-ground mosaic obtained from the background subtraction module is a combinedimage consisting of all warped image frames that have been seen by the back-ground subtraction algorithm so far. Parts of this mosaic can represent viewpointangle changes larger than 30 and be warped to such a degree that SIFT keypointsextracted from these parts do not match accurately with keypoints calculated froman image frame representing the same part of the scene but a taken with camerapointing directly. This can lead to an incorrectly estimated homography.Second, the requirement to create explicitly an image that represents static ob-jects in the scene imposes additional functionality constraints on the backgroundsubtraction algorithm. A goal is to make background subtraction algorithms plug-gable. Therefore the motion estimation system must be decoupled from the back-ground subtraction system. In addition, some background subtraction algorithmsrepresent a background model in such a way that there is no reliable way to createan image that only represents background objects. For example, the KDE back-ground subtraction algorithm presented in the next chapter represents the back-ground model as a collection of samples taken at different times and there is noreliable or quick way to find samples that represent static objects1.1An example of a slow approach is a mean-shift algorithm [10]323.6.2 Appending keypointsAnother way to maintain the list of keypoints that represent the background is toappend some of the keypoints from a new image frame to the list each time anew frame is processed. To do this we extract all SIFT keypoints from the newframe and calculate the homography H as described above. Finally, the keypoints’coordinates are transformed by the homography H and all or some of the newkeypoints are added to the list L.There are several possible policies to determine which of the new frame’s key-points get added to the list L: Blindly add all keypoints from the new frame. This policy leads to the list Lgrowing very quickly. In addition, the list L will eventually contain a lot ofredundant keypoints which represent both background and moving objectsthat consistently appear in the camera’s frustum. This leads to a large numberof outliers for RANSAC algorithm thus affecting its performance. A largenumber of keypoints representing foreground objects might eventually leadto RANSAC failing to estimate a homography correctly. Due to these issueswe consider this policy unsuitable for the required task. Only add keypoints that have been matched. This policy is also unfit sincemost of the keypoints obtained from a new image frame that represent pre-viously unseen parts of the scene will not have matching keypoints in thelist L. Therefore these keypoints will not be incorporated into the list L andeventually the algorithm will not be able to calculate a homography for animage frame that mostly contains previously unseen parts of the scene. Only add keypoints that do not have matches. This policy allows keypointsrepresenting new parts of the scene to be added to the list L as these key-points are not matched initially. Once these keypoints are in the list L theywill not be added to the list in subsequent frames, hence there will be no re-dundant keypoints representing the same objects. Although some keypointsrepresenting foreground objects always get added to the list L this only hap-pens once per foreground object and the Hough transform and/or RANSACcan successfully filter them out as outliers.33Appending keypoints eliminates both disadvantages associated with recalcu-lating the keypoints each time a new image frame is obtained. First, since thekeypoints in the list L are not extracted from a warped background image but arecoming from undistorted image frames, the homography H is estimated equallywell for all camera angles. Second, the background subtraction algorithm does notneed to produce an explicit image of the background, thus decreasing the couplingbetween these modules.However, appending keypoints also has disadvantages. First, there is no way todetect whether a keypoint represents a background or a foreground object. There-fore some foreground keypoints will eventually be added to the list of backgroundkeypoints L. The experiments that we conducted showed that these keypoints donot affect the quality of the estimated homography as they represent only a minor-ity of matches for the RANSAC algorithm. Second, the number of keypoints in thelist L grows large as the video sequence progresses and more keypoints get addedto the list.3.6.3 Comparison of strategiesWe performed an experiment to compare the strategy that involves recalculatingkeypoints against the strategy that appends unmatched keypoints from the newimage frame to the list of background keypoints L.Figure 3.2 shows two mosaics created from an image sequence consisting of2122 frames. This image sequence was taken with a pan/tilt camera in an outsideenvironment during the daytime. Every 10th frame was used in building each mo-saic hence 212 frames were used altogether. The background mosaic was buildby warping each new image frame of the sequence and overlaying it on the exist-ing mosaic. The mosaic shown in Figure 3.2a was built using the keypoint updatestrategy that recalculates keypoints from the mosaic every time before registeringeach new image frame. The mosaic shown in Figure 3.2b was created using thestrategy that appended unmatched keypoints in each new frame to the list of thebackground keypoints.There are visible artifacts in Figure 3.2a. Parts of the building and the tree inthe left part of the mosaic do not match. This also is seen clearly along the curb34(a) Recalculating keypoints(b) Appending keypointsFigure 3.2: Background mosaics built by recalculating keypoints and by ap-pending keypoints to the listof the road. There is a similar mismatch in the right part of the mosaic. Figure 3.3shows the location of the artifacts outlined in red. Conversely, the mosaic in Fig-ure 3.2b does not contain such artifacts. As discussed above, the reason for themismatches seen in Figure 3.2a is that the mosaic becomes warped at the extremesas the horizontal angle of the mosaic widens. The keypoints that are extracted fromthese parts of the scene do not match well to the keypoints calculated from the new35Figure 3.3: Background mosaic with misaligned areas outlined in red0 500 1000 1500 2000 2500Frame number020004000600080001000012000Total keypoints in LAppending strategyRecalculating strategyFigure 3.4: Number of keypoints in the list L36image frame resulting in an incorrectly estimated homography.Figure 3.4 shows plots of the number of keypoints in the list L at each frameof the sequence for both strategies. At the last frame of the image sequence thelists L contained 3197 and 10510 keypoints for the recalculating and appendingstrategies respectively. From the plot in Figure 3.4 it is clear that recalculatingkeeps the number of keypoints in the list L at a minimum, while the number ofkeypoints in list L updated using the appending strategy increases with every frameof the sequence. Although the size of list L for the appending strategy grows witheach frame, the rate at which it increases is dropping. This is due to the fact thateven though keypoints from new areas are being added to the list, keypoints fromsubsequent frames that represent features already present in the list are not added.Thus the number of keypoints increases rapidly as unseen areas become visible.Subsequently, only a few new keypoints are added.The experiment demonstrates that although appending keypoints to the list sig-nificantly increases the number of keypoints in the beginning of a sequence therate of growth decreases as ultimately all parts of the scene are seen. As demon-strated in Figure 3.2 the appending strategy produces a better homography esti-mation which results in better matches between each new image frame and thebackground model, which is critical for correct background subtraction. For thisreason the strategy that appends unmatched keypoints to the list was selected forthe final algorithm.37Chapter 4Background SubtractionThe previous chapter describes the part of the system that deals with motion com-pensation and aligning a new image frame to the background model’s coordinatesystem. This chapter covers the part of the system that performs background sub-traction and that maintains the background model for a pan/tilt camera. The back-ground model is likely to be larger than the camera’s field of view since it incorpo-rates all the previously seen parts of the scene.From the image registration subsystem the background subtraction system getsa new image frame and a homography matrix H that transforms the new frame tothe background model’s coordinate frame. Depending on the matrix H, the currentbackground model and the new frame there are several possible scenarios that thebackground subtraction subsystem might face:1. There is a one-to-one correspondence between each pixel in the new imageframe and each pixel in the background model.2. The new image frame completely or partially overlaps the background modeland, therefore, contains previously unseen parts of the scene.3. The new image frame is smaller than the background model and thus fitsentirely inside the background model with some parts of the backgroundmodel not seen by the new frame.4. The new image frame does not overlap the background model at all. Thus, it38BG ModelNew Frame(a) Case 1BG ModelNew Frame(b) Case 2BG ModelNew Frame(c) Case 3BG ModelNew Frame(d) Case 4Figure 4.1: Four situations that a pan/tilt background subtraction algorithmmight encounter. One-to-one correspondence is shown in Figure 4.1a,the new image frame partially or completely overlaps the BG model inFigure 4.1b, the new image frame fits inside of the BG model in Fig-ure 4.1c and the new image frame lies outside of the BG model in Fig-ure 4.1d.completely lies outside of the background model’s boundaries.The first scenario displayed in Figure 4.1a is the simplest. It happens whenthe camera is stationary and it is the usual assumption made by background sub-traction algorithms designed for a static camera. In this case each pixel of thebackground model is always visible in every frame of the image sequence and thereis a one-to-one correspondence between pixels in each image frame and pixels inthe background model.Figure 4.1b shows the second case. This situation could happen when the cam-era starts panning and previously unseen areas of the scene become visible. If thisscenario occurs during the training phase the background subtraction algorithmmust be able to grow the background model and incorporate new information into39the model. This case becomes more difficult when it happens during the testingphase after the training of the background subtraction algorithm has been finished.The algorithm must be able to recognize this case and to train parts of the modelon the fly during the testing phase.The third case, shown in Figure 4.1c, happens when the model already containsa large portion of the scene and, therefore, is larger than a single frame. Thereare no previously unseen parts of the scene in the frame. All pixels in the newframe correspond to some pixels in the background model, but not all pixels inthe background model correspond to some pixels in the new frame. In this casethe background subtraction algorithm must be able to update only the pixels in themodel that are in the camera’s field of view during either the training or testingphases.The final case shown in Figure 4.1d should be unlikely. If the camera’s motionis slow compared to the frame rate each image frame will contain parts of the pre-vious frame and the new frame will overlap the background model at least partially.This situation also can happen if the homography H is not estimated correctly bythe image registration module. Due to its improbability this case is not consideredby the background subtraction module.Therefore, to address the three cases of interest, the pan/tilt background sub-traction algorithm must be able to: grow the background model as previously unseen parts of the scene becomevisible and train these new parts of the model on the fly during the testingphase, update pixels in the model selectively such that only pixels corresponding tothe currently visible parts of the scene are updated.Extensions to support pan/tilt were developed for the three state of the art back-ground subtraction algorithms as described in the remainder of this chapter.4.1 Mixture of GaussiansA mixture of Gaussians approach is a widely used background subtraction methodthat has been extended for pan/tilt camera scenarios [12, 20, 26]. There are two40main reasons for this.First, since each pixel in the model is represented as a mixture of Gaussiandistributions it is easy to grow the background model to accommodate Case 2 andCase 3 mentioned above. To grow the model it is sufficient to add uninitializedmixtures for each new part of the scene.Second, it is easy to extract an image that only contains static objects by takingthe means of the Gaussians with the highest weights. This image can then be usedto register a new frame and to estimate camera motion parameters relative to thebackground model. Two implementations [12, 20] extract this image to registernew frames to the background model’s coordinate system.A variant of the MOG algorithm described in [31] with code from [38] is usedhere. Each pixel in the background model is modeled as a time seriesfX1;:::;XNg,where each pixel sample Xn is a RGB or a grayscale vector and N is a total numberof frames in an image sequence. The probability of a pixel Xt is modeled using amixture of distributions:P(Xt)=K k=1wk;tN (Xtjmk;t;Sk;t); (4.1)where K is a total number of distributions set by a user1, N (xjmk;t;Sk;t) is the kthGaussian distribution at time t parametrized by its mean mk;t and covariance matrixSk;t and wk;t is the weight of the kth distribution in the mixture,  Kk=1 wk;t = 1. Inthis work the covariance matrix Sk;t is a scalar matrix s2k;tI, meaning that all colorchannels are assumed independent with the same variances.The algorithm needs to estimate the parameters and the weight of each of theGaussian distributions in the mixture model given the data fX1;:::;XNg. Due tocomputational costs an EM approach was deemed unsuitable for this task [31]. Analternative K-means algorithm is used instead.For every new pixel sample Xt, 1 t  N a matching distribution is found asfollows. First, the distribution N (xjmk;t 1;Sk;t 1) that is nearest to Xt is deter-mined by finding the smallest Mahalanobis distance between the pixel sample Xt1 Usually the value of K is between 3 and 5 [31]41and the mean mk;t 1:i = argmink2f1:::Kgq(Xt mk;t 1)TS 1k;t 1(Xt mk;t 1) (4.2)Then an indicator variable Mk;t is calculated for each distribution:Mk;t =8<:1; if k = i andq(Xt mk;t 1)TS 1k;t 1(Xt mk;t 1) <D0; otherwise; (4.3)where D is a global threshold that determines when a pixel is considered to bea match to a distribution. For K distributions representing a single pixel in thebackground model, at most one Mk;t can be equal to 1.In a case when there is no matching distribution, i.e., Mk;t = 0 for all k 2f1:::Kg, the least probable mixture distribution is replaced with a new distribu-tionN (xjXt;s2initI), with Xt as mean and variance set to some large value s2init andweight is set to some small value winit. There are several ways to define the leastprobable distribution. Here, the following formula that prefers a distribution witha large variance and a small weight is used:argmink2f1:::Kgwk;t 1s2k;t 1 (4.4)The weights of all distributions in the mixture are then adjusted using a constantlearning factor a:wk;t =(1 a)wk;t 1 +aMk;t; (4.5)and renormalized so they still add up to 1:wk;t = wk;t Kl=1 wl;t: (4.6)Finally parameters of the ith distribution for which Mi;t = 1 are updated asfollows:mi;t =(1 r)mi;t 1 +rXt; (4.7)s2i;t =(1 r)s2i;t 1 +r(Xt mt)T(Xt mt); (4.8)42where r is a learning factor which is constant for all pixels in all image frames.For distributions with Mk;t = 0 the parameters remain the same:mk;t = mk;t 1 (4.9)s2k;t =s2k;t 1 (4.10)After the mixture parameters are adjusted the next task is to determine whichmixture distributions represent background processes. All K Gaussian distributionsare sorted by wk;ts2k;tin decreasing order. The distributions that are the most probableand have the largest evidence (i.e., have a large weight and a small variance) comebefore distributions that have a low weight and a large variance. Then the first B(B K) distributions are selected such that:B = argminb2f1:::Kg b k=1wk;t > T!; (4.11)where T;0 T  1 is a threshold that determines how much evidence should beconsidered as the background model. If T is closer to 1 then more than one distri-bution can be assigned to represent the background model thus allowing for multi-modal backgrounds.Finally the pixel Xt is classified as foreground or background based on whetherthe distribution that it matched in Equation 4.3 is determined to be foreground orbackground.When Case 2 occurs the size of the background model must be increased torepresent newly seen parts of the scene. A MOG background model extends easily.As the model grows, each of K Gaussians that represent a new pixel is initializedwith random means and a large s2init. The weights w are set to 0, except for thefirst distribution which is set to 1. Usually one of the distributions gets replacedimmediately after initialization during the background subtraction. After havingprocessed several frames the algorithm gains enough evidence to reweigh distribu-tions to produce valid classification results. The pan/tilt MOG algorithm used hereis given as Algorithm 4.1.43Require: homography H, new frame Ft, background model MˆFt  Ft transformed by Hif ˆFt overlaps M thenGrow M:for all new pixels in M doRandomly initialize mk for K distributionsSet variances to s2initSet wk = 0 for k2f2:::KgSet wk = 1 for k = 1end forend iffor all pixels Xt in ˆFt doFind corresponding mixture Pt 1 in MFind i as in Equation 4.2Calculate Mk;t for each distribution as in Equation 4.3if Mk;t = 0 for all k2f1:::KgthenFind least probable distribution as in Equation 4.4Replace it with new distribution N (xjXt;s2initI) and w = winitend iffor all k2f1:::KgdoCalculate wk;t as in Equation 4.5end forRenormalize weights as in Equation 4.6if Mi;t = 1 thenCalculate mi;t and s2i;t as in Equation 4.7 and Equation 4.8end ifSort distributions by wk;ts2k;tFind B distributions that represent background as in Equation 4.11if Mb;t = 1 for any b2f1:::BgthenXt is backgroundelseXt is foregroundend ifend forAlgorithm 4.1: Mixture of Gaussians for pan/tilt camera444.2 Non-parametric kernel density estimationA modified version of the non-parametric kernel density estimation algorithm de-scribed in Elgammal et al.[6] is used as a second background subtraction module.Some aspects of the Elgammal approach are not included in this thesis.Non-parametric kernel density estimation background subtraction is based onthe Parzen window method [23]. Each pixel in the background model is repre-sented by a probability density function P(x) calculated from N recent pixel sam-plesfX1 :::XNgas follows:P(Xt)= 1NN i=1K(Xt Xi); (4.12)where Xt is a pixel sample in a new image frame that needs to be classified and Kis a kernel function. Elgammal et al. use Gaussian kernel K(xjm;S) with a meanm and a covariance matrix S. Similar to other approaches [12, 20, 26, 31] S isassumed to be diagonal therefore all color channels are considered to be indepen-dent. But unlike the algorithm of Section 4.1 the covariance matrix is not a scalarmatrix. The probability of a new pixel Xt is:P(Xt)= 1NN i=1d j=11q2ps2je 12(Xt j Xi j )2s2j ; (4.13)where d is a number of color channels in each pixel, sj is a variance of jth channel,Xi j is a jth channel of an ith pixel sample in the model.To estimate S the median of absolute deviations over consecutive samples isused:sj = m j0:68p2; (4.14)where sj is a variance of a jth channel and m j is the median of absolute deviationsof a jth channel of N consecutive samples [6].Assuming that most of the samples X1 :::XN used to build the model belong tothe background, a new pixel Xt is classified as foreground if P(Xt)< T , where T isa global threshold [6].45Short-Term model Long-Term model Final Resultforeground foreground foregroundforeground background foregroundbackground foreground Specialabackground background backgroundaA pixel is classified as background, if there is at least one pixel in an 8-neighborhoodwhich is classified as background in both modelsTable 4.1: Intersection rule for Short-term and Long-term models [36]To accommodate a changing background the background model needs to beupdated. There are several ways to do this. One way is to add only pixels thatare classified as background. This approach improves detection of foreground pix-els since only background samples are collected. This could lead to a problemif some pixels are misclassified. These pixels will be consistently misclassifiedin future image frames. Another method involves blindly adding samples to themodel regardless of their classification label. This approach avoids consistentlymisclassified pixels but leads to a larger number of foreground pixels classified asbackground since these pixels can get incorporated into the background model.As with Elgammal et al. two background models are maintained — short-termand long-term. These models are essentially FIFO lists, each keeping at most Nmaxpixel samples. The short-term model stores the most recent samples that are clas-sified as background according to Table 4.1. This model adapts very quickly tochanges in the background process which results in higher sensitivity to a changingbackground. A long-term model model uses a blind update mechanism and storessamples taken over the large window in time. The size of the window depends onthe background sampling period t, which means that a pixel sample is added to themodel every t frames regardless of whether the pixel is classified as foreground orbackground. This model adapts to changes slower than the short-term model.This algorithm is trained by collecting N samples for each pixel in the model.With a pan/tilt camera both Case 2 and Case 3 shown in Figure 4.1 can lead to asituation where N is not the same for all pixels in the model. For instance, duringthe training phase the camera can pan across the scene resulting in some parts46(a) Frame 100 (b) Frame 200 (c) Frame 300Figure 4.2: Frames 100, 200 and 300 of image sequence taken with a pan/tiltcamera. Frames 200 and 300 show parts of the building not seen inframe 100.of the scene being visible only for a short period of time. An example of suchsituation is shown in Figure 4.2. The right side of the building and the intersectionin frame 100 shown in Figure 4.2a are not visible in both frames 200 and 300 asshown in Figure 4.2b and Figure 4.2c. Hence, assuming smooth camera motionbetween these frames, the part of the model that represents the white building inframes 200 and 300 will contain more samples after the training has been donethan the part of the model that represents the intersection and the right part of thebuilding. Similarly, previously unseen areas of the scene can become visible afterthe training has been performed, leading to the model not having any samples toclassify pixels in these new areas.To solve this problem the algorithm must be able to collect evidence and updatethe background model after the training has been performed. We do this by blindlyupdating the short-term model for pixels for which N < Nmax, where Nmax is amaximum allowed number of samples in the model. This lets the backgroundmodel quickly collect enough evidence to start classifying pixels successfully inpreviously unseen parts of the scene. The long-term background model is updatedas usual. The complete algorithm is presented in Algorithm 4.24.3 CodebookThe third background subtraction algorithm extended to support a pan/tilt camerais the codebook background subtraction algorithm of Kim et al.[15]. Like the other47Require: homography H, new frame Ft, background model MˆFt  Ft transformed by Hif ˆFt overlaps M thenGrow M:for all new pixels in M doList of short term samples /0List of long term samples /0end forend iffor all pixels Xt in ˆFt doFind corresponding long term samplesf˙X1 ::: ˙XNgin MEstimate ˙S for long term samples as in Equation 4.14Calculate ˙P(Xt) for long term samples as in Equation 4.13if ˙P(Xt) < T then˙L =foregroundelse˙L =backgroundend ifFind corresponding short term samplesfˆX1 ::: ˆXMgin MEstimate ˆS for short term samples as in Equation 4.14Calculate ˆP(Xt) for short term samples as in Equation 4.13if ˆP(Xt) < T thenˆL =foregroundelseˆL =backgroundend ifClassify pixel Xt using ˙L and ˆL as in Table 4.1if Xt is background orjfˆX1 ::: ˆXMgj< Nmax thenAdd Xt to a short-term list of samplesend ifif Last update was done in more than t frames ago thenAdd Xt to a long-term list of samplesend ifend forAlgorithm 4.2: Non-parametric kernel density estimation for a pan/tilt cam-era48two background subtraction algorithms described in this chapter the CB algorithmneeds to be modified in order to be able to handle both Case 2 and Case 3.Each pixel in the background model is represented by a codebook C which is alist of codewordsfc1;c2;:::cLg. Each codeword ci =(vi;auxi), where vi is a colorvector ( ˜Ri; ˜Gi; ˜Bi) and auxi is a tuple that contains all the additional information forthe codeword ci. Kim et al. store the following variables in auxi: ˇIi; ˆIi are the minimum and the maximum brightness values that are associatedwith the codeword ci. These values are used to match a new pixel to thecodeword. Kim et al. use the magnitude of (R;G;B) vector as the measureof brightness. fi is a number of times the codeword has been accessed during the trainingphase. li is a maximum negative run-length, i.e., the number of frames during thetraining period for which the codeword ci has not been matched. pi;qi are the numbers of the first and the last frames during the training phasefor which the ci has been matched.Again, there can be cases during the training phase when some parts of the scenestay in the camera’s frustum for longer than other parts of the scene. To take thisinto account we add an additional variable a to the codebook C which stores thetotal time (i.e., the number of frames) the current pixel appeared in the camera’sfield of view. Every time a codebook is accessed its corresponding a is increasedby 1. Hence in our extended model the codebook C is represented by a tuple(a;fc1;:::cLg).Two functions p :(R3;R3)!fTrue;Falsegand r :(R;R;R)!fTrue;Falsegare defined to match a pixel to a codeword. The first function p takes two colorvectors and returns True if these two vectors are close to each other according tosome metric and False otherwise. Similarly, r accepts a pixel’s brightness valueand maximum and minimum codeword brightness values, ˆIi and ˇIi respectively, andreturns True if the pixel’s brightness matches the codeword brightness accordingto some metric and False otherwise. Hence a pixel Xt = (Rt;Gt;Bt) matches acodeword ci if both p(vi;Xt)= True and r(pR2t +G2t +B2t ; ˆIi; ˇIi)= True.49The function r(It; ˇIi; ˆIi)!fTrue;Falsegchecks whether the pixel’s brightnessIt is within a range specified by Ilow and Ihi and is defined as follows:r(It; ˇIi; ˆIi)=8<:True; if Ilow It  IhiFalse; otherwise; (4.15)where Ilow = a ˆIi and Ihi = min b ˆIi; ˇIia . Ilow and Ihi depend on the global con-stants a < 1 and b > 1 that specify a valid brightness range for a valid matchbetween a codeword and a pixel. Decreasing the parameter a allows matches be-tween dark pixels and bright codewords, while increasing the parameter b allowsmatches between bright pixels and dark codewords.The function p(vi;Xt) measures similarity between a codebook color vector viand a pixel value Xt. In Kim et al. this function is defined as follows:p(vi;Xt)=8>><>>:True; if d =skXtk2 (Xt vi)2kvik2 < eFalse; otherwise; (4.16)where d represents a distance between two color vectors and e is a global thresholdthat determines the match boundary.The complete color model is shown in Figure 4.3. The distance between thepixel color vector Xt and the codeword vector vi is d and Ilow = a ˆIi and Ihi =min b ˆIi; ˇIia determine a valid match range for a projection p of Xt onto vi. If d isless than a threshold e and the length of the vector Xt is within the range definedby Ilow and Ihi then the pixel is considered to match the codeword.During the training phase for each new pixel Xt =(Rt;Gt;Bt), where 1 t Nand N is a total number of frames in the training sequence, the algorithm tries tofind a match between the existing codewords in the codebook C and the pixel usingthe functions p and r. If a matching codeword ci = (vi;auxi) is found its vectors50RGBviˇIˆIIhiIloweXtdFigure 4.3: The color model of Kim et al.. Vectors vi and Xt denote a code-word and a pixel in the R,G,B space. The decision boundary, outlined inblue, is determined by the threshold (radius) e and upper and lower lim-its Ihi and Ilow. d denotes distance between the pixel and the codewordvector. In this case d > e and the pixel is not matched to the codeword.51vi =( ˜Ri; ˜Gi; ˜Bi) and auxi =(ˇIi; ˆIi; fi;li; pi;qi) are updated as follows:vi  fi ˜Ri +Rtfi +1 ;fi ˜Gi +Gtfi +1 ;fi ˜Bi +Btfi +1 (4.17)ˇIi min(It; ˇIi) (4.18)ˆIi max(It; ˆIi) (4.19)fi fi +1 (4.20)li max(li;a qi) (4.21)pi pi (4.22)qi a; (4.23)where It =pR2t +G2t +B2t . Otherwise if no matching codeword is found the algo-rithm adds a new codeword c j to C:v j (Rt;Gt;Bt) (4.24)ˇIj It (4.25)ˆIj It (4.26)f j 1 (4.27)lj a 1 (4.28)p j a (4.29)q j a (4.30)Finally, regardless of whether the match was found or not, the active frame counta for the codeword C is increased by 1. After all N frames are processed eachcodeword ci in the codebook C is adjusted to a maximum negative run-length:li max(li;(a qi + pi 1)) (4.31)This follows from the assumption that the training sequence for each pixel isviewed as a loop sequence of a consecutive frames. Note that a typically hasdifferent values for different pixels.Since moving objects can also be present in the training sequence the algorithm52will also create codewords representing these moving objects. Therefore, the laststep of training phase consists of removing such codewords. Kim et al. call thisthe temporal filtering step. All the codewords that are not accessed for a long timeduring the training phase (i.e., have l larger than some threshold) are removedfrom the codebook. Kim et al. suggest N2 as the threshold. This is not appropriatefor a pan/tilt camera as some codewords representing background can be removedjust because they were not in the camera’s frustum for a long enough period oftime.We use a threshold Tm, 0 < Tm < 1, that represents a fraction of the total numberof frames during which the pixel was in the camera’s field of view to find thecodewords that have not been accessed often:C fcijci2C^li bTm acg; (4.32)The complete codebook training algorithm for a pan/tilt camera and the temporalfiltering routine are presented in Algorithm 4.3 and Algorithm 4.4 respectively.To perform background subtraction the algorithm matches a new pixel Xt to allcodewords in a corresponding codebook C. If for some codeword ci both p(vi;Xt)and r(It; ˆIi; ˇIi) are True the codeword ci is updated as in Equations 4.17 - 4.23 andthe pixel Xt is classified as background. If no match is found then the pixel isclassified as foreground and the codebook remains unchanged. This only lets thealgorithm perform the background subtraction and update the background modelfor those parts of the scene that were present in a training sequence. (i.e., Case 1and Case 3).As with MOG and KDE, in order to accommodate Case 2 when new parts of thescene become visible during the testing phase the algorithm must be able to incor-porate this new information into the background model. In [15] Kim et al. suggestan improvement to their CB algorithm that lets the changed parts of the scene beadded into the background model during the testing phase. This improvement alsoallows the model to handle Case 2 for a pan/tilt camera. An additional backgroundmodel H maintains all the codewords that are not found in the main backgroundmodel M. Pixels that have matching codewords in the model H are still labeled asforeground. This model only serves as a temporary cache and all the codewords53that gain enough evidence (i.e., are accessed often enough) are moved to the mainmodel M.The interaction of two models can be described as follows. In a case whenthere is no matching codeword in the main background model M for a new pixela matching codeword is found or created in the additional model H. Then, tomaintain the size of the model H all the codewords that have not been accessedfor a time specified by a global constant TH are removed from H. If a codewordstays in H for a number of frames specified by Tadd it is assumed to belong to thebackground and therefore is moved to the main model M. Likewise the size of themain model M is similarly maintained by removing the codewords that have notbeen accessed for a number of frames specified by Tdelete.The algorithm for the testing phase of a pan/tilt camera CB approach is pre-sented in Algorithm Color model improvementThe problem with the distance measure used by Kim et al. introduced in Equa-tion 4.16 and shown in Figure 4.3 is that the value of d strongly depends on thelength of both the input pixel’s and the codeword’s vectors xt and Vi. These lengthsin turn depend on the brightness of the pixel and the codeword.Consider an artificial example shown in Figure 4.4. In this sequence twosolid colored disks move against a background. A brighter disk with the colorvalue (100;128;0) moves against the brighter background with the color value(128;128;0) and a darker disk with color value (50;64;0) moves against a darkerbackground with a color value (64;64;0). Since there is no noise and the back-ground doesn’t change, after the algorithm has been trained all codewords thatcorrespond to a brighter part of the scene contain value vi = (128;128;0) and allcodewords that correspond to a darker part of the scene have value vi =(64;64;0).The color distance d between the brighter disk and the brighter backgroundis 19:79, similarly d between the darker disk and the darker background is 9:899.Hence if the global threshold e used for testing is set between these two values,the darker circle will not be detected as a moving object. An actual frame fromthe sequence and a classification result with the global threshold e = 12 are shown54Require: homography H, new frame Ft, background model MˆFt  Ft transformed by Hif ˆFt overlaps M thenGrow M:for all new pixels in M doC (0; /0)end forend iffor all pixels Xt =(Rt;Gt;Bt) in ˆFt doIt  pR2t +G2t +B2tFind corresponding codebook C in model MFind ci in C such that p(vi;Xt)= True and r(It; ˇIi; ˆIi)= Trueif No match thenCreate codeword c j as in Equations 4.24 - 4.30C C[c jelseUpdate ci as in Equations 4.17 - 4.23end ifUpdate a a+1end forAlgorithm 4.3: Codebook training algorithm for a pan/tilt cameraRequire: background model Mfor all codebooks Cj in M dofor all codewords ci in Cj dowrap li as in Equation 4.31if li Tm a j thenCj Cjnciend ifend forend forAlgorithm 4.4: Codebook temporal filtering algorithm55Require: homography H, new frame Ft, background models M, HˆFt  Ft transformed by Hif ˆFt overlaps M thenGrow M, H:for all new pixels in H doC (0; /0)end forfor all new pixels in M doC (0; /0)end forend iffor all pixels Xt =(Rt;Gt;Bt) in ˆFt doFind corresponding codebook CM in model MFind corresponding codebook CH in model HFind matching codeword ci in CM using p and rif ci is found thenXt labeled as backgroundUpdate ci as in Equations 4.17 - 4.23elseXt labeled as foregroundFind matching codeword hi in CH using p and rif hi is found thenUpdate hi as in Equations 4.17 - 4.23elseCreate codeword h j as in Equations 4.24 - 4.30CH  CH[h jend ifend ifRemove codewords from the cache model:CH  CHnfhijhi2CH^li > THgMove codewords:CM  CM[fhijhi2CH^a pi > TaddgCH  CHnfhijhi2CH^a pi > TaddgRemove codewords from the main model:CM  CMnfcijci2CM^a qi > Tdeletegend forAlgorithm 4.5: codebook testing phase for pan/tilt camera56(a) Frame 400 (b) Detection resultsFigure 4.4: Frame 400 and detection results from an artificial image se-quence. In Figure 4.4b black pixels denote background and white pixelsdenote foreground. CB algorithm failed to detect the dark moving circleagainst dark background with e = 12.(a) Frame 0 (b) Frame 251Figure 4.5: Frames 0 and 251 of the Wallflower Camouflage video sequencein Figure 4.4b. Black pixels represent parts of the scene classified as backgroundand white pixels represent foreground. Since the distance d between the dark diskand the dark background is less than e = 12 the pixels that belong to this disk wereclassified as background.The described scenario is unlikely to happen in many applications. An equiva-lent situation can occur in surveillance when, for example, the lighting conditionschange over time and the scene becomes darker. Thus the algorithm will have todetect darker moving targets against darker background.To detect such targets the threshold e might be lowered in advance to accom-57(a) Detection results for frame 0 (b) Detection results for frame 251Figure 4.6: Classification results of frames 0 and 251 with e = 8. The algo-rithm successfully detected a monitor as background in frame 0. Due tocolor similarity between a computer’s screen and the person’s sweaterthere are many misclassified pixels in frame 251.(a) Detection results for frame 0 (b) Detection results for frame 251Figure 4.7: Classification results of frames 0 and 251 with e = 1. Lower-ing the threshold e allowed the algorithm detect the person correctlyin frame 251 but at the same time image noise in both frames and thecomputer screen in frame 0 were misclassified as foreground.modate a future decrease of distance d. This leads to a situation where slight pixelvalue variations due to image noise cause pixels to be classified as foreground sincethese variations increase d significantly compared to a smaller e. Figure 4.5 showstwo frames of the Camouflage video sequence from the Wallflower dataset2. Thissequence contains 352 frames taken with a static camera. Some image noise is2Please refer to [35] for detailed explanation of the datasets58present in every frame. The camera is pointed at a computer which is located ona desk. In frame 241 a person wearing dark blue sweater enters the camera’s frus-tum and obstructs the computer. To effectively ignore the noise a threshold e hasto be set to a higher value, but increasing the threshold decreases the algorithm’ssensitivity to darker pixels. The classification results with e = 8 can be seen in Fig-ure 4.6. As with the previous example, black pixels represent background whilewhite pixels represent foreground. Both frames contain few pixels misclassifieddue to noise but the frame shown in Figure 4.6b contains a large part of the per-son’s body classified as background due to a close match between the dark sweaterand the monitor.To detect the person as a foreground object the algorithm must be able to differ-entiate the dark pixels of the person’s sweater from the monitor’s dark pixels whenthe person stands in front of the computer. Hence a lower value for the thresholde is required. The classification results with a lowered e can be seen in Figure 4.7.In this case lowering the threshold increased the algorithm’s sensitivity to imagenoise thus producing a large number of misclassified pixels.Our solution to this problem is to use a threshold for the cosine of the angleq instead of the threshold e for the distance d as the measure of “closeness” of apixel color to a codeword color value. The model is shown in Figure 4.8. Cosineof the angle q is calculated as follows:cosq = Xt vikXtkkvik(4.33)The benefit of using cosq instead of d is that it is not as sensitive to noise andbrightness changes. For example, in the artificial scene in Figure 4.4 the value ofcosq is exactly the same for both bright and dark discs when they are compared tobright and dark parts of the scene respectively. Figure 4.9 shows the classificationresults for the same image sequence shown in Figure 4.5 obtained by thresholdingcosq instead of d. The resulting images contain considerably less noise comparedto the results in Figure 4.7 while successfully displaying the person as foreground.As q approaches 0 the sensitivity of cosq degrades [19]. To avoid this problema new metric called normalized vector distance (NVD) was introduced in [21]. NVDis a simple metric that considers a distance between two normalized to unit length59RGBviˇIˆIIhiIlow XtqFigure 4.8: The color model for the CB algorithm used in this thesis. Vectorsvi and Xt denote a codeword and a pixel in R,G,B space. The decisionboundary, outlined in blue, is determined by the angle of the right cir-cular cone and upper and lower limits Ihi and Ilow. q denotes the anglebetween the pixel and the codeword vectors. In this case q is greaterthan the cone angle and the pixel is not matched to the codeword.vectors as a measure of their similarity. In Figure 4.10 NVD is represented by D,while new pixel’s color and the codeword’s vector are represented by Xt and virespectively. To match a pixel value Xt to a codeword vi NVD can be calculated asfollows:Dt =    vikvik XtkXtk    (4.34)As with color distance d and cosine of angle q a global threshold is applied toDt to decide if a pixel is “close” to a codeword’s color vector. If Dt is smaller than60(a) Detection results for frame 0 (b) Detection results for frame 251Figure 4.9: Classification results of frames 0 and 251 with the threshold forq = 2 . Using a cosine metric CB was able to correctly detect the com-puter screen as background in frame 0 and the person as foreground inframe 251.RGB11DtvtXtFigure 4.10: Normalized vector distance (adapted from [19]). Vectors vi andXt denote a codeword and a pixel in the R,G,B space. NVD, denoted asDt, is a length of a difference of vectors obtained by normalization ofvt and Xt.the threshold then the pixel is considered to be “close” and the brightness test is61performed as described in Equation 4.15 to determine whether the pixel matchesthe codeword.Although Dt can be calculated from the cosine of the angle and vice versa,Matsuyama et al. [19] argue that NVD provides better estimation when the anglegets close to 0, as the sensitivity of cosine function degrades.As with [35] experiments were performed on the Wallflower datasets. Thesedatasets were chosen because they cover most of the situations occurring in surveil-lance scenarios both indoor and outdoor, including smooth and sudden illumina-tion changes, an object being introduced into the scene and repetitive motion in thebackground.First, for each dataset and each metric the optimal threshold was estimated ex-perimentally by finding the threshold that minimized the total number of misclas-sified pixels. In order to find this threshold the brightness statistics ˆI and ˇI were notused. Each pixel was classified entirely by its distance to the codebook vector incases when color distance or NVD metrics were used or by angle when cosine met-ric was used. The classification results and the number of errors for these thresh-olds for each metric are presented in Table 4.2. False positives (FPs) representbackground pixels that have been classified as foreground and false negatives (FNs)represent foreground pixels that have been classified as background.The same thresholds obtained for each metric were used with the same datasetsagain but this time with brightness function r(It; ˆIi; ˇIi) enabled. For the brightnessfunction r the parameters a and b were set to the values that minimized the totalnumber of misclassified pixels in each test sequence. The detection results are pre-sented in Table 4.3. The number of false positive increased in all the cases, whilethe number of FNs decreased. These changes occur because including brightnessstatistics ˆI and ˇI during the classification phase imposes additional constraints ona pixel value in order for the pixel to be classified as background, leading to morepixels being classified as foreground. This results in an increased number of falsepositives (FPs). The number of FNs goes down for the same reason — with renabled the classification process becomes stricter and foreground pixels that hadmatched codewords based on color threshold can fail the brightness check and thusbe classified correctly.Both Table 4.2 and Table 4.3 show that the number of FPs generally is slightly62greater for the cosine and NVD metrics. This is due to the fact that to match apixel to a codeword neither the cosine or the NVD metrics rely on lengths of thecolor vectors. Hence pairs (dark pixels, bright codewords) and (bright pixels, darkcodewords) are matched more often than in cases when the color distance metric isused. For the same reason the number of FNs is lower for the experiments in whichthe cosine and NVD metrics are used.The results from the Camouflage image sequence presented in both tables showthat the number of FPs is about 3 times greater for the color distance metric compar-ing to the cosine and NVD metrics. This evidence supports the claim made earlierin this section that using the metric that does not depend on the length of the colorvectors produces better results when there are dark objects moving against a darkbackground.The number of misclassified pixels for the Light Switch image sequence in-creased significantly for all types of metrics after brightness statistics were enabled.This image sequence shows a dark room in which a light is suddenly turned on inthe frame 1853. Since the lighting change is sudden the background model fails toadapt to a brighter background and classifies most of the pixels as foreground be-cause they fail the intensity check even though they pass the threshold check asseen in Table 4.2.These experiments can be summarized as follows. For most of the sequenceswith no large difference between dark and light objects (e.g., Bootstrap, Wav-ing Trees) both the cosine and NVD metrics improve the detection results onlymarginally. On the other hand, for the sequence with a large difference betweendark and light objects (e.g., Camouflage) using either the cosine or NVD metricresults in fewer misclassified pixels. The experiments show that there is no signif-icant difference between the results obtained by using the cosine or NVD metrics.63DatasetMetric ErrorTypeBoot-strapCamou-flageFGAper-tureLightSwitchTime ofDayWavingTreesColordistanceFP 194 1956 555 288 112 214FN 2197 642 3392 2714 967 607FP+FN 2391 2598 3947 3002 1079 821CosineFP 121 733 627 339 186 333FN 1971 443 1329 2452 583 205FP+FN 2092 1176 1956 2791 769 538NVDFP 142 657 659 343 242 433FN 1965 517 1314 2445 528 156FP+FN 2107 1174 1973 2788 770 589Table 4.2: Comparison of codebook algorithm with three different metricswithout using brightness statistics ˆI; ˇIDatasetMetric ErrorTypeBoot-strapCamou-flageFGAper-tureLightSwitchTime ofDayWavingTreesColordistanceFP 356 2116 739 8140 263 279FN 1537 174 984 1282 473 123FP+FN 1883 2290 1723 9422 736 401CosineFP 385 839 833 7580 247 383FN 1360 165 766 1703 424 110FP+FN 1745 1004 1599 9283 671 493NVDFP 414 774 862 7582 295 448FN 1335 188 763 1703 379 85FP+FN 1749 962 1625 9285 674 533Table 4.3: Comparison of codebook algorithm with three different metricswith brightness statistics ˆI; ˇI64Chapter 5Results and DiscussionThe detection results obtained by the background subtraction algorithms describedin Chapter 4 adapted for the pan/tilt framework presented in Chapter 3 are demon-strated in this chapter. Six test image sequences were used in order to analyzethe performance of the implemented pan/tilt background subtraction algorithms indifferent environments. These six image sequences can be subdivided into twocategories: three image sequences were taken outdoors and the other three imagesequences were taken indoors.The parameters for each algorithm were found experimentally by minimizingthe total number of misclassified pixels for all sequences in each category. Theobtained parameters were then used for each image sequence in a category. Allthe image sequences and the detection results produced by the three backgroundsubtraction algorithms are available on-line [1].The detection results are presented as follows. For each test sequence a shortdescription, several representative frames and a figure showing test frames, groundtruth and detection results produced by each algorithm are presented. In all fig-ures the detection results and ground truth are presented as black and white im-ages where white pixels denote foreground objects and black pixels denote back-ground. Furthermore, a table summarizing the number of errors for each algorithmis provided. In these tables FPs denote background pixels that were classified asforeground and FNs denote foreground pixels that were classified as background.Finally, each algorithm’s detection results are discussed.655.1 Detection of moving objects in outdoor scenesTo test the performance of the algorithms in an outdoor environment we used twooutdoor sequences taken with a tripod mounted camera and one taken with a handheld camera. All outdoor test sequences were taken with a Panasonic SDR-S100digital camera at 25fps with pixel resolution of 704x576. After that lower qualityimage sequences were produced by scaling the original sequences to resolutions of160x131 pixels.All the outdoor sequences used the same sets of parameters for MOG, KDE andCB algorithms. For the CB algorithm the following parameters were used: thresholdthat specifies the size of the codebook Tm = 0:5, TH = 20, Tadd = 50, Tdelete = 200.The angle metric described in Chapter 4 used a threshold of 7 . The parameters aand b that match a pixel’s brightness to one of the codewords were set to 0.7 and1.5 respectively.The following parameters were used for the MOG algorithm. The number ofmixture components K was set to 5. Both learning factors a and r were set to 0.01.s2init was set to 3 and winit was set to 0.00001. The threshold T that determines howmany distributions represent background was set to 0.6. Finally, the threshold Dthat determines if a pixel matches a distribution was set to 15.For the KDE algorithm the background threshold T was set to 0.02 and thebackground sampling rate t was set to 1. Additionally for this algorithm we useda normalized color model proposed by Elgammal et al.[6] to reduce the number ofchannels per pixel from 3 to 2 in the following way:RN = 255 RR+G+B (5.1)GN = 255 GR+G+B (5.2)where R;G;B2(0:::255) are the original color values of the pixel and RN;GN aretwo normalized components of the pixel.Both the KDE and CB algorithms required training. The first 200 frames of eachsequence were used to train the algorithms.66(a) Frame 501 (b) Frame 604(c) Frame 700 (d) Frame 949Figure 5.1: Frames 501, 604, 700 and 949 of Sequence 1. Figure 5.1a showsan empty scene, in Figure 5.1b a person is entering the camera’s field ofview, in Figure 5.1c the person is immobile and Figure 5.1d shows theperson leaving the scene.5.1.1 Sequence 1Figure 5.1 shows several frames of a 1090 frame image sequence (called Sequence1) which was taken with a pan/tilt camera mounted on a tripod. The scene containsa pedestrian pathway bordered by a lawn, a building in the distance and severaltrees planted in front of the building. Branches of one of the trees are swayingin the wind. The whole scene is illuminated by direct sunlight and all images are67rich in contrast. In this sequence the camera mostly pans from side to side andtilts occasionally. The frames in Figure 5.1d and Figure 5.1a show the leftmostand the rightmost extreme angles of the camera. In frame 584 a target, a walkingperson, appears on the right. The person can be seen in Figure 5.1b. After havingwalked across the scene the person stops in frame 657 and then continues walkingin the same direction in frame 800. A frame with the standing person can be seenin Figure 5.1c. Finally, in frame 981 the person leaves the camera’s field of view.The detection results for frames 625, 720, 850 and 930 produced by the threepan/tilt background subtraction algorithms are presented in Figure 5.2. The firstrow shows the actual frames of the sequence. The GT for each frame is shown inthe second row. The results produced by the MOG, KDE and CB algorithms arepresented in rows three, four and five respectively. Table 5.1 summarizes the totalnumber of incorrectly classified pixels for each algorithm.The results produced by MOG and CB, presented in Figure 5.2j and Figure 5.2rrespectively, show that both algorithms failed to detect the now immobile target inframe 720. In the case of MOG, this is due to the fact that the learning factors a andr that are responsible for the rate at which new pixel values get incorporated intothe model were set to higher values thus resulting in a faster adaptation of immobileobjects into a background model. Figure 5.3 shows composite images of the algo-rithm’s background models. These images show the means of the most probablecomponents in the model for each pixel. At frame 720, as seen in Figure 5.3b, thepixels representing the target are already included in the most probable component.Similarly, in the case of CB the rate Tadd at which codewords move from cachecodebook to the background codebook was set to 50. This means that if a code-word is still in the cache codebook after 50 frames it is considered to representbackground and is moved to the background codebook. Increasing Tadd will notallow the algorithm to incorporate the target into the background model but it willslow the rate at which previously unseen areas get incorporated into the backgroundmodel. Since KDE incorporates new information relatively slowly the classificationresults produced by KDE in Figure 5.2n do not have this problem.Figure 5.2k produced by the MOG algorithm at frame 850 shows a white silhou-ette incorrectly classified as foreground where the target used to be in frame 720.Since the target spent a large amount of time in the same location, the distributions68(a) Frame 625 (b) Frame 720 (c) Frame 850 (d) Frame 930(e) GT 625 (f) GT 720 (g) GT 850 (h) GT 930(i) MOG 625 (j) MOG 720 (k) MOG 850 (l) MOG 930(m) KDE 625 (n) KDE 720 (o) KDE 850 (p) KDE 930(q) CB 625 (r) CB 720 (s) CB 850 (t) CB 930Figure 5.2: The first row shows actual frames from Sequence 1. The groundtruth is shown in the second row, where white pixels denote foregroundand black pixels denote background. The results produced by MOG areshown in the third row, while the results produced by KDE and CB areshown in the forth and fifth rows respectively.69Number of errors (Percentage)Algo-rithmErrorTypeFrame625Frame720Frame850Frame930TotalMOGFP 667(3.18%)128(0.61%)1778(8.48%)4701(22.43%)7274(8.68%)FN 860(4.10%)2181(10.4%)485(2.31%)317(1.51%)3843(4.58%)FP+FN 1527(7.28%)2309(11.01%)2263(10.79%)5018(23.94%)11117(13.26%)KDEFP 138(0.66%)147(0.7%)177(0.84%)1866(8.9%)2328(2.78%)FN 136(0.65%)208(0.99%)296(1.41%)228(1.09%)868(1.04%)FP+FN 274(1.31%)355(1.69%)473(2.25%)2094(9.99%)3196(3.82%)CBFP 340(1.62%)94(0.45%)404(1.93%)3179(15.17%)4017(4.79%)FN 169(0.81%)2148(10.25%)266(1.27%)220(1.05%)2803(3.34%)FP+FN 509(2.43%)2242(10.7%)670(3.2%)3399(16.22%)6820(8.13%)Table 5.1: Comparison of the three background subtraction pan/tilt algo-rithms for Sequence 1. For each algorithm there are three rows repre-senting FPs, FNs and the combined number of errors. Each column showsthe number of misclassified pixels in a frame and a percentage relative tothe total number of pixels in the frame. The final column shows the totalnumber of errors summed across all four frames.that represented background behind the target in that location were eventually out-weighed by the distributions that represented the immobile target. When the targetquickly left the location the algorithm had to relearn and reweigh the backgrounddistributions. As seen in Figure 5.2o and Figure 5.2s neither KDE nor CB had this70(a) Background model at frame 625 (b) Background model at frame 720Figure 5.3: Snapshots of background models for the MOG algorithm atframes 625 and 720. Each snapshot was created by taking the meanof the most probable component in each pixel’s mixture. At frame 720there is a recognizable silhouette of a standing person.issue. In the case of CB a codeword stays in the codebook for Tdelete frames if it isnot accessed. Since the target spent less than Tdelete1 frames in the same locationcodewords that represented the background behind the target were not removedfrom the codebook. In the case of KDE the slow rate at which the algorithm incor-porates new data into the model did not let the target pixels be absorbed into thebackground model.The detection results for frame 930 produced by all three algorithms seen inFigure 5.2l, Figure 5.2p and Figure 5.2t show a large portion of background onthe left incorrectly classified as foreground (i.e., FPs). This part of the scene hadnot been seen by the algorithms before and the algorithms were classifying it asforeground while collecting enough samples and learning the background.Since for KDE algorithm the normalized color model was used the results pro-duced by KDE in Figure 5.2m, Figure 5.2n, Figure 5.2o and Figure 5.2p show thatthe algorithm did not classify the target’s shadow as foreground.5.1.2 Sequence 2Another outdoor sequence (Sequence 2) was used to test the performance of thealgorithms in a case when there are many interacting foreground objects and some1In our case Tdelete = 20071(a) Frame 78 (b) Frame 180(c) Frame 413 (d) Frame 829Figure 5.4: Frames 78, 180, 413 and 829 of Sequence 2. Figure 5.4a andFigure 5.4b show two frames that are part of training sequence for bothKDE and CB. Figure 5.4c shows a large moving object quickly movingin front of the camera. Figure 5.4d shows some moving objects in thebackground.72of these objects occupy a large portion of a frame. This 1018 frame sequence wastaken at a busy intersection with several cars stopping at a red light and resumingmovement once the light changed to green. As with Sequence 1, this sequencewas taken with a camera mounted on a tripod with a pan/tilt head. Several rep-resentative frames from this sequence are shown in Figure 5.4. Figure 5.4b andFigure 5.4d show two frames taken at the extreme left and right angles of the cam-era’s trajectory. A complex interaction between foreground objects is shown inFigure 5.4a where there are several foreground objects (i.e., pedestrians and cars)at different distances from the camera. An example of a large foreground objectmoving in front of the camera is shown in Figure 5.4c.The GT and detection masks for some of the frames are presented in Figure 5.5.Table 5.2 summarizes FP and FN errors in each case.The detection results for the frames 400 and 460 are similar for all three algo-rithms — all moving objects were detected relatively accurately by all algorithms.On the other hand, the results for frames 660 and 900 are different. Figure 5.6shows detection results produced by KDE for the frames 660 and 900. An area thatwas consistently misclassified by the algorithm as foreground is outlined in red. Asimilar misclassified area is present in the result for frame 660 produced by CB asseen in Figure 5.5s. This is caused by an immobile object that eventually startsmoving.Frames shown in Figure 5.5a and Figure 5.5b contain a grey car that is waitingat the intersection. Since the car had been immobile since the beginning of thesequence, all three algorithms did not classify it as foreground object. In frame590 the street light changed to green and the car began to move. By frame 660the car had already traveled some distance and it was already outside of the cam-era’s field of view by frame 900. Since the KDE algorithm adapted to changingbackground very slowly it was not able to incorporate new background informa-tion into the short-term model once the car started moving. By frame 900 therewas still not enough evidence in the model for the new background and the areathat was occupied by the car was still classified as foreground. On the contrary, theMOG algorithm was able to completely incorporate the new background into themodel by frame 660 and the CB algorithm by frame 900.73(a) Frame 400 (b) Frame 460 (c) Frame 660 (d) Frame 900(e) GT 400 (f) GT 460 (g) GT 660 (h) GT 900(i) MOG 400 (j) MOG 460 (k) MOG 660 (l) MOG 900(m) KDE 400 (n) KDE 460 (o) KDE 660 (p) KDE 900(q) CB 400 (r) CB 460 (s) CB 660 (t) CB 900Figure 5.5: The first row shows actual frames from Sequence 2. The groundtruth is shown in the second row, where white pixels denote foregroundand black pixels denote background. The results produced by MOG areshown in the third row, while the results produced by KDE and CB areshown in the forth and fifth rows respectively.74Number of errors (Percentage)Algo-rithmErrorTypeFrame400Frame460Frame660Frame900TotalMOGFP 136(0.65%)202(0.96%)88(0.42%)57(0.27%)483(0.58%)FN 267(1.27%)114(0.54%)43(0.21%)30(0.14%)454(0.54%)FP+FN 403(1.92%)316(1.5%)131(0.63%)87(0.41%)937(1.12%)KDEFP 77(0.37%)84(0.4%)110(0.52%)117(0.56%)388(0.46%)FN 422(2.01%)291(1.39%)66(0.31%)36(0.17%)815(0.97%)FP+FN 499(2.38%)375(1.79%)176(0.83%)153(0.73%)1203(1.43%)CBFP 57(0.27%)78(0.37%)35(0.17%)24(0.11%)194(0.23%)FN 281(1.34%)175(0.83%)81(0.39%)32(0.15%)569(0.68%)FP+FN 338(1.61%)253(1.2%)116(0.56%)56(0.26%)763(0.91%)Table 5.2: Comparison of the three background subtraction pan/tilt algo-rithms for Sequence 2. For each algorithm there are three rows repre-senting FPs, FNs and a combined number of errors. Each column showsnumber of misclassified pixels in a frame and a percentage relative to thetotal number of pixels in the frame. The final column shows the totalnumber of errors summed across all four frames.75(a) KDE result for frame 660 (b) KDE result for frame 900Figure 5.6: Classification results for frames 660 and 900 of Sequence 2 pro-duced by KDE. In both frames parts where a grey car used to be areincorrectly classified as foreground and are outlined in red for clarity.5.1.3 Sequence 3An image sequence similar to Sequence 1 was taken with the same camera in thesame location. However, this time the camera was not mounted on a tripod butinstead was handheld. The camera was manually moved in an approximate pan/tiltfashion. The goal of this experiment was to evaluate the performance of the back-ground subtraction algorithms in the case when the frames are not exactly relatedby a homography and, as a consequence, are not registered exactly.In this image sequence, as in Sequence 1, the handheld camera sweeps acrossthe same scene and the target walks in from the right side of the scene, stops andthen continues walking to the left. Figure 5.7 shows several frames and the classifi-cation results produced by each of the three BS algorithms. The summary of errorsis presented in Table 5.3.The detection results presented in Figure 5.7 show that all three algorithmswere able to produce recognizable silhouettes of the target in all of the frames.Since the relation between the frames was not exactly a homography, the frameswere not registered precisely and as a consequence the results produced by MOG,which is a more sensitive algorithm, show a considerable number of misclassi-fied pixels. In all of the masks produced by the MOG (Figure 5.7i, Figure 5.7j,76(a) Frame 350 (b) Frame 430 (c) Frame 500 (d) Frame 570(e) GT 350 (f) GT 430 (g) GT 500 (h) GT 570(i) MOG 350 (j) MOG 430 (k) MOG 500 (l) MOG 570(m) KDE 350 (n) KDE 430 (o) KDE 500 (p) KDE 570(q) CB 350 (r) CB 430 (s) CB 500 (t) CB 570Figure 5.7: The first row shows actual frames from Sequence 3. The groundtruth is shown in the second row, where white pixels denote foregroundand black pixels denote background. The results produced by MOG areshown in the third row, while the results produced by KDE and CB areshown in the forth and fifth rows respectively.77Figure 5.7k and Figure 5.7l) there are visible lines coinciding with the edges ofsome of the background objects. These lines correspond to the pixels which arelocated at the edges of the solid colored background objects. And due to the im-precise registration these objects overlap with some offset in some of the frames.And these areas were detected as foreground by the MOG. A large number of FPsthat could be seen in the left and right parts of Figure 5.7i and in the lower part ofFigure 5.7k are the result of gradual changes in the color of the pavement producedby the camera adjusting to the lighting conditions.As with Sequence 1 both MOG and CB incorporated parts of the target into thebackground model when the target was stationary. This can be seen in Figure 5.7j,Figure 5.7k and Figure 5.7s. Again, this is due to high learning rates a and r inthe case of MOG and the parameter Tadd in the case of CB.5.2 Detection of moving objects in indoor scenesIn indoor scenes we cannot assume that all objects are at a reasonable distance (i.e.,at infinity). Hence, in order for frames to be related by a homography the cameraneeds to rotate exactly about its optical center. Lens distortion is another problemoften more noticeable in indoor environment. Indoor scenes are usually taken witha wide angle lens and contain objects that are close to the camera. The effects oflens distortions, such as radial distortion, are stronger, thus estimation of a homog-raphy in the indoor environment is more difficult than in the outdoor environmentas straight lines in the real world are not always represented by straight lines in theimages.For all the indoor scenes presented in this section the parameters for the CBalgorithm were exactly the same as for the outdoor scenes. For the MOG algorithmall the parameters were the same except for the threshold D which was increased to50, thus allowing a pixel to lie farther from the mean of a component distributionin order to match it. Similarly, the sensitivity of the KDE algorithm was increasedfor the indoor scenes by setting the threshold T to 0:01.78Number of errors (Percentage)Algo-rithmErrorTypeFrame350Frame430Frame500Frame570TotalMOGFP 2626(12.53%)584(2.79%)2910(13.88%)525(2.5%)6645(7.93%)FN 194(0.93%)1484(7.08%)1779(8.49%)312(1.49%)3769(4.5%)FP+FN 2820(13.46%)2068(9.87%)4689(22.37%)837(3.99%)10414(12.43%)KDEFP 245(1.17%)84(0.4%)970(4.63%)110(0.52%)1409(1.68%)FN 441(2.1%)252(1.2%)312(1.49%)265(1.26%)1270(1.51%)FP+FN 686(3.27%)336(1.6%)1282(6.12%)375((1.78%)2679(3.19%)CBFP 39(0.19%)15(0.07%)420(2.0%)34(0.16%)508(0.61%)FN 129(0.61%)267(1.27%)1549(7.39%)297(1.42%)2242(2.67%)FP+FN 168(0.80%)282(1.34%)1969(9.39%)331(1.58%)2750(3.28%)Table 5.3: Comparison of the three background subtraction pan/tilt algo-rithms for Sequence 3. For each algorithm there are three rows repre-senting FPs, FNs and a combined number of errors. Each column showsnumber of misclassified pixels in a frame and a percentage relative to thetotal number of pixels in the frame. The final column shows the totalnumber of errors summed across all four frames.79(a) Frame 1 (b) Frame 34Figure 5.8: Frames 1 and 52 of Sequence 4 by Xu[41]. Figure 5.8a shows thefirst frame of the sequence. Note the lack of objects that could producemany features. Figure 5.8b shows a typical frame of the sequence wherethe box and the arm occupy a large portion of the frame.5.2.1 Sequence 4One of the indoor sequences is in a dataset created by Xu[41]. This image se-quence contains 154 frames with the resolution of 320 by 240 pixels. In this videosequence a tripod mounted camera pans and tilts following a person’s hand thatholds a box at a close distance to the camera. Two frames from this sequenceare shown in Figure 5.8. A particular difficulty with this image sequence is thatthe hand rotates and moves the box erratically, sometimes bringing it closer to thecamera so that the box occupies a large portion of the screen. This can be seen inFigure 5.8b. The hand occupies the left portion of the screen almost all the timethus making the learning of the background model in that part of the scene diffi-cult. Moreover, there are few distinct background objects in the scene to extractkeypoints from and sometimes the foreground objects cover them completely asseen in Figure 5.9c.Detection results and ground truth are presented in Figure 5.9. The number ofFPs and FNs for each shown frame and each algorithm is presented in Table 5.4.The total number of errors in Table 5.4 is considerably larger that for other im-age sequences presented in this chapter. This is caused by a larger proportion of80(a) Frame 65 (b) Frame 95 (c) Frame 125 (d) Frame 150(e) GT 65 (f) GT 95 (g) GT 125 (h) GT 150(i) MOG 65 (j) MOG 95 (k) MOG 125 (l) MOG 150(m) KDE 65 (n) KDE 95 (o) KDE 125 (p) KDE 150(q) CB 65 (r) CB 95 (s) CB 125 (t) CB 150Figure 5.9: The first row shows actual frames from Sequence 4. The groundtruth is shown in the second row, where white pixels denote foregroundand black pixels denote background. The results produced by MOG areshown in the third row, while the results produced by KDE and CB areshown in the forth and fifth rows respectively.81foreground pixels in each image frame compared to frames in other sequences.Both the MOG and the KDE algorithms produced recognizable checker patternon the box. This is caused by the close similarity of the color of the wall and thecolor of the white checkers on the box.The KDE algorithm consistently failed to classify very dark areas of the imagesequence as background. A definite outline of the dark areas of the chair andthe desk are seen in Figure 5.9m, Figure 5.9n and Figure 5.9p. The color modelthat was used for this algorithm represents each R;G;B component as a fractionof the sum of all of the components. As a result, noisy pixels have a very largevariation in very dark areas. Consider for instance, a very dark pixel with R;G;Bvalues (1;1;0). According to Equation 5.1 and Equation 5.2 both of its valuesRN and GN are equal to 127. If in the next frame the same pixel due to the noisehas R;G;B values (1;0;0) its RN and GN are 255 and 0 respectively. In order toaccommodate for such a large variation the background threshold T would need tobe lowered which would result in a large number of foreground pixels classified asbackground.The results produced by the MOG algorithm shown in Figure 5.9i, Figure 5.9jand Figure 5.9k contain a large number of pixels at the top and the right sides of therespective frames erroneously classified as foreground (i.e., FPs). This is caused bythe addition of new randomly initialized mixtures to the model as new parts ofthe scene are encountered. The speed with which the MOG algorithm learns newbackground depends on the learning factors a and r.5.2.2 Sequence 5To test detection of multiple targets in an indoor environment we ran the algorithmson a hockey game broadcast video taken with a tripod mounted pan/tilt camera.The original image sequence contains 1000 frames at the high definition resolutionof 1920 by 970 pixels. For testing purposes each image was scaled down to theresolution of 300 by 152 pixels.In this image sequence a pan/tilt camera tracks play during a hockey game.Several frames from the image sequence are shown in Figure 5.10. The foregroundobjects in this sequence are hockey players and the background objects are the rink82Number of errors (Percentage)Algo-rithmErrorTypeFrame65Frame95Frame125Frame150TotalMOGFP 6210(8.09%)13850(18.03%)7718(10.05%)157(0.2%)27935(9.09%)FN 3234(4.21%)4272(5.56%)11697(15.23%)6738(8.77%)25941(8.44%)FP+FN 9444(12.3%)18122(23.59%)19415(25.28%)6895(8.97%)53876(17.53%)KDEFP 4699(6.12%)4874(6.35%)1131(1.47%)5279(6.78%)15983(5.2%)FN 9450(12.3%)5599(7.29%)14431(18.79%)4056(5.28%)33536(10.92%)FP+FN 14149(18.42%)10473(13.64%)15562(20.26%)9335(12.16%)49519(16.12%)CBFP 817(1.06%)448(0.58%)328(0.43%)214(0.28%)1807(0.59%)FN 6850(8.92%)7463(9.72%)16162(21.04%)5543(7.22%)36018(11.72%)FP+FN 7667(9.98%)7911(10.3%)16490(21.47%)5757(7.5%)37825(12.31%)Table 5.4: Comparison of the three background subtraction pan/tilt algo-rithms for the sequence by Xu. For each algorithm there are three rowsrepresenting FPs, FNs and a combined number of errors. Each columnshows number of misclassified pixels in a frame and a percentage rela-tive to the total number of pixels in the frame. The final column showsthe total number of errors summed across all four frames.83(a) Frame 400 (b) Frame 600(c) Frame 720 (d) Frame 999Figure 5.10: Frames 400, 600, 720 and 999 of Sequence 5. Figure 5.10a andFigure 5.10b show complex interaction between players. Figure 5.10cand Figure 5.10d show left and right parts of the rink demonstratingthe range of the camera’s motion.Figure 5.11: Example of radial distortion in Sequence 5. The distortion isnoticeable where the edge between ice and the boards curves up andthe glass edge curves down. Red lines are overlaid for reference.84and boards with spectators behind them. Since the camera is using a wide anglelens some distortion is present in the resulting images. Consider, for example, ahigh resolution version of frame 1 shown in Figure 5.11, in which red lines weredrawn on top of the image to display distortion of the rink boards. The far board,which is straight in real life, is curving up at the left and the right sides of the frame.Similarly, the portion of the glass visible at the bottom of the frame is curving downat the right and the left sides of the frame. This kind of distortion contributed tothe RANSAC algorithm not being able to estimate some homographies very accu-rately. As a consequence, some image frames were not aligned precisely to thebackground model.The ground truth and the detection results for the frames shown in Figure 5.10are presented in Figure 5.12 and detection errors for each algorithm are summa-rized in Table 5.5. Even though all algorithms were able to detect the players onthe rink in this image sequence, the KDE algorithm significantly outperformed boththe CB and the MOG algorithms. For frame 400, shown in Figure 5.12a, both theCB and the MOG algorithms produced a large number of FPs while KDE was ableto detect targets correctly.Frame 400 was taken at the time when the camera was quickly panning tothe right revealing previously unseen parts of the rink. When discovering previ-ously unseen parts of the scene KDE blindly updates both the short-term and thelong-term background models until the number of samples in the model is at fullcapacity. This lets KDE learn previously unseen parts of the scene very quickly. Asthe result, the KDE algorithm almost never misclassifies pixels from the previouslyunseen areas of the scene as shown in Figure 5.12m.For both MOG and CB the rate at which new parts of the scene get incorporatedinto the background model depends on the parameters a and r for MOG and Taddfor CB. Since the camera was quickly panning these two algorithms were not ableto learn the background for the new parts of the scene as quickly as the KDE. Thisresulted in a large number of FPs as seen in Figure 5.12i and Figure 5.12q.85Number of errors (Percentage)Algo-rithmErrorTypeFrame400Frame600Frame720Frame999TotalMOGFP 17974(39.42%)3597(7.89%)1605(3.52%)735(1.61%)23911(13.11%)FN 550(1.21%)603(1.32%)370(0.81%)597(1.31%)2120(1.62%)FP+FN 18524(40.63%)4200(9.21%)1975(4.33%)1332(2.92%)26031(14.73%)KDEFP 2202(4.83%)2273(4.98%)987(2.16%)817(1.79%)6279(3.44%)FN 1384(3.04%)438(0.96%)448(0.98%)616(1.35%)2886(1.58%)FP+FN 3586(7.87%)2711(5.94%)1435(3.14%)1433(3.14%)9165(5.02%)CBFP 17758(38.94%)3982(8.73%)687(1.51%)583(1.28%)23010(12.62%)FN 1024(2.25%)392(0.86%)404(0.89%)685(1.5%)2505(1.37%)FP+FN 18782(41.19%)4374(9.59%)1091(2.4%)1268(2.78%)25515(13.99%)Table 5.5: Comparison of the three background subtraction pan/tilt algo-rithms for Sequence 5. For each algorithm there are three rows repre-senting FPs, FNs and a combined number of errors. Each column showsnumber of misclassified pixels in a frame and a percentage relative to thetotal number of pixels in the frame. The final column shows the totalnumber of errors summed across all four frames.86(a) Frame 400 (b) Frame 600 (c) Frame 720 (d) Frame 999(e) GT 400 (f) GT 600 (g) GT 720 (h) GT 999(i) MOG 400 (j) MOG 600 (k) MOG 720 (l) MOG 999(m) KDE 400 (n) KDE 600 (o) KDE 720 (p) KDE 999(q) CB 400 (r) CB 600 (s) CB 720 (t) CB 999Figure 5.12: The first row shows actual frames from Sequence 5. The groundtruth is shown in the second row, where white pixels denote foregroundand black pixels denote background. The results produced by MOG areshown in the third row, while the results produced by KDE and CB areshown in the forth and fifth rows respectively. Both Figure 5.12i andFigure 5.12q show a large number of FPs due to the inability of theMOG and CB algorithms to quickly learn the background model fornew areas of the scene.87(a) Frame 239 (b) Frame 314(c) Frame 406 (d) Frame 442Figure 5.13: Frames 239, 314, 406 and 442 of Sequence 6. In Figure 5.13aa first person enters camera’s field of view. In Figure 5.13b the firstperson is walking towards the camera. In Figure 5.13c the first personis leaving the camera’s field of view while a second person is seenwalking in the back. In Figure 5.13d the second person is leaving thecamera’s field of view.5.2.3 Sequence 6An indoor video sequence was taken with a hand held camera similar Sequence3. Again, this sequence was taken with a Panasonic SDR-S100 digital camera atresolution of 704x576 pixels. Each frame was then scaled down to resolution of160x131 pixels.88In this 454 frame long sequence the camera sweeps the empty scene for the first200 frames, then a person enters the visible part of the scene from the right,walkstowards the camera and leaves the scene in frame 419. A second person appearsin frame 345 in the middle of the scene and also walks towards the camera. Thisperson leaves the camera’s field of view in frame 450 when the camera rotatesaway. Since the camera did not rotate about its optical center, the frames are notnecessarily related by homographies.Several frames from this sequence are shown in Figure 5.13. In Figure 5.13a afirst person enters the visible part of the scene from the right. In frame 314, shownin Figure 5.13b, the camera is sweeping to the left and the first person is walkingtowards the camera. In Figure 5.13c the first person is leaving the camera’s fieldof view while the second person is walking in the back. Figure 5.13d shows frame442 in which the camera is sweeping to the right and the second person is leavingthe camera’s field of view.The results show that all three background subtraction algorithms were able todetect moving objects. As with Sequence 3, the resulting images in Figure 5.14contain some misclassified pixels due to improper registration, which can be seenin the form of straight lines occurring at borders of the background objects in theresulting mask images produced by the algorithms. For the same reason there is anoticeable number of improperly classified pixels in the frames where a large plantis visible (i.e., frames 314 and 406). Since this plant is at close distance to thecamera and the camera’s motion is not pure rotation about its optical center there isconsiderable motion parallax due to the camera’s motion. As a result of the motionparallax the plant is not always mapped to the same location in the backgroundmodel.The results produced by the MOG algorithm show that there are more pixelsincorrectly classified as background (i.e., FNs) than in the results produced by theKDE and the CB algorithms. For example, in frame 239 seen in Figure 5.14i, theMOG algorithm failed to detect an upper part of the person’s body. In frame 406,shown in Figure 5.14k some parts of the target’s body, such as left arm, are notclassified as foreground. These problems are caused by similarity of the colors ofthe background and foreground pixels. In the first case the color of the target’s t-shirt is very similar to the color of the stairs in the background. In the second case89(a) Frame 239 (b) Frame 314 (c) Frame 406 (d) Frame 442(e) GT 239 (f) GT 314 (g) GT 406 (h) GT 442(i) MOG 239 (j) MOG 314 (k) MOG 406 (l) MOG 442(m) KDE 239 (n) KDE 314 (o) KDE 406 (p) KDE 442(q) CB 239 (r) CB 314 (s) CB 406 (t) CB 442Figure 5.14: The first row shows actual frames from Sequence 6. The groundtruth is shown in the second row, where white pixels denote foregroundand black pixels denote background. The results produced by MOG areshown in the third row, while the results produced by KDE and CB areshown in the forth and fifth rows respectively.90Number of errors (Percentage)Algo-rithmErrorTypeFrame239Frame314Frame406Frame442TotalMOGFP 134(0.64%)964(4.6%)99(0.47%)725(3.46%)1922(2.29%)FN 158(0.75%)341(1.63%)1833(8.75%)346(1.65%)2678(3.19%)FP+FN 292(1.39%)1305(6.23%)1932(9.22%)1071(5.11%)4600(5.48%)KDEFP 36(0.17%)379(1.81%)554(2.64%)1108(5.29%)2077(2.48%)FN 77(0.37%)107(0.51%)297(1.42%)115(0.55%)596(0.71%)FP+FN 113(0.54%)486(2.32%)851(4.06%)1223(5.84%)2673(3.19%)CBFP 182(0.87%)144(0.69%)310(1.48%)950(4.53%)1586(1.89%)FN 59(0.28%)99(0.47%)356(1.7%)150(0.72%)664(0.79%)FP+FN 241(1.15%)243(1.16%)666(3.18%)1100(5.25%)2250(2.68%)Table 5.6: Comparison of the three background subtraction pan/tilt algo-rithms for Sequence 6. For each algorithm there are three rows repre-senting FPs, FNs and a combined number of errors. Each column showsnumber of misclassified pixels in a frame and a percentage relative to thetotal number of pixels in the frame. The final column shows the totalnumber of errors summed across all four frames.the color of the arm is very similar to the color of the wooden wall. The sensitivityof the MOG algorithm can be decreased by adjusting the threshold D. Since thesame threshold is applied to each pixel in the image, decreasing it will removethese FNs while increasing the number of FPs in other areas of the image.915.3 DiscussionThe three algorithms showed different degrees of success in different situationswith the CB and the MOG algorithms being better suited for the situations wherethe background changes. The KDE algorithm performed the best for the sequenceswhere the background did not change.The KDE algorithm had the lowest number of errors in three sequences —sequences 1,3 and 5. In the other three sequences the CB algorithm had the fewesterrors. In Sequence 2 the KDE algorithm had the largest number of misclassifiedpixels. There are two main reasons for this. First, KDE produced many FNs failingto detect parts of a moving car against dark pavement in frame 400 (shown inFigure 5.5m). This happened because normalized color values of noisy dark pixelshave very large variation and therefore do not always match the pixels in the model.While the normalized color model of Elgammal et al.[6] helps to avoid classifyingshadows as foreground it fails in the cases where objects or background are indeeddark and contain some noise.Second, KDE relearns the background model very slowly once the objects thatbelonged to the background started moving. This resulted in an increased numberof FPs in frames 660 and 900 of Sequence 2.A large number of FPs in sequences 1 and 5 produced by MOG and CB werethe result of these algorithms’ inability to quickly learn a background model fornewly discovered parts of the scene. For these algorithms there is a balance be-tween how fast the new areas are incorporated into the background model versushow fast stopped targets become background. If these algorithms incorporate newinformation quickly, they sometimes can erroneously incorporate the stopped tar-gets into the background model as well. This was the case for the MOG and theCB algorithms in Sequence 1 as seen in Figure 5.2j and Figure 5.2r. On the otherhand, if the rate at which the background is learned is low large parts of the scenemight be misclassified as there is no complete background model for these parts.This happened with the MOG and the CB algorithms in Sequence 5 as seen in Fig-ure 5.5i and Figure 5.5q. Therefore it might be useful for these algorithms to builda complete background model first by sweeping through the complete scene.The KDE algorithm does not have problems with quickly building the model92for the new areas, but since there is no direct way to control the rate at which thealgorithm adds immobile targets to the background model this algorithm is veryslow to change the model once it has been learned. The KDE algorithm might bevery good in cases where the background scene is not expected to change (such assport events).Based on the detection results for both indoor and outdoor handheld imagesequences (Sequence 4 and Sequence 6) we conclude that it is possible to performbackground subtraction on image sequences taken with a handheld camera in thecase when the handheld camera moves in a fashion similar to pan/tilt. For the back-ground subtraction algorithms tested it is not required that the frames are exactlyrelated by a homography. Some misalignment is allowed. The multimodal natureof the BS algorithms allows these algorithms to model boundary situations whenone pixel in the background model can represent several objects from the scenedue to improper registration.93Chapter 6Conclusions and Future WorkThis thesis extends background subtraction algorithms designed for a static cam-era. To support camera pan/tilt a combination of Hough transform and RANSACalgorithms is used to estimate camera motion for each frame. The homographythat transforms each new frame to the background model is estimated. The newframe is transformed to the background model’s coordinate system and passed toa BS algorithm. This design allows extension of any BS algorithm designed fora static camera to support a pan/tilt camera. Three background subtraction algo-rithms were extended to support pan/tilt cameras: MOG, KDE and CB algorithms.These extended algorithms were then tested in different indoor and outdoor envi-ronments and the results showed satisfactory performance in these environments.Furthermore, experiments with Sequences 3 and 6 described in Chapter 5 haveshown that it is possible to use this approach with a hand held camera whose mo-tion only approximates that of an ideal pan/tilt camera. In these two sequences thealgorithms still were able to detect moving targets.A major issue inherent in most BS algorithms is the rate that determines howfast the algorithm should accept new information into the background model. If thealgorithm adapts too fast some foreground objects become part of the backgroundonce they become stationary for a short period of time or just move slowly. If thealgorithm adapts too slowly then changes in the background will continuously bemisclassified as foreground. This trade off is important in the case of a pan/tiltcamera since previously unseen parts of the scene become visible due only to the94camera’s motion. Since new parts of the scene have not been learned a BS algo-rithm initially classifies them as foreground thus producing many FPs. To expandthe background model, the algorithm must be able to quickly incorporate new in-formation. Consequently, it also becomes prone to incorporating slow moving ortemporarily stopped targets into the background model, as well.One possible solution to this problem is to make parameters that are responsiblefor learning rate adapt to the estimated camera motion. For example, the size of thenew previously unseen area in the current frame can be calculated by a backgroundsubtraction module described in Chapter 4 when it needs to grow the backgroundmodel.In the case of MOG, depending on how much the model is grown, the algorithmmay change learning parameters a and r to facilitate quicker learning withoutincorporating slowly moving targets into the background model. Large growth ofthe model means that the camera is viewing previously unseen area and, therefore,the parameters a and r may be increased for several frames to let the algorithmquickly learn the new background model. If there is no model growth, this meansthat the camera is viewing previously seen areas that already are part of the modeland, therefore, the parameters a and r may be lowered for better detection ofslowly moving targets.Similarly, in the case of the CB algorithm, the pixels in the background modelthat correspond to the newly discovered areas will not contain any codewords forthe number of frames specified by the parameter Tadd. One possible solution isto add the first seen pixel as a codeword if there are no other codewords for thisparticular pixel in the background model. Eventually, if this pixel turned out torepresent a moving object it will be deleted from the model after the number offrames specified by the parameter Tdelete. On the other hand, if this pixel is indeeda background pixel it will persist in the model.Another drawback of the algorithms presented in this thesis is that the com-putational requirements can be high. The current implementation in Python takesup to 40 seconds to process a single 160x131 pixel frame on a quad core Intel i5-750 processor. The warping of a new frame to the background model’s coordinateframe takes most of the computation time. Due to the modularity of the algorithmit is possible to parallelize the process in such a way that the two modules described95in Chapter 3 and Chapter 4 run in parallel. The motion estimation module warpsframes and pushes them into a queue while the background subtraction modulereads the warped frames from the queue and performs classification.There are several ways the work can be extended and improved. With theadvent of high frame rate cameras that can shoot sequences at speeds up to 1000fps it is possible to perform BS on the same image sequence at different rates tolocate objects that are moving at different speeds. The learning rate controls howfast objects get incorporated into the background model. This allows BS algorithmsto be tailored to detect objects moving with a specific speed. It is also possible toperform background subtraction on different “time layers”, where each “layer”contains fewer frames than the layer before. Objects moving at different speedswill be detected at different “layers”.Usually the colors of neighboring pixels are similar if pixels belong to the sameobject. Several algorithms proposed different approaches for dealing with spatialdependency. Some of these approaches consider neighboring pixels or blocks ofpixels when classifying a pixel in an image [19, 22, 29, 35]. Another approach pro-posed by Wu and Peng[40] uses an MRF to represent dependency between pixels.A possible spatial extension to a BS algorithm is to represent a background modelas a Gaussian pyramid at different scales and perform background subtraction ateach level individually. The final classification result can be obtained by combin-ing the outputs from all layers of the pyramid. This approach avoids blocky resultsprovided by methods that consider blocks of pixels and decreases computationalload in comparison to methods that use an MRF.A RANSAC algorithm fails when there are many outliers. Pre-processing usingHough transform helps to eliminate outlying feature points whose trajectories arenot at all consistent with the homography estimated in the previous step. Theapproach taken does not yet take camera motion into account. Satisfactory resultswere obtained for all cases presented in Chapter 5. One issue with this filtering stepis that it could lead to problems if the camera’s motion is both rapid and erratic.In this case the filtering step could erroneously remove inliers since they wouldnot be consistent with the homography obtained in the previous step. A usefulimprovement would be to incorporate a prediction technique, such as Kalman filter,to obtain a better initial estimate of the transformation between consecutive frames.96Bibliography[1] http://www.cs.ubc.ca/nest/lci/thesis/etsinko/. !pages 65[2] Y. Abdel-Aziz and H. Karara. Direct linear transformation from comparatorcoordinates into object space coordinates in close-range photogrammetry.Amer. Soc. Photogrammetry, pages 1–18, 1971. !pages 22, 23[3] D. Ballard. Generalizing the Hough transform to detect arbitrary shapes.Pattern recognition, 13(2):111–122, 1981. !pages 28[4] G. Bradski and A. Kaehler. Learning OpenCV. O’Reilly, 2008. !pages 3[5] O. Chum, T. Pajdla, and P. Sturm. The geometric error for homographies.Computer Vision and Image Understanding, 97(1):86–102, 2005. !pages26[6] A. Elgammal, D. Harwood, and L. Davis. Non-parametric model forbackground subtraction. Lecture Notes in Computer Science, 1843:751–767,2000. !pages 6, 9, 11, 12, 13, 15, 20, 45, 46, 66, 92[7] D. Farin, P. de With, and W. Effelsberg. Video-object segmentation usingmulti-sprite background subtraction. In Proc. IEEE InternationalConference on Multimedia and Expo (ICME, pages 343–346. Citeseer, 2004.!pages 16, 17, 18, 19[8] M. Fischler and R. Bolles. Random sample consensus: A paradigm formodel fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. !pages 6, 12, 17, 22[9] N. Friedman and S. Russell. Image Segmentation in Video Sequences: AProbabilistic Approach. In Proceedings of 13th Conference on Uncertaintyin Artificial Intelligence (UAI), pages 175–181, 1997. !pages 9, 10, 18, 1997[10] K. Fukunaga and L. Hostetler. The estimation of the gradient of a densityfunction, with applications in pattern recognition. IEEE Transactions onInformation Theory, 21(1):32–40, 1975. !pages 12, 32[11] R. Hartley and A. Zisserman. Multiple view geometry in computer vision.Cambridge Univ Pr, 2003. !pages 21, 22, 23, 24, 25, 26, 27[12] E. Hayman and J. Eklundh. Statistical background subtraction for a mobileobserver. In Ninth IEEE International Conference on Computer Vision,2003. Proceedings, pages 67–74, 2003. !pages 16, 18, 19, 40, 41, 45[13] P. KaewTraKulPong and R. Bowden. An improved adaptive backgroundmixture model for real-time tracking with shadow detection. In Proc.European Workshop Advanced Video Based Surveillance Systems, volume 1.Citeseer, 2001. !pages 11, 19[14] K. Karmann and A. von Brandt. Moving object recognition using anadaptive background memory. Time-varying image processing and movingobject recognition, 2:289–296, 1990. !pages 10[15] K. Kim, T. Chalidabhongse, D. Harwood, and L. Davis. Real-timeforeground–background segmentation using codebook model. Real-TimeImaging, 11(3):172–185, 2005. !pages vi, 6, 11, 12, 13, 15, 20, 47, 49, 50,51, 53, 54[16] D. Koller, J. Weber, and J. Malik. Robust multiple car tracking withocclusion reasoning. Computer VisionECCV’94, pages 189–196, 1994. !pages 1, 10[17] A. Lipton, H. Fujiyoshi, and R. Patil. Moving target classification andtracking from real-time video. In IEEE Workshop on Applications ofComputer Vision, volume 14, page 3. Citeseer, 1998. !pages 9[18] D. Lowe. Distinctive image features from scale-invariant keypoints.International journal of computer vision, 60(2):91–110, 2004. !pages 6,22, 26, 27[19] T. Matsuyama, T. Ohya, and H. Habe. Background subtraction fornon-stationary scenes. In Proceedings of Asian Conference on ComputerVision, pages 662–667, 2000. !pages vi, 14, 59, 61, 62, 96[20] A. Mittal and D. Huttenlocher. Scene modeling for wide area surveillanceand image synthesis. In cvpr, page 2160. Published by the IEEE ComputerSociety, 2000. !pages 5, 16, 19, 40, 41, 4598[21] S. Nagaya, T. Miyatake, T. Fujita, W. Ito, and H. Ueda. Moving objectdetection by time-correlation-based background judgement method. IEICETrans Inf Syst, 79:568–576, 1996. !pages 59[22] N. Oliver, B. Rosario, and A. Pentland. A Bayesian computer vision systemfor modeling human interactions. Computer Vision Systems, pages 255–272,1999. !pages 14, 15, 96[23] E. Parzen. On estimation of a probability density function and mode. Theannals of mathematical statistics, 33(3):1065–1076, 1962. !pages 11, 45[24] M. Piccardi and T. Jan. Mean-shift background image modelling. In 2004International Conference on Image Processing, 2004. ICIP’04, pages3399–3402, 2004. !pages 12[25] P. Power and J. Schoonees. Understanding background mixture models forforeground segmentation. In Proceedings Image and Vision Computing NewZealand, volume 2002. Citeseer, 2002. !pages 10, 11[26] Y. Ren, C. Chua, and Y. Ho. Motion detection with nonstationarybackground. Machine Vision and Applications, 13(5):332–343, 2003. !pages 9, 16, 18, 19, 40, 45[27] C. Ridder, O. Munkelt, and H. Kirchner. Adaptive background estimationand foreground detection using kalman-filtering. In Proceedings ofInternational Conference on recent Advances in Mechatronics, pages193–199. Citeseer, 1995. !pages 10[28] P. Sampson. Fitting Conic Sections to “Very Scattered” Data: An IterariveRefinement of the Bookstein Algorithm. Computer Graphics and ImageProcessing, 18(1):97–108, 1982. !pages 22, 26[29] M. Seki, T. Wada, H. Fujiwara, and K. Sumi. Background subtraction basedon cooccurrence of image variations. 2003. !pages 14, 96[30] Y. Sheikh, O. Javed, and T. Kanade. Background subtraction for freelymoving cameras. In IEEE International Conference on Computer Vision,2009. !pages 16, 17[31] C. Stauffer and W. Grimson. Adaptive background mixture models forreal-time tracking. In Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, volume 2, pages246–252, 1999. !pages 6, 9, 10, 11, 12, 18, 19, 41, 4599[32] C. Stauffer and W. Grimson. Learning patterns of activity using real-timetracking. IEEE Transactions on Pattern Analysis and Machine Intelligence,22(8):747–757, 2000. !pages 1, 2[33] Y. Sugaya and K. Kanatani. Extracting moving objects from a movingcamera video sequence. In Proceedings of the 10th Symposium on Sensingvia Imaging Information, pages 279–284. Citeseer, 2004. !pages 5, 16, 19[34] P. Torr and A. Zisserman. MLESAC: A new robust estimator withapplication to estimating image geometry. Computer Vision and ImageUnderstanding, 78(1):138–156, 2000. ISSN 1077-3142. !pages 19[35] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Principlesand practice of background maintenance. In iccv, page 255. Published by theIEEE Computer Society, 1999. !pages 1, 9, 13, 14, 58, 62, 96[36] R. Vemulapalli and R. Aravind. Spatio-Temporal NonparametricBackground Modeling and Subtraction. In IEEE 12th InternationalConference on Computer Vision, 2009. !pages v, 12, 14, 15, 46[37] H. Wang and D. Suter. A consensus-based method for tracking: Modellingbackground scenario and foreground appearance. Pattern Recognition, 40(3):1091–1105, 2007. !pages 2, 12[38] F. Wauthier. Motion tracking. http://www.anc.ed.ac.uk/demos/tracker/. !pages 41[39] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: Real-timetracking of the human body. IEEE Transactions on Pattern Analysis andMachine Intelligence, 19(7):780–785, 1997. !pages 9, 10[40] M. Wu and X. Peng. Spatio-temporal context for codebook-based dynamicbackground subtraction. AEU-International Journal of Electronics andCommunications, 2009. !pages 3, 14, 15, 16, 96[41] C. Xu. Datasets used for background subtraction.http://userweb.cs.utexas.edu/ changhai/icra10-datasets/video4.avi. !pages v, vii, 29, 80, 83[42] W. Xu, Y. Zhou, Y. Gong, and H. Tao. Background modeling using timedependent Markov random field with image pyramid. In Proc of the IEEEWorkshop on Motion and Video Computing. Breckenridge, USA, pages 8–13.Citeseer, 2005. !pages 3100


Citation Scheme:


Usage Statistics

Country Views Downloads
United States 4 0
Turkey 2 0
China 2 28
Italy 1 0
Indonesia 1 0
City Views Downloads
Ashburn 2 0
Ankara 2 0
Seattle 1 0
Unknown 1 1
Wilmington 1 0
Changsha 1 0
Naples 1 0
Nanjing 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items