Object Recognition with Many Local Features by Scott Helmer B.Sc. University of Toronto, 2002 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF M a s t e r o f S c i e n c e in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard T h e U n i v e r s i t y o f B r i t i s h C o l u m b i a September 2004 © Scott Helmer, 2004 THE UNIVERSITY OF BRITISH COLUMBIA FACULTY OF GRADUATE STUDIES 3 Library Authorization In present ing this thesis in partial fulf i l lment of the requi rements for an advanced degree at the Universi ty of British Columbia, I agree that the Library shall make it f reely avai lable for reference and study. I further agree that permission for extensive copy ing of this thesis for scholar ly purposes may be granted by the head of my depar tment or by his or her representat ives. It is understood that copying or publ icat ion of this thesis for f inancial gain shall not be a l lowed wi thout my wri t ten permiss ion. N a m e of Author (please print) Date (dd/mm/yyyy) Tit le of Thesis: D e 9 r e e : NV*k/~ r>(~ $s<r2ej*c$ iluirib Year: Depar tment ( The Universi ty of British Co m ia Vancouver , BC Canada grad.ubc.ca/forms/?formlD=THS page 1 of 1 last updated: 20-Jul-04 A b s t r a c t There has been a great deal of attention focused on part-based approaches to ob-ject classification in recent research in computer vision, and some approaches have achieved a surprising amount of success. However, learning models with a large num-ber of parts has been a particular challenge. One of the most successful approaches is that of Fergus et al. [5] who have developed a generative model for recognition that achieves excellent results on a variety of datasets. The learning method that they present to learn the parameters for the model, however, requires an exponential amount of time to train as the number of parts increase. The primary contribution of this thesis is the extension of their generative model, and the development of a learning algorithm that can learn a large number of parts in a reasonable amount of time. In particular, we have developed an incremental learning algorithm where the • model.is initialized intelligently with a small number of parts, and parts are added to the model one at a time. By taking such an approach we are able to learn models with a large number of parts in nearly a linear amount of time in the number of parts. The approach is validated on a number of datasets, including cars, motorbikes, and faces, and demonstrates excellent recognition results along with large models learned in a reasonable amount of time. • • ' ' . -ii C o n t e n t s Abstract ii Contents iii List of Figures v List of Tables vi Acknowledgements vii 1 Introduction 1 2 Part-Based Approaches to Object Recognition 5 2.1 Local versus global approaches 5 2.2 Parts 6 2.2.1 Representation of an image 7 2.3 Recognizing an object class with parts 9 2.3.1 Probabilistic approaches to recognition 9 2.4 The constellation model 11 2.4.1 Matching features to parts 11 3 Model 13 3.1 Background model 16 3.1.1 Background scale . 17 3.1.2 Background location 17 3.1.3 Background appearance 17 3.2 Foreground model 18 3.2.1 Part appearance 19 3.2.2 Part location 19 3.2.3 Part scale 20 3.2.4 Part statistics 20 3.3 Approximating object center and scale 20 3.4 Probability on sets 22 iii 4 Learning 24 4.1 Parameter estimation by Expectation Maximization 25 4.1.1 Expectation 27 4.1.2 Maximization 28 4.1.3 Appearance updates 29 4.1.4 Location updates 29 4.1.5 Scale updates 31 4.1.6 Part weight updates 31 4.2 Incremental learning 32 4.3 Intelligent initialization for models and parts 34 4.3.1 Image matching 34 4.3.2 Initializing a small model and a sample set 37 5 Model Computation 39 5.1 False matches and enforcing geometry 39 5.2 Search space techniques 40 5.2.1 Trimming the search space 43 5.3 Search space techniques for E M 44 6 Experiments 47 6.1 Feature detectors and descriptors 47 6.2 Experimental setup •• • 48 6.2.1 Datasets ".' 49 6.3 Computation time required to learn object models 50 6.4 Results 50 6.4.1 Comparison to previous methods . 51 6.4.2 Incremental learning 52 6.4.3 Learned models 54 7 Conclusion and Future Work 63 Bibliography 66 iv L i s t o f F i g u r e s 1.1 Example of a set of a part-based approach to object recognition . . . 2 2.1 Modelling an object by appearance versus by the underlying 3D structure 6 2.2 Representation of an image using local features 8 2.3 A n example of a matching h 10 3.1 A n example of an object generating an image in a part-based approach 14 3.2 The graphical model for the random variables h, A, X , S, c, and Bh 16 4.1 Example of image matching 35 5.1 Searching through a hypothesis space H to find matchings h that con-tribute to p(A, X, S) 41 5.2 Pseudo algorithm for approximating p(A, X, S) . 42 5.3 Techniques to reuse information from E M in order to compute p(A, X, S) efficiently in learning 45 6.1 Example of a SIFT feature descriptor 48 6.2 These ROC curves demonstrate that as the number of parts increase the R O C curve improves 53 6.3 Graph of R O C equal rate versus number of parts in the model . . . . 54 6.4 Examples of images that are easily recognized and those that are not 55 6.5 A 10 part model for motorbikes 58 6.6 A 10 part model for cars viewed from the rear 59 6.7 A 10 part model for cars viewed from the side 60 6.8 The locations for the parts in a 25 part model for faces 61 6.9 The appearance for parts in 25 part model for faces 62 L i s t o f T a b l e s 6.1 R O C equal error results comparing our results to that of Fergus et al. 52 vi A c k n o w l e d g e m e n t s First and foremost I'd like to thank my supervisor Prof. David Lowe for his patience, kindness, and for his many insights that allowed me to complete this thesis. Thanks also goes out to my co-supervisor Prof. Nando de Freitas for guidance and for teaching me that the math and the ideas behind the math go hand in hand. Many thanks also goes out to the Orphanage, where my generous house mates fed me when I was working frantically, and also provided me with much needed cheer. Long live the Lord of Catan, who ever he or she may be. I'd also like to thank my family, who have provided me with support through this whole process. - • - . SCOTT HELMER The University of British Columbia October 2004 vii C h a p t e r 1 I n t r o d u c t i o n Of all the tasks that humans perform with intelligence, visual object recognition seems to be one of the most effortless, done mostly without conscious thought. Yet to solve this problem consciously is far from effortless. The primary source of the difficulty that underlies solving the problem of object recognition is the notion of similarity. To see this, note that an object class can be defined by its functionality, such as chairs, or by its shape, such as cylinders, or by its intrinsic nature, such as mammals. As a result, the possible features that are relevant in object recognition are many, and vary in importance for each class. In many cases, even defining an object class explicitly in terms of necessary and sufficient characteristics is nearly impossible. Even more important is that many of the primary characteristics of an object class are not manifested visually, at least not directly. For example, an image of a dog is also an image of a living thing, yet the class of living things displays no regular visual characteristics. However, if we first judged the image as containing a dog, we could also infer that it is also contains a living thing. As a result, solving the object recognition problem requires significant machinery that is unrelated to visual phenomena. A great deal of success has been achieved in recognizing specific objects [8], such as a particular teddy bear, rather than an object class, such as teddy bears. The reason for this is that recognizing an object class is arguably a more difficult task in computer vision, and this is because there is variation in the appearance and shape across an object class, whereas this is not the case for single objects. Moreover, some success has been achieved in recognizing an object class visually with specific ob-ject classes. Examples of such success are face recognition [18] and digit recognition [10]. However, the approaches developed for these object classes generally don't ex-tend to general object classes because they make restrictive assumptions, require vast amounts of training time, or utilize visual features that are only useful in classifica-tion for a few object classes. Applications such as image labelling on the Internet and robot navigation require a vision system that can recognize a wide variety of object classes, while remaining computationally tractable. The concern of this thesis is the 1 2 development of an approach that can recognize general object classes, such as cars, faces, and motorbikes, and generally from canonical views. Image features are matched to model parts Matching is evaluated according to model to produce a measure used for recognition Figure 1.1: A n example of finding an optimal matching from features in the image to parts in the model. There can be more that one matching that contributes to the final decision as to whether the image contains an instance of the object class. The primary job of a vision system designed to recognize an object class from a 2D image is to determine what visual characteristics distinguish the object class from others. Such visual characteristics could be the colour, shape, or texture of the object 3 or parts of the object. However, the appearance of an image containing the object is dependent upon the what angle the object is viewed at, the distance the object is from the camera, its location and illumination, and other occluding objects. Yet whether an object is in a particular class is independent of all of these variables. Any successful vision system designed to recognize objects from 2D images must also be at least partially invariant to these variables. As a result we have taken a part-based approach, where we model the object as a collection of parts. By doing this, if the object is occluded, the occlusion only affects the occluded parts, and the object can still be recognized based on the parts that are not occluded. Moreover, translation and scale invariance become computationally tractable, as outlined later. A general part-based approach to object recognition extracts a set of features from an image, determines a matching of these features to parts in the model, sometimes multiple matchings, and produces a score that indicates how likely it is that the image contains an instance of the object class. See Figure 1.1 for an example. Our approach in particular evaluates a matching based both upon the spatial arrangement of the parts, the scale of these parts, and the appearance. A motivation for taking a part-based approach is that many object classes have characteristics that manifest themselves as distinct visual parts. A face for example is composed of eyes, a nose, and a mouth, and each of these are local in the sense that their appearance is captured in a region that is smaller than the whole object. Thus, modelling an object class by its parts can decompose the problem into recognizing pieces of the object, each of which will contain less variability than the variability of object as a whole, which makes it significantly easier to model. The approach presented in the thesis is an extension of the constellation model presented in Fergus et al.[5], with a number of novel extensions and explorations. The object class is modelled using a generative model, with distributions over the appear-ance, location and scale of parts. The measure of how likely an image is to contain an object is the sum of the likelihood of seeing the features of the image over all possible matchings of features to parts of the model. The difficulty with this model is that the number of matchings is exponential in the number of parts of the model, which makes evaluation and learning difficult. The Fergus et al.approach learns the parameters for the probability model simply by using Expectation Maximization (EM). In E M it is required that the probability is evaluated for all non-zero matchings, and with a random initialization almost all possible matchings must therefore be evaluated. By doing this they are restricted to models with less than 7 parts for reasons of efficiency. The primary contribution of this thesis is the development of a tractable approach 4 to learning models with more than a few parts. This is done by intelligently initial-izing a model with a small number of parts, and then adding parts incrementally to the model and updating parameters using E M after each addition. One of the primary advantages of such an approach is that not all possible matchings need to be reevaluated when adding a new part. From previous iterations we know the non-zero matchings for the previous parts, so the search space is greatly reduced. Another contribution is a new approach to scale and translation invariance that results in more a accurate geometric model on parts, which accelerates learning and improves recognition results. To achieve the scale and translation invariance, the location and scale of image features are first transformed into a model coordinate space before'being evaluated by the model. The particular transformation chosen results in the features being transformed to the most likely model coordinates given the model parameters. In the remainder of this thesis we will first describe a variety of related work and background related to object recognition. This will be followed by the description of the probabilistic model, in Chapter 3, and how the parameters are learned, in Chapter 4. Chapter 5 will describe a variety of techniques that are used to make the approach tractable. The approach is then validated with a set of experiments on a variety of object classes, including motorbikes, faces, and cars. The final chapter will then discuss some short falls of the approach and future work. Note A portion of material presented in the introduction and succeeding chapters appeared in the Generative-Model Based Vision Workshop 2004, Helmer et al. [7] C h a p t e r 2 P a r t - B a s e d A p p r o a c h e s t o O b j e c t R e c o g n i t i o n The computer vision literature is rich with approaches to solving the object recogni-tion problem, and approaches can be divided along many different lines. A complete overview of the field however is beyond the scope of this thesis. As a result, we outline the approaches that deal with the issues that were relevant in the design of the system outlined later in the thesis. The bulk of this chapter discusses part-based approaches to object recognition, and in Section 2.2.1 onwards some notation is introduced that is used throughout the thesis. 2.1 Local versus global approaches Two distinctive methodologies exist to solving the object recognition problem: global and local approaches. Global approaches extract features of the object that are generally properties of the object as a whole, such as colour histograms [15] or texture and shape [12] [16] . Local approaches, on the other hand, divide the object into smaller regions, and these are then used to recognize the object [13] [17] [14] [8] [5]. It should be noted however that it is not necessary that a vision system for object recognition take a strictly local or strictly global approach. In fact, certain object classes are devoid of any clear notion of parts, such as lakes or walls, whereas some object classes are little more than the sum of their parts, at least visually. Much of the current research in object recognition is devoted to part based ap-proaches because of several of its natural advantages. One of the primary advantages is that the background and occluding objects can be more easily discarded by simply looking only at the features on the object. In global approaches, however, difficult segmentation, or other restrictive assumptions, such as a uniform background, must be made so that the background and occluding objects do not influence the classifica-tion of the foreground. Another advantage is modelling the appearance of an object 5 2.2. Parts 6 class as a whole is a more complex task than modelling a subset of its parts. 2.2 Parts Using a part based approach first requires asking the question, how should a part be represented? Early research using a part-based approach by Biederman et al. focused on mod-elling the 3D structure of an object class, and inferring the 3D structure of an object from an image in order to do classification [2]. In Biederman's work, he theorizes that objects can be modelled using geons, or geometric ions, which are small 3D struc-tures such as cylinders, cubes, etc. A n object class in this case is a collection of geons with some sort of geometric relations and determining whether an object is present reduces to inferring the geons present, their geometric relations, and determining if this coincides with the object class model. A prior definition of parts such as geons may seem attractive, but determining their presence is difficult since it involves inferring 3D structure from a 2D signal. A n alternative approach uses an appearance based approach Where a part is modelled only as the 2D signal it presents in an image. In this way inferring the 3D structure can be avoided entirely. In an appearance based representation a part is modelled upon the pixel values of image patch directly, or on some transformation performed on the pixel values, such as P C A . 2 2 2 2 2 2 2 0 0 2 2 2 2 2 2 2 2 0 2 2 4 4 4 4 4 4 4 2 2 4 4 4 4 4 4 4 2 2 4 4 4 4 4 4 4 0 2 4 4 4 4 4 4 4 0 0 4 4 4 4 4 4 4 Figure 2.1: Representing and modelling parts based directly on its 2D appearance is an easier task then having to infer 3D structure. Many current approaches to object recognition take an appearance based ap-2.2. Parts 7 proach, but there is still the question of how to obtain parts that can be used for recognition. There are a variety of supervised approaches, Burl et al. [3] for example, where humans hand label parts on an image dataset. By taking such an approach, the question can be avoided, but labelling is time consuming, and it is desirable to be able to learn a model of an object class with as little supervision as possible. Another approach is to learn the parts from a dataset of images containing the object class in an unsupervised manner. However, any region of an image can serve as a possible part, and searching through regions at every location and at multiples sizes in all images in the dataset is prohibitively expensive. Shimon Ullman et al. [17] take this approach to part selection, incrementally selecting features that reduce the entropy of classification. However, as mentioned, searching over all locations and multiple scales is computationally expensive. 2.2.1 Representation of an image One way to overcome this problem is to select interesting regions in the image, en-code these local regions, and then to use only these to learn parts and for recognition. The motivation behind such an approach is that an image of an object class is gen-erally distinguished by regions that contain texture rather than those that do not. Thus, by representing an image as a selection of features, we reduce the search space to interesting regions. Moreover, many of these local feature approaches can help with dealing with invariance to scale, translation, and rotation. Additional discus-sion concerning feature detectors and descriptors can be found in Section 6.1 of the Experiments chapter, where we outline the particular feature detector and descriptor we used in our experiments. A n input image i is represented as a set of N features T\ which are extracted as a preprocessing step. For notational convenience we drop the reference to the image i in some sections of the thesis. Each feature / G T is composed of a d dimensional appearance vector A / , which describes a local region of scale S/ , centred at location vector X / which is the 2 dimensional tuple (Xfti,Xft2). The scale and location are described relative to image pixel coordinates. See Figure 2.2 for an illustration. 2.2. Parts 8 Figure 2.2: A feature / is describe by an image location X / = ( X / t l , X / ) 2 ) , a scale S and an appearance vector A / , which is some description based upon the pixel values within the circular region. 2.3. Recognizing an object class with parts 9 2.3 Recognizing an object class with parts When an image is given to an object recognition system, the system outputs a score that is then used to determine whether the object is present. Exactly how this score is produced and what it means varies widely across part-based approaches. In general, part based approaches are concerned with finding a matching, or matchings, that matches features to parts in the model, and the score is based upon this matching. We define a matching as h, where h(p) = f means that a feature / of the image matches part p. If a part is occluded or not present then h(p) = 0. A n example can be found in Figure 2.3. Most part-based approaches to recognition have a model over the appearance of an object class' parts and a model over the geometric relationship between the parts. One approach to recognition by Schmid et al. [14], uses the the model over the appearance to determine a possible matching h, then uses the geometric model to refine and score this matching. To find an initial matching they use a nearest neighbour approach, based upon the Mahalanobis distance, to extract matches from features to parts. Moreover, their geometric model encodes for each part the angles between neighbouring parts, and in this manner they can achieve scale invariance in their geometric model. After finding the initial matching, the matches that don't agree with the geometric model are removed from the matching.. The final score is then the number of matches remaining in the matching. Although this is conceptually an effective approach, the uncertainty of the matches is not represented, and it does not represent the relationship between false matches and model size. 2.3.1 Probabilistic approaches to recognition Most recent approaches take a probabilistic approach to object recognition instead of basing the measure on the number of matches. Given the object class C that we seek to recognize, the most obvious measure as to whether an object class is present in an image is the probability that the object class is in the image. That is, p(C\T) (2.1) Pope et al. [13] propose an approach to recognize objects using p(C\!F) by first finding a good matching h, and a transformation, t, that transforms image feature coordinates into model coordinate space. They then estimate ^(ClJ 7 ) based up this single matching and transformation. Object classes are represented as containing 2.3. Recognizing an object class with parts Features of the Image 10 Parts of the Model •Background (0) Figure 2.3: A n example of a matching h. Note that h does not have to be the true matching. For each of the features that are matched to a point it is implied that the remaining features are matched to the background parts of various types with with a Gaussian distribution over the part locations in model coordinate space. In their approach, p(C\!F) is approximated as p{C\T) « where they assume that part locations and appearance are independent given the transformations t and matching h, and that p(h,t\C) and p(h, t) are uniform. In their matching procedure they iteratively add a match of a feature to a part such that the match, along with the updated transformation t, improves the probability of the match the most. The difficulty with this approach is that it is designed for recognition of a single object, and it is unclear how to learn parts effectively for a general object class. A n alternate approach is to learn the density p(T\C), the probability of seeing these features given that the object is present. In an approach by Weber et al. [19] they model p(F\C), and use the likelihood ratio p{FM,t\C) p(A|h,t,g)p(X|h,t)p(h,t|C) p(JP|h,t)p(h,t) P(C) (2.2) (2-3) (2.4) 2.4- The constellation model 11 R = Whc) ( 2 - 5 ) to compute a measure for whether the object is in the image. This ratio compares the likelihood of seeing this set of features given that the object class is present against the likelihood of seeing the features if the object class was not present. The main advantage of using the ratio R for recognition is that the parameters 9 for p(.F|C) can be learned using E M , which requires little supervision. 2.4 The constellation model One of the most successful approaches to general object recognition is that of Fergus et al.[5]. The approach they present is an extension of the constellation model that was developed by Burl et al. [3] and extended by Weber et al. [19]. Although they attain good results on a variety of datasets, it takes an enormous amount of time to learn and is limited to just 6 or 7 parts to keep learning tractable. Since the approach presented in this thesis builds on the constellation model, it is useful to introduce some of the details. 2.4.1 Matching features to parts Suppose an image contains features T, and the model contains P parts with pa-rameters 6, there still is the problem of matching features to parts. Most previous approaches first match features to parts based solely on appearance and then later remove false matches by utilizing the learned geometric relations between parts. The approach taken by Fergus et ai, on the other hand, is to evaluate the value of a matching h jointly based upon the appearance and location. Now, they propose that if they had the matching h then p{F\h) = p(A,X,S|h) (2.6) = p(A|h)p(X|h)p(S|h) (2.7) They assume that the appearance, scale, and the location of a part are independent of each other when the matching h is given. In general, h is unknown, so they integrate it out of the joint p(A, X, S, h) to obtain p(A, X, S). That is, they define 2.4- The constellation model 12 p(A,X,S) - £ p ( A | h ) p ( X | h ) p ( S | h ) p ( h ) (2-8) hew where the hypothesis space TC contains all possible matchings of parts to features. The distributions p(A|h), p(X|h), and p(S|h) are all Gaussian distributions with distinct parameters for each part. However, they evaluate X and S in model coordi-nate space rather than image coordinate space. To achieve this scale and translation invariance, they use parts p that are matched to a feature / as a landmark point, and translate and scale the other features locations to model coordinate space by using the parameters of part p. This poses a difficulty because which feature is chosen dictates where the other features are transformed to in model coordinate space and their is no clear way to chose the landmark point. One thing to note in Equation 2.7 is that \H\ G 0 ( | F | P ) , so evaluating p(A, X , S) exactly is intractable for models with more than 6 or 7 parts. If the model is accurate than A * can be used to approximate p(A- X, S). However, the manner in which they learn the parameters for the model is intractable for models with more than 7 parts. The reason for this will be discussed more fully in Section 4.1.1. One of the primary contributions of this thesis is the introduction of a learning scheme that remains tractable even for models with a large number of parts. C h a p t e r 3 M o d e l The approach we take to the object recognition problem is to learn a distribution p(F\C), and use the likelihood ratio R — p(F\C) / p(F\->C) to recognize an object. Given a scene with an instance of the object class, a 2D image is generated by the foreground object, and by the background scene. This naturally leads to the choice of modelling p(^"|C) as a generative model, where the features T are generated by the parts of the object and the background. The model presented in this chapter builds upon the constellation model of Fergus et ai; the similarities and differences will be noted as they are presented. If a part p generates a particular feature / , then h(p) = / , so h is the matching from parts to features. If a feature / is not assigned to a part then it is assigned to the background, the set of background features for a particular matching h is denoted as Bh . In addition, if the object is in the image, then it will occur at some location and scale, denoted c = {XC,SC}. Naturally X c G TZ2 such that X c does not exceed the dimensions of the image, and S c £ TZ such that it does not exceed the maximum dimension of the image. The joint space to which c belongs is referred to as C. So, for each of the parts that generated a feature in the image, the feature will occur at some location and scale relative to c, and with some appearance. A n example of how the model is generative is illustrated in Figure 3.1. For notational convenience, we refer top(^-"|C) as the density p(A, X, S|0) where in most instances 9 is dropped. The background density p{!F\-<C) is sometimes referred to as p (A,X, S|h©) where b.0 is the matching that matches all features in the image to the background. Let the probability of seeing the features of an image, given that we have a match-ing h, and the object's centre and scale, c, and model parameters 9, be p(A, X, S|h, c) = p(A|h)p(X|h, c)p(S|h, c) (3.1) That is, if we know c and h, then the appearance, location, and scale of the features are all independent. This assumption is similar to that of Fergus et al, with the exception that Fergus et al. don't introduce the parameter c. Although the indepen-13 1 4 \ — i — | OS? • A 2 Features in model coordinates JOB \ M B . . . . A . . . x ' 3 Features in image coordinates Figure 3.1: When an object class is in an image, its constituent parts that are present generate features, where each feature / has an appearance Af, a location and scale in model coordinate space Yf and £». The object then generates a center and scale of the object in image coordinate space c = {X c, Sc}. As a result, the image location of feature / is X / = ^ Y / ~ c X c ^ and the scale is Sj = g£ 15 dence assumption makes modelling easier, it is easy to conjure up examples where the appearance, scale or location of a part are not independent. One clear example of this is that of a truck, where the location of a truck door is dependent upon the size of the wheel. Modelling these dependencies, however, is more difficult, and it is not clear that these dependecies are crucial in object classification. Although we have assumed independence between A , X , and S when h and c are given, it is not generally the case that these variables are known in object classification. In order to overcome this we can integrate h and c out of the joint distribution p(A, X , S, h, c |C) to determine p(A, X , S). So, p(A,X,S) = E / P(A ,X ,S ,h,c)cic hen ^CGC = Y . I P(A,X>S|h,c)p(h)p(c)dc (3.2) hen J c e C = £ p ( A | h ) p ( h ) / p(X|hIc)p(S|hlc)p(c)dc hen J c € C is the probability of seeing this set of features given that the object is in the image. However, the integral over C is not analytically tractable with most distributions, so we instead use a point mass estimate to approximate the integral. p(A, X , S) « V p(A|h)p(h)max p(X, S, c|h) (3.3) ' * ceC hen The intuition behind this approximation is that we find the best transformation of feature locations and scales into model coordinate space, and approximate p(X, S|h) based solely on this best transformation, rather than all possible transformations. This is accurate in the normal case when there is one transformation that dominates all others. This approximation will be discussed in detail in Section 3.3. Moreover, as with the Fergus et al. approach \H\ = 0(\T\P), so evaluating the sum in Equation 3.3 is expensive. Tractable approximations to compute p(A, X , S) are presented in Chapter 5. The-remainder of this chapter descibes the foreground and background model. . ' 3.1. Background model 1 6 © x, Figure 3.2:'The graphical model for the random variables h, A, X, S, c, and Bh 3.1 Background model Any given scene is composed of a number of objects, which belong to different object classes, and it is these scene elements, under illumination, that generate a 2D image. In a generative model approach, the primary concern is modelling of the foreground. However, in order to come up with the likelihood ratio, R, it is necessary to have some model of p(J r | - 'C) — p(A, X , Sjho), so that there is some measure as to how likely it is to see the features T in general. We assume that if a feature is generated by the background, then it is independent of other features, and that the appearance, location, and scale are also independent of one another. So, p (A,X, S|h0) becomes p(A, X, S|h0) = J] p(A / |h 0 )p(X / |h 0 )p(S / |h 0 ) (3.4) where h 0 is the matching where all features are matched to the background. Moreover, when the object is in the image it is usually the case that not all of the features in T are generated by that object. As a result, it is necessary that the distribution p(A, X, S) also models those features that are part of the background. 3.1. Background model 1 7 3.1.1 Background scale In general, a feature generated by the background could be any size. However, for most feature detectors that operate at multiple scales, the probability of a random feature / having scale of S/ is proportional to . The reason for this is that feature detectors generally construct an image pyramid, where the image at each level Sf is a Gaussian blur with o — Sf of the original image. This is then resampled according to Sf, resulting in an image ~^ the size of the original image. As a result, the number of features detected at a particular scale is ^ the original image size. If the largest scale at which a feature can be detected for an image is a, then • -p(Sf\f eB) = (3.5) This background distribution on scale differs from that of Fergus et al. who represents the background distribution simply as a uniform distribution. The motivation for the change is that feature detections are not usually generated uniformly across scales. 3.1.2 Background location In general it can be expected that a feature from the background can be found uni-formly throughout the image, so in this case the location is modelled as p(Xf\feB) = ±. (3.6) where 7 is the area of the image. This distribution is in accordance with the that chosen by Fergus et al. 3.1.3 Background appearance The appearance of the average feature returned by the feature detector is highly dependent upon both the detector and descriptor. As a result the appearance is modelled with a kernel density estimator with a Gaussian kernel. In the thesis, we primarily work with Gaussian distributions where the coovariance matrix is restricted to be diagonal. If a d dimensional random variable X is distributed according to a Gaussian distribution with mean n and standard deviation tr, ie. that X ~ A/"(/x, (tr)2), the probability of X is described as 3.2. Foreground model 1 8 G^^=wr^r{p-^) (37) The probability of the appearance given that it was generated by the background is p { A < l f B ) = \BackGround\ (3"8) where BackGround are a set of features that are extracted from a set of background images, and where <xbj is chosen experimentally by cross validation. This distribution differs from the model presented by Fergus et al, which uses a single Gaussian for all features. The motive for this change is that for high dimensional features, a single Gaussian does not provide an adequate model for the appearance of features generated by the background. This issue is a complicated one, and constructing a better model for the background distribution of appearances is left for future research. 3.2 Foreground model As mentioned previously, given a matching h and c we assume that appearance, location, and scale of the features are independent. We also assume that given h and c that all the attributes for each part are independent. In this sense, given that an object is part of a particular class, and its center c is known, then which parts are generated, where these parts are generated, and their appearance are all independent of each other. Clearly a lot of information is lost by this independence assumption. A face model that assumes such independence would probably consider a woman's face with a beard as likely as a man's face with a beard. However, in terms of object classification, these examples are pathological in the sense that real data rarely presents cases where the dependence needs to be modelled in order to do correct classification. In cases where modelling this dependence is necessary, the object class can be split into subclasses. One possible way to do this is during learning, detect when there are large segments of the training data that are not being recognized by the current model. At this point the algorithm could then learn a seperate model for this segment of the training data, thereby automatically creating a model for different subclasses. Development of a technique to do this is left for future research. 3.2. Foreground model 1 9 3.2.1 Part appearance If a part p is present in an image as feature / , then we assume that the appearance of the part Af ~ M(^pp, ( t r ^ ) 2 ) . That is that, p(Af\h(p) = / ) = G(Af\npPP, (<T;PP)2) ( 3 . 9 ) We assume that each dimension of the appearance is independent. There is no reason to believe that the dimension of each part is independent, but estimating a full covariance is prohibitely expensive. Given a matching h, we evaluate the appearance of a feature / according to the density for part p only if h(p) = f, and if / is assigned to no part then it is evaluated according the background distribution, p(Af\f £ Bh)- So, for the entire set of appearances A, p(A|h) = J] W , ( f l ) I I p ( A / | / e B ) ( 3 . 1 0 ) /|h(P)=/ /|/eSh 3.2.2 Part location If a feature / is generated by a part p, then we assume that in model coordinate space that the location of the part Y / ~ J\f(iil°c, (<rl°c)2). The i th element of al°c is referred to as cr l°f. The problem is that since X / is in image coordinate space, it must be transformed into model coordinate space. If we are given the object center and scale c, then Yf = S cXy + X c = (Yf - Xc) which implies that X ; ~ Af ( ^ ^ , ( ^ ) 2 ) - As a result p(X /|h(p)=7,c) = G X ; W T - X c ) l a loc ( 3 . 1 1 ) ( 3 . 1 2 ) ( 3 . 1 3 ) However, if / is part of the background, f € Bh, then we evaluate its location accord-ing to a background distribution. The resulting probability of seeing the locations X is therefore 3.3. Approximating object center and scale 20 p(X|h ,c ) = II G I X / p|h(p)=/ II p ( X , | / e B ) (3.14) /l/eflh 3.2.3 Part scale If h(p) = / , then the scale of part p in model coordinate space, is generated ac-cording to a Gaussian with, with mean ps^ale and standard deviation o-pCale. Since the scales S are given in image coordinates, they are transformed into model coordinates, tf = ScSf (3.15) ..scale —.scale This implies that Sf ~ J V ( \ - , ( \ - ) 2 ) -If / is part of the background, / e Sh, then we evaluate its location according to a background distribution. As a result, p (S |h ,c )= J] GlSf Plh(p)=/ \ 3.2.4 Part statistics We model the distribution p(h) as p^=w\ n WP n v-vp) (3-i?) 1 1 p|h(p)^0 p|h(p)=0 where wp is the probability that the part is present. A more complicated model is utilized in Fergus et ai, but as the number of parts increase, the number of parameters increases quadratically, which poses a problem for our approach. Implementing a more complicated model is left for further research. ..scale / fjScale\' . V , Hf" II P ( S / l / e ' B > (3-16) 3.3 Approximating object center and scale Integrating c out of the joint p(A, X , S, h, c), as mentioned previously, is not analyt-ically tractable. Instead, we make the approximation 3.3. Approximating object center and scale 21 p(A,X,S|h) « maxp(X,S,c|h) (3.18) That is, we replace the integral with a point mass estimate at the maximum, since when X and S are fixed then p(X,S,c|h) is a peaked distribution. In essence, this means that we are primarily concerned with the measurement of p(X, S, c|h) at the location where the parts are most likely to be centered, given the matching h. By making such an assumption we can find the copt that maximizes p(X, S,c|h), and use this to transform X and S in to model coordinate space. To find the maximum, notice that ' (3.19) (3.20) ( I I P( X / I / 6 W S / I / e BJ) (3.21) The second term in 3.20 is independent of c, and as a result is a constant when X and S are fixed. The first term, which we will denote with the function V(c|h, X , S) is the term that must be maximized. Taking a closer look at this function reveals that the value copt which maximizes log(V) also maximize V and thus p(X, S,c|h). To determine copt we maximize 3.4- Probability on sets 22 log(V) + + log(Sc) + constant + log(Sc) + log(Sc) (3.23) (3.22) (3.24) (3.25) (3.26) The most straightforward way to find the set of parameters copt that minimizes log(V) is to take the derivative and set it to Oand solve for the parameters. There is a closed form solution, where the solution requires solving a quadratic equation in S c, but the actual solution is omitted for brevity. The technical details addressed in this section deal with defining probabilities on sets with a variable size, and can be skipped without loss of continuity. In the model proposed by Fergus et ai, the number of features extracted from an image, k, are fixed apriori to some number, usually less than 20. That is, despite the fact that a highly textured image could contain 500 features, the image is represented by only a small subset of these features. The difficulty that this poses is that if a face appears in a very cluttered scene, then it's possible that very few of the features chosen to represent the image actually fall on the face. To overcome such difficulty, we do not fix the number of features k that represent an image. The number of features is entirely dependent upon the feature detector. However, if k varies per image, there are a number of technical details that need to be addressed. First, p^lC), as defined above is not a probability since 3.4 Probability on sets 3.4- Probability on sets 23 V f If p(A,X,S)dAdSdX = oo (3.27) That is, if we sum over all feature sets of all possible sizes, the distribution does not sum to 1, but to oo. However, this can be easily over come if we define p{F,k\C)=p(F\C)p(k) (3.28) and if we define p(k) to be uniform over [0, M], where M in this case is some large number. If we implicitly redefine p{T\C) to be p{T', k\C), then p(.F|C) is once again a proper probability. In practice p(k) can be ignored in both learning and recognition. The reason is that p(k) is a constant and thus cancels out in the likelihood ratio R and does not affect parameter estimation in learning. C h a p t e r 4 L e a r n i n g If we are given a training set D for which every image in the dataset contains an instance of the object class, we would like to learn a set of 6 for the posterior dis-tribution p(A,X,S\d). This chapter presents a method to learn these parameters quickly. The general learning algorithm is as follows 1. Extract a set of candidate parts J from the dataset 2. Initialize a small model from these candidate parts 3. For parts p — k + 1 to P (a) Maximize the parameters for the parts 1 to p using E M (b) For each potential part j G J that has not been used i . Determine the increase in the likelihood, Kj, of the data if part j is part of the model i i . Select the part j with maximum Kj and add it to the model The rest of the chapter is devoted to explaining each of these steps. The first section describes the maximum likelihood framework and the E M algorithm, and discusses why learning all P parts jointly from a random initialization is intractable. Section 4.2 describes incremental learning, its motivation and its pitfalls. The final section describes how to initialize a small model and how to select a set of candidate parts from the dataset. Readers familar with E M can skip section 4.1. 24 4-1. Parameter estimation by Expectation Maximization 25 4.1 Parameter estimation by Expectation Maximiza-tion Given a set of training data D, we'd like to determine the posterior distribution p(A,X, S|£>), which we usually denote p ( A , X , S|D), using the information in D. Ideally, we'd like to determine p(A,X,S\D) = fp(A,X,S,9\D) (4.1) Je = fp(A,X,S\9)p(9\D) (4.2) Je However, this is computationally intractable for the model outlined, so instead we approximate the posterior where the integral is estimated by a point mass, p ( A > X , S ) « p ( A , X , S | 0 ) (4.3) that is, we use a single set of parameters to evaluate p(A, X , S). Within the machine learning community, there are a variety of approaches to chose this set of parameters, and one of the most common methodologies is to select the set of parameters 9 that maximizes the likelihood of the data p(D\9). In this case, 9ML = argmax p(D\9) (4.4) o ' ' One of the pitfalls of using 9ML is that the parameters may overfit to the data. That is, the parameters may also fit not only to object class present in the images, but also to the noise or background in the data. This entails that 9ML may not generalize well for the object classification problem. Another methodology for selecting a set of model parameters is to select the maximum a posteriori (MAP) parameters 9MAP: OMAP = argmax p(9\D) (4.5) e p{D\9)p(9) , N = argmax^ ^ ; 4.6 e P{D) = argmax p(D\9)p(9) (4.7) e This set of parameters are the most likely given the data and the prior, where p(9) is a prior over the parameters. It is this prior that can be used to prevent the overfitting 4-1- Parameter estimation by Expectation Maximization 2 6 that can plague learning the maximum likelihood parameters 9 ML- The prior basically acts as a constraint over the set of parameters. For the model we are learning we found it was only necessary to impose a prior on the parameters crloc. The specification of the prior is stated in Section 4 . 1 . 4 . So we seek to find the set of parameters 9 that maximizes C(6\D) = p(D\0)p{d) ( 4 . 8 ) which is referred to as the penalized likelihood. The first thing to note is that C(9\D) is a function with many local maxima for the particular model we are interested in. As a result, any numerical technique that seeks to find the set parameters to maximize C(9\D) will settle on a set of parameters that is but a local maxima of C(9\D). One of the standard techniques to learn the parameters 0MAP is the E M algorithm [ 4 ] , which iteratively improves the parameters until a local maximum is achieved. To explain E M a few facts must first be noted. One is that maximizing C(9\D) is equivalent to maximizing l(9\D) = logC{9\D) because of the monotonicity of the logarithmic function. In addition, l(9\D) = £ l o g $ > ( . F , h | 0 ) + l o g p ( 0 ) ( 4 . 9 ) i hen = E l o s E ^ ^ ) § S ^ + l o g ^ ) ( 4 - 1 0 ) i hen ^ 1 ' ' i hen H\ \ ) = k(0\D) ( 4 . 1 2 ) where Equation 4 . 1 1 holds true by the Jensen inequality only if q sums to 1 . So, if we fix the values of the function q, then for all 9,1{9\D) > lb(9\D). Also note that if we fix q to p{h\!Fi,9), then it is easy to see in Equation 4 . 1 1 that lb{9\D) = 1{9\D). Now, on the tth iteration of E M we have parameters 9l. E M first fixes q to p (h | . F j , (9*) , so that lb(9t\D) — l(9t\D), and this is referred to as the Expectation step. On the Maximization step the parameters 9' are found that maximizes lb(9\D) are found, so now it is the case that lb{9'\D) > h^D). However, since /b(c9*|£>) = l^D), it is the case that 1{9'\D) > i(c9*|D). In other words, by increasing the .lower bound lb 4-1- Parameter estimation by Expectation Maximization 27 we also increase I. At this point the parameters of the model are updated to 9t+l = 9'', and the process is repeated. In this way the algorithm updates the parameters in such a way that it is guaranteed that l(9t+1\D) > Z(^*|D). The algorithm is as follows E M Algorithm For t = 1 to T 1. Calculate p(h\F\ 9f) (E step) 2. Calculate parameters 9t+1 that maximize 1{9\D) (M step) 4.1.1 Expectation On the expectation step of the algorithm p(h\T%, 9*) needs to be calculated for all i and h. By applying Bayes rule p(.T|h,fl«)p(h) p(A,X,S|h)p(h) .E f c 6 W P(A,X,S |h )p (h) (4.13) (4.14) (4.15) The standard method to compute the normalization term Z = ^2henp{A., X, S|h)p(h) is to first compute p(A, X, S|h)p(h) for all h G H. If the model parameters 9l are at all accurate, then for most h G Ti the value p(A,X,S|h) w 0, at least in terms of how much it contributes to Z. As a result Z can be approximated and p(h|^r, 0*) need only be calculated for a small subset of Ti.. The problem is that if the parts are initialized randomly then techniques such as A * cannot reduce the magnitude of the problem greatly. The reason for this is that there needs to be a large number of matchings h for which p(h\T, 9l) is non-zero, otherwise E M will never be able to converge onto a satisfactory solution. The primary contribution of this thesis is the development a method to overcome this very problem. If the parts of the model are initialized intelligently with a small number of parts, then learning an accurate distribution for a small number of parts is easy. By 4-1. Parameter estimation by Expectation Maximization 28 adding parts incrementally we avoid the combinatorial explosion because the number of matchings where p(A, X , S|h) actually needs to be evaluated for the Expectation step is minimized. Techniques that are used to calculate Z efficiently are discussed in Chapter 5. 4.1.2 Maximization In the maximization step of E M we need to determine 9' that maximizes lb{9\D) when q is fixed. The lower bound lb can be re-expressed as Q(0\D) lb(6\D) = logp(c9) + ^ ^ c 2 ( h | ^ ) l o g p ( ^ , h | t 9 ) i hen1 + J ] ^ g ( h | ^ ) l o g c 7 ( h | ^ ) » hen1 It is easy to see that maximizing lb is equivalent to maximizing Q[9\D) since the term H is only dependent upon the data D and not the model parameters. Q(9\D) can be rewritten as Q(9\D) = l o g p ^ + ^ ^ ^ h l ^ O o g ^ l h . ^ + logKhlc?)) (4.18) i hen = logp(0) + E E 9(h |^) ( logp(A i | h ) + logpCX'lh) + logp(§ i |h) + logp(h)) i hen In order to find the set of parameters that maximizes Q, we take the derivative with respect to the parameters, set this to 0, and solve for the parameters. Since each of the distributions are Gaussian, except for p(h), it turns out that the parameter updates for the means are basically weighted averages of the data, and that the variances are the weighted average of the variance of the data. The weights are the probability that each feature is a manifestation of a part. In order to describe the parameter updates a few terms need to be introduced. Let Hp be the set of hypotheses where for every h e Wp there exists / G J* such that h(p) = / . In other words W is the set of hypotheses that have some part that matches p. (4.16) (4.17) 4-1. Parameter estimation by Expectation Maximization 29 4.1.3 Appearance updates For part p, if we wish to obtain the parameters fipPP then we first take the derivative of Q with respect to /j,pPP Now if we set this to zero we can obtain the maximum since Q (4.19) (4.20) is a convex function. 0 = E E ' i W - A " * ) ) ^ (4-2 1) « r - E E where ZP(D) = ^hen* QO^lF1)- For the variances (o~pPP)2 we can perform a similar operation. rfo-APP E^ ^."PP ~*~ ^ ( h l ^ ) ( ^ P P U (4-23) Setting this to 0 we get 4.1.4 Location updates For a particular matching h we fix Ch before determining the parameters that max-imize Q. That is, we are going to set Y / f h = S C h X / + X C h , where Y / . h is the projection of X / into model coordinate space with matching h. Now if we note that p(Yf,h\h(p) = /) = p(X.f\h(p) = f, Ch), then we can replace the probabilities over X in Q to probabilities over Y . Now if we want to maximize Q with respect to /x' o c , then we take the derivative 4-1. Parameter estimation by Expectation Maximization 30 dnl°c d[iloc v dY.i Ehewt 9(h|.P) l o g p ( Y h ( p ) ) h | h ) (4.25) (4.26) dnlpoc = E E 9 ( h l ^ ) W ° C - y h ( P ) , h ) (4.27) Which results in the parameter update if we set the above to 0, < = E E ^ZT* ( 4 2 8 ) Now, if we want to maximize Q with respect to al°° we first have to describe the prior that we have placed on al°c. One pitfall to adding parts to the model incrementally is that in some cases there is a bias such that the parts learned earlier will have a smaller variance than parts learned later. If a new part is added to the model, the most likely centre, copt, according to the model will be determined primarily by the parts that have a small variance. As a result, new parts added to the model will have less influence on where the copt should be, and so their location will be noisier. In order to combat this bias we introduce an inverse Gamma prior on the alpoc in order to prevent variances from becoming too small. This prior has the form that p(a) oc (a)-pe~^ (4.29) We will first derive the parameter updates and then explain the intuition behind the choices for the parameters of 7 , and ft. dQ E i E h e W * <?( hl^) l o g p ( Y h { p ) ) h | h ) + l o g p ( c r ^ ) dcrlpoc dal°c (4.30) i h€Hlp aioc - • / ( < T / 0 C ) 3 aioc ( o . / o c ) a Setting this to 0 and solving for the parameter we get 4-1. Parameter estimation by Expectation Maximization 31 { < T p ) = 2^2 . Z p ( £ ) ) + p ( 4 3 2) In essence, the choice of 7 is prior belief of what the variance on the location of a part should be, and (3 is the strength of that belief. These parameters essentially add a set of (3 pseudo-data that are 7 in squared distance from the mean /J,1°C. The particular choices of these parameters were set by experimentation. 4.1.5 Scale updates Similar to the location updates, we fix Ch for h before parameter updates. So, we set £/,h = S c h S / . dQ d E i E h e W * ? ( h | ^ ) l o g p ( S 4 | h ) do-scale d(J° cale dJ2i J2h£Hi l o g P ( £ h ( p ) , h l h ) (JO-scale (4.33) (4.34) = E E ? ^ ) ( ( W b - 0 ( 4 - 3 5 ) i h€Hj, Setting this to 0 and solving we get « r = E E " { h ] z m T h ( 4 ' 3 6 ) For the variances {at)cale)'1 we can do the same, arriving at the parameter updates scale sr sr g(hiJFI)(^h(P),h - fj>spcale)2 a* =2^2^— zjD) ( 4 3 7 ) 4.1.6 Part weight updates If we seek to determine the parameters wp for p(h) take the derivative of Q with respect to wp. 4-2. Incremental learning 3 2 dQ _ ^ E i E h 6 H g ( h l ^ ) l o g P ( h ) ( 4 3 g ) dwp dwp E E mn+ E mn (4.39) » h€?i|/i(p)=0 " p hGK|/i(p)#0 p Setting this to 0 results in the parameter updates g ( h | . P ) «v = E E ^ ( ^ ) 4.2 Incremental learning Learning a model with random initialization with all P parts is generally not tractable for models with a large number of parts. The primary innovation in this thesis is the development of an incremental scheme to adding parts to the model to make learning models with a large number of parts tractable. Now if we have an accurate model with p — 1 parts, then if we add a new part, it is assumed that the parts we have learned so far will remain roughly the same. For example, if we have a model for motorbikes with parts that correspond to the front and back wheel, then adding a new part is unlikely to change the first 2 parts significantly enough that they no longer correspond to the front and back wheel. With this insight, learning a model with 3 parts, given the model with 2 parts, basically reduces to learning the parameters for the new part. The combinatorics of E M is reduced primarily because when we add a new part p we already have a good indication of which features match the first p — 1 parts. As a result, we know which h G Ti. are irrelevant, which is generally a vast majority. Moreover, we also have a good estimate for the object's centre and scale, c, which is the average object centre over all matchings c = J > ( h | A , X , S ) c h ( 4 . 4 1 ) hen To add a new part to the model we have a set of potential parts J, and for each j G J we have an estimate for the appearance distribution, that is we have mean ixfv and a variance <T"pp for the potential part. The following algorithm describes how we add a new part to the model. To set nspcale and fil°c we find the image that j 4-2. Incremental learning 33 originated from and use the image center c from that image to find an approximate location and scale in model space for the part j. We set <J1°C to be the average variance of parts thus far, and we set crspcale to be the scale of the part in model coordinate since we found that this generally varies with the scale of the part. Adding a new part 1. For each j £ J (a) Initialize a part p • Set appearance mean to fj,pPP — /x" p p and variance to a°^p = aapv • Find the image that the sample came from, determine the expected object centre, c, and set filpoc = S e X / + X s . Set the <rl°c to be the average variance of the parts learned thus far. • Using c, set fxpcale = ScSf and set aspcale = SCS/. 2. Determine the increase in the likelihood of the data with this part. That is calculate Kj = p(D\9) with this part included as part of the model. 3. Initialize the new part of the model to be the part that produced the maximum likelihood Kj Initially we only have a small number of parts in the model, and as a result the model will only be able to recognize only a fraction of the dataset. For example, if we have a model with 3 parts, for many images in the dataset these parts may note be present, or the feature detector may not have produced a feature at these parts. However, as we add new parts to the model, more and more images in the dataset will be recognized by the model. By adding a new part that produces the maximum likelihood, we usually add that part that occurs in the most images in the data set. The bias that this creates is that if the dataset is divided into distinct subsets, then it is possible that this method may learn only one subset. Dealing with this issue will be left for future work. 4-3. Intelligent initialization for models and parts 34 4.3 Intelligent initialization for models and parts In order to be able to learn the model incrementally we need a small model to start with, and we also need a set of potential parts that can be added to the model. The parts of the small model and the set of potential parts are all selected using image matching in the dataset. In image matching we match the features in one image to the features in another image based upon appearance, and remove those that don't agree geometrically. If there are enough features still left, we assume the match is correct, and we retain the features involved in that matching. The small model is chosen to be from the image that had the highest number of matches to other images in the dataset. The parts that are selected from this image are those features that were involved in the most matches to other images. To select the sample set of parts, we simply take those features that are involved in a high number of matches. A n example of this can be found in Figure 4.1 4.3.1 Image matching If two images contain the same object class, then it may be the case that they have features in common. In the image matching algorithm presented, the intent is to find those features in one image that match to the features in the other image. We obviously want to maximize the number of matches, but we also want to minimize the number of false matches since these will introduce additional noise into the initial-izations. In order to achieve this end we first match features in terms of appearance, and then find the subset of matches that agree geometrically. To better explain the procedures some additional notation must be introduced. For images i and image j, we denote a m(fl) = fj to be a match from /* G to P G where m(f) = p if (Afi - Ap)2 < (Afi - Ag)2 for all g e&. We denote Q as the set of features from image % in a match. We say that the match agrees geometrically if the best least squares linear trans-formation has an error less than ^)\Q\. The best least square linear transformation is the set of parameters (a, s) that minimizes the error In essence it means that we find the best linear transformation of feature locations from image i to the image coordinate system of j , and then measure the error by (4.42) 4-3. Intelligent initialization for models and parts 35 Figure 4.1: Each pair of images represents the final image matching, after geometric constraints have removed false matches. We want features that occur frequently in the dataset both for potential parts and for initializing a small model. In this case, those features connected by the solid lines would be good potential parts, where as those with the dotted lines occur infrequently. 4-3. Intelligent initialization for models and parts 36 squared distance from each feature to its match in the image coordinate system of j. For notational convenience we denote this error as Eg = minEg(a, s). In addition we denote Mitf as the set of features from other images that are judged to be a match to / G J-1. This is not used directly in the image matching, but is used later for initializing the model and initializing the set of potential parts. To determine an image matching between i and j 1. Find the closest matches in appearance for each feature in /* G Tl to a feature in p G TK That is find m(f) 2. Initialize Qopt = 0 3. For N iterations (a) Randomly select a set of 3 features in Fl, denote the set as Q (b) While Eg > (ie. add the best features to Q until the geometric error is too great) i . For each feature f \ let Qp = Q U fl and determine Eg i i . Select that feature fl that has minimum Eg i and add it to Q, that is g = g u /* (c) lf\gopt\<\g\ then let gopt = Q 4. If \Qopt\ > k then for each /* G gopU Mitfi = Mitfi U m( / i ) 5. If \Gopt\ > k return gopt, otherwise there is no matching. The parameter ip is set by hand, and is meant to represent how far away on avera a feature fl G g can be from m{fl) in the best linear transformation to the image coordinate space of j. In practise we set it tp to be about 0.1 times the minimum dimension of the image, assuming that the object takes up about 1/3 to 2/3 of the image. Clearly it is also possible that the matching found is a false match, and the fewer number of features in gopt, the more likely it is that we have a false match. This is where the parameter k comes in, for it determines when there is a matching. In the experiments we set k = 4, which was determined experimentally to prevent most 4-3. Intelligent initialization for models and parts 37 false matchings between images from affecting the initialization of the potential parts or initializing the small model. For values of k < 4 we found there were many more image matchings where the matching was incorrect. 4.3.2 Initializing a small model and a sample set To initialize the sample set J we match all images to each other and extract those features that were involved in the most matches to other images. By this we mean that we use those features that had a high number of both geometric and appearance matches to features in other images. Once we have these features we find the average appearance of these features across the other features they matched, and use these to initialize both /j,pPP and cr<-vpp for a potential part p. The motivation behind choosing these features is that these features are probably the most common and are likely to be part of a good model anyway. To initialize a small model with 5, we first find that image % that matched the most other images in the dataset. At this point we select the top 5 features from that image that were involved in the most matches to other images. These features are then used to initialize a small model. Similar to initializing potential parts for J% we also find the average appearance of these features, across the other features they matched in other images, and use this to initialize both the appearance /j,pPP and 0-app The reason that we choose one image and use the features only from that image is that it allows us to initialize both the appearance ^p,a<£)p, and the location fil°c, and scale [ispcale for a part p. Otherwise we would not have a coordinate system to initialize the parameters for location and scale. 4-3. Intelligent initialization for models and parts 38 Initializing a small model and a sample set J • •' 1. For each image % (or a subset) (a) Initialize Mitp = 0 for every feature fl G J-1 (b) For every other image j (or subset) determine Qopt by the matching pro-cedure described above 2. Find the r\ features that are involved in the most matches and put in set J 3. For each of fl G J, where i is the image / comes from y i M . • • Set \J?PP = — \ M ^ ~ \ — ' t n e average appearance of features in Mip • bet afi -4. Select that image i that matched the most other images 5. For 5 top features fl G .F 1 that are involved in the highest number of matches, (a) Set /j,a^p = — \ M I F ' \ — > ^ n e a v e r a g e appearance of features in Mjji, (b) Set <rapp - E / 6 ^ . / i ( A r ^ f ) 2 ( j b t o~fi - |M^.! (c) Set nfc = X / i (d) Set = V> (e) Set nsfTle = Sfi (f) Set o)?u = Sfi 6. Return these top features as the parts for a small model C h a p t e r 5 M o d e l C o m p u t a t i o n In general, calculating p(A, X , S) exactly is intractable for a large number of parts as mentioned previously. There are, however, a variety of techniques that can be utilized to approximate p(A, X, S). In general, if we have an accurate model of an object class then we expect that there is only one true matching h, and one true center c for the object in an image. As a result, there will generally be very few matchings that contribute to p(A, X , S). In practice this is also the case, and usually there is only one matching h that con-tributes significantly to p(A, X , S). So, the task in object classification is to find these few matchings. In this chapter, we outline a few techniques to find the most likely matchings. 5.1 False matches and enforcing geometry The density p(X, S|h) « max p(X, S, c|h) as defined earlier is unconstrained. How-c ever, if we impose one constraint on this probability then computation is much more tractable. The constraint is that if p(X / ,S / |h(p) = / ) < p ( X / , S / | h 0 ) (5.1) for some / and p, then we set p(A, X, S|h) = 0. What this means is that if / is not at all where part p should be according to the object model, then we discard the matching. The reason for making this restriction is two-fold. The first is that it can help reduce the amount of computation to approximate p(A, X, S) significantly. The second is that for high dimensional appearance vectors Af, the distribution for p(A/|h) can range over a much larger range of numbers than p(X, S|h), and as a result P(A|h) has a much greater influence on p(A, X, S|h) than p(X, S|h). One way to avoid false matches due to this is to enforce the constraint that for each feature / matched to a part, the feature's location should agree at least in some degree to where the part should be according to the object model. For example, consider a 39 5.2. Search space techniques 40 theoretical object model for cars, which has two parts that correspond to the wheels of the car. If an image with two wheels in random locations is given to the system, this constraint would prevent a false match. The problem with imposing such a constraint is that p(X.f, S/|h) is no longer a probability distribution. In practice, however, this constraint rarely affects the end evaluation since p(A, X , S) relies primarily upon only a few matchings, and these generally satisfy this constraint. It should be mentioned however, that this constraint would have a significant effect upon learning if the parts of the model were initialized randomly. The reason for this is that if the model is not'at all accurate, then many of the matchings that do in fact contribute to p(A, X , S) have no influence in the parameter updates because of this additional constraint. However, in the learning scheme presented, it is never the case that a part is initialized randomly, so the above approximation is reasonable. 5.2 Search space techniques Given a set of features J7, we can represent the hypothesis space Ti as a tree, where each level is a part, and nodes on level p represent a match from a feature to part p. Given this tree we can do a depth first search through the tree to determine p(A, X , S), trimming branches that are unlikely to contribute to p(A, X , S) and exploring those that are likely to contribute to this probability. Figure 5.1 illustrates the search space and the trimming techniques. One basic trimming technique is to not explore further down a branch in instances when p ( X / , S/|h(p) = / ) < p ( X / , S/lh©), which is trimming based upon geometry. Another basic trimming technique, based upon appearance, is to not explore further down a branch in instances when p(h(p) = /)p(A/ |h(p) = / ) < p ( A / ( X 7 > S / | / e B)p(h(p) = 0) (5.2) In both of these cases it is always better that the feature / be assigned to the back-ground, that is h(p) = 0. This will be explained in Section 5.2.1. The technique presented in Figure 5.2 to approximate p(A, X , S), computes p(A, X , S|h) for the top matchings h, neglecting many matches that would not contribute to p(A, X , S). In practice, computing the above approximation p(A, X , S) grows roughly linearly with the number of parts in the model. For example, for a model with 5 parts it takes much less than a second, but with a model with 25 parts, it takes between 2 and 10 seconds. There are additional techniques that can be utilized to find only the optimal matching hopt in much less than a second for even models with large parts. 5.2. Search space techniques 41 Figure 5.1: This diagram illustrates an example of the search space and the trimming techniques with a face model. A matching at a particular node is defined by the path from the root to the node, where the root assigns all parts to the background, and each node assigns a part to feature, (g) are cuts that have been made because of geometry, and X are cuts that have been made because of appearance. 5.2. Search space techniques 42 Approximating p(A, X , S) 1. Let h r be the matching where hr(p) = 0 for all parts p 2. SumAll = 0 3. For every / € T (a) SumAll — SumAll + computeChild(hr, 1, / ) 4. p(A, X , S) « SumAll Function computeChild(h,p, f) 1. Let Sumh' = 0, h ' = h, and set h'(p) — f 2. If p(h'(p) ± Q)p{Af\U{p) = f) < p(Af,Xf,Sf\f 6 B)p(h(p) = 0) return 0 (Trimming based upon appearance) 3. If p(Xf, Sf\b!) < p(Xf, S/|hg) return 0 (Trimming based upon geometry) 4. If p < P then (Descending down child) • For every / ' 6 Bw, the features matched to the background in h ' — Sum^i = Sumh/ + computeChildih!,p + 1, / ' ) 5. return p(A, X, S|h')p(h') + Sumw Figure 5.2: Pseudo algorithm for approximating p(A,X, S) 5.2. Search space techniques 43 In practice, approximating p(A, X , S) « p(h o p t)p(A, X , S | / i o p t ) is accurate enough to obtain the same recognition results as using the full technique described above. In learning, however, we need p(A, X , S) to be as accurate as possible since E M requires p (h |A ,X , S) for multiple h for parameter updates, so such additional techniques are of little use. In Section 5.3 we discuss how to reuse the search tree from previous iterations of E M to reduce computation. 5.2.1 Trimming the search space If we have a node n at level p, we know the matching h n specified at this node. If p(X, S|h n ) < p(X, S|h.0), we know that for each matching h further down the tree that it is always the case that p(X h ( p ) , Sh( p)|h n) < p(Xy, S/|h0) for some p. So, according to our constraint outlined in Section 5.1, all matchings that stem from this node do not contribute to p(A, X , S) and thus the subtree does not need to be explored. To see this, note that when p(X, S|h n ) < p(X, S|ri0), there must be at least one part p, where h n(p) = / , such that p ( X / , S / | h „ ) < p(X/,S/|h.0). For the matching h,j, where K is the set of features assigned to parts, the C h n that determines where X x and Sx are transformed into model coordinate space is the best transformation for these features, that is p(Xx;,Sx;|h n) « p ( X x : , S x | h n , c o p i ) (5.3) > p ( X x ; , S J C | h i , c ) V c (5.4) For each matching h below node n, it is the case that the matchings for the features in JC are the same as in h n . As a result, p(Xjc, Sjc|h n) > p (X^ , S^|h), and so it is the case that there exists a feature / £ /C such that p ( X / , S/|h) < p ( X / , S/|hg). Another technique to trim the subtree at a node n is to evaluate the appearance of the feature / given that it is matched to a part p and compare it to the likelihood of the appearance given that the feature is generated by the background. If it is a lot more likely that the feature is generated by the background than by the part, then we do not need to explore the subtree below the node. In particular if p(h(p) = /)p(A / |h(p) = / ) < p(Af, X / , S,\f E £)p(h(p) = 0) (5.5) then the subtree can in practice be safely be removed. The reason for this is that for any h where h(p) = / , p (X/ ,S / |h ) < 1, since we are using Gaussians and the 5.3. Search space techniques for EM 44 variances are never small enough to produce probabilities greater than 1, and as a result p(h(p) = f)P(Af\h(p) = /) < p ( A / > X / s St\f G B)p(h(p) = 0) op (h (p) = / ) p ( A / ) X / , S / | h ( p ) = / ) < p(A / l X / ) S / | /6B)p(h(p)=0)(5.6) This basically means that it is better that the feature be matched to the background in all matchings h where h(p) = / . . ' . ,. ' 5.3 Search space techniques for EM In computing the parameters 9 for the model, we repeatedly search through the hypothesis space Ti for matches that contribute to p(A, X , S ) . If we have p — 1 parts in the model, then if we run E M on the parameters then E M will converge to some local maximum after some number of iterations. At this point there is very little change in the parameters, so the matchings that are found to contribute to p(A, X, S) on each iteration will essentially be the same. Moreover, when we add a new part p to the model, we don't expect that the matchings for parts p — 1 to change a great deal. Given this, we can utilize information from previous iterations of E M in order to reduce the search space. Let 4>vj be a boolean variable, where 4>pj — TRUE indicates that there still re-mains a matching h where h(p) = / and p(A, X, S|h) contributes to p(A, X , S). cf>pj = FALSE indicates that / definitely does not match p. We can use this in the approximating technique in Section 5.2 by checking that <f)pj — TRUE in computeChild(h,p, f) and returning 0 immediately if it's false. From previous iterations, we have determined p(A, X, S|h) for a set of matches h G W that contributed to p(A, X , S). Let * P i / = J2h\hew & h( P )=/P( h )K A > X, S|h), that is, the total contribution of matching / to p to p ( A , X , S). Now, if the contribution of matching p to / is sufficiently small then it's unlikely that this match is ever going to contribute to p(A, X, S). In particular if p J <A (5.7) p ( A , X , S ) then we set 4>pj = FALSE, that is that in future iterations we do not need to evaluate an matches h where h(p) = / . This check can be performed at the end of every iteration. 5.3. Search space techniques for EM 45 O Matchings where h ( l )= f t don' t contribute to p ( A , X , S ) Match ings where h(2)=f 2 don' t contribute to p ( A , X , S ) Hypothesis Tree after i+1 iterations o f E M Figure 5.3: This diagram illustrates how we use information from a previous iteration of E M to trim the search space in order to approximate p(A, X , S) 5.3. Search space techniques for EM 46 The question is, how do we set A? With small models there really is no need to utilize previous iterations to reduce computation. However, for models with more than 10 parts, even the approximation scheme in the previous Section 5.2.1 is costly to compute for some model classes, like motorbikes. The key insight to setting A is that p(A, X , S) is roughly proportional to the number of matches in the optimal match hopt for an image. If there are a high number of matches in h o p t , then it is assumed that the matches from parts to features are correct, and that for any additional parts added to the model, these parts will continue to be matched to the same features in the dominant matching. Clearly we want to reduce computation as much as possible, but we don't want to prematurely exclude possible matches of parts to features since the parameters for parts change on every iteration. So, we set A to be the average error of optimal matches that contain 4 matches. The intuition is that if the optimal matching hopt has 6 matches, then matchings h with only two matches are unlikely to ever contribute to p(A, X , S) since we already have a very good matching of parts to features in hopt. In practice it was found that this setting produced the same results as learning without making any use of previous iterations to reduce the search space, except that we could learn large models of 25 parts in 25 minutes instead of 2 or 3 hours. C h a p t e r 6 E x p e r i m e n t s In order to validate the learning method and the models learned, a variety of ex-periments have been conducted on 4 different datasets. The first few sections of the chapter describe some parameter choices and the datasets,' while later sections describe actual results. 6.1 Feature detectors and descriptors Up until this point we have not restricted the model to any particular type of feature detector or descriptor. At this point we ask, what regions of the image are interesting? There are a huge variety of interest point detectors in the computer vision literature, some examples being [6] [11] [9]. However, there are a number of properties that are required for robust general object classification that narrows the number of choices. The first is that the feature detector should be scale and rotation invariant, that is, the same region will be chosen as a feature if the image is scaled or rotated. The particular feature detector we chose for our experiments is that of Lowe [9] , where interest points in this case are located at peaks in the difference of Gaussian function convolved with the image in scale space. In other words, they are located in regions and at scales where there are large gradients in all directions when compared to neighbouring scales and locations. By selecting such regions we are guaranteed to select highly textured regions of the image. Aside from determining which regions of the image are interesting, there is also the question of how to represent this region. As a first step the region should be blurred and resampled according to the scale so that all feature descriptors are of the same size. The simplest approach from this point would be to simply use the resampled region, but this usually produces a very high dimensional vector. The particular approach that Fergus et al. take is to first apply P C A , and represent the region as the first 10-15 principle components. There have also been a number of feature representations that have been developed by the computer vision community that are invariant to rotation, scale, and brightness 47 6.2. Experimental setup 48 Image gradients Keypoint descriptor Figure 6.1: Example of a feature descriptor when the image region is divided into 2x2 patches, with the histogram of image gradients in each patch having K = 8 bins. changes, and are less sensitive to changes in 3D viewpoint and noise. The particular approach adopted in this thesis is to use the approach developed by Lowe [9] in order to determine feature vectors A / . In this approach an image region is first blurred and resampled at the appropriate scale and then transformed into a patch of gradients. From here, the region is divided into mxm patches. Each patch is then described as a histogram of image gradients, with K bins. See Figure 6.1 for an example. The advantages of such a representation are that it is invariant to brightness changes, scale, and somewhat invariant to affine distortions. There is the option of making the feature rotationally invariant, but we have dropped this extension for simplicity. We have chosen to divide the image into a 3x3 patch, m = 3, and to have 4 gradient bins per patch, K = 4. This results in feature vectors A / that are 36 dimensional. We have experimented with other parameter choices but these choices tended to produce good results in a reasonable amount of time. 6.2 Experimental setup Each dataset was split into a test dataset and a training dataset. The training dataset was used in learning an object model and the test dataset was reserved for testing the model. Preprocessing for each of the datasets consisted primarily of extracting features, where an image usually had anywhere from 50 features to 700 features detected. This is very much in contrast to the method proposed by Fergus et ai, where the the maximum number of features was 20. Despite this difference, training 6.2. Experimental setup 49 our models were an order of magnitude faster. For each object class model, a set J of 100 potential parts, and a small initial model with 4 parts are extracted from the training data. Following this preprocessing step, the iterative learning procedure is then used to add parts to the model until the model contains P parts. 6.2.1 Datasets There are 4 different datasets from Fergus et al. on which the method presented in the thesis is validated: frontal views of faces, side views of cars, rear views of cars, and side views of motorbikes. The face dataset contains 440 views of faces, mostly of the same scale, similar lighting conditions, and varying backgrounds. The dataset only consists of 27 distinct individuals. In learning a model on this dataset, the dataset was randomly split into 300 training images, and 140 test images. This split was sometimes done numerous times to obtain averages. The motorbike dataset contains 826 images of side views of motor bikes, with varying scale, similar lighting conditions, and varying backgrounds, although most were against a light-coloured background. The test sets consisted of 200 images while the training sets consisted of the remaining 626 images The rear view car dataset contains 1155 images of rear views of cars, with varying scale, varying lighting conditions, and varying backgrounds, although all were taken on roads. The test sets consisted of 200 images and the training sets consisted of the remaining 955 images. The main side view car dataset contains 540 images of various cars viewed from side. The scale of the cars were identical, but the background varied. The test sets consisted of 140 images and the training sets consisted of the remaining 400 images. However, for comparison with results presented in Fergus et al. there is also an additional test set that contains test images of cars, sometimes more than one. The task with this dataset is to locate the object class, not only classify the image as a car. There are also two background datasets. The first contains 900 images of various sizes of various scenes. The second set contains 1155 images of various sizes of road scenes without the presences of a car. This is used for recognition results related to the dataset with cars viewed from the rear. In both cases, these datasets were used to compute the false positive rate, which is the percentage of the time that the object 6.3. Computation time required to learn object models 50 recognition system classified a background image as being part of the object class. 6.3 Computation time required to learn object models One of primary contributions of this thesis is a method to learn a part based model with a large number of parts quickly. In Fergus et al., models of 6 or 7 parts take 24 to 36 hours to train, and the time to learn the parameters for this method is generally exponential in the number of parts P, so larger models are not feasible. With the learning procedure described in this thesis, the amount of computation time to learn an object model is roughly linear in the number of parts. The learning is composed of two stages, a preprocessing step that initializes a small model and a set J of potential parts, and the incremental learning that adds these parts to the model. It generally takes about 10 minutes for the preprocessing step, but it is not optimized, so this compute time can be reduced. In general it takes about 1 to 3 minutes to add a new part to the model and run E M til l convergence on the parameters. Learning an object model of 25 parts usually takes 30 minutes or less. There are a variety of measures that can be utilized to measure how effective an object model is as an object classifier. Given that we have learned parameters for p(A, X , S|#), we can use the ratio R = p(A, X , S|#)/p(A, X , Sjh©) to recognize an image. If R > u>, where u> is some threshold, then we classify the image as having the object class. If R < u> then the image does not contain an instance of the object class. One of the most useful measures for evaluating an object recognition system is the Receiver-Operator curve (ROC), which measures how the true positive rate TP^ varies with the false positive rate FPW when u is varied. A true positive is when an image containing the object is identified as containing the object class. A false positive, on the other hand, is when an image is identified as containing the object class when it does not. The true positive rate is defined as 6.4 Results TP„ = Number of true positives (6.1) Number of images containing the object class whereas the false positive rate is defined as 6.4- Results 51 Number of false positives . w Number of images not containing the object class It it fairly common to compare different object classifications systems by utilizing the R O C equal error rate, which is defined to be TP^ = 1 — FP^ which is determined by varying the threshold u. This measurement is meant to convey the tradeoff be-tween classifying images containing the object class correctly and falsely classifying images that do not contain the image class as containing the image class. 6.4.1 Comparison to previous methods Our comparisons are restricted to comparing our results to the results presented by Fergus et al, since few approaches achieve the same amount of success on the datasets we use to validate our approach. The results that they present are primarily ROC equal error rates, with the exception of the dataset of cars viewed from the side. With the dataset of cars viewed from the side, they utilize a recall-precision curve (RPC), which is similar to the R O C curve. The task, however, is not to classify an image but to locate the object class in the image. So, in order to utilize our model to accomplish this we output all distinct matchings h and the associated object centers copt where p(A, X , S|h) > u. A true positive is defined to be the case when c occurs within some r from an instance of the object class, and a false positive is when c does not occur within r of an instance of an object class. Recall^ is similar to the true positive rate, ^ „ Number of true positives Recall^ = —— -f- (6.3) Number of instances of the object class The precision is defined to be _ . . Number of true positives Precision — (6 4) w Number of false positives + true positives So, given this experimental setup the result reported for the dataset of cars viewed from the side is recall-precision equal error, which is the value of Recall^ when Recall^ — 1 — Precision^. More information on this type of experiment can be found in [1] By the results in Table 6.1, it is unclear whether the approach proposed in this thesis is any better at object recognition than the one presented by Fergus et al. However, this could very be well an artifact of the datasets. For the datasets in 6.4- Results 52 Dataset Ours Fergus et al. Motorbikes Faces Cars (Side) Cars (Rear) 90.2 (20 part model) 94.7 (9 part model) 90.5 (6 part model) 96.2 (20 part model) 92.5 (6 part model) 96.4 (6 part model) 88.5 (6 part model) 90.3 (6 part model) Table 6.1: ROC equal error results comparing our results to that of Fergus et al. question, it turns out that most of the object classes can be recognized with just a few parts. For example, we found that a two part model with the two eyes was able to produce an R O C equal error rate of 0.8. As a result, learning models with more parts does not dramatically improve recognition results. This issue will be further discussed in the next section. 6.4.2 Incremental learning The main contribution of this thesis is an approach to learning a part-based object recognition that is able to learn a large number of parts in a reasonable amount of time. To test the effectiveness of the learning approach, we trained models of varying sizes for each object class to determine the improvement or lack of improvement in object recognition results. The ROC curves in Figure 6.2 demonstrate that learning models with more object parts definitely improves results, although the improvement in some cases is not large. These R O C curves are the average obtained over 10 different splits of the dataset into testing set and training set. Perhaps an easier way to assess how much having larger models improves recog-nition results is to look at the equal error rate for models of various sizes. In Figure 6.3 we present the average ROC equal error rate for models of various sizes. The first thing to note is that for all of the data sets the recognition results tend to peak at some point and performance tends to degrade or plateau as additional parts are added to the model. The explanation for this is somewhat complicated but enlightening. A n inspection of the datasets reveals that there is a portion of the each dataset that is quite different in character from the average image in the dataset. For example, in the face dataset, about 5 to 10 percent of the images contain an image of a face taken under lighting conditions that differ considerably from the rest of the dataset (see Figure 6.4). These faces appear much darker, and as a result there is very little contrast, which results in fewer feature detections on the face. In addition, the appearance of these features 6.4- Results 53 are significantly different than features in similar regions on the face in the average image in the dataset. Now, in our initializing procedure, we select a sample set J of potential parts, where a feature can act as a potential part if it is involved in lots of image matches. As a result few potential parts would come from these images, and as a result the learning method would have a bias and thus neglect this segment of the dataset, A similar analysis can be applied to the motorbike dataset. In this case a majority of the motorbikes are against a light background. As a result, many of the parts the learning algorithm learned were on the outside of the motorbike, so they included the lighter blackground. However, for a small portion of the dataset the background is quite dark, and as a result these images are hard to recognize using Motorb ikes Faces 0.8 0.6 tr CD > O o- 0.4 CD 0.2 .. f. .f\. 1 / V / 5 part mode l — 10 part mode l 2 0 part mode l 0.2 0.4 0.6 0.8 False Posi t ive Rate 0.8 rr g 0.6 '55 o Q- 0.4 CD 0.2 ^ — — 5 part mode l 10 part mode l 2 0 part mode l 0.2 0.4 0.6 0.8 Fa lse Posi t ive Rate C a r s (side) C a r s (rear v iew) 03 0.8 tr CD > 0.6 '55 o Cu 0.4 CD 3 I— 0.2 0 JJ tsL—--— 5 part mode l — 8 part m o d e l — 15 part mode l 0.2 0.4 0.6 0.8 False Posi t ive Rate 1 B 0.8 « tr CD > 0.6 '55 o CL 0.4 CD H 0.2 0 5 part mode l 10 part mode l 2 0 part mode l 0.2 0.4 0.6 0.8 Fa lse Posi t ive Rate Figure 6.2: These R O C curves demonstrate that as the number of parts increase the R O C curve improves. 6.4- Results 54 Figure 6.3: This figure outlines how the R O C equal error rate improves as more parts are added to the model. Note that for each model the curve seems to plateau, indicating that results don't improve with larger models. This is somewhat due to the nature of the datasets. these parts. Moreover, since these images are only a small portion of the dataset, very few parts in the potential part set J would come from these images. As a result, few parts would ever be added to the motor bike model that could help in modelling these darker images. 6.4.3 Learned models In Figures 6.5,6.7,6.6, 6.9, 6.8 are visualizations of the models that were learned by the approach outlined in this thesis. In each figure, the model is split into two parts, 6.4- Results 55 Figure 6.4: The first are images that are not recognized, where as the second column are easily recognized. The features in the optimal match are highlighted. Notice how in the first column the images are substantially different in character from those in the other column, indicating why they may be difficult to recognize. 6.4- Results 56 the appearance model and the location of the parts model. In the appearance section, for each part, is the blurred image of the top four features from the dataset that match the part based solely on appearance. The blurring is proportional to the scale of the feature, which is performed on the original image patch before encoding the feature descriptor A / . The 5th image patch is an example of a feature without the blurring. This is meant to illustrate which part of the object it comes from if it is not clear from the blurred image patches. Also with each part is the weight wp for the part p, which is roughly the proportion of the dataset that this part appeared in in the training data. Also included is the AppVar which is E t <7> the sum of the standard deviations over all dimensions of the appearance for the part p. This is meant to illustrate how wide the variance is for the appearance of each part. In the locations section is displayed various ellipses, where the centre of each ellipse corresponds to part p's location in model coordinate space. The x and y radius are 2a l° c. Each ellipse is connected to a line that points to some image patch, which is an example of that feature from the dataset. The scale of the feature is represented by the scale of the image patch. As can be seen in the figures, some interesting properties are picked up by the model. Of particular interest is the model of motorbikes, which are primarily de-scribed by the rounded curves in a particular geometric configuration, so it appears that it is modelling the wheels primarily. With the motorbikes it was generally the case that most of the parts were concentrated on the front wheel, even for large mod-els. A partial explanation for this is that in the initialization procedure, the initial small model tended to be composed of parts from the front wheel. As a result, when we trying to add new parts to the model, there is a bias towards parts being somewhat close to the parts already in the model. The reason for this is that if a majority of matches in h in an image are in one particular area, then these tend to dictate where the center copt is. As a result, the location for any potential part that is far from the parts in the model thus far will appear to be more noisey, and so the increase in the likelihood when this part is added to the model will be less than it would have been if it were closer to the other parts already in the image. As a result, parts that are closer to the parts already learned for the model are preferred by the incremental learning procedure. Another interesting thing to note is that many of the parts that were learned for these models were on the edge of the object in the image. This means that it is picking up the shape of the object, however, this relies on the background being a 6.4- Results 57 particular type. For example, the cars viewed from the rear tended to be against a light grey background, usually a road scene, so the parts that included part of the background will not be detected in instances where a car is on a very dark or very light background. However, this is not a problem of the approach, but of the dataset. Clearly if the dataset has a regularity, then an unsupervised approach to learning parts will pick up on this. In order to deal with this problem, datasets should include objects against a great variety of background scenes so that parts that are learned do not rely on the fact that the object class is against the same type of background as in the training data. 6.4- Results 58 Part 1: w = 0.52 AppVar=1206 Part 2: w = 0.32 AppVar=889 ^^ jjpjj| ^ piRlj ^^(P Part 3: w = 0.55 AppVar=l 139 Part 6: w = 0.43 AppVar=1023 /^r- Cr Part 7: w = 0.72 AppVar=l 195 Part 8: w = 0.43 AppVar=1044 4 ^ ^ ^ ^ Part 4: w = 0.42 AppVar=l 156 Part 5: w = 0.46 AppVar=l 101 Part 9: w = 0.55 AppVar=l 124 Part 10: w = 0.31 AppVar=871 Figure 6.5: A 10 part model for motorbikes 6.4- Results 59 Part 1: w = 0.61 AppVar=913 Part 6: w = 0.44 AppVar=325 Part 7: w = 0.69 AppVar=1251 Part 1 Figure 6.6: A 1 0 part model for cars viewed from the rear Part 1: w = 0.67 AppVar=867 ^Sm, j & k J | H H ^ 0 0 0 9 0 Part 2: w = 0.70 AppVar=966 € % O O f 1 ^ Part 3: w = 0.59 AppVar= 1240 | H | E§| J||p Part 4: w = 0.60 AppVar=1050 ^^^flHkk. jj^ g^ j^^SSSfc Part 5: w = 0.66 AppVar=1065 ^Hk Part 6: w = 0.66 AppVar=1065 .JJS J * * *SSfci W Part 7: w = 0.43 AppVar=909 Part 8: w = 0.41 AppVar=1046 Part 9: w = 0.66 AppVar=1057 J L - d t j Part 10: w = 0.48 AppVar= 114 ^PHW^ V M n ^ ^ ^ ^ ^ Figure 6.7: A 10 part model for cars viewed from the side Figure 6.8: The locations for the parts in a 25 part model for faces 6 Z L Part 3: w = 0.83 AppVar=738 r ^Br Part 4: w = 0.66 AppVar=1059 Part 1: w = 0.86 AppVar=733 • ••• Part 2: w = 0.42 AppVar=639 t < f * Q Part 3: Part 4: Part 5: w = 0.62 AppVar= • • € r Part 13: w = 0.55 AppVar=912 • Part 6: w = 0.62 AppVar=802 Part 7: w = 0.48 AppVar=706 - 0 Part 8: w = 0.69 AppVar=943 Part 9: w = 0.15 AppVar=1013 Part 10: w = 0.43 AppVar=697 Part 11: w = 0.75 AppVar=l 119 Part 12: w = 0.65 AppVar=976 Part 14: w = 0.53 App ••• Part 15: w = 0.48 App • • • Part 22: w = 0.47 AppVar=842 Part 23: w = 0.51 AppVar=958 Part 24: w = 0.48 AppVar=924 Part 25: w = 0.31 AppVar=600 •••• Figure 6.9: The appearance for parts in 25 part model for faces C h a p t e r 7 C o n c l u s i o n a n d F u t u r e W o r k There has been a great deal of attention focused on part-based approaches to ob-ject classification in recent research in computer vision, and some approaches have achieved a surprising amount of success. However, learning models with a large num-ber of parts has been a particular challenge. One of the most successful approaches is that of Fergus et al, who have developed a generative model for recognition that achieves excellent results on a variety of datasets, including cars, motorbikes, cats, faces, and air planes. The learning method that they present to learn the parame-ters for the model, however, requires an exponential amount of time to train as the number of parts increase. They cite a 6 part model takes about 24 hours to train. The reason for this is that in the learning algorithm they have to evaluate nearly all possible matchings in the initial stages, and the number of matchings increases exponentially with the number of parts in the model. The primary contribution of this thesis is the extension of their generative model, and the development of a learning algorithm that can learn a large number of parts in a reasonable amount of time. In particular, we have developed an incremental learning algorithm where the model is initialized intelligently with a small number of parts, and parts are added to the model one at a time. By initializing the parameters of a small model, we don't need to evaluate a great number of matchings in order to update the parameters for the model. Moreover, by adding parts one at a time we can utilize information from previous iterations, so the number of matchings that need to be evaluated to update the parameters can be greatly reduced. The end result, as demonstrated in Chapter 6, is that we are able to learn models with a large number of parts, over 25, in much less than an hour. Moreover, there is no reason to suggest that larger models cannot be learned. The recognition results, however, do not directly indicate that results improve a great deal with models with more parts. However, with the datasets on which we evaluated our approach, it turns out that a majority of the dataset can be recognized with only a few parts, and that the remaining are extremely difficult to recognize due to properties in these images that weren't present in a majority of the dataset. 63 64 As a result, recognition results, like the R O C equal error rate are not necessarily a good indicator of the power of our method. In future we.would like to evaluate our approach on more natural datasets, since this is more likely to reveal the advantage of having models with a large number of parts. The learning method presented, however, has a number of pitfalls and interesting extensions that we would like to address in future work. The first is the improvement of the procedure for initialization of a small model and for the initialization of a sample set. One improvement is to remove the bias towards selecting parts that are close together in an object, for both the small model and for the sample set of potential parts. The effect of this bias can be see in the model for motorbikes where a majority of the parts learned were near the front wheel. Another interesting improvement would be to be able to automatically split an object class into distinct subclasses in the cases where the subclasses are sufficiently distinct. For example, the class of vehicles is composed of motorbikes, cars, trains, but these classes do not share a large number of parts that would manifest themselves visually. It should be possible for the learning algorithm to detect if there is some segment of the dataset that it currently does not recognize and then try to create a separate object model for this segment. In addition, when we select a new part to add to the dataset, we select it from the set of potential parts that was obtained before learning, and we select it so that it maximizes the likelihood of the data the most. The problem with this is that the learning algorithm will build a richer model primarily for the segment of the training data that it already recognizes very well. There is no constraint that learning algorithm has to be able to recognize the training data evenly. As a result, a common result is that the model learned can recognize a majority of the dataset very well, but cannot recognize a few images of the dataset at all. A n interesting extension would be to impose a constraint, or bias the selection of a new part, so that the model can recognize each of the images in the dataset evenly. One approach to do this would be to instead select a potential part from the segment of training data that is not recognized well. It should be noted, however, that the learning method presented in this thesis is not restricted to the model presented. In general, a part-based approach has to learn the parts of the model, and learning models with a large number of parts will be difficult regardless of the model. The key insight provided by this thesis is that it is not necessary to learn all the parts at once. In fact, by beginning with a smaller model and building a more complex model slowly it is possible to arrive at a good 65 model while avoiding the computational cost that is associated with learning all of the parts at once. B i b l i o g r a p h y [1] S. Agarwal and D. Roth. Learning a sparse representation for object detection. In European Conference on Computer Vision, pages 113-1130, 2002. [2] I. Biederman. Human image understanding: Recent research and theory. Com-puter Vision, Graphics, and Image Understanding, 32:29-73, 1985. [3] M . C. Burl and P. Perona. Recognition of planar object classes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1996. [4] A . Dempster, N . Laird, and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(B): 1-38, 1976. [5] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2003. [6] C. Harris and M . Stevens. A combined corner and edge detector. In Fourth Alvey Vision Conference, pages 147-151, 1988. i • [7] S. Helmer and D.G. Lowe. Object recognition with many local features. In Generative-Model Based Vision (held in conjunction with Proc. IEEE Conf. Computer Vision and Pattern Recognition), 2004. [8] D. G. Lowe. Object recognition from local scale-invariant features. In Interna-tional Conference on Computer Vision, pages 1150-1157, 1999. [9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Interna-tional Journal of Computer Vision, 60(2):91-110, 2004. [10] G. Mayraz and G. E. Hinton. Recognizing hand-written digits using hierarchical products of experts. In Adv. in Neural Information Processessing Systems, pages 953-959, 2000. [11] K . Mikolajczyk and C. Schmid. A n affine invariant interest point detector. In European Conference on Computer Vision, pages 128-142, 2002. 66 Bibliography 67 [12] H. Murase and S. K . Nayar. Visual learning and recognition of 3d objects from appearance. International Journal of Computer Vision, 14(l):5-24, 1995. [13] A . R. Pope and D. G. Lowe. Probabilistic models of appearance for 3-d object recognition. International Journal of Computer Vision, 40(2):149-167, 2000. [14] C. Schmid and R. Mohr. Local greyvalue invariants for image retrieval. In IEEE Trans, on Pattern Analysis and Machine Learning, pages 530-534, 1997. ^ > - • [15] M . J. Swain and D. H. Ballard. Color'indexing. International Journal of Com-puter Vision, 7( l ) : l l -32, 2004. [16] M . A . Turk and A . P. Pentland. Face recognition using eigenfaces. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 586-591, 1991. [17] S. Ullman and E. Sali. Object classification using a fragment-based representa-tion. In T. Poggio S.W. Lee, H. H. Bulthoff, editor, First IEEE International Bio-logically Motivated Computer Vision Workshop (BMCV), pages 73-87. Springer, 2000. [18] P. Viola and M . Jones. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2001. [19] M . Weber, M . Welling, and P. Perona. Unsupervised learning of models for recognition. In European Conference on Computer Vision, pages 18-32, 2000.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Object recognition with many local features
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Object recognition with many local features Helmer, Scott 2004
pdf
Page Metadata
Item Metadata
Title | Object recognition with many local features |
Creator |
Helmer, Scott |
Date Issued | 2004 |
Description | There has been a great deal of attention focused on part-based approaches to object classification in recent research in computer vision, and some approaches have achieved a surprising amount of success. However, learning models with a large number of parts has been a particular challenge. One of the most successful approaches is that of Fergus et al. [5] who have developed a generative model for recognition that achieves excellent results on a variety of datasets. The learning method that they present to learn the parameters for the model, however, requires an exponential amount of time to train as the number of parts increase. The primary contribution of this thesis is the extension of their generative model, and the development of a learning algorithm that can learn a large number of parts in a reasonable amount of time. In particular, we have developed an incremental learning algorithm where the model is initialized intelligently with a small number of parts, and parts are added to the model one at a time. By taking such an approach we are able to learn models with a large number of parts in nearly a linear amount of time in the number of parts. The approach is validated on a number of datasets, including cars, motorbikes, and faces, and demonstrates excellent recognition results along with large models learned in a reasonable amount of time. |
Extent | 8197989 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-11-24 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051737 |
URI | http://hdl.handle.net/2429/15695 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2004-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2004-0484.pdf [ 7.82MB ]
- Metadata
- JSON: 831-1.0051737.json
- JSON-LD: 831-1.0051737-ld.json
- RDF/XML (Pretty): 831-1.0051737-rdf.xml
- RDF/JSON: 831-1.0051737-rdf.json
- Turtle: 831-1.0051737-turtle.txt
- N-Triples: 831-1.0051737-rdf-ntriples.txt
- Original Record: 831-1.0051737-source.json
- Full Text
- 831-1.0051737-fulltext.txt
- Citation
- 831-1.0051737.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051737/manifest