L E A R N I N G T O R E C O G N I Z E O B J E C T S IN I M A G E S : ACQUIRING A N D USING PROBABILISTIC MODELS OF A P P E A R A N C E by ARTHUR R . POPE B . S c , T h e U n i v e r s i t y O f B r i t i s h C o l u m b i a , 1982 S . M . , H a r v a r d U n i v e r s i t y , 1989 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T O F T H E REQUIREMENTS F O R T H E D E G R E E O F D O C T O R OF PHILOSOPHY in T H E F A C U L T Y O F G R A D U A T E STUDIES D E P A R T M E N T OF C O M P U T E R SCIENCE W e accept t h i ^ h e s i s as c o n f o r m i n g to the r e q u i r e d s t a n d a r d T H E UNIVERSITY O F BRITISH COLUMBIA N o v e m b e r 1995 © A r t h u r R . P o p e , 1995 In presenting degree this thesis in at the University of partial fulfilment British Columbia, of the requirements of department this or thesis by for scholarly his publication of this thesis or her may be representatives. It is permission Department The University of British Columbia Vancouver, Canada advanced granted for extensive by the head of understood that for financial gain shall not be allowed without permission. DE-6 (2/88) purposes an I agree that the Library shall make it freely available for reference and study. I further agree that copying for copying my or my written Abstract We present a method of recognizing three-dimensional objects in intensity images of cluttered scenes, using models learned from training images. By modeling each object with a series of views, representing each view with a large variety of features, and describing each feature probabilistically, our method can learn to recognize objects of unusual complexity. An object is modeled by a probability distribution describing the range of possible variation in the object's appearance. This distribution is organized on two levels. Large variations are handled by partitioning training images into clusters corresponding to distinctly different views of the object. Within each cluster, smaller variations are represented by distributions characterizing uncertainty in the presence, position, and measurements of various discrete features of appearance. Many types of features may be used, ranging in abstraction from segments of intensity edges to perceptual groupings and regions. Our learning method combines two activities: (a) an incremental conceptual clustering algorithm identifies groups of training images corresponding to distinct views of the object; (b) a generalization algorithm consolidates each cluster to produce a summary description characterizing its most prominent features. The method produces models that are both economical and representative, balancing these competing goals through an application of the minimum description length principle. Recognition requires matching features of a model with those of an image. Our matching method is a form of alignment: from hypothesized feature pairings it estimates a viewpoint transformation; from that it finds additional pairings, and so on. The method uses constraints based on features' positions, numeric measurements, and relations, assigning to each an importance commensurate with its uncertainty as recorded by the model. Decisions are ordered so that more certain features are paired sooner, while ambiguous choices are postponed for later resolution, when additional constraints may be known. We also describe a system implemented to test our recognition learning method. Experiments demonstrate the system's performance at tasks including learning to recognize complex objects in cluttered scenes, and learning to discriminate two quite similar objects. 11 Table of Contents Abstract n List of Tables vii List of Figures viii Acknowledgments xi Dedication xn 1 Introduction 1 1.1 Background 3 1.2 Research Goals 5 1.3 Outline of the Approach 6 1.3.1 Representation schemes 7 1.3.2 Recognition method 8 1.3.3 Learning method 9 1.3.4 Empirical evaluation 11 1.3.5 Discussion 11 1.4 Outline of the Thesis 12 2 Survey of Object Recognition 13 2.1 Introduction 13 2.2 Representation Schemes 15 2.2.1 What makes a good representation? 17 2.2.2 Locating features in three dimensions 19 iii 2.3 2.4 2.2.3 Viewpoint invariants 23 2.2.4 Common features 25 2.2.5 Organization of features 30 2.2.6 Representation of uncertainty in models 31 Recognition 34 2.3.1 Grouping features 35 2.3.2 Indexing the model database 38 2.3.3 Matching features 2.3.4 Verifying the match result : 41 51 Discussion 52 2.4.1 Scope of methods 52 2.4.2 Robustness of methods 53 2.4.3 Efficiency 54 3 Survey of Learning in Object Recognition 56 3.1 Learning Appearance Features 56 3.2 Learning an Appearance Classifier 57 3.3 Learning a Structural Model 58 3.4 Learning a Set of Characteristic Views 60 3.5 Learning a Recognition Strategy 61 3.6 Theoretical Learnability 61 4 Representation Schemes 4.1 63 Image Representation 64 4.1.1 Feature positions 65 4.1.2 Feature attributes 66 4.1.3 Feature relations 67 4.1.4 Image notation 68 iv 4.2 4.3 4.4 5 Implemented Feature Repertoire 69 4.2.1 Edge detection and curve segmentation 74 4.2.2 L-junction features . 78 4.2.3 Pairs and triples of L-junctions 80 4.2.4 Pair of parallel curve segments 80 4.2.5 Convex regions 82 Model Representation 83 4.3.1 Implications of using a multiple-view representation 83 4.3.2 Simplifying approximation of feature independence 84 4.3.3 Model view representation 85 4.3.4 Model notation 87 Coordinate Systems and Viewpoint Transformations 88 4.4.1 Similarity transformations 89 4.4.2 Affine transformations 90 4.4.3 Summary 92 Matching and Recognition Methods 93 5.1 Match Quality Measure 94 5.1.1 Conditional probability of a feature pairing 97 5.1.2 Prior probability of a feature pairing 98 5.1.3 Probability distribution over feature positions 99 5.1.4 Probability distribution over feature attributes 105 5.2 5.3 Matching Procedure 109 5.2.1 Probabilistic alignment 110 5.2.2 Estimating the viewpoint transformation 113 5.2.3 Checking consistency of candidate pairings 117 5.2.4 N-way ambiguous features 118 Curve Segment Matching 120 v 5.4 6 7 8 5.3.1 Curve segment representation 121 5.3.2 Evaluating a curve segment pairing 122 5.3.3 Updating the transformation estimate 127 Verification 128 Learning Procedure 132 6.1 Generalization Procedure 134 6.2 Clustering Procedure 138 6.2.1 Conceptual clustering 138 6.2.2 Clustering criteria 139 6.2.3 Minimum description length principle 140 6.2.4 Applying the MDL principle to object model induction 142 6.2.5 Clustering algorithm 146 Empirical Investigations 151 7.1 Illustrative Experiment 151 7.2 Classes of Similar Objects 170 7.3 Additional Experiments 177 7.3.1 Effects of feature distribution 177 7.3.2 Articulate objects 180 7.3.3 Difficult objects 180 Conclusion 184 8.1 Main Features of the Approach 184 8.2 Limitations of the Approach 185 8.3 Contributions 186 8.4 Topics for Further Research 187 Bibliography 189 vi List of Tables 4.1 Implemented feature repertoire 69 4.2 Definitions of various forms of Curve feature 78 4.3 Probability distributions estimated from model feature information 86 6.4 Example of generalizing model graph from image graphs 138 6.5 Coding scheme for object model 143 6.6 Coding scheme for unmatched portion of image graph 145 7.7 Clustering of bunny training images 157 7.8 Matches between bunny model graphs and test image graphs 164 7.9 Discrimination of clothespins . 175 7.10 Generalization among clothespins 176 7.11 Clusters of boat training images 182 vii List of Figures 1.1 Example of learning and recognition 2 1.2 Components of the learning method 10 2.3 Approximate invariance of apparent angle 25 2.4 Application-specific features 29 2.5 Graph representation of a model 31 2.6 Perceptual grouping criteria 37 4.7 Feature repertoire illustration (part 1 of 4) 70 4.8 Feature repertoire illustration (part 2 of 4) 71 4.9 Feature repertoire illustration (part 3 of 4) 72 4.10 Feature repertoire illustration (part 4 of 4) 73 4.11 Edgel grouping example 75 4.12 LJct feature 79 4.13 CrankLR, CrankRL, and Cap features 80 4.14 CapPair, Region, and Ellipse features 81 4.15 PCurves feature 81 5.16 Comparison of image and model feature positions 100 5.17 Approximation of image feature's uv distribution 101 5.18 Monte Carlo simulation of image feature's uv distribution 102 5.19 Example of locally-adaptive probability density estimate 107 5.20 Probabilistic alignment procedure 112 5.21 Consistent procedure 118 5.22 ConsistentWithParts procedure 119 viii 5.23 Curve segment matching 124 5.24 Examples of Support measure distribution 131 6.25 Example of generalizing model graph from image graphs 137 6.26 Clustering algorithm . 147 6.27 Effect of presentation order on clustering of training images 150 7.28 Bunny training images 152 7.29 Features of a bunny training image 154 7.30 Bunny training image clusters 155 7.31 Bunny model graph C 158 7.32 Bunny model graph D 159 7.33 Bunny test image 1 161 7.34 Match of bunny model graph D with test image 1 162 7.35 Match of bunny model graph D with test image 1 (cont.) 163 7.36 Match of bunny model graph N with test image 1 165 7.37 Bunny test image 2 166 7.38 Match of bunny model graph D with test image 2 167 7.39 Bunny test image 3 168 7.40 Match of model graph based on test image 1 with test image 3 169 7.41 Clothespin training images 171 7.42 Clothespin models 172 7.43 Clothespin test images 174 7.44 Shoe training images 178 7.45 Shoe model graph A 178 7.46 Shoe model graph B 179 7.47 Shoe recognition example 179 7.48 Boat training images 181 ix 7.49 Boat model graphs 7.50 Difficult objects Acknowledgments I was supervised in this work by David Lowe, and benefitted tremendously from his guidance, encouragement, and generosity. To a beginning researcher such as myself, the most useful advice is that concerning what problems are important and what solutions appear promising. David has been an invaluable source of this advice. It was he who suggested that I seek ways to improve object recognition performance through learning, and, on numerous occasions over five years, he has provided crucial suggestions to guide my efforts along fruitful avenues. The other members of my supervisory committee were James Little, Alan Mackworth, and Robert Woodham. Each has brought a unique and important perspective to my work, asked essential questions, and prompted necessary clarification of my ideas and presentation. I had many helpful discussions at UBC with fellow students Esfandiar Bandari, Jeffrey Beis, Daniel McReynolds, Richard Pollock, and Ronald Rensink, and with postdoctoral researcher Alun Evans. My work was supported by a productive laboratory environment created by Dan Razzell, Stewart Kingdon, and Rod Barman. Valerie McRae provided vital administrative support and plenty of encouragement. Harvard classes taught by James Clark, John Daugman, David Mumford and Alan Yuillefirstattracted me to the fascinatingfieldof computer vision. My wife Susan Steinbrecher, her family, and my family have provided the one thing I have relied upon most: their encouragement. I am grateful for their support, both emotional and material. I especially thank Susan for abiding my immoderate devotion to this work. My research was supported by the Institute for Robotics and Intelligent Systems (IRIS), one of the Canadian Networks of Centres of Excellence. Additional support was provided by the Natural Sciences and Engineering Research Council of Canada and the Canadian Institute for Advanced Research. xi This dissertation is dedicated to my parents, Helen and Robert Pope. Chapter 1 Introduction To recognize an object in a scene one must know something about how that object may appear. In effect, one must have an internal model of the object's form or appearance. This dissertation shows how a computer system can acquire a detailed model of an object's appearance directly from intensity images, and then use that model to recognize the object in complex scenes. Figure 1.1 illustrates the task this recognition learning system performs. One presents to the system a series of example images showing a particular object under various viewing conditions. Each of these training images is labeled with the identity of the object, nothing more. From the training images the system develops a model of the object's appearance. This training is repeated for each of the objects the system is to recognize. When then given a test image, the system can identify apparent instances of known objects in the image, reporting what objects are present, where in the image, and with what degree of certainty. The contributions of the research described here include a way of representing a model of an object's appearance as a series of probabilistic views, a method of learning such models from images, and a method of using these models to recognize objects. The research has sought to develop general methods rather than address specific needs of particular applications. Nevertheless, one can easily imagine applications in which these methods might prove important. A robot roaming in a dynamic, unknown environment could learn landmarks and other objects as they are first encountered, and later recognize them when they are encountered again. A factory robot, rather than having to be reprogrammed for each new part it is to recognize and 1 Chapter 1. Introduction 2 (<•) Figure 1.1: This example illustrates the task performed by the system described in this dissertation. The system learns a model of an object from a series of training images, including (a) and (b). With its model, the system can recognize the object in a complex scene (c), even when the object is partly occluded by others. Shown highlighted are some of the features recognized as part of the object. Chapter 1. Introduction 3 manipulate, could be taught by simply being shown the parts. And a researcher, having identified an object of interest in a handful of images, could use those as examples in instructing a computer to search among other images for occurrences of similar objects. 1.1 Background Object recognition is neither a new problem nor one that we can yet consider largely solved. Indeed, work on object recognition and related problems has proceeded at a quickening pace for the past thirty years. Yet even the best and latest object recognition systems remain capable of recognizing only simple objects under carefully controlled conditions. Why has object recognition proven to be so difficult? A detailed answer to this question will unfold in chapter 2 of this dissertation, where we shall survey object recognition methods, their strengths and limitations. For now, we need only remind ourselves that an object's appearance can have a large range of variation. It varies with changes in viewpoint, in lighting, and, if the object is flexible, with changes in shape. When the object is viewed through a camera, further variation due to optical distortion and pixel quantization is evident. Moreover, when recognizing a class of similar objects, one often finds variation among individual members. Accommodating this variation is a central problem in the design of any object recognition system, and one whose solution is likely to have implications throughout the system. The scheme used to describe image content must be stable so that small changes in a scene induce only small changes in its description. The scheme used to model an object's appearance must describe just what variations should be expected. If the system learns its models from example images, the learning procedure must generalize enough to overcome insignificant variation among the examples, yet it must also retain important distinctions needed to correctly differentiate objects. And the procedure used to recognize modeled objects in images must tolerate the degree of mismatch between model and image inevitably due to noise, clutter and occlusion. Researchers have used several strategies in their efforts to cope with the variable appearance Chapter 1. Introduction 4 of objects. These are summarized in the following paragraphs. Invariants. Recognition can employ invariant properties;—ones that vary little or not at all as viewing conditions change. For example, the shape of intensity edges outlining the boundary of an object remains largely fixed despite changes in lighting, and parallel edges on an object appear nearly parallel from a wide range of viewpoints. However, while invariant properties can provide important clues that an object is present, they cannot describe a 3-D object's appearance completely from all viewpoints. Recognition must employ other strategies as well. Explicit models. Researchers have tried to explicitly model some sources of appearance variation, particularly changes in viewpoint. A system using this common strategy is given its object models in the form of three-dimensional shape descriptions. During recognition, an object's shape description is combined with a model of the image formation process in order to predict the object's appearance and determine whether something of similar appearance can be found in an image. Thus successful recognition depends on having a good model not only of the object's shape, but also of scene illumination, surface reflectance, optical projection, and the sensor. However, because it has proven difficult to model shape and image formation comprehensively, object recognition systems using this approach have been restricted to simple objects and simple models. The object shown in figure 1.1, for example, would be far too complex for most systems of this type. Multiple views. An object can be modeled by a collection of views showing how the object appears under various conditions. The system stores all of these views, and recognizes the object in an image when it is able to match one of the stored views to some part of the image; presumably that view and the image show the object under similar, although perhaps not identical, conditions. This approach can allow one to avoid having to estimate the actual appearance of an object from unsatisfactory models of shape and image formation by having the system instead acquire its stored views directly from images. Chapter 1. Introduction 5 However, a key problem in applying this strategy has been that of determining how many views are needed to model an object and what those views should be. Noise. Any appearance variations not accounted for by invariants, explicit models, or stored views are usually lumped under the category of "noise", and recognition methods are designed to tolerate some degree of it. Researchers have often treated all features of an object's appearance as being equally affected by noise. Clearly, however, some features are more likely to be detected, can be more accurately localized, or are otherwise more stable than other features. Perhaps because these differences are difficult to quantify, researchers have often disregarded them and assumed that all features shared the same degree of uncertainty. These strategies have produced object recognition systems that are still severely limited in the variety of objects and viewing situations with which they can cope. I shall argue that these limitations are, in considerable part, due to a reliance on models that contain too little information. Models of three-dimensional structure and image formation have been too crude to adequately account for all of the phenomena important in identifying and distinguishing complex objects. Models of uncertainty have generally treated all features of an object's appearance as identical, not accounting for significant differences among them. Note, however, that any effort to adopt more informative models must also address the question of where the additional information may be obtained. 1.2 Research Goals The research described in this dissertation has been motivated by two primary and related goals. One goal has been to devise improved object recognition methods that are able to recognize more diverse objects in more complex situations. The other goal has been to devise associated methods of learning object models directly from intensity images. This work adopts the thesis that better recognition performance will come from using more informative object models, and Chapter 1. Introduction 6 the best way for a system to acquire such models is to learn them from images. The following practical concerns have guided the work: Efficiency. The processes involved in learning and recognition can potentially require large amounts of search. This work has sought highly efficient methods so that these processes can be accomplished in a matter of seconds or minutes per image. Ease of use. Various formulations of the training process are possible. For instance, one formulation would require that the system be told the pose of the object in each example image; another would require that the system be shown where certain features of the object appear in each example image. This work has sought a formulation that keeps training as simple as possible. It requires only that each example image be labeled as to the object it depicts. 1 Incremental learning. Learning could be formulated as a batch process, in which case all example images must first be acquired and then supplied collectively to the system. Or it could be formulated as an incremental process, in which case example images are supplied individually and recognition performance improves as each is incorporated by the system. Incremental learning gives the greatest flexibility. It allows a trainer to monitor recognition performance during training, discontinuing the process when it is no longer yielding improvements. It also allows an object recognition system to perform learning as a continuous background process, using its recognition results to refine its object models. This work has sought learning methods that are incremental. 1.3 Outline of the Approach This dissertation adopts the view that an object's varying appearance can be described by a probability distribution over a space of possible appearances. The true distribution for any Should it be required that the system report the pose or other parameters of an object it has recognized, those parameters would also have to be supplied during training. Learning and estimation of such parameters is beyond the scope of this dissertation. 1 Chapter 1. Introduction 7 particular object is unknown and—for practical purposes—probably unknowable. However, the training images are assumed to constitute a representative sampling of that distribution, drawn at random from it. In learning a model, the system uses the training images to estimate an approximation of the unknown distribution. In recognizing the object, it identifies a subset of the image that, according to the estimated distribution, has a high probability of being an appearance of the object. 1.3.1 Representation schemes Images are described in terms of discrete properties called features. Each has a particular type, a measured position within the image, and perhaps other numeric attributes that further characterize it. A feature may, for example, be a segment of intensity edge, a particular arrangement of such segments, or a region of uniform texture or colour. This dissertation shows how to represent a certain set of intensity edge features that are useful for describing many objects. Moreover, the approach allows this set to be extended, according to the specific needs of any particular application, to include various types of features covering a wide range of abstraction, abundance, complexity, and scale. Whatever features are provided, the approach can determine which features are best for recognizing any particular object in the course of learning a model for that object. An object model, describing the range of appearance an object can assume, is organized on two levels. Large variations in appearance are handled by subdividing the entire range of variation into discrete subranges corresponding to distinctly different views of the object. Within each of these model views, smaller variations are described by probability distributions that characterize the presence, position, and attributes of individual features. These distributions are estimated using position and attribute samples obtained from features found in the training images. This approach improves on the strategies outlined in the previous section. Explicit models of 3-D shape and image formation are avoided. Instead, multiple-view models are used that can Chapter 1. Introduction 8 accurately describe the appearance of complex objects, including articulate or flexible objects. Moreover, individual views are highly informative: Each describes not just a single appearance of the object, but rather a neighbourhood of similar appearances. This neighbourhood is described in considerable detail by according each feature its own uncertainty information rather than requiring that all features share a common, nonspecific "noise" distribution. 1.3.2 Recognition method An object is recognized in an image by finding a match between a portion of the image and a portion of one of the object's model views. Specifically, the match consists of a pairing of image and model features, plus a viewpoint transformation—a 2-D rotation, scaling, and translation— that brings those features into close alignment. Of all the possible matches, we wish to find one that pairs as many features as possible, pairs features that are as similar as possible, and aligns those paired features as closely as possible. To make these goals precise, a match quality measure is defined. Using individual feature distributions recorded by the model and background statistics compiled from a large collection of general images, the measure evaluates the significance of each of the match's feature pairings. For example, particularly high significance is accorded a pairing involving a feature that, according to the model, is often seen when the object is present, yet, according to the background statistics, is seldom seen in general. Evaluations of each of a match's feature pairings are combined to score the overall match. A matching procedure finds matches that, in terms of the match quality measure, are nearly optimal. It uses an alignment process that starts with an initial, hypothesized pairing between model and image features. From this pairing, a viewpoint transformation aligning the image with the model is estimated. That estimate allows additional pairings to be identified, ranked, adopted, and used to refine the transformation estimate in an iterative process that continues until all valid pairings have been found. Pairings involving model features whose positions, as described by the model, are especially uncertain are postponed until late in the alignment Chapter 1. Introduction 9 process, when those pairings can be most effectively constrained by the less-ambiguous pairings adopted earlier. Matching is particularly efficient because of the way the match quality measure and viewpoint transformation have been formulated. During alignment, the viewpoint transformation needed to align paired features in a near-optimal manner can be estimated quickly using a linear, least squares estimator. The calculations required to evaluate potential feature pairings in terms of the latest transformation estimate are also simple. And a bound on the attainable match quality measure can be computed for an ongoing evaluation, and perhaps early termination, of the alignment process. 1.3.3 Learning method The learning method is a combination of two procedures (see figure 1.2). One procedure clusters the training images into groups corresponding to distinct views of the object; the other procedure is applied to each group to construct a model view representing it. This latter procedure is called generalization because the model view it produces represents a generalization of a group of training images, accounting not just for those images of the object but also for other, similar ones. The clustering procedure must determine the number of groups to form and the assignment of training images to groups while balancing two somewhat-competing goals. We would like clustering to yield a model that represents the training images well. However, if unchecked, this goal leads to a model with numerous views, each formed from a single training image. We must balance it with a second goal, to produce a simple model comprising no more views than necessary. To balance these goals the clustering procedure employs a measure of cluster quality based on the minimum description length principle. Terms of this measure quantify and balance the model's complexity and its ability to account for the training images. As training images are received by the system, the clustering procedure assigns them to clusters so as to optimize the Chapter 1. Introduction T ^ ^ e s • 10 • • • • • • • • • Clustering Clusters Generalization Model Views Figure 1.2: The learning method is a combination of two procedures. One clusters similar training images. The other generalizes each cluster to form a model view describing a range of viewing conditions. measure. Experiments have shown that this local optimization method produces results that are quite satisfactory; in particular, the ordering of the training images has little adverse effect on the quality of the model produced. The generalization procedure forms a model view incrementally, by merging successive training images. The first training image assigned to a cluster forms the initial model view for that cluster. Subsequent training images assigned to the same cluster are matched with the cluster's model view using the matching procedure described above. Using the results of that match, the model view is then extended to encompass the additional training image. Matched model features acquire position and attribute samples from their matching image features, unmatched image features may be added to the model, and seldom-matched model features are eventually dropped. After several training images have been processed in this way, the model view nears an equilibrium: It contains the features most commonly found in a range of similar views of the object, along with representative distributions characterizing each feature. Chapter 1. Introduction 1.3.4 11 Empirical evaluation A system has been built to evaluate this approach. The system, called OLIVER, learns to recognize 3 - D objects in 2 - D intensity images, employing a repertoire of features designed primarily for describing the appearance of manufactured objects. The lowest-level features are straight, circular, and elliptical segments of intensity edges. These are augmented by features representing various perceptually-significant groupings, including junctions, pairs and triples of junctions, pairs of parallel segments, and convex regions. Within the same framework, this repertoire could be extended to allow robust recognition of a broader class of objects—for example, by including features attuned to colour and texture. 1.3.5 Discussion Several qualities of the approach should be clearly noted. 1. The models learned are models of appearance, not 3 - D shape. Indeed, because the relationship between shape and appearance is complex, this approach does well by avoiding any need to recover shape. 2. The model reflects whatever variety is presented in the training images. Consequently the recognition learning system cannot recognize an object from a side it has never seen before, nor infer new configurations of aflexibleobject from those it has seen. This confers an advantage since it means that the system is less likely to extrapolate erroneously from the training it has received. It also means that the model will incorporate any idiosyncrasies strongly represented among training images, such as the shadows beneath the object shown in figure 1.1. This quality would, in some situations, be considered an advantage. 3. For the system to be able to distinguish objects, those distinctions must be represented during training. If two similar objects are presented as distinct objects during training (with images of each distinctly labeled), the system will learn a model for each and have some basis for distinguishing them. On the other hand, if they are presented as the same Chapter 1. Introduction 12 object (with images of both identically labeled), the system will learn a single model encompassing the appearances of both and be unable to distinguish them. Again, this quality is advantageous, for it allows the system to be taught either to distinguish objects or to recognize them as belonging to a common class. 4. Training employs positive examples only: When teaching the system to recognize an object, one provides examples of how that object can appear, not, as in some types of learning, examples of how it does not appear. The approach thus avoids the difficulty of determining how to define and supply negative examples. It requires an assumption, however, that objects are well separated in the space of possible appearances. This assumption seems justified because the appearance space is enormous: its dimensions correspond to each measurement of each of numerous features. 5. This work does not address the issue of how a database of acquired models should be organized and accessed. It appears likely, however, that model database methods reported elsewhere (e.g., Beis and Lowe 1994) can be combined with the recognition and learning methods described here in order to produce a complete recognition system. 1.4 Outline of the Thesis In understanding this work it will be helpful to have some knowledge of object recognition issues and methods; a survey of that topic is provided in chapter 2. Chapter 3 surveys the particular role that machine learning techniques have played in object recognition systems. Chapters 4, 5 and 6 detail the new approach to recognition learning that constitutes the main contribution of this dissertation; the chapters cover representation schemes, the recognition method, and the learning method, respectively. Chapter 7 reports experiments performed to evaluate the approach. Chapter 8 closes the dissertation with a summary of the work and some ideas for further research. Chapter 2 Survey of Object Recognition 2.1 Introduction The object recognition problem is essentially this: Given some knowledge of how certain objects may appear, plus an image of a scene possibly containing those objects, report which objects are present in the scene and where. This chapter examines this problem and surveys approaches devised to solve it. The general object recognition problem has a variety of specific formulations. They differ principally in the following three ways: 1. Formulations differ according to what sort of knowledge is employed. We shall be concerned with model-based object recognition, in which the knowledge of an object's appearance is provided by an explicit model of its shape; with appearance-based methods that use an explicit model of an object's appearance, such as might be provided by a stored series of images; and with generic object recognition, in which the model describes a generic class of objects rather than one specific object. However, we shall stop short of considering how recognition may employ other types of knowledge, such as knowledge of the object's context (Strat and Fischler 1991), movement (Polana and Nelson 1994), or function (Stark and Bowyer 1991). 2. Formulations differ according to the class of objects recognized. Objects may be twodimensional—like symbols printed on traffic signs—or fully three-dimensional. Object 13 Chapter 2. Survey of Object Recognition 14 shape may be restricted to a simple form—e.g., polyhedra—or allowed to be more complex. Objects may be entirely rigid, composed of a few parts that can move rigidly with respect to each other ("articulate"), or capable of deforming throughout a whole range of shapes. This chapter, like our own recognition learning method, shall be concerned with all of these object classes. 3. Formulations differ according to the type of image used. Some find objects in intensity images, and others, in range images or a combination of range and intensity images. Recognizing a 3-D object in an intensity image of an unrestricted scene is, in many respects, the most difficult form of the problem: It requires dealing with the loss of depth information due to projection, with the possibility that objects may occlude each other, and with the fact that image intensity is only indirectly related to object shape. This is the form of recognition addressed by this chapter and by our own recognition learning method. Recognition is accomplished by finding a correspondence between certain features of the image and comparable features of the model. The two most important issues that a method must address are, What constitutes a feature? and, How is the correspondence found between image features and model features? Some methods use global features that summarize information about the entire visible portion of an object. Examples of such features are area, perimeter length, compactness, and Euler number. Global feature methods are popular for those applications where lighting can be precisely controlled and occlusion does not occur, but because they presuppose perfect segmentation of objects from their background and from each other, these methods do not serve well in more general situations. This entire dissertation focuses, instead, on the use of local features, such as edge segments and junctions. In discussing object recognition methods, we ought to have in mind some criteria forjudging how well a method performs. Most researchers have been primarily concerned with the following criteria, as summarized by Grimson (1990): Chapter 2. Survey of Object Recognition scope 15 What kinds of objects can be recognized, and in what kinds of scenes? robustness Does the method tolerate reasonable amounts of noise and occlusion in the scene, and does it degrade gracefully as those tolerances are exceeded? efficiency Recognition requires that an enormous space of alternatives be considered. How much time and memory are required to search that space? The rest of this chapter is in three parts. Section 2.2 is about how object models and images are represented for recognition. Section 2.3 is about the recognition process, which involves finding suitable matches between models and images. In closing, section 2.4 considers how well existing methods have performed and how they might be improved. Other object recognition surveys have been published. Extensive ones by Besl and Jain (1985) and by Chin and Dyer (1986) are now largely out-of-date, and omit research areas of recent importance such as geometric invariants and alignment matching. A survey by Brady, Nandhakumar, and Aggarwal (1989) is particularly about recognition in range images, and a recent one by Suetens, Fua, and Hanson (1992) covers a wide variety of matching techniques; however, both surveys also omit some of the most active areas of current research, including invariants, grouping, and indexing. This chapter's survey covers major issues and ideas across a broad range of object recognition methods, while paying particular attention to topics relevant to the recognition learning approach presented in later chapters. 2.2 Representation Schemes We shall survey needs and methods for two representation schemes: one to represent a model of an object, and the other, the content of an image. In any object recognition system the two schemes must be closely related to facilitate matching models and images. Ideally, individual primitives used to describe the model and those used to describe the image will be related in a straightforward way. If the object is described by a wireframe model, for example, then the image might be described in terms of intensity edge segments, each of which may be matched Chapter 2. Survey of Object Recognition 16 directly to one of the model's "wires". But often, as in this example, the primitives used for models and images have distinctly different meanings. This is particularly true of model-based recognition in intensity images; there, objects are often modeled only in terms of 3-D shape (e.g., the 3-D configuration of surfaces) while images are often described in terms of appearance (e.g., the 2-D configuration of intensity edges). Where this semantic gap exists, a recognition system must bridge it, and to do so, the system must know something about the image formation process: how shape properties are related to appearance properties. Image formation is a complex phenomenon, however, so systems that use different terms for representing models and images are generally restricted to simple primitives. Nevertheless, it is possible to represent both models and image in the same terms, allowing the two to be easily compared. Many systems that recognize 3-D objects in intensity images, including our own, do so by representing both models and images in terms of appearance; they thus avoid any need to account for image formation. This issue of whether an object should be modeled in terms of its 3-D shape or in terms of its appearance is examined more closely in section 2.2.2. As this discussion illustrates, object recognition is generally concerned with representation of both shape and appearance. This survey addresses both, and often without distinction since many of the same issues and ideas pertain to both. (Indeed, appearance can even be represented in terms of shape: the shape of 2-D intensity edges, for example.) Henceforth we shall use the term form to mean either appearance or shape. We shall follow Marr and Nishihara (1978) in their use of the terms shape, representation, and description. Shape means the geometry of a locus of points, which will typically be the points on the surface of a physical object, or the points on an intensity edge or region in an image. A shape representation is a language for describing shape or some aspects of shape. It includes a set of shape descriptions, and a mapping between shape descriptions and shapes. By appearance, we mean a pattern of image intensity; the terms appearance representation and Chapter 2. Survey of Object Recognition 17 appearance description are defined analogously. 2.2.1 What makes a good representation? A representation must be considered in light of the criteria it is meant satisfy. Those who have proposed shape and appearance representations (e.g., Marr and Nishihara 1978; Binford 1982; Brady 1983; Woodham 1987; Haralick, Mackworth, and Tanimoto 1988; Mokhtarian and Mackworth 1992) generally agree upon the following criteria as being important in many applications: scope and sensitivity The representation must be able to describe all relevant forms while preserving all important distinctions among them. stability Small changes in a form should produce small changes in the form's description. By ensuring that similar forms will have similar descriptions, this also simplifies the problem of comparing forms, as when comparing an object model with an image. efficiency There must be efficient methods of computing a description from an input image, and comparing that description with object models. uniqueness For any particular form there should be a unique description. By requiring that identical forms have identical descriptions, this criterion is meant to simplify the problem of comparing forms. These criteria lead most researchers to several common conclusions about the nature of a good representation. First, form should be described by a combination of local features, each pertaining to a specific region of an image or object. Local features can be computed relatively efficiently as each is based on a limited portion of the input data. Moreover, a description composed of local features can be relatively stable as only some of the features need be affected by any small change in appearance. And, of particular importance for object recognition, partial occlusion of an object will only partly affect recovery of a local feature description. Chapter 2. Survey of Object Recognition 18 Also for the sake of stability, a representation should describe form over a range of scales while somewhat decoupling the descriptions that pertain to different scales. Two forms that are largely similar would then have similar descriptions, even if the forms differed in smallscale details. A multiple-scale representation is often obtained by devoting features not only to various regions, but also to various levels of detail. Features are frequently classified further into various categories of shape or appearance. This allows the most appropriate representation to be chosen for each category, and it may consequently permit descriptions to be compared more efficiently. For example, features describing surface regions can be classified according to whether they describe planar, cylindrical, elliptical, or hyperbolic surfaces so that appropriate parameterization and comparison methods can be used for each. We shall later survey some commonly used categories of shape and appearance features. Altogether, discrete features are generally defined by partitioning the continuum of form along three dimensions: region, scale, and category. However, this partitioning can itself be a source of instability. For example, a surface that is classified as planar, might, with a small change of shape, be classified as cylindrical, and thus a small change in shape can produce a larger change in description. One way to reduce this instability is to allow some overlap and redundancy among the discrete features. Then a surface that lies near the threshold between planar and cylindrical would be represented by both types of features at once, and as the surface deforms to become more planar or more cylindrical, one of the two features will remain to provide some measure of stability. Note, however, that although it may provide greater stability, a redundant representation does not meet the uniqueness criterion, for various subsets of a redundant description may constitute alternate descriptions of the same form. Fortunately, the uniqueness criterion is not of great importance in the context of object recognition. Recall that the purpose of that criterion is to permit straightforward tests for form equivalence. In object recognition, however, we usually compare forms to judge not whether Chapter 2. Survey of Object Recognition 19 they are identical, but whether they are sufficiently similar, with some allowance for mismatch due to noise and occlusion. In comparing forms, then, the mismatch that may result from having alternate descriptions for a single form can often be accommodated with little additional effort. 2.2.2 Locating features in three dimensions A representation employing local features must have some means of specifying the locations of those features. In the context of object recognition, the method used to locate features of a 3-D form is particularly important, as it can largely determine how easy it is to compare a 3-D model with a 2-D image. Principally two methods have been used: one represents 3-D structure explicitly while the other represents it implicitly, using a set of 2-D views. Three-dimensional representation A 3-D representation affixes a single coordinate system to the object and uses it to locate various features. Equivalently, the object may be represented by a hierarchical arrangement of parts, each carrying its own coordinate system that is located with respect to the coordinate system of its parent part. A 3-D representation yields the most concise and precise descriptions, while providing a convenient way to describe articulate objects. However, when a 3-D representation is used to describe models for object recognition, either of two difficulties must be faced: 1. One can attempt to derive a similar, 3-D description from the image and match that description with various models. It is difficult, though, to segment objects or parts in an intensity image and fix coordinate systems to them reliably. The strategy of fixing a coordinate system to an object's centroid, for example, will be foiled whenever the object is partly occluded. 2. Alternatively, one can derive a 2-D description from the image, and use a matching procedure that accounts for the projection of a 3-D object onto a 2-D image. Because it must consider the effects of self-occlusion and perspective, this 3-D/2-D matching procedure faces a more difficult task than a 2-D/2-D or 3-D/3-D procedure. Chapter 2. Survey of Object Recognition 20 Despite these difficulties, however, both approaches have been demonstrated experimentally. For example, Dickinson and Pentland (1992) have recovered 3-D volumetric primitives from intensity images before recognizing objects in terms of those primitives, and Lowe (1987a) has matched 3-D models directly to 2-D features found in intensity images. Multiple-view representation A multiple-view representation describes an object with a set of 2-D characteristic views, each depicting the object under particular viewing conditions or a small range of conditions. The views may sample different viewpoints as well as different configurations of a flexible object. Matching is simpler than with a 3-D representation because it involves comparing descriptions that are both 2-D and may both be expressed in the same terms—there is no need to project the model during matching or otherwise account for the image formation process. Several systems have used multiple-view models to recognize 3-D objects in intensity images (e.g. Breuel 1992b; Burns and Riseman 1992; Camps, Shapiro, and Haralick 1991). 1 For an object that is even moderately complex, however, many qualitatively distinct views may be needed. Thus, a multiple-view representation requires considerably more space than a 3-D one. Space requirements can be reduced somewhat by allowing views to share common structures (Minsky 1975; Burns and Riseman 1992) and by merging similar views after discarding features too fine to be reliably discerned (Petitjean, Ponce, and Kriegman 1992). Another consequence is that there are, in effect, many more models to be considered during recognition as each characteristic view is a separate model. However, this may be more than compensated for by the fact that testing each model requires only a 2-D/2-D match rather than a 3-D/2-D or 3-D/3-D one, and 2-D/2-D matches can be performed much more quickly (Breuel 1992b). In general, an object's form is only approximated by a multiple-view description, as each Although the question of how the human visual system models objects is by no means settled, there is interesting evidence to suggest that it sometimes uses a multiple-view representation (Edelman and Bulthoff 1992; Tarr 1995; Ullman 1989). Humans are able to recognize some objects more accurately and rapidly when they are seen from familiar viewpoints, implying that those object views are readily available while others must be computed as needed. 1 Chapter 2. Survey of Object Recognition 21 view must represent an entire range of viewpoints over which the appearance of the object actually varies somewhat. As more views are used, each can cover a smaller range and do so more accurately. In this way there is a trade-off between the size of a multiple-view description and its accuracy. Breuel (1992b) has quantified this trade-off in terms of the number of views needed to represent a point feature model to within a specified accuracy: at most 900 views if each feature is required to lie within 10 percent of the object's apparent diameter. (With the approach presented in this dissertation, fewer views are needed because the features are more descriptive and matching is constrained by factors in addition to feature position.) Fortunately, when an object is viewed from a range of different viewpoints some relations among its features appear to remain nearly constant. This sort of invariance can extend the range of viewpoints covered by a single characteristic view, thus improving the trade-off between accuracy and number of views. Certain relations between lines in three dimensions, such as cotermination (the proximity of endpoints), parallelism, and collinearity, appear approximately invariant when seen from various viewpoints (Lowe 1985). Even angles between general pairs of lines (not necessarily parallel), and ratios of line lengths, are relatively stable with respect to viewpoint (Ben-Arie 1990; Burns, Weiss, and Riseman 1993). Another technique for improving the space/accuracy trade-off of a multiple-view representation is to interpolate among views. Ullman and Basri (1991) have shown that with three views of a rigid object whose contours are denned by surface tangent discontinuities, one can interpolate among the three views with a linear operation to produce a fourth view. If the object has smooth contours instead then the appearance of its rim is somewhat harder to predict; nevertheless, six views allow a good interpolation. In comparing model representation schemes we must also consider how models are to be acquired. When a 3-D model is already available—from a CAD system, for example—it can be used to build a multiple-view model by either empirical or analytic means. To build it empirically, a sphere of viewing positions about the object is sampled uniformly or stochastically. From each viewpoint the object is rendered under orthographic projection, and similar views are f •J Chapter 2. Survey of Object Recognition 22 clustered to obtain the characteristic views (e.g., Pathak and Camps 1993; Sato, Ikeuchi, and Kanade 1992; Zhang, Sullivan, and Baker 1993). To build a multiple-view model analytically, the view sphere is subdivided into regions by identifying boundaries where the object's selfocclusions begin and end. Algorithms exist for performing such analyses on restricted classes of shape, such as solids of revolution (Eggert and Bowyer 1993) and algebraic surfaces (Petitjean, Ponce, and Kriegman 1992). The regions correspond to qualitatively distinct appearances of the object, and there may be thousands of them for an object of moderate complexity. However, not all of these appearances can be distinguished in images of limited resolution, and some have vanishingly small probabilities of occurring. Thus the qualitatively distinct appearances must be reduced to a smaller, more practical set of characteristic views; methods for this are not yet evident. Alternatively, one may wish to construct an object model from intensity images depicting the object. As this dissertation shows, a multiple-view model can be built by clustering descriptions extracted from images. In contrast, building a 3-D model requires solving the apparently more difficult problem of recovering 3-D structure from intensity images. Combining 3-D and multiple-view representations Researchers has sought ways to combine some of the advantages of both 3-D and multiple-view representations. Dickinson, Pentland, and Rosenfeld (1992) first recognize generic parts by their characteristic views, and then assemble those parts into a 3-D description for comparison with object models. With this approach, a small number of characteristic views can support recognition of many different objects provided each object can be expressed as some composition of the generic parts. Zhang, Sullivan, and Baker (1993) use both 3-D and multiple-view descriptions of each object for recognition. The multiple-view descriptions are first used to find a quick, approximate match; that match is then verified using the 3-D description. Since the views are only used to suggest possible matches, this approach requires relatively few characteristic views and each view only has to contain a few prominent features. Chapter 2. Survey of Object Recognition 2.2.3 23 Viewpoint invariants Because recognition requires identifying an object under varying conditions of viewpoint and lighting, it is helpful to have features that are at least somewhat invariant with respect to changes in these conditions. This is especially true when recognizing 3-D objects in 2-D intensity images, where the effects of lighting and viewpoint are confounded. Recently, considerable effort has been directed at identifying and employing features that are completely invariant with respect to viewpoint (e.g., Mundy and Zisserman 1992; Weiss 1993). These are based on properties of geometric structures, called geometric invariants, that remain constant over an entire class of transformations. Each is defined in terms of a particular kind of geometric structure and a particular class of transformations. One example of a geometric invariant is the cross ratio. For any four distinct, collinear points A, B, C, and D, the value (AC • BD)/(AD • BC), where AC denotes the distance from A to C, etc., is unchanged by a perspective transformation, or, for that matter, by any composition of perspective transformations (i.e., a projective transformation). Thus, in pinhole camera images of a particular set of collinear points, the value of this ratio will appear constant regardless of viewpoint; moreover, a different set of points will likely produce a different value. Of most interest for object recognition, of course, are invariants for the Euclidean, affine, and projective transformations that are often used to model the image formation process. The cross ratio, like certain other geometric invariants, is computed from the coordinates of a small group of features. Choosing that group is itself a challenging problem, and it is discussed in section 2.3.1. Interesting invariants can also be computed from the coefficients of planar algebraic curves, from a set of higher-order derivatives taken at one point on a curve, or from certain combinations of feature coordinates, algebraic curve coefficients, and derivatives. However, whereas the use of coordinates entails the grouping problem, the use of algebraic curve coefficients or derivatives entails the problem of accurately estimating these quantities. Some current research is directed at finding invariants, along with associated grouping and estimation methods, that optimize the tradeoff between having to choose groups and having to estimate Chapter 2. Survey of Object Recognition 24 coefficients or derivatives (e.g., Weiss 1993). It has been shown that there is no invariant for 2-D projections of a finite, unconstrained set of 3-D points (Clemens and Jacobs 1991b; Burns, Weiss, and Riseman 1993; Moses and Ullman 1991). Consequently, attention has now focused on invariants for suitably constrained structures, particularly coplanar sets of points, lines, and curves. We shall give four examples here to illustrate the nature of this research. 1. Three coplanar, non-collinear points define two vectors, which can be used as a basis for representing the locations of other points lying in the same plane. The resulting representation is invariant to 2-D affine transformations, which can approximate perspective projections as a combination of orthographic projection and scaling (Lamdan, Schwartz, and Wolfson 1988). 2. A pair of coplanar conies yields two projective invariants. These can be used to identify objects like gaskets, which are essentially coplanar and typically composed of circles and ellipses (Forsyth et al. 1991). 3. A rotationally symmetric object has a silhouette that is essentially planar. Therefore invariants of coplanar features can be used to identify rotationally symmetric objects by their silhouettes (Forsyth et al. 1992). 4. There is an invariant of orthographic projections of four 3-D points that applies only if the four points define three orthogonal vectors. A test can be used to determine which projected sets of points meet this condition so that the invariant is only relied upon where appropriate (Wayner 1991). Another approach is to use properties that, although not truly invariant, are nearly so. BenArie (1990) and Burns, Weiss, and Riseman (1993) have developed probabilistic models that describe how certain image properties, such as junction angles and ratios of distances, vary with viewpoint. They have found that these properties are sufficiently stable over a wide enough range of viewpoints to be useful for recognition (see figure 2.3). Practically speaking, even true Chapter 2. Survey of Object Recognition 25 LOG (BETA/ALFA) (UNITS) Figure 2.3: Approximate invariance of apparent angle. These twofigureillustrate how the apparent angle between a pair of lines, as measured in an image, is often close to their true angle in 3-D space. Left: Burns'figure(reprinted from Burns, Weiss, and Riseman 1993) depicts a 90° angle viewed under orthographic projection from various 3-D positions. It appears to be within ±15° of 90° over approximately 42 percent of the viewsphere. Right: Ben-Arie's figure (reprinted from Ben-Arie 1990) plots the probability density of the log ratio of apparent angle to true angle, for 3-D angles that are acute, obtuse, and unrestricted. The peaks represent high probability that, from a randomly chosen viewpoint, the apparent angle approximates the true angle. invariants must be considered only approximately invariant because the image measurements from which they are computed can only be obtained with limited precision. Because invariants and quasi-invariants exist only for certain structures, they alone cannot completely describe objects. In a multiple-view model, however, they can extend the range of viewing conditions effectively spanned by a single view. Also, by providing clues as to what objects may be present in an image regardless of viewpoint, they can be useful in solving the indexing problem, which we shall survey in section 2.3.2. 2.2.4 Common features Almost any characteristic of form could be considered a feature, and, indeed, many alternatives have been suggested. How can we decide, then, which features are best? According to Saund (1988), features should achieve two goals. First, features should make Chapter 2. Survey of Object Recognition 26 explicit whatever information is required for the task at hand. In particular, features used for object recognition must somehow capture all distinctions needed to differentiate objects from each other and from other parts of the scene. Second, features should reflect the regularities and structures of the external world. For example, they should exploit properties that are invariant with respect to changing conditions. A common example is the intensity edge visible at a surface crease, which remains detectable across a range of illumination conditions; features based on intensity edges can exploit this illumination invariance. Both of these qualities suggest that the choice of features must depend on the nature of the application: what task is being performed, and in what environment. Saund observes that, by designing appropriate features, one is effectively embedding in a system some important knowledge about the intended application. By the same token, no feature can be considered ideal for all applications—the adequacy of each feature must be judged in light of the application for which it is being considered. We shall now briefly survey the features that are commonly used by object recognition systems. It is helpful to think of these features as being of three general types: 1. Segments and patches A 2-D curve, such as an intensity edge, can be approximated by a series of straight line segments, and a surface can be approximated by a series of polygonal faces. Although these features are convenient, they produce unstable descriptions when used to describe curved shapes. To allow stable descriptions to be produced for a larger class of shapes, including some curves, the set of features can be extended by adding higher-order analytic curves or surfaces. By adding elliptical arcs, for example, one can obtain reasonably stable descriptions of the projections of a circle. Segments of parabolas, circles, ellipses, and general conies have been used to describe 2-D curves, and spherical, cylindrical, and spline patches have been used to describe 3-D surfaces. However, because they involve more parameters, higher-order curves are more difficult to extract reliably from sensor data, and more difficult to compare. Furthermore, while they may extend the class of er 2. Survey of Object Recognition 27 shapes that can be represented well with few features, higher-order curves will still fail to yield stable descriptions of some shapes. Many systems have been demonstrated that use segments or patches as shape features, including HYPER (Ayache and Faugeras 1986), which uses polygonal approximations of 2-D curves, and 3D-POLY (Chen and Kak 1989), which uses quadric approximations of 3-D surfaces. . Parts With a limited set of parts one can construct a large variety of objects, especially if each part can be customized somewhat by choosing values for free parameters. Generalized cylinders, generalized cones, and superquadrics are three parameterized families of parts used to describe volumes. A generalized cylinder or cone is a volume swept out by running a closed planar figure along a space curve. The figure, the space curve, and the relation between them are all restricted so that just a handful of parameters are needed to determine the part's shape. Two-dimensional analogs of the generalized cylinder are ribbons, symmetric axis transforms, and smoothed local symmetries. Generalized cylinders and their two-dimensional counterparts describe only certain elongated shapes well. Systems that have used them include ACRONYM (Brooks 1981), PARVO (Bergevin and Levine 1993), and a system by Dickinson, Pentland, and Rosenfeld (1992). A superquadric is an equation of a particular form, whose solutions define a closed surface (Barr 1981; Pentland 1986). As the equation's two parameters are varied, the surface deforms through a range of shapes that includes cubes, diamonds, pyramids, and smooth, intermediate forms. Six more parameters are added to specify size along each axis, bending along two axes, and tapering along one axis, producing a family of parts more expressive than generalized cones and cylinders. Because so many parameters are involved, however, it has proven difficult to recover these superquadrics reliably, even from range images (Whaite and Ferrie 1991). Chapter 2. Survey of Object Recognition 28 While some objects can be described with elegant simplicity by parts (e.g., airplanes), many other objects admit no obvious part decomposition or parameterization (e.g., faces, mountains, shoes). It seems unlikely, therefore, that any part family could provide complete and stable descriptions of most objects. 3. Distinguished features A third approach is to describe form only in terms of a limited set of distinguished features. The curvature primal sketch (CPS) is one example of this approach (Asada and Brady 1986). It represents 2-D curves only in terms of the significant changes of curvature—including corners and inflections—that are detected at each scale. Among the motivations for the CPS is a hypothesis due to Attneave (1954) that, for the human visual system, a contour's extrema of curvature are more significant for recognition than are its other features. A second motivation for the CPS is that curvature changes provide a relatively stable description, as they remain detectable over a wide range of 2-D affine transformations. In some cases, distinguished features have been designed to represent very restricted classes of objects. To demonstrate how knowledge can be brought to bear in the design of a shape representation, Saund (1992) designed a representation for fish fin profiles using a vocabulary of approximately thirty features (see figure 2.4). Each feature captures just one fragmentary aspect of fin shape, such as the angle or curvature of a fin's leading edge. Collectively, though, the features provide a rich representation that is capable of distinguishing a wide variety of fin shapes. Some face recognition systems also use very application-specific features—for example, Gordon (1992) chose to represent faces only in terms of the geometry of selected eye, nose, and cheek features. One way in which distinguished features can be chosen for a particular application is to learn them from training images depicting the set of objects to be recognized. Segen (1989) has described a system that constructs a hierarchy of features by grouping distinctive Chapter 2. Survey of Object Recognition 29 LEADING-EDGE-ANGLE LEA DING-EDGE-RELATIVE-LENGTH TOP-CORNER-SKEW TOP-CO RNER-ROUNDNESS+FLA TOP-CORNER-FLA RE POSTERIOR-CORNER-VERTEX-ANGLE TOP-ARC-CURVATURE TOP-ARC-HEIGHT-BASE-WIDTH-RATIO BACK-EDGE-CURVATURE / ^ 11EIG1IT-MAXIMUM-WIDTH-RATIO RE /(> f~*~~5 Figure 2.4: Application-specific features. To illustrate how features can be tailored to applications, Saund designed approximately thirty descriptors for describing and distinguishing dorsal fin shape, including these ten. Each descriptor represents one shape deformation parameter. Excerpted from Saund 1992. points in training images, measuring the geometry of each group, and clustering these measurements to identify categories of common groups. This produces features corresponding to configurations of distinctive points that are particularly prevalent among the objects to be recognized. Two dimensional patterns of intensity derivatives can be used as distinguished features for describing appearance in intensity images. In an approach loosely analogous to Segen's, Weng (1993) has used unsupervised learning techniques to automatically construct a hierarchy of such iconic features from a set of intensity images. So far we have considered what types of features a representation might employ. Also important is the issue of what sizes or scales those features should have. A detail considerably smaller than the smallest feature simply cannot be represented, and a property that is only Chapter 2. Survey of Object Recognition 30 evident at a scale much larger than the largest feature cannot be represented explicitly. Consequently, whatever their types, features must be available at a range of scales if they are to make explicit all important qualities of form. 2.2.5 Organization of features There have been two common methods for organizing features into some kind of structure. The first is to arrange them hierarchically according to part/whole relations. In some part/whole hierarchies, levels of the hierarchy represent degrees of descriptive accuracy so that a single feature at one level decomposes into several finer features at the next (Marr and Nishihara 1978; Brooks 1981). In other part/whole hierarchies, levels of the hierarchy represent degrees of grouping, and primitive features occur only at the hierarchy's lowest level (Ettinger 1987). A second method of organizing features, not necessarily incompatible with the first, is to arrange them according to adjacency relations so that each feature is related to others nearby (Connell and Brady 1987; Shapiro 1980; Wong, Lu, and Rioux 1989). Both part/whole and adjacency organizations are often represented as graphs (or hypergraphs) in which nodes denote features or groups of features, and arcs denote the relations among them (see figure 2.5). An advantage of these organizations is that they provide a convenient framework for including additional information about the relative geometry of features. Arcs can be annotated with attributes that record the position of a part with respect to a whole, or the position of a feature with respect to its neighbour. For articulate objects, it is especially convenient to represent geometry this way since the configuration of each joint need be recorded in only one place. Moreover, a part/whole decomposition supports an object recognition strategy, described in section 2.3.3, whereby object parts are first recognized, and then whole objects are recognized as configurations of those parts. Chapter 2. Survey of Object Recognition 31 squat straight object straight squat piece piece PLANE have piece straight ENGINE join oblique BODY piece piece squat straight medium-long medium-long parallel^^pin straight curved \ ^ straight curved squat piece acute piece straight squat stubby snort very short Figure 2.5: Graph representation of a model. In this model description, nodes printed in bold denote a whole aircraft and its parts; arcs denote composition and adjacency relations among them. Attributes of these parts and relations are denoted by additional nodes shown in italics. Reprinted from Connell and Brady 1987. 2.2.6 Representation of uncertainty i n models Much of the discussion so far in this survey has concerned the representation of both images and object models. For models in particular, however, it is often important to be able to represent variable or uncertain forms. An object we seek to model may have a form that can vary continuously within certain limits, can assume any of a class of similar forms, or is not known exactly. Here we survey techniques for representing this uncertainty. A multiple-view representation can describe varying form quite simply: using multiple views to depict not only how the object appears from various viewpoints, but also how it appears Chapter 2. Survey of Object Recognition 32 when assuming various forms. Unfortunately, since every combination of viewpoint and form must by described separately, this approach multiplies both the complexity of the model and the work involved in acquiring it. Another approach, applicable to both 3-D and multiple-view representations, is to use a parameterized model in which free variables or quantifiers are used to specify certain measurements (e.g., Brooks 1981; Grimson 1990; Lowe 1991; Goldberg 1994). A model for pencils, for example, might describe the pencil body with a generalized cylinder whose length is a free variable; one for scissors might use a free variable to describe the rotation of each rigid blade with respect to the other. The model may include constraints limiting each parameter to a set of reasonable values ("pencil length is less than 20 cm"), and, in some cases (e.g., Brooks 1983; Goldberg 1994; Hel-Or and Werman 1994), the constraints can even specify relations among parameters ("wing width is no more than half of wing length"). Learning parameterized models from images requires identifying where parameters are needed and what their constraints and relations are; to our knowledge, no research has yet demonstrated learning of general, parameterized models, although work on motion decomposition may prove relevant to this problem. By characterizing features using probability distributions rather than specific values or hard constraints, a model can represent uncertainty about an object's form or, equivalently, a distribution of possible forms for the object. Wong and You (1985) introduced a structure they called a random graph: an attributed graph in which nodes represent features, attributes represent the properties of each feature, and a distribution characterizes each attribute. McArthur (1991) has used this structure to represent a model of a 3-D, rigid object in terms of point features whose uncertain locations are characterized by Gaussian distributions. Camps, Shapiro, and Haralick (1992) have used a similar structure to represent 2-D characteristic views, also with Gaussian distributions. Besides describing shape and perhaps its expected variation, an object model might also include information about the reliability or diagnosticity of various features and the expected Chapter 2. Survey of Object Recognition 33 cost of detecting them in images. A feature will be less useful for recognizing an object if it is not always present with that object, if the feature has a poor chance of being detected even when it is present, if it is frequently found among other objects, or if is costly to detect. A good strategy to use in recognizing an object is to first seek those features that are most reliable and least costly. Reliability can be characterized by the probability of detecting the feature when the object is present, versus the probability of detecting it otherwise or when the object is absent. Researchers have devised methods of assembling a probabilistic description of an object's features by analyzing a 3-D model of the object. Typically, the 3-D model is rendered from various viewpoints to determine what features are visible and what their apparent properties are. One approach simply equates a feature's detectability with the proportion of viewing positions from which the feature is visible (e.g., Goad 1983; Kuno, Okamoto, and Okada 1991). Some have argued, however, that much more complete models of lighting, sensors, and surfaces are needed to obtain useful estimates (e.g., Camps, Shapiro, and Haralick 1991; Chen and Mulgaonkar 1992; Sato, Ikeuchi, and Kanade 1992; Wheeler and Ikeuchi 1995). This dissertation describes an alternate approach whereby probabilistic descriptions of feature position and reliability are learned directly from real images. A model used to recognize generic objects must describe a class of objects that are perhaps only qualitatively similar. The model for a class of mallets, for example, might only specify, roughly, that there should be a cylindrical handle and a head joined at right angles. In most systems meant to recognize generic objects (e.g., Bergevin and Levine 1993; Dickinson et al. 1993; Havaldar, Medioni, and Stein 1994), qualitative models are described by small numbers of coarse, large-scale features such as generalized cylinders. In effect, by severely restricting the types of features available, the designers of these systems predetermine the scope of each generic class—what qualities must be shared by elements of the class, and what qualities may differ within a class because there are no features to represent them. This dissertation describes an alternate approach whereby a system endowed with many types of features learns which are Chapter 2. Survey of Object Recognition 34 relevant for recognizing any particular object or class of objects. 2.3 Recognition This section surveys methods of recognizing an object by finding a match between model features and image features. The inputs to the recognition task are a library of object models and an image in which objects are to be recognized; the outputs specify the identity, pose, and perhaps certainty of any objects recognized in the image. This task is made difficult by several factors, including: not knowing which of many possible objects might be present in the image; not knowing what pose an object might assume in the image; the possibility that an object might be partly occluded in the image; the possibility that the image may be cluttered with unknown objects; complex effects of the imaging process, such as shadows, specular highlights, and interreflections; and limits to the reliability and accuracy of sensor measurements. To recognize a single object in an image, most methods perform something like the following sequence of steps: detection signal processing to detect features in the image and represent them as symbols grouping identify new features as stable groupings of more primitive ones indexing use features to select a likely model out of a library of object models matching find the best match between features of the image and those of the selected model verification decide whether that match suggests that the modeled object is present in the image To recognize any number of objects in an image, methods generally recognize single objects repeatedly until no further objects are found. After each object is recognized, its image features are deleted and the next recognition cycle is performed using whatever features remain. Chapter 2. Survey of Object Recognition 35 The grouping, indexing and matching steps all involve choosing among alternatives—alternative features, models, or matches. Thus the entire process is essentially a large search through several stages of choices. One way that methods differ is in how they balance these stages in an attempt to minimize the overall workload. Some methods emphasize effective grouping and indexing to minimize the number of alternatives that must be considered by later stages, while other methods emphasize fast matching or verification so that grouping and indexing need not be so selective. Feature detection, which includes such processes as image segmentation and edge detection, is a very extensive topic and one that is not peculiar to object recognition. For these reasons we shall not survey it here (although features commonly used for object recognition were surveyed in section 2.2.4). The remaining four steps, from grouping to verification, are each discussed individually in the following sections. 2.3.1 Grouping features The purpose of the grouping step is to identify groups of features that, collectively, are more informative than individual features, and therefore better able to guide the selection and matching of models. The process is also referred to as perceptual organization, a term borrowed from psychology that refers to the perceived grouping of visual elements resulting from their proximity, their similarity, and other factors. The term perceptual organization has been widely applied to activities ranging from segmenting curves to recognizing geometric figures. What qualifies all these processes as forms of perceptual organization is their generality: They are based not on knowledge of specific objects, but rather on general assumptions that hold for most objects and situations encountered. Segmentation may be based on the assumptions that most surfaces are smooth and most objects convex, for example. Insofar as the underlying assumptions are valid, the feature organization produced will be descriptive and stable; where those assumptions fail, the organization may be Chapter 2. Survey of Object Recognition 36 spurious or unreliable. Because it uses no specific knowledge, perceptual organization is predominantly a bottom-up process that proceeds iteratively from low-level groupings to high-level ones. Features at each level are selected, grouped, and/or abstracted to form those at the next level. One important issue in this area is how to decide what features to group in an image. A decision about whether to group certain features is typically made by measuring various aspects of their relative geometry and by comparing those measurements to thresholds; sometimes the features' context is also considered. So, for example, if two line segments are sufficiently close to each other, if they are approximately parallel, and if no prominent segments lie between them then a decision may be made to group the two segments, producing a new feature denoting a pair of parallel lines. To be useful for object recognition, a group must not span objects; instead, its features must be derived entirely from a single object. While grouping methods are intended to produce groups that satisfy this condition, it cannot be guaranteed that all groups will for that would require a perfect segmentation of the image into objects before grouping. The goal, therefore, is to produce groups that are likely to involve a single object, yet unlikely to arise otherwise, due to an accidental arrangement of objects. Subsequent object recognition stages must allow for missing and spurious groups by expecting that only some—or occasionally none—of the groups are correct. Lowe (1985, 1990) first explicitly addressed the role of perceptual organization in object recognition. He suggested the following rationale (somewhat simplified here) for grouping decisions. Because a viewpoint-invariant 3-D structure projects the same pattern over a wide range of viewpoints, that pattern is more likely to occur in images than other, non-invariant patterns. Moreover, an occurrence of the pattern is more likely to be due to the presence of the 3-D structure than some accidental alignment of objects. Thus any arrangement of features that matches the pattern closely enough ought to be grouped. How close must the match be? Given a set of features, a viewpoint-invariant pattern, and some assumptions about how features are Chapter 2. Survey of Object Recognition 37 Figure 2.6: Perceptual grouping criteria. Lowe estimates the probability of a chance occurrence of proximity, parallelism, or collinearity from these measurements, plus assumptions about a random distribution of line segments. If the chance occurrence probability is low, the line segments are grouped as being likely to have come from a common object. Reprinted from Lowe 1987a. distributed due to chance, one can estimate the probability that the features would match the pattern at least as well as they do, due to chance alone (see figure 2.6). This is essentially the probability of the feature arrangement arising by accident. If that probability is below some threshold, then the features are grouped. Except for work by Jacobs (1989) on grouping pairs of convex contours, there seems to have been no other effort like this to found grouping decisions upon first principles. Instead, it has been common practice to develop ad hoc criteria for deciding which sets of features should be grouped (e.g., Bergevin and Levine 1992; Horaud, Veillon, and Skordas 1990; Mohan and Chapter 2. Survey of Object Recognition 38 Nevatia 1992; Sarkar and Boyer 1990; Saund 1990; Stein and Medioni 1992a). Researchers concerned with the problem of how to identify groups efficiently have produced data structures for quickly locating related features (Sarkar and Boyer 1990; Saund 1990; Stein and Medioni 1992a), parallelizable algorithms (Mahoney 1987; McCafferty 1990), and other process improvements (Huttenlocher and Wayner 1992). Sarkar and Boyer (1993) have represented the grouping process using a Bayes network in order to allow a control strategy that is moreflexiblethan the usual bottom-up process. With their approach, a grouping hypothesis may lead to a top-down search for additional evidence in the image, and hypotheses may compete with one another to resolve choices among alternate groupings. 2.3.2 Indexing the model database Given a library of object models and some features found in an image by feature detection and grouping, we want to select a model that is likely to match those features. In this, the indexing problem, the goal is to do much better than simply trying each model in turn. Solutions generally involve a table that is indexed either by individual features or by small groups of them. Each table entry indicates an object (and perhaps viewpoint) that could produce the corresponding feature or group. Before recognition, the table entries are created for various viewpoints of each object by analyzing the model library, by rendering each object model from a sampling of viewpoints, or by processing a representative set of training images. During recognition, features chosen from the image are used to index the table, thus producing hypotheses about what objects are present in the image. Each hypothesis denotes a possible, partial match between model and image, which must be further tested to determine the full extent and quality of the match. Many variations of this basic scheme are possible, and several have been investigated. One might index the table using all features from the image (Breuel 1990), a randomly chosen subset of features (Lamdan, Schwartz, and Wolfson 1988), or just features that are judged particularly likely to have been derived from single objects (Clemens and Jacobs 1991a). One might test Chapter 2. Survey of Object Recognition 39 every hypothesis retrieved from the table for a possible match (Clemens and Jacobs 1991a), or instead treat the hypotheses as votes, and only test those that receive the most votes (Lamdan, Schwartz, and Wolfson 1988) or some minimum number of votes (Stein and Medioni 1992b). Votes may bear equal weight, or they may be weighted according to the size of the lookup feature or how well the lookup feature matches each table entry (Beis and Lowe 1994; Rigoutsos and Hummel 1993; Sarachik and Grimson 1993). If one does not take the exhaustive approach of using all features to index the table, or if voting is used and only those hypotheses receiving the most votes are tested, then one must have some way of deciding when enough features have been examined or enough hypotheses tested. One can estimate the probability that each test will miss an actual occurrence of a modeled object, and repeat testing until the cumulative probability of missing an object drops below any desired threshold (Lamdan, Schwartz, and Wolfson 1988). How well indexing performs depends on how well entries are distributed throughout the table. The entries associated with any single object should occupy a relatively small portion of the entire table, and each entry should refer to relatively few objects. For these reasons, invariants and quasi-invariants are favoured as indexing features (Mundy, Zisserman, and Forsyth 1994), and poor distributions occur when objects are symmetrical or when they share common features (Flynn 1992). Indexing performance is also greatly affected by uncertainty in feature measurements. Uncertainty may be accommodated by coarsely quantizing feature dimensions, by replicating table entries (with a range of duplicate entries representing a range of possible measurements), or by sampling a range of index values at lookup time. All three approaches reduce the selectivity of the index, and replication greatly increases its storage requirement as well. Jacobs (1992) has shown that, for indexing based on groups of point features and a particular type of projection, the table's storage requirement can be reduced by factoring the table into two tables, each having half as many dimensions. Chapter 2. Survey of Object Recognition 40 A markedly different approach to indexing has been developed using networks of neuronlike units that compute functions called generalized radial basis functions or hyper-basis functions (Poggio and Edelman 1990; Brunelli and Poggio 1991; Edelman and Poggio 1990), which are essentially multidimensional kernel functions. A modeled object is represented by a network in which individual units represent distinct prototypical or characteristic views of the object. The input to the network is a vector of selected image feature measurements. The output represents either a normalized view of the object (which must be further tested by comparison with some standard view), or a graded yes/no recognition response. A library of object models is represented by a collection of networks, one for each model; for recognition, all networks are applied in parallel and the object's identity is indicated by the network producing the strongest response. Although these networks have been described as performing object recognition, we consider them to be essentially an indexing scheme performing just one component of the complete object recognition task. Like other indexing schemes, these networks consider only a small, fixed-sized subset of image features, and they must be applied to all such subsets to generate all hypotheses. They alone do not resolve the question of which image features are associated with an object. As indexing schemes, moreover, these networks do not compare well with other indexing schemes because, at one network per object, their space and time complexities scale linearly with the size of the model library. And although the networks have the advantage that they can be trained using images, there appear to be other trainable classifiers that can perform more quickly and about as accurately (e.g., the nearest-neighbour classifier described by Brunelli and Poggio 1992). Another recently proposed indexing approach organizes the model library hierarchically. Models of similar objects are clustered, and each cluster is represented in the library by a single prototype. This clustering may be repeated several times to form a model abstraction hierarchy. Recognition proceeds by descending this hierarchy while refining an object's identity along the way. However, a full search for a match between image and model features need only Chapter 2. Survey of Object Recognition 41 be performed at the hierarchy's top level—at each lower level, the match result from the level above provides an advanced starting point in the search for a more complete match. Variations of this approach have been described by Basri (1993) and by Sengupta and Boyer (1995). 2.3.3 Matching features The goal of matching is to identify a partial match between features of a given image and those of a given object model, and to also estimate how the modeled object is positioned in the image. The match solution must satisfy the viewpoint consistency constraint (Lowe 1987b), which requires that the locations of the object's features in the image be consistent with some pose of the object. Moreover, because there are generally many consistent solutions to any matching problem, it is usually further stipulated that the desired solution must maximize some appropriate measure of match quality. Such measures are often based on error models that describe how an image feature may differ from what the object model has predicted, and, for simplicity, on an assumption that errors affect each feature independently. Three types of error models are most commonly used: 1. A bounded error model requires each matching image feature to be within some fixed range of its predicted location. The corresponding match quality measure is often just a count of the matching features. Note that this model specifies nothing about the error distribution other than its bound. 2. A bounded uniform error model specifies that each image feature is distributed uniformly within some fixed range of its predicted location. The corresponding match quality measure is often a count of the matching features. 3. A Gaussian error model specifies that each image feature is distributed normally about its predicted location. The corresponding match quality measure usually considers both the number of matching features, and the sum of the squares of their normalized errors. Chapter 2. Survey of Object Recognition 42 Of these three, a Gaussian error model is often considered the most representative. Some justification for this comes from the central limit theorem and the fact that there may be many independent sources of error. Wells (1993, ch. 3) has found empirical evidence suggesting that a Gaussian error model is an accurate model of feature location distribution, at least for some features. Match quality measures are now often defined using Bayesian probability theory (e.g., Wells 1993). Given such factors as the prior probability that each object is present in the image, the prior distribution of its pose if present, the conditional probability that each model feature is matched if its object is present, and the conditional distribution of matching errors as described by an error model, one can estimate the posterior probability that a particular object is present with a given pose and a given set of feature matches. This posterior probability can then serve as a match quality measure. Assumptions of feature independence are needed to keep the approach tractable. One can classify various matching methods according to whether they search for a solution in correspondence space, transformation space, or both. Correspondence space is the space whose elements are sets of pairings between model features and image features. Transformation space is the space of possible object-to-camera or camera-to-object transformations, which are called viewpoint transformations or object poses. Under the viewpoint consistency constraint and an appropriate error model, the two spaces are closely related: Each set of pairings is consistent with zero or more transformations, and each transformation, with zero or more sets of pairings. 2 The interpretation tree method (Grimson 1990) exemplifies those methods that search entirely in correspondence space. Its name refers to a search tree of choices regarding the interpretation of each image feature. Proceeding from the root of the tree to its leaves, the match search examines an additional image feature at each level of the tree. Branches at each level represent different choices among model features that can be paired with that image feature, In the literature, the term match sometimes refers to a set of pairings between model features and image features; at other times it refers to an an individual pairing. Hence we shall avoid using the term in this survey except where its meaning is clear. 2 Chapter 2. Survey of Object Recognition 43 plus the choice of pairing nothing at all with it. A complete interpretation of the object's presence in the image, assigning some subset of image features to corresponding model features, is associated with each of the tree's leaves. The search can readily incorporate two kinds of matching constraints: unary constraints, which require that paired image and model features have similar properties, and binary constraints, which require that pairs of feature pairings be geometrically consistent. Although these local constraints alone are not sufficient to ensure viewpoint consistency and consequently each complete interpretation must be subsequently verified, the local constraints do effectively prune the search. According to Grimson (1990), higher-order constraints provide no real improvement because of their higher computational cost. A branch-and-bound technique can also be used to prune the search tree. This technique requires a function that evaluates a score for a complete interpretation, plus an estimator that gives an upper bound for the scores of any complete interpretations that might follow from a given partial interpretation. The search tree is pruned wherever the estimator shows no improvement possible over the best score yet obtained for a complete interpretation. Correspondence space search has often been cast as a problem of graph matching (e.g., BenArie and Meiri 1987; Bergevin and Levine 1993; Fan 1990; Shapiro and Haralick 1981; Wong 1992; Yang, Snyder, and Bilbro 1989; Zhang, Sullivan, and Baker 1992). In this framework, the task is to find a common subgraph isomorphism between two attributed graphs, one representing the image and the other, the model. Graph nodes represent features, graph edges or hyperedges represent the geometrical relations among them, and nodes and edges have attributes recording their properties or measurements. Usually an inexact match is sought, where the attributes of matching nodes and edges are allowed to differ somewhat to accommodate noise and distortion in the image. The search for an optimal graph match is guided by some measure of graph match quality, which must evaluate both how well the two graph's structures match and how well their corresponding attribute values match. This measure serves the same purpose as the match quality measure discussed above, and it too may be based on Bayesian probability theory. Chapter 2. Survey of Object Recognition 44 The biggest difficulty with correspondence space methods has been their computational cost, which is generally exponential in the number of image or model features. One way to avoid considerable computation is to relax the requirement that an optimal match be found and search instead for a near-optimal one. This is the tactic employed by heuristic search termination (Grimson 1990), which terminates the search as soon as a solution meeting some minimum requirement has been found. It is also employed by methods that cast the matching problem in the form of a constraint satisfaction network solved by relaxation labeling or gradient descent (e.g., Bhanu and Faugeras 1984; Bolle, Califano, and Kjeldsen 1992; Bray 1990; Kitchen 1980; Wheeler and Ikeuchi 1995). Transformation space search The generalized Hough transform is an example of a method that searches transformation space. It uses an array of bins indexed by parameters of the viewpoint transformation. First the bins are initialized as empty. Then, for each possible pairing between one image feature and one model feature, transformations consistent with that pairing are determined and votes are cast in the bins corresponding to those transformations. Finally, when votes have been placed on behalf of all pairings, the array is scanned to identify and verify those transformations that have received the most votes. Usually a bounded error model is used so that a finite range of transformations and a corresponding finite range of bins are consistent with each feature pairing; for reasonable error bounds, however, these ranges can be quite large. The principal advantage of this and other transformation space methods is that, unlike correspondence space methods, they avoid exponential search. Unfortunately, however, not every bin collecting a large number of votes represents a correct match solution. In some cases, due to an accident of how the array tessellates transformation space, a bin may collect many votes that are not entirely consistent with any single transformation. Also, when the image contains a great deal of clutter, random clusters of votes may overshadow a correct solution so that many of the fullest array bins must be examined before a correct solution is found (Grimson Chapter 2. Survey of Object Recognition 45 and Huttenlocher 1990). Breuel (1992a) and Cass (1992) have developed transformation space algorithms that avoid the problems due to tessellation. Like the generalized Hough transform, their algorithms use a bounded error model to associate a range of transformations with each possible pairing between image and model features. By designing their features, error model, and transformation space carefully, they are able to ensure that these transformation ranges are of a particularly simple form. For example, if the features are points, if error bounds are convex polygons, and if transformation space spans 2-D translation, rotation, and scaling, then the region of transformation space associated with each feature pairing is simply a convex polytope; each region can therefore be expressed as the intersection of several half-spaces, which in turn are defined by hyperplanes. To identify maximal sets of consistent feature pairings, Breuel and Cass search transformation space to find areas where maximal numbers of regions intersect. Because the regions are expressed in a convenient form (using hyperplanes), and because the number of regions is only polynomial in the numbers of image features and model features, this search can be performed relatively efficiently. Wells (1993) has shown how the transformation space search can be cast as an iterative estimation problem solved by an algorithm analogous to Newton's method. Using Bayesian theory and a Gaussian error model, he defines the posterior probability of a particular set of pairings and a transformation, given some input image. This probability is then integrated over all possible sets of pairings to produce a marginal probability of transformation that, for a given input image, is a function only of the transformation. Limited experiments suggest that this function is relatively smooth, and that its maximum is usually near the correct transformation. With an initial guess provided by indexing, the iterative Expectation-Maximization algorithm locates this maximum efficiently. Chapter 2. Survey of Object Recognition 46 Using both search spaces Instead of searching solely in either correspondence space or transformation space, some methods do portions of their search in each. One general strategy, called hypothesize-and-test, hypothesizes specific viewpoint transformations and tests those hypotheses by finding feature correspondences that are consistent with each. This strategy was used in the first object recognition system, by Roberts (1965), and it has been used in many other systems since then (e.g., Brooks 1981; Bolles and Cain 1982; Goad 1983; Lowe 1985; Chen and Kak 1989). The alignment method (Ullman 1989) is a specific implementation of hypothesize-and-test. It begins its search in correspondence space, where it pairs just enough "anchor" features to determine a viewpoint transformation. For example, three point features are needed if the transformation is restricted to being an orthographic projection and a scaling (the so-called weak perspective approximation of perspective projection); fewer features are needed if they have associated orientations (as junctions do, for example); and more features are needed if the object to be recognized is somewhatflexible.Once the transformation is determined, it is used to project the remaining features of the model into the image, where additional pairings are sought for each projected feature. Because there may be many combinations of anchor feature pairings, this method relies heavily on having efficient techniques for computing and verifying transformations (Huttenlocher 1988). The alignment method estimates a transformation once for each set of anchor feature pairings, and an error in localizing an anchor feature in the image yields an error in the transformation estimate. That error perturbs the locations of projected features, contributing to errors in pairing those additional features. Whereas the alignment method requires that at least one set of anchor features produces a sufficiently accurate estimate of the transformation, other methods attempt to overcome the inaccuracy of the initial estimate. Using additional pairings involving projected features, they refine that estimate iteratively. Lowe (1987a, 1987b) has described this iterative alignment approach and demonstrated it with his S C E R P O system. As with the alignment method, a transformation isfirstestimated from Chapter 2. Survey of Object Recognition 47 a small set of feature pairings. This transformation is used to predict the visibility and image location of each remaining model feature. For each of these projected model features, potential pairings with nearby image features are identified and ranked. This ranking considers both the likelihood that an image feature could occur by accident so close to its projected model feature, and the degree to which the pairing is rendered ambiguous by the presence of another, nearby image feature. The best ranked pairings are then adopted, all pairings are used to produce a refined estimate of the transformation, and the process is repeated until acceptable pairings have been found for as many of the model features as possible. Although backtracking can be used to try alternate pairings for projected features, Lowe reports that this is seldom necessary because the ranking scheme is usually successful in eliminating ambiguous pairings, and because errors in the initial estimate of the transformation are eliminated as additional pairings are incorporated. A similar method of iteratively refining a viewpoint transformation estimate has been used by Ayache and Faugeras (1986) in their HYPER system. This dissertation describes an enhanced version of iterative alignment that also uses feature uncertainty information. Another way of employing both correspondence and transformation spaces is first to identify interesting regions of transformation space by means of something like a coarse Hough transform, and then, within each region, to perform a correspondence search search while considering only pairings consistent with that range of transformations. This approach was used by Grimson and Lozano-Perez (1987), and again by Kuno, Okamoto, and Okada (1991). Their system begins with a Hough transform that accumulates evidence for various possible viewpoint transformations. As various pairings are hypothesized and each contributes evidence for certain transformations, the system assesses the uncertainty of the accumulating evidence. When that uncertainty drops below some threshold (i.e., a limited range of transformations begins to appear particularly likely) the system begins searching instead for model features within restricted image regions predicted by the transformation distribution. The threshold at which this transition takes place is determined according to the costs of the two search methods so that the overall search cost is minimized. Chapter 2. Survey of Object Recognition 48 Estimating a viewpoint transformation from correspondences The alignment method and its variants require that a viewpoint transformation be estimated from a set of correspondences between image features and model features. Also, when matching is done by correspondence space search, a viewpoint transformation must usually be estimated before a completed match can be verified. This problem is called pose estimation, and it occurs in several contexts besides object recognition, including camera and robot calibration, navigation, and photogrammetry. Pose estimation problems occur in forms that differ according to: • whether the feature correspondences are 3-D/2-D (between 3-D model features and 2-D image features), 2-D/2-D, or 3-D/3-D; • what form of projection transformation associates model features with image features (e.g., weak perspective or "full" perspective projection); • what kinds of features are used, with common examples including point features, planar faces and their edges, and smooth surfaces and their contours; • whether the object is rigid, deformable, articulate, or otherwise parameterized, and how its parameters are constrained; • whether a unique transformation (or a finite number of transformation solutions) is being estimated from a minimal set of correspondences, or whether an optimal transformation is being estimated from an overconstraining set; and • when estimating an optimal transformation from an overconstraining set of correspondences, how errors are modeled and where those errors are measured (e.g., in the 2-D image plane or in the 3-D scene). Difficulty of pose estimation increases with the generality of the shape representations being matched. There are efficient, well established 3-D/2-D pose estimation methods for point Chapter 2. Survey of Object Recognition 49 correspondences found by matching distinguished features (e.g., Huttenlocher 1988; Tsai 1987), and for line correspondences found by matching edges of a polyhedral object (e.g., Lowe 1987a). Until recently perhaps, no comparable method has existed for objects that are smooth but otherwise unrestricted, primarily due to the difficulty of predicting the 2-D curves projected by the contours of a smooth object (Lavallee and Szeliski 1995). In object recognition, because pose estimation is often part of the cycle of hypothesizing and verifying matches, computational efficiency is important. Recursive least squares estimators such as the Kalman filter have been used to estimate pose, updating the estimate at minimal cost as successive feature pairings are adopted (e.g., Ayache and Faugeras 1986; Hel-Or and Werman 1995; this dissertation). Also important is robustness in the face of incorrect assumptions about feature pairings, error models, and object models. One robust estimation approach attempts to identify data points that fail these assumptions—so-called outliers—in order to assign them less importance before refining the estimate. Another approach is to use more appropriate error models that, in comparison to the Gaussian distributions associated with least squares estimation, have longer "tails" that more accurately represent the appreciable likelihood of outliers. Ordering and structuring the search The efficiency of a correspondence space search is significantly affected by the order in which features are considered for matching. At each stage of the search one would like to choose an unpaired model feature that is likely to be found in the image when the object is present, is relatively unique among features of the model, is not likely to be encountered when the object is not present, can effectively constrain other pairings, and, according to what is already known about the object's pose, can be expected to lie within a small region of the image. Goad (1983) suggested how one might use criteria like these to choose the next model feature for each stage of an interpretation tree search. To measure the criteria he proposed sampling a uniform distribution of viewpoints, estimating the object's appearance from each viewpoint Chapter 2. Survey of Object Recognition 50 i using a 3-D model, and tallying each feature's visibility and position. The tallies would be used to rank each feature according to how likely it is to be visible, how accurately its position can be predicted, and how much information about the object's pose would be provided by pairing it. This analysis would be used to pre-determine the entire search tree for a given model before any attempt at recognition. Goad's experiments demonstrated the idea of pre-ordering the search tree but they did not test his ranking criteria (features were ordered subjectively, by hand). Kuno, Okamoto, and Okada (1991) have described a similar approach that also considers the likelihood of features arising accidentally, the cost of detecting them, and the degree to which they may be distorted by perspective. Another variation, the local feature focus method (Bolles and Cain 1982), involves analyzing the entire database of object models to select for each model one or more focus features that, due to their specificity, can be used to initiate matching by alignment. There has been some investigation of how to organize the match search hierarchically. In his SAPPHIRE system, Ettinger (1987) uses both a structure hierarchy for dividing object models into parts, and a scale hierarchy for representing each part at various levels of detail. Before recognition, models are decomposed automatically into parts according to heuristics that look for "necks" and other shape properties. SAPPHIRE'S recognition algorithm first recognizes the parts in an image and then recognizes objects in terms of those parts (i.e., treating entire parts as features). When recognizing a part, it first uses coarse features and then proceeds to finer ones. By decomposing the search in these ways, a large search is replaced by numerous smaller searches, altogether requiring less computation. The idea of decomposing the search according to a structure hierarchy and using that hierarchy to guide a bottom-up process is also found in systems by Burns and Riseman (1992) and by Dickinson, Pentland, and Rosenfeld (1992). Our own system uses hierarchical structure to guide matching by employing features at various scales and degrees of abstraction, but in a less rigid way: it is free to move opportunistically among different levels of the hierarchy as matching progresses. Chapter 2. Survey of Object Recognition 2.3.4 51 Verifying the match result The last step in the recognition process is to decide whether an optimal match found by the match search actually represents an instance of an object in the image. The factors considered in this decision and the weight accorded each factor may depend strongly on the nature of the application. The consequences of a false positive (declaring the object present when it is not) or a false negative (declaring the object not present when it is), the reliability of feature detectors, and the likelihood of occlusion should all determine what evidence is needed before the object is declared present. Other sources of knowledge may be used as well, such as prior expectations of whether and where the object is to be found and hypotheses that may explain away missing or extraneous features. Decision theory provides a suitable framework for combining these factors. Most research systems, however, have simply required that some fraction of the model's features be matched, and/or that some fraction of the edges projected from the model lie near image edges (e.g., Chen and Kak 1989; Gottschalk, Turney, and Mudge 1989; Lamdan, Schwartz, and Wolfson 1990). Some also assess negative evidence, such as image edges that cross projected model edges at large angles (e.g., Hansen and Henderson 1989; Huttenlocher and Ullman 1990). Match solutions are verified by testing these measures against empirically determined thresholds, and then ranked according to the measures to select the best, mutuallyconsistent solutions. Grimson and his associates (Grimson and Huttenlocher 1991; Sarachik and Grimson 1993) have developed analytic models for estimating the probability that a particular match may be accidental—i.e., due to a conspiracy of random features rather than the actual presence of the object. Only if this probability is below some threshold is the match accepted. Chen and Mulgaonkar (1992) have used a similar analysis not only to accept or reject matches, but also to halt the match search as soon as a sufficiently reliable decision can be made. Recently, Breuel (1993) has suggested that verification consider not only how much of a model is matched, but also the spatial distribution of its unmatched parts. In a valid match, the features that remain unmatched do so primarily because of occluding objects that typically cover Chapter 2. Survey of Object Recognition 52 contiguous regions of the image. Verification, therefore, should test how well these unmatched features can be explained away by hypothesizing a small number of contiguous occlusions. 2.4 Discussion Early in this survey we listed the criteria commonly used to evaluate object recognition methods: scope, robustness, and efficiency. We now ask, how well have existing methods met these requirements, and what factors have contributed to difficulty or success in these areas? 2.4.1 Scope of methods Most methods are applicable only to highly restricted classes of objects, and none can cope with the full range of objects found in natural environments. Usually objects are required to have smooth contours and be rigid or articulate so that they can be readily matched against a simple,fixedmodel of shape. These restrictions may not be a problem in special situations, but they effectively rule out recognition of most objects in most environments. Often underlying this scope restriction is a reliance on some small, fixed set of features for describing any object. Matching strategies that must apply models of the image formation process are especially limited to few and simple features. Since these methods require that there be a straightforward relation between 3-D model features and 2-D image features, they generally have been restricted to simple representations, such as polygonal models and models with few parameters. Feature variety has also been limited by the difficulty of detecting or recovering features reliably. For example, methods that represent objects as being composed of a small number of generic parts, such as generalized cylinders or superquadrics, must use simple parts so that they can be segmented and identified in images prior to recognition; as a consequence, however, the parts are of limited representational power and they are unable to portray many natural objects or object markings. The scope problem will not be solved byfindingsome small, universal set of features capable of representing any object. We should expect, instead, that a rather broad range of features Chapter 2. Survey of Object Recognition 53 will be needed. Many of these features would pertain only to certain objects but, collectively, they would be able to generate rich descriptions of a large class of objects. So that features need not be restricted to those for which we can readily model the image formation process, we should seek matching strategies that can avoid the need for such models—for example, by employing multiple-view models. The work reported in this dissertation develops this approach. Finally, since it would be impractical for us to try to anticipate all features that could prove useful for recognition, a general purpose recognition system should be able to coin new features as needed, based on the images it encounters and its own recognition experience. 2.4.2 Robustness of methods Because most methods have been tested only by their proponents, and apparently with just small, hand-picked sets of objects and images, it is difficult to assess how robust these methods truly are. Trials comparing alternate methods under similar conditions are extremely rare. But we would suggest that the following ideas and improvements can contribute most to robustness: • Improvements in feature detection. Without robust and stable feature detection, no recognition method can succeed regardless of what other strengths it may have. Although feature detection is outside the scope of this survey, it should be noted that improvements in this area would undoubtedly aid recognition systems. • Effective use of redundant information. An image of an object usually contains a considerable amount of redundant information. In some ways, this redundancy can compensate for missing or inaccurate features, just as indexing methods use redundant information to identify possibly-correct interpretations even when a large proportion of features are missing or inaccurate. Another use of redundancy is in representations that span a range of scales: Large-scale features make explicit certain information that is already implicitly available in small-scale ones, but the large-scale features provide good starting points for a match search, and they allow objects to be easily recognized as similar even when the objects differ in some details. Chapter 2. Survey of Object Recognition 54 • Accurate models. When we use object and image formation models that are inaccurate, we must compensate for their inaccuracies by relaxing the constraints used to accomplish recognition (Wheeler and Ikeuchi 1995). Both efficiency and reliability are compromised as a result. For example, it is common to model an object as a single, idealized form; any departure of its actual form from this ideal is simply regarded as "noise", with all features sharing a common noise distribution. In practice, however, individual features differ greatly in their likelihood of being detected and in other aspects of their uncertainty. Probabilistic models like those described in section 2.2.6 try to represent these kinds of variations more accurately, and their use in conjunction with methods of learning variations directly from training images should permit more reliable recognition. • Use of probabilistic reasoning. Increasingly, matching methods are being formulated in terms of similarity measures based on probability theory. This provides a framework for integrating a probabilistic model, the observations of many noisy features, and other sources of knowledge. The framework contributes to robustness by allowing appropriate importance to be accorded each of many information sources. 2.4.3 Efficiency One of the most effective ways of improving efficiency in object recognition has been to permit some relaxation of the assurance that a solution will be exactly correct or entirely optimal. This tactic has been responsible, in part, for the enhanced performance of indexing methods like geometric hashing, and of matching methods like alignment. As in these cases, recognition can sometimes be made fast by considering solutions that are approximate, near-optimal, or correct only with high probability. Efficiency remains an important problem despite continuing improvements in representations, algorithms, and computing power. Still impractical are systems that recognize complex objects in a fraction of a second, and those that can work with a database of hundreds or thousands of complex object models. However, some interesting ideas have been proposed for Chapter 2. Survey of Object Recognition 55 improving efficiency and, although these have been tried out individually, the task now appears to be to find effective ways of combining them. For matching, different strategies such as coarse-to-fine, part-to-whole, and abstract-to-specific have been proposed. Each may be good in certain situations, but the best overall performance will come from integrating them and being able to select among them as needed. Chapter 3 Survey of Learning in Object Recognition In this chapter we shall survey the roles that machine learning techniques have served in various object recognition systems. Where possible, we shall also draw comparisons between these and our own recognition learning method, which is fully presented in subsequent chapters. The final section of this chapter summarizes efforts to establish theoretical bounds on the learnability of certain formulations of the object recognition task. 3.1 Learning A p p e a r a n c e Features An object recognition system may learn new types of shape or appearance features to use in describing objects. Typically this is done by clustering existing features or groups of them, and then associating new features with the clusters that have been found. New features thus represent particular configurations or abstractions of existing ones. The new configurations may improve the representation's descriptive power, and the new abstractions may allow more appropriate generalizations. Segen (1989) has demonstrated this approach with a system that learns a representation for 2-D contours. The system's lowest-level features are distinguished points, such as curvature extrema, found on contours in training images. Nearby points are paired, each pair is characterized by a vector of measurements, the measurement vectors are clustered, and a new feature is invented for each cluster of significant size. Consequently, each new feature describes a commonly observed configuration of two distinguished points. The induction process is repeated with these new features to generate higher-level features describing groups of four, eight, 56 Chapter 3. Survey of Learning in Object Recognition 57 and more distinguished points. The Cresceptron system by Weng and his associates (1993) is analogous in that it induces a hierarchy of features within a pre-programmed framework. Since these features are essentially templates, invariance to translation, rotation, and scaling in the image must be provided by prior segmentation, by an attentional mechanism, or by searching extensively over possible transformations. We would expect both these methods to be sensitive to clutter in the training images and to parameters of the clustering algorithm. Delanoy (1992), Fichera (1992), and their associates have described object recognition systems that induce fuzzy predicates from training images. Their systems represent models as logical formulae, and, therefore, the systems need appropriate predicates. These are invented by clustering measurements of low-level features that have been recovered from training images. Turk and Pentland (1991) and Murase and Nayar (1993) have induced features using principal component analysis. They compute the several most significant eigenvectors, or principal components, of the set of training images. Since these few eigenvectors span much of the subspace containing the training images, they can be used to concisely describe those images and others like them. However, as this is a global representation, it does not support recognition of occluded objects. 3.2 Learning an Appearance Classifier The object recognition task can be characterized, in part, as a classification problem: instances represented as vectors, sets, or structures must be classified into categories corresponding to various objects. A wealth of techniques has been developed for classifying, and for inducing classifiers from training examples. For the purposes of object recognition, the important considerations distinguishing these techniques include the expressiveness and" complexity of the input representation (e.g., vectors are easier to classify than structures), the generality of the categories learned, the ability to cope with noisy features, the number of training examples needed, and the sensitivity to the order in which examples are presented. Jain and Hoffman (1988) describe a system that learns rules for classifying objects in range Chapter 3. Survey of Learning in Object Recognition 58 images. The instances classified by their system are sets of shape features with associated measurements. The classifier applies a series of rules, each contributing evidence for or against various classifications. Each rule applies to a particular type of shape feature and a particular range of measurements for that feature. These rules are learned from training images by extracting features from the images, clustering them according to their measurements, and associating rules with the clusters that derive primarily from a single object. Because this system does not learn constraints governing the relative positions of the shape features, it appears to have little ability to distinguish among objects that have different arrangements of similar features. Neural networks, including radial basis function networks, have been used as trainable classifiers for object recognition (e.g., Brunelli and Poggio 1991). In this role, the network approximates a smooth function that maps a vector of feature measurements to an object identifier, or to a vector of graded yes/no responses, one per object. Nearest neighbour classifiers are also commonly used. The most difficult aspect of these approaches seems to be deriving appropriate feature vectors from cluttered images, which requires consistently identifying and ordering a set of features. For this reason, these methods are best suited solely for indexing, as in the indexing method of Beis and Lowe (1994). 3.3 Learning a Structural Model A structural model explicitly represents both shape features and their spatial relationships. Its structure is thus analogous to that of the modeled object, and it is often represented as a graph or, equivalently, as a series of predicates. In general, a structural model can be learned from training images by first obtaining a structural description from each image, and then inducing a generalization covering those descriptions. Connell and Brady (1987) have described a system that learns structural models for recognizing 2-D objects in intensity images. Their system incorporates many interesting ideas. They use graphs to represent the part/whole and adjacency relations among object regions described by ribbon shapes (see figure 2.5 on p. 31). Chapter 3. Survey of Learning in Object Recognition 59 An attribute of a region, such as its elongation or curvature, is encoded symbolically by the presence or absence of additional graph nodes according to a Gray code. A structural learning procedure forms a model graph from multiple example graphs, most commonly by deleting any nodes not shared by all graphs (thus applying the "dropping rule" for generalization). Similarity between two graphs is measured by a purely syntactic measure: simply by counting the nodes they share. Consequently, this system accords equal importance to all features, and it uses a somewhat arbitrary metric for comparing attribute values. The approaches described below involve more powerful models that represent probability distributions over graphs. A Markov or independence condition is assumed so that the highorder, joint probability distribution over all graphs can be approximated by a product of loworder distributions, one per node or arc. The probability of a particular graph instance is then defined according to a partial match between that instance and the model. Our approach is similar in that it uses probability distributions over graphs to model objects; it differs in many aspects of how the probability distributions are defined, learned from training images, and used to accomplish matching. Wong and You (1985) represent a model as a random graph in which nodes represent shape features, arcs and hyperarcs represent relations among them, and both have attribute values characterized by discrete probability distributions. An attributed graph (i.e., a random graph's outcome) is treated as just a special case of random graph. An entropy measure is defined on random graphs, and the distance between two random graphs is defined as the increment in entropy that would result from merging the two (the minimal increment over all possible mergings). A random graph is synthesized from examples by repeatedly merging the two nearest graphs. This learning method seems to have been demonstrated only with 2-D recognition problems and clean, synthetic images. However, McArthur (1991) has extended the random graph formalism to allow continuous attributes, and used it to learn 3-D models from images acquired at known viewpoints. In comparison, our formalism incorporates continuous attributes more easily by using likelihoods rather than entropies, and our learning method does not require Chapter 3. Survey of Learning in Object Recognition 60 knowledge of viewpoint or feature correspondences. In Segen's system (1989) for recognizing 2-D shapes, a graph node represents a relation among shape features, which are oriented point features (see section 3.1). But instead of specifying a particular relation, a node in the model graph specifies a probability distribution over possible relations. For example, relations involving two features may include one that holds when the features are oriented away from each other, and another that holds when they are oriented in the same direction; a model node may specify that the first relation holds with probability 0.5, and the second, with probability 0.3. When a graph instance is matched with the model, the instance's probability can be computed by assessing the probability of each instance node according to the distribution of its corresponding model node. Recognition involves finding the most probable match. A model is learned incrementally from instances by matching it with each one successively; following each match the probabilities recorded in matching model nodes are adjusted and any unmatched instance nodes are added to the model. Whereas Segen's system reduces all measurements to global categories found by clustering, our method retains numeric measurements as attribute and position distributions that are learned individually for each model feature. As a result, we would expect our method to be capable of more precise discriminations among objects, and more appropriate generalizations. 3.4 Learning a Set of Characteristic Views In learning a model that is to be represented as a set of characteristic views, part of the task is to choose those views. This can be done by clustering training images, and choosing one characteristic view to represent each cluster. Although several researchers have clustered images rendered from CAD models and thus avoided the feature correspondence problem, Gros, Seibert and Waxman have clustered real images. Gros (1993) measures the similarity of an image pair as the proportion of matching shape features, whereas Seibert and Waxman (1992) use a vector clustering algorithm with fixed-length vectors encoding global appearance. Our method, in comparison, uses a clustering measure based on objective performance criteria (accuracy and Chapter 3. Survey of Learning in Object Recognition 61 efficiency), and an appearance representation less affected by occlusion. 3.5 Learning a Recognition Strategy Draper (1993) has considered how a system equipped with a variety of special-purpose representations and algorithms might learn strategies for employing those techniques to recognize specific objects. A typical recognition task would be to locate a tree by fitting a parabola to the top of its crown. For this task, an appropriate strategy is to segment the image, extract regions that are coloured and textured like foliage, group these into larger regions, smooth region boundaries, and fit parabolas to the boundaries. A human supplies training examples by pointing out the desired parabolas in a series of images. The system then evaluates various strategy choices (e.g., whether to smooth region boundaries) and parameter choices (e.g., the degree of smoothing) for cost and effectiveness. From these evaluations it constructs a strategy, including alternatives, for performing the task on new images. By learning what order to try alternative recognition methods, Draper's system differs from those that just select a single set of features useful for recognition. Difficulty remains in limiting the search among strategies and parameters to achieve acceptable performance. 3.6 Theoretical Learnability There have been some efforts to identify theoretical limits on what can be learned from images. These efforts are based on the analogy that learning to recognize an object from images is like learning a concept from examples. Valiant (1984) has provided a useful definition for characterizing the class of concepts that can be learned by a particular algorithm: Informally, a class is probably approximately correct (PAC) learnable by an algorithm if, with high probability, the algorithm learns a concept that correctly classifies a high proportion of examples using polynomially bounded resources (and, consequently, number of training examples); the bound is a polynomial of both the accuracy and some natural parameter of the concept class (e.g., vector length for concepts defined on a vector space). Significantly, the algorithm has no prior Chapter 3. Survey of Learning in Object Recognition 62 knowledge about the distribution of examples. Shvayster (1990) has shown that some classes of concepts defined on binary images are not PAC-learnable from positive examples by any algorithm. For example, suppose a template is said to match an image when every black pixel in the template corresponds to a black pixel in the image (although not necessarily vice versa). Then the concept consisting of all instances not matching some unknown template is not PAC-learnable. Shvayster speculates that some nonlearnable concepts may become learnable, however, if some prior knowledge about the distribution of examples is available. Edelman (1993) has argued that Shvayster's negative result is not applicable to object recognition because it uses an instance representation, the binary image, that is inappropriate. If instead instances are represented by vectors of point feature locations, Edelman shows, then recognition of an object can be learned from a polynomial number of positive examples. He concludes that model learning may be practical, provided an appropriate representation is chosen. Chapter 4 Representation Schemes This chapter begins a detailed presentation of our recognition learning method by specifying the three principal representation schemes used: those for images, object models, and viewpoint transformations. Images presented for training or for recognition must first be described in terms of local features of various types. Our approach is not reliant upon any particular set of these features, but it does require that the features be designed and represented according to certain principles. Section 4.1 discusses these principles and specifies how image features, regardless of type, are to be represented. In order to implement and evaluate our recognition learning method, however, we have also devised a specific set of features. These features and the techniques we use to detect them are described in section 4.2. An object is modeled by a series of views. The reasons for this choice and the implications of it are discussed in section 4.3. Also in that section, the model representation scheme is formally specified. The representation schemes for the positions of features and for viewpoint transformations must be closely related because viewpoint transformations are used to align feature positions during matching. Both are specified in section 4.4. 63 Chapter 4. Representation Schemes 4.1 64 Image Representation An image is represented in terms of discrete properties called features. Each has a particular type, a position within the image, and a vector of numeric attributes that further characterize it. A feature may, for example, be a segment of intensity edge, a particular arrangement of such segments, or (perhaps in a future implementation) a region of uniform colour or texture. Typically, low-level features will be found as the responses to feature detectors, such as edge or corner detectors, and additional, higher-level features will be found by grouping or abstracting the low-level ones. The features found in an image are represented by an image graph. Graph nodes represent features, while directed arcs represent grouping and abstraction relations among them. A typical image will be described by a graph containing hundreds or thousands of features of various types. The repertoire of features must be sufficient to provide a description of any relevant image properties, as distinctions among objects can only be made if there are features capable of expressing those distinctions. It is important, moreover, that the repertoire cover a wide range of properties and scales. As we shall see, large-scale features and features that are highly specific (e.g., those representing particular arrangements of intensity edges) can often provide useful starting points for recognition. However, small-scale features and commonplace features (e.g., individual segments of intensity edge) are needed to complete the repertoire so that it can represent a wide variety of objects and small distinctions among objects. The best features are those that can be detected reliably, are relatively invariant with respect to modest changes in viewpoint or lighting, and, like the feature groups discussed in section 2.3.1, are more likely to describe some portion of one object than an accidental conjunction of multiple objects. In the remainder of this section we shall specify how features are represented and describe additional principles governing the design of any features our recognition learning method is meant to use. These are the only requirements our method imposes on features. We shall not Chapter 4. Representation Schemes 65 prescribe specific features, as no one feature repertoire can be considered best for all applications. Nevertheless, we have implemented a specific repertoire in order to evaluate our method; we shall describe it in section 4.2. Most of the principles presented here are ideals that no feature detection method can achieve perfectly. In practice, features will inevitably be somewhat unreliable and "noisy", not entirely stable and dependable as we might wish. Consequently, our recognition learning method is designed to cope with features' inevitable shortcomings. The method's performance will benefit, however, to whatever extent these ideals can be achieved. 4.1.1 Feature positions A feature's position is represented by a point location in the image, an orientation, and a scale. This design is preferred because it corresponds well with the way viewpoint transformations are represented, as we shall explain in section 4.4. However, because feature types differ widely, there can be no single, specific convention governing how position parameters are defined in all cases. Instead, position parameters must be defined appropriately for each feature type. For example, the location of an edge segment junction might best be defined as the location of the junction's vertex, whereas the location of a region might be defined as the region's centroid. Ideally, the position of a feature describing some object would depend only on the placement of that object, and be unaffected by such things as partial occlusion of the feature or image quantization noise. Our goal, then, in designing a feature must be to define its position in such a way as to enhance stability and relevance. Consider, for example, a feature representing a junction of two straight lines. The junction's orientation could be defined as that of one of its lines (say, the clockwise one), but this would allow any error in estimating that line's orientation to wholly perturb the junction's orientation. A better choice (assuming independent errors) would be to define the junction's orientation as the average of the two lines' orientations, thereby obtaining a more stable measure of junction orientation. It may not be possible to define sufficiently stable measures of location, orientation, and Chapter 4. Representation Schemes 66 scale for all features. In some cases, there may be no meaningful definition available—as there is no way to define the orientation of a circle, for example. In other cases, any definition chosen will be sensitive to partial occlusion of the feature. For example, it is difficult to imagine a measure of line segment location or scale that is not sensitive to occlusion at either end of the segment. Because these kinds of limitations are inevitable for certain feature types, our learning and recognition methods are prepared to accommodate them by using only position components that are well defined and relatively stable, as we shall explain in section 5.1.3. 4.1.2 Feature attributes Features have numeric attributes augmenting the information provided by their types and positions. What attributes a feature has is determined by its type. For example, a feature representing a junction of edge segments may be given an attribute recording the junction's angle, whereas one representing a uniformly coloured region may have attributes describing the region's shape and colour. Attributes considerably enhance the descriptive power of features. So that features can be easily compared regardless of their image positions, each attribute must be measured in a way that renders its value independent of its feature's position. In other words, attributes must be invariant with respect to viewpoint transformations, which in our approach comprise translation, rotation, and scaling of features within the image. To accomplish this, an attribute measuring a distance, such as a region's width, is normalized by dividing it by its feature's overall scale (what Saund (1992) calls a scale-normalized measure). One measuring an orientation, such as the orientation of one part of a feature, is normalized by subtracting (modulo 27r) the feature's overall orientation. In designing features, we should try to include attributes that emphasize only relevant properties of appearance. For example, when objects are to be recognized under varying levels of illumination, an attribute that records intensity averaged over some region of an image will be of little use: it could assume any value for any object. However, an attribute that records the ratio of intensities of two adjacent regions could be quite useful: it would convey information Chapter 4. Representation Schemes 67 about contrast while remaining largely insensitive to illumination level. This goal of attribute relevance and stability, like other ideals of feature design, is often only partly attainable, and methods of learning and recognition must be prepared to accommodate "noisy" attributes. We sometimes face a choice in designing features to represent some continuum of possible appearances: Different regions of that continuum can be denoted by different types of features, or the entire continuum can be denoted by a single feature with attributes providing the additional specificity. For example, junctions can be represented using different types of features for acute, right, and obtuse angles, or with a single feature whose attribute specifies any angle. Because our recognition learning method generally consider features comparable only if they share the same type, the latter approach should be preferred as it introduces fewer arbitrary distinctions. Where a continuum must be subdivided into ranges denoted by different feature types—because different attributes are appropriate for the different ranges, for example—then stability can be improved by allowing some overlap among ranges and by using redundant features where overlaps occur, as was discussed in section 2.2.1. 4.1.3 Feature relations Features acquire a hierarchical organization as a result of some features being found as groupings or abstractions of others. Except for the lowest-level features at the bottom of the hierarchy, each feature has one or more parts; except for the highest-level features, each may be part of any number of groups. Grouping and abstraction relations provide constraints on how features are matched, although these constraints depend on the types of features involved. For some features, parts have distinctive and well-defined roles. The parts of a junction, for example, may be the line segments constituting its left and right arms, distinguishable from each other due to the junction's geometry; when two junctions are matched, their arms should also correspond. For other types of features, there may be a variable number of parts whose order and exact number may or may not be meaningful. For example, one might define a feature denoting a series of parallel Chapter 4. Representation Schemes 68 lines so that each line is one part; in matching features of this type, the approximate number and sequence of lines might be important, but matched features may be permitted slightly different numbers of lines. The matching procedure must know about these different forms of grouping relations so that appropriate constraints can be applied in each case. Here we define two forms. A template feature has a fixed number of parts, each distinguishable from the others; when two template features are matched, their respective parts should correspond directly. A sequence feature has a variable number of parts whose order alone is significant; when two sequence features are matched, any matching parts must be correctly ordered, but each may have additional, unmatched parts. Other forms of grouping relations could be defined also. 4.1.4 Image notation An image graph G is a tuple (F, R), where F is a set of image features and R is a relation over elements of F. A feature k € F is represented by a tuple of the form (tk, a ^ , bfc, Cfc). Feature /c's type is represented by tk, whose value is one of a set of symbols denoting different types of features. For example, in one possible repertoire of features these symbols might include EdgeSegment, Corner, and ConvexRegion. Featurefc'sattributes are represented by a vector of real numbers, afc, whose dimension and interpretation depend onfc'stype. Continuing our feature repertoire example, a Corner feature might have a single attribute representing a corner angle, whereas a ConvexRegion feature might have several attributes representing various shape measures. A feature's position in the image, with its associated uncertainty, is modeled by a multidimensional Gaussian distribution. The coordinate system in which this distribution is defined shall be described in section 4.4. Feature fc's position distribution is characterized by two parameters that are included in A;'s tuple: a mean vector bfc, and a covariance matrix Cfc. From featurefc'stype tk, one can determine whether k represents a grouping or abstraction of other features. If so then the relation R will contain a single element of the form (k, l\,..., l ). n 69 Chapter 4. Representation Schemes Table 4 . 1 : Implemented feature repertoire FEATURE Curve LJct CrankLR CrankRL Cap CapPair PCurves Region Ellipse TYPE REPRESENTED APPEARANCE straight, circular, or elliptical edge segment junction of two curve segments two junctions joined by a common curve segment two junctions joined by a common curve segment two junctions joined by a common curve segment three junctions joined by two curve segments pair of parallel curve segments convex region denned by curve segments nearly closed circle or ellipse This element specifies thatfc'sparts are the features l\ through /„. Because features can have varying numbers of parts, n may depend on tk or, even more specifically, on k. In our feature repertoire example, each Corner feature might have two parts and each ConvexRegion feature might have any number of parts. 4.2 Implemented Feature Repertoire OLIVER, the system we have implemented to test our recognition learning method, uses a reper- toire of features representing various geometric arrangements of intensity edges. Figures 4 . 7 through 4 . 1 0 illustrate the features it finds in a typical image, and table 4 . 1 summarizes their types. The lowest-level features are straight, circular and elliptical segments of intensity edges. Additional, higher-level features represent groupings of these, such as junctions, groups of adjacent junctions, pairs of parallel segments, and convex regions. These features have been designed according to the principles discussed in the previous section. OLIVER'S features are useful for representing the appearance of a wide range of objects while providing some invariance to illumination changes. Because the features are based on straight, circular, and elliptical edge segments, they are especially able to represent objects with edges and markings that are predominately straight and circular, including many manufactured Chapter 4. Representation Schemes 70 Figure 4.7: Feature repertoire illustration (part 1 of 4). This is the first of four figures illustrating the feature repertoire implemented to test our recognition learning method. Above: Features illustrated are derived from this intensity image. Below: Curve segments are found by fitting straight lines, circular arcs, and elliptical arcs to intensity edge pixels, which are drawn lightly in these figures. Figure 4.8: Feature repertoire illustration (part 2 of 4). Above: Junctions are found at curve segment intersections. Below: Pairs of junctions sharing common curve segments are grouped. Chapter 4. Representation Schemes 72 Figure 4.9: Feature repertoire illustration (part 3 of 4.). Above: Certain junction triples are found among groups of related junction pairs; each triple is depicted here by drawing curve segments joining the three junctions. Below: Pairs of nearby, parallel curve segments are grouped. Chapter 4. Representation Schemes 73 Figure 4.10: Feature repertoire illustration (part 4 of 4). Above: Convex regions are found by identifying perimeter chains of curves and junctions. Below: Nearly-closed circles and ellipses are found among circular and elliptical arcs. Chapter 4. Representation Schemes 74 objects. More than that, though, the features can represent an intensity edge of any shape by approximating it with a series of segments, and on this basis they can represent a much broader class of objects. But to effectively represent most objects, particularly natural ones, the small feature repertoire would have to be extended to include features that make explicit other important properties of appearance, such as colour, texture, symmetry, and repetition. Additional feature types are not disruptive because the recognition learning method will choose an appropriate subset for modeling each object. The features are found by a bottom-up, data-driven process of edge detection, segmentation, and grouping. The following paragraphs describe each step of this process. Readers interested only in the general recognition learning method, not the features used in our specific implementation, can skip these paragraphs. Most groups are found by exhaustively testing combinations of parts, although better performance and more reliable features could be obtained by adopting such grouping efficiencies and other improvements as are referenced in section 2.3.1. Moreover, most grouping decisions involve thresholds whose values have simply been chosen by hand to achieve acceptable results. We do not claim that these methods and threshold values are optimal. Indeed, to the extent that they yield somewhat-unreliable features, the tests of our recognition learning method will be appropriately and realistically challenging. So although we document our choices of grouping methods and threshold values here, we do not substantiate or otherwise emphasize those choices. Proper analysis of such grouping methods is beyond the scope of this dissertation. 4.2.1 E d g e detection and curve segmentation Canny's method (Canny 1986) is used to identify image pixels that represent intensity edges. These edgels are then grouped to form continuous curve segments, each representing a portion of edge that is fit reasonably well by a straight line, circular arc, or elliptical arc. We have chosen to use these curve segments as features because they can describe many edges well, they can be detected in a stable manner by the method to be described here, and they represent Chapter 4. Representation Schemes 75 Figure 4.11: Examples of straight lines, circular arcs, and elliptical arcs found by grouping edgels. edges at a level of abstraction that makes them useful for forming higher-level features such as junctions and convex regions. Figure 4.11 illustrates the result of grouping edgels found in a small region of an image. Note that not all edgels are necessarily assigned to curve segments, and some curve segments may share edgels. This is because the method does not insist on producing a complete, optimal, and exclusive grouping of all edgels. Instead, it seeks a grouping in which a minimum number of curve segments account for a maximum number of edgels while fitting those edgels as well as possible. We have found this form of grouping to be more stable than one that must account for every edgel exactly once. Grouping edgels to form curve segments Edgels are grouped to form curve segments according to a method devised by Leonardis (1993). Grouping occurs in two phases. In the first phase, a large number of curve segment candidates Chapter 4. Representation Schemes 76 are recovered by subdividing the image into an array of "seed" windows and then "growing" curve segments, starting separately from the edgels contained in each window. The group of edgels in a seed window are initially fit by a straight line segment using least squares estimation. If an acceptable fit having a sufficiently low error residual is obtained, the group is enlarged by seeking additional edgels lying near the endpoints of the line segment. This fit-and-extend process is repeated as long as acceptable fits and additional edgels are found. Once the line fit error becomes too large or additional edgels cannot be found near the line endpoints, the process switches to circular arcs, which are fit using a least squares formulation due to Thomas and Chan (1989). Finally, when circular arcs no longer provide good fits, general conies are fit by solving a generalized eigenvalue problem, as suggested by Taubin (1991); of the various forms of conic obtained by this method, only single lines, circular arcs, and elliptical arcs are retained. Among all the acceptable fits obtained during the fit-and-extend process begun from one seed window, the largest line segment, circular arc, and elliptical arc are recorded as candidates. In the second phase, some of the candidates are selected to become features while the rest are discarded. Initially, all candidates are considered unselected. Selection is then performed by evaluating each unselected candidate, selecting that scoring highest, and repeating this process until no remaining candidate achieves a positive score. The measure used to score candidates is derived from the minimum description length principle so that it favours candidates that group many edgels, fit their edgels well, and require few parameters (i.e., preferring straight lines over circular arcs, and circular arcs over elliptical arcs). The measure also considers pairwise interactions among candidates so that a candidate's score is lowered if some of the edgels it groups have already been grouped by another candidate already selected. This edgel grouping method has several qualities that make it well suited for our needs. It produces curve segments that bridge the small gaps often found in edges detected under near-threshold conditions. It produces a mix of straight, circular, and elliptical segments while choosing the (nearly) simplest combination adequate in each situation. Weights of the various factors used in scoring candidates can be adjusted to achieve a desired balance between coarse Chapter 4. Representation Schemes 77 descriptions with few segments, and accurate ones that may break complex curves into many segments. And although our serial implementation requires one to two minutes of CPU time (on a current-generation workstation) to process a complex image, a high degree of parallelism could be achieved with appropriate hardware by processing each seed window concurrently. Other curve segmentation methods tried Another family of curve segmentation approaches we tried proved less satisfactory. Edgels were first linked to form extended chains by linking each edgel with its neighbours; small gaps in a chain were bridged by searching beyond an edgel's immediate neighbours to more distant ones in the direction of the estimated edge tangent. We then tried various ways of segmenting these chains, including Lowe's method (1987a), which segments a chain into approximating straight line segments, and an extension of that method, by Rosin and West (1989), which first segments the chain into straight line segments, and then replaces some of those lines with circular arcs. These and similar methods that segment edgel chains are faster than Leonardis's method. However, they are less stable, apparently for two reasons. Because they assign each edgel to exactly one segment, the chain segmenting methods have much less latitude in choosing which edgels to include in each segment. And because they bridge gaps at the time they form a chain, the chain segmenting methods cannot, in making bridging decisions, use knowledge of the curve segment that will ultimately be fit to that portion of the chain. Consequently, the chain segmenting methods produce less reliable segmentations, particularly for edges that are closely spaced, contain small gaps, or cannot be decomposed cleanly into extended arcs and straight lines. Curve segment representation A curve segment feature, or Curve, is modeled by a straight line segment, circular arc, or elliptical arc. These are defined parametrically as specified by table 4.2. A Curve's location is defined as the location of the curve segment's midpoint. Its orientation, Chapter 4. Representation Schemes 78 Table 4.2: Definitions of various forms of Curve feature ' CURVE TYPE PARAMETRIC — X straight line XQ yo y circular arc X XQ = . y. elliptical arc FORM . y°. X XQ y y • 0 + cosf — smtf [0 sin# + - + cosf? s] cos 6 — sin 9 sin# cos 9 r cos s r sm s cos 9 — sin 9 sin 6 cos 9 a cos s- b sins ] Note: The point [x y] sweeps out the curve segment as the arc length parameter s ranges over [so,si]. [x y ] and 9 determine the segment's location and orientation in the image; r is the radius of a circular arc; a and b are the lengths of the major and minor axes of an elliptical arc. 0 Q if it is a circular or elliptical arc, is the direction of the curve normal at the midpoint, directed away from the arc's centre. The orientation of a straight line is the direction of either of the line's two normals, chosen arbitrarily. The feature's scale is the segment's arc length. A Curve has attributes so, s i , x , y , and 9 (defined in table 4.2). A circular arc feature Q Q has the additional attribute r, and an elliptical arc feature has the additional attributes a and b. Curves are matched by a special procedure for reasons we shall explain in section 5.3. Consequently, unlike other features, Curves do not have their attributes expressed in positioninvariant terms. 4.2.2 L-junction features An LJct is a template feature representing one of the angles enclosed by an intersection of a pair of Curves. Where two Curves cross, as many as four LJcts are produced, one for each valid combination of the ordering and traversal directions of the two Curves (see figure 4.12(a)). By defining an LJct to be a component of an intersection rather than an entire intersection, we simplify subsequent grouping processes. Chapter 4. Representation Schemes (a) (b) 79 (c) Figure 4.12: LJct feature. An LJct represents an angle enclosed by a curve segment intersection, (a) Up to four may describe one intersection, (b) Each LJct must satisfy various tests described in the text, (c) An LJct is assigned a location and orientation as denoted by the circle and arrow. Points of intersection between pairs of conies defined by Curves are located using algebraic methods (Smith 1906, pp. 253-4). Each angle enclosed by each intersection is tested, and an LJct is produced if the following conditions are met (see figure 4.12(b)): • the angle a enclosed by the junction is in the range | ± 1.3 radians; • the point p at which the Curves intersect—or would intersect if extended—is no more than 8 pixels from either Curve; • point p is no more than 0.1s beyond the end of either Curve, where s is that Curve's scale; • each "arm" of the junction extends for at least 0.2s along its respective Curve; • the scales of the two Curves differ by a factor of at most 10; and • each Curve's radius of curvature, measured at p, is at least 2.5 pixels. An LJct has a single attribute recording the angle enclosed by the junction. Its location is defined as the point of intersection (see figure 4.12(c)), its orientation bisects the enclosed angle, and its scale is the sum of the scales of the two Curves. An LJct is considered to have a stable location and orientation; its scale, however, is not used in matching as it is sensitive to occlusion. Chapter 4. Representation Schemes 80 CrankRL (a) (b) (c) Figure 4.13: CrankLR, CrankRL, and Cap features. These three features represent the various ways that a pair of LJcts can share a common Curve. (The three Curves constituting each feature may be of any type, not just the combinations of straight lines and circular arcs shown here.) Each feature has a location and orientation as shown, and a scale equal to the distance between the two LJcts. 4.2.3 Pairs and triples of L-junctions Three template features shown in figure 4.13 represent pairs of LJcts that share common Curves. The LJcts must be at least 0.2s apart along the Curve, where s is the Curve's scale. A junction pair feature has two attributes for recording the angles of its two LJcts. It's location, orientation, and scale are denned by a directed line segment joining the two LJcts' locations, and all are considered stable. For a CrankLR or CrankRL, there are two ways to order the two LJcts; one is chosen arbitrarily, and the resulting ambiguity is accommodated during matching. A CapPair (see figure 4.14(a)) is a template feature representing two Caps sharing a common LJct, or, equivalently, three LJcts. Its two attributes are the angle of the central junction and the logarithm of the ratio of the two arc lengths delimited by the three LJcts. Its location and orientation are those of the central junction, and its scale is the arc length between the first and last LJcts; all are considered stable. 4.2.4 P a i r of parallel curve segments A PCurves (seefigure4.15) is a template feature grouping two Curves that are approximately parallel and have few edgels interposed. The following conditions must be met: • the distance between the Curves, d, is at most 0.8 times the length of their overlap, I; Chapter 4. Representation Schemes (a) 81 (b) (c) Figure 4.14: CapPair, Region, and Ellipse features, (a) A CapPair represents a pair of Caps sharing a common, central LJct. (b) A Region represents a closed region whose perimeter is composed of LJcts and Curves, (c) An Ellipse represents a circular or elliptical arc denning a nearly closed circle or ellipse. In each case, a circle denotes the feature's location, an arrow denotes its orientation, and a dotted line denotes its scale. Figure 4.15: PCurves feature. A PCurves represents two nearby, nearly parallel Curves, (a) The pair must satisfy various tests described in the text, (b) A PCurves is assigned a location, orientation, and scale as illustrated. • the angle between the Curves, a, which for arcs is approximated using their minimum and maximum separation distances, is at most 0.15 radians; • at least 30 percent of each Curve must be overlapped by the other; • a fraction of at least 60 percent of one Curve must be overlapped by the other; and • at most 0.8/ edgels lie between the Curves. A PCurves has no attributes. Its location is the midpoint of the overlap region, its orientation bisects the tangents of the two Curves at the midpoint, and its scale is the two Curves' separation distance at the midpoint. There are two ways to order the two Curves; one is chosen arbitrarily, Chapter 4. Representation Schemes 82 and the resulting ambiguity is accommodated during matching. The orientation and scale are considered stable; the location is not because it is sensitive to any occlusion of either Curve. 4.2.5 C o n v e x regions A Region (see figure 4.14(b)) is a sequence feature grouping LJcts and Curves that form a closed, approximately-convex perimeter. Suitable perimeters are found by searching from each LJct for a counterclockwise sequence of alternating Curves and LJcts such that: • each side of the region is at least 5 pixels long; • each junction defines an angle interior to the region of at least 0.5 radians; and • no junction or curved side turns clockwise (outward from the region) by more than 0.1 radians. The search is ordered so that only the smallest region containing any particular LJct is identified. A Region's attributes and position are computed from the region's principal axes of inertia (Ballard and Brown 1982, p. 255). Its location is the region's centroid, its orientation is the direction of the region's major axis, and its scale is the length of that axis; all are considered stable, although the orientation has a two-way ambiguity that must be accommodated during matching. Its two attributes are measures of the region's elongation (the logarithm of the ratio of its length to width) and compactness (perimeter /area). 2 An Ellipse (see figure 4.14(c)) abstracts a single Curve that approximates a closed circle or ellipse. The Curve must be a circular or elliptical arc spanning an angle of at least IT radians and having an eccentricity in the range [0.2, 5] (to eliminate highly eccentric arcs sometimes fit to essentially linear bands of edgels). The Ellipse's location is the arc's centre (not midpoint); its orientation is undefined; its scale is the radius, r, of a circular arc, or an "average radius", Vab, of an elliptical one. An Ellipse has no attributes. Chapter 4. Representation Schemes 4.3 83 M o d e l Representation An object model is organized on two levels so that it can describe the object's range of appearance both fully and accurately. At one level, large variations in appearance are handled by subdividing the entire range of variation into discrete subranges corresponding to distinctly different views of the object; this is a multiple-view representation. At a second level, within each of the independent views, smaller variations are described by probability distributions that characterize the presence, position, and attributes of individual features. In effect, a model represents a probability distribution over image graphs, assigning highest probability to those image graphs that best depict the object. The only form of appearance variation not represented by the model is that due to the varying position of the object within the image plane. We presume that the object may appear anywhere in the image, at any orientation and any of a range of scales. Two mechanisms accommodate this positional variation. One is the viewpoint transformation, which aligns a model view with an appropriate region of the image; we shall describe it in section 4.4. The other is the use of position-invariant representations for attributes, which allow features' attributes to be compared regardless of the features' positions; these representations were described in section 4.1.2. All remaining forms of appearance variation—concerning what features are present, their relative positions, and their attributes—are represented by the model itself. 4.3.1 Implications of using a multiple-view representation Multiple-view and 3-D representations are the two principal methods of modeling a 3-D object for the purpose of recognition. Section 2.2.2 discussed these alternatives and their implications for recognition. Briefly, a multiple-view representation avoids the need to model image formation, thus allowing the use of efficient 2-D/2-D matching methods and a wide variety of appearance features; both factors permit the use of more complex objects and models. On the other hand, a multiple-view representation requires more storage and many views to represent most objects. To address this disadvantage, extend the range representable by each model Chapter 4. Representation Schemes 84 view, and reduce the number of views required, our method uses probability distributions to characterize features, as we shall describe in the following sections. When models are to be learned from training images, the choice between multiple-view and 3-D representations entails two additional implications, both favouring a multiple-view representation. First, that choice avoids the problem of recovering 3-D structure from training images; recovering structure would be difficult because the viewpoint of each image is unknown, there is uncertainty in the measurement of each feature's image location, and there may be errors in matching features among images. Secondly, multiple views can represent not only viewpoint variation but also variation in object shape and variation among different members of an object class. In constructing a multiple-view model there is no need to distinguish these different sources of variation. A 3-D representation, on the other hand, must invoke different mechanisms for different kinds of variation: whereas viewpoint variation is handled by explicitly representing 3-D structure and modeling image formation, other forms of variation must be handled by other means (e.g., parameterized object models or alternate models). Learning a 3D representation is correspondingly more difficult because it requires distinguishing viewpoint variation from other forms of variation. 4.3.2 Simplifying approximation of feature independence Our method lets each model view describe a range of possible appearances by having it define a joint probability distribution over image graphs. However, because the space of image graphs is enormous, it is not practical to represent or learn this distribution in its most general form. So instead, the joint distribution is approximated by treating its component features as though they were independent. This approximation allows the joint distribution to be decomposed into a product of marginal distributions, thereby greatly simplifying the representation, matching, and learning of models. One consequence of this simplification is that statistical dependence (association or covariance) among model features cannot be accurately represented within a single model view. Chapter 4. Representation Schemes 85 Consider, for example, an object whose features are divided among two groups, only one of which appears in any instance. With its strongly covariant features, this object would be poorly represented by a single view due to the simplifying approximation of feature independence. However, where one view cannot capture an important statistical dependence, multiple views can. In this example, two model views, each containing one of the two subsets of features, could represent perfectly the statistical dependence among them. Indeed, by using a large enough set of views we can model any object as accurately as we wish. For economy, however, we would prefer to use relatively few views and let each represent a moderate range of possible appearances. The model learning procedure described in chapter 6 offers one method of balancing the competing aims of accuracy and economy. 4.3.3 M o d e l view representation A single model view is represented by a model graph. Like an image graph, a model graph has nodes that represent features, and arcs that represent composition and abstraction relations among features. Each node records the information needed to estimate three probability distributions characterizing its feature (see table 4.3): 1. The probability of observing this feature in an image depicting the modeled view of the object. The node records the number of times the model feature has been identified in training images by being paired with a similar image feature. A count is also kept of the training images used to learn the overall model graph. The probability is estimated from these two numbers as described in section 5.1.1. 2. Given that this feature is observed, the probability of it having particular attribute values. This is characterized by a probability distribution over vectors of attribute values. Little can be assumed about the form of this distribution because it may depend on many factors: the type of feature, how its attributes are measured, possible deformations of the object, and various sources of measurement error. Thus we use a non-parametric density estimator that makes relatively few assumptions. To support this estimator, which is Chapter 4. Representation Schemes 86 Table 4.3: Probability distributions estimated from model feature information M O D E L F E A T U R E INFORMATION ESTIMATED PROBABILITY DISTRIBUTION Number of times model feature j has been paired with training image features. Probability of observing j in an image of the object. For example, Series of attribute vector samples. Probability that j, if observed, will have particular attribute values. For example, Series of position vector samples. Probability that j, if observed, will be at a particular position. For example, P(observed) = 0.6. Note: Column 1 lists the information associated with a model feature j. Column 2 describes the probability distribution estimated from that information. Chapter 4. Representation Schemes 87 described in section 5.1.4, the model graph node records sample attribute vectors acquired from training images. 3. Given that this feature is observed, the probability of it having a particular position. This is characterized by a probability distribution over feature positions. We approximate this distribution as Gaussian to allow use of an efficient matching procedure based on least squares estimation. The parameters of the distribution are estimated from sample feature positions acquired from training images. 4.3.4 M o d e l notation An object's appearance is modeled by a set of model graphs {Gi}. A model graph Gi is a tuple (JP, R, m), where F is a set of model features, R is a relation over elements of F, and m is the number of training images used to produce Gi. A model feature j € F is represented by a tuple of the form (tj,mj, Aj, Bj). Feature j's type is represented by tj, whose value is one of a set of symbols denoting different types of features (see section 4.1.4). The element rrij specifies in how many of the m training images feature j was found. The series Aj contains the attribute vectors of those training image features that matched j. The dimension and interpretation of these vectors depend on j's type. The series Bj contains the mean positions of the training image features that matched j . These positions, although drawn from separate training images, are expressed in a single, common coordinate system, which is described in the following section. From j's type tj, one can determine whether j is a feature that represents a grouping or abstraction of other features. If so then R will contain a single element, (k, l\,..., l ), specifying n j's parts as being l\ through l . The number of parts n may depend on j . Moreover, any ^ n may be the special symbol J_, which indicates that the part is not defined, and perhaps not represented in the model graph. Chapter 4. Representation Schemes 4.4 88 C o o r d i n a t e Systems and V i e w p o i n t Transformations A feature's position is specified by a 2-D location, orientation, and scale. Image features are located in an image coordinate system of pixel rows and columns. Model features are located in a model coordinate system shared by all features within a model graph. The absolute positions of these coordinate systems are not important as they are used only to measure features' relative positions. Two different schemes are used to describe a feature's position in either coordinate system: xyOs The feature's location is specified by [x y], its orientation by 6, and its scale by s. xyuv The feature's location is specified by [xy], and its orientation and scale are represented by the direction and length of the 2-D vector [u v]. We shall prefer the xyOs scheme for measuring feature positions, and the xyuv scheme for aligning features in the course of matching a model with an image. They are related by 9 = ian (v/u) and s = Vu + v . Where it is not otherwise clear we shall indicate which scheme _1 2 2 we are using with the superscripts x y 6 s and v . x uv The task of matching a model with an image includes that of determining a viewpoint transformation that closely aligns image features with model features or vice versa. In our method, this transformation is used to account for certain forms of appearance variation, such as displacement of the object within the image plane, while other forms of variation are represented by the object model itself. The viewpoint transformation, T, is a mapping from 2-D image coordinates to 2-D model coordinates—it transforms the position of an image feature to align it with that of a model feature. The result of applying T to the position b^ is denoted T(b^). We shall now describe two classes of viewpoint transformation. Both allow the transformation to be expressed as a linear operation with the advantage that it can then be estimated from a set of feature pairings by solving a system of linear equations. Chapter 4. Representation Schemes 4.4.1 89 Similarity transformations A 2-D similarity transformation can account for translation, rotation, and scaling of an object's projected image; it does not account for effects of rotation in depth, nor changes in perspective as an object moves closer to, or farther from, the camera. A 2-D similarity transformation decomposed into a rotation by 9t, a scaling by st, and a translation by [x yt], in that order, can be expressed as a linear operation using the xyuv t scheme, as Ayache and Faugeras (1986), among others, have done. The linear operation has two formulations in terms of matrices. We shall present both formulations here, and have occasion to use both in chapter 5. We shall develop the formulations by first considering the transformation of a point location from [xk yk] to [x' y' ]. We can write it as k k 1 k x = cos &t — sm 9 Xk sin 9 Vk t s t cos 9 t t (4.1) + Defining u = s cos 9t and v = s sin 9 allows us to rewrite this as either t t t t t -vt _ Xk v Xt + (4.2) Vk t or 1 0 x -yk 0 1 yk x k k x t (4-3) y J t u t v t Now consider a vector [iifc'Ufc] whose direction represents.an orientation and whose magnitude represents a length. When mapped by the same transformation, this vector must be rotated by 9t and scaled by st to preserve its meaning. Continuing to use u = St cos 9 and v = t we can write the transformation of [u k t t s sm9 , t as either u -vUt t 111. u Vt u t Vk t k (4-4) t Chapter 4. Representation Schemes 90 or 0 0 u x 0 0 v yt k (4.5) t k Equations 4.3 and 4.5 together give us one complete formulation of the transformation. We can write it with a matrix representing the position = [xk y Uk Vk] being transformed, k and a vector t representing the transformation: 'A I 0 Xk ~Vk x y' 0 1 Vk Xk yt < 0 0 Uk ~Vk u 0 0 Vk Uk v k t (4.6) = Afct. t t Equations 4.2 and 4.4 together give us another complete formulation. We can write it with a matrix A representing the rotation and scaling components of the transformation, and a t vector x representing the translation components: 4 K= 4.4.2 y'k < — u t -v t 0 0 x x v t u t 0 0 Vk 0 0 u t -v t Uk 0 0 0 v t u t v 0 t k k + yt Ab t fc + x. t (4-7). Affine transformations The 2-D affine transformations are a superset of the 2-D similarity transformations, able to account for one additional effect: the apparent shearing that accompanies rotation in depth of a planar surface. Affine transformations model perfectly the apparent deformations of a planar object viewed under orthographic projection (Huttenlocher 1988); for nonplanar objects viewed under perspective projection, they provide an approximation that may be poor. Chapter 4. Representation Schemes 91 Whereas a 2-D similarity transformation preserves angles, a 2-D affine transformation does not. Consequently, 2-D affine transformations cannot be used to align feature positions that have been defined with any recourse to angle measurement. Only collinearity and parallelism, which affine transformations do preserve, can be used to define feature orientation. For example, an angle's bisector does not necessarily remain its bisector after being transformed by an affine transformation. Thus we cannot define a junction feature's orientation as that of the junction's bisector, and still use affine transformations to align junction features. Our only choice is to define the junction feature's orientation as the direction of one of the junction's arms. Provided feature positions are appropriately defined, however, a 2-D affine transformation can be expressed as a linear operation using the sented by the six parameters x ,yt,a ,bt,Ct, t x y u v scheme. The transformation is repre- and d , where [xt t t denotes a translation and the yt] remaining parameters encode a rotation, scaling, and shearing as the entries of a non-singular 2x2 matrix. The affine transformation's two matrix formulations are: 4 I 0 Xk Vk 0 0 x y' 0 1 0 0 Vk Xk yt < 0 0 Uk Vk 0 0 at 0 0 0 0 Vk k b' fc = t = Afct. b t c t d t and bt 0 0 Xk c dt 0 0 yk 0 0 at bt Uk o 0 0 c d Vk o at bl y'k v'k t t t x t + yt = Ab t k +x . t Chapter 4. Representation Schemes 4.4.3 92 Summary Given this choice between similarity and affine transformations, we have elected to use similarity transformations because they impose fewer restrictions on how feature positions are denned and they engender simpler mathematics. Affine transformations would allow certain planar objects to be modeled by somewhat fewer views, but since few objects are planar, we do not consider this a compelling advantage. Because it can be expressed as a linear operation, the viewpoint transformation can be estimated easily from a set of feature pairings. Given a model feature at hj and an image feature at bfc, the transformation aligning the two features can be obtained as the solution to the system of linear equations hj = T(b^). With additional feature pairings, the problem of estimating the transformation becomes over-constrained; then the solution that is optimal in the least squares sense can be found by least squares estimation. We shall describe a solution method in the next chapter. Chapter 5 Matching and Recognition Methods Recognition requires finding a consistent set of pairings between some model features and some image features, plus a viewpoint transformation that brings the paired features into close correspondence. The pairings and transformation together shall henceforth be called a match. The pairings will often be incomplete, with some image features not explained by the model (perhaps there are other objects in the scene) and some model features not found in the image (perhaps due to shadows or occlusion). Nevertheless, the desired match should be a good one that jointly maximizes both the number of features paired and the resemblance among paired features. These qualities are evaluated by a match quality measure described in section 5.1. Identifying good matches requires searching among many possible combinations of pairings and transformations. Although features' positions, attributes, and relations provide constraints for narrowing this search, a complete search is still impractical. Instead the goal is to order the search so that it is likely to find good matches sooner rather than later, stopping when an adequate match has been found or when many of the most likely candidates have been examined. Information about feature uncertainty can help by determining which model features to search for first, over what size image neighbourhoods to search for them, and how much to allow each to influence an estimate of the viewpoint transformation. A matching procedure that operates this way is described in section 5.2. The match quality measure and matching procedure compare features on the basis of their types, positions, and attributes. However, features representing curve segments fit to intensity edges cannot be compared so easily—similar curves may have quite dissimilar parameters, for 93 Chapter 5. Matching and Recognition Methods 94 example. Thus curve segments are handled specially by the matching procedure, as described in section 5.3. There is more to recognizing an object than simply identifying a match between model and image features. Matching must be preceded by an indexing step that chooses, from a potentially large database of model views, some selections that ought to be tried. Although indexing is beyond the scope of this dissertation, we note that our representation schemes and matching procedure appear to be compatible with common indexing approaches: attributes of high-level image features can serve as index keys, and the entries retrieved from the index can provide starting points for the matching procedure. Matching must be followed by a verification step that decides whether it can be concluded, from match evidence, that an object is present. Depending on the nature of the application, this decision might consider such factors as how much of the object's appearance has been matched, and how well the matched features correspond. Because the match quality measure confounds these and other factors in a single measure, matches cannot usually be accepted or rejected simply on the basis of their match quality measures. Section 5.4 describes an alternate measure we have found useful for verifying matches during learning, when it can be presumed that the object is present unoccluded in the scene. 5.1 M a t c h Quality Measure A match is a set of pairings between some image features and some model features, plus a viewpoint transformation closely aligning paired features. Image features are nodes of an image graph describing an image's content, model features are nodes of a model graph describing one model view, and the viewpoint transformation is a 2-D similarity transformation. A match is evaluated by a match quality measure that considers what features are paired, how significant those features are, how similar their attribute values are, and how well their positions correspond. These factor are judged according to past matching experience as recorded by the model, and combined using Bayesian theory to estimate the probability that a particular Chapter 5. Matching and Recognition Methods 95 match represents a true occurrence of the object in the image. The match quality measure is associated with this probability. Pairings are represented by the ordered set E = (ei, e 2 , . . . ) , where ej — k if model feature j is paired with image feature k, and ej =J_ if it is not paired. H denotes the hypothesis that the object, as represented by the model view, is present in the image. The probability that this hypothesis is correct given a set of pairings E and a viewpoint transformation T can be written m\E,T) = ^ 1 ^ 1 % . (5.8) Because there is no practical way to represent the high-dimensional, joint probability distributions P ^ | T,H) and P ( £ A T) in their most general form, we approximate them using the feature independence simplification discussed previously (section 4.3.2). This reduces equation 5.8 to a product of marginal probability distributions: The approximation is a perfect one provided two independence properties hold: (a) {ei} is collectively independent given knowledge of T and H, and (b) {ei,T} is collectively independent in the absence of any knowledge of H. In practice we can expect these properties to hold at least somewhat. Given that an object is present at a particular pose, features detected at widely separate locations on the object will be independently affected by occlusion and noise; such features would satisfy property (a). And in a random scene cluttered with unknown objects, even nearby features may be largely independent because they could come from any of numerous objects; such features would satisfy property (b). On the other hand, the independence properties fail to the extent that there is any redundancy or covariation among features. Consider a feature representing a perceptually-significant grouping, which provides redundant information about other features—its parts—by asserting Chapter 5. Matching and Recognition Methods 96 something about their relative positions. Equation 5.9 will overstate the significance of pairing these features because it separately counts both the grouping feature and its individual part features. A matching procedure seeking to maximize equation 5.9 may thus favour pairings of those features at the expense of pairings of others that are not grouped. The independence properties are also violated in the way that the detectability, positions, and attributes of whole collections of features vary systematically with shifts in viewpoint, and in the way that occlusions tend to affect whole collections of nearby features rather than isolated ones. For our present purposes, however, equation 5.9 provides a suitable approximation if its use in a match quality measure results in appropriate matches being found. Experiments, some of which are described in chapter 7, indicate that this is indeed the case. In other words, the biases that result from ignoring dependencies among features seem to have little effect on the outcome of any particular matching problem. Consequently, we have adopted equation 5.9 as the basis for our match quality measure. We define the measure using log-probabilities to simplify calculations. Moreover, we assume that all positions of a modeled object within an image are equally likely, and thus P(T | H) = P(T). Invoking these simplifications, we define the match quality measure as (5.10) P(H), the prior probability that the object is present in the form represented by the model view, can be estimated from the proportion of training images used to construct that model view. The remaining terms require estimates of the conditional and prior probabilities of individual feature pairings, P(ej | T,H) and P(ej). Our methods of obtaining these estimates are described in the next two subsections, where we use the following notation to denote specific random events within the universe of matching outcomes: e~j = k is the event that model feature j is paired with image feature k e~j =_L is the event that model feature j is paired with nothing aj = a is the event that model feature j is paired with a feature whose attributes are a 97 Chapter 5. Matching and Recognition Methods b j = b is the event that model feature j is paired with a feature at position b 5.1.1 C o n d i t i o n a l probability of a feature pairing There are two cases to consider in estimating P(ej | T,H), the conditional probability of a pairing involving model feature j. ij =J- When j is not paired, the conditional probability is estimated by considering how often j also failed to be paired with image features during training. We use a Bayesian estimator, a uniform prior, and the statistics m and m,j recorded by the model: = P(ej=±\T,H) 1 - P ( e ^ ± | T,tf) 1- ^ - t l . = (5.11) gj = k When j is paired with image feature k, the conditional probability is estimated by considering how often j was paired with image features during training, and how those training features compare with k. We assume that the distribution of a feature's attributes does not depend on its position in the image, nor on the 2-D translation, rotation, and scaling of the object in the image. This allows us to adopt the simplification that P(a, = a | b , = T(b ),ij fc = P(a,- = a | ij ^±iT,H) k fc #±,#), which, in turn, allows us to write the desired estimate as a product of marginal probabilities that we can readily estimate: P(ej = k\ T,H) = P(e,-#J_ A a j = a = P(a, = a f c A hj f c f c f c P ( a , = a * | ej P(e ^±\T,H). 3 k | bj = T ( b ) , e P(bj - T(b ) | = =T(b ) 3 lj ±L,T,H) | T,H) T, H) P(e i T,H) H) P ( b j = T ( b ) | e, #J_, T, H) f c (5.12) Chapter 5. Matching and Recognition Methods 98 P(a.j = afc | ej ^A-,H) is estimated using the series of attribute vectors Aj recorded with model feature j, and a non-parametric density estimator we shall describe in section 5.1.4. Estimation of P(bj = T(bfc) | e~j =fi±.,T,H), the probability that model feature j is paired with an image feature at position bfc under viewpoint transformation T, shall be described in section 5.1.3. P(ej ^-L| T,H) is estimated as specified by equation 5.II. 5.1.2 1 P r i o r probability of a feature pairing Estimates of the prior probabilities, P(ej), are based on measurements of a collection of images typical of those in which the object will be sought. This milieu collection is used to estimate "background" probability distributions that describe what sorts of features are found regardless of whether any particular object is present. In other words, these distributions describe what can be expected in the absence of any knowledge of H or T. By an analysis similar to that underlying estimates of the conditional probabilities, we obtain estimates for two cases of ej. e-j = J. The probability of j remaining unpaired regardless of H and T is P(ij =±) = 1 - P(ej #JL). (5.13) The latter term is estimated from the frequency with which features of j's type, tj, occur in the milieu collection. ej = k The probability of j being paired with k regardless of H and T is P(ej = k) = = Pfe^lAa^atAbj-T^)) P( a j = a | ej # ± ) P(bj = T(b ) | ej #±) fc P(e; /J-)- fc (5.14) For simplicity, our notation does not distinguish probability mass and probability density. P(ej) is a mass because ij assumes discrete values, whereas P(a,) and P(bj) are densities because ay and hj are continuous. But because equation 5.9 divides each conditional probability mass by a prior probability mass, and each conditional probability density by a prior probability density, here we can safely neglect the distinction. x Chapter 5. Matching and Recognition Methods 99 P(a.j = at | e~j ^J_) is estimated using samples of attribute vectors drawn from the milieu collection, and the density estimator described in section 5.1.4. P(bj = T(bfc) | e~j is a constant estimated by assuming a uniform distribution of features throughout a bounded region of model coordinate space. 5.1.3 P r o b a b i l i t y distribution over feature positions The probability that a model feature and an image feature are paired depends, in part, on their positions and on the viewpoint transformation used to align them. A P(bj = T(bjt) | ej term in the match quality measure represents this dependency. To estimate the ^z±.,T,H) term, we transform the image feature's position into model coordinates, and there compare it with the model feature's position (see figure 5.16). The positions and transformation are each characterized by Gaussian probability density functions (pdfs) so that the estimate may consider not just their expected values, but also the uncertainty of each. This section describes the estimation method in detail. Characterizing the image feature position The position of the image feature k is originally represented by a Gaussian pdf in xyOs image coordinates with mean b^ y6s = [xk yk @k «fc] and covariance matrix C . x ySs k However, many feature detectors and grouping processes can provide only a point (or best) estimate of a feature's position, not a distribution. Given point estimates of location, orientation, and scale, [x Vk 9 Sk], we extrapolate a mean and covariance matrix by k k Xk ixyOs k (5.15) Vk &k h 0 0 0 0 *? 0 0 0 0 0 0 0 0 Chapter 5. Matching and Recognition Methods Viewpoint d transformation 100 Transformation s p a c e Model feature position pdf Image feature position pdf Image feature position pdf Model coordinates Image coordinates Figure 5.16: Comparison of image and model feature positions. An image feature's position is transformed from image coordinates (left) to model coordinates (right) according to an estimate of the viewpoint transformation (centre). A model feature's position is estimated in model coordinates (right). Uncertainty in the positions and the transformation are characterized by Gaussian distributions. Overlap of the two distributions in model coordinates corresponds to the probability that the two features match given the viewpoint transformation and their respective positions. where o~g — &o/§k, and the constants <x/, a$, and a are our estimates of the standard deviations k s in measurements of location, orientation, and scale. The orientation variance includes a factor based on the feature's scale, s , because the orientation of a large feature can usually be k measured more accurately than that of a small one. So that it can be transformed readily into model coordinates, the image feature's position pdf is re-expressed in xyuv image coordinates. An approximation must be used to obtain a Gaussian pdf within the uv subspace, but the approximation is adequate for realistically small 9 and s variances. The approximating pdf has a mean b^"" at the same position as b^ , and y9s a covariance matrix C yuv k that aligns the Gaussian envelope radially, away from the [uv] origin (seefigure5.17). We get Q°j? uv from Q* yes by using a matrix S that scales the 9 dimension by Sfc, and a matrix R(a) that rotates by a in the 9s/uv plane. Xk Vk s cos 9 k k s sm9 k k (5.16) Chapter 5. Matching and Recognition Methods k 101 H Figure 5.17: Approximation of image feature's uv distribution. The Gaussian distribution of an image feature's orientation and scale in 8s coordinates (left) is approximated by a Gaussian distribution in uv coordinates (right), with the parameters of the approximating distribution determined as shown. xyuv = Q u„(0 R f c 10 where 0 TL (a) uv _!) 0 g xy0s g o R u«(_^ 1 n 0 1 0 0 0 0 cos a — sin 0 0 sin a cos a + and = S a 0 u n 1 0 0 o o s o k 0 0 0 1 Figure 5.18 demonstrates the accuracy of this approximation. Transforming the image feature position to model coordinates The viewpoint transformation T is characterized by a Gaussian pdf over [x yt u v ] vectors; its t t t mean t and covariance matrix C are estimated from feature pairings, as shall be described in t section 5.2.2. Using T to transform the image feature position from xyuv image to model coordinates again requires an approximation. The reason is as follows. If we were to disregard the uncertainty in k's position, we could, by applying equation 4.6 (p. 90), obtain a Gaussian pdf in model coordinates with mean A^t and covariance A f c C A . Alternatively, if we were to disregard T f Chapter 5. Matching and Recognition Methods 102 130 Actual distribution in 6s coordinates 120 110 100 90 80 20° (a) 30° 50° 90 90 80 80 70 70 •••••:.?:\i.-r; J - 60 50 40 40 50 60 —i 70 — 180 90 70° Approximating distribution in uv coordinates .- '.X'.: •* 100 110 .i:-. •. 60 50 (b) 60° 100 Actual distribution in uv coordinates 100 40° 50 60 70 80 , 90 100 110 (c) Figure 5.18: Monte Carlo simulation of image feature's uv distribution. These scatter plots, each of 400 samples, show distributions of a typical image feature's orientation and scale, as modeled originally in 6s coordinates and approximately in uv coordinates, (a) The original Gaussian distribution in 6s coordinates with mean orientation 6 = 40°, orientation s.d. ag = 6°, mean scale Sk = 100, and scale s.d. a k = 5. (b) The distribution obtained by transforming each instance from 6s coordinates to uv coordinates. Although this distribution is not Gaussian, it can be approximated as Gaussian as described in the text, (c) The approximating Gaussian distribution in uv coordinates. The accuracy of the approximation is evident from the similarity of plots (b) and (c). k s k Chapter 5. Matching and Recognition Methods 103 the uncertainty in T, we could, by applying equation 4.7 (p. 90), obtain a Gaussian pdf in model coordinates with the same mean but covariance AtC^^Aj. With Gaussian pdfs for both the feature position and the transformation, however, the transformed position's pdf is not of Gaussian form. At best we can approximate it as Gaussian, which we do with a mean and covariance given in xyuv model coordinates by b = fct C A t, (5.17) fe = kt A C A +A C Aj. T k t x yuv k t k Characterizing the model feature position Model feature j's position is also described by a Gaussian pdf in xyuv model coordinates. Its mean hj and covariance Cj are estimated from the series of position vectors Bj recorded by the model. Two practical considerations enter into the estimation of Cj. First, when Bj contains too few samples for a reliable estimate of Cj, the estimate that Bj yields is blended with another determined by system parameters. Second, minimum variances are imposed on Cj in case some dimension of Bj has zero variance. Comparing positions in model coordinates The desired probability—that j matches k given their respective position distributions and the transformation T aligning them—can be estimated by integrating over all xyuv model coordinate positions r G 5R, the probability that both the transformed image feature is at r 4 and the model feature matches something at r: P(b, = T(b ) | e fc The random variables itj and r 3 kt T, H) = j P(Vj = r) P(f fct = r) dr. (5.18) are drawn from the Gaussian distributions N(bj,Cj) and N(bfc,Cfct). It may be helpful to think of this integral as one component in a convolution of t two Gaussians. Evaluating it by sampling at various r would be costly, but fortunately the Chapter 5. Matching and Recognition Methods 104 integral can be rewritten in closed form as P(bj -T(b ) | ej j^±,T,H) k = G(b, - b ku C + C ), 3 kt (5.19) where G(x, C) is a Gaussian with zero mean and covariance C. In this form, the desired probability estimate is easily computed. Features with undefined locations, orientations or scales Equation 5.19 is used to estimate P(bj = T(b )) for features j and k whose locations, orienk tations, and scales are all well defined. As mentioned in section 4.1.1, however, there are some feature types for which not all position components can be reliably defined. The distributions for these features must be compared only in terms of those position components that are reliable. Effectively, this requires projecting their full position distributions onto a subspace of just the reliable components before comparison. For example, if location and orientation are well defined but scale is not (as for an LJct feature), then position distributions must be compared in an xyO rather xyuv space. A simple way to estimate the xyO distribution is to approximate it with a product of two marginal distributions: P(b, = T(b )) fc « P([x y ] = T([x y ])) P{6 = T(0 )). 3 3 k k 3 k The location term is then estimated by the method described previously in this chapter, but with the transformations and integration performed in an xy subspace rather than the full xyuv space. The orientation term is only slightly more difficult to estimate. We assume that the image and model feature's orientation distributions are concentrated in a small portion of the circle, allowing us to treat orientation as a linear rather than circular measure. The mean and variance of the transformed image feature's orientation are extracted from its full position pdf as represented by b kt and C . This requires a uv —> ds transformation, essentially reversing the kt Chapter 5. Matching and Recognition Methods 105 8s —• uv transformation performed previously (equation 5.16 andfigure5.17), followed by a projection to extract the 8 components of the mean and covariance: 9 kt = tan *(—), Ukt a 2 ekt = s R«»(-0 fet + §) C R "(0 u f c t f c t -f) s. T Matrix H (a), defined in equation 5.1.3, rotates by a in the ds/uv plane. Vector s scales and uv extracts the 8 dimension: s = where 0 s t — \Ju\ + v k t The transformed image feature's orientation is now characterized by a Gaussian pdf with parameters 8 kt and c| f c t . A corresponding 8j and Og- are obtained for the model feature by measuring statistics of its sample population, Bj. The two distributions are then compared by integrating their product over all orientation's, which, as in equation 5.19, can be written P0j=T{e )) k The difference 8j — 8 kt = G ^ - ^ e r ^ + aL). is "wrapped" around 2ix so that its value is in [—7r,7r). Similar techniques can be used to compare feature positions within other subspaces. We compare Ellipse features on the basis of their locations and scales—but not orientations, as these features include circular arcs with no well-defined orientations. Similarly, line segments could be compared on the basis of their orientations and components of their locations perpendicular to one of the lines, thus avoiding any dependence on possibly-unreliable line length. 5.1.4 P r o b a b i l i t y distribution over feature attributes The P(aj = afc) term in the match quality measure represents a measure of the probability that model feature j might have attribute vector afc. Known are other attribute vectors, Aj, of features found to have matched j during training. The problem of estimating this probability is therefore of the following form: Given a sequence of d-dimensional observation vectors {x^ : 1 < Chapter 5. Matching and Recognition Methods 106 i < n) drawn at random from an unknown distribution, estimate the probability that another vector drawn from that same distribution would have the value x. This problem could be solved by assuming that the distribution has some parameterized form (e.g., normal), and then estimating the distribution's parameters from the observations x^ But the distribution may indeed be complex; it is determined not only by sensor noise and measurement error, but by the compounding effects of object shape, lighting, and viewpoint. And unlike the position distributions that must be transformed and compared during matching, attribute distributions are not manipulated so there is no need to simplify them for mathematical convenience. Hence we use a non-parametric density estimation method that assumes relatively little about the distribution. Here we shall describe that method and our application of it; further explanation of the method may be found in Silverman 1986. Non-parametric density estimator In its simplest, form, the density estimator sums contributions from a series of overlapping kernels centred on the observations. The density at x is given by (5.20) where h is a smoothing factor, a constant, chosen to strike a balance between the smoothness of the estimated distribution and itsfidelityto the observations X j . For the kernel function, K, we use an Epanechnikov kernel because it hasfinitesupport and can be computed quickly; its definition is i ^" (^ + 2)(l-x x) 1 C 0 where T ifx x<l T (5.21) otherwise is the volume of a d-dimensional sphere of unit radius. Adaptive density estimator Although this simple estimator is adequate for some distributions, we have obtained better results with a method that adapts h to local conditions of the distribution. With / (equation 5.20) Chapter 5. Matching and Recognition Methods 107 Figure 5.19: Example of a locally-adaptive probability density estimate for attribute vectors. The spikes denote the samples from which the density estimate was computed. as a first density estimator, a second estimator, f , is created whose smoothing factor varies a according to the first estimator's density estimate: /•W = iX>,-*(^f), (5.22) 1 where A, = * and v = (j[f(xi) The various Ai incorporate the first density estimates at the points X j , and v is a normalizing factor (a geometric mean). This adaptive estimator smoothes more in low-density regions than in high-density ones; thus a sparse outlier is thoroughly smoothed while a central peak is accurately represented in the estimate (see figure 5.19). Choosing the smoothing factor The adaptive density estimator still requires that a mean smoothing factor h be chosen. Intuitively, h should increase with higher dimension (larger d), and with fewer observations (smaller n). Silverman (1986) recounts a method, due to Epanechnikov and others, that chooses an optimal smoothing factor /i t by minimizing the mean integrated square error between the estimate op Chapter 5. Matching and Recognition Methods 108 and an unknown density whose second derivatives are bounded and continuous. If the unknown density is assumed to be a multivariate Gaussian, its derivatives have known bounds that can be used to obtain Kt P = h n- ^ \ l (5.23) d+A K where, for the Epanechnikov kernel, h Although this h opt = {8c^ (d + 4)(2 ^) } 1 K / d v 1/(d+4) . (5.24) is ideal for a Gaussian distribution, it may somewhat oversmooth the density of a multimodal one or undersmooth the density of a uniform one. We have obtained our best results with a smaller h for model feature distributions, which are generally wellconcentrated (we use 0.75/i t)) quite uniform (we use 1.5/i ). op a n d a larger h for milieu distributions, which are generally opt Other considerations Because the kernel is spherical, density estimation for a multidimensional distribution is most accurate when the distribution has the same scale in all dimensions (scale meaning, loosely, the size of the distribution's important features). Thus we try to normalize dimensions, scaling each by the inverse of its standard deviation as measured in {x^}; using the full covariance matrix to properly "whiten" the distribution would be even better. When for a model feature based on few training samples {xi} is too small to produce useful scaling factors, we obtain them instead from the milieu collection's attribute samples. The density estimator assumes some smoothness in the unknown density. Consequently, feature attributes must be designed so that they indeed have reasonably smooth distributions. Consider, for example, an attribute that measures the ratio of two distances, each varying uniformly over [1,10]. This ratio attribute would range over [0.1,10], but often it would be concentrated much more heavily in [0.1,1] than elsewhere. If instead this attribute measured the log atio of the two distances, it would vary over [—1,1], and it would usually be more r 10 Chapter 5. Matching and Recognition Methods 109 smoothly distributed, allowing better density estimates. This is the reason that logarithms are used for certain attributes of the CapPar and Region features. Smoothness often fails at the bounds of attribute domains. For example, if a feature's attribute measures a junction angle and the feature's definition requires that that angle be in [20°, 70°], then the attribute's distribution may end abruptly at 20° and 70°. Near such bounds, the density estimator we have described will underestimate the true density. Although techniques exist for coping with bounded domains (e.g., reflecting observations across the domain bounds) we have not yet adopted them; nevertheless adequate performance is achieved because relatively few attribute distributions are adjacent to domain bounds. 5.2 Matching Procedure Recognition and learning require the ability tofind,between a model graph and an image graph, a match that maximizes the match quality measure defined in section 5.1. It does not seem possible to find an optimal match through anything less than exhaustive search. Nevertheless, good matches can usually be found quickly by a procedure that combines qualities of both graph matching (p. 43) and iterative alignment (p. 46). Recall that iterative alignment will hypothesize initial pairings between model and image features, use them to estimate an aligning viewpoint transformation, use the transformation to evaluate and choose additional pairings, and so on, pairing as many features as possible. Whereas alignment enforces the constraint that pairings be consistent with some hypothesized viewpoint, graph matching pairs model and image graph nodes subject to various local geometric and topological constraints represented by graph arcs. In our version of iterative alignment, we explicitly represent the uncertainty in the position of each feature and the resulting uncertainty in the transformation estimate. On the basis of uncertainty information acquired from training images and recorded in the model, features that are localized well contribute most to the transformation estimate and those whose positions vary most are sought over the largest image neighbourhoods. We use not only the transformation Chapter 5. Matching and Recognition Methods 110 estimate to constrain pairings, but also the geometric and topological constraints represented by graph arcs; to be adopted, a pairing must be consistent with the transformation estimate and with previously adopted pairings of related features. The procedure is called probabilistic alignment, emphasizing its use of uncertainty information to evaluate pairings and guide the search. 5.2.1 Probabilistic alignment To choose the initial pairings, possible pairings of high level features are rated according to the contribution each would make to the match quality measure. The pairing (j, k) receives the rating gj(k) = max logP(gj = k\ T,H)- logP(ej = k). (5.25) This rating favors pairings in which j has a high likelihood of matching, j and k have similar attribute values, and the transformation estimate obtained by aligning j and k has low variance. 2 The maximum over T is easily computed because P(ij = k \ T, H) is a Gaussian in T. Alignments are attempted from these initial pairings in order of decreasing rank. Each alignment begins by estimating a transformation from the initial pairing, and then proceeds by repeatedly identifying additional consistent pairings, adopting the best, and updating the transformation estimate with them until the match quality measure cannot be improved further. At this stage, pairings are selected according to how each might improve the match quality measure; thus (j, k) receives the rating g {k;E,T) 3 = \ogP{e = k\ T,H) - logP(e = k). 3 i (5.26) This favors the same qualities as equation 5.25 while also favoring pairings that are aligned closely by the estimated transformation. In order for (j, k) to be adopted, it must rate at least as well as the alternative of leaving j unmatched, which receives the rating g (±;E,T) 3 2 = logP(e = L | i T,H) - logP(e = ± ) . i (5.27) More precisely: according to j's attribute vector distribution, k's attribute vector has high probability. Chapter 5. Matching and Recognition Methods 111 Figure 5.20 and the remainder of this section describe the matching procedure in additional detail. Subsequent sections address related topics: how the viewpoint transformation is estimated from adopted pairings, how new pairings are checked for consistency with the previously adopted ones, and how certain ambiguities are resolved. Significant computation is involved in rating and ranking the pairings needed to extend an alignment. Consequently, pairings are adopted in batches so that this computation need only be done infrequently. Moreover, in the course of an alignment, batch size is increased as the transformation estimate is further refined so that each batch can be made as large as possible. A schedule that seems to work well is to start an alignment with a small batch of pairings (we typically use five), and to double the batch size with each batch adopted. Thus, for example, after about 40 pairings have been adopted and used to estimate a viewpoint transformation, that transformation estimate is used to evaluate pairings in the selection of another 40 pairings; when that batch has been adopted, 80 pairings are selected, and so forth. A consequence of evaluating pairings in batches is that, at the time they are adopted, pairings have not necessarily been evaluated according to the best transformation estimate available. In the preceding example, the 80 pairing to be adopted will have been evaluated th using a transformation estimate based on only the first 40 pairings. Nevertheless, because the transformation estimate becomes increasingly stable as more pairings are adopted, this optimization has little effect on the matching outcome; yet it eliminates a considerable amount of computation. By using a priority queue, the matching procedure can identify a batch of the highest ranked pairings without having to sort a list of all pairings. The priority queue is implemented using a heap data structure, which is made large enough to retain only the desired number of the best pairings. The queue is initialized as empty, and then each consistent pairing is rated and placed on the queue according to its rating. After all pairings have been processed this way, the queue is left holding those pairings that are of highest rank; these constitute the desired batch. Chapter 5. Matching and Recognition Methods 112 Inputs: initial pairing (j, k) Outputs: final set of pairings E and transformation T E { 0',*) } <- m < — i n i t i a l batch size (e.g., 5) while true do begin Update the viewpoint transformation estimate: T <— new viewpoint transformation estimate based on E Queue pairings: q < — empty priority queue for each pairing (j,k) £ E do if gj(k;E,T) > gj(±;E,T) and Consistent ((j, k) ,E) then insert (j,k) in q with the rating gj(k;E,T) if q is empty then return E and T Adopt a batch of pairings: «- 0 while i < m and q is not empty do begin {j, k) < — the head entry popped from q if (j, k) is inconsistent with some entry in q and {j, k) has not already been downrated then i Downrate an ambiguous pairing: downrate (j, k) by reinserting it in q with a lower rating else Adopt a pairing: begin E «- E U { 0',*) } remove from q any pairings inconsistent with (j, k) i <- i + 1 end end ra <— 2m end Figure 5.20: Probabilistic alignment procedure. Consistent is a procedure denned in figure 5.21. Chapter 5. Matching and Recognition Methods 113 Once a batch of pairings has been identified with the aid of the priority queue, the pairings are adopted and used to update the transformation estimate. At this point, the procedure checks for ambiguous pairings. Queued pairings that propose alternate, mutually exclusive interpretations of the same feature are considered ambiguous, and they are downrated by a constant factor so that their selection will be postponed in favor of less ambiguous pairings. Backtracking could be performed wherever ambiguity forces a choice among conflicting pairings. (By backtracking, we mean not alternate choices of an initial pairing, but alternate choices of additional pairings once an initial pairing and its corresponding alignment have been hypothesized.) We have observed, however, that subsequent matches found through backtracking are usually little better than the initial match found by choosing the highest ranked pairings. Consequently, for the experiments reported in this dissertation, no backtracking was performed during matching. Each initial pairing and its corresponding alignment leads to a single match (or perhaps multiple matches if backtracking is permitted). Verification criteria, such as the one we shall describe in section 5.4, may be used to decide which of these matches represent an actual instance of the object in the image. If it is known that the image contains at most one instance of the object (as is the case with training images), then the search can be halted as soon as an acceptable match has been found. In the absence of an acceptable match, however, searching continues from successive initial pairings until resource limits (e.g., time limits) have been reached, or until all promising initial pairings have been considered. 5.2.2 Estimating the viewpoint transformation The matching procedure requires that we compute, from one or more feature pairings, a viewpoint transformation that maximizes the match quality measure for those pairings. In other words, we need to solve for a viewpoint transformation that optimally aligns given sets of model features and image features. Fortunately, feature positions and viewpoint transformations are represented in a manner that allows a good approximation of the desired transformation to be Chapter 5. Matching and Recognition Methods 114 obtained as the solution to a linear, least squares (LS) estimation problem. In the following paragraphs we shall first describe a formulation of this problem, and then describe a recursive estimator we use to solve it. Weighted least squares formulation Suppose that the desired optimal viewpoint transformation for the match is t = [x yt u v ], t t t and that (j, k) is any pairing of model and image features included in the match. Then t will transform fc's mean position to lie near j's mean position, leaving a residual error we shall denote e. Using the formulation developed in section 4.4 (p. 88), we can write this as A t f c = bj + e. (5.28) As the match may not align all feature pairs perfectly, e may be nonzero. Thus we consider e to be a random vector that assumes different values for different matches, and we estimate its covariance matrix as follows. Associated with j , and recorded in the model, is a series Bj containing the position vectors of image features found to have matched j during training. More precisely, each element of Bj is of the form Afc' t', where Afc' represents the mean position of an image feature k' that matched j during training, and t' is the viewpoint transformation associated with that match. We observe that Bj should have a distribution similar to that of Afc t. Thus the covariance of Bj, which we shall denote Cj, gives us an estimate of the covariance of e. Using the covariance estimate Cj, we scale equation 5.28 so that each scalar component of the residual is more nearly independent and identically distributed, and each has unit variance. This is done by dividing through by the upper triangular square root of Cj, in a process called whitening, to give U^Afct = U^bj+e', where C, = U , - U j and e' is now a residual error whose covariance matrix is I. (5.29) Chapter 5. Matching and Recognition Methods 115 A series of feature pairings produces a series of such relations, together forming an overdetermined linear system. The LS solution of this system gives us an estimate of the optimal viewpoint transformation t and its covariance matrix C . Fortunately, as a consequence of the t whitening, the estimate will be influenced most by those model features whose positions have exhibited the least uncertainty. A pairing of two features with well-defined locations, orientations, and scales produces four scalar equations for the linear system. When not all position components are well defined, fewer equations can still be obtained by projecting onto an appropriate subspace of the full xyuv space, as was done in section 5.1.3 to compare feature positions. We shall illustrate this technique in section 5.3 when describing how curve segments are matched. Use of a recursive estimator Each time the probabilistic alignment procedure adopts a batch of feature pairings, it must compute a new transformation estimate based upon those pairings and all previously adopted ones. This can be done most efficiently using a recursive LS estimator, which requires that only a small, constant amount of computation be performed with each pairing adopted and each recomputation of the estimate. Among such estimators, the square root information filter (SRIF) is well suited for this task. Compared to the conventional Kalman filter, for example, the SRIF has better numerical stability and it is more efficient at processing batched measurements (Bierman 1977). As its name implies, the SRIF works by updating the square root of the information matrix, which is the inverse of the estimate's covariance matrix. The initial square root R i and state vector zi are obtained from the first pairing (j, k) of model and image features: Ri = VJ A 1 and zi = U J b,-. 1 k (5.30) Then, with each subsequent pairing (j,k), the estimate is updated by triangularizing a matrix Chapter 5. Matching and Recognition Methods 116 composed of the previous estimate and data from the new pairing: A U^Ajfe UJ ^ 1 Ri Zi 0 ei (5.31) When needed, current estimates of the viewpoint transformation and its covariance matrix are obtained from the triangular R , by back substitution: ^ = R ~ 1 Z J and C ti = R ^ R ^ ( 5 - 3 2 ) Alternate least squares formulations The LS formulation of A t = b that we have just presented considers only the uncertainty of b , the model feature positions, and not the uncertainty of A , the image feature positions. Thus a total least squares (TLS) formulation, which would consider uncertainty in both A and b , might seem more appropriate. It may not be, however. When the solution t is being used to predict new values of b from given values of A , a t obtained by LS is expected to yield more accurate predictions than one obtained by TLS (Cheng and Hess 1990). Informally, this is because LS minimizes the distance A t — b , which is comparable to the prediction error, whereas TLS minimizes something else (the orthogonal distance between the data [A; b] and the subspace t ) . In our case, t is used to predict the model coordinate positions of image features in order to identify potential pairings for those image features, so we expect a t obtained by LS to produce more appropriate pairings. Also, most reports suggest that TLS is generally less stable than LS. Theoretical analysis 3 indicates that, in general, TLS is more sensitive to outliers (Ammann and Ness 1989), and more sensitive to perturbations in A and b when there is not an exact linear relation between the true but unknown values of A and b (Van Huffel and Vandewalle 1991, p. 5). In our case, outliers can arise from inappropriate pairings, and although the relation between image and One exception was reported by Chaudhuri and Chatterjee (1991), who compared LS and T L S estimators for two problems of motion estimation from point correspondences. In these specific cases, T L S was found to be more stable against outliers that occur because of point mismatches. 3 Chapter 5. Matching and Recognition Methods 117 model feature positions is approximated as linear (within the scope of one model view), it is, in reality, nonlinear. Moreover, empirical comparisons have found TLS more susceptible to noise, apparently because its solution method, which requiresfindingeigenvalues, is more affected by round-off errors (Ward 1984). These considerations all suggest that, in our case, LS may be a better choice than TLS. An empirical investigation is still needed to settle the matter, but that investigation is beyond the scope of this dissertation. For the work reported here, we have adopted the LS formulation. Within the LS framework, however, we could choose to model errors in either model or image feature positions. Most researchers using this framework have chosen to model image measurement errors, weighting the contribution of each feature pairing according to the estimated magnitude of image feature uncertainty (e.g., Ayache and Faugeras 1986; Hel-Or and Werman 1995). In contrast, we have chosen to emphasize model feature uncertainty, which in our case is likely to be larger and more informative than image feature uncertainty: it characterizes model feature positions over a range of viewing conditions, using feature-specific information acquired during training. 5.2.3 C h e c k i n g consistency of candidate pairings Before a pairing of features is adopted in the course of a match search, that pairing is examined to ensure that it is consistent with any previously adopted pairings of related features. For example, if the search has already adopted pairings involving parts of the proposed pair of features, but those pairings do not link parts that correspond properly, then, in certain cases, the search must reject the proposed pairing. Feature relations determine the constraints that are imposed. In our implementation we distinguish two forms of grouping relations, corresponding to the template feature and sequence feature classes described in section 4.1.3. Briefly, a template feature has a fixed set of parts; two template features can be paired only if all paired parts of one are paired with corresponding parts of the other. A sequence feature has a variable-sized, ordered set of parts; two sequence Chapter 5. Matching and Recognition Methods procedure Consistent ((j, k) ,E) begin c <— ConsistentWithParts((j, 118 k),E) E' <- E U { O'.fc) } f o r each (j',k') € E such that (j i s a part of j') and (A; i s a part of k') do c <— c and ConsistentWithParts ((j', k'),E') return c end Figure 5.21: Consistent procedure. Consistent ((j, k) ,E) checks whether proposed pairing (j,k) is consistent with previously adopted pairings E. ConsistentWithParts is a procedure defined infigure5.22. Both procedures return a value of true or false. features can be paired as long as their paired parts correspond in the same order. A proposed pairing {j, k) is tested in two ways, both implemented by the Consistent procedure specified infigure5.21. First, (j, k) is tested to ensure that it is consistent with any pairings already adopted among parts of j or k. This check, implemented by the ConsistentWithParts procedure specified infigure5.22, differs according to whether j and k are template or sequence features because of the difference in constraints outlined previously. The second test of (j, k) ensures that it is consistent with pairings already adopted among groups containing either j or k. This check is done by temporarily adopting (j,k) while performing the ConsistentWithParts check on each pairing involving each group containing j or k. Only if all tests are passed is (j, k) considered consistent, and a candidate for adoption. 5.2.4 N-way ambiguous features Matching must be able to cope with features whose interpretations or positions are ambiguous in the sense discussed in section 4.1.1. For example, among OLIVER'S features, the CrankLR, CrankRL, Region and PCurves features have orientations that are two-way ambiguous due to symmetries in the grouping arrangements those features represent. A horizontally-aligned Region, for example, can be assigned an orientation of either 0° or 180°, and its sequence of parts Chapter 5. Matching and Recognition Methods procedure ConsistentWithParts 119 ((j, k) ,E) begin number of p a r t s of j number of p a r t s of k if j,k are template f e a t u r e s begin f o r p <— 1 t o rij do (in which case rij = n ) then k (part p of j i s p a i r e d i n E, (part p of k i s p a i r e d i n E, return false r e t u r n true if but not t o p a r t p of k) or but not t o part p of j) then end i f j,k are sequence f e a t u r e s then begin p,q <- 1 while true do begin while (p < n,j) and (part p of j i s not p a i r e d i n E) do p <— p + 1 while (9 < nfc) and (part g of k i s not p a i r e d i n E) do if (p > rij) and (g > n ) return true if k then (all paired parts correspond) (p > ny) or (q > rij) or (part p of j i s not p a i r e d i n E t o part q of A;) then return false (one has a pairing the other does not) p <— p + 1 q ^ q +1 end end end Figure 5.22: ConsistentWithParts procedure. ConsistentWithParts ((j, k),E) checks whether proposed pairing {j, k) is consistent with existing pairings in E involving parts of j and k. Chapter 5. Matching and Recognition Methods 120 will differ accordingly. An n-way ambiguous feature found in the image is recorded only once in the image graph, with an interpretation chosen arbitrarily from among the n possible interpretations. An nway ambiguous feature that is part of the model is also recorded only once, but with a fixed interpretation (determined by the learning procedure from previous matches involving it and related features). During matching, when pairings are considered between the n-way ambiguous features j and k, the image feature is treated as though it were n distinct features, k\ through k . Each of these n features may have different positions and different sequences of parts (as n in the case of a Region, for example). Each pairing (j,ki) through (j,k ) is evaluated and queued as though k\ through k were n n distinct features, although only one of the pairings is permitted to be adopted. Thus, by virtue of adopting a pairing between j and one of the k\ through k , the matching procedure ultimately n assigns an appropriate interpretation to k. 5.3 C u r v e Segment M a t c h i n g By curve segment, we mean a feature denoting a segment of algebraic curve fit to intensity edges detected in an image. One example is OLIVER'S Curve feature, which represents a group of edgels that have been fit by a straight line segment, circular arc, or elliptical arc. There are particular difficulties in comparing image and model curve segments, and in using curve segment pairings to estimate a viewpoint transformation: • Estimates of parameters obtained by fitting curves to edgels, such as the centre of a circular arc or the eccentricity of an elliptical arc, are often unreliable. • With only slight deformation, a single edge can give rise to different types of curve segments. A segment recovered as an elliptical arc in one image may be recovered as a line in another, very similar image. Chapter 5. Matching and Recognition Methods 121 • A segment that appears complete in one image may be truncated in another due to occlusion. One consequence is that there is no reliable way of associating a point location with a line segment. Thus curve segments are not matched very effectively by simply assigning them locations, orientations, scales, and other attributes, and comparing them on that basis. Nevertheless, curve segments meeting certain requirements can be matched within the existing framework, alongside other features and by analogous methods. Briefly, two curve segments are compared by sampling one at regular intervals and evaluating the distance from each of those sample points to the nearest point on the other segment, as determined by the current transformation estimate. If the two curve segments are paired, these point correspondences are used to update the transformation estimate. Matching curve segments in terms of sample points avoids any need to compare their types or parameters, and it easily handles situations where the segments overlap The requirements of this method are that each curve segment can be represented as a parametric curve, and that, given any point, its closest point on the curve can be located. For a Curve feature, curve points closest to a given point p are located by examining the curve normals containing p; there are at most four distinct ones in the case of an ellipse, and they are associated with the roots of a quartic polynomial, as described by Spain (1957, p. 42). 5.3.1 C u r v e segment representation A curve segment's type and parameters are represented by a tuple c, which subsumes the role of the attribute vector a and position vector b used in the representation of other general features. In the case of a Curve feature, for example, c specifies whether the curve is a straight line, circular arc, or elliptical arc, and it includes any applicable parameters among xo, yo, 9, so, Si, r, a, and b (see table 4.2 on p. 78). There exists a function mapping the arc length parameter s to points [x(s) y(s)] on the curve represented by c, and an interval which s sweeps out the entire curve segment. [so,si] over Chapter 5. Matching and Recognition Methods 122 The edge pixels fit by a curve segment are modeled as being uniformly distributed along the curve segment while randomly deviating to either side of it by small distances. The term displacement is used here to mean the signed distance from a point to a curve; points on one side of the curve have negative displacements and those on the other side, positive displacements. The edge pixel displacement distribution is modeled as having a zero mean; its standard deviation is denoted by a . c An image curve segment k is thus described by c and a , both measured in the image k ck by fitting a curve to a group of edgels. Similarly, a model curve segment j is described by Cj and o j. These are obtained by fitting a curve to the collection of all edgels of all image curve c segments successfully paired with that model feature during training. In a sense, therefore, Cj represents a "mean" of certain training image curve segments, and o j describes the distribution c about that mean. In the case of OLIVER'S Curve features, a model curve segment is obtained by selecting the best among separate fits of a straight line, a circular arc, and.an elliptical arc. Straight lines and circular arcs are fit by linear least squares estimation (Thomas and Chan 1989). An elliptical arc is fit by first fitting a general conic using Taubin's direct method (Taubin 1991); if that conic turns out not to be an elliptical arc, one is instead fit using Leonardis's method (Leonardis 1993), which requires a more time-consuming series of Levenberg-Marquardt optimizations but is guaranteed to produce an elliptical arc. 5.3.2 Evaluating a curve segment pairing Recall that a pairing (j, k) of image and model features is evaluated in the match quality measure by a term of the following form (equation 5.26, p. 110): (k;E,T) 9j = logP(e =k\T,H)-logP(e =k). J J Chapter 5. Matching and Recognition Methods 123 When j and k are curve segments, these probabilities are factored as follows (cf. equation 5.12, p. 97, and equation 5.14, p. 98): P{e = k\T,H) 3 = P(CJ = T(c ) | tj ^J_,T,H) P{e # ± | T , ) , P(e, = A;) = fc (5.33) 3 P(c, = T(c ) | e ^J_) P(ej ^_L). (5.34) fc The two P(e T^JL) probabilities are defined and estimated as they are for other features (see 3 equations 5.11 and 5.13), whereas the two P(CJ = T(cjt)) probabilities are given the following interpretations: Conditional probability of curve pairing. P(CJ = T(c ) | <=j ^±,T,H) k is the prob- ability that model curve j is paired with image curve k, given that the object as modeled is present in the image, j is paired with something, and the viewpoint transformation is T. Prior probability of curve pairing. P(CJ = T(c ) \ e"j #-L) is the probability that k model curve j could be found paired to an image curve like k, regardless of whether the modeled object is present. In section 5.1, terms like P(aj = a^) and P(bj = T(b )) denoted probabilities defined k by distributions over the spaces of attribute vectors a and position vectors b. Here similar notation is used for a slightly different, although analogous, concept. P(CJ = T(c )) is defined k not by a distribution over a space of curve segment parameters c, but rather a distribution of edge pixel displacements. Specifically, several points are chosen along the image curve segment they are transformed into model coordinates according to T, and their displacements are measured from the model curve segment Cj\ P ( C J = T(c )) is defined by the distribution of k these displacements. 4 However the prior and conditional probabilities of curve pairing are defined, their ratio can be employed directly in the match quality measure provided they are both defined in terms of the same measure space. Consequently, both are defined in terms of a displacement distribution, although the original motivation for using a displacement distribution applies only to the conditional probability. 4 Chapter 5. Matching and Recognition Methods (a) 124 (c) (b) Figure 5.23: Curve segment matching, (a) An image curve segment is sampled at several points fj. (b) Each ^ is transformed to model coordinates as gi, and paired with its nearest point hi on the mean model curve, (c) Displacement distributions about gi and hi are aligned along their common line. Conditional probability of curve pairing The image curve segment c is represented by n sample points whose image locations are k characterized by Gaussian pdfs. These pdfs are centred at points fi that are distributed somewhat uniformly along c^; each pdf is circularly symmetric with standard deviation a ck (see figure 5.23(a)). Curve features, for example, are sampled atfivearc length positions distributed uniformly throughout [SQ, SI]; more samples would be needed for curves of higher degree. Each sample point pdf is transformed from image coordinates to model coordinates, where it is approximated as Gaussian using the technique described in section 5.1.3. There, the mean and covariance matrix of the i th sample point are denoted by gi and Ci. The point closest to gi on the model curve (not bounded by the segment endpoints) is identified and denoted hi. At this point, we adopt a simplifying assumption that, in the neighbourhood of gi and hi and at the scale represented by Ci, the transformed image curve and model curve are nearby and approximately straight and parallel, like the situation illustrated infigure5.23(b). This assumption allows us to consider the displacement distributions of gi and hi as being oriented along the common line joining gi and hi (see figure 5.23(c)). Only if the two curve segments Chapter 5. Matching and Recognition Methods 125 are poorly aligned does this assumption fail badly, but in that case at least some of the gi will be far enough from the model curve to ensure that this method assigns P ( C J = T{c ) | ej T^J_) k a low value. Alignment of the displacement distributions for gi and hi allows us to estimate the probability that an edge pixel in that neighbourhood belongs to both the image and model curves, an event denoted ej. This estimate is obtained by integrating, over all positions r £ K along the (unbounded) line gihi, the probability that both the image edge pixel is at displacement r and the model curve includes an edge pixel at displacement r P{e | ij = J P(fj = r) P(f = r) dr. T, H) t (5.35) k The random variables fj and f are drawn from the one-dimensional displacement distributions k that are centred on hi and gi (respectively) and aligned along gihi. The distribution of fj, representing displacements about the model curve segment, has standard deviation u j. The c distribution of f , representing displacements about the transformed image curve segment, has k standard deviation o i obtained by projecting C i onto the gihi line: c where R(OJ) = [1 0 = cos a sin R(-0) C ] t — sin cos a R(<P) [ 1 0 ] , a (5.36) (5.37) a and <j> is the direction of gihi. Equation 5.35 is analogous to equation 5.18 and, as in that previous case, the integral can be rewritten in a more easily evaluated form: P(e \~e ^,T,H) t 3 = G(\\ - h ||,o gi t 2 cl + o%). (5.38) The n values of P(ej) obtained by equation 5.38 at intervals along k must then be combined to produce a single estimate of the probability that j and k are paired. Here, our approach must become somewhat more improvised. If the P(ej) were mutually independent and n were Chapter 5. Matching and Recognition Methods 126 equal to the number of degrees of freedom in the parameterizations of j and k, we might use n =T(cfc) | tj ^±,T,H) = JJP(ei | tj ^±,T,H). However, these conditions do not hold except in some special cases (for example, perhaps when n = 2 and all curve segments are lines); generally we will have more sample points than degrees of freedom. So instead we use a subset S of the q smallest P(ti), corresponding to the worst sample points (those deviating farthest from the model curve): P(c =T(c )\e ?±,T,H) j k j = Y[P(ei | e #J_,T,#). ies 3 (5.39) In comparing Curves, for example, we use q — 3 sample points. Finally, we would like the result to reflect not only how close the curve segments are aligned, but also their degree of overlap. Consequently, if more than some fraction (we use 0.5) of the hi fall outside the [sn,si] bounds of the model curve segment, P(CJ = T(c ) \ tj ^±,T,H) k is assigned a value of zero. Prior probability of curve pairing If the prior probability is to be defined in terms of the same measure space as the conditional probability, then it too should be based on the displacements of some q sample points. As for the conditional probability, the q points are treated as independent so that probabilities for each of the points can be simply multiplied together. In this case, each probability concerns the event that a sample point on the image curve has a particular displacement from the model curve regardless of whether the modeled object is actually present in the image. The prior distribution of these displacements actually depends on several factors, including the dimensions of the image, the viewpoint transformation, and the nature of the model curve (if the model curve is a circle, for example, displacement is bounded on one side by the circle's radius). For simplicity, however, we assume that displacement is uniformly distributed over a range equal to the image dimension P (we use P = 400 pixels). Under this assumption, the probability density of any particular displacement is P~ , and the probability density of any l Chapter 5. Matching and Recognition Methods 127 particular set of q displacements about q independent sample points is (3~ . This gives us the q following estimate of the prior probability that model curve j might be found paired with image curve k: (5.40) 5.3.3 U p d a t i n g the transformation estimate When a pairing of curve segments has been adopted, correspondences between n sample points on the image curve and their nearest points on the model curve provide measurements used to update the viewpoint transformation estimate. As when curve segments are compared, the current transformation estimate is used to map each sample point fj to model coordinates, where it is denoted gi and paired with its nearest point hi on the model curve (see figures 5.23(a) and (b)). By equating the locations of gi and h; we can obtain two scalar equations for the system used to estimate the transformation: 1 0 x -y 0 1 y/ x f + f f yt Vh where fi = [xj yf] and hi = [xh Vh]- But employing this relation would minimize both the distance between the curves and the relative shifts along the curves of corresponding points. Because we are interested only in minimizing the curves' separation, this two-dimensional relation is instead rotated and projected to become a one-dimensional relation representing the component of the distance between gi and h'i that lies in the direction, d>, of the line joining those points. The resulting scalar equation is divided by the model curve's displacement uncertainty, a j, so that the expected variance of the residual error, e, is 1. The equation is further c weighted by q/n because, as in the previous section, the n equations are partially redundant. In summary, each pair of corresponding points gi and hi contributes to an estimate of the Chapter 5. Matching and Recognition Methods 128 viewpoint transformation a scalar equation of the form W 1 0 x -y x 0 1 y/ x Vt f f t f W Xh (5-41] Vh u t v t where W = 5.4 n CT, cos(/> sin< Verification Once a match has been found between a model graph and an image graph, it must be decided whether the match represents an actual instance of the modeled object in the image. A general approach to this problem would use decision theory to weigh prior expectations, evidence derived from the match, and the consequences of an incorrect decision. However, as explained in section 2.3.4, these factors depend largely on the particular nature of the application and, consequently, this dissertation does not attempt to advocate any particular definition of them. In the course of model learning, however, a match with a training image can be verified using a simpler, less-general method. In that context, where the object is presumed to be present unoccluded in the image, a valid match can be expected to account for most features of the model graph (whereas it may not when occlusion is possible). Thus it is sufficient to simply require that some minimum fraction of the model graph's features be paired. We shall 5 describe a verification test based on this idea after first addressing the issues of what paired features should be considered, how much importance should be assigned each paired feature, and how many paired features should be required before a match is accepted. Similar verification criteria have been described elsewhere, including Chen and Kak 1989, Gottschalk, Turney, and Mudge 1989, Grimson and Huttenlocher 1991, and Lamdan, Schwartz, and Wolfson 1990. 5 Chapter 5. Matching and Recognition Methods 129 Features considered The match quality measure used to guide matching provides one indication of a match's significance. A simple way to accept or reject matches, then, might be to require that this measure exceeds some threshold. However, the measure is unsuitable for this use because its range differs widely among objects according to what high-level features they have. As explained in section 5.1, high-level features that represent groupings of low-level ones violate the feature independence assumption; consequently, the match quality measure is biased by an amount that depends on what high-level features are present in the model. Whereas this bias seems to have no adverse effect on the outcome of matching any one model graph, it makes it difficult to establish a single threshold for testing the match quality measure of any model graph. Thus the verification method we present here considers only the lowest-level features of the model graph—those that do not group any other model graph features. We shall denote this set of features C. Among OLIVER'S features, for example, C includes only Curves. Importance assigned to each feature When counting paired model features, we weight each one according to its likelihood of being paired, thereby assigning greatest importance to the features that contribute most to the likelihood that the object is present. For model feature j , the likelihood of being paired, L(e L±\T,H) j7 - p ( g _ , is estimated using statistics recorded for feature j , as described in sections 5.1.1 and 5.1.2. One consequence of the weighting is to assign greater weight to a longer curve segment because it generally has a higher likelihood of being paired. The count of each model feature is also weighted according to how well it is fit by image features. When j is a curve segment, this weighting component is based on the fraction of j matched by nearby image curve segments. The fraction is estimated using a simple approximation: The lengths of image curve segments matching j are totaled, the total length Chapter 5. Matching and Recognition Methods 130 is transformed into model coordinates, and the transformed value is divided by the length of j. With st denoting the scaling component of the viewpoint transformation T, and Sj and s 6 k denoting the lengths of j and k in model and image coordinates, respectively, the fraction of j covered by image curve segments is defined as Comparable weighting schemes may be defined for other types of features included in C. Number of paired features required If we were to accept matches that paired a fixed number of model features regardless of model complexity, then with greater model complexity we would have an increased likelihood of accepting incorrect matches. For example, requiring that ten model features be paired may make sense for a model of twenty features, but for a model of a thousand features, any incorrect match is likely to contain at least that many "accidental" pairings. Thus we have chosen instead to require that some minimum fraction of the elements of C be paired. We define this fraction as 51 ^ ( e j #J_| T , H) S u p p o r t ( j , E, S u p p o r t ^ , T) = jec T) (5.42) £L(e #J_|T,ff) i jec A match (E,T) is accepted if Support(i£, T) achieves a certain threshold r. To validate this verification method and to determine a suitable value for r, we have measured the distribution of Support(i?, T) for correct and incorrect matches between various model graphs and their respective training image graphs (see figure 5.24). The distributions are well separated, with most correct matches achieving Support(.E, T) > 0.6 and most incorrect matches achieving Support^,! ) < 1 0.3. Hence we conclude that the verification method is effective, that it is not Since this method ignores the degree to which image curve segments overlap or extend beyond the model curve segment, it can overestimate the true fraction. Although more precise methods are easily envisioned, the present method has proven adequate. 6 Chapter 5. Matching and Recognition Methods 131 Figure 5.24: Examples of Support measure distribution. Support(E,T) (equation 5.42) was evaluated for correct and incorrect matches between various model graphs and their respective training image graphs for two objects. Distributions for the two objects are plotted separately here as (a) and (b). Distributions for correct matches are shown by solid lines; those for incorrect matches, by dashed lines. (Correctness of matches was judged subjectively, but was rarely ambiguous.) (a) Simple object comprising 30-40 curve segments, 69 correct matches; 38 incorrect matches, (b) Complex object comprising 130-150 curve segments; 46 correct matches; 337 incorrect matches. unduly sensitive to the choice of r, and that r = 0.5 is an appropriate, conservative threshold for distinguishing correct matches involving training images. Chapter 6 Learning Procedure When we assume general knowledge about an entire class, having been given specific facts about certain members of that class, we are learning by induction. This is the type of learning required of a system that learns to recognize objects: The system must learn to recognize an object under general viewing conditions, having observed it under particular conditions. In other words, the system must learn a general concept of the object's appearance from specific examples of its appearance. Induction raises the problem of how to select one concept among many that may be consistent with any particular set of observations (consistent in the sense that the concept implies the observations). For example, on observing the numbers 3, 7, and 31, we might conclude that we are seeing examples of the concept of odd numbers, the concept of prime numbers, or the concept of numbers of the form 2 " — 1 for integers n. Clearly, the choice among these concept hypotheses must be based on more than the numbers alone. In making our choice, we might consider how simple each hypothesis is (perhaps preferring the odd-number hypothesis as being the simplest), or how specific each hypothesis is (perhaps preferring the 2 — 1 hypothesis n because it applies to fewer numbers than the other hypotheses). To rank the hypotheses consistent with our observations and to choose a favourite among them, we must apply preferences and other factors that are collectively referred to as our inductive bias. Because a concept can be learned only if it can be represented, one source of inductive bias in any inductive learning system is the concept representation language. Our recognition learning system's concepts are object models represented according to the scheme described in section 4.3 132 Chapter 6. Learning Procedure 133 (and, indirectly, the class of viewpoint transformations described in section 4.4). What features are represented, how their attributes and positions are denned, how attribute and position probability distributions are estimated from training samples, what forms of independence are assumed among model views, among model features, and among samples, what class of viewpoint transformations is considered—all are sources of inductive bias attributable to the model representation. Inductive bias is also found in the way the match quality measure is denned, in the way the matching procedure is designed to favour certain match solutions over others, and in the way the learning method is designed to transform training images into object models. In effect, these sources of inductive bias are all forms of knowledge about the qualities desired in a good object model, knowledge that the learning method uses to choose one model among many. The learning method is a combination of two procedures, each performing a different type of inductive learning (figure 1.2 on p. 10). 1. A clustering procedure divides the training images into groups, each corresponding to a distinct view of the object. This task is a form of unsupervised learning. The concepts learned are the groups, each of which is defined by the training images assigned to it. 2. A generalization procedure merges the members of one group into a single model view represented by a model graph. This task is one of learning a concept from examples, where the examples (the group members) are all positive examples of the concept to be learned (the model graph). The generalization procedure is applied to each of the groups formed by the clustering procedure. The learning method couples these two procedures so that a choice among clustering alternatives is influenced, in part, by how well the resulting clusters can be generalized. Clustering and generalization are performed incrementally, with each image graph assigned to a group as it is received, and each model graph updated whenever an image graph is assigned to its group. The remainder of this chapter describes the two procedures in more detail; the generalization procedure, which is invoked repeatedly by the clustering procedure, is described first. Chapter 6. Learning Procedure 6.1 134 Generalization Procedure The generalization procedure's task is to produce a model graph from a group of image graphs. These image graphs will be similar to each other, having already been grouped by the clustering procedure because of their similarity. From them, the generalization procedure must produce a model graph that matches each image graph well, and also matches many of the features of each image graph (a trivial model could match any image well while matching only few features). Some versions of this problem—learning concepts by induction from examples—have been well studied. In particular, within the statistical pattern recognition and neural network fields, methods have been developed for inducing classifiers for n-dimensional vectors. In terms of our present discussion, these methods learn concepts denoting regions of n-dimensional space. Many have the important advantage of coping well with noisy data, but they are not applicable to relational structures, such as image graphs, because they require all instances to have a fixed number of vector components or attributes (Mitchell 1982). Methods developed within the artificial intelligence and machine learning fields do include some that operate on relational structures; several of these have been reviewed by Dietterich and Michalski (1981). Most represent structural instances and concepts using some form of predicate calculus, giving the concept space a straightforward organization that simplifies the search for suitable concepts. Consequently, however, concepts are definite rather than probabilistic: a concept either covers a particular instance or it does not, with no middle ground. Moreover, these methods generally do not cope well with noisy data. As these methods are not appropriate for inducing model graphs from image graphs, we have devised a new generalization procedure that resembles previous structural learning-methods but is better suited for our purpose. The generalization procedure's inductive bias is intended to favour model graphs that are both selective and economical. A selective model graph is one that matches not all image graphs, but only those that are quite like the image graphs used in its construction. An economical model graph omits any features that are too unreliable to benefit recognition significantly. Because each of these qualities is achieved at the expense of the other, the two must be balanced Chapter 6. Learning Procedure 135 somehow. The generalization procedure balances them by including in the model graph only those features that have been found among some minimum fraction of the image graphs. The generalization procedure operates incrementally, creating a model graph from the first image graph assigned to the group, and then updating that model graph with each subsequent image graph. To create the initial version of the model graph from the first image graph, the procedure uses each image feature k to create a model feature j. A model feature's type, tj, is set to its corresponding image feature's type, t ; its series of attribute and position vectors, Aj k and Bj, are initialized with the image feature's attribute and position vectors, afc and bfc; and its pairing count, rrij, is initialized to 1. In summary, tj <- t , k Aj <- {a }, fc Bj <- {b }, fc and rrij <- 1. (6.43) Also, each relation among image features is replicated among the corresponding model features. When a second or subsequent image graph is assigned to the group, that image graph is matched with the current version of the model graph. The search for this match proceeds until one is found that achieves some minimum value of the Support measure defined in section 5.4 (we use the threshold 0.75), or until a set number of alignment hypotheses have been considered (we examine at most 20 hypotheses). The resulting best match, which consists of a set of pairings E and a viewpoint transformation T, is then used to update the model graph in the following ways. 1. Paired model features receive additional samples. A model feature j that was paired with an image feature k receives additional attribute and position vector samples from k; it also has its pairing count nij incremented. The update operation is Aj <— Aj U {a }, fc Bj <— Bj U {T(b )}, fc and rrij <— nij + (6.44) 1. 2. Unpaired image features are used to extend the model graph. Unpaired image features suggest possible new model features, but to regulate growth of the model graph only some of these possible features are instantiated. The preferred features are those whose Chapter 6. Learning Procedure 136 relations may provide useful constraints during matching. Thus a new model feature j is created from an unpaired image feature k only if (a) k is related, directly or indirectly, to a paired image feature; and (b) j's relations will not conflict with those of existing model features. Specifically, if k is the p th part of some image group k', and k! was matched to some model group j', then the p part of j ' must be ± so that, without conflict, that part can be redefined th to be j . For each k satisfying these criteria, a corresponding j is created as specified by equation 6.43. A relation defining j's parts is added to the model graph, and j is included as a part in any groups j ' referred to by item (b). 3. Unpaired model features are eventually deleted. A model feature j that has been paired in less than a certain fraction of matches is deleted (we use the fraction 0.3). To avoid deleting features just created, only those matches performed since j's creation are counted. When j is deleted, so is the relation defining j's parts, and j is replaced by J_ in any relations defining j's groups. As the model graph is updated by successive image graphs, its stability increases for the following reason. A model feature j that is consistently paired with image features accumulates a high value of m,j; consequently, j is given high priority in matching, increasing the likelihood that an appropriate image feature will be paired with it if present. Conversely, a model feature that is paired too infrequently will eventually be pruned from the model graph. When all image graphs have been processed, the model graph contains the most consistent features, with representative populations of attribute and position vectors for each. Figure 6.25 and table 6.4 illustrate a typical progression of this process. Chapter 6. Learning Procedure (c) 137 (d) Figure 6.25: Example of generalizing model graph from image graphs. Image graphs from nine images spanning 20° of viewing angle, from (a) to (b), were generalized to produce a single model graph. The model graph's Curve (curve segment) features are shown in (c); its LJct (L-junction) and PCurves (parallel curve segments) features are shown in (d), where ellipses depict 2 s.d.'s of feature location uncertainty. Table 6.4 summarizes the generalization process. Chapter 6. Learning Procedure 138 Table 6.4: Example of generalizing model graph from image graphs TRAINING I M A G E 1 2 3 4 5 6 7 8 9 INITIAL M O D E L I M A G E F E A T U R E FEATURES FEATURES PAIRINGS 0 258 313 385 486 483 542 605 630 258 245 296 378 370 399 376 330 290 n.a. 196 237 295 351 345 319 300 309 M O D E L Added 258 55 72 101 33 76 74 52 27 FEATURES Deleted n.a. 0 0 0 36 17 11 27 31 Final 258 313 385 486 483 542 605 630 626 Note: With the second and subsequent training images, a model graph (col. 2) is matched with an image graph (col. 3), yielding a match that includes a set of feature pairings (col. 4). According to that match, features are added to the model (col. 5) and deleted from it (col. 6) See figure 6.25 for illustrations of the training images and final model. 6.2 Clustering Procedure The clustering procedure partitions image graphs into groups suitable for generalization. Its goals are to assign each image graph to one group, grouping those that denote similar appearances; to create as many groups as necessary, but no more; and to compose groups so that model graphs generalized from those groups meet the previously stated goals of being selective and economical. 6.2.1 C o n c e p t u a l clustering Conceptual clustering is the term used to describe the desired form of clustering (Michalski 1980). Whereas conventional clustering methods consider only the separation of instances in some metric space, conceptual clustering also considers concepts induced to represent each group. For example, it may group instances to produce concepts that favour concept simplicity, the fit of concepts to instances, or discrimination among concepts. Concept formation refers to conceptual clustering that is performed incrementally, with concepts updated as each instance Chapter 6. Learning Procedure is acquired. Thompson and Langley 139 (1991) review several conceptual clustering and concept formation methods, with emphasis on those that learn probabilistic concepts and concepts covering relational structures. Because conceptual clustering requires many comparisons in the course of assigning instances to groups and inducing generalizations from groups, and because it is costly to compare relational structures, methods that cluster structures sometimes restrict the form of those structures to permit efficient comparisons. L A B Y R I N T H (Thompson and Langley 1991), a concept formation system for probabilistic concepts covering relational structures, requires an instance to be a hierarchy of parts (or any analogous hierarchy); it induces a concept hierarchy for parts at each level of the part hierarchy, and matches structures bottom-up, proceeding from parts to whole structures. Segen's system clusters "layered graphs" that are analogous to our image graphs (Segen 1 9 9 0 ) ; to match two layered graphs it selects an assignment of their leaves, then adjusts that assignment toward a local optimum. In our case, with instances and concepts represented by image graphs and model graphs, the required matching can be performed efficiently by probabilistic alignment. 6.2.2 Clustering criteria The criteria used to guide conceptual clustering decisions represent a source of inductive bias whose nature is properly determined by the clustering's intended purpose. For example, WITT'S stated goal in forming probabilistic concepts among attribute vectors is to maximize pairwise attribute correlation within clusters and minimize correlation between clusters (Hanson and Bauer 1989); COBWEB'S predicted (Fisher 1987); goal is to find concepts for which missing attributes can be readily and Segen's system seeks a clustering that allows the concepts and instances to be represented as concisely as possible (Segen 1 9 9 0 ) . Our procedure for clustering image graphs should base its decisions on the following two goals, both directly related to recognition performance: Chapter 6. Learning Procedure 140 simplicity There should be no more groups than necessary; moreover, each group should, upon generalization, yield a model graph that is as economical as possible, containing only those features that are reliable enough to benefit recognition significantly. By limiting the number and complexity of model graphs, this goal promotes recognition efficiency. accuracy Each image graph must be included in some group; moreover, the model graph generalization of that group should characterize the image graph as completely and accurately as possible. By encouraging a close fit between model graphs and image graphs, this goal favours detailed models that are likely to discriminate objects effectively. 6.2.3 M i n i m u m description length principle The qualities of simplicity and accuracy can be neatly combined in a single framework provided by Rissanen's minimum description length (MDL) principle (Rissanen 1978), which holds that the best statistical model M for a set of data X is that yielding the shortest description of M and X combined. Similar decision criteria have been independently proposed by Wallace and Boulton (1968) for classification, by Segen (1979) for clustering, and by others. Applying the MDL principle requires schemes for efficiently coding models and data. A coding scheme is a one-to-one function C(x) that maps x, which may be model parameters or data, to a string of length L(x) bits. The coding schemes of interest are those satisfying the prefix property, which requires that no C(x) be a prefix of C(x') for any x ^ x'. This property is significant for two reasons. First, it ensures that an encoded string is self-contained, and need not be accompanied by a length or delimiter; thus a concatenation of encodings, such as C(x)C(x'), can be decoded. Second, it ensures that L(x) satisfies the Kraft inequality, X which allows information theoretic results to be used to establish the following links between Chapter 6. Learning Procedure 141 a coding scheme satisfying the prefix property and a probability distribution. Given a probability distribution P(x), we can always define an efficient coding scheme, one that satisfies the prefix property, achieves essentially L(x) — — log P(x), and has the minimum expected length 2 Y!, P(x)L(x). Conversely, given a code satisfying the prefix property, we can obtain a proper x probability distribution by defining 0-L(x) P(x) = — 77-r. (6.46) This equivalence between coding schemes and probability distributions means that MDL estimation and Bayesian estimation are, in some ways, equivalent. The MDL estimator is M = arg min L{M) + L(X | M). M (6.47) L(M) characterizes a coding scheme for models that are drawn from some predefined set of possible models. That set may comprise various parameter choices for some particular model; it may even include models of varying degree or complexity, such as mixtures of various numbers of Gaussians. L(X | M) characterizes a coding scheme that encodes the data X while assuming knowledge of the model M. The MDL estimator can be viewed as selecting the model that allows the most efficient description of the data as a "two-part" message, one that contains first the selected model and then the data encoded according to that model. Thus it favours a model that is both simple (so that its description is concise) and a good summary of the data (so that the subsequent description of the data is concise). The equivalent Bayesian estimator is M = arg max P(M) P(X | M), M (6.48) where P(M) is a prior distribution on models, and P(X \ M) is the likelihood of the data X given the model M. Despite this equivalence of MDL and Bayesian estimators, however, the MDL formulation has some advantages. It is often difficult to conceive of a suitable prior distribution on a set of models, yet easier to conceive of a way of efficiently encoding those models. In this situation, the Chapter 6. Learning Procedure 142 MDL formulation guides us to a suitable prior. Moreover, because we seem naturally inclined to choose coding schemes that require more bits to encode more complex models, this prior will usually have the property, often desirable, of favouring simple models over complex ones. In some cases the MDL formulation allows us to avoid any assumption of either a specific encoding scheme or a specific prior. One such case of relevance to us here involves encoding positive integers for which no distribution is known. By minimizing the expected code length over all possible distributions of the integers (a minimax approach), Rissanen (1983) arrives at a universal prior whose code length is given by L{n) = c + log (n) + log log (n) + . . . . 2 2 2 (6.49) The sum includes all positive recursions of the base-2 logarithm, and c « 2.865 is chosen so that 2" (") defines a proper distribution. L The MDL estimator cannot be used to estimate model parameters whose values are real numbers because an encoding of a real number may require infinitely many bits. This does not appear to be a serious restriction, however. In practical situations, the data X have only finite precision; consequently, the model parameters need only have finite precision too. Real-valued model parameters can be truncated to finite precision while including the truncation length itself as one parameter over which description length is optimized (Rissanen 1983). Alternatively, a truncation length or quantization interval can be predetermined, or real values can be restricted to rational numbers, which can be encoded as pairs of integers. Pednault (1991) compared these approaches in one surface reconstruction problem, finding in that case that predetermined quantization intervals gave the best results. 6.2.4 A p p l y i n g the M D L principle to object model induction The goal of image graph clustering, as stated previously (p. 140), is to obtain object models that are both simple and accurate. We shall now describe how those qualities are measured and combined through use of the MDL principle. Here the following notation is used: X is the set Chapter 6. Learning Procedure 143 Table 6.5: Coding scheme for object model 1. A model M — {Mi} is encoded by first encoding and then encoding each model graph Mi. with respect to the universal prior, 2. A model graph Mi = (F,R,m) is encoded by first encoding and m with respect to the universal prior. Then, for each feature j € F, j and any relation specifying j ' s parts are encoded. 3. A model feature j = {tj,rrij,Aj,Bj) is encoded by encoding tj, rrij, the series of rrij attribute vectors Aj, and parameters summarizing the series of position vectors Bj. 4. A feature type tj is encoded with respect to a suitable prior distribution over the range of feature types. (We simply use a uniform distribution, although a more realistic distribution could be determined readily from the milieu collection.) 5. A feature occurrence count rrij is encoded with respect to a uniform distribution on [l,m]. Note that m was previously encoded. 6. An attribute vector a £ Aj is encoded in L ( a ) bits. The function L is defined in the text (p. 146). a a 7. For a general model feature j, the four-dimensional Gaussian distribution summarizing Bj is characterized by 14 parameters (4 for the mean and 10 for the covariance matrix); similarly, an appropriate number of parameters are used to characterize the position distribution of a curve segment feature or a feature having some undefined position components. For simplicity, each parameter is quantized and encoded in a fixed number of bits (we use 10 bits per parameter). 8. A relation specifying j ' s parts is encoded by encoding identifiers for each of the parts with respect to the uniform distribution on [1, If j is a sequence feature, these identifiers are preceded by a count of them, encoded with respect to the universal prior. Note: Model graph notation is defined in section 4.3.4 (p. 87). Equation 6.49 specifies the number of bits needed to encode an integer with respect to Rissanen's universal prior. An integer encoded with respect to a uniform distribution on [l,n] requires [log n] bits. 2 of image graphs; Xi C X is a cluster of image graphs; Ai is the set of model graphs; Mi € Ai is the model obtained by generalizing the cluster Xf, M and X are elements of Ai and X. To evaluate model simplicity we specify a coding scheme for object models and associate the simplicity of a model with the length of its encoding. This coding scheme must satisfy the prefix property, but it need only be specified in enough detail to determine the length of encodings. The coding scheme we use is specified at that level of detail in table 6.5. The accuracy of a model is judged by how well it matches each of the image graphs used in its construction. Hence each image graph is encoded in terms of its best-matching model Chapter 6. Learning Procedure 144 graph, and the accuracy of the model is associated with the total length of these encoded image graphs. An image graph X is encoded in terms of a model graph M by considering X to be an outcome of a random variable whose probability distribution is characterized by M. Each possible match between X and M denotes an interpretation that assigns X a probability according to the distribution characterized by M. Because the match quality measure is defined as the logarithm of this probability, a match with a high match quality measure assigns X a high probability. Moreover, because of the relation between probability and code length, a good match suggests a short encoding for X. Thus, to determine a code length for X in terms of M, we first use the matching procedure to find a near-optimal match between them. From this match (E,T) and equation 5.9 on p. 95, we obtain p, the probability that X represents an outcome of M under the interpretation {E,T} (or, in other words, the probability that the object represented by M is present in the image represented by X, given E and T). The portion of X matched by M has a code length of " — [ log p~\ bits. The portion of X not matched by M does not influence p, and thus it must be 2 encoded separately. Table 6.6 specifies the scheme used to encode this unmatched portion. Because each image graph is encoded in terms of its best-matching model graph, the coding scheme must communicate the association between image graphs and model graphs. To simplify this association, it is assumed that each image graph's best-matching model graph is the one that was obtained by generalizing that image graph. In other words, we assume that if X € X is assigned to the cluster Xi, and Mi € M is the generalization of Xi, then Mi is the element of M that best matches X. This is almost certainly the case because Mi is the sole generalization of X. With this simplification, the association between image graphs and model graphs can be described by simply encoding the model graphs Mi and the corresponding clusters of image graphs Xi in order of increasing i. In turn, each Xi can be encoded by simply concatenating the encodings of its members; there is no need to encode the cardinality of Xi because it is equal to the value of m already encoded as part of Mj. Chapter 6. Learning Procedure 145 Table 6.6: Coding scheme for unmatched portion of image graph 1. The unmatched portion of an image graph (F,R) is encoded by first encoding with respect to the universal prior. Then, for each unpaired feature k € F, k and any relation specifyingfc'sparts are encoded. 2. An image feature k = (£fc,afc,bfc) is encoded by encoding t , &k and b^. k 3. A feature type tk is encoded with respect to a suitable prior distribution over the range of feature types. (We simply use a uniform distribution, although a more realistic distribution could be determined readily from the milieu collection.) 4. An attribute vector a*, is encoded in L (a/t) bits. The function L is defined in the text (p. 146). a a 5. A general image feature's position vector bjt is specified by four parameters; similarly, an appropriate number of parameters specifies the position of a curve segment feature or a feature having some undefined position components. For simplicity, each parameter is quantized and encoded in a fixed number of bits (we use 10 bits per parameter). 6. A relation specifying fc's parts is encoded by encoding identifiers for each of the parts with respect to the uniform distribution on [1, ||{F}||]. If A; is a sequence feature, these identifiers are preceded by a count of them, encoded with respect to the universal prior. Note: Image graph notation is defined in section 4.1.4 (p. 68). Equation 6.49 specifies the number of bits needed to encode an integer with respect to Rissanen's universal prior. An integer encoded with respect to a uniform distribution on [l,n] requires [log n~\ bits. 2 We can now write an expression for the code length of a model M. and its set of training images X. Decisions involved in clustering X are made so as to minimize this code length, which we shall denote L (M, X): C L (M,X) C = L (M) + m E *( L X I ')- (-°) M 6 5 Lm(-M) is the code length of M according to the scheme specified in table 6.5. L (X | M) is X the code length of image graph X in terms of model graph M, given by L (X\M) X = -log P(H\E,T) 2 + L (X,E). u Here E and T represent the best match found between X and M by probabilistic alignment. The probability that X is an outcome of M under the interpretation (E,T) is P(H \ E,T), which is defined by equation 5.9 on p. 95. (When evaluating P(H \ E,T) in this context, P(H) 146 Chapter 6. Learning Procedure is treated as a constant rather than estimated from the size of each cluster; this eliminates a bias that would tend to assign training images to large clusters (Medin 1983; Fisher and Pazzani 1991).) L (X,E) U is the code length of the unmatched portion of X according to the scheme specified in table 6.6. In tables 6.5 and 6.6, we postponed definition of the attribute vector coding scheme. We shall now show that an attribute vector's code length has little or no influence on clustering decisions, and thus it can be ignored. Consider feature k of image graph X G Xi, with attribute vector afc. According to the following reasoning, afc is likely to be encoded once in the encoding of M and X regardless of how X is partitioned. If k is matched by feature j of model graph Mi, it is very probable that afc will have been included in Aj by the generalization procedure; in that case, afc will be encoded as part of Mi. Ii k is not matched by any feature of model graph Mi, then it is very probable that afc has not been included in Aj by the generalization procedure; in that case, afc will not be encoded as part of Mi, but it will be encoded with the unmatched portion of X. In either case, afc is encoded once, and hence its code length, L (afc), appears as a constant in L (A4,X), a C independent of how X is clustered. Consequently, we simply define L (a) to be zero for all attribute vectors a. a 6.2.5 Clustering algorithm The clustering algorithm must assign training images to clusters while minimizing the measure of cluster quality, L (M, X), defined by equation 6.50. We use a simple algorithm that processes C each training image in turn, either assigning it to an "existing cluster or creating a new cluster with the training image as the initial element. This algorithm is summarized infigure6.26. An assignment to an existing cluster i is evaluated by firstfindinga match between the image graph X derived from the training image, and the model graph Mi obtained by generalizing the cluster's existing contents Xi. The search.for this match uses the same stopping criteria as the generalization procedure's match search (p. 135). If the match found is a good one (we require it to have a Support measure of at least 0.5), that match is used to further generalize the cluster's Chapter 6. Learning Procedure 147 Inputs: sequence of training image graphs X Outputs: number of clusters n, clusters Xi, and model graphs Mi n ^ 0, .M <- 0, J > < - 0 for each successive training image graph X begin y <- y u { x } € X do Try assigning X to each existing cluster: for i < — 1 to n do begin find the best match (E,T) between X and Mj if (E,T) is acceptable (e.g., Support(£,T) > 0.5) then begin M[ <— Mi generalized to include X according to (E,T) evaluate L (M \ { Mi } U { M[ }, y) end end C Try creating a new cluster containing just X: Mo <— generalization of { X } evaluate L (M U { M }, y) C 0 Adopt the best assignment of X: among the alternatives just considered, choose the alternative i that yielded the lowest value of L if i = 0 (the new cluster) then begin n < - n + l , M <- M , M ^ M U M , X <{X} end else begin M <\ { M } u { M ; }> Mi <- M ; , Xi +- xt u { end end c n 0 n 1 n x} Figure 6.26: Clustering algorithm. Members of a set of image graphs X = {X} are clustered incrementally into some number of subsets Xi, each of which is generalized by a model graph Mj. Clustering seeks to minimize the cluster quality measure L (M,X) defined by equation 6.50. A\P> denotes the set of members of A that are not in B. Not shown is the analogous procedure, described in the text, for reassigning isolated image graphs following this initial assignment procedure. C Chapter 6. Learning Procedure 148 model graph so that it will then encompass the new image graph. This new generalization M[ is obtained from Mi, X, and the match (E,T) by the generalization procedure described in section 6.1. Finally, L (M,X) C is evaluated with the cluster i represented by its new model graph Ml and the extended set of image graphs Xi U {X}. An assignment to a new cluster is evaluated by hypothesizing a new cluster XQ = {X}, obtaining its model graph Mo using the generalization procedure, and evaluating L (M,X) C with this extended set of clusters. Of these various assignments, that which yields the lowest value of L (M,X) C is adopted. This assignment process is repeated for each training image to produce a complete clustering. Occasionally this clustering will include some clusters that each contain a single, isolated training image. Consequently, a second phase of clustering is performed during which assignments are re-attempted for these isolated training images, often successfully. Reducing the reliance on matching In its simplest form, this algorithm performs considerable matching. Matching is used to compare each training image to each cluster, and to evaluate L (A4, X) for each possible assignment C of each training image. Fortunately, however, most of this matching can be avoided by simple improvements to the algorithm. An indexing method used to select subsets of likely model graphs from the model database during recognition can also be used to select subsets of likely clusters during clustering so that not all clusters need be considered for assignment of each training image. Although O L I V E R does not presently include an indexing method, any of various methods could be adopted for this purpose (e.g., Beis and Lowe 1994). To evaluate L (M, X) after generalizing a model graph from Mj to M/, it is not necessary to C perform general matching between Ml and each X 6 Xi, testing various alignment hypotheses. Instead, a previously obtained match between Mj and X suggests a single alignment hypothesis and an advanced starting point for obtaining the new match needed between Ml and X. Chapter 6. Learning Procedure 149 This new match is found by extending the partial set of feature pairings and the viewpoint transformation suggested by the previous match. Order sensitivity Incremental clustering algorithms are often sensitive to the order in which training examples are presented. Some algorithms (e.g., C O B W E B , Fisher 1987) attempt to reduce order sensitivity by reassigning examples, splitting clusters, and merging clusters in the course of training. These additional operations greatly increase the clustering algorithm's search space and, consequently, its computational complexity. Fortunately, however, order sensitivity is of little consequence in our case, for although it may affect the number and content of clusters and the coverage of individual model views, it does not appreciably affect the collective coverage of all model views. In other words, presentation order may determine whether a particular training image is assigned to one cluster or to an adjacent cluster, but in either case the model views obtained from those clusters will together cover essentially the same range of appearances. Figure 6.27 illustrates this point by showing the different clusterings produced by our simple, order-sensitive algorithm given various presentation sequences of a set of training images. A more sophisticated clustering algorithm—one that reassigns examples, splits clusters, and merges clusters to produce optimal clusterings regardless of presentation order—could produce somewhat better models that generally contain fewer model views. We suspect, however, that the improvement obtained would not justify the additional clustering effort required. Chapter 6. Learning Procedure 0° azimuth 150 40° azimuth Sequence: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 Clusters: 0 ... 8 10 . . . 20 22 . . . 34 36 . . . 40 Cluster sizes: 5 images 6 images 7 images 3 images Sequence: 0 10 20 30 40 2 12 22 32 4 14 24 34 6 16 26 36 8 18 28 38 Clusters: 0 ... 8 10 . . . 16 18 . . . 28 30 . . . 38 40 Cluster sizes: 5 images 4 images 6 images 5 images 1 image Sequence: 0 2 4 6 8 20 18 22 16 24 14 26 12 28 10 30 32 34 36 38 40 Clusters: 0 ... 8 10 . . . 12 14 . . . 24 26 . . . 40 Cluster sizes: 5 images 6 images 8 images 2 images Figure 6.27: Effect of presentation order on clustering of training images. Twenty-one training images were acqinred at 2° increments of viewing angle; the first and last images of this series, acquired at 0° and 40°, are shown. These were presented to the clustering algorithm in three different presentation sequences, producing slightly different clusterings in each case. Note that although cluster boundaries vary somewhat with presentation sequence, each cluster covers a contiguous range of azimuth, and hence of appearance; moreover, the number of clusters produced varies little. Thus the accuracy and economy of the resulting model is largely unaffected by presentation order. Chapter 7 Empirical Investigations In this chapter we describe several experiments involving a system implemented to test our recognition learning method. This system, called OLIVER, learns to recognize 3-D objects in 2-D intensity images while employing the repertoire of features described in section 4.2 (table 4.1 on p. 69 summarizes these features). We shall first present in detail an experiment that demonstrates how OLIVER learns to recognize a complex, rigid object. We shall then describe an experiment showing its ability to discriminate between similar objects, and its ability to learn models encompassing entire classes of objects. Finally, we shall demonstrate noteworthy properties of the system by briefly presenting additional experiments with various objects. OLIVER has been implemented within the framework of Vista, a versatile and extensible software environment designed to support computer vision research (Pope and Lowe 1994). Both OLIVER and Vista are written in C to-run on UNIX workstations. The execution times reported here were measured on a Sun SPARCstation 10/51 processor (SPECint92 rating: 65.2; SPECfp92 rating: 83.0). Source code for the system is available on the Internet (Pope 1995). 7.1 Illustrative E x p e r i m e n t The experiment described in this section demonstrates typical performance of the system in learning to recognize a complex object. That object is the toy bunny shown in figure 7.28. Training images of the bunny were acquired at 5° increments of camera elevation and azimuth over 25° of elevation and 90° of azimuth. Due to a data collection error, two training 151 Chapter 7. Empirical 0° elevation, 9 0 ° a z i m u t h Investigations 152 0° elevation, 10° a z i m u t h Figure 7.28: Bunny training images. Images were acquired at 5° intervals over camera elevations of 0° to 25° and azimuths of 0° to 90°. Shown here are four of the 112 images. Chapter 7. Empirical 153 Investigations images were omitted from this range: the images at 0° elevation, 0° and 5° azimuth. Four of the 112 training images are shown in figure 7.28. Features were detected in each training image using the procedure described in section 4.2. Feature detection, including edge detection, curve segmentation, and grouping, required about 30 seconds of CPU time per training image. Figure 7.29 depicts some of the features found in one image, which include 4475 edgels, 81 straight lines and 67 circular arcs (together comprising the Curve, or curve segment, features), 114 LJct features (L-junctions), 99 CrankLR, CrankRL, and Cap features (connected pairs of L-junctions), 17 CapPair features (connected triples of L-junctions), 19 PCurves features (pairs of parallel curve segments), 2 Region features (closed chains of L-junctions), and 16 Ellipse features (nearly-closed circular or elliptical arcs). The training images were presented to the system's learning procedure in various sequences. The models learned with these various sequences were all qualitatively similar in that each contained approximately the same number of groups and the same distribution of group sizes. The results reported here were obtained with the following sequence (read left-to-right, top-tobottom): (0°el.,10°az.), (0°,15°), (5°,0°), (25°, 0°), (0°,90°), (5°,90°), (25°,90°). During the first phase of clustering, the system divided the training images among 19 clusters. In the second phase, it reassigned two training images that remained the sole members of their clusters, leaving the 17 clusters shown in figure 7.30. Because this object's appearance varies smoothly with changes in viewpoint across much of the viewing range, it is not surprising that the clusters generally occupy contiguous regions of the viewsphere. Table 7.7 details the process by which some of the training images were assigned during the first phase of clustering. It is interesting to note that, because most clusters are well separated, Chapter 7. Empirical (a) Investigations 154 (b) " V? o ° o - -* in » \\ o OO O p (c) o O (d) Figure 7.29: Features of a bunny training image. Shown here are selected features found in the 0° elevation, 10° azimuth training image (itself shown in figure 7.28). (a) Edgels. (b) Curve features, (c) LJct features, (d) PCurves features (depicted by parallel lines), Region features (depicted by rectangles), and Ellipse features (depicted by circles). Chapter 7. Empirical 155 Investigations 0 5 10 15 20 25 Elevation Figure 7.30: Bunny training image clusters. Seventeen clusters, designated A through Q, were formed from the 112 training images. Contours delineate the approximate scope of the view classes associated with some of the clusters. Note, however, that because the view classes are defined probabilistically, their boundaries are actually indefinite. Chapter 7. Empirical Investigations 156 a training image's graph typically matched the model graphs of only one or two clusters (a "-" in the table indicates that no match was found). Hence, in assigning each training image, the system usually had only a few alternatives to consider beyond the initial match search. When training images were presented to the system in other sequences, the system produced different clusterings than that shown in figure 7.30. However, although cluster boundaries varied, qualities such as the number, extent, and cohesiveness of the clusters remained largely unaffected. In terms of these qualities, the clustering shown here is typical of the clusterings obtained with other presentation sequences. Altogether, 19.4 hours of CPU time were required to cluster the training images and to induce a model graph generalization of each cluster. Approximately half of this time was spent by the generalization procedure in iterative Levenberg-Marquardt optimizations, as it fit curve segments to collections of edgels to obtain the "mean" model curves described in section 5.3.1 (p. 122). Considerable time was also spent matching each training image graph to each model graph; however, as noted previously (p. 148), this time could be reduced through the use of indexing. Figures 7.31 and 7.32 show features of the model graphs representing two of the clusters. The following remarks pertain to these figures and to the model graphs they depict: • Some eccentric elliptical arcs evident among the Curve features shown in figures 7.31(a) and 7.32(a) are the result of fitting curve segments to edgels that are distributed in wide bands because they have been obtained from several training images. That these are eccentric elliptical arcs rather than shallow arcs or straight lines is of no detriment to the system's performance. • Ellipses are drawn for certain features to show two standard deviations of location uncertainty. To reduce clutter and to give some indication of feature significance, they are drawn only for those features that were found in a majority of training images. Considerable variation in location uncertainty is evident. Some LJct features have particularly large uncertainty and, consequently, they will be afforded little importance during matching. Chapter 7. Empirical Investigations - 157 Table 7.7: Clustering of bunny training images CLUSTER IMAGE El. 0 5 10 15 Az. 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 New 30.5 35.7 28.7 29.7 29.2 36.5 33.4 33.9 33.9 31.1 33.1 34.5 29.2 37.5 34.4 28.6 33.1 29.6 33.5 24.9 34.1 36.9 31.1 31.4 B C - n - n - - 20.9 13.0 9.0 - 38.0 - 29.7 32.2 30.6 30.2 33.9 34.6 27.5 32.0 - 8.9 -0.6 6.0 -1.0 - 6.3 - 16.9 - - - - - - 15.5 - - 14.6 0.5 1.1 - - - - - J - - - G 15.1 8.1 14.6 14.8 - C n 10.4 - - 1.6 0.5 -0.7 5.7 3.9 30.5 ASSIGNMENT A n 12.0 18.6 - 1.3 0.8 n - - 14.2 8.4 - 5.3 - - n 15.4 2.0 - - 16.5 5.0 4.6 - 0.4 -1.6 1.7 - - Note: Each row documents the assignment of one training image to some cluster. Images were clustered in the order they are listed, although not all are listed here. The New column reports the score (i.e., the change in cluster quality measure L (M, X)) associated with assigning the image to a new cluster. Columns labeled with cluster designators A, B, and so on report the scores associated with assigning the image to those existing clusters. Each image was assigned to the lowest-scoring cluster, nagged here using italics. A "-" indicates that a circumscribed search found no match between the training image and the cluster's current model graph generalization. An "n" indicates formation of a new cluster. Cluster C was merged with cluster C during the subsequent reassignment phase of the clustering algorithm. C Chapter 7. Empirical (c) 158 Investigations (d) Figure 7.31: Bunny model graph C. Shown here are selected features of the model graph obtained by generalizing the training images assigned to cluster C (delineated in figure 7.30). Each feature is drawn at its mean location, (a) Curve features, (b) LJct features, (c) CrankLR, CrankRL, and Cap features, (d) PCurves features, Region features (depicted by rectangles), and Ellipse features (depicted by circles). In (b) through (d), ellipses showing 2 s.d.'s of feature location uncertainty are drawn for those features found in a majority of training images. Chapter 7. Empirical (c) 159 Investigations (d) Figure 7.32: Bunny model graph D. Shown here are selected features of the model graph obtained by generalizing the training images assigned to cluster D (delineated in figure 7.30). Each feature is drawn at its mean location, (a) Curve features, (b) LJct features, (c) CrankLR, CrankRL, and Cap features, (d) PCurves features, Region features (depicted by rectangles), and Ellipse features (depicted by circles). In (b) through (d), ellipses showing 2 s.d.'s of feature location uncertainty are drawn for those features found in a majority of training images. Chapter 7. Empirical Investigations 160 • Large numbers of CrankLR, CrankRL, and Cap features appear to be superimposed. What may appear to be duplicates in this depiction, however, are actually distinct features representing different combinations of pairs of adjacent L-junctions. • Model graph C is of typical size, and, for example, it contains the following features: 17 straight line segments, 18 circular arcs, and 110 elliptical arcs (together comprising all Curves); 256 LJcts; 318 CrankLRs, CrankRLs, and Caps; 76 CapPairs; 66 PCurves; 8 Regions; and 25 Ellipses. The rest of this section describes tests in which the bunny model was used for recognition. The first and simplest of these uses test image 1, which is shown in figure 7.33. Although test image 1 is not identical to any of the training images, it most closely resembles the 5° elevation, 50° azimuth training image, which was assigned to cluster D during training. Table 7.8 (p. 164) reports the results of matching test image l's image graph with each of the bunny model graphs. For this test, match searches were allowed to examine all alignment hypotheses. Typically, there were 10-20 hypotheses examined for each pair of model and image graphs, and about five seconds of CPU time were needed to extend and evaluate each one. The matches reported here are those achieving the highest match quality measure. The model graph generalizing cluster D provides the best match with the image graph (as judged by each match's support measure). This is to be expected as the test image was acquired from a viewpoint surrounded by that cluster's training images. Moreover, other model graphs that match the image graph (although not as well) are all from neighbouring regions of the viewsphere. Image features included in the best match, that with model graph D, are shown in figures 7.34 and 7.35; a much poorer match, that with model graph N, is shown in figure 7.36. For test image 2, shown in figure 7.37, the bunny, camera, and lights occupied the same positions as for test image 1, but additional objects were present. Table 7.8 reports the result of matching test image 2's image graph with each of the bunny model graphs. This time each match search was limited to 20 alignment hypotheses and about six seconds of CPU time were needed to extend and evaluate each hypothesis. Due to the additional clutter in the image, only Figure 7.33: Bunny test image 1. Above: Image. Below: Curve features detected in the image. Additional features, such as LJcts and Regions, were also detected, but they are not shown here. Chapter 7. Empirical Investigations 162 Figure 7.34: Match of bunny model graph D with test image 1. Shown here are selected image features included in the match. Above: Curve features. Below: LJct features. Model graph D is that which best matches test image 1. Chapter 7. Empirical 163 Investigations II It Figure 7.35: Match of bunny model graph D with test image 1 (cont.). Shown here are selected image features included in the match. Above: CrankLR, CrankRL, and Cap features. Below: PCurves features, and Ellipse features (depicted by circles). The alignment process that found this match began by pairing the CrankRL feature defining the inside edge of the bunny's left ear. Chapter 7. Empirical Investigations 164 Table 7.8: Matches between bunny model graphs and test image graphs MODEL GRAPH A B C D E MATCH Correct Yes Yes F G H I J K L M N 0 P Q WITH T E S T Quality Yes Yes Yes Yes 998 596 1952 2347 841 298 147 1175 356 357 912 958 1024 613 151 206 204 IMAGE Pairings 164 141 252 253 160 100 88 201 91 87 195 175 200 177 72 102 86 1 MATCH Support 0.35 0.22 0.56 0.62 0.33 0.15 0.18 0.41 0.19 0.17 0.46 0.38 0.46 0.39 0.14 0.26 0.19 Correct Yes WITH TEST Quality 862 726 1634 1502 1043 275 278 438 879 768 422 521 649 593 187 579 344 IMAGE Pairings 191 166 278 207 205 115 122 139 186 197 158 165 177 162 90 161 116 2 Support 0.34 0.24 0.45 0.57 0.39 0.20 0.27 0.29 0.35 0.29 0.33 0.25 0.33 0.29 0.23 0.30 0.31 Note: Each row documents the results of matching a model graph with the two image graphs, those describing bunny test images 1 and 2. Reported for each match are the following. Correct: whether the match correctly identified most features of the object visible in the image, as judged by the experimenter. Quality: the match's match quality measure, g(E,T). Pairings: the number of image features paired. Support: the match's support measure, Support(.E, T). model graph D correctly matched the image. Figure 7.38 shows some of the features included in that successful match. The match was found on the second alignment attempt, which began by pairing a Region feature defining the bunny's right ear. The match includes 457 feature pairings, involving 204 of 897 model features and 207 of 1131 image features (Curve features are each permitted multiple pairings). Among the pairings, 354 involve Curve features, 74 involve LJct features, 8 involve features that group pairs or triples of L-junctions, 13 involve PCurves features, 1 involves Region features, and 7 involve Circle features. The initial pairing of Region features produced a transformation estimate of x t = -207 ±22, y t = -75 ± 2 0 , d t = 0.6° ±2.5°, s t = 1.3 ±0.1, 165 Chapter 7. Empirical Investigations Figure 7.36: Match of bunny model graph N with test image 1. Shown here are all model Curve features, whether included in the match or not. Each is projected onto the image at the position predicted by the feature's mean location and the match's viewpoint transformation estimate. Model graph N is that which matched test image 1 poorest among all matching model graphs. where the numbers preceded by ± are standard deviations. The final transformation estimate, based on all pairings, is x t = -203 ± 1 , y t = -43 ± 1 , 9 = -2.2° ±0.2°, t s t = 1.2 ±0.0. The additional clutter present in test image 3 (figure 7.39) is enough to prevent the system from finding correct matches with any of the bunny model graphs, even with no restriction on the number of alignment hypotheses examined. Interestingly, however, a model graph constructed solely from test image 1—showing the bunny in the same pose—does correctly match test image 3. We suspect that the better performance obtained with the test image 1 model graph is due to two reasons. First, test image 1 resembles the bunny's appearance in test image 3 better than any of the training images; hence, its model graph should be a more accurate representation of that particular appearance. Secondly, the model graph constructed from test image 1, being Chapter 7. Empirical Investigations 166 Figure 7.37: Bunny test image 2. Above: Image. Below: Curve features detected in the image. Additional features, such as L-junctions and convex regions, were also detected but they are not shown here. Chapter 7. Empirical Investigations 167 Figure 7.38: Match of bunny model graph D with test image 2. Shown here are image Curve features included in the match. derived from a single image, describes a very small range of appearance variation, whereas the other model graphs, as generalizations of larger groups of training images, must describe larger ranges of variation. A model graph that must describe a larger range of variation is able to do so less accurately because of simplifying approximations, such as that of independence among features (see section 4.3.2). These considerations suggest that more reliable recognition may be achieved by using more training images, and by adjusting thresholds so that more model views are induced, each covering a smaller range of variation. However, this improvement in reliability would be achieved at the cost of additional training and a more complex model. Further experiments are needed to investigate the relationship between number of training images, model complexity, and recognition performance. A simple way to verify a match, and thus decide whether the match represents a true instance of an object in an image, is to require that the match's Support measure exceed some threshold. For certain objects and applications, this simple criterion may suffice. Indeed, a Chapter 7. Empirical Investigations Figure 7.39: Bunny test image 3. Support threshold of 0.5 yields correct decisions about the presence of the bunny for all matches found with test images 1 though 3. (None of the incorrect matches found with test image 3 have a Support measure exceeding 0.47; matches with test images 1 and 2 are documented in table 7.8.) This section has demonstrated the system's typical performance in learning models of complex objects, and in using those models to accomplish recognition. Other objects that the system can learn to recognize are shown in figures 1.1 (p. 2), 4.7 (p. 70), 6.25 (p. 137), 6.27 (p. 150), 7.41 (p. 171), 7.44 (p. 178), and 7.48 (p. 181). Chapter 7. Empirical Investigations 169 Figure 7.40: Match of model graph based on test image 1 with test image 3. None of the bunnymodel graphs A through Q obtained by clustering training images successfully matched test image 3. However, a model graph constructed solely from test image 1 yielded a correct match. Shown here are image Curve features included in that match. Chapter 7. Empirical 7.2 170 Investigations Classes of Similar Objects The experiment described in this section tests two propositions: (a) that the system can learn to distinguish objects differing only slightly in appearance, and (b) that the system can learn a model representing an entire class of similar—not necessarily identical—objects. The objects used in this experiment were drawn from the two classes of clothespins shown in figure 7.41. To test proposition (a), the system learned separate models for the two classes of clothespins, and then it used those models to classify clothespins in test images. To test proposition (b), the system learned a single model graph encompassing both classes of clothespins, and then it used that model graph to recognize clothespins of either class. (Using two or more model graphs to represent the union of classes A and B would also allow the system to satisfy proposition (b), but in a trivial and uninteresting way.) Five images of class A clothespins and five of class B clothespins were used in training (see figure 7.41). Models were produced from the five class A training images (the class A model), from the five class B training images (the class B model), and from all ten images together (the class A/B model). In each case, the learning procedure assigned all training images to a single cluster, yielding a single model graph (although by manipulating parameter settings or the training image presentation sequence, we found we could induce formation of two clusters in the class A/B model). Model graphs for the three models are shown in figure 7.42. Note that 1 the class A/B model represents a generalization of the appearance of both classes of clothespins. Two test images, each containing all ten clothespins, provided 20 subimages with which to evaluate the recognition performance of the three models. The test images are shown in figure 7.43. Table 7.9 reports the result of matching each subimage's image graph with the model graphs from the class A and class B models. In all 20 cases the image graph "correctly" matched both model graphs, and by choosing the match with the higher match quality measure one can From more training images covering a wider range of viewing conditions, the system would learn more complete models, perhaps requiring multiple views each. 1 Chapter 7. Empirical Investigations 171 Class B clothespins Figure 7.41: Clothespin training images. Five clothespins of each class were used in training. Chapter 7. Empirical Investigations 172 Class B model Class A/B model Figure 7.42: Clothespin models. Shown here are the Curve features of models learned from class A clothespins, class B clothespins, and both classes of clothespins. Chapter 7. Empirical Investigations 173 reliably classify each test subimage as depicting either a class A clothespin or a class B one. (Direct comparison of the match quality measures obtained with different model graphs is only possible here because the model graphs are of comparable complexity.) In 19 out of 20 cases, the number of image features included in the match also provides a reliable indication of the object's class (in the exceptional case, subimage 2h, the same number of features were involved in both matches). The support measure provides a less reliable indicator of object class: it leads to correct classification in 17 out of 20 cases: We conclude that the class A and B models are complete and accurate enough to allow reliable discrimination between the two classes of clothespins, particularly on the basis of match quality measure. This establishes proposition (a). Of course, these results depend on our choice of objects: Were there less distinction between the two classes of objects, discrimination between them would be less reliable. However, in the case we have presented here, distinctions between the two classes are indeed minor. Each subimage's image graph was also matched with the model graph from the class A/B model, with the results that are reported in table 7.10. A correct match was obtained in each case, suggesting that a single model graph can be general enough to encompass both class A and class B clothespins, yet reliable enough for recognition. This establishes proposition (b). Again, the result depends on our choice of objects: Were there greater distinction between the two classes of objects, they would simply require separate model graphs and, thus, a somewhat larger model database. Chapter 7. Empirical Investigations Test image 2 Figure 7.43: Clothespin test images. 174 Chapter 7. Empirical 175 Investigations Table 7.9: D i s c r i m i n a t i o n of clothespins TEST ACTUAL SUBIMAGE CLASS la lb lc Id le If lg lh li lj 2a 2b 2c 2d 2e 2f 2g 2h 2i 2j A B A B A B A B A B A A A B B A B B A B MATCH WITH MODEL A Quality 610 306 782 292 476 87 562 137 532 169 735 401 531 419 299 635 398 291 631 517 Pairings 62 44 85 45 53 24 60 27 63 38 88 54 58 61 45 82 . 48 41 63 62 Support 77.2 69.5 82.6 77.3 59.6 17.8 69.9 46.5 65.7 45.6 78.8 52.1 76.0 66.1 38.2 67.7 65.1 70.0 82.6 69.5 M A T C H WITH M O D E L B Quality 285 518 363 437 305 376 297 237 202 554 462 229 327 749 537 370 582 387 182 702 Pairings 46 50 59 47 42 41 48 30 35 52 65 52 43 70 61 58 55 41 39 64 Support 74.5 86.2 84.9 88.9 65.5 74.1 69.1 63.9 39.3 69.7 70.7 69.2 68.8 82.5 85.3 54.4 85.8 76.0 56.5 84.6 Note: Each row documents the results of matching an image graph with two model graphs. The image graph is that describing the portion of test image 1 or 2 indicated by the Test Subimage column. The true classification of the clothespin depicted by the image graph is reported in the Actual Class column. The two model graphs are those for class A clothespins and for class B clothespins. Reported for each match are the following. Quality: the match's match quality measure, g(E,T). Pairings: the number of image features paired. Support: the match's support measure, Support(.E,T). Chapter 7. Empirical Investigations 176 Table 7.10: Generalization among clothespins TEST ACTUAL SUBIMAGE CLASS Quality Pairings Support la lb lc Id le If A Ig lh li A lj 2a 2b 2c 2d 2e 2f B 2g 2h 2i B B 2j B 589 591 731 496 535 416 572 268 574 580 766 633 520 796 568 606 641 416 617 733 60 54 70 49 56 43 58 31 60 55 82 71 57 75 63 72 59 40 56 70 74.1 67.1 78.7 76.8 63.4 62.6 71.6 64.8 67.6 61.5 74.6 83.7 68.8 65.0 71.7 61.0 74.0 69.6 80.0 74.4 B A B A B B A A A A B B A A M A T C H WITH M O D E L A / B Note: Each row documents the results of matching an image graph with the class A / B model graph, which was obtained by training with images of both class A and class B clothespins. The image graph describes the portion of test image 1 or test image 2 indicated by the Test Subimage column. The true classification of the clothespin depicted by the image graph is reported in the Actual Class column. Reported for each match are the following. Quality: the match's match quality measure, g(E,T). Pairings: the number of image features paired. Support: the match's support measure, Support(S, T). Chapter 7. Empirical Investigations 7.3 177 A d d i t i o n a l Experiments In this section, some additional experiments are briefly described in order to illustrate certain noteworthy aspects of the system's behaviour. 7.3.1 Effects of feature distribution When some regions of an object are much richer in stable features than others, those regions can dominate the matching processes that underlie learning and recognition. For example, most features of the shoe shown in figure 7.44 are concentrated near its centre. Moreover, as the shoe rotates about its vertical axis, features near the shoe's centre shift by small amounts while those near its heel and toe undergo much larger changes. Thus, when training images of the shoe are clustered during model learning, the many stable features near the shoe's centre are used to match training images over a large range of rotation, while the few variable features defining the heel and toe are dropped as being unreliable. The result is a model graph, like that shown in figure 7.45, with relatively few features defining the shoe's extremities. If the dropped features are deemed important, we can encourage the system to retain them in the models it produces by setting a higher standard for acceptable matches. For example, requiring a higher Support measure ensures that matches will include more of an object's Curve features. Thus, fewer of those features will be judged unreliable and more will be retained by the model. Figure 7.46 shows a model graph that was produced with a Support threshold of 0.6 rather than the usual value of 0.5; it provides somewhat more accurate representation of the shoe's heel and toe. Figure 7.47 shows this model graph being used for recognition. 178 Chapter 7. Empirical Investigations 12° elevation, 60° azimuth 12° elevation, 0° azimuth 0° elevation, 60° azimuth 0° elevation, 0° azimuth Figure 7.44: Shoe training images. Images were acquired at 6° intervals over camera elevations of 0° to 12° and azimuths of 0° to 60°. Shown here are four of the 33 images. Figure 7.45: Shoe model graph A. It generalizes 14 training images spanning 12° of elevation and 36° of azimuth. Left: Curve features. Right: LJct and PCurves features. This model graph was produced with a match Support threshold of 0.5. Chapter 7. Empirical Investigations 179 Figure 7.46: Shoe model graph B. It generalizes 7 training images spanning 12° of elevation and 18° of azimuth. Left: Curve features. Right: LJct and PCurves features. This model graph was produced with a match Support threshold of 0.6. Figure 7.47: Shoe recognition example. Above: Test image. Below: Image Curve features included in a match found with shoe model graph B. Chapter 7. Empirical Investigations 7.3.2 180 A r t i c u l a t e objects Just as the system will use multiple views to model an object's appearance over a range of viewpoints, it will use additional views to model a flexible or articulate object's appearance over a range of configurations. In general, the number of views needed increases exponentially with the number of dimensions along which the object's appearance may vary. The toy boat shown in figure 7 . 4 8 has a sail that rotates about the boat's vertical axis. Training images were acquired at camera elevations of 0 ° , 5 ° , and 1 0 ° ; camera azimuths of 0 ° , 5 ° , . . . , 3 5 ° ; and sail angles of 0 ° , 1 0 ° , . . . , 4 0 ° . The system's learning procedure clustered these 120 images to produce 6 4 model views, some of which are described in table 7.11. 2 Features of two of the model graphs are shown in figure 7 . 4 9 . In comparison^ only 1 3 views were needed to cover the same range of viewpoints when the sail angle was keptfixedat 0 ° . 7.3.3 Figure Difficult objects 7.50 shows two objects that OLIVER is not presently capable of learning to recognize well. The pinecone's appearance gives rise to myriad features that are closely spaced and essentially indistinguishable. Attempting to model each of these features, as the present system does, produces overly complex models whose features are so similar to each other that they provide no useful starting points for alignment hypotheses. A system able to recognize the pinecone well would have to do so on the basis of a statistical pattern of features, rather than specific, individual features. A feature for representing regions of uniformly distributed L-junctions would, if included in the present system's feature repertoire, perhaps allow it to recognize pinecones and similarly textured objects. The stuffed toy is tooflexibleto be represented by any practical number of model views T h e boat's sail contains few features compared to its hull. Thus, for reasons described in the previous section, the matches sought during model learning were required to achieve a Support threshold of 0.75 so that the sail would be fully represented in all model views. As a consequence, however, training images were divided among a relatively large number of model views. 2 Chapter 7. Empirical Investigations 181 0° elevation, 0° azimuth, 0° sail angle 0° elevation, 0° azimuth, 40° sail angle 0° elevation, 35° azimuth, 0° sail angle 10° elevation, 0° azimuth, 0° sail angle Figure 7.48: Boat training images. Images were acquired at elevations of 0°, 5°, and 10°; azimuths of 0°, 5°, . . . , 35°; and sail angles of 0°, 10°, and 20°. Shown here are four of the 120 images. Chapter 7. Empirical Investigations 182 Table 7 . 1 1 : Clusters of boat training images CLUSTER TRAINING El. A B Az. IMAGE CLUSTER Sail TRAINING El. Az. IMAGE Sail 0° 0° 0° 0° 20° 0° 0° 5° 0° 0° 20° 10° 0° 25° 0° 0° 25° 10° 0° 30° 0° 0° 20° 20° 0° 35° 0° 5° 0° 20° 0° 30° 10° 5° 0° 30° 5° 0° 40° 10° 5° 0° C D 0° 35° 10° 0° 25° 20° 0° 30° 20° 10° 10° 0° 0° 35° 20° 10° 5° 10° F Notes: Each table entry lists the training images assigned to one cluster. Not all clusters are reported. Figure 7 . 4 9 : Boat model graphs. Shown here are Curve features of model graphs that have been generalized from two clusters of boat training images. The clusters are described in table 7 . 1 1 . Left: Generalized from cluster B. Right: Generalized from cluster D . Chapter 7. Empirical Investigations 183 Figure 7.50: Difficult objects. The present system cannot learn to recognize these objects, for reasons discussed in the text. of precisely localized features, as the present system must try to do using its repertoire of features based closely on intensity edges. It would be more appropriate to model the toy at the level of parts—for example, a head, a torso and four limbs—whose geometry could then be specified somewhat loosely by the model. Approaches whose strategy is to recognize parts (e.g., Bergevin and Levine 1993; Dickinson, Pentland, and Rosenfeld 1992) are presumably better able to recognize such objects. The present system would perhaps learn to recognize the toy if its feature repertoire were extended to include higher-level features denoting the necessary parts. Chapter 8 Conclusion We have presented a method of modeling the appearance of objects, of acquiring such models from training images, and of using the models to accomplish recognition. This method can handle complex, real-world objects better than previously known methods. Indeed, in principle, it can be used to recognize any object by its appearance, provided it is given a sufficient range of training images, sufficient storage for model views, and an appropriate repertoire of feature types. This method can thus be considered a completely general approach to recognition by appearance. 8.1 M a i n Features of the A p p r o a c h The main features of the method are as follows: (a) Objects are modeled in terms of their appearance, rather than shape, to avoid any need to model the image formation process. This allows unusually complex objects to be modeled and recognized efficiently and reliably. (b) Appearance is described using discrete features of various types, ranging widely in scale, complexity, and specificity. We have adopted a basic, yet versatile repertoire of feature types comprising intensity edge segments and their groupings. This repertoire can be extended considerably, still within the framework of the approach, to accommodate a large variety of objects. 184 Chapter 8. Conclusion 185 (c) An object model represents a probability distribution over possible appearances of the object, assigning high probability to the object's most likely manifestations. Thus, learning an object model from training images amounts to estimating a distribution from a representative sampling of that distribution. (d) An object model divides the range of possible appearances into a number of discrete views. Within each view, independent probability distributions characterize the occurrence, position, and intrinsic attributes of individual features. This two-fold decomposition renders tractable the problems of learning a model and applying the model to recognition tasks. (e) A match quality measure provides a principled means of evaluating a match between a model and an image. It combines probabilities that are estimated using distributions recorded by the model. The measure leads naturally to an efficient matching procedure, probabilistic alignment, used to accomplish both learning and recognition. To identify good matches quickly, the procedure employs constraints based on features' relations, positions, and intrinsic attributes, each commensurate with its uncertainty as recorded by the model. (f) The model learning procedure has two components. One component identifies clusters of training images that ought to correspond to distinct model views. It does so by maximizing a measure that, by application of the minimum description length principle, combines the qualities of model simplicity and accuracy. The second component induces probabilistic generalizations of the images within each cluster. Working together, the two components construct a model by clustering training images, and, within each cluster, generalizing the images to form a model view. 8.2 Limitations of the Approach This recognition learning method, like any, is necessarily limited by the adequacy of its feature repertoire. It cannot learn to recognize objects for which it lacks appropriate features. Unlike Chapter 8. Conclusion 186 many others, however, this method uses an extensible feature repertoire, and it is able to ascertain, through learning, which elements of that repertoire may be relied upon in recognizing any particular object. Modeling a multifarious or highly flexible object with this method may require an impractically large number of model views. For these objects, a more effective strategy may be first to recognize parts, and then to recognize the whole object as a configuration of those parts. The present method could perhaps be extended to employ this strategy by assigning parts the role of high level features. This method does not suggest how information describing relations among views of an object—e.g., that two views are from neighbouring viewpoints—may be provided with the training set, included in the model, and used during recognition. Nor, however, does it preclude such an extension, which could aid recognition of moving objects. This method cannot distinguish objects on the basis of scale; instead it tries to recognize objects regardless of their scaling in the image. The method does, however, report the apparent scale of an object as measured directly in the image; under restrictive assumptions of viewing conditions, objects of different scale may be distinguished on that basis. Changes in perspective distortion of an object's appearance, if they are significant, must be included in the training set, as this method does not attempt to infer possible ranges of distortion. Similarly, illumination changes must be covered by the training set if those changes affect features appreciably (although features based on intensity edges have reduced sensitivity to illumination changes). 8.3 Contributions This work has centred on a few key principles that we consider applicable to many object recognition problems: • objects should be modeled in terms of their appearance; Chapter 8. Conclusion 187 • appearance should be described using a comprehensive repertoire of features; • an object should be modeled by a series of views, each characterizing features in probabilistic terms. The contributions of this work have been to articulate these ideas, translate them into a concrete method, and demonstrate their validity with a working system. Among the techniques developed by this work are representations for models and images, a repertoire of features for describing appearance in terms of intensity edges, a measure for evaluating the quality of a match between a model and an image, a search procedure for identifying good matches, a criterion for deciding whether a match represents an actual instance of an object in an image, a measure for evaluating how well a model represents a set of training images, and a learning procedure for automatically acquiring models from training images. Experiments have shown that the method works well for recognizing complex objects in cluttered scenes, and it is sensitive enough to distinguish very similar objects. Moreover, it can be used to recognize articulate objects and entire classes of similar objects. 8.4 Topics for Further Research There are several ways in which this method can be improved or augmented for greater efficiency and broader scope. Speed in both learning and recognition tasks would be greatly improved by the addition of an indexing component, which would examine image features and suggest likely model views for the matching procedure to consider. Existing indexing methods could be used, with the attribute vectors of high-level features serving as index keys. A substantial improvement in learning speed would also be achieved by adopting a faster procedure for estimating "mean" model curves from collections of edgels. The present method spends considerable time in a Levenberg-Marquardt optimization procedure as it iteratively fits elliptical arcs to edgels. Furthermore, extending the feature repertoire would allow the method to work more effectively with a broader class of objects. Useful would be features representing Chapter 8. Conclusion 188 additional groupings of intensity edges, such as symmetric arrangements and repeated patterns, and features representing regions of uniform colour or texture. Further experiments are needed to investigate the relationship between number of training images, model complexity, and recognition performance. There remains a family of challenging issues regarding how to organize a large collection of acquired models for greater efficiency. Savings in both storage and recognition time could be achieved by identifying parts or patterns common to several objects, factoring those parts out of their respective models, and recognizing the parts individually prior to recognizing their aggregates. Associating new feature types with some of the common parts and patterns would provide a means of automatically extending the feature repertoire and adapting it to the objects encountered during training. Furthermore, the same techniques of identifying and abstracting parts could be used to decompose flexible objects into simpler components, allowing those objects to be modeled with fewer views. The present work provides a foundation for investigating issues, such as these, concerning the analysis of appearance models acquired through learning. Bibliography Ammann, L. and J. V. Ness (1989). Standard and robust orthogonal regression. Communications in Statistics: Simulation and Computation 18(1), 145-162. Asada, H. and M. Brady (1986). The curvature primal sketch. IEEE Trans. Patt. Anal. Mach. Intell. PAMI-8(l), 2-14. Attneave, F. (1954). Some informational aspects of visual perception. Psych. Rev. 61, 183193. Ayache, N. and O. D. Faugeras (1986). HYPER: A new approach for the recognition and positioning of two-dimensional objects. IEEE Trans. Patt. Anal. Mach. Intell. PAMI-8(l), 44-54. Ballard, D. H. and C. M. Brown (1982). Computer Vision. Prentice-Hall. Barr, A. (1981). Superquadrics and angle-preserving transformations. IEEE Computer Graphics and Applications 1(1), 1-20. Basri, R. (1993). Recognition by prototypes. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 161-167. Beis, J. S. and D. G. Lowe (1994). Learning indexing functions for 3-D model-based object recognition. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 275-280. Ben-Arie, J. (1990). The probabilistic peaking effect of viewed angles and distances with application to 3-D object recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12(8), 760774. Ben-Arie, J. and A. Z. Meiri (1987). 3D objects recognition by optimal matching search of multinary relations graphs. Computer Vision, Graphics, and Image Processing 37, 345361. Bergevin, R. and M. D. Levine (1992). Extraction of line drawing features for object recognition. Patt. Recognit. 25(3), 319-334. Bergevin, R. and M. D. Levine (1993). Generic object recognition: Building and matching coarse descriptions from line drawings. IEEE Trans. Patt. Anal. Mach. Intell. 15(1), 19-36. Besl, P. J. and R. C. Jain (1985). Three-dimensional object recognition. Computing Surveys 17(1), 75-154. Bhanu, B. and O. D. Faugeras (1984). Shape matching of two-dimensional objects. IEEE Trans. Patt. Anal. Mach. Intell. PAMI-6(2), 137-156. Bierman, G. J. (1977). Factorization Methods for Discrete Sequential Estimation. Academic Press. 189 BIBLIOGRAPHY 190 Binford, T. O. (1982). Survey of model-based image analysis systems. Int. J. of Robotics Research 1(1), 18-64. Bolle, R. M., A. Califano, and R. Kjeldsen (1992). A complete and extendable approach to visual recognition. IEEE Trans. Patt. Anal. Mach. Intell. 14(5), 534-548. Bolles, R. C. and R. A. Cain (1982). Recognizing and locating partially visible objects: The local-feature-focus method. Int. J. of Robotics Research 1(3), 57-82. Brady, J. P., N. Nandhakumar, and J. K. Aggarwal (1989). Recent progress in object recognition from range data. Image and Vision Computing 7(4), 295-307. Brady, M. (1983). Criteria for representations of shape. In B. H. Jacob Beck and A. Rosenfeld (eds.), Human and Machine Vision, pp. 39-84. Academic Press. Bray, A. J. (1990). Object recognition using local geometric constraints: A robust alternative to tree-search. In Proc. European Conf. on Computer Vision, pp. 499-515. Breuel, T. M. (1990). Indexing for visual recognition from a large model base. AI Memo 1108, Mass. Inst. Technol., A.I. Lab. Breuel, T. M. (1992a). Fast recognition using adaptive subdivisions of transformation space. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 445-451. Breuel, T. M. (1992b). Geometric Aspects of Visual Object Recognition. Ph. D. thesis, Mass. Inst. Technol. Breuel, T. M. (1993). Higher-order statistics in object recognition. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 707-708. Brooks, R. A. (1981). Symbolic reasoning among 3-D models and 2-D images. Artificial Intell. 17, 285-348. Brooks, R. A. (1983). Model-based three-dimensional interpretations of two-dimensional images. IEEE Trans. Patt. Anal. Mach. Intell PAMI-5(2), 140-150. Brunelli, R. and T. Poggio (1991). HyperBF networks for real object recognition. In Proc. Int. Joint Conf. Artificial Intell, Volume 2, pp. 1278-1284. Brunelli, R. and T. Poggio (1992). Face recognition through geometrical features. In Proc. European Conf. on Computer Vision, pp. 792-800. Burns, J. B. and E. M. Riseman (1992). Matching complex images to multiple 3D objects using view description networks. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 328-334. Burns, J. B., R. S. Weiss, and E. M. Riseman (1993). View variation of point-set and linesegment features. IEEE Trans. Patt. Anal. Mach. Intell. 15(1), 51-68. Camps, O. I., L. G. Shapiro, and R. M. Haralick (1991). PREMIO: An overview. In Proc. Workshop on Directions in Automated CAD-Based Vision, Maui, Hawaii, pp. 11-21. IEEE Computer Soc. Press. Camps, O. I., L. G. Shapiro, and R. M. Haralick (1992). Object recognition using prediction and probabilistic matching. In Proc. IEEE/RSJ Raleigh, North Carolina, pp. 1044-1052. Int. Conf. Intell. Robots and Systems, BIBLIOGRAPHY 191 Canny, J. (1986). A computational approach to edge detection. IEEE Trans. Patt. Anal. Mach. Intell. 8, 679-698. Cass, T. A. (1992). Polynomial-time object recognition in the presence of clutter, occlusion, and uncertainty. In Proc. DARPA Image Understanding Workshop, pp. 693-704. Chaudhuri, S. and S. Chatterjee (1991). Performance analysis of total least squares methods in three-dimensional motion estimation. IEEE Trans. Robotics and Automation 7(5), 707-714. Chen, C. H. and A. C. Kak (1989). A robot vision system for recognizing 3-D objects in loworder polynomial time. IEEE Trans. Systems, Man, and Cybernetics 19(6), 1535-1563. Chen, C. H. and P. G. Mulgaonkar (1992). Automatic vision programming. CVGIP: Image Understanding 55(2), 170-183. Cheng, C.-L. and J. W. V. Hess (1990). Bounded influence errors-in-variables regression. In P. J. Brown and W. A. Fuller (eds.), Statistical Analysis of Measurement Error Models and Applications, Volume 112 of Contemporary Mathematics, pp. 227-241. Providence, Rhode Island: American Mathematical Society. Chin, R. T. and C. R. Dyer (1986). Model-based recognition in robot vision. Computing Surveys 18(1), 67-108. Clemens, D. T. and D. W. Jacobs (1991a). Model group indexing for recognition. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 4-9. Clemens, D. T. and D. W. Jacobs (1991b). Space and time bounds on model indexing. IEEE Trans. Patt. Anal. Mach. Intell. 13(10), 1007-1018. Connell, J. H. and M. Brady (1987). Generating and generalizing models of visual objects. Artificial Intell. 31, 159-183. Delanoy, R. L., J. G. Verly, and D. E. Dudgeon (1992). Automatic building and supervised discrimination learning of appearance models of 3-D objects. In SPIE Proc. Vol. 1708: Applications of Artificial Intell. X: Machine Vision and Robotics, pp. 549-560. Dickinson, S. J., R. Bergevin, I. Biederman, J.-O. Eklundh, R. Munck-Fairwood, and A. P. Pentland (1993). The use of geons for generic 3-D object recognition. In Proc. Int. Joint Conf. Artificial Intell., Volume 2, pp. 1693-1699. Dickinson, S. J. and A. P. Pentland (1992). A unified approach to the recognition of expected and unexpected geon-based objects. In Applications of Artificial Intell. X: Machine Vision and Robotics, Volume 1708, pp. 614-627. SPIE. Dickinson, S. J., A. P. Pentland, and A. Rosenfeld (1992). 3-D shape recovery using distributed aspect matching. IEEE Trans. Patt. Anal. Mach. Intell. 14(2), 174-198. Dietterich, T. G. and R. S. Michalski (1981). Inductive learning of structural descriptions. Artificial Intell. 16, 257-294. Draper, B. (1993). Learning Object Recognition Strategies. Ph. D. thesis, Univ. of Massachusetts, Amherst, Mass. BIBLIOGRAPHY 192 Edelman, S. (1993). On learning to recognize 3-D objects from examples. IEEE Trans. Patt. Anal. Mach. Intell. 15(8), 833-837. Edelman, S. and H. H. Biilthoff (1992). Orientation dependence in the recognition of familiar and novel views of three-dimensional objects. Vision Research 32(12), 2285-2400. Edelman, S. and T. Poggio (1990). Bringing the grandmother back into the picture: A memory-based view of object recognition. AI Memo 1181, Mass. Inst. Technol., A.L Lab. Eggert, D. and K. Bowyer (1993). Computing the perspective projection aspect graph of solids of revolution. IEEE Trans. Patt. Anal. Mach. Intell. 15(2), 109-128. Ettinger, G. J. (1987). Hierarchical object recognition using libraries of parameterized model sub-parts. Master's thesis, Mass. Inst. Technol. Fan, T. J. (1990). Describing and Recognizing 3-D Objects Using Surface Properties. Springer- Verlag. Fichera, O., P. Pellegretti, F. Roli, and S. B. Serpico (1992). Automatic acquisition of visual models for image recognition. In Proc. IAPR Int. Conf. Patt. Recognit., Volume 1, pp. 95-98. Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning 2, 139-172. Fisher, D. H. and M. J. Pazzani (1991). Computational models of concept learning. In D. H. Fisher, M. J. Pazzani, and P. Langley (eds.), Concept Formation: Knowledge and Experience in Unsupervised Learning, Chapter 1, pp. 127-161. Morgan Kaufmann. Flynn, P. J. (1992). Saliencies and symmetries: Toward 3D object recognition from large model databases. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 322-327. Forsyth, D. A., J. L. Mundy, A. P. Zisserman, C. Coelho, A. Heller, and C. A. Rothwell (1991). Invariant descriptors for 3-D object recognition and pose. IEEE Trans. Patt. Anal. Mach. Intell. 13(10), 971-991. Forsyth, D. A., J. L. Mundy, A. P. Zisserman, and C. A. Rothwell (1992). Recognising rotationally symmetric surfaces from their outlines. In Proc. European Conf. on Computer Vision, pp. 639-647. Goad, C. (1983). Special purpose automatic programming for 3D model-based vision. In Proc. DARPA Image Understanding Workshop. Goldberg, R. R. (1994). Constrained pose refinement of parametric objects. Int. J. Computer Vision 13(2), 181-211. Gordon, G. G. (1992). Face recognition based on depth and curvature features. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 808-809. Gottschalk, P. G., J. L. Turney, and T. N. Mudge (1989). Efficient recognition of partially visible objects using a logarithmic complexity matching technique. Int. J. of Robotics Research 8(6), 110-131. Grimson, W. E. L. (1990). Object Recognition by Computer: The Role of Geometric Con- straints. MIT Press. BIBLIOGRAPHY 193 Grimson, W. E. L. and D. P. Huttenlocher (1990). On the sensitivity of the Hough transform for object recognition. IEEE Trans. Patt. Anal. Mach. Intell. 12(3), 255-274. Grimson, W. E. L. and D. P. Huttenlocher (1991). On the verification of hypothesized matches in model-based recognition. IEEE Trans. Patt. Anal. Mach. Intell. 13(12), 1201-1213. Grimson, W. E. L. and T. Lozano-Perez (1987). Localizing overlapping parts by searching the interpretation tree. IEEE Trans. Patt. Anal. Mach. Intell. 9(A), 469-482. Gros, P. (1993). Matching and clustering: Two steps towards automatic object model generation in computer vision. In Proc. A A AI Fall Symp.: Machine Learning in Computer Vision. AAAI Press. Hansen, C. and T. C. Henderson (1989). CAGD-based computer vision. IEEE Trans. Patt. Anal. Mach. Intell. 11(11), 1181-1193. Hanson, S. J. and M. Bauer (1989). Conceptual clustering, categorization, and polymorphy. Machine Learning 3, 343-372. Haralick, R. M., A. K. Mackworth, and S. L. Tanimoto (1988). Computer vision update. In P. R. C. Avron Barr and E. A. Feigenbaum (eds.), Handbook of Artificial Intelligence, Volume 4. Addison-Wesley. Havaldar, P., G. Medioni, and F. Stein (1994). Extraction of groups for recognition. In Proc. European Conf. on Computer Vision, pp. 251-261. Hel-Or, Y. and M. Werman (1994). Constraint-fusion for interpretation of articulated objects. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 39-45. Hel-Or, Y. and M. Werman (1995). Pose estimation by fusing noisy data of different dimensions. IEEE Trans. Patt. Anal. Mach. Intell. 17(2), 195-201. Horaud, R., F. Veillon, and T. Skordas (1990). Finding geometrical and relational structures in an image. In Proc. European Conf. on Computer Vision, pp. 374-384. Huttenlocher, D. P. (1988). Three-Dimensional Recognition of Solid Objects from a Two- Dimensional Image. Ph. D. thesis, Mass. Inst. Technol. Huttenlocher, D. P. and S. Ullman (1990). Recognizing solid objects by alignment with an image. Int. J. Computer Vision 5(2), 195-212. Huttenlocher, D. P. and P. Wayner (1992). Finding convex edge groupings in an image. Int. J. Computer Vision 8(1), 7-29. Jacobs, D. W. (1989). Grouping for recognition. AI Memo 1177, Mass. Inst. Technol., A.I. Lab. Jacobs, D. W. (1992). Space efficient 3D model indexing. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 439-444. Jain, A. K. and R. Hoffman (1988). Evidence-based recognition of 3-D objects. IEEE Trans. Patt. Anal. Mach. Intell. 10(6), 783-801. Kitchen, L. (1980). Relaxation applied to matching quantitative relational structures. IEEE Trans. Systems, Man, and Cybernetics 10(2), 96-101. BIBLIOGRAPHY 194 Kuno, Y., Y. Okamoto, and S. Okada (1991). Robot vision using a feature search strategy generated from a 3-D object model. IEEE Trans. Patt. Anal. Mach. Intell. 13(10), 1085- 1097. Lamdan, Y., J. T. Schwartz, and H. J. Wolfson (1988). Object recognition by affine invariant matching. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 335-344. Lamdan, Y., J. T. Schwartz, and H. J. Wolfson (1990). Affine invariant model-based object recognition. IEEE Trans. Robotics and Automation 6(5), 578-589. Lavallee, S. and R. Szeliski (1995). Recovering the position and orientation of free-form objects from images contours using 3D distance maps. IEEE Trans. Patt. Anal. Mach. Intell. 17(A), 378-390. Leonardis, A. (1993). Image Analysis Using Parametric Models. Ph. D. thesis, Univ. of Ljubljana, Ljubljana, Slovenia. Lowe, D. G. (1985). Perceptual Organization and Visual Recognition. Kluwer. Lowe, D. G. (1987a). Three-dimensional object recognition from single two-dimensional images. Artificial Intell. 31, 355-395. Lowe, D. G. (1987b). The viewpoint consistency constraint. Int. J. Computer Vision 1, 57-72. Lowe, D. G. (1990). Visual recognition as probabilistic inference from spatial relations. In A. Blake and T. Troscianko (eds.), AI and the Eye, pp. 261-279. John Wiley k. Sons Ltd. Lowe, D. G. (1991). Fitting parameterized three-dimensional models to images. IEEE Trans. Patt. Anal. Mach. Intell. 13(6), 441-450. Mahoney, J. (1987). Image chunking: Defining spatial building blocks for scene analysis. Tech. Rep. 980, Mass. Inst. Technol., A.I. Lab. Marr, D. and H. K. Nishihara (1978). Representation and recognition of the spatial organization of three-dimensional shapes. Proc. Roy. Soc. London B 200, 269-294. McArthur, B. A. (1991). Incremental Synthesis of Three-Dimensional Object Models Using Random Graphs. Ph. D. thesis, Univ. of Waterloo. McCafferty, J. D. (1990). Human and Machine Vision: Computing Perceptual Organization. New York: Ellis Horwood. Medin, D. L. (1983). Structural principles in categorization. In Perception, Cognition, and Development: Interactional Analyses, Chapter 7, pp. 203-230. Hillsdale, New Jersey: Lawrence Erlbaum Assoc. Michalski, R. S. (1980). Knowledge acquisition through conceptual clustering: A theoretical framework and algorithm for partitioning data in conjunctive concepts. Int. J. Policy Analysis and Information Systems -4(3), 219-244. Minsky, M. (1975). A framework for representing knowledge. In P. H. Winston (ed.), The Psychology of Computer Vision, pp. 211-277. McGraw-Hill. Mitchell, T. M. (1982). Generalization as search. Artificial Intell. 18(2), 203-226. BIBLIOGRAPHY 195 Mohan, R. and R. Nevatia (1992). Perceptual organization for scene segmentation and description. IEEE Trans. Patt. Anal. Mach. Intell. 14(6), 616-635. Mokhtarian, F. and A. K. Mackworth (1992). A theory of multiscale, curvature-based shape representation for planar curves. IEEE Trans. Patt. Anal. Mach. Intell. 14(8), 789-805. Moses, Y. and S. Ullman (1991). Limitations of non model-based recognition schemes. AI Memo 1301, Mass. Inst. Technol., A.I. Lab. Mundy, J. L. and A. P. Zisserman (eds.) (1992). Geometric Invariance in Computer Vision. MIT Press. Mundy, J. L., A. P. Zisserman, and D. A. Forsyth (1994). Introduction and chapter summary. In J. L. Mundy, A. P. Zisserman, and D. A. Forsyth (eds.), Applications of Invariance in Computer Vision, pp. 3-8. Springer-Verlag. Proc. Second Joint European-US Workshop on Applications of Invariance in Computer Vision, Ponta Delgada, Azores, Oct. 1993. Murase, H. and S. K. Nayar (1993). Learning and recognition of 3-D objects from brightness images. In Proc. A A AI Fall Symp.: Machine Learning in Computer Vision, pp. 25-29. AAAI Press. Pathak, A. and O. I. Camps (1993). Bayesian view class determination. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 407-412. Pednault, E. P. D. (1991). Minimial-length encoding and inductive inference. In G. PiatetskyShapiro and W. J. Frawley (eds.), Knowledge Discovery in Databases, pp. 71-92. MIT Press. Pentland, A. P. (1986). Perceptual organization and the representation of natural form. Artificial Intell. 28, 293-331. Petitjean, S., J. Ponce, and D. J. Kriegman (1992). Computing exact aspect graphs of curved objects: Algebraic surfaces. Int. J. Computer Vision 9(3), 231-255. Poggio, T. and S. Edelman (1990). A network that learns to recognize three-dimensional objects. Nature 343, 263-266. Polana, R. and R. Nelson (1994). Recognition of nonrigid motion. In Proc. DARPA Image Understanding Workshop, pp. 1219-1224. Pope, A. R. (1995). World Wide Web page. URL http://www.cs.ubc.ca/spider/pope. Pope, A. R. and D. G. Lowe (1994). Vista: A software environment for computer vision research. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 768-772. Rigoutsos, I. and R. Hummel (1993). Distributed Bayesian object recognition. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 180-186. Rissanen, J. (1978). Modeling by shortest data description. Automatica 14, 465-471. Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Annals of Statistics 11(2), 416-431. Roberts, L. G. (1965). Machine perception of three-dimensional solids. In J. Tippett (ed.), Optical and Electro-Optical Information Processing, pp. 159-197. MIT Press. BIBLIOGRAPHY 196 Rosin, P. L. and G. A. W. West (1989). Segmentation of edges into lines and arcs. Image and Vision Computing 7(2), 109-114. Sarachik, K. B. and W. E. L. Grimson (1993). Gaussian error models for object recognition. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 400-406. Sarkar, S. and K. L. Boyer (1990). An efficient computational structure for computing perceptual organization. Tech. Rep. SAMPL-90-06, Ohio State Univ., Signal Analysis and Machine Perception Lab. Sarkar, S. and K. L. Boyer (1993). Integration, inference, and management of spatial information using bayesian networks: Perceptual organization. IEEE Trans. Patt. Anal. Mach. Intell. 15(3), 256-274. Sato, K., K. Ikeuchi, and T. Kanade (1992). Model based recognition of specular objects using sensor models. CVGIP: Image Understanding 55(2), 155-169. Saund, E. (1988). The Role of Knowledge in Visual Shape Representation. Ph. D. thesis, Mass. Inst. Technol. Saund, E. (1990). Symbolic construction of a 2-D scale-space image. IEEE Trans. Patt. Anal. Mach. Intell. 12(8), 817-830. Saund, E. (1992). Labeling of curvilinear structure across scales by token grouping. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 257-263. Segen, J. (1989). Model learning and recognition of nonrigid objects. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 597-602. Segen, J. (1990). Graph clustering and model learning by data compression. In Proc. Seventh Int. Conf. Machine Learning, pp. 93-100. Morgan Kaufmann. Segen, J. and A. C. Sanderson (1979). A minimal representation criterion for clustering. In Proc. 12th Annual Computer Science and Statistics Symp., pp. 332-334. Seibert, M. and A. M. Waxman (1992). Adaptive 3-D object recognition from multiple views. IEEE Trans. Patt. Anal. Mach. Intell. 14(2), 107-124. Sengupta, K. and K. L. Boyer (1995). Organizing large structural modelbases. IEEE Trans. Patt. Anal. Mach. Intell. 17(4), 321-332. Shapiro, L. G. (1980). A structural model of shape. IEEE Trans. Patt. Anal. Mach. Intell. 2, 111-126. Shapiro, L. G. and R. M. Haralick (1981). Structural descriptions and inexact matching. IEEE Trans. Patt. Anal. Mach. Intell. 3(5), 504-519. Shvayster, H. (1990). Learnable and noniearnable visual concepts. IEEE Trans. Patt. Anal. Mach. Intell. 12(5), 459-466. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. London: Chap- man and Hall. Smith, C. (1906). Conic Sections. Macmillan & Co. Spain, B. (1957). Analytical Quadrics. Pergamon Press. 197 BIBLIOGRAPHY Stark, L. and K. Bowyer (1991). Achieving generalized object recognition through reasoning about association of function to structure. IEEE Trans. Patt. Anal. Mach. Intell. 13(10), 1097-1104. Stein, F. and G. Medioni (1992a). Recognition of 3D objects from 2D groupings. In Proc. DARPA Image Understanding Workshop, pp. 667-674. Stein, F. and G. Medioni (1992b). Structural indexing: Efficient 3-D object recognition. IEEE Trans. Patt. Anal. Mach. Intell. 14(2), 125-145. Strat, T. M. and M. A. Fischler (1991). Context-based vision: Recognizing objects using information from both 2-D and 3-D imagery. IEEE Trans. Patt. Anal. Mach. Intell. 13(10), 1050-1065. Suetens, P., P. Fua, and A. Hanson (1992). Computational strategies for object recognition. Computing Surveys 24(1), 5-61. Tarr, M. J. (1995). Rotating objects to recognize them: A case study on the role of viewpoint dependency in the recognition of three-dimensional objects. Psychonomic Bulletin and Review 2(1), 55-82. Taubin, G. (1991). Estimation of planar curves, surfaces, and nonplanar surface curves defined by implicit equations with applications to edge and range image segmentation. IEEE Trans. Patt. Anal. Mach. Intell. 13(11), 1115-1138. Thomas, S. M. and Y. T. Chan (1989). A simple approach for the estimation of circular arc center and its radius. Computer Vision, Graphics, and Image Processing 45, 362-370. Thompson, K. and P. Langley (1991). Concept formation in structured domains. In D. H. Fisher, M. J. Pazzani, and P. Langley (eds.), Concept Formation: Knowledge and Experience in Unsupervised Learning, Chapter 5, pp. 127-161. Morgan Kaufmann. Tsai, R. Y. (1987). A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE Trans. Robotics and Automation RA-3(A), 323-344. Turk, M. A. and A. P. Pentland (1991). Face recognition using eigenfaces. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 586-591. Ullman, S. (1989). Aligning pictorial descriptions: An approach to object recognition. Cognition 32, 193-254. Ullman, S. and R. Basri (1991). Recognition by linear combination of models. IEEE Trans. Patt. Anal. Mach. Intell. 13(10), 992-1006. Valiant, L. G. (1984). A theory of the learnable. Commun. ACM 27, 1134-1142. Van Huffel, S. and J. Vandewalle (1991). The Total Least Squares Problem: Computational Aspects and Analysis, Volume 9 of Frontiers in Applied Mathematics. Philadelphia: Soc. for Industrial and Applied Mathematics. Wallace, C. S. and D. M. Boulton (1968). An information measure for classification. Computer Journal 11(2), 185-194. BIBLIOGRAPHY 198 Ward, R. (1984). Comparison and diagnosis of errors for six parameter estimation methods. Int. J. Systems Science 15(7), 745-758. Wayner, P. C. (1991). Efficiently using invariant theory for model-based matching. In Proc. Conf. Computer Vision and Patt. Recognit., pp. 473-478. Weiss, I. (1993). Geometric invariants and object recognition. Int. J. Computer Vision 10(3), 207-231. Wells III, W. M. (1993). Statistical Object Recognition. Ph. D. thesis, Mass. Inst. Technol. Weng, J. J., N. Ahuja, and T. S. Huang (1993). Learning recognition and segmentation of 3-D objects from 2-D images. In Proc. Int. Conf. Computer Vision, pp. 121-128. Whaite, P. and F. P. Ferrie (1991). From uncertainty to visual exploration. IEEE Trans. Patt. Anal. Mach. Intell. 13(10), 1038-1049. Wheeler, M. D. and K. Ikeuchi (1995). Sensor modeling, probabilistic hypothesis generation, and robust localization for object recognition. IEEE Trans. Patt. Anal. Mach. Intell. 17(3), 252-265. Wong, A. K. C , S. W. Lu, and M. Rioux (1989). Recognition and shape synthesis of 3D objects based on attributed hypergraphs. IEEE Trans. Patt. Anal. Mach. Intell. 11(3), 279-290. Wong, A. K. C. and M. You (1985). Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans. Patt. Anal. Mach. Intell. 7(5), 599-609. Wong, E. K. (1992). Model matching in robot vision by subgraph isomorphism. Patt. Recognit. 25(3), 287-304. Woodham, R. J. (1987). Stable representation of shape. In Z. Pylyshyn (ed.), Computational Processes in Human Vision. Norwood, N.J.: Ablex. Yang, B., W. E. Snyder, and G. L. Bilbro (1989). Matching oversegmented 3D images to models using association graphs. Image and Vision Computing 7(2), 135-143. Zhang, S., G. D. Sullivan, and K. D. Baker (1992). Using automatically constructed viewindependent relational model in 3D object recognition. In Proc. European Conf. on Computer Vision, pp. 778-786. Zhang, S., G. D. Sullivan, and K. D. Baker (1993). The automatic construction of a viewindependent relational model for 3-D object recognition. IEEE Trans. Patt. Anal. Mach. Intell. 15(Q), 531-544.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Learning to recognize objects in images : acquiring...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Learning to recognize objects in images : acquiring and using probabilistic models of appearance Pope, Arthur R. 1995
pdf
Page Metadata
Item Metadata
Title | Learning to recognize objects in images : acquiring and using probabilistic models of appearance |
Creator |
Pope, Arthur R. |
Date Issued | 1995 |
Description | We present a method of recognizing three-dimensional objects in intensity images of cluttered scenes, using models learned from training images. By modeling each object with a series of views, representing each view with a large variety of features, and describing each feature probabilistically, our method can learn to recognize objects of unusual complexity. An object is modeled by a probability distribution describing the range of possible variation in the object's appearance. This distribution is organized on two levels. Large variations are handled by partitioning training images into clusters corresponding to distinctly different views of the object. Within each cluster, smaller variations are represented by distributions characterizing uncertainty in the presence, position, and measurements of various discrete features of appearance. Many types of features may be used, ranging in abstraction from segments of intensity edges to perceptual groupings and regions. Our learning method combines two activities: (a) an incremental conceptual clustering algorithm identifies groups of training images corresponding to distinct views of the object; (b) a generalization algorithm consolidates each cluster to produce a summary description characterizing its most prominent features. The method produces models that are both economical and representative, balancing these competing goals through an application of the minimum description length principle. Recognition requires matching features of a model with those of an image. Our matching method is a form of alignment: from hypothesized feature pairings it estimates a viewpoint transformation; from that it finds additional pairings, and so on. The method uses constraints based on features' positions, numeric measurements, and relations, assigning to each an importance commensurate with its uncertainty as recorded by the model. Decisions are ordered so that more certain features are paired sooner, while ambiguous choices are postponed for later resolution, when additional constraints may be known. We also describe a system implemented to test our recognition learning method. Experiments demonstrate the system's performance at tasks including learning to recognize complex objects in cluttered scenes, and learning to discriminate two quite similar objects. |
Extent | 22356221 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-02-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051288 |
URI | http://hdl.handle.net/2429/4774 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1996-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1996-091473.pdf [ 21.32MB ]
- Metadata
- JSON: 831-1.0051288.json
- JSON-LD: 831-1.0051288-ld.json
- RDF/XML (Pretty): 831-1.0051288-rdf.xml
- RDF/JSON: 831-1.0051288-rdf.json
- Turtle: 831-1.0051288-turtle.txt
- N-Triples: 831-1.0051288-rdf-ntriples.txt
- Original Record: 831-1.0051288-source.json
- Full Text
- 831-1.0051288-fulltext.txt
- Citation
- 831-1.0051288.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051288/manifest