UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Embodied object recognition Helmer, Scott 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2012_fall_helmer_scott.pdf [ 35.09MB ]
JSON: 24-1.0052118.json
JSON-LD: 24-1.0052118-ld.json
RDF/XML (Pretty): 24-1.0052118-rdf.xml
RDF/JSON: 24-1.0052118-rdf.json
Turtle: 24-1.0052118-turtle.txt
N-Triples: 24-1.0052118-rdf-ntriples.txt
Original Record: 24-1.0052118-source.json
Full Text

Full Text

Embodied Object Recognition  by Scott Helmer B.Sc. University of Toronto, 2002 M.Sc. The University of British Columbia, 2004  a thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in  the faculty of graduate studies (Computer Science)  The University Of British Columbia (Vancouver) June 2012 c Scott Helmer, 2012  Abstract The ability to localize and categorize objects via imagery is central to many potential applications, including autonomous vehicles, mobile robotics, and surveillance. In this thesis we employ a probabilistic approach to show how utilizing multiple images of the same scene can improve detection. We cast the task of object detection as finding the set of objects that maximize the posterior probability given a model of the categories and a prior for their spatial arrangements. We first present an approach to detection that leverages depth data from binocular stereo by factoring classification into two terms: an independent appearance-based object classifier, and a term for the 3D shape. We overcome the missing data and the limited fidelity of stereo by focusing on the size of the object and the presence of discontinuities. We go on to demonstrate that even with off-the-shelf stereo algorithms we can significantly improve detection on two household objects, mugs and shoes, in the presence of significant background clutter and textural variation. We also present a novel method for object detection, both in 2D and in 3D, from multiple images with known extrinsic camera parameters. We show that by also inferring the 3D position of the objects we can improve object detection by incorporating size priors and reasoning about the 3D geometry of a scene. We also show that integrating information across multiple viewpoints allows us to boost weak classification responses, overcome occlusion, and reduce false positives. We demonstrate the efficacy of our approach, over single viewpoint detection, on a dataset containing mugs, bottles, bowls, and shoes in a variety of challenging scenarios. ii  Preface All of the work presented in this thesis has been performed under the supervision of David Lowe. A portion of this work has appeared in a number of publications, sometimes with the aid of various co-authors. • A significant portion of the research presented in Chapter 4 was presented as an oral at the International Conference on Robotics and Automation (Helmer and Lowe, 2010). In particular, the core probabilistic method is the same, although its description in the thesis is more refined. • A significant portion of the research and results presented in Chapter 5 was presented as an oral at the Asian Conference on Computer Vision (Helmer et al., 2010). The co-authors David Meger and Marius Muja were both integral in the collection and annotation of the UBC Visual Robot Survey dataset that we utilized in this chapter. David Meger was also integral in implementing the system that determined the extrinsic camera parameters for the dataset via the use of fiducial markers. The writing in Section 5.3.1, about the dataset collection, is largely written by David Meger as well. The core ideas in this chapter were greatly aided by discussions with David Meger. Kenji Okuma was also instrumental in providing the the interface code for the Deformable Parts Model (DPM).  iii  Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments  x  . . . . . . . . . . . . . . . . . . . . . . . . . .  xi  1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  1.1  Motivation - Curious George . . . . . . . . . . . . . . . . . .  3  1.2  Approach and contributions . . . . . . . . . . . . . . . . . . .  5  1.2.1  Using depth information for object detection . . . . .  5  1.2.2  Multiple viewpoint recognition . . . . . . . . . . . . .  7  1.2.3  Chamfer matching for object detection . . . . . . . . .  8  Thesis organization . . . . . . . . . . . . . . . . . . . . . . . .  8  1.3  2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10  2.2  Object detection without context . . . . . . . . . . . . . . . .  12  2.2.1  Appearance-based recognition . . . . . . . . . . . . . .  12  2.2.2  Recognition with 3D sensors  18  iv  . . . . . . . . . . . . . .  2.3  Object detection with context . . . . . . . . . . . . . . . . . .  19  2.3.1  20  3D geometric context in object detection  3 Appearance-based Object Detection 3.1  . . . . . . .  . . . . . . . . . . . . . 24  Recognition with contour-based chamfer matching . . . . . .  26  3.1.1  Chamfer distance . . . . . . . . . . . . . . . . . . . . .  26  3.1.2  Chamfer distance for detection . . . . . . . . . . . . .  28  3.2  Recognition with deformable parts model . . . . . . . . . . .  29  3.3  Sparse object detection model . . . . . . . . . . . . . . . . . .  32  3.3.1 3.4  3.5  Finding optimal labelling  Y opt  . . . . . . . . . . . . .  35  Experimental validation . . . . . . . . . . . . . . . . . . . . .  37  3.4.1  Experiment setup . . . . . . . . . . . . . . . . . . . . .  37  3.4.2  Results . . . . . . . . . . . . . . . . . . . . . . . . . .  39  Concluding remarks . . . . . . . . . . . . . . . . . . . . . . .  43  4 Stereo for Recognition . . . . . . . . . . . . . . . . . . . . . . 46 4.1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .  46  4.2  Stereo vision . . . . . . . . . . . . . . . . . . . . . . . . . . .  48  4.3  Utility of stereo depth information for detection . . . . . . . .  50  4.4  Probabilistic model for object detection . . . . . . . . . . . .  52  4.4.1  Scale from depth . . . . . . . . . . . . . . . . . . . . .  55  4.4.2  Surface variation . . . . . . . . . . . . . . . . . . . . .  59  Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  60  4.5.1  Experimental setup . . . . . . . . . . . . . . . . . . . .  60  4.5.2  Results and analysis . . . . . . . . . . . . . . . . . . .  64  Concluding remarks . . . . . . . . . . . . . . . . . . . . . . .  67  4.5  4.6  5 Multiview Recognition . . . . . . . . . . . . . . . . . . . . . . 68 5.1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .  68  5.2  Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  69  5.2.1  Probabilistic model for a scene . . . . . . . . . . . . .  71  5.2.2  Inferring 3D objects . . . . . . . . . . . . . . . . . . .  75  5.2.3  Single-view object detection . . . . . . . . . . . . . . .  77  Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  78  5.3  v  5.4  5.3.1  Collecting images from multiple registered viewpoints  78  5.3.2  Experiments . . . . . . . . . . . . . . . . . . . . . . .  81  Concluding remarks . . . . . . . . . . . . . . . . . . . . . . .  85  6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.1  Future directions . . . . . . . . . . . . . . . . . . . . . . . . .  88  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91  vi  List of Tables Table 3.1  The average precision for ETH dataset comparing chamfer matching with and without scale invariance. . . . . . . . .  Table 4.1  43  Comparing average precision with and without a scale prior and stereo depth information. . . . . . . . . . . . . . . . .  65  Table 5.1  The parameter values for scale . . . . . . . . . . . . . . . .  73  Table 5.2  The average precision for various object detectors for varying number of viewpoints. . . . . . . . . . . . . . . . . . .  Table 5.3  84  The average precision for various object detectors with and without the scale prior. . . . . . . . . . . . . . . . . . . . .  vii  84  List of Figures Figure 1.1  Curious George: Visual search robot . . . . . . . . . . . .  4  Figure 1.2  Fusing stereo depth information with appearance. . . . .  6  Figure 1.3  Multiple viewpoints provides more information which improves object detection  . . . . . . . . . . . . . . . . . . .  8  Figure 2.1  Image datasets for object recognition . . . . . . . . . . . .  13  Figure 3.1  NMS applied to sliding window classifier responses produces a set of sparse detections . . . . . . . . . . . . . . .  Figure 3.2  25  Oriented chamfer distance is spatial and orientation differences between template and edge image. . . . . . . . .  26  Figure 3.3  Filters and spatial model for a DPM for bicycles. . . . . .  31  Figure 3.4  Sample detections from ETH datasets for contour-based model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Figure 3.5  38  Recall-Precision curve for the category mugs from the ETH dataset with the VOC criterion for correct detection IoU ≥ 50%. The values in the bracket are averageprecision. . . . . . . . . . . . . . . . . . . . . . . . . . . .  Figure 3.6  Recall vs.  FPPI graphs comparing contour-based ap-  proaches on ETH dataset. . . . . . . . . . . . . . . . . . . Figure 3.7  41  Edge images generated via Canny (1986) and Martin et al. (2004). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Figure 3.8  40  42  Histogram of the scale of detections with and with scale invariance.  . . . . . . . . . . . . . . . . . . . . . . . . . .  viii  44  Figure 4.1  Process of combining colour images, size priors, and stereo depth images to improve detection. . . . . . . . . . . . . .  47  Figure 4.2  The geometry underlying binocular stereo . . . . . . . . .  49  Figure 4.3  The accuracy of depth estimation as depth increases for Point Grey’s Bumblebee 2 camera. Accuracy here refers to the standard deviation.  . . . . . . . . . . . . . . . . .  51  Figure 4.4  The depth, via stereo, of sampled points on mugs. . . . .  52  Figure 4.5  A diagram illustrating the relationship between depth and object size. . . . . . . . . . . . . . . . . . . . . . . . . . .  57  Figure 4.6  Examples of mug detections with and without a scale prior. 61  Figure 4.7  Examples of mug detections with and without a scale prior. 62  Figure 4.8  Recall vs. Precision graph comparing detection with and without a scale prior and stereo depth information. . . . .  Figure 5.1  Image of detection results comparing single view vs multiview recognition . . . . . . . . . . . . . . . . . . . . . . .  Figure 5.2  79  Recall-precision curves compare our multi-view (3 viewpoints) approach and the single view approach. . . . . . .  Figure 5.6  74  Sample of images comparing single-view with multi-view detection. . . . . . . . . . . . . . . . . . . . . . . . . . . .  Figure 5.5  71  The 3D width of the object as a function of viewpoint and the object’s scale . . . . . . . . . . . . . . . . . . . . . . .  Figure 5.4  69  Multiple views provide more appearance information along with ability to localize objects in 3D. . . . . . . . . . . . .  Figure 5.3  66  82  The performance of our system generally increases as the number of views for each scene is increased. . . . . . . . .  ix  83  Glossary Acronyms used throughout this thesis. IoU  Intersection over Union  SRVC  Semantic Robot Vision Challenge (Rybski, December 2010)  VOC  PASCAL Visual Object Challenge (Everingham et al., 2010a)  AP  Average Precision  NMS  Non-Maximum Suppression  DPM  Deformable Parts Model of Felzenszwalb et al. (2010)  FPPI  False Positives Per Image  HOG  Histogram of Gradients  x  Acknowledgments I would first like to express my gratitude to David Lowe, whose insight, guidance, and patience allowed me to see this thesis through to completion. I have a great deal of gratitude for the collegial environment at UBC. I am grateful for all the inspiration and hard work of those that worked with me on Curious George and the Semantic Robot Vision Challenge (SRVC), particularly to David Meger and Marius Muja. The Robuddies reading group has also proven a valuable source of inspiration and information, with special thanks to Jim Little, Sancho McCann, Bob Woodham, Ankur Gupta, and those previously mentioned for their insight. I would also like to thank various lab-mates who have been a great resource along the way. I also would like to thank all of the friends that have supported and entertained me along the way. Special thanks to Dima and the rest of the Orphanage for providing a home away from home. Thanks goes to my dinner and adventure partners, particularly Katie, Andrew, and Mike. And also to my friends and family back home.  xi  Chapter 1  Introduction Endowing computers with the ability to navigate and interact with our world has long been a goal of computer vision and robotics. Vehicles that can autonomously navigate roads or mobile robots that can clean houses, assist in elder care, or just find our coffee mug are all potential applications that rest on the ability of computers to interpret our world visually. In particular, all of these applications require the ability to recognize and localize objects. A great deal of progress has been made in many technologies that are important for scene understanding, such as tracking, structure from motion, and stereo vision. However, the ability to describe the high level properties of the objects (e.g. category, location, pose) in a scene through vision is something that comes easily to humans, but remains a challenge for computer vision. Consider the task of determining whether a set of image pixels are generated by some object in an un-occluded scenario, setting aside how this semantic segmentation occurs for a moment. Even in this restricted version of the problem the challenges are many. Some challenges are intrinsic to object categorization, such as intra-category variation and inter-category confusion. For example, the variation in appearance for birds is vast, where as the visual distinction between a bird in flight and an airplane is far more subtle. Moreover, for many objects, their defining characteristics are semantic rather than visual, and these characteristics often do not present obvious 1  visual cues. The challenge of inferring the properties of objects in an image is compounded by the imaging process itself. A significant amount of information is lost in the projection from 3D to 2D, and recovering some 3D properties with certainty is not a well-posed problem (Bertero et al., 1988). There are a host of techniques that can infer some of the underlying shape, 3D position, colouring and texture of surfaces, but this information is uncertain. As a result, attempts to utilize this evidence for object classification now must deal with ambiguity in both categorization and the properties used for categorization. The imaging conditions may also impoverish the data available for classification. Lighting conditions may alter the colour or wash out detail due to the dynamic range of the camera. Moreover, the resolution and focus of the camera can both alter the amount information available for classification. Imaging conditions pose a more significant problem in the more general task of object detection, where the segmentation is not provided. For a single instance in image I, object detection is the task of determining the optimal region R∗ such that R∗ = argmax fc (I, R)  (1.1)  R  where f is some classification function for a category c, and R is typically defined as a bounding box. Defining the form of the scoring function f requires addressing a number of questions. How modular is the object detection problem? If the image is a pre-segmented object of unknown category, the problem may seem easier since we only need to determine its score rather than also searching for R∗ . However, image regions outside of the object may provide important clues. For example, Torralba et al. (2008) shows that human vision can accurately recognize objects in isolation with a resolution as low as 32 x 32 pixels. They go on to show that humans can detect objects at much smaller scales when that object appears in an appropriate context. An example would be detecting a keyboard on a desk in an office scene. Additional cues, such as the type of scene, 3D geometry, 2  and object co-occurance can all be helpful in attaining higher recognition rates. The utility of additional context for object detection depends upon the application. An application such as optical character recognition in documents may make use of natural language processing to inform character recognition, but 3D structure is clearly of little use. For autonomous vehicles, however, inferring or sensing 3D scene structure can be beneficial in detecting relevant objects such as cars and pedestrians (Leibe et al., 2007). Here, 3D structure and geometry can inform where objects are likely to occur, potentially increasing both accuracy and computational efficiency. The central focus of this thesis is the integration of additional cues available from the underlying 3D geometry into object detection and demonstrating that these cues can improve object detection.  1.1  Motivation - Curious George  A motivating experience for the approaches we explore in this thesis is our work with the robot Curious George (Meger et al., 2008, 2010), depicted in Figure 1.1, and in particular as an entrant in the Semantic Robot Vision Challenge (SRVC) (Rybski, December 2010). In this challenge, a robot is given a list of objects to look for in an unknown indoor environment. It must return a set of images that identify bounding boxes for the objects on that list within a modest time constraint. We are not given the list until the competition, but are told a subset of the categories, and for the remaining categories our robot must utilize the internet over-night to autonomously collect images and build object detectors. The objects on that list span a range of types. Some are specific objects such as a particular book, where recognition here is often referred to as instance recognition in the literature. We have found that we can build an effective object detector by finding a set of geometrically consistent feature point matches to a single training image (Meger et al., 2008), utilizing Lowe (2004)’s Scale Invariant Feature Transform (SIFT) features. The more interesting type were categorical objects, such as a bottles or soccer balls. In  3  Figure 1.1: Curious George: A Powerbot mounted with a peripheralfoveal camera system for use in visual search. The top right sensor is LIDAR, used for navigation and to collect point clouds. The bottom right is our visual system, with a Point Grey Bumblebee 2 stereo rig (1074x768) on the bottom, with a G7 Canon (10 megapixel) on top. the case where we could utilize pre-trained state-of-the-art object detectors, such as Felzenszwalb et al. (2010)’s Deformable Parts Model (DPM), issues such as occlusion and view-point made it challenging to find objects within the time constraint. In the case where we could only rely on a few training images we utilized a contour-template approach, where viewpoint and weak detection responses proved to be problematic. As we push towards near real-time visual search capabilities, the tradeoff between accuracy and efficiency becomes important, a trade-off that this challenge highlights. In an unknown scene, it may require multiple images to identify and locate all the objects of interest due to occlusion, clutter, and the limitations of our object detectors. However, collecting and analyzing a set of images that provides complete visual coverage is not efficient. The 4  fact that the vision system is embodied, where we have some knowledge and control of where the image are taken, means that we can leverage this information to improve both the accuracy and efficiency of object detection. It is this insight that lead to much of the work in this thesis.  1.2  Approach and contributions  The central argument of this thesis is that knowing the 3D geometry of the underlying scene can be an important cue for the task of object detection. The underlying 3D geometry of a scene can provide information concerning the presence, location, and size of the objects in an image. It also allows us to integrate the appearance information from multiple images from different viewpoints. It follows from information theory that our ability to infer the presence and location of objects will remain the same or improve with this additional information. In this thesis we outline approaches that integrate appearance information and context from 3D geometry and demonstrate a significant improvement in categorical object detection for household objects in challenging scenarios.  1.2.1  Using depth information for object detection  There are a variety of sources of depth information for a scene from visual information. Humans on one hand can make use of monocular cues such as texture variations, interposition, occlusion, known object sizes, light and shading, and defocus (Bulthoff et al., 1998). In addition, humans also make use of depth cues available through binocular vision, motion parallax, and implicit knowledge via perspective. Computationally, inferring depth via the monocular cues has achieved some success (Hoiem et al., 2007; Saxena et al., 2008), but these approaches tend to be primarily successful in outdoor scenarios, and are still very much an active area of research. Approaches such as Hoiem et al. (2006) that make use of perspective, ground plane assumptions, and object sizes have proven successful in outdoor street scenes. However, these approaches contain assumptions that do not always hold in indoor settings, so we advocate more direct methods of obtaining depth 5  Probability Scale in Image Plane Shoe Shoe False False  w/out scale Prior w/ Scale Prior Positive w/out Scale Prior Positive w/ Scale Prior  Figure 1.2: A high resolution camera captures the appearance of a scene, while a stereo camera captures depth information. An appearance-based sliding window detector utilizes edges to determine likely locations of a mug. A scale prior on the 3D size of the object, in conjunction with the depth image, allows us to update our posterior, thereby reducing false positives. information. One particular technology that we demonstrate is effective in assisting object detection is stereo vision. It has the advantage of being a mature technology, with a variety of commercial cameras available, along with a bevy of algorithms to process stereo images. Another advantage is that stereo vision is effective in both indoor and outdoor settings, unlike projected light technologies like the Kinect (2010). One contribution of our work with stereo is that we show that even low-resolution stereo depth data can improve object detection on a challenging dataset of mugs and shoes. We note that due to missing data and the fidelity of the data, that the depth information may be insufficient to depict distinctive surface characteristics for household objects. Instead, the approach we develop robustly utilizes priors on object size and variation to both improve the accuracy and efficiency when combined with object detectors, as depicted in Figure 1.2. We formulate the problem as one of determining the set of objects that maximizes the posterior probability, factoring the likelihood as a product of an appearance classifier and a term for the 3D shape of the category. In this way, we can train the terms separately and swap in different appearance6  based object classifiers.  1.2.2  Multiple viewpoint recognition  There are a variety of applications where images from multiple viewpoints are available from the same scene, including those involving mobile robotics and surveillance. Imagery from multiple viewpoints provides not only more complete appearance information about the objects in the scene, but also provides us with the ability to infer the 3D geometry of the scene if the extrinsic camera parameters are known. We propose an approach where we integrate images I1 , ..., In from n viewpoints for the purpose of determining a set of objects O and their locations both in 2D and 3D that maximizes the posterior probability p(O|I1 , ..., In ). The novelty is primarily in that we also infer the 3D location of detections as well. There are a number of advantages to multiple viewpoint detection. One advantage is that multiple viewpoints allow us to recover from situations where a single viewpoint is not sufficient for recognition due to occlusion, or because the underlying detectors are not confident due to the challenging viewpoint or category. In addition, it also helps eliminate false positives, both in the case of spurious clutter and in the case where an object may look similar to others from particular viewpoints. Another advantage is that we can reason about object size and location, which can reduce false positives as well. Figure 1.3 shows a sample of where our approach succeeds in detecting more objects due to integrating multiple viewpoints. The primary contribution of our work in multi-view object detection is a novel approach to integrating images from multiple viewpoints for the purpose of identifying and localizing objects in both the images and in 3D. We demonstrate the efficacy of our approach to multi-view object detection on a challenging set of scenes containing household objects, where we find significant improvement over an approach that considers each image separately.  7  Figure 1.3: Sample of true positive (solid) and false positive detections (dotted) of a bowl (green) and mug (blue) in different viewpoints. The rays are projections of the detections from known camera parameters, whereby true positives agree on a 3D location, and false positive detections do not have supporting evidence from other viewpoints.  1.2.3  Chamfer matching for object detection  At the foundation of our work with integrating stereo and multiple viewpoints into object detection is an appearance-based object classifier. In this thesis, we utilize two approaches, a contour-template based approach and the Deformable Parts Model (DPM) of Felzenszwalb et al. (2010). Our contour-based approach is a classifier that utilizes chamfer matching to compare the edges in an image region to a template contour. In this thesis we introduce an improved version of chamfer matching that is scale-invariant and demonstrate that it is more effective that non-scale invariants versions.  1.3  Thesis organization  The thesis begins with a discussion of the literature and ideas that are relevant in Chapter 2. We follow this with a Chapter 3 overview of the 8  appearance-based object detectors that we utilize in our research. The contour-based approach that we present in this Chapter 3 also introduces an improvement to chamfer-matching that makes it scale invariant. In Chapter 4, we describe the model that integrates stereo depth information with appearance-based object detectors. In the following chapter, Chapter 5, we describe a model for multi-view object detection and show how to efficiently utilize this model to infer the location of objects. Experimental validation for our contributions is provided at the end of each chapter. In the final chapter, Chapter 6, we summarize our work and discuss future directions.  9  Chapter 2  Background 2.1  Introduction  The central topic of this thesis is categorical object detection, which relies on the ability to recognize a pattern and to find instances of that pattern in an image. The quantity and variety of research that is relevant to addressing this task is vast, ranging from sensing technologies, machine learning, pattern recognition, image representation, and scene understanding to name just a few. An important research question is addressing how independent the above fields are in relation to object detection. Indeed, debate concerning this question has continued since computer vision’s infancy. As Roberts (1965) notes, very early work took a reductionist approach, focusing only on pattern recognition of characters, with the hope that this work would extend to more sophisticated tasks. However, he notes that these approaches ignored the underlying 3D geometry of a scene, which can also be fruitfully utilized to determine the presence and location of an object, if the underlying geometry can be recovered. There are many approaches that explore the spectrum between pure pattern recognition that simply treats an image as a vector, encoding no prior knowledge about how an image is formed, to approaches that rely heavily on explicit reasoning about the 3D structure of a scene. A considerable amount of research in the 70s and 80s explored this spectrum, a good review of which 10  can be found in Besl and Jain (1985). A prominent example is Brooks et al. (1979) ACRONYM system. His system represents an object as a collection of 3D parts, the base element being generalized cones, with constraints on the size and shape of the cones, and constraints on the relative positions, orientations, and composition of these parts. The manifestations of these generalized cones in the image takes the form of “ribbons” and “ellipses” features, so the recognition process proceeds by extracting these image features and then determine if they satisfy the constraints for the object model. A significant problem in this approach, and in many early approaches, is that the systems were built with an eye towards what an object should appear like in an image, rather than what an object does appear like. The difficulty in providing an object model a priori and the use of heuristics and human-engineering to close the gap between the model and image resulted in brittle systems that did not scale. With the increasing availability of image datasets, the pendulum swung more towards appearance-based approaches. Rather than represent an object category in 3D, the appearance of a category in 2D could be learned directly from data using recently developed machine learning techniques, thus resulting in systems that were less brittle. Early datasets such as the MNIST digits (LeCun et al., 1998) carried significant intra-category variation, but little variation in imaging conditions such as background, and viewpoint. The CMU-MIT face dataset (Rowley et al., 1998), Caltech-4 (Fergus et al., 2003), and UIUC (Agarwal and Roth, 2002) car dataset brought in a wider variety of realistic imaging conditions in the form of background clutter and lighting conditions, encouraging more robust algorithms and models. However, these datasets still lacked viewpoint variation, and contained only a few categories. The Caltech-101 (Fei-Fei et al., 2006) dataset includes many more categories, but again this dataset lacks viewpoint variation. Moreover, the lack of generic background images and the fact that object instances occupy most of the image lead to questions as to whether success in classification in Caltech-101 would extend to object detection in more general scenarios. All of these datasets, a sample of which can be seen in Figure 2.1, were important as benchmarks but tended to encourage techniques 11  that could not scale to the full complexity of generic object detection (Ponce et al., 2006). The success of recognition in more constrained scenarios like digit recognition (LeCun et al., 1998), or particular object categories from a single viewpoint, such as the frontal face detector of Viola and Jones (2001), has pushed the field towards attempting more generic categorical recognition in less constrained scenarios. This is exemplified by the current dominant benchmark, the PASCAL Visual Object Challenge (VOC), a sample of images can be seen in Figure 2.1 (Everingham et al., 2010a). VOC is an annual competition that compares approaches on a challenging set of images and 20 diverse categories, ranging from various animals and vehicles, to potted plants and chairs. Moreover, there is more occlusion and a wider variation in viewpoint, both conditions that are somewhat absent from previous benchmarks. The challenge of this dataset has encouraged the pendulum to swing back towards the middle ground, resulting in more efforts to integrate both appearance-based modelling and contextual reasoning. We begin this review of previous work in Section 2.2 by discussing research into object detection that does not utilize context. The bulk of this work is concerned with appearance-based recognition which operates on single images only, although we do briefly discuss object recognition that utilizes 3D sensors as well in Section 2.2.2. This is followed by a review of object detection that leverages context in Section 2.3. The literature review we offer in this chapter is intended to place our work in context, and for the sake of brevity is not meant to be exhaustive, only representative.  2.2 2.2.1  Object detection without context Appearance-based recognition  The challenge with appearance-based object detection is not only with intracategory variation and inter-category confusion, but that it must also be robust to imaging conditions such as illumination, background clutter, and viewpoint, all of which can obscure distinctive visual features. One tactic is 12  MNIST  CMU-MIT  Caltech-4 and UIUC  Caltech 101  PASCAL VOC Figure 2.1: A sample of images from prominent datasets in object detection, where the difficulty in category and imaging conditions increases from earlier (top) to more recent examples (bottom) . 13  to learn this robustness purely from the data itself, whereby we do not encode any prior knowledge about the nature of images, such as the high correlation between neighbouring pixels or that minor differences in translations, and scale are generally irrelevant. Moreover, these type of approach typically ignore the fact that images arise from a 3D world and thus make no attempt to model the 3D nature of an object. There are ongoing research efforts that attempt to begin at the ground floor, such as work on deep generative models of Hinton et al. (2006), or the convolutional neural networks of Kavukcuoglu et al. (2010). The recent work of both Goodfellow et al. (2009); Kavukcuoglu et al. (2009) show that learning features that are robust to some imaging conditions is possible. However, these approaches have yet to scale to more general categorical detection. The general theme in modern approaches to recognition is to attempt to achieve some robustness with a little engineering, and let machine learning do the rest. In this case, image processing techniques are employed to provide a representation that is somewhat robust to imaging conditions. Ideally the internal representation of an image should not change greatly under changes in illumination or as an object undergoes minor changes in scale, translation, rotation. With an image representation robust to imaging conditions, approaches to modelling an object category via these image representations can then focus on robustness to inter-category confusion and intra-category variation. As a result fewer parameters will be required and less training data is needed since the training data only needs to capture the variation in category. A significant trend in image representation has been to aggregate information over local regions, either on the image plane, or in scale. Pooling information can increase the robustness to translation and scale. The type of information aggregated include edges (Belongie et al., 2002) and image gradients, such as Lowe (1999)’s SIFT features and Dalal and Triggs (2005)’s Histogram of Gradients (HOG), both of which also offer some invariance to brightness changes due to normalization. Mid-level features can be aggregated as well, including filter responses (Mutch and Lowe, 2008; Serre et al., 2005; Torralba et al., 2004), and local features quantized into elements of a 14  codebook (Lazebnik et al., 2006; Zhang et al., 2006). This type of mid-level information provides some robustness to intra-category variation. There are also a variety of ways to pool this information, including averaging via convolution or as histograms which can often be efficiently computed using integral images (Viola and Jones, 2001). Another approach is to utilize interest point detectors and represent the image via a set of region descriptors at these interest points (Fergus et al., 2003; Lazebnik et al., 2006; Lehmann et al., 2011; Leibe et al., 2004; Lowe, 1999; Zhang et al., 2006). This approach is particularly successful for instance recognition, at least for objects with internal texture (Lowe, 1999). There are a bevy of interest point detectors that are invariant to Euclidean and affine transformations, which provides some robustness to imaging conditions (Tuytelaars and Mikolajczyk, 2008). However, interest points detectors are not necessarily invariant across category, which means approaches that rely on them may fail when their interest point detectors do not “fire” on the relevant regions. Another approach to image representation is to utilize edge detectors to produce edge images. The advantage of this approach is that it is somewhat invariant to illumination. It also focuses attention on object boundaries, which are often the most distinctive feature for some categories, such as mugs and bottles, where the surface colouring is irrelevant. Edge images were also amongst the first alternative image representations, since many of the early approaches modelled objects in terms of geometric structures which could, in theory, be inferred from boundary information. Edge images, however, tend to be noisy, often with edges that can best be described as clutter and with some object boundaries missing. A recently popular approach in object detection to overcome this is to utilize chamfer matching (Gavrila, 2007; Opelt et al., 2006; Shotton and Blake, 2007), and represent a category as a collection of contours to capture shape variability. The approach to contour-based recognition that we develop in Chapter 3 falls under this umbrella, employing a version of chamfer matching that is scale-invariant and is effective for scenarios where training data is limited. One of the most prominent approaches to modelling an object category is 15  the part-based approach, which utilizes a collection of parts with a model for the appearance of parts and their spatial relationship to one another. Some of the early approaches (Burl et al., 1998; Fischler and Elschlager, 1973; Mohan et al., 2001), were limited not only in their access to large image datasets, but also by their reliance on human annotation of the “relevant” parts and on descriptors with limited invariances. Works such as Agarwal and Roth (2002); Fergus et al. (2003); Weber et al. (2000) extend this partbased framework to rely instead upon interest point detectors, removing the human from the loop. More recent approaches abandon interest point detectors altogether (Felzenszwalb et al., 2010; Opelt et al., 2006; Shotton and Blake, 2007). The primary advantage of a part-based representation is that we can model the appearance of a part and its spatial relationship to other parts separately, which is considerably easier than modelling them jointly. We also adopt the parts-based approach of Felzenszwalb et al. (2010) as an appearance-based object detector. View-point invariant recognition The review up until this point focused upon recognition from a single viewpoint, often the most “typical” and distinctive view, such as the side profile of a bike or the frontal view of a face. This kind of recognition is often referred to as canonical viewpoint recognition, placing emphasis on robustness across category rather than across viewpoint. This approach makes sense for applications where one can control the orientation of the category under consideration or where the viewpoint varies little from the canonical view. However, in general this assumption does not hold. In the related task of instance recognition, approaches that build on ideas of Lowe (2001); Rothganger et al. (2006) are able to successfully recognize textured objects, particularly packaged products that tend to have logos and text. Here, they utilize hard geometric constraints between local features, which are already invariant to various imaging conditions, in order to determine whether a particular image contains the object in any pose. However, these approaches rely on interest point detectors and dis-  16  tinctive feature descriptors, which are less robust when there is significant intra-category variation. Chakravarty and Freeman (1982) tackle viewpoint invariance by assembling a set of classifiers, each trained independently on a number of characteristic-viewpoints, and then utilizing the maximal response from that set. Fan and Lu. (2005); Schneiderman and Kanade (2000); Torralba et al. (2007) have all extended this idea for categories such as cars and faces, predefining the characteristic views ahead of time. Felzenszwalb et al. (2010) utilize a mixture model to capture different viewpoints, sorting the data into characteristic views according to the aspect ratio of the bounding box for the objects, which is not particularly effective for categories where significant changes in shape due to viewpoint do not result in differences in aspect ratio. The advantage of assembling a set of independent classifiers is that it can leverage existing work on canonical-viewpoint recognition. However, a naive combination of n single view classifiers can potentially suffer from an O(n) explosion in the number of false positives. Many earlier approaches to recognition attempt to model a category in 3D rather than 2D, and perform recognition by mapping the 2D image features to the 3D model (Brooks et al., 1979; Lowe and Binford, 1985; Roberts, 1965). These approaches are naturally viewpoint-invariant, but suffer from the challenge of specifying the model, and modelling variation. Many of the more recent approaches to 3D object recognition attempt to create a more unified object model, where each viewpoint is not treated as an independent entity. The work of Liebelt et al. (2008); Thomas et al. (2009) try to unify appearance across viewpoint on a per feature basis, where viewpoints can share features. The works of Kushal et al. (2007); Savarese and Fei-Fei (2007); Su et al. (2009); Sun et al. (2009) try to determine groupings of local features that are geometrically stable across a range of viewpoints, essentially identifying planar regions. A group of local features have strong constraints robust to Euclidean transformations within group, and weaker constraints robust to projective transformations between groups. Recognition is then to determine if sets of local groupings can be found where constraints both within and between are satisfied. It is unclear whether 17  this work scales to categories with significant intra-class variation, or to categories with little planarity. One of the most significant challenges in viewpoint invariant recognition is that training data is thus far fairly limited. The recent success in recognition is mostly of the canonical variety, and even then the quantity of training data required is significant, on the order of hundreds of images for the most successful approaches. The variation across viewpoint is typically more significant than across category, so the quantity of training data must likewise increase.  2.2.2  Recognition with 3D sensors  There has been considerable amount of progress and success in recent years in image based object detection, both inside and outside of the research laboratory. The progress of object detection that relies on directly acquired 3D data, on the other hand, has not been as rapid. This discrepancy is due in no small part to the internet and the ubiquity of digital cameras which have led to the increasing availability of image datasets. For 3D systems, dataset collection is still very much undertaken by research labs, thus limiting the variety and number of training instances available. That is not to say that no effort has been extended, for as of very recently, the number of public datasets has increased substantially, such as the wire mesh models of Google 3DWarehouse (Google, 2009), and RGB-D datasets using the Kinect (Janoch et al., 2011; Lai et al., 2011a). As a result of the dearth of datasets, most work has focused on the related task of instance recognition and pose estimation rather than categorical object detection. In fact, a few recent approaches that leverage both 3D data and images have found that 3D information is not particularly discriminative on its own (Janoch et al., 2011; Lai et al., 2011b; Quigley et al., 2009). In combination with appearance-based approaches, however, detection does improve. Some 3D features that have proven successful are to apply HOG to a depth image Hattori et al. (2009); Janoch et al. (2011); Lai et al. (2011b), SPIN image descriptors Lai et al. (2011b).  18  A more thorough review is beyond the scope of this thesis, since we instead focus on the contextual cues that stereo imagery can reliably provide. The reason for this is that stereo imagery is often not accurate enough to extract discriminative features for detection for most objects. There are a number of approaches that also use similar contextual cues derived from 3D data as well, but we leave that discussion to Section 2.3.1.  2.3  Object detection with context  In human vision, the role of context can be seen in both our ability to detect the presence of an object in a scene and our efficacy in search. Torralba et al. (2008) establishes that we are significantly better at recognizing objects in context when the resolution of the object is low. The logic of this work suggests that in low resolution situations, challenges due to occlusion, illumination, and viewpoint would also take its toll on recognition performance. However, in higher resolution situations humans have little difficulty recognizing familiar objects, with or without context Parikh et al. (2008). The role of context is much greater when our ability to detect objects is constrained by time or cognitive resources, as when our attention is focused elsewhere. The early works of Biederman et al. (1973); Hock et al. (1975) establishes that the speed at which a human can detect an object is related to whether the object belongs in a scene and how coherent its placement is. A fire-hydrant on top of a mailbox will be found later in a search task since the manner in which humans fixate in an image depends somewhat on scene context (Ehinger et al., 2009). In situations where our attention is diverted, we sometimes fail to notice objects that would otherwise capture our attention. Simons and Chabri (1999) dramatically demonstrate this when experimental subjects were told to count the number of times a ball was passed between players in white in a short video; half the subjects failed to notice the gorilla-costumed person that walked through the scene. The gorilla experiment highlights the challenges of object detection and visual search under constraints. In an embodied vision system, like a robot, time, energy, and computational resources are all limited. Object context in  19  this case becomes increasingly important because it can provide cues that are relevant in determining the presence, location, size, appearance, and other properties. This is particularly important in the localization aspect of object detection, since object classifiers are often only robust to minor changes in scale and translation and so search must be performed at numerous scales and locations. The computational burden of this search is compounded when there are multiple objects of interest or when the robot is also tending to other tasks. The use of context for object recognition has been explored periodically, with numerous scene understanding systems in the 70s such as Hanson and Riseman (1978)’s VISIONS, and again in the early 90s with systems similar to Strat and Fischler (1991). More recently, object recognition systems tend to be more modular, incorporating a subset of a wide variety of sources of context (Divvala et al., 2009). The co-occurance and co-location in the image plane of image textures (Shotton et al., 2006), semantic labels (Carbonetto et al., 2004; Kumar and Hebert, 2005), and objects (Desai et al., 2009) have all been leveraged with some success to improving object detection. The semantic context of a scene has similarly been explored, with the type of event (Li and Fei-Fei., 2007), scene category, and place (Torralba et al., 2003) all proving informative for detection. The temporal context has also been explored in the form of video, including numerous tracking-via-detection approaches, and optical flow.  2.3.1  3D geometric context in object detection  In this thesis we focus on utilizing the 3D geometric context scene in object detection, where the 3D geometric context of an image is the 3D spatial arrangement of objects and surfaces. The 3D geometric context is of course not unrelated to the 2D geometric context that we mentioned briefly in the previous section. The advantage to reasoning explicitly about 3D context is that it is more robust to viewpoint variation and less prone to biases in training data. The most informative cue for object detection is surface layout, which  20  can inform the presence, location, scale, and pose of an object. Early in computer vision research, Roberts (1965) notes the relationship between supporting surfaces and an object’s size both in 3D and 2D, whereby with knowledge of any two the third can be inferred. The prominent and more recent work of Hoiem et al. (2006) extends this idea by utilizing the horizon line and lack of camera roll to infer the ground plane, and thereby infer likely locations of objects in the image plane using prior knowledge of object size. This approach is particularly effective for outdoor scenes, and has been expanded in a variety subsequent works to analyze traffic scenes (Ess et al., 2009; Leibe et al., 2007; Wojek et al., 2010). Inferring the supporting surfaces in indoor scenes can be more challenging due to the lack of horizon and the failure of geometric context operators on indoor scenes (Hoiem et al., 2007). Bao et al. (2010) get around this by jointly modelling supporting surface(s), object(s) location and size, and their pose. They demonstrate that jointly inferring these properties is more effective than inferring them independently, resulting in better detection results and a set of supporting surfaces. This approach holds promise, but their reliance on object detection that also reports pose is problematic given the current state of the art. Moreover, they validate this work on toy scenarios with little clutter and several easy to recognize instances. There are also a variety of approaches that use the depth of the surface to inform object presence and the size in the image. This includes the work of Sudderth et al. (2006), who propose modelling the location of objects jointly in 3D. They show that if an object is spatially related to another object, say a monitor and a desk, knowing the scale and location of one object provides a good prior on the scale and location of another. However, spatial relationships tend to be weak, so this approach fails to generalize. A more direct approach is to use depth sensors to determine surface layout, and this is the approach we take in Chapter 4. Gavrila and Munder (2007) effectively uses stereo depth information and a prior on object size to improve pedestrian detection, although the system is coupled with the appearance detector so it is unclear how much it generalizes to other categories. The approach of Gould et al. (2008) uses a LIDAR coupled with a 21  traditional camera on office scenes, showing that properties such as height from the ground, object size, and surface variation can all be learned from data and can improve detection. However, this requires a significant amount of training data, and the reliance on LIDAR restricts its applicability. The work that is most similar to our own in Chapter 4, though developed independently, is that of Schindler et al. (2010). In their work they construct a probabilistic model that is similar to our own, where they factor classification into terms that deal with appearance and terms that deal with object size via stereo depth data. They also make use of robust estimators to overcome missing data in stereo. They exploit the relationship between depth and object size to improve detection, and they also make use of surface variation to suppress false positives. The primary difference is that they utilize assumptions about the ground plane that are effective for outdoor scenarios but do not necessarily hold for indoor scenarios. Multiview recognition The final research contribution of our thesis, in Chapter 5, concerns multiple viewpoint recognition, which integrates information collected from different viewpoints. We consider this work as also falling under 3D context, since we can jointly reason about the 3D arrangement of objects and their appearance in the image plane. Integrating information across many images has been a major focus of the active vision research community, where the goal is to determine from what viewpoints to acquire images for the purpose of recognition (Dickinson et al., 1997; Laporte and Arbel, 2006; Schiele and Crowley, 1998; Wilkes and Tsotsos, 1992). Laporte and Arbel (2006) describe Bayesian strategies for combining uncertain information between views. They suggest the use of a generative model of object appearance conditioned on the object label and other properties such as pose and lighting, along with a sequential update strategy in order to solve this problem in a scenario of actively acquiring images. However, active vision has typically been used to recognize specific objects, thus avoiding the wide degree of uncertainty in categorical object  22  detection. We extend some of these ideas to categorical object detection where accurate pose estimation is still a challenge. Several authors have also recently considered fusing information temporally over several frames of a video sequence for object detection. Andriluka et al. (2010) use a bank of viewpoint-dependent human appearance models and combine these with a learned motion prior in order to gather consistent information across motion tracks. Also, Wojek et al. (2009) infer the location of pedestrians simultaneously with the motion of a vehicle in order to achieve localization in 3D from an on-board camera. This differs from our own work primarily in that our viewpoints vary significantly, and thus inferring detections that belong to the same object is more of a challenge. The recent work by Coates and Ng (2010) most closely resembles our own, although developed independently. Here, they first use multiple images and rough registration information to determine possible corresponding detections. We both determine the posterior probability for each set of corresponding detections across viewpoints, using non-maximum suppression to discard errant correspondences. Their work differs from ours most significantly in that the posterior probability for a correspondence is based solely on appearance, where as our work includes geometric information as well. In addition, their experimental validation is limited, presenting multi-view results on a single category, where the difference in viewpoint is not particularly significant. Our work presents a formulation that is more general with more extensive experiments to demonstrate the utility of multi-viewpoint detection.  23  Chapter 3  Appearance-based Object Detection Object detection is a task that involves both category recognition and localization. Localization is a computational challenge dependent upon the structure of the classification function. Recognition is dependent upon the classification function, where the main focus is accuracy, with less attention given to the efficiency of that function. In this thesis, we explore the use of 3D geometric context for both improving the quality of recognition and localization, but also note that this context also provides constraints that can improve the efficiency. Towards the end of exploring the use of 3D geometric context for object detection, we discuss our approach to appearance-based detection first. There are two approaches to single-view recognition that we utilize in this thesis. In Section 3.1 we discuss a contour-based approach that utilizes a contour template to represent the object boundary and a variant of chamfer matching to determine if the edges in an image region are sufficiently similar to that object boundary. We explore this approach for situations where training data is sparse, such as SRVC. In situations where the collection of large amounts of annotated training data is possible, we utilize the Deformable Parts Model (DPM) of Felzenszwalb et al. (2010), which we briefly overview in Section 3.2. 24  Figure 3.1: Object classifiers are applied at different location and scales (left), producing detection responses (middle), and NMS produces a sparse set of detections (right). Our contour based-approach and the DPM are classifiers that are typically applied to the image in a sliding window manner, producing a dense set of detection responses. For the applications we envision, however, a sparser set of detections is more useful. In order to determine a sparse set of detections we utilize a variant of Non-Maximum Suppression (NMS). In Section 3.3 we describe this approach in the language of probability, where we select the responses that maximizes the posterior probability. See Figure 3.1. Along with presenting the classifiers that we later utilize in this thesis, there are a number of contributions that stand on their own in this chapter. In particular, we develop a new approach to chamfer matching that is more invariant to scale than previous approaches. We show that our new approach significantly outperforms chamfer matching that is not invariant, and that it is competitive or outperforms other single contour-template approaches on the challenging ETH Shape dataset (Ferrari et al., 2006) in Section 3.4. We also show that the form of Non-Maximum Suppression (NMS) can also make a significant difference.  25  x t+x  Ф(xt)  Ф(xe) xe E  T  Figure 3.2: Similarity between the model silhouette T centred at x and the edge image E is based upon the sum of the differences φ(xt ) − φ(xe ) (orientation difference), and xt + x − xe 2 (spatial difference) for edgels in xt ∈ T and their closest matches xe ∈ E.  3.1  3.1.1  Recognition with contour-based chamfer matching Chamfer distance  The chamfer distance was first introduced by Barrow et al. (1977) as a means of measuring the distance between two curves. Its key features are that it is a smooth measure, is robust to missing information, and that with algorithmic techniques, such as distance transforms, it can be evaluated efficiently. In its most basic form, chamfer distance is the total distance from all the points in a template point set T to their respective closest points in a point set E, where these points are referred to here as edgels. A threshold τ is often used to limit the penalty for missing edgels. To measure the distance between the template T at position x, each element of T is shifted by x, where the thresholded chamfer distance is defined as,  dcham (x, τ, E, T ) =  1 τ |T |  min(τ, min (xt + x) − xe 2 ) xe ∈E  xt ∈T  (3.1)  This measure is inherently biased towards cluttered images as noted by Gavrila and Munder (2007), where a high density of edgels is likely to have a small chamfer distance despite the fact that the pattern of edgels “looks” 26  nothing like template T . To overcome this, we adopt the approach of Shotton and Blake (2007); Stenger et al. (2006), who add a term that incorporates the difference in orientation between matching edgels to the chamfer distance. This acts as a penalty for matches due to clutter, since the orientations are less likely to agree. Explicitly, if we denote et,x to be the xe ∈ E closest to xt + x, then we can define two disjoint sets, T and T , where T = {xt |τ > et,x − xt 2 } and T = {xt |τ ≤ et,x − xt 2 }. These sets denote nothing more than splitting T into those edgels with a match less than τ and those that are too far away, i.e. potentially missing edgels. Let φ(x) denote the orientation of an edgel modulo π, and define the orientation distance as, dorient (x, τ, E, T ) =  2 |T | + π|T |  |φ(xt ) − φ(et,x )|  (3.2)  xt ∈T  and the total oriented chamfer distance is,  d(x, τ, E, T ) = λdcham (x, τ, E, T ) + (1 − λ)dorient (x, τ, E, T )  (3.3)  where λ weights the contribution of the orientation difference to the chamfer score. Figure 3.2 sheds some light on the oriented chamfer distance. This description thus far has not touched upon the issue of scale. In order to compare the contour template to the edge image at difference scales, Borgefors (1988) takes the approach of building an image pyramid, a multiscale representation of an image, and applies the contour successively to each of the images in the pyramid. The key challenge with this approach is that it is unclear how to scale an edge image. In the case of applying edge detection separately to each of the rescaled images, this can result in edge images that look quite different at different scales, with more edgels at higher resolutions. Moreover, this approach is computationally intensive since it requires computing edge images and distance transforms for each scale. The alternative is to scale the template rather than the edge image,  27  a common approach in modern applications of chamfer matching (Leibe et al., 2004; Liu et al., 2010; Shotton and Blake, 2007). For a bounding box b, which can alternatively be define as the tuple (s, a, ) with scale, aspect ratio, and centre respectively, scaling the template means multiplying the position of the edgels in T by s/sm , where sm is height or width of the contour template T , which can be arbitrary. It should be noted that previous approaches that scale the template do not normalize the chamfer distance by the template scale (Leibe et al., 2004; Liu et al., 2010; Shotton and Blake, 2007). The result is that the chamfer distance is not scale invariant. In fact, the distance between the template and the edge image is directly proportional to the scale of the template. To overcome this we propose to normalize the chamfer distance by sτ /sm . In order to capture some shape variation, for a particular b, we take the minimum chamfer distance over the template when scaled by the aspect ratios {0.9, 1.0, 1.1}, and under reflection in the y-axis. We let the set A denote the set of templates that represent these transformations, and so our chamfer distance becomes g(b, E, T ) = min  T ∈A  3.1.2  sd( , τ, E, sT /sm ) sm  (3.4)  Chamfer distance for detection  So now that we have a distance metric, how can we use it for detection? We use a single contour exemplar in order to keep things simple and effective for situations with minimal training data. We proceed by computing the chamfer distance densely over the image. This can be efficiently computed by first using the Euclidean distance transform, which computes the closest edgel for every location in an image. This only needs to be computed once at a cost of O(n), where n is the number of pixels in the image. Leveraging the distance transform, computing the chamfer distance at one location is O(|T |). We compute the chamfer distance over a range of scales, utilizing the same distance transform, with a ratio of 1.2 separating each scale. In our experiments, denser sampling in scale space does not seem to produce 28  better results. In this chapter, we can simply utilize the chamfer distance itself as a score or response, and pass this onto the approach we outline in Section 3.3. This has the advantage of not requiring training data aside from our contour exemplar. However, in later chapters we use a probabilistic approach for detection, so we employ a logistic classifier to produce a probability out of the chamfer distance, where p(c|b, T , E) = 1 + exp(w0 + w1 g(b, E, T )  −1  (3.5)  where the model parameters w0 , w1 are learned from training data. For a fixed category and template T , the weights w0 , w1 for the logistic classifier can easily be learned using gradient descent if we have a set of positive instances and negative instances. The only complication is that we are in a detection setting rather than a classification setting, where in detection we encounter far more negative instances than positive instances. Ideally we would like to have balanced training data, an equal number of positive and negative instances. However, we would also like our negative instances to be challenging, since the vast majority of negative instances will be easy to distinguish from our positive instances. We use a standard technique of bootstrapping for hard negatives (Dalal and Triggs, 2005), where we first train our classifier with a random sample of negative instances so that the dataset is balanced, and then use the classifier to select a set of “hard” instances from the negative images and retrain. There are a number of relevant parameters that are constant across categories, which we tuned on a hold-out data set. λ, which modulates the influence of orientation differences on the chamfer distance, was set to 0.25. The parameter τ was set to 0.15sm , where the model scale sm is arbitrary. Both of these parameters are similar to values used by Shotton and Blake (2007).  3.2  Recognition with deformable parts model  Felzenszwalb et al. (2010) is a state-of-the-art approach to object detection, 29  the top entrant in the 2007-8 VOC contests, and forms the basis of many later top entrants (Everingham et al., 2010b). Their approach is particularly effective for categories like pedestrians, vehicles, bottles, and bikes. Moreover, the code for their work is publicly available, and is readily usable, which has also led to its adoption throughout the community. Another significant advantage of their work is that it is also efficient, with a lot of attention given to the use of distance transforms and dynamic programming, producing detection responses in less than a second for a 1 megapixel image. We provide a brief description here, and refer the reader to Felzenszwalb et al. (2010) for full details. The approach builds on the pictorial structures framework of Felzenswalb and Huttenlocher (2005); Fischler and Elschlager (1973) and the Histogram of Gradients (HOG) representation of Dalal and Triggs (2005). A category is represented as a mixture model, where each mixture is intended to capture different aspects of the category, such as the side and front profile of a bike. Each mixture is a collection of filters that are spatially related to each other, see Figure 3.3 for an example. A filter is a vector of weights that determines the similarity of an image region by utilizing the dot product between the image representation in that region to these weights. The most important filter is the root filter, which coarsely represents the shape of the category in this mixture. The other filters focus on smaller regions and capture finer detail, typically placed in areas of significant gradient magnitude. A star model represents the spatial relationships between parts, with each part having a set of weights that captures how it is allowed to vary from its “anchor” position, the “anchor” being relative to the root filter. The underlying image representation is a variant of HOG of Dalal and Triggs (2005). HOG is an image representation that aggregates image gradients first into overlapping spatial bins, and then into bins based upon gradient orientations. The spatial binning achieves some invariance to affine transformations. Each spatial bin is normalized independently, with 4 different types of normalization, achieving some invariance to lighting. Each spatial bin is a 36 vector histogram, where the entire image is composed of an overlapping grid of these spatial bins. Each part in the model is a set of 30  Figure 3.3: A DPM model of a bicycle (courtesy of Felzenszwalb et al. (2010)), with two components (rows). The left images are the root filters, where the white lines indicate the weights on various orientations in particular spatial bins. The middle images contain the part filters, which capture finer details and cover less spatial extent. The right image represents the penalty for the location of a part deviating from its anchor position.  weights over a rectangular configuration of these bins, so the similarity of an image to a part filter is just the dot product. In Figure 3.3, the image representation is similar to the image on the left. The training procedure for DPM consists of annotating hundreds of instances of a category with a bounding box, along with a providing a large number of challenging images without the category. The discriminative model, a latent-SVM, takes hours to train, a procedure we will not cover here. In this thesis, the categories for which we train a DPM are mugs, bottles, and bowls, all from an elevation angle near 0 (i.e. from the side). We chose these categories because the objects from these categories vary little over azimuth. For our purposes they are challenging enough to demonstrate the effectiveness of our work in later chapters. To train DPMs we collected and annotated several hundred examples of each category from the inter31  net, utilizing the annotated images of PASCAL VOC 2008 for background images. Later in this thesis we employ a DPM both for our work in stereo in Chapter 4 and multiview detection in Chapter 5. The classifier natively produces a response from an SVM. We train a logistic classifier, whose form is similar to Equation 3.5, using gradient descent in order to transform the DPM response into a probabilistic classifier.  3.3  Sparse object detection model  The goal of object detection in this thesis is to produce a sparse set of locations that designate the location of objects and some measure of the confidence. The typical approach in the literature is to evaluate a number of locations in the image using a classifier, and then post-process these responses to remove those that are likely generated by the same image features. Options for determining locations at which to apply a classifier include the popular sliding window technique which samples locations and scales densely. More efficient options include local feature approaches which use a sparse set of regions that “vote” for locations via the generalized Hough transform (Leibe et al., 2004). Other approaches use branch and bound techniques such as efficient sub-window search (Lampert et al., 2008) or generalized distance transforms and dynamic programming (Felzenswalb and Huttenlocher, 2005). Options for producing a sparse set of detections include non-maximum suppression, mean-shift mode (Comaniciu and Meer, 2002), or applying maximum a posteriori (MAP) to conditional random fields (CRF) (Desai et al., 2009) or to generative models (Fergus et al., 2003). The ideal is to generate a single detection only on those regions that tightly bound instances of a category. The need for tolerance for false positives, localization errors, and multiple detections for the same object depends on the application. In a surveillance setting where detection may be used as an attention-operator for human interpretation, localization errors and multiple detections are not as problematic as missed detections. In a  32  setting with less reliance on a human operator, such as mobile robotics, accurate localization and fewer false positives are probably more important. The approach we take to produce a sparse set of detections is a variant of Non-Maximum Suppression (NMS). Non-maximum suppression is an approach that selects responses that are local maxima, and suppresses responses that are near these local maxima. We explain our approach in the language of probability, where we seek to find a set of detections that maximizes the posterior probability. Formally, we seek the optimal labelling, Y opt = {Yi = (ci , bi )|i = 1..N }, of an image I, where b = ( , w, h) is a bounding box with centre, width, height respectively. We consider only the two category scenario here, where ci denotes our target category and category.  Y opt  denotes the non-target or background  explicitly labels target category locations, where we denote B  as the set of bounding boxes in Y opt , but it also implicitly labels all regions ¯ as the non-target . We cast sparse outside of B, which we denote as B, object detection as the task of finding Y opt that maximizes the posterior probability,  Y opt = argmax p(Y|I)  (3.6)  Y  = argmax Y  p(Y)p(I|Y) p(I)  (3.7)  ¯ ) = argmax p(Y)p(I[B]| Y  p(I[bi ]|Yi )  (3.8)  Y∈Y  where Equation 3.7 follows by Bayes rule, and the notation I[b] refers to the image region within b. Equation 3.8 follows by an assumption that the appearance of each region is independent given the category. In essence, the optimal set of labels Y opt is one where the objects are more likely to have generated the corresponding image regions than not and where the colocation of Y opt agrees with our prior. Our independence assumption can be problematic if the regions overlap significantly, but we utilize the prior distribution p(Y) to prevent significant overlap. 33  The first term in Equation 3.8 is a prior on the co-occurance and colocation of objects in the image plane. There is some benefit to modelling the 2D arrangement of categories as shown by the work of Desai et al. (2009), since some objects are often co-located with one another. We take a simpler approach, where we impose the prior that bounding boxes should not overlap significantly. In essence, we are encoding our prior knowledge about how images are formed, whereby a pixel is generally the projection from one object. We do not want to prevent overlap entirely, since the edges of a bounding box will often contain other surfaces that do not belong to an object. To prevent overlap, we define ψ(bi , bj ) =  intersection(bi , bi ) (1 − ζ)area(bi ) + ζunion(bi , bj )  (3.9)  so that  p(Y) =  0 Z  if ∃i, j s.t. ψ(bi , bj ) > ϕ and i = j otherwise  (3.10)  where Z is a constant, and ϕ is a parameter governing how much overlap is permissible. The function ψ is the intersection of the bounding boxes over the area of bi if ζ = 0 or intersection over union if ζ = 1. In the first case, ζ = 0, our prior is asymmetric and functionally suppresses detections where the image region is largely accounted for by a detection with a higher probability. This enforces the notion that an image region is generated by the projection of one object. This has the advantage of suppressing false positives, since detections with higher probability suppress other explanations. However, it does have the disadvantage of discounting the possibility of occlusion if ϕ is too low. In the second case, ζ = 1, ψ functions as non-maximum suppression in a manner similar to how detections are evaluated in VOC. This is more forgiving in that it functionally suppresses detections that are likely based upon the same image features. However, it allows detections in the case where the scale of bounding boxes are drastically different, which suggests 34  that the overlapping detections are relying on different image features.  3.3.1  Finding optimal labelling Y opt  In general, finding the Y opt that maximizes the posterior probability in Equation 3.8 can be computationally intensive, if not intractable, depending on the accuracy of the object detectors and on the nature of the scene. For example in pedestrian detection, there may be many different labelings for an image of a crowd of people that may have nearly the same posterior probability. In common variants of NMS, a straightforward best-fit first greedy heuristic is employed. Here, an over-complete set of detection responses are produced (Y ) and are sorted from most likely to least. The highest remaining detection in Y is added to Y opt until Y is empty, removing those detections that overlap to any added to Y opt along the way. This heuristic will often produce the optimal solution when the image contains only a few well separated regions that look like the target object category. In the case where an image contains a number of regions that overlap and are close in appearance to the target category, determining the true Y opt requires more sophisticated searching strategies. The problem can be cast as determining the least cost path, where each node is adding a detection Y to Y opt , and techniques like A* or branch-and-bound can be used to efficiently determine Y opt . The complexity of the search is approximately bounded by the size of the power-set of the maximal number of distinct overlapping detections with reasonably high responses. In our experiments, the number of overlapping detections with high responses was small, so we simply employed an exhaustive search strategy to find Y opt . In Equation 3.8, notice that in order to determine the maximum we need to be able to compute the likelihood, p(I[bi )|ci ), and it appears that ¯ ) . Many we need to have a background model to compute the term p(I[B]| of the state-of-the-art approaches to object detection, however, are trained discriminatively and return a response that corresponds to p(ci |I[bi ]). This is also the case for the object detectors that we use. We can still work with these if we assume that,  35  ¯ ) p(I| ) = p(I[B]|  p(I[bi ]| )  (3.11)  Yi ∈Y  which is essentially the assumption that the appearance of image regions we think might be our category are independent of the rest of the image. Now, we can multiply the numerator of the likelihood term p(I|Y) by the left side of Equation 3.11 and the denominator by the right side because of equality, giving p(I|Y) = p(I| )  p(I[bi ]|ci ) Yi ∈Y p(I[bi ]| ) Yi ∈Y  (3.12)  where with an application of Bayes becomes, p(I|Y) = p(I| )  p(ci |I[bi ])(1 − p(ci )) Yi ∈Y (1 − p(ci |I[bi ]))p(ci )  Yi ∈Y  (3.13)  which no longer involves likelihood terms. Notice that in essence this reduces to using p(c|I[bi ]) instead of p(I[bi )|ci ), with the exception that we also have the category prior p(ci ) as well. However, rather than maximize the product, we advocate taking the logarithm, and instead use,  Y opt = argmax log(p(Y) + log(p(I|Y))  (3.14)  Y  = argmax log(p(Y)) + |Y|(log(1 − p(c)) − log(p(c))) Y  log(p(ci |I[bi ])) + log(1 − p(ci |I[bi ]))  +  (3.15)  Yi ∈Y  we can then proceed to use the same techniques to find the optimum as we outlined above. We note that p(c) implicitly determines the threshold for whether a detection belongs in Y opt . If log(p(ci |I[bi ])) + log(1 − p(ci |I[bi ])) < log(1 − p(c))−log(p(c)), then p(Y ∪Yi |I) < p(Y|I), which means that Yi should not be Y since the posterior probability is less. In our experiments, Section 3.4, we utilize the above to determine a sparse set of objects, but we evaluate a 36  detection, Yi , based upon the detection response, p(c|I[bi ]). As a result, we set p(c) to be very small so that it becomes the detection score p(c|I[bi ]) that determines whether it is a real detection or not. In deployment situations, however, a more reasonable prior could be used to reduce the number of false positives.  3.4 3.4.1  Experimental validation Experiment setup  We validate our contributions on the ETH Shape dataset, first introduced by Ferrari et al. (2006). This dataset is composed of 255 challenging images with five different categories of interest: apple logos, giraffes, swans, mugs, and bottles. Examples can be found in Figure 3.4. The dataset also comes with an edge image for each colour image, generated via the Berkeley natural boundary detector of Martin et al. (2004), which utilizes local brightness, texture, and colour cues to generate an edge map. The dataset also comes with an exemplar object boundary for each category, which is used as the contour template in this section in order to compare it to Ferrari et al. (2006). In order to evaluate a detection response, we use the intersection over union (IoU) ratio that is common in the literature, most notably in the PASCAL VOC (Everingham et al., 2010a). IoU is the intersection of the detection bounding box with the ground-truth bounding box, divided by the union of these two boxes. This is equivalent to Equation 3.9 when ζ = 1. If a detection does not exceed the IoU threshold, then it is considered a false positive. Only a single detection that exceeds the IoU threshold can be a true positive, all other responses are false positives. Finally, a detection can also only be a true positive if the response or score exceeds a threshold. To evaluate an algorithm on a dataset, there are a few metrics that are employed. Recall is defined as T P/N P , where N P is the number of positive instances in the dataset and T P is the number of true positives. Precision is defined as T P/(F P + T P ), where F P is the number of false 37  Figure 3.4: A sample of detections (at 0.7 recall) utilizing the scale invariant chamfer matching on the ETH dataset. Categories are apple logos (dark green), bottles (yellow), giraffe (black), swan (blue), and mugs (cyan). False positives and misplaced detections can be seen in the left most column. Right column is the contour template for each category, provided with the ETH dataset.  38  positives, so precision is the proportion of detections that are true positives. False Positives Per Image (FPPI) is another metric, F P/N , where N is the number of images in the dataset. All of these metrics vary according to how we threshold the detection responses. For a particular threshold, there is an associated recall and precision value. We sometimes refer to a set of results as at recall x, which means the detection threshold that produces a recall of x. In terms of summarizing results into a single number, the most common is the average precision (AP), which is the average precision across recall, sometimes referred to as the area under recall-precision curve in the literature.  3.4.2  Results  The first experiment is to determine the efficacy of different approaches to our probabilistic version of non-maximum suppression. The intent of NMS is to produce a sparse set of detections, so that each detection corresponds to a separate instance. The threshold ϕ determines how aggressive the suppression is, with fewer detections with a smaller ϕ. The results did not vary significantly between 0.3 ≤ ϕ < 0.7, a range that essentially removes many of the false positives that overlap significantly with a true positive. The most interesting result depends on whether ζ is 0 or 1. In Figure 3.5 it can be seen that the aggressive approach, ζ = 0 is more effective. Aggressive NMS in this case reduces the chances of spurious false positives that are due to overlap with an instance of the category but where it cannot be judged a true positive because another detection response overlaps with the instance. In other words, aggressive NMS reduces the chance of two responses on the same underlying instance. The difference is much greater for DPM, indicating that DPM produces strong responses over a larger range of scales, which in turn indicates that DPM is much less sensitive to scale. An example of this type of false positive, where NMS is not aggressive and fails, can be seen in upper left image of Figure 3.4, the two mug responses. The second experiment is to compare different approaches to contourbased detection from a single exemplar. The baseline is oriented chamfer  39  1.0 0.8  Precision  0.6  0.4 0.2 0.00.0  Intersection Over Union - Chamfer (0.607) Intersection Over Area - Chamfer (0.670) Intersection Over Area - DPM (0.764) Intersection Over Union - DPM (0.874) 0.2  0.4  Recall  0.6  0.8  1.0  Figure 3.5: Recall-Precision curve for the category mugs from the ETH dataset with the VOC criterion for correct detection IoU ≥ 50%. The values in the bracket are average-precision. matching which is not scale invariant. The approach we highlight is the scale invariant oriented chamfer matching. We also compare our work with Ferrari et al. (2006), another approach that utilizes a single contour template for recognition. We also compare the effectiveness of scale invariant oriented chamfer matching when we use an edge image generated by Canny edge detection Canny (1986) rather than the boundary detector of Martin et al. (2004). These results are summarized in Figure 3.6. The approach of Ferrari et al. (2006) takes an edge image and connects the disconnected contours into a network. Detection proceeds by doing a depth first search through this network for a path that resembles the exemplar, with a number of hand-tuned parameters to deal with deformation. Their approach relies on the edge image being relatively uncluttered to avoid combinatorial explosion, thus necessitating an edge detection approach that is computationally expensive such as Martin et al. (2004), which takes on the order of several seconds for 640x400 images. We manage to get similar or better performance with a far simpler approach that does not suffer computationally by its reliance on the Berkeley natural boundary detector. In order to determine how robust our approach is to how the underlying  40  mug  1.0  0.6  0.6  Recall  0.8  Recall  0.8  0.4  0.4  Scale Inv. Ori. Chamfer Matching (0.668) Scale Inv. Ori. Cham. w/ Canny (0.673) Ori. Chamfer Matching (0.400) Ferrari et al. ECCV 2006 (0.509)  0.2 0.00.0  0.2  0.4  0.6  0.8  FPPI  1.0  1.2  Scale Inv. Ori. Chamfer Matching (0.766) Scale Inv. Ori. Cham. w/ Canny (0.743) Ori. Chamfer Matching (0.574) Ferrari et al. ECCV 2006 (0.723)  0.2 0.00.0  1.4  applelogo  1.0  0.2  0.4  0.6  0.6  1.0  1.2  1.4  Recall  0.6  Recall  0.8  0.8  FPPI  giraffe  1.0  0.8  0.4  0.4  Scale Inv. Ori. Chamfer Matching (0.854) Scale Inv. Ori. Cham. w/ Canny (0.837) Ori. Chamfer Matching (0.842) Ferrari et al. ECCV 2006 (0.775)  0.2 0.00.0  swan  1.0  0.2  0.4  0.6  0.8  FPPI  1.0  1.2  Scale Inv. Ori. Chamfer Matching (0.686) Scale Inv. Ori. Cham. w/ Canny (0.520) Ori. Chamfer Matching (0.563) Ferrari et al. ECCV 2006 (0.591)  0.2 0.00.0  1.4  0.2  0.4  0.6  0.8  FPPI  1.0  1.2  1.4  bottle  1.0 0.8  Recall  0.6  0.4  Scale Inv. Ori. Chamfer Matching (0.923) Scale Inv. Ori. Cham. w/ Canny (0.838) Ori. Chamfer Matching (0.724) Ferrari et al. ECCV 2006 (0.858)  0.2 0.00.0  0.2  0.4  0.6  0.8  FPPI  1.0  1.2  1.4  Figure 3.6: Comparison between various approaches to detection via a single contour exemplar. We use a Recall vs. FPPI graph to compare to Ferrari et al. (2006) with a IoU 20%. The values in the parentheses are the AP precision.  41  Color  Berkeley  Canny  Figure 3.7: Edge images generated via Canny (1986) and Martin et al. (2004). edge image is generated, we compare the Berkeley edge detector of Martin et al. (2004) with that of Canny (1986) edge detection, a much faster approach. It can be seen in Figure 3.6 that the Berkeley boundary detector produces edge images that are better for contour-based detection. However, note that the results are only markedly different for the category of giraffes and bottles, with an improvement in AP of 0.52 to 0.686 for giraffes and 0.838 to 0.923 for bottles. In the case of giraffes, Canny edge images tend to be cluttered, as can be seen in Figure 3.7, which obscures the actual object of interest, particularly when there is deformation and thus the template will match clutter edgels whose orientation differs. In the case of bottles, it is sometimes the case that edges are not detected due to the translucency of the object as well as clutter edgels. The trade off we highlight here is between computational efficiency and performance. In later chapters we utilize Canny edge detection because our goal is embodied recognition on a mobile robot where computational efficiency is important.  42  Demonstration of scale-invariance In Figure 3.6 it can also be seen that scale invariant oriented chamfer matching is far more effective than oriented chamfer matching alone. This result is not unexpected since the dataset does contain instances over a wide range of scales. The oriented chamfer matching is biased towards producing higher responses for smaller regions because of the lack of normalization. It is the scale of the improvement that is remarkable, as Table 3.1 shows.  No scale invariance Scale invariant  mug 0.40 0.67  swan 0.57 0.76  applelogo 0.84 0.85  giraffe 0.56 0.69  bottle 0.72 0.92  Table 3.1: The average precision for ETH dataset comparing chamfer matching with and without scale invariance. Figure 3.8 provides a more nuanced view, showing a histogram of the scale of all detections at recall 0.7, both true and false positives. It can be seen that without scale invariance there are far more detections at smaller scales. The scale of ground truth instances in the dataset are generally uniformly distributed across scale, so this increase is due to false positives. The reason for this stems from the fact that for the same set of edges at a larger scale, the distance between the edges and the template (when it is also scaled appropriately) will be larger, thus resulting in a larger chamfer distance due to the lack of normalization with respect to scale. Therefore, in order to capture true positive detections at larger scales a larger number of false positives at smaller scales come along with it. The lack of scale invariance means that the detector is biased towards smaller bounding boxes.  3.5  Concluding remarks  In this chapter we presented an approach to object detection that we utilize later in this thesis. We focus on our scale invariant contour-matching approach to detection because it requires little training data. This makes it more suitable for detection where the collection of annotated training data is problematic, such as when a mobile robot is searching for an object that 43  0.018  With Scale Invariance Without Scale Invariance  0.016  % of Detections  0.014 0.012  0.010 0.008  0.006  0.004 0.002 0.000  75  125  175  225  275  Scale of Detections  325  375  Figure 3.8: A normalized histogram comparing the scale of all detections (at recall 0.7) for the category mugs with and without scale invariance for oriented chamfer matching. it does not have a prior model for. This approach is most effective for categories that do not undergo a great deal of deformation and where the object boundary is the most distinctive characteristic of the category. Categories such as mugs and bottles are suitable, whereas more than a single exemplar is needed to detect categories such as giraffes or swans. The primary contribution of this chapter is the introduction of a more scale invariant approach to chamfer matching. In previous approaches that employ chamfer matching, the distance metric between a contour and an edge does not normalize with respect to the contours size, and as a result the distance varies with the scale of the template. We introduce a normalization step and demonstrate that there is a marked improvement for object detection in the ETH shape dataset. Although the model we employ that utilizes chamfer matching is quite simple by design, there are more sophisticated approaches that utilize a collection of contours for recognition (Gavrila, 2007; Shotton and Blake, 2007). Our improvement holds for other approaches that require a metric that measures the distance between contours. 44  We also presented an approach that can take dense object detection responses and produce a set of sparse detections. We demonstrated that it is effective for two very different classifiers, illustrating that we can treat the underlying appearance-based object detectors almost as a black box. This partially fulfills our goal of an embodied approach to object detection that is modular, where new research can be plugged in accordingly. In the next chapter, we build on this approach, enhancing the underlying object detection with depth information acquired from stereo imagery.  45  Chapter 4  Stereo for Recognition 4.1  Introduction  There is a rich literature and considerable success on object detection utilizing images, but somewhat less success on detection from 3D sensors alone. This is partly due to the challenges of acquiring sufficient training data, as mentioned in previous chapters. It is also due to the various limitations that are specific to each sensing technology, particularly missing data and limited fidelity. One approach to overcoming these limitations is to integrate 3D sensor information with appearance information from images. In this chapter, we focus on utilizing stereo imagery to aid in single view object detection. As we argue in Section 4.3, the most reliable feature of binocular stereo depth images for object detection is the 3D geometric context that it provides rather than shape information. We focus on the 3D scale of an image region and the existence of discontinuities, both of which can often be reliably determined from a depth image. In conjunction with priors on the real size of an object, we show how to improve detection and efficiency in Section 4.4. There are a number of recent results in object detection that also makes use of 3D geometric context inferred from depth images. The work of Ess et al. (2007); Gavrila and Munder (2007); Schindler et al. (2010) all leverage stereo and ground plane constraints to improve pedestrian detection. Their 46  Figure 4.1: A peripheral-foveal camera system provides a highresolution colour image and sparse stereo data. Appearance based detectors (contour-based in this example) and a prior on object size constrains where the object can be in depth, resulting in fewer false positives. reliance on ground plane constraints, however, does not necessarily translate into success for other scenarios, in particular for smaller objects in the indoor scenes. The probabilistic model of Ess et al. (2007); Schindler et al. (2010) is similar to our own, but we present a simpler approach that requires considerably less training and demonstrate that it still remains effective without ground plane constraints. The work of Gould et al. (2008); Quigley et al. (2009) tackles the indoor scenario, using boosting to jointly learn appearance features and 3D features. They consider more contextual features such as height from ground, and the distance from camera, as well as geometric features of the object such as its 3D dimensions and surface variance. Their work relies on point clouds derived from high accuracy laser sensors, which means that considerable effort must be employed to collect training data for every category, and that their approach is limited to static scenes. We explore a simpler approach, where we separate appearance from the 3D geometric context. Moreover, the 3D size of a region is available directly via dense accurate point clouds in their work. Our work explores inferring object size from lower accuracy stereo depth images rather than from point clouds.  47  The approach that we present and explore in this chapter makes a number of contributions. The first contribution is that we show how to robustly infer object size from stereo depth images in the presence of missing data. We also show that significant improvements can be achieved for household objects with little training data, increasing the average precision (AP) of shoes from 0.67 to 0.81 and for mugs from 0.54 to 0.72. Moreover, previous approaches tend to explore intricate integrated systems (Gavrila and Munder, 2007), and we show that a simpler approach where appearance and 3D terms can be learned separately can be effective as well.  4.2  Stereo vision  Stereo vision is a technology that utilizes multiple images and epi-polar geometry to determine the underlying 3D points that correspond to points on the image plane(s). In its most common variant, binocular stereo, if image points (u1 , v1 ) and (u2 , v2 ) are projections of the same 3D point p = (x, y, z) in their respective images, then we can determine p. If we assume that the epi-polar lines are horizontal and aligned vertically, with known focal length (f ) and baseline (B) between cameras, we can verify geometrically via similar triangles in Figure 4.2 that the depth is, z=  fB d  (4.1)  where d = u1 −u2 denotes the disparity. Using this, and projective geometry, it is straightforward to determine the remaining coordinates as x = u1 z/f and y = v1 z/f . The primary challenge in stereo vision is the correspondence problem, where it is initially unknown which pixels in each image correspond to a 3D point. This necessitates searching for these correspondences and specifying a metric that can determine if two pixels correspond to the same 3D scene point from image data. The search is simplified with a properly calibrated stereo camera where we can assume that for any pixel (u, v) in one image, the corresponding pixel in the other image is in the same row, (u + d, v), where d is unknown. 48  u1 d= u1-u2  θ  z  B θ  p  u2 f  Figure 4.2: The geometry underlying binocular stereo Techniques to solve the stereo correspondence problem abound, a sample of which are reviewed in Scharstein and Szeliski (2002). The simplest and most efficient are scan-line algorithms that consider the intensity values in a small window around a pixel in one image, and try to determine a corresponding pixel on the same row where the intensity values in its window are most similar. These approaches vary mostly in the size of the window and the measure of similarity they utilize to compare the windows. This approach has proven to be effective for textured regions. However, in textureless regions the disparities are often left as missing values due to ambiguity. Sophisticated approaches for stereo correspondence utilize higher order relations to overcome these issues. For example, in textureless regions we can often assume that the disparity is similar to the disparity of nearby pixels with a similar texture. Algorithms that utilize additional constraints to propagate information to more uncertain regions can result in denser depth images. However, this propagation comes at a higher computational cost. Algorithms that utilize strictly local constraints to spread disparity information can utilize GPU implementations, achieving more complete depth maps in a fraction of a second (Cornelis and Van Gool, 2005). Approaches that optimize a global energy function can result in more accurate depth images, but can take seconds to compute a depth image. The trade off here 49  is between denser stereo depth images and computational efficiency. For the features that we advocate utilizing in the next section, primarily the average depth and presence of surface discontinuities, accurate and denser stereo is not particularly useful.  4.3  Utility of stereo depth information for detection  What kind of cues does perfect depth information offer for object detection? With a sufficiently dense depth image the shape of the surface can be inferred. The presence of object boundaries can be determined via depth discontinuities. In addition, it is possible to determine the size of a surface via projective geometry. There are also a variety of higher level contextual cues that are available. For example, it is possible to infer he location of supporting surfaces. Moreover, the 3D spatial relationship to other objects in the scene is also available. Another question to pose is how much of this information can be reliably determined from stereo depth images? In the context of detection, this is largely determined by three factors: the presence of missing values, the fidelity of the depth image, and the ability of a depth image to characterize a 3D shape. Stereo depth images are constructed from the relationship between disparity, focal length and baseline as in Equation 4.1. This implies that the precision of our depth values is related to these same quantities, wherein disparity is taken to be the only quantity with some uncertainty. In the case where a correspondence can be accurately determined, precision is inherently limited by the effects of discretization. If we refer to the variance of a disparity estimate to be σd2 , then the variance of our depth estimate zˆ, via propogation of uncertainty, is approximately f 2 B 2 σd2 /d4 , which implies that our uncertainty grows quadratically as the depth of the object increases. In our experiments we use Point Grey’s Bumblee 2 stereo camera, where the accuracy of depth estimation is illustrated in Figure 4.3. Another significant issue is missing depth values, which usually arise 50  0.8 0.7  Accuracy (m)  0.6 0.5 0.4 0.3 0.2 0.1 0.0 0  5  10  Depth (m)  15  20  Figure 4.3: The accuracy of depth estimation as depth increases for Point Grey’s Bumblebee 2 camera. Accuracy here refers to the standard deviation. from either untextured image regions or because of occlusion. In the case of untextured regions, the underlying surface can often be assumed to be varying smoothly. More sophisticated stereo algorithms can recover somewhat from this case, but a smoothly varying surface is not a particularly distinctive characteristic for recognition. Objects will often at least provide image texture at their boundaries, thus there will be some depth information available regardless of the stereo algorithm. Missing values due to occlusion, on the other hand, can be an issue if we are trying to characterize the shape of the surface. However, typically some depth information is still available for an object. The other significant issue for recognition is the resolution of a depth image and its ability to characterize a surface. If we consider the total area of the visible surface in 3D, and the area to which it is projected to in the image, more information is lost in the projection of a surface with significant variation than a surface parallel to the image plane. For example, in Figure 4.4, notice that the curvature of a mug is not reflected accurately in the point cloud, and that for the distinctive part of the shape, near the edges in each graph, no point cloud information is recovered. In general, those 51  1.949 1.916 1.883 2.154 2.118 2.082  1.694 1.670 1.646  Figure 4.4: The graphs on the right indicate the depth of surface points along a single scan line that passes through the middle part of each of the respective mugs. surfaces that are not parallel to the image plane are underrepresented, and thus it is challenging to infer surface properties for those regions where the surface is changing. This means that we are losing much of the information that characterizes an object. Although the limitations of depth images from stereo seem to be many, there are still cues that are recoverable. The first is that depth information can provide important information about the real world size of the object. Another cue that may be available is the existence and location of discontinuities in a region. These discontinuities could indicate either object boundaries, occlusion, or significant surface variation. Moreover, the lack of depth information in a region may indicate that no object lies in that region, or at least no object of interest. We explore utilizing some of these features in our approach.  4.4  Probabilistic model for object detection  In the previous chapter we outlined an approach to sparse object detection that leveraged classifier responses generated by applying the classifier at different locations and scales in the image. Widening the scope we introduce the depth image Z and utilize it primarily for the 3D geometric context  52  it provides. We follow similar logic as outlined in the previous chapter, where we seek to find the set of Y opt detection responses that maximize the posterior probability,  Y opt = argmax p(Y|Z, I)  (4.2)  Y  = argmax p(Y)p(Z, I|Y)  (4.3)  Y  ¯ I[B]| ¯ ) = argmax p(Y)p(Z[B], Y  p(Zi , Ii |Yi )  (4.4)  i  where we assume independence of the image and depth regions when we know the category. For notational simplicity, we let Ii denote I[bi ], the image region inside of bi , and the same goes for Zi . The first term is our familiar prior for how bounding boxes are arranged in the image plane, which we defined previously in Section 3.3, intended to prevent bounding box overlap. The last term term in Equation 4.4 is the likelihood, p(Zi , Ii |Yi ). Note that Zi contains the same information as (Zi − g(Zi ), g(Zi )), where g is a scalar function. We will later define g(Zi ) as an estimator of the mean of Zi . For notational simplicity, we denote Vi = Zi − g(Zi ). It can be shown that g(Zi ) and Vi are independent. We take this step so that we can separate, to some degree, the shape and the size of the surfaces. The likelihood term becomes p(Ii , Zi |Yi ) = p(Vi |Yi )p(Ii |Yi )p(g(Zi )|Yi )  (4.5)  where the additional assumption of independence between g(Zi ), Vi and Ii will generally hold if we do not try to infer some sort of depth information from the Ii . The first term in Equation 4.5 is our distribution on surface variation for the category that we discuss in Section 4.4.2. The second term is the appearance likelihood, which is dependent upon our appearance-based object classifier. The final term is related to the prior on the size of the object which we discuss in Section 4.4.1. 53  In order to find Y opt we can either use the best fit-first greedy heuristic, or utilize a search strategy such as branch-and-bound, both of which we have discussed previously in Section 3.3.1. In the case where we do not have a likelihood term for the image, p(Ii |ci ), but instead have p(ci |Ii ), we can use the same trick that we employed in Equation 3.12. Finding Y opt is equivalent to maximizing,  Y opt = argmax log(p(Y) + log(p(I, Z|Y))  (4.6)  Y  log(p(Vi |Yi )) + log(p(g(Zi )|Yi ))  = argmax Y  Yi ∈Y  + log(p(Y)) + |Y|(log(1 − p(c)) − log(p(c))) log(p(ci |Ii )) + log(1 − p(ci |Ii ))  +  (4.7)  Yi ∈Y  where the last two lines in Equation 4.7 come from Equation 3.12. The probability for a particular detection is p(c|g(Zi ), Ii , Vi , bi ) =  p(c|Ii )p(g(Zi )|Yi )p(Vi |Yi ) p(g(Zi )|bi )p(Vi |bi )  (4.8)  via Bayes rule and our previous independence assumptions. We assume that without any prior knowledge of the underlying category that p(Vi |bi ) is uniform. Moreover, we assume that objects are distributed uniformly in space, so by the nature of projective geometry this informs us that the closer that an object is, the more of the image it will occupy, so p(g(Zi )|bi ) ∝ g(Zi )−1 . Using this, we approximate the probability of a detection, score(Yi ) = g(Zi )p(c|Ii )p(g(Zi )|Yi )p(Vi |Yi ).  (4.9)  We use Equation 4.9 for the score of the detections our system outputs. Our approach to detection is the same as in last chapter, where we select a set of windows at which to apply our detector. There is the appearance based term from our object classifier, p(Yi |Ii ), the surface variance term, p(Vi |Yi ), and the term that involves the scale prior, p(g(Zi )|Yi ). In the 54  exhaustive case we apply these to every window, although we can use branch and bound techniques to reduce computation. In this case, we first determine p(g(Zi )|Yi ) and p(Vi |Yi ), which are very efficient, and then determine if we should apply the appearance-based classifier. For the dataset we explore in Section 4.5, we can reduce the classifier application by 13% for the variation prior, 75% for the scale prior, and 81% for both.  4.4.1  Scale from depth  In the simplest scenario, where the object is planar and fronto-parallel to the image plane, the relation hI = z −1 f hr holds, where h is the height of the object in the image and real world respectively, as in Figure 4.5. For a given category, hr is a random variable with distribution D, denoted as hr ∼ D, and thus z ∼ h−1 I f D. The size of many categories roughly follow a Gaussian distribution, which we adopt, so that z ∼ h−1 I f G(µr , σr ). In practice, however, the scale of our bounding box, hb , and our depth estimate, g(Zi ), are only estimators, which introduces additional uncertainty and bias. The first source of bias is generated by assumptions about the extrinsic camera parameters. We implicitly assume that the image plane is parallel to the axis along which the scale property is defined. For example, if we are using the height to denote scale then we assume that our viewing direction is parallel to the ground plane. If this is not the case then foreshortening can lead to the hb ≤ hI . However, this bias is typically small when the object is more than a metre away from the camera. Another factor is that hb may not even be a projection of hr at all. For example, if the elevation angle of the viewpoint deviates significantly from the canonical viewpoint, the points that define the height of an object may not lie along the rays projected out from b. We do not explore accounting for this bias here since preliminary experiments indicate that it has little effect upon detection performance for canonical viewpoint recognition for the categories we consider. We stress, however, that the plane along which the scale property is defined should be roughly perpendicular to viewing direction, and thus different viewpoints may require a different scale property.  55  Another source of bias can arise from the detection responses generated by the appearance-based detectors. It has been observed by ourselves and others (Ess et al., 2007; Leibe et al., 2007) that the image region for which an object classifier might respond most strongly can often differ from the true boundaries of the object in the image. One reason for this is that many object classifiers, including the ones we utilize, rely on features of an object’s boundary for recognition. These features, however, may only be apparent when the image region is larger than the object itself, so that hb > hI . We explicitly learn and remove this bias on a set of separate training data, finding that the bias is similar across category for the same detection approach. We found that the size of bounding box was about 15% larger for the DPM, and about 5% larger for our contour-based approach. Another source of uncertainty is introduced by the fact that we sample across image scale rather than search for the optimum bounding box. Therefore, where we apply our classifier may never tightly bound the object. Moreover, object detectors are often robust to minor changes in scale, and so the response from a detector may be similar for a range of scales around the true scale. To account for this uncertainty we introduce a noise variable ε ∼ G(0, σb ), so that hb + ε = hI and z=  f hr hb + ε  (4.10)  We approximate the distribution of z as a Gaussian. Moreover, we assume that g(Zi ) is an unbiased estimator of z, with an error that is small relative to the uncertainty of the category size so that p(g(Zi )|Yi ) = G(  √ f µr f (σr + µr σb ) , ) hb hb  (4.11)  where the variance comes from the propagation of error. We set σb = 0.1 for both detectors and all categories, which is related to the fact that both detectors seem to be robust to about 10% changes in scale. The original intent of this work was to aid object detection for a setting where we are given very little training data, such as our work with Curious George and the SRVC. Here, we are given an object to find and given access 56  f b  hb hr  γb  g(Z) z Figure 4.5: A diagram illustrating the relationship between depth and object size. to the internet to acquire relevant data regarding that category. The size of a category can often be parsed from web-data, particularly for manufactured objects available via online retailers such as Amazon or Walmart Meger et al. (2010). The original idea was to use this size data with a generous uniform distribution around that size data to reduce false positives and reduce computational burden by focusing only on image regions that agreed with the size prior. In our experiments we utilize a Gaussian distribution, where we set the parameters by hand according to our estimate of what the actual distribution of the height of the object category is. Another approach would be to collect 3D data and learn these parameters directly, but this contradicts our goal and is somewhat overkill when this can just as easily be supplied directly. Fritz et al. (2010) suggest using 2D imagery and the metadata that commonly comes with images. Here, if the category of interest is the focus of the image so that the focal distance corresponds to the depth, then we can use the camera intrinsics and the bounding box on the image to determine its real size. This is an intriguing way to learn both the appearance of a category and its real size, but one we do not explore this in this thesis.  57  Estimation of object depth In our reasoning about the scale of an object, the depth of an object is a single value z, generally defined to be the distance along the principal ray of the camera to the centre of the object. This value is not directly observable given a depth image, instead it must be estimated from the depth values in Zi . This estimation is the function g(Zi ). The first thing to note is that not all values in Zi correspond to the object, in particular those values near the boundary of bi may fall on more distant surfaces that do not belong to the object. Moreover, the object might be occluded to some degree. As a result, reducing the effect of “outliers” is desirable. There are at least two ways to achieve this. One is to restrict attention only to those depth values nearer to the centre of bi , reducing the effect of background clutter. Another is to use the median value. In our experiments, there was no noticeable difference in accuracy between utilizing the median or mean value. The primary difference between the two approaches is computational efficiency. In isolation, computing the mean is O(n), where n is the number of values in a region. Computing the median, on the other hand, is O(n log(n)). However, if we compute the mean or median for all locations and at multiple scales in the image, we can achieve greater efficiency. The amortized cost of computing the median is √ O( n), where as we can make use of integral images to compute the mean with an amortized cost that is O(1). Given the advantages in computational efficiency, we choose to estimate z with the mean. We define  ρ(Z, b, γ) = {(x, y)|Z[x, y] =  , |xb − x| < γhb , |yb − y| < γwb }  (4.12)  which is the set of image locations that are not missing ( ) and lie within an inner region inside b that is γ the size of b. We set γ = 0.75. With this, we define a robust estimate of the mean depth of a region as,  58  zˆ =  1 N  Z[x, y]  (4.13)  (x,y)∈ρ(Z,b,γ)  How effective is Equation 4.13 as an estimate of the depth of an object’s centre? If we consider the visible surface as a distribution from which we are drawing samples, and the mean of that distribution is some α from true object centre, then zˆ has a bias of α from the true object depth, as in Figure 4.5. The variance of the estimator zˆ decreases as the number of points in ρ(Z, b, γ) increases, which mitigates the effect of our original error in disparity in Section 4.3, which is vanishingly small when compared to the the variance of a few centimetres for the size of a typical housefold category. In fact zˆ is somewhat like a sample mean, although the samples are not truly i.i.d. because the disparity at one point is often dependent upon its neighbours due to the nature of stereo algorithms. Given the biased nature of zˆ, we instead define, g(Zi ) =  1 N  Z[x, y] − α  (4.14)  (x,y)∈ρ(Z,b,γ)  where N = |ρ(Z, b, γ)|. In practice, α can be safely ignored in cases where z is significantly larger than α, which is typically the case for small household objects a few meters away from the camera. For the categories we consider in our experiments we use α = 0.05m, although very little difference in performance was seen when α is ignored entirely.  4.4.2  Surface variation  The remaining term to do be discussed in our model is the likelihood of the surface variation given the category, p(Vi |Yi ). We do not expect a stereo depth image to provide enough information to be distinctive since much of the interesting surface variation is lost in the projection. What can be determined from Vi is the existence of discontinuities, which we do not expect for many household objects. The existence of discontinuities suggests that the surface is generated by something other than an instance  59  of our object category. Utilizing this insight, we define p(Vi |obj) =  1 ξ  1 N  0  otherwise  2 (x,y)∈ρ(Z,b,γ) Vi [x, y]  <ξ  In essence, we expect the variance of the depth values within b to be less than ξ, and thus we exclude instances when it is not. In our experiments we set ξ to be 0.25 m, the maximum expected interior depth for an object category. The surface variance for a region can be efficiently computed by making use of integral images, in much the same manner that the mean is calculated in the previous section. The amortized cost of computing the Vi is O(1).  4.5  Evaluation  To validate our approach we collected a dataset of stereo and still images for a variety of scenes containing mugs and shoes, examples of which can be found in Figure 4.7, Figure 4.6. Using these images, we then compare the performance of our contour-based object detector versus the performance when this base detector is augmented with a prior on scale and surface variation. We also include experiments with the DPM for mugs; we did not train a DPM for shoes because of a lack of readily available training data. There are a number of works that fuse stereo depth information with appearance information (Ess et al., 2007; Gavrila and Munder, 2007; Schindler et al., 2010). However, these focus predominantly on pedestrian detection and leverage both ground plane constraints and temporal constraints in the case of video (Ess et al., 2007; Schindler et al., 2010). As a result, we do not offer a comparison of our work to these previous approaches.  4.5.1  Experimental setup  The intent of our data collection was to produce a set of images that would be challenging for an appearance based classifier. With this in mind, the objects were placed at a variety of depths (1 to 7.5 m), with varying amounts of background clutter. In addition, the shapes and texture of the objects 60  Figure 4.6: Examples of mug detections at recall rate of 0.7, where green signifies true positives while red signifies false positives. The top row are detections generated without a scale prior using the contour-detector . The middle row are detections generated with a scale prior.  61  Figure 4.7: Examples of shoe detections at recall rate of 0.7, where green signifies true positives while red signifies false positives. The top row are detections generated without a scale prior using the contour-detector . The middle row are detections generated with a scale prior.  62  themselves varied. The images were of the side profile of the object category, although the object could be left or right facing. For the mug dataset we collected 20 images from different indoor scenes, with 15 different mugs, with about 3 mugs per scene. For the shoe dataset we also collected 20 images of different scenes, with 8 different shoes, with about 3 shoes per scene. The objects were placed in the scene to make recognition from appearance challenging, so the images are not always naturalistic. The camera setup consists of a Canon G7, using 1216x912 images, and a Bumblebee 2 stereo camera, using 1024x768 images, with the Canon camera on top of the Bumblebee as in Figure 4.1. The stereo algorithm we use is the Point Grey’s stereo algorithm provided with the camera, which provides fast, accurate depth maps for textured regions, and annotates ambiguous regions as missing information. These depth images, however, can often be very sparse. For the dataset we collected, the stereo algorithm was only able to determine a depth value for 31% of the pixels on average, although this improved to 82% for the inner region of the objects of interest. The motivation for the two camera approach is that the quality of the Bumblebee’s gradient image, upon which both appearance-based detectors rely, is poor in comparison to those of the Canon camera. The average precision for contour-based detection for mugs, for example, is only 0.4 for stereo images and improves to 0.54 for the Canon images. In addition, utilizing a higher resolution camera allows us to search for objects from a greater distance due to limitations of object classifiers to recognize objects from limited resolution. Utilizing a two camera system, however, introduces an additional complication since the depth image is in the Bumblebee’s coordinate system. To overcome this we find a set of point correspondences between one Bumblebee image and the Canon image using SIFT features and geometric constraints Lowe (2004). With this set of correspondences we utilize the Gold standard algorithm to compute the fundamental matrix, F , an algorithm that produces the Maximum Likelihood estimate on the assumption of a Gaussian error model (Hartley and Zisserman, 2004). This allows us to map the point cloud from the stereo images to image points in the Canon image, giving us 63  our depth image Z. Although this introduces additional errors in practice this had no noticeable effect upon the results due to our use of a robust estimator for depth. The contour-based detector that we utilize in our experiments is described in the previous chapter. The template contour that we utilize was acquired by selecting an exemplar image from the internet, applying foreground/background segmentation, and retaining the contour that surrounds the foreground segment. We collected a set of 50 positive instances from the internet for each category and used this, along with a background dataset, to learn the logistic classifier to produce detection responses. The DPM model for mugs that we utilize was trained on a separate set of training data, but we trained a logistic classifier to produce probabilistic responses utilizing the same data we collected to train the contour-detector. The Gaussian size prior for mugs was G(0.14, 0.04) and G(0.12, 0.03) for shoes. The uniform prior was U (0.09, 0.20) for mugs, and U (0.06, 0.2) for shoes.  4.5.2  Results and analysis  The first set of experiments is to determine the effect of different ways that the scale prior can be applied in detection. The primary variation is to apply both the scale prior and appearance detector jointly on every window, or to treat the appearance detector as a black box that returns a sparse set of detections and then apply the scale prior as a post-processing step. In Table 4.1 we can see that processing the image first via the appearance detector and then applying the scale prior as a post process is less effective than when we apply the appearance classifier and scale prior jointly. For mugs there is a 0.08 improvement in AP and for shoes there is an improvement of 0.04. One explanation for this is that in the post-process approach a number of true positives were not passed on as detections because a nearby window had a higher appearance response, thus suppressing the true positive. With the joint-process this errant false positive did not agree with the scale prior, and as a result the true positive came to the surface during the  64  Algorithm Chamfer Matching w/ post Uni. Scale Prior w/ post Gauss. Scale Prior w/ joint Variance Prior w/ joint Gauss. Scale Prior w/ joint Variance and Gauss. Prior DPM DPM w/ post Uni. Scale Prior DPM w/ post Gauss. Scale Prior  Shoes 0.67 0.73 0.77 0.71 0.81 0.81  Mugs 0.54 0.59 0.64 0.59 0.72 0.73 0.81 0.85 0.86  Table 4.1: Comparing average precision with and without a scale prior and stereo depth information. non-maximum suppression. Due to the nature of the DPM code, we could use it as a black box only, so could not explore the utility of applying the size prior jointly. We also present results for when we used only a crude uniform scale prior rather than a Gaussian scale prior. The uniform scale prior essentially suppresses all appearance detections that are not within a predetermined size range for each category. We explore this approach because in some scenarios sufficient prior knowledge is not available to define or learn a Gaussian size prior, such as in SRVC, and we wanted to see if even a crude prior would offer some improvement. There is some improvement in detection results with just a uniform prior, as seen in Table 4.1, but significantly less than with a more accurate Gaussian prior. The variance term that we introduce in Section 4.4.2 has far less impact on results than the Gaussian scale prior. With just the variance term alone there is a slight improvement in results over the appearance detector alone, although there is no statistically significant improvement when combined with the Gaussian scale prior. This result is not surprising, since most of the obvious false positives are eliminated with the scale prior. The variance term, however, can still be useful in reducing computation. The graphs in Figure 4.8, and the detection examples in Figure 4.7, illustrate the overall efficacy of our approach. For our contour-based approach, 65  0.8  0.8  0.6  0.6  Precision  1.0  Precision  1.0  0.4 0.2 0.00.0  0.4  DPM (0.81) DPM w/ post Scale Prior (0.86) Cham. w/ joint Scale Prior (0.72) Chamfer Matching (0.54) 0.2  0.4  Recall  0.6  0.2  0.8  0.00.0  1.0  (a) Mugs  Chamfer Matching (0.67) Cham. w/ joint Scale Prior (0.81) 0.2  0.4  Recall  0.6  0.8  1.0  (b) Shoes  Figure 4.8: Recall vs. Precision comparing detection with and without a scale prior. The difference between post and joint prior is explained in the text. the AP improves from 0.67 to 0.81 for shoes and from 0.54 to 0.72 for mugs, both significant improvements. The contour-based approach to recognition relies on a single contour-template as a model, so a scale prior can have a greater impact since the appearance based detector is relatively weak. The much smaller improvement, from 0.81 to 0.86 for mugs, when a scale prior is combined with the DPM is due to the fact that the DPM is already a very effective object detector. However, DPM requires a significant amount of training data, and can be computationally expensive, so a scale prior can be useful to reduce computation. A scale prior allows us to utilize a weaker appearance detector, which is useful in scenarios where training data is limited or in scenarios where we utilize a cascade of increasingly more sophisticated approaches for object detection. Another interesting note is that for the DPM model, most of the improvement comes from removing false positives since the curve is pushed upwards rather than to the right. Presumably the scale prior is removing false positives that are due to clutter, where the depths in these image regions do not agree with the prior. However, the contour-based approach is somewhat different. Here, the curve is shifted to the right as well, indicating that the scale prior is allowing us to capture more true positive instances. A 66  key difference is that we can apply both the scale prior and the appearance detector to jointly determine a score for a window, and then hand this off to our non-maximum suppression approach to select the final set of detections for an image. In the post-processing approach, that we use for the DPM, if the appearance based detector fails to return a response for true positive, our scale prior has no chance to improve the score of this detection.  4.6  Concluding remarks  In this chapter we have shown that the 3D geometric context available via stereo imagery can be useful for object detection, even for generic household objects, where both missing data and fidelity of the available depth data can be significant. Household objects often do not exhibit distinctive features that can be detected from stereo, but the approach we present instead uses the size information the stereo imagery provides. The approach we present efficiently estimates the depth of image regions and uses this information in concert with size priors to improve both the accuracy and efficiency of object detection. Moreover, our approach is modular, treating context and appearance separately, which allows for easy integration of future appearance-based object detectors.  67  Chapter 5  Multiview Recognition 5.1  Introduction  Object detectors are typically evaluated in scenarios where only a single image is available. However, for many applications and domains, ranging from mobile robotics and surveillance to community photo collections on the internet, more than a single image is available from the scene. In the previous chapter we explored using stereo depth information to provide cues regarding the size of the underlying objects to improve detection. In this chapter we propose a new approach to integrating imagery from multiple viewpoints for detection, a task we refer to as multiple-viewpoint detection. One motivation for integrating information across multiple viewpoints arises from the fact that state-of-the-art techniques have a false positive rate for many categories, particularly for viewpoints that deviate from the canonical view. Moreover, background clutter and occlusion can both result in weaker detection responses. By integrating multiple viewpoints, it becomes possible to leverage multiple weak detection responses to come up with a stronger response. An example where integrating multiple viewpoints allows us detect challenging examples can be seen in Figure 5.1, where we are able to detect bowls that were too challenging from a single viewpoint. The primary contribution of our work in this chapter is a novel algorithm that integrates imagery from multiple viewpoints to produce detections both 68  (a) Single View  (b) Multi View  Figure 5.1: Detection results at 0.7 recall of bowls (green), mugs (red) and bottles (white), using the Deformable Parts Model from Felzenszwalb et al. (2010) as an appearance based detector. The left image contains results from single-view detection, whereas the right image is from our multi-view detector, with 3 viewpoints available, resulting in fewer false positives and more true positives. in the 2D images and in 3D. Although integrating information across multiple viewpoints has been explored explicitly in the active vision community, such as in Laporte and Arbel (2006), the focus has been primarily on instance detection in an uncluttered environment. We build on some of this work and focus on categorical detection in cluttered environments. Our work is similar in spirit to the work of Coates and Ng (2010), however our score for detection incorporates 3D geometric information as well as appearance, whereas theirs includes appearance only. Our experimental validation is also more thorough, presenting a more convincing case for the utility of multiple viewpoint detection.  5.2  Method  In previous chapters we define object detection as the task of determining a set of responses where each response is the category c, score, and bounding box, b = ( , w, h), defined by the location of the centre , width w, and height h. In this chapter, we widen the scope and tackle the task of multiple 69  viewpoint object detection, or multi-view object detection, which we define as inferring the presence and location of objects in each of the images as well as inferring which detections refer to the same underlying object. We achieve this by inferring a set of objects, O, whereby each object oi has a 3D location (xi ), scale (si ), and pose (φi ), along with the category ci , and a bounding box bi,j for each of the images Ij . We assume that all objects are upright, so the pose of the object, φi , is the horizontal angle between some world-coordinate frame and the axis that defines the azimuth of the canonical pose for the object category, as in Figure 5.2. We again adopt a probabilistic approach, whereby we infer O that maximizes the posterior probability, p(O|I1 , ..., In ), where n is the number of images. Our approach to doing this is outlined in Algorithm 1. Algorithm 1: Multiple viewpoint object detection  1. Collect images {Ii |1 ≤ i ≤ n} from multiple viewpoints 2. Register images to get camera matrices Pi for each image Ii 3. Run single-view detectors on each image Ii , producing a set of responses Ri = {(bj,i , vj , cj , φj )|1 ≤ j ≤ ni } 4. Find a set of objects O = {oi = (xi , ci , si , φi , bi,{1,...,n} )|1 ≤ i ≤ M } that maximizes p(O|I1 , ..., In ). We discuss our algorithm in greater detail in subsequent sections, beginning first with our model of a scene in Section 5.2.1 which is necessary for Step 4 of our algorithm. We then discuss how we infer objects O that maximize our scene model in Section 5.2.2, since in general finding the optimal is intractable. We leave discussion of the actual single-view detectors until Section 5.2.3. How images are collected and registered is described in our experiments in Section 5.3, since these steps are generally application specific.  70  rcamj Φi  fj xcamj  θi,j Φ i,j  li,j  ri,k xi ri,j Canonical pose  li,k xcamk Figure 5.2: Multiple views provide more appearance information along with ability to localize objects in 3D.  5.2.1  Probabilistic model for a scene  In multi-view object detection, the unknown variables that we wish to infer are Oopt = {(xi , ci , si , φi , bi,{1,...,n} )|1 ≤ i ≤ m}. The score that we propose to maximize in order to determine Oopt is the posterior probability,  Oopt = argmax p(O|I1 , ..., In )  (5.1)  O  = argmax O  p(O)p(I1 , ..., In |O) p(I1 , ..., In )  (5.2)  n  p(Ii |O)  = argmax p(O) O  (5.3)  i  where we assume that each image is independent given the objects O, and we do not need to model p(I1 , ..., IN ) because it does not involve O. We implicitly assume that the extrinsic parameters of the cameras for each image are also known. The term p(Ii |O) is the likelihood of the image Ii given the set of objects 71  O. We assume that each image region Ii [bj,i ] is independent given the object oj , that is, the appearance of each image region is independent of the others given the category object presumably depicted in that region. As a result, m  ¯ ) p(Ii |O) = p(Ii [B]|  p(Ii [bj,i ]|oj )  (5.4)  j  We model the second term as p(Ii [bj,i ]|oj ) = p(I[bj,i ]|cj , φj,i ), a classifier that returns the likelihood of the image given the category and the viewpoint angle (φj,i ) relative to the canonical viewpoint. Using the camera matrix Pj for image Ij , we can find the ray (ri,j ) passing through the camera j and i,j ,  as well as determining the angle (θi,j ) of the ray relative to the world-  coordinate frame as in Figure 5.2. In turn, we can determine the angle of the object’s canonical viewpoint relative to the camera as φi,j = φi − θi,j . ¯ ), is the likelihood of the image The first term in Equation 5.4, p(Ii [B]| regions not accounted for by our objects given that these regions are generated by something other than the categories of interest. We denote this ¯ We note that for each object, the second term in Equation 5.4 area as B. actually decreases since it is a product of probabilities which are less than 1, and a similar phenomena occurs for our prior p(O). In theory, this is ¯ ) if we had an accurate model for this counterbalanced by the term p(Ii [B]| term. That is, for each object oj ∈ O, n  n  p(Ii [B¯ ∪ bj,i ]| ) < p(O)  p(O − oj ) i  p(Ii [bj,i ]|oj )  (5.5)  i  which means that the presence of an object oj is a better explanation for ¯ ), we ignore this term the data than not. Rather than try to model p(Ii [B]| and instead propose using a threshold to remove objects that do not have enough evidence to support their inclusion in O. The term that we threshold on is defined later in Equation 5.11. The term p(O) from Equation 5.3 is our prior on co-occurrance and colocation. For notation, let the bolded properties without subscript denote all properties of that type across the set of objects. For example, s is the  72  Parameter µc (m) σc (m)  Mug 0.16 0.05  Bowl 0.20 0.12  Bottle 0.20 0.07  Shoe 0.30 0.05  Table 5.1: The parameter values for scale scales {si |1 ≤ i ≤ M }. We define our prior as, n  p(si |ci )  p(O) = p(c, φ)p(x|c, s) i  p(  i,j |xi , si )p(wj , hj |si , φi )  (5.6)  j  where we model p(c, φ) with a uniform distribution, and will describe the other terms in turn. The first term, p(x|c, s) captures co-occurance and co-location statistics, an approach that numerous researchers have explored in the context of 2D detection as we noted in Section 2.3. However, learning these relationships for 3D scenes requires a significant amount of training data, which we leave for future work. For now we propose a simple prior that enforces the notion that distinct objects cannot occupy the same physical space. This prior assigns 0 to O when the distance between any two objects is less than half the size of the largest of the two objects, more formally  p(x|c, s) =  if ∃ i,j s.t. |xi − xj |2 < 0.5∗max(si , sj )  0 const.  otherwise  (5.7)  The third term, p(si |ci ), in Equation 5.6 is our scale prior for a category. The scale of an object is meant to denote the volume of the object, but in practice the scale of a category is modelled by utilizing a function of its dimensions: the width, height, and length. For our experiments we use a Gaussian with parameters µc , σc for each category, and use the length of the object (its largest horizontal dimension) to represent scale. The parameters are outlined in Table 5.1. The term p(  i,j |xi , si )  is our distribution on where the object is in Ij given 73  rcamj di,j  Φi  wi,j  qμ,c (s,i Φi,j )  Φi,j  xcamj  xi ri,j si  Figure 5.3: The 3D width of the object as a function of viewpoint and the object’s scale that we know its position in 3D (xi ) and the camera projection matrix Pj . With Pj we can determine the projection of xi in Ij , a position we denote as µx . In essence this distribution allows the bounding box of the object’s projection to be off-center to account for some of the noise in our estimation of the camera’s projection matrix. We use a truncated Gaussian to model our term, giving  p(  i,j |xi , si )  =  exp − (  i,j  − µx ) 2  di,j 2 2fj si  if  0  i,j  − µx  2  <  f si di,j  otherwise (5.8)  where fj is the focal length of the camera, and di,j is the depth of the object relative to the camera, as in Figure 5.3. The intuition here is that if we expect the ray (ri,j ) that passes through the  i,j  to pass within si of the  object, i.e. the ray intersects the 3D object if it has a scale of si . The last term, p(wj , hj |si , φi , ci ), is our prior on the shape and size of the bounding box in the image, given the scale, pose, and category of the object. For many object categories, the shape of an object in the image will change depending on the viewpoint, even if the distance between the object and the camera remains the same. For example, for the shoe category, the aspect ratio of the bounding box changes significantly as the azimuth of the viewpoint changes. We chose to model the size of the bounding box with a 74  single Gaussian on its width, where p(wj , hj |si , φi , ci ) = G(  di,j wj µc,w , σc,w ) fj  (5.9)  The Gaussian parameters are actually functions of the object’s 3D size si , such that µc,w = qµ,c and σc,w = qσ,c , which are the real 3D size of the object along the axis that is perpendicular to the current viewpoint, as illustrated in Figure 5.3. We transform the b width into the real world width by multiplying it by di,j /fj . For the categories of mugs, bottles, and bowls, the scale is the same across azimuth, so that qµ,c (si , φ) = si and qσ,c (si , φ) = σci . For the category of shoes we set these parameters by hand for the same 8 azimuths for which we trained our detectors, and then we use linear interpolation between these azimuth angles. Thus far we have described the model for a scene, and note that we seek to find a set of objects, O, that maximizes Equation 5.3. We define the score of an object (vi ), regardless of viewpoint, as n  p(Ij |oi )  vi = p(oi )  (5.10)  j n  = p(si |ci )  p(Ij [bi,j ]|ci , φi,j )p(  i,j |xi , si )p(wj , hj |si , φi )p(ci , φi )  (5.11)  j  where all of these terms have been discussed previously.  5.2.2  Inferring 3D objects  In the previous section we described our model for a scene that defines the posterior probability in Equation 5.3. We seek a set of objects, O, that maximizes this posterior probability. Oopt  is intractable.  In general, however, finding  In this section we outline our approach to finding  an approximate solution. The outline of our solution is to first propose a set of over-complete objects O∗ , and then incrementally refine and reduce this set. We begin by using the single-view detection responses Ri =  75  {(bj,i , p(Ii [bj,i ]|cj ), cj , φj,i )|1 ≤ j ≤ ni }. For each response we use the camera matrices to generate rays rj,i that pass through camera centre xcami of camera i and bounding box centre  j,i ,  as in Figure 5.2. The objects are  initialized to lie at the “near-intersections” between rays. Determining a reasonable set of near-intersections can be challenging depending on the nature of the scene and the false positive rate of the object detectors. Before we describe this, we introduce a bit more notation. Let D(ri,j , rk, ) be the minimum distance between rays ri,j , rk, , recalling that for ray ri,j , i refers to the detection identity and j refers to the image or camera. Let t(ri,j , rk, ) be the 3D point that is closest to both rays. Let di (x) be the depth of the point x in reference to camera i. We start by building an adjacency matrix, A, for all pairs of detection rays ri,j , rk, , where they are adjacent if they satisfy these five properties. Adjacency Properties  1. Same category: ci = ck 2. Near intersection: D(ri,j , rk, ) < 0.5µci 3. Agreement on object height: | log(hi,j dj (t(ri,j , rk, ))/fj ) − log(hk, d (t(rk, , ri,j ))/f )| < log(1.5) 4. Agreement with scale prior: |wi,j dj (ri,j , rk, )/fj −µci | < 2σci and |wk, d (ri,j , rk, )/f ) − µci | < 2σci 5. Pose Agreement: ∃φ s.t |φi,j − (φ − θi,j )| < π/4 and |φk, − (φ − θi,j )| < π/4 Using A, for all adjacent pairs of rays, ri,j and rk, , we compute the 3D ˜ that minimizes the squared reprojection error between x ˜ and i,j , k, . point x ˜ We sample scales and poses to determine s˜, φ that maximizes Equation 5.11 ˜ , and use these to define a potential object o ˜ , where we keep a given x ˜ . Following this, for each o ˜ we utilize list of detections that agree with o 76  A to determine potentially agreeing detections, adding those detections to ˜ if the geometric properties agree as outlined above, refining s˜, φ˜ as new o ˜ . We then score each of these potential objects detections are added to o using Equation 5.11. This set of objects is the over-complete set O∗ . With this large set of potential objects we eliminate candidates by the following greedy process, constructing our set O along the way. We first sort ˜ them according to score v. We then take the maximum scoring object o out and place it in O. We then go through each of the remaining potential ˜ and re-score those objects and remove the detections that belonged to o objects. We then remove those potential objects that overlap significantly in 3D according to our prior Equation 5.7. We repeat this process until O∗ is empty. The resulting set of objects O is then our approximation to the maximum of p(O|I1 , ..., In )  5.2.3  Single-view object detection  In our experiments we utilize 2 different kinds of object detectors. The first is the Deformable Parts Model (DPM) of Felzenszwalb et al. (2010), an approach that we discuss previously in Section 3.2. We utilize this object detector for the objects that do not vary (much) across azimuth, namely mugs, bowls, and bottles. Although mugs do vary across azimuth, we observed that the detection response of a DPM is relatively uniform across azimuth. For these categories, we ignore pose altogether. We first train the object detector discriminatively on training data from the internet. However, at this point the detectors return a classification response p(cj |Ii [bj ]), rather than the desired likelihood response p(Ii [bj ]|cj ). To achieve this we utilize a separate set of training data that is drawn from a similar distribution as our test data in order to learn an empirical distribution that maps p(cj |Ii [bj ]) to p(Ii [bj ]|cj ). We leave training a proper generative model as future work. We also wanted to explore the possibility of using object pose in our method, so we implemented a simple object detector that outputs the azimuth pose as well. We have discretized the azimuth viewing range into 8  77  bins, and represent each viewpoint as a separate classifier. Thus, our object detector for shoes is actually a bank of 8 detectors. The underlying detector is our contour-based detector that we outline in Section 3.1, where the detector for one viewpoint has as its model a single boundary contour. Like the DPM, our contour approach outputs detection responses of the form p(cj |Ii [bj ]), so again we utilize a validation set to build an empirical distribution to map p(cj |Ii [bj ]) to p(Ii [bj ]|cj ). The training and validation data we use for the shoe detector comes primarily from Savarese and Fei-Fei (2007). Both approaches return a sparse set of object responses for each image Ii , Ri = {(bj , vj = p(Ii [bj ]|cj ), cj , φj )|1 ≤ j ≤ ni }, except DPM does not return pose information.  5.3  Results  We evaluate the performance of our technique using the University of British Columbia Visual Robot Survey (UBC VRS) dataset, a collection of wellregistered imagery of numerous real-world scenes containing instances of the categories of interest: Mugs, bowls, bottles, and shoes. Section 5.3.1 will describe this dataset briefly. In Section 5.3.2 we present the results that illustrate the performance of our technique on this dataset.  5.3.1  Collecting images from multiple registered viewpoints  Numerous interacting factors affect the performance of a multi-view object detection system. Many are scene characteristics, such as the density and type of objects present, the appearance of each instance and the environment lighting. Others are artifacts of the image collection process, such as the number of images in which each object instance is visible and whether its appearance is occluded. Ideally, we would like to evaluate our technique on scenes that are challenging but are somewhat naturalistic. There are few publicly available datasets in which to evaluate our work, since existing datasets either lack significant clutter or are generated synthetically. Therefore, we have collected a new dataset, the UBC VRS, containing 78  Figure 5.4: Sample of images comparing single-view (left column) with multi-view (right column) detection at 0.7 recall, for bowls (yellow), mugs (red), shoes (blue), and bottles (white).  79  a variety of realistic indoor scenes imaged from a variety of viewpoints. The setting for many scenes is naturalistic, although we often augmented these scenes with additional instances of objects, both from our categories of interest and other categories. The physical settings present in the dataset include 11 desks, 8 kitchens and 2 lounges. We have augmented the highly realistic scenes with several “hand-crafted” scenarios, where a larger than usual number of objects were placed in a simple setting. We have assembled 7 shoe-specific, and 1 bottle-specific scene of this nature. A sample of images from these scenes can be seen in Figure 5.4. For each scene we collected anywhere between 8 and 18 images, where each image contained most of the objects in the scene, and where the primary viewpoint change was in azimuth angle. The elevation angle is nearly the same and close to 0 so that the objects are imaged mostly from the side. We made this decision because the state-of-the-art is most effective for canonical viewpoint recognition, which is often the side profile. A promising future direction is to utilize object detectors that are effective for all viewpoints so that we can integrate appearance information about objects from a greater range of viewpoints. As mentioned, each scene has been imaged from a variety of viewpoints, and each image has been automatically registered into a common coordinate frame using a fiducial target of known geometry. Fiducial markers are a common tool for tasks ranging from motion capture for the movie industry to 3D reconstruction. Our target environment involves highly cluttered, realistic backgrounds, and so simple coloured markers or uniform backgrounds (i.e. green screens) are not desirable. Instead, we have constructed a 3D target from highly unique visual patterns similar to those described in Fiala (2005); Poupyrev et al. (2000); Sattar et al. (2007). This target can be robustly detected with image processing techniques, and image points corresponding to known 3D positions (marker corners) can be extracted to sub-pixel accuracy. For the experiments in this paper, we have employed pre-calibrated cameras, so these 2D-3D correspondences allow the 3D pose of the camera to be recovered. In principle, sufficiently many successfully detected points on the known target would also enable un-calibrated pose 80  estimation. When evaluating practical inference techniques aimed at realistic scenarios, repeatability and control of experiments is of highest importance. In order to allow other researchers to repeat our experiments, we have released the entire set of imagery used for to generate all of the following results as part of the UBC VRS dataset1 .  5.3.2  Experiments  Although it would be useful to also determine the accuracy of our method’s ability to infer the 3D location of the objects, it is a significant challenge to produce ground truth annotations for the position of objects in 3D. Instead we focus on evaluating our approach by determining how well our 2D bounding boxes overlap with the ground truth. We again use the intersection over union (IoU) ratio that we utilized in previous chapters, using 0.50 as in PASCAL VOC. As a reminder, a true positive is when the ratio between the intersection of a detection bounding box with the ground truth to the intersection of these two boxes exceeds 0.50. Moreover, as in previous chapters, we also make use of the precision-recall curves and the average precision (AP) to summarize results across the dataset. We define these metrics in Section 3.4. Our first experiment compares our method, which detects objects for the images of a scene jointly, to a baseline where object detection is performed independently for each image. Both the baseline and our method rely on the same underlying single-view object detectors. In each trial we select a subset of 3 images obtained from well-separated viewpoints. Trials are made somewhat independent by randomizing the starting location for this viewpoint selection, so that the labelling procedure sees mostly non-overlapping sets of images between trials. The results of all trials over all scenes in the testing database are shown in the Recall-Precision graphs in Figure 5.5, and Figure 5.6, as well as Table 5.2. The multi-view approach significantly outperforms the baseline of detec1  Available at the address http://www.cs.ubc.ca/labs/lci/vrs/index.html (Accessed in March 2012)  81  1  0.8  0.8  0.6  Precision  Precision  1  Single view only Multiview Integration  0.6  0.4  0.4  0.2  0.2  0  0  Single view only Multiview Integration  0  0.2  0.4  0.6  0.8  1  0  0.2  0.4  Recall  0.6  0.8  1  Recall  (a) Bottles  (b) Mugs  1  Precision  0.8  0.6  Single view only Multiview Integration  0.4  0.2  0  0  0.2  0.4  0.6  0.8  1  Recall  (c) Bowls  Figure 5.5: Recall-precision curves compare our multi-view (3 viewpoints) approach and the single view approach. tion on single images. This is not surprising given that more information is available in the multi-view approach. The appearance information from multiple viewpoints reduces the number of false positives due to clutter. Moreover, the additional appearance information allows us to succeed in detecting instances that are challenging because the available viewpoints lack distinctive visual characteristics. For example, for single viewpoints the AP for bowls is just 0.71, but increases to 0.86 when we integrate 3 viewpoints using our method. There are often instances of clutter or objects that appear bowl-like from a particular viewpoint, but fail to retain this over multiple viewpoints. We have analyzed the situations where the multi-view procedure fails to detect some objects and note that it is mostly due to the failure  82  1 1 0.9  Single View 2 Views 3 Views 6 Views  0.9 0.8 0.8 0.7  0.7  0.6  Precision  Precision  0.6  0.5  0.4  0.5  0.4  0.3  0.3  0.2  0.2  0.1  0.1  0 0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  0 0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  Recall  Recall  (a) Bowls  (b) Shoes  Figure 5.6: The performance of our system generally increases as the number of views for each scene is increased. of the appearance based detectors. We have also studied the contribution that several system components make to multiple viewpoint detection. First, we varied the number of viewpoints available for each scene. In each case, we select the viewpoints so that they uniformly span the entire range of viewpoints.These results can be seen in Figure 5.6 and in the Table 5.2. The general trend is that additional viewpoints lead to better detection results. There is a notable difference in the behaviour between categories detected with the DPM detector (mug, bowl, bottle) and those identified with the contour detector (shoe). For the mug, bowl and bottle, the addition of a second view yields a significant increase in performance, a third view gives less improvement, and further additional views to yields less and less improvement. Our analysis of this trend is that the DPM detector gives sufficiently strong single-view performance, and that after aggregating information across only a small number of images, nearly all the object instances with reasonably recognizable appearances are detected. In this case, the instances we fail to detect are too difficult for our appearance based detectors to recognize from any viewpoint. On the contrary, the results from the shoe detector are interesting in that the performance for two viewpoints is little better than for a single 83  Number of Views Mugs Bottle Bowl Shoe  1 0.57 0.67 0.71 0.1  2 0.60 0.75 0.79 0.13  3 0.65 0.76 0.86 0.18  6 0.67 0.75 0.90 0.28  Table 5.2: The average precision for various object detectors for varying number of viewpoints. image, but the performance continues to show significant improvement with the higher number of viewpoints. The other categories tend of peak at 3 viewpoints. One explanation for this is due to the fact that the shoe detector is a bank of 8 independent classifiers, and thus there are 8 times as many false positives leading to lower precision. Moreover, the classifier is more likely to fail to detect the object, and as a result, more viewpoints are needed to provide enough evidence that a detection is a true positive. Another experiment that we perform is to determine the effect of the scale prior on the system. We can disable the scale prior by replacing p(si |ci ) with a uniform distribution. However, even without a scale prior, we can still reason about the scale of the object since detections of the same object from different viewpoints should agree on what size the object is based upon the size of their respective bounding boxes and the distance between the cameras and the object in 3D. This means the term Equation 5.9 can still be used in our prior p(O), at least for the symmetric categories. Table 5.3 compares the results for when no scale prior is available. The scale prior is particularly effective for mugs and bottles since these objects tend to have tighter distributions on scale, unlike bowls where σbowl = 0.12m. Scale Prior Mugs Bottle Bowl  Disabled 0.60 0.69 0.84  Enabled 0.65 0.76 0.86  Table 5.3: The average precision for various object detectors with and without the scale prior.  84  5.4  Concluding remarks  Multiple viewpoint object detection has significant advantages over working with isolated images for object detection. Not only can the additional appearance information help overcome occlusion, but it can also eliminate false positives and improve relatively weak object detectors. For categories with few distinctive characteristics, such as bowls, the improvement is remarkable. For example, we can recognize 70% of the bowls without a single false positive with 3 viewpoints, whereas with only a single viewpoint half of our detections would be false positives. Moreover, the ability to reason about the size of the object can improve detection significantly as well, improving the AP from 0.69 to 0.76 for bottles, and 0.60 to 0.65 for mugs. The approach we have presented is also novel in that it also offers the ability to infer the 3D location of objects as well. We did not get a chance to evaluate our work in this regard, but there are many applications where this 3D information would be useful. The system we have presented is just a first step in the direction of embodied object detection. Another advantage of our work is that the model is modular. Arbitrary object detectors can be used in a black-box fashion, at least if some validation data is available to learn an empirical distribution for the image likelihood terms. We have demonstrated this with the discriminatively trained Deformable Parts Model (DPM) and our own contour-based object detector. Moreover, the scene prior p(O) we propose can include richer interactions such as the co-location of objects of different categories.  85  Chapter 6  Conclusions Human vision can easily detect objects in isolated images, but remains a difficult challenge for computational systems. Intra-category variation, intercategory confusion, and imaging conditions such as viewpoint, illumination, and background clutter are all confounding factors that make object detection challenging in practice. In this thesis we have sought to improve object detection by leveraging our knowledge of the imaging conditions in order to minimize the effect of these conditions. Our work on chamfer-matching that we outlined in Chapter 3 is an example of minimizing the effect of imaging conditions. Previous approaches to multi-scale chamfer-matching are not invariant to scale because of the manner in which the chamfer distance is normalized. We show that additional normalization can correct this, and we have demonstrated that this can result in significant improvements in object detection. In general, we cannot necessarily control the scale of objects in imagery, and thus our metrics should be invariant to scale as well. Depth information offers us direct knowledge of some of the imaging conditions. However, with stereo imagery, missing depth information and the fidelity of depth estimates for surfaces just a few metres from the camera can pose a significant challenge in utilizing stereo depth data for object detection. The modular detection system we present in Chapter 4 overcomes these issues by focusing on using stereo depth data to provide the geometric 86  context for an image rather than providing accurate surface characteristics. We show how to use a robust estimator for the depth of an image region to infer the size of the underlying surface depicted in that image region. In conjunction with a size prior for an object category, we have shown that we can improve detection rates of an appearance-based object detector. Moreover, we show that a size prior can also be utilized to improve efficiency by restricting the application of appearance-based object classifiers to likely image-space scales. Finally, we show that with robust estimators of depth we can make effective use of efficient off-the-shelf stereo algorithms, thus reducing computational burden and system complexity. Chapter 5 presents a novel approach to object detection that integrates imagery from multiple viewpoints. By jointly inferring a set of 3D objects from a set of images of the same scene, we can improve object detection in those images as well as provide the 3D location of those objects. Inferring the 3D layout of the objects allows us to integrate appearance information across multiple viewpoints, along with providing the 3D geometric context of the objects in the images. This allows us to apply size priors, a technique that we also found useful in Chapter 4, along with priors that restrict the colocation of objects in the same physical space. As we discuss in Section 6.1, additional priors about 3D layout are also possible with the model we have presented. The challenge of object detection can be significantly ameliorated with 3D geometric context. We have shown that even course stereo information can significantly improve a weak object detector, such as our easy-to-train contour-based approach, and offer some improvement in accuracy and efficiency for the harder-to-train Deformable Parts Model (DPM). We have also demonstrated that the 3D location information available via multiple viewpoints improves detection not only in that we can enforce consistency between views but also because it allows us to utilize size priors. Incorporating both stereo and multiple viewpoints offers important advantages for applications such as robot vision, not only in improving detection accuracy but also in reducing the computational burden of appearance-based object classifiers by focusing attention on likely regions in an image. 87  6.1  Future directions  The bulk of our work attempts to leverage 3D geometric context to improve object detection. The methods we have proposed explore only a portion of what is possible, and we outline here a sampling of avenues that extend our work. • Learning Size Priors In this thesis we repeatedly make use of size priors for object detection, where the parameters of these size priors are specified by hand. These size priors and how they depend upon viewpoint, however, can be learned from data as some researchers have done (Fritz et al. (2010); Janoch et al. (2011)). Learning richer models of a category’s size is one way to improve the accuracy of localization and to improve detection responses. • Integrating Stereo with Multiple Viewpoints The underlying 3D layout of objects is available from multiple viewpoints indirectly in that it must be inferred from the appearance information. One future direction would be to incorporate depth data, acquired via stereo or other 3D sensing technology, into our multi-view approach. For example, it would be straightforward to include a factor in our probabilistic model that enforces the constraint that 3D objects should be consistent with surface measurements. Ess et al. (2007); Schindler et al. (2010) explore this in the context single-view images. • Occlusion Reasoning In both our stereo and multi-view work we have ignored the issue of occlusion. However, being able to detect partially occluded objects is essential for many applications. Occluded objects are particularly challenging not only because less information is available for classification, but also because we do not know a priori that the object is occluded. Many object detection systems fail for occluded objects because the score of the object is based upon the assumption that the entire object  88  is visible. Knowing the presence and location of occlusion can allow a detection system to be more forgiving in those circumstances. One avenue to explore would be to utilize depth information, via some 3D sensing technology, to infer the presence of occlusion. Often the presence of discontinuities in depth suggests occlusion. Moreover, our multi-view work is also amenable to occlusion reasoning, since it is possible to determine instances of occlusion for a particular object from the 3D location of other objects and the camera matrices. The work of Meger et al. (2011) is an example of an approach that explores this avenue, building on our multi-view work and incorporating pointcloud data from Kinect. • Object Location Priors Objects are not randomly distributed in space, and human vision presumably exploits this given that the manner in which we direct our gaze in searching for an object is similar to a prior (Biederman et al. (1973); Hock et al. (1975)). In fact, numerous approaches have been proposed to learn and use location priors for object detection, such as the notion that objects are typically found on supporting surfaces (Bao et al. (2010)), in certain scene types (Torralba et al. (2003)), and often co-located with other objects (Sudderth et al. (2006)). A similar extension can be made to both our stereo and multi-view work since the score we maximize in each approach includes a factor concerning the location of the objects. • Beyond Fiducial Markers In our multi-view work we were able to accurately estimate the camera projection matrices for each of the images via the use of fiducial markers. Many applications, however, are not amenable to the use of fiducial markers or other assumptions that ensure that accurate camera projection matrices can be estimated. For the application that inspired much of our work, visual search on a mobile robot, techniques like Simultaneous Localization and Maxmization (SLAM) and Structure-from-Motion (SfM) can provide estimates of the camera ma89  trices, but these are typically noisier than what we have experimented with in Chapter 5. One fruitful direction would be to jointly estimate the extrinsic camera parameters, objects in the scene, and other scene structures such as the supporting surfaces. A similar approach is taken taken by Bao et al. (2010); Hoiem et al. (2006), but only for single view scenarios. • Evaluation of inferred 3D locations We were only able to evaluate our work on multiple viewpoint detection by analyzing our results in 2D. However, our multi-view system is capable of inferring the 3D location of objects. Future work could look into the accuracy of this inference and into the numerous applications that can make use of this information. • Solution refinement for multiview detection In our multi-view work we seek to find a set of objects O that maximizes the posterior distribution p(O|I1 , ..., IN ). However we detail a solution that is an approximation, utilizing only a sparse set of detections returned by our underlying object detectors due to the computational complexity and non-linearity of the original problem. One future direction would be to refine this solution, beginning with the approximated set of objects, and then iteratively improve these object locations. This could produce more accurate localization and higher responses for true positives.  90  Bibliography S. Agarwal and D. Roth. Learning a sparse representation for object detection. In European Conference on Computer Vision, 2002. → pages 11, 16 M. Andriluka, S. Rother, and B. Schiele. Monocular 3D pose estimation and tracking by detection. In Computer Vision and Pattern Recognition, 2010. → pages 23 S. Y.-Z. Bao, M. Sun, and S. Savarese. Toward coherent object detection and scene layout understanding. In Computer Vision and Pattern Recognition, 2010. → pages 21, 89, 90 H. Barrow, J. Tenenbaum, R. Bolles, and H. Wolf. Parametric correspondence and chamfer matching: Two new techniques for image matching. In International Joint Conference on Artificial Intelligence, 1977. → pages 26 S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):509–522, 2002. → pages 14 M. Bertero, T. A. Pogcio, and V. Torre. Ill-posed problems in early vision. In Proceedings of the IEEE, pages 869–889, 1988. → pages 2 P. J. Besl and R. C. Jain. Three-dimensional object recognition. ACM Computing Surveys, 17(1), 1985. → pages 11 I. Biederman, A. L. Glass, and W. Stacy. Searching for objects in real-world scenes. Experimental Psychology, 97:22–27, 1973. → pages 19, 89 G. Borgefors. Hierarchical chamfer matching: A parametric edge matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(6):849–865, 1988. → pages 27 91  R. Brooks, R. Greiner, and T. Binford. The acronym model-based vision svsterm. In International Joint Conference on Artificial Intelligence, pages 105–113, 1979. → pages 11, 17 I. Bulthoff, H. Bulthoff, and P. Sinha. Top-down influences on stereoscopic depth-perception. Nature Neuroscience, 1:254–257, 1998. → pages 5 M. Burl, M. Weber, and P. Perona. A probabilistic approach to object recognition using local photometry and global geometry. In European Conference on Computer Vision, 1998. → pages 16 J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6): 679–698, 1986. → pages viii, 40, 42 P. Carbonetto, N. de Freitas, and K. Barnard. A statistical model for general contextual object recognition. In European Conference on Computer Vision, 2004. → pages 20 I. Chakravarty and H. Freeman. Characteristic views as a basis for three-dimensional object recognition. In In Proceedings of The Society for Photo-Optical Instrumentation Engineers Conference on Robot Vision, volume 336, pages 37–45, 1982. → pages 17 A. Coates and A. Y. Ng. Multi-camera object detection for robotics. In International Conference on Robotics and Automation, 2010. → pages 23, 69 D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002. → pages 32 N. Cornelis and L. Van Gool. Real-time connectivity constrained depth map computation using programmable graphics hardware. In Computer Vision and Pattern Recognition, 2005. → pages 49 N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. → pages 14, 29, 30 C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class object layout. In International Conference on Computer Vision, 2009. → pages 20, 32, 34 92  S. Dickinson, H. Christensen, J. Tsotsos, and G. Olofsson. Active object recognition integrating attention and viewpoint control. Compter Vision and Image Understanding, 67(3):239–260, 1997. → pages 22 S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert. An empirical study of context in object detection. In Computer Vision and Pattern Recognition, 2009. → pages 20 K. Ehinger, B. Hidalgo-Sotelo, A. Torralba, and A. Oliva. Modeling search for people in 900 scenes: A combined source model of eye guidance. Visual Cognition, 17(6-7):945–978, 2009. → pages 19 A. Ess, B. Leibe, and L. V. Gool. Depth and appearance for mobile scene analysis. In International Conference on Computer Vision, 2007. → pages 46, 47, 56, 60, 88 A. Ess, B. Leibe, K. Schindler, and L. V. Gool. Robust multi-person tracking from a mobile platform. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 2009. → pages 21 M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge. International Journal of Computer Vision, 88(2):303–338, June 2010a. → pages x, 12, 37 M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge workshop 2010, 2010b. URL http: //pascallin.ecs.soton.ac.uk/challenges/VOC/voc2010/workshop/index.html, Accessed in March 2012. → pages 30 Z.-G. Fan and B.-L. Lu. Fast recognition of multi-view faces with feature selection. In International Conference on Computer Vision, 2005. → pages 17 L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594–611, 2006. → pages 11 P. Felzenswalb and D. Huttenlocher. Pictorial structures for object recognition. International Journal of Computer Vision, 61(1):55–79, 2005. → pages 30, 32  93  P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 2010. → pages x, 4, 8, 16, 17, 24, 29, 30, 31, 69, 77 R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning. In Computer Vision and Pattern Recognition, 2003. → pages 11, 15, 16, 32 V. Ferrari, T. Tuytelaars, and L. Van Gool. Object detection by contour segment networks. In European Conference on Computer Vision, 2006. → pages 25, 37, 40, 41 M. Fiala. ARTag: A fiducial marker system using digital techniques. In Computer Vision and Pattern Recognition, volume 1, 2005. → pages 80 M. Fischler and R. Elschlager. The representation and matching of pictorial structures. IEEE Transactions on Computer, 22(1):67–92, 1973. → pages 16, 30 M. Fritz, K. Saenko, and T. Darrell. Size matters: Metric visual search constraints from monocular metadata. In Neural Information Processing Systems, 2010. → pages 57, 88 D. M. Gavrila. A bayesian, exemplar-based approach to hierarchical shape matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(8):1408–1421, 2007. → pages 15, 44 D. M. Gavrila and S. Munder. Multi-cue pedestrian detection and tracking from a moving vehicle. International Journal of Computer Vision, 73(1): 41–59, 2007. → pages 21, 26, 46, 48, 60 I. J. Goodfellow, Q. V. Le, A. M. Saxe, H. Lee, and A. Y. Ng. Measuring invariances in deep networks. In Neural Information Processing Systems, 2009. → pages 14 Google. 3d warehouse, 2009. URL http://sketchup.google.com/3dwarehouse, Accessed in March 2012. → pages 18 S. Gould, P. Baumstarck, M. Quigley, , A. Y. Ng, and D. Koller. Integrating visual and range data for robotic object detection. In ECCV Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications (M2SFA2), 2008. → pages 21, 47 94  A. Hanson and E. Riseman. Visions: A computer system for interpreting scenes. Computer Vision Systems, 1978. → pages 20 R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. → pages 63 H. Hattori, A. Seki, M. Nishiyama, and T. Watanabe. Stereo- based pedestrian detection using multiple patterns. In British Machine Vision Conference, 2009. → pages 18 S. Helmer and D. Lowe. Object recognition using stereo. In International Conference on Robotics and Automation, 2010. → pages iii S. Helmer, D. Meger, M. Muja, J. Little, and D. Low. Multiple viewpoint recognition and localization. In Asian Conference on Computer Vision, 2010. → pages iii G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006. → pages 14 H. S. Hock, G. P. Gordon, and R. Whitehurst. Contextual relations: the influence of familiarity, physical plausibility, and belongingness. Attention, Perception and Psychophysics, 16:4–8, 1975. → pages 19, 89 D. Hoiem, A. A. Efros, and M. Herbert. Putting objects in perspective. In Computer Vision and Pattern Recognition, 2006. → pages 5, 21, 90 D. Hoiem, A. Efros, and M. Hebert. Recovering surface layout from an image. International Journal of Computer Vision, 75(1), 2007. → pages 5, 21 A. Janoch, Y. J. Sergey Karayev, J. T. Barron, M. Fritz, K. Saenko, and T. Darrell. A category-level 3-d object dataset: Putting the kinect to work. In International Conference on Computer Vision, 2011. → pages 18, 88 K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic filter maps. In Computer Vision and Pattern Recognition, 2009. → pages 14 K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, M. Mathieu, and Y. LeCun. Learning convolutional feature hierarchies for visual recognition. In Neural Information Processing Systems, 2010. → pages 14 95  M. Kinect. www.xbox.com/kinect, 2010, Accessed in March 2012. → pages 6 S. Kumar and M. Hebert. A hierarchical field framework for unified context-based classification. In International Conference on Computer Vision, 2005. → pages 20 A. Kushal, C. Schmid, and J. Ponce. Flexible object models for category-level 3d object recognition. In Computer Vision and Pattern Recognition, 2007. → pages 17 K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical multi-view RGB-D object dataset. In International Conference on Robotics and Automation, 2011a. → pages 18 K. Lai, L. Bo, X. Ren, and D. Fox. Sparse distance learning for object recognition combining RGB and depth information. In International Conference on Robotics and Automation, 2011b. → pages 18 C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding windows: Object localization by efficient subwindow search. In Computer Vision and Pattern Recognition, 2008. → pages 32 C. Laporte and T. Arbel. Efficient discriminant viewpoint selection for active bayesian recognition. International Journal of Computer Vision, 68(3):267–287, 2006. → pages 22, 69 S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, 2006. → pages 15 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of IEEE, 86(11): 2278–2324, 1998. → pages 11, 12 A. Lehmann, B. Leibe, and L. V. Gool. Fast prism: Branch and bound hough transform for object class detection. International Journal of Computer Vision, 94(2):175–197, 2011. → pages 15 B. Leibe, A. Leonardis, and B. Schiele. Combined object categorization and segmention with an implicit shape model. In ECCV Work Shop on Statistical Learning in Computer Vision, 2004. → pages 15, 28, 32  96  B. Leibe, N. Cornelis, K. Cornelis, and L. Gool. Dynamic 3d scene analysis from a moving vehicle. In Computer Vision and Pattern Recognition, 2007. → pages 3, 21, 56 L.-J. Li and L. Fei-Fei. What, where and who? classifying event by scene and object recognition. In International Conference on Computer Vision, 2007. → pages 20 J. Liebelt, C. Schmid, and K. Schertler. Viewpoint-independent object class detection using 3d feature maps. In Computer Vision and Pattern Recognition, 2008. → pages 17 M.-Y. Liu, O. Tuzel, A. Veeraraghavan, and R. Chellappa. Fast directional chamfer matching. In Computer Vision and Pattern Recognition, 2010. → pages 28 D. G. Lowe. Object recognition from local scale-invariant features. In International Conference on Computer Vision, 1999. → pages 14, 15 D. G. Lowe. Local feature view clustering for 3d recognition. In International Conference on Computer Vision, 2001. → pages 16 D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004. → pages 3, 63 D. G. Lowe and T. Binford. The recovery of three-dimensional structure from image curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(3):320–326, 1985. → pages 17 D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using brightness and texture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26 (5):530–549, 2004. → pages viii, 37, 40, 42 D. Meger, P.-E. Forss´en, K. Lai, S. Helmer, S. McCann, T. Southey, M. A. Baumann, J. J. Little, and D. G. Lowe. Curious george: An attentive semantic robot. Robotics and Autonomous Systems, 56(6):503–511, 2008. → pages 3 D. Meger, M. Muja, S. Helmer, A. Gupta, C. Gamroth, T. Hoffman, M. Baumann, T. Southey, P. Fazli, P. V. Walter Wohlkinger, J. J. Little, D. G. Lowe, and J. Orwell. Curious george: An integrated visual search platform. In Canadian Robot Vision Conference, 2010. → pages 3, 57 97  D. Meger, C. Wojek, B. Schiele, and J. J. Little. Explicit occlusion reasoning for 3D object detection. In British Machine Vision Conference, 2011. → pages 89 A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:349–361, 2001. → pages 16 J. Mutch and D. G. Lowe. Object class recognition and localization using sparse features with limited receptive fields. International Journal of Computer Vision, 80(1):45–57, 2008. → pages 14 A. Opelt, A. Pinz, and A. Zisserman. A boundary fragment model for object detection. In European Conference on Computer Vision, 2006. → pages 15, 16 D. Parikh, L. Zitnick, and T. Chen. From appearance to context-based recognition: Dense labeling in small images. In Computer Vision and Pattern Recognition, 2008. → pages 19 J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth, M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid, B. C. Russell, A. Torralba, C. K. I. Williams, J. Zhang, and A. Zisserman. Dataset issues in object recognition. In J. Ponce, M. Hebert, C. Schmid, and A. Zisserman, editors, Lecture Notes in Computer Science: In Toward Category-Level Object Recognition. Springer-Verlag, 2006. → pages 12 I. Poupyrev, H. Kato, and M. Billinghurst. Artoolkit user manual, version 2.33. Human Interface Technology Lab, University of Washington, 2000. → pages 80 M. Quigley, S. Batra, S. Gould, E. Klingbeil, Q. Le, A. Wellman, and A. Y. Ng. High-accuracy 3d sensing for mobile manipulation: Improving object detection and door opening. In International Conference on Robotics and Automation, 2009. → pages 18, 47 L. G. Roberts. Machine perception o f three-dimensional solids. Optical and Electro-optical Information Processing,, pages 159–197, 1965. → pages 10, 17, 21 F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce. 3d object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. International Journal of Computer Vision, 2006. → pages 16 98  H. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:22–38, 1998. → pages 11 P. E. Rybski. Semantic robot vision challenge, December 2010. URL http://www.semantic-robot-vision-challenge.org/index.html, Accessed in March 2012. → pages x, 3 J. Sattar, E. Bourque, P. Giguere, and G. Dudek. Fourier tags: Smoothly degradable fiducial markers for use in human-robot interaction. In Computer and Robot Vision, Montreal, Quebec, Canada, May 2007. → pages 80 S. Savarese and L. Fei-Fei. 3d generic object categorization, localization and pose estimation. In International Conference on Computer Vision, 2007. → pages 17, 78 A. Saxena, S. H. Chung, and A. Y. Ng. 3-d depth reconstruction from a single still image. International Journal of Computer Vision, 76(1): 53–69, 2008. → pages 5 D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47:7–42, 2002. → pages 49 B. Schiele and J. Crowley. Transinformation for active object recognition. In International Conference on Computer Vision, 1998. → pages 22 K. Schindler, A. Ess, B. Leibe, and L. V. Gool. Automatic detection and tracking of pedestrians from a moving stereo rig. ISPRS International Journal of Photogrammetry and Remote Sensing, 65(6):523–537, 2010. → pages 22, 46, 47, 60, 88 H. Schneiderman and T. Kanade. A statistical approach to 3d object detection applied to faces and cars. In Computer Vision and Pattern Recognition, 2000. → pages 17 T. Serre, L. Wolf, and T. Poggio. Object recognition with features inspired by visual cortex. In Computer Vision and Pattern Recognition, 2005. → pages 14 J. Shotton and R. Blake, A.and Cipolla. Multi-scale categorical object recognition using contour fragments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007. → pages 15, 16, 27, 28, 29, 44 99  J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition. In European Conference on Computer Vision, 2006. → pages 20 D. J. Simons and C. F. Chabri. Gorillas in our midst: sustained inattentional blindness for dynamic events. Perception, 28:1059–1074, 1999. → pages 19 B. Stenger, A. Thayananthan, P. Torr, and R. Cipolla. Model-based hand tracking using a hierarchical bayesian filter. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9):1372–1384, 2006. → pages 27 T. M. Strat and M. A. Fischler. Context-based vision: Recognizing objects using information from both 2d and 3d imagery. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(1050-1065), 1991. → pages 20 H. Su, M. Sun, L. Fei-Fei, and S. Savarese. Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories. In International Conference on Computer Vision, 2009. → pages 17 E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky. Depth from familiar objects: A hierarchical model for 3d scenes. In Computer Vision and Pattern Recognition, 2006. → pages 21, 89 M. Sun, H. Su, S. Savarese, and L. Fei-Fei. A multi-view probabilistic model for 3d object classes. In Computer Vision and Pattern Recognition, 2009. → pages 17 A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, and L. Van Gool. Using multi-view recognition and meta-data annotation to guide a robot’s attention. International Journal of Robotics Research, 2009. → pages 17 A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Context-based vision system for place and object recognition. In International Conference on Computer Vision, 2003. → pages 20, 89 A. Torralba, K. Murphy, and W. Freeman. Sharing features: efficient boosting procedures for multiclass object detection. In Computer Vision and Pattern Recognition, 2004. → pages 14 100  A. Torralba, K. Murphy, and W. T. Freeman. Sharing features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(5), 2007. → pages 17 A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11): 1958–1970, 2008. → pages 2, 19 T. Tuytelaars and K. Mikolajczyk. Local invariant feature detectors: a survey. Found. Trends. Comput. Graph. Vis., 3:177–280, July 2008. ISSN 1572-2740. doi:10.1561/0600000017. URL http://dl.acm.org/citation.cfm?id=1391081.1391082. → pages 15 P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. → pages 12, 15 M. Weber, M. Welling, and P. Perona. Unsupervised learning of models for recognition. In European Conference on Computer Vision, 2000. → pages 16 D. Wilkes and J. Tsotsos. Active object recognition. In Computer Vision and Pattern Recognition, 1992. → pages 22 C. Wojek, S. Walk, and B. Schiele. Multi-cue onboard pedestrian detection. In Computer Vision and Pattern Recognition, 2009. → pages 23 C. Wojek, S. Roth, K. Schindler, and B. Schiele. Monocular 3D scene modeling and inference: Understanding multi-object traffic scenes. In European Conference on Computer Vision, 2010. → pages 21 H. Zhang, A. Berg, M. Maire, and J. Malik. Svm-knn: Discriminative nearest neighbour classification for visual category recognition. In Computer Vision and Pattern Recognition, 2006. → pages 15  101  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items