O D M A S : Object Discovery through Motion, Appearance and Shape by Tristram Southey B.Comp. , The University of Guelph, 2003 A T H E S I S S U B M I T T E D IN P A R T I A L F U L F I L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in The Faculty of Graduate Studies (Computer Science) The University Of Brit ish Columbia October 7, 2005 © Tristram Southey 2005 i i Abstract In this thesis we examine the problem of Object Discovery, the autonomous acquisition of object models, using a combination of shape, appearance and motion. We propose a new technique for detecting rigidly moving objects and constructing models of their appearance and shape called the O D M A S (Object .Discovery through Motion, Appearance and Shape) system. Our technique is a multi-stage approach. First, a stereo camera is used to find a sequence of images and shape maps of a given scene. Then the scene is oversegmented using normalized cuts based on a combination of shape and appearance. S I F T image features are matched between sequential pairs of images to identify groups of moving features and the three dimensional location of these moving features in the scene is identified using the shape map from the stereo camera. The moving features are then further individuated into objects by identifying groups where the motion of the features is rigid. These rigid feature groups are used to de-termine which regions in the segmentation of the scene correspond to objects, grouping together oversegmented regions as necessary. Additional features are extracted from these regions and combined with the rigidly moving image fea-tures to create snapshots of the object's appearance and shape. Finally, these snapshots are grouped together over time to produce models of the objects. i i i Contents Abstract i i Contents . . . . i i i List of Figures v i Acknowledgements v i i i 1 Introduction 1 1.1 Research Motivation 1 1.1.1 What is Object Discovery? 1 1.1.2 What does Object Discovery do? 2 1.2 Research Objectives 2 1.3 Research Contributions 3 1.3.1 Motion-based Individuation 3 1.3.2 Post-motion Static Individuation • 3 1.3.3 Normalized Cut based Scene Segmentation 4 2 Related Work 5 2.1 Fundamental Object Discovery Concepts 5 2.1.1 What is an Object? 5 2.1.2 What is'an Object Model? 6 2.1.3 Motion and Object Discovery 7 2.1.4 Passive vs. Active Object Discovery 8 2.2 Appearance and Shape: The Two Inputs to Object Discovery . 9 Contents . iv 2.2.1 Appearance-based Object Discovery 10 2.2.2 Shape-based Object Discovery ' 10 2.3 Comparison of Image and Shape-based Approaches to Object Discovery 13 2.3.1 Individuation . 13 2.3.2 Object Tracking 19 2.3.3 Object Modeling 20 2.3.4 Recognition 23 2.4 Exist ing Object Discovery Systems 24 2.4.1 Bootstrap Learning for Object Discovery 24 2.4.2 Is Bottom-Up Attention Useful for Object Recognition? . 29 2.4.3 Object Level Grouping for Video Shots . 31 2.4.4 Significant Non-Object Discovery Research 33 2.4.5 Shift Invariant Feature Transform 33 2.4.6 Normalized Cuts for Image Segmentation 35 3 A Combined Shape and Appearance-based Approach to Object Discovery 37 3.1 Object Discovery using both Appearance and Shape 37 3.2 Combined Shape/Appearance Individuation 38 3.2.1 Static Depth and Intensity Individuation 38 3.2.2 Rigid Body Motion Individuation 42 3.3 Object Region Voting 44 3.4 Object Region Voting for Combined Motion-based and Static In-dividuation 45 3.4.1 Object Region Voting for Static Object Modeling 47 3.5 Additional O D M A S Subsystems . . 48 3.5.1 Tracking ; 48 3.5.2 Modeling . 48 3.5.3 Recognition 49 Contents v 4 The O D M A S System 50 4.1 System Overview 50 4.2 Primary Motion-based System Components 52 4.2.1 Input 52 4.2.2 Normalized Cut Static Individuation 53 4.2.3 Rigid Body Motion Individuation 56 4.2.4 Moving Object Region Voting 59 4.2.5 Moving Object Modeling 62 4.3 Secondary Static Object Discovery System . . . 63 4.3.1 Static Object Region Voting 63 4.3.2 Static Object Modeling 65 5 Experiments and Results 66 5.1 Experiment Overview ' 66 5.2 Experiment 1: Multiple Moving Object Discovery 67 5.3 Experiment 2: Object Discovery using a Moving Camera and the Static Object Modeling System 74 5.4 Experiment 3: O D M A S for 3D Object Modeling 82 5.5 Summary of Results 84 6 Conclusion and Future Work 85 6.1 Conclusion '. 85 6.2 . Future Work 86 6.2.1 3D Feature Localization through Stereo S IFT Matching . 86 6.2.2 Learning Object Features 87 6.2.3 Non-rigid Object Detection ' . 87 6.2.4 Improved Scene Segmentation using New Image Properties 88 Bibliography 89 List of Figures 3.1 Segmentation using only Intensity 39 3.2 Segmentation using only Depth •. . 40 3.3 Segmentation using Intensity and Depth . . . 41 4.1 Diagram of the O D M A S System Flow 51 4.2 A n Image and Depth Map Input to O D M A S 53 4.3 Segmentation using Normalized Cuts on Depth and Intensity . . 55 4.4 Feature Matching 58 4.5 Rigidly Moving Feature Individuation 59 4.6 Region Voting for find Additional Object Features 61 5.1 Input to Experiment 1 68 5.2 Segmentation in Experiment 1 . 68 5.3 Moving Features in Experiment 1 69 5.4 Rigid Feature Individuation in Experiment 1 70 5.5 Object Snapshots from Experiment 1 71 5.6 First Example of Object Recognition in Experiment 1 72 5.7 Second Example of Object Recognition in Experiment 1 73 5.8 Input to Experiment 2 75 5.9 Segmentation in Experiment 2 76 5.10 Moving Features in Experiment 2 77 • 5.11 Rigid Feature Individuation in Experiment 2, .. 78 5.12 Object Snapshots from Experiment 2 ' . . . 79 5.13 Static Object Feature Matching in Experiment 2 80 List of Figures vii 5.14 Static Snapshots from Experiment 2 81 5.15 3D models from Experiment 3 83 Acknowledgements To Qixing, You are Poetry and Light. 1 Chapter 1 Introduction 1.1 Research Motivation 1.1.1 W h a t is O b j e c t D i s c o v e r y ? Autonomous learning is an ability possessed by all intelligent creatures. As humans, we are continually acquiring information about our environment, pro-cessing it and remembering it for later use. However, most robots currently in use in the world perform little or no autonomous learning, relying instead on pre-programmed data "taught" to them by their programmers. This makes robotic systems difficult to implement because of the high complexity of pro-gramming them with all the necessary information they wil l need to perform their duties. In any artificially intelligent system the ability to learn without supervision is useful since often there is simply too much information necessary for a particular task for it all to be taught explicitly. A n important ability that needs to be acquired is autonomous object-level learning, also known as object discovery. Object discovery [Modayil and Kuipers, 2004] is an unsupervised learning technique that allows an agent to generate models of objects through observations of its environment. The object models learned by the agent are created using input data about objects that it observes in its environment. The agent in an object discovery system is the device respon-sible for acquiring the input to the system. Potential agents for object discovery systems include robots, security cameras, camera equipped wearable computers or any other computer system with some ability to sense its environment. Chapter 1. Introduction 2 1.1.2 W h a t does Object Discovery do? Object discovery is an ability with numerous applications in both robotics and computer vision. It performs the essential step of assimilating the complex in-puts available to such systems, reasoning about them and producing models of objects. A key step in the object discovery process is to determine a one-to-one mapping between each element of the input and either an object or the background [Modayil and Kuipers, 2004]. In object discovery this is called indi-viduation. Such a mapping can assist other systems by dividing the input data into data corresponding to objects and data corresponding to the background and allowing the system to focus on the data it requires. Object discovery is also valuable since it is designed to accumulate data about objects over time allowing for the creation of more complete object mod-els. This is a done using object tracking which allows the system to correlate multiple views of an object. However, despite its usefulness, not all object dis-covery systems perform object tracking, preferring to focus just on the problem of individuation. 1.2 Research Objectives We believe that there are three properties on which object discovery should be based: .motion, appearance, and shape. Each of these three factors are useful in the task of autonomously learning models of objects and complement each other well. The goal of the research presented in this thesis was to design and implement an object discovery system that combined all these elements. To this end, we created the O D M A S system, which stands for Object Discovery through Motion, Appearance and Shape, to demonstrate the efficacy of our approach to object discovery. We hope to show that our approach can accurately model the appearance of a moving object, such that the object can be recognized in a new scene. We would also like to be able to acquire 3D models of the objects but this is not a primary goal of this research. The effectiveness of the system Chapter 1. Introduction 3 wil l be shown through three experiments that will demonstrate how the system performs under a variety of circumstances and the usefulness of the models it produces for object recognition. 1.3 Research Contributions In this thesis we propose and test three new techniques which are of benefit to the area of object discovery. 1.3.1 M o t i o n - b a s e d I n d i v i d u a t i o n In this thesis, we describe and test a new multi-stage technique for perform-ing individuation in object discovery that incorporates motion, appearance and shape information. Individuation refers to the process of dividing input data into data corresponding to objects and data corresponding to the background. Our new approach involves segmenting the scene, using both appearance and shape data, into regions corresponding to objects and regions corresponding to background. Because of the difficulty in accurately segmenting the scene, there are usually multiple regions in our segmentation that correspond to a single object. We use rigid motion data acquired from moving objects to determine which regions in the scene correspond to the same object and combine them together to find all the data associated with a moving object. 1.3.2 P o s t - m o t i o n S t a t i c I n d i v i d u a t i o n We also propose and test a technique that allows us to perform individuation on a moving object after the motion has ceased. This technique is based on using a model of the object acquired through motion-based object discovery and then matching features from that model against non-moving features to determine the location of the same object in the scene. Then, as was described above, we segment the scene based on both appearance and shape into regions corresponding to objects and regions corresponding to the background. Using Chapter 1. Introduction 4 the location of the matching features we are able to determine which regions correspond to the modeled object. New data about the object can then be acquired from those regions. 1.3.3 N o r m a l i z e d C u t b a s e d Scene S e g m e n t a t i o n Both of the contributions described above involve segmenting the scene based on a combination of shape and appearance data. In this thesis, we propose a way of combining shape and appearance data using a technique called normalized cuts [Shi and Malik, 2000]. This technique allows for a more accurate segmentation of background and objects by incorporating cues from the appearance of the objects and the difference in depth between them and the background. 5 Chapter 2 Related Work 2.1 Fundamental Object Discovery Concepts 2.1.1 What is an Object? As was stated in the previous chapter, object discovery is an unsupervised learn-ing technique that allows an agent to generate models of objects through obser-vations of its environment. Before we can discuss how we can perform object discovery, we must define three terms: object, background and environment. The environment of an agent is its immediate, visible surroundings and is made up of objects and background. The background is everything in the environment that is not an object. The difficulty lies in defining what an object is. As early as 1734, the philosopher George Berkeley noted the difficulty in determining the separation between object and background and the significance of physical structure in the definition of object [Berkeley, 1975]. Subsequent research has emphasized the important of coherence under motion in the defi-nition of an object [Yu et al., 2001]. Intellectually, for many people this makes sense. A n object is structure that, when a force is applied, moves together. We define object as a 3D structure or group of structures where all points on the structure move rigidly when a force is applied to them (see Section 2.3.1 for a formal description of rigidity). The restriction that we only consider rigid structures as objects is a necessity of the object discovery techniques used in our system. Since a crucial point in our definition of object is rigid movement, an object must be observed in motion before it can be identified as an object. To illustrate the above three definitions, picture a normal office scene. The Chapter 2. Related Work 6 whole of the visible area of the office is the environment. The walls, floor and ceiling are obviously not objects as they cannot move, so they are part of the background. The desks, shelves and other large heavy things move very rarely and so will usually be considered part of the background. If they are seen moving (i.e., someone decided to reorganize the office), then they would become objects. The same applies for items like ornaments, pictures and other portable yet rarely moved items. The remaining items, which can move by themselves or that are moved by other entities, and which are rigid, are considered objects. In an office, such items would include objects such as coffee mugs, books, people's faces and garbage cans. 2.1.2 W h a t is a n O b j e c t M o d e l ? A n object model is a description of an object [Fitzgibbon and Zisserman, 2003]. A good model can condense the large quantities of information produced through individuation and tracking into a useful and manageable form. There are two types of models, generative and descriminative. A generative object model is one that contains a sufficiently accurate and complete description of an object that the original data used to create the model can be approximately recovered. Generative models are important in applications such as movies or games where the model needs to encode enough data that an image of the object can be reconstructed for the viewer. A descriminative object model is one which can be used to determine whether some set of data points from an object are similar to that of the original object. Descriminative models are often used in recognition systems, to identify the presence or location of an object in some input [Fergus et al., 2004]. Once object models are created, all that is required is a label, a semantic description of what object the model is based on, like "coffee mug" or "optical mouse". W i t h a database of labeled descriptive models, a recognition system could answer queries like, "Where is the coffee mug?" or "Can you see the optical mouse?". The process of attaching labels to models is outside the scope Chapter 2. Related Work 7 of this work. The two types of data found in object models are appearance and shape. Appearance is a description of the visual properties of an object, such as color and texture, and is detected using a camera. Shape is a description of 3D structure of an object and it is usually detected using devices like laser range finders and stereoscopic depth cameras. 2.1.3 M o t i o n and Object Discovery Our analysis of object discovery techniques has revealed three basic ways of performing individuation in object discovery: explicit motion, implicit motion and static. Object motion is important for object discovery as it is the only way of determining whether a structure is movable and whether its components wil l move rigidly and both of these properties are part of the definition of an object given above. Object motion is also a useful property for performing object indi-viduation since if a region of input can be determined to be moving separately from the background, it is likely to correspond to an object (see Section 2.3.1 for more details). Object discovery techniques based on explicit object motion use tracking techniques to detect the motion of an object over sequential input frames. They require that the object move while within the view of the agent. Also, they require temporal knowledge about when input frames occur relative to each other (e.g., a video sequence from a camera) to detect object motion. The object-level grouping work [Sivic et al., 2004] discussed in Section 2.4.3 is an example of an explicit object motion discovery approach. Rather than detecting explicit object motion, implicit object motion uses reasoning techniques to determine that an object must be able to move without actually seeing it do so. Implicit object motion individuation generally relies on the identifying two views of a scene where one contains the object and the other does not. If the two views are from the same position, then this is evidence that the object may have moved between these frames of input. Interestingly, knowledge of the temporal relationship between the two frames is unnecessary Chapter 2. Related Work 8 since the ordering does not matter. Background subtraction can used to find implicit object motion [Ballard and Brown, 1982, Lee, 2005] and so can occu-pancy grids [Modayil and Kuipers, 2004] (see Section 2.4.1 for an example of occupancy grids being used for individuation). Techniques that do not detect any object motion, either implicit or explicit, we call static object discovery. These approaches are much less reliable since they rely on statistical information about the appearance of objects relative to a background [D. Walther and Perona, 2005]. For example, a common assumption in these systems is that regions with a uniform brightness correspond to objects [Shi and Mal ik , 2000]. While this is certainly a property of some monochromatic or lightly textured objects, it is not sufficient for individuating all objects. Using the definition of object given in Section 2.1.2, the best that these approaches can hope to achieve is that what they detect might be an object, since they cannot determine whether an object is rigid or that is not attached to the background. 2.1.4 P a s s i v e vs. A c t i v e O b j e c t D i s c o v e r y Any agent performing object discovery falls into the category of either being a passive or active observer. Passive object discovery techniques are ones where the observing agent does not intentionally physically interact with the objects in the environment. Any object motion comes either from the object itself (e.g., person walking) or some other external stimulus (e.g., a thrown brick). The major drawback of passive motion-based approaches is that many objects cannot move under their own power and are only rarely moved by other entities and as such are considered part of the background. Lack of object motion makes object individuation much more difficult and less reliable. Passive object discovery is used in situations where the agent is either incapable of interaction with its environment, like a security camera, or when it is in an environment where physical interaction is either dangerous or destructive (which is the case in most human inhabited environments) [D.' Walther and Perona, 2005, Modayi l and Kuipers, 2004, Sivic et al., 2004]. Chapter 2. Related Work 9 A n active object discovery technique is one where the agent interacts phys-ically with the objects it is trying to model. Usually such interactions are restricted to robots pushing or prodding objects in the environment with their body or an arm to create object motion. Fitzpatrick et al. [Fitzpatrick and Metta, 2003] have worked on this problem using a robot equipped with a mov-able arm. This allows their agent to generate the object motion that is used for motion-based object discovery, which is obviously of benefit in individuating and modeling objects that rarely move. Unfortunately, there are serious problems with active object discovery that need to be addressed before it can be employed more generally. The most serious concern with active object discovery is that the exploring agent wil l damage itself or the objects it is investigating. A robot that prods all objects it wants to examine, even gently, could still cause havoc. The easiest solution is simply to modify the environment to remove breakable and valuable objects, effectively "robot-proofing" it but this is not feasible in most situations. Other solutions are needed that either allow the robot to reliably manipulate objects without breaking them or identify and avoid breakable objects and object materials. 2.2 Appearance and Shape: The Two Inputs to Object Discovery There are two major sources of input to object discovery systems: shape and appearance. The next section will explore these two approaches side by side, comparing them at each of the four major steps of object discovery. This section wil l highlight the relative strengths and weaknesses of each approach and lead into the next chapter that discusses our approach which is based on a combi-nation of the two. It should be mentioned that motion is not an input to an object discovery system, motion is a property derived for either appearance or shape data. Chapter 2. Related Work 10 2.2.1 A p p e a r a n c e - b a s e d O b j e c t D i s c o v e r y Most approaches to object discovery use appearance-based systems that acquire data from devices like digital cameras. Appearance data, more commonly called images, are 2D light intensity functions I{x,y) where the x and y values are spatial coordinates of a pixel which is a representation of the light reflected by a region in the scene. Monochromatic pixels have a single value denoting the average brightness of a region. Color pixels typically have three values representing the intensity of the red, green and blue light reflected by the region. Images provide a rich source of information about the world but they are corresponding complex to analyze. The whole field of computer vision and image 1 analysis is devoted to this problem of interpreting appearance data. A major problem encountered when using images is that they are a projection of the 3D world onto a 2D plane. Fortunately, techniques like structure from motion allow the reconstruction of the 3D scene shape from multiple images [Ballard and Brown, 1982]. Object discovery using vision is a high-level learning task and requires a synthesis of techniques from computer vision. The high dimensionality of im-age inputs can be simplified using image features and they form the basis of many appearance-based object discovery systems. Image features are regions of an image that can be easily re-detected in another image at different scales, orientations, affine changes and lighting. Features have a descriptor which is a descriptive model of the region of the image upon which the feature is based. They usually also have an (x,y) position in the image, either at the pixel or sub-pixel level, that corresponds to the center of the region upon which the de-scriptor is based. Image features allow systems to focus only on a small number of useful areas in the image and ignore the rest. 2.2.2 S h a p e - b a s e d O b j e c t D i s c o v e r y Shape-based object discovery requires an input device that is able to acquire data about the physical structure of a scene. Typically, this is done using range-Chapter 2. Related Work 11 finding devices. A collection of range data about a scene is called a depth map. A depth map is a collection of data points where each point gives the distance from the camera to a point on the surface of a structure in the scene. Depth maps can be converted into shape maps. A shape map is a collection of 3D points that correspond to points on the surface of visible structures within the scene with the origin generally at the viewpoint of the acquisition device. The spatial configuration of the range points or 3D shape points in a depth map or shape map varies between different input devices. The two most commonly used range-finding devices in object discovery are planar laser range-finders and stereo depth cameras. Planar laser range-finders provide highly accuracy range data about the scene but only on a single plane. A stereo depth camera can provide a dense depth map of a scene but the data is often noisy and inaccurate, especially on regions with little or no texture. Dense laser range-finders exist and can produce highly accurate dense shape maps but they are not practical for object discovery because their refresh rates are slow and their range limited. Laser range-finders work by firing a spread of laser light beams and measur-ing the length of time it takes for them to bounce back. Those systems useful for individuation typically fire the beams in an arc along a single horizontal plane with the beams at equidistant angles. The (x,y,z) points in the shape map produced by a laser range-finder correspond to the points in the world where the beams encounter a solid surface. The beams are highly accurate at determining distance along their length. Stereo depth cameras use two or more cameras to generate a dense shape map of the scene. They work on the same principal as human stereo vision. The 3D location of a pixel in one image can be determined by finding horizontal difference between it and the corresponding pixel in the other camera. Here corresponding pixel refers to the pixel that is a reflection of light from the same point in the scene. Once the distance between the pixels, called the disparity, is known, it is straightforward to determine the distance of that point from the camera. Chapter 2. Related Work 12 In a two camera stereo system, images are acquired from each of the two cameras simultaneously. A process known as rectification is then performed to achieve equipolar constraints between the two image. For two images to be equipolar, the pixels that lie along the same horizontal line in each image must be reflections of light from the same points in the scene [Ballard and Brown, 1982]. The disparity between pixels must then be determined. One camera is used as the reference camera and it provides the image that is the basis for the disparity, depth and shape maps. For each pixel in the reference image, the corresponding pixel along the equipolar line in the other image is found. A commonly used approach for doing this is called correlation stereo. This approach finds a correlation mask, which is a square region of pixels, around each pixel in the reference image that they want to match against the pixels in the corresponding equipolar line [Scharstein and Szeliski, 2002]. The correlation mask on the pixel in the reference image is then compared against the correlation match of each pixel on the corresponding equipolar line in the other image to find the best match. The disparity is the difference in horizontal position between the reference pixel and the best match. A l l the disparity values for the image form the disparity map. The disparity between the two pixels can then be used to determine the depth of that pixel us-ing the focal length of the lens, the resolution of the image and the displacement between cameras. The depth map can then be used to determine the shape map of the scene. The shape map and depth map can then be used for shape-based object discovery. Shape-based object discovery bears a strong resemblance to shape-based Simultaneous Location and Mapping ( S L A M ) , a term coined by Leonard and Durrant-Whyte in [Leonard and Durrant-Whyte, 1991], which is one of the most active research areas currently in robotics. S L A M systems allow a robot in an unknown environment to both map that environment and situate itself at all times within that map. Typically, the goal of the mapping is to have an accurate description of the permanent features of the environment like walls, desks and bookcases. The majority of research into S L A M uses laser range-finders or Chapter 2. Related Work 13 stereo depth cameras. S L A M is similar to object discovery since S L A M is concerned with mapping the background and ignoring transient objects, while object discovery aims to model objects and ignores the background. Therefore, it is logical to think that shape-based approaches would make sense in object discovery, given their success in S L A M . 2.3 Comparison of Image and Shape-based Approaches to Object Discovery The problem of object discovery can be broken down into the following four steps: Individuation: Segmenting the raw input data into object snapshots. Object Tracking: Matching snapshots over time and producing object snap-shot sequences. Modeling: Constructing a model of the object based on a snapshot sequence. Recognition: The detection of previously modeled objects. The implementation of these steps is very different in shape and image-based approaches. Both approaches can be best understood by comparing them at each step. 2.3.1 I n d i v i d u a t i o n Individuation is the process of segmenting input data and identifying parts of the input that make up individual objects. Individuation greatly reduces the difficulty of managing complex inputs by grouping input elements into objects and allowing systems to ignore the regions that belong to the background. The input data points clustered using individuation could be pixels, depth points, image features, edges, or other elements of the input. The goal of individ-uation is to find a mapping from these data points to either objects in the scene Chapter 2. Related Work 14 or the background. A perfectly accurate mapping can rarely be achieved and there is usually either oversegmentation or undersegmentation. Oversegmen-tation occurs when the points that belong to one object map to two or more. Undersegmentation occurs when the points mapping to an object actually be-long to two or more objects or an object and the background. Oversegmentation is usually preferable to undersegmentation as it is much easier combine overseg-mented groups together than it is to redetermine a new correct set of mappings in an undersegmented group [Patras et a l , 2001]. Once individuation is per-formed, the elements of the input that map to a particular object are grouped together into a set of data points called an object snapshot which we denote as O. Static individuation is typically performed on data acquired over an ex-tremely short period of time, like one frame from a camera or a single scan from a laser range-finder, so there is no time for motion to occur. Static individuation is used in motion-less.object discovery systems and typically relies on finding homogeneous regions of the input that are spatially contiguous or close. This homogeneity is in one or more of the properties of the input, such as color in images or distance from the camera in depth maps. The theory behind this is that these properties are more likely to consistent within an object than between it and something else. -Homogeneous regions are segmented from the rest of the input, hopefully dividing the individual objects from the background. Motion-based Individuation uses input data acquired over time and employs explicit and/or implicit object motion detection techniques. Object snapshots in motion-based systems group together data points within a single input frame or across multiple frames and over time, based on those points being in motion. As was described in Section 2.1.4, object motion is useful for individuation because it provides a means for discriminating between the static background and moving objects. In motion-based individuation, it is often necessary to determine the 3D position of the image data points given their 2D position [Sivic et al., 2004]. The 3D position can be used to find groups of rigidly moving features, corresponding Chapter 2. Related Work 15 to a rigid object. A n object is said to be rigid if the motion of every point follows the same rigid transform. A rigid transform between points p and q takes the form of q - Rp + t where R is an n-dimensional rotation matrix and t is an n-dimensional trans-lation vector. To find rigid body motion from appearance data, the 2D image points must be mapped back to their original 3D real world positions [Ballard and Brown, 1982], This is, of course, unnecessary for shape-based methods since the 3D position of their data points is already known. Static Appearance-based Individuation There are several properties that can be used to segment an image into object snapshots. Homogeneous regions of color can be used to find monochromatic objects, but tends to oversegment objects with multiple colored regions. Image intensity or brightness, which is an average of the three -color channels, is an image property commonly used as a basis for individuation [Felzenszwalb and Huttenlocher, 2004, Shi and Malik, 2000]. Intensity-based individuation relies on each object having a similar brightness level all over its visible surface and that it is sufficiently different from the background. There are other image properties that can be used for individuation like color, texture and edges. There are a wide variety of techniques that are used to segment images • using these properties such as normalized cuts[Shi and Malik, 2000] (see Section 2.5.2), region growing[BalIard and Brown, 1982] and image saliencyjD. Walther and Perona, 2005]. However, the problem of individuating objects from a single image remains very difficult. One obstacle faced by many systems is that, even after the images have been segmented, it is unclear as to whether the region corresponds to the background or an object. The approach we employ to solve this problem is to use object motion to identify regions that move and associate them with objects. However, this requires multiple frames of data and so it is Chapter 2. Related Work 16 not a form of static individuation. Motion-based Appearance-based Individuation A simple, well known approach to motion-based appearance-based individuation is background subtraction [Ballard and Brown, 1982], which uses implicit object motion for detection. In its most common form, it requires that the camera be fixed and the background is stationary. First, the background intensity image, lb, is acquired for that camera position while no objects are present. This image is then subtracted from each new image I that may contain objects to get Is: Is{x,y) = I(x,y) - Ib{x,y) Regions of high magnitude in I3 correspond to a difference in pixel intensity between the new image and the original background image, which may in turn correspond to a new object. However, other causes of intensity difference, like changes in room illumination or semi-moving background objects like wind-blow trees may also cause responses. This approach works well so long as the camera remains stationary, especially if a model of the background image is acquired over time to deal with problems like changing illumination or new stationary objects [Lee, 2005]. When the camera is moving, the problem of finding moving objects becomes more difficult because the background is no longer static. One approach to solving this problem is to individuate objects by finding rigidly moving bodies. The first step is to identify regions of the image that are moving. A common approach to this is feature tracking where distinctive image features are tracked between pairs of frames to find the motion [Sivic et al., 2004]. Given two tempo-rally adjacent images I\ and I2, image features are extracted from each, creating two sets of features F\ and F2. Feature matching is then performed between the features in each of the two sets. For each feature in F\ the nearest neighbor is found in F2, where nearest neighbor is defined using some distance measure based on feature descriptors or other properties such as scale and orientation. Chapter 2. Related Work 17 Since many. features in frame F\ wil l not have a match in F2, there needs to be a mechanism for discarding poor matches. Some systems employ a global threshold; while others only accept a match if the descriptor difference for the best match is more than a threshold percentage closer than the second best. The results of the feature tracking are pairs features which can be linked over multiple frames to produce longer feature tracks. These tracked features can be individuated into object snapshots by finding groups that move rigidly [Sivic et al., 2004]. To do this, it is first necessary to find their 3D position in each frame. The problem of reconstructing 3D shape from 2D feature motion is called structure from motion. Given feature/point correspondences from multiple view points, it is possible to compute the 3D location of points from three projections [Fitzgibbon and Zisserman, 2003]. Given the feature tracks and the feature 3D position, the next step is to find rigidly moving bodies. The 3D distance between each pair of features on a rigidly moving body remains constant under any transformation. Groups of features that move rigidly can be found by comparing the distances between each feature's previous 3D position and each pair's current 3D position. Once the features are clustered into rigidly moving bodies, it is easy to remove the background if camera motion is known. The movement of the camera wil l apply a rigid transform on all points in the image. To find the background, all that need be done is to find the group of rigidly moving features with the closest rigid transform to the expected one given the camera motion. Once the background feature group has been removed, each other group of rigidly moving features can become an object snapshot. Static Shape-based Individuation There are several techniques for performing static shape-based individuation but they all rely on the same basic idea, that objects are solid masses in the environment and so segmentation can divide the scene into 3D volumes that might correspond to objects. A simplistic way of performing static individua-Chapter 2. Related Work 18 tion with shape data is to cluster the points spatially using some measurement closeness. This can be done with laser range data using a greedy clustering al-gorithm which groups points that are within a certain threshold distance of each other [Modayil and Kuipers, 2004]. A similar technique that can be used with dense depth maps is region growing [Ballard and Brown, 1982]. More complex techniques exist that use normalized cuts to individuate 3D points into objects based on their proximity to each other and their surface normal[Yu et al., 2001]. Another approach to static shape-based individuation with a depth map is to find discontinuities in the depth map that signal the edges of objects and use them as a basis for segmentation of the depth map [Izquierdo and Ghanbari, 1999]. This generally works so long as the object is sufficiently distant from its immediate background and does not overlap with other objects at a similar depth. There are several difficulties that are commonly encountered in static shape-based individuation. Two objects that are close together are very difficult to distinguish from one object and are likely to be grouped together, causing un-dersegmentation in the individuation. Another common problem when using planar laser range-finders is that single objects that only intersect with the viewing plane at widely spaced intervals (like table legs) become separate ob-jects causing over segmentation. Also, like all static individuation processes, it is impossible to determine whether an individuated object wil l move rigidly or to determine whether the object is separate from the background or not. F i -nally, static individuation using either sparse laser range-finders or dense stereo depth systems still has the problem of determining which clusters are objects and which background. One approach to this is to use motion to identify regions that correspond to moving objects, like in appearance-based segmentation. Motion-based Shape-based Individuation Motion-based shape individuation uses a fundamentally different approach than motion-based appearance-based individuation because of its lack of distinguish-Chapter 2. Related Work 19 able features. However, it is possible to track moving objects using shape data without the requirement that they be rigid using techniques like potential fields [Nanda and Fujimura, 2004] and Kalman filters [Mendes et al., 2004], These techniques rely on tracking the object as a whole, rather than tracking features as is usually done in image-based tracking. Implicit object motion has also been employed with some success in motion-based shape individuation in the form of occupancy grids [Modayil and Kuipers, 2004]. 2.3.2 Object Tracking Tracking is an important component of any object discovery system as it tem-porally correlates multiple views of an object allowing them to be integrated into a single model. After individuation, each input frame at time t contains a set of individuated object snapshots Ol, which are themselves sets of input data. The goal of tracking is to determine a mapping between the object snap-shots Ol and the snapshots found in the previous x frames that correspond to the same object. If the system allows for oversegmentation in the individuation component, then the mapping between snapshots may be many-to-many. These mappings between snapshots over multiple frames form a chain called an object sequence. Commonly used pair-wise tracking techniques compare pairs of snap-shots together. Each snapshot in On is matched against with each snapshot in On-\ to try and find a mapping. In more advanced tracking techniques, each snapshot On which fails to find a mapping in O n _ i is then compared against snapshots in On-2- This is repeated until On-x is reached. Appearance-based Tracking There is a large body of research on object tracking in images and the results have been impressive, even for extremely small, low-resolution image regions [Hue et al., 2001, Isard and MacCormick, 2001, Koller et al., 1994, Okuma et al., 2004], Tracking in image-based object discovery systems is relatively simple using image features if the objects are high resolution and there is good Chapter 2. Related Work 20 lighting [Lowe, 2004]. Objects that are very small or do not contain enough features for tracking are unlikely to contain enough information to create a good model and so are ignored by some appearance-based object discovery systems (such as O D M A S ) . A common approach object tracking in images is to use feature tracking as the comparison function (see Section 2.3.1 for details of feature tracking). If a threshold number of features are successfully matched between two snapshots, then they correspond to the same object. It is important that the features in the tracking system be sufficiently discriminable in order to prevent false positive object matches. Another important property of the features is that they be invariant to affine changes, which can occur very quickly in rotating objects. S IFT features are excellent in both these areas (see Section 2.5.1) [Lowe, 2004]. Shape-based Tracking Tracking based purely on depth can be a more difficult task than in image-based systems because of the lack of discriminable features. Without discrim-inable features it is difficult to match parts of individuated objects and instead the comparison functions rely on matching the snapshot as a whole. However, shape-based tracking approaches have the advantage of being able to deter-mine the approximate 3D location of an object within a scene, which can help in tracking the object. A simple approach employed by Kuipers and Modayi l in their Bootstrap Learning for Object Discovery uses the 3D position of the object's center of point mass and the distance to its furthest point as a com-parison function [Modayil and Kuipers, 2004]. Other techniques that have been employed in shape-based tracking include potential fields [Nanda and Fujimura, 2004] and Kalman filters [Mendes et al., 2004]. 2.3.3 Object M o d e l i n g Object modeling is the process of taking an object sequence generated by indi-viduation and tracking and merging the data in that sequence into a model, a Chapter 2. Related Work 21 description of the properties of the object. Object models are the final output of the object discovery system and are the basis of object recognition. In object discovery, modeling is often not a one time process. A brief object sequence is first converted into a model. Later, additional snapshots are added to that sequence and the model is modified to incorporate the new data. Models should easily accommodate new data and ideally be able to verify old data points using new data points. Appearance-based Object Modeling Appearance-based object models are particularly useful for many applications because both the appearance and the shape of an object can be determined from images and then encoded into the model. The exact nature of the model is determined by what information about the object needs to be stored. Descriptive appearance-based models can employ to the "bag of features" model which is simple and effective for object recognition. In this model, all image feature descriptors from the object sequence are extracted and stored as the model. This approach, which ignores feature's position, is effective for object recognition if the image features are sufficiently discriminable[D. Walther and Perona, 2005] and enough of them are in the model and target image. To prevent the number of feature descriptors in the model from constantly growing, each new descriptor that is added to the feature list is compared against all previous descriptors and only added i f there is no match. Unfortunately, this approach lacks any data on the overall structure of the object, data that might be important to other systems making use of the models. Generative appearance models require more information than simply po-sitionless image features, they also need to describe the object's shape. This requires that the feature's 3D positions have been determined through structure from motion or some other technique. In that case, an improvement on the "bag of features" model is to store all the feature's descriptors from the sequence as well as their 3D position, effectively turning the appearance model into a ap-Chapter 2. Related Work 22 pearance and shape hybrid model. Each new feature added to the model must first have its 3D position registered. Registration is the process of taking points in 3D space and transforming them into a different frame of reference [Trucco and Verri , 1998]. In this case, the positions of the features in the new snapshot must be registered so that they are in the same frame of reference as the fea-tures in the model. In the next section we wil l discuss how registration between groups of 3D points can be achieved. If the registration of the points is found, then a 3D model of the image features in the object snapshot sequence can be created. If a more accurate model of the object's appearance and shape is required, then it is possible to produce a 3D model with the surface appearance supplied as a surface texture [Yu et al., 2001]. This requires first that 3D points on the object surface be found, like in the example above. Then the texture patches for each triangle on the object's surface can be calculated from multiple viewpoints. Shape-based Object Modeling Depth models derived from a depth camera can produce more a complete model than planar laser range-finders, since the individuated snapshots contain dense 3D shape data. However, the approach is similar for data from either device. The creation of pure shape-based models is similar to the approach described above for producing shape and appearance hybrid models. Like in the other object modeling problems, the goal here is to correlate the multiple views of the . object into a single model. First, the surface measurements from the depth maps must be registered or transformed so they are in a common frame of reference. Algorithms like Iterative Closest Point (ICP) can find a rigid transform between two groups of 3D points, allowing depth maps to be merged together [Besl and M c K a y , 1992]. After registration, the combined surface representations undergo geometric fusion, where they are combined into a 3D model. The two best known techniques for this are mesh integration and volumetric fusion. In mesh integration [Rutishauser et al., 1994], the multiple registered surfaces Chapter 2. Related Work / 23 are transformed into 3D meshes and then combined. In volumetric fusion, an implicit surface is found using an volumetric field function [Curless and Levoy, 1996, Hil ton et al., 1996]. 2.3.4 Recogni t ion Recognition is the process of matching an object model against object data from some other source [Trucco and Verri , 1998]. In the context of object discovery, the other source of data is usually either raw input data, individuated snapshots, or other models. In model to model and model to snapshot recognition, the task is one of identification, to determine whether they correspond to the same object. In the task of object to input recognition, the task is both identification and localization, to determine the object's location in the input. Image-based Recognition A common approach to appearance-based object recognition which closely re-sembles the previous appearance-based tracking techniques described, relies on the image feature matching [Lowe, 2004], The processes for matching between a database object model and an input, snapshot or other model is very similar. First, let us assume that the object model stores the image features of the object or that these features can be easily extracted. Image features in a new model or snapshot are compared against the image features of each database model and if more than a threshold number match then the two are the same. If the comparison is between an input image and database object model, then input image features are matched against the features of each model and if a threshold number are matched, then the input contains the model. To find the object's - location, it is possible to use a bounding box that encompasses all the image features in the input which matched the model database object features [Lowe, 2004]. The simplicity of these approaches is a testimony to the effectiveness of highly discriminative image features for matching and recognition. Chapter 2. Related Work 24 Shape-based Recognition Shape-based recognition, like shape-based tracking, usually relies on matching based on the object as a whole, rather than the discriminative features. Fortu-nately, the number of points on an object's surface is often larger in a model than in a snapshot, allowing for more robust comparison in recognition than in tracking. A simple approach for shape recognition is to take the object model and translate it in the comparison space, to minimize the difference in shape be-tween the two surfaces. If the difference is below the threshold than the two are a match. This is simple to achieve when there is no localization to perform, such as when comparing a model against a model or snapshot [Modayil and Kuipers, 2004). However, it is computationally much more expensive to search an entire input space to recognize an object by its shape, especially in the presence of a large database of object models [Grimson, 1986]. Another fundamental problem with depth-based recognition is that human environments are designed such that objects are recognizable by their appear-ance [D. Walther and Perona, 2005]. Distinctively textured and colored ob-jects, which are common in human environments, are best recognized using appearance-based systems [Carmichael and Hebert, 2004]. 2.4 Existing Object Discovery Systems This section wil l present three recently developed object discovery systems, two of which are image-based and one shape-based. These systems wil l give the reader an idea of how other object discovery techniques are performed. 2.4.1 Boots t rap Learn ing for Object Discovery Modayil and Kuipers ' work on Bootstrap Learning for Object Discovery [Modayil and Kuipers, 2004], which uses a shape-based approach, is the original paper that inspired the O D M A S system. The key concept on which their system is based is that object models can be learned using simple heuristics if they are Chapter 2. Related Work 25 applied in a bootstrap fashion, where each simple algorithm feeds into next. This approach is reflected in the O D M A S system. Modayi l and Kuipers work follows a bootstrap approach where the results of each step feed into the next. The reason they give for this is bootstrap processes make learning in a high dimensional representation space, like the one acquired over time from a laser range-finder, much easier. Bootstrap learning applies successive, high bias, unsupervised techniques each of which can be very simple but that are cumulatively significant. The robot used in their experiments was equipped with a S I C K P L S laser range-finder that provided a series out range points on a 120 degree plane in front of the robot. This provided shape maps with high accuracy in the location of the 3D points but only on a single plane around a foot off the ground. As their system showed, the problems inherent to object discovery in only acquiring data on a single plane are serious. Their input data was passed through a series of steps: individualization, tracking, modeling and categorization. The step that they called categorization is the same as the recognition step described previously since it compares new object models to old ones to determine if they correspond to the same object. The other steps are the same as their counter parts presented earlier. Since their system demonstrates each of the four stages of object discovery very clearly, each wil l be examined. Individuation The first step of their system aims to group together the range points from their laser range-finder into hypothetical object snapshots. The problems in this step are determining what points are on objects and what are the background and separating the objects from each other. To do this they.first use static shape-based individuation, employing a measure of range point closeness to group together features. A l l "close" 3D data points are clustered spatially to produce a snapshot. Two points are considered close if their distance is less than a Chapter 2. Related Work 26 threshold value 5j: , close(xi,Xj) =|| Xi — Xj ||< Si This approach is fairly successful, though objects close to each other can be combined into one and objects that intersect the viewing plane at multiple points (like table legs or people) can be divided into separate objects. Next, to determine which of these snapshots correspond to objects and which background, they use an occupancy grid. A n occupancy grid is a shape map of an environment that shows which areas are filled with parts of the background, like walls and non-moving furniture, and which areas are empty. In this case their occupancy grid was planar because of their laser range-finder. Each individuated group of "close" features is matched against the occupancy gird and if it is not in a region that is part of the background, it is an object. Their occupancy gird was acquired through the laborious process of removing all objects from the environment and then having the robot map the room's shape with its laser range-finder. Tracking The tracking system they employ was based on the position and extend of the object that needs to be tracked. Each object snapshot has a location p that is center of its 3D point mass and an extent r , which is the distance from center to the furthest point. The comparison function between snapshots is, ds{s,t) =|| Ps-Pt\\ + \rs+rt\ Modayil and Kuipers claim that this function is robust to random noise and allows them to tracking moving objects that vary in shape. They employ a variant of pair-wise matching they call unique clear successor, which is based on matching a new snapshot with those from previous frames . If a unique clear successor cannot be found within a two frame window, then the tracker terminates. Every time a large, new snapshot appears, a new tracker Chapter 2. Related Work 27 is created. Small snapshots are treated as noise and ignored. It is difficult to determine how successful their tracker was since their final system only modeled static objects. Modeling While their ultimate goal was to model amorphous and moving objects, they admit that this is too difficult at their current stage of work and so limited themselves to static, rigid objects. This is a pity, as their system up to this point tried hard to accommodate moving and non-rigid objects and that seems somewhat of a wasted effort. Their approach creates a shape model by accumulating distinctive snap-shots while they are certain that their target is non-moving. Distinctiveness is determined using a non-symmetric dissimilarity function that compares the snapshots. They only combine snapshots that are sufficiently distinct from each other. This distinctiveness measure allows them to reduce the number of snap-shots that need to be integrated into the model while still preserving the shape of the object. It also allows them to prune out inaccurate snapshots since any with a very high distinctiveness value are discarded, having assumed that the target has either moved or been occluded. Acquisition is considered complete when the robot completes a full 360 de-gree circuit of the target. The snapshots are then registered and translated and fitted together to form a model. It is interesting to note that the model acquisition fixates on a single object while it is constructing the model and if it cannot complete a full rotation, it discards the attempt. Presumably this is because without a completed model it is too difficult to recognize an object. Categorization The step they called categorization and which was described previously as recog-nition, matches new models against the database of known models to prevent duplicate models being created. This is done after the complete shape model Chapter 2. Related Work 28 has been constructed and uses an online clustering algorithm. They use a simple distance dissimilarity function d(X, Y) to compare the new object against all previously identified groups. where do{x,y) is a symmetric distance measure between two 3D points [Modayil and Kuipers, 2004]. If the target is within a threshold distance C of a single known model, then it is identified as belonging to that category. If it is within C of multiple objects, it is discarded as uncertain. If the distance to all known objects is greater than C, it is added as a new model. They do not use recognized models to improve on their previously created models, something that could easily be added by simply aligning and merging the two shape models. In this way they could gradually detect and remove outliers by finding points which are distant from all other point in other models. Overall Analysis It is clear from the work in this paper that a sparse planar laser range-finder is a difficult tool use for performing object discovery. This can be seen in the limitations they imposed on themselves: requiring an occupancy grid, only tracking static targets, needing a complete 360 degree circumnavigation of their target for modeling and categorization. The lack of distinguishable features and the 2D viewing plane seem to be the biggest problems with the device. However, much of their work was extremely useful both in the O D M A S system.and in the creation of this thesis. Their division of the tasks of object discovery into the four groups was very important since it provided insight into the similarities between all such systems. Also, their use of bootstrap learning and simple, high bias heuristics provided a basis for the O D M A S system. 1 Chapter 2. Related Work 29 2.4.2 Is B o t t o m - U p A t t e n t i o n U s e f u l for O b j e c t R e c o g n i t i o n ? Rutishauser et al. have worked on the problem of learning object representations of multiple objects from unlabeled images in Is Bottom-Up Attention Useful for Object Recognition? [Rutishauser et al.,2004). Unlike some object discovery systems, they do not check that the objects are separate from the background or that they move rigidly. Instead, they make the assumption that visually distinctive image elements are found only on objects. Their approach uses selective visual attention based on image saliency to detect regions likely to contain objects. Some elements of their system resemble O D M A S ; namely the use of static segmentation and S I F T features for object recognition. However, their approach is based on a series of non-sequential, single camera images. Saliency-based Region/Feature Selection for Individuation The object discovery system proposed in this work by Rutishauer et al. does not attempt any form of object tracking and their modeling is minimal. The approach is almost purely focused on the task of individuation. To do this, their system identifies salient regions in the image (i.e., regions with high con-trast in color, texture or brightness). The theory behind this is that in human environments we naturally produce objects that can easily be seen and discrim-inated from the background to aid us in our everyday tasks. Whether this is true or whether it is simply a product of the large number of materials used by manufacturers, which is in turn influenced by the large number of tasks the objects must perform, is questionable. However, their results do indicate that this technique can be used to detect" regions that are likely to contain objects. Their saliency model is based on that of Itti et al. [Rutishauser et al., 2004] which is in turn based on that of Koch and Ullman [Koch, Christof and Ullman, Shimon, 1985]. Unlike the original system that simply returned the coordinates of the point of highest saliency in the image, their system instead identifies groups of features with a common saliency cause, in effect segmenting Chapter 2. Related Work 30 the features into objects. Briefly,their saliency model employs a sub-sampled Gaussian pyramid where each level is divided into red, green, blue, yellow, intensity and orientation channels. Normalized, center-surround "conspicuity" maps are then calculated for each of the respective channels. These maps are then normalized in relation to each other and summed into a single saliency map. A winner-take-all network of integrate-and-fire neurons is then used to determine the location of highest saliency. Walther and Rutishauser expand on this to include the concept of extent of saliency within the image by finding the map that contributes to the most energy and segmenting that map using region growing and adaptive thresholding. This then allows them to attend to multiple regions in order of decreasing saliency. Tsotsos et al. [Tsotsos et al., 1995] also designed a mechanism that allows them to trace back activations through the winner-take-all network to locate regions of common high saliency. This allows them to group together features with high saliency that has a common root (e.g., all features from something bright red and heavily textured). The goal of this is to identify the features belonging to a single object. The theory behind this is that the disparity, as described by the saliency map, between objects and background and in between objects is higher than the disparity within an object. Groups of features with a common saliency cause, therefore, are segmented and considered as belonging to a single object. The theory that it is possible to segment objects by locating image disparities (and depth disparities) is key to the individuation process used in the O D M A S system. However, approaches like these suffer from problems of both over and undersegmentation. The results from this paper show that the segmentation around objects is inaccurate, missing large portions of segmented objects and includes pixels from the background and another object. That criticism aside, their results show that the saliency metric can be used to identify the approxi-mate location of objects in images and as a basis for approximate segmentation. Chapter 2. Related Work 31 2.4.3 O b j e c t L e v e l G r o u p i n g for V i d e o S h o t s The goal of Object Level Grouping for Video Shots [Sivic et al . , 2004] by Sivic, Schaffalitzky, and Zisserman is the construction of object models from the frames of a Hollywood movie, in their case "Groundhog Day". To do this, objects are individuated from the background using tracking and finding rigidly moving bodies. This is done using structure from motion performed on image features tracked over a large number of frames (20-80). A movie is a more difficult input source than a camera equipped robot be-cause there is no knowledge of camera motion. Also, since they are using a single camera, they are unable to use stereo vision techniques to quickly de-termine scene depth. However, movies do present certain advantages over the robot camera because the camera operator does a lot of work for you. They tend to focus on a target object or group of objects, keeping the majority of them in focus, up close and near the center of the image even when moving. The lighting quality is almost always excellent and the resolution is very high. The background is typically more out of focus and often motion blurred, de-creasing the number of background image features that need to be identified and removed. They use a different set of affine invariant features than the O D M A S sys-tems: interest point neighborhoods [Mikolajczyk and Schmid, 2002] and max-imally stable extremal regions [Matas et al., 2002]. The first looks for regions with high variation in multiple directions (like corners), the other for regions of extended high contrast. Their justification for using two features types is that it provides them with better coverage as some regions may not evoke one of the feature types. Our hope is that S IFT features wil l provide sufficient coverage for the O D M A S system. These features are then matched pair-wise between consecutive frames. Their tracking system is much more complex than in other object discovery systems (such as O D M A S ) since they need very long feature tracks to achieve robust structure from motion. They employ a pair-wise tracking technique with Chapter 2. Related Work 32 various addition algorithms to perform short range and long range track repair. Their short range track repair is particularly interesting as it highlights the difficulties in performing structure from motion. It is inevitable that most fea-tures cannot be tracked across a large number of frames because of problems like motion blur, temporary feature changes and occlusions. Also, having a sufficiently wide baseline is important if the feature's position is going to be accurately localized. To prevent degeneracy in their structure from motion pro-cess, they need to extend the track for as long as possible. This is done by first performing short range track repair that works by propagating forward un-matched features. Any feature track that terminates after being matched for n previous frames is propagated into the next frame by estimating the transla-tional motion from the previous n matches. If there is a region at the propagated point that approximately corresponds to the tracked feature, then the feature is propagated forwards. They use a refinement technique from Ferrari et al. [Ferrari et al., 2003] to fit the region in the new frame. Their results show that this technique can clearly improve the average track length. Their long range track repair uses feature-based object recognition, as de-scribed in Section 2.3.4, to detect duplicate models are created for the same object during a scene. This often happens because of cuts between shots. They use image feature matching between models as the basis of recognition and then combine the models together. This work is an excellent example of how effective structure from motion can be for object individuation. However, it also clearly highlights the problems associated with that approach, namely the work that needs to be done in long term feature tracking and the number of frames that need to be processed before 3D feature positions can be determined. The O D M A S system manages to avoid many of these problems by using a stereo camera which provides a mechanism for determining depth in each frame. Chapter 2. Related Work 33 2.4.4 S igni f icant N o n - O b j e c t D i s c o v e r y R e s e a r c h There are two other significant research topics that should be discussed in this chapter since they are very significant in the O D M A S system. The first is the Shift Invariant Feature Transform (SIFT) that produces the image features that the system uses. The second is Normalized Cuts which is a significant component of the static individuation system used in O D M A S . 2.4.5 Sh i f t I n v a r i a n t F e a t u r e T r a n s f o r m David Lowe's Shift Invariant Feature Transform (SIFT) [Lowe, 2004] is widely regarded as producing some of the best image features currently available for image matching. They are designed to be invariant to scale and orientation changes and semi-invariant to lighting and viewing angle (affine) changes. The features are highly distinctive with a simple, accurate and fast to compute dis-similarity function. This allows a feature to be reliably and correctly matched against an extremely large number of candidates. Fairly low-resolution images (500x500) wil l contain approximately 2000 image features (depending on light-ing, scene contents, etc.) and they tend to be evenly dispersed over the whole image and at multiple scales except in untextured regions. The S I F T feature transform has four basic steps, each optimized for speed and efficiency. First, the system looks for candidate locations by finding scale-space extrema. The goal is to find locations and scales that can be reliably located under different viewpoints. Locations that are scale-invariant are found by searching for image features that are similar at all scales using scale-space which is a continuous function over scale. The scale-space of an image is found by convolving a multi-scale Gaussian, G{x,y, A) with the input image I(x,y): L{x,y, A = G(x,y, A ) * I(x,y) Note: * is a convolution in x and y and G{x,y,A) = l/27rA2e(l2+!/2)/2A2 Stable key points are then found by computing the difference of two scales Chapter 2. Related Work 34 separated by a fixed multiplicative factor c, D(x, y, A ) = (G(x,y, kD) - G(x, y,d))* I{x, y) = L{x, y, kD) - L(x,y, kD) (2.1) Local extrema are then detected by comparing candidate location in D(x,y, A ) to its eight immediate neighbors and eighteen neighbors in the scales above and below and selecting those that are always greater or lesser. These candidate locations are then fitted for scale, location, and ratio of principal curvatures. This allows them to reject poor candidates with either low contrast or lying upon an edge and therefore poorly localized. The key points are more accurately localized by fitting a 3D quadratic function to the data around the candidate region to find the interpolated location of the maximum. A simple threshold on minimum contrast is then applied to remove low contrast points. Edge responses are then detected by finding the principal curvature using a Hessian matrix and finding responses that are strong across the edge but weak perpendicular to it. Features at this point are still is not invariant to changes in orientation. To deal with this they need to find a consistent orientation for each key point that wil l allow them to ignore image rotation. First they use a Gaussian smoothed image of the point, L(x, y) at the located scale, so that the histogram is scale invariant. They then apply an orientation histogram with 36 buckets for the 360 degrees of the image and each sample from L(x, y) is added to the appro-priate bucket. Each sample is weighted according to its gradient magnitude and Gaussian circular window around the point. They then find the highest peak in the orientation histogram and any other peak within 80% of it and they are used as the orientation basis for constructing a key point. They now have the location, scale and orientation of all the key points but they still need a local image descriptor. This is necessary because the other values are not distinctive enough to provide effective feature matching. The local image descriptor needs to be as invariant as possible to lighting and affine changes. To achieve this they use an approach based on results from biologi-Chapter 2. Related Work 35 cal vision. Image gradients and magnitudes are sampled from around the key point, with the orientations taken relative to the overall orientation to achieve orientation invariance. The sample's magnitudes are weighted using a Gaussian function with a D equal to half the size of the descriptor. They then create a 4x4 array of histograms each with 8 orientation bins based on the sampled points, producing a 128 (4x4x8) element feature descriptor. Feature matching between images A and B is done relatively simply by finding the Euclidean distance between the image descriptor of each feature in A and each feature in B. For each feature in A, if the distance between the best match in B is less than a threshold percentage of the size of the next best feature match, the best match is considered correct. Otherwise the feature is considered unmatched. 2.4.6 N o r m a l i z e d C u t s for Image S e g m e n t a t i o n Normalized Cuts are the basis of one of the best image segmentation algorithms currently known. The original system, proposed by Shi and Malik in Normal-ized Cuts and Image Segmentation, [Shi and Mal ik , 2000] segments images using their intensity values. Their system is computationally efficient, produces ex-cellent results and is sufficiently general that changes like the addition of depth information to the pixel dissimilarity function are easy to implement. The concept of normalized cuts has application outside image segmentation. It is an approach for partitioning any graph into disjoint sets, based on an comparison function d(x, y) that calculates dissimilarity between any two nodes x and y in the graph. The graph is partitioned by cuts that separate regions of the graph and break bonds between each pixel in the two regions. A cut between regions A and B has an associated broken bond energy value, cut(A,B) = y^d(x,y) x€A y€B Cuts with minimal energy occur between regions that are highly dissimilar and so minimizing cut energy can be used as the basis for image segmentation. Chapter 2. Related Work 36 However, simply finding the minimal energy cut does not produce good results because is favors small cuts of equal size, since cuts where one region is very small and the other very large almost always have a lower energy than cuts where both sides are even. To deal with this, Shi and Mal ik proposed normalized cuts where the cut energy is normalized by the size of cut regions, ^ cut(A.B) cut(A.B) NCut{A, B) = ' ' + ' / cut(A,V) cut(B,V) where A and B are regions and V is the whole graph. W i t h this function small cuts no longer reliably have a smaller energy than large cuts. Determining the exactly minimized normalized cut for a graph has been proven to be NP-complete. However, they provide is a computationally efficient technique for finding a approximate discrete solution for a specific number of cuts. 37 Chapter 3 A Combined Shape and Appearance-based Approach to Object Discovery 3.1 Object Discovery using both Appearance and Shape Chapter 2 examined shape-based and appearance-based approaches to object discovery, describing them in terms of the four steps involved: individuation, tracking, modeling and recognition. Both approaches had advantages and dis-advantages relative to each other and neither was clearly superior. A l l object discovery techniques examined in Section 2.4 used either shape maps or images as their input data but none of these approaches attempted to combine them. The O D M A S system was designed to take advantage of both types of input. The goal of this chapter is to discuss the object discovery techniques we use in the O D M A S system and how we employ a combination of appearance and shape data. A l l of the object discovery techniques presented in this chapter are passive because, without more work on active exploration protocols, active object dis-Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 38 covery is simply too risky to be performed in most environments. Both static and motion-based object discovery approaches will be described, as well as a combination of the two. Since the techniques employed required both image and dense depth information, we use a stereo camera as the input source. 3.2 Combined Shape/Appearance Individuation As was previously mentioned in Section 2.3.1, individuation is a technique for finding a mapping between input data points and objects in the scene or the background. Individuation using a fusion of shape and appearance can be ap-plied to both static and motion-based individuation techniques. In the O D M A S system, individuation is done in three stages. First, we use static depth and intensity individuation to segment the image provided by the stereo camera. Since it is rarely possible for a static segmentation algorithm to segment an image and maintain a one-to-one relationship between segmented regions (also called object regions) and objects, we adjust the parameters of the static individuation component to oversegment. This leads to a many-to-one relationship between segmented regions and objects in the scene. Next, we use Rigid Body Motion Individuation to find groups of image features that move rigidly in the scene. These rigidly moving feature^ groups, which correspond to rigid moving objects in the scene, are then used as a basis for Object Region Vot-ing which combines oversegmented object regions from the static individuation component that correspond to the same object. 3.2.1 S t a t i c D e p t h a n d Intens i ty I n d i v i d u a t i o n Many appearance-based [D. Walther and Perona, 2005, Shi and Mal ik , 2000] and depth or shape-based [Modayil and Kuipers, 2004, Y u et al., 2001] static object individuation approaches rely on clustering and segmentation techniques to group together data points that correspond to objects. For any such sys-Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 39 (a) Original image (b) Appearance-based segmentation Figure 3.1: (a) is an image of a scene containing two objects on a desk, (b) is a segmentation of that image using static object individuation based only on the pixel intensity. Note that the similarity in color between the white box and the background has caused both to appear in the same region. tern, there are situations where object individuation is impossible because of the layout of the scene or the appearance of the objects. For example, segmen-tation techniques based purely on appearance fail when objects have a similar appearance to their background (see Figure 3.1). Depth-based approaches can fail when depth of the points on two objects are insufficiently distant from each other (See Figure 3.2). Notice that both techniques failed under different cir-cumstances and on different parts of the scene. A combined static individuation technique that responded to cues from both depth and appearance might be able to avoid both of these problems. Since over-segmentation is preferable to undersegmentation (see Section 2.3.1), a simple technique for combining image and shape-based segmentation is to overlay both segmentations on top of each other. Groups of pixels where there is disagreement between the two segmentations would become new regions. Such an approach undersegments only if neither technique had sufficient data to segment a region Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 40 If (a) Original depth map (b) Depth-based segmentation Figure 3.2: (a)is a depth map of a scene showing two boxes on a desk. A l l depth maps we use have a valid surface constraint applied. A valid surface must contain a minimum of 100 pixels with a fixed maximum disparity between them, (b) is a segmentation of that depth map using static object individuation based on the depth of the points. Note that the two objects are grouped together in one region since they are at approximately the same depth. Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 41 Figure 3.3: This Figure shows a segmentation using the image from Figure 3.1 and the depth map from Figure 3.2. These two data sources are combined together using the normalized cuts based approach described in Section 4.2.2. but wil l usually lead to more oversegmentation than either of the originals. A better solution, and that causes less oversegmentation, is to incorporate cues from both shape and image into the functions used to determine the initial segmentation. This has two major advantages over the simple overlaying ap-proach. Firstly, it can decrease unnecessary oversegmentation because it does not create a new segment for every region where there is disagreement. Sec-ondly, by combining the responses from two different properties, this approach can segment along boundaries where neither property is sufficiently confident to segment on its own. Figure 3.3 shows an example of a segmentation that incorporated data from the image in Figure 3.1 and the depth map in Figure 3.2. These two types of scene data were combined using normalized cuts which is the approach we use in the O D M A S system (see Section 4.2.2 for details). A n important factor for such an approach is that there is a mapping between Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 42 the shape and appearance data. B y describing how the data points from each input relate to each other, this mapping makes it possible to generate an initial segmentation that uses both sources of data. For example, if shape and ap-pearance data is acquired by a stereo camera, then there is a mapping between image pixels and the (x,y,z) coordinates of the shape map because the image was used as the basis for finding the stereo depth map. We employ such a combined response technique that incorporates both depth and intensity data from a stereo camera to perform static individuation in the O D M A S system (see Section 4.2.2). 3.2.2 R i g i d B o d y M o t i o n I n d i v i d u a t i o n The next step in our individuation approach is to find rigidly moving groups of points in the scene. Given our definition of object, motion is a necessary component of any reliable individuation technique because without we have no mechanism for determining whether an object wil l move rigidly (see Section 2.2). Using feature tracking to track points in the image and determine which are moving is a well understood technique. The main difficulty we found in motion-based individuation using image feature tracking is in segmenting the moving features in the presence of multiple moving objects 'and a moving cam-era. We perform the moving object segmentation by finding groups of features that move rigidly in 3D space and compensate for a moving background by determining camera motion through robot odometry. This difficulty in deter-mining a segmentation between multiple moving objects in the scene is the main reason we restrict the system to only finding rigid objects. In other approaches to finding rigidly moving objects from image data, image features are extracted and tracked and then structure from motion is applied to find their 3D position. They then search for rigidly moving bodies and treat those as objects (see Section 2.3.1) [Sivic et al., 2004]. In the O D M A S system we are using a stereo depth camera which provides the 3D location of every pixel in the image, making structure from motion unnecessary. Each point in Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 43 shape map D(x,y) gives an (x,y,z) location in the scene that correspond to pixel I{x,y) in image / . Many image features have a central origin point that has either a pixel or sub-pixel location their original image. If the location of the feature in the image / is I(x\,yi) then the (x,y, z) location of that point in the scene is given by D(x\,y\). If the location of the feature is at the sub-pixel level, then interpolation between the four surrounding depth points can be used to approximate its depth. These techniques allow the positions of image features to be determined from a single frame of input, unlike conventional monocular structure from motion. This process of using a depth map to determine feature position we call stereo feature position reconstruction. Given the (x, y, z) position of each image feature in the current and previous frame, the problem then becomes ones of finding groups of features that move rigidly in relation to each other. First, we match features between the current and previous images. The 3D position of these features in the current and previous time step provides information on how features in the scene have moved between the last frame and the current one. Next, it is necessary to remove static features as these belong to the back-ground. If the camera is static, this is simply a matter of finding features that have not moved between this frame and the last frame. If the camera is mov-ing and the motion of the camera is known, then we can find the expected rigid transform between non-moving features in the current frame and previ-ous frame. Any features which match this rigid transform are actually static features and correspond to the background. These features contain valuable in-formation since they have a known correspondence to the background and they are used later in Object Region Voting to help deal with undersegmentation. What remains are all the features that have moved between the current frame and the previous one. Our goal here is to divide these features into groups where the motion of all the features in a group follows a single rigid transform (i.e., they belong to a rigid object). The technique we employ for finding rigidly moving groups of features relies on the property that on a rigid object the distance between all points on that object remains constant under Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 44 any transformation. To find rigidly moving bodies, we compare the distance between the (x,y, z) position of the same feature in the current and previous frame and find groups of features where the distance remained constant. These rigidly moving features are used as the basis for the rest of the indi-viduation technique because there is high confidence that they all correspond to the same moving object. We refer to these rigidly moving features as core features since they make up the basis for other segmentation techniques. 3.3 Object Region Voting Motion-based individuation can be used to individuate objects from the back-ground by finding groups of features that move rigidly together. However, many image features on the object cannot be tracked between the pairs of frames be-cause of occlusion or affine changes and so they could not be included as core features. Other image features are discarded because of the noisy nature of stereo depth camera results since the 3D position of many features cannot be determined with a high enough accuracy for them to correspond to rigid body motion. Our observations of motion-based individuation in action have shown that as little as 10% of the features on an object can be tracked between frames and used to find rigid body motion. Additionally, a major drawback of motion-based approaches is that they require that objects move in order for them to be individuated. The world contains many more static objects than moving objects, so it would be useful if we could perform individuation on objects that are not moving. However, if we are to maintain our constraint that the objects we are finding and modeling must move rigidly, object motion is necessary as an initial cue to begin indi-viduation. What we need is a technique that allows objects which have been observed to move rigidly to continue to be individuated after their movement has ceased. The static individuation technique given above oversegments the image too much to be used by itself as a basis for object individuation. Both of the problem of finding additional features after rigid body motion Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 45 and of reliably applying static individuation to individuate static object can be solved using a technique we call Object Region Voting. 3.4 Object Region Voting for Combined Motion-based and Static Individuation A problem with static individuation is that it is almost impossible to individuate a scene and create a one-to-one relationship between segmented object regions and objects. However, often the parameters can be adjusted such that the scene wil l typically be either undersegmented or oversegmented. Object region voting relies on using static individuation with parameters set to oversegment the image of the scene into object regions. Then, groups of rigidly moving features found through rigid body motion are used to determine which regions belong to the same object and should be aggregated to compensate for the oversegmentation. We use a voting scheme as the basis for recombining the oversegmented regions. In each object region we count the number of core features from each core feature group with a 2D projection into that region. The core feature groups used are all those found using rigid body motion, including the group that corresponds to the background. If one group of core features has at least n features in a region and k times more features within that region than the core feature group with the second highest number, then that region is associated with the first core feature group. Otherwise, the region is classified as uncertain and it is associated with the background. The n feature threshold is there to prevent problems with outlier features. Outlier features are feature belonging to an object that when projected onto the 2D image plane are not in a region corresponding to that object. These usually occur either around the edges of objects, where features are incorrectly projected into neighboring regions or because features were incorrectly found to be moving rigidly with a moving body because of depth map noise. Since there are rarely more than a couple of such features associated with the wrong region, the n threshold prevents that Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 46 region from being associated with the wrong group. The k threshold is in place to deal with situations where there is undersegmentation in the individuation. It is difficult to set the static individuation parameters in such a way that we can guarantee that there will be no undersegmentation and yet still have regions large enough that there wil l be enough points projected into that region for the voting system to work. In the situation where two moving rigid objects are in the same region, there will be many core features from two groups in that region and to prevent the creation of a model which combines information from the two objects, that region is not associated with either group. Even if the undersegmentation groups together the background and an object, this can still be detected by allowing the rigid features belonging to the background to vote in each region to associate it with the background. Once all the regions have been associated with either objects or background, object snapshots are made. We create an object snapshot for each core feature group that was associated with at least one image region. The snapshots con-tains all the image features with a 2D projection that places then inside one of the region associated with that group. These features are broken into two categories, the core features which are the rigidly moving features that were the basis of the vote, and adjunct object features, which are all the non-core features within an associated region. These adjunct object features were not detected as core object features either because they were not tracked during the rigid motion phase or^because their 3D position was insufficiently accurately determined by the stereo camera for them to correspond to rigid motion. The adjunct object features are less certain to belong to an object than the core object features because they failed to meet the motion and rigidity constraints but they are still useful for matching because they are typically more numerous and can be detected on regions of the object that have not been seen before. Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 47 3.4.1 O b j e c t R e g i o n V o t i n g for S t a t i c O b j e c t M o d e l i n g The ability to detect adjunct object features through object region voting can also be employed in another way, to allow the system to continue to acquire adjunct object features from an object after the object motion has ceased. In the previous section we described how it is possible to use object region voting to counteract both over and undersegmenation of the static input. W i t h no object motion, there are no rigidly moving core feature groups found and so the previously described system cannot acquire new information about objects. However, we can use the known core features in the object models as a basis for rinding new adjunct object features. The approach described here incorporates elements of both static individua-tion and object recognition. Static object modeling is applied after the combined motion-based and static individuation steps described above. Our goal here is to find a mapping between object regions that were associated with the back-ground and existing object models. Again we use a voting scheme, but here we compare the features within each object region against the core features belong-ing to objects in the database rather than the core feature groups found through rigid body motion. The same voting rules apply, except that the background does not get a vote (since all features here correspond to the background). It is very important for this technique that only core image features from the model be matched against the features in the object regions. If adjunct image features from the model are used and the image is undersegmented, image features that correspond to the background can be added to the model as new adjunct image features. These incorrectly added adjunct features wil l then match against regions in the next frame that actually should correspond to the background, adding more inaccurate image features. This can create a degenerative cycle, causing large parts of the background being added to the object model. B y restricted the matching to being between features in object regions and model core features, we can prevent this from occurring. However, a downside of this technique is that completely novel views of the object cannot Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 48 be incorporated into the model because there wil l be no core features in the model that match features in those regions. 3 .5 Additional O D M A S Subsystems The previous sections described the individuation process of the O D M A S system which was the main focus of the research. This section .will discuss the other three part of the O D M A S system, tracking, modeling and recognition. 3.5.1 Tracking The goal of object tracking in the context of object discovery is to correlate over time object snapshots generated through individuation (see Section 2.3.2). We perform tracking using shift invariant image features (SIFTs) and feature tracking. Feature tracking tends to work well on large and clearly visible objects [Lowe, 2004, Sivic et al., 2004]. Rather than match object snapshots against each other in a pair-wise manner to track object, we track through object recog-nition. Features in new object snapshots are compared against known features on all object models and if a threshold number match against only one model, then the snapshot is associated with that model. If there are more than a thresh-old number of matches against multiple models, then the system classifies the snapshot as uncertain. Otherwise the snapshot is used as the basis for a new model. Objects without enough image features for tracking wil l not generate good feature-based models and so are ignored by the O D M A S system. Our hope was that appearance data alone would be more than sufficient for the purposes of tracking and that it was unnecessary to incorporate shape information. 3.5.2 M o d e l i n g As was previously discussed in Section 2.3.3, in the context of object discovery, modeling is the process of taking an object sequence and integrating the data Chapter 3. A Combined Shape and Appearance-based Approach to Object Discovery 49 in that sequence into a model, a description of properties of the object. Con-structing useful and accurate object models is important since they are the final output of the object discovery system. The models used in the O D M A S system are comprised of S I F T features with an associated (x,y,z) position in the model space. These models are primarily designed to be discriminative, since they are based on image features that are used mostly for recognition tasks. However, we also wanted to encode some amount of shape information into the model, hence the associated 3D position of the features is stored. Our hope was that, on sufficiently textured objects with a large number of features, feature positions alone wil l be sufficient for reconstructing the shape of the object. 3.5.3 Recogni t ion In the O D M A S system, the object recognition component allows object models to be matched against raw input data (see Section 2.3.4). As with tracking, S IFT features and feature matching have been shown to be very successful for appearance-based recognition [Lowe, 2004] and this is the approach we employ. We treat the object recognition component as a separate system that matches features on objects in the model database against features in the image to locate known objects. 50 Chapter 4 The ODMAS System 4.1 System Overview O D M A S stands for Object Discovery through Motion, Appearance and Shape. The system performs passive object discovery using a combination of explicit motion and static object discovery techniques. The input device is a binocular stereo camera that provides both high resolution images and a dense depth or shape map. These two sources of data are combined in the individuation step and used to produce object models containing both shape and image data. Figure 4.1 is a system flow diagram for O D M A S . The primary system flow uses motion-based cues for object segmentation. In the Input stage, data is acquired by the stereo camera and an image and depth map are produced. The image is then oversegmented into object regions using Normalized Cut Depth and Intensity Segmentation. The 3D positions of the S IFT features in the image are then determined and they are grouped together into core feature groups using Rigid Body Feature Individuation. The rigidly moving core features are then used for Moving Object Region Voting to determine which regions correspond to which objects. These regions are grouped together and all the new features within those regions classified as adjunct features. The core features and adjunct features are used to construct object snapshots which in turn are used for Moving Object modeling to either improve existing object models or create new ones. The other part of the system is the optional static object discovery compo-nent. The Static Object Recognition stage adds new adjunct features to existing models using the core features from the moving object models produced by the Chapter 4. The ODMAS System 51 Input I Legend Primary Data Flow — Secondary System Data Flow (optional) System Component T Boundary I - 1 Normalized Cut Depth & Intensity Segmentation I Rigid Body Feature Individuation I Moving Object Region Voting I Moving Object Modeling Static Object Recognition Static Object Modeling Primary Motion-based Object Discovery Component Secondary Static Object Discovery Component (optional) Figure 4.1: A n overview of the O D M A S system. The primary branch performs object discovery using motion-based cues while the second branch allows the system to model static objects. Chapter 4. The ODMAS System 52 primary system, the background image features and the normalized cut static individuation. The core features from the models are matched against the back-ground image features and then the matches are used to vote on which regions correspond to known objects. New features in these regions are then used for Static Object modeling, allowing more adjunct features to be added to existing object models. . 4.2 Primary Motion-based System Components 4.2.1 I n p u t The O D M A S system has two sources of input. The primary source of input is a Bumblebee™stereoscopic depth camera. This device uses a pair of digital cameras1 to capture high resolution images of the scene and then determine the corresponding shape map and depth map. The right image is the one which is actually used by the rest of the system, hereafter referred to as image / . A depth map D is acquired through correspondence stereo on the two images (see Section 2.2.2) and then converted into the shape map S. This depth map, like all those used in the O D M A S system, has a surface validity constraint applied. A valid surface must contain a minimum of 100 pixels with a maximum disparity of 0.5 between them. This constraint is applied to remove noisy and inaccurate regions, as well as remove responses to very small objects that would not produce good models. See Figure 4.2 for an example of an image and depth map from the camera. Both of these inputs wil l be used throughout the chapter to demonstrate each of the stages of the O D M A S system. Image and depth maps like these can be acquired and processed at rates of up to 5 frames a second at resolutions of 640x480. The camera is on a ActiveMedia Powerbot™, an autonomous robotic research platform. The secondary source of input is the odometry from the Powerbot, which is the approximate position of the robot relative to its starting point. Chapter 4. The ODMAS System 53 100 200 300 400 500 600 700 800 (a) Input Image (b) Depth Map Figure 4.2: (a) is an example of an image from the stereo camera, (b) is an example of the corresponding depth map. This is determined using data position encoders attached to the robot's drive shaft. Such approaches are notoriously inaccurate over long periods of time due to slippage in the wheels. However, the accuracy of odometry over very short periods of time is sufficiently for the O D M A S systems. 4.2.2 Norma l i zed C u t Stat ic Indiv iduat ion The first stage of the individuation process is to perform static object individ-uation on the input image / using data from both that image and the shape map S. Ideally, image would be segmented such that there is a one-to-one re-lationship between image regions and objects in the scene, with one remaining region containing all pixels corresponding to the background. Since we know of no approach that can achieve these results, we employ a static individuation procedure where the parameters are adjusted to oversegment the image rather than undersegment it. This means that there should be a many-to-one relation-ship between image regions and objects. For our approach, oversegmentation is Chapter 4. The ODMAS System 54 preferable to undersegmentation since is easier to group regions corresponding to the same object together than it is to divide a region containing two or more objects. For scene segmentation we use normalized cuts based on both image intensity and depth. As was previously described in Section 2.4.6, normalized cuts is a technique for segmenting a graph based on a dissimilarity function between the nodes. Image segmentation can be performed by treating the image as a graph where the nodes in the graph are the pixels in the image. The combination of both intensity and depth can provide cues about object boundaries in situations where they fail individually (see Section 3.2.1) and so we employ a cost function in the normalized cuts based on both the input image / and shape map S. Fast algorithms exist for approximating, the ideal normalized cut of a graph for a fixed number of segments. What is needed for these fast methods is a weighted graph G where each node is a pixel and with edges between every pair of nodes. The edge weight Wij is the likelihood that the two pixels i and j belong to the same object. In our system, Wij is based on the difference in intensity, depth and the distance between pixels i and j. The function used to find u>ij in the O D M A S system is based on one presented by Shi and Tomasi [Shi and Malik, 2000], modified to include pixel depth information, ^ ^ a ^ ^ a J . i n P l - P , h < r I 0 otherwise where for pixel i, is the intensity, D^) is the depth and Pi is the (x,y) coor-dinates in the image. 5j, 5s and Sp are control parameters for each property and these values allow us to adjust how much the image is under or oversegmented. These values were set by hand but are not changed during our experiments, r is a maximum 2D distance between pixels that can belong to the same object. Past this distance the edge weight is 0, indicating that the two pixels cannot belong to the same object. The output of the static individuation stage is a segmentation of input image i Chapter 4. The ODMAS System 55 Figure 4.3: This is the resulting segmentation of the image and depth map in Figure 4.2 using normalized cut segmentation based on both the image intensity and depth map. Each region is represented by a constant grey level in this image. Chapter 4. The ODMAS System 56 / called Z which gives a mapping from each pixel I(x,y) to an object region R. Figure 4.3 contains the resulting segmentation from the image and depth map in Figure 4.2. 4.2.3 R i g i d B o d y M o t i o n I n d i v i d u a t i o n The goal of this step is to find the rigidly moving core features of each moving object in the scene (see Section 3.2.2 for details on core features and rigid body individuation). The first step is to use the Shift Invariant Feature Transform (SIFT) on input image / ' at time t to extract a set of image features Fl. Stereo feature position reconstruction, based on the shape map 5 ' , is used to determine the 3D position of each feature relative to the camera (see Section 3.2.2 for details). A l l features contain the following values: Fp, the (x,y) sub-pixel location of the feature in the image, Fs,,,the (x,y,z) position of the feature in the scene and Fp, a histogram descriptor of the image feature. The next step in Rigid Body Motion Individuation is to determine a mapping between all features from the current image Fl and features F t _ 1 in image 7 t _ 1 from the previous frame. This is done through feature matching based on each feature's descriptor (see Section 2.4.5). The output is two sets of image features: T ' which is a set of tracked features in the current frame and T t _ 1 which is a set of the corresponding features in the previous frame. From here on, for any set of features at time t, based on these tracked features, there exists a matching set of features t — 1 (i.e., feature set X1 has a matching feature set Xl~l). After this, we need to determine which features in T* are moving and which are static. Our robot uses odometry data to determine an approximate Sp and SQ which are the robots change in position and orientation between the previous and current frame. Given 5P and 6q we determine the P and Q values for the corresponding rigid transform (see Section 2.3.1). This rigid transform is applied to all features from the previous frame. To determine which features are moving we find the distance they have moved between the previous and current frame. A feature is determined to be moving if this distance is greater than the Chapter 4. The ODMAS System 57 minimum movement distance 5mrnd- A l l moving features are grouped together in a set of features W. Wl = V ( / € T; || f% - (P + R * fe1) ||> 5mmd) Since S IFT features have a tendency to move slightly in position between frames, even if the object they are on is static, 5mmd is usually set slightly above 0. This value was set by hand but remained constant in all our experiments. See Figure 4.4 to see an example of moving feature detection. Additionally, we want to determine which features we can be confident are static and so can be associated with the background in this frame. These features form a set B. We use the same approach as above but instead find features which have moved less than o~msd which is the maximum static distance. This value is set very close to 0. B = V ( / € T; || fs - ( P + R * /I"1) |< 6msd) Given M we next want to determine which of these features belong to rigidly moving bodies. In a rigidly moving body, all 3D points on the object's surface remain equidistant from each other under any transformation. We use this property to find all rigidly moving sets of features with the R A N S A C algorithm [Fischler and Bolles, 1987]. Our R A N S A C process follows these steps, 1. Select a random feature x from W and create an empty set E. 2. F ind all the features where the distance between them and x remains the same between the current and previous frame and add them to set E. E = E + V(yeW;\\ ( 4 - ys) - (4" 1 -tff1) ||< &d 5d was set by hand but remained constant through our experiments. 3. Repeat steps 1 & 2 Q times and keep the largest set of features E. 4. If there are more than v features in E, add them as a set to C, remove them from W, and repeat from 1. Chapter 4. The ODMAS System 58 Figure 4.4: This example shows two sequential frames of input image to the O D M A S system with current image on the bottom and the image from the previous frame on top. The lines indicate pairs of moving features tracked between the two images. Chapter 4. The ODMAS System 59 Figure 4.5: This image shows the position of two groups of rigidly moving image features. Rigidly moving features on the group to the left are indicated with Xs while rigidly moving features on the object to the right are indicated with Os. The result of this is C, which is a set of i sets of rigidly moving features. See Figure 4.5 for an example of this rigidly moving feature selection. These sets of image features are then used with segmented image Z as the basis for Image Region Voting. 4.2.4 M o v i n g Object Region V o t i n g The goal of this section is to use the segmented image Z from the static individ-uation component and the sets of moving image features C to produce object snapshots O (see Section 3.2.4 for a description adjunct object features and Image Region Voting). For each region R in G, image region voting will associate it with either a set of core features in C or the background B. The voting is based on the Chapter 4. The ODMAS System 60 number of features in the feature sets with an image position within a given region. Given a function num(R, F) which returns the number of features in a feature set F which have a sub pixel location Fp that lies within region R, we perform the following steps to classify each region r. 1. Create a set of-zeros V of a length equal to i + 1 where i is the number of sets in C. These wil l store the votes from the core feature sets and the background for region r. 2. Set V(0.:.i) = num{r,C(0...i)) These are the votes for that region from each of the rigidly moving core feature sets. 3. Set V(i + 1) = num(r, B) This is the votes for that region from the background. 4. F ind the maximum value in V and set x\ equal to its index and n\ equal to its value. 5. F ind the second highest value in V and set n2 equal to its value. 6. If X\ = i + 1, this means that the background had the largest number of votes and therefore r i-> B. Otherwise, given threshold constants N and K, if n\ > N and n i > K *n2 then r >-> C(x\), otherwise r i-» B. For'our experiments we use N = Z and K = 2. This process is repeated for each region in G until they are all associated with either a core feature set C or the background. Any set of features in C which was not associated with at least one region are removed from C since they have no correspondence in the scene. The next step is to construct an object snapshot O for each of the remaining rigid features sets in C. Each object snapshot contains a set of core features C Chapter 4. The ODMAS System 61 Figure 4.6: This image shows the position of every feature within the regions corresponding to the rigidly moving features in Figure 4.5, both core features and adjunct features. Features belonging to object on the left are indicated with Xs while belonging object to the right are indicated with Os. found through rigid body motion and a set of adjunct features A found through region voting. The C in a snapshot contains all image features in the core feature set with a sub-pixel image location within one of the regions associated with O. A contains all image feature in F with a sub-pixel location within one of the regions associated with O and not already in the core feature set C in the snapshot. The features associated with the two groups of rigidly moving object features are shown in Figure 4.6. These object snapshots contain all the information about the shape and appearance of the individuated objects in this frame. Chapter 4. The ODMAS System 62 4.2.5 M o v i n g Object M o d e l i n g Given the object snapshots 0 produced by the motion-based individuation pro-cess, we next want to use them to either create new object models or improve existing ones. A n object model M resembles an object snapshot in that they both consist of two sets of features, core features C and adjunct features A. Features in the model contain a descriptor and a 3D position with an origin relative to the camera at the time of the first snapshot in the model. The first step in converting object snapshots to models is to determine for each snapshot whether or not it corresponds to an already modeled object in the model database. This is done by matching all the feature descriptors in a snapshot against all the features in each of the models, using both core and adjunct features. If there are more than 5m matches between the snapshot and multiple models, then the classification is uncertain and the snapshot is discarded. For our experiments we use m = 10. If none of the models have 5m matches, then the snapshot is classified as unknown and a new model is created. If there are more than Sm matches between the snapshot and only one model then the snapshot is classified as belonging to that model and is incorporated into it. A new model M is created from a snapshot O by removing the sub-pixel image position Op from the all the features in O, since these values are mean-ingless for model features. The core feature set and adjunct feature set from the snapshot then become the C and A for the model. This, of course, means that the 3D coordinates of the features first in the model are all relative to the camera and object at the time the first snapshot. A l l future features added must to registered into this frame of reference. Incorporating new data into an existing model is a more complex procedure since we want to avoid adding features that are already known into the model and the (x,y,z) position of the features in the snapshot must be rectified to correspond to the features already known in the model. Feature matching be-tween the features in O and M gives a set of features H which are common to Chapter 4. The ODMAS System 63 both and a set of features J that are not in the model. Only the features in set J are added to the model. To rectify the 3D positions of the features in J to the existing features in the model, we use Iterative Closest Point (ICP) [Besl and McKay, 1992], a technique for finding the rigid transform between two sets of 3D points. ICP is performed on all the features in H between both their position in the snapshot and their position in the model. Since we have a known correspondence between all the pixels being registered, the ICP starts very close to the correct solution which helps to prevent it from falling into local minima. ICP returns a rotation matrix R and a translation set T for the rigid transform between the features in H. This transform is applied to all the features in J and then they are added into the model as either core or adjunct features. 4.3 Secondary Static Object Discovery System As was described in the beginning of this chapter, there is both a motion-based and static object discovery component to the ODMAS system. This section describes the static object discovery system which employs object models found using the motion-based component and continues to individuate objects corresponding to those models after their motion has ceased. This component is optional and the system can function without it. 4.3.1 Stat ic Object Region V o t i n g This step occurs after the motion-based modeling component of the system is completed. This is because the motion-based object discovery system is typi-cally more reliable than the static object discovery system, so it makes sense to acquire as much data through motion as possible before data is acquired through static object discovery. After the moving object region voting compo-nent of the ODMAS system has been performed, some regions in the statically individuated image Z will be associated with moving objects while others are associated with the background. However, there may be novel views of known Chapter 4. The ODMAS System 64 but static objects in the scene, and the goal of this stage is to match regions that were associated with the background against known models in the database to extract new adjunct object features for those models. For each region R in Z that was not associated with a set of core feature C, this component wil l associate it either with some model M or the background B. This is done through region voting as in the motion-based component. As before, the voting is based on the number of features with an image position within a given region. First, given q objects in the database, we find Y which is a set of sets of features, where each set is all the core features from a model that match against a feature in B. Only core model features are used for matching to prevent problems resulting from undersegmentation (see Section 3.4.1 for details). Given a function num(R,F) which returns the number of features in a feature set F which have a sub-pixel location Fp that lies within region R, we perform the following steps to associate each region r with a model or the background. 1. create a set of zeros V of a length q which is equal to the number of models in that database. These wil l store the votes for each model. 2. Set V{0...q) = num{r,Y(0...q)) These are the votes for that region from the core features in each model. 3. F ind the maximum value in V and set x\ equal to its index and ri\ equal to its value. 4. F ind the second highest value in V and set n 2 equal to its value. 5. Given threshold constants N and K, if n\ > N and nx > K * n2 then r >—> M{x\), otherwise r >-* B. For our experiments we use N = 5 and K = 2, the same values that were used in the motion-based voting system. For each model with at least 1 region associated with it, we create a static snapshot S. This snapshot contains a single set of adjunct features, since only Chapter 4. The ODMAS System 65 adjunct features were found. 4.3.2 S t a t i c O b j e c t M o d e l i n g Static object modeling is functionally almost identical to the moving object modeling, except that the snapshots only contain adjunct object features and are already associated with a model. A l l that is required is that the features be registered using I C P against their counterparts in the model. Then any new features can be added to the adjunct features already in the model. Chapter 5 66 Experiments and Results 5.1 Experiment Overview To test our approaches to object discovery, we performed three experiments on the O D M A S system. Each experiment was constructed to test an aspect of the system. The first experiment was designed to test the system's ability to detect multiple rigidly moving objects and construct descriptive models of the objects using a static camera. The second experiment was designed-to test the system's ability to detect and construct descriptive models of rigidly moving object while the camera is in motion. This experiment also tests the system's ability to model static objects using the optional secondary static object discovery system. The third experiment was designed to test whether our system is able to produce accurate 3D models. Our description of the first two experiments will contain a sequence of images that show each of the stages used in the experiment. In the first experiment, we will show the effectiveness of models for object recognition by using them to locate the modeled objects in a novel scene. For the third experiment, we wil l show the resulting 3D model of the features. In all these experiments, the parameters used remain fixed to demonstrate the system's robustness to different situations. Chapter 5. Experiments and Results 67 5.2 Experiment 1: Mult iple Moving Object Discovery This first experiment is designed to demonstrate the ODMAS system performing object discovery on a scene containing multiple moving objects. In the video sequence used, two people are reaching onto a desk and simultaneously removing two boxes from it. The goal of the experiment is for the system to acquire descriptive models of the two objects that were moved which can be used for object recognition and localization. The features from these models are then used to detect the location of these two boxes from other images. For this experiment, the camera position is kept fixed. Most of the core components of the system are used for this sequence. Figure 5.1 shows an image and depth map from the video sequence used. These are used to perform static segmentation on the scene using normalized cuts, see Figure 5.2. SIFT feature tracking is used between the image from the current frame and the previous frame to find moving features, see Figure 5.3. The moving features are then divided into objects by finding rigidly moving core feature groups using RANSAC, see Figure 5.4. Since the camera is not moving, there is no need to remove a core feature group corresponding to the background. The rigidly moving groups of features are then used for region voting, which is in turn used to find adjunct features for the objects. Figure 5.5 shows all the features used in the object snapshots for each of the objects. The core features and adjunct features in these snapshots are then used to create models of the objects. The models produced in this experiment can be used to detect and localize the modeled objects in an image. For this, SIFT features are extracted from the target image and matched against the features in the model. Figures 5.6 and 5.7 show the object models being recognized in the target images. Notice that though both objects are partially occluded, their locations in the image can be determined using the model features. Chapter 5. Experiments and Results 68 Figure 5.2: This is the resulting segmentation of the image and depth map from experiment 1. Chapter 5. Experiments and Results 09 Figure 5.3: This figure shows the features matched between two sequential images from experiment I. Chapter 5. Experiments and Results 70 Figure 5.4: This figure shows groups of rigidly moving features on each of the objects in experiment 1. Chapter 5. Experiments and Results 71 Figure 5.5: This figure shows all the features in the snapshots for each of the moving objects in experiment 1. These features were found through region voting using the features in Figure 5.4 and the segmentation in Figure 5.2. Chapter 5. Experiments and Results 72 Figure 5.6: This figure shows an example of object recognition using the model of the left box produced in this experiment. Chapter 5. Experiments and Results 73 Figure 5.7: This figure shows an example of object recognition using the model of the right box produced in this experiment. Chapter 5. Experiments and Results 74 This experiment demonstrates that the O D M A S system is able to individu-ate multiple moving rigid objects and construct a recognition model from that data. Almost al l features in the models come from the object, though some ad-junct features detected near the edges of the object are actually based on both the object and the background. However, since the core features were reliably matched between two frames of input, we can say with high confidence that they actually correspond to the object itself and not just a temporary arrangement of object and background. 5.3 Experiment 2: Object Discovery using a Moving Camera and the Static Object Modeling System The second experiment is designed to demonstrate the O D M A S system us-ing a moving camera and performing static object modeling. A s previously described in Section 3.4.1, static object modeling allows the O D M A S system continue acquiring adjunct features from objects after the object's motion has ceased. This is done by matching core image features in the known object mod-els against static features in the input image, performing region voting based on the matched features and extracting new adjunct features from the identified regions. Obviously, to acquire novel features from a static object, the camera must have moved. The video sequence used for this sequence contains a moving box on a desk with an assortment of clutter in the background. Halfway through the sequence, the box stops moving but the camera continues around the cube to acquire a novel view of its surface using static object modeling. During the moving sequence motion-based individuation is performed as in experiment 1. Input data is acquired, see Figure 5.8, and used to determine a static segmentation of the scene, see Figure 5.9. Feature matching is then used to find all moving features in the image, see Figure 5.10. Since the camera is in motion, moving Chapter 5. Experiments and Results 75 (a) Original image (b) Appearance-based segmentation Figure 5.8: (a) is an example of an image from the video sequence in experiment 2. (b) is an example of the corresponding depth map. features are found on both the background and objects. These features are divided into rigidly moving feature groups using R A N S A C , see Figure 5.11, and then odometry is used to determine which core feature group corresponds to the background. The remaining core feature group corresponds to the moving box. This group is then used as the basis of region voting to find the core and adjunct features belonging to the box, see Figure 5.12, which then become a snapshot and are added to the object model. For the static object segmentation system, core features from the object model found through object motion are matched against features from an image after the cube has ceased to move. Figure 5.13 shows the features in the new image that match against features from the previously acquired model. Region voting is then applied to find new adjunct features for the model, in this case acquiring features from two new sides of the cube, see Figure 5.14. This experiment illustrates that the O D M A S system can handle a moving camera and that static object segmentation can acquire novel adjunct features from an object after its motion has ceased. It should be noted that it is not Chapter 5. Experiments and Results 7b Figure 5.9: This is the resulting segmentation of the image and depth map from experiment 2. Chapter 5. Experiments and Results 77 Figure 5.10: This figure shows the moving features matched between two se-quential images from experiment 2. Since the camera is in motion, moving features are found on both moving and static objects. Chapter 5. Experiments and Results 78 Figure 5.11: This figure shows groups of rigidly moving features in experiment 2. Notice that there is one rigidly moving feature group associated with the background and another with the moving box. Chapter 5. Experiments and Results 79 Figure 5.12: This figure shows all the features in the snapshot of the moving box in experiment 2. These features were found through region voting using the features in Figure 5.11 and the segmentation in Figure 5.9. Chapter 5. Experiments and Results 80 Figure 5.13: This figure shows features in an image that matched with features in the model of the white box in experiment 2. These features are used as the basis of performing static object segmentation. Chapter 5. Experiments and Results 81 Figure 5.14: This figure shows adjunct features found on the box using region voting and the features found in Figure 5.13. Chapter 5. Experiments and Results 82 always possible to acquire data from different sides of an object through static object individuation. It requires that the static segmentation of the scene not oversegment the object and create an object region for each side of it. In this case the segmentation put all three surfaces of the object in the same region, so when that region was selected as belonging to the object through object voting new features were extracted from the new surfaces. 1 5.4 Experiment 3: O D M A S for 3D Object Modeling The third experiment is designed to demonstrate the O D M A S system's ability to model the 3D shape of an object. The target object in this example is a large cylinder covered in newspaper to provide texturing so that many S I F T features could be extracted. The simple geometric shape of this object was chosen to make it easier to see the structure of the final model. During the video sequence, the camera position is static and the cylinder is rotated on its axis. Since the object is constantly in motion, object discovery is performed using the primary motion-based system. The individuation and tracking steps involved in this experiment are not shown here since they are very similar to those shown in the first experiment, only involving a single moving object rather than multiple moving objects. The real difficulty in this step is in registering each new snapshot of the object with the points already in the model. As previously described in Section 4.2.5, the first step in registering snapshot features is matching the snapshot features against the features already in the model based on their descriptors. The iterative closest point algorithm is then used to find a rigid transform between the 3D location of the matching snapshot and model features. That transform is applied to the features in the snapshot that did not match against the features in the model (i.e. new features for the model) and they are added into the model. Unfortunately, we found that we were unable to find an accurate Chapter 5. Experiments and Results 83 (a) Original image (b) Appearance-based segmentation Figure 5.15: (a) shows the side of the resulting model of the cylinder, (b) shows the top of the resulting model of the cylinder. registration between the features in the model and the snapshot and this made modeling the entire surface of the cylinder impossible. The problem with registration lies in the fact that the image location of S I F T features from a point on an object can vary across the objects surface from one frame to another, especially when the viewing angle changes between frames. This means that the corresponding 3D location of the feature will not be at the same point on the object in both the snapshot and the model. Since this is the case, the rigid transform between the points in the snapshot and in the model is not sufficiently accurate for new features to be added to the correct location in the image. Figure 5.15 shows two views of the resulting model of the cylinder after a full 360 degree rotation. What has happened is that each resulting snapshot of the object has been registered such that they are all approximately overlapping. W i t h the location of the points varying along the surface of the cylinder, the I C P has found an incorrect registration and determined that each snapshot should lie on top of the last one. Chapter 5. Experiments and Results 84 5.5 Summary of Results The O D M A S system succeeded in two of the three experiments that we con-ducted. In the first experiment, we showed that the system is able to individuate multiple moving rigid objects and create appearance-based models of those ob-jects that can be used for object recognition. This was one of the primary goals of the O D M A S system. The second experiment showed that the same O D M A S system was able to create appearance-based model of objects in the presence of camera motion and that features could still be extracted from objects after their motion has ceased..This makes the O D M A S system much more generally useful since it can be used on a moving platform and no longer requires that objects keep moving in order for features to be extracted. Finally, in the third experiment we discovered that our approach to 3 D object modeling was inade-quate for producing accurate shape models of objects. However, this was never a primary goal of the system and it is questionable as to how useful models based on the 3 D positions of S I F T features would be. S I F T points do not occur in textureless regions of an object, so 3 D points are unlikely to be acquired for those places. Only highly textured objects contain enough features to make complete surface models. However, objects only sparsely textured can provide enough S I F T features for the objects to be individuated and recognition models be created. 85 Chapter 6 Conclusion and Future Work 6.1 Conclusion The O D M A S system was designed to test our theories about how to perform object discovery using a combination of appearance, shape and motion. We pro-posed a technique for detecting rigidly moving objects and constructing models of their appearance and shape. This technique, described in Chapters 3 and 4, used a multi-stage approach to object discovery. A stereo camera was used to find a sequence of images and shape maps of a given scene. Then using nor-malized cuts based on a combination of shape and appearance, the scene was oversegmented into object regions. S IFT image features were matched between the current and previous frame to identify groups of moving features. The lo-cation of these feature in the scene was identified using the shape map. These features were then further individuated into objects by identifying groups where the motion of the features was rigid. These feature groups were then used to determine which regions in the segmentation of the scene correspond to objects, grouping together oversegmented object regions as necessary. Features were extracted from these regions and combined with the rigidly moving image fea-tures to create snapshots of the object's appearance and shape. Finally, these snapshots were grouped together in models of the objects. We also proposed and implemented a secondary subsystem in O D M A S that allows objects to continue to be segmented after their motion has ceased. This Chapter 6. Conclusion and Future Work 86 system relies on matching non-moving image features against known features on object models in the object database and then using these feature to determine which object regions correspond to the unmoving objects. These object regions can then be used to identify new features to add to the object model. We discovered through our experiments our system was able to acquire appearance-based models of moving rigid objects both while they were mov-ing, using the primary motion-based system, and after motion has ceased, using the secondary static object segmentation system. We are able to use both static cameras and cameras with an approximately known odometry. The models gen-erated can be used for object recognition and localization by matching features from the target image against features from the object models. Unfortunately, we were unable to register the points on the surface of the object with sufficient accuracy to acquire 3D models of the entire surface of objects. However, as was previously stated, this was not a primary goal of the system. 6.2 Future Work The O D M A S system was designed to test our theories about how it might be possible to combine shape, appearance and motion data for performing object discovery. Our successes and failures in our experiments have indicated a num-ber of new directions we are considering in using to solve this difficult problem. Here are some of the improvements or alternate solutions we feel would benefit any future system we create. 6.2.1 3 D F e a t u r e L o c a l i z a t i o n t h r o u g h S tereo S I F T M a t c h i n g We determined the 3D location of S IFT features in the scene by comparing the feature's subpixel location to the shape map from the stereo camera. A more accurate technique for determining the 3D position of a S I F T feature using a stereo camera involves performing feature matching between the image from Chapter 6. Conclusion and Future Work 87 each camera and finding the disparity between the positions of the S IFT features [Se et al., 2001]. Greater accuracy in the 3D feature positions should allow us to find more rigidly moving core features and possibly to perform the registration needed to create 3D objects models. 6.2.2 L e a r n i n g O b j e c t F e a t u r e s In the O D M A S system, image features are treated as independent structures with descriptors acquired from a single frame of input. However, in reality, the location on the object that an image feature describes persists through multiple frames. To prevent duplicate features from appearing in our model, we only accept features that do not match any existing feature in the model. A l l features matching an existing feature are discarded. A n improvement to the O D M A S system would be to use the data available in these similar but not identical features of the same point on an object to improve both the feature descriptor and to more accurately localize the feature's position on the object. Previous work on this has involved using a Kalman filter to model 3D landmarks [Se-et al., 2001]. This might allow us to perform the registration necessary to create 3D models. This would entail further research in effectively combining two or more feature descriptors to create a model that both captures all the different views of the feature and is still computationally efficient to match against other features. 6.2.3 N o n - r i g i d O b j e c t D e t e c t i o n A major limitation of the O D M A S system for performing object discovery is that it requires that the objects modeled be rigid and moving. Removing the move-ment limitation is very problematic because we cannot confirm that an object is separate from the background without detecting some form of motion, either implicit or explicit. However, it should be possible to remove the rigid object limitation from the O D M A S system, allowing articulated or even completely non-rigid objects to be modeled. As was previously mentioned, the main reason Chapter 6. Conclusion and Future Work 88 that we use find rigidly moving features groups is to separate groups of mov-ing features into multiple moving objects. As an improvement to the O D M A S system, we would like to explore other options for performing this separation based on existing non-rigid object tracking techniques. 6.2.4 I m p r o v e d Scene S e g m e n t a t i o n u s i n g N e w I m a g e P r o p e r t i e s Our approach to scene segmentation, which used normalized cuts based on image intensity and depth, was mostly very effective for our purposes. However, there were some failures with the approach when trying to model objects when the depth maps of the scene were incomplete. Acquiring shape using correspondence stereo requires that both cameras have a reasonably clear view of the object. However, when dealing with objects like boxes which have sharp angles between their surfaces, we found that a side would only be visible from one camera, meaning that there would be no depth data for it. Without depth information for that side, it would generally not be segmented into the same region as the sides with depth information. To allow us to compensate for this problem, we are considering different image properties on which to base our segmentation of the scene. In particular, we are interesting in looking at using color and texture as a basis for segmentation. 89 Bibliography D. H. Ballard and C. M. Brown. Computer Vision. Prentice Hall Professional Technical Reference, 1982. ISBN 0131653164. G. Berkeley. An essay towards a new theory of vision, 1975. P. J. Besl and N. D. McKay. A method for registration of 3-d shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(2):239-256, 1992. ISSN 0162-8828. doi: http://dx.doi.org/10.1109/34.121791. O. Carmichael and M. Hebert. Shape-based recognition of wiry objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(12):1537-1552, 2004. ISSN 0162-8828. doi: http://dx.doi.org/10.1109/TPAMI.2004.128. B. Curless and M. Levoy. A volumetric method for building complex models from range images. Computer Graphics, 30(Annual Conference Series):303-312, 1996. URL citeseer.ist.psu.edu/curless96volumetric.html. C. K. D. Walther, U. Rutishauser and P. Perona. Selective visual attention enables learning and recognition of multiple objects in cluttered scences. Jour-nal of Computer Vision and Image Understanding, Special Issue on Attention and Performance in Computer Vision, 100:41-63, 2005. P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image seg-mentation. International Journal of Computer Vision, 59(2):167-181, 2004. ISSN 0920-5691. Bibliography 90 R. Fergus, P. Perona, and A . Zisserman. A visual category filter for google images. In Proceedings of the European Conference on Computer Vision, pages 242-256, 2004. •V. Ferrari, T . Tuytelaars, and L . J . V . Gool. Wide-baseline multiple-view cor-respondences. In Proceedings of the Computer Vision and Pattern Recognition, pages 718-728, 2003. M . A . Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. In M . A . Fischler and O. Firschein, editors, Readings in Computer Vision: Issues, Problems, Principles, and Paradigms, pages 726-740. Kaufmann, Los Altos, C A . , 1987. A . Fitzgibbon and A . Zisserman. Automatic camera tracking. In M . Shah and R. Kumar , editors, Video Registration, pages 18-35. Kluwer, 2003. P. Fitzpatrick and G . Metta. Grounding vision through experimental ma-nipulation. Philosophical Transactions of the Royal Society: Mathematical, Physical, and Engineering Sciences, 361 (1811):2165—2185, 2003. W . E . L . Grimson. The combinatorics of local constraints in model-based recognition and localization from sparse data. Journal for the Association for Comuting Machinery, 33(4):658-686, 1986. ISSN 0004-5411. doi: http: / / doi.acm.org/10.1145/6490.6492. A . Hil ton, A . J . Stoddart, J . Illingworth, and T. Windeatt. Reliable surface reconstruction from multiple range images. Lecture Notes in Computer Science, 1064:117-125, 1996. U R L c i t e s e e r . i s t . p s u . e d u / h i l t o n 9 6 r e l i a b l e . h t m l . C. Hue, J.-P. Le Cadre, and P. Prez. Tracking multiple objects with particle filtering using multiple receivers. In Proceedings of the International Seminar on "Target Tracking: Algorithms & Applications, pages 61-64, Twente, The Netherland, October 2001. Bibliography 91 M . Isard and J . MacCormick. Bramble: A bayesian multiple-blob tracker. In International Conference on Computer Vision, pages 34-41, 2001. E . Izquierdo and M . Ghanbari. Video composition by spatiotemporal object segmentation, 3d-structure and tracking. In IV, pages 194-199, 1999. Koch , Christof and Ullman, Shimon. Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry. Human Neurobiology, 4:219-227, 1985. D . Koller, J . Weber, and J . Malik. Robust multiple car tracking with occlusion reasoning. In Proceedings of the European Conference on Computer Vision, pages 189-196, 1994. D.-S. Lee. Effective gaussian mixture learning for video background subtrac-tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5): 827-832, 2005. ISSN 0162-8828. J . J . Leonard and H . Durrant-Whyte. Simultaneous Map Building and Local-ization for an Autonomous Mobile Robot. In Proceedings of the International Workshop on Intelligent Robots and Systems, pages 1442-1447, May 1991. D . G . Lowe. Distinctive image features from scale-invariant keypoints. Inter-national Journal of Computer Vision, 60(2):91-110, 2004. J . Matas, O. Chum, M . Urban, and T . Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proceedings of the British Machine Vision Conference, 2002. A . Mendes, L . Bento, and U . Nunes. Multi-target detection and tracking with a laser scanner. In Proceedings of the Intelligent Vehicle Symposium, pages 796-801, 2004. K . Mikolajczyk and C. Schmid. A n affine invariant interest point detector. In Proceedings of the 7th European Conference on Computer Vision, Copen-hagen, Denmark, pages 128-142. Springer, 2002. U R L h t t p : / / p e r c e p t i o n , i n r i a l p e s . f r / P u b l i c a t i o n s / 2 0 0 2 / M S 0 2 . Copenhagen. Bibliography 92 J . Modayi l and B . Kuipers. Bootstrap learning for object discovery. In Pro-ceedings of the International Conference on Intelligent Robots and Systems, pages 742-747, June 2004. H . Nanda and K . Fujimura. Visual tracking using depth data. Computer Vision and Pattern Recognition Workshop, 27(02):37-37, 2004. K . Okuma, A . Taleghani, N . de Freitas, J . J . Lit t le , and D . G . Lowe. A boosted particle filter: Multitarget detection and tracking. In Proceedings of ' the European Conference on Computer Vision, pages 28-39, 2004. I. Patras, R. L . Lagendijk, and E . A . Hendriks. Video segmentation by map labeling of watershed segments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3):326-332, 2001. ISSN 0162-8828. M . Rutishauser, M . Strieker, and M . Trobina. Merging range images of arbitrarily shaped objects. In Proceedings of the Computer Vision and Pattern Recognition, pages 573-580, 1994. U R L c i t e s e e r . i s t . p s u . e d u / ru t i shauser94merg ing .h tml . U . Rutishauser, D . Walther, C. Koch , and P. Perona. Is bottom-up attention useful for object recognition? In Proceedings of the Computer Vision and Pattern Recognition, pages 11-37, 2004. D . Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vision, 47(l-3):7-42, 2002. ISSN 0920-5691. S. Se, D . Lowe, and J . Litt le. Local and global localization for mobile robots using visual landmarks. In Proceedings of the IEEE/RSJ International Confer-ence on Intelligent Robots and Systems (IROS), pages 414-420, Maui , Hawaii, October 2001. J . Shi and J . Mal ik . Normalized cuts and image segmentation. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 22(8):888-905, 2000. U R L c i t e s e e r . i s t . p s u . e d u / s h i 9 7 n o r m a l i z e d . h t m l . Bibliography 93 J . Sivic, F . Schaffalitzky, and A . Zisserman. Object level grouping for video shots. In Proceedings of the European Conference on Computer Vision, pages 85-98, 2004. E . Trucco and A . Verri. Introductory Techniques for 3-D Computer Vision. Prentice Hal l P T R , Upper Saddle River, N J , U S A , 1998. I S B N 0132611082. J . K . Tsotsos, S. M . Culhane, W . Y . K . Wai , Y . La i , N . Davis, and F . Nuflo. Modeling visual attention via selective tuning. Artificial Intelligence, 78(1-2): 507-545, 1995. ISSN 0004-3702. doi: http://dx.doi.org/10.1016/0004-3702(95) 00025-9. Y . Y u , A . Ferencz, and J . Malik. Extracting objects from range and radiance images. IEEE Transactions on Visualization and Computer Graphics, 7(4): 351-364, 2001. 