Recognition of Shape and Orientation using Binocular Vision by Raymond Stanley Burge B.A.Sc (Engineering Physics) University of British Columbia M.A.Sc (Electrical Engineering) University of British Columbia A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the Faculty of Graduate Studies Department of Electrical and Computer Engineering We accept this thesis as conforming to the required standard The University of British Columbia July 1999 © Raymond Stanley Burge, 1999 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of 6 " U c [ r ^ j . Lr ««• »^ The University of British Columbia Vancouver, Canada Date D^o fS~ l i l t DE-6 (2/88) Abstract This thesis examines the problem of building primitive vision elements for the remote operation of heavy equipment using minimal supervision. The ultimate goal of the vision system is to provide the robot with a world view that is compatible with the operator's view such that the operator sets goals and the robot executes low level tasks needed to accomplish intermediate objectives. A system for perceiving shape is developed that allows the robot to identify the ground surface on which it moves and to recognize obstacles and simple objects within the workspace. Stereo vision is used for this purpose. Disparity between stereo images contains distance information about a scene. The information in stereo images and the recovery of dense disparity maps is studied. A procedure for matching stereo images is given. The disparity gradient is used as the fundamental evidence of shape perceived by stereo vision and the properties for shape are analyzed within disparity gradient space. These attributes are exploited for navigation and recognition of simple shapes. The techniques developed are used to recognize cylinders and to estimate their location, size and orientation within a robot workspace. The vision process includes three phases: (1) recovery of a dense disparity map that includes sub-pixel interpolation of local surfaces, (2) deciding the location of objects within the robot workspace, and (3) confirmation of a cylinder and determining its location, radius, and pose. Example of these procedures are provided for real images. The algorithm provides a means for navigation on plane surfaces and for recognition and pose estimation of cylinders within the robot workspace. Examples described here demonstrate the feasibility and reliability of this approach. Future work will consider improvements to the low-level processes and increasing the resolution of disparity surfaces. Table of Contents Abstract ii List of Tables vi List of Figures vi 1. Introduction 1.1 Motivation 1 1.2 Stereo Vision 3 1.3 Object Identification and Pose Estimation 6 1.4 Contributions Made in this Thesis 6 1.4.1 SAD vs. SSD 7 1.4.2 Sampling Noise and Image Matching 7 1.4.3 Matching over Occlusions 8 1.4.4 Disparity Gradient Space 8 1.4.5 Surface Energy 8 •1.4.6 Normalized Global Shape Models 9 1.4.7 Local Shape Estimation 9 2. Binocular Vision 2.1 Camera Images 12 2.2 Stereo Images 13 2.3 The Epipolar Constraint 16 2.4 The Perception of Planar Surfaces in Stereo Images 17 2.4.1 The Perception of Surface Orientation 18 2.4.2 The Disparity Gradient of Horizontal Ramp Surfaces 19 2.5 Producing Dense Disparity Maps 20 2.5.1 The SSD and SAD Matching Metrics 21 2.5.2 A Reason Why SSD and SAD Give the Same Results 23 2.6 Sub-Pixel Disparity 26 2.6.1 Image Models 26 2.6.2 Sub-Pixel Matching with SSD 27 2.6.3 Novel Technique for Sub-pixel Matching with SAD 28 2.7 Summary 30 3. Producing Dense Disparity Maps 3.1 Structural Limitations in Image Matching 31 3.1.1 A Model for Image Matching 3 3 3.2 Why'Good'Images Don't Match 36 3.3 A Wiener Filter Approach for Matching Sampled Data 39 3.4 A New Preprocessing Filter for matching Images 41 i i i 3.5 Matching Along Epipolar Lines 44 3.6 Surface Models and Correlation Windows 45 3.7 Left-Right Match Confirmation 47 3.8 Matching Over Occlusions 48 3.9 A Novel Technique using Directional Matching over Occlusions 51 3.10 Matching with Asymmetric Correlation Filters 55 3.11 The Match Rules 58 3.12 Summary 62 4. Navigation, Recognition, and Orientation 4.1 Global Shape Model 65 4.2 Disparity Gradient Space: a new way to evaluate shape 67 4.2.1 The Disparity Gradient of Planar Surfaces 67 4.2.2 Reevaluating Surface Normals 69 4.3 Qualitative Shape 71 4.3.1 The Disparity Gradient of Cylindrical Surfaces 71 4.3.2 Identifying Cylinders in Disparity Gradient Space 73 4.4 Shape and Orientation of Local Surface Patches 76 4.5 Quadratic Surfaces 78 4.6 Surface Curvature 79 4.7 Surface Energy 80 4.8 Surface Energy and the Membrane Model of Surfaces 82 4.9 Simple Object Recognition using Surface Energy 83 4.10 Identifying Local Shape and orientation using SVD 84 4.11 Summary 87 5. Implementation Issues and Results 5.1 Navigation and Obstacle Detection 88 5.1.1 Navigation on an Infinite Plane 8 8 5.1.2 Navigation in a Corridor 96 5.2 Recognizing Logs 98 5.2.1 What to Measure 99 5.2.2 Simulation of Noise Effects 99 5.3 Where to Measure 102 5.4 How to Measure 102 5.5 Drum Image 103 5.6 Tree Images 106 5.7 Summary 111 6. Summary and Future Work 6.1 Summary 113 6.1.1 Contributions 114 6.2 Future Work 118 A p p e n d i c e s Appendix 1 Camera Errors 119 Appendix 2 Matching Random Images 123 Appendix 3 Recursive Edge Detectors 126 Appendix 4 An Approximately Separable 2D Recursive Filter 128 Appendix 5 Other Estimators for Cylinder Orientation 132 Appendix 6 Comparison of Signal Matching with Power Cepstrum and SSD 137 R e f e r e n c e s 148 v List of Tables Table 5.1 Eigenvalues associated with surface energies on a plane in the presence of noise 100 Table 5.2 Eigenvalues associated with surface energies on a cylinder in the presence of noise 100 Table 5.3 Eigenvalues associated with surface energies on a sphere in the presence of noise 101 List of Figures Figure 1.1 Excavator in the robotics laboratory 2 Figure 2.1 Projective geometry of a pinhole camera 12 Figure 2.2 Stereo images where nearby objects in the left view appear shifted to the right 14 Figure 2.3 Projective geometry of a stereo pinhole camera 15 Figure 2.4 Side view of 5 planar ramps that intercept the Z = 0 Plane at - Y 0 20 Figure 2.5 Orientation of surface patches 23 Figure 2.6 Cumulative probability of disparity change in an image region 24 Figure 2.7 Sub-pixel interpolation of SAD matching 29 Figure 2.8 Using SAD to match raw stereo images and SAD interpolation 30 Figure 3.1 Parallel systems with asynchronous sampling 32 Figure 3.2 Bode plot showing reduction of useful matching Bandwidth by wideband noise 39 Figure 3.3a Using the symmetrical difference filter to recover image texture 42 Figure 3.3b Using the thesis filter to recover image texture 42 Figure 3.4a Test image used in this chapter 43 Figure 3.4b Disparity using image texture from symmetrical difference 43 Figure 3.4c Disparity using image texture from thesis filter 43 Figure 3.5 Construction of an epipolar match surface 45 Figure 3.6a Match surface formed from comparing horizontal line 100 in left and right images 46 vi Figure 3.6b Contour plot of Figure 3.6a with best match solution on match surface 47 Figure 3.7 Left-right correspondence on match array occurs when both the vertical and diagonal search confirms the same match 48 Figure 3.8a Positive distance step 48 Figure 3.8b Correspondingpatternonmatchsurfa.ee 48 Figure 3.9a Negative distance step 49 Figure 3.9b Corresponding pattern on match surface 49 Figure 3.1 Oa Search region when previous match is known to be correct 50 Figure 3.10b Search pattern on match surface with no knowledge of previous match 50 Figure 3.11 Forward and backward correlation filters on different surfaces 52 Figure 3.12 Matching over an occlusion 53 Figure 3.13 Failure of symmetrical filter in region with step changes in disparity 54 Figure 3.14 Diagonal search line on backward filtered match surface 55 Figure 3.15 Data vector of diagonal search results 56 Figure 3.16 First evidence of negative disparity step on diagonal search 57 Figure 3.17a Right view of aerial photograph 60 Figure 3.17b Disparity recovered with symmetrical filter 60 Figure 3.17c Disparity map recovered using only Rule 1 60 Figure 3.17d Disparity map recovered using Rule 1 and Rule 2 61 Figure 3.18 Right Pentagon image and recovered disparity map 62 Figure 4.1 Camera orientation for floor navigation 65 Figure 4.2 Camera view of infinite plane and disparity image of plane 66 Figure 4.3 Line of intersection of a planar surface with the Z = 0 plane 68 Figure 4.4 Disparity Gradient Space 68 Figure 4.5 Geometry of vertical cylinder with respect to the Z = 0 plane 70 Figure 4.6 Definition of a plane by axes intercepts 71 Figure 4.7 Projection of surface normals of a vertical cylinder on the Z = 0 plane 72 Figure 4.8 Projection of surface normals of a non-vertical cylinder on the Z = 0 plane 73 Figure 4.9 Disparity gradient at random points on a 45 gallon drum 74 Figure 4.10 Disparity gradient at random points on a nearby tree surface 76 vii Figure 4.11 Surface patch with three normals 77 Figure 4.12 Surface normals on a cylinder 78 Figure 4.13 Surface normals on a cylindrical section 80 Figure 4.14 SVD error in estimating cylinder orientation vs. number of measurements 85 Figure 4.15 Normalized eccentricity vs. number of measurements 86 Figure 5.1 Log on concrete floor 89 Figure 5.2 Side view and top surface of disparity map of log on concrete floor 89 Figure 5.3a Residual after floor subtraction 91 Figure 5.3b Residual after floor normalization 91 Figure 5.4 Log and back wall in Figure 5.1 detected by floor normalization 93 Figure 5.5a Log oriented toward observer to examine disparity along the vertical lines 93 Figure 5.5b Raw disparity along vertical lines shown in Figure 5.5a 94 Figure 5.5c Disparity after sub-pixel interpolation using method in Section 2.6.3 94 Figure 5.6 Dense disparity map recovered using method given in Chapter 3 94 Figure 5.7 Contour plot of interpolated disparity image 95 Figure 5.8 Normalized obstacle map using threshold of 1.1 95 Figure 5.9 Synthetic hallway image 97 Figure 5.10 Obstacle detection in hallway 97 Figure 5.11 Cylinder geometry 98 Figure 5.12 Drum image at 2.5 meters, f = 400 pixels, b = 17.4 cm. 103 Figure 5.13 Mesh plot of disparity surface shows only a few pixels change over the drum 104 Figure 5.14 Convergence of estimated drum orientation vs. number of iterations 104 Figure 5.15 Running average of drum radius estimate in camera baseline units vs. number of iterations 105 Figure 5.16 Cylindrical signature of drum using running average of eigenvalue ratio vs. number of iterations 105 Figure 5.17 Tree image at 2 meters, f = 400 pixels, b = 17.4 cm. 106 Figure 5.18 Stereo detail of tree image; Z = 2 meters, f = 800 pixels, b = 17.4 cm. 106 Figure 5.19 Section of tree disparity surface 107 Figure 5.20 Convergence of tree orientation estimate vs. number of iterations 107 viii Figure 5.21 Running average of tree radius estimate in camera baseline units vs. number of iterations 108 Figure 5.22 Cylindrical signature of tree using running average of eigenvalue ratio vs. number of iterations 108 Figure 5.23 Tree image 109 Figure 5.24 Stereo detail of tree; Z = 2.2 meters, f = 400 pixels, b = 17.4 cm. 106 Figure 5.25 Section of tree disparity surface 109 Figure 5.26 Convergence of tree orientation estimate vs. number of iterations 110 Figure 5.27 Running average of tree radius estimate in camera baseline units vs. number of iterations 110 Figure 5.28 . Cylindrical signature of tree using running average of eigenvalue ratio vs. number of iterations 111 ix Chapter 1 Introduction /. 1 Motivation The motivation behind this research is to examine basic issues related to stereo vision interface for supervisory control interface of heavy machinery, like the excavator shown in Figure 1.1. In this system, the human operator would assign tasks and oversee their execution and some low level task activities such as local navigation and primitive object recognition would be performed autonomously by the robot. This thesis investigates how stereo vision could be used in this application. The contributions made in this thesis result from the investigation of basic stereo vision problems and from studies of how stereo vision can be used for navigation and simple object recognition. Humans use stereo vision to manipulate objects within a 'personal work space' and these methods are studied within the excavator workspace where cylindrical objects such as trees and logs are to be manipulated. The task is to develop a means for the robot to visually interpret the world such that the operator and robot can cooperate in navigation, obstacle avoidance, object recognition, and pose estimation. A command such as "Pick up that log." does not require that the robot find a log in a visual scene. The operator points it out. However it does require that the robot be able to confirm the presence of an object and characterize that object as a log. To act as a semi-autonomous agent and pick up the log, the robot must also be able to navigate to the log and then estimate the size and orientation of the log. An analogy for the motivation behind this thesis is the cooperation between man and animals (autonomous agents). Historically, humans worked with horses to accomplish heavy tasks however a horse is not driven like a truck or car; it is guided. One characteristic of this cooperation 1 between man and horse is, to some extent, a common perception of the world. For example, in plowing a field both the horse and the human can perceive the ground and agree on some common attributes of this visual surface. They can both locate themselves on this surface and can perceive obstacles. This agreement constitutes a primitive visual lexicon within which cooperative tasks are possible. At present, the level of performance achieved by biological vision systems eludes passive computer vision. However, in applications like remote supervisory control (often called teleoperation), a less sophisticated vision system can help the machine and operator to agree on objects and their location within the workspace. Figure 1.1: Excavator in the robotics laboratory. Manipulating objects by heavy machinery requires considerable operator skill (Stenz and Singh 1998). While human operators see their tasks as repetitious, in most cases, the work environment is unstructured and the operations are not repeated exactly. Grasping and lifting 2 operations occur within a ten meter radius from the operator and as an initial step the vision system needs to determine which data is within the workspace, distance of objects from an observer. Depth provides a low level cue by which distant objects, while plainly visible, can be eliminated from the data processing. This thesis builds two primitive visual elements to allow cooperation between a robot and a human operator. This is done through the paradigm of shape; global shape and local shape. Global shape defines a workspace such as a perception of the ground that is shared by the human and the robot. Local shape allows a human and robot to define objects within the workspace. There are several ways to perceive shape. This thesis studies how stereo vision can be used to perceive shape for supervisory control of semi-autonomous machinery. 1.2 Stereo Vision In his book on vision, Marr (1981) suggests that an intermediate goal of vision systems is to identify surfaces and from this surface information one can proceed to identify objects. In binocular stereo, surfaces are usually identified in a two-step process. The first step is to find corresponding points in the stereo images and produce a disparity map. The second step recovers surfaces from the disparity map. Numerous algorithms for binocular stereo vision have been proposed to find corresponding points in stereo images. Some systems (Grimson 1982, Clark 1985) solve the correspondence problem by feature matching. These techniques seek to extract unambiguous features such as intensity edges. Features that are unique in both views can be matched unambiguously, however these features are localized in the images and result in sparse disparity maps. Blake and Zisserman (1987) argue that sparse disparity maps do not produce unique surfaces. Fua (1991) achieves a much denser disparity map using a correlation technique which matches blocks of image intensity data. This method performs the correlation twice, once from the left to right image and once from 3 the right to left image. False matches occur randomly whereas true matches are confirmed by both correlations. Other block matching techniques use phase irrformation rattier than intensity data. Sanger (1988) and Fleet (1990) use Gabor filters to determine phase shifts between regions in the left and right images. Disparity is estimated from the phase shift and the pass band of the Gabor filter. Bandari and Little (1991) use a cepstrum technique to estimate block displacements in images. While these techniques can produce dense depth maps, they require more computation and do not produce significantly better results than intensity correlation. Techniques to estimate disparity produce incomplete depth maps and some form of interpolation is used to recover continuous surfaces. Grimson (1982) proposed an elaboration of the relaxation algorithm to fit membrane surfaces to area boundaries defined by edge matching. In this technique, a mesh of points is introduced in the region. The points are asked to conform to the finite difference equation that describes a membrane surface. Each iteration of the process adjusts the mesh positions according to the membrane equation until the boundary conditions are satisfied. Terzopulos (1988) extended Grimson's method by using multi-grid techniques to speed up die relaxation. As an alternative to multi-grid techniques, Szeliski (1991) used conjugate gradient descent in conjunction with a hierarchical set of basis functions to provide fast surface interpolation. This work was extended by Pentland (1992) who achieved fast, approximate solutions by transforming the problem to a wavelet basis. It achieves its speed because transformation to the appropriate wavelet basis allows approximate calculations at different levels of resolution that run virtually independent of each other. Marr's theory of vision seeks to build a world model of smooth surfaces that are bounded by depth discontinuities (Marr 1981). Terzopulos, Szeliski, and Pentland all use the same basic technique to handle surface discontinuities. This method monitors stress in the membrane or thin plate that is fit to the data. When the stress exceeds a threshold, the process is interrupted and a 4 local discontinuity is introduced to relieve the stress and the smoothing regions are segmented to accommodate the discontinuities in further iterations. These methods have been demonstrated with both sparse and dense depth data. Two other techniques for surface recovery have been demonstrated to work with dense depth maps. Fua (1991) uses 'adaptive smoothing', an iterative process that varies the weighting of the smoothness constraint according to a piecewise linear function of the image intensity gradient (edges) and the depth gradient (strain). Fua does not report on convergence speeds or on the stability and accuracy of this technique. Blake and Zisserman (1987) developed the GNC algorithm that changes the form of the smoothness constraint as a function of the depth gradient. The algorithm allows surfaces to break by limiting the smoothing constraint so that large values of depth gradient can be accepted. Blake and Zisserman also modify the smoothing constraint according to different scales or level of detail in the image. The GNC algorithm and Fua's 'adaptive smoothing' both work with dense depth maps such as those produced from stereo correspondence by area correlation. Several hardware configurations have been used to build stereo vision systems. One configuration uses three cameras arranged in a triangle: 'trinocular stereo' (Point Grey Research). Binocular vision has only one baseline between the two cameras. Trinocular vision has three baselines. Any depth map that is recovered by a camera pair on one baseline must be confirmed by the other baselines. Steward and Dyer (1988) use a least squares technique to determine the depth map from the three baselines. Trinocular stereo has fewer errors at the cost of a more complicated camera system and more processing. Matthies and Okutomi (1989) propose a variable baseline camera to solve for depth maps. When the cameras are beside each other, the two views are almost identical and it is easy to find correspondence between the images. As the cameras gradually separate, the depth map is updated. This system is usually implemented with an series of cameras in fixed positions along a baseline and is subject to alignment and calibration problems. 5 Little work has been done using a zoom lens. Wiley and Wong (1990) show that a zoom lens introduces stable distortions that can be detennined during camera calibration. They demonstrate that a zoom lens can be used in computer vision applications. 1.3 Object Identification and Pose Estimation Several techniques for object identification and pose estimation have been reported that work with monocular views and exploit the contour features of isolated objects. Karhunen-Loeve expansions (Sirovich and Kirby, 1987) and Fourier descriptors (Arbter 1989, 1991) have been used for object identification. Tensor based moment functions have been used by Cyganski and Orr (1985) to find the affine transformation relating projections of planar contour features. Faber and Stokely (1988) extended this method to study 3D medical images. Lo and Don (1989) developed 3D moment forms that can be used for object identification and estimating orientation. The disadvantage of these techniques is that they require a view of the entire object boundary to determine either the contour Fourier coefficients or the area moments. This is difficult to achieve in environments where occlusion is common. Horn (1987, 1988) and Arun (1987) have proposed a least squares matrix method to recover the rotation and translation matrices relating two views of the same object. This technique does not require contours or an image of the whole object but it needs point correspondence to match the regions of unknown orientation with the same regions in the reference view. Wildes (1991), Jones and Malik (1992), and Devernay and Faugeras (1994) use the stereo disparity map to recover surface orientation. This technique is described in the next chapter. 1.4 Contributions Made in this Thesis This thesis addresses the development of a primitive perception of the world in which a robot and a human operator can agree on some basic elements of a scene that serve to define simple 6 tasks. Stereo vision is applied to this problem and several issues are examined as part of this work. Some implementation issues such as camera calibration are well documented elsewhere (Devernay and Faugeras, 1996) and are not dealt with in this thesis. Instead, emphasis is placed on other questions like: "Why is there no discernable difference between SAD and SSD matching metrics?" and "Why do false matches occur during stereo correlation?". Ground plane stereo (Burt et al. 1995) is used for navigation in this thesis and by other researchers (Williamson 1998), however in this thesis ground plane stereo is treated as a special case of a larger concept; a global shape model. With this approach it is possible to solve navigation problems not only in log sort yards but also in other specialized environments like hallways, city streets, and tunnels. 1.4.1 SAD vs. SSD In Section 2.5 a justification is given for why the least absolute value (SAD) and least squares (SSD) matching metrics give virtually the same results when matching blocks of image data. In Section 2.6 a justification for linear interpolation methods that are commonly used on stereo depth maps is given. An equation is proposed based on linear interpolation to generate sub-pixel disparity maps from the sum of absolute differences (SAD) matching metric. 1.4.2 Sampling Noise and Image Matching Many signal processing applications significantly oversample the raw input signal, often by a factor often or more. It is pointed out in this thesis that practical images are not oversampled in that the point spread function of the lens is closely matched to the pixel size. A typical value for Gaussian blur in camera optics is given as cr= 0.6 pixels (Nalwa and Binford, 1986). The problems that can arise from this are discussed in Chapter 3 where a texture filter is proposed to resolve the conflicting requirements of high frequency information for matching and band limiting to reduce spatial sampling noise. It is shown that spatial sampling noise is systemic to sampled data 7 systems and this makes high frequency information unique to a single image and not useful for matching or object recognition, even in stereo pairs where there are few scale and orientation differences. Two strategies are suggested; upsampling the data to achieve higher resolution stereo and lowpass filtering to accept only image data that has a positive signal to noise ratio. A two dimensional recursive filter is developed for this second strategy. 1.4.3 Mat chins over Occlusions Chapter 3 also provides a directional filter for matching stereo images with step discontinuities. A match surface is constructed and the behaviour of stereo information on this surface is examined. This results in a simple data staicture and two simple matching rules that conform to the visual information present in stereo images. 1.4.4 Disparity Gradient Space Chapter 4 introduces disparity gradient space as a simple way to work out global shape models and the perception of planar surface elements in stereo vision. It is shown that planes map to a point in this space and that cylinders map to a curve in this space. Other environments like corridors, hallways, and tunnels have predictable signatures in this space and can be used for navigation in same way as a global plane model is presently used. 1.4.5 Surface Energy Chapter 4 also introduces a technique using the energy in the surface nonnals on a local surface patch that achieves the viewpoint invariant membrane surface model proposed by Blake and Zisserman, 1987. Section 4.9 indicates how simple shapes can be recognized by the eccentricity of the eigenvalues of an energy matrix derived from the surface normals and how the 8 eigenvectors identify surface orientation. This is different from standard Gaussian curvature techniques and does not use a second derivative. 1.4.6 Normalized Global Shape Models In Chapter 5, navigation and obstacle detection is seen in terms of a global shape model. Examples in Chapter 5 indicate that there is less foreground noise in the detection process when the disparity map is normalized with respect to the floor rather that when the floor is subtracted from the disparity map. A justification for this result is given and empirical evidence for work with log manipulators is shown in Section 5.1. A example is given in which the concept of a global model is extended to navigation in corridors where obstacles and hallways opening onto the corridor are made evident. 1.4.7 Local Shape Estimation Chapter 5 develops the technique to characterize local shape without regular tesselation using the energy matrix formed from the surface normals. As mentioned in the proceeding paragraphs, this is different from standard Gaussian curvature techniques. It does not use a second derivative and the information normally available in the Hessian matrix is lost. Local patches are assumed to be planar and as such, the smallest eigenvalues of the their surface energy provide an estimate of the orientation noise in the surface normal measurements. This is used to improve the eigenvalue eccentricity estimates used to characterize the cylinder shapes. The eigenvector associated with the smallest corrected eigenvalue is taken as the orientation of the object. Since the energy matrix is different from the Hessian matrix, a technique to estimate radius is also given. This and the orientation vector can be used to work out how to pick up a tree or a drum. 9 The emphasis in this thesis is on some basic issues that arise in stereo vision. Using stereo vision, the world is perceived in certain ways and this is examined as a medium within which cooperative tasks can occur. Contributions made in this thesis reflect this principal emphasis but more work remains to address detailed implementation issues. 10 Chapter 2 Binocular Vision This chapter examines and builds upon some basic principles underlying stereo vision and the relationship between three-dimensional object geometry and the visual surfaces that are extracted from stereo views. The goal is to examine basic mechanisms needed to implement a stereo vision system that can identify surfaces. These surfaces are used to classify objects as planes and cylinders. The emphasis is on low-level techniques that produce dense disparity maps. Three basic questions are asked: 1. What information is needed to determine surface orientation. 2. What matching metric is appropriate to recover this information. 3. What form of sub-pixel interpolation can be used. Sections 2.1 and 2.2 introduce basic assumptions used in binocular vision. Section 2.3 describes the epipolar constraint. Section 2.4 discusses the mechanism for perceiving surface orientation with stereo vision. Sections 2.5 explores why there is little or no difference between implementations of least absolute value and least squares matching techniques. Section 2.6 examines a technique for sub-pixel disparity interpolation for surface reconstruction based on work from the field of high definition television (HDTV). Working assumptions about visual surfaces are introduced from which limits of stereo vision as a mechanism to recover surface orientation are derived. The description of stereo vision developed in this chapter begins with homogeneous coordinates as a way to describe projective camera geometry. The problems of determining position and surface orientation from stereo images are discussed using an ideal stereo camera. This simplifies the formulas to common expressions found in most papers on the subject. Some 11 aspects of camera alignment, and how it affects the perception of position, surface orientation and surface curvature, are reserved for the appendix. 2.1 Camera Images A camera is a passive sensor that measures the reflectance of a scene. The brightness pattern on the image plane is formed by rays of light that, in a non-distorting medium, travel in straight lines from the surfaces of objects in the scene to the camera lens. An ideal pinhole in front of an image plane can model the projective geometry of a typical camera. Such an arrangement is shown in Figure 2.1 Image P l a n e — • ) T P i n h o l e M a s k < f •! X .A CD Figure 2.1: Projective geometry of a pinhole camera. The image can be considered to be a mapping of points in the world onto a two dimensional surface. As Figure 2.1 shows, the projection onto the imaging surface does not correspond to unique world coordinates. All world positions along a particular line of sight are mapped onto the same point on the imaging surface. In other words, each pixel on an image can represent a line of sight stretching from the lens surface to infinity. Light reflected by an object and 12 imaged by the camera to a point on the image plane can be from any position on this line. It can be seen by simple projective geometry that the (x,y) position on the imaging surface is given by: * = /• X y = f Y (2.1) Z - - Z where/is the focal length of the camera, (x,y) is a position in the image plane, and (X,Y,Z) are world coordinates. A more convenient approach uses homogeneous coordinates (Roberts 1965) to describe this imaging process. X Y Z 1 The matrix is called the perspective projection transform. In homogeneous coordinates, the image Z vector and the world coordinate vector are augmented by a scale term, a ~ ~ ^ ' where/is the focal length of the camera lens. The image scale term describes the apparent size of the image in relation to the camera focal length and the distance of an object from the camera lens. In this example, the position of the pinhole lens is taken as the origin for measuring world coordinates, and the optical axis is defined by a line from the center of the imaging surface (x - 0 , y = 0 ) through the pinhole. ax 1 0 0 0 ay 0 1 0 0 a 0 0 1// 0 2.2 Stereo Images Suppose a viewer stands on a highway so that his right eye looks exactly along the centerline of the road. The left eye would be somewhere over the left lane of the highway and objects in the left view, like the centerline of the road, would appear shifted to the right, depending on their distance from the observer. This apparent shift in the positions of objects is called the disparity between the stereo images. Objects at infinity have zero disparity while objects near the 13 observer have large disparities. Objects very close to the observer can be seen only in one view; their disparity is outside the stereo field of view. aRXR 1 0 0 0 aRyR = 0 1 0 0 aR 0 0 VA 0 In a stereo camera system, there are two views of the same scene. In most cases, the cameras emulate human vision, producing a left and right view. If we designate one camera as a right view and the other as a left view and specify that the world coordinates are measured with respect to the pinhole of the right camera then the two images can be described as follows. X Y Z 1 ~X Y Z 1 The right camera is assigned the reference view of the world and the left camera is displaced with respect to the right camera and it may be rotated with respect to the orientation of the right camera (Appendix 1). The matrix, [T], transforms the world coordinates as defined by the right camera to the apparent positions seen by the left camera. If we can identify the same world <*LxL " l 0 0 0" aL)!L = 0 1 0 0 [T] 0 0 1/A 0 14 point in both images and if [T] includes a displacement term then it is possible to recover the world coordinates. Left Image Plan e , L Right Image Plane 1 • T~X"1 Figure 2.3: Projective geometry of stereo pinhole cameras. t X+b In its ideal form, both cameras have the same focal length and [T] is a one dimensional translation, as shown in Figure 2.3. The cameras are separated by a distance b; the stereo baseline. Objects seen in both cameras have the same vertical position but they appear to be shifted horizontally. The stereo image can be combined in a single expression: ClRXR 1 0 0 0 aRyR 0 1 0 0 ~X aR 0 0 0 Y 1 0 0 b Z a L y L 0 1 0 0 1 aL _ 0 0 0 Translation terms put entries in the right hand column of the matrix; in this case a single translation b, along the X axis. This translation tenn makes the system invertable. When the two camera focal lengths are equal, fi =f^ and CIT, = <7^ , the solution for the distance from the observer 15 is Z = = — where D is the image disparity; the apparent shift in an object's position in xL-xR D the two views. A normalized system for binocular stereo measures the location of objects as multiples of the stereo baseline. This baseline measurement defines the stereo field of view and it defines what objects are considered nearby; local to the observer. The stereo equations for tins ideal camera configuration are: X = —, Y = A Z = ^ (2.2) D D D where X, Y, and Z are measured as multiples of the stereo baseline. Other forms of stereo cameras include trinocular stereo (Yachida et al. 1986, Point Grey Research) and motion stereo. These can be analyzed in the same way as binocular stereo. In trinocular stereo, there are three cameras; each located at the vertex of a triangle. This produces three views with three different baselines. The trinocular system is over-determined and the world positions can be found by least squares techniques (Hansen et al. 1988). Motion stereo is possible with a single, moving camera where the second, translated image is displaced in time, rather than space, from the first image. 2.3 The Epipolar Constraint In this thesis the right image of the stereo pair is used as the reference view. Each pixel in the right image represents a line of sight that terminates at an object. Since the camera looks along this line of sight, it sees only a point. The pixel is, in effect, the cross section of this line of sight. Depth can be recovered with binocular vision because the other, translated view, can 'see' some of the line of sight of the reference camera. This provides a side view of the line of sight which can be traced as a line on the left image; the epipolar line. When an object is visible in both views, a pixel in the right view that is the termination of the line of sight at the surface of some object, has a corresponding pixel somewhere on the epipolar line in the left view. While there is no depth 16 information in the reference view, the second view allows us to 'see' where the object resides along the reference line of sight. The longer the base line, the better view we get of the reference camera's line of sight and the more accurately an object can be located along this line. If straight lines are not distorted in either view then for a given pixel in the right view, finding a corresponding pixel in the left view involves a one dimensional search along the epipolar line. This is called the epipolar constraint. In this thesis, the epipolar lines are horizontal lines in the stereo image pairs. The epipolar constraint can also apply to structured light systems, which behave much like stereo vision run backwards. In these systems, a line of sight in the right view is replaced by the path of a laser beam which sends light out toward some object. In the left view, the illuminated spot on the surface can be located as a displacement along the epipolar line. Although it takes time to completely cover a scene, this arrangement simplifies the stereo correspondence problem. Structured light systems are also subject to alignment errors that are similar to those encountered in binocular vision. In some scenes, the laser can illuminate a spot that is hidden in the camera view by some obstacle or edge. In this case, the arrangement of objects in the scene occludes some regions from the observer. Such occlusions are a problem in stereo vision. 2.4 The Perception of Planar Surfaces in Stereo Images A useful vision system is more than a mechanism for finding the distance to objects, such as described in the previous sections. A stereo vision system has the opportunity to work with a disparity image of the world. If a disparity value can be assigned to each pixel location defined by one of the stereo images then this becomes a disparity image of the scene. Finding disparity is a primitive step in stereo processing and it presents some advantages that are exploited in this thesis. Some characteristics of the disparity image are introduced in this section. These basic characteristics are known in computer vision and they are presented here as part of a framework 17 upon which a vision system can be built. This work is expanded in later chapters of this thesis where this section is referenced. 2.4.1 The Perception of Surface Orientation The perception of shape is determined by the ability to detect different surface orientations. In stereo images, surface orientation is measured in terms of perceived depth changes which, in turn, are related to disparity changes. A vision system that recovers a dense disparity map assigns disparity values to most pixel locations in the reference image. Describing the disparity in some neighborhood of (x0, y0) by a first order Taylor expansion is equivalent to fitting a planar surface to the region. D = D0+Dx(x-x0) + Dy(y-y0) = Dxx + Dyy + (D0-Dxx0-Dyy0) where D0 is the disparity at the image position {xg, y0) and Dx and Dy are the disparity gradients in the x andy directions about the image position (XQ, y0). Dividing through by D and substituting the world coordinates from Equation 2.2; D X + D Y + —(D^-D x.-D yf?\Z = l (2.3) x y / v 0 x 0 y 0 J This is the equation of a plane in world coordinates X, Y, Z with surface normal n=Dxi+Dyj + j{D0-Dxx0-Dyy0)k (2.4) where i , j, and k are unit vectors in the X, Y, and Z world direction coordinates. It can be seen that the first two terms of the normal vector are the disparity gradients in the image coordinates x andy. The disparity gradients are projections of the surface normal onto the image plane. Higher order surfaces can have higher order Taylor expansions of stereo disparity (Devemay and Faugeras, 1994). These involve higher order derivatives of disparity. However, an estimate of the tangent plane that is fit to each point on a smooth surface follows the first order 18 expansion. The perceived orientation of points on a smooth visible surface is constrained by the ability to recognize changes in disparity as described by Equation 2.4. Estimation of surface orientation from stereo disparity gradients has been proposed by several researchers (Grimson 1981; Wildes 1991; Jones and Malik 1992; Devernay and Faugeras 1994). 2.4.2 The Disparity Gradient of Horizontal Ramp Surfaces An important surface for mobile robots is the floor or ground on which it moves. If a robot stands on a flat horizontal plane or if it rests on an inclined ramp then the ramp or plane will intercept the Z = 0 plane as defined by the pinhole mask of the camera system (Figure 2.4). Assuming this surface does not change in the X direction then it intercepts the Z = 0 plane at some constant value, Y 0 . This is substituted into Equation 2.3 with the values Z = 0 , Dx = 0. The result for any horizontal ramp isDyY0 = 1. In other words, all horizontal ramps that intersect the Z = 0 plane at Y Q have the same disparity gradient: In Figure 2 .4, all of the ramp surfaces have the same vertical disparity gradient, Dy ; a value that is determined by where the ramps intercept the pinhole mask plane. However, the disparity value is unique along a particular line of sight since each ramp is intercepted at a different distance from the pinhole. Along the line of sight, the image coordinates are constant and the uniqueness of each surface normal (Equation 2.4) is determined by the different disparity values that correspond to the Z value of each point. Each ramp in Figure 2.4 has the same disparity gradient but each has a different disparity value, D0. When substituted into Equation 2 .4, each ramp will have a unique surface normal. 19 Figure 2.4: Side view of 5 planar ramps that intercept the Z = 0 plane at -YQ A robot on an infinite ramp surface would see disparity as a linear function of vertical position in the image. This has been used by (Burt et al. 1995) to differentiate between the floor and navigation obstacles. This section has shown that the disparity gradient is easily determined in terms of where the projected plane intercepts the Z = 0 plane. This is extended in Chapter 4 to provide an simple technique that relates shape and orientation with the intersection of surfaces and the Z = 0 plane. 2.5 Producing Dense Disparity Maps The remainder of the chapter deals with aspects of producing dense disparity images. Most dense disparity maps are produced by matching blocks of pixels in the stereo image pair (Fua 1991). Two matching metrics are usually used: sum of squared differences (SSD) and sum of absolute differences (SAD). Kanade (1995) reports that there is no discernible difference in performance between these matching metrics. This section asks if there should be a difference and, if so, why is there no difference? 20 2 5.1 The SSD and SAD Matching Metrics A common method to find the relative displacements in regions between two images is to compare blocks of image data . This technique is used in high definition television (HDTV) and it is used in some stereo vision algorithms to recover dense disparity maps. In both these applications, blocks of 'gray scale' data are compared rather than only matching certain features such as edges (Grimson 1981). Two popular metrics for matching blocks of image data are the sum of squared differences (SSD) and the sum of absolute differences (SAD). Matthies (1992) used an SSD of the form: J(d) = fj(lr(xi-d)-Il(xi))2 <=i other researchers, such as Fua (1993), use a normalized score in which the average intensity data is removed from the window. fl((rr(x,-d)-ir)-(fl(xl)-i,)y j{d) = j 1 ^til^-d)-!^^^)-!^ This structure is similar to the linear correlation expression used in Numerical Recipes (Press et al. 1992). In general, the SSD is a least squares metric of the form: R where ez- are the fit errors in matching the two regions. The minimum is found by searching over all possible ej in the region R. At the minimum the variation in J is zero. & / = 0 = 0 R The SSD chooses a point where the mean matching error is zero and assigns this disparity value to the center of the window that defines the matching region, R. 21 On the other hand, the sum of absolute differences can be simpler to calculate but it is difficult to analyze. The simplest form of the SAD is: N A normalized SAD can be written as: Z | ( / r ( W W > f r ( * , ) - / , ) | J(d) N N 1-1 V i - l The SAD chooses a point where the median matching error is zero as seen by the variation of the SAD matching metric. The SSD finds a solution where the mean error is zero while the SAD solution is where the median error is zero. Both the SAD and the SSD assign a constant disparity value to the center of the matching window. If the intensity information is not distributed symmetrically across the window, then the SSD will tend to give more weight to the region of the window with the most texture. When the disparity is constant across the window, this pulling effect is not a problem. The SAD, with it's median matching characteristic, is more resistant to outliers and wild points. Even so, Kanade reports that there is virtually no difference in performance between nearest pixel matching with SSD and nearest pixel matching with SAD (Kanade 1995). This behavior is examined in the next section. • / = Zhl = S e ' s i S n ^ ' ) R R <5J=0^2]sign(e j) = 0 R 22 2.5.2 A Reason Wliy SSD and SAD Give the Same Results Most matching mechanisms that compare blocks of image intensity values attempt to find a constant disparity value between regions in a pair of stereo images even though the disparity may not be constant over the matching window. When the window size is small it is more probable that the actual depth values are nearly constant across the window indicating that the surface patch within the periphery of the window has little or no slope. The probability of perceiving a sloped surface with stereo vision can be estimated as follows: Figure 2.5: Orientation of Surface Patches For a flat surface which is tilted by an angle <f>xabout the Taxis, the disparity change over the image of the surface in the x direction varies as: dD__dD_dZ_dX___\m^x dX~~dZlx~dX~ Z~~ ' ' where —— = tan cf>x, the slope of the surface in the X direction. Cist Over some circular region in the image, the maximum disparity gradient is: dD dr tan^ Z where </> is the maximum angle between the surface normal and the line of sight. Points on a surface patch that face toward the observer are more visible than points on the surface that are not oriented toward the observer. If all surface orientations are equally probable, (a uniform distribution), the probability that a visible point on a surface is centered on a small patch 23 with orientation angle <f> depends on cos$, the dot product between the surface normal and the line of sight of the observer. The probability that the orientation is less than ^ (0 < </>Q < n/2) is: Jcos^ dtj> - sin <p0 Jcos^ d<f> If the matching regions have a width W, then all the pixels in the window could be assigned a constant disparity if the disparity change across the window is small. For example, when W < —, the maximum disparity change across the window would be less than 1/2 pixel. How Ar 2 likely this is depends the distance to the surface and the maximum slope of the surface, (fo. Using Z Equation 2.5 and a maximum disparity change of 1/2 pixel: 0O = tan - 1 and the probability that the disparity change across the window is less than 1/2 pixel is: f W< n = sin if)0 - sin f . z } tan" V Ar 2, 2W) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Probability of Disparity Change < 1/2 Pixel \ Z=30 Z=10 ^ 0 5 10 15 20 Window Size in Pixels Figure 2.6: Cumulative probability of disparity change in an image region. 24 The graph indicates that when block matching stereo images, most visible regions will be perceived as surface patches that are oriented toward the observer. Only nearby surfaces (Z < 20) are expected to have significant disparity gradients across the matching window. In most cases, a matching metric that assigns a constant disparity value over the data window conforms to the expected behaviour of the data. Within a small matching window, the expected disparity is virtually constant and non-stationary image intensities within the window do not introduce significant differences between SAD and SSD matching metrics. Even though regions with more texture have more weight in the matching window, when there is little or no disparity change across the window the solution will not be skewed. Most researchers assign a constant disparity to a matching window, Devernay et al. (1994) point out that higher order correlation surfaces are necessary to avoid matching bias, particularly when matching larger nearby regions. They also point out that this increased accuracy comes at a much higher computational cost. One result of this section is that a constant disparity matching model used by most researchers is acceptable. Another reason that SSD and SAD give the same results is that local image data is highly correlated and changes little over small windows. On average, image intensity is highly correlated across small image blocks and image texture does not change radically across the block. This indicates that there are few outliers in the data and even though the matching metrics use different statistics, the results are similar. This correlation between neighbouring samples is examined in the next section. In this section, the expected perception of surfaces using stereo was derived to explain why SSD and SAD matching metrics give the same solutions. It was later found that a similar expression for the expected perception of surfaces has been proposed as a stereo matching constraint (Arnold and Binford 1980; Ma et al. 1995) but this was not applied to study the similar performance in matching metrics. 25 2.6 Sub-Pixel Disparity To extract surface curvature from stereo images it is often necessary to interpolate disparity data. Most researchers use linear interpolation models for disparity (Matthies 1992). The final sections of this chapter asks why linear interpolation works as opposed to other interpolation techniques? An implementation of sub-pixel interpolation for the SSD and SAD matching metrics is described. 2.6.1 Image Models There is considerable agreement on the form of general image models in the literature (Juang and Mumford, 1998). Various studies show that images conform to a Markov covariance model of the form e °M. Pratt (1978) quotes a horizontal correlation coefficient p = 0.953 and a slightly higher vertical correlation coefficient p = 0.965. On the diagonals, p = 0.938; higher than the product of the horizontal and vertical coefficients. In the Markov model, a= -\n(p). Clarke (1985), after testing many images, developed a nonseparable model: Clarke's model has been confirmed by Akansu and Haddad (1992) as being the most accurate although for most calculations Akansu and Haddad use a simple, isotropic value ofp = 0.95. These models indicate that, on average, images are dominated by low frequency information and that, statistically, there is little information extending out to the bandwidth allowed by the pixel sampling frequency. This body of work shows that, on average, neighbouring pixel intensity values are highly correlated and there is little expected difference between adjacent pixel values. In some small neighbourhood of the image field, the expected intensity value is nearly constant. A continuous 26 local model would be of the form e 0 0 5 ' J r ' « (l - 0.05|x|), which projects a linear intensity function across the pixel surface. The image models developed by Clarke and others indicate that when images are matched to the nearest pixel, we should expect the residual difference in intensities to change linearly on a sub-pixel basis. In high definition television many of the proposed transmission schemes involve block matching to sub-pixel resolution. HDTV studies such as Girod (1993) and Bellifemine etal. (1992), report that bilinear interpolation is superior to a sine interpolation filter. The bilinear interpolation filter is more closely matched to the average image spectrum predicted by the Markov image models. As a result, Girod finds that bilinear interpolation rejects noise better than sine interpolation however, Girod also finds that a Wiener interpolation filter matched to a individual images is superior to both bilinear interpolation and sine interpolation. The first order autoregressive image model and the experience of HDTV researchers both suggest that a sub-pixel matching can be treated as linear interpolation. Match scores from SSD and SAD can be expanded in terms of a linear intensity field across the pixel surface. The goal is to interpolate the lowest of the match scores in order to estimate where the minimum would be to a sub-pixel resolution. 2.6.2 Sub-pixel mat chins with SSD The SSD matching metric evaluates the sum of the squared differences of pixel intensities in two blocks of data. Near the best match, the difference in intensities is expected to be linear with displacement along the epipolar line and the residual SSD value is expected to be quadratic. This can be interpolated with a parabolic interpolation algorithm such as can be found in Numerical Recipes (Press et. al.). The sub-pixel displacement is estimated using Newton-Raphson. Letting 27 J(n,k) be the SSD score for pixel n and disparity k where both n and k are integers. As k is varied on a sub-pixel basis, the model predicts that: J(n,k + A) = J{n,k) + ~ A + ^ ^2 2 = J(n,k) + e at the minimum: = 0 => A = -dJ_ dk d2J dk2 This is evaluated using the minimum value of J achieved by matching on integer pixel boundaries, J(b) and the two closest neighbours, J(a) and J(c). Note that the numerator is the symmetrical form of the first derivative. The second derivative in the denominator is also symmetric. Parabolic interpolation of the SSD matching metric, as used by Matthies (1992) and other researchers, conforms accepted image models. According to this model, near the best match the residual is linear with displacement. The SSD matching metric squares this to produce a quadratic function that can be interpolated as shown. 2.6.3 A Technique for Sub-pixel Matching with SAD The SAD matching metric evaluates the sum of the absolute differences of pixel intensities in two blocks of data. Since the image model predicts that near the best match the difference in intensities is expected to be linear with displacement along the epipolar line, the residual SAD value is expected to vary as the absolute value of displacement from the minimum as shown in Figure 2.7. 1/2 (7(c) -J(g)) J(a) + J(c)-2J(b) 28 a b c Figure 2.7: Sub-pixel interpolation of SAD matching. In Figure 2.7, the three best match scores occur for disparity values a, b, and c. Near the best match, the difference in intensities is expected to be linear with displacement such that the absolute slope of the SAD matching metric is equal about the minimum. The line between J(c) and the minimum goes through a point with the same value as J(b). The horizontal distance between J(c) and this point is 1-2A such that, with equal absolute slopes: J (a)-Jib) = J(c)-J(b) 1 ~ 1-2A Notice that J(b) is the smallest of the three match scores and that the left side of the equation is the difference between the largest value and the smallest value. The largest value is either J(a) ovJ(c), depending on whether the sub pixel displacement is positive or negative about b. The general formula for sub-pixel displacement becomes: 1 J(a)-J(c) (2.6) 2MAX(J(c),J(a))-J(b) where M A X is a function that chooses the larger of the two values J(a) or J(c). The figure below shows the disparity surface of a cylinder recovered using SAD to match raw stereo images. Figure 2.8a is the integer disparity surface given by block matching and Figure 2.8b is the inteqjolated surface using Equation 2.6. This result is in agreement with the proposition 29 that surface detail can be recovered by linear interpolation and that Equation 2.6 will interpolate disparity data from SAD matching. (a) nearest pixel disparity surface using SAD (b) sub-pixel interpolated surface using Eqn 2.6 Figure 2.8: Using SAD to match raw stereo images and SAD interpolation 2.7 Summary This chapter has reviewed some of the aspects of low level stereo vision. The emphasis has been on block matching techniques that can produce dense disparity maps. Section 2.4 shows that recovering surface infonnation involves a derivative of the disparity map. This is an noisy process and higher order surfaces imply higher order derivatives. The next chapter deals with these issues and with issues of false matches. Section 2.5 looks at how the disparity infonnation is extracted prior to surface recovery and how block matching can be affected by a combination of changing disparities and changing intensities. Specifically, an explanation of why least absolute differences and least squared differences do not produce significant difference in performance is offered. Most matching algorithms impose a constant disparity on the match window and the results in Section 2.5 confirm this as an acceptable technique. Finally, in Section 2.6, sub-pixel matching is discussed. It is shown that interpolation methods are a function of the matching metric and local image data that, in the absence of other infonnation, confonns to general data models derived from a large ensemble of images. 30 Chapter 3 Producing Dense Disparity Maps This chapter describes the algorithms and procedures used to extract stereo information from an image pair. The goal is to produce dense disparity maps that can be used to characterize visual surfaces as described in Chapter 4. Blake and Zisserman (1987) point out that dense range maps are needed to produce accurate surface data. For this application, 'gray scale' matching is preferred (Fua 1991, Matthies 1992, Devemay 1994) since it yields dense disparity maps and better estimates of the shape of the visible surface. This chapter builds on the methods described in these papers. The central questions in this chapter are: 1. How are match errors generated and what is the useful information for producing a dense disparity map from a pair of stereo images? 2. What are the mechanics of matching stereo images and how can occlusions be handled? Sections 3.1 and 3.2 examine where useful information for gray scale matching is expected in generalized images. An estimate is made of the number of unambiguous matches that can be expected across an epipolar line. This leads to a texture filter in Section 3.4. Later sections in this chapter use the energy surface that results when matching two horizontal lines from a pair of stereo images to determine how to match over occluded regions in stereo images. 3.1 Structural Limitations in Image Matching A preliminary step in finding disparity surfaces is to identify the same features of a scene in each of the stereo images. This process is subject to error. The conditions necessary to reduce false matches restrict the minimum size of matchable regions in an ideal pair of stereo images. This 31 section studies two mechanisms that cause errors when matching the intensity values in blocks of pixels. These mechanisms are: 1. Analog to Digital Converter (ADC) Quantization forces the pixel values to be integers between 0 and 2N-1. The entire scene must be translated into a limited number of ADC values. In practical images the number of pixels is much greater than the number of ADC states and so, unique matches can only be achieved by comparing groups of pixels. This limits the resolution of the matching process to be less than the resolution of the image. 2. Stereo vision relies on the fact that scenes are sampled differently in the two images however corresponding features of an external scene are not imaged on integer pixel boundaries in each of the stereo images. As a result, the closest matching pixel locations in stereo images usually contain different information. This is illustrated in the sampled data system shown in Figure 3.1. x(0—• Low-pass Filler H A/D Dl(n) *f D/A Low-pass Filter ->Y(t) Data Clock 1 A/D D2(n) D/A -W T Low-pass Filter ->Y(t) Data Clock 2 Figure 3.1: Parallel systems with asynchronous sampling. The two sampled data systems shown in Figure 3.1 obey the Nyquist criteria. They sample the same data and achieve identical results at their respective outputs. In each system the signal path starts with a common low pass antialiasing filter followed by an A/D section that samples the input data and converts the signal to digital fonn. The data is converted back to analog form using 32 a D/A and identical reconstruction filters. However, the two systems sample the data asynchronously and the digital data stream is not the same in each system. The Nyquist criteria states that the information can be recovered exactly from the digital data but it makes no claim that the digital data is exactly the same in each of the parallel systems shown in Figure 3.1. The Nyquist criteria holds if the sampling frequencies are slightly different or, as in the case of ideal stereo cameras, exactly the same but with an with arbitrary phase difference. In each case, the digital data will be different in the parallel paths. Modern image analysis is based on digital processing and it is the digital data that is compared. 3.1.1A Model for Image Matching Studies of real images show that neighbouring pixel values are, on average, highly correlated. Such images are usually modeled as first order Markov processes. Akansu and Haddah (1992) use p = 0.95 as an average correlation coefficient for typical images. Pratt (1978) quotes p= 0.953 as an average horizontal correlation coefficient and 0.965 as the vertical correlation coefficient for typical images. Clarke (1985) gives similar results. As discussed in Section 2.5, the autocorrelation function of such a process is: R(x) = rt(0)jcH = R(0)e^ln(p) * R(0)e-005 This is an average statistic derived over an ensemble of images. In Clarke's work several thousand images were studied including photographs, X-ray images, and Fax documents. The statistical model places no conditions on the camera and electronics used to capture the images. According to Akansu and Haddah (1992), a one dimensional equation for this model driven by a random source is: x(n) = px(n-\) + {\-p)u{n) In an 8 bit system the random source, u, has the full ADC range (0 to 255). 33 When matching gray scales, a group of pixels in one image is compared to a similar group of pixels in the other stereo image. The four pixel values in a 2x2 window can be considered as a 4 dimensional vector that points to a unique position in 4D space. The number of elements in this space is determined by the number of different ADC values each pixel can have raised to the power of the number of pixels in the window. Each element in this space can be considered as a unique state of the 2x2 window. If the number of distinct states is large, the probability that another group of 4 pixels will point to the same state within a finite search range is small. When noise is added to the ADC values, the 4 pixels values in the window no longer point to a single, unique element in 4D space. Noise forces the vector to move in some volume of this space. In average images where local data tends to be highly correlated, the range of ADC states available to neighbouring pixels within a matching window is restricted. If a given pixel has a value of 100 then the image model restricts neighbouring pixels to values near 100. This decreases the volume of available states within a given matching window and so increases the ambiguity of the matching process. In an image with 256 gray levels, no noise, and p = 0.95, the amplitude of the source information becomes (l-p) x 256 = 13. The strong correlation between neighbouring pixels limits the average range of new information available to adjacent pixels. The number of states available for matching is 13^ where Wis the number of pixels in the window. Since 13 is almost the square root of 256, we should get almost the same matching performance as a random image when the number of pixels in the window is doubled. Experience has shown this not to be the case. The main problem in matching practical images that follow this auto regressive model is the presence of additive noise; rj(n) in the expression below. x(n) = px(n-\) + {\-p)it(n)+ r/(n) When there are only 13 states that describe information, additive noise hollows out a proportionally larger region of uncertainty in the matching space. With fewer states available due highly correlated data, and larger unambiguous state sizes defined by system noise, the effective 34 number of elements a given matching volume decreases. With an additive noise ±1 ADC count, a pixel can take on 3 values as the noise changes form +1 to 0 to -1. The number of distinct, non overlapping source states falls from 13 when the pixels were fixed in position to 4 and the window size must increase further (and the match resolution decrease) to provide unambiguous matching. The number of ambiguous matches across an epipolar line can be estimated by: where: Ng = number of matching errors across an epipolar line R = search range = number of unambiguous information states (N$ < 4) W= number of pixels in the matching window L = number of pixels in the epipolar line In practice, two dimensional matching windows greater than 7x7 pixels are used to get reasonable results (Fua 1993). Using to the above approximation (Equation 3.1), a 256x256 pixel image will have one matching error per line using a 50 pixel matching window over a search range of 20 pixels when the number of unambiguous information states between adjacent pixels is in the order of 1.3. According to the image model used in this section, this low number of distinct states between adjacent pixels corresponds to an additive noise that can range over +4.5 ADC counts. In terms of Gaussian noise, 95% of the noise values are with 2o of the mean. A range of ±4.5 ADC counts corresponds to noise with o = 2.3 ADC counts; a number that is not unreasonable. Random dot stereograms have flat spectra. Local regions are uncorrelated and have good characteristic for matching. Local regions in practical images are highly correlated and present limited texture for gray scale matching. This makes the matching process sensitive to noise. One would like to enhance useful image texture for matching, producing a flat spectrum but not amplify N, R-l (Appendix 2) 35 the image noise. Random noise is present in images but it is dependent on conditions that are difficult to estimate. Two systemic noise sources that can be estimated are ADC quantization noise and spatial sampling noise. The ADC quantization noise is described in text books with variance: _2 A 2 <T = —, a* ±0.3 ADC counts 12 To achieve reasonable matching success, the variation between adjacent pixels should be large compared to noise fluctuations. This information can be enhanced by filtering however spatial sampling noise limits the high frequency data that can be used for matching stereo images. 3.2 Why 'Good' Images Don't Match Visual surfaces appear in the stereo images as two differently sampled versions of the same intensity pattern. Such a system will not produce meaningful matches of all frequency components since each image would sample the visual surface using different pixel boundaries. If the signal of interest is a sine wave with frequency just slightly less than the half the sampling frequency, the relative phase of two sampled versions of the signal can be anywhere within a ± 90° range of each other at this frequency. As an extreme example, one camera could locally sample the peaks and valleys of the sine wave and the other system could sample the same data but almost 90° out of phase and see values near zero. These are the parallel sampled data systems described in Figure 3.1. Even though the Nyquist criteria is obeyed, sampling the same data asynchronously will produce different data streams. In robot stereo vision, this data is not reconstructed but remains in digital form where it is compared to recover depth information. At the highest frequency of interest, the sampling offset is assumed to be uniformly random between ± 90° with standard deviation of 52°. At half this frequency, the sampling offset is uniformly random between ± 45° with standard deviation of 26° etc. The standard deviation of signal phase due to ideal stereo sampling is approximately: 36 <fi&-^-— where cog is the sampling frequency and cok < (3.2) cos 7 2 The difference ,s,between two sine waves with a sampling offset <j> is: £- = ^ 4sin cokn + —\-As'm\cokn- — =2^sin[— cos(cokn)&0Acos(cokn)& ——Acos{a>kn) \ 2) \ 2) \2) 6)s 7 Using the convention where the frequency of sampled signals varies between 0 and n, and cos =2n; the stereo sampling error of a signal A sm(cokri) is of the form: SK^^LAo,os{cokn), cok^n (3.3) Given a sine wave, the sampling noise appears as a cosine wave. The error introduced by stereo sampling is similar to a derivative operator. Since derivatives can be calculated as the difference between adjacent pixels, spatial sampling noise can be considered a subpixel difference between image elements and therefore follows a derivative-like frequency response. As with a derivative operator, the amplitude of the error is proportional to the amplitude and frequency of the signal component being matched. At the highest frequency allowed by Nyquist (co^.« K), the standard deviation of the sampling error is approximately 2TT/7 « 1. In other words, at the Nyquist limit the expected amplitude of spatial sampling noise is about equal to the signal magnitude. Since the distribution is taken to be uniform, the probability is about 60% that the sampling error is within the bounds of the standard deviation and there is approximately a 40% probability that the sampling error is greater than the signal information. An approximate upper bound to sampling noise is given by the worst case phase error induced by asynchronous sampling, </>< n^^. The worst case matching cos error is: s<—^Acos(cokn), ak^n 37 In a worst case, the sampling error can be about 1.5 times the signal information. The maximum error is about equal to the available signal at o\ « 2n/3 or one third the sampling frequency. Even signal frequencies less than n/2 (half Nyquist) can have significant sampling ambiguity and degrade the matching process. When combined with other noise sources the high frequency components do not provide reliable matching information yet these are the frequency components most emphasized in standard digital correlation formulas such as Pearson product-moment correlation (Numerical Recipes, 1992). A similar matching metric, the SSD was introduced in Chapter 2. This expression evaluates local regions of data in which the average value within the window is subtracted. It acts like a high pass filter on the data within the window and emphasizes the frequency components that are most subject to stereo sampling errors. Large data windows will reduce the weighting of high frequency terms in the window at the cost of reducing the resolution of the matching process. The effects of sampling artifacts can also be reduced by lowpass filtering the data and not subsampling the result or by upsampling and interpolating the data. Data that is lowpass filtered before using the SSD effectively becomes bandpass filtered data in the SSD calculation. If the scene is lowpass filtered and then subsampled, it can be reconstructed but sampling artifacts can also be reintroduced by the subsampling process. This can make matching at coarse resolutions problematic if a strict Nyquist criteria is used at each scale. Unless each scale is oversampled, noise is needlessly introduced into the matching process. SSD = 38 3.3 A Wiener Filter Approach for Matching Sampled Data As pointed out in Section 3.1.1, high frequency information (visible texture) is important for matching local image regions. The amplitude of the image spectrum predicted by the autocorrelation model is of the form: 1 Vl + (20<y)2 Sampling noise acts like a derivative and increases with signal frequency. In a Bode plot (Figure 3.2), the sampling noise increases linearly in the region where the image spectrum is flat and is constant in the region after the first order pole of the image model. Spatial sampling noise sits on a pedestal of random noise + ADC noise. As mentioned in the previous section, by itself, spatial sampling noise reduces the noise margin for matching to zero at the high frequencies. Based on this approximation, one can develop a 'rule of thumb' for the usable image bandwidth available for gray scale recognition and matching. image spectrum Figure 3.2: Bode plot showing reduction of useful matching bandwidth by wideband noise. An interesting concept in the Bode plot approximation is that the actual 3dB bandwidth of the image spectrum is not important since the shape of the noise and signal Bode plots track each other and the high frequency intercept occurs at the same frequency in the Bode plot. Sampling 39 noise is derived from the signal and the relative slope between the signal and the sampling noise pedestal is fixed at +1 and the two will intercept at the same frequency regardless of the order and bandwidth of the input signal. However, the reduction of useful bandwidth by wideband noise depends on the slope of the image spectrum as shown in Figure 3.2. The basic idea in reducing matching ambiguity is similar to a Wiener filter where the signal is bandlimited at a point where noise masks useful information. Wideband noise is assumed to be uncorrelated with the spatial sampling noise. In the Bode plot of the first order model of the image spectrum, the high frequency components vary as 1/20© so that the change useful bandwidth due to the wideband noise pedestal in Figure 3.2 appears as: where rj, the estimate of wide band noise in the image and coc is the frequency where the signal information is equal to the stereo sampling error. For this calculation coc is taken as 27i/3, the frequency where the worst case sampling error is about equal to the signal amplitude. The factor of 20 follows from the average image model. The 'rule of thumb' (rough approximation) for the degradation of useful image bandwidth with 8 bit quantization is: where a is the wideband noise in ADC counts in an average image ADC value of 128. In this equation, 40TI » 128; the number used as the average ADC value in this thesis. Expressing coc as a fraction of In is used to scale the noise to a value in ADC counts. In this case, to get a useful bandwidth of 7t/2, the wideband noise should not be much more than 2 to 3 ADC counts in an average ADC value of 128. Since the goal is to compare two pieces of image data, the wideband noise is the noise difference between the images and, on average, this will be larger than wideband noise in a single image. 1 1 {20coBJ- (20a)J (3.4) 40 The degradation of matching bandwidth will be different for images that do not conform to the image model. Raw images that contain more high frequency information will have a wider spectrum than depicted in Figure 3.2 and the factor of 20 in Equation 3.4 (from the average image model) might be replaced with values as low as 10 or 8. Images that agree with the average image model and that also suffer from more wideband noise will have lower useful bandwidths available for matching. Equations (3.3) and (3.4) indicate that average images should be over sampled by at least a factor of 2 before matching. Sampling uncertainty affects coarse to fine matching strategies, edge detector bandwidths, and it affects the bandwidth of filters used to extract image texture. 3.4 A New Preprocessing Filter for Match ins Images Image texture along the epipolar line can be extracted by a symmetrical difference operator of the form: This directional operator is similar to the sin(co) function (0 < co < %); the symmetrical derivative. It acts as a bandpass filter with zeros at co = 0 and co = n and a maximum response at co = 7t/2. Deriche (1987) introduced a recursive edge detector which is analyzed in Appendix 3. It is shown that this implementation is equivalent to a low pass filtered symmetrical derivative. This concept is used to develop a texture filter that is better matched to the expected image parameters: Unlike Deriche's filter, which is used to extract edges, this one does not need tuning. The lowpass term is realized by a second order Butterworth filter run forward and then backward over the data, -1 0 1 - 2 0 2 -1 0 1 41 making it phase linear. A second order Butterworth lowpass filter is used in this way because it is virtually separable in two dimensions and it conforms to an accepted surface model (Appendix 4). The image is lowpass filtered in two dimensions and then the symmetrical differencee is taken of each horizontal scan line. This allows the texture to be efficiently extracted in one dimension, for matching along epipolar lines, while spatial sampling noise is removed in 2 dimensions. By itself, the symmetrical derivative filter is a good way to recover image texture. It ignores low frequencies with little useful texture for local matching and acts to cancel the first order pole in the image model and so flattens the spectrum, making the texture information more like a random signal and better for matching. The filter also has zero response at 7C and therefore attenuates spatial sampling noise. Even so, it still accepts signals with negative signal to noise ratio as can be seen in the lower right corner of Figure 3.3a. In this thesis the symmetrical difference operator is combined with the 2-D Butterworth lowpass filter that is set for a frequency of about OAn (Figure 3.3b). The recovered texture information shows positive signal to noise ratios. 10' 10" Symmetrical Difference Filter 10' Texture Filter Image Model 10" 10 10 Figure 3.3a: Using the symmetrical difference Figure 3.3b: Using the thesis filter to recover to recover image texture. image texture. Figure 3.4a shows the test image used in this chapter. It has both large and small surfaces to test disparity resolution and it also contains shadows with very little visible texture. This makes stereo matching difficult and it makes the image useful for studying fundamental problems that 42 arise from poor texture and occlusions in vertical surfaces. Just as with human vision, the algorithm developed in this chapter solves aerial stereo images, stereo images with vertical surfaces such as drums and trees, and ground based stereo images used for navigating on plane surfaces. Figure 3.4a: Test image used in this chapter. The next figures show the disparity maps resulting from simple block matching of image texture recovered using a symmetrical derivative filter and texture recovered with the thesis filter. Left-Right confirmation is used to validate the matches (described later in this chapter). Black areas occur when no match is found. Bright white regions are false matches. Figure 3.4b: Disparity using image texture from Figure 3.4c: Disparity map using image texture symmetrical derivative: 13977 match failures from thesis filer: 11988 match failures 43 The black banded region on the extreme left in Figures 3.4b and 3.4c is difficult to match using simple block matching and it is not included when counting regions where Left-Right confirmation fails. A noticeable difference in the images can be seen in the lower left and in the upper right where the symmetrical difference filter fails to validate as many regions as recovered using the thesis filter. Overall the symmetrical difference filter has about 16% more match failures. Even though it attenuates signal frequencies greater than nil, it can accept data with negative signal to noise ratios as shown in Figure 3.3a. The filter used in this thesis is designed as a general purpose image filter to recover texture that has a positive signal to noise ratio for gray scale matching. It consists of two elements; a 1-D symmetrical difference operator to flatten the useful image spectrum along the epipolar line and a two-dimensional separable lowpass filter to reject spatial sampling noise. This is described in Appendix 3 and Appendix 4. 3.5 Matching A long Epipolar Lines In this thesis, one image is designated as the reference image so that disparity values can be assigned to locations in this image. A match surface of two epipolar lines is constructed from the absolute difference of the pixel values in a line of the reference image WRT the corresponding epipolar line in the other image. In most stereo images, including those in this thesis, the epipolar lines are horizontal lines in the stereo image. A horizontal row of pixels from an image constitutes an epipolar line. A match surface can be constructed by comparing horizontal line 100 in the right image with horizontal line 100 in the left image. In Figure 3.5, the bottom row on the match surface is the zero shift absolute difference between the pixel values of the two image lines. The second row in the match surface is the absolute difference of a one pixel shift between the two lines, etc. This process forms an array of match scores. The number of horizontal elements in the array is 44 determined by the number of width of the images while the number of vertical elements is determined by the disparity search range. i t i i i i i i i i i i i i i r r r n Right Epipolar Line Left Epipolar Line to be Shifted and Matched Array of Match Scores for Corresponding Epipolar Lines Figure 3.5: Construction of an epipolar match surface. Using SAD (sum of absolute differences) as a matching metric, the overlapping regions in this matrix structure contain the absolute difference of the gray scale values for a particular pixel pair after the texture filter has been applied to the images. The match scores for a given position in the right (reference) epipolar line appear in a vertical column on the match surface. Match scores for a given position in the left epipolar line appear along a 45° line on the match surface. Ideally, match positions would be indicated by a contiguous line of zeros through the match surface. 3.6 Surface Models and Correlation Windows In practical applications of machine vision, visual surfaces arise from physical surfaces and a physical surface model that describes how adjacent surface elements interact is used to constrain estimation of visible surfaces. A surface can be though of as a mosaic of infinitesimal planar surfaces. In most physical surfaces the orientation of neighbouring differential surface elements does not change quickly. One can always find exceptions, like the sea urchin, but walls, 45 cylinders, and other simple visual elements are relatively smooth and appear discontinuous only at the boundary of the visual surface. Blake and Zisserman deal with visual surface models in their book 'Visual Reconstruction'. They point out that "the higher order the model, the greater the range of interaction between cells, and the more intractable the problem of signal estimation becomes. In practice, anything above first order is more or less computationally infeasible". Their work favours the weak membrane model; a first order surface model, and the weak string model which is the one dimensional equivalent. The epipolar match scores in Figure 3.6 have been low pass filtered along the constant disparity lines using a simple first order recursive filter that conforms to the weak string model. In this way, the matching data is weighted according to the modeled local interaction between elements on the surface. 0.5^ Disparity 0 n 300 200 100 Horizontal Pixel Position Figure 3.6a: Match surface formed from comparing horizontal line 100 in left and right images. Figure 3.6a shows a match surface for a pair of typical epipolar lines. The y axis has a 20 pixel disparity range. The x axis is the pixel position in a 256 pixel epipolar line and the z axis is the match score. The best match appears as a minimum energy path though this surface. False 46 matches can occur in regions with little intensity texture and in regions with occlusions. The path is expected to be discontinuous over occluding surfaces. Figure 3.6b: Contour plot of Figure 3.6a with best match solution on match surface. 3.7 Left-Right Match Confirmation Left-Right confirmation is often used to reject false matches (Prazady 1985, Fua 1991). In Figure 3.7, match scores for a given position on the right epipolar line appear in a vertical column on the match surface while match scores for a given position on the left (shifted) epipolar line appear along a 45° line on the match surface. Using Left-Right confirmation, a match is confirmed when the best score on a search along the vertical line is confirmed as the best match by a search along the intersecting 45° line. 47 Left Image Pixel 15 Scores < ^ n Right Image Pixel 10 Scores 10 15 Figure 3.7: Left-Right correspondence on match array occurs when both the vertical and diagonal search find a best match for a 5 pixel disparity. 3.8 Matching Over Occlusions Occlusions occur when some region of the scene is visible in one view but is not visible in the other view. Regions on either side of the occlusion are visible in both views and, ideally, can be correctly matched. If there is a positive jump in disparity then the left view will contain some of the scene that is not visible in the right view. These extra pixels in the left view have no matching pixels in the right view. In the example below, a continuous solution that can match up to pixel 8 in the right (reference) image and also correctly match the next pixel must jump over some interval of pixels on the left epipolar line. The left view contains pixels that are not present in the right view. Valid matching information is locally continuous along the right epipolar line but it is discontinuous along the left epipolar line. Left Right unmatched pixels in left view Figure 3.8a: Positive distance step. Figure 3.8b: Corresponding pattern on match surface. 48 If there is a negative jump in disparity then the right (reference) view will contain some of the scene that is not visible in the left image. The match information is continuous on the left epipolar line and it is discontinuous on the right epipolar line. In the figure below, there is a correct match at position 6 on the right epipolar line and the next possible match lies along the 45° line that corresponds to the next contiguous pixel on the left epipolar line. Righ t Figure 3.9a: Negative distance step. •••• \ X s; s 11 t ! unmatched p ixe ls in right v iew Figure 3.9b: Corresponding pattern on match surface. These rules conform to the disparity gradient limit (Pollard et al. 1985). As pointed out in a following publication (Trivedi and Lloyd, 1985), the disparity gradient limit is a geometric constraint that results from camera geometry and the different information available in the two views of the scene. In the case of a positive disparity step, the match is continuous on the right epipolar line while in the case of the negative disparity step, the match is continuous on the left epipolar line. This constrains how the disparity changes across the occlusion gap and it also conforms to the monotonicity constraint (Geiger et al. 1995) which states that, except for very special illusions, a match discontinuity does not appear at the same location in both images. One view or the other will have continuous match data. When indexing across the match array using these rules, i f the previous match is known to be correct then the next correct one to one match is in the region defined by a disjointed line as shown. 49 r n e x t c a n d i d i a t e m a t c h e s c o r r e c t m a t c h e s ^ i n d e x d i r e c t i o n Figure 3.10: Search region when previous match is known to be correct. Figure 3.10 indicates the next possible candidates on the match surface. If there is no disparity change then the correct match is at the next horizontal element in the match surface . If there is a positive disparity change then the next match is on a vertical line above the previous correct match. If there is a negative disparity change then the next correct match is on the diagonal line below the previous match. If there is no information about previous correct matches then the score associated with the best match is in the triangular region indicated below. A pixel that is visible on both epipolar lines will have a correct match score in the vertical column at the index location. When the right index pixel is not visible in the left view the correct match is right of the index position. A pixel in the right view that is identified as being in an occlusion region can be assigned to the background disparity (Jones and Malik, 1992). This is the best match in the region to the right of the index position . .... ... .... 2. i n d e x l o c a t i o n Figure 3.11: Search region on match surface with no knowledge of previous match. 50 This illuminates two points: 1. When no occlusion is present (no disparity step) or when there is a positive disparity step, information is continuous in the right view. The correct match is in a vertical column at the pixel index. This is a small search region. 2. Only when there is a negative disparity step does the search region expand to include the triangular region shown. This is a lower probability event than that described in the previous situation (point 1 above ). Using the left view as the reference view, the conditions become reversed for the stated disparity changes. It is possible to achieve reasonable matching by ignoring the occlusions mentioned in Point (2) since half the disparity steps are automatically handled by (1). As described in Sections 3.1.1, matching compares groups of pixels in some region of the image. To compare valid regions of the images it is best to match on each side of the occlusion boundary rather than across the occlusion. Such a delineation of matching regions is described in the next section. 3.9 A Novel Technique using Directional Matching over Occlusions Blake and Zisserman (1987), Terzopolulos (1988), Fua (1991), and other researchers use a two step process to extract surfaces from stereo images. The first step is to extract range data by determining the disparity between the two stereo images. In the second step, this information is then smoothed according to some surface model; a process which also smoothes data across depth discontinuities. An energy score is calculated to measure how well the range or disparity data fits the surface model. In the region of a discontinuity, the energy is high and the smoothing is relaxed or abandoned in order to preserve the discontinuity. On each side of the discontinuity, smoothing remains in effect. In the first step, which extracts disparity, correlation scores are supported by a region of data determined by the shape of the correlation window. The correlation window acts as a filter 51 which weights and sums the match scores of individual image elements in the window. At this level, correlation matches individual pixels but, as discussed earlier, noise and poor image texture require that the individual scores be summed over some region before a decision can be made. In this respect, smoothing stereo range data according to a surface model and correlation use similar operations in that they form spatial weighted averages within the image space. In performing the second operation, smoothing depth data according to a surface model, the underlying information for depth is smoothed twice, once in the correlation process where the depth estimation is derived from an average match score over an image region and a second time when the depth data itself is smoothed over some region. An advantage of the second step is the ability to handle discontinuities where the range data does not conform to the surface model. However, since the surface model implies some spatial support for this decision, the correlation process can also use be adapted to detect discontinuities with respect to a surface model. This is done with directional correlation filters. Figure 3.11 shows the basic implementation of the correlation filters used with SAD matching in this thesis. In Section 3.5, a symmetrical correlation filter was used. In this section simple fonvard and backward first order recursive filters are used to average matching scores along the epipolar lines. This is a one dimensional search and the filters weight data according to the weak string surface model which can be formed from the addition of two first order Butterworth recursive filters. Figure 3.11: Forward and backward correlation filters on different surfaces. 52 When the filtered data is processed separately the storage requirements are doubled; yielding two matching surfaces, each similar to the one shown in Section 3.5 except that one surface is filtered in the forward direction and another surface filtered in the backward direction. At each pixel location, the forward and backward filters are compared. In Figure 3.12 the backward filter cannot fit on the surface and straddles the transition between two visible surfaces while the fonvard filter is continuous on a surface and has a better match score. At each pixel in Figure 3.12, the match algorithm decides which filtered surface to adopt. As the forward filter advances one pixel to the right, the backward filter retreats one pixel. Eventually only the backward filter can find a continuous surface and the forward filter begins to cross the discontinuity. At this point, the backward filter has a lower score and best represents the surface at that pixel location. The basic idea is that if the fonvard and backward filters agree on a disparity value then they are on a surface that conforms to the membrane model. If they disagree then they are near a discontinuity. In traveling across the match surface by increasing horizontal pixel number (from left to right), the best match in the region just before an occlusion region is on the forward filter surface. The best match in a region just after a disparity transition is on the backward filter surface. In effect, the backward filter surface looks ahead onto future surfaces whereas the forward t pixel position Figure 3.12: Matching over an occlusion 53 moving filter looks back over past data. If the forward filter data shows a good match score, the disparity value should be within ±1 pixel of the previous disparity value for the data to be part of a known surface. This can occur because: 1. there has been no disparity transition. 2. there has been a disparity transition but the fonvard filter is now settled on the new surface and the previous disparity values were obtained from the backward filter. These effects can be seen in Figure 3.13 below where a single symmetrical filter and Left-Right confirmation is used to match the surface in an aerial photograph. In scanning from left to right, the leading and trailing edges of the building appear as disparity steps. The symmetrical filter cannot find confirmed matches in these regions which appear black in the disparity image. Poor visual texture also limits matching but it is the effects of disparity steps and occlusions that asymmetrical filters address in this thesis. a: Right view of aerial photograph. b: Disparity recovered with symmetrical filter. Figure 3.13: Failure of symmetrical filters in regions with step changes in disparity. 54 Most of the scene in Figure 3.13 has continuous visible surfaces. 27% of the pixels in the disparity image cannot be matched using the symmetrical correlation filter and Left-Right confirmation (i.e. more than one quarter of the image). 3.10 Matching with Asymmetric Correlation Filters Consider searching along a constant pixel position in the right view; the vertical line in each forward and backward match surface shown in Figure 3.14. Away from a discontinuity, both searches confirm a best match at the same disparity value. In case of a positive disparity step the backward filter will have the better match score and the disparity must be greater than the old forward disparity since only positive disparity steps are continuous in the right view. The only condition that is not handled correctly by this simple vertical search is a negative disparity step. In the left view, negative disparity steps are continuous. A search along a constant pixel position in the left view occurs along a 45° line on the match surface. At the boundary of a negative disparity step, the backward filter would be on a new surface and the forward filter (on the old surface) would be entering an occlusion region. In this case, the backward disparity is less than the old forward disparity and the backward disparity value is assigned to this pixel (which is visible only in the right view). This procedure of filling in of occlusion regions with the background disparity is discussed by Jones and Malak (1992). I, JL ; — 1 1 1 .... -* . . . . . . index locat ion - 1 Figure 3.14: Diagonal search line on Backward filtered match surface 55 Figure 3.14 illustrates a match surface generated by a ten pixel disparity search range, (i.e. there are ten vertical elements in the match surface.) With each increment of the pixel position in Figure 3.14, the triangular region moves to the right; ten new elements arrive and ten old elements leave. The discarded elements are the match scores for the previous constant pixel position in the right epipolar line. The new elements are the match scores for a constant pixel position in the left epipolar line. These appear on the 45° line in the triangular region. A search along this vector will identify the best match associated with a pixel on the left epipolar line that is 10 pixels away from the current position in the right view. As the index advances, only ten new elements need to be searched to record the best left match candidate. The two pieces of infonnation needed from this search are the value of the best match score along the 45° line and the position of the best match score along the line. A vector of the best left view solutions is maintained, shifted and updated with the new match candidate. In effect, this vector is the W T A solution (Winner Take A l l ; Little and Gillet, 1990) seen from the left view. The length of this vector is equal to the disparity search range. ~'i ' *-> ,,,, , ^ ,1 n "T"["' | j EE i ... ... Figure 3.15: Data vector of diagonal search results. In Figure 3.15, the latest best match score from the diagonal search and its vertical position are stored in the top locations of the data vector. The previous data has been shifted down one position in the vector. It can be seen that it is possible for the second top position to have a solution that points to the maximum disparity value; a region that is not part of the triangle. Similarly, the third top position in the data vector could have a solutions that point to the top two 56 disparity values; also outside the triangle. One criteria for valid matches becomes very simple: in the data vector, the correct match has a disparity value equal to its index position in the vector. For example, the top entry in the data vector might have a best match at position 8 along the diagonal search. After two shifts down (which occurs after two shifts to the right along the epipolar line), the index in the vector and the disparity value agree. This represents the best match seen from the Left view. If this match was to be confirmed in the Right view by a vertical search, it would also reside at position 8. A necessary condition for Left-Right confirmation is that the vector index and disparity value agree. A l l the elements in the vector where the disparity value and the index position agree meet this criteria for information that is visible in both views. p r e s e n t s u r f a c e f u t u r e s u r f a c e index location - T Figure 3.16: First evidence of a negative disparity step on diagonal search. An analogous situation holds for negative disparity steps where a search along the diagonal line would show a match at a position less than the current disparity value. Figure 3.16 shows the first evidence of this event on the diagonal search. This information is stored in the top of the data vector shown in Figure 3.15. As the index position advances along the epipolar line, the first evidence of the negative disparity step is shifted down in the data vector. The pointer to the future surface in the data vector coincides with the disparity value of the present surface when the present surface ends. By the time that the index location in Figure 3.16 had advanced to the end of the upper surface, the top four elements of the data vector would contain information about the new, lower disparity surface. The future surface is confirmed by continuity of disparity values in the top elements of the data vector which are the search results in the left view only. It cannot be confirmed 57 by data in the right view since a negative disparity step produces unmatchable pixels in the right view. Left-Right confirmation will fail at surface boundaries with step discontinuities such as those in the test image (Figure 3.13). Asymmetrical correlation filters and the diagonal search technique provides a look at future surfaces and effectively steps over information that is unmatchable in the reference image. When using the right view as the reference image, negative disparity steps generate unmatchable pixels in the reference image and Left - Right confirmation must fail in such regions of the reference view. When this happens, the technique adopted in this thesis is to look for future surface continuity and assign the corresponding disparity values to the unmatchable pixels in the reference image. 3.11 The Match Rules Two match surfaces are constructed for each epipolar line in the right image as described in Section 3.5. One surface is filtered in the fonvard direction and the other surface is filtered in the backward direction as described in Section 3.9. For each pixel in the horizontal image line from the right view, each match surface is searched vertically and diagonally as described in the previous section and the best diagonal match scores are stored in a data vector as shown in Figure 3.15. The data vectors, designated Uf and Ub contain the energy and disparity values that result from the diagonal searches. Rule 1: The best vertical match score of the fonvard match surface, ef, and of the backward match surface, eb, are compared according to this rule: accept the disparity, df, that is associated with ef IF: (ef< eb) 58 AND UffdfJ) < Ub(dfJ) AND abs(Uf(df,2) - df) < 2 In this case the best forward match score is less than the best backward match score, the disparity value associated with the forward match score is confirmed to be within +1 pixel of the best fonvard surface scores stored in the diagonal vector and the corresponding forward match score in the diagonal vector is less than the backward match score. ELSE accept the disparity, db, that is associated with eb IF: (eb<=ej) AND Ub(db,l) < Uf(db,l) AND abs(Ub(db,2) -db)<2 This rule handles regions away from discontinuities and regions near a positive disparity step. In these regions the match data is continuous in the right view which is taken as the reference view in this thesis. Experience has shown that about 95% of the pixels in a typical stereo reference image can be assigned a disparity value using Rule 1. When this rule fails a solution is accepted that corresponds to a negative disparity step. In this case, some regions visible in the right image are not visible in the left image and cannot be matched. Rule 2: For pixels that fail Rule 1, a future surface with a disparity given by the average of the top elements in the diagonal search vector is taken as the solution. This corresponds to a continuous surface visible beyond the occlusion region and automatically fills in occluded regions with the background surface. 59 The results of this procedure are shown below. Two of the pictures (Figures 3.17a and 3.17b) are simply reproduced from Figure 3.13. They show the right image of the stereo pair and the disparity recovered by a symmetric matching filter. The third and fourth pictures show the dense disparity map recovered using asjnimetric filters and the matching rules outlined above. Figure 3.17a: Right view of aerial photograph. Figure 3.17b: Disparity recovered with symmetrical filter. Figure 3.17c: Disparity map recovered using only Ride 1. 60 The disparity map in Figure 3.17c shows most unmatched regions near negative disparity steps. These are most evident along the back of buildings. The right view can see the backs of buildings whereas these regions are not visible in the left view and should remain unmatched in the right view. Figure 3.17c shows unmatched regions in the right view after Rule 1 is applied. All the unmatched regions in this figure are assigned disparity values according to Rule 2 outlined above. The result is shown in Figure 3.17d. Figure 3. J Id: Disparity map recovered with Rale 1 and Ride 2. The dense disparity map in Figure 3.17d is recovered in a single pass using the technique described in this chapter. There are few wild points. The most noticeable are in a white region in the lower left corner of the image. Comparing Figures 3.17c and 3.17d shows that this wild point passes left-right confirmation as described by the matching Rule 1 above. Figure 3.18 shows the result when the same code is applied to the Pentagon image. The Pentagon image shows the limitations of this technique in resolving small surfaces. The directional correlation filters can preserve step disparity changes between large surfaces but have trouble when the image has many small surfaces. Better results are obtained by Geiger et al. (1995) and by 61 Birchfield and Tomasi (1998). Both use dynamic programming however Geiger's algorithm is much more time consuming. In regions with little texture, Birchfield's results are much better than those achieved with the algorithm presented in this chapter. However, Birchfield's algorithm tends to be brittle and not well suited for sub-pixel interpolation of disparity surfaces. In the next chapter, sub-pixel interpolation will be shown to be essential for perceiving local shape with stereo vision. Figure 3.18: Right Pentagon image and recovered disparity map. 3.12 Summary This chapter examines an implementation of gray scale matching stereo images. The emphasis is on solving problems that arise when dense disparity maps are extracted from images. Sections 3.1 and 3.2 investigate the statistical information that is presented in images. It is shown that, when matching regions in two images, there is a minimum region size necessary to find an unambiguous match and that, due to spatial sampling noise, high frequency information does not contribute to successful matching of these regions. Most signal processing systems intentionally oversample the input data. This is important in applications such as heart monitors that arc used to detect abnormalities. In contrast, most 62 images are not oversampled since, to humans, they would appear blurred or out of focus. Section 3.2 shows that high frequency components in critically sampled data do not contribute to gray scale matching and identification. This can also affect some coarse to fine matching schemes where the coarse images are critically subsampled. Two strategies are proposed to use the available image data. The one outlined in this thesis is to achieve oversampling by using a lowpass filter. The other strategy is to upsample the image data prior to matching. This second strategy could be used to high resolution stereo information in local regions of interest where the region is visible in both views. This remains an area for future work. In Sections 3.4 and 3.5, a texture filter is introduced to extract the useful matching information in the images. Sections 3.6 to 3.10 introduced a match surface to study the interaction of occlusions and correlation windows. Forward and backward directional filters are introduced to match over occlusions. This information is combined in Section 3.11 in an algorithm to produce dense disparity maps. The match rules are not heuristic but are systematically developed from geometric constraints that result from camera geometry and the different information available in the two views of the scene. The result is a single pass algorithm (no post processing) used to extract dense disparity maps in this thesis. 63 Chapter 4 Navigation, Recognition, and Orientation This chapter examines shape in stereo vision and develops teclmiques for robot navigation as a shape paradigm. Using the concept of a global shape model; in this case one that describes the floor on which a robot moves, (Burt et al. 1995), and other researchers (Kumar et al. 1994, Zhang et al. 1994), detect obstacles that arise in robot navigation problems. This technique identifies the floor as an object with global shape attributes whereas quantitative shape techniques measure local curvature in a scene and are more subject to noise. Navigation occurs within the context of a global shape whereas objects are recognized by local shape. The goal of this thesis is not only to determine how to navigate toward a log or cylindrical drum, but to determine how to recognize the object and its orientation with respect to the robot. Towards this end, Section 4.2 introduces disparity gradient space as the fundamental space in which stereo vision determines surface orientation; beginning by a description of planar surfaces in this space. In Section 4.3, this is extended to recognition of simple shape. In this case, cylinders are described in terms of their projection into this space. In Section 4.3 some real world examples are given and the limits of shape perception by stereo are discussed. The concluding sections of this chapter develop a formal mechanism to recognize local shape and the extract the orientation of simple objects using stereo vision. Previously, in Section 2.4, surface normals were extracted by taking the derivative of disparity. In recognizing a curved surface, a vision system perceives changes in surface orientation which is equivalent to a second derivative of the disparity field. Section 4.5 develops a techniques to recognition based on the null space of the surface normals. 64 Recognition of objects by computer requires a model of the problem domain in which objects can be classified according to how their attributes conform to some stored model. In a robust recognition system, these attributes need to be invariant under translation and rotation, within the limits of the robot work space. This chapter uses concepts from Extended Gaussian Images (EGI) to develop a method for recognizing cylinders and estimating their size and orientation. The basic ideas are built on previous work by Little (1985), Boyle and Copper (1986), and Kang and Ikeuchi (1993). However, this chapter begins with the navigation problem which involves recognition of a surface on which a robot can move: a global plane. 4.1 A Global Shape Model Burt et al. (1995) described mobile robot navigation on planar terrain where obstacles were made evident as deviations from the planar surface. The basis for this technique relies on the behaviour of planar surfaces in stereo vision as described in Section 2.4. For navigation, a global shape model is imposed on the scene; in this case a horizontal plane. The whole scene is organized in terms of the model but the floor is not explicitly recognized. This organization occurs in a disparity image of the scene. — Zmin Figure 4.1 Camera orientation for floor navigation The implementation in this thesis is designed for use indoors where the floor is flat such that the cameras coordinates do not roll with respect to the floor. The robot cameras are tilted 65 toward the floor so that, if the robot was situated on an infinite plane, the top row of the camera would just perceive the horizon. The bottom row of the image sees the floor a short distance away. Using fixed camera geometry, the disparity value of the bottom row is pre calibrated into the system and, if necessary, updated by the robot. When the robot can see the floor at the bottom row of the image, it can measure the disparity value of this row. In practice, the algorithm measures the median disparity value of the bottom 5 rows that have unobstructed views of the floor. If this value changes significantly from the precalibratcd value then a new disparity value for the top row can also be calculated in terms of the camera height above the floor and the camera's vertical field of view (FOV) as shown in Figure 4.1. Given that the disparity gradient of a planar surface is constant (Section 2.4) and knowing the disparity values for the top row and the bottom row, the disparity value of the floor is a straight line function of the image row number. Ideally, the top row of the image just perceives the horizon and the disparity value at this row is zero. The bottom shows the closest region of the plane where the disparity value is maximum. In Figure 4.2, the disparity image shows high values (white) where the disparity is highest. A side view of the disparity values associated with the infinite plane appears as a straight line. -+— Row 1 Figure 4.2: Camera view of an infinite plane and Disparity Image ofplane. The global shape model is continuous and exists in the disparity image. When the disparity map of a scene is adjusted with respect to the model, obstacles are detected as regions that do not 66 conform to the model. Examples of this are shown in Chapter 5 where the technique is extended to navigation in hallways. In this case the global model becomes more complex. Such shape models can also be developed to navigate in a pipe or a tunnel. A mechanism for this is given in the next section. 4.2 Disparity Gradient Space: a new way to evaluate shape In the previous section, the navigation problem was constrained to vehicles that travel on a flat surface such that the cameras coordinates do not roll with respect to the floor. A robot might need to navigate along a wall and find obstacles on the floor and doorways, hallways, and obstacles that protrude from the vertical wall. In other applications, it may be necessary to travel inside a pipe or tunnel. In some instances, the camera coordinates may not be aligned with the flat surface model. This section develops a space to facilitate working such problems (Burge et. al. 1998). 4.2.1 The Disparity Gradient of Planar Surfaces Section 2.4 dealt with the disparity gradient of horizontal ramps. A robot that stands on an infinite plane can consider that the plane intercepts a virtual plane defined by the pinhole mask of the camera. Where this horizontal plane intercepts the Z = 0 plane, as defined by the pinhole mask, determines the perceived disparity gradient of the horizontal plane. A robot that looks off a tall building sees the ground with a disparity that changes slowly in the image. A robot that is very near the floor, glances along the floor and perceives a rapidly changing disparity over the surface of the floor. In both cases the disparity gradient is constant. The equations in Section 2.4 are symmetrical about the optical axis i.e. the basic characteristics of the plane do not change as the camera rotates about the optical axis. Visible planes that intercept the plane of the pinhole mask at a distance R from the pinhole aperture are all 67 tangent to a circle of radius R centered at the origin of the Z = 0 plane (the pinhole mask plane) as shown in Figure 4.3. l ine of intersection of plane with the pinhole mask sin 9 A , Figure 4.3: Line of intersection of a planar surface with the 2=0 plane. The point at which a plane intercepts the X and Y axis of the pinhole plane is: X - —^—• = —^— and Y - ^ = —!— where 9 is the surface orientation with respect to the camera cos 0 Dx sin0 Dy F rotation about the optical axis. In the inverse space of the Z = 0 plane, Dx2 + D 2 - -Xr , as shown y Rz in Figure 4.4. Figure 4.4: Disparity Gradient Space. 68 In Figure 4.4, disparity gradient space has 3 regions: 1. Region A: nearly vertical surfaces like a wall; an observer looking directly at a notice board or poster. In this region, visible planes intersect the Z = 0 plane at large distances from the pinhole aperture and the resulting disparity gradients are almost zero. The disparity is constant. 2. Region B: measurable non-zero disparity gradients. 3. Region C: low visibility, grazing incident surfaces with large disparity gradients. Since all points on a given planar surface will have the same disparity gradient; in disparity gradient space, all points on a that planar surface will map to a single point. As shown in Section 2.4, this point is not unique. Many planes can map to the same point, however, all points on a given plane must map to the same point in the disparity gradient space. The location of the point is given by Dx2 + Dy2 = where R is the shortest distance between the pinhole and where the extended plane would intercept the Z = 0 plane (the pinhole mask plane). 4.2.2 Reevaluating Surface Normals In Section 2.4, local disparity was described by a first order Taylor expansion; which is equivalent to fitting a planar surface to a local region in the image. This was used to derive the following expression for surface normals in terms of disparity gradients: An example of this expression is the geometry of a section through a vertical cylinder with respect to the pinhole plane as shown in Figure 4.5. The vertical disparity gradient, Dy is zero on the surface of a vertical cylinder. This simplifies the example to accommodate the limitations of the figure. (4.1) 69 optical axis(x=0) X Pinhole Mask Plane z ^ / T a n g e n t Plane disparity .= D disparity = Do Vertical Cylinder Section Figure 4.5: Geometry of a Vertical Cylinder with respect to the Pinhole Mask Plane All points on the tangent plane have the same surface normal and therefore the same disparity gradients. A point where the plane intercepts the optical axis of the camera would have no x or y displacement in the image such that Equation 4.1 for the surface normal at this point simplifies to: planar surface patch at the center of the image in order to evaluate the Z component of the surface nonnal. It is interesting to note in Equation 4.2 that the Z component of the normal at the center of the image is a disparity divided by the focal length of the camera, producing a gradient term in the Z dimension. f Using the results of the previous section and the basic stereo equation, D - ^- s the surface nonnal becomes: n= DX,D £ . - _L 1 J_ / IX'Y'Z, 70 where X, Y, and Z are the are the intercepts of the tangent plane on the world coordinates. These 'world coordinates' are defined by the camera orientation. The Z axis is the optical axis of the camera and the X Y plane is defined by the pinhole mask. This shows that estimating surface orientation using Equation 4.1 is consistent with estimating where the tangent plane intercepts these axes. It agrees with definitions of planar surface given in geometry texts that show a plane as a triangular region with vertices on the X, Y, and Z axes. . Z Figure 4.6: Definition of a plane by axes intercepts. 4.3 Qualitative Shape Planar surfaces provide useful information in navigation problems, however higher order surfaces are usually needed to identify an object. When an object is modeled as a continuous tiling of small planar surface patches, shape will have some characteristic in disparity gradient space. In this section cylinders are recognized as curves in this space. 4.3.1 The Disparity Gradient of Cylindrical Surfaces In Section 2.4, the perceived disparity gradient is shown to be the projection of the surface normal onto the Z = 0 plane as defined by the camera imaging surface. On the surface of a vertical 71 cylinder whose axis of symmetry is parallel to the Y axis; the projection of the surface normal can only vary in the X direction. Since the nonnal vectors will have no vertical component then, on the disparity gradient plane, the disparity gradients of this cylinder all map to points on a horizontal line that goes through the origin; the x axis. In the case of a horizontal cylinder whose axis of rotation is parallel to the X axis, the disparity gradients map to points on the y axis, a vertical line that goes through the origin. The disparity gradients on any cylinder that is parallel to the Z = 0 plane will map to points on a straight line that goes through the origin in the disparity gradient plane. In the case of vertical and horizontal cylinders, the disparity gradients map onto the x and y axes respectively. Figure 4.8 shows the intuition behind this concept. Figure 4.7: Projection of surface normals of a vertical cylinder on the Z=0 plane Using Figure 4.7, a normal vector can be pictured as a perpendicular pin in a flat piece of cardboard. When placed on a cylinder, the cardboard becomes the tangent plane of the cylindrical surface and the pin is normal to the surface. As the cardboard is moved around the curve of the cylinder, the pin traces a straight line on the pinhole mask. In the disparity gradient plane, this line is independent of vertical position because the nonnal vector of this cylinder has no vertical component. The disparity gradient of any point on the cylinder must lie somewhere on a horizontal 72 line and since the surface normal has no vertical component (Dy = 0), all points lie on the x axis in the disparity gradient plane. Using this intuition, the normal to a vertical cylinder that is sloped away from the observer will trace a line on the disparity gradient plane that does not go through the origin. At some point as the piece of cardboard is moved around the cylinder, the pin will be oriented parallel to the x axis and the vertical component of the disparity gradient will be zero. When the cardboard is moved over the surface to face toward the pinhole aperture, the vertical component of the disparity gradient will be a maximum. Figure 4.8: Projection of surface normals of a non vertical cylinder on the Z=0 plane 4.3.2 Identifying Cylinders in Disparity Gradient Space The previous sections have looked at how simple surfaces can be found using stereo vision. In Section 4.1, a plane was imposed on the disparity map. The differences from this model were identified as obstacles. In an alternate technique (Section 4.2), points are grouped in some space and then characterized as belonging to a surface depending on how the points group in this space. This technique is similar to a search in which the final decision depends on finding sufficient data points with which to unambiguously identify the grouping. 73 Stereo vision can perceive depth from disparity and determine shape from changes in disparity. A change in disparity is the fundamental evidence of shape in stereo vision and disparity gradient space is just this; a fundamental space in which stereo vision determines shape. The previous sections show how, in disparity gradient space, planes map to a single point and cylinders map to a curve in this space. Every point on a given planar surface, to which a disparity gradient can be assigned, will confirm the existence of the plane. A plane is singular in this space. Similarly, on a small scale, some section of a cylinder will appear like a planar surface. The existence of a cylinder is confirmed by a locus of points that trace out a curve in disparity gradient space. As such, the support for a cylinder is not as dense as that for a plane. Since any point on a cylinder will map onto a curve in disparity gradient space, the location of the point on the cylinder is not important providing that the surface is adequately sampled. Figure 4.9 shows the disparity gradients taken at random points on the surface of a 45 gallon dram. Disparity Gradient Space of Vertical Drum 0.2 •£ CD 1 0.1 O >, .*± a o b 1-0.1 CD > -0.2 i ^ j m w w w i mini i i***i»ag! j f f i . . . -0.2 -0.1 0 0.1 0.2 Horizontal Disparity Gradient Figure 4.9 Disparity gradient at random points on a 45 gallon drum. The figure above shows a nearby 45 gallon dram and the disparity gradient measured on its surface. The dram is not exactly vertical and the line in disparity gradient space is slightly off 74 horizontal. In later sections, cylinders are identified in terms of their null space, indicated by the condition of an eigenvalue matrix. In this section a cylinder is mapped to a curve. It will be seen that the condition of the eigenvalue matrix is related to the width of the curve. In the limit, the width of the curve is zero; no noise exists and the curve has only length. It is length that distinguishes a curve from a point and hence a cylinder from a plane. The length of the curve depends on the ability of stereo vision to measure disparity gradients across the surface of the cylinder. Section 2.6 looked at expected disparity changes with respect to distance and surface orientation. On the vertical drum, 70% of the visible surface has a slope of less than 45° with respect to the optical axis of the camera. Accordingly, 70% of the information about the disparity gradients has a value: dD dr tan <b 1 Z Z where Z is measured in baseline units. A common baseline used in this thesis is 10 cm. with a typical workspace greater than 2 meters (Z > 20). Figure 4.9 has very little noise, such that the line can be recognized over a horizontal disparity gradient of ± 0.1. ; an angle of 63° of the drum surface at Z = 20 or about 90% of the cylinder image. In order to estimate disparity derivatives over this surface, the cylinder must close to camera and be physically large; filling most of the image. In this example, where 90% of the surface can be used to estimate curvature, the remaining 10% of the visible surface is close to the edge of the cylinder. The derivative estimate needs finite support and data in a region at the periphery of the cylinder is not always on the surface of the cylinder. For this reason, it is difficult to measure the curvature of small cylinders. The disparity gradient of a smaller cylinder is shown in the next image. In Figure 4.10, the cylindrical surface is smaller than the 45 gallon drum but the same distance away from the camera. The tree is sloped away from the observer and to the left, distorting the curve from a straight horizontal line. The measurable range of the disparity gradient 75 is less than that of the 45 gallon drum surface and noise becomes more of an issue. Even so, the surface is best characterized as a cylinder rather than a plane (a point) or a sphere (a symmetrical region on the disparity gradient plane). Figure 4.10 shows the disparity gradients of points chosen randomly on the surface of the tree. Disparity Gradient Space of Tree Surface Patch 0.05 S Figure 4.10 Disparity gradient at random points on a nearby tree surface. In comparison to the technique of forcing a continuous model on the disparity map, measurement of disparity gradients is less robust since it must search for data conforming to a model. This data can be ambiguous since more than one object can exist in the image. In forcing a continuous floor model on the disparity map, all the data in the disparity map can take part in the search. As shown in Section 2.4, many horizontal ramps can conform to this model but in practical circumstances only one surface in the image conforms to the global model. 4.4 Shape and Orientation of Local Surface Patches As shape becomes more complicated, the characterization of shape becomes dependent on localized measurements. It was pointed out in the previous section that the disparity gradient is the fundamental evidence of shape in stereo vision. From the disparity gradient it is possible to 76 estimate a surface nonnal using the results from Section 2.3; a formulation that relies on a tangent plane approximation to the local surface. When a surface is smooth, without creases and folds, it has a tangent plane at every point; a condition for smoothness (Horn 1989). The orientation of the tangent plane describes the orientation of the surface at that point. Local shape can be determined by differences in tangent planes at different points on a surface patch (Horn 1989, Koenderick 1990). In describing such surfaces, the basic notion is that tangent planes are like the first derivatives of shape; linear functions that approximate the surface. Using this heuristic, curvature is like the second derivative of shape (Koenderick 1990) and Gaussian Curvature is a measure of surface shape that is described by the divergence of the nonnal directions over a unit surface area (Koenderick 1990). In this respect, curvature is a two dimensional space that is orthogonal to the tangent plane. In a 3 dimensional world, curvature can be zero (a plane where the normals don't diverge), one dimensional (a cylinder), or two dimensional (an ellipsoid). In die following sections, local shape is determined by the way that unit normals on a surface patch diverge from a planar fit Local surfaces are identified by the asymmetry in surface curvature and orientation is determined by the principal direction of this asymmetry. Figure 4.11: surface patch with 3 normals As a point moves on a curved surface, the nonnal vector rotates about a vector that is tangent to the surface. In regions with high curvature, the local nonnal vectors are all rotated to point in slightly different directions. Each normal vector in a small surface patch can be described 77 by a component in the average normal direction and tangent components along the surface. In a simple example, normals on the cylindrical surface in Figure 4.12 diverge in a direction around the cylinder; not along it. There is no divergence of surface normals in a direction parallel to the cylinder's axis of symmetry. When traveling over this surface, the surface normals all rotate about a fixed direction which describes the orientation of the cylinder. In a stereo image of a cylinder, points in the image where the surface normals no longer conform to this model can be considered to be outside the cylinder boundary (not including the planar ends of the cylinder). n 1 Figure 4.12: Surface normals on a cylinder. 4.5 Quadratic Surfaces On a locally smooth surface, the derivatives are continuous and well behaved. Such functions, with continuous second derivatives, are often treated like quadratics over a sufficiently small region (Scales 1987, Lewis 1986). In optimization and control problems, this local quadratic behavior is exploited in Newton-Raphson algorithms to find local minimums (Numerical Recipes). Such approximations are common over locally smooth regions of a function (Koenderick 1990). Using differential geometry, local surface curvature is determined by fitting a quadratic surface to a small surface patch (Scales 1987, Elber&Cohen, 1993). A quadratic surface centered on some 3 space position, x0, can be written as: F = i ( x - x 0 ) T A ( x - x 0 ) + b T (x-x 0 ) + c (4.3) 78 The surface normal is the gradient of this function, g = A ( x - x 0 ) + b (4.4) and the Hessian matrix is A. In Equation (4.3), the quadratic surface is described in terms of: • a constant term, c; some average position of the points on the surface. • a planar term fit through the surface. • a curvature term that corrects the plane to better fit the surface. When x = x 0 , at the center of the patch, the surface normal, g, is equal to the normal of the tangent plane, b. The quadratic surface is symmetric such that the normals of a quadratic surface patch have a constant term and a term that changes as a point moves over the local surface. At different positions on the surface patch the tangent plane normal, b, is adjusted by the product of the Hessian and the displacement vector. How the surface normals behave in a region about x 0 is described by the Hessian matrix (which is sometimes called the curvature matrix (Numerical Recipes, Lewis 1986)). If the Hessian is unknown then the ensemble of normals on the surface patch contains information about the orientation of the tangent plane and about the quadratic curvature of the surface. This is the basis of gradient descent techniques (numerical recipes, Scales 1987). 4.6 Surface Curvature Noise and the effects of surface curvature cause the normals on the surface patch to point in slightly different directions. For simplicity, the surface normals in Equation (4.4) are considered to be perturbations of the unit normal to the tangent plane at the center of the surface patch. At different locations, the unit normal at the center of the patch, b, is modified by three vectors; one that reduces the component in the same direction as b and two that point along the surface in the directions of surface curvature. The dot product of b and a neighbouring unit normal, g, is the 79 cosine of the angle swept out by a radius of curvature in the direction (x - x 0 ) and the cross product of these vectors is the sine of the angle; the strain vector in the tangent plane. Figure 4.13 shows a slice through a surface patch and two associated surface normals. The radius of curvature is a minimum when the slice is perpendicular to the longitudinal axis of the surface patch. Figure 4.13: Surface normals on a cylindrical section. . Figure 4.13 shows that an estimate of the distance between points on the surface is needed to determine the radius of curvature, R. A unit displacement on the surface as given by (x - x 0) will produce a change in the surface normal (as determined by the Hessian matrix) that reflects the radius of curvature. In figure 4.14, let g] and g 2 be the components of the vector g as shown such that g 2 « g]. For small angles, the radius of curvature is approximately |gi| / |g2|. In this case, where a 'unit displacement' is used, the radius of curvature is defined directly by the values in the Hessian matrix. Since all infonnation about the visible surface is derived from the image, what constitutes a unit displacement on the surface is image dependent. Unit displacements are measured in terms of pixels which are related to the world coordinates by stereo geometry. The perceived curvature of objects in the world is detennined by this conversion. 4.7 Surface Energy On the surface described in Equation (4.3), the outer product of a local surface normal at some position in 3 space, x, is a 3x3 matrix. 80 gig? = [A(x 0 - ) + b][A(x 0 - x,.) + b] T = [A(x0 -x,.)][A(x0 - X / ) ] T +[A(x0 -x,)]bT +b[A(x0 -x,.)]T +bbT The autocorrelation matrix of the surface normals on a quadratic surface patch is the expected value of the outer products. R = E,Jg(j0 - x,.)g(x0 - x,)T] = - G G T Tliis expression measures the correlation between the 3 components of the surface normal over the visible surface. On a small surface patch, the surface normals all point in the same general direction such that the surface patch of a curved surface appears as a warped plane. If the columns of an mx3 matrix, G , are m unit surface normals, (m > 3), then G G T is a 3x3 matrix that measures the average directional energy associated with the normal vectors. The eigenvectors of G G T form the basis that describes the principal directions of surface normals on the patch. For a point on a surface with corresponding unit normal g 7 , the directional cosine with respect to some unit test vector, u , is the inner product, g , T u . When m surface normals, (m > 3) make up the columns of a ;??x3 matrix, G , then G T u = e is a column vector of the m directional cosines of the surface normals with respect to the test vector u. The total energy in the directional cosines is E = eTe = u T G G T u where G G T is a 3x3 symmetric, positive definite matrix. G G T = SgigiT; the sum of the autocorrelation matrices for each surface normal. Regardless of the number of measurement vectors used, G G T is always a 3x3 matrix that can be updated as more surface normals become available. The vectors that have the largest and smallest directional energy are the eigenvectors of G G T . On a quadratic surface patch, the three eigenvectors describe the normal to the tangent plane, b , and the principal directions of curvature on the tangent plane respectively. The eigenvalues give the energy associated with each of these directions. 81 4.8 Surface Energy and the Membrane Model of Surfaces Finding the energy of the surface normals is consistent with the viewpoint invariant membrane surface model. In this model, the surface is described by the smooth shape of a membrane that has been stretched out of the tangent plane. In this respect, deformed membrane surface corresponds to the quadratic surface. The energy functional for a physical membrane over some small surface patch is (Blake & Zisserman 1987, Terzopoulos 1988, Whitten 1993): J = \\\hJ+W\dZdl (4-5) where £and n are vectors in the tangent plane and v is the displacement from the tangent plane; and are the components of the surface gradient that are in the tangent plane. The energy in the membrane when it is stretched out of the tangent plane is given by the energy of the components of the surface gradients that are in the tangent plane. Equation (4.5) measures this energy and this is the same energy expressed by two of the three eigenvalues found in Section 4.8. The membrane model is often used to smooth an image depth map d(x,y). The problem is usually formulated not in the tangent plane {£,,rj) but in the image coordinates (x,y). Each location in the image can be assigned a depth value and the membrane model is used constrain the solution. The membrane surface v(x,y) is the one that best satisfies: J = \\\{v-dy +A2[{vJ +(vyf]dxdy (4.6) When the problem is set in the image coordinates, the solutions favour surfaces that are parallel to the image plane. This is a good strategy for aerial photographs (Whitten, 1993) but, as suggested in section 2.4, a mobile robot will see the horizontal surface of the ground as the predominant plane. A constraint that favours solutions parallel to the image plane will fight a horizontal plane solution where the depth is continuously changing along the observers line of sight. Blake and 82 Zisserman (1987) show that the image plane constraint works well when the surface is oriented toward the observer, within some small angle. However they show much better results for cylinders and other curved shapes when a viewpoint invariant algorithm is used. The problem with this technique is to find the angle between the observers line of sight and the tangent plane normal on the surface patch. As Blake and Zisserman point out, a viewpoint invariant constraint needs an estimate of the orientation of the tangent plane since the membrane energy is described in terms of a deflection away from the tangent plane. Their technique is to estimate the orientation of the surface patch and compensate the apparent surface gradients accordingly. The method outlined in the previous section uses the autocorrelation matrix of the local surface.normals. One of the eigenvectors of this matrix is normal to the surface tangent plane and the other two eigenvectors are in the principal directions of the surface curvature. While the usual membrane constraint is formulated in the tangent plane (a two dimensional problem), using this three dimensional system automatically recovers the viewpoint invariant information. 4.9 Simple Object Recognition using Surface Energy Three simple shapes, flat plates, cylinders, and spheres can be characterized by the three eigenvalues of the equation G G T x = X2x. On the surface of a flat plate, the surface normals point only in one direction and the null space of the surface normals is the two dimensional surface of the flat plate. The matrix G G T would have one large eigenvalue and two that, in the absence of noise, are equal to zero. On a cylindrical surface patch, one eigenvalue is zero and the other two are non zero. On a spherical patch, all eigenvalues would be non zero since the surface normals of a sphere point equally in all directions (but most of the visible surface energy will be directed toward the observer). In practice, noise will make all three eigenvalues non zero, however, a decision boundary to recognize simple shapes can be constructed from the ratios of the eigenvalues. 83 This can also be related to the classification of shapes in disparity gradient space as discussed in Section 4.3. The null space of the plane is 2 dimensional such that all points on the plane map to a single point in disparity gradient space. The null space of a cylinder is one dimensional such that all points on a cylinder map to a curve in disparity gradient space. Similarly, the spherical surface has no null space and points on a sphere map to a circular region in the disparity gradient plane. Recognizing these displacements of disparity derivatives is, in effect, a qualitative measure of the second derivative of disparity. (Strang 1988) points out that this eigenvalue problem defines an ellipsoid in three space. The eigenvalues define the dimensions of the ellipsoid along the axis defined by the eigenvectors. Since the eigenvalues are detennined by how the surface nonnals change with respect to each other in some region on the surface, the eccentricity of the ellipsoid is determined by the shape of the visible surface that produced the surface nonnals. Finding the space defined by the surface normals can determine the shape of visible surfaces without explicitly evaluating the directional derivatives of the surface nonnals. The ratio of the square roots of the eigenvalues determine the radius of curvatures in terms of'unit displacements' on the visible surface. 4.10 Identifying Local Shape and Orientation using SVD This thesis needs to recognize at least two surfaces, planes (for navigation) and cylinders. Both of these are singular in disparity gradient space. Several successful techniques have been developed in course of this thesis to recognize the orientation of cylinders by their one dimensional null space. The routines are recursive and update estimates of the cylinder's null space as each new point is added. They can converge quickly (within a few samples) however they can have some problems with 2 dimensional null spaces and so the candidate nonnals used to evaluate the surface energy must be selected. A more robust estimator is provided by singular value decomposition (SVD). The original techniques appear in the appendix. 84 The SVD algorithm has several advantages in this application: • SVD is stable in systems with 1 and 2 dimensional null space in the presence of noise. This allows surface elements to be classified by the eccentricity of their eigenvalues • The teclmique for object recognition formulated in the previous section uses surface energy to characterize shape. This energy formulation has positive eigenvalues and the SVD produces only positive eigenvalues so nothing is lost in using SVD. • The matrix is 3x3 which is easy for SVD . Both matrices output from the SVD are equal. • SVD is readily available in software libraries. • Internally, the SVD routine is iterative and so it may be possible to update estimates as new data becomes available rather than re running the full routine each time a new data point becomes available. Press (Numerical Recipes) cautions against tinkering with the published SVD routine however, a recursive SVD would be a valuable algorithm in this application. This is reserved for future work. Even so, the SVD converges quickly (within a few data samples). Figure 4.14: SVD error in estimating cylinder orientation vs. number of measurements A minimum of 3 surface normals need to be accumulated in the energy matrix, G G T , in order to characterize a surface element. Experience has shown that more than 5 normal 85 measurements are needed in the energy matrix to get a useful estimate of the cylinder orientation. After this, a single application of the SVD algorithm can give information on the cylinder orientation. This is examined in Chapter 5. Figure 4.15 shows the normalized eccentricity given by the ratio of the smallest eigenvalues that is used to identify the cylinder. Even with such an obvious cylindrical shape that fills most the image, the local ratio in this example is about 9:1 indicating that there is significant noise in the null space and that most of the visible surface elements are oriented toward the observer and shown low disparity gradients. There are implementation issues that affect the signal and noise sources in perceiving shape with stereo. Chapter 5 shows how better results are achieved. 0 10 20 30 40 Figure: 4.15 Normalized eccentricity vs. number of measurements As shown in Section 2.5.2, even with a uniform distribution of oriented surface elements, most visible stereo surface elements appear oriented toward the observer. This limits the available 'signal' for determining shape while no such limits are placed on noise. 86 4.11 Summary This chapter examined the role of shape in stereo vision. The concept of global shape has been used to solve navigation problems using stereo vision (Burt 1995), (Kumar 1994). This global shape model exists in the disparity image of the scene. In Section 4.2, disparity gradient space is introduced as the fundamental space for shape in stereo vision. It is shown that the perception of local and global shape can be predicted in this space. In this space planes map to a point. The concept of a global plane is robust because all of the image contributes to the continuous definition of a single point in disparity gradient space. Obstacles are detected as regions that do not conform with this model. It is then shown that cylinders map to a line in this space. A global cylinder such as the interior of a pipe will map to a line in this space. This is not as robust a shape as a global plane since the information in the image becomes distributed in this space. This line signature is shown to be able to characterize local cylinders in Section 4.3 however to estimate cylinder orientation one must fit an ellipse to this data. A more robust estimator uses the surface energy associated with local shape. Sections 4.4 and 4.5 discusses how the local behavior of surface normals are characteristic of shape. The curvature energy in a surface patch is used to characterize surfaces in Section 4.7. This is shown to provide a viewpoint invariant membrane model of the surface. This has been approximated by Blake and Zisserman but the technique in Section 4.7 provides this solution directly. The goal of this thesis is to develop a technique to identify a log; to navigate towards it; and the find the orientation of the log so that it can be picked up. Section 4.10 identifies an estimator for shape and orientation of cylinders. This is combined with the global plane model for navigation to stereo vision can be used in this task. 87 Chapter 5 Implementation Issues and Results This chapter examines the primitive vision elements developed in earlier chapters for remote supervisory control (teleoperation). Robot recognition and navigation should be as autonomous as possible but in this application they are used more for confirmation; providing terms for communication about the world upon which both the operator an remote robot agree. Both need to recognize a navigable plane and locate obstacles and both need to recognize a target; in this case cylinders. This chapter tests the stereo vision process and the extraction of shape elements that allow the robot to cooperate with a remote operator. 5.1 Navigation and Obstacle Detection This section examines the implementation of global models for navigation. Two models are used, a global plane to solve part of the problem posed in this thesis, and a model of a hallway. The hallway example is introduced by a synthetic image to show how the ideal global model concept can be extended to more complex environments. The basic technique is similar to the principles of perspective in art, namely the intersection of lines and planes that radiate from a vanishing point. 5.1.1 Navigation on an Infinite Plane Modem log sort yards are concrete surfaces upon which large machines move about, picking up logs and moving then to designated piles. In this outdoor, industrial environment the concrete surface is marked with dirt and bark strips giving it a better visual texture than a clean indoor surface. Such visual texture helps reduce noise in estimating the concrete floor as a surface. 88 The following example detects a log as an obstacle on a indoor concrete surface using the global plane model. It does not attempt to characterize it as a log. Figure 5.1: Log on concrete floor. 2 0 0 1 5 0 1 0 0 5 0 0 Figure 5.2: Side view and top surface of disparity map of a log on concrete surface. Figure 5.2 shows a mostly side view of a disparity surface in which a log rests on the floor about 5 meters in front of the camera (between rows 50 and 100 in Figure 5.2). In this region, it is evident that the log violates the downward slope of the flat surface of the floor. This particular disparity map was obtained in real time (~10Hz rate) using a commercial stereo camera system from Point Grey Research. The camera works extremely well however there are often holes and spikes in the disparity map, particularity on this concrete surface. Part of the reason for using this 89 image was to examine the technique of assigning unmatched regions into the model of the floor rather than assigning them a disparity of value of -1. It was felt that, for this task, forcing unknown disparities to conform to the floor model made few assumptions and simplified processing. The cement floor has poor visual texture and Figure 5.2 shows the perceived disparity surface of the flat cement floor has noise and noise spikes. Some top rows are noisy and appear as a small ridge even though no obstacles were present at this location in the image. The back wall is also visible in this disparity surface as a second region that also violates the floor model. Notice the back wall appears as a flat horizontal region where the disparity values are small. The wall is vertical and has constant disparity and appears as a flat horizontal region in Figure 5.2. Visual artifacts in the roll-up door and featureless concrete surfaces of the back wall also generated disparity spikes. When the disparity map is adjusted with respect to the floor model, obstacles are detected by values greater than some threshold. It has been reported that obstacle detection was very dependent on model errors and on the detection threshold (Zhang et al. 1994, Burt et al. 1995). The problem is that noise can distort the estimate of the floor model or appear as false obstacles. A one dimensional section of the disparity map is used in Figure 5.3 to examine the problem of estimating the floor and finding obstacles in the presence of noise. Two techniques are used to detect obstacles. In Figure 5.3a, the floor model is subtracted from the disparity map and in Figure 5.3b, the disparity map is divided by the floor model. In their simplest form, obstacles appear as constant disparity surfaces that violate the disparity gradient model of the floor. This simple obstacle (a vertical surface facing toward the camera) has a constant disparity value whereas the disparity model of the floor decreases linearly as the row number (Figure 5.2). At the base of the obstacle, the object disparity and the model disparity have the same value and there is no height difference between the floor and the point at which the obstacle joins the floor. Working directly with the disparity infonnation, the obstacle is 90 20 15 10 5 0 Obs tac le Detection by Subtraction Obstac le Detection by Normalization • A V 50 100 150 200 Imaae R o w # 250 100 150 200 Imaae R o w # 250 Figure 5.3a: Residual after floor subtraction. Figure 5.3b: Residual after floor normalization. detected as a deviation from the floor model. In the case of a constant disparity surface, the signal to be detected is u = ay. Specifically: 1. subtracting the floor: (obstacle + noise) - model = (k+ n) - (k- ay) -ay+n where k is the constant disparity of a vertical obstacle, y is the vertical displacement in the image and a is slope of the floor disparity model. As expected, the signal increases linearly with the obstacles height from the floor. In this expression, noise is assumed to be independent of position in the disparity map. To be detected, the signal must exceed the uniform global noise, 77. A low object near the middle of the screen can be masked by foreground noise as in Figure 5.3a since detection thresholds tend to be global. Assuming that it is more important to detect nearby obstacles than objects in the background, normalizing with respect to the floor model can improve navigation. 2. normalizing the floor: (obstacle + noise) k + n _ (k- ay) + ay+ 77 _ ^ + ay n model k- ay k- ay k-ay k-ay 91 If disparity noise is considered to be random and independent of position in the disparity map then normalization of noise by the floor model can be simplified by considering a noise spike at a given row position in the image. Assuming that the model disparity of the image is approximately zero: (obstacle + noise) model 1 + + (4.1) k orrow# where k is the constant disparity of a vertical obstacle and a is slope of the floor disparity model. In this expression, k and arow# are equal at the horizontal position where the obstacle meets the floor. In Figure 5.3b, k is approximately half the row number at which the obstacle joins the floor {i.e. a « 0.5). An obstacle is appears when the normalized value is greater than 1. If the signal to noise ratio of floor subtraction is designated, iS7v7?0 = — then for a Disparity noise spikes in the foreground appear at a large row# and do not as easily mask obstacles in the background. Using normalization, an obstacle near the middle of the screen is less likely to be masked by disparity noise generated in the bottom of the screen (Figures 5.3a & 5.3b). A real obstacle in the center of the image is less likely to by masked by nearby disparity noise. Since the disparity of nearby objects is larger than the disparity of distant objects, nearby mole hills can have the same disparity as distant mountains. It makes sense to normalize the disparity image with respect to the floor model. This restores the relative size of the obstacles. In this respect, normalization is similar to working in world coordinates of height (Zhang et al. 1994) but it does not require the extra step of conversion into this. As the row number decreases, the sensitivity to noise increases and it becomes more difficult to detect the point at which the obstacle normalized floor, the SNR with respect to SNR0 is: SNR _ arow# SNR0 k ^ J 92 joins the floor. The figure below shows the log and part of the back wall detected using threshold = 1.1. 50 100 150 200 50 100 150 200 250 300 Figure 5.4: Log and back wall in Figure 5.1 detected by floor normalization. In Figure 5.4, the back wall had poor visual texture and the disparity surface is not well defined, nevertheless it appears as a significant obstacle in the image. In another example, the log is oriented directly toward the observer. The disparity along the 2 black lines in Figure 5.5a is shown in the next figures. Figure 5.5a: Log oriented toward observer to examine disparity along the vertical lines. 93 In Figure 5.5a, one vertical line lies along the floor and the other runs along the top of the log. Using the method in Chapter 3, the disparity that is associated with these lines is shown in Figure 5.5b. The floor line shows noise associated with poor texture but the disparity follows a straight line until it hits the back wall. In the second line, the vertical face of the log closest to the observer is very evident. This is the flat, constant disparity region at the top end of the line. Figure 5.5b: Raw disparity along vertical lines shown in Figure 5.5a. Figure 5.5c: Disparity after subpixel interpolation using method in Section2.6.3. 94 As expected, the line associated with the floor shows a constant disparity gradient. Ideally, all points on the floor will map to a single point in disparity gradient space. Figure 5.7: Contour plot of interpolated disparity image. Figure 5.8: Normalized obstacle map using threshold of 1.1. For simple navigation and obstacle detection the disparity map need not be interpolated. The obstacle map in Figure 5.8 is virtually the same in both cases. Instead of an obstacle map, an elevation contour map (Chauvin et. al. 1998) can be created by simply varying the normalization threshold. This can form a topographical map of the local environment. 95 5.1.2 Navigation in a Corridor In this application of global shape models, a corridor is defined by the intersection of 3 or 4 planar surfaces. It can be an interior hallway or an urban street defined by a horizontal road surface and the vertical surface of buildings. When these surfaces are projected into the distance, they converge to a point. A simpler problem is robot navigation along a vertical wall which is the intersection of two planar surfaces. The horizontal floor has only vertical disparity gradient and the vertical wall has only horizontal disparity gradient. Along a line that marks the intersection of the wall and the floor, the disparity of the floor and the disparity of the wall must agree. When the disparity gradients on each plane are measured then , in tenns of image row and column coordinates, the description of the floor disparity can be given as D = a row + a, and the description of the wall disparity be D = ficol + b. The intersection of the wall and floor is a.line in the image given by: P i b-a row = — col H a a This defines a line that is the intersection of two planes by row and column coordinates in the image. It does not rely on an edge algorithm that detects the image intensity changes at this boundary. All points on the planes can contribute to this estimate. In a corridor model, the image is segmented using intersecting planes in the disparity image. Previously, the robot expected to find itself on a geometric plane and could use this information to define obstacles in tenns of this model. It rejected walls as not conforming to the expected model and so walls became obstacles with respect to the horizontal plane. Possessing a model for a wall and floor, the robot can define obstacles in the floor segment while doorways are defined in tenns of the wall segment. Similar image segmentation is possible with 3 surfaces defining a maze or 4 surfaces defining a hallway. The visual image is segmented according to models imposed on the disparity image. 96 Figure 5.9 shows a synthetic image of a hallway. Figure 5.10 shows the disparity image of the hallway where the planes that make up the hallway intersect at a point. Since all points on a given plane have the same disparity gradient, it is straight forward to segment the image according to the planes of the hallway. Obstacles and doorways become evident using the same techniques described in Section 5.1.1. Figure 5.9: Synthetic Hallway Image Figure 5. JO Obstacle Detection in Hallway In Figure 5.10, the ceiling is not detected since is not used as a target surface. Hallways are evident as negative obstacles and protrusions from the floor and walls are shown as obstacles. 97 5.2 Recognizing Logs n 1 Figure 5.11: Cylinder geometry. If points on a log's surface can be located in an image then the surface normals at these points can be estimated from the stereo disparity gradient. Let «, and n2 be two different vectors that are normal to the surface at visible points P\ and Pj on the log. The unit vector which gives the direction of the axis of the log is perpendicular to both these vectors. «, x n2 r , -. r Wo = i - - . = < ¥ + boJ+cok K x,h\ Given that points P\{X\,Y\,Z\) and i ^ O ^ ^ z ^ ) n e o n m e 1°§ s u r f a c e a n ( l m a t w, and « 2 are their outward unit normals then (P, - ) and (P2 - Rn2) define two points on the log's axis where R is the radius of the log. Using the vector n0 defined above, the straight line connecting these two points along the log axis satisfies the relationship: AX _ A7 _ AZ ao bo co If we write nl = aj + bj + cxk and n2 = a2i + b2j + c2k then the log radius, R, is: R_b0(X]-X2)-a0(Yl-Y2) In ideal teleoperation, identifying two points on the log surface provides the information needed to estimate the orientation and radius of the log. In practice, noise does not allow this. 98 5.2.1 Wliat to Measure. In this thesis, cylinders are identified by the eccentricity of the surface normal eigenvalues. The orientation of the cylinder corresponds to the eigenvector associated with the smallest eigenvalue. The signal of interest is the surface energy that is evident in the divergence of the surface normals. Noise, on the other hand, exists in all components of the surface normals and serves to mask surface curvature on a scale where the curvature is small with respect to noise. The technique for measuring surface normals has been shown to be an approximation to the tangent plane at some point on a visual surface. Normals gathered on a small enough surface patch will appear oriented in the same direction and will offer little evidence of an object's shape. If the surface patch is small then the smallest eigenvalues associated with the nonnals on the small patch are more a measure of isotropic noise than an indication of shape. The technique proposed is not to measure shape as the divergence of three (or more) nonnals defined by points on the surface but to measure shape as the divergence of three (or more) nonnals defined by patches on the surface. 5.2.2 Simulation of Noise Effects A measure of apparent curvature is developed from the ratio of the largest eigenvalue to the sum of the other two eigenvalues. This will be largest for a planar surface. In the presence of noise the smaller eigenvalues can be well populated making it difficult to distinguish shape. When the small surface patches are treated as planar elements then a noise estimate can be derived and the shape estimate corrected for noise. In the following examples the SVD was run on three simulated surface patches each of which used five sample nonnals. The final estimate of shape used all fifteen nonnals. This is compared to a single SVD of three nonnals where each nonnal is the sum of the five normals measured on a patch (five times the average of the normals on the patch). 99 1 three patches on a plane with normals: n = [(rand - .5) (rand - .5) , (1 + rand -5)] SVD 1 SVD 2 SVD 3 SVD of all no noise SVD ofav. eigenvalue 1 .14 .22 .35 .85 0 .22 eigenvalue 2 .54 .68 .41 1.78 0 1.24 eigenvalue 3 4.28 4.10 4.24 12.37 15 61.4 apparent curvature 6.29 4.55 5.57 4.7 oo 42.0 eigenvalues(l+2) .68 .90 .76 2.63 eigenvalues( 1 +2)-noise 0 0 0 .29 corrected curvature 42.6 00 Table 5.1: Eigenvalues associated with surface energies on a plane in the presence of noise. In the example, noise populates the minimum eigenvalues of the surface energy in each of the three patches such that each patch appears curved. The total surface energy of all fifteen normals shows the accumulation of this effect. When each patch is treated as a plane, the smallest eigenvalues should be zero. The total energy in the two smallest eigenvalues is a measure of the noise in the surface estimates and is subtracted from the sum of the two smallest eigenvalues in the SVD of total column. The result is that the curvature of the planar surface is much better defined. 2. three patches on a cylinder with surface nonnals: nl = [(rand - .5), (rand - .5), (1 + rand - .5)] n2 = [(.5 + rand - .5), (rand - .5), (.866 + rand - .5)] n3 = r(-.5 + rand - .5). (rand - .5). (.866 + rand - .5)1 SVD 1 SVD 2 SVD 3 SVD of all no noise SVD ofav. eigenvalue 1 .28 .22 .10 .62 0 .03 eigenvalue 2 .52 .41 .20 3.98 2.5 15.49 eigenvalue 3 4.20 4.37 4.70 10.40 12.5 50.67 apparent curvature 5.25 6.94 15.66 2.26 5.0 3.26 eigenvalues(l+2) .80 .63 .30 4.6 eigenvalues( 1+2)-noise 0 0 0 2.87 corrected curvature 3.62 5.0 Table 5.2: Eigenvalues associated with surface energies on a cylinder in the presence of noise. 100 3. three patches on a sphere with surface nonnals: nl = [(rand - .5), (.5 + rand - .5), (.866 + rand - .5)] n2 = [(.433 + rand - .5), (-.25 + rand - .5), (.866 + rand - .5)] SVD 1 SVD 2 SVD 3 SVD of all no noise SVD ofav. eigenvalue 1 .80 .03 .09 3.31 1.87 5.23 eigenvalue 2 .45 .54 .53 2.17 1.87 12.23 eigenvalue 3 3.75 4.43 4.38 9.52 11.25 44.99 apparent curvature 3.0 7.77 7.06 1.73 3.0 3.26 eigenvalues(l+2) 1.25 .57 .62 5.48 eigenvalues( 1 +2)-noise 0 0 0 3.04 corrected curvature 3.13 3.0 Table 5.3: Eigenvalues associated with surface energies on a sphere in the presence of noise. The tables for the sphere and cylinder show how the ratio of the smallest eigenvalues is used to distinguish between a sphere and a cylinder. These shapes are differentiated from a plane which has very little energy in the smallest eigenvalues when compared to the largest eigenvalue. The effect of SVD on each surface patch (SVD 1 etc.) is a least squares value for the surface nonnal which corresponds to the average of surface nonnals on a planar patch. Since only five normals are used in each estimate, the residual noise in the three estimates prevents the corrected curvature of the plane from reaching infinity and distorts the corrected curvature of the sphere and cylinder. Similar noise improvements are obtained by directly averaging the 5 surface normals evaluated on each surface patch. In this case, the SVD calculation is run once using the three average surface normals. The noise estimate is the total noise in all three surface patches. Each surface patch is assumed to be small enough to be planar and the noise is estimated by how the five local normals on a patch deviate from this model. This total noise in all three surface patches is used to correct the smallest eigenvalues in the SVD and recover their ratio for characterizing a shape. 101 5.3 Where to Measure All visible points on a plane equally support the identification of the plane since, in disparity gradient space, all points on a plane map to a single point. There is no a priori information that says one point has a better signal to noise ratio than another in recognizing a plane. This is not the case for cylinders and spheres. As pointed out in Section 2.5.2, most surface elements appear oriented toward the observer and contribute little evidence of shape. Characterization of shape as the divergence of surface normals will show good signal to noise ratio when the divergence is large. On a cylinder, this occurs when comparing the surface orientation on one side with the orientation of surface elements on the other side. In the previous section the cylinder normals were located at -30°, 0° , and +30°. Half the visible surface of a vertical cylinder is within ±30° of a normal vector oriented toward the observer. Within this range the normal component oriented toward the observer varies from 1 to 0.866 indicating that normals vectors clustered in this region will not yield strong curvature signatures. Larger signals would be derived from comparing locations at ±45° on the cylinder surface where the vertical and horizontal components of the surface normal are equal. The highest signal to noise ratio lies along 2 lines at ± 45° on the cylinder surface and parallel to the axis of the log. The projection of the null space of the log on the image plane is the preferred search direction and displacements in the direction of curvature will locate measurement points with high signal to noise ratio. On the surface of a sphere the highest signal to noise ratio is in circle at 45° to the normal oriented toward the observer. 5.4 How to Measure Global shape models, such as the floor, can be used to organize a scene. Regions that do not conform to the model can be evaluated in terms of local shape models. In teleoperation, it is assumed that the operator will point to the object of interest and the robot will be expected to recognize the object. Such a scenario does not require that the robot autonomously locate cylinders. 102 A location on the visible surface is used to find two other nearby locations with similar disparity. These three locations are used to estimate the directions of minimum and maximum curvature associated with the two smallest eigenvalues in the surface energy. Along the direction of maximum curvature a new point is found that is closest to the observer. This is indicated by a maximum disparity value. This location and two on each side of it in the direction of maximum curvature are used in a second estimate of shape. The third set of measurements are taken in the direction defined by the minimum curvature. These surface energies are added to the previous set and new estimate is generated by SVD. A radius estimate is also generated. The process is repeated as permitted. In the case of logs and cylinders, estimates of the radius, the orientation vector, and the point closest to the observer allows the log to be outlined without finding its actual boundaries, some of which can be occluded by other obstacles. 5.5 Drum Image Figure 5.12: Drum image at 2.5 meters, f= 400 pixels, b =17.4 cm. This image was used in Chapter 4 to demonstrate cylindrical signatures in disparity gradient space and to show uncorrected surface energies in the presence of noise. In this example, noise is estimated and the eigenvalues show a better definition of shape. Even with such a well defined image, a mesh plot of the disparity surface shows that only a few pixels of disparity 103 information are available. This shows the limits of the ability of stereo vision to determine shape and highlights the importance of subpixel interpolation in characterizing shape. 30\ 0 0 Figure 5.13: Mesh plot of disparity surface shows only a few pixels change over the drum. Normalized Orientation Residual using S V D 0 1 1 •• • ! j ....A, j / v /. \ / ! V 0 5 10 15 20 25 30 Figure 5.14: Convergence of estimated drum orientation vs. number of iterations. Figure 5.14 plots the length of the residual in the orientation estimate. Both the known orientation and the estimated orientation are nonnalized and the residule error vector is calculated as the difference. The pose estimate for the drum as given by the eigenvector associated with the 104 smallest eigenvalue. This orientation vector points in the direction of least curvature provides information to orient a gripping arm to pick up the drum. Radius Estimate in Units of Camera Separation 2.5 1.5 1 - + +++--++++- -++ ++-I 5 10 15 20 25 30 Figure 5. J 5: Running average of drum radius estimate in camera baseline units vs. number of iterations. The estimated radius of the drum is about 1.9 baseline units. With a camera separation of 17.4 cm. this corresponds to a diameter of 66 cm. (26 inches). The measured diameter of the 45 gallon drum is 61 cm. (24 inches). 50 r Ratio of Smallest Eigenvalues 40 30 20 10 : i i +++" + + * f ^ + T + ' i F ' + T T f - - + + + + --++ + + -i 5 10 15 20 25 30 Figure 5.16: Cylindrical signature of drum using running average of eigenvalue ratio vs. number of iterations. 105 5.6 Tree Images 0 0 Figure 5.19: Section of tree disparity surface. Figure 5.19 shows a mesh plot of the tree disparity surface. It also shows the limit of stereo in the perception of shape. At a distance of 2 meters only about 3 or 4 pixels are available to describe shape. This makes the process dependent on subpixel interpolation and vulnerable to noise and surface smoothness. Normalized Orientation Residual using S V D 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 5 10 15 20 25 30 Figure 5.20: Convergence of tree orientation estimate vs. number of iterations. 1 1 i i i : i i i i h r \ ! i * " V — 107 1 0.8 0.6 0.4 Radius Estimate in Units of Camera Separation + + +. ... ±-+- 4--+- +-f- *-+. ± .+. i + - L + .+4 +4-0.2 R-Qi -J 1 1 i i 1 5 10 15 20 25 30 Figure 5.21: Running average of tree radius estimate in camera baseline units vs. number of iterations. The estimated radius of the tree is about 0.6 baseline units. With a camera separation of 17.4 cm. this corresponds to a diameter of 21 cm. The measured diameter of the tree approximately 23 cm. and varies over the surface of the tree. Ratio of Smallest Eigenvalues 50, . . 1 . . , 40 30 20 10 + ++ 10 15 20 25 30 Figure 5.22: Cylindrical signature of tree using running average of eigenvalue ratio vs. number of iterations. 108 Figure 5.23: Tree image Figure 5.24: Stereo detail of tree; distance =2.5 meters,/ = 400 pixels, b = 17.4 cm. 0 0 Figure 5.25: Section of tree disparity surface. 109 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 Normalized Orientation Residual using SVD ...V • \ I \ I I Li VL I i , > 0 5 10 15 20 25 30 Figure 5.26: Convergence of tree orientation estimate vs. number of iterations. Radius Estimate in Units of Camera Separation 2.5 1.5 1 1 1 ; " 7"+"^"+""+"^ [:++-+"+-f-+T"++"H r++++q -++ + + -i i i i 5 10 15 20 25 30 Figure 5.27: Running average of tree radius estimate in camera baseline units vs. number of iterations. The estimated radius of this tree is about 2.5 baseline units. With a camera separation of 17.4 cm. this corresponds to a diameter of 87 cm. The measured diameter of the tree approximately 82 cm. and varies over the surface of the tree. 110 50, Ratio of Smallest Eigenvalues 40 30 20 10 + + ++ + + + + -j-++ + + A 5 10 15 20 25 30 Figure 5.28: Cylindrical signature of tree using running average of eigenvalue ratio vs. number of iterations. The cylindrical signature of this second tree was not as definite as with the previous example in Figure 5.22. This is expected since this tree is larger in diameter and does not exhibit as much curvature with respect to the residual noise in the smallest (minimum curvature) eigenvalue. 5.7 Summary This chapter examined implementation and performance issues associated with navigation and recognition of cylinders using stereo vision. Section 5.1 demonstrates obstacle detection where the robot workspace is nonnalized with respect to a global shape model; an infinite plane. This is similar to work by Burt (1995) and Zhang (1994). In this thesis, navigation is considered in terms of global shape, hi Section 5.1.2 a global shape model for corridors is introduced. This can be used for navigation in hallways and urban streets where vertical walls are known entities and can be included in the robot's global shape model of the environment. The four planar surface of a corridor map to four points in disparity gradient space. This facilitates prediction of the robot 111 environment and detection of obstacles and passages. Future work could include other models to detect passages and obstacles in cylindrical environments like the pipes and tunnels found on space stations or in sewer systems. Section 5.2 deals with noise issues in characterizing shape. It is shown that noise can be measured and that shape information obtained by singular value decomposition (SVD) can be corrected for noise. This increases the effective signal to noise ratio in shape detection. Example are given of the technique's performance in characterizing drums, trees, and logs. The goal of this thesis is to develop a primitive visual interpretation that is shared by a human operator and a mobile robot for a specific task; manipulation of cylinders. A command such as "Move that log." does not require that a robot find a log in a visual scene since such a command requires that the operator point out the log. However it does require that the robot be able to confinn the presence of an object and then characterize that object as a log. To pick up the log, the robot must also be able to estimate the orientation and size of the log. This chapter demonstrates that the techniques developed in this thesis can accomplish this task. 112 Chapter 6 Summary and Future Work 6.1 Summary This thesis addresses the development of a primitive perception of the world. The ultimate goal is a remote supervisory control system (teleoperation) in which a robot and a human operator can agree on some basic elements of a scene and cooperate in obstacle detection, navigation, object identification, and pose estimation. Stereo vision was applied to this task and mechanisms were developed to organize a scene in tenns of a global shape model and to characterize objects as local shape that exists within the larger context of the global shape model. This thesis solves problems of navigation, pose estimation and object recognition at a low level; working in the disparity image. Disparity gradients are fundamental to this process and provide evidence of shape in stereo vision. In disparity gradient space, all points on a visible plane map to a single point in this space; which makes the recognition of a plane a robust proposition. On the other hand, all points on a cylinder map onto a curve in this space and the data becomes more difficult to characterize. All points on a plane support the singular character of the plane in this space whereas the character of a cylinder is distributed in this space. A formulation where stereo disparity can be used to recognize local surfaces; planes and cylinders was developed. Given a disparity map, the complexity of the recognition task was related to the null space of the surface normals. The method introduced in this thesis takes advantage of this null space. Shape is characterized by the eccentricity of the eigenvalues of the surface energy matrix and orientation is given by the eigenvectors. The technique relies on high resolution disparity surfaces and makes use of sub-pixel interpolation to calculate the surface normals. The 113 largest stereo baseline used in this thesis is 17.4 cm. In outdoor environments the baseline can be increased to 50 cm. and so increase the resolution of the disparity surfaces. Other techniques for the perception of shape include photometric stereo and shape from shading. The method used in this thesis and these other methods all work best with large images of simple shapes. They all have resolution limitations. Even so, it is possible to characterize complex surfaces using an extended Gaussian image (EGI) however the simple local shapes associated with logs, trees, and drums can avoid this complexity and work at a low level without the regular tesselation and mesh relaxation associated with EGI techniques. The method worked out in this thesis has been shown to be usable in applications where the operator might say "Move this log." In such a case the robot can be equipped with the primitive vision elements capable of locating the indicated object in its workspace, identifying it as a log, and deciding how to pick it up. 6.1.1 Contributions This thesis emphasizes some basic issues that arise in stereo vision. The motivation is to find primitive visual elements that can allow a robot to share some of the same understanding of the world as that held by a remote operator. Using stereo vision, the world can be perceived in certain ways and this was examined as a medium for cooperative tasks. The contributions made in the course of this work reflect this emphasis but do not stress some implementation issues. /. Matching Metrics In Section 2.5 a justification is given for the similarity in performance when block matching stereo images using sum of squared differences (SSD) and sum of absolute differences (SAD). This justification is based on published image models and on the expected behaviour of visual surfaces in stereo vision. The reasons given are: 114 1. Most surface patches appear oriented toward the observer and have constant disparity across a small patch so that SSD and SAD give the same result independent of texture distribution within a small block of image data. 2. On average, image intensity is highly correlated across small image blocks and image texture does not change radically across the block. There are few outliers in the data and even though the matching metrics use different statistics, the results are similar. 2. Surface Interpolation In Section 2.6 a justification for linear interpolation methods that are commonly used on stereo depth maps is given. An equation (Equation 2.6) is proposed based on linear interpolation to generate sub-pixel disparity maps from the sum of absolute differences (SAD) matching metric. Figure 2.8 shows the interpolated disparity surface of a cylinder using Equation 2.6. 3. Spatial Sampling Noise Sections 3.1 and 3.2 examine the fundamental problems in comparing gray scale images. It is shown that spatial sampling noise is systemic to sampled data systems and this makes high frequency infonnation unique to a single image and not useful for matching or object recognition, even in stereo pairs where there are few scale and orientation differences. In fact, high frequency data acts as noise that interferes with stereo matching. It is shown that stereo matching can be improved by a texture filter that enhances the useful image data. A rule of thumb is provided to estimate the useful bandwidth available for matching image data. Two strategies are suggested; upsampling the data to achieve higher resolution stereo and lowpass filtering to accept only image data that has a positive signal to noise ratio. A two dimensional recursive filter is demonstrated for this second strategy. 115 The analysis in Section 3.3 shows that there is little or no useful matching information beyond a 7c/2 bandwidth and that the preprocessing filter used in this thesis gave approximately 15% matching improvements in Figure 3.4. 4. Matching Rules for Directional Filters Chapter 3 also introduces a directional filter for matching stereo images with step discontinuities. A match surface is constructed and the behaviour of stereo information on this surface is examined. This results in two simple matching rules that conform to the visual information present in stereo images. The results of this process are shown in Figure 3.17 and Figure 3.18. 5. Disparity Gradient Space Chapter 4 introduces disparity gradient space as a simple way to work out global shape models and the perception of planar surface elements in stereo vision. It is shown that planes map to a point in this space and that cylinders map to a curve in this space. Other environments like corridors, hallways, and tunnels have predictable signatures in this space and can be used for navigation in same way as a global plane model is used. It was difficult to determine the orientation of cylinders in this space. Pose information was more readily provided by considering the surface energy of local shape. 6. Viewpoint Invariant Membrane Model Chapter 4 shows that finding the energy in the surface normals on a local surface is consistent with the viewpoint invariant membrane surface model. Two of the three eigenvectors of the autocorrelation matrix made up of the local normals are the principal strain vectors of the 116 membrane model. Section 4.9 indicates how simple shapes can be recognized by the eccentricity of the eigenvalues of the energy matrix and how the eigenvectors identify surface orientation. 7. Normalized Obstacle Detection and Corridor Models In Chapter 5, navigation and obstacle detection is seen in terms of a global shape model. Experiments in Chapter 5 indicate that there is less foreground noise in the detection process when the disparity map is normalized with respect to the floor rather that when the floor is subtracted from the disparity map. A justification for this result is given and empirical evidence for work with log manipulators is shown in Section 5.1. A example is given in which the concept of a global model is extended to navigation in corridors where obstacles and hallways opening onto the corridor are made evident. 8. Characterization of Shape and Orientation using Surface Energy Chapter 5 shows the practical application of the energy matrix formed from the surface normals for identifying local shape and orientation. Local patches are assumed to be planar and as such their smallest eigenvalues provide an estimate of the orientation noise in the surface normal measurements. This is used to improve the eigenvalue eccentricity estimates used to characterize the cylinder shapes. The eigenvector associated with the smallest eigenvalue is taken as the orientation of the object. The examples in Chapter 5 show reasonably fast convergence of the orientation estimate to the true orientation. Since the energy matrix is different from the Hessian matrix, a technique to estimate radius is also given. This and the orientation vector can be used to work out how to pick up a tree or a drum. 117 6.2 Future Work One topic for future work is the sampling noise that appears when matching stereo images and other digitized information. Coarse to fine matching has been used for finding stereo disparity but in many implementations the scales are critically sampled and follow Nyquist's criteria. This will introduce sampling noise at each scale. Instead, each scale should be oversampled in accordance with the useable bandwidth for matching sampled data given in Section 3.2. Most multiscale image operations are designed for decomposition and reconstruction. Reconstruction is not necessary for image matching and, once free of this constraint, each scale can be configured to minimize matching noise. In addition, it was proposed that upsampling could allow all the information in the original image to be used for matching. This is interesting for increasing the relatively coarse resolution of disparity maps and for recovering shape in regions of interest in the image. For this application an FIR filter might preferable to the recursive Butterworth lowpass filter used in this thesis. Another topic for future work is to develop a recursive SVD algorithm. The calculation is iterative and if the previous state could be preserved then the algorithm could proceed from the previous state and incorporate a new measurement into the eigenvalue and eigenvector estimates. Global shape models can be investigated to detect obstacles and passageways in cylindrical environments such as the pipes and tunnels in space stations and in sewer systems. Such structured environments can be analyzed in terms of global shape and its signature in disparity gradient space. In Section 2.3 it was suggested that structured light systems are similar to stereo. Using this analogy, the concepts of navigation by global shape using structured light systems can be investigated. 118 Appendix 1 Camera Errors A single line of sight in the reference view is visible only as a fixed pixel location (xRyR) in the reference image plane. The epipolar line in the left view can be determined by solving for the allowed x^ and yr. values that can correspond to a position in the reference view which is fixed by some world position as defined in Equation 2.2. This requires knowledge of the focal length of each camera and the relative orientation of the cameras. ~aLxL "1 0 0 0" aLyL = 0 1 0 0 [T] 0 0 V / L 0 aRxR aRyR z 1 '(a.l) The scale factor for points in the reference image is aR = — . The [T] matrix can be formed from /R the product of a displacement matrix, [D] and a rotation matrix, [R]; [T] = [D] [R] . 1 0 0 1 0 1 0 0 0 1 0 0 0 1 cos a -sin a 0 0 cos/J 0 -sin/J 0 " l 0 0 0 sin a cos a 0 0 0 1 0 0 0 cos y - s i n / 0 0 0 1 0 sin/J 0 cos/J 0 0 svay cosy 0 0 0 0 1 0 0 0 1 0 0 0 1 When a stereo camera with geometry as shown in figure 2.2 is constructed with minimum alignment errors then the rotation angles a (roll), P (yaw), and y (pitch) are small and the displacement errors sy and ez are small. In the displacement matrix, the horizontal baseline, b, has 119 a value of one unit and all world distances are measured as multiples of this value. The [T] matrix uses the small angle approximation such that cos(a) « 1 and sin(a) « a. Higher order terms, 0(e2), are neglected such thatrsin(«) sin(/J) « 0 and ey sin(a) « 0. The [T] matrix is approximately: [ T ] , 1 -a -p 1 a 1 -y sy P Y \ sz 0 0 0 1 In evaluating xL and V/, as defined in Equation a. 1, terms like: 1 + g ; + g 2 a r e e v a j u a t e ( j a s + + £ - 2 ) ( i _ ( £ - 3 + f 4 ) ) « l + f j + f 2 - ( f 3 + f 4 ) . 1 + f3 + £4 Substituting: £> = « fR{l + Sf), the approximate values for x^ andj^ are: xL*xR+D-pfR +{xR+D) V fli /R) -ayR (a.2) yL*yR+yR £f-P^-r^r +ccxR + syD-7fR (a.3) JR JRJ With reasonable care, the cameras can be positioned to within a millimeter on a baseline of 10 cm. This gives sy and sz values on the order of 1%. We are interested in Z values greater than 10 so the term — is omitted in (a.2) and in (a.3). The values for y^ are considered independent of Z when eyD < 1. When this condition is valid then, for a given pixel in the reference image, the numbers for_V£ are all constant. The epipolar search is along a horizontal line in the left image whose y position can change with XR but for any given pixel in the reference image, the search is along a horizontal line in the left image. The approximation for x^ shows that the disparity effects in xr, can be canceled by the yaw angle, p. This is the basis of vergence where the line of sight of one or both cameras is rotated to 120 fixate on an object. In this case, P~ — and, in general, y^ changes with distance and the epipolar lines are no longer horizontal. The Effects of Camera Alignment Errors Estimating the position of visible objects requires accurate estimates of stereo disparity, however, camera alignment errors cause the apparent stereo disparity, D, to be different from the ideal disparity, D. Equation a.4 approximates the horizontal position in the left image as: xL*xR+D-l3fR +(xR+D) /R JRJ ayR (a.4) Z approximately: where D = ^ — is the ideal disparity and fifR is the vergence error term. The apparent disparity is -4 D = xL - xR « D + XD WR -PfR (a.5) IR /R . Alignment errors cause gain and offset errors in D, the measured difference in horizontal position between the left and right images. Errors in the apparent disparity are simplest in a region near the center of the image where the values for xR and.y^ divided by the focal length are small. This region corresponds to a small field of view of the scene. Alternately, i f a camera has a long focal length such that the entire field of view is small then the pixel position divided by the focal length is also small. With this long focal length / narrow field of view assumption, the apparent disparity given in (a.5) can be further approximated as: D «(l + ef)D+efxR -ayR-/3fR 121 Even with the long focal length approximation for the uncalibrated disparity, useful estimates of the world position using the ideal stereo equations would be dependent on pixel position and require estimates of the camera alignment errors, ay, a, and p. In section 2.3, the equation for the estimated surface normal is: n = DJ + Dy] + j (D0 - DxxR - DyyR ) k Substituting the long focal length approximation of D for D in this equation, the estimated surface normal, h is approximately: «« (1 + sf)h + [sf,-a, -/J] In the case of the surface orientation estimate, the long focal length approximation makes the offset in the surface normal estimates almost constant, independent of pixel position. When the curvature of visual surfaces is measured, the vision system estimates how the surface orientation changes in some region of the image. The long focal length approximation has an almost constant error in the estimated surface normal over the whole image. Since curvature estimates are concerned with changes in the surface normal, this constant orientation error does not affect estimates of surface curvature. However, the magnitude error term is still present. 122 A p p e n d i x 2 Matching Random Images A procedure that matches gray scale values in images compares the ADC values assigned to individual pixels. This process can be studied by considering the problems in matching random dot stereograms which are a synthetic stereo images of a uniformly random intensity distribution where regions are shifted on exact pixel boundaries. In this image all 2N ADC states are equally probable so that single pixel value exists among 2 -^1 other values. The smallest element that can be matched is a l x l window (a single pixel). Over a limited search range of R pixels there is one target pixel and R-l chances for false matches. The probability of finding no ambiguous matches in R-l pixels is: P(e = 0) = 1 R-l 2N For random stereogram images with an 8 bit resolution (2^ = 256 gray values) and a search range of R = 20 pixels, the probability of an unambiguous match is approximately 93% when matching single pixels. Across an epipolar line of 255 pixels, single pixel matching over this search range has a probability of no matching errors: P(e= 0) = 0.93255 « 0 Matching is improves when groups of two adjacent are compared. The values in 2 adjacent pixels presents a unique code in 2562 possible states. The probability of finding no ambiguous matches in a 20 pixel search range across a 255 pixel epipolar line is: P(e = 0 ) « s x255 ( 1 9 ^ 1 ,2 = 0.93 ( 2 5 6 ) 2 A 2x2 pixel window (containing 4 pixels) would give almost zero errors (P(e = 0) = l) . Since the value of neighbouring pixels cannot be predicted, random images have the highest possible information content and, in this ideal environment, a 2x2 pixel window is the smallest square window that gives nearly no matching errors. 123 The four pixels in a 2x2 window can be considered as a 4 dimensional vector that points to a unique position in 4D space. The number of elements in this space is determined by the number of different ADC values each pixel can assume raised to the power of the number of pixels in the window. Each element in this space can be considered as a unique state of the 2x2 window. Since the number of distinct states is large, the probability that another group of 4 pixels will point to the same state within a finite search range is small. When noise is added to the ADC values, the 4 pixels values in the window no longer point to a single, unique element in 4-D space. Noise forces the vector to move in some volume of this space. The presence of additive white noise causes the ADC to overflow its boundaries and the values attained by the pixels in the window point to a high dimensional region of some radius determined by the standard deviation of the noise. Matching becomes ambiguous when theses regions overlap. When the l x l (single pixel) window is considered, a the target pixel has a larger 'footprint' than when no noise is present and it is easier to find a second match for this pixel. The probability of finding only one match over a search of R pixels is: P(e = 0) = \ 1 + 21^ *-'^ (l + 2|7p(*-l) 2N 2N where rj is the additive noise in the system in terms of ADC counts. If the noise is small such that individual pixel values vary within a range of ±1 ADC count then an unambiguous pixel state would occupy 3 ADC counts. Its nearest non-overlapping neighbour would be 3 ADC counts away. In a gray scale range of 256 values, there would be 85 distinct pixel values. Pixel values are quantized on fixed boundaries where as Gaussian noise models distribute noise continuously. The calculations are simpler if the noise is considered as range of ADC counts within which the matching pixel exists such that: , . , . . , number of accessible ADC states number of unambiguous states per pixel « r—. l + 2|//j where 77 = ±1, ±2, etc. 124 Estimating the Number of Matching Errors on and Epipolar Line This section began with the probability of finding no ambiguous matches over a search range, R. In making a single comparison, the probability of finding the same value is: 1 oo-where Ns is the number of unambiguous ADC states after noise is taken into account and w is the number of pixels in the window. The first expression then becomes: P(e = 0 ) = 1-1 .R-l O O " 1- R-l O O " Across an epipolar line of L pixels, the probability of no errors is approximately: R-l P(e = 0 ) \ R-l^ 1- -L The probability of an error is approximately R-l far matching errors on an epipolar line is: R-l (N.Y L and a rough estimate of the number of L2 125 Appendix 3 Recursive Edge Detectors (Deriche 1987) developed an optimal recursive edge detector that is based an Canny's design criteria. Canny's optimum edge detector is /(x) = ae~a^ sin(yx) which has a Fourier transform: k F(co) - ico (a2 + y2j2+2co2(a2-y2) + co4 According to Deriche, Canny's final form for the operator had a = / in which case the Fourier transform becomes: F(co) = ico (a 2 + y2)2 + coA = JCO-K 1 + CO (Or the derivative of a second order Butterworth function. In practice, Canny chose to implement the edge detector as the first derivative of a Gaussian. Deriche achieves a higher overall performance figure using a bandpass filter of the form f{x) = axe~a^ which has a Fourier transform: Jr F(co) = ico-f ( \ 2\ 1+ CO V ) the derivative of the square of a first order Butterworth function. Deriche constructed this filter as the sum of a forward recursive filter and a backward recursive filter. Using Deriche's notation, the filter is of the form: F(Z) = F.(Z) + F+(Z->) OO where: F+{Z~l) = YJAe~azrl e~a7rx ar1 «=o and F_(Z): -aZ \ + b\Z + b2Z 126 F+(Z']) is the forward filter, shown in the usual form of a recursive filter. FJZ) is the backward filter. The numerators show that one filter is one sample in the past from which is subtracted the other filter located one sample in the future. This can be rewritten as the symmetrical difference of two low pass filters, one run fonvard and the other run backward where the symmetrical difference operator is of the form y(n) = x(n+l) - x(n-l). This is the basis of the Sobel filter. The edge detector can be realized as the symmetrical difference of two low pass filters or as a symmetrical difference operator followed by a low pass filter. The filters can be a second order Buttenvorth function or the square of a first order Buttenvorth function. Also in this appendix is a demonstration of the second order Buttenvorth function which is nearly separable in two dimensions and gives better symmetry than the square of a first order Butterworth function. Second Order Buttenvorth Edge Detector 127 Appendix 4 An Approximately Separable 2D Recursive Filter The recursion formula for a second order Buttenvorth filter is: y(n) = 2by{n -1) - cy(n - 2) + K(x(n) + 2x(n -1) + x(n - 2)) The filter uses only past and present data points. This asymmetry produces phase distortions in the output that are removed when the filter is run forward and then backward over the same data segment. The result of this fonvard and backward filtering is the square of the absolute value of the transfer function. In the case of a second order Butterworth, this squared transfer function corresponds to the kernel of the Euler equation for a bending beam. In 2 dimensions, the squared 2nd order Buttenvorth filter corresponds to the plate constraint model for surface recovery in images. The stiffness of the plate is related to the lowpass cutoff frequency of the filter. A plate structure is more self supporting than an membrane structure and is a candidate for a 2D separable filter. The impulse response and contour plots indicate a useful 2D filter can be realized from two independent one dimensional filters. The figures show useful radial symmetry in the point spread function of the two separate filters. 1 5 1 10-0 0 128 Impulse Response for coc = 0.03 n 20 40 60 80 100 120 Contour Plot of Impulse Response with 5% Resolution 2nd Order Buttenvorth Filter 1. The normalized 2nc* order Butterworth filter is given in terms of the low frequency cutoff, coc. 1 H{S): s2+42s+l where s = — 2. frequency warping of normalized cutoff frequency is needed so that the s —> Z bilinear approximation will work for large values of coc. The cutoff frequency, fc is given in terms of the sampling frequency, fs, where 0 < fc < fs/2. coc = — tan bilinear transformation s = • 2 1-Z - l 2 tan T - l f ,0</< l J T 1 + Z~l ' a 1 + Z Z transform of 2nc* order Butterworth filter H(Z) = s = — -—^-y where a = tan(—/) a2(\ + r1)2 (1 - Z _ 1 ) 2 + f2o(\ - Z _ 1 )(1 + Z _ 1 ) + a2 (1 + Z _ 1 ) 2 129 _ Y(Z) 1 + 2Z" 1 +Z" 2 where: 0= t a n ( j / ) b = l + fla + a1 \-2brx+cZ~2 _ l-42a + a2 \ + 42a + a2 K = l + Jla+a2 a ,2 in the data domain: y(n) = 2by(n -1) - cy(n - 2) + K(x(n) + 2x(n -1) + x(n - 2 •)) If the filter is used to smooth a relative functions such as the difference energy between two shifted images and if the cutoff frequency remains constant then the filter gain, K can be set to unity to slightly simplify the calculation. Pseudo Code of '2D'Buttenvorth Filter I* this routine overwrites input data in array u(row,col) with filtered data. input: 0<w<0.99 lowpass frequency as a ratio of half the sampling frequency */ pi=3.14159; a=tan(pi/2*w); b=(l-a*a)/(l+sqrt(2)*a+a*a); c=( 1 -sqrt(2)*a+a*a)/( l+sqrt(2)*a+a*a); K=a*a/( 1+sqrt(2) *a+a*a); /* 1. process data horizontally */ for col=l:Num_of_Cols /* get initial conditions in filter */ temp(col)=u( 1 ,col); endfor(col) for row=l:Num_of_Rows for col=3 :Num_of_Cols /* forward pass */ temp(col)=2*b*temp(col-l)-c*temp(col-2)+K*(u(row,col)+2*u(row,col-l)+u(row,col-2)); endfor (col) for col=Num_of_Cols-2:-l: 1 /* backward pass */ u(row,col)=2*b*u(row,col+l)-c*u(row,col+2)+K*(temp(col)+2*temp(col+l)+temp(col+2)); endfor (col) temp(l)=u(row,l); temp(2)=u(row,2) endfor (row) /* 2. process data vertically*/ forrow=l:Num of Rows 130 temp(row)=u(row, 1); endfor(row) for col=l:Num_of_Cols for row=3 :Num_of_Rows /* forward pass */ temp(row)=2*b*temp(row-l)-c*temp(row-2)+^ endfor (row) for row=Num_of_Rows-2:-l: 1 /* backward pass */ u(row,col)=2*b*u(row+l,col)-c*u(row+2,col)+K*(temp(rovv)+2*temp(row+l)+temp(row+2)); endfor (row) temp(l)=u(l,col); temp(2)=u(2,col); endfor (col) endproc 131 Appendix 5 Other Estimators for Cylinder Orientation Two techniques to recover this information from the disparity map are compared. Convergence is tested against the LMS algorithm. One of the disadvantages of the LMS filter is its slow convergence. This is because it operates on the average characterization of the data rather than the actual data acquired. An advantage of the LMS algorithm is that it can be very stable and will eventually approach an optimum solution. LMS Filter This approach recursively estimates « 0 , the null space vector, by formulating an adaptive filtering problem using the least mean squares (LMS) algorithm. In this configuration, ni is the surface normal, « 0 is the current estimate of the cylinder orientation (orthogonal to the surface normal), and n0 • hi - et is the projection of the orthogonal estimate onto a measured surface normal (which should be zero). Some fraction of the measured surface normal provides an 'average' correction to the estimate. Using the LMS algorithm, the update equation for the orientation of the log axis is nok = - aeknk. 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 50 100 150 200 250 300 Orientation residual with LMS vs. number of measurements 132 Each iteration of the LMS filter subtracts some fraction, a, of the projection of a new surface normal vector, hk on the previous orthogonal estimate, nQ. The LMS algorithm expects to converge to a stationary residule that is evenly distributed among all components of the orientation vector. An initial estimate of « 0 is given by the cross product of the first two normal vectors as described in Section 5.2.1. The rate of convergence and the residual of the estimate is related to a. It is possible to make a large at the beginning and then decrease as the number of samples increases but this is not done in this thesis. The LMS algorithm provides an easy mechanism to reject invalid data. The orientation residual is a single number that, when large, can be rejected and the algorithm can continue. A Novel Recursive Estimate for Loss The basic methodology in this estimator is to find a unique characteristic of the surface and exploit that characteristic in a detection scheme. Cylindrical surfaces have a null space which means that a matrix description of these surfaces should be poorly conditioned such that the inverse acts like a singularity. The idea is to take advantage of this and to recognize such surfaces in terms of their null space. Any vector in 3 space can be constructed from a linear combination of the eigenvectors of a symmetrical, positive definite 3x3 matrix, U , whose columns are linearly independent. If A, \ is an eigenvalue of U then the matrix [U-A-II] only spans two space and the eigenvector associated with A-i lies in the null space of [U-A-II]. For any vector, y, the expression [U-A-iI]"ly effectively divides by zero any eigenvector component of y associated with X\. If we perturb X\ by a small amount, 6, then [U-(Aq+e)I] is ill-conditioned but not singular. [U-(A.i+s)I]"ly will not blow up but will amplify the eigenvector component of y that is associated with X\. This works like a matched filter that extracts the desired eigen direction from the vector y. 133 Consider the equation Ux = Xx. Multiplying both sides by the inverse of U and dividing by A,, we get TJ_1x = — x . The gain of the operator TJ'l is l/X. When we consider an arbitrary vector, X y= axj+bx2+cx3 that is a linear combination of the eigenvectors {x ,^X2,X3} of U then T T - 1 a b c U y = — X J + — x 2 + — x 3 /1| ^3 The eigenvector associated with the smallest eigenvalue is amplified the most. When the eigenvalues are distinct and well separated, the vector component with the minimum energy of the system will dominate recursive operations. The gain of each operation is increased when U is poorly conditioned or when TJ is made nearly singular. [ U - al]x - Ux - ax = (X - a)x [U-aI]- 1 x = - ^ — x X-a For the general vector y= axj+bx2+cx3 , T T T T I - I a b c [ U - a l ] y = - X l + - x 2 + - x 3 /I, - a X2-a X3-a This technique can be used to get a recursive least squares estimate of the log orientation. As more surface normal measurements are incorporated into U, better estimates of the log orientation become available. For fast convergence an initial estimate of the minimum eigenvalue and a good candidate vector for the log orientation is needed. When the first 2 surface normals are available, their cross product provides the best estimate of the log orientation. Since only 2 vectors are available, the 3x3 correlation matrix, U will be singular and the eigenvalue associated with the initial estimate of log orientation will be zero. When a third, noisy surface normal measurement is incorporated into U, the matrix will no longer be singular but the minimum eigenvalue will be small, near the initial estimate of zero. The recursion formula is: vk=?[Uk-ak_,l]" lyk_ 1 134 where the new estimate of the orientation vector is then normalized y k = At every step the goal is to find a vector y such that Uy = A-iy. The equation [U-aI]y ; (A,i-a)y is satisfied by the recursion equation [U-aI]v = y if (k\-a)vjt= y^ . Since y^ is normalized, this leads to the eigenvalue update: 1 v k -y k- l When a new normal vector, n^ becomes available: U k = U k _ , + n X 0 5 10 15 20 25 30 35 Recursive Eigenvector estimation vs. number of measurements A Comparison with Recursive Least Squares The least squares recursion formula is x(0 = x(r -1) + P(f)a(0[y, - a T (r)x(f -1)] where x is the estimate of the log orientation, a(t) is the latest surface nonnal measurement, the desired output, yj- = 0, and P = i H . Rearranging the tenns gives x(0 = x(r- l ) -a (0 [P(Ox(r- l ) ] T a (0 135 The term P(r)x(r-1) acts as a filter to refine the estimate of the minimum eigenvalue. In poorly conditioned systems the gain in this operation can be large. This vector is projected on the latest surface normal measurement. Since x should be orthogonal to a(t), the component of x in the direction of a(t) is subtracted off. This follows the Gramm-Schmidt process for finding orthogonal basis vectors. P(t)x(t -1) finds the vector that minimizes the energy of the system. Convergence of Recursive Least Squares (RLS) can be slow for two reasons: • The algorithm does not normalize the eigenvalue filter. When the matrix is poorly conditioned, the value is amplified by a factor The m § b initial gain can force the estimate into a poor state. • As more data becomes available, the total energy in U increases, the total noise energy in the orientation vector represented by increases and the gain of the eigenvector filter decreases. This makes it difficult to recover from a poor state estimates that are generated earlier. This can be corrected by a forgetting factor. If the eigenvalue is tracked as in the previous section then the gain of the filter can be held constant or confined to a small range. This serves a similar function to the usual forgetting factor that is introduced to track non-stationary parameters. Handling Invalid Data The estimates of the orientation vector, n0 should be stationary. If a point is selected that is not on the log surface then the orientation filter will see a discontinuity and that data point is discarded. To evaluate surface points, it is necessary to keep an estimate of the error variance. Points with error values greater than 2 or 3 a can be discarded. An easy estimate of a is: ojfc = o*-i + a ( k | - o i t - i ) 136 Appendix 6 Comparison of Signal Matching with Power Cepstrum and SSD Many stereo vision systems use Sum of Square Differences (SSD) to find corresponding regions in stereo images. The Power Cepstrum is sometimes used to estimate disparity between stereo images, although it is not as popular as SSD. As a disparity estimator, the Cepstrum is most sensitive when the data windows are perfectly aligned and it's sensitivity decreases as the windowed data is shifted. Yeshurun and Swartz suggest that the power Cepstrum has superior signal to noise properties when compared to standard correlation techniques. This should be most evident when the Cepstrum is used as a zero shift disparity detector because this produces the largest peaks in the power Cepstrum. More recently, Smith and Nandhakumar (1996) have proposed using the zero shift power Cepstrum as a stereo matching detector. This section examines the mechanics of the power Cepstrum as a zero shift disparity detector and compares it's performance to the SSD. The Cepstrum used in stereo matching is usually of the form: Ceps = FFT- ,[log|FFT(s)|] The technique proposed by Yeshurun and Swartz selects data from two stereo images using windows N samples wide. These two data segments, si and S 2 , are concatenated to make a single data structure that is 2N pixels long. One can view this as the sum of two 2N wide data segments each of which has N samples of zero padding. SJ SI S2 + s i 137 Since each segment is padded with N zeros, the FFT of each segment will be interpolated to 2N complex components. Segment 2 has zero padding in first half and image data is shifted to the second half of the segment. The FFT of this second segment is: e-ik"S2(cok) = {-\fSJ<ok) The FFT of the concatenated data is: (Banadari) S = S,{tok) + e-ik"S2(cok ) = S](o>k) + (-1)* S2 K ) For matching, S2 is modeled as a noisy version of s^ shifted by x pixels. S = Sl(cok) + (-\f[e-i^Sx(cok)+ ricok)] Each of the original data samples is N pixels that have been zero padded to 2N data points such that the spectrum of the concatenated signals has 2N components. When x equals zero, the odd components of the interpolated spectrum cancel except for the noise terms. The result is a picket fence effect; signal + noise for the even components, and just noise for the odd components. Suppose we ignore the noise term and try to match ideal random signals that have flat power spectra. When T equals zero, the power spectrum of the concatenated ideal signals appears as a large square wave with a period of 2 pixels. Even terms add, and odd terms cancel, approaching zero. After taking the log of this power spectrum, the odd terms approach negative infinity. This greatly amplifies the picket fence effect. The FFT (or inverse FFT) of this new situation, results in a new spectrum (the Cepstrum) that is dominated by a single high frequency peak that measures the average peak to peak amplitude of the amplified picket fence effect. Since the odd values approach log(O), the peak to peak amplitude is very large. This value is the zero shift peak of the Cepstrum. The mechanics of the second FFT in the Cepstrum calculation calculates the highest frequency component as the sum of the differences between the even samples and the odd samples. It adds up all the even samples and then adds up all the odd samples and then subtract the two results. This gives us a number for the highest frequency component. In the Cepstrum, these 138 sample values are logarithms. The additions are equivalent to multiplications and the subtraction is equivalent to a division. If the logarithm operation is postponed; multiply all the even terms together and then divide by the product of all the odd tenns and then take the logarithm of this result, we get the same value for the zero shift peak as does the nonnal Cepstrum calculation. The difference is that the second FFT is not calculated and only a single logarithm is used in the calculation; but even this single logarithm is not needed to detect the correlation peak. In fact, noise rejection is improved by not taking this logarithm. Evaluating logarithms has been an expensive part of cepstrum calculations and may be unnecessary for some applications. The rest of this appendix section looks at the details of this process and compares this detector to other zero shift detection criteria. The Zero Shift Cepstrum Peak The goal is to analyze the cepstrum process and identify it's advantages as a zero shift disparity detector. At the beginning of the process the signals are concatenated as described earlier. One signal is modeled as a noisy, shifted version of the other so that the concatenated signals have a Fourier transform of the form: 5 = 5 , K ) + ( - l ) i [ ^ K ) ] + ^ ) In its simplest form, this expression models the original signals as having constant magnitude Fourier coefficients that slowly rotate with respect to each other as the data windows become displaced. In practical applications, the local magnitude of the Fourier coefficients changes with window position. To be accurate, the noise tenn should represent both additive noise and how the signal changes as the window moves. The noise magnitude increases as the data windows share less and less signal infonnation. The next step is to separate the spectrum of the concatenated signals into even and odd coefficients. 139 for k = even S(k) = S(a)k)(l + e ) = 2S(cok)e for k+1 = odd S(k +1) = S(cok+l )(l - e****) + n(<oM ) = 2iS(aM )e T = 2iS(ojk+l)e It is assumed that, on average, the noise terms are uncorrelated with the signal components in a region where T — » 0. The normal Cepstrum calculation determines the absolute value of the neighboring coefficients and then takes the logarithm. To calculate the zero shift term, the final FFT subtracts one log value from the neighboring log value. This is equivalent to dividing one coefficient by its neighbor and then taking the log of the result. If we postpone the log operation, the result of the division is: The effect of zero padding is to interpolate the spectrum. This tends to produce frequency coefficients that are averages of their neighboring coefficients. For most images, S(cok+1) and S((ay) are not too different in value. As expected, the trigonometric tenns disappear when x = 0 except for a residual term that is related to the signal to noise ratio. The cepstrum calculation takes the log of tins value which greatly amplifies the sensitivity of this zero detector. Since the magnitude of the logarithm is the same for the inverse (i.e. log(l/x) = - log(x)), we can divide the even coefficients 'S(cok)\ 140 by the odd coefficients. As x—» 0, this inverse expression acts like a cotangent function of the shift term. r S(k) = 2S(cok)e 2cos(a)k-) + n(a)k) -i<a t + 1-/ < T T T T "\ = 2S(a>k)e 2\cos(cok+x - ) c o s ( o - r ) + s i n ( f i > f c + I - ) s i n ( © , - ) J + « ( « * + , ) Dividing the even components by the odd components: Sk \S(cok) jk+\\ \S(a>k+l)\ f T\ T T cot (k + \)cox— cos(a>,— ) + sin(« 1— ) + rk V 2 J 2 2 As T — > 0, this expression produces a large peak that doesn't rely on the log(0) effect. When we take the logarithm of the cotan peak, we compress it and get the corresponding Cepstrum term. In effect, the final FFT operation that calculates the zero shift peak is behaves like: Elog|S(*)|-2>g|S(*)| = log odd Ylsck) Y[s(k) odd log n : k even \S(k)\ |.ST*+ 1)| r-+0 log Y\ cot(n</>) nodd It can be seen that, as x—> 0, the Cepstrum evaluates a function similar to the log of the product of the cotan terms. It doesn't matter for this calculation if tangent functions or cotan functions are used. The final step in the Cepstrum calculation takes the absolute value of the log and this yields a positive peak of the same value. If tangent terms are evaluated, then the internal mechanics of the Cepstrum behaves like a null detector that measures how well the two image segments match. The SSD also behaves as a null detector and it shares some characteristics with the Cepstrum. When cotan terms are evaluated, the effect of the logarithm is to compress the dynamic range of the data. To detect a peak in a noise field, compressing the dynamic range of the data is not necessarily the best strategy. By not using any logarithms in the calculation, we can build a 141 zero shift detector that more efficient than the usual Cepstrum calculation. For this particular application, the final FFT and all the log calculations can be eliminated. A Tariff) Zero Shift Detector To simulate the zero shift Cepstrum, a tan(cj)) detector without a concatenated data structure can be built. Using two N pixel data windows Sj and s2 such that s 2 is a shifted version of S] , we can evaluate the FFT of the sum and difference of these signals and then divide the magnitude of one set of coefficients by the other to get tan(cj)) terms. F (s, - s2) = (1 - e-"")Sx {co) = 2/<r'W 2 sin(ffl-)5,0) F (5, + s2) = (1 + e-im)S, O) = 2e~M2 cos( co^)S, (co) Tests The following forms of correspondence detectors are compared with the zero shift Cepstrum peak. log(tan((|))) detector = log-nis>K)+s2(^)i 2 multCotan = = 7 i inverse SSD = — ; k In the following figures the different fonns of correspondence detectors are used to find the alignment of two, one dimensional first order Guass-Markov process with no additive noise. A small constant, 8, in the calculations protects against overflow and limits the peak values. The search windows are 32 pixel wide. In a noise free environment, all the detectors perform well and the log(tan(((>)) detector performance is very similar to the zero shift Cepstrum detector. 142 200 150 100 50 Zero Shift Cepstrum Peak 10 20 30 window position 40 200 150 100 50 Log(Tan(phi)) Zero Shift Detector 10 20 30 window position 40 x 1 o 8 1 Cotan(phi) Zero Shift Detector 1.5 0.5 10 20 30 window position 40 2000 1500 1000 500 Inverse S S D Detector 10 20 30 window position 40 Noise is modeled as the cause of false matches; a combination of added disturbance and the effect of how the signals change locally when the window moves. When external added noise is about the same magnitude as the signals to be matched then false matches become more likely. The residual at zero shift is due only to the added noise which can be greater than the residual at some other shift. This depends on, how the signal changes when shifted, the effect of noise terms added to the shifted signal, and on the matching metric. The figures below show how the different detectors react when the added noise energy is about the same as the signal energy. The zero shift Cepstrum detector and the log(tan(<j>)) perform about the same . 143 30 25 20 15 10 5 0 Zero Shift Cepstrum Peak 10 20 30 window position 40 35 30 25 20 15 10 5 0 LogfTan(phi)) Zero Shift Detector 10 20 30 window position 40 These peaks are compressed by the logarithm. The next figure shows that the multCotan detector produces well defined peaks even in large noise fields. It is this property that makes the Cepstrum attractive for matching signals. For the remainder of this section, the performance of the inverse SSD and the multCotan detector are compared. x - | g 1 3 Cotan(phi)Zero Shift Detector Inverse SSD Detector 10 20 30 window position 40 10 20 30 window position 40 The figures below show how the inverse SSD and the multCotan respond to noise; uniform noise (rand(N)) and exponential noise (-alog(rand(N)). Exponential noise can model unexpected disturbances like that caused by specularities and occlusions in images. The figures below use 16 pixel, one dimensional data windows where the correct match is centered on pixel 8. The SSD and multCotan were tested with identical data sets. In both cases, the SSD outperforms the multCotan detector. Similar results were obtained using the zero shift Cepstrum detector. 144 SSD with Uniform Noise 16 14 12 | 10 'in & 8 o I 6 4 2 0 A I V \ I 20 40 60 run# 80 100 16 14 12 I 1 0 '</) a 8 SZ o j2 6 4 2 0 Cot(phi) detector with Uniform Noise I I I I I II f] / V 20 40 60 run # 80 100 16 14 12 c o «5 10 a 6 4 2 SSD with Exponential Noise A V V V 20 40 60 run # 80 100 16 14 12 m 10 o Cot(phi) detector with Exponential Noise 1 A \ A 1 1 1 20 40 60 run # 80 100 While the multCotan produces well defined peaks for matching, it does not appear to be as robust as the SSD. This is because the multCotan detector is a product of terms while the SSD is a sum of terms. In the presence of noise, this makes the product detector more susceptible to errors. If chance makes one term in a finite series nearly zero then the product of the terms will be close to zero and the inverse will produce a large peak. However, the sum of the series will be more resistant to a spurious zero term. When noise is added to the series, a false zero is more likely to mask the correct match in the product rather than in the sum. 145 Most test images have little visible noise yet false matches are frequent. Real images also exhibit self similarity under translation. This combined with occlusions and illumination changes act like burst of noise in some part of the image that can obscure correct matches. The performance of the multCotan detector in matching one dimensional segments a real image is tested. Below is a stereo image pair of Pepsi cans in which the disparity is between 1 and 4 pixels. The stereo matching test uses windows of 32 contiguous horizontal pixels centered on random locations in the one image. A 16 pixel window is taken from the other image, centered on the same location. All matches should be within 4 horizontal pixels. Before running the stereo matching we test self matching where all matches should be centered on zero. In this test, we don't know if a match that gives a disparity between 1 and 4 pixels is the correct match but matches outside this region are known to be incorrect. These spurious errors appear like shot noise. run # run # 146 All matches less than zero and greater than 4 are known to be false matches. The tests indicate that the SSD produces fewer false matches. The SSD and the multCotan detector have about the same ratio of false matches as that shown with the exponential noise model. Conclusion There are several versions of SSD detectors reported in literature. The version tested would produce false matches about 6% of the time. The zero shift power Cepstrum produced false matches about 15% of the time and the multCotan about 18% of the time. The Cepstrum is calculated using an interpolated spectrum. This averages neighbouring terms and it may be the reason that the power Cepstrum was found to be more noise resistant than the multCotan detector (see Section 3.1.2). The tests indicate that the SSD is a better detector, having has less than half the error rate of others. The second FFT in the power Cepstrum calculation can be viewed as a weighted sum of logarithms. In this respect, the Cepstrum belongs to a class of product detectors whose output is the product of terms in a series. These results indicate that product detectors in a noisy environment are vulnerable to conditions that will produce a null at some point in the data series whereas summation detectors, like least squares, are not as vulnerable to this effect. 147 References E. Adelson and E. Simoncelli, "Orthogonal Pyramid Transforms for Image Coding," SP1E Visual Communications and Image Processing II, Vol.845, 1985 A. Akansu and A.Haddad, "Multiresolution Signal Decomposition," Academic Press, 1992. K. Arbter, "AfTine-Invaraint Fourier Descriptors," From Pixels to Features, Elsevier Science Publishers pp. 153-164, 1989. K. Arbter, W.Snyder, H. Burkhardt, and G. Hirzinger, "Application of Affine Invariant Fourier Descriptors to Recognition of 3-D Objects," IEEE Trans. PAM, Vol. 12 pp. 640-646, 1991. R.D. Arnold and T.O. Binford, "Geometric constraints in stereo vision," Proceedings ofSPIE Image Processing for Missile Guidance, Vol. 238, pp. 281-292, San Diego 1980. K.S. Arun, T.S. Huang, and S.D. Blostein, "Least-squares fitting of two 3-D point sets," IEEE PAMI, Vol. 9, No. 5, pp. 698-700, September 1987. E. Bandari and J. Little, "Cepstral analysis of optical flow," The University of British Columbia, Department of Comuter Science, Technical Report 92-6, November 1991. E.Bandari and J. Little, "Spatial-quefrency approach to optical echo analysis," IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1992 pp. 850-852. A. Blake and A. Zisserman, "Visual Reconstruction," MIT Press 1987. S. Birchfield and C. Tomasi, "Depth Discontinuities by Pixel-to-Pixel Stereo," IEEE International Conference on Computer Vision, January, 1998. pp. 1073-1080 S. Birchfield and C. Tomasi, "A Pixel Dissimilarity measure That is Insensitive to Image Sampling," IEEE Trans. PAMI, Vol. 20, No. 4, pp. 401 - 406, 1998 148 R. Burge, J. Mulligan, and P. Lawrence, "Using Disparity Gradients for Robot Navigation and Registration," IEEE Conf. on Intellegent Robots and Systems, October, 1998. pp. 539-544 P. Burt, P. Anandan, and K. Hanna, "An Electronic Front End Processor for Active Vision," SPIE Conf. on Intellegent Robots, November, 1992. P. Burt, L. Wixson, and G. Salgian, "Electronically Directed "Focal" Stereo", Proceedings of the Fifth International Conference on Computer Vision (ICCV95), June 1995, pp 94-101. N . Chauvin, G. Marti, and K. Konolige, "A Countour Method for Real-Time Range Image Parsing," IEEE Conf. on Intellegent Robots and Systems, October, 1998. J. Clark, " Multi-Resolution Vision System with Application to the Automated Measurement of Logs," PHD Thesis Electrical Engineering, U B C , September 1985 R. Clarke, "Transform Coding of Images," Academic Press, 1985. J. Cyganski and J. Orr, "Applications of Tensor Theory to Object Recognition and Orientation Determination," IEEE Trans. RAMI, Vol. 7 pp. 662-673, 1985. R. Deriche, "Using Canny's Criteria to Derive a Recursively Implemented Optimal Edge Detector," International Journal of Computer Vision, pp. 167-187, 1987. F. Devernay and O.D. Faugeras, "Computing Differential Properties of 3-D Shapes from Stereoscopic Images without 3-D Models," INRIA Report, June 1994. F. Devernay and O.D. Faugeras, "From Projective to Euclidean Reconstruction," Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR'96), 1996. G. Elber and E. Cohen, "Second-Order Surface Analysis Using Hybrid Symbolic and Numeric Operators", A CM Transactions on Graphics, Vol. 12, No. 2, April 1993, pp 160-173 149 T. Faber and E. Stokely, "Orientation of 3-D Structures in Medical Images," IEEE Trans. PAMI, Vol. 10 pp. 626-633, 1988. A. Fenske and H. Burkhardt, "Affine Invariant Recognition of Gray Scale Objects by Fourier Descriptors," SPIE Applications of Digital Image Processing, Vol. 14 pp. 53-64, 1991. D.J. Fleet and A.D. Jepson, "Computation of component image velocity from local phase information," International Journal of Computer Vision, 5:1, pp. 77-104, 1990. P. Fua, "A Parallel Stereo Algorithm that Produces Dense Depth Maps and Preserves Image Features," INRIA Technical Report 1369, January, 1991. D. Geiger, B. Ladendorf, A. Yuille, "Occlusions and Binocular Stereo," Internation Journal of Computer Vision , Vol. 14, pp. 211-226, 1995. W. Grimson, "From Images to Surfaces," MIT Press, 1981 C. Hansen, N . Ayache, and F. Lustman "Efficient Depth Estimation Using Trinocular Stereo" Optical Society of America, Vol. 1003, pp. 124-131, Sensor Fusion: Spatial Reasoning and Scene Interpretation, 1988. B. K.P. Horn, "Closed-form solution of absolute orientation using unit quaternions," Optical Society of America, Vol. 4, No. 4, pp. 629-642, April 1987. B. Horn, H. Hilden, and S. Negahdaripour, "Closed-form solution of absolute orientation using orthonormal matrices," Optical Society of America, Vol. 5, No. 7, pp. 1127-1135, July 1988. B. Horn, "Robot Vision," McGraw Hill, 1989. J. Huang and D. Mumford, "Statistics of Natural Images and Models," http://www. dan. brown, edu/people/mumford/researchjiew. html, 1998 D. Jones and J. Malik, "Determining the Three-Dimensional Shape from Orientation and Spatial Frequency Disparities," Proc. of the European Conference on CompiHer Vision, May, 1992. 150 D. Jones and J. Malik, "A Computational Framework for Determining Stereo Correspondence from a Set of Linear Spatial Filters," Proc. of the European Conference on Computer Vision, May, 1992. T. Kanade, "Toward Frame Rate Stereo Vision", http://www-cgi.cs.cmu.edu/afs/cs/usr/tk/www/ Projects_www/IUW94stereo.fm.html J. Koenderink, "Solid Shape", MIT Press, 1990. (Math QA641 K745 1990) R. Kumar, P. Anandan, and K. Hanna, "Direct recovery of shape from multiple views: a parallex-based approach" ICPR, pp. 685-688, 1994. F.L. Lewis, "Optimal Control", John Wiley, 1986. J. Little and W. Gillet, "Direct Evidence for Occlusion in Stereo and Motion," Image and Vision Computing, Vol. 8 no. 4 pp. 328-340, 1990. L. Lipton, "Foundations of the Stereoscopic Cinema," Van Norstrand Reinhold, 1982. C. Lo and H. Don, "3-D Moment Fonns: Their Construction and Application to Object Identification and Positioning,"/£££ Trans. PAMI, Vol. 11 pp. 1053-1064, 1989. B. Lucas and T. Kanade, "An Iterative Registration Technique with Application to Stereo Vision," Proc. of the Seventh IJCAI, 1981, pp.674-679. D. Marr, "Vision," W.H. Freeman, 1981. L. Matthies, "Stereo Vision for Planetary Rovers: Stocastic Modeling to Near Real Time Implementation," Internation Journal of Computer Vision, Vol. 8:l,pp. 71-91, 1992. S. Mitra, S.L. Lim, D.J. Lee, and B.S. Nutter, "Depth estimation from disparity of stereo images," . SPIE, Vol. 1349, Applications of Digital Image Processing VIII, pp. 216-226, 1990. 151 I. J. Mulligan, A.K. Mackworth, and P.D. Lawrence, "A model-based vision system for manipulator position sensing," Proceedings of the Workshop on Interpretation of 3D Scenes, Austin, TX, pp. 186-193, November 27-29, 1989. V. S. Nalwa and T.O. Binford, "On detecting edges," IEEE Trans. PAMI, Vol. 8 pp. 699-714, 1986. M . Okutomi and T. Kanade, "A locally adaptive window for signal matching," International Journal of Computer Vision, 7:2, pp. 143-162, 1992. A. Pentland, "Spatial and Temporal Surface Interpolation using Wavelet Bases," SPIE Geometric Methods in Computer Vision, pp.43-62, 1992. Point Grey Research, www.ptgrey.com. S. Pollard, J. Mayhew, and J. Frisby, "PMF; A Stereo Correspondence Algorithm Using a Disparity Gradient Limit," Preception, Vol 14 pp.449-470, 1985. W.K. Pratt, "Digital Image Processing", John Wiley & Sons, 1978. W.H. Press, S.A. Teukolsky, W.T. Vettering, and B.P. Flannery, "Numerical Recipes in C" Cambridge University Press, Second Edition, 1992. T. Reagan, T. Abatzoglou, J. Saghri, and A. Tescher, "Sub-Pixel Resolution for Target Tracking," SPIE Applications of Digital Image Processing, Vol. 15 pp.2-19, 1992. T.D. Sanger "Stereo disparity computation using gabor filters," Biol. Cybern. 59, pp. 405-418, 1988. L.E. Scales, "Introduction to Non-Linear Optimization", Springer-Verlag, 2nd Ed. 1987. P. Scheiwiller, S. Murphy, and A. Dumbreck, "A Compact Zoom Lens for Stereoscopic Television", SPIE Stereoscopic Displays and Applications, pp. 2-8, 1991. 152 E. Simoncelli, W. Freeman, E. Adelson, and D. Heeger, "Shiftable Multiscale Transforms," IEEE Trans, on Information Theory, Vol. 38 pp.587-607, 1992. S. Singh and H. Cannon, "Multi-Resolution Planning for Earthmoving," International Conf. on Robotics and Automation, May, 1998, www.frc.ri. emu. edu~ssingh/pubs-conf.html. I. Sirovich and M . Kirby, "Low-dimensional procedure for the characterization of human faces," 1987 Optical Society of America, Vol. 4, No. 3, pp. 519-524, March 1987. M . Starks, "Stereoscopic Video and the Quest for Virtual Reality," SPIE Stereoscopic Displays and Applications, pp. 327-342, 1991. A. Stentz, J. Bares, S. Singh, and P. Rowe, "A Robotic Excavator for Autonomus Truck Loading," IEEE Conf. on Intellegent Robots and Systems, October, 1998., www. frc. ri. emu. edu~ssingh/pub s-conf. ht ml R. Szeliski, "Fast Surface Interpolation using Hierarchical Basis Functions," IEEE Trans. RAMI, Vol. 12 pp. 513-528, 1991. D. Terzopolulos, "The Computation of Visible Surface Representations, "IEEE Trans. RAMI, Vol. 10pp.417-439, 1988. H. Trivedi, and S. Lloyd, "The Role of Disparity Gradient in Stereo Vision," Preception, Vol 14 pp.685-690, 1985. Y. Tse and R. Baker, "Camera Zoom/Pan Estimation and Compensation for Video Compression," SPIE #1452 Image Processing Algorithms and Techniques II, pp. 468-479, 1992. M . Turk and A. Pentland, "Face processing: models for recognition," SPIE Vol. 1192, Intelligent Robots and Computer Vision VIII: Algorithms and Techniques, pp. 23-32, 1989. S. Umeyama, "Least-squares estimation of transformation parameters between two point patterns," IEEE PAMI, Vol. 13, No. 4, pp. 376-380, April 1991. 153 R. Vaz and D. Cyganski, "3-D object Orientation from Partial Contour Feature Data," SPIE Vol. 1349, Applications of Digital Image Processing XIII, pp. 452-459, 1990. G. Van der Wai, P. Burt, "A VLSI Pyramid Chip for Multiresolution Image Analysis," International Journal of Computer Vision, Vol. 8.3 pp.177-189, 1992. G.Whitten, "Scale Space tracking and Deformable Sheet Models for Computational Vision," IEEE PAMI, Vol. 15, No. 7, pp. 697-706, July 1993. R. Wildes, "Direct Recovery of Three-Dimensional Scene Geometry from Binocular Stereo Disparity," IEEE PAMI, Vol. 13, No. 4, pp. 761-774, August 1991. A. Wiley and K. Wong, "Metric Aspects of Zoom Visioa" SPIE #1395 Close Range Photogrammetry Meets Machine Vision, pp. 112-118, 1990. T.Williamson, "A High Performance Stereo Vision System for Obsatcle Detection," DoctoralThesis CMU-Rl-TR-98-24, Carnegie Mellon University. September 25 1998. M . Yachida, Y. Kitamura, and M . Kimachi, "Trinocular Vision: New Approach for Correspondence Problem" International Conference on Pattern Recognition, pp. 1041-1044, October 1986. A. Yahja, S. Singh, and A. Stentz, "Recent Results in Path Planning for Mobile Robots Operating in Vast Outdoor Environments," Proc. 1998 Symposium on Image, Speech, Signal Processing and Robotics, September, 1998, WvVW.frc.ri.cmu.edu~ssingh/pubs-conf.html. Z. Zhang, R. Weiss, and A. Hanson, "Qualitative Obstacle Detection" CVPR, pp. 554-559, 1994. 154
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Recognition of shape and orientation using binocular...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Recognition of shape and orientation using binocular vision Burge, R. 2000
pdf
Page Metadata
Item Metadata
Title | Recognition of shape and orientation using binocular vision |
Creator |
Burge, R. |
Date Issued | 2000 |
Description | This thesis examines the problem of building primitive vision elements for the remote operation of heavy equipment using minimal supervision. The ultimate goal of the vision system is to provide the robot with a world view that is compatible with the operator's view such that the operator sets goals and the robot executes low level tasks needed to accomplish intermediate objectives. A system for perceiving shape is developed that allows the robot to identify the ground surface on which it moves and to recognize obstacles and simple objects within the workspace. Stereo vision is used for this purpose. Disparity between stereo images contains distance information about a scene. The information in stereo images and the recovery of dense disparity maps is studied. A procedure for matching stereo images is given. The disparity gradient is used as the fundamental evidence of shape perceived by stereo vision and the properties for shape are analyzed within disparity gradient space. These attributes are exploited for navigation and recognition of simple shapes. The techniques developed are used to recognize cylinders and to estimate their location, size and orientation within a robot workspace. The vision process includes three phases: (1) recovery of a dense disparity map that includes sub-pixel interpolation of local surfaces, (2) deciding the location of objects within the robot workspace, and (3) confirmation of a cylinder and determining its location, radius, and pose. Example of these procedures are provided for real images. The algorithm provides a means for navigation on plane surfaces and for recognition and pose estimation of cylinders within the robot workspace. Examples described here demonstrate the feasibility and reliability of this approach. Future work will consider improvements to the low-level processes and increasing the resolution of disparity surfaces. |
Extent | 20058623 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-07-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065379 |
URI | http://hdl.handle.net/2429/11383 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2000-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2000-486117.pdf [ 19.13MB ]
- Metadata
- JSON: 831-1.0065379.json
- JSON-LD: 831-1.0065379-ld.json
- RDF/XML (Pretty): 831-1.0065379-rdf.xml
- RDF/JSON: 831-1.0065379-rdf.json
- Turtle: 831-1.0065379-turtle.txt
- N-Triples: 831-1.0065379-rdf-ntriples.txt
- Original Record: 831-1.0065379-source.json
- Full Text
- 831-1.0065379-fulltext.txt
- Citation
- 831-1.0065379.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065379/manifest