Patchlets : a method of interpreting correlation stereo 3D data by Donald R. Murray B.Sc. Elec. Eng., University of Alberta, 1987 M.Sc. Elec. Eng., University of Alberta, 1992 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF T H E REQUIREMENTS FOR THE D E G R E E OF Doc to r of Phi losophy in T H E FACULTY OF GRADUATE STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard The University of British Columbia December 2003 © Donald R. Murray, 2003 A b s t r a c t This thesis describes methods for fitting local planar surface elements, that we call patchlets, to 3D data obtained from correlation stereo images. We use these patchlets to robustly extract and estimate bounded planar surfaces from complex and noisy stereo scenes. The patchlet element is a small planar surface the size of a pixel projected onto a world surface. It has a position, a normal direction, a size, and confidence metrics on its position and orientation. The confidence metrics are generated from the noise model of the stereo vision system, which are propagated to 3D point data and then to the patchlet parameters. The patchlets are used to extract larger bounded planar surfaces that are useful for environment modeling. We use a region-growing approach to identify how many surfaces exist in a stereo image and an initial estimate of the surface parameters. We then use Expectation-Maximisation (EM) to refine these surface parameters to an optimal estimate using a probability maximisation approach. The confidence metrics of the patchlet parameters allow proper weighting of patchlet contributions to the probability maximisation solution. We verify by experimental means the accuracy of correlation stereo matching and demonstrate that the patchlet confidence metrics obtained from that accuracy fit expected normal distributions. We compare the results of the patchlet-based surface segmentation to manually constructed ground-truth segmentation and find the segmentation accuracy for a scene ranged from 82% to 93%. We also present a method for filtering noise from correlation stereo disparity images. ii Contents Abstract 1 1 Contents iii List of Tables vii List of Figures viii Acknowledgements x i v 1 Introduction 1 1.1 Overview 1 1.2 Thesis contributions 5 1.2.1 Thesis scope 6 1.3 Surface estimation challenges 6 1.4 Thesis summary 11 1.4.1 Patchlets 11 1.4.2 Surface extraction from stereo images 14 1.5 Thesis chapter organisation 18 2 Literature Review 19 2.1 Robot navigation and mapping 20 2.2 3D structure from scanning 21 2.2.1 Triangulations 22 2.2.2 Volumetric • . 23 2.2.3 Particle methods 24 2.2.4 Surface modeling 25 2.2.5 Differential Geometry 26 3 Stereo vision 27 3.1 Overview of stereo vision 27 iii 3.2 Stereo vision to 3D data 29 3.3 Error analysis of stereo vision 31 3.3.1 Mismatch errors 31 3.3.2 Sub-pixel interpolation bias 35 3.3.3 Stereo accuracy . . . . . . 38 3.4 Uncertain points 41 4 Model estimation 43 4.1 Overview 43 4.2 Probability distribution models 43 4.2.1 Gaussian probability distribution 44 4.2.2 Lorentzian probability distribution 44 4.2.3 Fisher probability distribution 46 4.3 Mixture models 47 4.4 Maximum likelihood estimation 48 4.4.1 Robust estimation 50 4.5 Expectation-Maximisation 50 4.6 Application of E M to mixture models 51 4.6.1 Example of E M on a ID Gaussian mixture model 53 4.6.2 Example of E M on a ID Lorentzian Mixture model 55 5 The Patchlet 57 5.1 Overview 57 5.2 Patchlet representation 59 5.3 Estimating patchlet parameters 60 5.3.1 Initialisation and outlier rejection 61 5.3.2 Fitting the patchlet plane 62 5.3.3 Creating the patchlet local coordinate system 66 5.3.4 Propagating patchlet uncertainty 67 5.4 Results 69 5.4.1 Patchlets from synthetic data 69 5.4.2 Patchlets from real data 72 6 Surface segmentation and estimation 76 6.1 Motivation 76 6.2 Approach 79 6.3 Surface model 82 6.4 Initial surface identification 85 6.4.1 RANSAC extracted surface results 86 iv 6.5 Surface refinement with E M 88 6.5.1 Surface bounding 93 6.6 Example results 94 6.6.1 Future model selection strategies 96 6.7 Stability over noise 96 7 Experimental Results and Discussion 99 7.1 Experimental equipment 99 7.2 Analysis of correlation accuracy 101 7.3 Patchlet parameter regression test 103 7.4 Image test sets 103 7.5 Patchlet construction 107 7.6 Surface extraction 122 7.6.1 Example of model selection process 135 7.7 Ground truth comparisons 136 7.8 Fitting: points v.s. patchlets 145 7.9 Lorentzian v.s. Gaussian 146 7.10 The role of scale 146 7.11 Computation timing 147 8 Conclusions k , 149 8.1 Patchlet model 149 8.2 Surface extraction from stereo images 150 8.3 Stereo noise filtering 151 8.4 Correlation stereo error propagation 152 8.5 Future work 153 8.5.1 Merging multiple views 153 8.5.2 Model scale 153 8.5.3 Parametric surfaces 154 Bibliography 155 Appendix A Derivation of Expectation-Maximisation updates for mix-ture models 165 A . l Gaussian ID models 165 A.2 Lorentzian ID models 166 A.3 Fisher ID models 168 A.4 Patchlet-Surface models 169 A.4.1 Gaussian positional model - Surface uncertainty only 170 A.4.2 Gaussian positional model - Surface and patchlet uncertainty 172 Appendix B Geometry and coordinate transforms 173 B . l Whitening 173 B.2 Unwhitening 174 B.3 Projection of a point to a plane 174 Appendix C Independence of orientation and offset 175 Appendix D Additional confidence measure images 178 vi L i s t o f T a b l e s 4.1 Simulated ID Gaussian mixture model results 53 7.1 Image 1: summary of segmentation accuracy 136 7.2 Image 2: summary of segmentation accuracy 139 7.3 Image 3: summary of segmentation accuracy 143 vii L i s t o f F i g u r e s 1.1 Stereo image: (a) is the view from the reference camera of a stereo camera system, (b) is the disparity or depth view of the scene. In the disparity image, the brighter a pixel the closer it is to the camera. . 3 1.2 Patchlet example - close-up : a close-up of the patchlets shown in Figure 1.7, the individual patchlets are clearly visible 4 1.3 Challenges in fitting surfaces to data: (a) is the ground truth solu-tion (b) shows the effects of not enough models (c) shows the poor solutions possible if the right number of models fall into an incorrect - local minima (d) shows the case when both model selection is poor and the a poor local minima is found (e) shows the effects of selecting too many models to fit 8 1.4 It is important to bound the surface 9 1.5 Surface gaps - when is a gap between points large enough to break a surface into two? With points the question is ill-posed. If we have a size to the points, we can answer this question 10 1.6 The patchlet model - uncertainty measures are indicated by the grey zones 12 1.7 Patchlet example : for (a) and (c) the patchlets are colored by the intensity value of the source image, for (b) and (d) the patchlets are white with lighting 13 1.8 Stereo image used for example surface segmentation 14 1.9 Example of greedy RANSAC-based surface segmentation - each can-didate surface is shown in black. Unselected but valid data is shown in grey. As each surface is identified, its associated data is removed from the data set for the next iteration 15 1.10 A rendered 3D view of the RANSAC-based segmented surfaces - the different greyscales represent the patchlet-surface associations . . . 16 1.11 Relationship between surface and patchlets 17 3.1 Illustration of trinocular stereo matching and disparity 28 viii 3.2 Pin-hole camera model: geometric system relating 3D position (a;, y, z) to pixel position (u, v). The optical centre is the origin of the camera coordinate system 3.3 Extracting a 3D point from stereo matching : The origin of the stereo system is the optical centre of the right camera (although it need not be in general). The two matching rays projecting from the cameras intersect at a 3D location. The rays are denned by a pixel location in each image plane 3.4 An example scene 3.5 The disparity image and corresponding top-down occupancy grid map result from the scene in Figure 3.4 3.6 The morphologically filtered disparity image and corresponding top-down occupancy grid map result from the scene in Figure 3.4 . . . . 3.7 The disparity image filtered with our surface continuity-based "spike removal" filter and corresponding top-down occupancy grid map re-sult from the scene in Figure 3.4. Removed pixels are marked in black 3.8 Spike removal: effect on 3D point clouds (a) shows a side view of the point cloud generated by the disparity image given in Figure 3.1 without spike removal validation, (b) shows the same disparity image after spike removal has been applied 36 3.9 Sub-pixel interpolation bias . 37 3.10 Stereo triangulation error characteristics: (a) shows the typical aligned covariance matrix determined through equal weighting of errors in the left and right cameras, (b) shows the covariance matrix we obtain through use of a separate pointing and matching accuracy 39 4.1 The Lorentzian distribution - the parameters that control the func-tion are the mean \x and the spread parameter, T. T is the width of the distribution at one-half of the maximum height 45 4.2 A comparison of the Gaussian and Lorentzian distributions. Note the heavy tails of the Lorentzian 45 4.3 The Fisher distribution 46 4.4 Polar coordinates for points on a sphere 48 4.5 Graphical model for data obtained from a mixture model 52 4.6 The results of ID E M fitting to a mixture model at various iterations. A histogram of the data is shown with the dashed line. The solid line represents the various PDFs of the sub-models within the mixture model 54 ix 4.7 The results of ID E M fitting to a mixture model using the incorrect number of sub-models is shown here, comparing the behavior of the Gaussian with the Lorentzian. (a) shows the Gaussian fit. The left model is correctly identified by the right model is spread between the centre and the right underlying models, (b) shows the same system with the Lorentzian sub-models. The left model is correctly identified. The right model locks-on to the underlying right model and the centre model is relegated to the status of outlier 56 5.1 Surface continuity and correlation stereo: In the continuous case, (a), both cameras perceive largely the same textured region and con-sequently the stereo match will be more likely to be successful and will have a high confidence. In the discontinuous case, (b), both of the cameras will include similar views of the the foreground object but will clearly include very different regions from the background. This will reduce the quality of the correlation match 58 5.2 The patchlet surface element and its various parts 59 5.3 Constructing the patchlet local coordinate system: the cone repre-sents the projection of the disparity pixel. The plane represents the fitted patchlet plane 66 5.4 Synthetic disparity image of a 5m deep by 2m wide by 2m tall box. In the appearance image, each side of the box is mapped to a constant greyscale 69 5.5 View of the 3D points extracted from the synthetic disparity image, (b) shows a close-up of the upper, right, back corner of the box. Notice the sides of the box are much less densely sampled than the back 70 5.6 Shaded view of the patchlets that match the views in Figure 5.5. Our apologies, there are some patterns visible in (a) due to aliasing in the rendering. This is due to the large number of patchlets (76800) and the limited area of the rendering. The patchlets are rendered at 90% size so that the gaps between patchlets are visible 71 x 5.7 This figure shows the standard deviation and K values of the patchlets displayed in greyscale. In both cases brighter means more confidence: high K, low A. In (a) the square-root of the patchlet variance values are displayed graphically. The values range from 1mm to 1cm. The back of the box is much more uncertain than the sides. In (b) the Fisher model K parameter is shown. The back of the box has a low value of K 1 whereas the sides of the box are quite constant with values of K ~ 8.8. (The banding seen on the box sides in (a) and (b) is caused by slight perturbations in the 3D points caused by the quantization of the disparity image by 1/128 of a pixel.) 71 5.8 The effect of covariance/surface orientation on confidence 72 5.9 Image of steps and associated disparity image 73 5.10 Display of variance and K values of patchlets created from stereo image in Figure 5.9 73 5.11 Several 3D rendering of the patchlets created from the stereo image in Figure 5.9 75 6.1 Challenges in fitting surfaces to data: (a) is the ground truth solu-tion (b) shows the effects of not enough models (c) shows the poor solutions possible if the right number of models fall into an incorrect local minima (d) shows the case when both model selection is poor and the a poor local minima is found (e) shows the effects of selecting too many models to fit 77 6.2 Unbounded surfaces attract outliers 78 6.3 Surface gaps - when is a gap between points too large? 80 6.4 Relationship between surface and patchlets 83 6.5 RANSAC-based greedy surface segmentation on synthetic data - black indicates pixels associated with the surface, grey are valid pixels, white are invalid pixels or pixels have already been assigned to a different surface 87 6.6 RANSAC-based greedy surface segmentation - black indicates pixels associated with the surface, grey are valid pixels, white are invalid pixels or pixels have already been assigned to a different surface . . . 89 6.7 Full initial segmentation from greedy surface associations as shown in Figure 6.6 including a 3D view from a different viewpoint 90 6.8 A 3D view of the segmented patchlets in Figure 6.7 seen from a dif-ferent viewpoint 90 6.9 Graphical model for generation of sensed data from an underlying mixture model of surfaces 91 xi 6.10 Simplification of graphical model 92 6.11 Example surface extraction 95 6.12 Surface segmentation on synthetic images with noise 97 7.1 Digiclops trinocular stereo rig 100 7.2 Example of rectification 100 7.3 Correlation calibration image 101 7.4 Experimental comparison of correlation accuracy with unit Gaussian 102 7.5 Experimental verification of patchlet confidence metrics 104 7.6 Images 1-6 105 7.7 Images 7-12 106 7.8 Carpet image confidence measures 108 7.9 Image 1: steps close-up 110 7.10 Image 2: front steps I l l 7.11 Image 3: front steps - side view 112 7.12 Image 4: front steps - side view - far 113 7.13 Image 5: brick corner - close-up . 114 7.14 Image 6: brick corner 115 7.15 Image 7: brick corner - far 116 7.16 Image 8: courtyard 117 7.17 Image 9: dirt pile - low 118 7.18 Image 10: dirt pile - high 119 7.19 Image 11: walkers and tree 120 7.20 Image 12: statue 121 7.21 Image 1 surface extraction 123 7.22 Image 2 surface extraction 124 7.23 Image 3 surface extraction 125 7.24 Image 4 surface extraction 126 7.25 Image 5 surface extraction 127 7.26 Image 6 surface extraction 128 7.27 Image 7 surface extraction 129 7.28 Image 8 surface extraction 130 7.29 Image 9 surface extraction 131 7.30 Image 10 surface extraction 132 7.31 Image 11 surface extraction 133 7.32 Image 12 surface extraction 134 7.33 Image 3: model selection - attempts at using 3 through 8 models . . 135 7.34 Image 1: comparison of ground truth and E M segmented planes . . 137 7.35 Image 1: Surface by surface comparison of segmentation 138 xii 7.36 Image 2: comparison of ground truth and E M segmented planes . . 140 7.37 Image 2: Surface by surface comparison of segmentation 141 7.38 Image 3: comparison of ground truth and E M segmented planes . . 142 7.39 Image 3: Surface by surface comparison of segmentation 143 7.40 Image 3: Surface by surface comparison of segmentation 144 7.41 Image 3: comparison of fitting full patchlet model and fitting posi-tional information only, without the patchlet orientation 145 8.1 Patchlet example 150 8.2 Surface extraction from stereo images 151 8.3 Stereo noise filtering 152 8.4 Correlation stereo error 153 C. l A fitted line to a set of points 175 D. l Image 1 confidence measures 179 D.2 Image 2 confidence measures 180 D.3 Image 3 confidence measures 181 D.4 Image 4 confidence measures 182 D.5 Image 5 confidence measures 183 D.6 Image 6 confidence measures 184 D.7 Image 7 confidence measures 185 D.8 Image 8 confidence measures 186 D.9 Image 9 confidence measures 187 D.10 Image 10 confidence measures 188 D . l l Image 11 confidence measures 189 D.l2 Image 12 confidence measures 190 xiii Acknowledgements There are many people to whom I am grateful for their assistance and support during the course of this thesis, but to two in particular I owe special thanks. The first is Jesse Hoey, a fellow student and lab-mate, to whom I am indebted for his great patience and many hours spent revealing to me the art of probabilistic methods. Also for his good cheer and positive attitude. The second is my wife, Gloria, whose continuous support and encouragement made this work possible. I thank you both. D O N A L D R . M U R R A Y The University of British Columbia December 2003 xiv Chapter 1 Introduction 1.1 Overview This thesis describes a method for fitting local planar surfaces to the 3D data ob-tained from correlation stereo imaging and, using these local planar surfaces as the fundamental sensed data primitive of the depth image, detecting and segmenting the 3D image into higher-level surfaces. We call the local planar surface elements patchlets. Stereo vision is a method of range scanning that is gaining popularity as com-puter vision cameras become cheaper and computers become fast enough to process stereo vision at near video rates for images as large as 320 x 240 pixels. Real-time stereo vision has been used for applications that require coarse depth sensing such as robot navigation [ML98] or foreground/background segmentation [KYO +96], as well as for systems that combine depth information with other more precise cues such as the use of stereo combined with color and shape for gesture recognition [Jen99]. In the past few years more attempts have been made to use stereo data for complex scene 3D modeling. These attempts are generally made with many cameras (from 5 to 100's) and with many viewpoints to generate overlapping point clouds [SKYT01]. Our interest in the interpretation of correlation stereo for environment mod-eling originated from our work on visually guided mobile robots and robot mapping [MJ97, ML98, ML00, JML99]. In this work we used stereo vision to generate oc-cupancy grid maps [Elf89] for robot navigation and localisation. The robot would collect stereo data from multiple viewpoints as it moved along a trajectory and merge the data into a 2D floor-plane map of the robot's environment. These maps would be used for navigation and spatial-task planning for the mobile robot. Two dimensional occupancy grids are simple environment models. They are two dimensional and coarse (depending upon the grid size, but in our work we used a grid size of between 2 and 10cm square). These attributes make them not very 1 useful for environment modeling tasks other than robot navigation. We wished to investigate using stereo data obtained from a mobile robot to create more complete environment models that are useful for complicated tasks. The challenges in using stereo vision data are that it • produces an enormous quantity of data • produces data that is less precise than the typical sensors used • often experiences systematic noise that is difficult to filter The quantity of 3D points that can be collected through correlation stereo is quite large. Typical cameras cover a 90° x 60° field of view and can acquire ap-proximately one million data points per second. (This is based on approximately 60,000 data points obtainable from a 320 x 240 stereo image being processed at 15 frames per second. If we ignore the processing bottleneck, current stereo cam-eras can easily acquire images that will produce from 7,000,000 to 18,000,000 depth pixels per second.) 1 We realised that to use this large amount of 3D information in an environment model, it would be advantageous to use a compact model repre-sentation. Stereo vision generates simply too much data for a model to explicitly represent each acquired data element. As well, stereo is less precise than data acquisition methods that have been traditionally used for 3D modeling such as structured light [SW90, VM98] or laser range finders [Per93]. The two sources of errors in stereo depth images are the slight errors in pixel-disparity values of correctly matched pixels, and gross errors due to stereo "mismatches". Mismatches are errors that occur when the stereo matching process fails and incorrect pixels in the separate camera images are determined to be looking at the same point in space. This happens when there are two features in the scene that appear similar and are confused in the matching process or when the true match is occluded from the view of one camera. These errors are systematic and are stable over time because they are caused by the scene content, not by small changes in lighting or camera sensor parameters. The mismatch errors will cause a region of pixels to have a consistent but incorrect disparity value. This consistency of the error value over a region makes the error appear to be "signal" rather than "noise" to most filtering methods. A method to combat mismatch errors is to combine data from multiple view-points. Mismatch errors will not be persistent over change in viewpoint while real Approximately 7,000,000 would be generated by a 640x480 resolution camera acquiring images at 30 Hz, while 18,000,000 would be generated by a 1024x768 system. Both types of systems have been commercially available since 2000. 2 structure will be. Consequently mismatch errors can be detected as data that does not receive multiple viewpoint support. Another method to filter mismatch errors is presented in Chapter 3 and relies on the insight that correlation stereo is a surface-sensing technology that can only obtain accurate depth values for surfaces that meet certain criteria of texture, size, position and orientation. Since correlation stereo works by pattern matching, it only works well on textured surfaces that can be seen clearly from at least two of the cameras in the stereo rig. Although the resultant disparity image has a depth value associated with each pixel, that depth value is really a result of many pixels in a local neighbourhood. This implies that there must be a surface patch that fills this neighbourhood of pixels with texture information for the correlation to match. So correlation stereo requires the scene to be made up of texture surface patches. This insight, that correlation stereo is a surface-sensing method, led to the development of the primary contribution of this thesis, the patchlet. The patchlet is a surface element that represents a stereo pixel projected onto its sensed surface. It has a position, a size and a surface normal. The patchlet is a sensed-data primitive that is more information-rich than 3D points, and is based on our knowledge of the constraints of the surfaces that correlation stereo can detect. (a) Reference camera image (b) Disparity (depth) image Figure 1.1: Stereo image: (a) is the view from the reference camera of a stereo camera system, (b) is the disparity or depth view of the scene. In the disparity image, the brighter a pixel the closer it is to the camera. Figures 1.1 and 1.2 are intended to briefly give the reader an intuition about patchlets, what they are and where they come from. Figure 1.1(a) shows the view from a stereo camera. Figures 1.1(b) shows the disparity or depth image obtained from correlation stereo. In this mapping, the closer a pixel is to the camera the brighter it is. Figure 1.2 shows a close-up view of this scene after it has been converted to patchlets and rendered from a novel view. Each of the small rectangles 3 Figure 1.2: Patchlet example - close-up : a close-up of the patchlets shown in Figure 1.7, the individual patchlets are clearly visible 4 that make up the woman's face in Figure 1.2 is a patchlet. The depth image shown in Figure 1.1(b) has been converted to 3D position and sensor accuracy information, and then small surface elements have been fit to this data, resulting in the patchlets see in Figure 1.2. Each patchlet has been colored by the greyscale value of the image in Figure 1.1(a) to provide a simple texture mapping. For environment modeling, we have developed methods for extracting planar surfaces from complex stereo scenes using the patchlet as the basic data primitive. Planar surfaces are models that fulfill our compact representation requirements and are the most prevalent surfaces in man-made environments. As well, since a planar patch is a very basic surface type one can often represent more complex shapes through a combination of planar surfaces. However, the methods developed in this thesis are not restricted to planar surfaces and can be extended to other kinds of surfaces such as cylinders and spheres; it is just a matter of correctly parameterising the surface. We demonstrate that the extraction of planar surfaces is made easier and more robust through the use of patchlets as the basic data element. For the remainder of this chapter, we first present a list of contributions of this thesis to the fields of stereo vision and environment modeling in Section 1.2. Section 1.3 describes many of the challenges one is faced with in extracting surfaces from 3D point clouds and how the extension of 3D points to 3D patchlets can address many of these challenges in a natural and logical way. Section 1.4 gives a short overview of the main contributions of this thesis, patchlets and patchlet-based surface segmentation. Finally Section 1.5 gives an overview of the organisation of the thesis. 1.2 Thesis contributions This section lists the significant contributions made by this thesis in descending order of importance. Patchlet model In this thesis we develop the patchlet model, which is a surface element data primitive for representing sensor data obtained from correlation stereo vision. The patchlet is made up of a 3D position, surface orientation or normal, and physical size. In addition, there are confidence parameters on both the surface normal and the 3D position that represent the sensor accuracy of the patchlet. These confidence measures are propagated from the accuracy in the underlying stereo measurements. The construction of patchlets is described in Chapter 5. Surface extraction from complex scenes Using patchlets, we demonstrate suc-cessful extraction of planar surfaces from complex stereo images. The surface 5 extraction method uses the additional parameters available from the patch-lets, compared to 3D point based methods. The method allows the user to specify the resolution at which surfaces are to be extracted. These surfaces have rectangular bounds on their extent. Surface extraction is described in Chapter 6. Stereo noise f i l ter ing based on surface cont inu i ty Based on the insight that correlation stereo only performs well on continuous, textured surfaces, we de-velop a filter for removing stereo mismatch errors. This filter works by detect-ing within the stereo result sensor data that does not conform to the surface constraint and removes it. This filter has been very successful and adopted for commercial stereo systems. It is described in Section 3.3.1. A me thod for es t imat ing error of 3D points der ived from corre la t ion stereo Error analysis for stereo generated 3D points is normally done with the cam-eras in the stereo rig having equal weight or pixel accuracy. This is true for featured-based stereo, in which the features are extracted from each image independently and then matched. However, for correlation stereo this is not the case. The reference camera in the stereo rig is the privileged camera and its error is only the error of the camera calibration. The contribution of the other cameras have both camera calibration errors and correlation matching uncertainty. In Section 3.3.3 we present a modified method for estimating the variance on a 3D point generated by correlation stereo that is both new and more suited to the correlation algorithm. 1.2.1 Thesis scope It should be added that although a long-term goal of these methods is to combine data from multiple viewpoints, and to automatically determine the natural resolu-tion or scale of the reconstruction, these objectives are outside of the scope of this thesis. 1.3 Surface estimation challenges Surface segmentation from range data is a challenging area of research in its own right. (See Section 2.2 for more discussion on current state of the art.) We quickly realised that we wanted to avoid using image structure as much as possible in de-veloping our approach for constructing compact scene representations. B y "image structure" we mean the knowledge of a data element's position in the source image. For example, when one does a region growing method, growing the region through 6 adjacent pixels, this is using image structure. Conversely standard clustering, di-viding the data elements into groups that have low variance in 3-space, would not make use of this information. We want to avoid using image structure because we want methods that can incorporate images from multiple viewpoints into the same representation. A pixel from one image does not share the image structure of a pixel from a different image, therefore these two pixels could not be used together in a method that depended on image structure. For this reason, we decided that an approach that uses the parameters of the data elements independent of their source image is preferable. In this section we will limit our discussion of surfaces to indicate planar surfaces. The task of extracting surfaces from range data can be decomposed into three sub-tasks: 1. determine how many surfaces there are in the scene - this is called the model selection problem 2. determine which data elements belong to which underlying surfaces - this is called the data association problem 3. estimate the surface parameters, given its associated data - this can be thought of as the model fitting problem. The model fitting problem is the easiest one of the three and is generally well under-stood. The challenges associated with it are issues such as outlier rejection, but these are well defined and there are many methods available. The other two problems, however, are more difficult. Typical solutions are to use methods such as region growing within the image, or robust clustering such as expectation-maximisation. Often the techniques require iterative solutions. Figure 1.3 shows some of the pit-falls encountered when these problems are not solved correctly. The ideas are i l -lustrated with lines fitted to 2D points where the lines represent planar surfaces. Figure 1.3(a) shows five surfaces and a set of points that have been sampled from them. Figure 1.3(b) shows an example of poor model selection. There are five un-derlying surfaces but we have only chosen to fit two. The two models must account for all the data and so each surface is a combination of multiple underlying surfaces. Figure 1.3(c) shows an example of a poor solution to the data association problem. In this case several points that are associated with one underlying surface have been associated with two fitted surfaces. This is a common problem when an iterative solution has been poorly initialised. Iterative solutions are usually trying to min-imise a cost function and this cost function is rarely unimodal. Poor initialisation will make the task of finding a global optimum much more difficult. Figure 1.3(d) 7 (a) Desired solution (b) Poor model selection (c) Poor association (d) Poor model selection and association (e) Too many models Figure 1.3: Challenges in fitting surfaces to data: (a) is the ground truth solution (b) shows the effects of not enough models (c) shows the poor solutions possible if the right number of models fall into an incorrect local minima (d) shows the case when both model selection is poor and the a poor local minima is found (e) shows the effects of selecting too many models to fit 8 shows the kind of situation that can occur when both the model selection and data association is done poorly. In this case there are effectively two surfaces overlapping each other and attempting to fit to the same structure. In this case, they simply divide the data between them. Finally, Figure 1.3(e) shows the possibility if too many models are selected. Actual surfaces can be broken, and broken again into smaller and smaller units. Figure 1.4: It is important to bound the surface Another challenge we encountered was that unbounded surface models will always tend to pick up outliers that corrupt the accuracy of the model parameter estimation. Figure 1.4 illustrates this problem. The dotted line indicates the actual surface we would hope to fit. However, if the surface is unbounded, it is quite likely that somewhere along its unbounded extent there will be some unrelated data points that will, by chance, be close enough to the surface to be associated with it. When that happens, because these distant points have a very long "lever arm" from the centroid of the main basis of support, they will exert a rotational influence on the surface parameter estimation. In the figure, this rotated surface is shown by the solid line. As one collects more data and is modeling larger and larger environments, this problem will only become worse. Our solution was to bound the size of surfaces so that only local support would be used to estimate surface parameters. Bounded surfaces add extra degrees of freedom to the parameters of the surface model, as the spatial extent of the surface must be represented. There is also the requirement of limiting the size of the surface during parameter estimation. It is important to have a mechanism within an interactive parameter estimation method to restrict the surface size otherwise the algorithm may incrementally increase the surface size ad infinitum, and the problems with unbounded surfaces will begin to appear. Figure 1.5 illustrates how the patchlet data representation helps solve this problem. Figure 1.5(a) shows a situation where there are a set of points that may or may not be associated with the same surface. There are two noticeable gaps that are larger than the rest. When dealing with point data it is a difficult task to know when a gap between points is large enough to classify the far points as outliers. Since patchlets have a size, however, we can determine if any gap is bridged or not. By projecting the patchlet onto the surface we can determine if they overlap 9 Gap? Gap? (a) Are all these points on the same surface? There are two significant gaps of different size, is either of them too large? No gap Gap! (b) Sized data points: here one gap is bridged No gap No gap (c) Both gaps are bridged (d) Even though the patchlet is large enough, it does not match the surface parameters Figure 1.5: Surface gaps - when is a gap between points large enough to break a surface into two? With points the question is ill-posed. If we have a size to the points, we can answer this question. 10 continuously. Figure L5(b) shows a set of possible surface element sizes where the left gap is bridged but the right one is not. If the patchlets were larger, as shown in Figure 1.5(c), both gaps are bridged and we can confidently associate all of the data to the same surface model. Finally, Figure 1.5(d) illustrates the additional discrimination available from the patchlet normal. In this case, the right-most patchlet clearly does not belong to this surface, regardless of gaps because the normal of the patchlet does not match the normal of the surface model. By including the orientation in our data element parameters, we gain extra dimensionality in which to detect the quality of fit between patchlets and surfaces. 1.4 Thesis summary In this section we give a brief overview of the main contributions of the thesis: patchlets and surface extraction from stereo images. 1.4.1 P a t c h l e t s The fact that patchlets have an intrinsic size and orientation proved to be a great advantage in overcoming many of the challenges of surface segmentation and extrac-tion presented above. One of the issues not touched upon yet, however, is the use of confidence measures on patchlet parameters. In the previous discussion we have described how we intend to use patchlets to represent noisy stereo data, and that we will apply robust estimation techniques to them for extracting surface parameters. The robust estimation techniques point to the use of probabilistic methods such as Expectation-Maximisation. However, another issue with stereo range data is the large variation in scale of the sensed data, depending on its distance from the camera. The projective nature of a camera means that the surface area of a pixel will project to a size that is proportional to the square of the distance. So a pixel at 5m will project to an area twenty-five times larger than a pixel at lm. The errors in the 3D accuracy of stereo grow even more dramatically. Error in the disparity image is constant in pixels, but disparity is inversely proportional to the distance from the camera. For example the 3D accuracy for a typical stereo camera would be ± 4mm at lm while a point sensed at 5m would be ± 10cm and a point sensed at 10m would be ± 38cm.2 This large difference in the confidence of 3D values obtained from stereo vi-sion indicates that any algorithm that treats these values as equally weighted will 2This is for a stereo rig with a focal length of 250 pixels and a baseline of 10cm and an assumed disparity accuracy of ±0.1 disparity pixels. 11 be very naive. Instead of equally weighting each patchlet, we propagate the confi-dence measures of the 3D points to confidence measures of the patchlet parameters. These measures can be used to properly weight the patchlet attributes for proba-bilistic methods. To paraphrase a statement made by Sebastian Thrun during his tutorial at NIPS 2001 (Neural Information Processing Systems): we've tried to use perfect models with sensor-based data uncertainty, and we've tried to use probabilistic models with perfect data, but for real world systems to work robustly it is important to use both probabilistic models and sensor-based data uncertainty. Confidence Cone Figure 1.6: The patchlet model - uncertainty measures are indicated by the grey zones The patchlet model is illustrated in Figure 1.6. The patchlet has a 3D po-sition and its own local coordinate frame. The 3D point specifying its position is set as the origin of its coordinate frame. Its normal is the local Z axis, and there are size dimensions in the local X and Y axes. There are confidence measures that map to a variance in both the normal direction and the position along the normal. These uncertainties are shown by the grey area in Figure 1.6. The patchlet is formed by fitting a local planar surface to the 3D point data in the neighbourhood of its associated stereo pixel. Each 3D point has an uncertainty ellipse described by its 3-space covariance matrix. This matrix is determined from the stereo sensor model and the patchlet confidence measures are these matrices propagated to the patchlet parameter space. An example of rendered patchlets is shown in Figures 1.7 and 1.2. These patchlets were created from the stereo image shown in Figure 1.1. Figure 1.7(a) and (c) show rendered views of the patchlets where each patchlet is colored by the intensity of the original image. Figure 1.7(b) and (d) show rendered views where the patchlets are white and shaded. 12 (a) Rendered view with colored patchlets (b) Rendered view with shaded patchlets (c) Rendered view with colored patchlets (d) Rendered view with shaded patchlets Figure 1.7: Patchlet example : for (a) and (c) the patchlets are colored by the intensity value of the source image, for (b) and (d) the patchlets are white with lighting 13 1.4.2 Surface extract ion from stereo images To extract surfaces from range images, we use a combination of region-growing and probabilistic techniques. The first step is to extract a good initial guess of the surface parameters. To do this we use a modification of the RANSAC (RANdom SAmpling Consensus) algorithm. A random patchlet is selected and used as a candidate surface seed. Using region growing within the image, we incrementally add patchlets to a set of patchlets whose parameters match those of the candidate surface. When there are no more patchlets to be added, we score the success of this candidate as the number of patchlets that belong to its set. We do this many times (e.g., 100 times) and select the surface with the highest score as a good initial guess. All the patchlets within the successful surface's set are removed from the image and the algorithm is repeated on the remainder until a set number of surfaces have been extracted or no further surfaces are identified above a threshold score. (a) Reference camera image (b) Stereo disparity image Figure 1.8: Stereo image used for example surface segmentation This approach is greedy, as it assigns data to the best surface first, and it will never go back to re-evaluate that data association during this phase of the algorithm. Figure 1.8(a) shows an example scene that includes 3 concrete steps. There are six obvious surfaces in this image. Figure 1.8(b) shows the stereo disparity image that is the result of correlation stereo. The darker a disparity pixel is in this figure, the farther it is from the camera. For more description on correlation stereo and the disparity image see Chapter 3. This disparity image is converted to patchlets and the RANSAC-based seg-mentation algorithm was run. It automatically detected and segmented five of the six surfaces. Figure 1.9(a-e) shows each iteration of the algorithm. The highest scored surface is shown in black. The grey values indicate valid patchlet pixels that were not selected this iteration. One can see that as the algorithm progresses, the available patchlets decrease until finally not enough is left to create a large enough surface. 14 (a) First surface (b) Second surface (a) Third surface (b) Fourth surface (a) Fifth and last surface Figure 1.9: Example of greedy RANSAC-based surface segmentation - each candi-date surface is shown in black. Unselected but valid data is shown in grey. As each surface is identified, its associated data is removed from the data set for the next iteration. 15 Figure 1.10: A rendered 3D view of the RANSAC-based segmented surfaces different greyscales represent the patchlet-surface associations 16 The top step's horizontal surface was not detected because this surface is too oblique to the cameras. Not many patchlets could be constructed on this step because of the surface orientation with respect to the camera. This can be seen by the strip of data missing in the images in Figure 1.9 where the top step appears. Figure 1.10 shows a 3D rendered view of the patchlets created from this stereo image. The patchlets are color-coded to indicate the surface to which they are assigned. The RANSAC-based method is successful in its own right in solving both the model selection problem and the data association problem. However, the greediness of the algorithm can make a few mistakes with data association. More importantly, it is based on image structure which, as we discussed in Section 1.3, is a disadvantage when combining images from multiple viewpoints. To refine these estimates we make use of Expectation-Maximisation (EM), a probability maximisation technique that is well suited for optimising both the data association problem and the model fitting problem simultaneously. This method makes no use of the image structure. It is based on the probabilistic relationship between a surface and a set of patchlets, regardless of their source image. We describe a strategy for using RANSAC to bootstrap the system on the first image and to use E M thereafter to incorporate additional images. This alleviates the problem of using image structure with the RANSAC-based approach. Surface normal Pointing error Position error Surface Patchlet normal Normal uncertainty Position uncertainty Patchlet Figure 1.11: Relationship between surface and patchlets Figure 1.11 illustrates the basis of the relationship between patchlets and 17 surfaces. The grey zones indicate the confidence measures of the patchlet based on the propagated sensor errors. The differences between the position of the surface and the patchlet, as well as the difference between the normal pointing directions, are normalised by the confidence metrics. This normalisation weights each patchlet-surface relationship appropriately. 1.5 Thesis chapter organisation Chapter 2 is a review of relevant previous work in 3D modeling and robot vision. Chapter 3 is a review of stereo vision. It discusses the kinds of structured errors that correlation stereo often suffers from, and presents a novel and successful filter for removing these errors. The other kind of error encountered in disparity images is the small errors in disparity associated with camera calibration and the resolution of stereo correlation accuracy. We characterise this error and demonstrate how to propagate these errors into sensor accuracy measurements for the 3D points obtained from disparity pixels. Chapter 4 is an overview of probabilistic models and methods for finding best parameter estimates from large data sets. Chapter 5 describes the patchlet, its parametric representation, how to con-struct a set of patchlets from a disparity image, how to propagate stereo sensor accuracy to confidence measures of patchlet parameters. Chapter 6 presents an illustrative example application of patchlets: ex-tracting planar surfaces from outdoor scenes. It shows how a simple RANSAC based approach can obtain fairly good surface estimates from the scene, and how these initial estimates can be modified into a optimal estimation using Expectation-Maximisation. Chapter 7 presents 14 image data sets, with the results for patchlet and planar surface extraction. Chapter 8 gives the conclusions of this work based on the results shown in Chapter 7. 18 Chapter 2 Literature Review In this chapter we briefly review related areas of research that have laid the ground-work for this thesis. We divide this literature into four topics. These are: robot mapping and camera motion, 3D modeling from point data and range image seg-mentation. Descriptions of related work in correlation stereo algorithms is found in the Stereo Vision chapter. Although our goal is environment modeling from stereo images obtained from a mobile robot, this thesis does not make a contribution towards robot navigation. Robot navigation is the area of work where this research originated and is pertinent as it is linked to work in robot localisation and pose estimation. It is becoming evident that work in this area in recent years has effectively solved the mobile robot localisation problem for indoor environments. Camera ego-motion work, while not a global localisation solution, is creating the ability to register camera positions from a series of images taken from a moving camera even in an outdoor environment. Automatic camera position registration over multiple viewpoints which will greatly assist environment modeling work in the future. There is a tremendous amount of work done in 3D modeling and interpreting and segmenting of dense range images. This work covers topics such as meshing, implicit surfaces, oriented particles and segmentation of range scans into various higher order structures ranging from planes to 3D objects. The primary difference between this thesis and the area of 3D modeling is in the sensor model and the fundamental data primitives. The vast majority of existing 3D modeling methods work with point data as the fundamental data element. Our work is concerned with extracting a different type of underlying sensed-data primitive, the patchlet, from the sensor data, and seeing how we can incorporate this into modeling approaches. Our hope is that we can tap into the point-based 3D modeling work in the future and adapt these methods in order to use the patchlet data element for more complex surface and object modeling tasks. 19 2.1 Robot navigation and mapping This thesis was developed with the view of using it on a mobile robot equipped with stereo vision guidance. Our work with Spinoza and Jose, two robot systems that could autonomously navigate, explore and create occupancy grid maps based on stereo vision were reported previously in [MJ97, ML98]. Occupancy grid maps were pioneered by Moravec and Elfes in the 1 980's [ME85, Elf89] and were used by other successful robot navigation teams such as the University of Bonn [TFB98]. Occupancy grids are probably the most common form of environment model-ing for robot navigation today. The maps are simple 2D floor-plan grids where each grid element stores a value that represents the estimated probability that the grid is occupied by or free from obstacles. Range sensor readings reduce the probability that an obstacle exists in the grid cells between the sensor and the sensed obsta-cle as these grid cells must be empty to have allowed the sensor to "see through" them (this is also referred to as "space carving" [KSOO]). Then the grid cells in the neighbourhood of the sensor-reported obstacle had their "occupied probability values" increased. This simple, elegant mapping approach was very successful, espe-cially since it was a natural way to combine sensors of different types and readings from different viewpoints. Other voxel-based approaches include Torres-Mendez and Dudek's work with inferring occupancy based on sparse laser scans and single camera video intensity images using Markov Random Fields [TMD03a, TMD03b]. The problem was that the utility of occupancy grid systems for long-term mapping projects was limited without the ability to localise the robot within an existing world coordinate frame. Without this ability to localise the robot, the esti-mation of the robot in the world coordinate system would accumulate incremental errors and this drift would make merging data taken at different times ineffective. This lack of robot localisation made it difficult to re-use maps, to share maps, and to maintain maps of environments over time. This localisation problem is important for environment modeling for the same reasons as it is for environment mapping. It is a common requirement when merging data from multiple viewpoints that the sensor positions be registered in a common coordinate frame. Robot localisation has been addressed for different kinds of robots, sensors and world representations [Cro89, SC86, CL85, CL90]. Leonard and Durrant-Whyte.used active landmarks and tracked the position of their robot using Extended Kalman Filters in [LDW92]. Thrun, Fox and Burgard's work in robot mapping and localisation for occupancy grids and robot's equipped with laser scanners [TFB98] has gone a long way to solving the problem for realistic and complex environments. Their system operates very much like the occupancy grid model except that rather 20 than maintaining an occupancy probability, the grids maintain a "probability the robot is here" value. Vision-based localisation were also developed as shown by Dudek, with robot localisation based on recognised visual features in [SD99]. A l -though the above works were important for robot navigation, they did not provide positional information accurate enough for 3D modeling, for which the registration of camera positions must be highly accurate. Registered camera position work was being carried on by the vision commu-nity during the same time with the goal of extracting the ego-motion of a camera from a sequence of images. The problem of extracting camera motion through a scene is directly linked to the recovery of 3D positions of features within the scene (references to this are numerous but some examples are Zhang [Zha93] and Zisser-man, Beardsley et al. in Oxford [BTZ96]). This work was often made more complex by the use of uncalibrated cameras, so that the methods would have to solve for the camera calibration as well as the ego-motion and the structure-from-motion problem. The problem is simplified if a calibrated stereo rig is used as demonstrated by Stein and Shashua in [SS98]. More recently, the use of features that are stable over varying viewpoints has been gaining popularity, such as the SIFT (Scale Invariant Feature Transform) features developed by Lowe in [Low99]. These have been used towards robot localisation quite successfully in [SLL01, SLL03]. This development of robust camera localisation methods indicates that the camera position registration problem has been virtually solved, which is an important stepping stone towards improved environment modeling systems. 2.2 3D structure from scanning This thesis develops methods for creating high-level surface representations from stereo range scans and consequently is related to research conducted in 3D modeling. A cursory review of some of the major areas of 3D modeling from point-based range scanners is given here. The scope of this review cannot do full justice to the vast quantity of literature available in the area of 3D modeling from range data. Many of these methods we considered for environment modeling when we embarked on this research. In the end, we elected to use a different kind of data representation than the 3D point, on which the methods discussed below are predominantly based. However, it is good to recall that most of the researchers working on modeling algorithms do so with the understanding that their range data is very dense and highly accurate. Their goals are different in that they often are attempting to reconstruct CAD quality models. Conversely, we acknowledge we are constructing uncertain models 21 based on uncertain data, with the object of managing that uncertainty. The work described here that is most closely related to our approach is the particle-based methods described in Section 2.2.3. 2.2.1 Triangulations Triangulations, or meshes, are a common approach to the modeling problem. This is a modeling method where surfaces are constructed as a mesh of planar faces created by connecting common vertexes into triangular shapes. These meshes have been the basis of computer graphical models for many years. The challenges in creating these meshes from range scanner data are: how to make a triangulation compact yet preserving high-frequency details, how to construct a triangulation that minimizes modeling artifacts and most accurately represents the underlying surfaces and how to combine multiple range scans into a single model. A common technique is to create a triangulation that connects all points in the data set, which is typically not sufficiently compact, and then to decimate the mesh by merging redundant faces. The reverse approach, such as pioneered by Fowler and Little [FL79] and further developed by Garland and Heckbert in [GH95], is to incrementally added faces until the surface represents the 3D points within a certain error tolerance. They use the 2 1/2 D constraint in this work that is in many ways suited to image-based scanners, which will only obtain a 2 1/2 D data set per image scan. The 2 1/2 D constraint means that with respect to some plane, usually the X Y plane, the 3D data is monotonic. This means that for each position in the X Y plane there can be only one Z value. This representation is also called a height-map. Stereo vision conforms to this 2 1/2 D constraint. This constraint is very useful in constructing meshes as the triangulation can be done in the 2D image space. Two dimensional triangulations are much simpler and less prone to geometric problems than three dimensional ones. Methods have been developed for combining meshes obtained from multiple scans that have been meshed with the 2 1/2 D constraint such as Turk and Levoy's "mesh zippering" [TL94] and Roth's overlapping meshes [Rot99]. Conversely, work has been done in meshing collections of 3D points that are "unorganized", by which it is meant that there is no additional information known about them other than their positions, for example, the image structure information is not known. Good examples of this are the works done by Hoppe et al. [HDD+92, EDD+95, Hop96] The problem with these approaches is that they are based on the assumption that all the data presented to them is valid and without considerable positional error. 22 When applying these methods to data obtained from a noisy sensor such as stereo vision, the resultant meshes are, in fact, dominated by the noise. That is, the salient points and much of the high frequency elements are actually due to noise rather than signal. This behavior makes direct meshing approaches ill-suited for noisy data. As well, meshes do not lend themselves well to models that include explicit representations of uncertainty. Since a subset of the 3D point data is usually used as the vertexes in a mesh, adding uncertainty to these vertexes makes the geometry of the mesh potentially unstable. 2 . 2 . 2 V o l u m e t r i c Another popular and successful approach to building 3D models from range data is the use of volumetric approaches. Volumetric approaches are similar to occupancy grids in that the 3D space is divided into volume elements or voxels and the 3D information is represented by the occupancy of these voxels. Many methods use voxel-based approaches to organize and collect 3D points before approximating them with a mesh surface [PDH+97, Pul99, LPC+00]. A method for utilising voxel approaches to combine multiple range images and obtain greater accuracy than the scale of the tessellation of the voxel space was developed in the mid-1990's by Hilton et al. [HSIW96] and improved on by Curless and Levoy [CL96]. This method uses implicit surfaces, that is surfaces that are not explicitly represented. The idea is similar to occupancy grids mentioned above in that it uses a volumetric representation of space. It uses the space carving approach of occupancy grids to clear empty voxels in the 3D volume along the line-of-sight from a sensed point to the sensor origin. However, rather than marking the voxel in which the sensed point lands as occupied, instead it marks the voxels near as having a signed distance in front of or behind the surface boundary. These distances are accumulated over multiple scans. The final surface is extracted into a mesh using a method such as Marching Cubes [LC87]. The surface position is interpolated between the volumetric distance to surface values of the voxels to find the zero-crossings in three space. Consequently the approach allows sub-voxel surface position estimates. This method works very well and is a keystone of 3D modeling today. It does have problems with noise and outliers due to the unforgiving nature of space carving. This has been somewhat overcome by modifications to the algorithm such as the Consensus Surfaces of Wheeler et al. [WSI98]. The main reason why this method is unsuitable for correlation stereo used over widely varying ranges is that it uses a point-based sensor representation, and it is not clear how to extend this method to use point confidence measures, point size, and more importantly point orientation. Because it is volumetric, it also favours 23 bounded scenes and it can be memory intensive. It is sensitive to unconnected surfaces and to gross errors such as stereo mismatch errors. 2.2.3 Particle methods The previous work most related to the patchlet data element originated with Szeliski and Tonnesen's oriented particles [ST92, STT93]. The oriented particle idea was to include orientation with point data that represents a 3D surface. Points in a local neighbourhood could interact and adjust their positions slightly to minimise certain cost functions, such as the thin plate restriction on curvature. In this respect, oriented particles acted like a form of 3D active contours such as the snakes of Kass, Witkin and Tersopoulos [KWT88]. Witkin and Heckbert among others have- used oriented particles for surface representations in [WH94]. Oriented particles were at that time really a surface representation method completely separate from data or surface acquisition. Szeliski and Tonnesen pre-sented methods for using them to smoothly interpolate surfaces from key points and to allow users to manually edit and manipulate computer generated surfaces in a natural and intuitive way, but the oriented particles would be sampled from a priori surface information. Pascal Fua, in [Fua96], developed methods to combine stereo vision data from multiple viewpoints and to use this data in generating oriented particles very similar to Szeliski's. Fua's particles extended Szeliski's to include a size parameter similar to that of patchlets. However, although his particles were created from sensed data, it was done in a very different manner from patchlets. Fua collected all the stereo data from registered camera parameters into a dense point cloud. He then tessellated his space into voxels and selected the voxels that contained more than a threshold of 3D points as being of interest. For these voxels, he took all the points within them and generated a best-fit plane to them in order to create the particle orientation. The particle size was based on the size of the voxel grid. This was an important step forward, to generate particles with size and orientation from sensor data. He relied on the fact that noise data would tend not to collect in large enough concentrations to flag a voxel as being of interest, and the fitting of a surface to the stereo data provided regularisation to smooth minor noise and slight errors in camera registration. He then allowed the particles to interact in a way similar to Szeliski's, in the manner of active contours. Fua's work is the most similar work we could find to patchlets. There are several differences, however. Fua generates his surfaces from data collected from multiple stereo images and binned into voxels. Patchlets are created from a sin-gle stereo image and are based on the constraints that allow correlation stereo to 24 work. Fua used equal sized and spaced particles whereas patchlets change size and spacing depending on the scale of the sensed data. And patchlet contain confidence metrics on the positional and orientation information that allow the differences in the accuracies of patchlets to be properly represented when used with probabilistic methods. In work inspired by Fua's particle work, Sara and Bajcsy [SB98a] developed their fish scales paradigm that is also based on interpretation of stereo data. In their method, rather than fitting planes to the voxeled point data, they create a fuzzy set that represents the data. This fuzzy set has a position and a covariance matrix representing the variance of the distribution it represents in various directions. By considering the point data variance, they can attempt to classify a fuzzy set into a class of shape, such as a ball, a plate, or a line. Again their approach does not use the constraints of stereo sensing and combines data from multiple viewpoints directly into the point cloud before analysis. Use of oriented particles with size parameters for surface representation has gained support recently in the computer graphics community. Pfister et al. make a case for surfels, or surface elements, as rendering primitives especially for high complex shapes that would require very large number of triangles to represent with traditional polygonal mesh methods [PZvBGOO]. Surfels are generated by sampling surfaces of a given model with rays extending from 3 orthogonal viewpoints. Much work has been done with surfels to organize them in structure for fast access and rendering using either layered depth cubes (LDCs) [LR98] or 3 layered depth images (LDIs) [SGHS98] each corresponding to the view from one of the three orthogonal viewpoints. Statistical analysis of such surface elements has been used for object recognition experiments [WHH03]. 2.2.4 Surface model ing There is a large body of work in the area of interpreting structure from range images that is outside of the scope of this review but we direct the reader to the review done by Hoover on this area [HJBJ+96]. The common approaches described are clustering, region growing, Hough transform, least-squares fitting and robust esti-mation approaches. It should be pointed out, however, that the work compared in this review was performed on data obtained from laser scanners and is exceptionally clean compared to correlation stereo range images. The issues that they are work-ing to resolve in range image segmentation are details currently beyond the scope of surface segmentation with the noise levels of stereo vision. Regarding environment modeling from mobile robots, some impressive work has been done in this area by Martin and Thrun [MT02] in which they use a robot 25 with a plane-based laser scanner to construct compact 3D planar models of their robot environment. This paper, in fact, contains many of the ideas we used in our surface segmentation approach with patchlets. They use probabilistic methods such as Expectation-Maximisation to cluster data into bounded planar surfaces. The details of the implementation are, of course, quite different as they are using a very different kind of sensor data and their noise is generated primarily from errors in the robot position rather than errors in the scanner data. In this work, Martin and Thrun make use of a "phantom" or "outlier" class to deal with noise outliers that do not fit to any stable surface, and we have adapted this technique to our implementation. Details of this are given in Chapter 6. 2.2.5 Differential Geometry These methods are designed to infer a smooth surface from scattered point data by fitting local parametric surfaces to neighbourhoods of points. Peter Sander's Ph.D. dissertation [San88] proposed methods for inferring points along curves on a surface and parameterizing the local surface properties at these points by the use of augmented Darboux frames. The parameterization of the surface is posi-tion, orientation and surface curvature. This work made use of 3D operators edge operators [ZH81] for initializing the frames and then refining them via variational relaxation [San88]. Ferrie, Lagarde and Whaite extended this work by aggregating the resultant Darboux frames into continuous surfaces curves in [FLW89]. 26 Chapter 3 Stereo vision 3.1 Overview of stereo vision Stereo vision is the process of extracting three dimensional information from cameras that are physically offset. This is the same process that works in human visual systems to achieve depth perception. It is well understood how stereo vision relates to 3D geometry. By identifying pixels location in two cameras that are known to correspond to the same 3D position, then that 3D position can be extracted via triangulation. The primary task of stereo vision algorithms is to obtain these pixel matches. This is known as the correspondence problem. There are many stereo algorithms. One approach is to find sparse features such as corner or interest points, lines, or recognisable structures in the two cameras and match them in order to find a sparse set of 3D points [ZF92, SB98b]. In the early 1990's, dense stereo vision gained increasing popularity due to the significantly larger amount of range data that it could provide [LG90, Fua93, OK93]. Most dense stereo algorithms obtain corresponding matches using correla-tion. This method takes a small image region, or mask, from the reference image and correlates it in the other images along a ID curve defined by the epipolar line. The location along the curve with the best correlation to the reference image mask is selected as the correspondence point [Fua93]. Other methods attempt to obtain so-lutions to the correspondence that optimise the solution over more than one pixel at a time. The Dynamic Programming approaches to binocular stereo [BT98, GLY95] attempt to optimise the correspondence matching over an entire epipolar line at the same time. Graph-cuts techniques [KZ01] attempt to find a globally optimal solu-tion for dense stereo correspondences. Phase-based methods for solving correlation are given in [FJJ91, FB96]. A complete overview of all dense stereo methods is out of the scope of this document but we would direct the reader to the the com-prehensive work by Scharstein, Szeliski, and Zabih [SSZ01]; an overview, summary 27 and evaluation paper in which they present a taxonomy for comparing dense stereo algorithms. 1 (a) Disparity image (b) Top image (c) Left image (d) Right image Figure 3.1: Illustration of trinocular stereo matching and disparity The methods proposed in this thesis are for extracting and interpreting sur-face elements from stereo data and aggregating larger surfaces. These methods do not depend on a particular stereo correspondence algorithm, except that it must be a dense stereo method. We elected to use Sum of Absolute Differences (SAD) correlation stereo. This is the method most used for real-time stereo due to its low computational cost. Additionally, it is available in a commercial library that is supplied with our stereo vision camera rigs. For our experiments, we used a trinocular stereo camera. Figure 3.1 shows an example of the stereo matching and dense disparity results for this stereo rig. Additional information and data sets may be found at http://www.middlebury.edu/stereo 28 The right camera, which is the reference camera in this case, is show in 3.1(d). The white box indicates a particular image region or "mask" and the white lines indicate the epipolar lines for the horizontal and vertical camera pairs. Images 3.1(b) and 3.1(c) show the top and left images. The filled white box indicates the image location of the mask in the right image. The white open box indicates the location of the corresponding mask in the left and top images. The difference between the positions of rilled and open boxes is the disparity value for this pixel. The disparity distance for the top-right pair matches the disparity distance for the right-left pair. Image 3.1(a) shows the dense disparity image. Higher disparity (which are the brighter pixels) represents positions closer to the camera. In this particular image the valid disparity values range from near 0 to 64. Full black pixels are ones where no match was found. These matches were rejected by the stereo software library. The disparity image is registered with the right image, pixel-by-pixel. Since the matching was done with trinocular stereo, the bottom portion of the disparity image could not obtain good matches due to the lack of overlap between the right and top images at close disparities. This is the reason for the noise on the bottom portion of the foreground person. 3.2 Stereo vision to 3D data Correlation stereo as described in the preceding section yields a disparity image as output. The disparity value in each pixel of the image allows one to extract the matching pixel locations in all the cameras of the stereo rig. For example, for a binocular stereo vision camera with the right camera as the reference camera, a given row (r), column (c), disparity (d) value in the disparity image maps to the left image as: r r c c + d L J left (3-1) Each pixel in a stereo disparity image corresponds to a 3D position. This 3D position is located at the intersection of the rays emerging from the cameras that are associated with the matched pixels. Determining the 3D position of this stereo data is fairly straightforward but first we will provide the general projective camera equations on which it is based. The pinhole projective camera equations are based on the pinhole camera model [Fau93] as shown in Figure 3.2. In this camera model we use ( i t , v) to denote the position of a pixel in the image plane. These coordinates have their origin, where the optical axis of the camera system intersects the coordinate frame. In terms of row and column in an image, there is a row-column position, (r0, c 0), that 29 Optical Centre Y Figure 3.2: Pin-hole camera model : geometric system relating 3D position (x, y , z) to pixel position (u, v ) . The optical centre is the origin of the camera coordinate system. indicates the position where u = 0, v = 0. Given the arrangement of the camera coordinate axes in Figure 3.2 with increasing column corresponding to increasing x and similarly for row and y , the relationship between row (r), column (c), u , and v is u r - r 0 V c - c 0 (3.2) Given a 3D position, its projected pixel location in the image plane will be u J z V fU z (3.3) where / is the focal length of the system. For a stereo rig such as we assume, with aligned coordinate systems offset along the X axis, with parallel epipolar lines, the z location for a disparity pixel can be calculated by z - fB/d (3.4) and consequently X Bu d . y . Bv d (3.5) 30 Left Camera Right Camera (and origin) Baseline - B Figure 3.3: Extracting a 3D point from stereo matching : The origin of the stereo system is the optical centre of the right camera (although it need not be in general). The two matching rays projecting from the cameras intersect at a 3D location. The rays are defined by a pixel location in each image plane. 3.3 E r r o r analysis of stereo vision Stereo errors can be classified into two kinds of error. The first is the "mismatch" problem. This is where the stereo algorithm has failed to correctly solve the cor-respondence problem and the resultant disparity value is completely unrelated to the correct value. The second is the accuracy of a correct match. In this case, the correspondence problem has been solved successfully, but there is an error in the disparity value obtained. We will first address the mismatch problem and develop methods for detecting and removing these mistakes. We follow this by an analysis of the accuracy issue and illustrate how to determine the covariance of a 3D position extracted from a disparity pixel. 3.3.1 Mismatch errors Mismatches in stereo vision are a serious problem and greatly increase the amount of noise in a disparity image. Indoors scenes containing specular surfaces, repetitive patterns, and time-varying light sources can cause errors that are more or less uni-formly distributed across the disparity range of the stereo system. These mismatches can be reduced by validation through comparing left-to-right and right-to-left best 31 matches [Fua93] or by other validation methods such as thresholding the texture content on the support region [Res99] . They can also be reduced by increasing the number of cameras in a multi-baseline system [OK93]. However, even with a trinocular stereo system, these errors appear. Since the value of the disparity pixel is unrelated to the true value, they usually appear as "spikes" in the reconstructed 3D surface - spikes towards the camera if the disparity value is too high, or away from the camera if the disparity value is too low. To overcome the problem of spikes we tried common image filtering tech-niques such as median filtering or morphological filtering. These filters can improve the results, but not as significantly as we would like. These techniques fail because they are designed to remove or reduce noise from inputs that have unstructured or evenly distributed noise, i.e., noise that appears randomly and consequently will appear most often as single pixels in the image. In short, "white noise". Mismatch errors in disparity images do not have white noise properties. They occur when two similar image features are in proximity to each other. The algorithm then matches one of the features in one image to the wrong feature in the other image. A classic example of this problem is the so-called "picket fence" problem. When looking at a picket fence with binocular stereo, there will be regularly spaced, nearly identical image patches that can be mismatched. When these errors occur, they often occur over the entire region occupied by the feature. Thus, the errors do not appear in isolated pixels but in a dense, connected region. These coherent errors confound the above filtering approaches as they appear as a stable signal, one to be preserved instead of rejected. We developed an approach using surface connectivity to overcome the prob-lem of noise rejection for coherent errors. To remove spikes caused by feature mis-matches, we took into account the attributes of these errors: • they are locally stable • they are not large • they have no support from surrounding surfaces (since they are spikes) This kind of surface has a sharp disparity discontinuity at all borders whereas true surfaces in the stereo image are not only locally consistent, but globally part of a larger 3D surface. By segmenting the image into continuous disparity surfaces, we can establish a good hypothesis based on the size of the surface whether it is a real 3D surface or a noise artifact. To segment the image into surfaces of continuous 32 disparity we apply the following logic: i = L given j € N(i) \dj — di\ < 1 where i is any given pixel, L is a surface label, N(i) is a neighborhood of pixels around i and di is the disparity value at location i. Entire surfaces are invalidated from the disparity image if the number of pixels that have a given label do not pass a threshold. This approach has two significant benefits: it can reject cohesive spikes that may fool noise rejection filters; and it can preserve thin structures that are part of a coherent structure. Figure 3.4: An example scene Two examples of the effectiveness of this approach are shown in the series of Figures 3.4, 3.5, 3.6, and 3.7 and Figure 3.8. The filtering method was originally devised to improve robot mapping using stereo vision [ML98]. This was very im-portant in robot mapping, as the method calls for projecting all 3D data onto the ground plane on which the robot is moving. This has the effect of magnifying the noise of a stereo image, as much of the data that is closest to the robot is noise. Figure 3.4 shows a typical stereo scene from our mobile robot operating in an office environment. Figure 3.5(a) shows the disparity image obtained from this stereo image, and Figure 3.5(b) shows top-down robot map for this view. In Figure 3.5(b) black indicates detected obstacles, white indicates clear space and grey indicates unsensed regions. The process of creating this top-down view is to project all 3D points sensed in the stereo map onto the 2D floor-plane. In implementation, this means taking the closest sensed point in each column of the disparity image as the dominant point in that column. This algorithm favors the inclusion of noise, as any 33 (b) (a) Figure 3.5: The disparity image and corresponding top-down occupancy grid map result from the scene in Figure 3.4 (a) (b) Figure 3.7: The disparity image filtered with our surface continuity-based "spike removal" filter and corresponding top-down occupancy grid map result from the scene in Figure 3.4. Removed pixels are marked in black. spike towards the camera will become the closest point and will be included in the 2D floor map, making the algorithm discount the the valid data in that column. This can be seen clearly in Figure 3.5(b). The sensed points (in black) are mostly very close to the robot (which is at the centre of the bottom edge of the map). Figure 3.6 shows the disparity and top-down map result after the disparity image has been morphologically filtered with an erosion filter. The idea was that erosion would eliminate small regions and leave larger regions relatively untouched. We can see that Figure 3.6(b) is an improvement over Figure 3.5(b), but there are still noise issues occurring. The disparity image after our spike removal filter is applied is shown in Figure 3.7(a) with the the filtered out regions marked in black. The resultant top-down view is markedly cleaner. The left hand side of the map which is closer to the robot matches the computer desk in the left hand side of Figure 3.4. The more distance right hand side matches the counter/cabinet that is further back. Figure 3.8 shows a close-up side-view of the point cloud generated from the face of the person in Figure 3.1. Figure 3.8(a) shows the unfiltered point data while Figure 3.8(b) shows the filtered. The result is a dramatic improvement. This point cloud is made from the same stereo data set used in Figures 1.7, 1.2 and 3.1 earlier. 3.3.2 Sub-pixel interpolation bias The other kind of errors that occur in stereo vision are errors in the numerical accuracy of the disparity correspondence. Larry Matthies has probably done the most rigorous investigations into errors of triangulation and stereo vision. Among 35 (a) (b) Figure 3.8: Spike removal: effect on 3D point clouds (a) shows a side view of the point cloud generated by the disparity image given in Figure 3.1 without spike removal validation, (b) shows the same disparity image after spike removal has been applied. 3 6 Interpolation bias (pixels) -0.5 Disparity / 0.5 Figure 3.9: Sub-pixel interpolation bias other papers, in [MS87] he showed how to obtain 3D covariance matrices for feature-based stereo. With Grandjean [MG94] he provided a rigorous investigation of the statistical distributions of errors in sub-pixel disparity values of correlation stereo. They found the disparity estimates to have Gaussian distributed random errors with standard deviations that were consistent over different resolutions of images. In recent work with Yalin Xiong [XM97], Matthies investigated the sources of sub-pixel disparity errors. Sub-pixel interpolation is the method of determining the disparity value of a pixel to sub-pixel accuracy and varies with the correlation algorithm. It is very important in obtaining the best depth estimates possible with stereo, as the quantisation of integer disparity values are very limiting in depth resolution. Xiong and Matthies discovered that, among other sources, there was a systematic bias in their sub-pixel interpolation results which was based on the Sum of Squared Differences (SSD) algorithm. Since then Shimizu and Okutomi have also investigated sub-pixel interpo-lation biases in [SOOT]. It is very interesting that, although these two groups of researchers have pursued this problem from very different avenues 2 their discover-ies were almost identical and are shown in Figure 3.9. The dashed line in Figure 3.9 shows this interpolation bias. This curve indicates the bias in sub-pixel estimation (the vertical axis) given a true disparity value (the horizontal axis). The shape of the curve and the magnitude of its peaks change depending on the correlation al-gorithm and the actual image content. Shimizu and Okutomi found the magnitude of the peaks to be generally below 0.1 and often as low as 0.01 pixels. The curve is symmetric about d = 0 and its two halves are "almost" symmetric about d = ±0.25. They found this bias to apply equally to SSD stereo and SAD (Sum of Absolute 2Xiong and Matthies relied heavily on strong empirical evidence, while Shimizu and Okutomi based their work on a theoretical analysis of a variety of correlation algorithms. 37 Differences) stereo. The solution that Shimizu and Okutomi proposed is to correct the disparity values obtained by the sub-pixel interpolation process with a cancellation function which is shown as the dotted line in the figure. This cancellation function is applied to all disparity estimates to remove the bias. For their method, the final bias is shown as the black line in Figure 3.9, which is the difference between the two curves. This is a marked improvement over the original bias. This cancellation function value is obtained by determining the disparity value for the same pixel when correlated to the non-reference images which are rectified with an artificially introduced 0.5 pixel shift. The disparity value for this shifted stereo pair should be exactly 0.5 different from the sub-pixel value of the unshifted, or nominal, stereo pair. However, because of the shape of the bias curve, we know that the sub-pixel value for this shifted pair will be affected by an almost equal but opposite bias. Shimizu's cancellation function is implemented by taking the shifted-pair disparity value, subtracting 0.5 from it to bring it the same disparity as the nominal value, except for bias, and then to average the two results. Since the two bias will be opposite and equal, they will cancel out. The correction comes at the cost of having to process the stereo disparity values twice, once for the reference images and once for the shifted images. With the bias present, sub-pixel values vary systematically along a sensed surface. The point estimates will "roll" in a wave shape, with the wavelength of this shape being the distance between points at integral disparity values. For scenes the distances to surfaces are large (over 2m) this sub-pixel bias correction makes a dra-matic improvement in the accuracy of the results. This accuracy for our equipment is determined experimentally in the results chapter in Section 7.2. 3 . 3 . 3 Stereo accuracy Analysis of stereo errors in the past was based on feature-based stereo, especially for landmark acquisition and tracking for the purposes of robot navigation such as the work of Shafer and Matthies [MS87]. Since they were for feature-based stereo, the position of each feature was determined independently in each image. This led to the logical conclusion that the errors in the feature position within each image were independent of the other. Consequently, the covariance of the 3D point was determined through identical and balanced pixel errors as shown in Figure 3.10(a). This resulted in covariance matrices that are aligned with the stereo camera coordinate system. Our treatment of the covariances associated with a 3D point obtained from stereo matching is different from those feature-based approaches. The difference is 38 Left Camera Right Camera Left Camera Right Camera (a) (b) Figure 3.10: Stereo triangulation error characteristics: (a) shows the typical aligned covariance matrix determined through equal weighting of errors in the left and right cameras, (b) shows the covariance matrix we obtain through use of a separate pointing and matching accuracy 39 that we do not treat the left and right images equally. The basis for this difference is that the accuracy is not based on two similar measurements. Instead it is based on the position of the pixel in the reference camera, and the accuracy of the disparity correspondence between the reference and other cameras. In correlation stereo, one camera is the reference camera. This is the camera that the disparity image is registered pixel-to-pixel with. The other camera images are matched against this reference camera image. The accuracy of the location of a pixel in the reference camera is determined by the accuracy of the camera calibration. The accuracy of the disparity is determined by the accuracy of the disparity score. Figure 3.10(b) illustrates how we do not treat these two accuracies equally. The accuracy of the reference pixel we call the pointing error, since this is the accuracy of the primary ray along which the 3D point should be. The accuracy of the disparity value we call the matching error. This determines how far along the reference pixel ray the 3D point is. The pointing error is, of course, two dimensional; it varies in both row and column. It is the result of calibration and will vary from pixel to pixel. It is a constant error, but unfortunately it is not possible to measure the value exactly for each pixel. Camera calibration is determined via bundle adjustment, where the camera intrinsic parameters (usually a small number as low as six) are adjusted to minimise the observed errors in a large number of calibration target feature positions. This is an over-constrained problem and typically the errors are minimised in a least-squares sense. After the bundle adjustment is complete, one can evaluate its accuracy by taking a second set of calibration data and evaluating the residual error of the corrected features positions against their ideal positions. From this residual error one can develop the mean and variation of the error of the pixel locations, but since the exact errors are associated with a very sparse set of features, we simply take the variation of the error and apply it to all pixels in the calibrated image equally. There is no knowledge from this method about the directions of errors in different regions of the image, so we simply make the errors unbiased and apply the variance equally in both the u and v axes. Consequently, since there are a large number of pixels in the reference image and we have only this average error statistic, we assume the independence of the pointing error in row and column. The equation that determines the 3D position of a (u, v, d) pixel is Equation (3.3). The pointing error translates to error in u and v while the matching error represents uncertainty in d. Thus, the covariance of the parameter set (u, v, d) is : P 0 0 A = 0 P 0 0 0 m 40 where p is the pointing error and m is the matching error. (u,v,d) space is not a convenient space to work in. In practice we want to have all points in a common 3D coordinate system. We want a covariance matrix A x that describes the covariance of the 3D point in (x, y, z) and what we have is a covariance matrix A u in (u,v,d). Following the methods in Faugeras [Fau93], we propagate A u to A x by using the Jacobian of (x, y, z) with respect to (u, v, d). From Equations (3.5) and (3.4) the Jacobian is: J = dx dx dx du dv dd &u_ du_ &u_ du dv dd dz dz dz du dv dd B d 0 0 0 B d 0 -uB ~d^~ -vB IT -fB Propagating the covariance matrix is simply A x = JKJT (3.7) (3.8) 3.4 U n c e r t a i n points Given the above analysis of the accuracies of correlation stereo and the equations regarding its geometry from Section 3.2, we can determine for each valid disparity pixel in a disparity image a 3D position and a 3 x 3 covariance matrix that represents the accuracy of the 3D point. We will refer to this 3D point - covariance matrix pair as an uncertain point, since its exact position is uncertain. The disparity pixel is defined by three parameters, its position (row, column) in the image and its disparity: (r,c,d). The uncertain point is defined by a 3D position, X = [xyz]' (which is extracted from (r,c,d) by Equations (3.2), (3.4) and (3.5)), and a 3 x 3 covariance matrix Ax which is determined by Equation (3.8). The covariance matrices of uncertain points are not constrained to be aligned in any way. The major and minor axes of the covariance matrices for different pixels are not aligned in general. Data with unrelated covariances such as this are referred to as heteroscedastic data (as opposed to data with aligned covariances which is homoscedastic). Solving problems with heteroscedastic data is also referred to as the errors-in-variables problem. Heteroscedasticity can affect computer vision problems in different ways. Peter Meer's group at Rutgers have developed some general methods to approach these problems in [BM01, MMOO]. For our situation, however, the important effect of heteroscedastic data is that we will not be able to obtain a closed-form solution to any model-fitting that we may wish to do on the data. Heteroscedasticity means that we will have to use an iterative minimisation solution for any fitting problem such as a least-squares plane fit to a set of uncertain points. For homoscedastic data, 41 a least-squares planar fit could be achieved in closed-form via methods such as use of the pseudo-inverse or principal component analysis of the moments of inertia. 42 Chapter 4 Model estimation 4.1 Overview As discussed in Section 1.4.1, probabilistic methods are important when working with noisy real-world data and models that include uncertainty. In this thesis, each piece of information that we work with, be it 3D points, patchlets, or surfaces, we want to incorporate a statistical or probabilistic understanding. As well, when clustering surface elements into surfaces, we need to make use of Gaussian mixture models - a representation that encloses multiple sub-models. This chapter describes the mathematical concepts for representing data as probability distributions and the estimation of the parameters of those distributions. This includes a description of probability density functions for modeling distributions, of mixture models, of maximum likelihood estimation and mixture model parameter estimation through the use of Expectation-Maximisation techniques. 4.2 Probability distribution models Both in creating patchlets (Chapter 5) and in aggregating patchlets into larger surface descriptions (Chapter 6), we make use of probabilistic methods. These require the use of probability models or probability density functions (PDFs). These PDFs are described in this section. A PDF represents a probabilistic distribution. The PDF describes the prob-ability that a sample has a particular value. PDFs can have any form, so long as the integral of the PDF over the range of the random variable sums to one. PDFs are good tools for modeling noisy sensor data, and have been exploited for this purposes to a great extent. In our investigations we have considered various PDFs for modeling different parts of our problem. The two distributions for 3?" that we considered are the Gaussian distribution and the Lorentzian distribution. 43 In addition to Euclidean K n spaces, we have to deal with representing and clustering normals of 3D surfaces. Normals are orientations that can be represented by Euler angles or unit vectors. Euclidean spaces do not work well with this kind of data, as it is circular in nature. For this data we use the Fisher distribution, a distribution that is specifically designed for circular space. We provide further discussion of these three distribution models in the following sections. 4.2.1 Gaussian probability distribution If one has underlying knowledge of the sensor model, one can use a particular PDF that represents the sensor model best. In general, however, when one does not have an explicit sensor model representation, the Gaussian probability distribution is typically used. This is due to a proof called the central limit theorem [Kal97] which shows that a set of variates with various random perturbations affecting them will tend towards a Gaussian distribution. This leads to the conclusion, if one has no a priori knowledge of the error processes for a sensor, that the Gaussian is the best assumption. The Gaussian distribution is defined by its mean and its covariance. The probability density function is P(x\p,A) = je-^x-n)'A-l(x-n) (41) V^7T(|A|)2 where p is the model mean, A is the covariance matrix, x is a given sampled data point, and d is the number of dimensions of the space. 4.2.2 Lorentzian probability distribution The Lorentzian (or Cauchy) distribution [PTVF92] is a heavy-tailed distribution. It is termed heavy-tailed because the tails of the distribution are significantly larger than the Gaussian. By tails, we mean the portion of the distribution that is outside the main mode of the distribution, and stretches out to co, approaching 0. For the Gaussian this is approximately the region farther than three standard deviations from the mean. The equation for a Lorentzian distribution is P ^ ^ = ll STTT^ ( 4 - 2 ) TT (x - pf + ( L ) 2 The T term describes the width of the distribution at one-half of the maximum height of the distribution. This is illustrated in Figure 4.1. Figure 4.2 shows the difference in shape of the Lorentzian compared to the Gaussian, and the heavy tails are clearly visible. 44 Figure 4.1: The Lorentzian distribution - the parameters that control the function are the mean p and the spread parameter, T. F is the width of the distribution at one-half of the maximum height. Figure 4.2: A comparison of the Gaussian and Lorentzian distributions. Note the heavy tails of the Lorentzian 45 Heavy-tailed distributions like the Lorentzian are used in M-estimators or "robust estimation". These techniques are less sensitive to outliers when extracting model parameters from sensor data. It is this outlier insensitivity that gives rise to their description as "robust". See Section 4.4.1 for a description of how the Lorentzian is used in Maximum Likelihood Estimation. 4.2.3 Fisher probability distribution K = 1 u l l l 1 l 1 1_ -1 -0.5 0 0.5 1 1.5 2 2.5 Figure 4.3: The Fisher distribution Gaussian and Lorentzian distributions are defined in 5ftn space. Unfortu-nately, much of the data we need to handle are surface normals. Surface normals are unit vectors. The space of these vectors is circular or spherical in nature. The simplest illustration of this is an angle in 2D space. As the angle increases it will "wrap around" and intersect its previous positions. An angle a is equivalent to a + 2nn where n is any whole number. Higher dimensional orientations have sim-ilarly "wrap around" characteristics. The "space" of normal vectors is the surface of the unit sphere. There are a variety of ways to represent data on the unit sphere. The most common are Euler angles or with unit vectors. If the 3?" space is n dimensions, the unit vector has n — 1 degrees of freedom, as it is constrained that its length must be 1. For 3D surfaces, the normals are unit vectors in n = 3 space and so have a dimensionality of 2. At times we represent normals as a 3D unit vector, at others 46 we use 2 Euler angles, depending upon the circumstances. Distributions on the unit sphere do not map well to Gaussian models since they do not have unbounded extent. This means that the sum under the curve of a Gaussian distribution on a circular space will be less than 1 and that it will vary for different covariances. It also causes trouble for parameter estimation because the direction of the shortest distance between two points may reverse as one of them is modified. Luckily, the statistics of circular distributions have already been developed, although most of us are not as familiar with them as with linear space models. There are several different models for spherical data. The one that is the simplest, but also best matches the Gaussian model for spherical spaces is the Fisher model [Fis53]. The PDF of the Fisher model [Fis87] is T l^pKX LI P(X\H,K)= A . . (4.3) 47rsmh(«;) In this equation, a; is a vector data point, [i is the mean vector of the distribution and K is the concentration parameter. Both x and fj,, as directional vectors, are 3 dimensional variables. The concentration parameter K is a scalar. The higher the concentration parameter, the more clustered the distribution is. In this respect, one can think of the concentration parameter as the inverse of standard deviation. (For circular values close to the mean, K is equivalent to K = Q f the equivalent Gaussian distribution (see Equation (5.4) in Section 5.3.4). The ^ is necessary for normalising the function. Note that in some references xT[i (which is the dot product of the vectors x and /z) is also sometimes referred to as cos(^) where ij) is the angular difference between x and [x. These two representations are identical. The Fisher model is sometimes referred to as the Von Mises-Fisher model, as it is an extension of the Von Mises distribution, a circular distribution for the unit circle. Alternatively, we can express the exponential term in Equation (4.3) in terms of spherical polar coordinates (see Figure 4.4). In this case, if x' = (6X,4>X) and A1' = (Qfin), the exponential term becomes XT/J, — cos ij) = sin 6X cos cos((j)x — <^ M) — cos 6X sin #M (4.4) 4.3 M i x t u r e models Often a system is too complex to be represented by a single probability density function like those presented previously. For example, a typical PDF such as the Gaussian will not be able to accurately model a multi-modal distribution. In this 47 z p y X Figure 4.4: Polar coordinates for points on a sphere case, we may use a mixture model. A mixture model is a set of sub-models, each with an associated model probability. Mixture models are a good framework in which to estimate the parameters of multiple models or clusters of data where one knows that the underlying distri-bution is likely multi-modal. We will use mixture models to represent a plurality of surfaces that are sensed by the stereo sensor, and handle the patchlet-to-surface data associate problem in the context of a mixture model. In the previous sections we have discussed various PDFs that give a represen-tation for P{x\0) where 6 is the vector of model parameters. In the mixture model case, 6 is a set of M models, each with its own parameters dj. Additionally there is the probability that the jth model gave rise to x. This is represented as TTJ. Clearly since one and only one model gave rise to x, Y^=i ~ 1- And, 4.4 M a x i m u m likelihood estimation In order to estimate model parameters from sensor data, one can maximise the probability of the model parameters given the data set, P(0|X). Other ways to estimate model parameters, such as error minimisation methods using least-squares or x2 metrics, can still be interpreted as probability maximisation methods for particular probability models (as we will see shortly). Using Bayes rule, we know that M (4.5) 3 P(0|X) = •P(X|fl)P(fl) (4.6) 48 The denominator P(X) is a constant and does not affect the location of the maxi-mum and so can be dropped. So, arg max P(#|X) = arg max P(X|0)P(0) (4.7) 6 0 This is the maximum a posteriori or M A P solution. If we assume uniform priors, the P(0) term can be removed and the expression is now argmaxP(0|X) = arg maxP(X|0) (4.8) 9 6 which is known as the maximum likelihood estimate or M L E . We approach the pa-rameter estimation problem from a maximum likelihood formulation with the knowl-edge that we can follow the same path with a M A P formulation given a judicious choice of priors. In solving the maximum likelihood problem, we make use of the log likeli-hood. That is that arg max P = arg max log P . So, assuming the data samples are independent of each other we can say argmaxP(6>|X) = argmaxlogP(X|#) (4.9) 6 9 N P(X\0) = YlP(xi\9) (4.10) i=i N logP(X|0) = ^ l o g P ( x ^ ) (4.11) i=i So N argmaxP(0|X) = argmax]PlogP(xi|l9) (4.12) 6 i=i In the case of a Gaussian model, \ogP(xi\6) = -^(Xi - ii)'K-\Xi — u) — log(\/M|A|)i) (4.13) If we substitute Equation (4.13) into Equation (4.12) and determine the best mean (/z) that satisfies this optimisation we get N arg max log P(X|0) = arg min ^ (xi — y)'{xi — /z) (4.14) i=l This is our old friend, the least-squares solution. In fact, the least-squares estimate is the M L E estimate of the mean for a Gaussian model. 49 4.4.1 Robust estimation When a heavy-tailed distribution like the Lorentzian is used for M L E it is called robust estimation [PTVF92]. For the Lorentzian, Equations (4.12) and (4.13) change to N N p _, logP(X |0) = ^ logP(xi ie ) = ^ l o g ( - ) 2 - log((Xi - u)2 + (-) 2) (4.15) i=i i=i Similarly holding V constant and solving for p, the solution will simplify to N arg max log P(X |0) = arg min ] P log((xi - a)2 + K) (4.16) i=l where K is a constant based on T. One can see that the effect of any single outlier data point x has much stronger effect on the Gaussian-based solution than the Lorentzian-based solution. Let us reparameterise x — [i to z. In Equation (4.14)), the contribution of this one point to the sum that is to be minimised is (f ) 2 . If we move this point an incremental distance away from the mean, it changes the residual error for this point by dz V This shows that the effect of an outlier grows linearly with its separation from the mean and this makes the least-squares approach so sensitive to outliers. A point's effect on the solution grows linearly with its distance from the mean. Therefore outliers, which are far from the mean by definition, exert a strong influence. With the Lorentzian, we have the following derivative Here, the change in effect of the point with increasing z actually asymptotically approaches 0, as in the limit, this approaches ^ . Therefore, an outlier that is con-siderable distance from the mean has a relatively constant effect on the solution. If the mean moves slightly one direction or the other it makes relatively little differ-ence on the effect of the outlier on the solution. This makes Lorentzian-based M L E solutions relatively insensitive to outliers. 4.5 Expectation-Maximisation There are many articles and books that discuss Expectation-Maximisation (EM) for various problems. The two references we relied on for E M were Minka's description 50 of E M as a lower-bound method [Min98] and D'Souza's technical note on implement-ing E M for Gaussian mixture model parameter estimation [D'S99]. Minka's work provides good background theory while D'Souza's tech note is an excellent "how-to" paper that includes all the details of deriving the 'E ' and ' M ' update equations. E M is an iterative technique used to maximise a probability-based function. It is a local technique and can be caught in local minima; however, for some classes of problem it can provide faster convergence than other local methods such as gra-dient descent or Newton's method. E M has the advantage of being well-suited for estimating the parameters of mixture models (more about this below). Another advantage of E M is its ability to work with problems that explicitly represent hid-den or nuisance parameters, and marginalise these parameters in order to find the optimal solution. When formulating a probability maximisation problem, there are often pa-rameters that, by explicitly representing, make the problem more tractable, but for which one does not require an exact solution. An example of this is the data as-sociation parameters in a mixture model. Knowing the association of sensed data to sub-model simplifies the problem from a mixture model to a set of single model parameter estimation problems. However,- the goal is to estimate the parameters of each of the sub-models of the mixture. If we can do this without solving the data association problem, we will still achieve our objective. Another example is when data is being fused from multiple sources with unknown error characteristics. In this case, knowing the error characteristic from each data source will allow unbiased weighting of data to improve any combination of the data. However, the knowledge of each data source error characteristic is a means to achieve the goal of correctly fusing the data. In E M , these hidden parameters are included in the problem formulation, but the method marginalises these parameters by solving for their expected values. By using the statistical expectancy of these hidden parameters, we can find a most probable solution to our objectives without making a definite solution to those hidden parameters. Determining the expected values of the hidden parameters is the 'E ' step, or the 'Expectation' step. The ' M ' step, or 'Maximisation' step, is the maximisation of the probability function while fixing the expected values of the hidden parameters. 4.6 A p p l i c a t i o n of E M to mixture models For a mixture model, the hidden parameters over which we are marginalising are the data association parameters. These are parameters that identify which sub-51 © © 71 M N Figure 4.5: Graphical model for data obtained from a mixture model model gave rise to a particular data point. During the 'E ' step, we determine the expected value of the data association variable. This is equivalent to determining a probability that a sensed data point came from a particular sub-model. In the ' M ' step, we estimate each of the sub-model parameters based on the data, weighted by the data association probability. To demonstrate the use of E M for mixture model parameter estimation with application to different probability models (Gaussian, Lorentzian, Fisher) we present briefly some simple ID examples. For these examples, we followed the problem model presented by D'Souza in [D'S99]. We modified his derivation of the E M update equations to fit our particular models. Figure 4.5 shows a graphical model of the relationship between data from a mixture model (as described in Section 4.3), and the parameters of those models. There are M models, each with a set of parameters 6j\ there are 7Y sensed data points X = {XQ .. .XN). Each data point Xi has an associated Zi parameter that indicates which sub-model gave rise to it. Z = {ZQ...ZN}, Zi G {1...M}. Zi is the data association parameter. It is a nuisance parameter, in that we do not need a solution to it; our goal is to estimate the model parameters. However, it is necessary to estimate the data association in order to estimate the parameters we are interested in. IT is the vector of the sub-model probability. The probability we are trying to maximise is the joint distribution of 0 and X , P(0X). Unfortunately, it is necessary to determine the data association parameters Z in order to do this. Consequently, we solve for Jz Here, we re-define the data association parameter Z to a 2D boolean variable £. Without changing the problem, we expand Z from Z = { Z \ . . . Z N } where Zi € {1 . . . M} to £ = {£y} where i = {1 . . . N}, j = (1 . . . M} , ^ G {0,1} and £ \ & = 1. According to Minka (and based on Jensen's inequality), with this formulation we (4.19) 52 can write P(0XZ) > J] UiPix^P&j = (4.20) i j If we consider the log-likelihood of ( 4.20) we write N M logP(0X) > logHHlPix^P^ = 1)}^ (4.21) i j N M ^ ^ ^ ^ [ l o g P ^ I ^ O + l o g T r , ] (4.22) i 3 Minka shows that by finding the expected values of £y we are setting a lower bound to the P(9X), and by maximising 9 with respect to the expected values of we are finding the maximum of that lower bound. By iterating, we are continually raising the lower bound until we have converged on a solution. In the following sections, we show how this is achieved in practical terms for simple ID examples using different PDF-based mixture models. 4.6.1 Example of E M on a ID Gaussian mixture model Sub-model I 1 I 2 I 3 I I Sub-model I 1 I 2 [3 Mean (p) 2.0 5.0 8.0 Mean (/z) 1.987 4.990 7.985 Standard Dev. (\/A) 0.5 0.3 0.2 Standard Dev. (VX) 0.479 0.272 0.189 Probability (TT) | 0.44 | 0.22 | 0.33 | | Probability (TT) | 0.461 | 0.207 ] 0.332 (a) Actual values (b) E M Estimated values Table 4.1: Simulated ID Gaussian mixture model results In this section we present the results of E M for a mixture of Gaussians on a simulated ID data set. The derivation of the update equations is given in Appendix A . l . These results are shown to allow comparison with the results for the Lorentzian in the following section. For the first simulation, the mixture was set to have three sub-models. Their attributes are stated in Table 4.1(a). For the simulation, 1000 data points were generated from the ground truth mixture model to create the input data set. E M was run with the correct number of models and the parameter estimation results are shown in Table 4.1(b). Figure 4.6(a-e) shows the results of the model estimation after various iterations of the E M process. The process converged in 18 iterations. As well, Figure 4.6(f) shows the evaluation of the log-likelihood at each iteration. The log-likelihood increases monotonically towards a maximum as we expect. 53 -1 0 1 2 3 4 5 6 7 6 B -1 0 1 2 3 4 5 6 7 8 (a) results of fitting after 1 iteration (b) results after 5 iterations -1 0 1 2 3 4 5 6 7 S 9 -1 0 1 2 3 4 5 6 7 6 9 (c) results after 10 iterations (d) results after 15 iterations 0 1 2 3 4 5 6 7 8 8 0 2 4 6 8 10 12 14 18 18 20 (e) final results after 18 iterations (f) log-likelihood values Figure 4.6: The results of ID E M fitting to a mixture model at various iterations. A histogram of the data is shown with the dashed line. The solid line represents the various PDFs of the sub-models within the mixture model. 54 To see how the method performs with outliers, or with an incorrect number of sub-models provided in the mixture model, the system was run again, but this time E M was performed with a mixture of only two Gaussians. Figure 4.7(a) shows the results for this parameter estimation. The left model has fitted correctly and identified one of the ground-truth generative models. The other tries to account for all of the remaining data. The result is that it is stretched across the two remaining ground-truth models and obtains a very poor fit. This illustrates the sensitivity to outliers and correct model complexity initialisation that E M has with a Gaussian mixture model. 4.6.2 Example of E M on a ID Lorentzian Mixture model In this section we present the same experiment of under-estimating the number of models in the mixture, except that the underlying probability distribution estimated is the Lorentzian rather than a Gaussian. The derivation of the E M update equations are given in Appendix A . 2 . Solving the "E" and " M " steps given in the previous section numerically, we ran the Lorentzian-based E M on the same ground truth models as given in the Gaussian simulation example, but with only 2 models in the estimated mixture. The results are shown in Figure 4.7(b). One can see that in this case, the right model has correctly identified a ground-truth model. However, the left model, rather than being split between the two ground-truth distribution, has selected to centre on the left of the pair. This is due to the Lorentzian outlier insensitivity. As the distribution is pulled towards one distribution, the other one becomes treated more and more like a set of outliers, and contributes less and less to the parameter estimation. The only effect it has in this case is to spread the fitted model wider that it would otherwise be, but the model mean is in the.right place. This illustrates the Lorentzian's tendency to lock onto one underlying model, regardless of incorrectly solving the model selection problem. 55 (a) Gaussian fit with too few models (b) Lorentzian fit with too few models Figure 4.7: The results of ID EM fitting to a mixture model using the incorrect number of sub-models is shown here, comparing the behavior of the Gaussian with the Lorentzian. (a) shows the Gaussian fit. The left model is correctly identified by the right model is spread between the centre and the right underlying models, (b) shows the same system with the Lorentzian sub-models. The left model is correctly identified. The right model locks-on to the underlying right model and the centre model is relegated to the status of outlier. 56 Chapter 5 The Patchlet 5.1 Overview The primary contribution of this work is the concept of the patchlet. The patchlet is a new way of interpreting dense 3D scans, particularly from stereo vision. The idea of the patchlet arose from the intuition that correlation stereo vision is a region matching technique and consequently senses surfaces, not points. It is based on matching image regions of texture that represent small surface "patches" of texture on physical surfaces in the world. The patchlet is a method of interpreting 3D scans as a set of surface elements sampled from an underlying surface, rather than the traditional method of a set of 3D points. These surface elements have a position, a normal and a size. Due to the projective nature of a camera, the size of a projected pixel, and its associated surface element, will vary depending on the distance of the surface from the camera. In addition to the position, orientation and size of the patchlet, we also represent a confidence measure on the position and orientation. This allows each patchlet's individual confidence characteristics to be included in larger surface estimation. Correlation texture matching implicitly assumes underlying surface continu-ity. This statement is a simplification; correlation stereo does not always fail on highly discontinuous scenes. But correlation stereo is based on matching two re-gions of texture and for the regions to match well it is beneficial that the texture be largely on a surface that is not too oblique, occluded or discontinuous. This is illustrated in Figure 5.1. In practice, we find that the noisy or invalid stereo pixels are in regions that are occluded from one camera, discontinuous, or poorly textured. In fact, it is the assumption of continuous surfaces in disparity space that has led to the very successful noise rejection filter described in Section 3.3.1. The goal of patchlets is to use surface elements rather than 3D points as the fundamental perception primitive in analysing 3D data from correlation stereo 57 Surface Right observation Left Right Camera Camera (a) Continuous case Surface Left observation Left Right Camera Camera (b) Discontinuous case Figure 5.1: Surface continuity and correlation stereo: In the continuous case, (a), both cameras perceive largely the same textured region and consequently the stereo match will be more likely to be successful and will have a high confidence. In the discontinuous case, (b), both of the cameras will include similar views of the the foreground object but will clearly include very different regions from the background. This will reduce the quality of the correlation match. 58 Confidence Cone Y size Dligin. X Z variance Figure 5.2: The patchlet surface element and its various parts vision. 5.2 Patchlet representation A patchlet is a local surface element which has a 3D position and surface normal, similar to Szeliski and Tonnesen's oriented particles [ST92], with the addition of a physical size attribute like Fua's particles in [Fua96]. In contrast to these techniques, however, we do not try to create patchlets that are evenly sampled across a surface or space (Fua), or from already known surfaces (Szeliski). We create patchlets, one for each stereo pixel, on the basis of sensor data and with the understanding that each patchlet will have a different size depending on its parent pixel's projection onto the sensed surface. As well, the patchlet has confidence measures for its position and orientation which are equivalent to a positional and pointing accuracy. These parameters are illustrated in Figure 5.2. The parameters of a patchlet are: • o r i g i n (x) a 3D position indicating the centre of the patchlet • n o r m a l (tp or [9,(f>]) a normal direction. We use ip when referring to the 3-space normal vector, or (9, (f>) angle pair when using polar coordinates illus-trated previously in Figure 4.4. 59 • local coordinate system (T^, T™) a local pair of X Y axes within the patchlet plane that define the directions of the size parameters. The 4 x 4 homoge-neous transforms Tlw (world-to-local) and its inverse, T™ (local-to-world) define a relationship between the world coordinate system and the local patchlet co-ordinate system that includes the definition of the local X and Y axes. These transforms actually represent the origin, normal and X and Y axes together, although usually it is convenient to refer to the normal and origin outside of their context. • size (sx, sy) a size parameter in both the local X and Y direction • positional confidence (A or F) represented by a variance or gamma parame-ter (depending on whether Gaussian or Lorentzian positional models are being used - see Sections 4.2.1 and 4.2.2) indicating the confidence of the patchlet origin in the direction of the normal • directional confidence (K) represented by the K parameter of the Fisher probability distribution (see Section 4.2.3) It is important to note that some of these parameters are interdependent. The transform T™ contains the 3D position and the normal information. As well, we often use the normal/offset representation of a plane. This representation comes from the standard plane equation Ax + By + Cz + D = 0 (5.1) In this equation, (A, B, C) can be constrained to have a magnitude of 1, and then is the normal vector of the plane. In this case, — D is the perpendicular distance between the plane and the origin and is referred to as the offset. The offset can be determined from the patchlet normal and origin by taking the dot product, or offset = ipTx. The position confidence measure is constrained to be in the normal direction and so is effectively an offset confidence measure. 5.3 Estimating patchlet parameters To extract a patchlet at a particular location in the disparity image, we use a neigh-bourhood method that will use disparity pixels in a region of the image around the location of interest. Given the assumption that the sensed surfaces are piecewise continuous and densely sampled, we could use a simple "adjacent pixel" operation. For example, it is possible to estimate the normal of the surface at any pixel given 60 the 4 adjacent pixels. One can fit a second order curve through the three pix-els in each direction (row and column) and estimate the surface orientation from these quadratics. However, since individual points are noisy and often missing due to invalid stereo data, a region-based method will be more robust. Using more points provides a smoothing or regularisation [Gri81] of the estimated surface and is less sensitive to Gaussian noise. It allows us to provide simple outlier rejection to discount points within the region that are clearly either outliers or pixels from a different surface (in the case where a neighbourhood overlaps two surfaces). It also allows us to make surface estimates even when one of the four immediately adjacent pixels is invalid. We chose a simple square neighbourhood, or mask, around the pixel of in-terest. To create a set of patchlets from a disparity image, for each pixel location that contains a valid disparity pixel we perform the steps outlined below. In this description, when we refer to "points" we mean uncertain points as described in Section 3.4. 1. Create the set of points that are associated with the disparity pixels that fall within the mask - This can be any size depending on the amount of regular-ization necessary. In our experiments we used 5 x 5 masks in the disparity image. 2. Remove any obvious outliers - abort if insufficient valid pixels remain (see 5.3.1) 3. Create a best-fit plane to the remaining points (see 5.3.2). 4. Project the centre pixel onto the plane to determine the patchlet origin and determine the local coordinate frame and size of the patchlet (see 5.3.3). 5. Determine the confidence parameters by propagating the uncertainty of the points to the patchlet parameter space (see 5.3.4). 5.3.1 Initialisation and outlier rejection Like most parameter estimation techniques, the above is sensitive to initialisation and outliers. For outlier rejection,we have the advantage of having some notion of scale or size in our 3D point cloud. We calculate the size of a frontal-parallel pixel at the centre point and this gives us a rough idea of the dispersion of 3D points within the disparity neighbourhood. We then use a threshold to reject points that are more than a certain number of "pixel sizes" away from the centre point. The idea here is to get rid of gross outliers that are completely off the surface we are about to fit. If we use a threshold like 100 times the pixel size away from the centre point, we can be quite confident that any rejected pixel is not an inlier. 61 In addition to problems with outliers, if an optimisation algorithm is poorly initialised, it may either converge to the wrong local minimum (in the case where the initialisation is not in the watershed of the global minimum) or it may take a much larger number of iterations to converge. To obtain a quick and reasonable initiali-sation, we calculate the plane parameters that satisfy the unweighted least-squares criteria. This can be solved efficiently by using principal component analysis on the matrix of moments of a point cloud about its centroid. This invariably starts the optimisation method within the correct local minimum watershed and reasonably close to the optimal solution. The plane refinement step from this point must take into account the heteroscedastic attributes of the point covariance matrices. This outlier rejection and initialisation is applied before estimating any of the patchlet parameters. The final threshold test is to ensure that there are still sufficient uncertain points remaining within the disparity image neighbourhood. If less than the threshold, typically 50%, are remaining, we reject this neighbourhood as being too inconsistent to fit a patchlet. 5.3.2 Fit t ing the patchlet plane To fit a plane to a set of uncertain points we use the Maximum Likelihood Estimate (MLE). Following our discussion of M L E in Section 4.4, the M L E solution for the plane parameters is one that maximises the probability of the plane model 6 given the patchlet data X . With a Gaussian uncertain model for the uncertain points, this is the equivalent of the least-squares solution, or minimising the sum of the Mahalanobis distances for all the data points. The data set is X = {X\ • • • Xn} where X ; = {[xiyiZi]', Ai} represents the 3D position of the point and the covariance matrix representing its uncertainty. For a given plane H, there will be an associated error-squared for each of the uncertain points defined by the Mahalanobis distance e2 = (Xi - XiYA'^Xi - Xi) (5.2) where X , is the point on the plane that is closest to Xi. Unfortunately, this point Xi is not necessarily the perpendicular projection of Xi onto H. It is the point on H which minimises Equation (5.2). Recall from Section 3.4 that the uncertain points making up X are heteroscedastic. We cannot use any methods of global coor-dinate system rotation and scaling to make the covariance matrices all align with the coordinate the data is heteroscedastic. Consequently, in order to evaluate the Mahalanobis distance between a plane and an uncertain point we must perform a different whitening transform for each uncertain point. The whitening (or spher-ing) transform [KWV95] is a rotation and rescaling of the coordinate system so that 62 a covariance matrix becomes a sphere or the identity matrix: equally weighted in all directions. It is called "sphering" because of this direction-independent constant weighting. It is also called "whitening" because it removes bias from data elements, making them all equally weighted. Evaluating point-to-plane error To find the point X j on plane H that is the closest to Xi given the influence of the covariance matrix Aj , we must apply a whitening transform to realign and scale the world coordinate system so that the covariance matrix Aj is spherical (i.e., the identity matrix), and consequently equally weighted in all directions. (See Appendix Sections B . l and B.2, Equations (B. l) and (B.2) for mathematical details.) This is done by rotating the world coordinate frame to be aligned with the covariance matrix principal axes, then scaled to make the principal components of equal size. This transform is applied to the parameters of H and to X j to map them into the whitened space. The point in whitened space, XiW, that is the closest point on the whitened plane Hw to the whitened data point XiW, is the perpendicular projection of X{W onto Hw (see Equation (B.3)). The point XiW is then mapped back to the unwhitened space to obtain X j . The details of the whitening and unwhitening transform can be found in Appendix B . l . However, the important consequence of this is that we must perform these mappings and projections for each data point and for each iteration of the plane fitting. Plane fitting with Levenberg-Marquardt Because the point data is heteroscedastic, we could not obtain a closed-form solution to the M L E of the plane parameters, but required a iterative optimisation method. We initially experimented with Newton's method [PTVF92]. Wi th Newton's method, one linearises the slope (i.e., assume the rate of change of the derivative function landscape is constant - this implies a second order surface) and predicts when the derivative of the function to be minimised will be zero. It then updates the current estimate of .the solution to be at that location and repeats. For func-tions that approximate a second order shape, it has the advantage of very quick convergence. When applied to our problem, Newton's method works by iteratively applying updates to the current parameter estimate of the plane H to bring it closer and closer to the optimal solution. Therefore -ffj+i = Hi — u where Hi represents the estimate of plane H at iteration i and u is the update applied to the parameters of Hi at that iteration. To apply Newton's method to this problem, we would linearise the function, take the first derivative, and predict how much update needs to be 63 applied to H to bring the error (non-squared) to zero. This can be expressed as J u = e (5.3) where J is the N x M Jacobian matrix formed by dei J i j ~ Oh. hj is the j parameter in the plane model H, M is the number of parameters in the model and N is the number of points or errors. We can solve for u by applying the pseudo-inverse J u = e J r J u = J T e u = '3TJ)-ijTe However, the aggressive update strategy of Newton's method may have disadvan-tages depending on the function that is being optimised. In our case, we discovered the the function landscape was a long, narrow valley. This causes trouble with New-ton's method as it often over- or under-shoots the optimal step size. This leads to the algorithm spending too much time oscillating back and forth across the valley while very slowly moving along the valley towards the optimum. To overcome this, we selected the Levenberg-Marquardt (LM) method [PTVF92]. L M is a method with an automatically adapting step size that shortens when im-provements are not being made and lengthens towards Newton's optimal step size as it gains confidence that the steps are not overly large. David Lowe provides an excellent discussion for the solution and stabilisation of this in [Low91]. The discussion below follows his description and adapts it to our particular problem. L M applies a prior to the Newton method that specifies that the update step size should be small (or zero). The equations then become J AI J AI r T -u J AI u = ( J ' J - H A 2 I)u u = J r e e 0 J T r -, J e AI 0 ( J r J + A 2!)" 13 T i Note that there is an extra set of linear equations added to the system, [AI]u = [0]. This adds a set of artificial points to the minimisation approach that are always perfectly satisfied before each update. They will act to reduce the update (and 64 so reduce the step size) from the Newton iteration. The strength with which they restrict the update is controlled by A. The higher A is, the more it restrains the Newton update step. Levenberg and Marquardt observed that with this formulation, for A = 0, the solution was identical to Newton's method. As A increases, the solution becomes gradient descent with smaller and smaller step sizes. A is governed by either dividing or multiplying A by a user supplied constant typically between 2 and 10. We selected 4, but the actual value of K is not very important. The algorithm, then, is 1. select an initial value for A 2. evaluate initial error EQ = |e| 3. iterate (a) calculate update vector u (b) calculate error Ei+i — |e| for Hi+\ = Hj — u (c) if Ei+i < Ei then (d) else Hi+i = Hi For small A, the method approaches Newton's method which has the advantage of fast convergence. For large A the method approaches gradient descent with small step sizes for maximum robustness. After each iteration the error is tested. If it is not improving, then the step sizes are assumed to be too big, A is increased and the iteration is repeated from the previous starting position. If the error is decreasing, then A is decreased so that the minimisation increases in speed. This results in an adaptive minimisation that automatically finds suitable values for A. To use this algorithm, we need to populate Equation (5.3). Each row in the equation requires the residual error of a point with respect to the current plane parameters, as well as the partial derivatives of that error. Evaluating the error for each point, ej, is done by taking the square root of the squared-error equation given in (5.2). To evaluate (5.2), we need to determine Xi, which is discussed in the previous subsection. To determine the partial derivatives of with respect to H, 65 Figure 5.3: Constructing the patchlet local coordinate system: the cone represents the projection of the disparity pixel. The plane represents the fitted patchlet plane. we used a numerical estimate: dej _ ei{H)-ei{H') dhj e where K = hi i^j hj — hj + e 5.3.3 Creating the patchlet local coordinate system After determining the plane parameters of the patchlet, we need to determine the local coordinate frame of the patchlet; the patchlet origin and the X - Y axes of the local coordinate system. We will refer to these local coordinate axes as Xi, Yi, and Zi where the I subscript denotes the axis as local. The local origin is set to the intersection of the ray projecting from the camera optical centre through the centre pixel for this neighbourhood with the patchlet plane. The coordinate system is constructed as illustrated in Figure 5.3. If we consider the centre pixel projection through space to be a cone ema-nating from the camera, the intersection of this cone and the patchlet plane will be an ellipse. We choose to set the Xi axis to the major axis of the ellipse. The Z\ axis is set to be the surface normal. The Y\ axis is the normalised cross product of the viewing direction and the surface normal. The X{ axis is the cross product of the 66 Yi axis and the Zi axis. Or, if we label the camera optical centre as O, the patchlet origin as Oi, and the normal as N then Zi = N Yi = N x OOi XL = YixN To determine the size of the patchlet, we calculate the size of this ellipse in the major and minor axis directions. The minor (or Yi) axis is always perpendicular to the viewing axis and so is the frontal-parallel pixel size which is sy = j (here z is the z coordinate value of the patchlet origin in the camera coordinate system). The size in the major (or X{) axis direction is sx = where a is the angle between the normal and the viewing vector, so cos a = N • 00\. Once we have the patchlet origin and coordinate system, it is a simple matter to construct the homogeneous transforms. The local-to-world transform is: [Xr] [Yt] [Z,] [OOi] ' 0 0 0 1 and the world-to-local transform, T^, is the inverse. 5 . 3 . 4 Propagating patchlet uncertainty The last step in estimating the parameters of the patchlet is to propagate the un-certainty of the basis 3D points to the patchlet parameter space. This determines the confidence parameters for the normal direction and patchlet position. The co-variance matrix of the parameters estimated by the Levenberg-Marquardt method described in 5.3.2 is A - 1 = JTJ where J is evaluated at the converged parame-ter values of plane H [Low91]. However, in spherical systems it is always best to avoid singularities that polar coordinates often cause (i.e., when 9 = 0 then <f> is unconstrained or meaningless - for a globe, this is the situation where the normal is pointing in the direction of the north or south pole, the globe can rotate, or change its <f) parameter, without a change in the axis direction). Consequently, before eval-uating J , we first transform plane H and all the uncertain points that make it up to use the patchlet origin as the world origin and the normal of the patchlet pointing in the positive X direction. This is a position where the polar coordinates for the normal, 9 and (f>, are equally weighted and it is as far as possible away from the "north pole" singularity. After transforming the parameters and calculating the Ja-cobian, the covariance matrix for the 3 parameters, {9,4>, offset} is obtained from (JTJ)-K 67 As in Section 3.3.3 we apply the methods for propagating uncertainty con-tained in Faugeras [Fau93] but with some modifications. Since this method is based on a Gaussian model, for the normal direction we first propagate the uncertainty into a variance for polar coordinates that represent the normal direction, and then transform this variance into the Fisher model confidence measure K. Recall from Section 3.3.3 that propagating variance from parameter space B to A is simply AA = JABJT where J is the Jacobian, J(i,j) = ff 1 . In this case, the uncertain points exist in 3D space and the patchlet parameters of interest are those of an unbounded plane which we represent with polar coordinates for the orientation, and an offset parameter, or {6,(f>, o). The elements of the Jacobian are calculated numerically. The resulting covariance matrix has the form " A B C A = B D E C E F where the sub-matrix A B B D is the covariance of the angles {9, <f>} and F is the variance of the offset parameter. C and E are much smaller than other elements in the matrix due to the independence of the angles and the offset. A justification of the independence of the angle and offset parameters is given in Appendix C. In order to convert the 2 x 2 angle covariance to the Fisher model, we first find the primary eigenvalue of the matrix which indicates the largest variance of the system, E. We then convert this to the equivalent K value. The forms of the Gaussian and Fisher distribution match very closely within n —> — IT range for K > 3, and so we set the magnitudes of the two curves to be equal at the centre of the distribution, \i. Equating the two curves gives: 1 ( S - M ) 2 ; e 2 A KeKCOSip y/2-rrA 4n sinh K We evaluate these at the centre peak, (x — u) = 0 and tp = 0 and it becomes 1 _ KeK \/2nA 4-7T sinh K Substituting the exponential equivalent for sinh, i.e., sinh/t = e "~ e *, then 1 _ neK \ / 2 ^ A 4 7 r e " - e " ' 1 K 27r( l -e- 2 *) 68 For values of K greater than 3, which is expected, e 2 k < 0.0024 so (1 — e 2 r e ) as 1. Making this substitution, we get 1 _K_ V^TTA ~ 2TT or * = VT (5-4) 5.4 Results While we present more details about the results of generating patchlets from stereo data in Chapter 7, in this section we a show a limited set of patchlet results on synthetic and actual data to illustrate the patchlet features. 5.4.1 Patchlets from synthetic data To test the creation of patchlets on synthetic data, we created a synthetic disparity image. The synthetic object we chose to image was a simple square box. The box sides were aligned with the camera coordinate system and are lm to each side of the optical axis. The end of the box was 5m away from the optical centre. The disparity image was generated by determining the corresponding 3D point on the synthetic surface for each pixel in the image. The 3D values were then mapped backwards through the stereo rig parameters to generate a disparity value. Note, this method does not simulate the imaging process except for the geometry, so there is no smoothing due to optics or stereo algorithms. (a) Appearance image (b) Disparity image Figure 5.4: Synthetic disparity image of a 5m deep by 2m wide by 2m tall box. In the appearance image, each side of the box is mapped to a constant greyscale. 69 Figure 5.4 shows the disparity image as well as the appearance image, which indicates how the sides of the box would appear in the camera view. The camera parameters were taken from the stereo rig used to acquire the other stereo test data. There are only 10 integer levels of disparity (5 through 15) in this disparity image, but the disparity values are calculated to sub-pixel up to 1/128 of a pixel. (a) Oblique view of point cloud (b) Close-up of upper-right back corner Figure 5.5: View of the 3D points extracted from the synthetic disparity image, (b) shows a close-up of the upper, right, back corner of the box. Notice the sides of the box are much less densely sampled than the back. Figures 5.5 and 5.6 shows the resultant points and patchlets generated from the synthetic disparity image. These results were generated with a neighbourhood mask size of 5 x 5. In Figure 5.5(a) one can see that the bottom front of the box is truncated. This is due to the field-of-view of the camera. In Figure 5.5(b) we can see a corner where the top, side and back of the box are all visible. It is clear that the sampling of the side and top surfaces is much less than the back due to those surfaces being viewed at an oblique angle. This is also visible in Figure 5.6(b) where one can see that the patchlets on the side and top are considerably larger in order to cover the more sparsely sampled surface. The smoothing of the patchlet neighbourhood operator is also visible where the surfaces meet in Figure 5.6(b). Figure 5.7 shows a mapping of patchlet confidence measures to greyscale to give us an idea of the levels of the K and A parameters of the patchlets. In both cases, brighter indicates more confidence (higher K, lower A). In both cases we can see that where two planes meet there is a reduction in confidence, which is to be expected due to the mixing of data from different surfaces. The lowest confidence in both cases is the back wall. This is in part due to this surface being the farthest from the camera, but it also is due to the orientation of the back wall with respect 70 (a) Oblique view of patchlets (b) Close-up of upper-right back corner Figure 5.6: Shaded view of the patchlets that match the views in Figure 5.5. Our apologies, there are some patterns visible in (a) due to aliasing in the rendering. This is due to the large number of patchlets (76800) and the limited area of the rendering. The patchlets are rendered at 90% size so that the gaps between patchlets are visible. (a) Standard deviation (\/A) (b) Fisher K parameter Figure 5.7: This figure shows the standard deviation and K values of the patchlets displayed in greyscale. In both cases brighter means more confidence: high «, low A. In (a) the square-root of the patchlet variance values are displayed graphically. The values range from 1mm to 1cm. The back of the box is much more uncertain than the sides. In (b) the Fisher model K parameter is shown. The back of the box has a low value of K as 1 whereas the sides of the box are quite constant with values of K ~ 8.8. (The banding seen on the box sides in (a) and (b) is caused by slight perturbations in the 3D points caused by the quantization of the disparity image by 1/128 of a pixel.) 71 to the camera. (a) e = 0.25 (b) e = 0.3 (c) e = 0.7 Figure 5.8: The effect of covariance/surface orientation on confidence Figure 5.8 illustrates the effect of the orientation between the uncertain point covariance ellipse and the fitted surface. The heavy, dark, horizontal line represents an estimated plane. The line between the point and the plane indicates the error contribution for that particular point. The ellipse represents a single stan-dard deviation of the point's covariance matrix. Since distance errors are weighted by the standard deviation, the contributed error is measured in units of standard deviation. We can see that as the the standard deviation ellipse is rotated away from the "straight on" position, the error value for the point increases, even though the absolute distance between the point and the plane remains constant. This is an illustrative example. In actual practice, the standard deviation ellipse is actually much more acute for stereo data, and this effect is more pronounced than shown here. This increasing error with increasing angle of incidence causes an amplifica-tion in both the error and the error gradient. An increased error gradient results in an decrease in the variance of the parameter estimates, or in other words an increase in the confidence of the parameter estimate. This effect can be observed in Figure 5.7, especially in the values of K. There is a sharp decrease in confidence in the transition from the side walls to the back wall. The angle between viewing direction and box wall is only 11 degrees at this point. 5.4.2 Patchlets from real data Figures 5.9, 5.10 and 5.11 show an outdoor scene captured with a stereo camera and the associated stereo, 3D point, and patchlet images. The scene is the front 72 steps of the CICSR building at University of British Columbia. Figure 5.9(b) shows the disparity image obtained from the image shown in Figure 5.9(a). As in the previous section, we have mapped the confidence measures of the patchlet parameter estimates into greyscale values that are displayed in Figure 5.10. As before, the surfaces that are more oblique to the camera tend to have higher confidence. As well, the patchlets that are farther away from the camera are less accurate than those that are closer. However, we can see that since we are not dealing with synthetic data, there is considerably more variation in the values. (a) Appearance image (b) Disparity image Figure 5.9: Image of steps and associated disparity image (a) Patchlet standard deviation image (b) Patchlet K image Figure 5.10: Display of variance and K values of patchlets created from stereo image in Figure 5.9. Figure 5.11(a-f) shows the patchlet information rendered from several view-points, (a) shows the point cloud generated from the stereo image, (b) shows the patchlet data with the patchlet color obtained from the rectified image, while (c) shows the patchlets with shaded white, (d) shows the far end of the steps where 73 they join the concrete border. This view is shown from the front of the steps look-ing towards the building. We can see the smoothing that occurs due to correlation stereo between the steps and the vertical border surface, (f) shows that some of the steps are missing the top surface. The upper two steps are missing this surface because it is occluded from the stereo camera (as shown in Figure 5.9(a)). The lower steps have some smaller holes in the upper surfaces that are due to stereo failing in regions of the stereo image where the CCD has saturated and so there is not texture available to the correlation stereo algorithm. 74 (a) Point cloud (b) Texture mapped patchlet view Figure 5.11: Several 3D rendering of the patchlets created from the stereo image Figure 5.9. 75 Chapter 6 Surface segmentation and estimation 6.1 Motivation The problem of extracting accurate, bounded, higher-level surfaces from noisy range images is a difficult one, and one that is well suited to demonstrate the usefulness of the extra dimensions patchlets have over 3D points. This work was prompted by the goal of environment modeling - creating compact environment representations from large amounts of noisy stereo data. Although patchlets lend themselves to fitting of any parametric surface, to limit the scope of this thesis we elected to restrict our environment model to planar surfaces. This was done not only for scope reasons, but also because planar surfaces are the most fundamental 3D surface, the most common one encountered in man-made environments, and one that can be used to construct more complex shapes. Ideally we would like to segment and estimate all surfaces simultaneously. This alleviates issues encountered with greedy algorithms, a greedy algorithm can corrupt a later surface by an over aggressive segmentation of a previous surface. However, to solve a simultaneous estimation problem like this we must solve the following sub-tasks simultaneously: 1. determine how many surfaces there are - the model selection problem 2. determine which data elements belong to which surfaces - the data association problem 3. estimate the surface parameters - the model fitting problem. In general, if one has solved the first and second of these problems, the third is very straightforward. If we have a set of data assigned to a particular surface model, we 76 (a) Desired solution (b) Poor model selection (c) Poor association (d) Poor model selection and association » (e) Too many models Figure 6.1: Challenges in fitting surfaces to data: (a) is the ground truth solution (b) shows the effects of not enough models (c) shows the poor solutions possible if the right number of models fall into an incorrect local minima (d) shows the case when both model selection is poor and the a poor local minima is found (e) shows the effects of selecting too many models to fit 77 can fit a surface to that data fairly easily. The difficult tasks are the model selection and data association problems. Figure 6.1 illustrates several difficulties one encounters when these two tasks are not solve satisfactorily. This figure shows the data as 2D points and the surfaces as lines for simplicity. Figure 6.1(a) shows the ground-truth surfaces and some data that is generated from them. Figure 6.1(b) shows the kinds of problems one encounters when the model selection problem is not solved satisfactorily. Multiple surfaces are associated with one model and the resultant fit is poor. Figure 6.1(c) shows the kinds of problems one encounters when the data association problem is solved poorly. Multiple models may try to fit to the same underlying surface, leading to the splitting of that data between the two models, and poor fits. This is also the problem shown in Figure 6.1(d). Over segmenting the data is another common problem that is shown in Figure 6.1(e). Figure 6.2: Unbounded surfaces attract outliers Another decision to be made was whether to put size constraints on our planar surfaces, i.e., to use bounded surfaces or to fit unbounded planes. Unbounded planes are significantly simpler in mechanics. A plane has only three degrees of freedom. The number of parameters for a bounded plane depends on the shape used to limit its extent. We chose a rectangle. This adds two size parameters and an orientation angle within the plane, making the bounded surface a six parameter object. Our decision to use bounded planes came from early experiments conducted with unbounded ones. Figure 6.2 illustrates the difficulties we found with unbounded surfaces. In general there is a local region of support for each surface, represented in the figure by the cluster of points on the left. However, for large data sets and unbounded planes it is inevitable that the plane will pass near some other points in the data set, here shown as the cluster of points on the right. Since the distance of the point from the plane is the perpendicular distance, outlier points such as these can be any distance from the local surface support group and still be close enough to the plane to be classified as inliers. However, since these points are rarely in fact associated with that surface, they have a corrupting effect on the parameter estimation. This corrupting effect is often quite strong, as the distance of these 78 points in-plane from the centroid of the real support points can be quite large, and this distance acts as a "lever arm", magnifying the contribution of these points to the orientation of the surface. The choice to use bounded surfaces requires that we must develop a mech-anism in our model fitting methods to limit the size of the plane. There must be some "cost" associated with larger surfaces to keep the size of the surface within reasonable limits. This can be done with connectivity constraints - all points within the surface must be connected in the image and the surface size is then restricted to be the smallest possible to still enclose all assigned points. This connectivity restriction, however, has the limitation of using image structure which we decided earlier to restrict as much as possible. The reason why we wish to restrict the use of image structure (i.e., knowledge of the {row,column} position of each patchlet within a source image) was that we wish to develop methods that are amenable to combining stereo images from multiple viewpoints. As soon as you impose image structure as a requirement, you limit the ability to directly compare data elements from different stereo images and consequently different image structures. However, if our data elements have a size, as patchlet do, then the concept of "connectivity" does not require image structure, as we can determine whether two data elements contact or overlap based on their size. This concept is illustrated in Figure 6.3. Figure 6.3(a) shows a set of points with two significant gaps in their connectivity along a surface. Figure 6.3(a-d) show the resolution of the ambiguity of these gaps if we replace the points with patchlets. In Figure 6.3(b) we can see the the left gap is closed or bridged and the right gap is not. In Figure 6.3(c) both gaps are bridged. In Figure 6.3(d) the right gap is not bridged, and we can also see that it is not likely that the right most data element could belong to this surface due to the large difference in normal vectors. 6.2 Approach One of the more difficult aspects of extracting surfaces from large amounts of data is the data association problem. By "data association" we mean the assignment of each sensed data point to a generative surface model. Once one has the assignments of data to surfaces, the task of estimating the surface parameters is a well-conditioned problem. The data association problem can be thought of as data clustering. Clearly if one can "cluster" the data into surface groups successfully, this is a solution to the association problem. There are several ways one could go about this. The two logical approaches we considered were: a greedy approach that would fit one plane at a time; or an approach that would try to fit all surfaces simultaneously. In the 79 Gap? Gap? (a) Are all these points on the same surface? There are two significant gaps of different size, is either of them too large? No gap Gap! (b) Sized data points: here one gap is bridged No gap No gap (c) Both gaps are bridged (d) Even though the patchlet is large enough, it does not match the surface parameters Figure 6.3: Surface gaps - when is a gap between points too large 80 end it was a combination of these two approaches, the first for initialisation and the second for surface refinement. The greedy algorithm conceptually is: loop(i) find largest surface Si with most support from data set X find subset of data, X j associated with Si if X j subset is smaller than threshold end loop remove X j from X save S{ call loop( i + 1) The idea of the greedy method is to find one plane that has good support from the data, then remove the data associated with that plane and repeat. If the method works, plane after plane will be fit and their associated data removed from the available set until there is no data left or the plane fit cannot find sufficient support. We use this method for our initialisation approach discussed in Section 6.4. However, this method is not robust when used with noisy points (not patchlets, but unoriented points). One of the main issues when applying this method to point data is outlier rejection. Clearly if one is fitting a single plane to all the data in a complex scene, the fit will not be accurate as there is more than one plane in the scene. To work, we must reject points belong to all surfaces other than the one that is being fit. In practice this is difficult. Either not enough points are marked as outliers and too many points are associated with the surface, or too many points are marked as outliers and only a subset of the true associated points are connected to the surface. In the first case, the estimate of plane parameters is poor due to corruption by outlier points, and as well, when one removes the points associated with the plane for the next iteration, too many points are removed. This leads to the removal of points that should have been left in the data set for the next iteration, and so later planes will not have the support they should have had. If too few points are associated with the fitted surface, then the next iteration of plane fitting may fit the same plane again, using the points from that surface which remained after the previous solution, leading to finding the same surface again and again. Our approach is to use RANSAC (RANdom SAmpling Consensus algo-rithm) [FB81, Tor95] to find candidate surfaces. We then evaluate the success of each candidate and choose the one with the most support from the patchlet data. We apply the above greedy algorithm, removing patchlets that were used to support 81 \ previous surfaces. We continue this until either a user-set maximum number of sur-faces is reached or until no more sufficiently supported surfaces can be found. We then apply an EM-based surface refinement on all initialised surfaces. The details of these RANSAC- and EM-based methods are discussed in the following sections. 6.3 Surface model To begin our description of surface extraction methods we first explain the surface model and the patchlet-to-surface metrics that we use. The surface model is virtually identical to the patchlet model, as both represent a bounded plane with confidence measures. The difference is that the surface parameters will not be obtained from the camera sensor definition but will be determined by the aggregate patchlets that make it up. The parameters are: • origin (Xs) a 3D position indicating the centre of the surface • normal (ips o r [&s, <l>s]) a normal direction. As with the patchlet, we use ips when referring to the 3-space normal vector, or (8s, <f>s) angle pair when using polar coordinates illustrated previously in Figure 4.4. • local coordinate system (T£, Tf) a local pair of X Y axes within the surface plane that define the directions of the size parameters. The 4x4 homogeneous transforms (world-to-surface) and its inverse, T™ (surface-to-world) define a relationship between the world coordinate system and the surface coordinate system that includes the definition of the local X and Y axes. • size (Ssx, Ssy) a size parameter in both the local X and Y direction • positional confidence (As) represented by a variance • directional confidence (KS) represented by the K parameter of the Fisher probability distribution (see Section 4.2.3) Patchlet — to — Surface metrics We will need to make comparisons between surfaces and patchlets often, in order to evaluate how likely a given patchlet is related to a given surface. There are two ways that we do this. The first is to use a Mahalanobis distance between the surface and the patch-let. In this case, we are evaluating the distance between the patchlet and the surface, 82 Surface normal Pointing error Position error Surface Patchlet normal Normal uncertainty Position uncertainty Patchlet Figure 6.4: Relationship between surface and patchlets both in position and orientation, normalised by a standard deviation. Since the off-set and orientation errors are independent (see Section 5.3.4 and Appendix C), the Mahalanobis distance is evaluated by mahal_dist(5,X) = of f set_dist(S, X ) 2 + angle_dist(5, Xf (6-1) Where of f set_dist(5. X) is the perpendicular distance between the surface and the patchlet normalised by the patchlet variance, and angle_dist(S l, X) is the angular difference between the surface normal and the patchlet normal normalised by the equivalent variance of the patchlet Fisher K parameter given in Equation (5.4) in Chapter 5. Explicitly this is: offset_dist(S,X) = angle_dist(S', X) = (Xs -ips-Xx- ips)2 Ax (KCOS-1 (ips -ipx))2 2vr (6.2) (6.3) Where Xs and ips are defined previous as the surface coordinate system origin and normal respectively and Xx and ipx are the patchlet coordinate origin and normal respectively. This Mahalanobis distance is direct but simplistic as it does not take into account issues such as the grouping of patchlets associated with with surfaces. It is only used in the RANSAC initialisation process; for E M we need a more rigourous comparison. 83 In that case, we evaluate a probability value based on the Probability Density Function (PDF) of a surface generating patchlets, or P(X\S). This can be expressed as P(X\S) = PX(X\S)P^(X\S)PS(X\S) (6.4) where Px is the positional probability model, P^ is the orientation or Fisher prob-ability model and Ps is the bounded surface probability model. Px is either the Gaussian or the Lorentzian model applied to the offset parameters. Ps is a model based on the distance between the patchlet and the bounded surface measured in the surface plane. So Px(X\S) = { where 1 i ( * ~ M ) 2 ~7==e 2 A Gaussian r h-7 ,j , ,r>, Lorentzian ( x - M ) 2 + ( £ ) 2 x = Xx-tps ix = Xs -ips A _ A, 9 Ax A _ As+Ax For the Lorentzian model, we do not have a method for combining the patch-let and surface F parameter so we only use the surface F. The Fisher PDF, as described in Section 4.2.3, is PTP(X\S) = / , , . . v ' 4 7 r s i n h ( « ; ) where KS^X 2 , JI •s + Kx Finally, for the bounded surface probability we use the following heuristic { 1 i f Xx inside S l_dix^s1 i f 0 < d ( X x , S ) < B 0 i f d(Xx,S)>B where Xx is the perpendicular projection of Xx onto S, d(Xx,S) is the distance of Xx outside of the bounded area of S and B is a user specified threshold. With this model, if a patchlet projects to within the surface, there is no penalty applied to the overall probability score. If it falls within a threshold outside of the surface boundary, a penalty is applied that is linearly proportional to the distance away from the surface, and if the patchlet is beyond the boundary by more than the threshold, it is completely unrelated to the surface. This is to encourage locality of patchlets assigned to surfaces and is discussed further in Section 6.5.1. 84 6.4 Initial surface identification The initial estimates of the location, parameters and size of surfaces in the scene are based on the RANSAC (RANdom SAmple Consensus) algorithm [Tor95] with some modifications to make the algorithm suitable for our data types. RANSAC was designed to solve the problem of extracting a global transformation on a set of features points between two images, in the presence of outliers. The typical ex-ample would be two images of a largely static scene taken from a moving camera. The outliers are present due to either mismatched features between the two images or features associated with an independently moving object. The idea of the algo-rithm is to randomly select the minimum number of points required to solve for the transformation. Solve, and then compare its solution to all the points. If one takes sufficient random sets, it is very likely that one of the sets will contain no outliers, and the solution will be very good. When applied to all the points, the majority of points will fit quite well and the outliers will be easily detected as the points for which it fails. In our case, our minimum point set is actually "one patchlet", since one patchlet is a full candidate for an unbounded plane. We then perform a region growing about that patchlet and include any patchlet that is adjacent to the grown region and whose distance is within a threshold. The region growing method ensures connectivity and thus bounds the surface through image structure constraints. The distance is measured by applying Equation (6.1) between the candidate plane and the potential member patchlet. If the Mahalanobis distance is within two standard deviation (2a which contains 95% of the Gaussian PDF), the patchlet is included into the set of patchlets that make up this candidate surface. After a user-specified number of patchlets are included in the surface (in practice, 50), the surface param-eters are re-estimated to improve the surface fit to the neighbourhood of patchlets. From that point, the region growing continues until no more patchlets are found to include in the candidate surface. Because the patchlets are organised in a 2D matrix associated with the pixels of the image from which they were derived, this flood-fill process is computationally efficient. After the aggregation of patchlets is complete, the size of the surface is determined to be the minimum size that includes all member patchlets. The orientation of the X and Y axes of the surface are aligned with the major and minor axes of variance about the surface centroid. The above process is done a number of times to generate a number of single surface solutions based on a random seed patchlet. The surface with the highest score is retained. Al l member patchlets are removed from the data set and the process is repeated until the highest scoring surface does not exceed a preset limit. For scoring, the simplest and most successful method we used was the count of the 85 number of member patchlets. We also used methods such as the sum of the patchlet-to-surface P D F values from Equation ( 6.4). This worked equally as well but since there was (a) no appreciable improvement and (b) the units are not in pixels making it more difficult to set a reasonable threshold value, we used the simpler approach. The final surface is then "bounded" to determine the additional parameters beyond an unbounded plane. To determine these parameters we do the following: • find the centroid of all member patchlets within the surface and project this centroid onto the surface for the surface centre position • compute the moment matrix of covariance of the patchlet positions around the centroid - determine this moment matrix' eigenvectors and values • set the local surface X axis as the primary eigenvector (or the direction of most spread in the surface plane) • set the local surface Z axis to the surface normal • set the local surface Y axis to the cross product vector of the Z and X axes To determine the size of the surface we experimented with a few methods. The first is to set the x and y size of the rectangle to completely enclose all patchlets that are associated with the surface. The second is to set the surface to have a total area of the sum of all the areas of member patchlets. Both methods worked well for obtaining an initial estimate for surface refinement. The first method tends to slightly over-estimate surface size and the second tends to slightly under-estimate the surface size. For the experimental results we elected to use the second method. 6.4.1 R A N S A C extracted surface results To illustrate the algorithm, we present in this section the results of the R A N S A C -based initial surface estimation here with figures for each step of the process. Fig-ure 6.5 shows the R A N S A C surface estimation performed on the synthetic data described previously in Figure 5.4. Since this data is "perfect", we get a very con-sistent result. The segmentation was done with a priori surface standard deviations of 2cm and 7.5 degrees. The algorithm proceeds iteratively. Each iteration the best surface is located (shown in black) and removed from the data set. Figure 6.5 (f) shows the final segmentation of the patchlet data into 5 surfaces, with black repre-senting the undecided patchlets that are not clearly associated with one surface. Of more interest is to see how this method performs on real stereo data from a complex scene. We repeat the information shown in Figure 6.5 on real data and the results are shown in Figure 6.6. This figure uses the same stereo data as 86 (c) third surface (d) fourth surface (e) fifth surface (f) complete segmentation Figure 6.5: RANSAC-based greedy surface segmentation on synthetic data - black indicates pixels associated with the surface, grey are valid pixels, white are invalid pixels or pixels have already been assigned to a different surface 87 presented in the previous chapter in Figure 5.9. The RANSAC algorithm runs to completion and identifies the presence of 10 surfaces in the model. Each surface identified is shown in Figure 6.6 (a) through (j). We can see that the method begins by extracting the larger surfaces and then proceeds to the smaller ones as the data available becomes progressively more sparse. One can see in Figure 5.9 (b) and (d) that occasionally the surfaces extend into regions which ideally would not be included in that surface. There are strips of data on the vertical concrete border at the far end of the stairs. These patchlets are included in the surface because the patchlets on this border, which are quite far from the camera, have very low confidence values. The low confidence value on the patchlets' orientation allows these patchlets to be incorporated into a surface, even though the orientations do not match very well. However, on the whole the method works. Figure 6.7 (a) shows the rectified source image and (b) shows the segmented image. Figure 6.8 shows the same clustering in a 3D view from a different viewpoint. The segmented image is created by taking the full set of surfaces and for each patchlet computing which surface it is most likely associated with. There is use, however, of the "outlier class" described in Section 6.5 below, which is why not every valid patchlet has an associated surface. 6.5 S u r f a c e re f i nemen t w i t h E M The RANSAC surface extraction method described in Section 6.4 provides a reason-able method for both model selection and model initialisation. However, to refine the solution to ensure we have obtained an optimal surface estimation and assignment, we have developed a surface refinement method based on Expectation-Maximisation (EM) that was described in Section 4.5. When solving a problem with EM, it is important to first of all fully express the probability that you mean to maximise. We begin by constructing a graphical model to demonstrate the associations between different variables in the model. This is shown in Figure 6.9. In this graphical representation, the symbols represent: • 6j - The parameters of the model for surface j • X i - The positional parameters for data patchlet i • £i - The variance characteristics for patchlet i. This represents the noise process of sensing the patchlet and is its measurement accuracy. • z; - The data association parameter, this indicates that patchlet i originated from surface j if = j. 88 (a) first surface (b) second surface (c) third surface (d) fourth surface Ig^HBSW -I (e) fifth surface (f) sixth surface (g) seventh surface (h) eighth surface J J I I I H ' II II (i) ninth surface (j) tenth surface Figure 6.6: RANSAC-based greedy surface segmentation - black indicates pixels associated with the surface, grey are valid pixels, white are invalid pixels or pixels have already been assigned to a different surface 89 (a) Rectified image (b) Segmented image Figure 6.7: Full initial segmentation from greedy surface associations as shown in Figure 6.6 including a 3D view from a different viewpoint. Figure 6.8: A 3D view of the segmented patchlets in Figure 6.7 seen from a different viewpoint. 90 91 n Figure 6.10: Simplification of graphical model • II - A vector of variables, {TTI,. • -TTM} that indicate the likelihood of each model. Ylj = 1- A larger surface that has more surface area and is therefore has more patchlets associated with it will have a larger TT value. Figure 6.10 shows a more compact version of the full graphical model that is more commonly used but does not illustrate as completely the relationships. These graphical models show the relationship of the models and that data. The data association variable Z{ is generated from the model likelihood TTJ. The patchlet data points x\ are generated from the data association variables Z j and the associated model parameters 0Zi, and are corrupted by sensor noise e*. This model is very similar to the standard mixture-model graph shown in Figure 4.5 but it has the addition of the measurement error terms £ j . This is logical since in basic problems of this kind one does not have explicit information about the measurement error for each data point. In our formulation we do have that information and should use it accordingly. Following the application of E M to mixture models as described in Sec-tion 4.6, the equation we are trying to maximise is N M log P(0X) > t l oS P( xi\ 9i) + lo§^ (6-5) i=1.7=l P(xi\9j) is given in Equation (6.4). The "E" and " M " steps for this model without 92 reference to the surface boundary is given in Appendix A.4. A.4.1 gives the deriva-tion for the system without considering the patchlet confidence measures E{. A.4.2 gives the derivation including the patchlet confidence measures. In both cases the updates for the surface offset parameter, 9^. has a closed form solution, while the surface normal requires a numerical minimisation. The numerical minimisation se-lected was Downhill Simplex [PTVF92], which is a robust and very general method, although not exceptionally efficient. Using E M naively does not necessarily result in an improvement to the surface estimation and can lead to quite poor results. The most important issue that can have a strong effect on the result is that of outliers in the sensor data. E M applied to a mixture model makes the assumption that all sensed data was generated from one of the present models. This simplified view of the world causes trouble because outliers will inevitably be assigned to some model, regardless of the low quality of fit. Since the weights of the E M are normalised per data element, even if the natural PDF result was a very low attraction to all models in the mixture, the normalisation will often make the patchlet strongly associated with one of the models. A common solution to the outlier problem is to introduce a "outlier class". An example of this is the work done by Liu et al. [LEC+01] (although they describe it as a "phantom" rather than "outlier" class). This is a uniform distribution that covers the entire patchlet parameter range with a constant but small attraction or likelihood value. Patchlets that fit extremely poorly to all members of the mixture model will be more attracted to the outlier class. This prunes poorly fitting elements from the data set and prevents them from distorting the final fit. In our experiments we chose outlier class that had a model likelihood of 5% (that is, it should account for roughly 5% of the patchlet data) and with a uniform attraction of 0.05. 6.5.1 Surface bounding As discussed earlier, it was important for the success pf the presented methods to have the surfaces bounded. The reason was to avoid the problem of associating un-related patchlets to unbounded planes. For an unbounded plane it inevitably passes near some data that is not actually related to its primary basis of support. These data become incorrectly associated with the surface and the parameter estimation is corrupted. Surface bounds were updated after each " M " step of the E M refinement. The overall surface area was set as the combined projective area of all patch-lets associated with the surface. The area contribution of a patchlet to each surface 93 was weighted by the probability that the patchlet belong to that surface, or N where Aj is the area of surface j and is the area of patchlet i. In this respect, the total area of all surfaces in the mixture model were constrained to be equal or less than the combined patchlet area. The origin of the surface was set to the centroid of the weighted patchlet positions. Since the normal of the surface was updated during the " M " step, the only thing remaining is to determine the X and Y axes of the surface and the associated size parameters sx and sy. This was done in a three-parameter gradient descent search, the parameters being {sx, sy, cp} where c/5 is a rotation parameter in the plane of the surface. The evaluation function for this search was the number of patchlet positions that fall within the rectangle weighted by their data association probability £ij. This improved the surface boundaries significantly over the RANSAC principal component analysis method of determining the primary axes of the surface. 6.6 Example results Figure 6.11 shows an example taken from Section 7.6 with the results of the RANSAC-based algorithm and the EM-refinement. The white lines indicate the surface bound-aries rendered into image space. The greyscale values in Figure 6.11 (c) and (d) indi-cate the most likely data association for the patchlet association parameter. In this example one can see that the data association along the boundaries of the surfaces has been improved by the E M refinement. But especially the surface boundaries have been improved and made significantly more accurate with the gradient descent search on the bounding parameter space. i=i 94 (c) RANSAC-based surface estimation (d) After E M refinement Figure 6.11: Example surface extraction 95 In general the E M refinement only has to run a few iterations (five or less) before it converges to the optimal value, as the RANSAC initialisation has already seeded the initial guesses close to the local optimum. 6.6.1 Future model selection strategies In the algorithm described so far, we have relied on the RANSAC initialisation method to determine the model selection solution. While this produces a satisfac-tory result, it relies on the image structure and, looking ahead to when we will be combining results from multiple images, this will be a problem. One possible avenue to solving the model selection problem would be to use data-driven methods that attempt to maximize the information gain while minimiz-ing the model complexity such as Minimum Description Length (MDL) [Sch78], A Information Criterion (AIC) [Aka74] or Bayesian Information Criterion (BIC) [Ris78]. In general, these methods attempt to determine the minimum amount of classes re-quired to achieve the maximum separation of information based on information-gain theory. However, in practice the clusters we are investigating are so close in the standard plane space that it is very difficult to separate them. The separation really needs the locality provided by methods like the RANSAC approach. Our plan, then, for combining multiple images is to use RANSAC for the model selection of the first image data set. When subsequent data sets are attached to the current body, we intend to use the current surface estimates and E M re-finement to expand the current surface mixture into the new data. After this is complete any data that remains in the new image set can be subject to a RANSAC run to determine if any new surfaces exist in the unfitted data. The idea is to use the RANSAC model selection approach to initially bootstrap the system, and then to use it on any "outlier class" data that remains with each added image. 6.7 Stability over noise To test the sensitivity of the surface segmentation over varying noise conditions, we performed the full algorithm (RANSAC followed by EM) on the synthetic data presented in Figure 5.4. This time we introduced artificial Gaussian noise to both the pointing and matching errors with equal standard deviations. We created patchlets from the resultant noisy disparity images and proceeded to segment the surfaces. The results of this are given in Figure 6.12. The pixels are colored according to the surface segmentation. 96 (a) a = 0.05 pixels (b) cr = 0.1 pixels (c) a = 0.2 pixels (d) a = 0.4 pixels Figure 6.12: Surface segmentation on synthetic images with noise 97 As described in Section 5.4.1, the far surface of this synthetic image is actually 5m away from the camera. Therefore, although it looks like an equal sided box, in fact the camera is looking down a long corridor. The far surface is, therefore, affected much more by this noise than the sides of the corridor that are closer to the virtual camera. One can see that the method works successfully in all but the most extreme case. In Figure 6.12(d) the noise introduced is 0.4 pixels in both pointing and matching errors. This leads to an extremely noisy surface and the patchlet data was very chaotic, consequently no surface could be detected on the far wall. For comparison, an error of ±0.1 pixels at 5m yields a positional error of ±10cm, while an error of ±0.4 pixels yields a positional error of ±37cm. It should be mentioned that stereo error is not truly "salt and pepper" noise. Although it is zero mean and Gaussian distributed, it does not fluctuate indepen-dently pixel-by-pixel but appears to vary more smoothly from region to region. This is due to issues that affect regions rather than individual pixels, such as errors in the lens distortion removal. Consequently, the local surfaces will be more regular with real data of a certain error characteristic than they will be for synthetic data of this kind. But even though the noise is quite extreme, the sides of the corridor are still successfully extracted. 98 Chapter 7 Experimental Results and Discussion 7.1 Experimental equipment In this chapter we present: • our experimental verification of the correlation algorithm accuracy • a verification of the improvements to correlation accuracy when applying Shimizu and Okutomi's cancellation function (see Section 3.3.2) • a regression test that verifies the patchlet confidence parameters are within the expected statistical bounds • several test images sets and resultant patchlets • surface extraction results on the test image sets • a discussion of computation requirements • a discussion of the use of outlier-classes v.s. Lorentzian models • insights on the role of surface and patchlet scale to the process The stereo data was obtained with a Point Grey Research Digiclops 4.0mm focal length trinocular stereo rig shown in Figure 7.1. Raw images were taken at 640 x 480 pixel resolution and rectified to 320x240 resolution. The rectification process remaps the raw images to fit a pin-hole camera model image that is aligned row-by-row and column-by-column with the epipolar lines of the stereo system. Figure 7.2 shows an example of the results of the rectification process. Fig-ure 7.2(a) shows the reference camera image as it is delivered by a camera with high 99 Figure 7.1: Digiclops trinocular stereo rig lens distortion. Figure 7.2(b) shows the same image after rectification. Notice that the lines that are actually straight in the real scene, such as the door frames, can appear curved in the unrectified image. This is due to the "barrel" distortion of a wide angle camera lens. In the rectified image, all the lines that are straight in the world appear straight in the image, which matches the projective transformation of a pin-hole camera. 1 (a) unrectified (b) rectified Figure 7.2: Example of rectification ^he images in Figure 7.2 were taken with a 2.0mm Digiclops camera, a wider angle Digiclops than was used for the test images in the rest of the results section. The 2.0mm Digiclops images were used for this figure as the lens distortions are more apparent and so better illustrate the issue. 100 7.2 Analysis of correlation accuracy In Section 3.3.3 we developed the methods for propagating sensor errors to 3D covariance matrices that represent the certainty of 3D points extracted from the disparity pixels. Aside from the disparity images, this process requires two sensor-based error values: pointing error and matching error. The pointing error comes from the accuracy of the calibration of the stereo rig. Point Grey Research provides information on the accuracy of the calibration of each of their camera systems and for our system the calibration accuracy was an RMS error of 0.08 pixels with a maximum error of 0.30 pixels. These errors are for the system generating rectified 640 x 480 pixel images and we performed our results at one-half that that resolution, so the pointing error of our system has a standard deviation of 0.04 pixels. The matching error, however, is a measurement of how well the correlation algorithm can match two image regions, and more specifically how accurate is the sub-pixel interpolation of that process. This is information that is not provided by Point Grey Research and so we had to determine this accuracy ourselves. The method used to determine this error follows in the remainder of this section but was determined, with sub-pixel interpolation bias correction (see Section 3.3.2), to be 0.05 pixels for our 320 x 240 stereo correlation test images. (a) unrectified planar surface (b) disparity for planar surface Figure 7.3: Correlation calibration image To determine this matching or correlation accuracy we made use of the error model developed in Section 3.3.3 but set the matching error as an unknown pa-rameter. We took an image of a known planar surface with good texture. In this case, an office floor covered by carpet as shown in Figure 7.3. As one can see from Figure 7.3(b), the well textured carpet works as an ideal pattern for stereo. The right and bottom edges of the image are noise due to the lack of overlap between the 101 reference-top and the reference-left camera pairs. We trimmed the left and bottom noise regions from the image and retained the centre, left and top portions of the disparity image for testing. We then processed this disparity image for several sample settings of match-ing error in the following manner: 1. given the provided matching error, we generated uncertain points as described in Section 3.4 2. solve for a best-fit plane to the uncertain points - a plane that minimises the Mahalanobis distances between the points and the planes 3. histogram the square root of the Mahalanobis distances between the points and the best-fit plane Because a Mahalanobis distance is normalised by the variance of each uncertain point, this distance is unitless, but all of the distances are on the same scale. The square root of a Mahalanobis distance is effectively in the units of "standard devia-tion" . If our provided matching error correctly fits the correlation accuracy obtained in the disparity image, our histogrammed distances should fit a unit Gaussian. That is to say, 67% of the data points should fall within one standard deviation of the plane, 95% of the data points should fall within two standard deviations of the plane, and so on. (a) Biased correlation accuracy (b) Corrected correlation accuracy Figure 7.4: Experimental comparison of correlation accuracy with unit Gaussian We performed this processing on a series of correlation accuracies from 0.03 to 0.2 pixels. The values and curves matching the unit Gaussian best are shown 102 in Figure 7.4. Figure 7.4(a) shows the results for standard sub-pixel interpolation stereo. 2 The curves clearly match the standard Gaussian in form, and match-ing error that fits the unit Gaussian is just above 0.1 pixels, approximately 0.12 pixels. What is more interesting, however, is the evidence of the improvement in sub-pixel interpolation when using the sub-pixel bias correction described in Sec-tion 3.3.2. With this correction a matching error of 0.05 almost exactly matches the unit Gaussian. Since the bias-corrected sub-pixel interpolation values were so dramatically more accurate than the uncorrected values, we used bias-corrected sub-pixel interpolation for the rest of our experiments. The matching and pointing errors for uncertain point generation were therefore set to be 0.05 and 0.04 pixels respectively. 7.3 Patchlet parameter regression test After setting the pointing and matching errors for our stereo rig, we wanted to verify that patchlet confidence metrics do, in fact, represent the reality of the situation. To test this, we applied a similar test to that given in the previous section. Taking a picture of our planar carpet, we calculated the disparity image, uncertain points and corresponding patchlets. We took the best-fit plane (this is the plane that best fits the uncertain points) and compared our patchlet values to the surface parameters. Again, what we expect is that 67% of the the generated patchlets are within 1 confidence measure of the surface parameters, that 95% are within 2 measures, and so on. Figure 7.5 shows the result of this test. The dashed line again represented the unit Gaussian curve. The solid line is the histogram of normalized distances of patchlets to the best-fit surface. Although the histogram curve has slightly heavier tails than the Gaussian, the accuracy of the match is surprising. From this we concluded that the confidence metrics of patchlets do, in fact, represent accurately the accuracy of the patchlet parameters. 7.4 Image test sets The remainder of the results presented in this chapter are based on a set of 12 test images taken on the grounds of UBC near the CISCR building. The rectified images of these test sets are shown in Figures 7.6 and 7.7. The first eight image sets are of 2This sub-pixel interpolation is provided by the Point Grey Research libraries and uses the linear "V" sub-pixel interpolation approach as opposed to the quadratic-fit sub-pixel interpolation. 103 Figure 7.5: Experimental verification of patchlet confidence metrics man-made scenes made up almost entirely of planar surfaces. These scenes are the cases where we expect very good results due to the structure of the scene. There are some challenges, however. In Images 2, 3 and 4, there are some regions in the scene that are quite distant from the camera, at ranges where we can expect considerable noise in the stereo results. As well, Images 3 and 4 have a very complex scene with many surfaces and many very similar surfaces: the steps are at the same orientation and offset only by a few centimetres making them very close objects in our clustering space. Image 7, the farthest range "brick corner" scene also has some regions that are at a distance. It also has a non-rectangular ground plane. Image 8 was added as a scene to determine how much structure at long distance could be extracted. The remaining images are used to see how the methods perform on scenes that are not made up exclusively of planar surfaces. Images 9 and 10 have a conical mound of dirt that is not planar. Image 12 has a statue with many curves, and is also at some distance from the camera. Image 11 is a scene where there are no objects near the camera, and there are some very irregular objects in the scene -specifically two-walkers and a tree. 104 (c) Image 3: front steps (d) Image 4: front steps (e) Image 5: brick corner (f) Image 6: brick corner Figure 7.6: Images 1-6 105 (k) Image 11: walkers (1) Image 12: statue Figure 7.7: Images 7-12 106 7.5 Patchlet construction In this section we show the image test sets converted to stereo disparity images and to patchlet sets. Patchlets throughout this section were created with a 5 x 5 patchlet neighbourhood mask. Figure 7.8 is an image of the planar carpet again. This time, however, we have displayed images that encode the confidence metrics of each patchlet generated. Figure 7.8(c) shows the patchlet positional variance mapped to greyscale, while Figure 7.8(d) shows a similar display of the normal confidence parameter K. The positional confidence is mapped relative to the patchlet positional standard deviation (not variance) from 0 to 3mm, where 0 maps to white and standard deviations of 3mm and higher map to black. We can see a fairly smooth increase in confidence as we move from the top of the image (farthest from the camera) towards the bottom of the image. The K image is mapped between K = 3 and 40, with black equaling K values of 3 or less. We can see some increase in confidence in this image but it is primarily dominated by the slight wobbles in the surface. As discussed in Section 5.4.1 and illustrated in Figure 5.8, the relative orientation between the camera and the surface has a strong impact on confidence evaluation. We found that the surface orientation had a much stronger impact on the orientation confidence than the distance from the camera. The structured variations in surface orientation so apparent in the K image are caused by calibration and rectification artifacts which are systematic. The confidence metric images were deemed not to be of as much interest to the reader as the other images we will show, such as rendered 3D patchlets and extracted surfaces, and so the images are not included in the results chapter. However, they are included in Appendix D. The remainder of this section is taken up with figures showing the results of patchlet creation from the stereo test image sets presented in Section 7.4. In each figure there are 4 images. (a) Rectified image This image shows the rectified image of the reference camera from the image test set. (b) Disparity image This image shows the stereo disparity image of the image test set. Black pixels indicate pixels that are marked invalid by the stereo algorithm. Each disparity image has a different range of valid disparity and is scaled in that range for viewing. The disparity images are only represented at the integer disparity level, not at the sub-pixel level, which is why some false contouring is visible. 107 (c) Positional variance (d) Fisher K Figure 7.8: Carpet image confidence measures 108 (c) 3D rendered patchlets: coloured This image shows the generated patchlets rendered in 3D from a novel viewpoint. In each image the viewpoint has been altered by moving the virtual camera to the right and lifting it somewhat. The amount of this translation and rotation varies from data set to data set. It was selected by hand to give the viewer a good impression of the patchlet set. Each patchlet is coloured by the greyscale value from its corresponding rectified image pixel. (d) 3D rendered patchlets: shaded Texture mapped 3D models have a ten-dency to look good even when their geometry leaves something to be desired. This image has the patchlets coloured white and flat shaded to allow the viewer a more impartial representation of the 3D structured obtained by the patchlet determination. We also present comments on each individual image set result. As one can see, the 3D reconstruction is significantly better when the scene is closer to the camera. However, the goal of this research is to investigate using low-resolution as well as high-resolution data. Many of the images here were used with low disparity ranges and maximum (closest) disparities of less than 15. To contrast this, the image of the woman's face shown in the Introduction had a maximum disparity of 64. 109 (a) Rectified image (b) Disparity image (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.9: Image 1: steps close-up This image was taken at relatively close range. Points range from about 75cm from the camera to 125cm. The effective range of disparities is only 12, from 21 to 33. There are some invalid disparity pixels caused by saturation of the CCD and consequently loss of texture in that image region. These data holes are propagated to the patchlet set. 110 (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.10: Image 2: front steps This image had an active disparity range of 20, from disparity 6 at the doors in the background, to 26 for the front step. It also suffered some loss of data from saturated CCD values. The patchlet set has missed the top two horizontal surfaces as these surfaces are too oblique to the camera. One can see that the lower steps, which occupy proportionately more of the image and are closer to the camera have a much cleaner patchlet construction than the top step and the doors in the background. I l l (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.11: Image 3: front steps - side view This image had an active disparity range of 21, from 3 to 24. However, the range is more interesting because there are surfaces that are visible as they transition through disparity space. In the shade patchlet image some oscillations can be seen on the distant surfaces. This is the remnants of the sub-pixel bias discussed in Chapter 3 which the Shimizu cancellation function could not completely eradicate. One can note that the step edges in this image are more rounded than in the previous two due to the greater distances and consequently greater smoothing by stereo and patchlet masks. 112 (a) Rectified image (b) Disparity image (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.12: Image 4: front steps - side view - far This image has only 12 active disparity values, from 1 to 13. It is taken at a greater distance than the previous front step images and this is apparent in the patchlet reconstruction. The oscillation and smoothing effects are more pronounced, as they increase with lower disparity values. What is significant is the top two steps almost disappear into a single surface due to the smoothing effects. This is primarily caused by the correlation stereo, not the patchlet creation. 113 (a) Rectified image (b) Disparity image (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.13: Image 5: brick corner - close-up This image has 23 active disparity values ranging from 23 to 46. Due to the high texture of the brick, the disparity image is very dense. However, as can be seen in the shaded image, the regular brick pattern does cause some systematic errors in the stereo. 114 (a) Rectified image (b) Disparity image (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.14: Image 6: brick corner This image has only 11 active disparity values, from 15 to 26. The results are very similar to the previous image except that the slightly higher stereo errors are apparently in the shaded image. 115 (a) Rectified image (b) Disparity image (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.15: Image 7: brick corner - far Because this entire scene is quite distant, the active disparity range is fairly small, only from 5 to 13. As can be seen from the patchlet view, there is a large region on the oblique wall that could not be obtained from invalid disparity values. As well, the most distant wall has considerable noise. Regardless, however, the overall structure of the scene is apparent through the patchlets. 116 (a) Rectified image (b) Disparity image (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.16: Image 8: courtyard The courtyard image was taken at a very long range to test the utility of extracting distant structure using patchlets. The active disparities range from 0 to 12. The patchlet view illustrates the fact that the 3D reconstruction degrades with distance. 117 (a) Rectified image (b) Disparity image (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.17: Image 9: dirt pile - low This image was taken of a non-man-made shape to experiment with non-planar surfaces. The active disparity ranges from 4 to 19. 118 (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.18: Image 10: dirt pile - high This image was taken of a non-man-made shape to experiment with non-planar surfaces. The active disparity ranges from 0 to 14. Some portions of this image were much too far to attempt patchlet construction (i.e., the building viewed in the distance). 119 Another image taken of non-man-made objects. The active disparity ranges here was very low, only 1 through 9. This limited disparity range severely impacts the quality of the reconstruction. 120 (c) 3D rendered patchlets: coloured (d) 3D rendered patchlets: shaded Figure 7.20: Image 12: statue This image had an active disparity range of 1 through 14. 121 7.6 Surface extraction In this section we present the results of the RANSAC- and EM-based segmentation results on the 12 image test sets. The results are shown by colour coding the pixels by the surfaces they are associated with and by rendering the rectangular surfaces into the image space. These results were all done with Gaussian positional models and with outliers classes as described in Section 6.5. The surface accuracy parameters were provided a priori and unless otherwise specified had an accuracy of 2cm positional standard deviation and 5 degrees directional standard deviation. The model selection was solved by the RANSAC-based algorithm (see Section 7.6.1 for results specific to the model selection problem). In general the E M methods took less than 10 iterations as the RANSAC initialisations were very close to the optimal result. The segmentation is colour coded, and the outlier class patchlets are the darkest class and also do not have an associated rendered rectangle. In general, the E M classification has cleaner separation between surface classes and more accurate surface boundaries. 122 123 (a) Rectified image (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.22: Image 2 surface extraction The doors in the back have all been picked up by the outlier class in this image set. Note that the invalid pixels limited the extent of the RANSAC segmentation on the second step while this has been correctly fixed in the EM-based segmentation. Some settings of surface confidence parameters caused successful extraction of the doors in the back. For more comment on this see Section 7.10. 124 (a) Rectified image (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.23: Image 3 surface extraction This scene ended with 8 selected planes. Some settings of surface confidence parameters caused successful extraction of the concrete balustrade. For more com-ment on this see Section 7.10. In this result some of the surfaces had a tendency to bleed into the balustrade region. This is in part due to the weak confidence parameters for the balustrade. Since it is quite far from the camera it has weaker confidence values and consequently has more freedom to join surface classes. 125 (a) Rectified image (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.24: Image 4 surface extraction As seen in Figure 7.12, the top steps of this image had been excessively smoothed in the stereo processing. Consequently, the individual steps have largely disappeared and the top few steps appear more as a ramp than as separate surfaces. This is the result from the surface fitting. A ramp-like surface fits to all but the bottom few steps. 126 (a) Rectified image (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.25: Image 5 surface extraction 127 (a) Rectified image (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.26: Image 6 surface extraction 128 (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.27: Image 7 surface extraction Although from the patchlet result we could see that the noise on the back wall was quite extensive, the surface segmentation still successfully identifies all the major planar surfaces in the scene with fairly good accuracy. User-specified surface parameters were 5cm positional and 10 degree standard deviations. 129 (a) Rectified image (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.28: Image 8 surface extraction Although the distant objects in this scene, such as the pillar, are quite far from the camera, the method can still extract the dominant features. The pillar is 8m from the camera. User-specified surface parameters were 10cm positional and 12 degree standard deviations. 130 (a) Rectified image (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.29: Image 9 surface extraction 131 (a) Rectified image I 1 r-, / j j A i / / / B / / -' / / (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.30: Image 10 surface extraction As is to be expected when fitting planar surfaces to a curved surface, the planar positions are not necessarily stable between the RANSAC and E M based algorithms. However, both solutions are reasonable approximations to the curved surface. The method works quite well as long as the curvature is not too sharp. 132 (a) Rectified image (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.31: Image 11 surface extraction This image is a difficult one and was done with standard deviations of 5cm and 10 degrees. The farther walker is not identified through the RANSAC initial-isation algorithm as she is too small and the patchlets vary too much. However, surfaces are identified in the tree and the nearer walker as well as the ground plane. 133 (a) Rectified image (b) RANSAC-based surface estimation (c) After E M refinement Figure 7.32: Image 12 surface extraction The planar surfaces in the scene are detected well : the ground plane and the major surfaces on the pedestal. The statue is a non-planar object and the planes are not necessarily stable, yet the fit is a reasonable approximation given the algorithm's constraints. 134 7.6.1 Example of model selection process To illustrate the model selection process, we present the full results for Image 3, the most complicated scene with the highest number of models. Figure 7.33 shows the process going from 3 models to the stopping point of 8 models. Each iteration the R A N S A C and E M based algorithms were run, with the results from the previous result being used to initialise the next attempt with one extra model. The R A N S A C approach thus only has to identify one additional model each iteration and the E M approach converges quick as most of the models are already at their optimal values. (a) 3 models (b) 4 models t *~9K riwHHBHHwHHB 135 7.7 Ground truth comparisons To test the validity of the planar surface segmentation, we performed a. comparison of the EM-segmented planes to a manually determined ground truth segmentation. The 3 staircase images, Image 1, 2 and 3, were each classified by hand into their major surfaces and compared to the results shown in 7.6. The results of this com-parison are presented in this section. In Figures 7.34, 7.36 and 7.38, we show the rectified image in sub-figure (a), the ground truth image in sub-figure (b) and the EM-based segmentation result in sub-figure (c). In the "ground truth" image we display the borders of each segmentation surface overlaid onto the rectified image to make these images easier for the reader to interpret. Figures 7.35, 7.37, 7.39 and 7.40 show visual comparisons of the segmen-tation result between the ground truth and EM-based results. The white pixels indicate the locations where the E M and ground truth agree. The black regions indicate where the E M segmentation does not match the ground truth. The dark grey pixels indicate the remainder of the ground truth region for this surface. It is understandable that the E M segmentation will not completely extract the ground truth region. There occasional invalid pixels within the stereo disparity image and there are the regions along the right and bottom edges of the disparity image where there is insufficient overlap between the camera pairs to extract mean-ingful disparities. Consequently we only count the false positives, that is the number of pixels that the E M segmentation claimed was part of this surface and that the ground truth denied. The percentage of the E M segmented pixels that match the ground truth are given in Tables 7.1, 7.2 and 7.3. Surface Size (pixels) % correct 1 13224 98.9% 2 8454 96.2% 3 10824 91.9% 4 8729 92.2% 5 4336 90.5% 6 6480 88.5% Average 93.0% Table 7.1: Image T. summary of segmentation accuracy The segmentation of Image 1 successfully identified the same number of mod-els as the ground truth segmentation. This is to be expected as this is the one image of the three that is the least complex. We can see from Figure 7.35 and Table 7.1 that the segmentation was largely successful. The E M segmentation averaged 93% 136 (b) Ground truth (c) E M segmentation result Figure 7.34: Image 1: comparison of ground truth and E M segmented planes 137 (a) Surface 1 (b) Surface 2 (c) Surface 3 (d) Surface 4 (e) Surface 5 (f) Surface 6 Figure 7.35: Image I: Surface by surface comparison of segmentation 138 correct pixel-by-pixel when compared with the ground truth. Surface Size (pixels) % correct 1 4791 94.8% 2 10329 83.1% 3 .3811 99.4% 4 10047 71.9% 5 7560 79.0% Average 85.8% Table 7.2: Image 2: summary of segmentation accuracy The E M segmentation in Image 2 missed two surfaces that were identified by the ground truth segmentation. The reasons for this are the small size of the missed surfaces in the image and the distance of them from the camera. The RANSAC method that is used for model selection is dependent on the surface occupying enough space in the image to be detectable. The smaller the region the less likely it will be detected by the RANSAC segmentation. Also the accuracy of farther surfaces is considerably less than closer surfaces. Since the RANSAC method relies on the patchlets being within the specified tolerances of the surface accuracy parameters, there is less chance to obtain successful surfaces at farther distances. The E M segmentation was quite successful with the 5 surfaces it correctly detected and segmented. Figure 7.37 shows the surface-by-surface accuracy and this information is summarised with accuracy percentages in Table 7.2. The average accuracy for this image was 86% and this average was pulled down by the rates for surfaces 4 and 5. These surfaces are the ones that border the surfaces that were not detected by the model selection process and consequently the E M surface classes "bloomed" into the unclaimed image region. This blooming is to be expected when there is not another surface that is competing for that area's patchlets. One can notice that where the borders of surfaces appear the best is where two surfaces meet. Finally, we repeated this test for Image 3, which is a much more complex scene that either of the previous two. These results are shown in Figures 7.38 7.39 and 7.40. Image 3 has eight surfaces identified by the RANSAC model selection process. Again there were two surfaces missed compared with the ground truth segmentation and for the same reasons: size in image and distance from camera. Also again the surfaces that were adjacent to missing classes bloomed into the unclaimed territory, in this case Surface 6 and Surface 7. As well, Surface 4, 6 and 7 crept into the distant stair balustrade. This was explained earlier and is due to the low confidence values on the patchlets in that region, as they are quite distant from the camera. The percentage accuracies 139 (a) Rectified image (c) E M segmentation result Figure 7.36: Image 2: comparison of ground truth and E M segmented planes 140 (c) Surface 3 (d) Surface 4 (e) Surface 5 Figure 7.37: Image 2: Surface by surface comparison of segmentation 141 (a) Rectified image (b) Ground truth Surface 8 Surface 7 •Surface 6 • Surface 5 (c) E M segmentation result Figure 7.38: Image 3: comparison of ground truth and E M segmented planes 142 (c) Surface 3 (d) Surface 4 Figure 7.39: Image 3: Surface by surface comparison of segmentation Surface Size (pixels) % correct 1 4313 99.9% 2 3164 69.7% 3 6532 99.1% 4 6490 78.8% 5 4283 99.4% 6 6649 66.7% 7 5714 61.9% 8 2415 80.9% Average 82.0% Table 7.3: Image 3: summary of segmentation accuracy 143 (a) Surface 7 (b) Surface 8 Figure 7.40: Image 3: Surface by surface comparison of segmentation 144 are shown in Table 7.3. The low percentage surfaces were 2, 4, 6 and 7. Surface 4 was low due to its inclusion of the balustrade. Surface 6 and 7 were so due to their inclusion of the balustrade and their blooming into adjacent unclaimed space. Surface 2 had a small amount of blooming, but because this surface is so narrow, it has a high proportion of border to area, and consequently a small error in border position resulted in a large error in segmentation percentages. These results are a comparison of image segmentation and pixel classification and does not address the accuracy of the surface parameter extraction. Unfortu-nately, for surface parameters we do not have ground truth available. However, it does address the two problems of model selection and data association which, as was discussed in Section 6.1, are actually the most difficult and most important issues in the task of surface segmentation and estimation. 7.8 Fitting: points v.s. patchlets To investigate the effectiveness of the orientation parameter of patchlets in segmen-tation, we ran the same segmentation experiment with the orientation parameter of the patchlet suppressed. This is effectively fitting surfaces to point data only. We used the full patchlet model for the RANSAC initialisation but suppressed the orientation for the E M refinement. (a) With full patchlet model (b) With position only Figure 7.41: Image 3: comparison of fitting full patchlet model and fitting positional information only, without the patchlet orientation Figure 7.41 shows the result of this experiment on the Image 3 test set. Because the RANSAC model selection uses the E M segmentation at the previous model count (see Section 7.6.1), the model selection is not the same for the two cases. The position-only experiment stopped with a model selection of six, opposed 145 to eight for the full patchlet model experiment. The results of the position-only experiment are shown in Figure 7.41(b). One can see that the localisation of the surface boundaries is clearly not as good as in the full patchlet model segmentation. This is due to planes fitting across steps, as can be seen especially in the lower lefthand class. This demonstrates the improved cluster discrimination available when the parameter space of the 3D data is expanded to include orientation. 7.9 Lorentzian v.s. Gaussian Although we have developed methods for use with the Lorentzian distribution model for patchlet position, in actual practice this model was not required and proved to be a difficult system to get working well. The idea of the Lorentzian distribution is that it is insensitive to outliers and can lock onto valid structure in a complex scene without concern about whether the outlier data is associated with this distribution or not. This should obviate the need for an "outlier" class as described in Section 6.5 In practice, the "outlier" class does an excellent job at achieving what the Lorentzian distribution was intended for. The Lorentzian proved to be not as insen-sitive to outliers as would be hoped and typically was drawn off the ideal optimum by incorrect data association, whereas the Gaussian distribution tended to fall-off so quickly that it would quickly relinquish an outlier to the "outlier" class. Conse-quently, the Lorentzian was not used in generating the final set of results. 7.10 The role of scale We use surface confidence measures associated with our surface parameters in the same manner as we use for patchlets. However, these confidence measures are sup-plied a priori to each experiment. It should be possible to estimate the values during E M , to obtain the natural scale of each feature. This is, in fact, what most E M ap-plications do. In practice the clusters of our systems were too close and with too much overlap to make this approach practical. However, it would be interesting to investigate this issue further. For example, we found that for coarse or loose surface confidence parameters, we would obtain some of the surfaces in the scenes with poor confidence and many inaccuracies in the patchlet positions, such as the doors in the background of Image 1. However, for these settings, the closer, more accurate, surfaces tended to be merged into one surface. Clearly there is a trade-off between the scale or accuracy of the surface and the model selection problem. For example, at ridiculously loose accuracy specifi-146 cation of the surface, the entire data would be lumped into one enormous surface. One can visualise a family of surfaces for a 3D scene based on scale. The lowest scale would have all the data merged into one large ill-fitting surface. As the fitting requirements become more and more stringent, this large surface would gradually break into smaller and smaller pieces. We hoped that there may be some connection we can develop between this family of surfaces and our surface extraction processes. We performed some experiments to investigate this using the surface confi-dence measures as a "tuning knob" to move the surface segmentation up and down this family of surfaces. We found that in general there is a natural scale for each surface that is stable over a large range of "tuning" settings. This natural scale is likely associated with the confidence metrics that would be estimated by the E M process and we would like to extend this research to investigate this phenomenon. 7.11 Computation timing We should give a brief word on the computation required to perform the processing reported in this section. It should be noted, however, that in many parts of the processing we utilised numerical search strategies that were based on the Downhill Simplex search algorithm. This algorithm is very robust and problem-insensitive but it is not particularly efficient. The efficiency of many of these steps could be improved dramatically. The timing reported here were done on an Intel Pentium4 2GHz processor running the Linux operating system. Stereo disparity processing was done by the Point Grey Research optimized stereo algorithm and was the fastest part of the computation, taking a fraction of a second to generate the 320x240 disparity image. Generating the uncertain points from the disparity image was also quite fast again taking a fraction of a second. The processing of the uncertain points into patchlets, however, took consid-erably longer. It must be remembered that the fitting of patchlet data uses the Levenburg-Marquardt iterative fitting algorithm for each patchlet involved. For our data sets this was typically for 45,000 local fits. These fits took 0.005 seconds per patchlet for a total computation time of approximately 4 minutes per image. This is the heaviest computation of the method and could be improved dramatically by forc-ing local uncertain points to conform to homoscedastic error characteristics. This would result in a fairly small change in the best-fit and allow the use of closed-form solutions. The RANSAC-based estimation takes approximately 1.5 seconds per model attempted. The time for a complete image must be multiplied by the number of 147 models present. The EM refinement showed the most variation in timing, generally taking less than 10 seconds to perform. But this is highly dependent on the quality of the initialisation and the number of models in the system and in some cases could take up to 3 minutes to complete. 148 Chapter 8 Conclusions The main thrust of this thesis is to explore a new method of interpreting corre-lation stereo 3D sensing and apply this method towards compact modeling of 3D environments. To do this, we made contributions in the areas of filtering and stereo disparity results, extracting surface elements from stereo images, and robustly ex-tracting and estimating bounded planar surfaces in complex stereo scenes. The main contributions of the thesis are • the patchlet model • surface extraction from complex stereo scenes • stereo noise filtering based on surface continuity • estimation of 3D point error based on correlation stereo error propagation These contributions are briefly summarised below. As well, we include one illustra-tive figure for each contribution. 8.1 Patchlet model The patchlet model is described in Chapter 5 and is the primary contribution of this thesis. The patchlet is a surface element model used to represent the results of correlation stereo depth images. It has a position, normal direction and size as well as confidence metrics on the position and normal direction. These confidence metrics are propagated using the knowledge of the accuracy of camera calibration and correlation. The patchlet is inspired by the concept that correlation stereo is a surface-sensing rather than point-sensing technology. Dense correlation stereo requires there 149 Figure 8.1: Patchlet example to be a surface with sufficient local variation and so the constraint that sensed values are surfaces is consistent with the sensor model. The patchlet contains additional dimensions of information beyond the 3D point and this information can assist the interpretation of stereo depth images. Con-nectivity of patchlets can be maintained through surface size even if the compared patchlets come from different viewpoints or image sets. The normal information is quite important for additional discrimination in clustering methods, for example when identifying surfaces. The confidence metrics allow the propagated sensor error to be used in probabilistic frameworks, to ensure that the patchlets are weighted according to their accuracy. We show in Section 7.3 that, given the sensor model with pointing and matching accuracies of the stereo rig, the patchlet parameters weighted by the confidence metrics conform to expected distributions. This shows the uncertainty of the patchlet model matches our experience with real world data. 8.2 Surface extraction from stereo images A method is developed in Chapter 6 that automatically segments bounded planar surfaces from noisy stereo scenes by taking advantage of the additional information available in patchlets. The issues in fitting many surfaces to noisy data are that we must solve three sub-problems: 1. the model selection problem - how many models to use 2. the data association problem - which data belongs to which model 150 (a) Rectified image (b) Segmented surfaces Figure 8.2: Surface extraction from stereo images 3. the model fitting problem - what is the best-fit surface There are two methods presented: a RANSAC-based approach the requires the use of image structure but solves the model selection and model fitting problems and provides a good estimate for the data association problem, and a probability max-imisation approach that uses Expectation-Maximisation to obtain a locally optimal parameter estimate. We demonstrate that the combination of these two techniques can success-fully extract planar surfaces even when the noise is considerable and the scene complex. The method successfully segments 82% to 93% of classified pixels when compared with ground truth segmentation. The approach also works reasonably well on natural scenes which have curved surfaces, but the surface boundaries are, understandably, not stable. We provide experimental results on the effectiveness of this method when confronted with varying noise characteristics in Section 6.7. These experiments show the surface segmentation can work even in the presence of considerable noise. As well we demonstrate, in Section 7.8, that the orientation parameter is very important in achieving robust surface segmentation in the face of noisy data. 8.3 Stereo noise filtering We investigate the problem of stereo mismatch errors in Chapter 3 and how to filter them effectively. The problem with mismatch errors is that they are systematic and locally coherent. This coherence means the errors are stable in a local neighbourhood and so appear more as signal than noise. This behaviour confounds typical noise filters that expect zero-mean uncorrelated errors. 151 (a) Unfiltered Figure 8.3: Stereo noise filtering (b) Filtered An important cue for detecting these errors was overall size of contiguous connected regions in the disparity image. Valid stereo tends to blend together with adjacent surfaces whereas mismatch errors appear as "spikes" in the range image. These spikes are not connected anywhere around their periphery to the rest of the disparity image. By detecting and removing these spikes we can remove much of the "scatter" noise that plagues stereo vision based point clouds. This filter has been very successful and has been incorporated as a standard feature in Point Grey Research's commercial stereo vision library. 8.4 Correlation stereo error propagation Traditionally error analysis for stereo generated 3D points has been done with the cameras in the stereo rig having equal weight or pixel accuracy. This is true for featured-based stereo, in which the features are extracted from each image indepen-dently and then matched. However, for correlation stereo this is not the case. The reference camera in the stereo rig is the privileged camera and its error is only the error of the camera calibration. The contribution of the other cameras have both camera calibration errors and correlation matching uncertainty. In Section 3.3.3 we present a modified method for estimating the variance on a 3D point generated by 152 Left Camera Right Camera Left Camera Right Camera (a) Traditional (b) Calibration and correlation error approach Figure 8.4: Correlation stereo error correlation stereo that is both new and more suited to the correlation algorithm. 8.5 Future work 8.5.1 Merging multiple views The methods of this thesis were developed with the purpose of combining data from multiple views into a single environment model. This is something we intend to do, in combination with camera-based localisation methods that are being developed at the Laboratory for Computational Intelligence using SIFT features. Although theoretically our methods should be well prepared for this stage, it is often in the doing that new issues or problems crop up. We look forward to this next step in the research process. 8.5.2 Model scale As mentioned in Section 7.10, the issue of surface scale is an interesting one that our current methods seem well-suited for dealing with. It would be very interesting to investigate how our techniques could be used to generate model reconstructions at given resolutions. Some of the issues are • to determine the natural scale of a particular surface or feature, • what is the range of confidence metrics over which an extracted surface is stable 153 • determine whether it is feasible to construct a family of surfaces that represent the scene through a continuous space of resolutions We would like to investigate whether we could develop ideas towards extending scale-space concepts into surface features in 3D. 8.5.3 Parametric surfaces Planar surfaces are well suited for most man-made environments, but the concept is limiting when one considers object modeling applications. We would like to consider incorporating techinques from the 3D modeling literature that investigate fitting parametric surfaces to patchlet data. As well, in the past particle-based modeling systems have been used to create snake-like active contours in three dimensions. Patchlets could be used in a similar fashion. 154 Bibliography [Aka74] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19:716-723, 1974. [BM01] J. Bride and P. Meer. Registration via direct methods: A statistical ap-proach. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), pages 984 - 989, Kauai, Hawaii, December 2001. [BT98] S. Birchfield and C. Tomasi. Depth discontinuities by pixel-to-pixel stereo. In Proceedings of the 6th IEEE International Conference on Computer Vision (ICCV '98), pages 1073-1080, 1998. [BTZ96] P. Beardsley, P. Torr, and A. Zisserman. 3D model acquisition from extended image sequences. In Proceedings of the J^th European Con-ference on Computer Vision (ECCV '96), volume 2, pages 683 - 695, Cambridge, UK, April 1996. [CL85] R. Chatila and J.-P. Laumond. Position referencing and consistent world modeling for mobile robots. In Proceedings 1985 IEEE Inter-national Conference on Robotics and Automation, pages 138 - 145, St. Louis, March 1985. [CL90] R. Chatila and J.-P. Laumond. Position estimation of mobile robots with internal and external sensor using uncertainty evolution techin-ques. In Proceedings 1990 IEEE International Conference on Robotics and Automation, pages 2011 - 2016, Cincinnati, Ohio, May 1990. [CL96] B. Curless and M . Levoy. A volumetric method for building complex models from range images. In Computer Graphics Proceedings (SIG-GRAPH '96), pages 303 - 312, New Orleans, Louisianna, August 1996. [Cro89] J. Crowley. World modeling and position estimation for a mobile robot using ultrasonic ranging. In Proceedings 1989 IEEE International Con-155 ference on Robotics and Automation, pages 675 - 680, Scottsdale, Ari-zona, May 1989. [D'S99] A. A. D'Souza. Using E M to estimate a probability density with a mix-ture of gaussians. Technical report, University of Southern California, Department of Computer Science, 1999. [EDD+95] M . Eck, T. DeRose, T. Duchamp, H. Hoppe, M . Lounsbery, and W. Stuetzle. Multiresolution analysis of arbitrary meshes. In Com-puter Graphics (SIGGRAPH '95 Proceedings), pages 173 - 182, 1995. [Elf89] A. Elfes. Using occupancy grids for mobile robot perception and navi-gation. Computer, 22(6):46-57, June 1989. [Fau93] O D. Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint. MIT Press, 1993. [FB81] M.A. Fischler and R.C. Bolles. A RANSAC-Based Approach to Model Fitting and Its Application to Finding Cylinders in Range Data. In Pro-ceedings of the International Joint Conference on Artificial Intelligence (IJCAI 81), pages 637-643, 1981. [FB96] T. Frohlinghaus and J. M . Buhmann. Regularizing phase-based stereo. In Proceedings of the 13th International Conference on Patern Recogni-tion (ICPR '96), pages 451 - 455, Vienna, Austria, August 1996. [Fis53] R.A. Fisher. Dispersion on a sphere. Proceedings of the Royal Society London, A127:295-305, 1953. [Fis87] N. I. Fisher. Statistical analysis of spherical data. Cambridge University Press, 1987. [FJJ91] D. J. Fleet, A. D. Jepson, and M . R. M . Jenkin. Phase-based disparity measurement. CVGIP, 53(2):198-210, 1991. [FL79] R.J. Fowler and J.J. Little. Automatic extraction of irregular network digital terrain models. Computer Graphics, 13:199 - 207, 1979. [FLW89] F. P. Ferrie, J. Lagarde, and P. Whaite. Darboux frames, snakes, and super-quadrics: Geometry from the bottom up. In Proceedings of the IEEE Workshop on Interpretation of 3D Scenes, pages 170-176, Austin, Texas, 1989. 156 [Pua93] P. Fua. A parallel stereo algorithm that produces dense depth maps and preserves image feature. Machine Vision and Applications, 6:35 -49, 1993. [Fua96] P. Fua. Reconstructing complex surfaces from multiple stereo views. In Proceedings of the Fifth International Conference on Computer Vision (ICCV '96), pages 1078 - 1085, Cambridge, Massachusetts, USA, June 1996. [GH95] M . Garland and P.S. Heckbert. Fast polygonal approximation of terrains and height fields. Research report CMU-CS-95-181, Carnegie Mellon University, School of Computer Science, September 1995. [GLY95] D. Geiger, B. Ladendorf, and A. Yuille. Occlusions and binocular stereo. International Journal of Computer Vision (IJCV), 14(3):211-226, April 1995. [Gri81] W. E. L. Grimson. From Images to Surfaces. MIT Press, 1981. [HDD+92] H. Hoppe, T. DeRose, T. Duchamp, J. McDonald, and W. Stuetzle. Surface reconstruction from unorganized points. Comp. Graphics (SIG-GRAPH '92), 26:71 - 78, 1992. [HJBJ+96] A. Hoover, G. Jean-Baptiste, X . Jiang, P. J. Flynn, H. Bunke, D. B. Goldgof, K. K. Bowyer, D. W. Eggert, A. W. Fitzgibbon, and R. B. Fisher. An experimental comparison of range image segmentation al-gorithms. IEEE Transactions on Pattern Analysis and Machine Intel-ligence (PAMI), 18(7):673-689, 1996. [Hop96] H. Hoppe. Progressive meshes. In Computer Graphics (SIGGRAPH 1996 Proceedings), pages 99-108, 1996. [HSIW96] A. Hilton, A.J . Stoddart, J. Illingworth, and T. Windeatt. Reliable surface reconstruction from multiple range images. In Proceedings of the 5th European Conference on Computer Vision (ECCV '96), pages 117-126, 1996. [Jen99] C. Jennings. Robust finger tracking with multiple cameras. In Pro-ceedings of the International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems (RATFG99), pages 152-160, 1999. 157 [JML99] C. Jennings, D. Murray, and J. Little. Cooperative robot localization with vision-based mapping. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA '99), pages 2659-2665, 1999. [Kal97] [KSOO] O. Kallenberg. Foundations of Modern Probability. Springer-Verlag, 1997. New York K . N . Kutulakos and S.M. Seitz. A theory of shape by space carving. International Journal of Computer Vision (IJCV), 38(3):199-218, July 2000. [KWT88] M . Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour mod-els. International Journal of Computer Vision, 1(4):321-331, 1988. [KWV95] J. Karhunen, L. Wang, and R. Vigario. Nonlinear PCA type approaches for source separation and independent component analysis. In ICNN-95, Perth, Western Australia, November 1995. [KYO+96] T. Kanade, A. Yoshida, K. Oda, H. Kano, and M . Tanaka. A stereo machine for video-rate dense depth mapping and its new applications. In ARPA96, pages 805-811, 1996. [KZ01] V. Kolmogorov and R. Zabih. Computing visual correspondence with occlusions using graph cuts. In Proceedings of the 9th IEEE Interna-tional Conference on Computer Vision (ICCV 2001), pages 508 - 515, 2001. [LC87] W. Lorensen and H. Cline. Marching cubes: A high resolution 3-d surface construction algorithm. Computer Graphics, 21:163-169, 1987. [LDW92] J. J. Leonard and H. F. Durrant-Whyte. Directed Sonar Sensing for Mobile Robot Navigation. Kluwer Academic Publishers, 1992. [LEC+01] Y . Liu, R. Emery, D. Chakrabarti, W. Burgard, and S. Thrun. Us-ing E M to learn 3D models with mobile robots. In Proceedings of the International Conference on Machine Learning (ICML), 2001. [LG90] J.J. Little and W.E. Gillett. Direct evidence for occlusion in stereo and motion. Image and Vision Computing, 8:328-340, November 1990. [Low91] D. G. Lowe. Fitting parameterized three-dimensional models to im-ages. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 13(5):441-450, 1991. 158 [Low99] D. G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the 7th IEEE International Conference on Computer Vi-sion (ICCV '99), pages 1150-1157, Corfu, Greece, September 1999. [LPC+00] M . Levoy, K. Pulli, B. Curless, S. Rusinkiewicz, D. Roller, L. Pereira, M . Ginzton, S. Anderson, J. Davis, J. Ginsberg, J. Shade, and D. Fulk. The digital michelangelo project: 3D scanning of large statues. In Kurt Akeley, editor, Siggraph 2000, Computer Graphics Proceedings, pages 131-144. A C M Press / A C M SIGGRAPH / Addison Wesley Longman, 2000. [LR98] D. Lischinski and A. Rappoport. Image-based rendering for non-diffuse synthetic scenes. In Rednering Techniques '98, pages 301-314, Vienna, Austria, June 1998. Springer. [ME85] H. Moravec and A. Elfes. High-resolution maps from wide-angle sonar. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA '85), St. Louis, Missouri, March 1985. [MG94] L. Matthies and P. Grandjean. Stochastic performance modeling and evaluation of obstacle detectability with imaging range sensors. Robotics and Automation (RA), 10:783-792, 1994. [Min98] T. P. Minka. Expectation-Maximization as lower bound maximization. Unpublished online document, 1998. [MJ97] D. Murray and C. Jennings. Stereo vision-based mapping and nav-igation for mobile robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA '97), pages 1694-1699, 1997. [ML98] D. Murray and J. Little. Using real-time stereo vision for mobile robot navigation. In Workshop on Perception for Mobile Agents (at CVPR), July 1998. [ML00] D. Murray and J. J. Little. Using real-time stereo vision for mobile robot navigation. Autonomous Robots, 8(2):161-171, 2000. [MM00] B. Matei and P. Meer. A general method for errors-in-variables problems in computer vision. In Proceedings of the 2000 IEEE Computer Soci-ety Conference on Computer Vision and Pattern Recognition (CVPR 2000), pages 18 - 25, Hilton Head, South Carolina, June 2000. 159 [MS87] L. Matthies and S.A. Shafer. Error modelling in stereo navigation. RA, 3(3):239-248, 1987. [MT02] C. Martin and S. Thrun. Real-time acquisition of compact volumetric maps with mobile robots. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2002), Washington, DC, 2002. [OK93] M . Okutomi and T. Kanade. A multiple-baseline stereo. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (PAMI), PAMI-15(4):353-363, April 1993. [PDH+97] K. Pulli, T. Duchamp, H. Hoppe, J. McDonald, L. Shapiro, and W. Stuetzle. Robust meshes from multiple range maps. In IEEE Int Conf. on Recent Advances in 3-D Digital Imaging and Modeling, May 1997. [Per93] Perceptron Inc. LASAR Hardware Manual, 23855 Research Drive, Farmington Hills, MI 48335, 1993. [PTVF92] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C : The Art of Scientific Computing. Cambridge University Press, 1992. [Pul99] K. Pulli. Multiview registration for large data sets. In Proceedings Second Int'I Conf on 3-D Digital Imaging and Modeling (3DIM '99), pages 160 - 168, Ottawa, Ontario, Canada, October 1999. [PZvBGOO] H. Pfister, M . Zwicker, J. van Baar, and M . Gross. Surfels: Surface elements as rendering primitives. In Kurt Akeley, editor, Siggraph 2000, Computer Graphics Proceedings, pages 335-342. A C M Press / A C M SIGGRAPH / Addison Wesley Longman, 2000. [Res99] Point Grey Research. Triclops Reference Manual - Triclops SDK Ver. 3.1, 1999. [Ris78] J. Rissanen. Modelling by shortest data description. Automatica, 14:465-471, 1978. [Rot99] G. Roth. Registering two overlapping range images. In Proceedings Second Int'I Conf on 3-D Digital Imaging and Modeling (3DIM '99), pages 191 - 200, Ottawa, Ontario, Canada, October 1999. 160 [San88] P. Sander. On Reliably Inferring Differential Structure from Three-Dimensional Images. PhD thesis, McGill University, Montreal, Canada, 1988. [SB98a] R. Sara and R. Bajcsy. Fish-scales: Representing fuzzy manifolds. In Proceedings of the 6th IEEE International Conference on Computer Vi-sion (ICCV '98), pages 811-817, 1998. [SB98b] S. Se and M . Brady. Stereo vision-based obstacle detection for partially sighted people. In Proceedings of 1998 Asian Conference on Computer Vision (ACCV '98), volume 1, pages 152-159, 1998. [SC86] R. C. Smith and P. Cheeseman. On the representation and estimation of spatial uncertainty. International Journal of Robotics Research, 5(4) :56-68, 1986. [Sch78] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461-464, 1978. [SD99] R. Sim and G. Dudek. Surface description of complex objects from mul-tiple range images. In Proceedings of the 7th IEEE International Con-ference on Computer Vision (ICCV '99), Kerkyra, Greece, September 1999. [SGHS98] J. Shade, S. J. Gortler, L. He, and R. Szeliski. Layered depth images. In Computer Graphics, SIGGRAPH '98 Proceedings, pages 231-242, Orlando, Florida, July 1998. [SKYT01] T. Sato, M . Knabara, N . Yokoya, and H. Takemura. Dense 3-D re-construction of an outdoor scene by hundreds-baseline stereo using a hand-held video camera, 2001. [SLL01] S. Se, D. Lowe, and J. Little. Vision-based mobile robot localization and mapping using scale-invariant features. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 2051-2058, Seoul, Korea, May 2001. [SLL03] P Saeedi, D. Lowe, and P. Lawrence. 3D localization and tracking in unknown enivronments. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2003), Taipei, Taiwan, September 2003. 161 [SO01] M . Shimizu and M . Okutomi. Precise sub-pixel estimation on area-based matching. In Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV 2001), volume I, pages 90-97, Vancouver, BC, Canada, 2001. [SS98] G.P. Stein and A. Shashua. Direct estimation of motion and extended scene structure from a moving stereo rig. In Proceedings of the 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '98), pages 211-218, 1998. [SSZ01] D. Scharstein, R. Szeliski, and R. Zabih. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In Proceedings of the IEEE Workshop on Stereo and Multi-Baseline Vision, Kauai, HI, December 2001. [ST92] R. Szeliski and D. Tonnesen. Surface modeling with oriented particle systems. Computer Graphics, 26(2):185-194, July 1992. [STT93] R. Szeliski, D. Tonnesen, and D. Terzopoulos. Modeling surfaces of arbitrary topology with dynamic particles. In Proceedings 1993 IEEE Computer Society Conference on Computer Vision and Pattern Recog-nition, pages 82 - 87, New York City, New York, USA, June 1993. [SW90] T. G. Stahs and F. M . Wahl. Fast and robust range data acquisition in a low-cost environment. Close-Range Photogrammetry meets Machine Vision, 1395:496 - 503, 1990. [TFB98] S. Thrun, D. Fox, and W. Burgard. Probabilistic mapping of an environ-ment by a mobile robot. In Proceedings of the IEEE International Con-ference on Robotics and Automation (ICRA), pages 1546-1551, 1998. [TL94] G. Turk and M . Levoy. Zippered polygon meshes from range images. In Computer Graphics Proceedings (SIGGRAPH '94), pages 311 - 318, Orlando, Florida, July 1994. [TMD03a] L. A. Torres-Mendez and G. Dudek. Range synthesis for 3d environ-ment modeling. In Proceedings of the 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003), Las Vegas, NV, 2003. [TMD03b] L. A. Torres-Mendez and G. Dudek. A statistical learning method for mobile robot environment modeling. In Proceedings of the International 162 Joint Conference on Artificial Intelligence (IJCAI 2003) Workshop on Reasoning with Uncertainty in Robotics (RUR), pages 85-92, Acapulco, Mexico, 2003. [Tor95] P. Torr. Outlier Detection and Motion Segmentation. PhD thesis, Uni-versity of Oxford, Oxford, UK, 1995. [VM98] R. J . Valkenburg and A. M . Mclvor. Accurate 3D measurement using a structured light system. Image and Vision Computing, 16(2):99 - 110, February 1998. [WH94] A. P. Witkin and P. S. Heckbert. Using particles to sample and control implicit surfaces. Computer Graphics, 28(Annual Conference Series) :269-277, 1994. [WHH03] E. Wahl, U. Hillenbrand, and G. Hirzinger. Surflet-pair-relation his-tograms: A statistical 3d-shape representation for rapid classification. In Proceedings Fourth International Conference on 3-D Digital Imaging and Modeling (3DIM 2003), pages 474 - 481, Banff, Alberta, Canada, October 2003. [WSI98] M.D. Wheeler, Y . Sato, and K. Ikeuchi. Consensus surfaces for modeling 3d objects from multiple range images. In Proceedings of the 6th IEEE International Conference on Computer Vision (ICCV '98), pages 917-924, 1998. [XM97] Y . Xiong and L. Matthies. Error analysis of a real-time stereo system. In Proceedings of the 1991 IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition (CVPR '97), pages 1087 - 1093, 1997. [ZF92] Z. Zhang and O. Faugeras. A 3d world model builder with a mobile robot. International Journal of Robotics Research, 11:269 - 285, August 1992. [ZH81] S. Zucker and R. Hummel. A three-dimensional edge operator. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 3(3):324 - 331, 1981. [Zha93] Z. Zhang. Motion and structure of four points from one motion of a stereo rig with unknown extrinic parameters. In Proceedings 1993 IEEE Computer Society Conference on Computer Vision and Pattern 163 Recognition, pages 556 - 561, New York City, New York, USA, June 1993. 164 Appendix A Derivation of Expectation-Maximisation updates for mixture models A . l Gaussian ID models Following the description in Section 4.5, we can see that our complete log-likelihood function will be: N M ( y ^ ^ f e l i o g p ^ i y + feiiog^ (A.i) 2=1 j = l The " M " step In the " M " step we want to maximise the expected complete log-likelihood with respect to the model parameters dj and TTJ. We differentiate the likelihood to get: and d , , , d e-fai-H)2^1 logP (Xi\0j) = — log (A.3) d 1 = a ^ [ - 2 ( x i " ^ V 1 - y/^j] (A.4) = -(xi - m)h~l (A.5) 165 Substituting Equation (A.5) into (A.2) we get N E v ^ K - f o - ^ A r 1 ) = 0 (A.6) 1=1 •viV . _ 22i=l(&j)xi / A 7 N Next we solve for Aj. Starting from Equation (A.l) again d(lc) dA3 0 (A.8) N Q \ Ev&>aAT[-2 ( x i ~ ^ • ) 2 A r 1 - l o § v / ^ A " + logTTj] (A.9) i=l J f > ; > [ ^ - ^) 2Ar 2 - ^ A ^ 1 ] (A.10) ' r~jj --j 2 i=l £ i = l ( £ i j ) A . = ^ i v . y ^ ^ ^ ( A . n ) The " E " step During the "E" step of E M , we update the expected values of (dj). This is deter-mining the probability of £ given the data and model parameters. it \ _ P / r _ m _ P(xi\zj = j,9)P(Zj =j) _ P(xi\zj = j,e)ir. Ek=iP(xi\zi = k,9)P(zi = k) £ f = 1 P ( * i k = k,0)irk (A.12) A.2 Lorentzian ID models Following the discussion in the previous section, the complete log-likelihood function is: N M = E E ^ ) l o g p ( ^ l ^ ) + ^ ) l o S 7 r i (A.13) i=l j=l 166 The "M" step In the " M " step we want to maximise the expected complete log-likelihood with respect tooth model parameters dj and TTJ. We differentiate the likelihood to get: ^ = E e 4 l o s ? ( l | | ( ' ) = 0 (A-14) and — logPte l^ ) = 7 5 - l o g — 2 2 , r , 2 1 (A.15) = ^ T [ 1 ° g ^ - l o g 7 r - 1 ° S ^ - ^ ) 2 + ( § ) 2 ) ] ( A - 1 6 ) 2(xi - fj,j) (Xi - Hy + (i)2) Substituting Equation (A.17) into (A.14) we get (A.17) N E(fr), 2 { X \ / j ) v ^ = 0 (A.18) Similarly, for Y d(lc) = 0 (A.19) dT = ^ ^ ^ [ ^ ^ - ^ ^ - ^ ( ( ^ - ^ ^ ( T J ) 2 ) ] (A.20) i=l Unfortunately, we were unable to determine a closed-form solution for either Equation (A.18) or Equation (A.21). However, if we accept the added computational burden, we can solve for these values numerically. The "E" step The "E" step is based on evaluations of P(xi\zi = j,9). The formulation of deter-mining the expected values (dj) does not change from that shown in the previous section, only the probability functions are switched to the Lorentzian. 167 A.3 Fisher ID models The complete log-likelihood function is: N M ( y ^ ^ f e i J l o g P ^ i y + fellog^ (A.22) i=l j=l The " M " step As previously, To determine the differential expressed in Equation (A.23), we make use of an alter-nate form of the Fisher model expressed in Equation (4.4). From this substitution, and given that sinh(fc) = ^—f—, we obtain d d Kef(xifij) — log P(Xi\9j) = — l o g — — — (A.24) Ofij , o/j,j 4-zr sinh(rv) = ^ - [ l o g r c + Z ^ ^ . O - l o g S T r - l o g ^ - e - ' 1 ) ] (A.25) <7/Xj Where f(xi,6j) = Ac(sinaj cos aj cos(/?j — Pj) — cos a, sin aj). In this case, /tj = [ctj(3j]'. Taking the partials with respect to aj and pj we obtain log P(xi\9j) = K sin a j cos aj cos (/?$ — Pj) — K cos sin aj (A.26) —— logP(xj|#j) = K sin aj sin aj (sin/?i cos/3j — cos/% sin/?j) (A.27) a/jj-(A.28) Substituting Equations (A.26) and (A.27) respectively into (A.23) we get . N ^2(&j)K s m a i c o s aj cos(/3j — /3j) — « cos ai sin aj = 0 (A.29) i=l JV ^ (£ij) K sin a* sin aj (sin/% cos/3j — cos/3, sin/3j) (A.30) i=i = 0 (A.31) (A.32) 168 From this we can solve for aj, (3j Ei=i(&j) sinctj cos(/3j - fy) EiIi(^i)cosa Ei=i(&j) s i n a i s i n A Eil i (&j) s m "i c o s A tancy = ^ = 1 ^ 0 i r i ' (A.33) tan/3, = ^rr: ' . (A-34) Similarly, for K d(Q (A.35) an = 0 • ' <A-36> = E ^ ) ^ [ l o g K + / ( ^ ' ^ ) - l o g 8 7 r - l o g ( e ' c - e - K ) ] (A.37) i=i " 1 = [—h sin aj cos cv, cos(/3j — /3j) — cos sin aj — tanh K|A.38) i=i Unfortunately, we were unable to determine a closed-form solution for Equa-tion (A.38). However, if we accept the added computational burden, we can solve for this values numerically. The " E " step The "E" step is based on evaluations of P(xi\zi = j,9). The formulation of deter-mining the expected values does not change from that shown in the Section A . l only the probability functions are switched to the Fisher. A.4 Patchlet-Surface models The complete log-likelihood function is: N M Vc) = E E ^ ) l o g p ^ i ^ ) + ^ ) 1 ° g 7 r J ( A - 3 9 ) i=l j=l Our model, Oj, is composed of both a Fisher model for evaluating normals and a Gaussian or Lorentzian model for evaluating positional offsets. So P{xi\0j) = Pf{xi\9j)P0{xi\9j) (A.40) where Pf(xi\9j) and P0(xi\9j) are the Fisher (orientation) and Positional (Gaussian or Lorentzian) models respectively. The surface model 9j is made up of parameters 169 for the Fisher model (<f>, K) which is the mean direction and concentration term, and either the Gaussian (/x, A) or Lorentzian (a-, T). The models are: Pf(x\6) = f ^ - r - (A.41) J 47r sinh K =^-{x-ii)2Kj P ' ( x m = " A T ( A ' 4 2 ) »™ = < A ' 4 3 ) In this section we will develop the equations for the "E" and " M " steps for the above model, with the Gaussian or Lorentzian positional model. To begin, we derive the equations when considering the surface uncertainty measures only, and then we repeat the derivation incorporating both the surface and the patchlet uncertainties. A.4 .1 Gaussian posi t ional model - Surface uncertainty only The " M " step We want to maximise the expression for the expected complete log-likelihood ((lc)) with respect to the model parameters / i , A, <j>, K. To maximise elc we take the partial derivatives with respect to each of the parameters for each sub-model in the mixture model, set it equal to zero and solve for the parameter of interest. Substituting (A.40), (A.41) and (A.42), into (A.39) yields N M = E E ^ N ^ / W ) + l o 6 ^ 0 * ) + logTTj] (A.44) i=i j=i W M = EE^y)P°s( K J') + KMi>xi<t>) ~ l o g ( 4 7 r ) - log(sinhKj) i=i j=i + ^(<t>jxi ~ H?^f l ~ log(V^F) - logt^A") + log(7r,-)] (A.45) 170 The update equation for fij is determined by N f ^ = B % > | - > ° 6 W = ° (A.46) Equation (A.47) is almost the same as that given in Section A . l in Equation (A.5) except that the offset used, <f>'xi, is not the same as the X j offset from the origin. The update equation for /ij is H = E ^ / X i (A.48) The update equation for Aj is determined by l r = E ^ ) ^ l o g P ^ l ^ ) = 0 (A.49) •? 1=1 •? N 8 -1 = E « y > ^ [ T ( ^ ' S i ~ H ) 2 k f l ~ l o g ( ^ = 0 ( A - 5 ° ) 2=1 ^ N = E ^ ) 2 ^ " ' S i - ^) 2 V 2 - A r ' ] = 0 (A.51) i=l • . Solving for Aj we obtain Updates for (pj and Kj are more difficult since we could not derive a closed forms solution. The partial derivative with respect to Kj is ^ 4 + ^ - t d h ^ ^ Neither of these equations gave closed-form solutions and consequently, rather than selecting a numerical optimisation for the partial, we opted for a direct numer-ical optimisation of the (lc) equation with respect to 4>j and Hj. The " E " step The "E" step is based on evaluations of P(XJ|ZJ = j, 0). The formulation of deter-mining the expected values (£jj) does not change from that shown in the Section A . l only the probability functions are switched to the full surface/patchlet probability. A; = ^ = 1 ^ 7 (A.52) 171 A.4 .2 Gaussian posi t ional model - Surface and patchlet uncertainty The difference between this derivation and the one given in Section A.4.1 is that the base model equations given in (A.41) and (A.42) are modified to include the patchlet uncertainty as well. In this formulation, the equations are modified to (A.56) Here and Ajj are values that take into account the parameters of both the model, Oj and the patchlet, X j . The values are combined as: . ki:j = Aj + Ai (A.57) \J Ki + Kj These relationships are based on the fact that the correlation of one Gaussian with another, results in a Gaussian that has a variance that is the sum of the first two. The " M " step The update for /J,J is exactly the same as the surface-only case. For Aj, the partial derivative is _ = E<&>^[-^W*i - A ^ A y " 1 - M v 7 ^ ) ] = 0 (A.59) " " J i = l •? N 1 = £(&> ^Uj'xi - ixjftaf2 - Af1] = 0 (A.60) i = i (Note, = 1.) This expands to for which we do not have a closed solution. Similarly for the updates of 4>j and Kj there is not a closed solution as the terms only become more complex than the surface-only case. The " E " step The "E" step is unchanged. 172 Appendix B Geometry and coordinate transforms Whitening or sphering is a transform that takes a point and a covariance matrix, and a sphere. It is called whitening because usually this covariance matrix represents expected error bounds on the point. When the transform is complete the error becomes "white" noise - it is not biased in any direction. B . l Whitening Given that we have a point X and a covariance matrix A, determine the steps required to transform another point, Y into the whitened space defined by {X, A}. Define E as the matrix of eigenvectors of A and e as the array of eigenvalues. Perform the following steps: Centre the space around X. Rotate the space so the eigenvectors of the covariance matrix make up the primary world coordinate axes. Finally, scale the axes by the standard deviations in each cardinal axis of the covari-ance matrix. These standard deviations are the square-roots of the the eigenvalues. To do this we create the "whitening matrix" W: remaps the space so that point is at the origin and the covariance matrix becomes Y' = Y-X Y" = EY1 W = v/e2 0 0 0 1 173 And finally multiply the whitening matrix to the current transformed point: Y"' = WY" Y'" is the result of applying the whitening transform to point Y. This equa-tion can also be represented in concatenated form (below), but I find the above description easier to follow. Y'" = WE(Y - X) (B.l) B.2 Unwhitening Unwhitening reverses the process of the whitening transform. In this case the "un-whitening matrix" is the inverse of the "whitening matrix" or U = y/^i 0 0 o jr2 o o o The processes is reversed as: Y = (E-lUY'") + X (B.2) B.3 Projection of a point to a plane At several points we perform a projection of a point to a plane. We do this as follows. Given a plane H and a point X, the plane can be defined by the planar equation: Ax + By + Cz + D = 0 Where \(A,B,C)\ = 1, N = (A,B,C) is the plane normal and —D is the perpendicular distance of the plane from the origin in the direction of the normal. The X is equivalent to a vector from the origin to the point X. The signed distance of X off of the plane H can then be computed by the dot product of the plane normal and the vector X or: d = X-N + D From this we can construct a vector in the reverse of the plane normal of magnitude d : d/V. The point equivalent to the projection of the point X onto plane H, X', is then X' = X-dN (B.3) 174 Appendix C Independence of orientation and offset • • • 1 • • • • • Figure C . l : A fitted line to a set of points One of the assumptions we make in our treatment of the variation of patchlet parameters is the independence of orientation and 3D position. In this section we investigate the validity of that assumption. Figure C . l shows a 2D example of a line fitted to a set of points. The coordinate system has been transformed so that the origin is at the centroid of the set of points and the fitted line lies along the X axis. This transformation corresponds to the world-to-local transform patchlet parameter. Because the origin is at the centroid of the point set we know the 175 following constraints are true: JV J > = 0 N i=l 0 (C.l) (C.2) To investigate the determination of the covariance matrix for the line's posi-tion (offset in the Y direction) and orientation (rotation off the X axis) we make use of equation given in Section 5.3.4 : A - 1 = JT J where A is the covariance matrix and J is the Jacobian matrix of the point-surface error. In our illustration the error for each point, ej, is yi — yi where ( X J , i s the perpendicular projection of the point (XJ,2/J) onto the line. Clearly the nominal value of all yi is 0. In this two dimensional case, the Jacobian, J , is Af x 2 matrix where J(i, j) = 8 are the line parameters so we assign 8\ to be the offset and 82 to be the orientation. The offset is initially 0, but if we make an incremental change in the offset, this is a change in yi. So dei d — = —{Vi ~ Vi) = - i For a small rotation in orientation A<p, yi changes by XjA0. So dcj) — Xi From the chain rule 8ej 84> dyi d(j> Now that we have the elements of the Jacobian, we can see that N JTJ = S^N det2 2_,i=l dvi N dei dei dyi d<j> 2^i=l EN dei dei i=l dyi d4> v^/V det2 2^i=l d(j) YN -X-Eili^2 From Equation (C.2) we know that the off-diagonal values in this matrix, 2~2i=i ~xi equals 0. Therefore in the inverse of the matrix the off-diagonal values will also be 0. Unfortunately, this case is a simplification of the true situation. In our case with patchlets, each point has a weight assigned by the cross section of its covariance value, so in our case each point will have a weight Wi. Now the constraint equations 176 become (C.3) i=l N (C.4) i=l And the error equation is ei = Wi(yi - yi) If we push this formulation through we obtain And In this case, the off-diagonal values are not set to 0 by the constraint equations. However, the diagonal elements of this matrix will only increase with additional data elements. The off-diagonal values will tend to near-zero values, since the positive and negative X j values will continually negate each other. So in practice, the off-diagonal values are not guaranteed to be zero but tend to be much smaller than the diagonal elements. This has been verified experimentally during the patchlet formation. The off-diagonal elements tended to be at least two orders of magnitude less than the diagonal ones, and often five or six orders of magnitude less. Consequently, the assumption of a diagonal matrix and the independence this assumption provides was considered a reasonable one. 177 Appendix D Addi t ional confidence measure images In this section we provide a complete listing of the image test sets and their mapped confidence measures. 178 (a) Rectified image (b) Positional variance (c) Fisher K Figure D . l : Image 1 confidence measures 179 (a) Rectified image (b) Positional variance (c) Fisher K Figure D.2: Image 2 confidence measures 180 181 (a) Rectified image (b) Positional variance (c) Fisher K Figure D.4: Image 4 confidence measures 182 183 (a) Rectified image (b) Positional variance (c) Fisher K Figure D.6: Image 6 confidence measures 184 185 (a) Rectified image (b) Positional variance (c) Fisher K Figure D.8: Image 8 confidence measures 186 (a) Rectified image (b) Positional variance (c) Fisher K Figure D.9: Image 9 confidence measures 187 188 (a) Rectified image (b) Positional variance (c) Fisher K Figure D . l l : Image 11 confidence measures 189 190
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Patchlets : a method of interpreting correlation stereo...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Patchlets : a method of interpreting correlation stereo 3D data Murray, Donald R. 2003
pdf
Page Metadata
Item Metadata
Title | Patchlets : a method of interpreting correlation stereo 3D data |
Creator |
Murray, Donald R. |
Date Issued | 2003 |
Description | This thesis describes methods for fitting local planar surface elements, that we call patchlets, to 3D data obtained from correlation stereo images. We use these patchlets to robustly extract and estimate bounded planar surfaces from complex and noisy stereo scenes. The patchlet element is a small planar surface the size of a pixel projected onto a world surface. It has a position, a normal direction, a size, and confidence metrics on its position and orientation. The confidence metrics are generated from the noise model of the stereo vision system, which are propagated to 3D point data and then to the patchlet parameters. The patchlets are used to extract larger bounded planar surfaces that are useful for environment modeling. We use a region-growing approach to identify how many surfaces exist in a stereo image and an initial estimate of the surface parameters. We then use Expectation-Maximisation (EM) to refine these surface parameters to an optimal estimate using a probability maximisation approach. The confidence metrics of the patchlet parameters allow proper weighting of patchlet contributions to the probability maximisation solution. We verify by experimental means the accuracy of correlation stereo matching and demonstrate that the patchlet confidence metrics obtained from that accuracy fit expected normal distributions. We compare the results of the patchlet-based surface segmentation to manually constructed ground-truth segmentation and find the segmentation accuracy for a scene ranged from 82% to 93%. We also present a method for filtering noise from correlation stereo disparity images. |
Extent | 44129344 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-11-30 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0051745 |
URI | http://hdl.handle.net/2429/15995 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2003-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2004-902358.pdf [ 42.09MB ]
- Metadata
- JSON: 831-1.0051745.json
- JSON-LD: 831-1.0051745-ld.json
- RDF/XML (Pretty): 831-1.0051745-rdf.xml
- RDF/JSON: 831-1.0051745-rdf.json
- Turtle: 831-1.0051745-turtle.txt
- N-Triples: 831-1.0051745-rdf-ntriples.txt
- Original Record: 831-1.0051745-source.json
- Full Text
- 831-1.0051745-fulltext.txt
- Citation
- 831-1.0051745.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051745/manifest