Rigidity Checking For Matching 3D Point Correspondences Under Perspective Projection by Daniel Peter Roland McReynolds B. Eng. (Electrical) Carleton University M . Sc. (Computer Science) The University of British Columbia A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T O F T H E REQUIREMENTS FOR T H E D E G R E E O F Doctor of Philosophy in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The University of British Columbia October 1997 © Daniel Peter Roland McReynolds, 1997 In presenting this thesis/essay in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Computer Science The University of British Columbia 2366 Main Mall Vancouver, B C Canada V6T 1Z4 Date: I°y n?7~ Abstract A solution is proposed for the problem of determining the correspondence between sets of point features extracted from a pair of images taken of a static scene from disparate viewpoints. The relative position and orientation between the viewpoints as well as the structure of the scene is assumed to be unknown. Point features from a pair of views are deemed to be in correspondence if they are projectively determined by the same scene points. The determination of correspondences is a critical sub-task for recovering the stmcture of the world from a set of images taken by a moving camera, a task usually referred to as stmcture-from-motion, or for determining the relative motion between the scene and the observer. A key property of a static world, assumed by the proposed method, is rigidity. Rigidity of the world and knowledge of the intrinsic camera parameters determines a powerful constraint on point correspondences. The main contribution of this thesis is the rigidity checking method. Rigidity checking is a tractable and robust algorithm for verifying the potential rigidity of a set of hypothesized three-dimensional correspondences from a pair of images under perspective projection. The rigidity checking method, which is based on a set of stmcture-from-motion constraints, is uniquely designed to answer the question, "Could these corresponding points from two views be the projection of a rigid configuration?" The rigidity constraint proposed in this thesis embodies the recovery of the extrinsic (relative orientation) camera parameters which determine the epipolar geometry - the only available geometric constraint for matching images. The implemented solution combines radiometric and geometric constraints to determine the correct set of correspondences. The radiometric constraint consists of a set of grey-level differential invariants due to Schmid and Mohr. Several enhancements are made to the grey-level differential invariant matching scheme which improves the robustness and speed of the method. The specification of differential invariants for grey-scale images is extended to color images, and experimental results for matching point features with color differential invariants are reported. Contents Abstract ii Contents iii List of Tables viii List of Figures ix Acknowledgements xv Dedication xvi 1 Introduction 1 1.1 Background and contribution 1 1.2 Applications of rigidity checking 7 1.2.1 Recognition from a single view 7 1.2.2 Stereo correspondence 8 1.2.3 Image matching 8 1.3 Matching point features using rigidity checking 9 1.3.1 Introduction 9 1.3.2 Motivation 10 1.3.3 Key point detectors 15 iii iv 1.3.4 Grey-level differential invariants 15 1.3.5 Rigidity checking 16 1.4 Rigidity checking estimator terminology 17 1.5 Contribution from rigidity checking application 17 1.6 Thesis structure 18 2 Background 19 2.1 Introduction 19 2.2 Camera models 20 2.2.1 Projective camera 20 2.2.2 Affine camera 22 2.2.3 Perspective camera 22 2.2.4 Weak perspective, orthographic and para-perspective cameras 24 2.2.5 The affine camera revisited 26 2.3 2.4 2.5 2.6 Camera calibration 26 2.3.1 Camera calibration with a calibration object 28 2.3.2 Self-calibration 30 2.3.3 Summary 31 3D correspondence constraints for two views of a static scene 32 2.4.1 Coplanarity constraint 34 2.4.2 Collinearity constraint 34 2.4.3 Invariance of distance and angle constraint 35 2.4.4 Relationship to structure-from-motion constraints 37 On the uniqueness of correspondence 38 2.5.1 Sources of matching ambiguity at keypoints 39 2.5.2 Additional matching constraints 41 2.5.3 Stable recovery of motion and structure 42 Nonlinear least squares estimation 42 V 3 Rigidity checking 49 3.1 Introduction 49 3.2 Related work for the task of verifying point correspondences 52 3.2.1 Correspondence verification and orthographic projection 52 3.2.2 Correspondence verification and perspective projection 54 3.3 3.4 3.5 3.6 3.7 3.8 Related methods for the task of structure-from-motion 57 3.3.1 Structure-from-motion and orthographic projection 57 3.3.2 Structure-from-motion and perspective projection 58 Derivation of the rigidity checking equations 61 3.4.1 Rigidity checking camera model 61 3.4.2 Least squares solution of nonlinear equations 62 Implementation issues 65 3.5.1 Algorithm overview 65 3.5.2 Computing an initial estimate 67 3.5.3 Estimator for the nonlinear camera model 73 Experimental results 83 3.6.1 Monte Carlo simulation 83 3.6.2 Real image sequence 86 Comparison to other methods 89 3.7.1 Horn's algorithm for relative orientation 89 3.7.2 Comparison to essential matrix methods 90 Conclusions 4 Rigidity checking for matching point correspondences 4.1 91 93 Introduction 93 4.1.1 94 From brightness images to verified matches 4.2 Background 95 4.3 Keypoint detectors 97 vi 4.4 Equivalence of objective functions 99 4.5 Biased versus unbiased objective function 102 4.5.1 Minimization of the unbiased objective function 106 4.5.2 ROC performance comparison - biased versus unbiased objective functions . . 106 4.6 The x test statistic 109 4.7 Characterizing the brightness function at keypoints 110 4.7.1 116 2 Computing the covariance matrix for differential invariants 4.8 Nearest neighbour search withfc-dtrees 117 4.9 Integrating the affine and perspective camera models 119 4.10 Ranking match validity 4.10.1 How many subsamples? 5 Experimental results for matching point features 120 123 &124 5.1 Introduction 124 5.2 Camera calibration 125 5.3 Images taken with an SLR camera 127 5.3.1 Playground scene 128 5.3.2 Construction scene 131 5.3.3 CICSR building 134 5.4 5.5 5.6 Images taken with a CCD camera 138 5.4.1 138 Laboratory scene Color Differential Invariants 139 5.5.1 Color feature transformation due to Ohta 140 5.5.2 RGB color features 143 5.5.3 Construction scene color differential invariants 144 5.5.4 Nitobe Garden scene: color and grey-level differential invariants 145 5.5.5 Summary 148 Image matching rotational invariance 151 vii 5.7 6 Conclusions Conclusions and future directions 6.1 6.2 154 157 Issues for further study 160 6.1.1 Issues arising from the image matching application 160 6.1.2 Rigidity checking issues 161 6.1.3 Robust estimation of the epipolar geometry 162 Public availability of rigidity checking implementation Bibliography 162 163 List of Tables 3.1 Hessian Condition Values 79 5.1 Camera calibration results 127 5.2 Grey-Level differential invariant matching summary 127 5.3 Rigidity checking summary 128 5.4 Matching results for color and grey-level differential invariants. Percent proportion of correct matches for three scenes 148 viii List of Figures 1.1 Two views of a Lego object with some correspondences noted 4 1.2 Depiction of epipolar geometry between a pair of views. 5 1.3 Comer detection output after non-maximum suppression and threshold for 500 strongest TA/ responses 10 1.4 Same as previousfigurefor second image 10 1.5 Hypothesized matches from k-d tree nearest neighbour search over space of grey-level differential invariants 11 1.6 Same as previousfigurefor second image 11 1.7 Ranked outliers relative to the affine camera model. The outlier ranked 1 is the least likely to be a valid correspondence for the affine camera model. Unranked matches fit the affine camera model to a 1.8 X0.95 confidence level Rigidity checking verified matches for thefirstimage. Verified matches are marked with *'s 1.9 12 13 Rigidity checking verified matches for the second image. Verified matches are marked with *'s 13 ix X 3.1 Plan and front elevation views of a pair of orthographic projections of two scene points. The image plane is rotated by ninety degrees from position 1 to 2. The rotation induces an epipolar geometry between the views with parallel epipolar planes. Axes V and V are defined in their respective image coordinate systems. The rotation of the camera (or equivalently the object) can be decomposed into a rotation about axis V followed by a rotation about the camera's viewing direction that maps axis V into axis V 3.2 69 Images 1 and 2 respectively fromfigure3.1. The rotation of the 3D edge whose projection is e can be decomposed into two rotations. Thefirstrotation about the axis V 1 maps e into e such that its projection onto V is invariant. The second rotation about 1 2 the viewing direction takes e into e , e 's corresponding edge in the second view. 2 3.3 3 1 . . 70 Rigidity verification with and without translation in depth initial estimate. Receiver operating characteristic (ROC) for observations of six points with a noise variance of 1 pixel. The curves are generated for the scenario with expected large perspective distortions. Close up of the curves at a small false positive rate. Each curve generated from 2000 trials for each of rigid and nonrigid configurations 3.4 73 Receiver operating characteristic (ROC) as a function of the global depth offset. MonteCarlo scenario is for observations of six points with a noise variance of 1 pixel. The curves are generated for the standard scenario. Each curve generated from 2000 trials for each of rigid and nonrigid configurations 3.5 79 Receiver operating characteristic (ROC) for depth and disparity parameterization tradeoff. Monte-Carlo scenario is for observations of six points with a noise variance of 1 pixel. The curves are generated for the standard scenario. Each curve generated from 2000 trials for each of rigid and nonrigid configurations 3.6 81 Receiver operating characteristic (ROC) for observations with a noise variance of 1, 4 and 9 pixels. Motion and structure variables are for the standard scenario. Close up of the curve's knee. The curve for the noisiest data is furthest to the 3.7 right 84 Test with large perspective distortion. Receiver operating characteristic (ROC) for observations with a noise variance of 1 pixel 84 xi 3.8 First frame of Lego sequence with labeled correspondences 86 3.9 Second frame of Lego sequence 86 3.10 Bar graph of sorted residual error for permutations of six corresponding points. Only the lowest residual values are shown, approximately 10 pixels and less. For an expected noise variance of 1 pixel, 35 permutations out of 720 (4.9 percent) are classified as potentially rigid. Filled bar is correct correspondence 87 3.11 Similar to the previousfigure,but with seven correspondences. For an expected noise variance of 1 pixel, 25 permutations out of 5040 (0.5 percent) are classified as potentially rigid. Filled bar is correct correspondence 87 3.12 First frame of laboratory scene with labeled correspondences 88 3.13 Second frame of laboratory scene 88 3.14 Bar graph of sorted residual error for permutations of six corresponding points for the laboratory scene. For an expected noise variance of 1 pixel, 15 permutations out of 720 (2.1 percent) are classified as potentially rigid. Filled bar is correct correspondence 89 3.15 Similar to the previousfigure,but with seven correspondences. For an expected noise variance of 1 pixel, 4 permutations out of 5040 (0.1 percent) are classified as potentially rigid. Filled bar is correct correspondence. 89 3.16 A comparison with Horn's method. ROC for noisy observations with a variance of 1, 4, and 9 pixels. Horn's method are the 3 dashed curves. The curve for the noisiest data is furthest to the right for each method. Motion and structure variables are for the standard scenario 91 3.17 A comparison with Horn's method. Close up of the knee of the ROC curves from the previous figure 4.1 91 ROC plot for the biased and unbiased nonlinear rigidity checking estimators. The unbiased estimator curves are denoted by There are three curves for each estimator corresponding to three different Gaussian noise levels with a standard deviation of 1,2, and 3 pixels. Each curve was generated from 100000 trials of the standard scenario for six points (see text) 109 xii 5.1 First calibration image. Frame id 22 126 5.2 Second calibration image. The relative motion between the camera and target board is 24 inches. Frame id 23 5.3 126 Third calibration image. The relative motion between the previous image and this one is again 24 inches. Frame id 24 5.4 126 First frame of playground scene image pair with candidate matches from grey-level differential invariant matching 5.5 130 Second frame of playground scene image pair with candidate matches from grey-level differential invariant matching 5.6 130 Ranked order of outliers relative to affine camera model. The point ranked 1 is least likely to fit the affine camera model. Unlabeled matches satisfy the xi 9 5 (95% confidence level) significance test. Points ranked 1 and 3 are incorrect matches and the remaining labeled points are correct matches but do not satisfy the Xo 9 5 t e s t - This is due in part to larger perspective effects, as, for example, the match ranked 2, would be expected to have 5.7 131 Rigidity checking verified matches for the playground image pair. These matches are for the unbiased estimator for the nonlinear camera model. The localization error standard deviation is assumed to be one pixel. Verified matches are marked with *'s. The biased nonlinear estimator also correctly verified match 16 5.8 5.9 132 Rigidity checking verified matches for the playground image pair. Same as previous figure for non-reference image 132 Grey-Level differential invariant matches at scale-level 0,firstframe 133 5.10 Grey-Level differential invariant matches at scale-level 0, second frame 133 5.11 First frame of construction scene image pair with candidate matches from grey-level differential invariant matching 135 5.12 Second frame of construction scene image pair with candidate matches from grey-level differential invariant matching 135 xiii 5.13 Rigidity checking verified matches for the construction scene image pair. The localization error standard deviation is assumed to be one pixel. Verified matches are marked with *'s 136 5.14 First frame of CICSR building image pair with candidate matches from grey-level differential invariant matching 137 5.15 Second frame of CICSR building image pair with candidate matches from grey-level differential invariant matching 137 5.16 Rigidity checking verified matches for the CICSR building image pair. These matches are from the unbiased estimator for the nonlinear camera model. The localization error standard deviation is assumed to be one pixel. Verified matches are marked with stars. The biased estimator also correctly verified matches 5 and 11 138 5.17 Playground scene with candidate matches from color differential invariant matching. Color feature components II and 12 from Ohta are used yielding a 14 component differential ;•• invariant vector. Seventy percent of the matches are correct 5.18 Same as previousfigurebut for non-reference image 141 141 5.19 Playground scene with candidate matches from color differential invariant matching. RGB color feature components yielding a 21 component differential invariant vector. 74 percent of the matches are correct 5.20 Same as previousfigurebut for non-reference image 144 144 5.21 First frame of the image pair of the Nitobe Garden at UBC with candidate matches from grey-level differential invariant matching. Of the 7 matches, 4 are correct 146 5.22 Second frame of the image pair with candidate matches from grey-level differential invariant matching 146 5.23 First frame of the image pair of the Nitobe Garden at UBC with candidate matches from RGB color differential invariant matching. The differential invariant vectors have 21 components, 7 from each color feature. Of the 12 matches, 11 are correct 5.24 Second frame with candidate matches from RGB color differential invariant matching. 147 147 xiv 5.25 First frame of the image pair of the Nitobe Garden at UBC with candidate matches from color differential invariant matching using 1112 color features due to Ohta. The differential invariant vectors have 14 components, 7 from each color feature. Of the 9 matches, 5 are correct, however, match 2 is mis-localized by approximately 5 pixels 149 5.26 Second frame with candidate matches from color differential invariant matching using II12 color features 149 5.27 Rigidity checking verified correspondences from the hypothesized RGB color differential invariant matches for the Nitobe Garden scene. These verified matches are from the unbiased estimator for the nonlinear camera model. The biased estimator also correctly verified matches 2, 4, and 8. Verified matches are marked with *'s 150 5.28 First frame of playground image pair with verified candidate matches from grey-level differential invariant matching. Differential invariantfilterresponses computed using . ; central difference. The rigidity checking verified matches are marked with *'s and are from the biased estimator for the nonlinear camera model 152 5.29 Second frame of playground image pair. Differential invariantfilterresponses computed using central difference. The rigidity checking verified matches are marked with *'s. . 152 5.30 First frame of playground image pair with verified candidate matches from grey-level differential invariant matching. Image rotated counter-clockwise by ninety degrees. Differential invariant filter responses computed using central difference. The rigidity checking verified matches are marked with *'s and are from the biased estimator for the nonlinear camera model 5.31 Second frame of playground image pair. See caption from previous figure 153 153 Acknowledgements My first acknowledgement of gratitude must go to my thesis advisor David Lowe. David has been a constant source of inspiration and encouragement through all the ups and downs of this truly challenging endeavour. I consider myself very fortunate to have had the opportunity to work with him as my mentor for this effort as well as my M.Sc. Our collaboration spans many enjoyable years indeed. I am indebted to the remaining members of my thesis committee, Uri Ascher, Jim Little and Bob Woodham, for their helpful advice and suggestions - especially with respect to the final draft of my thesis. Thank-you. My work has benefitted greatly from discussions with Jiping Lu, Jeff Beis, Richard Pollock, Art Pope, Esfandiar Bandari, Ying Cui, Bob Price, and Michael Sahota. I am grateful for the staff at the Laboratory for Computational Intelligence who have helped me over the years - Rod Barman, Stuart Kingdon, and Dan Razzell. A special thank-you to Valerie McRae for her friendship, help, conversation, and encouragement. The faith, love, and support of my parents Lucette and Tony constitutes a wonderful foundation upon which I stand. My family and I have been, and continue to be, truly blessed with the fellowship of our friends at Brighouse United Church in Richmond. Thank-you for your prayers, encouragement and support. For my beloved wife Maureen, my deepest felt expression of gratitude for your love, your encouragement, and your perseverance. And to my daughters Sarah and Emily - thank-you for your rambunctious hellos and hugs and love always but especially at the end of many of those long days. This research was supported financially by the National Science and Engineering Research Council of Canada and a Graduate Research Assistantship under the supervision of Dr. David Lowe for which I express my thanks. D a n i e l Peter R o l a n d The University of British Columbia October 1997 xv M c R e y n o l d s To Maureen, Sarah, and Emily. Chapter 1 Introduction 1.1 Background and contribution Computational vision is concerned with the information-processing task of producing useful descriptions of the three-dimensional world from images of that world. An important component of this task is the recovery of various geometric and material properties of three-dimensional visible surfaces from two-dimensional brightness images. The desired properties include, for example, the distance between surfaces and the camera, surface shape, reflectance, color, and texture. Recovering these physical quantities from two-dimensional images requires a thorough understanding of the image formation process [40]. Recovering the geometry of visible surfaces is a task that is necessary for accomplishing many other tasks from a variety of application areas such as robotics, computer graphics or image processing. For example, in robotics, knowledge of the geometry and location of surfaces is often necessary to avoid obstacles or to perform navigation tasks. For a robotic system to recognize a part in a bin of different parts, the sub-task of recovering the shape of the object is often required. In computer graphics, the three-dimensional structure of objects is required before they can be projected and rendered - threedimensional object models can be acquired from a sequence of two-dimensional images taken from a variety of viewpoints. In image processing, video data can be compressed based on knowledge about 1 CHAPTER 1. INTRODUCTION 2 the three-dimensional structure of the surfaces and the camera motion, thus reducing transmission bandwidth requirements. To recover the structure of the scene from multiple images it is necessary to establish the equivalence of certain geometric information obtained from each image. This sub-task involves solving the correspondence problem which is the problem addressed by this thesis. Point features from a pair of views are deemed to be in correspondence if they are projectively determined by the same scene points. The correspondence problem can be defined as follows. Given a pair of two-dimensional images, for every point in space that projects onto both images determine the location of the projected scene point in each image. Determining correspondences is a critical sub-task for recovering the structure of the world from a set of images taken from different viewpoints, usually referred to as structure-from-motion, or for determining the relative motion between the scene and the observer. Establishing feature correspondences between images of an object is also a critical sub-task for a variety of object recognition tasks, in particular, those methods that match views or images of objects, e.g., linear combinations of models [107]. The structure-from-motion problem has not been solved in its full generality since most methods assume that the problem of determining correspondences has been solved. Some structure-from-motion algorithms solve the correspondence problem under the assumption that the viewpoint change is small, thus simplifying the problem. Such a restriction is not assumed by this research. The image formation process involves a projection from the three-dimensional world to the twodimensional image plane. This mapping from three dimensions to two is not uniquely invertible so additional constraints are required to reduce the space of possible worlds that could have produced the image. Additional constraints on the geometry of the world can be provided by a set of images of the same scene from differing viewpoints. If we assume that the images are taken simultaneously or are of a static scene then the relative depths of the scene points and the epipolar geometry can be uniquely determined from a pair of views given the correct correspondences and some additional geometric information about the image formation process - information that is related to the internal geometry of the camera. The problem of determining correct correspondences is also under-constrained for a pair of views unless scene structure and the epipolar geometry are given. This situation defines a classic chicken and egg problem where determining correct correspondences requires knowledge of the epipolar geometry CHAPTER 1. INTRODUCTION 3 and scene structure and, conversely, recovering the unknown epipolar geometry and scene structure requires knowledge of the correct correspondences. A classic strategy for solving problems like this involves iterating on a cycle of hypothesizing and verifying. The main contribution of this thesis is a method, called rigidity checking, for verifying the rigidity of a set of hypothesized three-dimensional correspondences from their two-dimensional perspective projections. To demonstrate the effectiveness of the rigidity checking method, a system is described that automatically determines a set of point correspondences from a pair of views of a static scene. The accurate verification of the rigidity of an hypothesized set of three-dimensional point correspondences from two views solves the correspondence problem. The rigidity checking method, which is based on a set of structure-from-motion constraints, is a solution uniquely designed to answer the question, "Could these corresponding points from two views be the projection of a rigid configuration?" It is argued that the answer to this question involves solving a problem which is well-posed, and is the criticalfirststep for a general solution to the often ill-posed structure-from-motion problem. Figure 1.1 is a depiction of a pair of views of a 3D Lego™object taken by a camera from disparate viewpoints . A small number of correspondences with a sketch of the 3D wireframe model are 1 shown. The camera is assumed to be calibrated, that is, the transformation between the camera and image coordinate systems is known. Knowledge of the camera calibration is required to satisfy the geometric model of the projection process from the camera coordinate system to the image coordinate system. This model specifies that the scene point, image point and center of projection are collinear. The additional conditions that the scene is static and the world and camera coordinate systems are related by a Euclidean transformation completes the rigidity specification. Figure 1.2 is a depiction of the epipolar geometry for a single correspondence. The rigid body motion of the camera induces a vector between the centers of projection and the projection vectors of the scene point which all lie in the same plane, called the epipolar plane. Where the vector between the two centers of projection intersects an image plane defines the epipole. For coplanar image planes the epipoles lie at infinity. The problem we are posing is to automatically identify point correspondences 'The case of a single body undergoing a rigid motion relative to a static camera that takes a picture at two different time instants is geometrically equivalent. CHAPTER 1. 4 INTRODUCTION • °- Cantor of Projection 1 Figure 1.1: Two views of a Lego object with some correspondences noted. from disparate viewpoints such as the ones shown in Figure 1.1. It is necessary but not sufficient for correct correspondences to lie on their respective epipolar lines. Sufficiency is not achievable from two views since any point on the epipolar line will satisfy the epipolar constraint and there are no other geometric constraints available that uniquely specify the correct correspondence. Basri [5] has shown that the correspondence between a pair of points from two views under the orthographic projection of the same scene point is determined up to a straight line. Hence, from two views, correspondence is not uniquely determined, i.e., the correspondence of points along the lines cannot be determined. Basri goes on to show that the three-dimensional correspondence of points from their two-dimensional images is not unique for any number of images under orthographic or perspective projection. The relation between correspondences for more than two views is a line in a space of dimension proportional to the number CHAPTER 1. INTRODUCTION 5 of images. The correspondence of a scene point can only be determined up to this line in the higher dimensional space. The specification of a depth for a projected scene point in one camera coordinate frame determines a unique location in each of the other images by the linear constraint. The correspondence of points along this line cannot, therefore, be uniquely determined. This implies that the rigidity of the scene is also not verifiable with absolute certainty from two-dimensional views. Given m and n points in two images, respectively, the cost of verifying their rigidity and, hence, their correspondence, is 0 ( m n ) since a minimum of six points is required to perform the verifica6 6 tion with rigidity checking. The cost of evaluating all these combinations with exhaustive search is prohibitive. The use of heuristic radiometric constraints to yield a set of predicted matches reduces this complexity substantially as demonstrated by the rigidity checking application presented here. Figure 1.2: Depiction of epipolar geometry between a pair of views. CHAPTER 1. INTRODUCTION 6 The correctness of the approach hinges on the assumption that for a known noise distribution the probability of selecting the rigid hypothesis when the datafitsthe model can be determined with confidence that is proportional to the accuracy of localizing the correct correspondences . Ideally, the 2 probability of an incorrect match satisfying the rigidity checking constraint would also be an estimable value, near or equal to zero, for a given localization error bound. However, establishing this probability distribution is not a straight-forward process. To do so would require knowing all the ways the data can be misfit to the model - not an easy problem to solve [116]. Instead, we invoke therigidityassumption due to Ullman [105]. The rigidity assumption is: Any set of elements undergoing a two-dimensional transformation which has a unique interpretation as a rigid body moving in space should be interpreted as such a body in motion. Two conditions are assumed to hold true. Thefirstis that the probability that arigidinterpretation can be found given that the accurately known correspondences are not from arigidconfiguration is zero. The second condition is that the probability that a rigid interpretation is not found when the configuration is rigid, is also zero for accurately known correspondences. These conditions have been promulgated by others [64, 4, 8]. The uniqueness of the rigid interpretation for accurately known correspondences yields a zero probability for interpreting as nonrigid a rigid configuration. This applies to three views of four or more points under orthographic projection. In the two view case, this applies only to perspective projection of six or more points under a heuristic proof of uniqueness. Ullman cites a statistical proof that the probability of arigidinterpretation for a nonrigid configuration is zero for accurately known correspondences. Bennett et al. [8] give an alternative proof of the same result using the Lebesgue measure. The situation for noisy observations is quite different, however, and appears to require a complex analysis involving several factors, as noted by Bennett etal. [8]. Whaite and Ferrie [116] make a similar observation for a related data classification problem using the residuals offittingthe data to a given model. Such an analysis does not yet appear to have been undertaken in the computational vision literature. Other factors do influence this confidence measure, however. For example, bilateral or higher order symmetry in the object structure can yield correspondences which satisfy rigidity exactly but in fact are incorrect. The transparency of surfaces and specularly reflecting surfaces are other factors that confound the image matching process and lead to non-unique solutions for correspondence. It is assumed that surfaces are opaque and reflect light diffusely. 2 CHAPTER 1. INTRODUCTION 7 The empirical results for rigidity checking reported here are for noisy observations under scaled orthographic and perspective projection. The results suggest that, for noisy observations, the probability for either type of error is small but is not zero. A detailed analysis of the probability distributions for these errors requires several factors to be addressed separately and together as a system - it is required to have a detailed model of the image formation process as well as a model of the image acquisition system in order to characterize the source of localization error in the image measurements. In addition, the analysis must address the numerical aspects of the estimation process including the choice of objective function, the type of estimator used to perform the optimization and the expected distributions for all of the random variables for the stochastic processes inherent in these models. The initial set of verified correspondences can be used to estimate the epipolar geometry to reduce the search space for more correspondences. Given a larger set of correspondences, the epipolar geometry may be reliably estimated and a full stereo matching process could then be applied to yield a dense matching between the images. 1.2 1.2.1 Applications of rigidity checking Recognition from a single view Kontsevich discusses the application of his linear structure-from-motion algorithm for weak perspective images to the problem of correspondence checking and recognition. As a corollary to the "viewpoint consistency constraint" [60] he states the following assumption for pairwise comparisons: "If, for some pair of projections, correct [rigid] correspondence exists, the projections are views of the same object." The relation between correspondence and structure referred to by Kontsevich as the "structural theory" (citing earlier work due to Grzywacz and Yuille [31]) describes the mutually supporting processes of correspondence and 3D interpretation that operate simultaneously. Huttenlocher and Lorigo [48] have independently developed a method for recognizing objects from a set of feature points from a single two-dimensional view using a verification procedure which is based on the same structure-from-motion constraint due to Kontsevich. The rigidity checking method incorporates the stmcture-from-motion constraint due to Kontsevich and extends the formulation of the rigidity component of the "structural theory" CHAPTER 1. INTRODUCTION 8 hypothesis of Grzywacz and Yuille to images formed under full perspective projection. The recognition problem from pair-wise comparison of images is therefore feasible for perspective images using rigidity checking. Bennett et al. [10] derive recognition polynomials that assume point correspondences have been established between a single two-dimensional previous view of an object and a novel view under orthographic projection. Their polynomials can be constructed for different transformations between views, e.g., similarity or affine. They have not, however, derived a polynomial for perspective projection although they claim that, in principle, such a polynomial can be constructed. The rigidity checking method solves the correspondence problem under the given assumptions for perspective projection and could be used as a method for recognizing objects given one or more model views and a novel view. An example of this application could be image retrieval from a database. An example query would be to find all the images containing different views of the desired object from an example image of the object. 1.2.2 Stereo correspondence In stereo conespondence, the use of the epipolar constraint reduces the search space to a one dimensional search along the corresponding epipolar lines. In the absence of extrinsic camera calibration, the problem of stereo conespondence is equivalent to the motion conespondence problem. Hence, if the epipolar geometry is unknown or poorly estimated, then the rigidity checking method would be a suitable module for disambiguating matches and together with other binocular matching rules should prove to be a reliable approach to solving the stereo camera calibration problem. 1.2.3 Image matching A camera, whose intrinsic parameters are assumed to be known, takes two pictures of the same scene under similar illumination conditions from disparate viewpoints. It is assumed that the views of the scene overlap so that some minimal amount, say fifty percent, of the surface area is visible in both views. The task, then, is to determine a set of point conespondences between the two views. This application is further developed as part of this thesis and is introduced in the following section. CHAPTER 1. INTRODUCTION 9 1.3 Matching point features using rigidity checking 1.3.1 Introduction The proposed task is to determine a set of three-dimensional correspondences from a pair of perspective images obtained with a calibrated camera from disparate viewpoints. The approach to this recovery problem begins with the detection and localization of hypothesized corresponding point features from the image pair. Image matching can be defined as the process of determining a mapping between two images such that all image points are either matched or designated as occluded or the correspondence is unknown. Feature point matching with rigidity checking can be viewed as the importantfirststep in deriving this dense representation. Our approach to establishing a set of correspondences can be summarized as follows: • Point features are extracted from each image using the Harris-Stephens comer detector [34]. Figures 1.3 and 1.4 are images of an indoor scene marked with the location of 500 comers from the comer detection output. • The comer features are characterized by a set of grey-level differential invariants due to Schmid and Mohr [86]. A matching scheme based on k-d tree nearest neighbour search produces a set of hypothesized matches. • The hypothesized matches are partitioned into two sets according to their fit to an affine camera model in order to facilitate sequential verification using the rigidity checking method. The sequential application of rigidity checking yields the final set of verified matches. Figures 1.5 and 1.6 show the hypothesized matches. In Figure 1.7 the correspondences are ranked according to their likelihood of being correct relative to the affine camera model from least to most likely. Unranked matches fit the affine camera model within a 0.95 confidence level for the x 2 one-tailed test statistic for a noise standard deviation of 2 pixels. In Figures 1.8 and 1.9, comers marked with *'s are the rigidity checking verified matches. Rigidity checking verified all of the correct correspondences, however, match 17 was incorrectly accepted. This is attributed to a disparity that lies parallel to the epipolar line for the correct match, thus yielding a low residual. CHAPTER 1. INTRODUCTION 10 Figure 1.4: S a m e as previous figure for second image. 1.3.2 Motivation The approach could be described as a mixture of feature and region based methods. The primary motivation behind this is to provide a generalized approach to image matching that is also efficient and could CHAPTER 1. INTRODUCTION 11 Figure 1.5: Hypothesized matches from k-6 tree nearest neighbour search over space of grey-level differential invariants. Figure 1.6: S a m e as previous figure for second image. possibly be implemented in a real-time system. Point features are extracted based on a measure of "cornerness," a notion related to a steep intensity gradient in an image region with both principal curvature CHAPTER 1. INTRODUCTION 12 Figure 1.7: Ranked outliers relative to the affine camera model. The outlier ranked 1 is the least likely to be a valid correspondence for the affine camera model. Unranked matches fit the affine camera model to a xf confidence level. 9 S components of the local autocorrelation function approximately equal at some location in the region. The matching process begins with a region comparison between pairs of comer features which is an implementation of a local matching constraint. The region comparison is based on a kind of cross-correlation measure. The measure is the Euclidean distance between two grey-level differential invariant representations of the brightness function at the corresponding keypoint from each image. These choices are motivated by the assumption that a large enough subset of features (greater than five) will have approximately the same appearance under the viewing assumptions implicit in our model of the image formation process and the physical properties of the world. Correlation is used to quickly filter out unlikely correspondences using a metric that measures the similarity of the brightness values in a localized region of each image. This helps to reduce the combinatoric explosion of possible correspondences that must be verified with the rigidity checking criterion. Global consistency for hypothesized sets of correspondences is achieved by rigidity checking. Our method is further motivated by the desire to develop a system which is best suited for general CHAPTER 1. INTRODUCTION 13 Figure 1.9: Rigidity checking verified matches for the second image. Verified matches are marked with * ' s . applicability to a wide variety of scenes. Images of scenes taken by a stereo rig are difficult to analyze for determining correspondences when the fronto-parallel assumption is violated, as when, for example, the CHAPTER 1. INTRODUCTION 14 scene has many surfaces with a significant variation in surface orientation and depth which leads to surface occlusion and significant distortion of the local geometry of the brightness pattern for corresponding surface patches. The scenario we are attempting to deal with is considerably more challenging in that, in addition to handling significant variation in the surface geometry of the scene, the relative orientation between cameras is unknown and the viewpoints can be significantly different. However, in another sense, the problem we pose is easier since we are not trying to recover a dense disparity map. Only six corresponding points are required to uniquely recover the epipolar geometry , which means that the pre3 diction phase of the matching process only needs to produce a small number of unambiguous matches to be verified. Once the epipolar geometry is well established, conventional stereo techniques can be utilized to recover a dense disparity map. This approach relies onfindinga small set of unambiguous matches from a region based, local constraint method that can be rapidly and accurately verified with a feature based global constraint method - the rigidity checking method. This sequence of operations is similar to that of Zhang et al. [121] who have developed an approach for matching images taken by an uncalibrated camera. In their approach they use a correlation method to generate an initial set of correspondences. They then use a local relaxation method to disambiguate matches by applying a matching consistency constraint to the neighbours of each candidate match. Incorrect and poorly localized matches are eliminated by applying the epipolar constraint to the matches in a least-median-of-squares framework. Our proposed method is able to handle a larger range of viewpoints by utilizing a differential invariant matching procedure and by exploiting rigidity. For small sets of corresponding points, the epipolar constraint solution that their method is based on often yields low residual solutions, hence rigid interpretations, for nonrigid configurations, whereas the rigidity checking method is shown to be more robust. Consequently, it is not initially required for rigidity checking to have a large number of correct correspondences to yield a well-conditioned problem. The following subsections elaborate on the key components of the proposed approach for image matching. The rigidity checking method is over-constrained by six correspondences and a heuristic expectation is that the solution is unique [46]. 3 CHAPTER 1. INTRODUCTION 1.3.3 15 Keypoint detectors Keypoint is the more general term which is used here interchangeably with the term corner. A comer is defined to be a two-dimensional discontinuity of the image brightness function usually associated with the projection of surface junctions in the scene. Comers are popular tokens for image analysis because of the ease of their detection and their localization attributes in two dimensions which make them useful for matching purposes, e.g., across image sequences. Detecting the same comers arising from the same scene point from images taken under different illumination or from different viewpoints is known to be unreliable, however. Recent work by Merron and Brady [67] provides new insight into why this is the case. This point will be revisited in Chapter 4. In our approach, the presence of two-dimensional brightness discontinuities is used as an attentional mechanism for applying a region based matching criterion. The Harris-Stephens comer detector is implemented using derivative of Gaussianfilterkernels to estimate the brightness derivatives as suggested by Schmid and Mohr. Since the brightness gradient is changing rapidly in the vicinity of a comer, a local characterization of the brightness function should possess a unique set of attributes. The drawback of a local characterization at a comer is that the comer may be due to a discontinuity in depth or surface orientation and the support size of the characterization could overlap non-contiguous surfaces leading to instability of the characterization with respect to changes in viewpoint. This instability can lead to mis-matches or match failures. The best comers for matching purposes are due to a discontinuity of the surface reflectance function, which occurs, for example, when surfaces have markings on them. 1.3.4 Grey-level differential invariants As an alternative to normalized cross-correlation we make use of the grey-level differential invariant scheme due to Schmid and Mohr [86] to locally characterize the brightness function over a set of point features. The differential invariants are used to determine an initial set of hypothesized matches. Local differential invariants of the image brightness surface as defined by Koenderink and van Doom [50] are invariant to the similarity group of two-dimensional rotations and translations. A multi-scale representation affords scale invariance, thus, the use of grey-level differential invariants allows a much larger CHAPTER 1. INTRODUCTION 16 space of two-dimensional image transformations to be handled without resorting to brute force search to match images. Under a set of conditions specified by Weng et al. [Ill] we can expect local differential invariants to be invariant to motion. Our working assumption when matching images is that there will be a sufficient number of projected scene points that satisfy the motion invariance assumptions. A sufficient number in our case is six points, the minimum number required for rigidity checking. 1.3.5 Rigidity checking Rigidity checking is a tractable and robust method for verifying the rigidity of an hypothesized set of 3D point conespondences from two-dimensional perspective views. Therigiditychecking problem is different from the structure-from-motion problem because it is often the case that two views cannot provide an accurate structure-from-motion estimate due to ambiguity and ill-conditioning, whereas it is still possible to give an accurate yes/no answer to therigidityquestion since the question defines a well-posed problem in the sense of Hadamard. Rigidity checking verifies point conespondences using 3D recovery equations as a matching condition. The proposed method is the only algorithm for this verification problem that makes full use of the disparity of the correspondences, works with as few as six corresponding points under full perspective projection, handles correspondences from widely separated views, and is integrated with a linear algorithm for 3D recovery. The contribution of the rigidity checking method can be interpreted as a reformulation of therigiditycomponent of the so called "structural theory" of Grzy wacz and Yuille [31] for conespondence. This theory states that correspondence and 3D interpretation processes operate simultaneously with mutual support. Rigidity checking yields a significant improvement in the implementation of therigidityassumption of the theory relative to other methods because it makes full use of the image disparity which necessarily involves scene structure and extends the implementation to full perspective images. Wei et al. [110] claim that the conespondence problem for less than eight points is intractable for the approach that uses 3D recovery equations as a matching condition. Test results of the performance of the algorithm refute this claim. CHAPTER 1. INTRODUCTION 1.4 17 Rigidity checking estimator terminology z Therigiditychecking method incorporates two different estimators, one for the linear affine or weakperspective camera model and one for the nonlinear perspective projection camera model. Although, strictly speaking, it is not correct to refer to the estimators as linear or nonlinear, on occasion, for the sake of brevity, the estimator for the affine or weak-perspective camera model will be referred to as the linear estimator and the estimator for the nonlinear perspective camera model will be referred to as the nonlinear estimator. There are two versions of the estimator for the nonlinear, perspective projection, camera model. At the risk of confusing the precise technical definition of bias from the statistical literature with the 4 usage of the term here, therigiditychecking biased nonlinear estimator refers to the formulation of the objective function which is biased towards minimizing the residuals with respect to one image only. The objective function for the unbiased nonlinear estimator is symmetric in that the residuals are minimized for both images simultaneously by incorporating a set of measurement bias parameters in the camera model for the reference image. 1.5 Contribution from rigidity checking application Our approach offers several improvements over existing methods. These include a novel integration of the estimator for the motion parameters of an affine camera model with therigiditychecking method for sequentially verifying points correspondences and three enhancements to the grey-level differential invariant matching scheme due to Schmid and Mohr. The three enhancements are, a scale-space matching approach which improves the robustness of the method, k-d tree nearest neighbour search which reduces the query completion time to logarithmic expected time, and a method for determining the covariance matrix for normalizing differential invariants at runtime. In order to establish a set of hypothesized matches, a comer detector finds a set of keypoints which are characterized by the differential invariant representation. Differential invariants are invariant 4 A n estimate is unbiased if the expected value of the error of the estimate is zero, otherwise it is biased. CHAPTER 1. INTRODUCTION 18 to rotation and a multi-scale matching scheme affords scale invariance over a fixed bound. The use of feature points leads to translational invariance. Thus, a greater range of imaging situations can be handled relative to classical correlation methods. An empirical analysis of the unbiased version of the objective function for the nonlinear camera model reveals a negligible performance difference relative to the biased version. However, we show that the unbiased estimator allows for a unified treatment of the residual error of the linear and nonlinear estimators when deciding the rigidity question - the x test statistic. We show that this unified error 2 treatment follows from the result that the objective functions minimized by the estimators for the linear and nonlinear camera models are equivalent. The specification for grey-level differential invariants is extended to include color differential invariants. Results for two different sets of color features are reported: RGB (red, green, blue) and a color feature set due to Ohta [71]. The use of color differential invariants for image matching has not been reported in the literature to date. 1.6 Thesis structure Chapter 2 provides an introduction to the camera models and parameter estimation methods relevant to this dissertation. The issue of camera calibration is also reviewed. Chapter 3 develops the rigidity checking algorithm and reports an initial set of experimental results. The application of rigidity checking to the problem of feature point matching is presented in Chapter 4 and Chapter 5 contains the experimental results for image matching with grey-level and color differential invariants. Chapter 6 presents the thesis conclusions and ideas for future work. Chapter 2 Background 2.1 Introduction The image formation process is comprised of two components, geometric and radiometric. The geometric component determines where a surface element in the scene appears in the image and the radiometric component determines the image brightness of the surface element. The primary focus of this thesis is on the geometric component of the image formation process. The geometry of this process is captured by the camera model. Section 2.2 provides an overview of a set of camera models. Rigidity checking incorporates two camera models, the perspective camera and the scaled orthographic (weak perspective) camera model. These camera models are referred to as calibrated models. Section 2.3 discusses the problem of camera calibration and some of the proposed solutions in the literature. Section 2.4 discusses three categories of constraint formulation derived from the epipolar geometry between different views of the same scene. Also discussed is the uniqueness of the solution for determining correspondences and sources of ambiguity that confound the correspondence process. Therigiditychecking method uses 3D stmcture-from-motion constraints as a matching condition to verify correspondences. Stmcture-from-motion is an inverse mapping problem from the 2D image measurements to the 3D motion and structure parameters. The inverse mapping problem for parametric models is generally solved by using methods derived from parameter estimation or optimization 19 CHAPTER 2. 20 BACKGROUND theory where a function must be minimized according to some constraints. Section 2.6 reviews estimation methods that are relevant to the rigidity checking method including the Gauss-Newton, Newton, and Levenberg-Marquardt methods. 2.2 Camera models The geometry of the image formation process can be represented by a set of canonical camera models that describe the projection of scene points onto the image plane [40, 39,90]. The classic model for a camera is a pinhole at afixeddistance from an image plane. Light travels in a straight line, hence, each point in the image specifies a ray which extends towards the scene. This gives us the standard perspective projection model which is only an approximation to the optical physics of a physical camera - an excellent approximation in most applications. The other camera models are specializations of this model. Shapiro [90] discusses six camera models which are in two classes, calibrated and uncalibrated. The projective and affine camera models are in the uncalibrated class and the perspective, para-perspective, weak perspective (scaled orthographic), and orthographic models are the calibrated models. These models are summarized below from the most general model to the least for each class, noting that knowledge of only the affine, scaled-orthographic and perspective camera models is required for this thesis - the other models are discussed to provide an overall context for these core models. 2.2.1 Projective camera The projective transformation maps a three dimensional point in the world to the two dimensional retinal, or image, plane. It is convenient, from a mathematical point of view, to consider the world as embedded in a three dimensional projective space, V and the image plane as embedded in a projective space of 3 dimension two V as this facilitates the expression of the projective transformations [30, 22]. Points 2 in the scene and image projective spaces are represented as vectors in homogenous coordinates. The 21 CHAPTER 2. BACKGROUND projective transformation from V to V is given by 3 2 Wl T Tn Ti T14 T21 T22 T23 T24 T31 T32 T33 T34 n = w 2 w 3 3 Pi (2.1) p2 p4 where T is a 3 x 4 projection matrix, that maps the scene point p = \p , p ,p ] x T y z to image point w = [u, v] expressed in homogenous coordinates. T The homogenous coordinates \pi,P2,P3,P4] T a ° d [wi,w , W3] are related to the inhomogenous T 2 coordinates of p and w by P = [Px,Py,Pz] = T \Pl/P4,P2/P4,P3/P4] 2 and w = [u, v] T = [wi/w , w /w ] . T 3 2 3 Since homogenous coordinates are scale invariant, excluding scaling by zero, the projective transformation matrix T is also defined uniquely up to a scale factor, i.e., only the ratios of the elements Tij are significant, hence, there are 11 independent degrees of freedom. There are no restrictions on the world or image coordinate frames in which p or w are measured as a result of equation (2.1) including orthogonality. Note, however, that this projection is projective linear, i.e., straight lines project to straight lines, and consequently this model does not take into account lens distortion or image plane defects which can affect the accuracy of an estimate of the projective transformation. The transformation matrix T can be decomposed as follows [90]: C C\2 C13 10 C21 C22 C23 0 n T = CPG = 0 0 C33 0 0 0 10 0 0 10 G11 G12 G13 G14 G21 G22 G23 G24 G31 G32 G33 G34 G41 G42 G43 G44 (2.2) The matrix C is an affine transformation of the camera coordinates to the set of pixel coordinates, hence C31 — C32 — 0. The components of this array are termed the intrinsic parameters of the camera and CHAPTER 2. 22 BACKGROUND usually has up to five independent degrees of freedom depending on the level of complexity of the intrinsic camera model. Under the assumption that there is no shear in the image plane, then C , which has four degrees of freedom, can be written as, -K r 0 0 u 0 -/wo 0 0 (-) 2 3 1 where £ is the pixel aspect ratio, / is the effective focal length, and [wo^o] is the coordinate vector of T the principal point of the camera, the point where the optical axis intersects the image plane. The camera is said to be "calibrated" when the components of C are known. The 3 x 4 matrix P is the projection operator, and G is the mapping from the world coordinate system to the camera coordinate system. 2.2.2 Affine camera The affine camera specializes the projective camera by locating the optical center at an infinite distance from the image plane. In this case, all of the world points are projected in parallel onto the image plane. The transformation is denoted T / / and is given by a T12 = w 3 T 13 T14 T23T 2 i 0 T 0 0 Pi (2.4) p2 2 4 T34 p4 from the world to the image in There are eight independent degrees of freedom. The affine projection inhomogenous coordinates can be written w = Mp + t where M = [Mjj] is a 2 x 3 matrix with elements M , j = Tij/T (2.5) 2.2.3 and t = [T14/T34, T24/T 4] . r 34 3 Perspective camera The perspective camera is a specialization of the projective camera from knowledge of the intrinsic parameters of the camera, i.e., the camera is calibrated. The transformation from the world to camera co- 23 CHAPTER 2. BACKGROUND ordinate systems can be written as (2.6) P = Rp + d where P is the inhomogenous coordinate vector of the world point in the camera frame, R is the 3 x 3 rotation matrix with row vectors R^, R , and R , and d = [d , d , d ] is the displacement between T y 2 x y z coordinate frames . Hence, r G R d 0 1 T and using equation (2.3) the perspective transformation matrix denoted T can be written as p -f£d + ud -fd + vd x -fRy + R vR 0 0 y z 0 z (2.7) z z The expression that relates the inhomogenous world and image coordinates is w = - / ^(R^P + d )/(R p x (R y P z + d) z + oy/(R p + d) 2 z UQ + (2.8) v 0 The familiar camera centered perspective projection equation u p -f p x z V is determined when the rotation matrix is the identity matrix and the displacement is zero, i.e., the world and camera coordinate systems are aligned, the camera aspect ratio is one and the principal point is [0,0] yielding the transformation matrix -/ 0 0 0 0 - / 0 0 0 0 10 The transformation to image coordinates is nonlinear in the scene coordinates due to the scaling by inverse depth. The recovery of motion and structure from perspectively projected images is a T CHAPTER 2. 24 BACKGROUND difficult problem because this nonlinear transformation from the scene to image coordinates can lead to non-unique and ill-conditioned solutions. Closed form solutions for nonlinear systems of equations are generally not known or do not exist, and, when the perspective effects are small compared to the precision of the observations the inverse problem becomes ill-conditioned [33]. 2.2.4 Weak perspective, orthographic and para-perspective cameras Weak perspective projection is a specialization of the perspective projection model by approximating the depth of a set of scene points by their average depth. A parallel projection of the scene points is made onto an intermediate image plane parallel to the image plane of the camera where the intermediate image plane is at the average depth. The intermediate projected points are then perspectively projected onto the image plane of the camera which is equivalent to a uniform scaling of the intermediate image coordinates. After the rigid body transformation from the scene coordinate frame to the camera frame by equation (2.7), the depth of point i in the camera frame is given by F = R Pi + d . Let the centroid of a set 2 J z z of scene points be p „ - When thefield-of-viewof the camera is small and the depth range of the points a e in the camera frame, P i — P , , z z ave is small relative to P then the depths P \ can be approximated ZAVe z by P ,ave- This approximation is referred to as the weak perspective or scaled orthographic camera. The z transformation is given by: r Ra;P + d •^^37 x w 2 w 3 =C P Q P T ,ave 1 Pi 2 = c RyP + dy P ^37 u 1 Z z,ave p4 from which is derived 10 0 0 0 10 0 0 0 0 1 R d 0 P ,av T z (2.9) CHAPTER 2. 25 BACKGROUND The difference between the P matrices of equations (2.2) and (2.9) is in the replacement of P constant P . z<ave wp by the The mapping from the world to the image in inhomogenous coordinates is given by / w where M. 28 R„ P + £ 4 / UQ + = M w p p (2.10) +1wp V 0 is a 2 x 3 matrix of uniformly scaled rotation matrix rows, and t wp is a two component vector that gives the projection of the scene coordinate frame origin, p = 0. If the world and camera coordinate frames are aligned and the camera aspect ratio is one then ^ wp — C -/ 0 0 0 0 -/ 0 0 0 0 0 p 1 z,ave which yields the coordinate mapping, Ui f p 1 The scaling by 1/P , z ave z,ave Pxi (2.11) Pyi accounts for the changes in image size as the scene or object moves closer or further away from the camera. The weak perspective approximation is considered reasonably accurate when the average depth P ve exceeds the depth variation A P = P — P z<a i.e., P ,ave z 2 z by an order of magnitude, > 101 A P | . The absolute error in the image position is given by Z P 1 z,ave which shows that a narrow field of view (small values for P /P x tion, AP /P , z ve z<a Zjave z 1 P z and P /P ) y z and a small depth varia- improves the fidelity of the weak perspective model. Note also that the errors are not uniform over the field-of-view [90]. The orthographic camera specializes the weak perspective camera by fixing the ratio f/P ve z<a to unity. The para-perspective camera model is similar to the weak perspective model but specializes the projection by introducing a fixed, para-perspective, projection direction which is usually specified in 26 CHAPTER 2. BACKGROUND terms of an angle between the orthogonal projection onto the average depth plane and the para-perspective direction. All scene points are projected parallel to this fixed direction. Shapiro [91] gives an expression for the ID para-perspective mapping from a scene point in the camera frame to the image as / [P - [Pz- Pz,ave)cot{9)), z,ave X where 6 is the angle between the projection direction and the positive x axis. In the case of two dimensions, two angles are specified, one angle in the x-z plane and the other in the y-z plane. The transformation matrix is given by 1 0 —cotO x — C 0 1 0 0 —COtOy cotd x R COtOy (2.12) 0 Note that this is a calibrated camera model. 2.2.5 The affine camera revisited Shapiro [91] discusses the properties of the affine camera noting the following points. The affine camera generalizes the weak perspective and para-perspective cameras in two ways. The affine camera model subsumes an affine 3D transformation of the scene coordinates and secondly calibration is not required, as can be seen by noting the absence of the calibration matrix C . The affine camera, therefore, can be considered as an uncalibrated weak perspective camera. Shapiro argues that several affine measurements such as parallelism, ratios of parallel lengths, ratios of areas on parallel planes, are useful and obtainable without requiring camera calibration, an often ill-conditioned recovery process. Distances measured in the image plane, he notes, are still required for computations that require noise to be minimized. 2.3 Camera calibration Camera calibration is the process which determines the geometric relationship between the scene and images of that scene. The geometric relationship describes the transformation between coordinates of CHAPTER 2. BACKGROUND 27 points in the world and their coordinates in the image plane. The purpose of camera calibration is to be able to predict accurately the image position of surface elements given their position in the world as well as the inverse relationship; given the projected image position of the surface element, determine accurately the direction of the ray on which this surface element must lie in the world. In most situations camera calibration refers only to the geometric component of the process but there is also a radiometric component, for example, when sampling a composite video signal, errors in detecting the blanking level can alter the brightness level of the image [11]. The geometric aspect of camera calibration will be the focus of the following discussion. The rigidity checking method assumes that the camera is calibrated, so the objective of this section is to see what the camera calibration process involves and to suggest an approach to calibration based on results in the literature. Two calibration processes are looked at. The first approach requires images of calibration targets with known control points in the world coordinate frame. The second approach is called self-calibration and relies only on establishing correspondences between views. The advantages and disadvantages of these two approaches are discussed. Camera calibration involves specifying a mathematical model of the image formation process. This parametric model, in general, includes two sets of calibration parameters, the so called intrinsic and extrinsic parameters. The intrinsic parameters determine the internal geometry of the camera and the extrinsic parameters determine the external geometry of the position and orientation of the camera relative to the world coordinate system. An additional set of parameters are often specified which model the aberration of the lens system. These parameters are used to determine an optical geometric correction [49]. The projective camera model described by equation (2.2) contains the affine transformation matrix C referred to as the calibration matrix. The components are referred to as the intrinsic parameters of the camera and the matrix usually has up to five degrees of freedom. An example of the physical parameters encoded by the matrix is given by equation (2.3). A more comprehensive model of the intrinsic parameters of the camera is given by CHAPTER 2. 28 BACKGROUND C = -fk u fk cot{9) u 0 —fk cosec(6) VQ 0 where k ,k , u u v 0 (2.13) 1 0 are the horizontal and vertical scale factors that relate the pixel size to the world coordinate v frame units for length, and 9 is the angle between the directions of image coordinate axes. This factor is introduced to account for possible non-orthogonality in the pixel grid. In practice, its value is typically very close to 7r/2. Since / cannot be distinguished from k or k , Faugeras et al. [23] suggest defining u a = - fk u u 2.3.1 and a v = - fk v which yields five degrees of freedom for the calibration matrix. v Camera calibration with a calibration object Tsai's [103] classic paper on camera calibration details a two stage process to determine the calibration of a camera using a planar calibration object. The calibrated parameters include the extrinsic parameters, i.e., orientation and position of the camera relative to the world coordinate frame, the focal length, radial lens distortion parameters, and a scaling factor for the horizontal coordinates which accounts for image acquisition and sampling errors. Two radial lens distortion parameters are specified but Tsai notes that, from experimental evidence, one parameter accounts for the majority of the radial distortion error. Tangential lens distortion, which is primarily the result of the non-collinearity of the optical centers of the lens elements in a compound lens leading to both radial and tangential displacement of the light rays , 1 is not considered significant enough to be modeled. This contrasts with the analysis due to Weng et al. [113], who account for tangential lens distortion in their model, and report a significant improvement in their calibration precision metric when tangential distortion is accounted for as well as radial distortion. 2 Their geometric distortion model consists of radial distortion, decentering distortion which contributes to radial and tangential distortion, and thin prism distortion which also contributes to radial and tangential distortion. The thin prism distortion is due to manufacturing and lens design errors and lens assembly misalignment. These errors are modeled by the addition of a thin prism model which yields additional 'called decentering distortion 2 which they refer to as the normalized stereo calibration error CHAPTER 2. BACKGROUND 29 radial and tangential distortion parameters. A calibration procedure generally consists of the following steps [49]. 1. Place the calibration object in the camera's fields of view such that the reference points are accurately known relative to some world coordinate frame. 2. Measure the reference points in the acquired image. 3. Specify a parameterized camera model that relates the projection of calibration reference points in world coordinates to the observed image coordinates in terms of the model parameters. 4. Perform an estimation procedure on some objective function, e.g., least squares estimation. The objective function could, for example, specify the residual error between the model's prediction of where the scene point projects and the observed location. Kanatani [49] notes that typically the objective function is nonlinear in the model parameters and iterative numerical techniques must be used to determine the minimizer by a linearization process. The conditioning of the linearized system of equations can be poor due to the presence of parameters that are not sufficiently constrained, thus leading to an ill-conditioned problem. For example, Tsai notes that for typical industrial machine vision applications only radial distortion need be considered and only the first term in the series expansion [103]. Therefore, special attention must be paid to the model, the methodology for determining the initial estimate, and the implementation of the iterative minimization process [103, 113]. For CCD type cameras producing an analog video signal, attention must be paid to errors introduced by the sampling process performed by the frame-grabber. Depending on the type of synchronization that exists between the camera and frame-grabber, errors in detecting various voltage levels such as the zero reference and range reference (which determines the black and peak white reference), start of a new line (horizontal synch), start of a newfield/frame(vertical synch), and if interlaced, start of first or second field, can lead to brightness level changes and geometric distortion of the image. Beyer [11] analyzes these error sources and proposes methods for the detection and calibration of these errors. Willson and Shafer ask the question "What is the center of the image?" [118]. Their thesis is that the conventional answer, the intersection of the optical axis with the image plane, is not a well defined CHAPTER 2. 30 BACKGROUND geometric construction. There are, they argue, several answers, and they do not all coincide. They describe a taxonomy of sixteen different image centers and describe methods for measuring them all. They also show that some of these image centers change position with the lens parameters, e.g., focus or zoom. 2.3.2 Self-calibration Faugeras et al. [23] introduced the concept of calibrating the same camera, i.e., determining the intrinsic parameters (2.13), from a set of point correspondences identified over several views taken from different viewpoints. The method relies on solving a set of six quadratic constraints from three images for five unknown parameters. The intrinsic parameters a « „ , w , v , and 6 are then determined from the five U ) 0 0 intermediate parameters. The system of six quadratic equations are called the Kruppa equations and are solved by a homotopy continuation method. Hartley [37] describes the methods as requiring extreme accuracy of computation, and unworkable for greater than three images due to the rapid growth in the number of possible solutions. Determining point correspondences from several views is a difficult problem due to occlusion, aspect changes, and brightness changes with respect to a changing viewpoint due to non-lambertion reflectance components of the surface. Other methods for self-calibration require known camera motions. Hartley [37] proposes an algorithm which does not require the location of the viewpoint to change over several views but only the camera's orientation. No other information concerning camera motion or initial estimates of the intrinsic parameters are required. The method, however, still requires the correspondence problem to be solved. Since the viewpoint does not change, the problem is simplified somewhat and simple brightness correlation works well after an initial set of four correspondences are manually identified to determine the similarity transformation between the images. The method proceeds byfirstfindinga set of point correspondences from a set of three or more images. The camera intrinsic parameters and the camera's location are assumed to be constant. One of the views is considered the reference and the projective transformation matrices (see equation (2.1)) between the reference view and each of the subsequent views are determined. Using a non-iterative method, an upper-triangular calibration matrix is computed from the least squares solution of an over-determined system of equations in a set of intermediate parameters derived from each of the projection matrices. If CHAPTER 2. BACKGROUND 31 desired, an iterative method can be applied to refine the estimate of the calibration matrix. Experimental results for synthetic data indicate that the iterative refinement process yields a substantial improvement in the accuracy of the estimated parameters. Another advantage of Hartley's projective method for self-calibration is the ability to perform geometric correction due to lens distortion. Hartley reports that an additional term for radial lens distortion was included in the iterative refinement method and tested with inconclusive results, i.e., the distortion effect was estimated to be negligible. This result agrees with a result due to Zhang [122]. Zhang generalizes the projective camera model to include parameters for lens distortion. Lens distortion leads to epipolar curves rather than lines and the estimation process determines the parameters of this curve under the projective constraint for point matches. Zhang reports that the estimation process works best when localization noise is small and the lens distortion is relatively large, otherwise the estimate for the fundamental matrix can be worse than the estimate assuming no lens distortion. This is due to the poor 3 conditioning of the system of constraints when the lens distortion parameters are included. An important observation Zhang makes is the necessity to identify correspondences near the image borders where the distortion effects are maximal. Image overlap for all border regions for a rotating camera, except for rotation about the optical axis, is unlikely. This is an important consideration for Hartley's method when including lens distortion. Also, Hartley notes that his method does not perform well for rotation only about the optical axis. 2.3.3 Summary Self-calibration by Hartley's method appears to be a promising approach to calibrating the intrinsic parameters. Hartley's correspondence method requires manual intervention to determine an initial set of correspondences, a definite drawback for autonomous systems. The grey-level differential invariant matching method due to Schmid and Mohr incorporated in the rigidity checking image matching application would allow for a fully autonomous process since the matching method is invariant to 2D similarity transformations of the image features and is scale invariant over afixedbound. The drawback of the differThe fundamental matrix encodes the position of the epipoles and the pencil of epipolar planes for a pair of images for the projective camera model. 3 CHAPTER 2. BACKGROUND 32 ential invariant method is the requirement that the image pixels are square. It is often the case that the approximate pixel aspect is known and would be enough to do an initial matching. Alternatively, a bootstrap phase for calibration could be implemented so that the pixel aspect ratio could be determined. This could easily be done by taking an image of a sphere andfittingan ellipse to the image of the contour from which the ratio of the major and minor axis length could be determined. The calibration images could then be resampled to yield square pixels before searching for correspondences. For mobile robotics and other situations where frequent camera calibration is desirable the classical methods using calibration targets are not feasible except for those parameters that are relatively stable with respect to changes in the lens parameters, such as skew and pixel aspect ratio. Note that a reduced set of calibration parameters such as only the magnification factor can be estimated simultaneously with the motion and structure parameters for the rigidity checking method at the expense of increasing the minimum required number of correspondences by one. The other calibration parameters would assumed to be known. This may be adequate for a large number of applications and suggests that a combination of the classical and self-calibration methods should yield best results. 2.4 3D correspondence constraints for two views of a static scene There are three basic formulations of the epipolar constraint for two calibrated images of a static scene for point correspondences • coplanarity • collinearity • invariance of distance and angle. These constraints can be used as matching conditions for correspondence and several methods for determining correspondences have been developed based on these constraints for calibrated cameras [10, 47, 52,101, 110], including the rigidity checking method from this dissertation, and similarly for uncalibrated cameras [91, 121]. These formulations differ with respect to the parameters that are estimated. CHAPTER 2. BACKGROUND 33 In the case of the coplanarity constraint for the perspective camera model, the depth parameters have been eliminated leaving five motion parameters, three for rotation and two for translation - translation can only be determined up to a scale factor eliminating one degree of freedom. Eliminating the depth parameters has the advantage of reducing the computational complexity of the estimation process. Other advantages include proofs for the existence and uniqueness of the solution. The main disadvantage is that the objective function should be weighted by a function that relates the image measurement error to the particular formulation of the coplanarity constraint. In some formulations of the coplanarity constraint this weighting is not included leading to poorly conditioned solutions and a sensitivity to measurement noise. When the weight term is included, further assumptions are required in order to simplify an otherwise complicated objective function due to the presence of the parameters to be estimated in the weight term. The collinearity constraint for the perspective camera model includes both the motion and the structure parameters. The advantage of this constraint is that the objective function can be formulated very simply in terms of directly minimizing the residual of the observation relative to the model prediction. This residual is just the localization error of the image measurements which is where the error originates and is therefore the ideal measure of departure from the bestfit.The main disadvantage is the increase in the computational complexity of the minimization due to the larger parameter space and the inability, to date, to derive existence and uniqueness proofs for the solution. The formulation that implements the invariance of angle and inter-point distance for a rigid body for the perspective camera model appears to be the weakest of the three formulations. Typically, the formulation does not include a weighting that relates image measurement noise to the distance and angle invariance constraints. Experimental results reported in the literature suggest a sensitivity to measurement noise and poor conditioning of the solution [68, 109]. The iterative minimizer also requires the initial estimate to be very close to the correct solution. Another disadvantage is the inability to prove the uniqueness of the solution since the potential number of solutions is so large [46]. This section reviews the constraint formulations for the two calibrated cameras relevant to the rigidity checking method, the perspective and scaled orthographic camera models. CHAPTER 2. 2.4.1 34 BACKGROUND Coplanarity constraint Perspective camera. The coplanarity constraint describes the coplanar relationship that exists between the two rays from a scene point to the two projection centers and the vector that joins the two centers of projection. This constraint can be written as l'-(bxr) = 0 (2.14) where b is the translation vector, 1' is the projection vector in the left camera frame rotated into the right camera's coordinate frame, and r is the projection vector in the right camera frame. Horn [41] describes this constraint as the scalar triple product [l'br]. Scaled orthographic camera. Since the centers of projection are at infinity for this camera model, the coplanarity constraint is expressed in terms of the rotation in variance of the projection of the scene point onto the rotation axis projected onto the image plane,firstdescribed by Kontsevich [52]. This constraint can be written as e • v = e' • v' (2.15) where e and e' are corresponding vectors defined by the projection of the vector defined by a pair of correspondences . The vectors v and v' encode the orientation of the rotation axis with respect to their 4 image coordinate systems. All corresponding points satisfy equation (2.15), hence, all epipolar planes are parallel. Figure 3.1 depicts the epipolar geometry for two scaled orthographic cameras. Note that v and v' are parallel in the rotation frame of reference, which follows from the rotation invariant. 2.4.2 Collinearity constraint Perspective camera. The collinearity constraint follows from the fact that the scene point, image point and the center of projection all lie on the same ray. From similar triangles this constraint is expressed as u V -/ Pz P X . p y . See Figure 3.2 where e corresponds to e and e' corresponds to e . The vector e is the projection of a vector r in space such that r = P - P where P and P are 3D world coordinate vectors and P is the centroid of the set of correspondences or one of the correspondences designated as the origin of the world coordinate frame. 4 1 0 0 3 0 35 CHAPTER 2. BACKGROUND where P is the scene coordinate vector, w is the image coordinate vector and the image plane is given by z = —/. The relation between a pair of corresponding points for two views for the collinearity constraint can be written as (w(m) - w) + (w'(m') - w') = 0 where w and w' are the models that relate the observations w and w' to the motion and structure parameters m and m'. Scaled orthographic camera. In Chapter 4 it is shown that the coplanarity constraint (2.15) due to Kontsevich is equivalent to the collinearity constraint for a pair of affine cameras. This equivalence holds for the scaled orthographic camera as well, since the proof does not make use of the internal composition of the camera matrices M , M ' , t and t'. The following expression for the collinearity constraint for the affine camera, and equivalently the scaled orthographic camera for a different decomposition of the camera parameters M , M ' , t and t', is due to Shapiro [90] : t , t , X ) = ( x - ( M X + t)) + ( x ' - (M'X + t')), , , 1 (2.16) where M and M ' , t and t' are the camera model parameters, x and x' are the observations and X is the scene structure vector. The cost function subscript is tk since Reid and Murray [80] have shown that the factorization approach due to Tomasi and Kanade [102] minimizes an objective function equivalent to this constraint. 2.4.3 Invariance of distance and angle constraint Perspective camera. The principle of invariance of distance for rigid objects can be exploited to determine 3D structure and motion of objects in space from two or more images [68, 109]. The distance between a pair of scene points P; and Pj is given by (X - X ) t 3 2 + (Yi - Y ) 2 3 + (Zi - Z). 2 3 CHAPTER 2. 36 BACKGROUND The relation used to describe the position of a scene point from its projection is based on the collinearity constraint and, for the first camera frame, C\, is given by Xi = XiXi Yi = Xijji Zi = (1 - Xi)fi where / i is the focal length for C\, the optical axis points in the negative Z direction, the center of projection is at [0, 0, fi] and A- > 1. Similarly for the second camera, C , T t 2 Ui = jiUi Vi = jiVi Wi = (1 - 7i)/ . 2 These expressions are referred to as the scalar form of the projection relations. The distance between two points in C\ using the projection relations is given by <f > = (XiXi - XjXj) 2 { A similar expression, + (X,yi - Xjyj) 2 + (A,- - Xj)f . 2 can be written for C . Assuming that the same units of measurements are 2 used, the two distance expressions can be equated, df> = df? or (XiXi - XjXj) 2 + (X yi - Xjyj) 2 t + (Xi - Xj)f 2 = ( 7 ^ - jjUj) 2 + ( 7 ^ - jjVj) 2 + (7,- - 7 j ) / • 2 2 This expression is an homogenous quadratic in the unknowns. Five point correspondences yields ten equations in nine unknowns since the solution is known up to a scale factor, hence, one of the unknowns can befixedto set the scale of the solution. Huang and Netravali [46] note that by Bezout's theorem there can be at most 2 = 512 solutions and attempts tofixthe bound on the number of solutions have 9 not been successful due to the large dimensionality of the solution space. Scaled orthographic camera. For two views, the constancy of distance constraint is undefined since the structure is only recoverable up to a single parameter family. However, for three views, the constancy of distance constraint is defined and is determined by the intersection of two families of structure for the two sets of image pairs from three views. Kontsevich [52] utilizes the intersection method to recover rigid structure from three views under scaled orthographic projection. CHAPTER 2. BACKGROUND 2.4.4 37 Relationship to structure-from-motion constraints Huang and Netravali [46] review a set of algorithms for recovering motion and structure from two or more views of a scene. They classify the algorithms according to the dimension of the correspondence mapping, e.g., 2D to 2D or 3D to 2D, the order in which parameters are recovered, the camera model and the feature type. Among other factors, they emphasize the nature of the objective function, existence and uniqueness of the solutions and the sensitivity of the solutions to noise. For point correspondences they partition the solutions into two groups according to the order the parameters are solved for, i.e., structure first or motionfirst.These groups correspond respectively to the distance/angle invariance constraint and coplanarity constraint groups discussed above. The algorithms that implement the collinearity constraint are subsumed by the motion-first category by treating them as least squares (over constrained) versions of the (minimally constrained) coplanarity constraint relations. This obscures the geometric difference between coplanarity and collinearity which in turn confounds the superior results of the collinearity based methods with those based on coplanarity. The experimental evidence suggests that the best results for structure-from-motion in terms of robustness and precision are due to the minimum variance estimator versions of algorithms which implement the collinearity constraint [65, 112, 3, 99]. To summarize, given a set of methods for verifyingrigiditysuch that each of the three epipolar constraint formulations discussed above are represented, the experimental evidence from the structurefrom-motion literature would suggest that the best performance would be expected from those methods based on the collinearity constraint. Note that the rigidity checking method described by this thesis is based on the collinearity constraint formulation for the linear and nonlinear camera models. As noted in Chapter 1, the rigidity assumption requires that the solution for structure be unique, where uniqueness is necessary for proving that the probability of falsely rejecting a rigid configuration is zero. Huang and Netravali note that forfivepoint conespondences the coplanarity constraint yields no more than ten real solutions. Six or more point conespondences over-determines the solution, so they "heuristically" expect that the solution is almost always unique. However, they note that Horn [41] and Longuet-Higgins [59] both describe a degeneracy of the solution for points that lie on a quadric surface that includes the two projection centers, thus yielding multiple solutions. An equivalent uniqueness proof 38 CHAPTER 2. BACKGROUND for the collinearity condition formulation does not appear to have been derived or does not exist. For six or more point correspondences the number of parameters is over-constrained so the heuristic expectation is that the solution is unique. 2.5 On the uniqueness of correspondence Basri [5] has analyzed the constraints on correspondence that result from rigid body motion under orthographic or perspective projection. The objective of his analysis was to use the constraints to establish point correspondences for depth recovery. The analysis is extended to include general affine transformations of the scene and to rigid objects with smooth surfaces that yield a smoothly varying occluding contour with a change in viewpoint. The main results can be summarized as follows. Rigid body motion induces a set of epipolar lines in the images such that their correspondence is determined by the transformation between views. Correspondences along the lines cannot be determined for any number of images. Epipolar lines can be determined from a small set of corresponding points, four in the case of orthographic projection and seven in the case of perspective projection. Note that the requirement of seven is determined by the objective function due to Tsai and Huang [104] based on the so called essential matrix. The essential matrix represents the Euclidean motion parameters for the coplanarity constraint formulation of the epipolar geometry. The coplanarity constraint given by equation (2.14) can be rewritten as r • (b x R 1) = 0. This can be written in matrix form as (r) E 1 = 0 T where E, the essential matrix, is a 3 x 3 matrix defined as E = ei e e 4 e e 7 e e 3 5 e 6 8 e 9 2 39 CHAPTER 2. BACKGROUND where R is a rotation matrix and G is denned as G = f 0 -6 63 0 -61 -62 h 0 3 6 2 , a skew-symmetric matrix that implements the vector cross-product as a matrix product. Seven correspondences is a sufficient but not a necessary minimum required for a unique solution for the essential matrix in the absence of a set of side constraints described in [46]. The uniqueness of 5 the essential matrix leads to the uniqueness of the transformation variables and the epipolar geometry for their formulation of the objective function which is a variant of the essential matrix method due to Longuet-Higgins [58]. The results for rigid body motion carry over to the case of general affine transformations of the scene. This parallels the results for the affine camera due to Shapiro [90]. In some cases correspondence can be determined uniquely. This occurs for certain degenerate motion cases. Under orthographic projection, degenerate motion is defined by a rotation about the projection direction and an arbitrary translation. Under perspective projection a rotation about the center of projection is degenerate. The existence of multiple images does not yield a unique solution for correspondence, however, heuristic methods exist to solve the problem uniquely by helping to eliminate spurious solutions. The basic idea for three images is to establish an hypothesized set of matches between a pair of images and check for their presence in the third image at locations predicted by the epipolar geometry recovered by the hypothesized matches. This method follows an hypothesis verification heuristic. 2.5.1 Sources of matching ambiguity at keypoints The radiance of a surface element in the direction of the camera can be approximately modeled by a continuous function in three components, the surface irradiance, the normal vector to the surface element, and the surface reflectance . Image contours usually conespond to discontinuities in one or more of these 6 The side constraints issue from the particular structure of the essential matrix. The inclusion of the side constraints heuristically leads to the expectation of a unique solution as noted earlier. Note, however, that there exist degenerate configurations for seven or more correspondences [41]. 5 6 Light sources visible in the scene will not be considered here. CHAPTER 2. BACKGROUND 40 components over some angular interval of thefieldof view. The four principal types of image contours are [2]: 1. occlusion: these contours are caused by a discontinuity in surface orientation, reflectance or irradiance, or a combination of discontinuities in these three physical attributes, due to the presence of a surface partially interposed between the camera and another surface. 2. ridge: caused by a discontinuity of the surface normal of a continuous surface. 3. marking: caused by a discontinuity of the surface reflectance. 4. shadow: caused by a discontinuity of the magnitude of the surface irradiance. Ridge, marking and shadow contours are independent of the viewpoint and thus give rise to useful correspondence tokens. Occlusion contours are not independent of the viewpoint, and thus give rise to tokens with unstable attributes when the viewpoint changes, hence, they are not suitable for correspondence. The problem of matching images along boundaries defined by depth discontinuities stems from the problem of localizing the scene points in the images at these boundaries. Depth discontinuities occur between opaque surfaces at the occluding contour of the nearer surface which is defined by those viewing directions which are tangential to the surface. For a smooth surface the scene points that projectively define the occluding contour in one image are generally not the same ones in another image taken from a different viewpoint. This situation leads to considerable ambiguity in matching since the contour not only changes position and shape due to the change in viewpoint but also changes according to the surface curvature [5]. However, tokens from occluding boundaries can be relatively stable if the occluding and occluded surfaces are close together relative to the distance of the occluding surface from the camera and may yield a token reliable enough for correspondence. Correspondence tokens labeled 2 and 13 from Figures 1.5 and 1.6 are examples of relatively stable correspondence tokens at an occlusion boundary for the given camera motion. The keypoint labeled 13 is due to the desk panel occluding a wire hanging down from the workbench such that there is a small distance, one or two feet, between the panel and the wire. The occlusion boundary of the desk panel is orthogonal to the wire, hence, for the given camera motion, which is a pure translation in depth and a CHAPTER 2. BACKGROUND 41 small rotation down, the point of tangency at the desk panel boundary of the projection ray to the wire is relatively motionless, i.e., the tangent ray almost lies in the plane defined by the wire direction (vertical) and the translation vector. In general, the stability of occlusion boundary correspondence tokens depends equally on the motion and the relative geometry of the occluding and occluded surfaces and the initial viewpoint. 2.5.2 Additional matching constraints The following constraints can be useful to disambiguate hypothesized matches if the epipolar geometry has been well established. To the extent thatrigiditychecking does not assume knowledge of the epipolar geometry, thus defining a much more difficult problem, these constraints can only be applied at a much later stage where, for example, a dense matching between images is desired. Most of these heuristically derived constraints from the structure of the physical world are well established in the stereopsis literature [64]. The second constraint is more appropriate for continuous matching rather than for sparsely distributed features. 1. Order. Assuming that the object surface is opaque and continuous, contour segments and points maintain their relative spatial order from different viewpoints. 2. Disparity gradients. The disparity gradient is useful for determining search bounds for an assumed range of depths for the scene point. For point tokens, the similarity of the disparity for neighbouring tokens is a useful constraint. This constraint is violated, however, for adjacent point tokens on different surfaces spanning a large depth range under perspective projection. Occlusion also complicates the use of this constraint. 3. Parallelism and symmetry. Scene structures that are parallel or symmetrical about some axis tend to remain that way over significant viewpoint changes. This is a strong constraint under orthographic projection since orthographic projection preserves parallelism that exists in the scene. 4. Uniqueness. A correspondence token can only match one token from the other image. This is violated only rarely, e.g., a token on a transparent surface aligned with a token on an opaque surface have the same location in the image but could match two different tokens at different locations in the 42 CHAPTER 2. BACKGROUND other image. Another situation that violates the uniqueness constraint occurs when four spatially separate points are aligned so that only two are visible from each viewpoint and all the points project to the same epipolar line for each image [2]. One constraint that is applicable without knowledge of the epipolar geometry when a static scene is assumed, is token orientation [108] which, for point features, can be determined from the smoothed gradient direction at the feature point. Under a set of conditions that define the invariance of the twodimensional brightness pattern with respect to three-dimensional motion, the component of rotation about the optical axis should change the orientation of all tokens equally. For a set of token correspondences, if the orientation change of a token is not approximately the same as that of the remaining tokens then a heuristic expectation is that the token correspondence is not likely to be correct. This matching constraint has proven to be a useful part of the rigidity checking method. 2.5.3 Stable recovery of motion and structure As a side note, Basri remarks that for distant compact objects the transformation variables under perspective projection are not well conditioned since the perspective distortion that disambiguates translation from rotation and determines scene structure can be very small. Poor conditioning of the transformation variables also occurs when the translation component in depth vanishes in which case the epipolar lines are parallel. These situations are well approximated by the orthographic projection model and the epipolar geometry can be stably recovered. The structure of the rigidity checking algorithm is motivated, in part, by these observations. 2.6 Nonlinear least squares estimation The nonlinear least squares problem is given by 1 1 1 m minimize </>(m) = -||F(m)|| = - F ( m ) F ( m ) = - V / ^ m ) meR n 2 2 T 2 I tr-i where m > n and the residual function F : R —r R n m 2 (2.17) 43 CHAPTER 2. BACKGROUND is nonlinear in the parameters m, and /,• (m) denotes the ith component of F. Dennis and Schnabel [20] discuss the nonlinear least squares minimization problem and give two formulations for its solution based on classical results from unconstrained minimization and the solution of systems of nonlinear equations. The first approach is generally referred to as the Gauss-Newton method and an improvement to the basic Gauss-Newton method is given which is called the LevenbergMarquardt method. The Levenberg-Marquardt method can be considered as a constrained form of the linear least squares solution for nonlinear least squares estimation [69]. The second approach is related to solutions for unconstrained minimization and leads to a Newton's method for nonlinear least squares incorporating secant methods. Dennis and Schnabel note that general purpose solutions for unconstrained minimization applied to nonlinear least squares problems are not recommended since they do not take advantage of the special structure of the objective function (2.17). The treatment of nonlinear least squares given below follows that of Dennis and Schnabel [20] and More [69]. In the context of rigidity checking, for the unbiased nonlinear estimator , m in equation (2.17) is 7 the set of model parameters to be estimated for motion, structure and the reference image measurement biases and the residual function F is to fit the given image observations w and w with a model C(m). The residual component /; in this case can be written / ; = w; — C (m). The computational objective is tofindthe m that yields the bestfitaccording to the criterion that the sum of the squares of the residuals is minimized. The Gauss-Newton method is derived from the affine approximation of the residual function F about the point m 8 c M (m) = F(m ) + J ( m ) ( m - m ) c c c c (2.18) where J is the Jacobian matrix of the residual function F and is given by _ dfjjm) J t 13 dnij Recall from Chapter 1 that the term "unbiased" or "biased" nonlinear estimator refers to whether or not the measurement bias terms for the reference image are incorporated in the nonlinear camera model. Dropping the measurement bias terms gives the "biased" version of the objective function. 7 As noted in [20] this may also be referred to as the linear model, but affine makes explicit the property that the approximation does not necessarily contain the origin. 8 CHAPTER 2. 44 BACKGROUND where the dependency of F on the observation variables is dropped for notational convenience. In general, it is not expected that there exists an m» such that M (m*) = 0 exactly, since the c linear system of equations is over-determined . One way to proceed is to minimize the residuals of the 9 affine approximation which is a linear least squares problem and is given by minimize - | | M ( m ) | | . m£R" 2 (2.19) 2 c Let m + denote the next iterate as the solution to the linear least-squares problem (2.19) which will be used to solve the nonlinear least-squares problem. Assuming that J has full column rank the solution, m , that minimizes (2.19) is given by + m+ = m - ( j ( m ) J ( m ) ) J ( m ) F ( m ) T c -1 c (2.20) T c c c The geometric proof follows from the full column rank assumption for J . The point m which + minimizes the distance to the point J ( m ) m - F(m ) relative to the Euclidean norm, is the m which c c c + orthogonalizes the error vector J (m ) m - ^ J (m ) m - F (m ) ^ relative to the column span of J (m ). c + c c c c Thus, m must satisfy + J ; ( m ) ^ J ( m ) m - (j(m )m - F(m ))^ = 0, T c c + c c c for all i where J;(m ) is the ith column of the Jacobian matrix J . Equivalently, this can be written as c J ( m ) ( j ( m ) m - (J(m )m - F(m ))) = 0 (2.21) T c c + c c c and expanding and simplifying this expression leads to the solution (2.20). The proof concludes by noting that if the columns of J(m ) are linearly independent then the m c + that yields the point J ( m ) m c + is unique and is the solution to the system of equations (2.21). Algebraically, an alternative proof of the same result is arrived at, upon noting that it is sufficient that the Hessian of | | | M ( m ) | | is positive definite, if the partial derivative of | | | M ( m ) | | with respect 2 2 c c to m is set to zero and the resulting system is solved for m , a necessary condition for m + + minimizer of (2.19). 9 Note also that the optimization problems addressed by this thesis are not expected to be zero residual problems. to be a CHAPTER 2. 45 BACKGROUND The iterative solution that uses (2.20) on each iteration is called the Gauss-Newton method. The Gauss-Newton method, with the Levenberg-Marquardt extensions, is used to perform the minimization of the rigidity checking objective function. Before proceeding with a discussion of the LevenbergMarquardt method a brief summary of the full Newton solution for nonlinear least-squares will be given which will help to provide some insight into the trade-offs between Gauss-Newton and Newton's method. Dennis and Schnabel note that a Newton type method with an appropriate secant update for the second order terms of the Hessian matrix is considered most desirable for problems with expected largeresiduals. Gauss-Newton and Levenberg-Marquardt methods are considered more suitable for zeroresidual and small-residual problems. As we shall see, the trade-offs between these methods favour the Levenberg-Marquardt method for the nonlinear least squares minimization problem addressed by this thesis. The minimization of (2.17) by Newton's method proceeds from a local quadratic model of </>(m) at the current point m , and it is given that <f>(m) is twice continuously differentiable. The quadratic c model, M ( m ) can be written as c M (m) c = <£(m ) + V ( £ ( m ) ( m - m ) + i ( m - m ) V ^ ( m ) ( m - m ) = i F ( m ) F ( m ) + F ( m ) J ( m ) (m - m ) + T c T c c T c c T c i(m- 2 c c c c c m ) (j(m ) J(m ) + S(m ))(m- m ) T c (2.22) T c c c c where m S(m) = £ / - ( m ) V / , - ( m ) »'=i (2.23) 2 t is the second order information of V ^(m) and V0(m) = J(m) F(m). 2 T Newton's method determines the step SJV that makes the affine model of V^>(m) at the current point m equal to zero. The Newton step corresponds to the root of the local affine model of the gradient c of the quadratic approximation of the objective function, i.e., find the root of V M ( m c c + SN). This constraint follows from the necessary condition that for m -|-s AT to be a minimizer of M (m) the gradient c VM C must equal zero at that point. c 46 CHAPTER 2. BACKGROUND Now, V M ( m + s ) = V(£(m ) + V <f>(m )s 2 c c N c c N and setting this to zero yields V <f>(m )s = - V ^ ( m ) 2 c N c thus SAT = - Newton's method applied to <f>(m) is therefore given by iterating m + = m, - ( j ( m ) J ( m ) + S(m )) T c c c c 'J(m ) F(m ) T c c For m + sjv to be the unique minimizer of M ( m ) , it is sufficient that V 0(m ) is positive definite 2 c c c and it is necessary that V 0 ( m ) be positive semi-definite. It is known that the rate of convergence is 2 c quadratic, when certain conditions are met, that is, | | m j i — m*|| < c||mj — m*|| if mj is sufficiently 2 + close to m», where m.j is the jth iterate, and m» is the minimizer of <^(m). Newton's method differs from Gauss-Newton by the presence of the second order information S(m). This follows from the difference in the quadratic models (2.19) for the Gauss-Newton method and (2.22) for Newton's method. Dennis and Schnable prove that for the zero-residual problem, i.e., S(m„) = 0, the term S(m) is not important to the Gauss-Newton method and under the standard conditions Gauss-Newton is quadratically convergent near the minimizer. If S(m*) is small relative to J(m ) J(m*) where small is defined by |/;(m)| ||V /;(m)|| <C A ; for i = 1,2,..., m where \ in T 2 J|> m ra m is the smallest eigenvalue of J(m, ) J(m*), then the Gauss-Newton method is locally linearly converT ! gent. It is generally agreed [76,20,95] that if the residuals are small then the S(m) term can be ignored with little or no loss in performance. The residual function components, /,-, for the rigidity checking nonlinear camera model, are assumed, upon convergence, to be uncorrected, have a small magnitude, and have a random sign value so that the expected value of their sum should be small, hence, S(m*) from equation (2.23) is expected to be small as well. CHAPTER 2. 47 BACKGROUND Levenberg-Marquardt The Levenberg-Marquardt algorithm can be viewed as a model-trust region modification of the GaussNewton method. The "model-trust region" approach to minimization requires specifying a value, S , the c maximum length of a successful step likely to be taken from x , the current model point. The value S c c provides a region in which the current local quadratic model, | | M | | , can be trustedto adequately model 2 C the objective function <f>. Levenberg-Marquardt steps, are determined by solving sub-problems of the form minimize |^||F(m ) + F'(m )s || : ||D*s || < A 2 fc where is a scaling matrix and A fc fc fc > 0. The trust region radius k f c | (2.24) is adjusted according to the agreement between the predicted residual reduction and the actual reduction from step s^. One such test is given by ||F(m )|| -||F(m 2 fe 9 -4 + Sfc )|| 2 ||F(m )|| - ||F(m ) + F'(m )s || ' 2 2 fc If P > 1 0 fc fc fc fc (or a similarly small number) the step is accepted otherwise the trust region is decreased and a new step is calculated. If p « 1 then the trust region radius is increased. It is known that for D/t = I, where I is the identity matrix, the solution to the constrained minimization problem (2.24) is given by s = -(j(m ) J(m ) +Aa) J(m ) F(m ) T k where Ak = 0 if _ 1 fc fc (2.25) T f c f c ||(J(mA;) J(mfc)) J(mjt) F(mfc)|| < A^, otherwise, A^ > 0 such that T -1 T ||s&|| = Afc. The result given by equation (2.25) is derivable from optimization theory where A^ is a Lagrange multiplier. Various methods exist for updating A^ and A t and the reader is referred to [20] for details. The convergence properties of Levenberg-Marquardt method are similar to those for the GaussNewton method. The Levenberg-Marquardt method is recommended due to its improved robustness when the Jacobian matrix does not have full column rank. If the Gauss-Newton step is too long then the Levenberg-Marquardt method moves closer to the steepest descent direction —J(mfc) F(mfc) as a T function of A&. Equation (2.25) is the solution for the Levenberg-Marquardt step from the normal equations CHAPTER 2. 48 BACKGROUND for the linear least squares problem J(m ) F(mjb) -\AfeDfc 0 fc (2.26) written here to include the scaling matrix D^, i.e., I in (2.25) is replaced with diagonal matrix DjTDfc upon forming the least squares normal equations. The scaling matrix Dyt scales the step so that each component contributes equally towards reducing the residual for the same step length in each component direction. More [69] notes that typically da, the ith component of the diagonal matrix D^, is given by d\V = ||djF(mfc)||, k > 0. He notes that this works in general as long as ||c?;F(mfc)|| does not increase with k. However, if it does increase, then the scaling in the ith direction, A^/da, should decrease since the ith parameter is becoming less reliable for large steps in that direction due to the increased sensitivity of the residual function in that direction. The geometric model of the scaling matrix D is an hyper-ellipsoid with axes in the coordinate directions m,- with semi-axis length Ak/du. More suggests an update for such that 4> = mzx{4*- \\\d F(m )\\} 1 i k k>l. Note that scaling is not an issue for a Newton step since the step is determined byfindingthe root of the affine approximation to the local quadratic model, given by the solution to the system V / = —V / 2 or similarly with the Gauss-Newton (or Taylor series) step determined from J JS = — J F . It can T T t be shown that the Newton step is independent of the scaling of the variable space [20]. Intuitively, the minimum of a quadratic model is independent of the scaling of the independent variable space. However, this is not the case for steepest descent since steepest descent depends on how a unit step is defined in the component directions, and, since Levenberg-Marquardt varies continuously between an approximate Newton step and steepest descent, the importance of scaling becomes apparent. Chapter 3 Rigidity checking 3.1 Introduction An algorithm is given for accurately and rapidly verifying the potential three-dimensional (3D) rigidity of point correspondences from a pair of two-dimensional views under perspective projection. Our motivation comes from the problem of finding corresponding point features between two or more disparate views of an object. The output of the method is a simple yes or no answer based on the residual error of a minimum variance estimator. The rigidity verification approach proposed here is shared by other methods for verifying point correspondences using 3D recovery equations. The algorithm, however, subt stantially improves upon other methods because it works with as few as six corresponding points under full perspective projection, handles correspondences from widely separated views, makes full use of the disparity of the correspondences, and is integrated with a linear estimator based on a weak perspective model. The matching condition for verifying rigidity is based on a set of 3D scene recovery constraints whose satisfaction minimizes the residual error of an iterative optimization algorithm for the nonlinear camera model. Although iterative methods can be computationally intensive, an accurate answer to the rigidity question is computed quickly for two main reasons. First, a reasonably good initial parameter estimate is computed from an algorithm for a linear camera model, and secondly, the nonlinear model 49 CHAPTER 3. RIGIDITY CHECKPSG 50 for 3D recovery makes full use of the image disparity. The 3D recovery equations proposed here are derived from the collinearity condition of the scene and image points under perspective projection, which provides a natural and integrated approach to the simultaneous estimation of relative motion and scene structure. Matching point-feature patterns between images is fundamental to computational vision as evidenced by the large body of literature addressing this problem. It is intrinsic to such tasks as stereo, structure-from-motion and model based recognition. Investigation into the correspondence problem for biological vision has a long history as well [14, 1]. Applications which require point correspondences include, for example, the linear combination of models approach to object recognition due to Ullman and Basri [107] or the view interpolation scheme due to Seitz and Dyer [88]. Bennett et al. [10] derive recognition polynomials that assume point correspondences have been established between a single 2D previous view of an object and a novel view under orthographic projection. The motion segmentation problem has been addressed by methods which determine point correspondences. The approach of Thompson etal. [101] exploits the rigidity constraint under orthographic projection to find sets of points that are inconsistent with the motion of the camera relative to the static environment. Nishimura et al. [70] use the epipolar constraint to determine point correspondences and segment out differently moving objects under weak perspective projection. The determination of disparities in stereoscopic images also requires finding point correspondences [28]. The rigidity checking method is inspired by recently published ideas for recognition by Bennett etal. [10] and Kontsevich [52] and has similar goals to the work of Wei, He and Ma [110]. The algorithm proposed here embodies the rigidity assumption first postulated by Ullman [105]. The rigidity assumption forms an organizing principle for the analysis of changing images due to the relative motion between the observer and the scene. The rigidity assumption can also be used to select corresponding points in stereoscopic images. Bennett cites evidence that the rigidity of motion is a key principle in organizing the world into discrete objects [10]. Marr provides a good overview of the utility of the rigidity assumption and places its history in context up to circa 1982 [64]. There is a body of work in the psychology literature which also deals with the perception of rigid and nonrigid 3D point configurations (see [15] and references therein). CHAPTER 3. RIGIDITY CHECKING 51 The matching condition for verifying the accuracy of point correspondences under the rigidity assumption is based on a set'of 3D recovery equations that are generally referred to as stmcture-frommotion equations. Accurately estimating the stmcture-from-motion parameters from a small set of point correspondences under perspective projection is an ill-posed problem. In general, if less than eight correspondences are available from two views then nonlinear equations must be solved to yield the parameter values. The resulting system of equations is inherently unstable given noisy observations, requires an initial estimate near the global minimum, and yields multiple solutions. Also, if the point features are projected approximately orthographically then a family of solutions exist for motion and structure. Given that the problem is ill posed, it follows that a measure of goodness for matching should be based on a criterion which is not a direct function of the estimated parameters or an estimate of the variance of the error in the estimated parameters. The problem of verifying the potential rigidity of a configuration of 3D points from a pair of views is reduced to an algorithm which minimizes the residual of a nonlinear objective function in the image coordinates. By only considering the residuals of the data fit to the stmcture-from-motion model in the verification process, potential difficulties that could result from the use of the estimated parameter values are avoided. This verification process we call rigidity checking and it is worth noting that rigidity checking provides a mechanism forfindinggeometric invariance from 2D views. It is assumed that images are formed by a perspective projection and the camera model assumes knowledge of a scale factor that is related to the camera's image size and focal length. It is also assumed that the observation variances are known or can be measured. The rigidity decision is based on the residual of an iterative least squares estimator based on the Levenberg-Marquardt method [56] [63]. The convergence rate of the nonlinear estimator is improved substantially by an initial estimate of the rotation about the optical axis and a single-parameter family of relative depths computed from a linear stmcture-from-motion algorithm due to Kontsevich [52]. Kontsevich's algorithm assumes images are formed by a scaled orthographic projection. Recent studies provide analysis and experimental evidence of the reliability of estimating the component of rotation about the optical axis [70, 93, 43]. By only relying on the residual offittingthe image observations to the stmcture-from-motion model for verifying potentialrigidity,with the ability to handle ill-conditioned solutions and widely separated viewpoints, CHAPTER 3. RIGIDITY CHECKING 52 and by working with a minimal number of correspondences, rigidity checking is seen to be different from traditional structure-from-motion. 3.2 Related work for the task of verifying point correspondences 3.2.1 Correspondence verification and orthographic projection Nishimura et al. [70] apply the epipolar constraint to the problem of correspondence and motion segmentation for the orthographic camera model. They use an hypothesize-and-verify framework for subsets of four point conespondences to determine sets of matches consistent with the same epipolar geometry by accumulating estimates of the fronto-parallel rotation and scale change that are derived from the epipolar geometry in a Hough transform scheme. They assume that the viewpoint change between views is small to reduce the combinatorial complexity of generating subsets of hypothesized matches. f Sudhir et al. [98] present related work for finding point correspondences of a moving object un-r • der affine (not necessarily rigid) motion for weak perspective projection and a minimum of three views. The affine motion assumption yields a sufficient condition for determining conespondences uniquely for three views under scaled-orthographic projection by verifying that a set of hypothesized correspondences uniquely satisfies an affine structure criterion. Two algorithms are given for finding a maximal set of correspondences. Thefirstalgorithm is a search (hypothesize and verification) method that assumes at most k possible matches for each point in the reference image in each of two other images. The worst-case complexity of this algorithm is 0(nk ) w where n is the total number of feature points from the object. The second algorithm uses a relaxation labeling framework solved with a cooperating team of learning automata. They assume a small motion between views to reduce the combinatorial complexity of matching real images of an object. Test results show that the relaxation labeling scheme is the slower method since floating point computations are required for generating random samples and for updating a set of probability vectors. Huttenlocher and Kleinberg [47] have published a theoretical result for the problem of deciding whether two noise-free views of a 3D point set are of the samerigidobject under orthographic and central projection. Their verification step for an hypothesized labeling of the points is based on exactly the same CHAPTER 3. RIGIDITY CHECKING 53 constraint as Kontsevich's for orthographic projection. Jacobs and Chennubhotla [19] describe an algorithm that determines the structural consistency of a set of correspondences sharing the same 3D affine motion tracked over a sequence of images for the affine camera model. The method assumes that image measurement error is bounded by a known value. The set of all noise-free affine coordinates for any instance of a 3D scene can be described by two parallel lines, each in a one parameter affine coordinate space. For noisy image coordinates, the lines are represented by axial boxes. The structural consistency test consists of finding a pair of parallel lines that are transverse over the entire set of axial boxes in both affine coordinate spaces defined by a sequence of tracked image points. The problem of finding the pair of parallel transverse lines is solved using linear programming in 0(n) time for n images of a fixed number of tracked points. The use of an affine coordinate space requires that the points that define the affine coordinate basis vectors exist over the entire sequence. It is not necessary that the other tracked points be visible over the entire sequence. The method yields good results over a real image sequencefindingmost but not all structurally inconsistent matches. The method, however, is sensitive to the choice of points that define the affine coordinate basis vectors. Grzywaczand Yuille [31] propose two solutions for the problem of determining correspondences for long-range apparent motion. Long-range apparent motion refers to the psycho-physical phenomena of determining the correspondence of tokens, and consequently the scene structure, when the views are separated in space and time by a large amount, i.e., the viewpoint or object orientation and location changes by a non-trivial amount. Thefirstsolution is a modification of the minimal mapping theory due to Ullman [105] and the second solution is referred to as the structural theory. The structural theory hypothesis is that rigidity alone (the basic assumption used to recover the 3D structure from motion) is sufficient to solve the correspondence problem (and simultaneously the structure-from-motion problem) for images formed under orthographic projection. They utilize an incremental rigidity scheme that minimizes the deviation of the rigidity of the recovered structure between views over a sequence of views. The deviation of rigidity is formulated as N AR = J2(LiAt)-^(t-8t)) 2 hi CHAPTER 3. RIGIDITY CHECKING 54 where L{j is the distance between scene points i and j. The structural theory simultaneously solves the stmcture-from-motion and correspondence problems by finding the correspondences which, upon recovering structure according to the incremental rigidity scheme, yields the minimal AR. They report that the algorithm that implemented the structural theory rarely converged to the correct set of correspondences unless given some initial indication of the correct matches. Note that the method assumes that the depths of the feature points are known in thefirstview. Grzywacz and Yuille note that the solution space explored by the structural theory is complex, i.e., has many local minima. This is not surprising for two views since it is known that two views of orthographically projected scene points yields a one parameter family of solutions for structure and motion. The structural theory works with two views at a time, incrementally adjusting the structure to maximize rigidity, but given the non-uniqueness of the solution, it may be too much to expect a well conditioned problem. Three views yields a unique solution for stractufe and motion and modifying the theory to work with three views at a time may yield a much better conditioned problem at the expense of increasing the combinatorial complexity of the matching problem. 3.2.2 Correspondence verification and perspective projection The discipline of photogrammetry is closely related to computational vision. They differ primarily in their goals: photogrammetry is motivated by the requirements of cartography and aerial surveying while computational vision is concerned with discovering and understanding the constraints that make vision possible in the biological domain or for providing machines the sense of vision [97]. The stmcture-frommotion constraints used by the rigidity checking method and numerous other methods in computational vision have been used for decades by photogrammetrists [94,119]. Chen [16] describes a method for verifying the correctness of correspondences which is incorporated in a photogrammetry workstation system for providing high precision, real-time, automated, photogrammetric three-dimensional measurements. The thesis due to Chen is an example of the growing trend in photogrammetry, and especially in the area of close-range photogrammetry, to incorporate results from machine, robot, and computational vision [120] into their systems, largely driven by the emergence of digital photogrammetry and the desire CHAPTER 3. RIGIDITY CHECKING 55 to automate a variety of tasks. The flow of results from photogrammetry into computational vision is not so apparent and is acknowledged by relatively few researchers with the exception, for example, of [40,41,77, 97]. The basis for the verification method developed by Chen is a 3D intersection constraint. An iterative scheme is proposed for 3D target matching which improves the estimation of the camera relative orientation parameters by combining the 3D intersection constraint with the bundle adjustment method. 1 The 3D intersection constraint is based on the collinearity constraint also used by the rigidity checking method. Chen notes that the collinearity constraint can include parameters for lens distortion, which improves the accuracy of the estimated structure parameters, an important performance goal of the system. Chen does not make clear the minimum number of correspondences required to perform the bundle adjustment. The number must be relatively large since several views are assumed available and the intrinsic camera parameters and lens distortion parameters are also estimated. The tolerance value for target matching in space is closely related to the error in the 3D target coordinates. The tolerance value for matching is the RMS error of the computed 3D point coordinates. Knowledge of the covariances of the image measurements is assumed. His verification method differs fromrigiditychecking in that he assumes knowledge of the relative orientation of the cameras from a previous matching stage with a known target configuration - what he calls a three side rectangle method. Also, his matching decision is based on a 3D RMS error value whereas therigiditychecking decision is predicated on the residual of the model fit of the image data. The rigidity checking criterion is the more appropriate one unless the covariance of the motion and image measurement error is propagated to the structure covariance estimate and the problem is well-conditioned - often the situation forrigiditychecking is that the problem is ill-conditioned since the method assumes only two views are available and the number of correspondences is at, or near, the required minimum. This error propagation is subsumed by the bundle adjustment method which is not fully described by Chen so it not known if the error propagation is fully specified, and, since Chen assumes the availability of many views and correspondences, the problem is likely to be well-conditioned. 'Bundle adjustment refers to the simultaneous optimization of a set of parameters that relate the observations to the observation model, usually by weighted least-squares methods. CHAPTER 3. RIGIDITY CHECKING 56 A similar constraint for verifying correspondences for multiple images and known camera orientations has also been reported by Bedekar and Haralick [6] and Cheng et al. [17]. Bedekar and Haralick specify a decision statistic based on the residual of the model fit of the image observations in a maximum a posteriori estimation process for the scene structure variables which is similar to the rigidity checking decision model. Cheng et al. also define a similar residual function for the image observations but propose two different methods for the optimization. Thefirstmethod uses a proximity matrix idea from Scott and Longuet-Higgins modified to include known camera orientations and the second method determines a maximum network flow. A class of structure-from-motion algorithms for images under perspective projection solves a system of linear equations for a set of motion parameters. These algorithms are based on the coplanarity constraint of corresponding points from two views. The classic linear algorithm wasfirstdeveloped by Longuet-Higgins and requires a minimum of eight point correspondences to fully constrain the unknown parameters [58]. Wei et al. employ the linear formulation developed by Tsai and Huang [104] to solve the correspondence problem and estimate motion simultaneously. Their method, similar to the rigidity checking method, uses an hypothesize and test approach to determine the correct correspondences with the three-dimensional (3D) recovery equations as a matching condition. Their method, however, requires a minimum of nine correspondences for a least squares estimate compared to six for the rigidity checking method. The reliability of the matching constraint utilized by Wei et al. is reduced for images that are projected nearly orthographically or are noisy because of the inherent limitations of the epipolar constraint for verifying point matches under these conditions. Weng et al. [112] discuss the limitations of the epipolar constraint for accurately estimating the motion and structure parameters. The main point of their discussion is that the epipolar constraint recovers the motion parameters independently of the scene structure. This independence has the advantage of yielding a linear system for estimating the motion parameters but has the disadvantage that the solution space accommodates a scene structure space which includes many physically impossible scene solutions, i.e., the motion solution leads to a violation of the "depths positive" criterion for the scene structure. The "depths positive" criterion simply states that the CHAPTER 3. RIGIDITY CHECKING 57 estimated structure should yield a set of depths which are all in front of the camera. 2 The violation of the "depths positive" criterion is a result of the ambiguity of the motion solution in the presence of even a small amount of noise or when the images are projected nearly orthographically. This large scene space that includes physically impossible scenes accounts for the experimentally observed large false positive rate exhibited by epipolar constraint based methods for verifying correspondences. The rigidity checking method is based on the collinearity constraint for perspective projection and therefore makes full use of the image disparity which necessarily involves the scene structure. This constraint reduces the ambiguity of the motion solution by helping to decouple the effects of rotation from translation on the observations by embedding local structure constraints on the image disparities within the global motion constraints. Also, because the "depths positive" criterion is implicitly enforced in the simultaneous solution of motion and structure, the search in the motion parameter space is also appropriately restrained to a valid region. 3.3 Related methods for the task of structure-from-motion The rigidity checking problem is claimed not to be the same as the structure-from-motion problem since most structure-from-motion solutions assume that the correspondence problem has been solved and the rigidity checking method solves the correspondence problem for an hypothesized set of point correspondences under the rigidity assumption. The rigidity checking method is based on a matching condition derived from structure-from-motion models for the scaled-orthographic and perspective camera models. This section reviews solutions to the problem of stmcture-from-motion that use a similar formulation of the structure-from-motion model incorporated in the rigidity checking method. 3.3.1 Structure-from-motion and orthographic projection The estimator for the nonlinear camera model is initialized with a motion and structure estimate provided by a linear camera model estimator based on Kontsevich's shape-constancy constraint for a pair of If the camera coordinate system system is left-handed with the x-axis horizontal and a vertical y-axis, then the viewing direction and z (depth) axis is positive. 2 CHAPTER 3. RIGIDITY CHECKING 58 scaled orthographically projected images [52]. Kontsevich's constraint is similar to the one described by Koenderink and Van Doom [51] and Basri [5] in that it is based on the same geometric arguments concerning recovery of scale and the rotation component in the image plane of the motion between a pair of views. The main contribution of Kontsevich's algorithm is its simple linear formulation for recovering the epipolar geometry as a change of basis transformation. The algorithm avoids computing rotation angles and is easily implemented. Koenderink and van Doom do not specify an algorithm but outline the geometric constraints that determine a construction that will yield the scale and rotation parameters for the rotation component in the image plane. Kontsevich and Koenderink and Van Doom both derive the same expression for depth recovery. Basri derives the same set of linear constraints for orthographically projected images as Kontsevich, but he determines the epipolar geometry by solving explicitly for the planar rotation angles. Also, Basri does not derive an expression for depth recovery. Basri also cites Ullman [106] as having earlier derived essentially the same transformation as his but with a minimum of five corresponding points rather than four. The derivation due to Ullman cited by Basri assumes knowledge of the 2D velocity field under orthographic projection of five points from the first view and the location of the corresponding points from a second view which is sufficient to yield an unique 3D structure up to a reflection ambiguity about the image plane [106]. Other derivations for recovering the epipolar geometry and scene structure from weak perspective views are reported by Huang and Lee [45], Lee and Huang [55], Bennett et al. [9], Harris [33], and Hu and Ahuja [43]. 3.3.2 Structure-from-motion and perspective projection Methods for nonlinear objective functions The rigidity checking estimator for the nonlinear camera model is based on the scaled nonlinear least squares Levenberg-Marquardt algorithm. The estimator for the nonlinear camera model was developed earlier [65,66] to solve the stmcture-from-motion problem given a correct set of point and/or line correspondences. The estimator was based on a simplified version of the Levenberg-Marquardt method that did not incorporate the scaling matrix and stabilization methodology developed for rigidity checking nor any method for computing an initial estimate of the motion and structure. CHAPTER 3. RIGIDITY CHECKING 59 Szeliski and Kang [99] have recently and independently developed an algorithm for structurefrom-motion which is based on an objective function for the nonlinear camera model similar to the one for the rigidity checking method. Their formulation differs with the rigidity checking formulation in several ways. They use quaternions to represent rotation whereas the rigidity checking formulation represents rotation with an orthonormal rotation matrix. They incorporate and solve for a parameter that describes the global offset of the scene points from the camera coordinate frame and they also solve for a global scale factor related to the focal length and global depth offset parameter. We assume a known image scale factor which is related to the image size and focal length of the camera under the assumption that pixels are square. Hence, afixedvalue for the global depth offset is assumed as this cannot be determined from the images. Since they assume that there are several views and many correspondences available to overconstrain the problem, the use of stabilization methods to improve the performance of the algorithm is not required. Also, they incorporate sparse matrix methods to reduce the computational complexity of the estimation process. * Azarbeyjani and Pentland formulate a recursive least squares estimator for recovering the structure and motion of a moving camera from tracked point correspondences [3]. They describe their formulation as using the so called "relative orientation" constraints and, in terms of the objective function and parameterization, their formulation is most similar to the rigidity checking formulation of the known methods in the literature. Their parameterization differs from rigidity checking in that it includes the focal length of the cameras which is assumed to be constant over the image sequence and the use of quaternions to represent rotation. An extended Kalmanfilterprovides the computational framework for the parameter estimation. Since many views and correspondences are available yielding a highly overconstrained problem, methods for stabilizing the estimation process are not required. Poelman and Kanade [73] describe an algorithm for estimating structure-from-motion under perspective projection that uses the Levenberg-Marquardt nonlinear least squares estimator for point correspondences. Their objective function is similar to the objective function for the rigidity checking unbiased estimator. Their method is designed to handle many views and correspondences. The complexity of inverting the approximated Hessian matrix for such a large parameter space leads them to perform the estimation in a cooperative framework. The parameter space is divided into motion and structure pa- CHAPTER 3. RIGIDITY CHECKING 60 rameters. Two separate estimators are implemented such that the motion estimator assumes the structure parameters arefixedand the structure estimator assumes the motion parameters are fixed. For multiple frames the motion can be estimated separately for each frame and, similarly, the stmcture can be estimated for each point independently. Other recent related work in nonlinear least squares estimation of stmcture-from-motion is from Weng et al. [112]. Their minimum variance optimal estimator minimizes an image based error function for point correspondences similar to the one described here. The major difference with the rigidity checking estimation method concerns the stmcture parameters which, for rigidity checking, are solved for simultaneously with the motion parameters but are factored out into a separate optimization step in their formulation. They argue for this decomposition because it reduces the computational load and more importantly because their minimum variance estimator only involves the minimum number of parameters for the stmcture-from-motion problem. The dimensionality of the rigidity checking parameter space is not large because* rather than solving for 3D scene coordinates, only the scene depths in a reference coordinate frame are determined in addition to the motion parameters. Although not strictly relevant for the rigidity verification task for two views, it is worth noting that their results for the Levenberg-Marquardt batch solution method are generally better than their iterated extended Kalmanfilter(IEKF) results. The better performance of batch techniques over IEKF methods for stmcture-from-motion is also supported by other experiments [53]. Azarbeyjani and Pentland argue, however, that due to their parameterization, which minimizes the number of parameters while retaining a set of stmcture estimates that serves to stabilize the estimator, their IEKF algorithm yields results comparable with batch techniques. Weng et al. [112] also analyze the limitation of small interframe motion and derive the CramerRao lower error bound for the motion estimates. This lower bound predicts the instability and large expected error in the estimation of the motion and stmcture parameters from views that are close together. Discussed in a later section is a stabilization method incorporated into the rigidity checking algorithm which addresses this issue. CHAPTER 3. RIGIDITY CHECKING 3.4 61 Derivation of the rigidity checking equations The following section specifies therigiditychecking camera model, for which the rigidity checking objective function is formulated. The minimization of the objective function is then specified and an estimator to perform the minimization is described. 3.4.1 Rigidity checking camera model The standard perspective projection model is assumed. There are two coordinate systems relevant to this formulation, one camera-centered and the other object-centered. The object-centered frame can also be thought of as an intermediate camera frame which is identified with a reference camera frame by a translation along the optical axis. The object stmcture is recovered in this stable reference frame. Experiments show that estimating the motion and stmcture in an object-centered coordinate system noticeably improves the stability of the recovery method. This is especially true when the object motion is dominated by rotation. Intuitively, working in an object-centered frame helps to decouple the effects of rotational motion from translational motion since object rotations require smaller compensating translations. The transformation from the intermediate camera frame to the camera centered frame is given by a translation t cz along the optical axis. The rigid body motion of point pj from the reference object frame to a subsequent camera frame is given by Pj = R + t + t Pj (3.1) c where t maps the intermediate camera frame to the camera centered frame and is given by t = [0,0, t ] , T c c cz R is a 3 by 3 orthonormal rotation matrix from the intermediate camera frame to the frame of the second image, t is the translation vector and P j is the object point in the second camera frame. Image coordinates of the point Pj are given by the familiar perspective projection equation Uj = . i . 5 Z P L p p (3.2) where / is a known scale factor related to the image size and focal length of the camera. The center of projection is defined to be at the origin of the camera frame and the image plane is given by z = —/. CHAPTER 3. RIGIDITY CHECKING 62 The camera is assumed to be calibrated, hence, the origin of the image coordinate frame is defined to be at the center of the image which is assumed to be coincident with the principal point of the camera, and the axes are orthogonal and equivalently scaled to have the same magnification / . 3.4.2 Least squares solution of nonlinear equations We seek to minimize the standard nonlinear least squares objective function 0(m) = i | | F | | 2 = ±£||f(m)J| 3 (3.3) i over a parameter space m. The image based residual function f (m) is the residual of the fit between a point location in the second image and the model of the transformed point from the reference image: f (m) = w(w, m) - w'. (3.4) Here, m is the set of parameters for motion and the depths of the scene points in the reference object frame, w is the coordinate vector of the point in the reference image frame and w' is the conesponding point in the second image frame. The function w(w, m) maps the image coordinates in the reference image frame into the nonreference image frame by first back-projecting the image coordinates into the reference camera centered frame, i.e.,findP and P from (3.2) and t x y cz where P = p +t . This is followed by a transformation to z z cz the reference object frame by a translation along the optical axis, then to the nonreference camera frame according to the latest motion estimate by equation (3.1) andfinallyprojecting into the nonreference image frame by (3.2). Because of the smoothness and well behaved properties of the nonlinear projection equation (3.2), minimization of the objective function (3.3) by the Gauss-Newton method is, in principle, a good choice. The Gauss-Newton method, however, in its basic form, is not appropriate for the estimation problem addressed by this thesis. Since the rigidity checking problem is often ill-conditioned, which leads to a singular Jacobian, we use a variant of the model-trust region method known as the LevenbergMarquardt algorithm to make the problem better conditioned. CHAPTER 3. RIGIDITY CHECKING 63 A method which directly solves (3.3) is not known or does not exist, so the minimization is accomplished with the Gauss-Newton method described in Section 2.6, augmented by the LevenbergMarquardt extensions which will be discussed later in this chapter. Computing the partial derivatives The error measure is the difference between the observed image point and the image point from the reference view mapped into the nonreference (or current) view, Wj — w'j = fj, where \VJ is a function of the depth of the point in the reference object-centered frame and the translation and rotation between the two views. The partial derivative of f, with respect to h reduces to the partial derivatives of Wj which k are given by ^ dh z i ( J ^ _ ^ p \ dh p d = k jZ k l A dh ) d jZ k n d k The partial derivatives of P , P jX 9*i _ ~f ( >y dh p . \ dh dP a jy and P jZ z p k p 3 jZ v A dh J • d P k with respect to h are the components of a set of dik rectional derivatives in a camera centered frame. These parameters are the depth values of the points in the reference object frame, pj , the three translation parameters t , t and t and the parameterizaz x y z tion of the rotation component. Determining the partial derivatives with respect to rotation poses some difficulty, since rotation has only three degrees of freedom and any formulation with a rotation operator usually involves more than three terms, e.g., the nine elements of an orthonormal rotation matrix. Furthermore, the formulation makes no suggestion as to the appropriate representation in terms of its underlying parameters. If we compose three rotations about individual axes to compute an arbitrary 3D rotation, singularities can occur if the sequential composition fails to specify independent directions of rotation. Therefore, we represent full three-degree-of-freedom rotations with a 3 by 3 orthonormal rotation matrix and compute corrections about each of the coordinate axes to be composed with this rotation. These corrections are approximately independent for small angles. These derivatives are analytic closed form expressions. An important advantage of this fact is that the Jacobian matrix can be computed efficiently and precisely. Hence, we are not required to use computationally expensive finite difference methods. CHAPTER 3. RIGIDITY CHECKING 64 Partial derivatives of u and v The partial derivative of Uj with respect to depth is duj -f = dp (dP P z P dP \ ]X jZ ]X \ d P Pz jZ jZ d Pz J ' Let the rows of the rotation matrix R be R , R , and R respectively. Rewrite (3.1) as x y Pj = R((p jz 2 + t )P cz 3 - t ) +1 + t . c c Then, P and P can be written as )X JZ Pjx — (,Pjz ~~r* tc )~RxPj Rx^c ~l~ t Z x Pj — {.Pjz ~t~ ^cz)Rz-Pj Rz^c "F t -\- t z z cz where P j is given by Pi = and pj + t z cz J L / 1 / = P . Then, jz ^ when P = P j else dPj/dp = R,Pj and ^ = R P, 2 = 0. Therefore, z duj _ -f Wz Pj = z when P = P j else it equals 0. Rotations are parameterized by incremental rotations about the coordinate axes. Let these parameters be designated by <f> , <j> and <j> . Then the partial derivative with respect to <p is x y z d<t> P? V 3Z x x d<t> x d<f> P x 2 where dPj /d<f> = — p' = —R pj, recalling that R ^ ^ are row vectors. z x jy y The partial derivative with respect to <f> is y du 3 -f f A Pf \ z OP JX t>y d< p <t>y 1JXd = (' P JZ P p' jx JX 3 CHAPTER 3. RIGIDITY CHECKING where dP /d(f> = -p' jx y jz = -K pj z 65 and dPj /d(f> = p' = z y jx K pj. x The partial derivative with respect to <f> is z dui "3 d<f> z where dP /d<t> = p' = jx z jy H Pj. y The partial derivatives of Uj with respect to the three translation components t , t and t are x y z now given. The partial with respect to t is x d u dt x 1 _ z l ( pf \ z Jj±_f> dt Jj± d P j z d 3 X x \ - Zl dt )-p x ] Z Similarly, the partials with respect to t and t are y z ^ =0and^ = ^ dty 8t . z The partial derivatives of VJ are derived similarly. The solution for motion and depth can only be recovered up to a global scale factor. To reduce the number of degrees of freedom it is convenient to fix the global scale by either fixing one of the depth parameters or by constraining the translation to have a unit norm. In our implementation, the value of one of the depth parameters is set to zero in the object frame. Hence, from a simple counting argument for corresponding points, a minimum of two views and five points are required to yield a system of equations of full rank where each match contributes two independent constraints. 3.5 Implementation issues In this section are discussed the implementation details. The section begins with an overview of the rigidity checking algorithm followed by a description of the initial parameter estimate computation. Several issues are then discussed including the numerical conditioning of the nonlinear least-squares equations. 3.5.1 Algorithm overview The algorithm's input consists of the coordinates of a set of point correspondences from a pair of images. A linear estimator provides initial values for a set of motion and structure parameters under a weak per- CHAPTER 3. RIGIDITY CHECKING 66 spective projection assumption. The nonlinear estimator is initialized with these estimated motion parameters augmented by reasonable choices for the unknown motion parameters. If the nonlinear estimator returns a residual error which is judged to be consistent with a nonrigid configuration then a second set of initial values is provided by the linear estimator and the nonlinear estimator is remn. The final yes/no rigidity checking answer is then determined. The linear estimator: • estimates two, from a set of three, rotation parameters - the third parameter, a rotation in depth, is a free variable, • changes the sign of the free variable to determine a second set of initial parameter values - a byproduct of the Necker reversal ambiguity of the orthographic projection model, • estimates the global scale change which is used to generate an initial estimate for the translation in depth, and, • provides a set of relative depths parameterized by the free variable. The nonlinear estimator: • is initialized with the camera's image magnification factor and pixel aspect ratio, and assumes that the image data has been corrected for the camera's other intrinsic parameters, • is initialized with the linearly estimated motion and stmcture parameters augmented by reasonable choices for the unknown motion parameters, in particular, fixing the angle for rotation in depth fixes the stmcture initial estimate since orthographic projection leaves both unspecified - this is often referred to as the bas-relief ambiguity for orthographic projection, • is re-run with the second initial estimate only if the first initial estimate yields a decision of nonrigidity. The residual error of the nonlinear estimator determines the rigidity checking decision - a simple yes/no answer. CHAPTER 3. RIGIDITY CHECKING 3.5.2 67 Computing an initial estimate A linear structure-from-motion estimator Kontsevich [52] describes a fully linear algorithm for recovering the epipolar geometry of a set of corresponding points under scaled orthographic projection by formulating the problem in terms of a change of basis in Euclidean coordinates. The change of basis transformation maps the corresponding points into rectified image frames with an aligned epipolar geometry, i.e., the epipolar lines are horizontal and coincident between the rectified views. Hence, the change of basis transformation excludes the scaling component and the rotation component about the viewing direction. The translation component of the motion is excluded immediately by considering the transformation of the edge vectors formed by joining the object points. The recovery problem is reduced to the standard problem of binocular stereo where the unknown rotation component is about the (new) vertical axis. A one parameter family of rigid object structure is recovered, parameterized by the angle of rotation about the vertical axis. A minimum of four non-coplanar point conespondences are necessary but not sufficient to uniquely satisfy the constraints. Rotation initial estimate The geometric interpretation, expounded by Kontsevich, of the constraint that determines the rotation component of a motion observed by orthographic projections is depicted infigures3.1 and 3.2 (where figure 3.2 is similar to one given in [52]). Kontsevich assumes that the camera observes a moving object where motion includes not only rotation and translation but also a dilation that follows the motion and leaves the object shape invariant. In this description a 3D edge is defined as the vector between a pair of object points. Kontsevich notes that the orthogonal projection of a 3D edge onto a rotation axis (which is identified with the moving object) is invariant to the rotation. An arbitrary rotation can be decomposed into a rotation about some axis V in thefirstimage plane and a rotation about an axis orthogonal to the image plane. The rotation component about the viewing direction maps axis V in thefirstview to axis V in the second view. The norm of the projection onto V of a 3D edge is equal to the norm of its projection onto V in the second image. Under orthographic projection this equality also holds for the 2D CHAPTER 3. RIGIDITY CHECKING 68 projections of the 3D edge. Algebraically, the expression is given by equation (2.15), i.e., e' • v'. Once the axes v and v' are determined according to the method described in Chapter 4, Section 4.4 using singular value decomposition, the global scale factor is also determined and a basis transformation can be formed for each image which, when applied to the two point sets, yields a binocular stereo problem. The object stmcture is uniquely determined by the rotation angle about the vertical axis which is a free variable, hence, there is a one parameter family of solutions for object stmcture. The expression for object stmcture is given by H i where + U «J" + A- 'i,j ~ u 1 i,j u and u' - are the horizontal image coordinates for the projected edge vector defined by two i} image points i and j, and A = cot(a) — l/sin(a), where a is the rotation angle in depth, can be any non-zero value that parameterizes the space of all possible interpretations for two given images. The axes v and v' can only be determined unambiguously (up to a sign value) if the two sets of image measurements are not related by an affine transformation (which would be the case for a rotation only about the viewing direction or if the 3D configuration is coplanar). The constraint that the norm of the orthogonal projection of the 2D projection of any 3D edge onto axis V equals the norm of the projection of the corresponding 2D edge onto axis V ' can be used as a consistency criterion for verifying point correspondences under scaled orthographic projection - as described by Kontsevich. This consistency criterion has also recently and independently been described by Huttenlocher and Kleinberg [47] in a paper that addresses the combinatoric problem of verifying point correspondences under orthographic and central projection for noise free observations. Huttenlocher and Kleinberg describe a new algorithm with a low order polynomial time complexity for the problem of labeling and verifying the correspondences between two sets of projected points. Their verification criterion for orthographically projected point correspondences is identical to Kontsevich's (note that the criterion due to Kontsevich is more general since it handles scaled orthographic images). It is an open question whether the complexity of the algorithm due to Huttenlocher and Kleinberg for noisy observations approaches the bounds for noise free observations. Further insight into this question, however, CHAPTER 3. RIGIDITY CHECKING 69 Figure 3.1: Plan and front elevation views of a pair of orthographic projections of two scene points. The image plane is rotated by ninety degrees from position 1 to 2. The rotation induces an epipolar geometry between the views with parallel epipolar planes. Axes V and V are defined in their respective image coordinate systems. The rotation of the camera (or equivalently the object) can be decomposed into a rotation about axis V followed by a rotation about the camera's viewing direction that maps axis V into axis V . 70 CHAPTER 3. RIGIDITY CHECKING F i g u r e 3.2: I m a g e s 1 a n d 2 r e s p e c t i v e l y from figure 3 . 1 . T h e rotation of t h e 3 D e d g e w h o s e projection is e c a n b e d e c o m p o s e d into t w o rotations. T h e first rotation a b o u t t h e a x i s V m a p s e into e s u c h that its projection o n t o V is invariant. T h e s e c o n d rotation a b o u t t h e v i e w i n g d i r e c t i o n t a k e s e into e , e ^ s c o r r e s p o n d i n g e d g e in t h e s e c o n d v i e w . 1 1 2 2 3 comes from the application of the constraint to recognition problem due to Huttenlocher and Lorigo [48]. In order to deal with noisy observations, a rasterization of the parameter search space that defines parameter ranges is defined. The parameter space includes the image rotation for each image and a scalar offset that establishes concurrent epipolar lines. The rasterization granularity is specified to befinebut an hierarchical matching process is utilized which can rale out large blocks of small grid cells if larger grid cells are ruled out. A bipartite-graph matching algorithm determines the maximal set of matches for a given set of parameters. The overall complexity of the algorithm is claimed to be (without formal proof) 0(m + n) - for m and n matches in each of two images, rather than 0(n ) for n matches in both im2 5 3 ages formally proven by Huttenlocher and Kleinberg. The discrepancy appears to arise from a match symmetry condition overlooked in the original result. Due to the choice of algorithm for performing the bipartite-graph matching, the actual complexity is claimed to be 0(min(m, n)e) where e is the number of edges in the graph. The number of edges is claimed to be small and is large only for a certain rarely occurring geometric degeneracy so that the upper-bound of mn is usually not attained. Kontsevich's algorithm is applicable to multiple frames. In the case of three frames it is sufficient to compare the structure estimates from two pairs of comparisons. If the structure is unique then real scalars, A and fi, can be found which satisfy the structure equalities. There is a problem, however, with the solution strategy suggested by Kontsevich for a special motion case. The special motion case occurs, CHAPTER 3. RIGIDITY CHECKING 71 for example, when there is no rotation about the viewing direction between any pair of views. The details of the problem will be omitted here since we are primarily concerned with the two view case, however, it should be noted that the solution for the two scalars is not a simple linear problem. In fact, solving for the scalars involvesfindingthe real roots of a fourth order polynomial. The solution is further complicated when the image measurements are noisy. Structure initial estimate The linear estimation algorithm provides a single parameter family of solutions for depth parameterized by the rotation angle about the vertical axis. For the purpose of initializing the depth values a rotation angle of approximately 10 degrees is chosen if the residual error from the linear estimator exceeds a threshold determined from the observation variances. Otherwise, the depths are computed for an assumed rotation in depth of 45 degrees. A large residual error produced by the linear estimator for a rigid configuration could indicate that the orthographic model is inappropriate for the observations. Such would be the case if the images exhibited a large amount of perspective distortion. The smaller rotation angle for a given set of disparities produces a larger depth range since depth is proportional to the product of disparity and the cosecant of the rotation angle. The mean value of the cosecant function over the range of 1 to 45 degrees corresponds to an angle of about 10 degrees. The depth scaling corresponding to a 10 degree rotation angle was found experimentally to give the best convergence results for images with small or large perspective distortions. Translation initial estimate The default initial estimate for translation is zero except in the special case where the scale factor estimated by the linear algorithm suggests that the scene receded from the camera. In this case, the image scale factor is used to estimate the translation component in depth, t . The assumption is that the global z scale factor between views estimated by Kontsevich's algorithm is a reasonable indication of the motion of the object in depth. From Kontsevich's algorithm, the scale factor s is given by s|r| = |r'| where r is a 3D edge vector and r' is the corresponding edge after the view transformation. The value of s is deter- 72 CHAPTER 3. RIGIDITY CHECKING mined from the average ratio of the projected edge vector magnitudes before the view transformation to the magnitudes after the view transformation. To see the relationship between translation in depth and s under perspective projection, the scale factor can be written in terms of, say, the u coordinates of a pair of corresponding points as s = u'/ u where u under perspective projection is given by equation (3.2). The u component from equation (3.2) can be rewritten as —kx nz + 1 0 where k = //t , rj = i " , z is the point's depth in the object frame, and t 1 cz Q cz is the global depth offset. Now the scale factor can be rewritten as _ {-kx') (nz + 1) 0 (-kx) (r,z' +l)- ^ 0 Under the assumption that the global scale change is due to a motion dominated by translation in depth (i.e., the rotation component is negligible), assuming that there is no translation parallel to the image plane (the effect of translation parallel to the image plane can be minimized by considering the distances between points), and noting that the scaling is applied after the scene is orthographically projected, then x' ~ x and z' ~ z + t . With these assumptions equation (3.5) simplifies to Q 0 z _ (nz + !)(!0 i — z s) s rj Under the assumption that nz << 1, which holds reasonably well for distant compact objects, the ex0 pression for t reduces to z t z = ^ ST] = ^ t. cz (3.6) S Equation (3.6) is the expression that is used in the algorithm's implementation. The initial estimate for t is only bound to the given expression if s is less than 1 (indicating that the object translated z away from the camera). There are two reasons for this: thefirstis that experimental results indicated that the nonlinear estimator converged to the correct solution more effectively if the object loomed rather than receded, and secondly, the global offset is set to a small value and initializing the motion to bring the object closer to the camera increased the risk of violating the "depths positive" criterion upon convergence. CHAPTER 3. RIGIDITY CHECKING 73 ROC Curve For Rigidity Verification / 1 i /./.... ; J 1 \ i With translation in depth initialization Without translation in depth initialization UJ 0 i 0.05 i i 0.1 0.15 False Positive Rate i I 0.2 0.25 Figure 3.3: Rigidity verification with and without translation in depth initial estimate. Receiver operating characteristic (ROC) for observations of six points with a noise variance of 1 pixel. The curves are generated for the scenario with expected large perspective distortions. Close up of the curves at a small false positive rate. Each curve generated from 2000 trials for each of rigid and nonrigid configurations. Monte Carlo test results for the scenario with a large amount of perspective distortion and additive noise with a variance of one pixel indicated that the correct verification rate improved by approximately 7 percent at the 2 percent false positive rate with this initialization for t incorporated in the algorithm. The z test scenario was simply to disable the initialization of t and compare the resulting ROC plot with the z plot for the initialization enabled. This result is from Figure 3.3 which shows the two ROC plots. The figure is zoomed in to show the curves for lower values of the false positive rate. The improvement is negligible above a false positive rate of approximately nine percent. 3.5.3 Estimator for the nonlinear camera model Stabilizing the nonlinear least squares solution As long as there are significantly more constraints on the solution than unknowns, the Gauss-Newton method, as described earlier, will usually converge in a stable manner from a wide range of starting positions. However, in motion analysis and stmcture recovery, it is often the case that the observations CHAPTER 3. RIGIDITY CHECKING 74 can lead to an ill-conditioned solution even when the parameters are over-constrained. Viewing situations that can lead to ill-conditioning include views that are closely spaced together or that are projected approximately orthographically. The minimization of the rigidity checking objective function is accomplished with a stabilized form of the Gauss-Newton method. The stabilized form of the Gauss-Newton method follows the approach due to Lowe [61] and is equivalent to implementing the scaled linear least squares version of the Levenberg-Marquardt method due to More [69]. The following discussion of a stabilized form of the Gauss-Newton method is instructive in the context of the estimation problem addressed by this thesis. The derivation below yields results equivalent to those discussed in Chapter 2, Section 2.6 on the Levenberg-Marquardt method. Specifying Prior Constraints. Convergence problems can be ameliorated by introducing prior expectations on the solution that specify the corrections to make in the absence of further data. For our problem, the prior expectation will simply.be to solve for zero corrections to the current parameter estimates. Prior knowledge of the expected value of the parameters, even if only in terms of order of magnitude, allows each parameter to be scaled according to its contribution to the global error measure. One appropriate measure of a parameter's contribution to the residual error is its prior variance. Each parameter should be weighted so they all contribute equally according to their distance from their expected value in standard deviations. Weighted prior constraints on the solution can be incorporated by adding rows to the linear system constraining the value that is to be assigned to each parameter correction: J w h= -F 0 The matrix W adds one row for specifying the weighted value of each parameter correction, and we specify a zero a priori value for the correction. Each iteration of the Gauss-Newton method determines a new correction to be added to the current parameter estimate. The prior constraint is set to zero and stipulates that the correction should be to leave the parameter estimate at its current value if the system is poorly conditioned. Each diagonal element of matrix W is set to the inverse standard deviation <7; for CHAPTER 3. RIGIDITY CHECKING 75 parameter i Wu 1 <Ti' Also note that it is assumed that the observations have been scaled to have a unit variance. This derivation of the weighted Gauss-Newton solution is equivalent to the constrained optimization approach which solves the linear least squares problem (2.26) where the value of the Lagrange multiplier, Afc, is equal to one. Choices for the weight values can be informed by knowledge about the scene geometry which aids in setting bounds on the range of expected parameter values which in turn can be used to specify standard deviations. Rotations for example will have a standard deviation of at most 7r/2 and translations must be limited to keeping the object in thefieldof view. We make an initial choice for the location of the object-centered frame relative to the camera frame. The standard deviation of the depth parameters is proportional to this distance as a function of the degree of stabilization desired. A relatively small standard deviation corresponds to a greater degree of stabilization which will cause the depth estimates to remain closer to their initial values. These deviations may be large relative to the deviations arising from the image measurements, but they still play an important role in stabilizing the solution for ill-conditioned problems. Efficient Computation of Stabilization. The over-determined system (3.7) can be minimized by solving the corresponding normal equations J J W ' T T J W ' T w T which multiplies out to (j J + W W ) h = - J F . T T (3.8) T Since W is a diagonal matrix, W W is also diagonal but with each element on the diagonal squared. r This means that the computational cost of the stabilization is negligible, as we canfirstform J J and T then simply add small constants to the diagonal that are the inverse of the variance of each parameter. The scaled Levenberg-Marquardt method, discussed in Section 2.6, can be seen to be an example of regularization using a Tikhonov-Arsenin stabilizing functional since the normal equations of (2.26) CHAPTER 3. RIGIDITY CHECKING 76 have the form (J J + X B T B )s T K K k = - J F (3.9) T k which minimizes ||F(m*) + F'(m )s || + A ||D s || . 2 fc 2 fc fc fc fc Note that equation (3.8) is equivalent to the least squares normal equations given by equation (3.9) for A = 1. fc The rigidity checking implementation of Levenberg-Marquardt sets the scaling for motion to zero, i.e., no trust region bounds for the motion parameters since the experimental results suggested that these bounds were not necessary, and sets a fixed non-zero value for the trust region bound for the stmcture and observation bias parameters, i.e., the trust region bounds are not modified during convergence since the experimental results suggested that unconstrained steps for these parameters were most likely to lead to non-convergence of the minimization algorithm andfixingthe values was considered adequate for the problem thus saving the computations required to update the trust region radius and the scaling matrix elements. The rigidity checking implementation of the Levenberg-Marquardt algorithm follows the implementation due to Press et al. [76]. Their implementation does not specifically address the issue of variable scaling, i.e., inclusion of the scaling matrix D^, however, an interpretation of their implementation effectively calls for rewriting equation (3.9) as ( J J + B B )s T = -J T K k k T F (3.10) and scaling this matrix according to the following scheme, ( J J + D D ):T T fc t ( J J + DjCD*);,T = ( J J + D l D ) » ( A ; + l) = ( J J + BJB^ T f c T i+ (3.H) The Levenberg-Marquardt scaling parameter A^. is used to increase the stabilization weights whenever the norm of the residual error increases. Increasing the value of A^ will essentially freeze the parameters having the lowest standard deviations and therefore solve first for those with higher standard 77 CHAPTER 3. RIGIDITY CHECKING deviations. For our problem, ill-conditioning of the linear approximation to the nonlinear system typically stems mainly from the depth parameters. The depth parameters are therefore scaled by a relatively small standard deviation. This implies that the convergence path will initially proceed by relatively large steps in the translation and rotation directions with subsequent iterations characterized by steps of equal magnitude in all directions. The difference between the implementation due to Press et al. and that derived in Section 2.6 is significant since the additional scaling of J J+D^Dfc by A^+l serves to move the approximate Newton T step closer to the steepest descent direction compared to the solution provided by equation (3.9). Also, by setting the Lagrange multiplier in equation (3.9) to one, the Levenberg-Marquardt step from equations (3.10) and (3.11) is always biased closer to the steepest descent direction. The Levenberg-Marquardt algorithm due to Press et al. sets A^. to a very small initial value. The Levenberg-Marquardt step given by equations (3.10) and (3.11) is computed and if the norm of the residual vector increases \' is increased by a factor of 10 and the step is recomputed. If the computed step k leads to a smaller residual norm then the step is accepted and \' is reduced by a factor of 10. The adk justment of the Levenberg-Marquardt parameter A^. serves the same purpose as adjusting the Lagrange multiplier A^ from equation (3.9), i.e., to adjust the step size to remain within the model-trust region. Although not strictly in keeping with the theoretical derivation given in Section 2.6 this implementation has proven to work well in practice since the matrix J J + A D D with A <C 1 is poorly T T conditioned, in general, for the nonlinear estimation problem addressed by this thesis for all the reasons cited earlier, thus, ( J J + D D ) (1 + A) remains relatively well conditioned even for A <C 1. An impler T mentation of rigidity checking using the form J J + A D D proved to be highly unstable during Monte T T Carlo testing and converged more slowly. Scaling the image coordinates and the global depth offset The translation vector t in equation (3.1) that transforms scene points from the object-centered frame c to the camera-centered frame can be viewed as a global depth offset. This offset is an arbitrary choice but its value can affect the convergence of the nonlinear estimator. Its value determines the scale of the 78 CHAPTER 3. RIGIDITY CHECKING structure and translational motion estimate and in principle it can be set to any nonzero finite number with the appropriate sign value. In practice its value is assigned in relation to the scaling of the initial depth estimates as well as the scaling affect it has on the numerical conditioning of the nonlinear least squares normal equations. Experimental results, an example of which is given below, suggests that the numerical conditioning of these equations, although data dependent, is on balance appreciably improved when: • the initial depths from the linear estimator are obtained from the observations scaled by 1//, the known image magnification factor, and, • the scale of the depth parameters is preserved by assigning the global depth offset to a value of 2. Linearly estimated depths from observations scaled by 1// were found experimentally to yield partial derivatives of similar scale for all parameters when the global depth offset is set to 2. This improves the numerical conditioning of the approximated Hessian matrix and therefore improves the convergence of the nonlinear estimator. Experiments with synthesized data for a large range of object sizes and distances suggests that a value in the range 1/2 to 4 yields approximately the same performance for images with moderate to large perspective effects. For scenes that project approximately orthographically, performance degrades rapidly for values less than 2. A value of 2 for the global depth offset is a good compromise. As an example, the approximate Hessian with stabilization, J J + W W , was analyzed as a T T function of the global depth offset for the verification of eleven correspondences from the image pair Playground shown in Figures 5.4 and 5.5. The correspondences verified by the biased estimator for the nonlinear camera model are 0, 1, 2, 3, 4, 5, 6, 9, 10, 14, and 15. Table 3.1 shows the condition number and the largest and smallest values of the squared norm of the partial derivative of the residual function, 11 d{ F11 . The matrix condition number is determined by the ratio of the largest to smallest singular values 2 of the Hessian matrix. Four values for the global depth offset were tried, -1, -2, -4 and -8. The best condition number is determined by t cz = -2. Note also that the range of partial derivative magnitudes increases with an increase in the global depth offset. This result mirrors the unreported results for the scaling and conditioning of the Hessian observed with synthetic data gathered from the ROC test results 79 CHAPTER 3. RIGIDITY CHECKING Table 3.1: Hessian Condition Values -1 Condition Number 7.9 x in Min ||<9iF|| Max ||<9iF|| 6 x 10 2 x 10 2 2 s 7 4 -4 -2 1.1 x 10 3 x 10 1 x 10 5.1 x 10 4 x 10 2 x 10 4 e b 4 7 7 -8 9.1 x 10 1 x 10 6 x 10 B 4 7 ROC Curve For Rigidity Verification 1 i 0 . 1- • — 0.02 0.04 1 0.06 False Positive Rate •••>. 0.08 >. 0.1 0.12 Figure 3.4: Receiver operating characteristic (ROC) as a function of the global depth offset. Monte-Carlo scenario is for observations of six points with a noise variance of 1 pixel. The curves are generated for the standard scenario. Each curve generated from 2000 trials for each of rigid and nonrigid configurations. documented by this thesis. Figure 3.4 shows four ROC curves corresponding to various values of the global depth offset for initial depths from the linear camera model estimator obtained from observations scaled by 1//. The curves were generated for the standard scenario described in section 3.6 for six points with a noise variance of one pixel. The global depth offset values range from -1 to -8 and from the plots it is evident that the value of -2 yields the best performance. Further insight into the parameter scaling issue, especially for the rotation and translation parameters, is provided by Wheeler and Ikeuchi [117]. The reiterate the importance of parameter scaling for iterative estimation, similar to the discussion given earlier on nonlinear parameter estimation, for their CHAPTER 3. RIGIDITY CHECKING 80 problem of estimating the rotation component of a model pose using gradient descent techniques. They note that the space of rotation and translation parameters is a non-metric space since there is no scaling that yields a strict metric space for these parameters. The gradient of a scene coordinate vector with respect to rotation depends on the scaling of the data as can be seen from the partial derivatives with respect to rotation, i.e., the partial derivatives consist of the set of coordinates themselves. The gradient with respect to translation is identity, i.e., is independent of the data. The difference in the scale dependency of the partial derivatives for rotation and translation leads to two problems: gradient descent directions can be biased towards the rotation direction and the numerical conditioning of the approximate Hessian matrix can be very poor. To ameliorate these problems, they suggest scaling the data to a canonical frame, e.g., the unit cube. Our scaling solution for the initial depth estimates and global depth offset achieves the same desired result. An issue related to the global offset is the choice of the object frame origin for the stmcture estimated by the nonlinear algorithm. The idea is to determine which scene point is closest to the camera in order not to solve for its depth. This implies that all of the other estimated depths should be further away from the camera. This helps to reduce the possibility that the stmcture estimate will violate the "depths positive" criterion due to an under-estimate of the global offset. The closest scene point determined by the linear estimator is assigned afixeddepth value - the global depth offset - by the nonlinear estimation process. Estimation with disparity rather than depth A study of the parameterization of the nonlinear estimator suggested that a small but noticeable improvement in the stability and rapidity of convergence when inverse depth in the camera frame (which we call disparity) is estimated rather than depth. Figure 3.5 is an example ROC plot of a test for the standard scenario for six points and a noise variance of one pixel. Thefigureshows two ROC plots for the biased estimator for the nonlinear camera model parameterized by depth and inverse depth, i.e., disparity. The disparity parameterization yields a better detection curve, e.g., at the 2 percent false positive rate, the tme positive rate has increased by approximately 1.5 percent. CHAPTER 3. RIGIDITY CHECKING 81 ROC Curve For Rigidity Verification r i /' 0.98 : ^—* >- H i j '• I /./...; £0.96 CO £T c o i j — i / ca depth dispar ty j £ 0.94 i > '/ <u 5 0.92 O i[ i 1 11 : ij 0.9 A 0.88 •UL 0 L: 0.02 0.04 0.06 0.08 False Positive Rate 0.1 0.12 Figure 3.5: Receiver operating characteristic (ROC) for depth and disparity parameterization trade-off. Monte-Carlo scenario is for observations of six points with a noise variance of 1 pixel. The curves are generated for the standard scenario. Each curve generated from 2000 trials for each of rigid and nonrigid configurations. The conditioning of the approximated Hessian matrix is improved since the scaling of the disparity parameters is now commensurate with the scaling of the other parameters - especially for distant points. For example, from the ROC plot data generated for Figure 3.5 the condition number over 5 trials after ten iterations for the depth parameterization ranged from a low of 5.7 X 10 to a high of 2.3 x 10 3 5 and the square of the norm of the partial derivative of the residual function with respect to each depth, ||d;F|| , ranged as low as 12 and as high as approximately 10 . On the other hand, the condition number 2 5 for the disparity parameterization, for the same number of trials, ranged from 10 to 10 and ||<9iF|| for 3 4 2 disparity ranged from 2 x 10 to 2 x 10 . The value of ||<%F|| for the other parameters is in the range 4 6 2 of approximately 2 x 10 to 8 x 10 . Despite the small sample size, these values are indicative of the 4 5 population distribution. The ROC plots are a better indication of the overall performance benefit. Harris and Pike also estimate inverse depth to avoid a nearness bias for their estimation process due to their use of ellipsoids to model structure uncertainty [35]. The form of the partial derivatives remains the same for the change of variable from depth z to 82 CHAPTER 3. RIGIDITY CHECKING disparity d. For example, the partial derivative of Uj with respect to disparity Pd is frjt = f fd 7d p (Pjd xPj P R ~ PjxR-zPj) • For the remaining partial derivatives Pjd is substituted for P~ . z Stopping criteria for convergence The algorithm iterates until one of the following stopping criterion is met. 1. The norm of the parameter correction vector h is less than 10~ . 2 2. The residuals do not decrease by more than the relative amount 1 0 -3 over 2 successive iterations. 3. The number of iterations exceeds an upper bound of 10. 4. The residual is less than some threshold. The following expression for determining a decision threshold, based on the expected variance of the data, was derived by over-specifying the number of constraints, i.e., effectively treating the observations from both images as contributing constraints rather than the correct assignment of a single constraint for each correspondence: T F = ky/(2Nm- n)a 2 , where N is the number of images, A; is a factor setting the confidence level and is set to 2 for the Monte Carlo trials described below (unless otherwise noted), a is the observation variance and 2 is assumed to be equal for all measurements, m is the number of correspondences and n is the number of estimated parameters. In practice this expression has proven to work well. A better solution for choosing a threshold is to compute the receiver operating characteristic (ROC) for an assumed distribution of camera and viewing geometries and scene structures in a Monte Carlo simulation. The threshold is given immediately for a choice of either the desired true positive rate or the tolerable false positive rate. The above expression for Tp gives a value for six points and two images which yields a tme positive rate of approximately 99 percent at an approximate false positive rate of 2 percent as determined from the ROC plot for the standard scenario described in the following section. CHAPTER 3. RIGIDITY CHECKING 3.6 83 Experimental results Monte Carlo simulations were run to determine the parameter space bounds over which the algorithm is effective. The algorithm was also tested on real images with manually selected matches. For the simulated data, unless stated otherwise, the camera focal length was set to unity and the field of view was specified by the size of the image frame, i.e., the image frame is s by s where s is 0.7. The number of corresponding points is 6 over 2 frames. Normally distributed random values were added to the image coordinates to simulate the effects of noise. If the resolution of the camera i s m x m then the image coordinates vertically and horizontally are rounded to m discrete values. In all synthetic data cases m was set to 512 pixels. For synthetic and real image data, the elements of W for stabilizing the disparity parameters used a disparity standard deviation of 1/50 focal lengths. For the standard scenario the correct verification rate typically decreases by about 3 percent with the absence of stabilization. For the scenario which yields image data with large amounts of perspective distortion the lack of disparity stabilization caused the correct verification rate to decrease by about 10 percent. The stabilization values for the motion parameters were set to zero as it was found to be unnecessary to stabilize these parameters. The linear algorithm supplies two initial estimates because of the ambiguity in the rotation sense for rotation in depth. The second initial estimate is only computed if the nonlinear estimator fails to verify potential rigidity given thefirstlinear estimate. 3.6.1 Monte Carlo simulation The standard scenario for camera and scene geometry. 10,000 trials for the following scenario were run. The scene consists of 6 feature points. Translation was uniformly distributed between -500 and 500 focal lengths. Rotation about the optical axis was uniformly distributed in the interval ±180 degrees and rotation in depth was uniformly distributed in the interval ± 9 0 degrees. Object size was in the range 10 to 5000 focal lengths with the closest object point 2 to 5000 focal lengths away from the camera with both variables uniformly distributed in their respective ranges. The motion is applied to the points in the object centered frame. The global offset that defines the scene coordinates in the object centered frame is CHAPTER 3. RIGIDITY CHECKPrfG 84 ROC Curve For Rigidity Verification ROC Curve For Rigidity Verification 0.05 0.1 False Positive Rate 0.1 0.2 0.3 0.4 0.5 0.6 False Positive Rate 0.7 0.8 0.9 Figure 3.6: Receiver operating characteristic Figure 3.7: Test with large perspective distortion. (ROC) for observations with a noise variance of 1, Receiver operating characteristic (ROC) for obser4 and 9 pixels. Motion and structure variables are vations with a noise variance of 1 pixel, for the standard scenario. Close up of the curve's knee. The curve for the noisiest data is furthest to the right. a uniformly distributed random variable in the range 2 to 5000 focal lengths. Image noise was simulated by adding normally distributed random values to the exact image coordinates with a zero mean and a variance of 1 pixel. On average, 1000 trials required a total of 3.9 seconds to verify at a convergence rate of 97.9 percent with double precision floats on a Sparc 10 processor. The threshold Tp was computed for an observation variance of 1 pixel for all correspondences. The nonlinear estimation algorithm is typically bypassed for 50 to 60 percent of the trials for the standard scenario because the norm of the residual error for the linear estimator is below the computed threshold. A set of trials was run to test the algorithm's performance for randomly generated correspondences. For these "nonrigid" trials, image data was generated by taking samples from a uniform distri3 bution of image coordinates over a512by512 range of integer values. The convergence rate for 10,000 nonrigid trials was 1.3 percent taking an average of 13.6 seconds to complete 1000 trials for the same Note that for some of these trials, due to the random sampling process used to generate observations for both images, the hypothesized 2D matches could in fact correspond to the projection of a rigid 3D configuration. The probability of this, however, is considered to be quite small. CHAPTER 3. RIGIDITY CHECKING 85 threshold, Tp, used for the rigid trials. Receiver operating characteristic. The ROC curve, in the context of our problem, is a plot of the graph of the probability of correctly choosing the rigid hypothesis when the configuration is rigid versus the probability of incorrectly choosing the rigid hypothesis when the configuration is nonrigid. Given a set of observations, the observation noise variance, and a threshold value for the residual error norm, the probability that the configuration is rigid or nonrigid can be predicted from the ROC curve. The receiver operating characteristic (ROC) curve is given infigure3.6 for three image noise levels. The noise is normally distributed with a mean of 0 and a standard deviation of 1, 2 and 3 pixels. The curves for increasingly noisy observations shift progressively to the right. 100,000 trials for rigid configurations and 100,000 trials for nonrigid configurations wereranto generate each curve. The camera and scene geometry random variables for the standard scenario described above were used. Images with large perspective distortion. Figure 3.7 is an ROC plot for images with large amounts of perspective distortion. The closest scene point is nowfixedin the range 2 to 50 focal lengths while the remaining points lie in a depth range between 10 and 5000 focal lengths from the closest depth. Some of the samples from this scenario may be somewhat unrealistic since some of the points probably would not be in focus given a real camera with a 40 degreefieldof view (the value used here). Added noise has a variance of 1. 100,000 trials for rigid and nonrigid configurations wereranfor this curve. Approximately 15 percent of the rigid object trials resulted in the linear estimator's residual error falling below therigiditythreshold for a noise variance of 1 pixel. Note that the nonlinear estimator was not bypassed for the ROC trials just described. The intent was to assess the end to end performance of the verification process. Bypassing the nonlinear estimator could be detrimental to the system's performance if the linear estimator yields a residual error norm below the rigidity threshold for correspondences not arising from arigidconfiguration. However, for the nonrigid trials described above, the frequency of this occurrence was approximately 0.2 percent, a relatively insignificant value. In general, therefore, if the linear camera model fits the data then the configuration should be judgedrigidsince the nonlinear camera model estimator cannot provide better discrimination for this case. CHAPTER 3. RIGIDITY CHECKING Figure 3.8: First frame of Lego sequence with labeled correspondences. 3.6.2 86 Figure 3.9: Second frame of Lego sequence, Real image sequence Two images of a Lego object on a motion table were taken by a monochrome camera with a 480 by 512 pixel image. The object's center was approximately 13 inches from the camera's projection center. The image magnification factor was determined to be 609 pixels with a pixel aspect ratio of 1.25. The observations were not corrected for lens distortion or the remaining intrinsic camera parameters. The total motion between the two views was a translation of 2.5 inches and a rotation of 15 degrees. Figure 3.8 is the first frame marked with seven manually determined feature points. Figure 3.9 is the second frame with the corresponding points marked. As with the synthetic data the global offset is set to a value of 2. For the correctly matched correspondences the nonlinear estimator was bypassed since the residual error from the linear estimator was below the threshold for an assumed noise variance of 1 pixel. This is a strong indication that the object projection is approximately orthographic. For the first experiment only the correspondences labeled one to six in both views are considered. Given the six correspondences in the first frame, the same label set for the correspondences in the second image was permuted and verified. Figure 3.10 shows a portion of the bar graph of sorted residual errors 87 CHAPTER 3. RIGIDITY CHECKING Real Image Residuals - 6 Correspondences 20 25 30 Permutation Index Figure 3.10: Bar graph of sorted residual error for permutations of six corresponding points. Only the lowest residual values are shown, approximately 10 pixels and less. For an expected noise variance of 1 pixel, 35 permutations out of 720 (4.9 percent) are classified as potentially rigid. Filled bar is correct correspondence. Real Image Residuals - 7 Correspondences Permutation Index Figure 3.11: Similar to the previous figure, but with seven correspondences. For an expected noise variance of 1 pixel, 25 permutations out of 5040 (0.5 percent) are classified as potentially rigid, Filled bar is correct correspondence, for the 720 possible permutations of the label set in the second image. The figure shows the number of permutations that yielded residual errors below 10 pixels. For an expected noise variance of 1 pixel, 34 incorrect correspondence label sets fall below the rigidity threshold, a 4.7 percent false positive rate. The residual for the linear camera estimator for the correct correspondence was 1.0 pixels. Only one of the incorrect correspondences yielded a residual below this value. The permutation with the lowest residual value of 0.77 pixels is (5,3,4,6,1,2). For the second experiment all seven correspondences are considered. Given the seven correspondences in the first image, the same label set for the correspondences in the second image was permuted and verified. Figure 3.11 shows a portion of the bar graph of sorted residual errors for the 5040 possible permutations of the label set in the second image. The figure shows the number of permutations that yielded residual errors at or below 10 pixels. For an expected noise variance of 1 pixel, 24 incorrect correspondence label sets fall below the rigidity threshold, a 0.48 percent false positive rate. The correct correspondence yielded the lowest residual of 1.0 pixels. The next highest residual is 1.36 pixels for the CHAPTER 3. RIGIDITY CHECKING Figure 3.12: First frame of laboratory scene with labeled correspondences. 88 Figure 3.13: Second frame of laboratory scene, permutation (1,2,7,4,5,6,3), i.e., points 3 and 7 swapped. The epipolar lines for these two points appear to be in close proximity and therefore a mismatch between them could lead to a low residual solution. In another experiment with real images the corresponding scene points span a large depth range relative to the distance to the camera, thus emphasizing the performance of the nonlinear estimator. The experimental method is the same as for the Lego object. The image size is 640 by 480 pixels, the image magnification factor was determined to be 650 pixels and the pixels were found to be square. Figures 3.12 and 3.13 show 7 manually determined correspondences. The ratio of the depth range to the closest scene point distance is approximately 6:1. Figures 3.14 and 3.15 show a portion of the bar graph of sorted residual errors for the 720 and 5040 possible permutations, respectively, of the label set in the second image. For the six correspondences numbered 1 to 6 the norm of the residual of the estimator for the linear camera model is 16.6 pixels. The residual norm falls to 1.9 pixels for the nonlinear estimator. For all seven correspondences the norm of the residual of the linear estimator is 18.9 pixels with the residual norm falling to 2.4 pixels for the nonlinear estimator. Although these residuals may appear to be large, it should be noted that the observations are not corrected for lens distortion or the remaining intrinsic camera parameters and, most importantly, the estimator stops iterating as soon as the residual falls below the rigidity threshold. CHAPTER 3. RIGIDITY CHECKING 89 Real Image Residuals - 7 Correspondences Real Image Residuals - 6 Correspondence 3 10 12 Permutation Index Figure 3.14: Bar graph of sorted residual error for permutations of six corresponding points for the laboratory scene. For an expected noise variance of 1 pixel, 15 permutations out of 720 (2.1 percent) are classified as potentially rigid. Filled bar is correct correspondence. 3.7 3.7.1 4 5 Permutation Index 6 7 Figure 3.15: Similarto the previous figure, but with seven correspondences. For an expected noise variance of 1 pixel, 4 permutations out of 5040 (0.1 percent) are classified as potentially rigid. Filled bar is correct correspondence, Comparison to other methods Horn's algorithm for relative orientation Horn's algorithm for determining the rotation and translation between a pair of perspective views of a scene can be parameterized in terms of quaternions [42]. The objective function is based on the well known coplanarity constraint for conesponding points and is formulated in terms of a scalar triple product appropriately weighted by a term based on the estimated observation variances. It is interesting to note that Weng et a/.'s epipolar improvement method [112] is based on the same objective function for a different parameterization. Horn's algorithm was implemented in Matlab and tested. Figure 3.16 is the ROC curve for Horn's method compared to the rigidity checking method for three different image noise levels and figure 3.17 is a close up of the knee of the curves. The same standard scenario was used as described earlier. Linear estimates are provided as initial guesses to Horn's iterative algorithm. This contrasts with Horn's approach which involves running a large number of initial guesses to find all of the solutions. Ten iterations maximum were allowed for each initial estimate. 10,000 trials for both rigid and CHAPTER 3. RIGIDITY CHECKING 90 nonrigid 3D configurations were run for each of Horn's ROC curves. The high false positive rate exhibited by Horn's method results from the property of the epipolar constraint that allows for a larger space of motion solutions which are determined independently of the physical plausibility of the corresponding stmcture estimate. In addition, the potential ambiguity of the motion estimate for nearly orthographic images or noisy image measurements also contributes to the algorithm's weaker performance. A discussion of the drawbacks of the epipolar constraint for rigidity verification can be found in the earlier section on related work. In contrast to epipolar constraint based methods, the rigidity checking method converges to fewer local minima with low residual values primarily because the collinearity constraint formulation makes use of all the image measurement information. Rigidity checking searches a rigid transformation space while implicitly enforcing a simultaneously consistent and physically plausible depth estimate. Deviation from rigidity is reliably signaled by a large residual. Of the three sets of 10,000 nonrigid trialsranon Horn's algorithm, approximately 55 percent of the trials in each set were discarded because the "depths positive" criterion was violated. If this criterion was not applied the false positive rate would, in fact, be much higher. This consideration accounts for the blip in the ROC curve at around the 0.45 false positive rate. 3.7.2 Comparison to essential matrix methods Wei et al. employ the linear formulation developed by Tsai and Huang [104] to solve the correspondence problem and estimate motion simultaneously. Monte Carlo testing with a Matlab implementation of their algorithm indicates a large false positive rate for the standard Monte Carlo scenario described earlier. Monte Carlo testing for nine corresponding points and a noise variance of one pixel yielded a true positive rate of approximately 36 percent versus 99 percent for the rigidity checking method and 85 percent for Horn's method at a false positive rate of 5 percent. The performance contrast is greater still considering that the results for Horn's method and the rigidity checking method are for only six correspondences. Note that Wei et al. do not check if the motion solution corresponds to a "depths positive" reconstruction. Such a check would improve their performance. CHAPTER 3. RIGIDITY CHECKING 91 ROC Curve For Rigidity Verification ROC Curve For Rigidity Verification 1 0.9 0.95 0.8 0.9 a, CO 0 13 tr 8 5 rr = 0.6 o % 0.8 5=0.5 f > | 0.4 o O 0.75 ID > S fc: 0.3 8 0.7 0.65 0.2 0.6 0.1 0.55 0.1 0.2 0.3 0.4 0.5 0.6 False Positive Rate 0.7 0.8 0.5! j 0.05 0.1 0.15 0.2 0.25 False Positive Rate 0.3 0.35 0.4 Figure 3.16: A comparison with Horn's method. Figure 3.17: A comparison with Horn's method. ROC for noisy observations with a variance of 1, Close up of the knee of the ROC curves from the 4, and 9 pixels. Horn's method are the 3 dashed previous figure, curves. The curve for the noisiest data is furthest to the right for each method. Motion and structure variables are for the standard scenario. 3.8 Conclusions An algorithm has been described that reliably and rapidly verifies the potential rigidity of threedimensional point correspondences from a pair of two-dimensional views under perspective projection. The method, which we call rigidity checking, is useful for finding corresponding point features between two or more views of an object. The rigidity decision is based on the residual of the fit to the image observations of an integrated pair of stmcture-from-motion estimators for a linear and nonlinear camera model. The matching condition is based on a set of 3D recovery equations derived from the collinearity condition of points under perspective projection. This choice for the 3D recovery model contributes significantly to the performance improvement of the algorithm relative to other methods because, unlike recovery based on the epipolar constraint, the collinearity condition uses all of the information in the image measurements. This improvement in performance comes at a small additional cost in computational complexity due to the choice of parameterization. In Monte Carlo simulations over the entire set of rigid and nonrigid trials, a single trial took on the order of 10 milliseconds to execute on a Sparc 10 CHAPTER 3. RIGIDITY CHECKING 92 processor. In summary, rigidity checking works with as few as six corresponding points under weak or full perspective projection, handles correspondences from widely separated views, makes full use of the disparity of the correspondences, and is integrated with an initial parameter estimator based on a linear weak perspective camera model. Results from extensive Monte Carlo simulations and from real images were presented. A comparison was made with methods based on the epipolar constraint such as Horn's nonlinear algorithm for stmcture-from-motion and Wei et al.'s linear method that illustrated the disadvantages of the epipolar constraint as a matching condition. Applications of this algorithm as a module for performing rigidity checking are numerous; 3D recognition from a single previous view, motion segmentation and stereo correspondence are but a few. Chapter 4 Rigidity checking for matching point correspondences 4.1 Introduction This chapter describes an application of the rigidity checking method to the problem of matching point features from a pair of images. The proposed solution can be applied to structure-from-motion, mobile robot navigation, image database query processing, and recognition from a single view. The experimental data consists of a class of images which are taken over a relatively large unknown viewpoint range. The computational complexity of matching these types of images is generally very large for classical methods because of one or more of the following factors: images differ by a non-trivial rotation about the viewing direction, the scale change of the brightness function is spatially variant and large, the spatial variation of the disparity due to perspective effects can be significant, and occlusion. The approach can be summarized as follows: point features are detected and localized between views, the features are characterized in order to establish candidate matches, and lastly, candidate matches are verified to determine the rigidity of the projected 3D configuration. The overall solution is built upon rigidity checking, affine structure-from-motion [90], and grey-level differential invariants [86][87]. 93 CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 4.1.1 94 From brightness images to verified matches Our objective is to verify the rigidity of 3D point correspondences from 2D brightness images taken from a wide range of viewpoints. To accomplish this task we make use of the grey-level differential invariant scheme due to Schmid and Mohr [86] to locally characterize the brightness function over a set of point features. The differential invariants are used to determine an initial set of hypothesized matches. A novel sequential verification scheme based on the affine camera model and rigidity checking then determines the set of correspondences that could have arisen from the projection of a rigid configuration. Approaches for matching point features generally rely on detecting and characterizing the image brightness at the feature and applying a set of constraints on the feature attributes to determine the best matches. The image brightness at point features can be characterized locally by its pattern (e.g., as a template), curvature of the brightness function, color and a variety of other attributes. A constraint based scheme is then employed to find the best matches between images. Therigiditychecking method like other geometric constraint based methods for verifying 3D correspondences from 2D images exploits the epipolar constraint to determine match validity. It is the choice of the objective function, however, that sets the various methods apart. We show that the estimator for the weak perspective camera model due to Kontsevich [52] that is incorporated in therigiditychecking method is equivalent to the estimator for the affine camera model due to Shapiro [90] and, consequently, the integrated pair ofrigiditychecking estimators share the same objective function for the linear and nonlinear camera models. Conclusions from the analysis of this objective function by Shapiro carry over to the nonlinear estimator, and vice versa, yielding new insights into their respective performance. The advantage of the approach we propose lies in the ability of the algorithm to quickly establish a candidate set of matches without resorting to exhaustive search by focusing on keypoints [84, 44, 121, 86]. Search time complexity is also reduced through the use of nearest neighbour searching with k-d trees over the space of differential invariant vector representations of the comer points. The use of grey-level differential invariants allows a much larger space of 2D transformations to be handled without searching the space of 2D rotations. The multi-scale representation yields scale invariance over a range of scales. A sequential verification scheme over the set of candidate matches that combines a linear estimator for the CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 95 affine camera model with rigidity checking yields good results in the presence of incorrect matches with a low computational cost. The linear estimator is used to preprocess the hypothesized set of matches to rank the validity of the matches relative to the affine camera model. This preprocessing step substantially improves the robustness of sequentially verifying the matches with rigidity checking. In the following sections we review keypoint detection and grey-level differential invariant matching and explore the relationship between the objective functions for the affine camera model due to Shapiro [90] and the linear and nonlinear camera models incorporated in the rigidity checking method. But first, a review of some the related methods from the literature. 4.2 Background Feature based approaches for matching images are prevalent in the literature. The most closely related work includes that of Zhang et al. [121] and Hu and Ahuja [44]. Feature point matching differs from intensity based matching [111] and optical flow methods [78] in that the matching is for point features only. Another class of feature-based matching extracts a variety of correspondence tokens from image contours, e.g., points, lines or curves, and constructs an edge-vertex graph of the feature space. Nodes are defined by the tokens and the geometric relations between tokens define the edges. The features are matched byfindingthe sub-graph isomorphism between the graphs from a pair of images. Methods for finding sub-graph isomorphism include relaxation, tree searching, and maximal clique detection [121]. The problem of automatically determining the correspondence of feature points has received considerable attention in the photogrammetry literature as well [32, 100, 38]. Tang and Heipke [100] describe an approach for automatically computing the relative orientation of a stereo-pair of digital aerial images. They assume that the camera(s) are calibrated and that the viewpoint conditions are consistent with aerial images, e.g., small rotation angles of each image and a small relative scale change. They also assume knowledge of the approximate scale of aerial photography and the approximate x and y overlap between the images. They utilize an image pyramid to speed up the processing of extracting features and computing the relative orientation with increasing precision. At the highest level of the pyramid an estimate of the 2D registration of the stereo-pair is made using the approximately known scale, overlap CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 96 and rotation parameters. Point features are then extracted and registered to a reference image using the transformation parameters derived from the previous pyramid level. Brightness cross-correlation is used tofindthe best match in the neighbourhood of the transformed feature point. Using the best matches thus found, a robust least squares bundle adjustment is performed to compute the relative orientation parameters and eliminate outliers. Least squares matching is used at the lowest pyramid level, if necessary, to achieve the desired estimate precision. Least squares matching refers to a method that combines crosscorrelation with edge based matching in a statistical estimation framework [32]. Gruen and Baltsavias [32] describe a matching model called least squares matching with geometrical constraints. Their method for sub-pixel matching simultaneously solves a local radiometric (intensity correlation) and geometric (projection collinearity) constraint for a set of surface elements defined for each feature point in a set of image observations. The objective function unknowns include the local image shift (modeled as an affine transformation in the radiometric model), radiometric parameters, and scene coordinates. The model, in principle, allows for simultaneously estimating any number of relevant parameters, including for example, relative orientation and intrinsic camera parameters for which a unique solution requires the space coordinates of a set of control points. Rule based matching decisions based on the estimate variances can be incorporated to eliminate incorrect or poorly localized matches. Similar least squares formulations that combine radiometric and geometric constraints for multiple views include Heipke [38] and Bosemann [13]. Rather than model the local intensity variation of the projected scene point as a function of an affine change of basis with respect to a change in viewpoint, Heipke [38] recovers the canonical brightness of a surface element (described as an orthophoto pixel value) and models the intensity variation between images by a linear transformation that accounts for the brightness and contrast differences. Heipke models the object surface as a piecewise smooth set offiniteelements that define a mesh, a representation often used for digital terrain models. The radiometric model defines a set of surface elements of constant size within each grid mesh. Several simplifying assumptions are made with respect to the radiometric model, including lambertion reflectance and constant illumination. The linear radiometric transformation model is included to compensate for deviations from these assumptions. The application of the method was to recover a precise, dense, digital surface model from sub-pixel matches given several CHAPTER 4. RIGIDITY CHECKPSG FOR MATCHING POPNT CORRESPONDENCES 97 views and known extrinsic camera parameters and several control points. The issue of incorrect matching was not fully addressed, but the implication from the estimate variances for the recovered surface height was that very few, if any, incorrect matches were made. 4.3 Keypoint detectors The detection of keypoints is primarily motivated by the fact that rigidity checking is defined in terms of scene points and their projections. This is the pragmatic reason. However, keypoints or corner features have several properties which make them useful representations for scene and image brightness geometry as well. Keypoints are useful as correspondence tokens in the sense described by Marr [64]: tokens, and correspondence tokens in particular, represent attributes of the image which correspond to physical events in the scene, e.g., a discontinuity in surface reflectance or depth. Tokens should be derivable in a reliable and repeatable fashion. They should possess several attributes such as position, size, orientation, and brightness. Keypoints and their attributes satisfy many of these requirements and, therefore, their detection has become very important to computational vision. Keypoints are useful for stereo matching, displacement vector measurements, and image matching. Two broad groups of comer and vertex detectors are identified by Deriche and Giraudon [21]. Thefirstgroup extracts keypoints, from edges represented as chain codes, by searching for points having curvature maxima orfittingpolygonal approximations to the code and determining curvature maxima from these [83]. The second group detects keypoints directly from the grey-level signal, either by applying "heuristic" interest operators or measuring brightness gradients and curvatures. Several detectors for identifying and localizing interest points have been developed and many of them are reviewed by Deriche and Giraudon [21]. The comer detector of Harris and Stephens [34], which is from the second group described above, has been compared to several others and shown to give the best results based on several criteria such as repeatability with respect to scale, illumination and viewpoint changes [85]. Nevertheless, the Harris-Stephens comer detector does exhibit repeatability problems despite having the best performance relative to other tested comer detectors. Recently, Merron and Brady [67] have investigated this lack of comer detection reliability, and CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 98 have attributed a significant portion of it to the spatial anisotropicity of the typical corner detector gradient estimator. They have developed a better gradient estimator that is based on measuring the gradient in four directions. They note that the gradient estimate of the smoothed image reduces the anisotropy, which may explain, in part, the success of the Harris-Stephens stabilized comer detector due to Schmid and Mohr [85] - their results for a derivative of Gaussian version of the Harris-Stephens detector shows a noticeable improvement in repeatability over the standard implementation especially with respect to rotations in the image plane. The Harris-Stephens comer detector was developed as a set of enhancements to the Moravec interest operator. The problem of detecting comers can be analyzed in terms of the curvature properties of the local image brightness autocorrelation function where the curvature information can be represented by the Hessian matrix. The autocorrelation function is useful for characterizing how the brightness values change in the neighbourhood of a location. At a comer or an isolated brightness peak all shifts will result in a large change in the autocorrelation function at that location. A version of the brightness spatial change function for a small shift (x, y) can be written [34] u,v Afirstorder approximation of E is given by E(z, y) — Ax 2 + By 2 + 2Cxy where A, B, and C are approximations of the second order directional derivatives of the Gaussian smoothed brightness image (see the equations below). This can be rewritten as E(z, y) = [x, y] M [x, y] where the matrix M is an T approximation of the Hessian matrix for E and E is closely related to the image's local autocorrelation function. The principal curvatures of the image brightness autocorrelation function at a point can be approximated by the eigenvalues of the approximated two by two Hessian matrix, M , defined at that point. A comer is signaled if the principal curvatures are approximately equal and sufficiently large. The determinant of the approximated Hessian matrix, det[M], is proportional to the product of the principal curvatures. The Harris-Stevens comer detector is given by the following operator where large values of R signals the presence of comers. CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES R(x, y) = detTMl^)] - k trace [M| 2 (XiJ/) ] 99 (4.1) where, M(x,y) M \(x,y) C l(*,v) = _ \(x,y) \(x,y) C A =X 2 B <g> w, B = Y ®w, 2 C = XY <g> w X = I®[1,0,-1]« Y = I ® [1,0,-1] T 01 di Oy I is the brightness image, and, 1 w = 27ra 2 e A positive value for the scalar k can be used to ensure that the Gaussian curvature approximation det[M] is valid by suppressing the detector response for image regions with very high contrast (as signaled by a large mean curvature). The Gaussian curvature approximation is only valid for surfaces of low inclination ([40] page 271). Zhang et al. specify a value for k of 0.04 to discriminate against high contrast pixel step edges [121]. 4.4 Equivalence of objective functions In this section, the objective functions for the linear and nonlinear camera models are shown to be equivalent. This result establishes a common basis for evaluating the residuals of the fit to the different camera models and, in principle, allows for the evaluation of matches to be performed independently of the camera models. Attention can be focused on the objective function itself - is it "optimal" in some sense and what are the best estimators for it? CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POFrfT CORRESPONDENCES 100 Therigiditychecking method consists of an integrated pair of linear and nonlinear structurefrom-motion estimators as described in Chapter 3. The constraint on which the linear estimator is based is due to Kontsevich. We recall the following description of Kontsevich's rotation invariant for the scaled orthographic projection of a scene edge (defined as the vector between two scene points and serves to eliminate the translation component of motion between views). The geometric interpretation of the constraint that determines the rotation component of a motion observed by orthographic projections is depicted in Fig. 3.2. Kontsevich assumes that the camera observes a moving object where motion includes not only rotation and translation but also a dilation that follows the motion and leaves the object shape invariant. Kontsevich notes that the orthogonal projection of a 3D edge onto a rotation axis (which is identified with the moving object) is invariant to the rotation. An arbitrary rotation can be decomposed into a rotation about some axis V in thefirstimage plane and a rotation about an axis orthogonal to the image plane. The rotation component about the viewing direction maps axis V in thefirstview to axis V in the second view. The norm of the projection onto V of a 3D edge is equal to the norm of its projection onto V in the second image. Under orthographic projection this equality also holds for the 2D projections of the 3D edge. Kontsevich states that under the condition that V and V are determined by nonzero vectors v and v' scaled so that |v'| = |v/s| where s is the scaling of the object after the motion then the following relation holds (see Figure 3.2 and note that e = ei from thefigureand similarly e' = 6 3 ) e • v = e' • v'. (4.2) Therigiditychecking solution for determining v and v' consists of concatenating e and e' yielding the constraint [e -e' ] [ v v ' f T T T T = 0. (4.3) For N correspondences the following homogeneous system of equations is solved written as the linear matrix system Et/ = 0. (4.4) The solution to this system is the unit eigenvector associated with the least eigenvalue. In practice, equation (4.4) is solved by singular value decomposition of the matrix E . The vector v spans the null space CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POmT CORRESPONDENCES 101 of E and specifies a set of basis vectors and a scale factor that determines the Euclidean transformation between the two viewpoints up to a one parameter family of rotations about the projected rotation axis, i.e., a rotation in depth. The solution for the in plane rotation is unique [52, 5,51] and if given one solution in the family all other solutions are known except at two points for non-degenerate configurations of scene points. The scene structure is degenerate if there are fewer than three non-coplanar edges [77]. Now, Reid and Murray [80] have shown that the objective function minimized by Tomasi and Kanade [102] is equivalent to minimizing the distance between the projection of the estimated points and the observed points for the affine camera. This cost function is given by n-1 e (M,M',t,t',X ) tk n-1 = ^|x -MX,-t| t + ^ 2 2 t=0 |xj - M % - t'| , 2 (4.5) i=0 where M and M ' , t and t' are the camera model parameters, x- and x' and X , are the observations and 4 the scene structure vector in inhomogenous coordinates. Next we show the equivalence of minimizing (4.5) and the solution of (4.4) by using a decomposition arrived at by Shapiro [90] which is summarized next. By minimizing (4.5) directly over t and t' and centering the datasets x,-, x', and X ; with respect to their centroids (4.5) can be written as n-1 e (M,M',X ) = ifc n-1 | Ax, - M A X , | t i=0 2 + | AxJ - M ' A X ; | (4.6) i=0 This expression can be rewritten as n-1 (L,S) = ] P | v i - L s i | , 2 (4.7) i=0 where v,- = Ax; ,L = / j _ M , and s,- is the 3D affine structure vector. By inspection v is the M ' L ' J M vector [p — p ' ] from Kontsevich up to a reflection and an arbitrary translation. T T T The full system of equations is given by V = LS where we note that V is equivalent to E from (4.4) (up to a reflection and an arbitrary translation). Singular value decomposition of V , an M x N matrix, can be written as V = L*S*, 102 CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POUVT CORRESPONDENCES where L* i s a n M x M orthogonal matrix with columns £{ (i=l..M), and S* is an M x N matrix with mutually orthogonal rows. The £{ vectors form an orthonormal basis arranged in order of decreasing singular values. L* and S* can be partitioned such that V = [LL ] 1 where L is the M x r matrix [£± \ £ | ... | £ ] and L is the M x (M - r) matrix [£ +i | ... | £M\ If x 2 r r rank(V) is r then S-- = 0 and ~L is the left null space of V and is therefore independent of the structure 1 L S, i.e., it depends only on the motion. For the case of two views where (M, N) = (4, n) the system should have rank 3 since L and S both have maximum rank 3; L has dimension M X 3 and S is 3 x N and rank(LS) < rank(L) and rank(LS) < rank(S). The solution that minimizes is given by the vector that spans the left null space of V which is £4. £4 is the same vector as v given above, up to the sign value, and is determined only by the motion. £4 defines a 4D hyper-plane in the concatenated centered image coordinates which Shapiro calls the CI space. The projections of the v, onto £4 define the residual vectors of the fit and have squared lengths (v, • £ ) . Thus, minimizing e k = Z~2i I « 2 v 4 _ t is equivalent to minimizing J2 (v • £ ) . 2 i 2 4 We see, then, that Kontsevich's constraint is equivalent to minimizing the distance in the image plane between the model and the observations. Several advantages accrue from this fact. The constraint involves the minimum number of parameters for the model and the objective function is seen to be minimizing the noise term of the image observations which is where the localization error originates. We also note that the sum of the squared residual errors satisfies the x distribution for N - 4 degrees of 2 freedom. 4.5 Biased versus unbiased objective function We note that the original rigidity checking nonlinear estimator is a biased estimator in the sense that the residual of the fit is found in one image only, i.e., there are no residual components with respect to the image coordinates in the reference image coordinate frame. We show empirically via Monte Carlo simulation and the receiver operating characteristic (ROC) that the unbiased version of the estimator does 1 2 CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 103 not improve the performance of rigidity checking in any significant way. What the unbiased estimator does, however, is provide a unified framework between the linear and nonlinear estimators for establishing a significance test for rigidity - the x significance test. This symmetry is actually presaged by the 2 equivalence of the objective functions for the linear and nonlinear camera model estimators. The biased nonlinear estimator minimizes the following objective function in image coordinates, (4.8) where m is the parameter vector for rotation, translation, and depth scalars, and w and w' are the image coordinate vectors for two views. This estimator minimizes the distances between the projection of the estimated scene model and the observations by simultaneously estimating the stmcture and motion between views. However, this formulation assumes that there is no localization error in the reference image observations. This leads to an anisotropic noise distribution for the residual errors and makes it very difficult to determine analytically a significance test. The rigidity checking significance test critical value for the norm of the residual vector is given by Tp = k^/{2Nm — n)a which is an approximation with confidence factor k for the 2 estimator of the variance of data. This value has been shown to work well in practice as it corresponds to a false positive rate of approximately 2 percent and a true positive rate of approximately 99 percent as determined from the ROC plot for the standard scenario described in Section 3.6.1 for six points. The unbiased nonlinear estimator explicitly models the noise bias terms in the reference image and is motivated by the noise formulation due to Azarbayejani and Pentland [3]. The objective function for two views is given by / ( m , m') = ^2 ( W ( W J , m) - w;) 2 (w'( w;,m') + - w ') (4.9) t where the vectors of model parameters, m and m', now include the bias terms b Ui and b for image Vi coordinate components u and v in the reference frame. Note that W ( W J , m) is the observation equation for the transformation Rpj + t + t c (4.10) CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 104 that describes the mapping from the world coordinate frame to the camera frame for the reference image and W'(WJ, m') is the observation equation for the transformation p;-= R ' P i +1'+ t (4.ii) c which describes the mapping from the world coordinate frame to the non-reference camera frame. Partial derivatives with respect to measurement bias parameters The minimization of (4.9) requires the partial derivatives of the models w(w, m) and w'(w, m') with respect to the bias parameters. The bias terms model the additive noise that corrupts the observed locations of the image observations. Image coordinates of the point P j are given by the perspective projection equation with an additional term consisting of the bias parameters. - P Uj 1 JX P + (4.12) byj With the additional bias terms, P , from Section 3.4.2 becomes Pi = L / Poj + Pbj- T T / The partial derivative of w'(w, m') with respect to the bias parameters leads to finding the partial derivative of u'j with respect to b , which is given by u du' -/ 3 (dP' \~8b^ 3X db~ ~pj u where dP'j /db x u z P' dP' P' db u is given by dP' JX db u db (pjzR-'x d (PjzR'x [ 0 j + P-bi] + tj;) db d u - P u {PjzR-'xPbj} • db u CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 105 Therefore, dP' gjf = P M f (4-13) where R' is the row 1, column 1, element of the rotation matrix R'. Similarly, n dP' 3Z —— = JP = R'31/f, l jJZ (4.14) db u dP' -at = P M - L (4 15) and dP' 3 Z db = P R' /f. JZ (4.16) 32 v Therefore, upon substitution, p 8% db u pi J_ K (4.17) 2 and db ~ 2 Similarly, du'du^ db 3 _— v (4.18) P' u p p1 /_ JZ 3Z pi ( -pi 2 t 32 P' R' ) Z 12 , (4.19) Jz and dv^ Pjz db P' (4.20) 2 v jz The partial derivatives of w(w, m) with respect to the bias parameters are much simpler due to the fact that the motion for this model, which is identified with the reference camera frame, is zero, i.e., since the position and orientation of the world coordinate frame is arbitrary, the relative orientation problem is simplified by assuming that the reference camera frame and world frame are aligned. This leads to zero partial derivatives for the motion and structure parameters. The partial derivatives of w(w, m) with respect to the bias parameters reduce to db db u u and dv du or- = o r = 0db db - u v (4-22) CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 106 4.5.1 Minimization of the unbiased objective function The minimization proceeds in the same fashion given in Section 3.4.2. The motion parameters for the reference frame model w are the identity transformations for rotation and translation R = I and t = 0 where I is the identity matrix. These are the motion parameters for the mapping from the world coordinate frame to the reference camera frame given by equation (4.10). This is equivalent to assuming that the reference camera frame is aligned with the world frame. In addition to the partial derivatives of w' the Jacobian matrix now contains the partial derivatives of w which are all identically zero with respect to the motion and depth variables and nonzero for two of the bias terms. The sparsity of this form of the Jacobian can be exploited to reduce the cost of forming the approximate Hessian, J J, and the gradient vector, J F . T T 4.5.2 ROC performance comparison - biased versus unbiased objective functions The sum of the square of the residuals for the unbiased nonlinear estimator, like the sum of the square of the residuals of the linear estimator, satisfy the x statistic, and have been shown to be minimizers 2 of the same unbiased objective function in the image observations. This affords a principled method for selecting the significance test critical value and places the linear and nonlinear estimators on an equal footing for assigning confidence levels to the verified points for their respective camera models. The use of the x test assumes that the localization error is additive and satisfies an independent, 2 isotropic Gaussian distribution. The use of singular value decomposition forfindingthe minimizer of the linear estimator for the affine camera assumes the same error distribution. The localization error results from several factors that affect the accuracy of the feature detector and the initial feature matcher including scene stmcture and illumination, accuracy of the system calibration, image resolution and the performance of the feature detector and matcher. The common assumption is that the total of these errors by the central limit theorem yields an approximate Gaussian distribution. Noise characteristics of linear least squares estimation was investigated by Weng et al. [ 114,112] and the use of the x statistic for an2 alyzing the goodness offitfor a nonlinear estimator is described by Whaite and Ferrie [116]. Whaite and Ferrie make a noteworthy point about the types of error encountered when judging a model's fit to the CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 107 observations using the estimate of the variance of the data. Estimating the frequency of the so called type II errors, i.e., accepting a model when the data do not actuallyfitthe model, requires knowledge of the 1 error distribution of thefitwhen the model does notfitthe data. Determining this conditional probability distribution requires some measure of all the ways the data can be misfitted to the model, an impractical task in their view. In the context of rigidity checking, a type II error occurs when a set of correspondences is deemed to be potentially rigid when, in fact, they are not due to a rigid configuration. Shapiro [90] has shown that, for the affine camera model estimator, there is a spatial variation of the estimated variance of the residual error of observations i.e., the estimated variance of the localization error of a point depends on the point's location in both images. The estimated variance consists of a constant term and a subtracted term that is a function of the observation coordinate vector. When this variance is estimated in the rigidity checking sequential verification process, the upper bound given by the constant term is used and yields a conservative threshold for detecting outliers. The use of the x statistic is well founded. Nevertheless, it is necessary to include a confidence 2 factor for scaling the critical value for the unbiased nonlinear estimator, since a limit is placed on the number of iterations and the minimizer may not be reached before the iterations are terminated. Convergence can be slow if the least squares system of equations is poorly conditioned and terminating the iterative process before convergence can lead to inflated residuals. The upper bound on the number of iterations has been implemented in order to minimize computation when a large proportion of the trials consist of verifying nonrigid configurations which would otherwise continue iterating for too long before other halting criteria are met. The confidence factor is equivalent to scaling the expected noise variance that scales the square of the residual. The only distinction is that the noise variance is used by the linear estimator x test, and should therefore reflect the best estimate available, and the confidence 2 factor makes explicit the adjustment of the x critical value to account for the implicit behaviour of the 2 nonlinear estimator. The drawback of unbiased estimation is the increase in the cost of inverting the approximate stabilized Hessian matrix, J J + W W , on each iteration of the nonlinear least squares estimation. There T T are 2N bias terms for N correspondences and therefore the cost goes up significantly. The cost of in1 Also referred to as false positives. CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POPST CORRESPONDENCES 108 verting a matrix is, in general, 0(n ) where n is the dimension of the square matrix. For example, if 3 there are 6 correspondences the approximate Hessian for the biased nonlinear estimator has dimension 11, while the dimension for the unbiased estimator is 23, so, the relative cost of adding an additional <5iV correspondences is (23 + 3(5A ) /(ll + SN) . Asymptotically, the cost per point of inverting the r 3 S Hessian increases by a factor of 27 relative to the biased estimator. However, the special structure of the Jacobian allows this upper bound to be lowered. Methods for doing this are described by Slama [94], Hartley [36, 37], and Szeliski and Kang [99]. Our implementation of sequential verification, described later, minimizes this cost by using the unbiased estimator to "polish off' the biased estimator's solution if the residual from the biased estimator passes the significance test. The unbiased estimator is initialized with the estimated parameters from the biased estimator. To assess the performance trade-offs between biased and unbiased estimation a receiver operating characteristic (ROC) plot was generated for both estimators and is given by Fig. 4.1. A Monte Carlo simulation was used to generate these plots for three different Gaussian noise levels of 1,2, and 3 pixels standard deviation. The standard scenario described in Section 3.6.1 for six scene points was run with 100000 trials executed to generate each curve. Based on these ROC plots the unbiased estimator provides a slightly higher true positive rate only for large false positive rates. For lower false positive rates, the biased estimator slightly outperforms the unbiased estimator at all noise levels. This could be attributed to two factors. First, the unbiased estimator is only allowed to perform 10 iterations. Since the parameter space is larger than the space for the biased estimator this may not be a sufficient number of iterations to converge to a low residual solution. Secondly, the observation noise is Gaussian distributed and is uncorrelated. The sum of the measurement bias terms should be approximately zero hence the approximate Hessian is expected to be very close to the exact Hessian which should not penalize the performance of the biased estimator. Also, since the ROC plot is a ratio of probability distributions, the expected anisotropicity of the error distribution for the biased estimator may not necessarily lead to a poorer detection performance. However, the performance differences appear to be small. To summarize, the unbiased estimator allows a principled analysis to be made of the residual error in terms of the x statistic, and a small 2 CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 109 R O C Curve For Rigidity Verification i i _i i i : : : : _ i — y ^ ('/'://: 0.98 1 1 1 ! h 0.96 1 S rx i .8 0.94 i c o i 1 1 i i /' /' y ji I i \l ji i jr. < i j\ i i i i I I 1 0.88 0 ,. I|...: Ii i i i i i i 0.02 0.04 0.06 0.08 0.1 0.12 False Positive Rate Figure 4.1: ROC plot for the biased and unbiased nonlinear rigidity checking estimators. The unbiased estimator curves are denoted b y T h e r e are three curves for each estimator corresponding to three different Gaussian noise levels with a standard deviation of 1,2, and 3 pixels. Each curve was generated from 100000 trials of the standard scenario for six points (see text). performance gain is afforded at higher false positive rates, but the complexity trade-off favors the use of the unbiased estimator with little or no penalty in verification performance. 4.6 The x test statistic 2 Fitzgibbon and Fisher [25] have argued that the x 2 t e s t statistic has several disadvantages as a test for a model's fit to sets of observations typical in computer vision, e.g., feature positions. The x distri2 bution is for an exact observation corrupted by an isotropically and identically distributed zero-mean Gaussian noise process of variance a . The results of Fitzgibbon et al. show that the x test statistic is 2 2 very conservative because, they argue, it is based on a very unrealistic noise model, i.e., Gaussian. An- CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POWT CORRESPONDENCES 110 other drawback is that the noise variance of the process is seldom known. For our problem, the noise model for the localization error for various comer detectors is known with some confidence [18] [21]. The use of the x statistic is well established and has been shown to work well in practice for an orthog2 onal regression estimator for a linear camera model [90]. The conservativeness of the test, i.e., falsely rejects valid matches, can be controlled by a confidence factor that can be determined from ROC curves produced from Monte Carlo simulations for an assumed range of scene and camera geometries and localization error variance. Test results for real images reported in the next chapter, however, substantiate the claim that the x statistic is conservative, but only with respect to the estimator for the nonlinear camera 2 model - the decision statistic appears to be valid for the linear camera model estimator. The inclusion of a confidence factor helps to reduce the conservativeness of the test, but it appears that, for the rigidity checking nonlinear least squares estimator and for the data sets processed by this thesis, issues related to the conditioning of the problem and the convergence properties of the estimator need to be addressed in order to use the x test statistic in an appropriate way. 2 4.7 Characterizing the brightness function at keypoints The representation and matching strategy developed by Schmid and Mohr [86] is invariant to changes in the two-dimensional brightness function with respect to scale, orientation and position. A normalized version of their representation is also invariant to brightness scaling and translation. The representation is fairly robust with respect to rotation in depth that leads to foreshortening of surface patches, i.e., in general, a local affine distortion of the brightness surface is induced by a rotation in depth. Their representation is based on earlier work by Koenderink and van Doom [50] and Romeny et al. [81] on differential invariants. The local jet, J, of order N is defined at a point x = (x\, x , • • •, XD) (D = 2 for an image) as t 2 J [I]{x, a) = {L .., {*, N ix n a)|(x, a) G / x R+; n = 0,..., N] where / is the image array and a is a given scale. Li „j 1 Gaussian derivatives ...,-„ (x, a). n (4.23) (x, a) is the convolution of image I with the CHAPTER 4. RIGIDITY CHECKING FOR MATCHPSG POINT CORRESPONDENCES 111 That is, L,-,...,•„(x,ff) = (G ... ll in * /)(X,CT) where 1 G(x,a) = x D P(-7r^) ex d n 2 a n d G ,.. {x,a) = ^ h in 2 ( 7 d ...di h where n = 0,1,2 . . . and —G(x,cr) n 6 {xi, ^ 2 , . . . , xp} for k = 1,..., n. The local jet represents the truncated Taylor series expansion of the image function and is useful for encoding the position dependent geometry of the brightness surface. The Taylor series expansion requires the computation of the partial derivatives which is an ill-posed problem for noisy images in the sense of Hadamard. Differentiation is discontinuous in the presence of noise, that is, it does not vary continuously with perturbations of the data, e.g., given a signal f(x) and f(x) = f(x) + esin(ojx) for a small e, f'(x) and f'(x) can differ significantly for large values of the frequency UJ. The calculation of the derivative requires the specification of an operator. The choice of the derivative of the Gaussian is motivated by the properties of the Gaussian kernel: it is smooth and localized in the spatial and frequency domains, it is the unique function which has the smallest product of localization uncertainty in space and frequency, and, because of its smoothness properties, it is the least likely operator to introduce artifacts in thefilteredimage that are not present in the original image [64,40]. As an operator, it can be applied efficiently in the sense that it is the unique rotationally symmetric smoothing operator that is decomposable into the product of two one-dimensional operators [40]. The Gaussian function has the property of scale dependency which is given by a, and it is this scale property that affords the multi-scale approach for differential invariants. CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 112 Adopting the notation of Schmid and Mohr [86], the differential invariant vector V, is given by L xL x LXX LxLx ~f" L y L y -^-^Lxy LxLy-j-LyyLyLy LxX~\~Lyy l LxX LXXX LxXX L LyLy Lxxy ( Ly Ly Ly~\SLxyy -\~Lxxy X (—2 L Lx LxLx~\~2Lx X L Ly-\- L X ( — 2Lx ^^LxxyLx X Lxy~\~Lyy (4.24) Lyy ~~3Lxxy Ly Ly Ly)-\~ X X Lx L ~\~2Lxy Lx LxLy LyLy^~\~L yy LxXX Lxx L yy X LxLy LxLy LxLyLy-~~LyyyLxLxLx (— 2LxLyLy-\-LxLxLx)~\~LyyyLxLxLy -\-LyLyLy ) — LyyyL Ly Ly-\-L X -\~3LxyyLxLyLy XXX L x L x L y -\~LyyyLyLyLy where the 1/^,...,^, are the "local jet" elements defined by equation (4.23) and ii,..., i £ {x, y} where n x, y are the Cartesian coordinates on which the image brightness function, I, is defined. For example, L is the average brightness, L L -\-L x x L is the gradient magnitude squared, and L -\-L y y xx yy is the Laplacian of the brightness function. The components of the vector, V, of invariants, are the complete and irreducible set of differential invariants up to third order [87]. These functions are rotationally symmetric which follows from the invariance specification. Differential invariants can be written that are also invariant to an affine transformation of the brightness function given by I(x,y) = al(x,y) + b. These invariants are the last seven components of the differential invariant vector (4.24) normalized by an appropriate power of the gradient magnitude squared, i.e., (L L X determined so that the ratio a /a^ + LL), p X y y where the power p is equals one where k is the power of the scaling coefficient a for a k particular product of the Li , i . For example, thefirstcomponent of the brightness affine invariant lr n n vector is given by I LL j xx x x -\- 2L L L xy {L L X X x + -\~ y L LyL yy y L Ly) ! 3 2 y where the scaling coefficient of the numerator due to a scaling of the brightness function by a can be shown to be a . Note that differential invariants by dint of differentiation are invariant to the brightness 3 CHAPTER 4. RIGIDITY CHECKPSG FOR MATCHING POINT CORRESPONDENCES 113 translation b, except for the invariant L in (4.24). Most feature based matching methods that compute local brightness attributes assume that the local geometric transformation of the brightness surface can be modeled by a rigid transformation. It is also assumed that the local surface reflectance is diffuse. Weng et al. have analyzed the perspective projection of a locally rigid 3D transformation and concluded that the local 2D projected motion at a non-boundary point xn can also be modeled by a rigid transformation if the change in depth is small compared to the original depth, the 3D rotation in depth, i.e., out of the image plane, is not large, and the depth range in a small neighbourhood of xo is not large relative to the depth of xo. Based on this analysis, Weng et al. develop a "cornerness" attribute for matching point features which is motion invariant. This analysis also lays the foundation for justifying the use of local differential invariants for characterizing the local brightness distribution at a feature point. Local differential invariants of the image brightness surface as defined by Koenderink and van Doom [50] are invariant to the similarity group of 2D rotations and translations. A multi-scale representation affords scale invariance. Under the conditions specified by Weng et al. we can expect local differential invariants to be invariant to motion. Our working assumption when matching images is that there will be a sufficient number of projected scene points that satisfy the motion invariance assumptions. A sufficient number in our case is six points, the minimum number for rigidity checking. The advantages of the approach by Schmid and Mohr over a related method due to Rao and Ballard [79] for our task have been confirmed by our experiments with implementations of both methods. Rao and Ballard specify a local characterization of the brightness function at keypoints using a vector of steerable Gaussian derivativefiltersof various orders over a range of scales. The vectors are used to characterize images of scenes suitable for indexing for an image database application. Steerability refers to the property of any n-th order continuously differentiable function that determines the value of a directional derivative as a linear combination of other related directional derivatives. For example, Ballard and Rao specify a set of steerable Gaussian derivativefiltersas CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES where j is the derivative order of the filter, n — j + 1, and for G(x, y) — e ( v ), G^ x2+ k ° ^ = l^ ' ' G{x y) H i ) = 2 114 is given by («'" ! W ( i + !)• The interpolation functions for j = 1, for example, are given by kn(6) = l/2[cos(0 - <f>(i))], * = 1,2. Note that the number of basis functions is one more than the filter order. Ballard and Rao characterize the local brightness function by a multi-scale set of Gaussian directional derivatives. They specify 5, octave separated, scales and usefiltersup to order 3. The filter responses are concatenated to form a vector of dimension 45. The steering direction is determined by the gradient direction of the brightness function determined by thefirstorderfilterresponses. The vector offilterresponses is steered in that direction to yield the local description of the brightness function. Two key factors mitigating against the representation of Ballard and Rao for our application is the necessity to compute the gradient angle to determine the steering direction of thefilters,which we have found to be unstable with respect to changes in viewpoint, and the inherent difficulty of matching 2 feature vectors over non-trivial scale changes. This is due to the high correlation betweenfilterresponses over scale changes and the awkwardness of the representation which requiresfindingthe correct vector component overlap. Also, for large scale changes for the vector matching problem, several components from each vector are necessarily eliminated reducing the discrimination power of the representation. We have developed a set of enhancements to the basic grey-level differential invariant method that, for our matching task, yields noticeable performance gains. These enhancements include, • tracking matches through scale space tofilterout unstable matches, • in situ determination of the covariance matrix used to normalize the vectors of differential invariants, and lastly, • k-d tree search over the normalized feature space for rapid matching of differential invariant vectors. Also, Deriche and Giraudon [21] have noted that the brightness gradient direction is not reliably estimated near corners, but see the analysis of Merron and Brady [67] for a proposed solution. 2 CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 115 The criteria used to select the appropriate filter scale is not addressed in their paper [87]. Their technical report [86] reveals that increasing thefilterscale up to a given bound improves their algorithm's performance for the image matching task. For their image database recognition task thefilterscale was set to one. The multi-scale representation consists of a set of differential invariant vectors computed over a range of scales centered on a base scale <7o that vary by a factor of 1.2™ for some integer range of n. The 20 percent factor is empirically derived and reflects the expected differential scale range over which the invariants do not change appreciably. The complete set of scales is given by <r = §Vo, where t i € {—n,..., —1,0,1,..., n}. A value of four for n yields the scale factor range 0.48 to 2.07, hence there are nine differential invariant vectors for each keypoint. To increase the ratio of true to false matches for the image database recognition task semi-local and geometric constraints are applied tofilterout unlikely or ambiguous matches. The semi-local constraint consists of confirming that a potential match has a minimum number of nearest neighbours which are also matched. The geometric constraint imposes a similarity transformation constraint on the gradient angle at the keypoint and its matched neighbours. The gradient angle difference between matches should be locally consistent. We also have found this geometric constraint to be a useful part of the rigidity checking verification process. Differential invariants are computed at three reference scales that differ by multiples of ten percent of the reference base scale, i.e., abase (1 + k * 0.1) for k = 0,1,2. The value of ten percent is half of the expected range of scale insensitivity for differential invariants. We have observed that nearest neighbour matching with differential invariant vectors is scale sensitive over moderate changes in viewpoint. This value was found experimentally to yield good results. Too large a value eliminates most matches and too small a value does not effectively eliminate scale unstable matches. Computing thefilterresponses and differential invariants is currently the most time consuming part of the algorithm. Use of recursivefilteringfor computing the Gaussianfilterresponses on a multi-resolution pyramid will speed up this part. Note also that this part of the algorithm is parallelizable. CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 4.7.1 116 Computing the covariance matrix for differential invariants It is necessary to compute the covariance matrix required to normalize the differential invariants for k-d tree search. The covariance matrix describes the expected variance of the differential vector components with respect to changes in viewpoint, illumination, sensor properties, and noise. Schmid and Mohr do not fully address the issue of how to determine the covariance matrix for their image database recognition task [87] or image matching task [86]. In an extended version of [87] the covariance matrix is determined by averaging together over all keypoints the covariance matrix generated for each keypoint by tracking observations over an extended image sequence. It is not clear, however, if this is suitable for the image matching problem. First, we assume that tracking is generally not an option and secondly the resulting covariance matrix may not be suitable for all image pairs to be matched, except, of course, the images from the sequence itself. Computing and combining covariance matrices over several different sequences, thereby enlarging the sample size, may be sufficient for a variety of matching tasks. We adopt a classical perturbation approach to estimating the covariance matrix for the differential invariant vectors. Such an approach is frequently used when an analytic solution is not feasible. For example, this approach is used to determine datum/model compatibility for the RANSAC algorithm [24]. The RANSAC procedure uses data perturbation to estimate implied error bounds for a given model. An analytic approach for determining the covariance of the differential invariant vectors would require a complete model of the image formation process including, for example, the surface reflectance properties, which are generally unknown a priori, and the transfer function of the image acquisition system. Rather than undertake such an analysis, the covariance matrix is estimated by averaging together a large number of covariance matrices generated from differential invariant vectors and their noise perturbed deviates. We compute a covariance matrix directly from the given image pair by modeling brightness variation due to all the factors mentioned above by normally distributed noise added to the images. The noise variance is some fraction of the variance of the images. A value of 25 percent of the brightness variance has been used in the experiments described below. This is an empirical value that has yielded good results. A principled method for selecting this value requires further investigation. CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 117 A covariance matrix is determined for each comer found by the Harris detector in the non-reference image. The covariance matrix captures the perturbation of the observations between the non-referenceplus-noise and non-reference images over the range of differential scales 0{. This process is intended to characterize the sensitivity of the differential invariants to scale and noise. The global covariance matrix is the average of the covariance matrices over all comers in the non-reference image. 4.8 Nearest neighbour search with A>d trees Given the differential invariant covariance matrices for each scale-level, a k-d tree is built from the differential invariant vectors for both images for each reference scale level. A k-d tree is a generalization of the binary tree used for sorting and searching [27]. The k-d tree implemented here is a binary tree in which each node represents a subfile of the records in the file. The root is the entire set of records, and each non-terminal node has two successors. The successors represent each of the two sub-files after partitioning. The terminal nodes represent mutually exclusive small subsets of the data records, which all together define a partition of the record space. The terminal subset of records are called buckets. For the rigidity checking application, terminal buckets contain only a single record, i.e., a differential invariant vector. In k dimensions, a record is represented by k keys, i.e., the set of keys defines the record, which, forrigiditychecking, is a vector in K . Any one of the k keys can be designated the discriminator for park titioning the subset of records. For a given discriminator key value for dimension k* 6 {1,..., k}, left successor records all have key values less than or equal to the discriminator value and allrightsuccessor records have values greater than the discriminator value. The depth of a k-d tree is \log N] where N is 2 the total number of records. The optimized k-d tree decides the partition key and partition value to yield maximal informational value for a binary tree, which occurs when a record is equally likely to be found on either sub-tree. The median value of the marginal distribution of keys is suggested as the partition value in [27]. Computing the variance and mean of the marginal distribution of keys is less costly than the median and is the partitioning method used by rigidity checking, i.e., the partition key is the dimension with the largest variance and the partition value is the mean of the marginal distribution of keys for that dimension. CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 118 The k-d tree variant used here assumes the points are de-correlated and satisfy the Euclidean norm [96]. The transformation to Euclidean coordinates by definition yields orthogonal, linear components and these components are de-correlated by virtue of using a decomposition of the inverse covariance matrix which defines the Mahalanobis distance. This decomposition of the inverse covariance matrix that yields the change of basis to Euclidean coordinates is described in [87]. An approximate expected value for the number of records examined is given by Sproull [96] as R(k,b)^b{[G{k)/b]^ k + l} k where k is record dimension, b is the number of records in a bucket, and G(k) > 1 is a constant that accounts for the geometric properties of the norm used for distance measurement. Sproull assumes G (k) = 1. The dimension k for the rigidity checking application is seven and the number of records in a bucket is one, therefore, R(7,1) = 128, hence, a nearest neighbour query is expected tofinishin logarithmic expected time since this is much less than the total number of records N which is typically 4500. For each base scale level a bi-directional matching process is executed by k-d tree nearest neighbour look-up. The matching process determines those matches which are the same for each of the three base scale levels. This is a test of the scale stability of the feature. The total scale range spanned by the base scales is only 20 percent which is the expected range of differential invariant scale insensitivity. Small changes in the differential invariant scale, even as little as ten percent, often yield large changes in the nearest neighbour assignments over non-trivial changes in viewpoint. Hence, to improve the ratio of correct to incorrect matches it has proven beneficial to identify those matches which exhibit scale space stability. The smallest base scale level is typically about 3 pixels. This is due to the improved matching results that have been observed with larger scales. Increasing the scale beyond 6 pixels generally leads to poorer performance. Matching at larger scales reduces the sensitivity of the differential invariants to localization and digitization error and brightness surface deformation due to the change in viewpoint. The last step in the image matching process applies rigidity checking to hypothesized set of matches returned by the grey-level differential invariant matching process. CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 4.9 119 Integrating the affine and perspective camera models The problem of verifying a single small set of matches using rigidity checking (RC) was studied in Chapter 3. In this chapter the verification problem is generalized to n correspondences where n is typically much larger than the minimum six matches required for rigidity checking. The set of n matches can contain points which are projected approximately orthographically or are poorly localized or are simply wrong, i.e, gross outliers or blunders. Methods called "robust estimators" have been developed to deal with outliers. Two popular methods are the M-estimators and Least Median of Squares (LMS) [82] [ 123]. M-estimators are reducible to a weighted least squares formulation and are therefore reasonably cost effective but recent investigations have found them less reliable than LMS for detecting gross outliers for a task similar to ours [121]. On the other hand LMS is known to be computationally expensive on sequential processors. We propose a sequential verification approach that is less computationally expensive than LMS. Since RC belongs to the class of least squares estimators it is susceptible to the influence of outliers. The reliability of outlier detection is improved by preprocessing the point correspondences under the affine camera model. The idea is to defer verification of those correspondences least likely to be valid. This preprocessing step serves to rank the likelihood of the validity of the matches. It is motivated in part by results from the grey-level differential invariant matching process. This matching scheme provides hypothesized correspondences, the majority of which often satisfy the affine camera model. Given this behaviour, the ranking process is expected to provide a valid set of matches which, in turn, is expected to provide a sufficient degree of over-constraint that subsequent outliers will be reliably flagged by a large residual error. We show that the preprocessing step followed by sequential RC verification is less costly than the LMS method utilizing the RC algorithm. Shapiro analyzes the affine camera model and applies it to the problem of tracking point features in an image sequence while detecting and eliminating poor or incorrect matches. The affine camera is a generalization of the orthographic, scaled orthographic and para-perspective camera models. It can also be viewed as being equivalent to the uncalibrated scaled orthographic camera [90]. We describe a method that incorporates Shapiro's regression diagnostic method for outlier detection to rank the validity CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 120 of matches relative to the affine camera model. 4.10 Ranking match validity Shapiro and Brady describe a set of methods called regression diagnostics for detecting outliers in an orthogonal regression framework applied to the task of matching points for a tracker for the affine camera model [90]. As noted by Shapiro and Brady the method only works well when the number of outliers and their distance is small. We adopt the regression diagnostic method of Shapiro but apply it in a novel way to yield a ranked set of matches partitioned into two classes. The ranking method ranks the fit of all matches relative to the affine camera for rigid motion. A Monte Carlo procedure, similar to the method given in [82](chapter 5) for the least median of squares method, is employed that takes randomly selected subsets of size seven from the total set of hypothesized matches. The ranking procedure determines the robust orthogonal regression (OR) fit of the subset using a method Shapiro calls eigenvalue-identity to 3 remove at most one outlier. The idea is to improve the coverage of the set of matches for a fixed number of subsets while decreasing the probability of missing a valid subset of matches due to an outlier. If the OR fit of the six or seven matches satisfies a x test then an accumulator is updated that 2 counts the number of times each match participates in a successful trial. After a sufficient number of trials areranto adequately explore the space of combinations of the matches, a histogram is determined over the set of all matches of the number of trials in which each match participated in a successful x fit. 2 The histogram is sorted from largest to smallest number of trials where matches with large counts implies that they are the best candidates forfittingthe affine camera model. The sorted histogram will be used to determine the sequence of matches to be sequentially verified relative to the affine camera model. The OR estimator is applied to thefirstsix matches and the residual error tested against the x statistic. If the 2 first six points satisfy the x statistic at a 0.95 confidence level then the next match on the list is added 2 to the set and the OR estimation process is reran. This process continues until the list is exhausted or a failure of the x test. If the set of matches from this process in non-empty then the motion estimate is 2 retained as an initial estimate for the nonlinear estimator. Note that the rigidity checking linear estimator is equivalent to Shapiro's estimation by orthogonal regression. 3 CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 121 This scheme is very similar to a Hough transform method for parameter estimation and could therefore be considered as a type of maximum likelihood estimator for determining correct correspondences relative to the affine camera model; the difference is that, rather than accumulating the motion parameter estimates, the correspondence labels are accumulated. Hough transform estimators are known to be relatively robust [123]. At this stage in the verification process the set of hypothesized matches has been partitioned into two sets: matches which satisfy the affine camera model and those that do not. Note that either set could be empty. The set of points notfittingthe affine camera model are ranked according to their position in the sorted histogram. Matches with larger counts are more likely to be valid matches than those with smaller counts. Gross outliers are expected to be ranked lowest while the higher ranked points could be valid matches with large perspective distortion. The next stage of the verification process involves taking the ranked list of matches and sequentially building a set of matches verified by rigidity checking. A popular alternative for robust estimation, least median of squares [82], is not directly applicable to the rigidity checking method. RC computes a depth estimate simultaneously with the motion estimate. Given a motion and stmcture estimate from a subset of the matches, an additional optimization step would be required tofindthe median residual over all matches byfindingthe optimal depth estimate for all the remaining matches. The RC framework can accommodate this additional optimization step but the issue of computational complexity favors the sequential approach. The computational complexity of sequential verification is considerably less than that of LMS, even with the additional computation required by the match ranking procedure. The expected complexity of the ranking procedure for m trials over a total of k matches is the product of the number of trials m with the complexity of applying singular value decomposition (SVD), which has a computational complexity, for an x X y matrix, approximately cubic in the maximum of x or y, to a 4 by 4 scatter matrix, multiplied by a small constant that accounts for the constant time to compute the scatter matrices and a single deletion per trial which, as a worst case, amounts to a factor of two. The expected complexity of sequentially verifying k correspondences for the biased nonlinear least-squares estimator is proportional to (k + 5) which is the cost of inverting the approximate Hessian 3 CHAPTER 4. RIGIDITY CHECKFSIG FOR MATCHING POPtfT CORRESPONDENCES 122 matrix where matrix inversion is known to have a complexity which is proportional to the cube of the matrix dimension and k + 5 is number of parameters. The special structure of the Jacobian matrix can be exploited, however, to reduce this computational cost, as mentioned earlier. LMS for our problem would be applied with subsets of six matches. For six matches the nonlinear estimator solves for 11 parameters, 6 for the motion and 5 depth estimates since one of the depths is given a fixed value. The complexity of LMS is the product of the number of trials m and the complexity of inverting a matrix of size 11, times a constant that accounts for the expected number of iterations to converge plus the cost of estimating the residual error for the remaining correspondences which involves inverting a k — 6 matrix to determine the remaining depth values times a constant for the expected number of iterations to converge. Since the number of trials required to find at least one valid set of matches is generally much larger than the number of correspondences, the comparison favors the use of the OR match ranking and sequential verification with rigidity checking. • An alternative approach to robust estimation are the M-estimators but these methods are generally considered less reliable than LMS. On the other hand M-estimators reduce to a weighted least squares optimization process which affords a low complexity solution. The trade-offs between these methods for our task is reserved for future work. The sequential verification approach for RC is expected to be stable and robust when applied to the ranked list of matches. If the majority of correspondences satisfy the affine camera model then the correct verification of a small number of valid correspondences with large perspective distortion is not likely to occur since the full solution for the camera motion and scene structure may not be sufficiently constrained. In this case, RC is expected to correctly verify those correspondences that satisfy the scaled orthographic camera model. Note that LMS is not an option for ranking the correspondences relative to the affine camera model because, in general, greater than 50 percent of the correspondences are not expected to satisfy this camera model. The breakdown point for LMS is when 50 percent of the data consists of outliers. CHAPTER 4. RIGIDITY CHECKING FOR MATCHING POINT CORRESPONDENCES 4.10.1 123 How many subsamples? As in LMS, an exhaustive assessment of all possible combinations of points for some given subsample size is not computationally feasible. As mentioned above, the method proceeds by determining a suitable number of randomly selected subsets of size 7 that probabilistically ensures that all matches are included a sufficient number of times to assess their validity. A baseline value for the number of trials toranis based on the following formula due to Rousseeuw and Leroy [82], P = 1 - [1 - [1 - ] ] , p € m where m is the number of trials that with probability P will contain at least one valid subset of p matches assuming the total set of matches contains e fraction of outliers. P is set to 0.99, p was set at 6 (rather than 7 since a single outlier is eliminated), and we assumed (n — 6)/n fraction of outliers for n matches for the affine camera model as a worst case proportion. An upper bound of 3000 trials was set, a value that was determined empirically to yield reasonably stable results when there are greater than 12 matches. For fewer than 12 matches the number of combinations of 6 from a set of n matches was used. These values are in-line with the scheme proposed by Rousseeuw for his implementation of LMS ([82], chapter 5). Subsets of matches are randomly selected according to a distribution based on inter-point distances which ensures that points for any trial are probabilistically well distributed over thefieldof view to avoid selecting subsets of matches which will yield solutions which are poorly conditioned. The distribution based on inter-point distances introduces a bias for points that are "least crowded" when points are selected at random. Chapter 5 Experimental results for matching point features 5.1 Introduction This chapter presents the results for a series of experiments with real image data for the task of matching feature point correspondences. The chapter begins with a description of the camera calibration exercise that was applied to the SLR camera used to take a series of print color pictures of a variety of outdoor scenes. The film was subsequently digitized according the proprietary KodakPhotoCD™ process. Results for an indoor scene are also given for a monochrome CCD camera. Lastly, results for color differential invariant matching and a test of the rotational invariance assumption are reported. Implementation notes Discrete derivative-of-Gaussian filters can be implemented as either forward or central difference operators. The results reported here, unless otherwise noted, are for the forward difference implementation, since they yielded better grey-level differential invariant matching results. The drawback of using the forward difference operator is that it introduces a 1/2 pixel phase shift in the derivative of the function relative to the basis for the original function. This tends to make the differential invariants less robust 124 CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 125 with respect to 2D rotations in the image plane. An example of this sensitivity is given in Section 5.6. The confidence factor that scales the critical value for the unbiased nonlinear estimator residual test statistic is 16. The x critical values are for a 95 percent confidence level for both the linear and 2 nonlinear estimator residual. The residual for the linear estimator is not scaled by a confidence factor. For each correspondence that is added to the set of verified correspondences, the iteration limit is increased by one. This proved to be of great benefit, especially for the unbiased nonlinear estimator, as the dimension of the parameter space increased. 5.2 Camera calibration A simple calibration scheme is given in this section for determining the focal length of the camera from the constraint of similar triangles for perspective projection. The derivation assumes accurate knowledge of the distance between two points on a planar target in the scene that moves with a known motion parallel to the optical axis. The aspect ratio of the pixels is simply found by taking the ratio of the horizontal and vertical projection of a equal length segment on the target plane. It is further assumed that the target plane is parallel to the image plane and that the center of the image is assumed to be the principal point of the camera. Let — f %a/ Z\ai Via = fy /z\a = fx /z , V2a = fy /z2a Ulb = fxb/zib, "16 = fyb/zu Ub = fXb/Z2b, V2b U 2a 2 a 2a a a = fyb/z b 2 where / is the image magnification factor that will be estimated, u and v are the image coordinates measured with respect to the origin at the center of the image, the subscripts a and b are for two different points on the target board and subscripts 1 and 2 are for two different depth positions of the planar target relative to the camera. Then, CHAPTER 5. EXPERIMENTAL RESULTS = Zla ~ Z2a FOR MATCHING 1 fx \ POINT 1 FEATURES 126 Az a fU ' a 1 1 Az "la V2a fV ' a 1 _ J _ Az fUb' U2b and f t Z\b - Z b = fyb{ where U = l/ui a a — l/u 2a 1 1 2 m \ ) A z Vb = -777, ^26 fVb and similarly for V , Ub and Vj,. a The squared distance between scene points a and 6 is given by (x + (y - Vb) = d - x) 2 a 2 2 a b hence, Az 2 I 2 Ub-Uay u -u ) a Figure 5.1: First calibration image. Frame id 22. b (Vb-Vay + yv -v ) a (5.1) b Figure 5.2: Second calibration image. The relative motion between the camera and target board is 24 inches. Frame id 23. Figure 5.3: Third calibration image. The relative motion between the previous image and this one is again 24 inches. Frame id 24. Several pictures o f a planar target board were obtained at the Laboratory for Computational Intelligence. Figure 5.1 to Figure 5.3 are the calibration images. The target points on the board are 2 inches apart center to center. The array o f target points has dimension 11 x 11. The target board and camera focal plane were arranged to be approximately parallel. Between each frame the board was advanced 12 127 CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES inches as was the camera for a combined relative motion of 24 inches parallel to the optical axis. Proper calibration using the available motion stages was not possible owing to the desire to calibrate the camera focused at infinity. The minimum depth offieldfor the lens and an aperture setting of /2.8 was about 16 feet when focused at infinity. The pixel aspect ratio was determined to be one by measuring in the image the square sides of the target board. Note that the camera aperture outdoors was in the range / l l to /22 which will lead to a slightly different result for the magnification factor but the difference is assumed to be negligible. Table 5.1 gives the focal length calibration results for two points, top left and bottom right, such that their distance d — 2(20 ). Frame pair refers to the image pair that provides the observations. 2 2 Three frames yields three frame pairs denoted 22-23, 22-24, and 23-24. The mean of the magnification factors is 1452. Table 5.1: Camera calibration results. Frame pair Az 22-23 22-24 23-24 24 48 24 2 U u V v -3.6066e-03 -7.1795e-03 -3.5729e-03 9.4764e-04 -2.7389e-03 -4.5822e-03 -1.8433e-03 1.5152e-03 3.8462e-03 2.3310e-03 a 2 2 2 2(20 ) 2(20 ) 2(20 ) 2 2 2 b 1.7885e-03 8.4087e-04 a / b 1.4266e+03 1.4365e+03 1.4945e+03 Table 5.2: Grey-Level differential invariant matching summary. Scene Total Matches:Number Correct Scale-level Base l.l*Base 1.2*Base Playground Construction Cicsr Laboratory 73:20 72:24 102:32 76:14 5.3 58:16 68:14 98:32 65:13 Multi-scale Total Matches:Number Correct 58:19 68:16 86:32 62:19 18:16 18:14 38:24 19:12 Average k-d tree Search 106 104 109 115 ^Harris 1/2 1/2 1/2 1 0~Base 3 5 3 3 Images taken with an SLR camera A set of images were taken with a Minolta SLR camera equipped with a Minolta 50mm fl.7 lens. All images were taken with the focus set to infinity. Thefilmwas ASA 400 color print film digitized with CHAPTER 5. EXPERTMENTAL RESULTS FOR MATCHING POINT FEATURES 128 Table 5.3: Rigidity checking summary. Scene Noise <T Number Diff.Inv. Matches Number Correct Verified Linear Playground Construction Cicsr Laboratory Laboratory 1 1 1 1 2 18 18 38 19 19 16 14 24 12 12 11 14 9 6 9 Verified NonLinear Bias Unbias 5 0 14 6 3 Verified Incorrect Linear 4 0 12 1 3 0 1 3 1 0 Ver. Incorrect NonLinear Bias Unbias 0 0 7 2 4 0 0 7 0 1 Total Verified Bias:Unbias 16:15 14:14 23:21 12:7 12:12 the KodakPhotoCD™ digital imaging system. Conversion from the PhotoCD YCC image color coding scheme to RGB was in accordance with Rec. 709 [75] using the popular image format/processing conversion tool package pbm+ and routine "hpcdtoppm" due to Hadmut Danisch. The conversion to grey-scale images from nonlinear RGB was according to the Rec. 601 standard y' 6 5.3.1 01 = 0.299 R' + 0.587 G' + 0.114 B'. Playground scene Figures 5.4 and 5.5 show an outdoor playground scene. The images are 768 columns by 512 rows in size. Only the camera calibration described in the previous section was performed. The motion consists of a translation parallel to the ground and a small panning rotation. The comer detector smoothing scale for the Gaussianfilterwas set to a standard deviation of 1/2 pixel. The labeled correspondences are determined from the multi-scale differential invariant matching scheme. The base level standard deviation of the differential invariant Gaussianfilterfamily was set to 3 pixels. To compute the covariance matrices for the grey-level differential invariants normally distributed noise was added to the non-reference image. The noise was zero mean with a variance 1/4 of the variance of the reference image pixel brightness. Eighteen matches were found of which fourteen are correct (78%) although matches 14 and 15 are very close to correct and, in fact, are subsequently verified as rigid. Decreasing the noise variance leads to fewer matches and more inconect matches. For a noise variance 1/64 of the variance of the reference image only 11 matches were found of which 3 were inconect. The CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 129 k-d tree for each image consists of 500 x 9 leaves, where 500 corresponds to the number of comers detected by the Harris comer detector [34] [86], and 9 corresponds to the number of differential invariants for each comer. There are three k-d trees, one for each scale level. For this image pair, the average number of records searched was 106 from a total search space of 4500 records. This value is actually less than the expected value of 128 determined in Section 4.8 and confirms the expected logarithmic behaviour. The localization error variance was set to one pixel for rigidity checking. The global depth offset for the nonlinear estimator was set to the standard value of —2. Figure 5.6 shows the ranking of the outliers relative to the affine camera model for an assumed localization error standard deviation of one pixel. Both outliers are correctly detected as well as scene points with large disparity and perspective distortion. Figures 5.7 and 5.8 show the rigidity checking verified matches for an assumed localization error standard deviation of one pixel. Matches 7 and 8 are correctly rejected as outliers. The unbiased estimator incorrectly rejected match 16, whereas the biased estimator did correctly verify it to be part of the rigid configuration. Scale-space differential invariant matching Three base scale levels were employed for the multi-scale matching process. For the following example the scale levels were set to 3, 3.3 and 3.6 pixels. The covariance matrix for normalizing the differential invariants was computed from the second frame using a noise standard deviation of 1/2 * 56.9 where 56.9 is the standard deviation of the non-reference image brightness. The Harris comer detector was applied to both images. Five hundred comers from each image were selected based on their comer strength. A bucketing method was employed to ensure as uniform a distribution as possible over the entire image. For thefirstbase scale, 73 matches were found. Figures 5.9 and 5.10 are the labeled matches for scale-level 0 for both views. The second and third base scales yielded 58 matches in each. Of these matches only 18 were consistently matched through scale space. Of the 73 matches found for scale-level 0, only four more than the subset that survive scale-space matching are correct - matches 41, 52, 61, and 72. This is typical of the other base scales as well. Scale-space matching at the expense of three times the level of effort for matching improves the ratio of true to false CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES Figure 5.4: First frame of playground scene image pair with candidate matches from grey-level differential invariant matching. Figure 5.5: Second frame of playground scene image pair with candidate matches from grey-level differential invariant matching. 130 CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 131 Figure 5.6: Ranked order of outliers relative to affine camera model. The point ranked 1 is least likely to fit the affine camera model. Unlabeled matches satisfy the X 0 . 9 5 (95% confidence level) significance test. Points ranked 1 and 3 are incorrect matches and the remaining labeled points are correct matches but do not satisfy the X 0 . 9 5 test. This is due in part to larger perspective effects, as, for example, the match ranked 2, would be expected to have. matches from 27 percent to 89 percent. This suggests that efficient methods that determine the scale and scale-space stability of features would be useful for determining these attributes for correspondence tokens. 5.3.2 Construction scene Figures 5.11 and 5.12 show a construction scene. Some of the comers are mismatched due to the high symmetry of the stmcture for a smoothing scale of 3 pixels. The mismatching is reduced for a larger smoothing scale and a scale of 5 pixels was chosen. This follows from having a larger filter support size which improves the signal detection at the risk of becoming more sensitive to an affine distortion of the brightness pattern due to the change in viewpoint. The lack of comers on the shadowed side of the building can be attributed to the corresponding CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES Figure 5.7: Rigidity checking verified matches for the playground image pair. These matches are for the unbiased estimator for the nonlinear camera model. The localization error standard deviation is assumed to be one pixel. Verified matches are marked with * ' s . The biased nonlinear estimator also correctly verified match 16. Figure 5.8: Rigidity checking verified matches for the playground image pair. S a m e as previous figure for non-reference image. 132 CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES Figure 5.10: Grey-Level differential invariant matches at scale-level 0, second frame. 133 CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 134 decrease in the gradient magnitude due to the reduced brightness contrast relative to the sunlit side and the fact that after the comer bucketing process, which ensures comers are selected over the entire fieldof-view, the resulting comer list is ranked and only the strongest responses retained. Alternative schemes for selecting comers could be implemented to ensure a more uniform distribution over the field-of-view. For example, uniformly sub-sampling the buckets, which number 4096 for the results reported here, to produce the required number would ensure a uniform distribution of comers over the image. Figure 5.13 shows the rigidity checking verified matches for an assumed localization error standard deviation of one pixel. Outliers 6, 1 and 13 are correctly rejected and outlier 5 is falsely accepted. The remaining matches are correctly verified by the linear estimator - the nonlinear estimator was bypassed due to the sufficiently low residual error of the linear estimator. The outliers were correctly rejected by the nonlinear estimator. The match labeled 13 was rejected as an outlier since the change in its brightness gradient direction (the change in the direction of the brightness gradient between images due to the viewpoint change) did not agree closely enough with the change in the gradient direction derived from the previously verified matches. The gradient direction test determines the largest absolute difference in the gradient direction change of the match to be verified relative to the other matches and this value cannot exceed the threshold value which was set to 45 degrees. 5.3.3 CICSR building Figures 5.14 and 5.15 show two views of the CICSR building at UBC marked with the hypothesized greylevel differential invariant matches. Like the construction scene, some of the comers are mismatched due to the high symmetry of the stmcture details. Figure 5.16 shows therigiditychecking verified matches for an assumed localization error standard deviation of one pixel. As summarized in Table 5.3, nine of the correctly predicted matches are verified by the estimator for the linear camera model and fourteen of the correctly predicted matches are verified by the biased estimator for the nonlinear camera model. The unbiased estimator for the nonlinear camera model verified twelve of the correctly predicted matches, incorrectly rejecting matches 5 and CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES Figure 5.12: Second frame of construction scene image pair with candidate matches from grey-level differential invariant matching. 135 CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 136 Figure 5.13: Rigidity checking verified matches for the construction scene image pair. The localization error standard deviation is assumed to be one pixel. Verified matches are marked with *'s. 11. The unbiased estimator for the nonlinear camera model incorrectly rejected match 16. Ten of the fourteen incorrectly predicted matches are verified as part of a rigid configuration. The motion consists of a translation parallel to the ground and a moderate rotation. The epipolar lines will therefore be approximately horizontal. The disparity of correctly rejected matches 3 and 15 are approximately orthogonal to the expected horizontal direction of the epipolar lines and these matches would therefore be expected to yield a large residual. Less obvious are correctly rejected matches 6 and 13 which have disparities approximately parallel to the expected direction of the epipolar lines. Note, however, that these matches are the two closest to the field-of-view boundary on the right side. Because of the camera rotation, the epipolar lines will not be exactly parallel and matches located near the image boundaries will be sensitive to localization error and camera lens distortion. Incorrectly accepted matches, 9,10,12,22,24,26,30, and 33, all have disparities approximately parallel to the expected epipolar line direction. Most of them are also very close to their correct positions. CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POPNT FEATURES Figure 5.15: Second frame of CICSR building image pair with candidate matches from grey-level differential invariant matching. 137 CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 138 Rigidity ch 100 200 Figure 5.16: Rigidity checking verified matches for the C I C S R building image pair. These matches are from the unbiased estimator for the nonlinear camera model. The localization error standard deviation is assumed to be one pixel. Verified matches are marked with stars. The biased estimator also correctly verified matches 5 and 11. Incorrectly accepted matches, 17, and 18, are mis-localized by small number of pixels, in the order of 4 pixels for match 17 and approximately 10 pixels for match 18. 5.4 5.4.1 Images taken with a CCD camera Laboratory scene Figures 1.3 to 1.9 give the results for an image pair taken at the Laboratory for Computational Intelligence at U B C . The image dimensions are 480 rows by 640 columns. The camera was a monochrome C C D camera providing an analog video signal digitized with an S G I frame-grabber. The camera was calibrated using a procedure similar to the one described for the S L R camera. The image magnification factor was estimated to be 650 pixels with an aspect ratio of one. CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 139 The motion consists of large translation in depth and a small rotation about the horizontal axis. The scale change was manually determined to be approximately 1.5. This type of scale change is not generally handled by classical correlation matching schemes such as the one utilized by Zhang et al. [121]. The rigidity checking linear estimator determined the scale change to be 1.55. The standard deviation of the Harris-Stephens comer detector was set to one pixel and the base level standard deviation of the differential invariant Gaussianfilterfamily was set to 3 pixels. The variance of the zero mean Gaussian noise added to the reference image to determine the differential invariant covariance matrix was 1/4 of the variance of the non-reference image brightness. The number of hypothesized matches is 19 of which 12 are correct. Matches 0, 9, 12, 14, 15, 17, 18 are incorrectly hypothesized as corresponding. The results of the outlier ranking procedure are displayed in Figure 1.7. Matches ranked 1 to 3 are outliers and the match ranked 4 is a correct match. Match 4 is near the periphery of the field of view and consequently has a large disparity and potentially greater perspective distortion. Also, being near the edge of thefield-of-view,lens distortion could be contributing to its higher residual error. The sequential verification process results are given by Figures 1.8 and 1.9. Nine of the twelve correctly verified matches were verified by therigiditychecking linear estimator, i.e., the residual error for the linear estimator satisfied the x significance test with an expected 2 localization error standard deviation of 2 pixels. Three of the four incorrectly verified matches, 0, 9, and 18, from the biased nonlinear estimator were rejected by the unbiased estimator (recall that the unbiased nonlinear estimator is used to "polish off" the estimate provided by the biased version ifrigidityis detected). Match 17 was incorrectly verified by the biased and unbiased nonlinear estimators. Both nonlinear estimators correctly verified matches 1, 13, and 6. The matches labeled 14 and 15 were rejected as outliers since the change in their brightness gradient direction was not consistent with the previously verified matches. 5.5 Color Differential Invariants Several color coding schemes exist. Two were tried, RGB and a color coding scheme designated here as {11,12,13}, due to Ohta [71]. The color of an image is usually given as three values corresponding to CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 140 the tri-stimulus values R (red), G (green), and B (blue). Color features, such as intensity (D), saturation (S), and hue (H), can be calculated from {R,G,B} by using linear or nonlinear transformations. The various color coding schemes have different characteristics. For example, human color perception is conveniently coded as {D,S,H}, {Y,I,Q} is used to efficiently encode color and luminance information in television signals, and the normalized color set {r,g,b} is used to represent the color plane [71]. The matching performance of color differential invariants proved to be better than that of greylevel invariants for some of the scenes in the test set. The additional complexity of computing the larger dimensioned invariant vectors can be ameliorated by parallelizing thefiltercomputations. Due to the high dimensionality of the color differential invariant vectors the k-d tree query completion time is not logarithmic. The use of a variant of the k-d tree that does approximate nearest-neighbour look-up due to Beis and Lowe [7] is a potential solution for achieving expected logarithmic completion time with a very low probability of missing a match. 5.5.1 Color feature transformation due to Ohta Ohta [71] describes an algorithm for segmenting images using color information. Ohta notes that nonlinear color transformations such as {D,S,H} have non-removable singularities, such that a small perturbation in the {R,G,B} value can cause a large jump in the transformed value. Also, the distribution of nonlinearly transformed values can have spurious modes and gaps. For these reasons, the linear transformations such as {Y,I,Q} are considered more appropriate for computational purposes. Ohta, using a recursive region segmentation algorithm, computes the Karhunen-Loeve transformation on the RGB data of the region to determine a salient set of color features. Several images were analyzed to determine the color feature set. In pattern recognition theory, a feature is said to have large discriminant power if its variance is large. To determine the discriminant power of color features, Ohta uses the Karhunen-Loeve transformation. Let A be the covariance matrix of the R, G, and B, distributions. Let Ai, A2, and A3 be the eigenvalues of the covariance matrix, such that, A1 > A2 > A . Let W - = 3 t [uiRi wci u>Bi] , T for i=l,2, and 3, be the eigenvectors of A corresponding to A , respectively. Ohta defines the color features Xi, 4 CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 141 f * > * » n c « imagv-Qrt'.a cocrdft«r«n:Bf irwaranfs Figure 5.17: Playground scene with candidate matches from color differential invariant matching. Color feature components 11 and 12 from Ohta are used yielding a 14 component differential invariant vector. Seventy percent of the matches are correct. * * T v - t * > - » f K * image - O h t a c o b i dKtonrrliil imuranfe Figure 5.18: Same as previous figure but for non-reference image. X , and X3 as 2 Xi = wRiR +wG Gi + WBiB 142 CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES such that ||W,|| = 1, for i=l,2, and 3. It is known that X\, X , and X are uncorrected and Xi is 2 3 "best" in the sense that it has the largest variance which is given by by Ai. X is the best feature dimen2 sion orthogonal to X\. From his analysis of several images, Ohta found that W i was proportional to [1/3, 1/3, 1/3] for all scenes, W was dominated by [1/2, 0, - 1 / 2 ] or [-1/2, 0, l / 2 ] a n d W b y T T T 2 3 [-1/4, 1/2, -1/4] . From this he concluded that h = (R + G + B)/3,I T 2 = (R-B)/2 or (B - R)/2, and 7/3 = (2G — R — B)/4 are important components for representing color information. The recursive K-L segmentation experiments reported by Ohta revealed that the most significant color feature for determining histogram cutoff values was I\, which corresponds to the image brightness component. The next most significant component was I . Component I\ was used 76 percent of the time for deciding the 2 histogram cutoff value, and I was used 20 percent of the time. 2 The playground scene image pair was used for testing the color feature components due to Ohta. The covariance of the RGB values was computed and the largest eigenvalue accounted for approximately 93 percent of sum of the eigenvalues with the corresponding eigenvector [0.6406, 0.5881, 0.4936] ; The T next largest eigenvalue accounted for approximately 6 percent of the sum and its corresponding eigenvector is [-0.6894,0.1576,0.7070] and the last eigenvector is [0.3380, -0.7933,0.5064] . These vecT T tors agree fairly well with the results reported by Ohta, noting, however, the change of sign of the last eigenvector. Given the low variance of the last color feature component only a 14 component color differential invariant was computed from features 7i and I . The covariance matrix for the II feature, which 2 corresponds closely with the standard transformation from RGB to grey-scale, was determined from the reference image and the reference image perturbed with additive noise with a variance of 1/4 the variance of this component for the reference image as is done for the grey-level invariants. The covariance of the 12 invariants, however, were computed without additive noise, and therefore, only captured the perturbations due to scale change. Figures 5.17 and 5.18 show the predicted matches. Of the 37 matches, 26 are correct (approximately 70 percent), compared with 89 percent using grey-level vectors. The k-d tree average number of records checked was 1114 from a space of 4500 records which is substantially higher than for a 7 component differential invariant vector. This is keeping with expected non-logarithmic behaviour for larger CHAPTER 5. EXPERIMENTAL RES ULTS FOR MATCHING POINT FEATURES 143 vector dimensions when the total number of records is not large. 5.5.2 RGB color features A test similar to the one for Ohta was performed using the standard RGB color features. The color differential invariant vector has dimension 21, seven invariants for each color channel. The additive noise statistics for perturbing the reference image to compute the covariance matrix for the invariants was determined from statistics gathered equally from all three channels. However, very poor performance was realized when the additive noise had a variance of 1/4 the variance of the brightness values determined from all three channels. A smaller value of 1/64 the image variance yielded better results and all three channels were perturbed accordingly. Figures 5.19 and 5.20 show the predicted matches from the RGB color differential invariant matching scheme. Of the 35 matches, 26 are correct (approximately 74 percent). The k-d tree search examined an average number of 1180 records. Given the similarity in the average k-d tree search length and in the number of matches found versus number correct it would appear that the color features due to Ohta and the RGB features as well as the corresponding differential invariants share very similar characteristics. The same kinds of scene points are matched, however, the two sets of correct matches do not coincide exactly with the same scene points. In fact, 16 of the 26 correct matches (62 percent) from each set are the same, the remainder yield a total of 20 correct matches. Now, if the incorrect match sets do not overlap than we have perfect correspondence, however, one incorrect match from the two sets of incorrect matches coincide, and, unfortunately, they also spatially coincide in the non-reference image, otherwise this incorrect match could be discarded as well. By comparing the location of the color differential invariant matches for the two color feature sets, a set of predicted matches with only one incorrect match is obtained. In this case, 16 out of 17 correct matches are obtained which is comparable to the grey-level invariant performance which is 16 out of 18 are correct. Checking the intersection of all three sets, i.e., RGB, Ohta, and grey-level, yields a set of 11 matches all of which are correct. CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 144 ftfte w nc* imag* — RGB color d iilarenial in va 1131\ Figure 5.19: Playground scene with candidate matches from color differential invariant matching. RGB color feature components yielding a 21 component differential invariant vector. 74 percent of the matches are correct. iMDn-mtaranc* imjga - R G B c o b r dtfvwrriial invjrante Figure 5.20: Same as previous figure but for non-reference image. 5.5.3 Construction scene color differential invariants The same test was performed with the construction scene image pair. The RGB invariant matching scheme yielded 33 matches of which 29 were correct (88 percent), compared with 14 out of 18 correct matches CHAPTER 5. EXPERIMENTAL RES ULTS FOR MATCHING POINT FEATURES 145 (78 percent) for the grey-level invariants. The k-d tree search examined an average number of 1320 records. The number of hypothesized matches using the color features due to Ohta was only 17 of which 11 are correct (65 percent) and the k-d tree search length averaged 1190 records. The number of correct matches in the intersection set between RGB and Ohta was 8 with an empty intersection between the incorrect match sets. Finally, intersecting the RGB and Ohta matches with the grey-level differential invariant matches (of which there is a total of 18, 14 of them correct) yielded only 4 matches all of which are correct. In this case, the low number of matches from the color feature set due to Ohta reduced the likelihood of a large number of overlapping correct matches. 5.5.4 Nitobe Garden scene: color and grey-level differential invariants For the Nitobe garden scene, deep shadows and the relativelyflatpond surface poses potential difficulties for comer detection and false matches. As with the construction scene, the reduced contrast of any 2D brightness discontinuities in the shadowed regions means that these comers are rejected in favor of the high contrast comers from the strongly illuminated scene areas. The specular reflectance component of the surface of the pond can lead to false matches since the scene brightness will be indistinguishable from the mirror image of the scene - assuming the pond surface is planar and an ideal specular reflector the reflected image will have the same radiance as the scene surface element. False matches involving features reflected through the surface of the pond are not found in the tests reported below, probably since the surface does not appear to fit the given assumptions, for example, the reflection appears to be distorted in most areas. The experimental results indicate that the RGB color differential invariants have a decisive advantage over grey-level invariants for this image pair. The color features due to Ohta yield marginally better results than the grey-level invariants. Figures 5.21 and 5.22 show the predicted matches from the grey-level differential invariant matching scheme. Of the seven matches, four are correct. The k-d tree search examined an average number of 123 records. The filter characteristic length for the Harris-Stephens' comer detector was one-half pixel and the base scale level for the multi-scale matching was three pixels. These values apply to the color differential invariant matches as well. CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 146 Figure 5.21: First frame of the image pair of the Nitobe Garden at UBC with candidate matches from grey-level differential invariant matching. Of the 7 matches, 4 are correct. Figure 5.22: Second frame of the image pair with candidate matches from grey-level differential invariant matching. The RGB invariant matching scheme yielded 12 matches of which 11 are correct and are given by Figures 5.23 and 5.24. The A:-d tree search examined an average number of 1684 records. The number of hypothesized matches using the color features {II ,12} from Ohta was nine of which five are correct, CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 147 Figure 5.23: First frame of the image pair of the Nitobe Garden at UBC with candidate matches from RGB color differential invariant matching. The differential invariant vectors have 21 components, 7 from each color feature. Of the 12 matches, 11 are correct. Figure 5.24: Second frame with candidate matches from RGB color differential invariant matching. although match 2 is mis-localized by approximately 5 pixels relative to the reference image. The k-d tree search length averaged 1189 records. The predicted matches for the Ohta color features are given CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 148 in Figures 5.25 and 5.26. The number of correct matches in the intersection set between RGB and Ohta is four with an empty intersection between the incorrect match sets. Finally, intersecting the RGB and Ohta matches with the grey-level differential invariant matches yielded only three matches all of which are correct. In this case, the low number of correct matches from the color feature set due to Ohta and the grey-level invariants again reduces the likelihood of a large number of overlapping matches. Figure 5.27 shows the rigidity checking verified matches for the unbiased estimator for the nonlinear camera model. The biased estimator also correctly verified matches 2,4, and 8. Match 11 was incorrectly rejected and match 1 was incorrectly accepted. The camera motion consists of a translation parallel to the ground plane and a small rotation. The epipolar lines are expected to be approximately horizontal and the disparity of outlier match 1 is horizontal so a low residual for this match could be expected. 5.5.5 Summary The use of multiple color feature sets to disambiguate color differential invariant matches by a set intersection method has been demonstrated. The results are preliminary and further investigation is required to determine a general principle for matching point features by comparing matches from a set of color feature sets and a grey-level feature. These preliminary results, however, indicate that this may be worth pursuing. The intersection method yielded sets of correspondences that contained a significantly greater proportion of correct matches relative to the parent sets. Table 5.4: Matching results for color and grey-level differential invariants. Percent proportion of correct matches for three scenes. _ _ Scene RGB Ohta Grey-level Playground 74 70 89 78 Construction 88 65 Nitobe 92 56 57 Test results with three image sets for comparing the performance of color and grey-level differential invariants were reported. A summary of the matching results is given in Table 5.4. Color differential invariants generally yielded a much larger set of matches than those provided by the grey-level invariants, and, in the case of the RGB color feature set, a consistently higher proportion of correct matches CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 149 Reference image - 1112 differential invariant matches 100 200 300 400 500 600 700 Figure 5.25: First frame of the image pair of the Nitobe Garden at UBC with candidate matches from color differential invariant matching using 1112 color features due to Ohta. The differential invariant vectors have 14 components, 7 from each color feature. Of the 9 matches, 5 are correct, however, match 2 is mis-localized by approximately 5 pixels. Figure 5.26: Second frame with candidate matches from color differential invariant matching using 1112 color features. CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 150 Rigidity checking verified matches from R G B color invariants 100 200 300 400 500 600 700 Figure 5.27: Rigidity checking verified correspondences from the hypothesized RGB color differential invariant matches for the Nitobe Garden scene. These verified matches are from the unbiased estimator for the nonlinear camera model. The biased estimator also correctly verified matches 2, 4, and 8. Verified matches are marked with *'s. relative to grey-level differential invariant matches were produced with one exception. For that case, the RGB matching performance was still acceptable. Differential invariant matching with the color feature set due to Ohta did not perform as well as either the RGB or grey-level invariants. Given that one of the color features from the Ohta set is approximately equivalent to the grey-level brightness value this is somewhat surprising. Possible causes may include a poorly specified covariance matrix, since noise was not added to the the second color feature, or the color feature {h,h} may not yield differential invariant vectors that are unambiguous or sufficiently invariant with respect to the change in viewpoint. RGB color feature invariants generally yielded a higher proportion of correct matches than the color feature set due to Ohta. The main drawback of using color differential invariants is the added cost of computing the extra vector components and the greatly increased k-d tree search complexity due to the higher dimensionality of the invariant vector. As described earlier, however, methods exist for overcoming these drawbacks such as the use of efficient algorithms for computing filter responses which can also be run in parallel, and the use of approximate nearest-neighbour look-up algorithms. With these CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 151 enhancements, the use of color differential invariants would be advisable over the grey-level variety. 5.6 Image matching rotational invariance The differential invariants are rotationally symmetric and should, therefore, be invariant to an arbitrary rotation about the optical axis. We have seen matches, albeit incorrect, which were made as a result of the rotational invariance property in the results from the construction and laboratory image pairs. In those cases the incorrect matches were detected since their change in orientation did not agree with the expected value from the other matches. To test the effectiveness of therigiditychecking method for a large rotation about the optical axis the non-reference image from the playground image pair was rotated counter-clockwise by ninety degrees such that the original image brightness values are retained. The receiver operating characteristic results for the standard scenario lead us to expect that the rigidity checking method will be robust with respect to this type of rotation and this is shown to be true for the following test. The comer detection and grey-level differential invariant matching results show that, for this 2D image rotation, there is no difference compared to the image pair without the 2D rotation, as expected. Note that the grey-level differential invariants are computed using the central difference form of the derivative operation. This ensures that rotational invariance is achievable. Use of the forward difference, which introduces a one-half pixel shift in the resulting derivative, would, in principle, render the differential invariant representation not rotationally invariant. However, in practice, it can be considered as another noise source. Figures 5.28 and 5.29 show the predicted matches from the grey-level differential invariant matching scheme for filter responses computed using the central difference. Figures 5.30 and 5.31 show the predicted matches from the grey-level differential invariant matching scheme for filter responses computed using a central difference for the rotated reference image. Thesefiguresare also marked with the rigidity checking verified matches for the biased estimator for the nonlinear camera model. The hypothesized matches from the grey-level differential invariant matching scheme are identical. The verification results for the biased estimator for the nonlinear camera model are virtually identical CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 152 Reference image — verified greylevel differential invariants Figure 5.28: First frame of playground image pair with verified candidate matches from grey-level differential invariant matching. Differential invariant filter responses computed using central difference. The rigidity checking verified matches are marked with *'s and are from the biased estimator for the nonlinear camera model. Figure 5.29: Second frame of playground image pair. Differential invariant filter responses computed using central difference. The rigidity checking verified matches are marked with *'s. CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 153 Figure 5.30: First frame of playground image pair with verified candidate matches from grey-level differential invariant matching. Image rotated counter-clockwise by ninety degrees. Differential invariant filter responses computed using central difference. The rigidity checking verified matches are marked with *'s and are from the biased estimator for the nonlinear camera model. Figure 5.31: Second frame of playground image pair. See caption from previous figure. CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 154 in that matches 9 and 10 from Figures 5.30 and 5.31 are correctly verified whereas they are incorrectly rejected for the image pair without the 2D rotation. The matching results, not illustrated here, for the unbiased estimator for the nonlinear camera model do not agree as well. For Figures 5.30 and 5.31, matches 7,4 and 19, are incorrectly rejected whereas they are correctly verified in Figures 5.28 and 5.29 - labeled as matches 8,13, and 20. Also, match 16 is incorrectly accepted for the image pair with rotation whereas it is correctly rejected for the image pair without the 2D rotation where it is labeled as match 15. This test serves to point out that the comer detector and differential invariants are behaving as expected. It can also be concluded that the rigidity checking method for the biased estimator for the nonlinear camera model appears to be robust to this type of 2D transformation which satisfies the expectation drawn from the ROC test results. However, we note that the unbiased estimator for the nonlinear camera model does not achieve the same performance as the biased estimator. This is probably due to the increased dimensionality of the parameter space for the unbiased estimator and the conservativeness of the x 2 t e s t statistic. It is alsojworth noting that the minimizer for a nonlinear least-squares system reached from a given starting point in the parameter space may not necessarily be the same as the one reached when starting from some other point. It may be the case that the much larger parameter space for the unbiased estimator for the nonlinear camera model relative to the biased estimator leads to a greater sensitivity to the initial estimate for the minimization and increases the likelihood of not converging to the same minimizer - especially for ill-conditioned problems due to the presence of several observations that satisfy the linear camera model. 5.7 Conclusions We have described an approach to matching point features from a pair of images by using differential invariants of the brightness function and rigidity checking. A multi-scale extension to the basic method of Schmid and Mohr led to a large improvement in the ratio of correct to incorrect matches. Nearest neighbour search with k-d trees answered queries in logarithmic time as expected, and calculating the covariance matrix at run-time yielded a relatively good matching performance. Ranking the correspondences relative to the affine camera model has yielded a robust verification process leading to a viable CHAPTER 5. EXPERIMENTAL RESULTS FOR MATCHING POINT FEATURES 155 sequential verification scheme. The method, however, suffers from the inability of the grey-level differential invariant matching scheme to handle moderate to large viewpoint changes. This sensitivity to viewpoint change leads to the preponderance of matches that satisfy the affine or scaled-orthographic camera model which in turn leads to a poorly conditioned nonlinear system of equations solved by the rigidity checking method. The robustness of therigiditychecking method is thus emphasized by the correspondences provided by the grey-level differential invariant matching scheme. Overall, the biased estimator for the perspective camera model performed better than the unbiased estimator. For a somewhat higher risk of accepting invalid matches, the biased estimator successfully verified all of the correctly hypothesized matches that did not satisfy the linear camera model with one exception, a single match for the CICSR building scene was incorrectly rejected. The number of incorrectly accepted matches is small or none at all, with the exception of the CICSR building results. In that case, due to the symmetry and repetitiveness of the stmcture details, several incorrect matches yielded disparities approximately parallel to the expected epipolar line direction. This situation, however, should not significantly reduce the accuracy of an estimate of the epipolar geometry since the matching errors are along the expected epipolar lines. The slightly poorer performance of the unbiased estimator for the nonlinear camera model is mainly attributed to the conservativeness of the x test statistic and possibly to a greater sensitivity to 2 the conditioning of the nonlinear least squares normal equations due to the much higher dimensionality of the parameter space. Further study of the effect of the stabilization weights on the performance and conditioning of the unbiased estimator could shed more light on this issue. Increasing the number of iterations does, in some cases, increase the number of correctly verified matches, but this was not tested formally. Results from the test to observe the performance of the system with respect to 2D image plane rotations showed the comer detection and grey-level differential invariant matching processes to be invariant to this transformation, as expected. Therigiditychecking results for the biased nonlinear estimator were relatively unaffected by the transformation, however, two correct correspondences incorrectly rejected for one set of views were correctly verified for the other set. This indicates that the nonlinear estimator does exhibit some sensitivity to the initial estimate. The verification results for the unbiased CHAPTER 5. EXPERIMENTAL RES ULTS FOR MATCHING POflVT FEATURES 156 estimator were not as good, indicative of a much greater sensitivity to the initial estimate provided by the biased estimator in the sequential verification process. Chapter 6 Conclusions and future directions To recover the structure of the scene from multiple images, or to recognize an object given a single previous view, it is necessary to establish the equivalence of certain geometric information obtained from each image. This sub-task involves solving the correspondence problem which is the problem addressed by this thesis. The main contribution of this thesis is the method of rigidity checking. The rigidity checking method verifies the potential rigidity of 3D point configurations from a pair of views under perspective projection. The accurate verification of the rigidity of an hypothesized set of three-dimensional point correspondences from two views solves the correspondence problem. An example of an application of the method was implemented and tested - an automated process for the matching of image point features. Establishing feature correspondences between images of an object is a critical sub-task for a variety of applications. Potential applications include 3D object recognition from a single previous view and correspondence verification for stereo or motion over widely separated views. Related applications from the area of photogrammetry, include, for example, automating the relative orientation process for aerial or satellite imagery. In this application it is often the case that the intrinsic camera parameters are known, a prerequisite of the rigidity checking method, hence, the automation of this task can benefit from including the rigidity checking algorithm in the workstation software suite for verifying hypothesized point correspondences. Our analysis began with the observation that it is often the case that two views cannot provide 157 CHAPTER 6. CONCL USIONS AND FUTURE DIRECTIONS 158 an accurate structure-from-motion estimate because of ambiguity and ill-conditioning. However, it was argued that an accurate yes/no answer to the rigidity question is possible and experimental results supported this assertion with as few as six pairs of corresponding points over a wide range of scene structures and viewing geometries in a Monte-Carlo simulation for various noise levels. The performance of the method was also tested using manually selected correspondences from real images. The rigidity checking method was evaluated as arigiditydetector where the detection performance was captured by the receiver operating characteristic (ROC) curve from data generated by the Monte-Carlo simulation. The ground-truth for tests with real images was determined by manually selecting the correct correspondences. Rigidity checking verifies point correspondences by using 3D recovery equations as a matching condition. The proposed algorithm improves upon other methods that fall under this approach for the following reasons: the algorithm works with as few as six corresponding points under full perspective projection, handles correspondences from widely separated views, makes full use of the disparity of the correspondences, and is integrated with a linear algorithm for 3D recovery due to Kontsevich. Rigidity checking was applied to the problem of automatically determining a set of feature point correspondences from a pair of brightness images using an hypothesize and verify heuristic. Automating the task of matching feature points requires a mechanism to perform the initial hypothesizing sub-task. The method of grey-level differential invariant matching due to Schmid and Mohr was implemented to accomplish this sub-task. The verification sub-task was accomplished by a sequential verification process using therigiditychecking algorithm with a match pre-filtering stage that ranked the likelihood of the correspondences being correct with respect to the affine camera model. Experimental results for tests on several real image pairs with ground-truth provided by manually verifying the rigidity checking correspondence results proved therigiditychecking method to be tractable and robust for this task. However, some limitations were revealed by experiments with the implemented solution to the automated image matching problem. With the initial feature point selection scheme, the selection of comers was biased to those with the strongest comer detection responses, hence, very few comers are detected or matched from shadowed surfaces. The grey-level differential invariant matching scheme is CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS 159 sensitive to the scale of the point features and to moderate or large viewpoint changes. As a result of this failure mode, a large proportion of the hypothesized correspondences were biased towards distant scene points whose brightness pattern remained stable with respect to the given viewpoint change. The rigidity checking unbiased nonlinear estimator was sensitive to the initial estimate provided by the biased estimator, and generally verified fewer correct matches, primarily due to the conservativeness of the x test statistic. Another contributing factor may be due to the much larger dimension of the 2 parameter space which can reduce the convergence rate of the unbiased estimator relative to the biased estimator and increase the susceptibility of converging to a local minimum. These problems are exacerbated by ill-conditioning of the least-squares normal equations which frequently occurs when several of the correspondences satisfy the linear camera model and the viewpoint changes are relatively small, as was the case for some of the PhotoCD test image pairs. However, the results show that the conservativeness of the unbiased estimator ensured that the verified set of correspondences was almost assuredly error free. Due to the nature of the comer detection and correspondence prediction steps in the fully automated application, only a limited test of the rigidity checking method could be made. Nevertheless, the point feature matching application did exemplify the robustness of the rigidity checking method. Rigidity checking experimental results obtained from Monte Carlo testing and from real images with manually determined correspondences helped to establish a complete picture of the performance of the method. Finally, a novel extension of the grey-level differential invariant specification to color differential invariants was given and preliminary test results were reported. Two sets of color features were specified, RGB and a subset of the color features proposed by Ohta. The RGB color feature set yielded the best performance, generally out-performing the grey-level invariants in terms of proportion of correct correspondences as well as total number of matches. A method that simply intersects correspondence sets from the differential invariant matching scheme for different sets of color features produced sets of correspondences with a significantly higher proportion of correct matches relative to the parent sets. The number of elements in the parent correspondence sets can be quite small, however, so the method may be impractical unless the cardinality of the original correspondence sets is substantially increased. Given appropriate algorithms for efficiently computing the invariant vectors and for performing CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS 160 nearest-neighbour look-up, the use of color differential invariants with the RGB color feature set is recommended. 6.1 Issues for further study In this section several issues that were not fully explored are reviewed with suggestions for future work. 6.1.1 Issues arising from the image matching application A version of point feature matching using rigidity checking with least median of squares could be compared with one or more M-estimator versions and sequential verification. The study could look at the trade-offs, for example, between computational cost and robustness. The grey-level differential invariant matching scheme performed somewhat below expectations. Several factors contribute to this including, for example, the localization error of the detector, the sensitivity of the differential invariant representation to discretization error, the appropriate scale for characterizing the brightness function and the type of physical event in the scene that gave rise to the comer. Alternative keypoint detection methods such as the recent work by Lowe [62] onfindingpeak responses in scale-space of the Laplacian of the brightness function and work by Lindeberg [57] on detecting the scale and scale-space stability of blob and comer features could provide better features for characterizing with differential invariants for predicting correspondences. The use of color differential invariants was recommended if given tractable algorithms for computing the invariants and nearest-neighbour lookup. The existence of such algorithms could be determined and the approach pursued if they exist. A good starting point for this investigation is the approximate nearest-neighbour lookup method using k-d trees due to Beis and Lowe [7]. Several theoretical results have been reported for recursivefilteringimplementations of the Gaussian kernel which are known to be efficiently computable. The possibility of combining the matching results derived from different sets of color features to disambiguate the matches could be investigated further. CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS 6.1.2 161 Rigidity checking issues The proposed approach assumes that the camera has been calibrated, i.e., the intrinsic parameters of the camera are known. This may be considered an undesirable requirement for a general approach to image matching. Recent results from projective geometry concerning camera calibration suggest that a minimum of three views by the same camera and known conespondences are required to estimate the five calibration parameters [36]. The discussion in Chapter 2 on camera calibration suggests that the collinearity constraint is the most appropriate model for the geometric component of the image formation process. A study of how to extend the rigidity checking method to incorporate camera calibration and lens distortion parameters could focus on such issues as the minimum number of correspondences required for an over-constrained solution for the extended camera model, which additional parameters are expected to be well constrained, and what benefits, if any, are accrued by incorporating intrinsic camera and lens distortion parameters. A good starting point is the stmcture-from-motion method due to Azarbayejani and Pentland which incorporates the focal length of the camera into the parameter space [3] and methods from the photogrammetry literature, e.g., [32, 38]. Other issues that arose from the experiments include: • the suitability of the x distribution for the unbiased nonlinear objective function. An investigation 2 could be made into what the actual noise distribution is and the impact the choice of the estimation algorithm has on the residual distribution. • what are the probability distributions that determine type I, rejecting a valid match, and type II, accepting an invalid match, decision errors. • the convergence behaviour of the unbiased estimator for the nonlinear camera model, e.g., the potential problem of converging to local minima, and a sensitivity to the initial parameter estimate, especially for ill-conditioned problems. CHAPTER 6. CONCL USIONS AND FUTURE DIRECTIONS 6.1.3 162 Robust estimation of the epipolar geometry Given the set of verified correspondences, numerous methods exist for determining the epipolar geometry. The basic rigidity checking method could be modified to accomplish this or a method such as the one developed by Zhang et al. [121] could be used. The rigidity checking method can provide an estimate for the relative orientation of the cameras between two views. The epipolar geometry is given directly from the relative orientation parameters [5]. Given a robust estimate of the epipolar geometry, stereo matching techniques could be applied to increase the number of correspondences, with the goal of achieving a dense matching between the pair of images. 6.2 Public availability of rigidity checking implementation The rigidity checking algorithm has been implemented in C and is freely available 5 at ftp://ftp.cs.ubc.ca/pub/local/danm. > Bibliography [1] Anstis, S.M., The perception of apparent motion, Phil Trans. R. Soc. London B 290,153-168,1980. [2] Ayache, N., Artificial Vision for Mobile Robots, The MIT Press, 1991. [3] Azarbayejani, A., Pentland, A., Recursive estimation of motion, structure, and focal length, IEEE PAMI, (17), No. 6, June 1995, pp. 562-575. [4] Ballard, D.H., Brown, C M . , Computer Vision, Prentice Hall, 1982. \; [5] Basri, R., On the uniqueness of correspondence under orthographic and perspective projections, Proc. Image Understanding Workshop, pp. 875-884,1992. Also published as A.I. Memo No. 1333, Massachusetts Institute of Technology, Artificial Intelligence Laboratory, December 1991. [6] Bedekar, A.S., Haralick, R.M., Finding corresponding points based on bayesian triangulation, IEEE Computer Society Conference On Computer Vision and Pattern Recognition, pp. 61-66, 1996. [7] Beis, J.S., Lowe, D.G., Shape indexing using approximate nearest-neighbour search in highdimensional spaces, IEEE Computer Society Conference On Computer Vision and Pattern Recognition,^. 1000-1006,1997. [8] Bennett, B.M., Hoffman, D.D., Prakash C , Observer Mechanics: A Formal Theory of Perception, Academic Press, 1989. [9] Bennett, B.M., Hoffman, D.D., Nicola J.E., Prakash C , Structure from two orthographic views of rigid motion, J. Opt. Soc. Am. A, 6(7), 1052-1069,1989. [10] Bennett, B.M., Hoffman, D.D., Prakash, C , Recognition Polynomials, J. Opt. Soc. Am. A, 10(4), 759-764, April 1993. [11] Beyer, H.A., Linejitter and geometric calibration of CCD cameras, ISPRS, J. photogrammetry and remote sensing, 45, pp. 17-32, 1990. 163 [12] Blaszka, T., Deriche, R., Recovering and characterizing image features using an efficient model based approach, Rapport de Recherche 2422, INRIA Sophia-Antipolis, France, Nov 1994. [13] Bosemann, W., Geometric models in object based multi-image matching, Spatial information from digital photogrammetry and computer vision, International Society for Photogrammetry and Remote Sensing, pp. 61-68,1994. [14] Braddick, O.J., Low-Level and high-level processes in apparent motion, Phil Trans. R. Soc. London 5 290,137-151,1980. [15] Braunstein, Myron L.,Hoffman, Donald D., Pollick, Frank E., Discriminating rigid from nonrigid motion: minimum points and views,Perception and Psychophysics, 47(3), 205-214, March 1990. [16] Chen, J., The Use of Multiple Cameras And Geometric Constraints For 3-D Measurement, unpublished Doctoral dissertation, The City University, Dept. of Electrical, Electronic, and Information Engineering, 1995. [17] Cheng, Y-Q., Collins, R.T., Hanson, A.R., Riseman, E.M., Triangulation without correspondences, Proceedings, ARPA Image Understanding Workshop, (Monterey, CA, November 13-16, 1994), Morgan Kaufmann, San Francisco, CA, pp. 993-1000,1994. [18] Cui, Y , Lawrence, P.D., Detecting scale-space consistent comers based on comer attributes, Int. Conf. Systems, Man and Cybernetics, Vol.4, pp. 3459-3554,1995 [19] Jacobs, D.W., Chennubhotla, C , Finding structurally consistent motion correspondences, Proc. Int'l Conference on Pattern Recognition, pp. 650-653, 1994. [20] Dennis, J.E., Schnabel, R.B., Numerical methods for unconstrained optimization and nonlinear equations, Prentice-Hall series in computational mathematics, Englewood Cliffs, NJ, 1983. [21] Deriche, R., Giraudon, G., A computational approach for comer and vertex detection, International Journal of Computer Vision, 10:2,101-124, 1993. [22] Faugeras, O., Stratification of three-dimensional vision: projective, affine, and metric representations, J. Opt. Soc. Am. A, 12(3), 465-484, March 1995. [23] Faugeras, O., Luong, Q.-T., Maybank, S.J., Camera self-calibration: theory and experiments, European conference on computer vision, pp. 321-334, 1992. [24] Fischler, M.A., Bolles, R.C., Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Comm. of the ACM, Vol. 24, No. 6, pp.381-395, June 1981. 164 [25] Fitzgibbon, A.W., Fisher, R.B., Lack-of-fit detection using the run-distribution test, European conference on computer vision, Lecture Notes in Computer Science, Vol. 801, pp.173-178,1994. [26] Freeman, W.T., Exploiting the generic view assumption to estimate scene parameters, Proc. 4th International Conference on Computer Vision, 347-356, 1993. [27] Friedman, J.H., Bentley, J.L., Finkel, R.A., An algorithm for finding best matches in logarithmic expected time, ACM Trans. Math. Software, Vol. 3, No. 3, pp. 209-226, 1977. [28] Frisby, J.P., Pollard, S.B., Computational issues in solving the stereo correspondence problem, in Computational Models of Visual Processing, Landy, M.S., Movshon, J.A., eds., The MIT Press, Cambridge, MA, 1991. [29] Fua, P., A parallel stereo algorithm that produces dense depth maps and preserves image features, Machine Vision and Applications, 6, 35-49, 1993. [30] Gonzalez, R.C., Woods, R.E., Digital Image Processing, Addison-Wesley Publishing Company, Reading, MA, 1992. [31] Grzywacz, N.M., Yuille, A.L., Massively parallel implementations of theories for apparent motion, Spatial Vision, Vol. 3, No. 1, 15-44, 1988. [32] Gruen, A.W., Baltsavias, E.P., Adaptive least squares correlation with geometrical constraints, Computer Vision for Robots,Society of Photo-optical and Instrumentation Engineers, vol. 595, pp. 72-82,1985. [33] Harris, C , Structure-from-motion under orthographic projection, Proc. of 1st European Conference on Computer Vision (ECCV90), 118-123, 1990. [34] Harris, C , Stephens, M., A combined comer and edge detector, Proceedings 4thAlvey Vision Conference, pp. 147-151, 1988. [35] Harris, C.G., Pike, J.M., 3D positional integration from image sequences, Image And Vision Computing, Vol. 6, No. 2, 87-90,1988 [36] Hartley, Richard I., Euclidean reconstmction from uncalibrated views, Applications of invariance in computer vision : second joint European-US workshop, Joseph L. Mundy, Andrew Zisserman, David Forsyth (eds), Ponta Delgada, Azores, Portugal, 237-256, October 9-14,1993. [37] Hartley, Richard I., Self-calibration of stationary cameras, International Journal of Computer Vision, 22(1), pp. 5-23,1997. [38] Heipke, C , A global approach for least-squares image matching and surface reconstruction in object space,Photogrammetric Engineering & Remote Sensing, vol. 58, no. 3, pp. 317-323, 1992. 165 [39] Horaud, R., Christy, S., Dornaika, R, Object pose: the link between weak perspective, para perspective, and full perspective, Rapport de Recherche, N. 2356, INRIA Sophia-Antipolis, France, September 1994. [40] Horn, B.K.P., Robot vision, Cambridge, MA, The MIT Press, 1986. [41] Horn, B.K.P, Relative orientation, International Journal of Computer Vision, 4, 59-78,1990. [42] Horn, B.K.P, Relative orientation revisited, J. Opt. Soc. Am. A, 8(10), 1630-1638, Oct 1991. [43] Hu, X., Ahuja, N., Motion estimation under orthographic projection, IEEE Transactions on Robotics and Automation,Yo\. 7, No. 6, 848-853, 1991. [44] Hu, X., Ahuja, N., Matching point features with ordered geometric, rigidity and disparity constraints, IEEE Trans. PAMI, Vol. 16, No. 10, pp. 1041-1049, October 1994. [45] Huang, T.S., Lee, C-H., Motion and stmcture from orthographic projections, IEEE Trans. PAMI, Vol. 11, No. 5, 536-540,1989. [46] Huang, T.S., Netravali, A.N., Motion and stmcture from feature correspondences, Proceedings of the IEEE, Vol. 82, No. 2, pp. 252-268, February 1994. [47] Huttenlocher, D.P, Kleinberg, J.M., Comparing point sets under projection, ACM-Symposium on discrete algorithms, pp. 1-7, 1994. [48] Huttenlocher, D.P, Lorigo, L.M., Recognizing three-dimensional objects by comparing twodimensional images, IEEE Computer Society Conference On Computer Vision and Pattern Recognition,^. 878-884,1996. [49] Kanatani, K., Geometric Computation for Machine Vision, Oxford University Press, New York, 1993. [50] Koenderink, J.J., van Doom, A.J., Representation of local geometry in the visual system, Biological Cybernetics, 55, pp. 367-375,1987. [51] Koenderink, J.J., van Doom, A.J., Affine stmcture-from-motion, J. Opt. Soc. Am. A, 8(2), 377-385, Feb 1991. [52] Kontsevich, L.L., Pairwise comparison technique: a simple solution for depth reconstruction, J. Opt. Soc. Am. A, 10(6), 1129-1135, June 1993. [53] Kumar, R.V.R., Tirumalai, A., Jain, R.C., A non-linear optimization algorithm for the estimation of stmcture and motion parameters, IEEE Computer Society Conference On Computer Vision and Pattern Recognition, 136-143,1989. 166 [54] Lane, R.A., Thacker, N.A., Seed, N.L., Stretch-correlation as a real-time alternative to featurebased stereo matching algorithms, Image and Vision Computing, 12(4), 203-212, May 1994. [55] Lee, C-H., Huang, T., Finding point correspondences and determining motion of a rigid object from two weak perspective views, Comp. Vis. Graph, and Image Processing 52, 309-327,1990. [56] Levenberg, K., A method for the solution of certain nonlinear problems in least squares, Quart. Appl. Math., vol. 2, pp. 164-168,1944. [57] Lindeberg, T., On scale selection for differential operators, Proc. 8th Scandinavian Conf. Image Analysis, Tromso, Norway, pp. 857-866,1993. [58] Longuet-Higgins, H.C., A computer program for reconstructing a scene from two projections, Nature, vol. 293, 133-135, Sept 1981. [59] Longuet-Higgins, H.C., The reconstruction of a scene from two projections - configurations that defeat the 8-point algorithm, Proc. 1st IEEE conference on artificial intelligence applications, Denver, CO, pp. 395-397, December 1984. [60] Lowe, D.G., Perceptual organization and visual recognition, Kluwer, Boston, MA, 1985. [61] Lowe, D.G., Fitting parameterized three-dimensional models to images, IEEE Trans. RAMI, vol. 13, no. 5, pp. 441-450, 1991. [62] Lowe, D.G., Detection of stable keypoints in scale-space, in preparation. [63] Marquardt, D.W., An algorithm for least squares estimation of nonlinear parameters, SIAM J. of applied math, vol. 11, no. 2, pp. 431-441, 1963. [64] Marr, D., Vision: a computational investigation into the human representation and processing of visual information, Freeman, San Francisco, Calif., 1982. [65] McReynolds, D.P., Solving for relative orientation and depth, MSc thesis, Department of Computer Science, University of British Columbia, October 1988. [66] McReynolds, D.P., Determining the motion of a remotely piloted vehicle from a sequence of images, IEEE Proceedings Of The National Aerospace And Electronics Conference (NAECON 89), vol. 3, 1097-1104,1989. [67] Merron, J., Brady, M., Isotropic gradient estimation,/£'££ Computer Society Conference On Computer Vision And Pattern Recognition, pp. 652-659, 1996. [68] Mitiche, A., Aggarwal, J.K., A computational analysis of time-varying images, in Handbook Of Pattern Recognition, Fu, K.S., and Young, T., (Eds.), Academic Press, Inc., 1986. 167 [69] More, J.J., The Levenberg-Marquardt algorithm: implementation and theory, Numerical Analysis, G.A. Watson, ed., Lecture notes in math 630, Springer-Verlag, Berlin, pp. 105-116, 1977. [70] Nishimura, E., Xu, G., Tsuji, S., Motion segmentation and correspondence using epipolar constraint, Asian Conference on Computer Vision (ACCV 93), 199-204,1993. [71] Ohta, Y , Knowledge-based Interpretation of Outdoor Natural Scenes, Pitman Publishing, Inc., 1985. [72] Ozanian, T , Approaches for stereo matching - A review, Modeling, Identification and Control, vol. 16, no. 2, 65-94, 1995. [73] Poelman, C.J., Kanade, T , A Paraperspective Factorization Method for Shape and Motion Recovery, CMU-CS-93-219, School of Computer Science, Carnegie Mellon University, 11 December 1993. [74] Poelman, C.J., Kanade, T , A Paraperspective Factorization Method for Shape and Motion Recovery, IEEE Trans. PAMI, Vol. 19, No. 3, pp. 206-218, March 1997. [75] Poynton, Charles A., Poynton's Colour FAQ, available as http://www.inforamp.net/poynton/Poynton color.html. [76] Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T., Numerical recipes in C, Cambridge University Press, Cambridge, 1988. [77] Pritt, M.D., Stmcture and motion from two orthographic views, J. Opt. Soc. Am. A, 13(5), pp. 916921, May 1996. [78] Rangarajan, K., Shah, M., Establishing motion correspondence, CVGIP: Image Understanding, Vol. 54, No. 1, pp. 56-73, July, 1991. [79] Rao, R.P.N., Ballard, D.H., Object indexing using an iconic sparse distributed memory, Proc. 5th International Conference on Computer Vision, Cambridge, MA, June 20-23, pp. 24-31, 1995. [80] Reid, I.D., Murray, D.W., Active tracking of foveated feature clusters using affine stmcture, International Journal of Computer Vision, 18,41-60, 1996. [81] ter Harr Romeny, B.M., Florack, L.M.J., Salden, A.H., and Viergever, M.A., Higher order differential stmcture of images, Image and Vision Computing, 12(6), July/August 1994. [82] Rousseeuw, P.J., Leroy, A.M., Robust regression and outlier detection, John Wiley & Sons, New York, 1987. 168 [83] Rutkowski, W.S., Rosenfeld, A., A comparison of comer-detection techniques for chain-coded curves, TR-623 (MCS-76-23763), Computer Science Center, University of Maryland, College Park, MD, January 1978. [84] Salari, V., Sethi, I.K., Feature point correspondence in the presence of occlusion, IEEE Trans. PAMI, Vol. 12, No. 1, pp. 87-91, 1990. [85] Schmid, C , Appariement a"images par invariants locaux de niveaux de gris, unpublished Doctoral dissertation, lTnstitut National Polytechnique de Grenoble, July 1996. [86] Schmid, C , Mohr, R., Matching by local invariants, Rapport de Recherche, N 2644, INRIA, August 1995. [87] Schmid, C , Mohr, R., Combining greyvalue invariants with local constraints for object recognition, IEEE Computer Society Conference On Computer Vision and Pattern Recognition, pp. 872-877, 1996. [88] Seitz, S.M., Dyer, C.R., Physically-Valid View Synthesis by Image Interpolation, IEEE Workshop on Representation of Visual Scenes, pp. 18-25, Cambridge, MA, June 23, 1995, [89] Shapiro, L.S., Brady, M., Rejecting outliers and estimating errors in an orthogonal-regression framework, Phil. Trans. R. Soc. Lond. A, 350, pp. 407-439,1995. [90] Shapiro, L.S., Affine analysis of image sequences, Cambridge University Press, 1995. [91] Shapiro, L.S., Zisserman, A., Brady, M., 3D motion recovery via affine epipolar geometry, International Journal of Computer Vision, 16, 147-182,1995. [92] Shashua, A., Correspondence and affine shape from two orthographic views: motion and recognition/!./. Memo No. 1327, Massachusetts Institute of Technology, Artificial Intelligence Laboratory, December 1991. [93] Sinclair, D., Motion segmentation and local structure, Proc. 4th International Conference on Computer Vision, 366-373, 1993. [94] Manual of Photogrammetry, Slama, C.C., (Ed.), American Society of Photogrammetry, 4th Edition, 1980. [95] Sorenson, H.W., Parameter estimation: principles and problems, Control and systems theory; v. 9. New York: M. Dekker, 1980. [96] Sproull, R.F., Refinements to nearest-neighbour searching in k-dimensional treesAlgorithmica, 6, pp. 579-589,1991. 169 [97] Strat, T.M., Photogrammetry and knowledge representation in computer vision, Spatial information from digital photogrammetry and computer vision, International Society for Photogrammetry and Remote Sensing, pp. 784-792,1994. [98] Sudhir, G., Banerjee, S., Zisserman, A., Finding point correspondences in motion sequences preserving affine stmcture, Proceedings British Machine Vision Conference, 359-368, 1993. [99] Szeliski, R., Kang, S.B., Recovering 3D shape and motion from image streams using non-linear least squares, IEEE computer society conference on computer vision and pattern recognition, 752753,1993. Also published as CRL 93/3, Cambridge Research Lab, Digital Equipment Corp., March 1993. [100] Tang, L., Heipke, C , An approach for automatic relative orientation, Proc. Optical 3D Measurement Techniques II: applications in inspection, quality control, and robotics, A. Gruen and H. Kahmen, eds., 4-7 Oct 1993, Society of Photo-optical and Instrumentation Engineers, vol. 2252, pp. 347-354, 1994. [101] Thompson, W.B., Lechleider, P., Stuck, E.R., Detecting moving objects using the rigidity constraint, IEEE Trans. PAMI, 15(2), 162-166, Feb 1993. [102] Tomasi, C , Kanade, T , Shape and motion from image streams under orthography: a factorization method, International Journal of Computer Vision, 9(2), pp.137-154, November, 1992. [103] Tsai, R.Y., A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses, IEEE Journal of Robotics and Automation, RA-3, N4, pp. 323-344, August 1987. [104] Tsai, R.Y., Huang, T.S., Uniqueness and estimation of 3-D motion parameters ofrigidbodies with curved surfaces, IEEE Trans. PAMI, 6(1). 13-27, 1984. [105] Ullman, S., The interpretation of visual motion, MIT Press, Cambridge, Mass., 1979. [106] Ullman, S., Recent computational studies in the interpretation of stmcture from motion, In: Human and Machine Vision, Rosenfeld, A., and Beck, J., (Eds.), Academic Press, New York, NY, 1983. [107] Ullman, S., Basri, R., Recognition by linear combinations of models, IEEE Trans. PAMI, 13(10), 992-1006,1991. Originally published as technical report A.I. Memo No. 1152, MIT, Aug 1989. [108] Venkateswar, V., Chellappa, R., Hierarchical feature based matching for motion correspondence, IEEE Workshop on Visual Motion, pp. 280-285,1991. 170 [109] Y.F. Wang, N. Karandikar, J.K. Aggarwal, Analysis of video image sequences using point and line correspondences, Pattern Recognition, Vol. 24, No. 11, pp. 1065-1084,1991. [110] Wei, G.-Q., He, Z.~ Ma, S.D., Fusing the matching and motion estimation of rigid point patterns, Proceedings 1990 IEEE International Conference on Robotics and Automation, Cincinnati, OH, USA, 2017-22 vol.3, 13-18 May 1990. [Ill] Weng, J., Ahuja, N., Huang, T.S., Matching two perspective views, IEEE Trans. RAMI, Vol. 14, No. 8, pp. 806-825, August 1992. [112] Weng, J., Ahuja, N., Huang, T.S., Optimal motion and structure estimation,//?/?/? Trans. PAMI, vol.15, no.9,pp. 864-884, 1993. [113] Weng, J., Cohen, P., Herniou, M., Camera calibration with distortion models and accuracy evaluation, IEEE Trans. PAMI, vol.14, no.10, pp. 965-980, October 1992. [114] Weng, J., Huang, T.S., Ahuja, N., Motion and structure from two perspective views: Algorithms, error analysis, and error estimation,//?/:/? Trans. PAMI, vol.11, no.5, pp. 451-476, May 1989. [115] Weng, J., Huang, T.S., Ahuja, N., Motion and structure from image sequences, Springer series in information sciences 29, Springer-Verlag, 1993. [116] Whaite, P., Ferrie, F.P., Active exploration: Knowing when we're wrong, Proceedings, Fourth International Conference on Computer Vision, (Berlin, Germany, May 11-14,1993), IEEE Computer Society Press, Los Alamitos, CA, pp. 41-48, 1993. [117] Wheeler, M.D., Ikeuchi, K., Iterative estimation of rotation and translation using the quaternion, CMU-CS-95-215, School of Computer Science, Carnegie Mellon University, 10 December 1995. [118] Willson, R.G. and Shafer, S.A., What is the center of the image?,//?/?/? Computer Society Conference On Computer Vision and Pattern Recognition, pp. 670-671,1993. [119] Wolf, P.R., Elements of Photo grammetry, Second Edition, McGraw-Hill Inc., 1983. [120] Wong, K.W., Machine vision, robot vision, computer vision, and close-range photogrammetry, Photogrammetric Engineering & Remote Sensing, vol. 58, no. 8, pp. 1197-1198, 1992. [121] Zhang, Z., Deriche, R., Faugeras, O., Quang-Tuan L., A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry, Artificial Intelligence, vol.78, no.1-2, pp. 87-119, 1995. [122] Zhang, Z., On the epipolar geometry between two images with lens distortion, Proc. Int'l Conference on Pattern Recognition, Vol.1, pp. 407-411, August 1996. 171 [123] Zhang Z., Parameter estimation techniques: a tutorial with application to conic fitting, Image and Vision Computing, 15, pp. 59-76, 1997. 172
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Rigidity checking for matching 3D point correspondence...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Rigidity checking for matching 3D point correspondence under perspective projection McReynolds, Daniel Peter Roland 1997
pdf
Page Metadata
Item Metadata
Title | Rigidity checking for matching 3D point correspondence under perspective projection |
Creator |
McReynolds, Daniel Peter Roland |
Date Issued | 1997 |
Description | A solution is proposed for the problem of determining the correspondence between sets of point features extracted from a pair of images taken of a static scene from disparate viewpoints. The relative position and orientation between the viewpoints as well as the structure of the scene is assumed to be unknown. Point features from a pair of views are deemed to be in correspondence if they are projectively determined by the same scene points. The determination of correspondences is a critical sub-task for recovering the structure of the world from a set of images taken by a moving camera, a task usually referred to as structure-from-motion, or for determining the relative motion between the scene and the observer. A key property of a static world, assumed by the proposed method, is rigidity. Rigidity of the world and knowledge of the intrinsic camera parameters determines a powerful constraint on point correspondences. The main contribution of this thesis is the rigidity checking method. Rigidity checking is a tractable and robust algorithm for verifying the potential rigidity of a set of hypothesized three-dimensional correspondences from a pair of images under perspective projection. The rigidity checking method, which is based on a set of structure-from-motion constraints, is uniquely designed to answer the question, "Could these corresponding points from two views be the projection of a rigid configuration?" The rigidity constraint proposed in this thesis embodies the recovery of the extrinsic (relative orientation) camera parameters which determine the epipolar geometry - the only available geometric constraint for matching images. The implemented solution combines radiometric and geometric constraints to determine the correct set of correspondences. The radiometric constraint consists of a set of grey-level differential invariants due to Schmid and Mohr. Several enhancements are made to the grey-level differential invariant matching scheme which improves the robustness and speed of the method. The specification of differential invariants for grey-scale images is extended to color images, and experimental results for matching point features with color differential invariants are reported. |
Extent | 24752550 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-04-17 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051261 |
URI | http://hdl.handle.net/2429/7366 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1997-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1997-251144.pdf [ 23.61MB ]
- Metadata
- JSON: 831-1.0051261.json
- JSON-LD: 831-1.0051261-ld.json
- RDF/XML (Pretty): 831-1.0051261-rdf.xml
- RDF/JSON: 831-1.0051261-rdf.json
- Turtle: 831-1.0051261-turtle.txt
- N-Triples: 831-1.0051261-rdf-ntriples.txt
- Original Record: 831-1.0051261-source.json
- Full Text
- 831-1.0051261-fulltext.txt
- Citation
- 831-1.0051261.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051261/manifest