SOLVING FOR R E L A T I V E ORIENTATION A N D DEPTH By Daniel Peter McReynolds B.Eng., Carleton University, 1981 A THESIS SUBMITTED IN P A R T I A L F U L F I L L M E N T OF THE REQUIREMENTS FOR THE D E G R E E OF M A S T E R OF SCIENCE in THE F A C U L T Y OF G R A D U A T E STUDIES Department of Computer Science We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH C O L U M B I A October 1988 © Daniel Peter McReynolds, 1988 In presenting degree freely at this the available copying of department publication thesis in partial fulfilment University of British Columbia, for this or of reference thesis by this for his thesis and study. scholarly or for her purposes gain of Co/lfbT'g'F S^/£r/VCf The University of British Vancouver, Canada Date DE-6 (2/88) 0c7c)3£Z M. Columbia /?<?<? requirements shall that agree may representatives. financial the I agree I further permission. Department of be It not is that the Library permission granted by understood be for allowed an advanced shall for the that without make it extensive head of my copying or my written Abstract A translating and rotating camera acquires images of a static scene from disparate viewpoints. Given a minimum number of corresponding image tokens from two or more images, the rotation and translation (referred to as the relative orientation) of the camera, and the relative depths of the tokens, can be computed between the views. Image tokens, which can be points or lines, are deemed corresponding if they arise from the same physical entity. The process of determining corresponding tokens is assumed to be known. This thesis poses the problem of solving for relative orientation and depth. The solution to this problem has applications in building object or scene models from multiple views and in passive navigation. A minimum of five corresponding pairs of points, measured from two perspective projection images, are required. In the case of lines, the measurements of six corresponding sets of lines, from a minimum of three images, are necessary for a unique solution. The algorithm simultaneously determines the rotation and translation of the camera between the images and the three dimensional position of the matched scene tokens. The translation and depth can only be determined up to a global scale factor. It is assumed that the motion of the camera or object is rigid. The equations that describe the motion and scene structure are nonlinear in the camera model parameters. Newton's method is used to perform the nonlinear parameter estimation. The camera model is based on the collinearity condition, well known in the area of photogrammetry. This condition states that the focal point, image point and object point must all lie on the same line under perspective projection. This approach differs from other similar approaches in the choice of model parameters and in the formulation of the collinearity condition. The formulation for line correspondences turns out to be an easy extension of the point formulation. To improve the range of convergence and the robustness of the algorithm, the Levenberg-Marquardt extension for discrete least squares is applied to the over-determined system. Results from a Monte Carlo simulation with noise-free images indicate a range of convergence for rotation of approximately plus or minus sixty degrees. Simulations with noisy images (image measurements perturbed by plus or minus three pixels) yield a range of convergence for rotation of approximately plus or minus forty degrees. The range of convergence in translation is in the order of twice the length of the translation vector in any direction parallel to the image plane. For translation orthogonal to the image plane, the range of convergence is approximately eighty percent of the length of the translation vector. ii Simulations with the same noisy images generated for the rotation tests, indicate that the range of convergence for translation is not appreciably affected by these noise levels. Tests with real images, in which a 3D model of an object is derived from multiple views with human assistance in determining line correspondences, yield results that agree reasonably well with the noisy image simulations. iii CONTENTS Abstract ii CONTENTS iv LIST OF FIGURES vii ACKNOWLEDGEMENTS viii 1 Introduction 1 1.1 Problem statement 3 1.2 Nomenclature 3 2 The issues 4 2.1 Establishing a context for the problem 4 2.2 Exploring the issues 10 2.2.1 Type of motion assumed 2.2.1.1 Rigid translation 2.2.1.2 Rigid Rotation 11 12 2.2.1.3 Rigid general motion 12 2.2.1.4 Nonrigid Motion 13 2.2.2 Rotation Representation 13 2.2.3 Token Type and correspondence 15 2.2.4 Iterative and noniterative solution methods 16 2.2.4.1 Iterative solution - initial parameter estimates 31 2.2.5 Recovery of scene structure 32 2.2.6 Assumptions regarding scene or motion properties 33 2.2.7 Necessary and sufficient conditions for uniqueness 34 2.2.8 Parameter estimation error and robustness 38 2.2.9 Multiple objects 40 2.2.10 Perspective and orthographic projection 41 2.3 Conclusion 41 2.3.1 Conservation of distance or angular configuration 41 2.3.2 Coplanarity 42 2.3.3 Collinearity 42 2.3.4 The best error term? 3 Point 11 correspondences 42 44 iv 3.1 Mathematical derivation 44 3.1.1 Description of the imaging model 44 3.1.2 What is sought 46 3.1.3 Assumptions 46 3.1.4 Closed form solutions 46 3.1.5 Iterative solutions 46 3.1.5.1 Newton's method 47 3.1.5.1.1 Convergence criteria and rate of convergence 48 3.1.5.1.2 Application of Newton's method 48 3.1.6 Least squares 49 3.1.6.1 Damped least squares 3.1.6.1.1 Levenberg-Marquardt 51 51 3.2 Solving for relative orientation and depth 53 3.3 The algorithm 64 4 Line correspondences 4.1 Computational 65 model 65 4.1.1. Error term derivation 4.2 The algorithm 66 69 5 Experimental results and discussion 71 5.1 Algorithm implementation 71 5.2 Analysis of real images 75 5.2.1 Image acquisition 75 5.2.2 Monte Carlo method for determining range of convergence of real images 76 5.3 Monte Carlo simulation 77 5.3.1 Data set definition 78 5.3.2 Exploring the parameter space 78 5.3.2.1 Depth 79 5.3.2.2 Rotation 79 5.3.2.3 Translation 79 5.3.3 Image noise and algorithm robustness 5.3.3.1 Noise mensuration 6 Conclusion 80 80 92 6.1 Directions for further research 93 v Bibliography 95 Appendix A 99 Appendix B 103 vi LIST OF FIGURES Fig. 2-1. The processing of visual information 7 Fig. 2-2. Coplanarity condition 10 Fig. 2-3. Image geometry for reconstruction and reprojection method [ToFa87]. 26 Fig. 2-4. Coplanarity constraint and error term from Gennery 28 Fig. 3-1. Imaging model and camera coordinate systems 45 Fig. 3-2. Perspective Projection 47 Fig. 3-3. The partial derivatives of x', y', and z' with respect to rotation 57 Fig. 4-1. Error term for line correspondences 69 Fig. 5-1. Behavior of residuals with respect to the number of iterations 83 Fig. 5-2. Range of convergence with respect to x for 6 lines, noise free synthetic image 83 Fig. 5-3. Range of convergence with respect to x for 12 lines, noise free synthetic image 84 Fig. 5-4. Range of convergence with respect to Tx and Ty for 6 lines, noise free synthetic image 84 Fig. 5-5. Range of convergence with respect to Tz for 6 lines, noise free synthetic image 85 Fig. 5-6. Range of convergence with respect to Tx and Ty for 12 lines, noise free synthetic image 85 Fig. 5-7. Range of convergence with respect to Tz for 12 lines, noise free synthetic image 86 Fig. 5-8 Images of office staple remover from four viewpoints 87 Fig. 5-9. Sample screen from program to designate corresponding line segments.88 Fig. 5-10. Range of convergence with respect to Tx and Ty, real image sequence.88 Fig. 5-11. Range of convergence with respect to Tz, real image sequence Fig. 5-12. Range of convergence with respect to x, real image sequence 89 89 Fig. 5-13. Range of convergence with respect to x , for 12 lines and no damping.90 Fig. 5-14. Range of convergence with respect to Tx and Ty, noisy synthetic image. 90 Fig. 5-15. Range of convergence with respect to Tz, noisy synthetic image 91 Fig. 5-16. Range of convergence with respect to x , noisy synthetic image 91 vii ACKNOWLEDGEMENTS I wish to thank David Lowe for his guidance and support in helping me to accomplish my goals. He helped to make this work a truly enjoyable endeavor. Thanks also to Allan Mackworth for his helpful comments and corrections on the final draft of this document. To friends and acquaintances in the department that I've come to know during my time here, I wish you all the best of luck and success. To Jim and Celine and the girls, thanks for helping to make Vancouver a fun place to be. To my parents and family, thank you for your love and support (now really, it didn't take that long, did it?!!) To the memory of Peter Hall, et en souvenir de Elz^ar Menard. And with much love and happiness I'd like to thank my fiance^ Maureen. Thanks to Prof. David Thomas, School of Computer Science, Carleton University, for arranging access to the School's facilities during May of 1988 and for his help while I was a special student there in 85-86. Financial support in the form of a summer R A from David Lowe was very much appreciated. Financial support in the form of a research contract from Fred Potts, Computing Devices Company, Ottawa, is gratefully acknowledged. When the final crunch hit, Moyra Ditchfield came through with a much gratefully acknowledged equipment loan, thanks! Thanks also to Marc Majka for help with his graphics package. viii CHAPTER 1 1 Introduction The purpose of computational vision is to produce useful symbolic descriptions of the world from images. This thesis deals with the specific visual task of computing a description of the motion of imaged objects and their structure in three dimensional space. The process that accomplishes this task takes, as its input, a sequence of two dimensional intensity images and produces a mathematical description of the rotation and translation of the objects and the three dimensional coordinates of features on the objects. This work was originally motivated by the problem of navigation for autonomous robots. A n autonomous robot requires accurate visual maps of its environment to successfully maneuver within it. Structure from motion is one method for building such maps. Algorithms for computing structure from motion still lack robustness and have limited ability to deal with multiple and nonrigid objects [Huang87]. The problem of "structure from motion" has been investigated extensively in the computational vision community. Numerous solutions have been derived from a variety of premises. The solution methods are classified as either optic flow methods, image intensity gradient methods, or token based matching methods. This thesis describes a token based method. Tokens can be either points or lines. The problem of solving for relative orientation and depth is posed and solved. The algorithm developed here is robust and efficient, due to the simplicity of its derivation from the essential physical model parameters. These parameters describe the rotation and translation of the camera centered coordinate frame and the depths of the imaged scene features. Many other solutions solve for more parameters than exist in the physical model. This contributes to poor efficiency and to a lack of robustness. The use of damped least squares techniques contributes significandy to the robustness of the algorithm developed here. The equations that describe the perspective projection process and the coordinate transformations between successive views are nonlinear in the model parameters. General methods for solving nonlinear equations in closed form are not known or do not exist. Closed form solutions for nonlinear equations do exist in a small set of cases. The solution for nonlinear equations can be determined by iterative methods. These methods linearize the equations and iteratively determine corrections to the initial parameter estimates. Iteration 1 continues until the error term, describing the difference between the dependent variable and its parameter based estimate, is sufficiently small. The method developed here linearizes the equations by expressing them in a truncated Taylor series form and is, therefore, equivalent to Newton's method. Newton's method requires the calculation of the partial derivatives of the model equations with respect to the model parameters. These partial derivatives are easy to determine in the case of translation and depth, but this is not the case for rotation. Rotation is difficult, since there are many formulations for expressing the transformation in terms of its three underlying parameters. Newton's method, however, does not require that the partial derivatives be calculated with respect to the model parameters. It is only necessary to be able to correct a model parameter as a function of some variable for which the partial derivatives can be calculated. Here the formulation parameterizes the rotation in terms of three, mutually orthogonal, correction rotations. These incremental rotations occur about each of the coordinate axes. The partial derivatives of the model equations are quite simple with respect to these parameters. The minimum number of corresponding image features necessary for a unique solution is determined. Additional correspondences leads to an overdetermined set of equations. These equations are solved using the Levenberg-Marquardt extension to least squares minimization. The solution for point correspondences is extended for line correspondences. Line correspondences are easier to determine than point correspondences, since low-level vision routines can compute the transverse location of features better than terminations. By writing the equation of an image line in terms of its slope and distance from the origin, an error term can be written which is a function of the distance between an observed line and the line predicted by the current parameter estimates. A line is defined by its endpoints. The use of line correspondences facilitates the solution for relative orientation and depth since occlusion becomes less of a problem for situations when the images are taken from very different viewpoints. With line correspondences, it is only necessary to find corresponding line segments which arise from the same physical entity. The results from a Monte Carlo simulation are presented for noise-free and noisy images. The simulation is based on the object model and motion results from an analysis of a sequence of real images. This permits a comparison of the results from the real image analysis and the Monte Carlo simulation. The real image sequence is of a common office object, a staple remover. With human assistance, a set of corresponding lines are selected from a sequence of images. A 3D model of the object is then computed along with the motion parameters of the camera. 2 The thesis is organized in the following manner. A statement of the problem solved in this thesis is given in the remainder of this chapter. Chapter 2 discusses the issues relevant to the solution. In Chapter 3, the solution for point correspondences is derived. Chapter 4 contains the solution derived for line correspondences. Chapter 5 describes the results of a Monte Carlo simulation. Also described are the results from an analysis of a sequence of real images. The chapter begins with a discussion of the algorithm's implementation and implementation specific extensions. Chapter 6 contains the conclusion and a discussion on further research. 1.1 Problem statement Given a set of known correspondences between points or lines in two or more twodimensional images, determine the unknown rotation and translation that describes the coordinate transformation between the views. Also determine the depths of the points in the case of point correspondences. In the case of line correspondences, determine the depths of the line endpoints. 1.2 Nomenclature Several notational conventions have been adopted. Capitalized bold face characters signify matrices and lower case bold face characters identify vectors, e.g. R, z. Plain characters refer to scalars, e.g. u,v. A vector is assumed to be a column vector. The transpose of a vector is denoted by a superscript T, e.g. p , (u,v) . T T For the remainder of the thesis, the terms motion and relative orientation are used interchangeably. Unless noted otherwise, motion will not refer to instantaneous translational and angular velocities, but will refer to a discrete displacement in translation and orientation, independent of the temporal dimension. 3 CHAPTER 2 2 The issues In this chapter, it is argued that solutions for relative orientation and depth often incorporate assumptions which oversimplify the problem. Also, in many cases, the solution formulation includes more parameters than exist in the basic physical model. The solution developed in this thesis only solves for the fundamental underlying physical parameters. No assumptions are made regarding the scene structure and the only assumption regarding motion is that it should be rigid. Those algorithms which report good results incorporate methods for dealing with over-determined systems of equations. Least squares, singular value decomposition and eigenvalue analysis are the most common techniques for handling overconstrained systems of equations. Over-constraining the solution is useful for ameliorating the effects of noisy measurements. The method developed here utilizes a least squares technique to solve the over-determined system of equations. 2.1 Establishing a context for the problem Determining the relative depth of imaged objects can be accomplished by a variety of means as depicted in fig. 2-1. This figure serves to illustrate the processes and the information paths that transform the input information into a symbolic description. The computation of relative orientation would appear to be accomplished by a relatively small number of processes, whereas depth can be computed from several processes. Although these processes are interdependent, the explicit determination of the unknown relative orientation and depth parameters can be computed in a sequential fashion. Relative depth can be computed from shape. There is for example, shape from shading, texture or motion. Pentland has also developed a method for recovering depth from information provided by focal gradients [Pent85]. Relative orientation can be computed without solving explicitly for depth. This is done in photogrammetry for example [Tho59][Wolf83][Hom87]. Conversely, relative depth can be computed without explicitly solving for relative orientation by using the constancy of distance condition with respect to rigid body motion [MiSeAg88]. The results of both computations can be correlated, and discrepancies accounted for by relaxing the assumptions, most notably rigid motion. Hildreth and Grzywacz investigate Ullman's 4 incremental rigidity scheme to accomplish this [GrHi87]. The scheme maximizes the rigidity of the model when updated with new information, by minimizing the changes to the 3D position of the object points. It should be pointed out that the condition of rigid motion greatly reduces the complexity of determining relative orientation and depth. One of the first to demonstrate the power of this condition was Ullman [U1179]. Before proceeding, there are a few additional comments concerning fig. 2-1 to be made. 1. To build up a 3D model of the imaged objects it will be necessary to establish an object-centered coordinate frame. This facilitates the mapping of new depth information back into the object's frame of reference. This object-centered coordinate frame can be arbitrarily set to the camera coordinate frame of the first image taken from a sequence of images. Alternatively, if additional information is available, the standard frame of reference can be some world coordinate system external to any camera centered frame. 2. The process that computes image tokens can be recognized as computing Marr's primal sketch [Marr82]. The processes that compute corresponding tokens, depth from stereopsis, relative depth and relative orientation are involved in computing Marr's 2 1/2 dimensional sketch, which also has some relation to Barrow and Tenenbaum's depth and orientation intrinsic images [BaTe78]. The process that computes the scene model is related to the set of processes defined by Marr for computing 3D model representations. 3. Optical flow has been defined as the instantaneous positional velocity field [Marr82], or as the apparent motion of brightness patterns [Horn86]. This apparent motion or velocity field is induced on the retina or camera image plane, by the motion of the observer or camera relative to the imaged objects. The structure of imaged objects and the relative motion of the observer can be computed from the flow fields. Horn draws attention to the difference between the optical flow and the motion field [Horn86]. The motion field is computed from the derivative of the perspective projection equation. Optical flow is the apparent motion of the brightness pattern. Optical flow is not uniquely computable from local image information. This notion is captured by the aperture problem. Computing optical flow requires estimating the derivatives of the image intensity function with respect to the image space and time. This is an ill posed problem. Regularization has been applied to this problem in the context of edge detection to enable a unique solution [T0P086]. In the context of fig. 2-1, computing the optic flow would be part of the correspondence process, since fully specifying the optic flow enables one of the 5 correspondence problems to be solved for images separated by a very short time or spatial interval [Marr82]. With this said, no more attention will be paid to optical flow except for the occasional aside. (Interested readers are directed to [Hil83] [Horn86]). 4. This thesis assumes the correspondence problem has been solved. The correspondence problem is defined to be the problem of identifying features in two images which are projections of the same entity in the 3D world. Ullman argues that the matching process, for motion perception in the human vision system, operates on sets of correspondence tokens. Correspondence tokens , for example, can be points, lines or edge fragments. These types of correspondence tokens, however, contribute to the difficulty of the problem. Between two images there may be many candidate matches between such primitive image features [U1179]. Horn summarizes three basic methods for solving the correspondence problem in the context of stereo vision [Horn86]. These methods are, grey-level matching, correlation and edge-matching. Grey-level matching methods assume that imaged surfaces are smooth and have relatively little tilt with respect to the image plane. Given these assumptions, it may be reasonable to assume that neighboring points on the object's surface will map to neighboring points in a pair of stereo images. If this is true, then their grey-levels will be approximately the same in both images. Thus, a scheme can be formulated that matches grey levels between the stereo pair. The basic problem with this method is that grey-level functions are generally not very similar since the object is being imaged from two different viewpoints. Correlation methods assume that corresponding image patches have similar brightness patterns. If this holds then a correlation process can be formulated that finds corresponding patches by finding the largest correlation value. This method suffers from a sensitivity to foreshortening effects due to surface tilt. Deciding on the size of the patch to correlate also poses problems. For example, a patch that is too small may lead to many false matches. Marr claims that there is direct evidence that grey-level correlation does not form the basis for the correspondence process in the human visual system [Marr82]. Edge matching methods work with low level symbolic entities computed from the intensity image rather than with the image intensity function directly. These symbolic entities are edges, which can be defined as curves in the image where rapid changes occur in the image brightness function or its spatial derivatives. Problems can arise from such things as false matches or occlusion. Occlusion causes an edge in one image to not have a corresponding edge in the other. False matches can be resolved by introducing additional information such as grey-level matching. 6 ^images Fig. 2-1. The processing of visual information The processes, symbolized as bubbles, accomplish the data transformations.The labelled lines represent the data flows, open ended boxes represent data stores. Shape-from-x can be shapefrom-shading, texture, contour, or focal gradient, for example. It should be noted that a distinction between the correspondence problem for stereopsis and the motion correspondence problem has been made [Marr82]. The essential difference is that the stereo version of the problem is cast in the spatial domain, whereas, the motion correspondence problem is in the temporal domain. The distinction between stereo and motion becomes even more pronounced when the motion problem involves nonrigid bodies. The two problems are equivalent for rigid bodies but for nonrigid bodies the entirely new problem of object constancy comes into play. Object constancy is the consistency of an object's identity through time [Marr82]. No more will be said concerning this, however, for an introduction, see [Marr82] [U1179] [Horn86]. Note that the data flow diagram of fig. 2-1 says nothing about the temporal sequence of the processes. The algorithm developed in this thesis computes relative orientation and depth simultaneously. Other methods compute motion first then structure and vice versa. The question then is, what is the best method for computing relative orientation and depth? (1) depth first, (2) relative orientation first, (3) simultaneously? Secondary questions are: Is the best method task dependent? What criteria will be applied to evaluate the methods? The standard criteria of correctness, robustness and efficiency will be used here. In general, task dependency is relevant. Some visual tasks only require an estimate of the motion, with little need for an accurate structure estimate (e.g. obstacle avoidance). Other tasks may only require a good estimate of the structure, with little or no concern with the presence of any motion (e.g. constructing models of objects). To simplify the discussion, the goal will be to compute an accurate estimate of the relative orientation and depth. Accuracy and robustness will have a greater importance than efficiency. The answer to this question partially hinges on an analysis of the error term that each method seeks to minimize. Methods that compute depth first seek to rninimize the difference in distance between object points computed before and after motion. This error term is based on the "conservation of distance with respect to rigid body motion" condition. This condition is isomorphic to the "conservation of angular configuration with respect to rigid body motion" condition which has also been used to solve for structure and motion. This condition exploits the invariance of the 3D orientation of lines with respect to rigid body morion. It is the line dual of the conservation of distance condition. 8 Methods that solve for relative orientation first seek to minimize the error that is a function of the divergence of the measurements from the coplanarity condition [Tho59][Horn87][Wolf83]. Fig. 2-2 depicts the coplanarity condition. From the geometry of central projection, it can be seen that the two vectors defined by the projection of the object point and the translation vector must all lie in the the same plane. Methods that estimate relative orientation and depth simultaneously seek to minimize the error between image measurements and image measurements predicted from an estimate of the relative orientation and depths. This is known in the photogrammetry literature as the collinearity condition [Wolf83]. The collinearity condition states that the focal point, image point and scene point all lie on the same line. This line is the ray that includes the focal point and the scene point and whose intersection with the image plane is the central projection of the scene point (see fig. 2-2 where r\ and r are the central projection rays). So the question r now becomes, what is the best error term to minimize: 1. conservation of distance or angular configuration, 2. coplanarity, 3. collinearity? This question will be returned to after a number of algorithms that incorporate these conditions have been reviewed and other related issues have been discussed. 9 Fig. 2-2. Coplanarity condition. Vectors defined by the projection of an object point onto the image plane of two camera stations and the translation between the stations define a plane in space. In this case t, n and r are coplanar. In general, due to measurement noise and a poor estimate of relative orientation r\ and r do not intersect at the scene point. The collinearity condition is also depicted here. The line from the origin of the camera coordinate system (also coincident with the focal point in this figure) to the scene point intersects the image plane at a point. The focal point, image point and scene point all lie on the same line. r r 2.2 Exploring the issues To help arrive at an answer to the question of which error term is best, the following issues will be explored within each error term category: 10 1. Type of motion assumed 2. Representation of rotation 3. Type of token for correspondence 4. Iterative versus noniterative solution methods, and initial estimates for iterative methods 5. Recovery of scene structure 6. Assumptions regarding scene or motion properties 7. Necessary and sufficient conditions for solution uniqueness 8. Parameter estimation error and robustness 9. Multiple objects 10. Central and orthographic projection 2.2.1 Type of motion assumed In almost all cases, rigid general motion is assumed. The rigidity condition allows the problem to be tractable for general motion, as first demonstrated by Ullman under orthographic projection [U1179]. Attention is drawn here to a variety of assumptions concerning the type of motion to be analyzed. These assumptions often lead to a great simplification in the formulation and often closed form solutions can be written, however, these assumptions represent a large oversimplification of the problem and are generally not useful for real world situations. 2.2.1.1 Rigid translation Collinearity More constrained types of rigid motion such as pure translation, are assumed in certain cases. For example, Lawton [Law83] assumes pure translation and solves for the direction by an exhaustive search of the translation space that minimizes the error in image feature matching. The minimization is performed through global search at a coarse resolution. Steepest descent is then used in that region thought to contain the global minimum. He reports a smooth error function with a single minimum in a large neighborhood around the correct translation axis. By assuming pure translation, his error term is not strictly the collinear error term first described in the photogrammetry literature. However, it is classified as such since the error mmimization is with respect to a difference in the image space between the model prediction and the observation. 11 The computation of relative motion from line tokens under central projection is simplified when assuming pure translation, as shown by [YeHu83]. They are able to solve, in closed form, for the relative depths of single pairs of corresponding lines and the amount of translation relative to the depth of one of the corresponding lines. In general, very few methods are developed which can only deal with pure translation since they will have limited application. Also, the collinearity condition with the implicit assumption of rigid motion is sufficient to determine the constraints that make the problem tractable. Coplanarity and Distance conservation Methods based on these conditions generally subsume pure translation. 2.2.1.2 Rigid Rotation Collinearity Kanatani explores the invariant properties of perspective images under central projection [Kan88]. B y restricting the motion to pure rotation about an arbitrary axis through the focal point of the camera, a closed form solution for depth is derivable. Knowledge of the true lengths of object lines or their relative 3D orientations is sufficient to constrain one degree of freedom; the depth of one of the points that defines the 3D line. The introduction of a simplifying hypothesis constrains the other degree of freedom; the depth of the second point. Although interesting in and of itself, there does not seem to be any potential for applying the results to general motion conditions. This method may be useful as a test to detect pure rotation, but even this seems prohibitive, as a priori knowledge of the scene structure is required. Yen and Huang apply their methods under the assumption of pure rotation [YeHu83]. As in the case for pure translation, they demonstrate the advantages of the gaussian sphere projection under highly constrained types of motion. Coplanarity and distance conservation The methods based on these conditions are not applied, in general, to restricted types of motion. 2.2.1.3 Rigid general motion Collinearity, coplanarity, and distance conservation Almost all analyses and algorithms assume rigid general motion. Any of the three conditions is sufficient to formulate constraints to solve uniquely for relative orientation and 12 depth (up to a scale factor). Note that a static environment and moving observer attached to a rigid coordinate frame yields the same motion as a static camera and associated coordinate frame observing a rigid body in motion. This relativity is evidenced by the interchangeability of the terms, rigid camera motion with a static environment, and, rigid body motion with a static coordinate frame. 2.2.1.4 Nonrigid Motion Collinearity, coplanarity, and distance conservation The amount of work in this area is increasing. Huang has described it as an area deserving more attention [Huang87]. Hildreth and Grzywacz extend Ullman's research into his incremental rigidity scheme [GrHi87][U1184]. Ullman's incremental rigidity scheme seeks to maximize the rigidity of an object's 3D model, as new information about its structure is added by rninimizing the magnitude of the changes to its structure. He uses the conservation of distance condition to compute the depth values of imaged points under orthographic projection. Hildreth and Grzywacz conclude from their investigations that velocity based formulations can be used to rapidly develop a rough 3D model of the scene, but that these models are not robust over time. They further conclude, that for a stable solution, the use of disparate views that differ significantly are necessary. The difficulty in computing projected velocities is also cited as a reason for not recovering structure from motion solely on velocity information. Terzopoulos, Witkin and Kass develop a physically based model to construct 3D models of deformable objects from image intensity data [TeWiKa88]. They are able to compute the structure and motion of deformable objects from a temporal sequence of binocular image pairs. A n interesting aside to this issue is the effect of noise on structure from motion computations. As pointed out by [GrHi87] and [RoAg80], noise in the image measurements has the same effect as deforming bodies ("jello-like" as described by Roach and Aggarwal). The algorithms which can detect a lack of rigidity through inconsistencies in the model equations are probably not going to perform well with noisy images. Solution methods that compute the variances in the parameter estimates such as least squares or extended Kalman filtering will presumably be able to handle noisy data and give some measure of the departure from rigidity. 2.2.2 Rotation Representation 13 Rotation can be represented in a variety of ways. The choice of rotation is usually dictated by the formulation of the problem. For example, in one formulation, a small angle approximation to an orthonormal rotation matrix is used which leads to a uniqueness proof for the solution. The proof hinges on the small angle approximation [FaHu84]. In many cases the rotation is parameterized by the nine elements of a 3x3 orthonormal matrix. These parameters are solved for explicitly and this has the disadvantage of over-parameterizing the solution. Rotation has only three underlying parameters and more parameters leads to inefficiency, sensitivity to noise, and a loss of numerical precision. Collinearity Rotations are generally represented as 3x3 orthonormal matrices. When exploiting the collinearity condition, [FaHu84][RoAg80] and [Wolf83] all linearize the nonlinear projection equations. This requires computing the partial derivative of their model equation with respect to the rotation parameters. There are three fundamental parameters which capture the three degrees of freedom that rotation in R possesses. The three degrees of 3 freedom can be expressed as three rotations, each about the standard basis vectors (these are referred to as the Euler angles), or as a unit length axis (two degrees of freedom) and the amount of rotation about that axis, 9. The direction cosines of the rotation axis, though specifiable with only two parameters, are written with three parameters if the norm of the vector represents the amount of rotation about the axis. Fang and Huang parameterize rotation as three direction cosines defining a rotation axis containing the origin and an angle specifying the amount of rotation about the axis [FaHu84]. A n approximation to an orthonormal matrix, using the direction cosines and rotation angle, 8, is used in the derivation of the nonlinear motion equations. Some algorithms that solve for relative orientation based on collinearity in photogrammetry parameterize rotations using the Euler angles [Wolf 83]. Roach and Aggarwal use Euler angles as well [RoAg80]. Small angle approximations to orthonormal matrices have been used to simplify the analyses [FaHu84][YaMe86]. Coplanarity and Conservation of distance As well as the previously mentioned axis/angle and Euler angle parameterizations, the use of quaternions has become popular. Horn represents rotations either with quaternions or orthonormal matrices [Horn87]. The use of quaternions simplifies the process of ensuring unit normality. The use of orthonormal matrices slightly complicates the process of maintaining orthonormality. He suggests using quaternions to represent incremental rotation adjustments which are then 14 mapped into an orthonormal matrix. When this matrix is postmultiplied by the current rotation estimate, represented as an orthonormal matrix, the result is orthonormal modulo machine precision. Faugeras, Lustman, and Toscani represent rotations as quaternions [FaLuTo87]. The quaternion representation simplifies the expression of the function to be minimized, i.e. HE - TRII min R R is represented with quaternions. The solution to this minimization problem is the eigenvector associated with the least eigenvalue of the quadratic expression derived from the minimization expression. When solving for relative orientation and depth from line tokens [FaLuTo87] parameterize rotations by a vector, the norm of which is the amount of rotation. The rotation matrix is represented by an exponential function whose exponent is a function of the rotation vector. This parameterization permits the application of extended Kalman filtering to determining the rotation. Weng and Huang also solve for rotation using unit quaternions [WeHuAh87]. This representation facilitates the use of a ininimization expression involving eigenvalue analysis. [FaTo86] represent rotations as 3x3 orthonormal matrices when solving for translation. They describe the geometric projection model with R parameterized in this form. However, when actually solving for rotation, they switch to the quaternion representation. This allows the minimization to be rewritten as an eigenvalue problem. The use of direction cosines allows a small angle approximation to be written for the orthonormal matrix. By parameterizing rotation as a vector, whose direction is the rotation axis and whose norm is the amount of rotation about the axis, a 3x3 rotation matrix, which is approximately orthonormal, can be constructed [TsHu81]. This small angle representation is also adopted by [FaHu84][FaHu84b][YaMe86] In all other cases, the nine parameters of an orthonormal matrix are the rotation parameters. Attention has been paid to maintaining orthonormality, and often this requirement provides additional constraints when solving for relative orientation and depth. 3.2.3 Token Type and correspondence A token is a symbolic entity computed from images. Tokens have associated with them a number of measures such as position, brightness, size or orientation. 15 The token of choice for the vast majority of these analyses is the point. Line tokens have been investigated to some extent [YeHu83][MiSeAg86] [LiHu86][Kan88][FaLu87][FaLuTo87][SpA187], and higher order tokens very little [Huang87]. By higher order tokens, it is meant, for example, curvilinear tokens, bars, blobs, and bounded regions. Huang mentions a solution based on bounded regions [Huang87]. A planar contour, for example, will change shape with the viewpoint. The information contained by these changes can be used to compute relative orientation. A general methodology, even for one region cases, has yet to be developed. As most researchers assume the correspondence problem has been solved, there is little mention of it beyond the usual assumptions. However, Marr has noted that fully specifying the optic flow should render the small motion correspondence problem essentially solved [Marr82]. As with other chicken and egg problems, which process precedes the other is an open question. The answer may be that the processes are cooperative. Given an hypothesis for some token matches, an initial estimate for the relative orientation and depth can be computed. The initial relative orientation and depth estimates will allow the correspondence process to make more tentative matchings. If more matches are not forthcoming then the relative orientation and depth estimates are revised as a function of some mismatch error in the correspondence process. The two processes should eventually settle on a set of matches as well as relative orientation and depth estimates, that minimizes their respective error terms. Some work has been done that use a tentative estimate for the relative orientation to guide a search for further matches. For example, Faugeras and Lustman [FaLu87] use the current estimate of the relative orientation parameters to restrict the search space for the corresponding point or line in the second image, given its observation in the first image. Fang and Huang discuss matching when the image scales are quite different [FaHu84b]. 3.2.4 Iterative and noniterative solution methods Iterative solutions are usually associated with equations which are linearized approximations to the exact nonlinear equations. The exact equations that express the collinearity, coplanarity or conservation of distance condition are nonlinear. Nonlinear solutions require an initial estimate to start the convergence process which generally must be quite close to the correct value. Linear closed form solutions have been derived for relative orientation and depth [Lh81][TsHu81][SpA187].The linearization is achieved by the introduction of intermediate variables. The coplanarity condition is the basis for the linear 16 closed form methods developed by Longuet-Higgins [Lh81] and Tsai and Huang [TsHu81]. The collinearity condition forms the basis of [SpA187]'s closed form method. These closed form solutions generally report a sensitivity to noisy image measurements. The solution developed in this thesis requires solving a set of nonlinear solutions. However, due to the choice of model equations and their parameterization, the method developed here exhibits a wide range of convergence and a fair degree of robustness with respect to noisy image measurements. Collinearity The collinearity condition is expressible only through nonlinear equations. These equations, in general, are then linearized and solved by iterative numerical techniques. Spetsakis and Aloimonos have derived a linear closed form solution [SpA187]. They use line tokens and the solution is linear in a large set of intermediate variables (far more parameters are solved for than exist in the basic physical model). They report a sensitivity to noisy image measurements, however, which is a common problem with other closed form solutions. Methods based on linearizing nonlinear equations, in almost all cases, formulate the equations with more parameters than exist in the basic physical model. This is especially true in the parameterization adopted for rotation. Rotation has three degrees of freedom, but many methods explicitly solve for nine rotation parameters. This leads to large systems of equations and to possible inaccuracies from computations with finite precision. Problems with rotation estimation are sometimes dealt with by assuming small rotation angles. This assumption is generally unrealistic in real situations and represents a major drawback for those methods that make this assumption. The solution method developed in this thesis does not make this assumption. Methods that linearize a set of nonlinear equations and solve them iteratively often report their best results when using the Levenberg-Marquardt method for nonlinear parameter estimation. [FaHu84] solve a set of over-determined nonlinear equations using three different methods: (1) Newton iteration, (2) Gauss-Newton, and (3) Levenberg-Marquardt. Their formulation assumes that the rotation angles are small. This is the major drawback for the method. A minimum of five points will yield a solution to the set of nonlinear equations. In practice, more than five points are used. The motion equations are derived from equations that determine the image space shift of a projected image point from two disparate views. Expressions for the image space shifts, A X and A Y , are derived in terms of the unknown motion parameters for rotation and translation, and the relative depth of the imaged point. 17 [FaHu84] try three methods for solving the nonlinear equations. Their best results are with the Levenberg-Marquardt least squares method, and the Newton method. Fang and Huang's solution given in [FaHu84] is seen in other applications [HuFa] [FaHu84b]. [YaMe86] solve for the motion and structure of a rigid body using regularization. The major difficulty with their method is that it requires an exhaustive search of the parameter space to locate the global minimum of the merit function. The regularizing functional is the norm of the rotation, expressed as a vector. The regularizing functional imposes the additional constraint of minimizing the size of the estimated rotation when minimizing the error term with respect to rotation and translation. This functional is appropriate, since the model equations assume a small rotation. They solve for the motion parameters, with a formulation that is written with relative depth eliminated, using exhaustive search. The solution by exhaustive search may partly be motivated by a desire to illustrate the effect of regularization on the error function that they are minimizing. [RoAg80] solve a system of nonlinear equations using a modified LevenbergMarquardt least squares algorithm. Their formulation over-parameterizes the problem and they make use of an inefficient least squares technique to perform the nonlinear parameter estimation. They require a minimum of five points in two views or four points in three views. In the case of five points, the number of variables is eighteen. There are twelve unknowns for the 3D coordinates of the points. There are fifteen variables for the point coordinates, however, one of the depths is fixed to fix the scale of the solution. The remaining six unknowns, three for rotation and three for translation, describe the motion of the object (or camera). Each pair of corresponding points yields four nonlinear projection equations. Hence, five corresponding point pairs yield twenty equations, but two of the equations have no unknowns. The eighteen nonlinear equations are solved using the Brown-Dennis version of the Levenberg-Marquardt algorithm [BroDen72]. This method computes the partials of the Jacobian using finite differences. This is computationally very expensive. The number of parameters in their model equations far exceeds the number of parameters in the underlying physical model. This leads to a very rapid expansion in the size of the system of equations, which certainly leads to poor efficiency. The least squares error term is the difference between the observed image coordinates and the coordinates predicted by the model equations. 18 The nonlinear equations expressing the collinearity condition can be rewritten as linear equations by introducing intermediate variables. This is done by Spetsakis and Aloimonos in a method that uses line tokens [SpA187]. [Sp A187] solve for three dimensional motion and structure with an algorithm that requires thirteen line correspondences over three views. The solution is in linear closed form. They discuss the advantages of line correspondences, stating that the correspondence process is made easier, and they speculate that the error terms involving lines will be less sensitive to image measurement noise. Their results, however, contradict their original presuppositions. They report a sensitivity to noisy measurements. As with other closed form solutions, the introduction of many intermediate variables appears to negate the advantages of the linearization. The begin their analysis for the situation where the motion is known, and the task is to recover the scene structure. They parameterize 3D lines with four parameters, two for the direction and two to fix the position in space. Two views of a line with known motion is sufficient to constrain the four line parameters, yielding a unique solution. They extend this analysis to three frames by treating the three frame case as two instances of the two frame case. Four equations (two for direction and two for position) can be written for three views of one line. Two of the equations are dependent. The remaining two equations are rewritten in terms of three 3x3 matrices of unknowns for a total of twenty-seven unknowns. The two equations are linear in the intermediate variables with each equation involving eighteen of these variables. Each of the unknowns is a nonlinear combination of a subset of the two unknown translations and rotations. The rotation is parameterized by the nine elements of an orthonormal matrix. Thirteen corresponding sets of lines allows twenty-six equations to be written in twenty-six unknowns. The twenty-seventh unknown is set to one, since the solution is only known up to a scale factor. Decomposing the solved intermediate variables, to yield the motion parameters, involves an eigenvalue analysis of the intermediate variable matrices. This analysis produces six vectors from which the rotation and translation vectors can be solved for. They report results indicating a sensitivity to noisy measurements. This correlates with results for the eight pure parameter closed form solution for point correspondences. In both solution methods the problem is linearized by the introduction of intermediate variables which themselves are nonlinear functions of the motion parameters. Confounding the motion parameters leads to the possibility (especially in the presence of noise) of incorrect combinations of parameter values, which can account for the observations. Also in this 19 case, the rotation parameterized by nine variables introduces six more degrees of freedom than is necessary, leading to reduced efficiency and a greater opportunity for numerical inaccuracy due to computations with finite precision. Coplanarity This condition has been used to a much greater extent to solve for relative orientation and depth, dating back to the work of Thompson in photogrammetry [Tho59]. Although the model equations are nonlinear in the physical parameters, linear equations are derivable by introducing a set of intermediate variables. Linear closed form solutions, however, appear to be sensitive to noisy image measurements. By incorporating techniques for solving overdetermined systems of equations the noise sensitivity is reduced somewhat but remains relatively high. Longuet-Higgins computes the relative orientation of two viewpoints from the closed form solution of a set of linear equations [Lh81]. The equations are derived from the coplanarity condition. This condition is a consequence of the transformation from one coordinate system to the next due to rotation, R, and translation, t. A 3D point p' in the rotated and translated coordinate frame is related to a point p in the reference frame by p' = Rp +1. The corresponding vectors, represented by the projection of a point into two image planes, must lie in a plane that includes the translation vector. Longuet-Higgins defines a 3x3 matrix Q, which, in effect, maps the coordinates of a point (treated as vector) in one frame (the non-reference frame) into a frame with an orientation parallel to the reference frame and computes the vector product of this reoriented vector with the translation vector. The resulting cross product is taken into an inner product with the corresponding vector measured in the reference frame. This product is equal to zero by the coplanarity condition, since the vectors are orthogonal. The nine elements of the transformation matrix Q are found by solving eight simultaneous linear equations and setting the ninth to one (Q can only be known up to a scale factor). Eight corresponding points yield the eight equations. The translation is derived from Q by observing that the elements of Q Q can be written in terms T of the translation only. Once t is computed, R is computed by observing that Q is the cross product of the translation and rotation. A series of manipulations involving R, rewritten in terms of the cross product of two other rotation matrices and the linear combination of t and Q X t (X denotes vector cross product, and Q is one of the three rows of Q), which are a a orthogonal, leads to an expression for R, written in terms of Q and t (both are known). Once R is solved for, the depths are computed straightforwardly. Tsai and Huang at about the same time as Longuet-Higgins, developed a linear closed form solution to the rigid general motion problem, computable from eight point 20 matches on a planar patch [TsHu81]. Using the standard mapping expression for points from one camera coordinate system to the next (p' = Rp +1), extended to include an affine transformation and specialized to the planar point case, they write two nonlinear expressions for the image ordinates, X ' and Y', of a projected point. This expression is written in terms of their eight "pure parameters". The equations for the eight pure parameters are nonlinear in twelve unknowns. Rewriting the eight parameter equations to account for the global scale factor, and specializing to the planar case, leads to a set of eight nonlinear equations in the eight unknown motion and structure parameters. There are three parameters for rotation, two for translation and three parameters which define the object plane. These equations are rewritten in terms of only one unknown, the relative translation in the x direction, Ax". This equation is a sixth order polynomial in Ax". Once Ax" is solved for, the seven other unknowns are computed in closed form. The eight pure parameters are determined by solving a set of eight or more linear equations by least squares techniques. The system of equations that determines eight pure parameters is derived from an analysis in Lie algebra. The analysis proves that the eight pure parameters are unique for the mapping from (X,Y) to (X',Y'). The equations are written in terms of frame differences. A frame difference is defined to be the difference in image intensity of the projection of the same physical point at two different times. Tsai and Huang assume that the image intensities are the same for the same physical point. Tsai and Huang determine rotation and translation by solving a set of linear equations to obtain a set of intermediate parameters [TsHu84]. They report a sensitivity to noisy image measurements. From the intermediate parameters in matrix form, the actual motion parameters are computed from the singular value decomposition components. (A three stage process with two sets of intermediate parameters). Eight "pure parameters" are computed from eight image point correspondences. Each correspondence yields one equation, linear in the eight unknown "pure parameters". The equation is derived from the basic transformation p' = R p + t . Tsai, Huang and Zhang [TsHuZh82] continue their analysis from [TsHu81]. They develop an algorithm for determining the motion and structure of planar patches. Their computation of the eight pure parameters remains the same. The computation of R and t from the eight parameters now relies on a singular value decomposition (SVD) of a 3x3 matrix whose elements are the eight pure parameters and a constant. A case analysis of the multiplicity of the singular values leads to the appropriate computations for R, t and the plane parameter, p. The analysis of the singular values is presented as three theorems. 21 The first theorem deals with singular values of multiplicity two. In this case, the motion is a rotation about the origin, along with a translation parallel to the normal axis of the planar patch. This is a necessary and sufficient condition for singular values to be of multiplicity two. Theorem two deals with values of multiplicity one. In this case two possible solutions exist. The necessary and sufficient condition for this case is a rotation about an axis through the origin and a translation, not parallel to the normal axis of the planar patch, measured with respect to the second coordinate frame. Theorem three covers the case where all the singular values are the same. The necessary and sufficient condition for this to occur is when the motion is a pure rotation about an axis through the origin. [TsHu84b] use the same S V D methods detailed in [TsHuZh82] to solve for rotation, translation and depth. They extend their algorithm to the three view situation of four corresponding point sets. The pure parameter matrix for three views is the product of the matrices from views one and two, and views two and three. Faugeras and Toscani make use of the coplanarity constraint when solving for rotation and translation [FaTo86]. Translation is computed first, followed by rotation. Both translation and rotation are solved as minimization problems. The solutions are found by computing the eigenvector associated with the least eigenvalue satisfying certain sign and norm constraints of a quadratic vector expression (the norm of the error term expressed as a vector). Faugeras and Lustman compute the singular value decomposition (SVD) of a 3x3 matrix, A , a linear transformation matrix, which describes the relative orientation between two viewpoints and the projection of a set of coplanar points or lines [FaLu87]. This transformation is linear as a result of using homogeneous coordinates. The points or lines are contained by a plane whose parameters are given by p = (a,b,c,d). The S V D of A permits the extraction of the rotation R by equating the SVD components with the expression for A in terms of R, the translation, t and p. The resulting equations are redefined in terms of the R bases vectors, ei. t is then eliminated from this set of equations yielding, three 2 2 3 linear equations in three intermediate variables x j . x- is a function of the normal, n, which defines the object plane, a matrix, V , from the S V D result, and ei. The distance of the plane, d, from the origin can then be solved for in closed form. A case analysis of the solutions for the singular values yields R. Once R is solved for, t follows straightforwardly. 22 The matrix A , which maps planar points or lines from one coordinate system into the other, is computed by iteratively rninimizing a least squares error function with respect to the eight elements of A from observations of four or more points or lines. The ninth element of A is set to one, since the solution is only known up to a scale factor. This solution method involves more parameters than exist in the underlying physical model (the fundamental parameters are: four for the plane, three for rotation and two for the translation). The algorithm proceeds by first estimating A. Then the motion parameters must be extracted from the intermediate variables (the elements of A) by an S V D analysis. The introduction of many intermediate variables may lead to problems with noisy measurements. Because the motion parameters are confounded when represented by intermediate variables, a large variation in the individual motion parameters may still yield a value for the intermediate variables, which accounts for the image measurements. [FaLu87] give experimental results for one test case. The image had a 256x256 pixel resolution. The desired amount of rotation was 5.55 degrees and their algorithm estimated it to be 13.81 degrees. Eight coplanar points were used as input and the planes surface was 3.81 meters from the camera. The direction of the rotation axis and plane normal were determined with reasonable accuracy. In the case of line tokens, [FaLu87] use the same procedure developed for points, since coplanar lines and points can be treated similarly in their formulation of the projective geometric constraints. Faugeras, Lustman and Toscani [FaLuTo87] use the same approach detailed in [FaTo86]. They minimize the least squares system derived from the coplanarity constraint. The solution to the minimization problem is the eigenvector associated with the least eigenvalue of the norm of a vector. The vector is given by the coplanarity constraint, with each pair of corresponding image points yielding one equation. A minimum of eight matched points are required to solve for the eight pure parameters. These eight parameters are the eight elements of the matrix E defined by E = T R (where T represents the translation and is written as a 3x3 anti-symmetric matrix that affects a cross product), with the ninth element set to one, since the equations are homogeneous (the solution is known only up to a scale factor). This work represents an extension to the methods developed by LonguetHiggins [Lh81][Lh84] and Tsai and Huang [TsHu81][TsHu84]. Also [FaLuTo87] compute relative orientation and depth from line matches. A projected line defines a plane containing the origin. They prove that a minimum of three views of a line establishes the rigid displacement between the three views. The three planes must satisfy the equation which defines the pencil of planes that these planes must belong to. 23 This matrix equation, to be satisfied, must have four determinants of size three, all equal to zero. By mapping the equations of the planes into a single coordinate system, an expression for a determinant which only involves R, can be written and solved for. Similarly, another determinant, only involving t, can be written and solved for. The remaining determinants only involve t and are all proportional. Six sets of corresponding lines yield six equations for both rotation and translation, which is a sufficient constraint since there are six unknowns for R and five for t (t is known only up to a global scale factor). The rotation parameters are determined from extended Kalman filtering. The translation is computed in linear closed form once the rotation is known. Weng, Huang and Ahuja solve for translation from the eight pure parameters by standard least squares techniques [WeHuAh87]. Results from synthetic images indicate that the estimate of the pure parameters is sensitive to noise. Since the motion parameters are derived from the pure parameters, this pure parameter error dominates the errors in the motion parameters. Rotation is solved for by finding the unit eigenvector associated with the smallest eigenvalue which minimizes the norm of a 4x4 matrix. The 4x4 matrix expresses the least squares error between the observed rotation and the closest orthonormal rotation matrix. Given that the translation does not vanish, they then compute the relative depth of the corresponding points, using least squares. They require eight or more corresponding points to compute the eight pure parameters, using least squares techniques. [ZhHa85] describe a method that is essentially the method of [Lh81] or [TsHu84], based on the computation of motion from the eight pure parameters. By making assumptions about the type of motion and properties of the scene, linear closed form solutions are derivable [YeHu83]. These assumptions, however, limit the useful application of their methods. Yen and Huang take advantage of the duality between points and lines when projected onto a gaussian sphere to yield closed form solutions for motion and relative depth. In the two view case, additional information such as the direction of translation is required. Two views under-constrain the solution for relative orientation and depth from line correspondences. For general motion, assumptions regarding scene properties are made allowing for a set of nonlinear equations to be written to solve for rotation. The solution methods of Yen and Huang, are in closed form for points derived from coplanar line correspondences. They use the so called "eight pure parameter" methods to solve for motion and depth. Their solution methods for the two frame case are specialized by making assumptions about the point and line configurations and the use of a priori knowledge such as coplanarity, known relative orientation, etc.. Such assumptions are also made for sequences of images. 24 When solving for rotation, translation and relative depth for line correspondences over three or more frames, Yen and Huang derive an expression for the rotation parameters which is a fourth order polynomial in the rotation parameters. They were not able to report any success in solving these equations. Each line correspondence yields one equation, thus for the three frame case, at least six line correspondences are required for a solution. Rather than solving for rotation using nonlinear equations, Yen and Huang opt for a linear method analogous to Huang's eight pure parameter method. This linear method, however, requires the introduction of three 3x3 intermediate parameter matrices. The elements of these matrices can be solved for when given twenty-six line correspondences generating twenty-six linear equations in twenty-six unknowns (one of the unknowns is a constant since the solution is known only up to a scale factor). Again, the analysis solves for more parameters than the underlying imaging situation possesses, which may lead to noise sensitivity. The nonlinear motion and depth equations, derived from the coplanarity condition, have also been solved by iterative numerical techniques. Toscani and Faugeras do an iterative minimization of an error function using a finite difference technique when solving for rigid rotation, R, and translation, t [ToFa87]. They suggest that their error function lends itself to steepest descent techniques and extended Kalman filtering. The error function 2 is defined by the cross product of the projection of the estimated scene point, P„, (labelled Qj in fig. 2-3) and its observed location, Q2, written as vectors, in the camera centered 2 coordinate system. The estimated object point, P , is defined by projecting the point defined e by the intersection of a ray, p i (corresponding to the observed point in one image), with a plane, Q. Q is defined in terms of the plane, £, that contains the estimated translation, t , e and the observed ray, p i . P2 is defined by the projection of the observed point, P, into the second image plane. Note that due to measurement noise and other factors p i and p2 do not in general intersect at P (see fig. 2-3). The geometry is symmetric, thus pj is similarly 2 defined as P . Two error terms can be written, one for each camera position. The overall e criterion to be minimized is the sum of the two camera specific error functions. 25 L *e ' e * R ! R Fig. 2-3. Image geometry for reconstruction and reprojection method [ToFa87]. Horn solves for relative orientation, using an iterative least squares algorithm that seeks to minimize a triple product derived from the coplanarity condition [Horn87]. The triple product is formed from two corresponding rays and the translation, all measured with respect to the reference camera-centered coordinate frame. The triple product is zero when the three vectors are coplanar. Measurement errors, and an estimate for the relative orientation which is inaccurate, will lead to a nonzero triple product. A triple product is derived that incorporates an incremental correction to the relative orientation estimate. A least squares error function is written in terms of the triple product. The constrained minimization 26 problem reduces to a set of linear equations in the six correction parameters: 8b (an infinitesimal correction to the translation vector) and 8w (an infinitesimal correction to the rotation represented as a quaternion). A qualitative description of his results indicates that the algorithm will converge to the global minimum from every possible starting point in the parameter space //"there are a sufficiently large number of ray pairs available. This is indeed a very good result. Local minima appear if there are only a few more than the minimum five pairs of rays. However, even in the case of multiple solutions it is claimed that only one of the solutions yields the same sign for all of the rays. Thus a good initial estimate is still not required. A systematic search of the parameter space can be conducted to locate the global rrunimum. It is not mentioned how many ray pairs are required for the algorithm to converge to the global minimum from any initial parameter estimate. The number of iterations required to converge in practice is also not discussed, nor are results with noisy measurements. The use of least squares techniques will probably yield good results for noisy images. Gennery solves for relative orientation using generalized least squares adjustment [Gen79]. He also implicitly uses the coplanarity condition. His error term is the distance between an observed image point (x2,y2) in frame C2 and the epipolar line defined by the image ray (xi,yi,f), measured in C1, which is the projection of a scene point, p, into camera l's image plane (Ci). (See fig. 2-4). The projection of image ray (xi,yi,f) defines an epipolar line in C2. If there is no measurement error and no error in the relative orientation estimate, then this distance is zero. Note that the epipolar line, the baseline (or translation), and the scene point are all coplanar when there is no error. His formulation of Gauss least squares leads to a linearization of the nonlinear camera model. This is similar to the approach adopted in this thesis for line tokens. Given this error model, he writes an expression for the epipolar line. The epipolar line is defined by the intersection of C2S image plane and the plane defined by the translation and the image ray in C1. A l l of these vectors are measured with respect to C2's coordinate frame. 27 error term Fig. 2-4. Coplanarity constraint and error term from Gennery. Two cases for the error term are described. The first case yields one error measurement, which occurs when the image point, (x2,y2), does not lie beyond the i end-point (xo,yo) (designated as D2 in fig. 2-4) of the epipolar line. Hence, there is one distance measurement (the perpendicular distance). In the second case, two error measures are computed when (x2,y2) lies beyond the end-point (xn,yo) of the epipolar line segment. The two error measures are orthogonal. Note that by treating the epipolar line as an infinite line, the perpendicular distance could still be computed. Similarly, the norm of the two error measures could be computed. In effect, there is still only one error measurement for each pair of corresponding points. Least squares iteration computes corrections to the current estimates for rotation and translation, and iteration stops when there is no appreciable change in the parameter estimates. This may lead to problems i f the minimum is poorly conditioned. A n additional test on the residuals may be called for. Nagel exploits the coplanarity condition to solve for orientation [Nag81]. Three nonlinear homogeneous equations, in three unknown rotation parameters can be written, and solved exactly, given five corresponding pairs of points in two views. Having more points 28 leads to a least squares minimization of the homogeneous equations. Once the rotation has been determined, the translation can be computed by a linear equation up to a scale factor. Similarly, the relative depths can be computed. Liu and Huang investigate the problem of solving for rigid body motion using line correspondences [LiHu86]. They are able to solve for the rotation component of general rigid motion using six or more line correspondences over three frames. The rotation and translation components are solved separately. The rotation component is found by solving six or more nonlinear equations in the six unknown rotation parameters. They represent rotations using tilt, span and swing angles [sic]. The nonlinear equation is based on the coplanarity condition. Three views of an object line define three planes containing the origin when imaged by the central projection process. The vector product of two image plane normals, reoriented to the reference observation, should be orthogonal to the normal of the plane defined from the reference observation. One such equation can be written for each set of corresponding lines. Six sets of corresponding lines yield six equations which are solved with iterative numerical techniques. More than six sets of lines yields an overdeterrnined system, which can be solved with least squares techniques. Once the rotations have been determined, translation can be solved (up to a scale factor) by solving five or more linear equations in five unknown translation parameters (recall that only three views are being considered). The ratio of the angles between the projection planes can be determined from the ratio of the vector products of the plane normals, measured with respect to the reference viewpoint or observation time. The ratio of these angles is equal to the ratio of the inner products of the unknown translations and the corresponding plane normals (all measured with respect to the reference observation). In test results, with noise free synthetic images, Liu and Huang report that a good guess was required for the iterative solution to converge to the global minimum. When rotation is parameterized by direction cosines and the amount of rotation about the specified axis, they report poorer results than with the rotation parameterized by tilt, pan and swing angles. No reason was given as to why this should be. The range of convergence for six line correspondences was plus or minus three degrees in the rotation parameters (tilt, pan and swing). With seven correspondences the rotation range is plus or minus five degrees. More line correspondences did not appreciably improve these results. In the case where the rotation was the same for the two successive frames the rotation range was plus or minus twenty-five degrees. They report using a routine from M I N P A C K - 1 from Argone National Laboratory for iteratively solving the nonlinear equations. Their results may improve somewhat i f they use a different numerical least squares routine. 29 Conservation of distance or angular configuration The equations derived from this condition are nonlinear and second order in the depth unknowns, and are solved iteratively. The best results are reported when use is made of the Levenberg-Marquardt method for nonlinear parameter estimation. The iterative methods required a good initial estimate to achieve accurate estimates of the scene structure and exhibit some sensitivity to measurement error. Mitiche, Seida and Aggarwal use standard iterative numerical techniques to solve a system of nonlinear equations for the depths of imaged points [MiSeAg88]. The equations express the 'conservation of distance with respect to rigid body motion' condition and are second order in the unknown depth parameters. Reported results indicate a sensitivity to image measurement errors. In tests on synthetic images, initial estimates for the depth of the points were exact except for a random additive factor drawn from an interval that was only twenty percent, or less, of the actual depth. They report good results for this initial estimate with ninety percent of the trials achieving less than a ten percent error in the depth estimates. The additive noise was from a Gaussian distribution with a standard deviation of one-half pixel. Each corresponding pair of image points contributes two unknown depths, and each pair of 3D points gives rise to one second order equation. Hence, five pairs of corresponding points yield ten equations in ten unknowns. Given the solution for the relative depth parameters by iterative techniques, the relative orientation of the cameras can be solved in closed form, using a subset of four points, taken from the minimum five points used to compute the depths. The four non-coplanar points, measured with respect to the two viewpoints, can be related by a linear transformation matrix (in homogeneous coordinate form). The elements of the transformation can be computed by solving twelve linear equations in twelve unknowns. The relative orientation parameters can then be computed straightforwardly. Their results report the use of the Levenberg-Marquardt algorithm for least squares parameter estimation. Mitiche, Seida and Aggarwal solve for the 3D relative orientation of lines by solving twelve second order equations in twelve unknowns [MiSeAg86]. They make use of the "conservation of angular configuration with respect to rigid body motion" condition. This can be interpreted as the line dual of the "conservation of distance with respect to rigid body motion" condition. The conservation of angular configuration leads to the observation that the orientation of 3D lines associated with a rigid object will remain invariant as that object moves. They report results with noisy synthetic images that show that a reasonably good initial guess for the initial depths is required. To achieve less than an average ten degree 30 error in the spatial orientation of the 3D lines, the relative error in the initial estimate must be less than twenty-seven percent. The direction of a 3D line is parameterized in terms of the relative depth between a pair of points that define the line. They write one equation involving four unknowns for two views of two lines. The four unknowns are the relative depths of two points that define the line, for two lines, measured in two coordinate frames. Three views of four lines yield six different line pairs. There are twelve unknowns (four unknown relative depths per view). Six line pairs gives twelve equations each involving four of the unknowns. The equations are second order in the relative depth variables. Once these are solved for then the rotation can be computed. This is done iteratively. The translation and the depth of the lines (up to a scale factor) are then computed in linear closed form. The rotation is solved by using the Levenberg-Marquardt algorithm to compute the three rotation angles. Knowledge of the scene and type of motion provides sufficient constraint to enable a formulation involving linear sets of equations. For example, [Kan88] solves for pure rotation by computing angular differences between projections of lines in the image plane due to camera rotation only. The assumption of known orientations in the scene leads to a closed form solution for the amount of rotation. Similar to the approach of [MiSeAg86], use is made of the conservation of angular configuration with respect to rigid body motion. 2.2.4.1 Iterative solution - initial parameter estimates Iterative solutions require an initial estimate to start the convergence process. It is clear from the literature that the choice of numerical technique plays a large part in the success of the algorithm. Clearly, assessing the merits of an iterative algorithm is complicated by the numerical techniques used to solve the equations. In the future, it is hoped that a standard numerical technique for solving least squares nonlinear parameter estimation may be agreed upon (or discovered, as it is still pretty much an open question as to which technique is the best). Ideally, any initial estimate for the unknown parameters should work. The only report where this may be the case is from Horn for a method, based on the coplanarity condition, that solves only for relative orientation [Horn87]. In most other cases a good initial estimate is required. Some methods assume that the motion is small and solve a simplified set of equations for the unknown parameters. This solution serves as the initial estimate for the more general equations. There is very little analysis in the computational vision literature regarding appropriate initial estimates for the iterative algorithms. For example, [MiSeAg86] require initial guesses for their rotation estimation, but do not describe how this was accomplished. 31 [MiSeAg88] solve for a set of depth parameters using the Levenberg-Marquardt algorithm. The initial depth parameters are set equal to an arbitrary value. Initial depth estimates for synthetic images were randomly chosen from a range of values defined by some percentage of the known depth. [ToFa87] use, as an initial estimate, the results from a linear closed form solution method which is an extension of Longuet-Higgins' and Tsai and Huang's method. [Horn87] advocates initial rotation estimates which are based on the internal angles of regular polyhedra. This approach is necessary if there are an insufficient number of corresponding rays. If there are a sufficient number of rays, then a random starting value is claimed to be sufficient. [FaHu84b] use approximations in the rotation parameters to simplify a nonlinear equation. The estimates computed from this simplified equation serve as an initial estimate for the method using the more accurate equation. [FaHu84] found that an initial guess of no translation or rotation worked well when the motion was small. For larger motion it was found that the initial guess for the translation was more important than for rotation. The initial estimate for translation, then, was given by setting the rotation estimate to zero and solving a set of linear equations for the translation parameters. [RoAg80] assume small interframe displacement and no rotation to compute the parallax of displaced image points. The parallax is then used to compute the 3D location of the imaged points, given an estimate of the second camera position. The second camera position is derivable from the fact that the depth of one of the imaged points is fixed. These initial estimates vary with the choice of the fixed point. A least squares minimization of the error between observed and estimated image coordinates of imaged points determines the best reference point. 2.2.5 Recovery of scene structure In some cases, algorithms are developed that only solve for relative orientation. The implicit assumption is that once motion has been recovered, depth estimates are easily computed [Lh81]. Note that recovery of depth depends on a nonzero translation, since it is the disparity in the corresponding image features which contain the information regarding depth. Pure rotation about an axis through the focal point does not produce disparity. Depth estimates from relative orientation parameters can only be determined up to a global scale factor. Only when additional information regarding scene structure is known can the absolute depth be 32 determined. This scale factor also determines the amount of translation, as depth and translation are directly proportional [Horn86]. [MiSeAg86] recover the relative depth of the endpoints of the projected lines. The depths are relative due to the fact that the equations involved are homogeneous. [FaLu87] recover the depth of a plane up to a scale factor from observations of four or more points or lines on the plane. [FaLuTo87] recover the parameters for 3D lines by solving an overdetermined system of linear equations by least squares or extended Kaiman filtering. B y defining a line as the intersection of two planes, they can incorporate the 'pencil of planes' constraint, which says that the two planes defining the line must be in the same pencil of planes that include the planes generated from the three projections. Three sets of two equations can be written, which are derived from the determinants of size three of the matrix that describes the pencil of planes. Since there are only four unknowns for the intersection of two planes, the system is overdetermined and least squares or extended Kaiman filtering is used to compute them. [YeHu83] recover the relative depth of lines in the case of pure translation and general rigid motion. [TsHu84] can determine the depth up to a global scale factor once the rotation and translation of the motion has been determined. [Lh81] computes scene structure by deriving an expression involving the inverse of the transform, p' = Rp + t, and the inverse projection equation. This is computable once R and t have been solved for. [TsHuZh82] recover the planar patch parameters which give the distance of the patch from the camera coordinate frame up to a global scale factor. [TsHu84b] recover depth of the planar patch via the same methods given in [TsHuZh82]. [WeHuAh87] recover depth if the translation is not vanishingly small. [FaHu84] recover depth relative to the translation in the z direction. [RoAg80] solve for the depth of imaged points simultaneously with the motion parameters using least squares minimization of a set of nonlinear equations. [Nag81] recovers the relative depth of imaged points in closed form. [MiSeAg88] compute depth using conservation of distance with respect to rigid body motion. 2.2.6 Assumptions regarding scene or motion properties. 33 In many cases, it is assumed that only a certain type of motion is being observed and/or there is a priori knowledge of some scene structural properties. These are additional assumptions to the rigidity assumption. These assumptions provide additional constraints to produce unique solutions to the posed problem and often lead to closed form solutions. These assumptions, however, lead to an over simpUfication of the structure from motion problem. Thus, the solution methods that make these assumptions, generally, will have little practical application to the general motion and structure problem. They are only useful for the situations they have been derived for. For example, Nagel, by assuming the motion is confined to a plane, recovers the relative depth of imaged points in closed form [Nag81]. [Kan88] assumes known length and 3D line orientations. [FaLu87] assume the world is piecewise planar. This allows them to solve for R and t and relative depth in linear closed form. [FaLuTo87] assume planar surfaces when solving for R, t, and depth, z, in the case of point tokens. [TsHu81] assume their corresponding points are coplanar. This allows them to reduce the number of unknowns from twelve to eight. [YeHu83] assume a number of line configurations when analyzing two frame motion. This is required in order to reduce the number of degrees of freedom to the number of available constraints. These extra assumptions are not necessary when three or more frames are available. [TsHuZh82] assume that the corresponding point tokens are coplanar. [TsFfu84b] assume points lie on a rigid planar patch. [WeAg81] assume that the rotation axis is fixed for short periods of time. This reduces the complexity of the expression that describes the dynamic motion. In other cases, the assumptions are stated as necessary for a unique solution to be computed. That is, the equations become singular or inconsistent if the stated assumptions are not met. For example, [MiSeAg86] assume that 3D object lines are not parallel, i.e. no two orientations are identical. This ensures that the equations will be independent. This issue is discussed further in the next section. 2.2.7 Necessary and sufficient conditions for uniqueness Linear closed form solutions are attractive, since proofs of uniqueness are usually straightforward. Solutions derived from nonlinear systems of equations are, in general, nonunique. It is difficult to prove, in general, when a solution from a set of nonlinear 34 equations is unique. Empirical evidence, however, suggests that, in the case of solving for relative orientation and depth, a unique solution is computed when the unknowns are strongly overdetermined. In some cases assumptions regarding the motion or scene structure permits a proof for the uniqueness of the solution. However, for the same reasons that these assumptions restrict the applicability of the solution methods to real situations, these uniqueness proofs are also of limited usefulness. For example, [FaLu87] prove that, under the assumption that points are not collinear or that lines are not parallel, a linear system of equations with full rank is obtainable, thus a unique solution can be determined. [MiSeAg86] argue that three views of four lines are sufficient to uniquely determine a solution for the motion parameters. This conclusion is contradicted by [FaLuTo87] who claim that six or more lines in three views are required. This latter claim is supported by the results from this thesis. [MiSeAg88] require four points, not coplanar, to recover relative orientation. [FaLuTo87] argue that [MiSeAg86] improperly determine the minimum number of lines required to solve for R and t. They claim that [MiSeAg86] count dependent equations. They also describe the minimum number of lines needed for a solution to exist; six in this case, over three views. They argue simply from the knowledge of how many unknowns there are and the number of equations that can be written for each set of matches. [TsHu81] discuss the uniqueness of the solution of the sixth order polynomial in Ax", which is the relative translation in the x direction. They prove the uniqueness of their eight pure parameters, which describe a mapping from one image coordinate frame to the next. This proof involves the use of Lie algebra. [YeHu83] treat numerous specific cases of line configurations and types of motion. In general, given sufficient knowledge of the scene or motion, a unique solution for the motion estimates is possible. With arbitrary line configurations, uniqueness is not guaranteed for the two frame case. With three or more frames, motion can be uniquely determined with arbitrary line configurations. Longuet-Higgins' method for determining relative orientation is ambiguous in the sign of the solution for the translation and in the elements of the intermediate parameter array* Q [Lh81]. This ambiguity is resolved by constraining the depths (also referred to as the forward coordinates) to be positive, which is possible if and only if the signs of t and Q are correctly chosen. It is also mentioned that certain eight point configurations will result in dependent equations when computing Q. He develops this more fully in [Lh84]. 35 Longuet-Higgins discusses configurations of eight points which result in an inability to compute relative orientation uniquely[Lh84]. Essentially, the points must he on a quadric surface which passes through the two viewpoints. It is not known and it may not possible to determine a priori if the points satisfy this condition. [TsHuZh82] give three theorems detailing the necessary and sufficient conditions for the existence and uniqueness of the solution. The first theorem deals with singular values of multiplicity two. In this case, the motion is a rotation about the origin, along with a translation parallel to the normal axis of the planar patch. This is a necessary and sufficient condition for singular values to be of multiplicity two. Theorem two deals with values of multiplicity one. In this case two possible solutions exist. The necessary and sufficient condition for this case is a rotation about an axis through the origin and a translation, not parallel to the normal axis of the planar patch, measured with respect to the second coordinate frame. Theorem three covers the case where all the singular values are the same. The necessary and sufficient condition for this to occur is when the motion is a pure rotation about an axis through the origin. [TsHu84b] give a proof that three distinct views of four points (no three collinear) is sufficient to determine the motion uniquely. They assume that two solutions are possible and arrive at a contradiction. The only condition necessary for a unique solution, for the three view situation, is that the two orientations of the observed planar patch not be the same for views one to two or two to three. [TsHu84] give uniqueness proofs and the minimum number of points required for a solution to exist and for its uniqueness. They prove that a minimum of seven points are required to uniquely determine, up to a scale factor, the rigid general motion of an object with curved surfaces (i.e. no restriction on point configuration.) They also prove that for a solution to exist, the seven points cannot be traversed by two planes, one of which contains the origin, or by a cone containing the origin. Nonlinear equations, unfortunately, do not lend themselves very easily to such proofs. In the case of [FaHu84], they do arrive at a uniqueness proof for a solution involving nonlinear equations, but this was only possible using a small motion assumption and involves taking the limit of an expression as the amount of rotation goes to zero. Note that these uniqueness proofs go beyond the proofs for solution existence. Proving existence (or necessity) usually amounts to demonstrating that the number of 36 unknowns are sufficiently constrained by the equations. This is almost always the case when rigidity is assumed. Proving sufficiency is another matter all together. In many cases, discussing uniqueness amounts to discussing empirical results (this is especially true of methods which perform nonlinear parameter estimation). For example, Horn discusses ambiguities which arise from his solution method [Horn87]. It is possible that a solution may represent projections from behind the camera plane. The sense of his vector scalars describing the projection ray can be used to resolve this. Also, the solution for the sense of the translation may not be correct. Again, the correct sense for the translation can be resolved by the corresponding ray coefficients. When using quaternions, the negated quaternion represents the same rotation. This redundancy is resolved by simply choosing the first nonzero component to be positive. If there are five or fewer pairs of corresponding rays, the solution is minimally determined or is under-determined, hence, more than one solution exists, in general. Lastly, he discusses the issue of critical surfaces. If the imaged points he on a critical surface then the error term varies non-linearly in parameter space. Hence, the least squares error varies non-quadratically. This makes the extremum difficult to locate and in some cases the error does not vary in some direction in parameter space, thus, the solution is not unique. Unfortunately, there is no method to detect if the imaged points he on such a surface as a preprocessing step when solving for relative orientation. [Kan88] proves the uniqueness of the solution for structure up to a reflection in a mirror orthogonal to the optical axis, i.e. because the projection is equivalent to an orthographic projection, the solution for the depths is unique up to a reflection in space. [Gen79] discusses the convergence of the gauss least squares method. The special case of one camera in front of the other may lead to a non-unique solution. Also if the depth of the observed point causes the image point to be near the epipolar end point (xn,yo) then the residuals are not approximately quadratic and the error estimates will tend to be very incorrect. [FaHu84] give a uniqueness proof for nonlinear parameter estimation for small rotation angles. The proof requires a minimum of nine corresponding points, not all on a quadratic surface, passing through the viewpoint (they assume a static camera). They first prove the lemma that the mapping due to rotation and translation between camera frames is unique for every point in 3D object space. That is, no more than one mapping, a exists between corresponding points which are related by a motion transformation with transformation parameters, R and t. Informally, this just states that if a mapping is given by R and t, no other set of parameters R',t' can affect the same mapping (a is one to one and onto). The second lemma they prove states that, for five or more properly chosen point 37 correspondences, the solution to the motion transformation equation is unique. Similar to the proof for the first lemma, they demonstrate that the system of equations has a full rank of five for the five unknowns, hence the solution is unique. Their first theorem proves that as 6 (the amount of rotation) approaches zero, the least squares solution differs from the real motion parameters by 0(9 ). This result, again, is only valid for small rotation angles. 2 Their second theorem deals with degenerate point configurations for nine points. If the nine points do not lie on a quadratic surface containing the viewpoint, then theorem one holds true. [YaMe86] illustrate their solution method with graphs of the error function whose minimum defines the solution for the motion parameters. Implicitly, this illustrates the uniqueness of their solution, as there is only one minimum, in general. 2.2.8 Parameter estimation error and robustness A l l of the algorithms discussed so far must at some time be tested with real images (read noisy). Most analyses and algorithms have explicitly taken into account the effect of noise by including more than the minimum number of observations and solving the overdetermined system of equations by least squares or other similar techniques. In a few cases a sensitivity analysis of the algorithm to noise has been made [WeHuAh87][Triv87]. The analyses either make predictions of parameter estimate errors based on some noise model, or some estimate of the noise is made which can be used to minimize the error in the parameter estimates. The use of extended Kaiman filtering has also been made to palliate the effects of noise and recover a noise estimate. Extended Kaiman filtering has the drawback, however, of requiring an a priori noise model. A difficulty in assessing the robustness of the various algorithms has been the lack of standard error measures for the amount of noise (in real and synthetic images), present in the observations The noise measure that appears to be most appropriate is one given by Shariat and Price [ShPr85]. The measure is the ratio of the amount of noise to the average disparity in x and y (both measured in pixels). In addition to this is the lack of a standard error measure for the estimated parameters. Least square techniques are attractive in that an estimate of the parameter variance can be computed [Gen79]. Regularization has been tried as a means of constraining the solution to provide a unique solution. [YaMe86] solve for rigid body motion using regularization. Their regularizing functional is the norm of the rotation vector. Their choice of the functional scaling factor is not theoretically derived, but appears to be empirical. They achieve marked improvement in their parameter estimation results for small numbers of corresponding 38 points, approximately eight. They note that the error term they regularize is quite smooth for large numbers of corresponding points, i.e. the solution is highly overconstrained, and regularization adds little benefit in this case. A variety of error measures and noise measures have been used. For example, [MiSeAg86] compute noisy endpoints for lines by randomly perturbing line orientations with uniformly distributed numbers in the range ±V50 pixels. Their parameter estimation error is the orientation error of 3D lines. This is determined as a function of the precision of the initial guess. For a guess with 30% precision, the average error in orientation was approximately 12 degrees. [FaTo86] consider the difference in angle between calibrated and estimated translation. Rotation error is the difference between the estimated rotation and the value determined from the camera calibration. [FaLuTo87] test real and synthetic noisy images. Their real data input consists of 113 points. B y the standards of this thesis, this is a very large data set. Here typically, less than 50 points are used. [FaTo86] define estimation errors in terms of the angles between desired axes of R and T and the estimated axes. They also define an absolute error for the amount of rotation about the axis. Synthetic noise levels are uniformly distributed in the range of one to five pixels. [TsHu84] give a few synthetic image test cases with and without noise. Noise is reported as a relative percentage and parameter estimation errors are also reported as relative percentages. [Gen79] estimates the parameter values using gauss least squares. The method produces a covariance matrix for the errors in the estimated parameters. The square root of the diagonal elements yields the standard deviation of the estimated errors. [FaHu84] give an order of magnitude error analysis for the motion estimates. They find experimentally that for large rotations their method failed (this is not unexpected since the model they use assumes small rotations). They discuss the effect of image resolution on the parameter estimates. There are indications of a sensitivity to image noise and to large depth values. This follows, since for a given translation, the further away a point, the less disparity between two views. [RoAg80] generate noisy synthetic images by adding random uniformly distributed noise to the image measurements. This noise is distributed over the interval plus or minus one to four pixels. By over-deterrnining the solution they achieve accurate estimates (twelve to fifteen corresponding point pairs). They found that two views of fewer than nine points 39 lead to a numerically unstable solution for the object model. The rotation estimate did not improve appreciably when more corresponding points were included. The object model improved the most with more corresponding points. [MiSeAg88] compute the relative distance between points on imaged objects as a measure of the accuracy of their algorithm. Noise for synthetic images was from a Gaussian distribution with zero mean and a standard deviation of one half pixel. The parameter estimation error was defined to be the relative distance between scene points. Cumulative frequency of error plots show robust behavior. Some analyses are aimed directly at compensating for noise. For example, [WeHuAh87] perform a first order perturbation analysis of their algorithm. Given a noise model, they determine the amount of error introduced into the motion estimates. The noise is assumed to be in the image measurements. Their analysis is confined to determining the amount of perturbation of an eigenvector of a matrix that has been computed from noisy image data. The eigenvector determines the rotation thus only the rotation estimate benefits from this analysis. Their estimates for translation and depth are from least squares techniques with its own error analysis methods. They do not pursue the least squares error analysis. Trivedi gives a brief description of a method to compute relative orientation that incorporates an extra error term which compensates for errors in the image measurements [Triv87]. The error term is added to the image measurements such that for a given estimate of the model parameters, the residual error in the model equations is reduced, subject to the constraint that the magnitude of the measurement correction term is minimized. The method is described as the "variational principle". [YaMe86] use exhaustive search of the translation parameter space and are able to compute the shape of the error function as a function of the translation. This is mostly for the benefit of illustrating the effects of regularization. 2.2.9 Multiple objects The ability to deal with multiple objects has not had much attention. The bulk of the work appears to have taken place using optical flow methods [Adiv85]. Jointed objects with rigid parts were analyzed by Webb and Aggarwal [WeAg81]. Huang discusses multiple objects briefly in an overview paper. Segmentation is mentioned as a key issue and he offers a possible scenario for segmenting objects based upon his linear algorithm for relative orientation and depth [Huang87]. 40 [TeWiKa88] discuss the application of their physical model based methods for analyzing images containing several deformable objects. 2.2.10 Perspective and orthographic projection Ullman used orthographic projection when he first introduced rigidity to solve the structure from motion problem [U1179]. Some researchers still use orthographic projection since it simplifies some analyses [GrHi87][Kan88][WeAg81]. Orthographic projection, however, is not generally used since central projection is considered to be a more accurate camera model. [WeAg81] assume orthographic projection to recover the motion and structure of an object. They assume that the object is a jointed collection of rigid parts. The orthographic assumption simplifies the projective geometry, e.g. the projection of two points rotating about a fixed axis lies on an ellipse. Under perspective projection for sufficiently large motion, the projection would lie on a spiral leading to a more complicated analysis. Note that depth is not recoverable from pure translation under orthographic projection. The vast majority of researchers, however, use perspective projection (also referred to as central projection). Perspective projection is considered the best geometrical description of the image formation process [RoAg80]. It should be pointed out that as the focal length of the camera becomes very large or as the absolute depths of the imaged scene features become very large, the perspective projection process approaches an orthographic projection, i.e. the amount of perspective effect decreases. The accuracy of the structure estimation derived from perspective effects decreases with the decrease in the perspective effects and also becomes more noise sensitive. These observations apply to the method developed in this thesis, since perspective projection is assumed. 2.3 Conclusion The question was, what is the best error term to minimize? The answer is still somewhat obscure. A brief description of the advantages and disadvantages of each method may be in order. 2.3.1 Conservation of distance or angular configuration Advantages Methods based on this condition use both point and line tokens. Experimental results appear to be reasonably good, at least in the case of point tokens [MiSeAg88]. Closed form solutions for some of the parameters are derived. 41 Disadvantage Very few methods are based on this condition. This is probably due to the fact that numerous solution methods already exist, some in closed form. Nonlinear parameter estimation is required for some of the unknowns. Iterative solutions developed for methods in this category require a good initial guess for the unknowns. These methods appear to be somewhat sensitive to measurement error. 2.3.2 Coplanarity Advantages Many algorithms exist which are based on this condition. Linear formulations have been written and many proofs for solution existence and uniqueness have been written for these linear methods. Disadvantages Linear methods appear to be noise sensitive. Reducing a wealth of data to eight parameters from which the motion is computed seems to be a large data reduction task with plenty of scope for error. Nonlinear methods based on this condition appear to offer no particular advantage over other nonlinear methods. In general, a good initial estimate of the unknowns is required. Horn's algorithm may prove to be the standard for coplanarity based nonlinear methods [Horn87]. The computational vision and photogrammetrical communities eagerly await for Horn to report experimental results on real images. 2.3.3 Collinearity Advantages A linear method for line correspondences exists, which is based on this condition. However, like its eight pure parameter counterpart, it appears to be sensitive to noise. Linearizing the nonlinear equations describing the collinearity condition is easier than for the coplanarity condition. Disadvantages Relatively few algorithms have been developed. Nonlinear formulations suffer from the usual drawbacks, i.e. a requirement for a good initial estimate, a small range of convergence, the need for an appropriate parameterization, etcetera. In most cases, the formulations involve far more parameters than exist in the underlying physical model. 2.3.4 The best error term? 42 With well established linear and nonlinear formulations, it is not hard to be convinced that the coplanarity condition establishes the standard by which all other algorithms should be judged. Of the algorithms reviewed, Horn's solution for relative orientation, based on the triple product, is probably the best nonlinear formulation. Gennery's work is interesting, since it is one of the first methods that made use of two key concepts central to the formulation developed here, namely, least squares parameter estimation and an error term based on the distance between a point and an image line [Gen79]. Lowe extended these ideas in his formulation for determining viewpoint for model based vision [Lowe87]. This thesis represents an extension of Gennery's and Lowe's method within the framework associated with the collinearity condition. Furthermore, the line token based algorithm developed here, represents the second line token based, nonlinear parameter estimation method, derived from the collinearity condition. The first was Lowe's algorithm for solving for viewpoint and model parameters [Lowe87]. 43 CHAPTER 3 3 Point correspondences 3.1 Mathematical derivation The solution for relative orientation and depth from point tokens is derived in this chapter. An algorithm is presented that implements the solution. The chapter opens with a description of the image geometry. A statement of the assumptions is followed by the derivation of the solution. 3.1.1 Description of the imaging model The standard perspective projection model is assumed. Let the 3D position of scene point p be (x,y,z) with respect to the camera centered coordinate system of camera 1 (see T fig. 3-1). The image plane coordinate, (u,v) , in camera 1 of the projection of this point is T derived from similar triangles (see fig. 3-2). (3.1) This equation also describes a well known condition in photogrammetry, the collinearity condition. It states simply that the focal point, image point and the scene point must all lie on the same 3D line. This is obvious from fig.3-2. Therefore, given (u,v) for point p, the scene coordinate is given by T (3.2) To determine where in the image of camera 2 the point p will project, a rotation R and a translation t is applied to map scene coordinates in camera l's system into that of camera 2's. This is followed by the projection onto the image plane. The focal length f is given. Hence we have: (3.3) p" =(x",y",z") = Rp + t T This can be rewritten as 44 p"=p'+t (3.3.1) where p is defined by 1 (3.3.2) p' = (x\y',z')T = R p P(x,y,z) P"(x",y",z") Fig. 3-1. Imaging model and camera coordinate systems. P = (x,y,z) is the coordinate of P with respect to camera l's coordinate system C. P" = (x",y",z") is measured with respect to camera 2's coordinate system C". The image plane origin is (0,0,-f) with respect to the camera-centered coordinate system. (u,v) is the image plane coordinate of P 45 measured with respect to the image plane coordinate system c. (u",v") is the image plane coordinate of P" measured with respect to image plane coordinate system c". Equation (3.3.2) will become useful when computing the partial derivatives of p" with respect to the rotation parameters. The image coordinates of p", (u",v") , are given by T the standard central projection equations. (3.4) 3.1.2 What is sought Given two or more images of a scene, taken from different viewpoints, it is desired to estimate the rotation (3 degrees of freedom) and translation (3 degrees of freedom) that describes the change of basis between the successive viewpoints. It is assumed that the image features will be measured with respect to a camera centered coordinate frame. It is also desired to determine the depths of corresponding sets of imaged scene points (1 degree of freedom for each pair of corresponding points). These parameters are denoted by R, t, and z. These parameters can be solved for from measurements made on the intensity images of the scene. Photogrammetry provides only one source of information concerning camera motion and the relative depths of imaged scene surfaces. Other sources of information include, but are not restricted to, optic flow and shape-from-x, where x can be shading, texture or contour, for example. The solution derived here is based on a concept first developed in photogrammetry, the collinearity condition. 3.1.3 Assumptions The correspondence problem has been solved. The scene is static or the scene objects are rigid. The focal length is known. 3.1.4 Closed form solutions Closed form solutions that explicitly solve for R, t, and z are not known or do not exist. There are no general solution methods for solving in closed form nonlinear systems of equations. 3.1.5 Iterative solutions 46 Iterative solutions for nonlinear parameter estimation are fraught with peril. In general it is very difficult to analytically prove that a given problem formulation satisfies the method's convergence criteria. Iterative solution methods can suffer from numerical instability, sensitivity to ill conditioning and poor convergence when the current parameter estimates are far from the solution. The iterative method for the nonlinear parameter estimation chosen here is Newton's method. The rate of convergence of Newton's method is nominally quadratic near the solution. Also, Newton's method provides a flexible framework that enables reparameterizations for different forms of correspondence [Lowe87]. _ (0,0) Fig. 3-2. _ f z Perspective Projection Perspective projection model in the y-z plane. It is assumed that the camera is pointed in the -z direction. The image plane is at z=-f. The projection center is coincident with the origin. From similar triangles, with respect to the point p=(x,y,z), the x-z plane, u x -fx = — thus u = = ^- Hence v = Similarly in This figure also illustrates the collinearity condition. 3.1.5.1 Newton's method Newton's method offers perhaps one of the best rates of convergence in that it is nominally quadratic. However it requires a good initial guess to achieve convergence. 47 Gradient methods, although slower in convergence rate, especially near the solution, are guaranteed to converge to some minimum. Marquardt [Marq63] has proven that his formulation of damped least squares, a process that can also be considered an extension of Newton's method, offers an optimal trade off in performance between gradient descent and Newton's method. Thus, with this extension, Newton's method should perform well. The solution should converge nearly quadratically once near the global minimum, and otherwise is guaranteed to converge to some local minimum. In its general form Newton's method for nonlinear systems of equations can be written as (3.5) x = G(xk-l) = k - l - J( k-l)-lF(xk-l) k x X where x e R , J(x) is the Jacobian matrix, and F(x) is the objective function such that n F(x*) = 0 where x* is the fixed point of G [BurFai85]. Equation (3.5) describes the k * iteration of the solution for the fixed point x* of G. The practical application of this method proceeds by finding a vector y which satisfies J(x )y = -F(x ). The new approximation k x k + 1 k is obtained by adding y to x . The algorithm continues to iterate until either llyll < TOL k (some tolerance) or the iteration limit is exceeded. 3.1.5.1.1 Convergence criteria and rate of convergence Burden and Faires describe a sufficient set of criteria for which Newton's method will provably converge quadratically [BurFai85]. The proof states that the method will converge quadratically provided llx° - pll < § where II • II is the standard I2 norm, x° is the initial estimate of x, p is a solution of G(x) = x and £ defines a neighborhood of p satisfying various continuity, boundedness and gradient constraints. No proof is offered here that these conditions are satisfied in the general case. Note also that the algorithm presented here relies on the Marquardt damped least squares formulation with its own attendant proofs for convergence and rate of convergence'(see [More77]). 3.1.5.1.2 Application of Newton's method The application of Newton's method to this problem requires determining the partial derivatives of u and v with respect to each of the image model parameters. This is straightforward in the case of translation and depth but it is not the case for rotation. Rotation is particularly difficult in that there are numerous ways of expressing the rotation in terms of its underlying parameters. The parameterization of the rotation adopted 48 here is that of Lowe's [Lowe87]. This parameterization offers simplicity in that only three parameters are required, thus saving computational resources and minimizing the size of the solution space. It also leads to very simple partial differentiation of u and v with respect to these parameters. The parameters are (p , <Py and (p . They represent incremental rotations about each x z of the coordinate axes. These rotations are mutually orthogonal and are approximately independent for small rotation angles. R is maintained as a three by three matrix and is updated after each iteration by prefix multiplication with the 3x3 correction matrices, R , R , x y and R . R , R , and R are functions of cp , (p and cp respectively. The partials of u and v z x y z x y z with respect to z depend on R and the representation of R in this form facilitates their calculation. 3.1.6 Least squares By including more observations, or reducing the number of parameters to be adjusted, an overconstrained set of equations can be written. Such a system can be solved by formulating the normal equations for a least squares solution. Following Press, Flannery et.al. [PrFlTe88] the least squares merit function, % , to 2 be rnmimized is given by N (3.6) where a is the M x l parameter vector, yi is the observation (or measurement), y(xi;a) is the measurement predicted from the model equation and xj is the measurement which is mapped to yi by the model, o\ is the standard deviation associated with measurement yf. Sufficiently close to the minimum it may be reasonable to approximate x by its 2 Taylor series expansion truncated to its quadratic form. Therefore (3.7) % (a) = y - da + ~aDa 2 49 where y is the value of X (ao), the merit function evaluated at the point ao about 2 which the expansion is taking place, d is defined to be an M x l row vector and is the negative gradient of % 2 evaluated at ao, D is an M x M matrix and is the Hessian matrix of % evaluated at ao2 Now it is known that if the approximation is good enough then a i m n can be reached in a single step by (3.8) a m i n = a c u r + D-l(-Vx (a ur)) 2 C The gradient of % , which is zero at the minimum, has the following terms. 2 ( 3.9) ^!=- y ^ i - y ^ h t e ^ . k = 2 4-K d&k a i i ; 2 , . . . ,M ) dak Define p = Y T ~ 9a k z k N ( 3.10) =2 BakBai y^^<3{ 1 i=l m ^ d y i ^ l I 3ak 3ai _ ( y . _ y ( x . ; a ) ) ^rtoO^ 3ak3ai v k = 1,2, .. , M ; 1 = 1,2, M Define 1 3% 2 2 BakBai Thus aki = j D in equation (3.8). Equation (3.8) can be rewritten as the linear system of equations (3.11) M 5>kl5ai = p 1=1 k 50 This set of equations is solved for Saj, the incremental corrections which are added to the current estimate of a. At this point it must be noted that a y will be defined to only include the first partials and ignore the second order partials. Thus N (3.12) Briefly, the reasons are that these second order terms are negligibly small in practice since multiplication by (yi - y(xi;a)) in (3.10) and summing over i can be seen as multiplying by the random measurement noise if the model is a good one. The sign of this term can be either positive or negative and should not be correlated with the model. Thus this term should be negligibly small relative to the first order terms by cancelling each other out. Note that redefining ocki does not change the parameter values at the minimum but only the path taken to reach them. This follows from the independence of ocki from the condition that Pk = 0 for all k at the % minimum. 2 The value of 0"i if not known is set to 1. 3.1.6.1 Damped least squares 3.1.6.1.1 Levenberg-Marquardt Levenberg [Lev44] introduces the concept of damping the increments to the residual function parameters when performing least squares iteration. That is, due to the first order approximations of the nonlinear normal equations the increments added to the parameters may be quite inappropriate in an absolute sense i.e. one must remain in a neighborhood where the linearization of the problem approximates adequately the nonlinear objective function. The idea then is to scale these increments to a more appropriate value. The least squares problem becomes one of minimizing the sum of the square of the residuals and the sum of the square of the increments. The important factor here is the relative weighting between the residuals and the increments. What is not addressed adequately in [Lev44] is how to deterrnine this relative weighting. He instead relies on a trial and error approach to determining its value. Marquardt [Marq63] independently arrived at essentially the same damping formulation as Levenberg. He proves that as the relative weighting term described above is 51 varied from zero to infinity the correction vector varies monotonically between the vector defined by the Taylor series approximation and the vector defined by the negative of the gradient. The algorithm Marquardt proposes is based on a trial and error search for an appropriate value of the damping factor at each iteration of the least squares method. At each iteration the value of the damping factor is adjusted from its its previous value so that the residuals are monotonically decreasing. Thus this algorithm shares the properties of gradient descent methods with those of Taylor series methods. More reformulates the Levenberg-Marquardt algorithm providing a more rigorous theoretical analysis of the damped least squares method [More77]. The algorithm incorporates two enhancements to the original algorithm. The first is a more precisely defined method for determining the step bound of the parameter correction and the second is a better treatment of how to determine the Levenberg-Marquardt damping parameter. He also gives a brief description of what an appropriate scaling for the parameter correction vector should be. The step bound of the correction vector is updated by maintaining the residual reduction at some reasonable level. Criteria for increasing the step bound are not rigorous and multiplying by some value greater than one is sufficient. Criteria for decreasing the step bound is more explicit. The step bound is scaled by a factor which minimizes the residuals in a small neighborhood around the current parameter estimate. This scale factor is defined to be in the range (1/10, 1/2) and is set to one of the end points i f it's computed value is outside this range. The Levenberg-Marquardt parameter is chosen to satisfy a tolerance constraint on the size of the next correction step. That is, the size of the correction vector must be within some tolerance of the desired step size. A relation is defined that describes this tolerance which is a function of the Levenberg-Marquardt parameter. The value of the damping parameter which minimizes this tolerance relation can be found via an iterative scheme. It is claimed that this scheme typically converges within one or two iterations depending on the size of the tolerance allowed on the correction step. The constant of proportionality between the correction vector and the sum vector in the least squares system of linear equations is determined from an examination of how the system was derived. It follows that this scaling is proportional to the diagonal elements of the least squares coefficient matrix. More observes that if one of the diagonal elements increases then the correction for this parameter is probably becoming unreliable and therefore the correction step size in this direction should be decreased. He incorporates this observation into his algorithm for maintaining an appropriate scaling of the correction vector. 52 Needless to say these enhancements to the basic algorithm are costly in terms of additional computation. Most notable is the computation of the QR decomposition of the least squares normal equations which is said to improve the algorithms robustness. Additional computation is incurred to find the new step bound and damping parameter. The version of the Levenberg-Marquardt that forms the basis of the algorithm adopted here is due to Press, Flannery, et.al. [PrFlTe88]. The damping is computed by multiplying the diagonal elements of the least squares coefficient matrix by A. That is a 'jj = JJ a (3.13) a'jk s ojk Rewrite (3.11) as M (3.14) 2 c c ' k l 5 a i = pk 1=1 The Levenberg-Marquardt algorithm proceeds as follows. 1. Compute the merit function X (a). 2 2. Set X to 0.0001 (This value is somewhat arbitrary but should not be too large.) 3. Solve the linear equations (3.14) for 8a and evaluate %(a+8a ). 2 4. If x (a+8a ) >= % (a), then increase A. by a factor of 10 and return to step 3. 2 2 5. If % (a+8a) < % (a), then decrease A, by a factor of 10, assign (a+8a) to a and return to 2 2 step 3. Due to round off error and poor conditioning of the minimum it is possible that the residuals will not decrease substantially on each iteration near the minimum. A test is suggested that will cause the algorithm to terminate i f such a condition is detected. The test simply could be to keep track of the residual reduction and halt if its relative decline is less than, say, 10 over two or three consecutive iterations. -3 3.2 Solving for relative orientation and depth The approach used to solve for the image model parameters is based on minimizing the error between the actual location of a scene feature in an image and its predicted position. The position of a scene point in an image is decomposed into two components u and v. u 53 and v are nonlinear functions of the image model parameter vector a. The problem can be linearized and solved iteratively as follows. The functions u and v are expanded into a first order Taylor series. In its general form the Taylor expansion of u is (3.15) u(a) = P (a) + R ( a ) n n The first order expansion is u(oc) (3.16) = Pi(cc) + Ri(oc) = u(a(°>) + u'(a(0)) • (a - a«») + R i ( a ) where a(°) is the point about which the expansion is taking place. Now define u'(a(°>) = (3.17) 3u ."YT (0) 3u du 8a and (3.18) Aa = (a-a(°)) = ( A a i , A a 2 Aa ) T n Now define E as u E„ = u(a) - u(a(°)) u'(a(0)) • A a T (3.19) E = — u 3a i Aai+ — 3a A a + ... + — 2 2 3a Aa n The derivation for E is identical and yields v (3.20) „ E 3v v = . 3v . 3v . Aai+ A a + ... + Aa 2 dai 9a 3a 2 54 n n n R i (a) is assumed to be vanishingly small for a(°) sufficiently close to a. E and E u v are the differences in the u and v directions respectively between u(oc) and v(a) which are measured in the image plane and u(oc( )) and v(oc( )) which are the values predicted from the n n model after n iterations. Define e = ( E i , E 2 , E 3 , . . . , Euk , E i , . . . , E k ) where k is the T u u u v v number of matched points. Then e(a) = 0 is the objective function whose solution a* is desired. When the number of equations is greater than the number of unknowns (i.e. the system is overdetermined) then the sum of the square of the residuals e(a) are minimized with respect to a. From (3.19) and (3.20) e(a) = J ( a ) A a where J(a) = ( u i , u T T 2 , ... , u T k ,vi , v T T 2 , ... , V k ) 7 7 is the Jacobian matrix and A a is the vector of corrections to a ® . The linear system e(a) = J ( a ) A a is iteratively solved for A a and these corrections are added to a. This linearization process is equivalent to Newton's method. Camera model The camera model equations are repeated here. The image coordinate, (u,v) , of the T scene point p, is given by the standard central projection equation. Therefore, given (u,v) for point p the scene coordinates are given by T , , t f-zu -zv v (x^z) = p = h p — z 1 1 The coordinate transformation between two views is given by p" = (x",y",z")T = Rp + t This can be rewritten as p" = p' + t where p' is defined by 55 p' = (x',y',z')T = R p From the perspective projection model, the image coordinates u" and v" in the second camera centered coordinate system are given by The first step in applying Newton's method is to determine the partial derivatives of u" and v" with respect to each of the independent variables, u" and v" are the image ordinates with respect to camera 2's coordinate frame (see fig. 3-1). The unknown parameters are the depth values of the object points, z, the translation vector, t = ( T , T , T ) , and R , the rotation. T x y z This standard form for the central projection equation does not lend itself to an easy calculation of the partial derivatives. Determining the partial derivatives with respect to rotation poses particular difficulties, since rotation has only three underlying parameters and any formulation with a rotation operator, must necessarily be expressed with more than three terms, which in almost all cases, become the rotation parameters (e.g. the nine elements of an orthonormal rotation matrix). Furthermore, the formulation makes no suggestion as to how the rotation should be represented in terms of its underlying parameters [Lowe87]. Newton's method does not require that the partial derivatives be derivable with respect to the model parameters. What is required, is to be able to adjust the model parameters with respect to a set of parameters whose partials can be written. In the case of rotation, this can be accomplished by defining an infinitesimal correction to the rotation matrix R. This infinitesimal correction will be denoted by r, r = (cp ,(Py,(p ) - The elements of r represent T x z rotations about each of the basis vectors in R , and are approximately independent for small 3 angles. The partial derivatives of the model equations with respect to these rotation parameters turn out to be quite simple. In the course of applying Newton's method, the rotation estimate is updated by composing Rk with R ((p ), Ry(<Py)> and R (<Pz)- Rk is the current estimate for the rotation x x z component between the two camera stations (maintained as a 3x3 orthonormal matrix). The components of r, r = ((p ,(p ,(p ) , are the rotation arguments for R , R , and R T x y z x respectively. 56 y z x' z' 0 y* -z' <Py z' 0 y* -x' <Pz -y' x' 0 Fig. 3-3. The partial derivatives of x', y', and z with respect to rotation. 1 The partials are with respect to the rotation parameters (px, cpy, and (pz. The rotations are counterclockwise (measured in radians) about the coordinate axes as viewed toward the origin. Partial derivatives of u" We have for u" du" *3F 3x" ax 1 n , , ( 3 2 x Z ,32" dz' _ fx- a * " o T ( ? ) 2 -ar u and , _(-f) ax- a "ar 9 n 2 1 ) where _ ^ f _ f , 3 x " „dz' " ( " ) r "3z~" "3T z + are determined as follows apj_ oz _a(x'y',z')_aR( ) P az az 3(^, z> (3.22) The partials of u" with respect to the rotation parameters are au" ou acp x - f // i.° „ax" -r (z") [ a(p z x 2 x x „az'\ \ 3(p J x 57 But D + 3x" 3<Px = dx' 8<Px (3.23) . 3z" and d<Px d<Px " 3cp 3 U f 3x' 3<Px 3z' = 0 and . . Therefore ' (z") x V 2 x since dz' _ = = y from figure 3-3. 1 d<Px -f / „3x" (Z")1^ 3(py j ; 2 39y „3z'\ 3(PyJ _ 3x" 3x' , 3z" dz' . But = and = . Therefore 3q>y 3(Py 3(Py 3q)y , „„, (3.24) 3u" — 3(Py 0 since 3x' 3cp = z' and y ,/z* =-f^ 3z* + x"x'\ ^ = -x' from figure 3-3. 3(p y 3u" 3<Pz - f / „8x" ( " ) t 3cp z 2 z x „9z"\ 3(p J z _ 3x" 3x' ,3z" 3z' _ , But = and = . Therefore 3(p 3(p 3(p 3(p z z |Hl 3(pz (3.25) since z 3x' 3<p = -y' and z z z 3z' 3(p = 0 from figure 3-3. z The partials of u" with respect to the translation parameters are now derived. The partial derivative of u" with respect to T is given by x 3u" _ _ f _ f2„ 3Z x " „3z" x 3Tx" ~(z") l 5Tx" 3Tx" 58 Therefore du" (3.26) since z' ax- , ,dz" „ = 1 and ^=r- = 0. 3u" _^£_(,,dx" <5Ty" " ( ")2 I Jfr =0 v oTy" Z Z „dz" X oTy" Therefore (3.27) since ^ r - = 0 and = 0. du" __J_( du" _ fx" „dx" „dz"^ Therefore (3.28) or; ( »)2 Z since j^p- = 0 and ^ p - = 1. Partial derivatives of v" The partial derivative with respect to depth is B n ( 3 u t ^~ ™ - 2 9 ) = ^z~ a n o v ^ " "ar = • Therefore -(f) ay' fy" az 1 - — " a r ( z * ) 2 "ar + 59 3z' where 3v' a n d a r e determined from equation (3.22). The partials of v" with respect to the rotation parameters now follow. The partial derivative with respect to (px is given by ( ") l z d<Px 2 d<Px d<PxJ But - — = — — = -z and = = y . Therefore d<Px d<Px 3(p 3(p J i x x ov" The partial derivative with respect to (py is given by £1 = 4_( .&l_ .&ly 3(Py But z (Z") ^ 2 y 8(p 8(PyJ y -= = 0 and = = -x'. Therefore 3(Py 9(Py 3(Py 9(p y (3.31) = 9(py (Z ) Z The partial with respect to (p is z av" 3(p z But 3(p z (3.32) = 9(pz -f = / ..dy" „a '\ Z r2 l Z - ^ V (z") (^ 3(p z 1 d(p J z - x' and —— = = 0. Therefore 9(p 9(p z ^ 3(pz z =^1 z 60 The partials of v" with respect to the translation parameters are dv" 5T7 _^f_/,.ayl ,,dz"- "(z") r3T7" 3T7 2 y Therefore dv" §}r (3.33) • <&1 = 0 and n since =0 A 0" = n dz dv" dv" -f -t ,,dv" f/ _„dy" dTV "(z") t 3Ty" 2 y „dz" dz'" ^Ty". Therefore dv" dlVy (3.34) -f z' since g^r- = 1 and ^ r - = 0. dv" 3Tl -f ( „dy" „ 3 z " " ( z " ) 2 ^ 3 T l " dT^ v z y Therefore (335) wr" d v f y " z since = 0 and = 1- Newton's method computes a vector of corrections which is used to update the current estimate of a. Let this vector of corrections be H = ( A z i , A z , ... , A z , A T , A T , A T , A ( p , A ( p , A ( p ) 2 n X The linear system can be written as 61 y Z x y z T (3.36) J(cc) H = e where J(oc) is the Jacobian matrix. du". - ^ 0 du" du", ... 0 — — L 3T dz 3u: 3T X du". du" du'\ 1 L 3T Y 1 3 0 Z 3 0 X 2 ar°-°OT7 2 3T7 2 n n n 3z 3T 3 T X J(a) = 3v'^ n 3T 3 0 y du" 2 2 3u" 3u" n Z Z 307307 3u" 3u" 3u" 3u" 3u" 00... 3 0 Y 2 aT^ao^ du" 1 du" du" du" du" du" 2 0- L 3 0 X n 3 0 Y n Z 3v'^ dv\ 3v'j 3v'^ 3v'" 3v'j 3z ° - - 3 f 7 3T7 3T730~7 3 0 7 3 0 ^ 3v' 3v' 3v' 3v' 3v' 3v' 3v" 0 2 0 2 2 2 2 2 2 ~ 5 z " ° - 3 f 7 3T7 3T7307 ^ 0 7 3 0 7 0 dv'n 3 v ; dv" 3v; dv" n 00... dz 3 T X 3T Y 3T Z 3 0 dw" n 3 0 X Y dv" n 3 0 n Z H is as above and e is the vector of errors E and E . E and E are computed as the u v u v difference between the predicted position of the image points and the observed values. The predicted position is computed from (3.4) using the current estimate of a. Newton's method proceeds by solving for H using, for example, Gaussian elimination. a +i is the newest n estimate of a where (3.37) a n + i =a +H n Iteration continues until IIHII < tolerance or the iteration limit is exceeded. Least squares solution The least squares normal equations are given by pre-multiplying both sides of (3.36) by J 7 (3.38) yielding jTj(ct) H = j T e This formulation is equivalent to (3.11). Damped least squares The system of normal equations solved here is given by 62 ( J T J + A(diag(jTj))) H = j T (3.39) e where diag(J J) is the matrix j T j with all the non-diagonal elements set to zero. T H can be determined by Gaussian elimination, for example. a n + i is the newest estimate of a where a n + i is given by (3.37). Iteration continues until IIHII < tolerance, or the iteration limit is exceeded. Solution existence and uniqueness Each pair of corresponding image points gives rise to two measurements and one extra degree of freedom z (its depth value). The number of image model parameters then, are three for translation, three for rotation and one for each pair of corresponding image points. Hence, to solve for these parameters exactly, six pairs of corresponding image points are required since six pairs of points yields 12 measurements and there are 12 unknowns. However, since the translation and depth can only be determined up to a global scale factor, there is in fact, one less degree of freedom. Fixing the scale of the solution The number of degrees of freedom is actually one less than described above. This is due to the perspective projection process which is invariant with respect to an equal scaling in translation and depth [Huang87]. For example, given a point at some fixed distance from the reference camera position and a translation between two viewpoints, the scene point will project onto the same image point if the translation is doubled and the depth is doubled. To reduce the number of degrees of freedom it is convenient to fix the global scale by either fixing one of the depth parameters or by constraining the translation to have unit norm. Fixing one of the depth values to some appropriate negative value has the advantage of helping to constrain the solution to lie in the negative z half space which is appropriate for the model presented here. For the case of six matched points, fixing one of the depth values reduces the number of unknowns to eleven. This leads to an overdetermined set of equations which are solved using least squares techniques. The minimum number of points required for a unique solution, up to a global scale factor, is now five i f one of the depths is fixed. The model equations are nonlinear, hence in general, more than one solution is possible. Empirical evidence suggests that, for the problem posed here, the solution is unique if the equations are sufficiently over determined [Horn87]. Monte Carlo simulations, conducted as a part of this work, support this claim. 63 3.3 The algorithm A pseudo code version of the algorithm now follows. 1. Initialize. k = 0.0001 2. From two images determine n >=5 pairs of corresponding points. That is, each pair of points correspond to the same physical point being imaged. This is the essence of the correspondence problem. Measure (ui,v0 and (ui,v0". Make an initial estimate of a. Fix one of the depth values. 3. while IIHII > tolerance and number of iterations < iteration limit { 4. Compute e = ( e e ) where e T T u T v ui = u"i - ui(a) and e Vi = v"i - vi(a) 5. Compute J from equations (3.7) to (3.21). Note that one of the depths has been fixed. 6. Compute Hell. Ilell id <-- Hell. 7. Compute J T j and J e . ( j T j ) 0 T 8. J T 0 <- jTj. (jT J <-- J T J + X(diag(jTj)) 9. while Hell >= llell id 0 { 10. solve J J H = j T 11. a ' < - a + H T e 12. compute Hell 13. if Hell >= llelloid { 14. X <-- X * 10 15. J T J < ~ ( J T J ) + X ( d i a g ( j T j ) ) 0 16. J e <-- ( j T ) o T e } } 17. A, <--*,/10 18. a <-- a' 19. compute IIHII 20. increment number of iterations } 64 0 e ) o <__ j T . e CHAPTER 4 4 Line correspondences The use of corresponding line tokens has been widely investigated for solving the problem of relative orientation and depth (or structure from motion)[YeHu83] [LiHu86] [MiSeAg86] [SpA187] [Huang87] [FaLuTo87] [FaLu87] [Kan88]. Finding lines is easier since low level vision routines are better at determining the transverse locations of lines and are less accurate at determining the location of line terminations [Lowe87] [Huang87] [SpA187]. The solution for relative orientation and depth is under-constrained in the two view situation [Huang87] [MiSeAg86] [FaLuTo87]. Arguments, similar to those contained in the references will be made with regard to the algorithm developed in this thesis. A minimum of three views is required to permit a unique solution up to a global scale factor. 4.1 Computational model The computational model for point correspondences is easily extended for line correspondences. Although not developed here, the model can be modified to handle point and line correspondences within the same computation. The error term will be expressed as the difference, in the distance, between an observed image line and the line predicted from the estimated motion and depth parameters, applied to the corresponding line in the reference image. This distance error will be measured perpendicularly from each end-point of the predicted line to the observed line. (This is a relative measurement and could be written as the distance from the end-point of the observed line to the predicted line.) Hence, each corresponding line will yield two equations per non-reference frame. (In a sequence of images there will be one frame referred to as the reference frame with the rest referred to as non-reference frames.) Although no formal proof is presented, an informal derivation of the number of line correspondences and views necessary, for a unique solution (up to a global scale factor), is offered. Each corresponding line yields two equations per non-reference frame. The number of unknowns is three for rotation and three for translation for each non-reference frame. There are two depth unknowns for each line. Therefore, there are 6k+2n unknowns for k non-reference frames and n lines. Each line yields two equations per non-reference frame. Therefore, 6k+2n < 2nk, i.e. the number of unknowns must be less than or equal or 65 the number of constraints. The first positive integer solution is n=6 and k=2. Six corresponding sets of lines are required over two non-reference frames, i.e. a total of three frames. Now in practice, one of the depth values is fixed to fix the scale of the solution. Thus, five and one half lines are required over a total of 3 frames. Six lines over constrains the solution and least squares is used to solve the system of equations. The partial derivatives of the distance error term with respect to the model parameters are taken, rather than the partial derivatives of u and v, as is the case for point correspondences. 4.1.1. E r r o r term derivation bl The distance from the origin of the line ax+by+c = 0 is given by - = = = d. V a +b 2 2 Therefore Id = dVa +b . Substitution into ax+by+c = 0 gives ax+by±dVa +b = 0. 2 2 2 2 Therefore, ( 4 .i) - ^ Va +b = 2 b y Va +b + 2 2 = ± d = 0 2 Now the slope m is given by m = -g-. Substituting into (4.1) and factoring out b in the ax by denominator — ^ + —,k + d = 0. Therefore bVm +l bVm +l y 2 (4.2) 2 Vm +1 Vm +1 2 2 Rewrite this as Ax+By±d = 0. Given two points (xi.yi) and (x2,y ) the slope is given by 2 m = ~ . However if x x2-xi y 2 v i 2 -xi is approximately zero then redefine m as m = X 2 X 1 then the y -yi 3 2 slope of ax+by+c = 0 is given by m = —. Substituting into (4.1) and factoring out an a in tlx bv the denominator —•===. + — < ± d = 0. aVm +l aVm +l 2 (4.3) X Vm +1 2 Therefore 2 + " ±d = 0 Vm +1 m y 2 This can be written as A'x+B'y±d=0 where A'=B and B'=A from (4.2). 66 The distance of a line from the image plane origin, d", is given by (4.4) d" = Au"+Bv" Calculating the partial derivatives The partial derivative of d" with respect to the model parameters a is 67 . h = Af r . 3v" _ since^=0 r ? ,,. (- > 3d" 3T7 n 4 U 3u" 3T7 A = A 3v" 3T7 0 + B = B | rn- ] since ^jr- = 0 od" _ <A 1 ^ (4.12) A = 3u" ^ - A + ^v" B ^ - " (( ") ) A 2 + B Z ((z¥ Computing the line coefficients If the slope m is defined by m = y " ^ then x2-xi 2 v (4.13) A = 1 3 " and B = Vm +1 Vm +1 m 1 2 2 X9~X1 However if x - x i is vanishingly small then let m = ^ *. Then 2 (4.14) A = , and B = • " Vm +1 Vm +1 1 m 2 2 In practice (4.14) is used if Ix2-xil < ly -yil. 2 E r r o r term definition The error term is e (4.15) = d" - d l ti it = Au"+Bv" - (Au„ + B v J = A(u" - u ) + B ( v " - v ) e e The error is depicted in fig. 4-1. 68 Fig. 4-1. E r r o r term for line correspondences. This figure depicts the error term for line correspondences. The lines are in a non reference image plane with origin (0,0). (x",y") i and (x",y") 2 define the estimated image line. (x",y")i and (x",y")2 define the observed line. e e •i it The equation A(x" - x ) + B ( y " - y ) = E i defines the distance error for estimated line end-point (x",y") i. A similar equation holds for E . e 2 4.2 The algorithm A pseudo code version of the algorithm now follows. 1. Initialize. X = 0.0001 2. From m images (m > 3) determine n > 6 m-tuples of corresponding lines. 3. Compute the line coefficients Ajk and Bjk as described above in computing the line coefficients, where j is the image number and k is the corresponding line set number. 4. Make an initial estimate of a. Fix one of the depth values. 5. while IIHII > tolerance and number of iterations < iteration limit { 6. Compute e where 69 e = (ei2l, e22l, ei22, e222, ei23, e223. ••• .12k, e22k. ©131, e 3i, - ,ei3k, e 2 e23k, - , eijk, e jk,..., ei k, e2mk) T 2 m where for ejjk = Aj (uij" - u ..) + Bjk(vij" - v ) k e e i is the hne end point number (1 or 2) j is the image number (the reference image number is 1 therefore, j = 2, m (the number of images) k is the corresponding line set number 7. Compute J from equations (4.7) to (4.21). Note that one of the depths has been fixed. 8. Compute Hell. Ilell id <-- Hell. 0 9. Compute J J and J e . T T ( J J ) o < - J J . ( J e ) o <-- J e . T T T T 10. J T J < - J T j + ^ ( d i a g ( j T j ) ) 11. while Hell >=llell i 0 d { 12. solve J J H = J e T T 13. a ' <— a + H 14. compute Hell 15. i f Hell >= llelloid { 16. X<-X* 10 17. J T J < - (JTJ)Q + ?i(diag(jTj) ) 0 18. J e <-- ( j T ) o T e } } 19. X <- X110 20. a <-- a ' 21. compute IIHII 22. increment number of iterations } 70 CHAPTER 5 5 Experimental results and discussion The major contribution of this work is a clear demonstration of the algorithm's ability to compute relative orientation and depth from a sequence of real images. These images are not enhanced and not of high resolution. The image tokens are not computed with sub-pixel accuracy but are computed by the Canny edge detector and an algorithm for line following. A pseudo Monte Carlo test was run to determine the range of convergence. A Monte Carlo simulation was run to determine the algorithm's range of convergence. The simulation generated synthetic images with exact and noisy image measurements. The scene structure and motion parameters were taken from the results of the analysis of the real image sequence. Thus, the Monte Carlo analysis is based on a real situation and a comparison of the Monte Carlo results with the results of the real image analysis can be made. This chapter opens with a discussion of the algorithm's implementation. 5.1 Algorithm implementation Various extensions were made to the basic algorithm for line correspondences. Finding the maximal 3D model It is possible, in a sequence of images from disparate viewpoints, that certain regions of the scene, visible in one image, may not be visible in another. This phenomenon is generally referred to as occlusion. As the viewpoint changes an object may interpose itself between the object under view and the viewer, or a region on an object may disappear from view due to self occlusion. The scene structure, determined initially by the algorithm, is computed from line tokens in what is referred to as the reference image. These line tokens have corresponding line tokens in the non-reference images. Because of occlusion or a failure of the edge detection process, it is possible that a view of a line in the scene, projected onto one image, may correspond to a line in the scene of small length relative to a view of the same line in another image. Alternatively, one of the views may be of a segment of the line in the scene not seen at all in the other views. Recall that this line segment still belongs to the same physical scene line seen from other viewpoints. It is only that, in general, different views of the same scene line will be of different segments of that line. The objective, then, is to combine all of these views to determine the full spatial extent, in the scene, of the 71 imaged line. A two step process was implemented to determine the full 3D extent of the scene lines. The first step determines the depth of the image line endpoints measured with respect to the non-reference frames. This is determined by applying the inverse projection equations to the image points to compute the depth, with the constraint that the calculated point is collinear with the known 3D line. This is done for all points in all non-reference images. This computation is overconstrained (either u or v can be used), hence the solution that minimizes the projection error is chosen. The mathematical derivation begins with the standard parametric equations for a line in three dimensional space. (5.1) x = xi+ait (5.2) y = yi + a t (5.3) z = z\ + a$t 2 aj are the elements of the vector a = p - P i where pi are the three dimensional 2 coordinates of the two endpoints of a line in the scene which are computed by the algorithm. Pi is measured with respect to the non-reference camera coordinate system. These endpoints (when projected into the reference image) correspond to the endpoints of lines in the reference image. From (5.1) and (5.3) and u = -fx/z [-(u/f)z - x i ] / ai = (z - zi) / a3 Hence, (5.4) z = [ a x i - a i z i ] / [ -a (u/f) - ai ] (5.5) t = (z - zi) / a where z is from (5.4). 3 3 3 The 3D coordinates of the image point (u,v) are given by (5.6) x = x i + ait where t is from (5.5) y = yi + a t where t is from (5.5) z=z where z is from (5.4) 2 Equations (5.2) and (5.3) and v = -fy/z can be used to write a similar set of equations. Equations (5.6) are the 3D coordinates of the non-reference image point and 72 guaranties collinearity with the corresponding scene line derived from the reference image. At least one image ordinate will agree with the observed value when the 3D point is projected back into the non-reference image (to be called reprojection). For the above equations it is the u ordinate. The v ordinate, in general, will not agree exactly with the image observation due to parameter estimation error and measurement noise. By computing the 3D scene coordinates from both sets of equations (inverse projection of u or v) and comparing the error of their reprojections, the 3D coordinates with the least error on reprojection are selected. The second step maps the scene coordinates measured in the non-reference frames back into the reference frame. A comparison is made of all the scene coordinates that belong to the same scene line, and two points are selected which yield the line of maximum length. Handling missing lines It is possible that the correspondence process will fail for some scene features in one or more images. This may be due to occlusion or a failure of the edge detector. The implementation keeps track of missing lines (signaled by a zero valued line entry on input) as a function of the image they're missing from and computes the Jacobian accordingly. Essentially, the Jacobian has fewer rows than the Jacobian that would be computed i f all corresponding lines from each image were present. Ill conditioned coefficient matrices The least squares system may become singular as a result of nearly dependent equations. From experiments, it was generally observed that a failure to converge was caused by near zero pivot elements introduced by the partial derivatives with respect to depth. As these partials go to zero, the correction estimates go to infinity. The damping parameter, X, also increases, but at a much slower rate. Since the Levenberg-Marquardt damping parameter is multiplied by the diagonal matrix element (the pivot element) no benefit is realized from the damping. This is true as long as X does not keep pace with the rate of decrease of the pivot elements. Since the size of the residuals also decreases with a decrease in the partial derivative values, X will not keep pace with the decrease in the pivot elements. Thus, the increase in the size of X tends to be restrained, and the system becomes degenerate. This behavior is apparently atypical for this method of determining nonlinear parameters since large corrections to the parameters should generally lead to an increase in the residuals with the result that this large correction is rejected. A floor on the size of the diagonal elements of the least squares coefficient matrix was introduced. However, the value of the diagonal element is arbitrary (it was set to the 73 computer's smallest positive floating point value representable with only the mantissa), and typically, by the time this value was reached, the solution had moved so far from the global minimum that only local minimums were encountered. The most desirable solution is to detect the onset of failure early on and remove the offending data points from the computation. It is possible to reduce the divergence of the solution from the global minimum by restricting the size of the parameter corrections to some maximum value.This can be viewed as another form of 'damping'. The introduction of the floor value for the coefficient matrix diagonal elements causes the corresponding parameter to remain essentially unchanged from one iteration to the next. Test results revealed that a very large number of iterations were required to produce convergence. This was true even for noise free images with twelve lines and no initial estimate error except for the initial depths. The norm of the residuals decreases rapidly for the first five or so iterations, then it just as rapidly levels off, decreasing only slightly over the remaining number of iterations (see fig. 5-1 at the end of the chapter). This behavior is closely related to the behavior of the Levenberg-Marquardt damping parameter, X. On the first iteration X increases by a factor of 100 to 1000. This forces the least squares solution away from the quadratic approximation and towards pure gradient descent with a small step size. On each subsequent iteration two values of X are tried. The smaller value of the damping factor leads to an increase in the residuals, thus X is increased. Even after a large decrease in the residuals, indicating an approach to some minimum (usually the global minimum), this behavior does not change, i.e. X stays constant. Given this behavior, and in the interests of improving the rate of convergence, a test was inserted to detect this behavior, and, after an arbitrary number of successive iterations at a constant value, X is set to zero. With X set to zero, the least squares method computes corrections for a quadratic approximation of the error function. Best results were found by setting X to zero, after detecting ten successive iterations at a constant value. This value was chosen as a compromise between extending the range of convergence and decreasing the number of iterations to converge. This value is only approximate, however, and may be decreased if the solution is highly over-constrained or if the image measurements are not noisy (for noise free images of twelve lines, setting X to zero after five iterations saved approximately 4 iterations without affecting the range of convergence compared to the case where X is set to zero after ten iterations). Plotting the image lines and their estimated values after each iteration reveals that the solution converges rapidly in less than ten iterations over a wide range of initial estimates, with little improvement after ten to fifteen iterations. 74 Despite the unexpected behavior of the Levenberg-Marquardt least squares technique, it should be made clear that this technique contributes significantly to increasing the range of convergence of the algorithm. The evidence for this comes from a comparison of fig. 5-13 with fig. 5-3. The range of convergence is seen to have been increased by approximately 18 degrees (referenced to the fifty percent failure rate) with a penalty, on average, of approximately 7 extra iterations required to achieve convergence. Stopping criterion for convergence The algorithm iterates until one of the following stopping criterion is met. 1. the norm of the correction vector H is less than 10" , 2 2. the residuals do not decrease by more than the relative amount 10 over 3 -3 successive iterations 3. the number of iterations exceeds a predetermined limit (set to 50 for the Monte Carlo tests). Criterion 2 is suggested by [PrFlTe88] and seemed to work well in practice. Criterion 1 is a bit more arbitrary. Setting this value too low causes the algorithm to iterate longer than necessary, with the result that criterion 2 or 3 is invoked. Setting this value too high will result in poor final estimates of the parameters. The value given above was empirically the best value overall. For noiseless image measurements, when criterion 1 was met, the final absolute estimate errors were typically of the order 10" to 10' for all the 4 5 parameters. 5.2 Analysis of real images A sequence of real images was analyzed (see fig. 5-8). A program was written which enables the user to specify corresponding lines in a sequence of images (see fig. 5-9). The initial relative orientation parameters had to be guessed at, as the experiment was not calibrated. A pseudo Monte Carlo analysis was conducted to determine the range of convergence. The intrinsic camera parameters were not known and had to be estimated. A l l angle measures are in degrees and all distance measures are in pixel units. This also applies to the results for the Monte Carlo analysis of synthetic images. 5.2.1 Image acquisition A sequence of four images of an office staple remover were taken by an Hitachi C C D camera. Image resolution was 512 by 512 pixels. 75 The camera's intrinsic parameters were found empirically by improving on a set of initial estimates. Upon successfully converging on a unique solution from a variety of initial guesses (in fact no other solutions were found), a series of tests were run to find the optimal focal length and image scale factors. The values of the intrinsic camera parameters that minimized the residuals upon convergence were designated as optimal. It was found that a focal length of 633 pixels and a scale factor of 1.1 in the u (horizontal) direction was optimal. The vertical scale factor was unity. Recall that all distance measures are in pixel units and all angle measures are in degrees. After the images were acquired, the Canny edge detector was run on them. From the zero crossing images, lines were extracted using methods developed by Lowe [Lowe87]. A program was written which displays the line images and enables the user to select corresponding lines from these images. The output from the program consists of a list of line endpoints. From occlusion or edge detection failures, some corresponding lines may not exist in one or more images. This situation is signaled and is dealt with appropriately by the implementation developed here. This permits as much information as possible to be used. The Monte Carlo analysis was run on three of the images shown in fig. 5-8, the first, second and fourth. In all, twenty one scene lines were used. 5.2.2 Monte Carlo method for determining range of convergence of real images The same infrastructure for determining the range of convergence of synthetic images was applied to the real images. The basic difference is that the correct, or desired, solution can only be the best estimate determined by a small number of tests. This, in fact, turns out to be adequate, and good results from the simulation were found. The Monte Carlo simulation, to determine the range of convergence, systematically varies the initial estimate of one of the unknown parameters with the other parameters fixed to their correct (desired) values. The exception to this is the initial estimate for the depth values. Initially, all depth estimates are set to be equal, i.e. coplanar. The progress of the algorithms convergence is monitored and statistics gathered on the algorithms failure rate and number of iterations to converge. Note that the failure criterion was not altered from the Monte Carlo simulations run with noise free synthetic images. In those tests, a failure was signaled if the number of iterations exceeded fifty, the parameter estimates became undefined, or the error in the final estimate of the rotation exceeded one percent (this is quite stringent, but is appropriate for 76 noise free synthetic images). The range of convergence would be expected to increase, to some extent, if some of these criteria were relaxed (especially the rotation error test). The use of a graphics package for generating plots of the image measurements and their estimates was useful for analyzing the rate of convergence and the accuracy of the parameter estimates. A typical sequence of image plots ulustrating the convergence process for the real image sequence is given in Appendix A. Since the experiment was not calibrated, some other means of determining the performance of the algorithm had to be determined. The most obvious measure is the solution for the structure of the 3D scene. If the computed structure appears to closely match the physical appearance of the object (for example, lines are parallel and relative distances on the object are preserved), then at least some small measure of success can be noted. Appendix B contains a series of views of the solution for the 3D structure of the staple remover, determined from three of the real images (see fig.5-8). Fig. 5-10 to 5-12 plot the range of convergence of the algorithm, tested with three of the real images (the first, second and fourth). The range of convergence is virtually the same for real images and the Monte Carlo tests with noisy images (noise was uniformly distributed in the range ±3 pixels). In the case of rotation error, with reference to the fifty percent failure rate, the range of convergence extends to about 36 degrees. Only for the results for translation error in x,y do they differ markedly in the failure rate. If the rotation error failure criterion used for noisy synthetic images was applied to the real images, the range of convergence would be expected to increase. 5.3 Monte C a r l o simulation The purpose of the simulation was to determine the algorithm's range of convergence. This requires that a failure criterion be defined. It was decided that a failure to converge would be signaled if 1. an arbitrary iteration hmit was exceeded, 2. the parameter estimates became undefined, or 3. the error in the final estimate of the rotation exceeded some arbitrary value. The iteration limit was set to fifty. This value was arrived at by empirical means. With the modification to the Levenberg-Marquardt algorithm, which sets X to zero after a certain number of iterations, the solution almost always converged in less than fifty iterations. The relative error in the amount of rotation about the rotation axis was determined. If it exceeded one percent, then a failure to converge to the global minimum was signaled. For 77 noise free synthetic images, this relative amount should really reflect the machine's precision, since the algorithm should converge exactly (within round off error and other computational errors). Monte Carlo simulations should seek to recreate real situations as closely as possible. The randomly generated scene points and desired motion and depth parameters were culled from the results for the real image analysis described earlier. In the case of noisy synthetic images, the noise was drawn from a uniform distribution. This choice was mostly due to expediency and adherence to the majority of work done in this area which seems to favor this distribution (probably due to expediency). General setup For each new initial parameter estimate, ten iterations of the simulation were run. The scene structure was randomly chosen for each of the ten iterations. A l l random numbers were drawn from a uniform distribution. 5.3.1 Data set definition The scene points were drawn from a uniform distribution of scene coordinates in a volume defined by the coordinate ranges (in pixel units), 500 < x < 1500, 500 < y < 1400, -1800 < z <-2200 . The specification for this volume comes from the results of the analysis of the real image sequence. The 3D structure of the staple remover defines a volume in space, the dimensions of which are approximated by the coordinate ranges just given. The motion estimated from the real image sequence, described earlier, served as the desired motion for the simulation. 5.3.2 Exploring the parameter space To simplify the analysis of the algorithm, it was necessary to reduce the number of degrees of freedom that must be explored to determine the range of convergence. From the premise that the range of convergence is limited by the smallest range of any one parameter, a scenario was constructed. In the scenario all but one of the initial parameter estimates are fixed to their desired values, i.e. the initial estimate is set to the correct solution. Only one of the estimates is systematically varied and, for each initial estimate, compute the failure rate and average number of iterations over a set of tests having different scene structures. This scenario was then modified to take into account the fact that the depth estimates of the line 78 endpoints are not known, in general. By setting all of the depths to some number (which may be derived from domain specific knowledge) the scale of the solution is fixed. This scale also determines the scale of the translation. Therefore, the only parameters that are varied systematically are the translation and rotation parameters. In general, upon successfully converging, the average absolute error of the parameter estimates for noise free data was in the order of 10" to 10" . 4 5 5.3.2.1 Depth Since, in general, the depths of the scene points are not known, all of their values are set to be equal, i.e. the points are coplanar. The actual value chosen for this depth fixes the scale of the solution. In these tests, it was set to the estimated depth of the object analyzed in the real image sequence. 5.3.2.2 Rotation Rotation has three underlying parameters and, therefore, a three dimensional parameter space should be explored. However, it can be argued that the important measure of the error between the desired coordinate system orientation, and the initial estimate of that orientation, is the amount of rotation about the axis between the two orientations. The significance of the error, in the direction of the rotation axis, decreases as the orientation difference between initial and desired orientations goes to zero. The scenario is that the rotation axis is randomly chosen, and the amount of rotation, x, is varied systematically over some range. This incremental rotation is composed with the desired rotation estimate, and becomes the initial rotation estimate. Hence, x is an absolute error in the initial estimate. As the rotation axis is random, only the range 0 to 180 degrees for the value of x needs to be considered (the random axis serves to vary x over the 360 degree range). Fig. 5-2 is a plot of the range of convergence and number of iterations to converge, for six scene lines (synthetic data), with respect to error in the initial estimate for rotation, x. Fig. 5-3 is the same plot, but for twelve scene lines. By doubling the number of line correspondences, the range of convergence (with reference to the fifty percent failure rate) is increased by 12 degrees from approximately 48 degrees to 60 degrees. The average number of iterations to converge at the fifty percent failure rate is virtually the same for both, at about 34. 5.3.2.3 Translation 79 Translation also has three degrees of freedom. Unfortunately, two components must be retained. One of the components is parallel to the image plane and the other component is orthogonal to the image plane. These components are differentiated as a consequence of their effects on the image projections. Pure translation in the z direction (orthogonal to the image plane) causes uniform image scaling and depth is not recoverable (unless the translation is known). Translation in the x,y direction (parallel to the image plane) produces image disparity and depth is recoverable (assuming the translation is large enough). The range of convergence for translation is determined by varying one of (x,y) or z, and fixing the other to have an initial estimate error equal to zero. The initial estimate error is a percentage of the desired translation norm. For (x,y), y is randomly chosen in the interval -klltll < y < klltll, where k is some percentage in the range 0 to 200. The determination of x is such that ll(x,y)ll is equal to klltll. The sign of x is also random. Fig. 5-4 is a plot of the range of convergence of the algorithm for six scene lines with respect to an initial estimate error in translation. The error is only in the x and y directions. Fig. 5-5 is the same plot as fig. 5-4 except that the initial estimate error is only in the z direction. Thus, -klltll < z < klltll. Fig. 5-6 and 5-7 are the same as the plots fig. 5-4 and 55 but for twelve scene lines. Doubling the number of line correspondences extends the range of convergence for translation error in x,y from about 170 percent of lltll to beyond the test range. For translation error in z the range is extended from 70 percent to about 80 percent. The real improvement here, though, is the range over which the failure rate is zero. This range was extended from 20 percent of lltll to 60 percent. The number of iterations to converge for twelve scene lines is slightly less than for six scene lines. 5.3.3 Image noise and algorithm robustness The Monte Carlo simulation was run on the synthetic scene of 12 lines described above. Noise was added to all of the image measurements. The noise was uniformly distributed in the range ± 3 pixels. The failure criterion for the rotation error was relaxed such that a failure was signaled if the final estimate for the amount of rotation exceeded 15 degrees. The other failure criteria remained the same. Fig.'s 5-14 to 5-16 are the range of convergence plots for the noisy image tests. 5.3.3.1 Noise mensuration The noise measure used here is defined to be the ratio of the amount of noise in the disparity vector to the norm of the disparity vector [ShPr85]. The horizontal disparity is 80 computed as the difference in the u ordinate between the projection of object point p and p + t . Similarly, the vertical disparity is the difference in the v ordinate between the x projection of object point p and p + t . The norm of the disparity vector is the root of the y sum of the squares of the horizontal and vertical disparity. The average disparity norm was 472 pixels. This is equal to the hypotenuse of the right triangle formed by the average horizontal disparity of 380 pixels and the average vertical disparity of 280 pixels. Uniformly distributed image noise in the range of ± 3 pixels was added to both ordinates of all image points. The largest noise level, therefore, is V 2*3 2 * ~ 4 T 2 = 1-8% (approximately), since the radical term is the amount of noise for one 2 - line in a pair of corresponding lines. Results The failure rate for an initial estimate error in x,y for noisy images has increased to about the 30 to 35 percent level as compared to the zero percent level for the noise free test (compare fig. 5-14 with fig. 5-6). The failure rate increases linearly from zero to 30 percent for the initial estimate error in z, at which point the failure rate increases rapidly as compared to the noise free test (compare fig. 5-15 with fig. 5-7). For noise free data and the test for initial estimate error in z, the failure rate is zero over the same range that the failure rate increases linearly for the noisy image case. Note that, with respect to the 50 percent failure rate, the range of convergence for initial estimate error in z is approximately the same as the noise free image case. The number of iterations, on average, has increased by 5 to 10 for initial estimate errors in translation. It would appear that the presence of noise has not adversely affected the range of convergence for initial estimate errors in translation, however, the failure rate has increased somewhat over this range (in the order of 20 to 30 percent). For the case of an initial estimate error in rotation only (bearing in mind that the initial depth estimates are coplanar for all of the tests described here) it can be seen that the average failure rate has increased to about the 20 percent level before rapidly moving beyond the 50 percent failure rate level in comparison to the noise free case (compare fig. 5-16 and fig. 53). The range of convergence for rotation is reduced to about 37 degrees from the 60 degree value for the noise free case (with reference to the 50 percent failure rate). The number of iterations for noisy data has increased by approximately 10 as compared to the noise free case. The general observation to be made here, concerning the effect of noise, is that the range of convergence is not reduced for initial estimate errors in translation, whereas, it is 81 reduced for initial estimate errors in rotation (the range of convergence for rotation is reduced by almost 40 percent). The average failure rate, however, is somewhat higher for translation than for rotation, indicative of the algorithms stability with respect to rotation. The absolute errors in translation, depth and rotation were also computed for the noisy image tests. In all cases, when the failure rate was 50 percent or less, the average absolute error in rotation was less than 2 degrees. The error in the norm of the translation vector was less than 2 percent (this was a worst case observation from the test with translation error in x,y. Typically the relative error is less than 1 percent). The average absolute depth error was less than 20 pixels in approximately 2000 (again a worst case, the typical value was 10 pixels or less). When comparing these results with the results from the analysis of the real image sequence, it was found that they agree almost exactly for rotation and translation in z but they do not agree for translation in x,y. As discussed in the section on real images this may be due to the lower error tolerance for rotation used to signal a failure to converge. In any event, the range of convergence for the real image data, for translation in x,y, was approximately half of the range for noisy synthetic images. Except for the reduction in the range of convergence for rotation, these results appear to be quite good. 82 1 0 1 10 . 1 20 . N u m b e r of 1 M . 1 -40 1 1 50 iterations Fig. 5-1. Behavior of residuals with respect to the number of iterations. This plot is for 6 lines over 3 frames. Image measurements are noise-free. The rotation initial estimate error, x, is 10 degrees. Initial depths are coplanar. There is no translation initial estimate error. ions 8-, 8- o 0) Fail o 8- S- o _ ' ) / 1 20 1 40 1 ' I SO 1 I B0 • i 100 tau error (deg) Fig. 5-2. Range of convergence with respect to x for 6 lines, noise free synthetic image Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. 83 S-, en cz 3- 0 20 40 BO BO tau error (deg) Fig. 5-3. Range of convergence with respect to x for 12 lines, noise free synthetic image Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. ions 8-, 8- o CO -t-» 8- <« CD S- Fail O • 3 r 20 1 i 40 1 i • 60 i • BO i • 100 i • 120 i • 140 i • 160 i • 1 BO i 200 T r a n s l a t i o n e r r o r a s %||t|| In x,y Fig. 5-4. Range of convergence with respect to Tx and T y for 6 lines, noise free synthetic image Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. 84 iterations §1 ss- =te Fail rate <« ?- o _ C M 1 f 1 | 20 1 40 | 1 00 J HQ 1 | i 20 1 ' i 140 • i ISO 1 100 T r a n s l a t i o n e r r o r a s %||t | in z Fig. 5-5. Range of convergence with respect to T z for 6 lines, noise free synthetic image Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. 8- CO c 8- o o CO =*fe 8- <« CO S- o '6 O_ /~ r c • 1 20 1 1 40 ' 1 00 1 1 B0 1 1 100 1 1 120 1 1 140 1 1 ' 1Q0 1 1 1 BO I 200 T r a n s l a t i o n e r r o r a s %||t|| in x,y Fig. 5-6. Range of convergence with respect to T x and T y for 12 lines, noise free synthetic image Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. 85 ions 8- o 8=*t= <« Fail <D So o_ ' 1 20 ' 1 40 'f1 60 1 80 1 1 100 I 1 120 I 1 140 T r a n s l a t i o n e r r o r a s %||t|| in z I 1 1B0 Fig. 5-7. Range of convergence with respect to T z for 12 lines, noise free synthetic image Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. 86 Fig. 5-8 Images of office staple remover from four viewpoints. Images are numbered 1 to 4 from left to right, top to bottom. 87 ; Sun C a n o n L i s p , DsvAlopmant Env ; Copyright ( c ) 1987 by Sun Micros' Copyright (c) 198S, 19B8, 1987 b; ; ; This software product contains c ; Information balonging to Sun Mtc ; for tny rsason other than f o r ar ; Loading sourco f l l s " / u b c - l c v f s l ; Expanding Dynamic Mamory ypa ( I n l t ) to s t a r t v i s i o n ayetarn, > (inlt) ; Loading binary f l l s "macros. Ibln ; Loading source f i l e " u t i l i t i e s . 1 ; ; ; Loading binary f l i t "display.Ib1 ; ; ; Loading sourcs f l l s "mouse.lisp" Loading binary f 1 1 • "geoaulbln" Leading binary f l l a "oeostf.lbln Loading binary f l l s "modalIng.lb Loading binary f l l s " p r o j . l b l n " Loading binary f l l s "create, l b l n 1 1 IIL > (camp-demo) ; Loading binary f l l a "< Sal act ana Una 1n each 1mags with End aach aat of matching sagmsnts Quit matching by c l i c k i n g r i g h t bu Fig. 5-9. Sample screen from program to designate corresponding line segments. This interactive program highlights selected line segments which are deemed to be corresponding. The program's output is a list of endpoints that define the corresponding image lines. ions 18- o <u 3=*t= <« Fail <u ?o ' 1 20 1 t • 1 40 60 1 BO 1 1 100 1 1 120 1 1 140 i 160 • i ' i 1B0 200 T r a n s l a t i o n e r r o r a s %||t|| in x,y Fig. 5-10. Range of convergence with respect to T x and T y , real image sequence. Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. 88 ions 8- 8- p -*—' 8=8= <a V ?- Fail a O_ | | j 20 40 60 . | BO 1|1|I | i 100 120 140 T r a n s l a t i o n e r r o r a s %||t|| in z ISO Fig. 5-11. Range of convergence with respect to T z , real image sequence. Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. to 8- tau error (deg) Fig. 5-12. Range of convergence with respect to t , real image sequence. Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. 89 8-1 Epsilon = 0: t a u e r r o r ( d e g ) Fig. 5-13. Range of convergence with respect to x , for 12 lines and no damping. This graph illustrates the range of convergence with the Levenberg-Marquardt parameter set to zero, i.e. no damping. Compare with fig. 5-3. Again, initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. 8-. co S- T r a n s l a t i o n e r r o r a s %||t|| in x,y Fig. 5-14. synthetic Range of convergence with respect to T x and T y , noisy image. Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. 90 ions S-, 8- a CD s- =tts =8 CD Fail "a 5o_ CM ] 1 11 20 , 40 00 8D 100 120 i 1B0 140 T r a n s l a t i o n e r r o r a s %||t|| i n z Fig. 5-15. Range of convergence with respect to T z , noisy synthetic image. Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. ions 88- a CD S- >« CO S- Fail o O_ C M ( , 20 | 1 | 40 SO tau error (deg) Ii 1 100 B0 Fig. 5-16. Range of convergence with respect to x , noisy synthetic image. Initial depths are coplanar. The failure rate is in percent and the number of iterations is the bold line. 91 CHAPTER 6 6 Conclusion A solution for relative orientation and depth has been presented. The solution method, derived from the collinearity condition, used both point and line tokens. Experimental results for the line token method were presented. Results from a Monte Carlo analysis indicate a relatively large range of convergence and a fair degree of robustness. It was found that, for noisy synthetic images with uniformly distributed noise of +3 pixels, the range of convergence for rotation was approximately +40 degrees. The range of convergence for rotation for noise free images was found to be approximately ±60 degrees. The range of convergence in translation was not seriously affected by this noise level. The range of convergence for translation was greater than 200 percent of the size of the translation vector in any direction parallel to the image plane. The range of convergence for translation, orthogonal to the image plane, was found to be approximately ±80 percent of the size of the translation. These ranges are given with respect to a fifty percent failure rate in convergence for the algorithm. The number of iterations required to converge for noise free data was typically in the range of 20 to 30 iterations. For noisy synthetic images, the number of iterations required to converge increased, on average, by about 10, giving a range of approximately 30 to 40 iterations. The analysis of the real image sequence was a good demonstration of the algorithm's performance. A pseudo Monte Carlo analysis of the real image data was performed to determine the range of convergence. Those results agree in general with the results from the analysis of the synthetic images. In fact for rotation, the range of convergence agreed exactly at the 60 percent failure level (this value was 40 degrees) and were very close at the lower failure rates. The results did not agree as well for translation. The real images proved to have a lower range of convergence in translation relative to the noisy synthetic image tests. The range of convergence for an initial estimate error in translation, orthogonal to the image plane, was approximately the same range as for noisy synthetic images (approximately ±80 percent of the size of the translation vector measured with respect to a 50 percent failure rate in convergence). However, the range of convergence parallel to the image plane was approximately half of the range for noisy synthetic images (half of 200 percent of the size of the translation vector measured with respect to a 50 percent failure rate in convergence). 92 This discrepancy may be accounted for by the lower error tolerance used in the test on the real image data. The number of iterations required to converge was slightly greater than the values found for noisy synthetic images by approximately five. The range then, was approximately 35 to 45 iterations. Although the motion for the real image sequence was not calibrated, computer graphics of the estimated object structure gave a good indication of the performance of the algorithm. The 3D structure of the object was consistent with its true physical structure. The object, in this case, was a familiar office object, a staple remover. The performance of the Levenberg-Marquardt algorithm for least squares adjustment was found to be less than optimal. It was necessary to remove the damping, after an arbitrary number of iterations, to increase the rate of convergence. The arbitrariness of this number was reduced by the use of graphical techniques, which served to illustrate the accuracy of the estimated parameters after a given number of iterations. Further extensions to the least squares algorithm, such as those suggested by More, may be required [More77]. Nevertheless, by incorporating the Levenberg-Marquardt least squares extension, a marked improvement in the algorithm's performance was achieved for estimating relative orientation and depth. 6.1 Directions for further research This algorithm can be incorporated into a system for computing 3D object models. The models computed by the algorithm developed here serve as input to a model based vision system. In addition to providing 3D scene models, an accurate estimate of the camera viewpoint compensates for a lack of information due to occlusion, poor edge detection, etcetera.. Therefore, the algorithm developed in this thesis will be of interest to researchers in model based vision. Methods for nonlinear parameter estimation require further research. One possible avenue is to investigate the use of Tykhonov stabilizers. A closer look at some of the extensions to the Levenberg-Marquardt algorithm suggested by More (and others) may be in order. Although these extensions were felt to be costly in a computational sense, some simplifications may be possible to reduce the cost. Extending the parameterization for nonrigid objects requires investigation. An easier, more immediate extension is to be able to handle point and line correspondences simultaneously. This extension is a relatively straightforward. Other extensions include determining parameterizations for higher order tokens. These tokens could be curvilinear extensions to line tokens, for example. Region based tokens should also be looked at. 93 It should be fairly straightforward to detect measurements which cause the solution to become singular. Typically, the partial derivatives of degenerate lines rapidly approach zero, with the algorithm responding by computing very large corrections. The partial derivative with respect to depth, was usually the most sensitive to degenerate data. Upon being detected, the degenerate data would then be edited from the calculations. Combining this algorithm with intensity gradient or optical flow methods opens up the possibility for segmenting images while computing structure and motion. This extension could be useful for analyzing images of multiple objects. It is an open question what the exact relationship is between structure from motion and structure from other processes, such as binocular stereo, shading, texture, and so on. Also of interest is the role, in the correspondence process, played by structure from motion computations. Motion analysis is an area that has already seen a great deal of progress with a good base of results on which to build. There are still many open questions to answer and many improved solution methods await discovery. 94 Bibliography [Adiv85] Adiv, G., Determining three-dimensional motion and structure from optical flow generated by several moving objects, IEEE trans, on pattern analysis and machine intelligence, vol. PAMI-7, no. 4, pp. 384-401, 1985. [Ana85] Anandan, P., A review of motion and stereopsis research, COINS technical report 85-52, 1985. [BaTe78] Barrow, H.G. and Tenenbaum, J.M., Recovering Intrinsic Scene Characteristics from Images, in Systems, A . R. Hanson & E.R. Riseman (eds), Academic Press, New York, pp.3-26, 1978. [BroDen72] Brown, K . M . and Dennis, J.E., Derivative free analogues of the Levenberg-Marquardt and Gauss algorithms for nonlinear least squares, Numer. Math., vol. 18, pp. 289-297, 1972. [BurFai85] Burden, R.L. and Faires, J.D., Numerical Analysis, 3rd Ed., (Prindle, Weber, Schmidt, Boston, M A , 1985). [FaHu84] Fang, J.Q. and Huang, T.S., Solving three-dimensional small-rotation motion equations: uniqueness, algorithms, and numerical results, Computer vision, graphics and image processing, vol.26, no. 2, pp. 183-206, May 1984. [FaHu84b] Fang, J.Q. and Huang, T.S., Some experiments on estimating the 3D motion parameters of a rigid body from two consecutive image frames, IEEE trans, on pattern analysis and machine intelligence, vol.6, no.5, pp.545-554, Sept. 1984. [FaLu87] Faugeras, O.D., Lustman, F., Let us suppose that the world is piecewise planar, Robotics research, the 3rd international symposium, edited by O.D. Faugeras and Georges Giralt, MIT press, 1987. [FaLuTo87] Faugeras, O.D., Lustman, F., Toscani, G., Motion and structure from motion from point and line matches, IEEE-1st international conference on computer vision, pp.25-34, London, 1987 [FaTo86] Faugeras, O.D. and Toscani, G., The calibration problem for stereo, Proceedings of CVPR86, Miami Beach, Florida, U S A , 1986, pp. 15-20. [Gen79] Gennery, D.B., Stereo-camera calibration, In Proc. Image Understanding Workshop (DARPA), pp. 101-108, November 1979. [GrHi87] Grzywacs, N . M . and Hildreth, E.C., Incremental rigidity scheme for recovering structure from motion: position-based versus velocity-based formulations, J. Opt. Soc. A m . A , vol.4, no.3, march 1987, pp.503-518. [Hil83] Hildreth, E.C., The measurement of visual motion, MIT press, Cambridge, M A , 1983. [Horn86] Horn, B.K.P., Robot Vision, M I T press, Cambridge, M A , 1986. 95 [Horn87] Horn, B.K.P., Relative orientation, MTT A l Memo 994, 1987. [Huang83] Huang, T.S.(ecL), Image sequence processing and dynamic scene analysis, Springer-Verlag, Heidelberg, F R G , 1983. [Huang87] Huang, T.S., Motion analysis, in Encyclopedia of artificial intelligence vol. 1 and 2, S.C. Shapiro ed., John Wiley & Sons, New York, 1987. [HuFa] Huang, T.S.and Fang, J.Q., Estimating 3-D motion parameters: some experimental results, Proceedings of the society of photo-optical instrumentation engineers, vol.449, part 2, pp.435-437. [HuTs81] Huang, T.S., Tsai, R.Y., Chapter 1 Image sequence analysis: motion estimation, in Image sequence analysis, T.S. Huang (ed), Springer-Verlag, Heidelberg, F R G , pp. 1-18, 1981. [Kan88] Kanatani, K., Constraints on length and angle, Computer vision, graphics and image processing, 41, 28-42, 1988. [Law83] Lawton, D.T., Processing translational motion sequences, Computer vision, graphics and image processing, 22, 116-144, 1983. [Lev44] Levenberg, K., A method for the solution of certain nonlinear problems in least squares, Quart. Appl. Math., vol. 2, pp. 164-168, 1944. [Lh81] Longuet-Higgins, H.C., A computer program for reconstructing a scene from two projections, Nature 293, pp. 133-135, September 1981. [Lh84] Longuet-Higgins, H.C., The reconstruction of a scene from two projections - configurations that defeat the 8-point algorithm, Proceedings of the first conference on artificial intelligence applications, Denver, CO, pp. 395-397, December 5-7, 1984. [LiHu86] Liu, Y . , and Huang, T.S., Estimation of rigid body motion using straight line correspondences, Proceedings workshop on motion: representation and analysis, IEEE computer society, pp.47-51, 1986. [Lowe87] Lowe, D.G., Three-dimensional object recognition from single twodimensional images, Artificial Intelligence 31, pp. 355-395,1987. [Marq63] Marquardt, D.W., A n algorithm for least-squares estimation of nonlinear parameters, J. S I A M , vol 11, no. 2, 1963. [Marr82] Marr, D., Vision: a computational investigation into the human representation and processing of visual information, W . H . Freeman & Co., San Francisco, 1982. [MiSeAg86] Mitiche, A . , Seida, S., and Aggarwal, J.K., Line based computation of structure and motion using straight line correspondences, Proceedings ICPR, pp. 1110-1112, 1986. 96 [MiSeAg88] Mitiche, A., Seida, S., and Aggarwal, J.K., Using constancy of distance to estimate position and displacement in space, IEEE trans, pattern analysis and machine intelligence, vol.10, no.4, pp.594-599, July 1988. [More77] More, J.J., The Levenberg-Marquardt algorithm: Implementation and theory, Numerical analysis, G.A. Watson (ed), lecture notes in math 630, Springer-Verlag, Berlin, ppl05-106, 1977. [Nag81] Nagel, H.H., Determination of moving rigid objects based on visual observations, LEEE computer, vol.14, pp.29-39, 1981. [Pent85] Pentland, A.P., A new sense for depth of field, IJCAI 85, 988-994, 1985. [PrFlTe88] Press, W.H.., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T., Numerical recipes in C, Cambridge University Press, Cambridge, 1988. [RoAg80] Roach, J.W. and Aggarwal, J.K., Determining the movement of objects from a sequence of images, IEEE Trans.Pattern Anal.Machine Intelligence, vol. PAMI-2, pp. 554-562, Nov. 1980. [ShPr85] Shariat, H . and Price, K . E . , Results of motion estimation with more than two frames, D A R P A , Image understanding: proceedings of a workshop held at Miami Beach, Florida, December 9-10,1985. [SpA187] Spetsakis, M . E . and Aloimonos, J.Y., Closed form solution to the structure from motion problem from line correspondences, AAAI87, pp.738-743,1987. [TeWiKa88] Terzopoulos. D., Witkin, A., and Kass, M . , Constraints on deformable models: recovering 3D shape and nonrigid motion, Artificial Intelligence, 36, 1988, pp.91-123. [Tho59] Thompson, E.H., A rational algebraic formulation of the problem of relative orientation, Photogrammetric record, vol.3, no. 14, pp. 152-159, Oct. 1959. [ToFa87] Toscani, G . and Faugeras, O.D., Structure from motion using the reconstruction & reprojection technique, IEEE workshop on computer vision, pp. 345-348, 1987. [T0P086] Torre, V . and Poggio, T.A., On edge detection, IEEE trans, on pattern analysis and machine intelligence, vol.PAMI-8, no.2, march 1986. [Triv87] Trivedi, H.P., Estimation of stereo and motion parameters using a variational principle, image and vision computing, vol.5, no.2, may 1987. [TsHu81] Tsai, R.Y., Huang, T.S., Estimating three-dimensional motion parameters of a rigid planar patch, IEEE trans, acoustics, speech, and signal processing, vol. ASSP-29, no. 6, pp. 1147-1152, dec 1981. 97 [TsHu84] Tsai, R.Y., Huang, T.S., Uniqueness and estimation of 3-D motion parameters of rigid bodies with curved surfaces, IEEE Trans. P A M I , vol.6, no.l, ppl3-27,1984. [TsHu84b] Tsai, R.Y., Huang, T.S., Estimating three-dimensional motion parameters of a rigid planar patch, III: finite point correspondences and the three-view problem, IEEE transactions on ASSP, vol. ASSP-32, no. 2, pp. 213-221, April 1984. [TsHuZh82] Tsai, R.Y., Huang, T., Zhu, W., Estimating three-dimensional motion parameters of a rigid planar patch, U: singular value decomposition, IEEE transactions on ASSP, pp525-534, vol.30, no.4, August 1982. [UU79] Ullman, S., The interpretation of visual motion, MIT press, Cambridge, M A . , 1979. [U1184] Ullman, S., Maximizing rigidity: the incremental recovery of 3-D structure from rigid and nonrigid motion, Perception 13, pp. 255-274, 1984. [WeAg81] Webb, J.A., Aggarwal, J.K., Visually interpreting the motion of objects in space, IEEE computer, 14, pp. 40-49, 1981. [WeHuAh87] Weng, J., Huang, T.S., and Ahuja, N . , Error analysis of motion parameter estimation from image sequences, ICCV87, pp. 703-707, 1987. [Wolf83] Wolf, P.R., Elements of photogrammetry, 2-nd edition, McGraw-Hill, New York, N Y , 1983. [YaMe86] Yasumoto, Y . , Medioni, G., Robust estimation of three-dimensional motion parameters from a sequence of image frames using regularization, IEEE trans, on pattern analysis and machine intelligence, vol. PAMI-8, no. 4, pp. 464-471, July 1986. [YeHu83] Yen, B . L , Huang, T.S., Determining 3-D motion and structure of a rigid body using straight line correspondences, in [Huang83], pp365-394. [ZhHa85] Zhuang, X . , Haralick, R . M . , Two view motion analysis, CVPR85, pp. 686-690, 1985. 98 Appendix A Sequence of images illustrating algorithm convergence. Estimated image lines are emboldened. The images on the left of the page are from the first non-reference image (image 2 in fig. 5-8). The images on the right are from the second non-reference image (image 4 in fig. 5-8). I t e r a t ion 0 Itsration 0 99 Iteration 4 Iteration 101 4 102 Appendix B 2D projections of a 3D model of a staple remover c by the algorithm from a sequence of real images. 103 104 105 106 107 108 109
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Solving for relative orientation and depth
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Solving for relative orientation and depth McReynolds, Daniel Peter 1988
pdf
Page Metadata
Item Metadata
Title | Solving for relative orientation and depth |
Creator |
McReynolds, Daniel Peter |
Publisher | University of British Columbia |
Date Issued | 1988 |
Description | A translating and rotating camera acquires images of a static scene from disparate viewpoints. Given a minimum number of corresponding image tokens from two or more images, the rotation and translation (referred to as the relative orientation) of the camera, and the relative depths of the tokens, can be computed between the views. Image tokens, which can be points or lines, are deemed corresponding if they arise from the same physical entity. The process of determining corresponding tokens is assumed to be known. This thesis poses the problem of solving for relative orientation and depth. The solution to this problem has applications in building object or scene models from multiple views and in passive navigation. A minimum of five corresponding pairs of points, measured from two perspective projection images, are required. In the case of lines, the measurements of six corresponding sets of lines, from a minimum of three images, are necessary for a unique solution. The algorithm simultaneously determines the rotation and translation of the camera between the images and the three dimensional position of the matched scene tokens. The translation and depth can only be determined up to a global scale factor. It is assumed that the motion of the camera or object is rigid. The equations that describe the motion and scene structure are nonlinear in the camera model parameters. Newton's method is used to perform the nonlinear parameter estimation. The camera model is based on the collinearity condition, well known in the area of photogrammetry. This condition states that the focal point, image point and object point must all lie on the same line under perspective projection. This approach differs from other similar approaches in the choice of model parameters and in the formulation of the collinearity condition. The formulation for line correspondences turns out to be an easy extension of the point formulation. To improve the range of convergence and the robustness of the algorithm, the Levenberg-Marquardt extension for discrete least squares is applied to the over-determined system. Results from a Monte Carlo simulation with noise-free images indicate a range of convergence for rotation of approximately plus or minus sixty degrees. Simulations with noisy images (image measurements perturbed by plus or minus three pixels) yield a range of convergence for rotation of approximately plus or minus forty degrees. The range of convergence in translation is in the order of twice the length of the translation vector in any direction parallel to the image plane. For translation orthogonal to the image plane, the range of convergence is approximately eighty percent of the length of the translation vector. Simulations with the same noisy images generated for the rotation tests, indicate that the range of convergence for translation is not appreciably affected by these noise levels. Tests with real images, in which a 3D model of an object is derived from multiple views with human assistance in determining line correspondences, yield results that agree reasonably well with the noisy image simulations. |
Subject |
Images, Photographic |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-08-31 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051262 |
URI | http://hdl.handle.net/2429/28023 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-UBC_1988_A6_7 M38.pdf [ 7.37MB ]
- Metadata
- JSON: 831-1.0051262.json
- JSON-LD: 831-1.0051262-ld.json
- RDF/XML (Pretty): 831-1.0051262-rdf.xml
- RDF/JSON: 831-1.0051262-rdf.json
- Turtle: 831-1.0051262-turtle.txt
- N-Triples: 831-1.0051262-rdf-ntriples.txt
- Original Record: 831-1.0051262-source.json
- Full Text
- 831-1.0051262-fulltext.txt
- Citation
- 831-1.0051262.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051262/manifest