Real-Time Motion Tracking of Mobile Robots via Image Registration by Atousa Soroushi B.A.Sc. Electrical Engineering, Iran University of Science and Technology, 1989. A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES ELECTRICAL ENGINEERING We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA August 1996 © Atousa Soroushi, 1996 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of £ /ccfr>U^ I K l £r,V&^ The University of British Columbia Vancouver, Canada Date /}(4(ASh Z( , Cjf? DE-6 (2/88) Abstract Abstract In order to have accurate, safe and reliable remote control for mobile robots, it is necessary to track their motion. With new high technology systems, fools for imaging and image acquisition are becoming less expensive every day. Consequently, vision-based systems are considered practical for tracking purposes. Most of the present motion-detection systems used for control of heavy duty hydraulic machines depend on external equipment like inertial sensors, dead-reckoning or laser beacons. There are also a few existing systems for navigation purposes that are based on vision, but they do not perform in real-time or they require a partially known environment. The goal of this thesis is the design and implementation of a stand-alone motion-tracking system, to track the local motion of a machine-mounted down-looking camera in real-time. By installing the camera under the base of a mobile robot, the local trajectory of that robot in an unstructured environment can be tracked. In the present work we use a computational vision approach to design an image registration system to estimate the two dimensional translation, rotation and small scaling factor from two partially overlapping Images. This is done in real-time (about 20 frames per second) even when there is simultaneous rotation and translation occurring between the two successive frames and images may have few significant features. To reduce the processing time and obtain information from different regions of the image, we process only selected regions of interest of the original image. An initial estimate is used to select those regions, so that regions contain similar features with the least possible disparity. This method, with the aid of a coarse-to-fine search method, makes the motion-detection algorithm able to execute in about 45ms and detect a dependable rotation of up to 11.25 degrees and translations of up to 8 pixels for each pair of image frames. ii Table of Contents iii Table of Contents Abstract ii List of Tables vi List of Figures vii Acknowledgment viii Dedication ix 1 Introduction 1 2 Related Work on Motion Tracking 4 2.1 Inertial Navigation 5 2.2 Dead-Reckoning and Odometry 6 2.3 Vision Based Approaches 6 2.3.1 Active Beacons 6 2.3.2 Landmarks .7 2.3.3 Image Correspondence 8 2.3.3.1 Model-Based Image Correspondence: 9 2.3.3.2 Feature-Based Image Correspondence:. 10 3 Background on Image Registration 13 3.1 Registration in Theory 13 3.2 Common Transformations 14 3.3 Registration Methods and Trade-Offs 16 3.3.1 Fourier Phase Correlation 16 3.3.2 Spatial Correlation Methods 18 Table of Contents iv 3.4 Registration Components 21 3.4.1 Feature Space 21 3.4.2 Similarity Metric 22 3.4.3 Search Space 24 3.4.4 Search Strategy 24 4 Motion-Detection Algorithm 26 4.1 Motion Model 28 4.2 Sub-Image Generation 30 4.3 Resolution Pyramid 33 4.4 Hierarchical Image Registration 35 4.5 Velocity Prediction 36 5 Hierarchical Image Registration 37 5.1 Feature Extraction 39 5.2 Hierarchical Correlation Matching 45 5.2.1 Match Verification 48 5.2.2 Subpixel Correlation 50 5.3 Transformation Estimation 51 5.3.1 Transformation Refinement 53 5.3.2 False Match Recovery 54 6 Implementation Issues and Experimental Results 55 6.1 Hardware Configuration 55 6.1.1 Camera Setup 55 6.1.2 Frame Grabber 56 Table of Contents v 6.2 Calculation of Camera Location from Transformation Parameters 57 6.3 Optimal Parameters 59 6.3.1 Image Size 60 6.3.2 Number of Resolution Levels 62 6.3.3 Number of Features to be Extracted 63 6.3.4 Correlation Window and Search Window for Matching 64 6.3.5 Thresholding for Discarding False Matches 66 6.4 Timing Cost 67 6.5 Maximum Detectable Displacement of Camera 69 6.6 Results 70 6.6.1 On Images with Synthesized Motion 70 6.6.2 On Real Images 72 7 Conclusions 75 7.1 Concluding Remarks 75 7.2 Directions for Future Improvements 76 Bibliography 78 List of Tables vi List of Tables Table 6.1 Comparison.between hierarchical registration with different numbers of resolution levels 63 Table 6.2 Intermediate results, obtained from registration in four levels, for 20° rotation 66 Table 6.3 CPU time spent on various functions of motion-detection algorithm 68 Table 6.4 Experimental results on a terrain images with various synthetic translations and rotations 71 Table 6.5 Experimental results on images, taken by the camera, mounted on a mini-excavator 73 List of Figures vii List of Figures Figure 2.1 The interaction between components of Ravela et al. registration system 12 Figure 4.1 Flowchart of motion-detection algorithm 27 Figure 4.2 Sub-image Generation of two successive images based on an initial estimate 31 Figure 4.3 Generation of Resolution Pyramid 34 Figure 4.4 Resolution Pyramid generated based on non-weighted averaging 35 Figure 5.1 Flowchart of Hierarchical Image Registration 38 Figure 5.2 Hierarchical feature extraction with aim of search space estimation 42 Figure 5.3 Use of window patch versus pixel in hierarchical corner detector 43 Figure 5.4 Feature Extraction results on real images by use of Window patch versus single pixel detector 44 Figure 5.5 Match verification process 48 Figure 5.6 Behavior of Correlation Matching with Match Validity Check on real images " 49 Figure 5.7 Sub pixel sampling by using quadratic estimator 50 Figure 6.1 Setup of Camera 55 Figure 6.2 transformation of image frame versus global frame of robot platform 58 Figure 6.3 sub-image constructed of four different regions vs. one region 61 Figure 6.4 An approximate maximum detectable rotation within a 5x5 correlation window 65 Figure 6.5 Experimental results on synthetic images 72 Figure 6.6 Experimental results on real images 74 Acknowledgment I wish to express my sincere gratitude to my advisor Dr. Peter D. Lawrence, for his guidance, support and advice throughout the course of this thesis. I would especially like to extend my appreciation to Dr. David Lowe for his insightful suggestions, invaluable discussions and continuous encouragement. I would also like to acknowledge Dr. Ian Cumming for his time, reading my thesis and his valuable comments. In addition, I thank my good friends, Ying Cui, Shirin Abadi, Ray Burge, Jye-Kai Chang and Keyvan Hashtrudi-Zaad. They supported my ideas, reviewed my work and provided me with valuable recommendations and advice. I would also like to thank Poul Jacobsen for his invaluable help in our experiment on the mini excavator. I also wish to acknowledge Forestry Engineering Institute of Canada for providing the research assistantship and helpful information on the application requirements of this project. I appreciate the research project funding provided by the Canadian Network Center of Excellence Program IRIS (Institute for Robotics and Intelligent Systems), project IS-4. Special thanks to my husband Kazem Naderi for his love, faith, understanding and continuous encouragement. Finally, I would like to express my deep gratitude and indebtness to my parents for providing me an excellent environment to educate and grow up, and encouraging me to follow my interests. Atousa Soroushi viii Dedication To my beloved parents, Paridokht Anousheh and Shahrokh Soroushi 1 Chapter 1 Introduction In the last few years, a considerable amount of interest has been devoted to robotics and mo-bile robots. There is a wide range of applications for mobile robots, in forestry, construction and transportation of dangerous materials. Accordingly, some work has been done in autonomous guidance of vehicles that are used for the above purposes, such as excavators, bulldozers, and road finishers [1],[2],[3]. Such machines are able to follow a commanded path. However, there are many different environmental or communication factors, such as slippery surfaces, interac-tion with obstacles, etc., that might lead to failures. Therefore, for accurate guidance, with the ability of position correction, it is necessary to locate the machine and get feedback about the path that it has followed. The need for a reliable and inexpensive system for guidance of mobile robots, has led us to the development of a dependable and fast motion-tracking system based on vision. Through our eyes, not only we can derive a rich understanding of what is in the world and where it is located, but also we can locate and track our motion. The same principle can be used in motion-tracking of a mobile robot, by using a vision system. The purpose of this project is to develop and implement an algorithm to track the motion of a mobile robot in real-time by using image registration. By mounting a down-looking camera on a mobile robot, the problem of tracking the motion of the robot becomes similar to tracking the motion of the camera. With the assumption that the robot moves on a two-dimensional plane, detection of two-dimensional translation and rotation between each pair of successive images, continuously taken from the ground, will be sufficient to track the camera's movements. 1 . Introduction 2 Image registration is a fundamental task in image analysis and is used to match two or more pictures taken at different times, from different sensors, or from different viewpoints. It enables the images taken in different physical conditions to be aligned by a spatial transformation. Therefore, detection and measurement of the variations in the images is possible. To track the motion of an image, via image registration, it is first necessary to find enough reliable, informative patches in the image and then, to detect their corresponding matches in the moved image. Finally, with all relative location of features and their corresponding matches, an over-constrained linear equation is built that leads to the detection of the existing transformation between images. Normally, since the computation time for feature extraction and match detection is too high, it is not possible to process every incoming image in real-time. Therefore, only a certain number of images can be processed. Obviously, in processing fewer images with a larger temporal gap between them, there is a higher chance of having a bigger change of disparity in the images. As the images might move farther, the size of the search area must increase. Since the spatial disparity between corresponding image features becomes larger, the bigger search area requires more computation time. In addition, sampling the images in larger intervals can also lead to difficulty in predicting the motion, since the range of potential motion increases. Thus, it is desired to decrease the amount of computation time and increase the rate of processing images as much as possible. In this thesis, we focus on processing a few selected regions out of whole images in multi-scale resolution, to have a fast registration algorithm. Velocity prediction is used to obtain an approximate location of the corresponding regions to be selected. This saves a lot of computation time, since the entire image does not need to be searched for corresponding features. Since at a higher rate of image processing, the expected spatial disparity between images is smaller, 1 . Introduction 3 motion can be predicted more accurately. The other consideration is that some terrain might lack objects with specific structures. We found that corners, rather than other structured features, are the most suitable features to be tracked. Corners are extracted reliably with a relatively low computational cost, in a multi resolution hierarchy. Then, a correlation-based metric, which is robust to the environmental change of illumination intensity, is used to locate the corresponding matches, with subpixel accuracy. Some methods are also presented to eliminate false matches, such as match verification, and discarding of matches that have high residuals relative to the best fit of the least squares estimation. In Chapter 2 we briefly study some of the previous work done on motion tracking. A background on image registration and its general requirements, which could lead to the selection of an efficient registration algorithm, follows in Chapter 3. Next, Chapter 4, presents the main procedures of the proposed motion-detection algorithm, including the generation of sub-images and resolution pyramid and a brief review of our image registration algorithm. The velocity prediction based on the results of registration, is also covered in this chapter. The details of image registration, which is one of the major components of the presented motion-detection algorithm, is explained in Chapter 5. Also the selected methods for feature extraction, matching in subpixel resolution, match verification and finally translation and rotation estimation are explained in details. Then, the implementation issues and experimental results on synthetic and real images are discussed in Chapter 6. Finally, Chapter 7 concludes this dissertation with some recommendations for future improvements. 4 Chapter 2 Related Work on Motion Tracking Many different approaches to resolving motion tracking problems have been undertaken. As reviewed by Borenstein er a/.[4], the ones mostly used for navigation purposes operate on the basis of tracking the motion of mobile machines by using sensors mounted on them. These methods can be categorized into the following groups based on the type of sensors used: 1. Inertial sensors such as gyroscopes and accelerometers 2. Sensors for dead-reckoning and odometry 3. vision systems based on: a. Active beacons b. Landmarks c. Image Correspondences i. Model-based positioning ii. Feature-based positioning The positioning methods that are based on active beacons and landmarks provide absolute position measurements, however the other methods are appropriate for relative positioning measurements. Because of the lack of a single, generally good method, developers of automated guided vehicles normally combine the two methods, one from each category. 2 . Related Work on Motion Tracking 5 In this chapter, we will review some of the recent work done in the above area. We will mainly concentrate on vision systems which compute the correspondence between two successive images. 2.1 Inertial Navigation The principle of inertial navigation involves continuous sensing of minute accelerations in each of the three directional axes and integrating them over time to derive velocity and position. A gyroscopically stabilized sensor platform is used to maintain consistent orientation of the three accelerometers throughout this process. Inertial devices can provide accurate information on the attitude of a vehicle. But to obtain the desired accuracy, an inertial navigation system of very high standard is required, the price of which is unacceptable for practical use in mobile machines. For example, a high quality inertial navigation system found in a commercial airliner will have a typical drift of about 1850 meters per hour of operation and cost between $50K and $70K[5]. A hyper-frequency system such as GPS (General Positioning System) gives the 3-D position coordinates. However, its accuracy is not very high. DGPS (Differential GPS) can derive a higher precision of the order of one meter in. real-time[6]. In addition, the GPS update time is very slow and it cannot be used under structures or even dense foliage. An integrated inertial platform consisting of three gyroscopes, a biaxial accelerometer and two tilt sensors is described in [7]. Their system is designed for estimating the position and orientation of a moving robot vehicle. Gyroscopes provide angular rate information and accelerometers provide velocity rate information. By supplying error models for inertial sensors, they claim a maximum, rotation error of. 1-3 degrees/min and a position drift rate of 1-8 cm/sec. We wish to obtain higher accuracy with our vision system. 2 . Related Work on Motion Tracking 6 2.2 Dead-Reckoning and Odometry Dead-reckoning (originating from "deduces reckoning" of sailing days) is a simple mathe-matical procedure for determining the present location of a vessel by advancing some previous position through the known course and velocity information, over a given length of time[8]. The most simplistic implementation of dead-reckoning is odometry. Odometry is based on simple equations that utilize data from inexpensive incremental wheel encoders, such as brush encoders, potentiometers, synchros, resolvers, optical encoders, magnetic encoders, inductive encoders and capacitive encoders. However, odometry is also based on the assumption that wheel revolutions can be translated into linear displacement relative to the floor. This assumption is not always valid, as slippage is one extreme example. In the case that one wheel slips, then the associated encoders will register an invalid revolutionary motion. There are some other reasons for inaccuracies in the translation of wheel encoder readings into linear motion such as unequal wheel diameters, misalignment of the wheels, finite encoder sampling rate or travel over uneven surfaces. For a survey on different dead-reckoning approaches and measurement of odometry errors, the reader can refer to the survey done by Borenstein et al.[4]. 2.3 Vision Based Approaches 2.3.1 Active Beacons Active beacon navigation systems are the most common and reliable navigation aids for ships and airplanes [4]. Vision approaches using active beacons allow for high sampling rates and high reliability, but they require high cost installation and the maintenance of beacons. Accurate mounting of beacons is necessary for accurate positioning. So the most important drawback of the active beacon navigation systems is that they cannot be used in unstructured sites. 2 . Related Work on Motion Tracking 7 Corre and Garcia[1] present a sensory system that provides the position, attitude and speed of a mobile robot. A turning linear CCD camera attached to the vehicle, receives light from sources located on the site. The system derives the azimuth and height angles of the light sources. Consequently, the location and the speed of the robot at each reading of a beacon is. computed. With the rotational speed of one revolution per two seconds and three beacons, the position information could be updated three times in two seconds. To estimate the position at any instant, it is suggested that linear interpolation or a Kalman filter be used. Transitions Research Corporation, TRC has introduced a beacon navigation system, which calculates vehicle position and heading at ranges up to 24.4 meters, within a quadrilateral 24.4x24.4 m2 area defined by four passive retroreflective beacons [9]. A static, 15 second unobstructed view of all four beacons is required for initial acquisition and setup. Then, only two beacons must remain in view as the robot moves about. The system resolution is 120 mm in translation and 0.125° for rotation in the horizontal plane. The scan unit has a maximum of 1 Hz update rate. A dedicated 68HC11 microprocessor continuously outputs navigational parameters (x,y,0) to the vehicles on-board controller via an RS-232 serial port. 2.3.2 Landmarks Landmarks are distinct features that a robot can recognize from its sensory input. In general, landmarks have a fixed and known position relative to which a robot can localize itself. Landmarks can be categorized in two main groups: artificial and natural landmarks. Natural landmarks are those objects or features that are already in the environment and have other functions than just for robot navigation. In contrast, artificial landmarks are specially-designed objects or markers that need to be placed in the environment for the purpose of robot navigation. The main task in localization based on landmarks is first to recognize the landmarks reliably and then to calculate 2 . Related Work on Motion Tracking 8 the robot's position. Abe et a/.[10] proposed a vision navigation system based oh natural landmarks. The system recognizes outlets of air conditioning systems located on the ceiling as landmarks. They apply Fuzzy Template Matching to detect the landmarks in an image, in addition to a neural network for fast detection. The navigation system can then calculate the distance and the angle of the landmark from the map information. As a result, the present robot position and orientation is identified. Their experimental results show that with a system composing a cpu with two transputers and two cameras all recognition and distance measurements take about four seconds. They could derive satisfactory results with the robot's maximum speed of 2 km/hr. The above positioning systems require structured targets (landmarks), so they are not suitable for motion tracking in an unstructured environment. 2.3.3 Image Correspondence One of the most efficient vision-based approaches to motion tracking is by computing the correspondences between successive frames. Assume that any object existing in the scene represented by a model corresponds to a set of simple features in the image. A change in the configuration of features denotes a relative change in position of the object versus the camera. Therefore, a relation between the change of feature configuration and the change of model pose or camera pose should be determined. To relate the change of feature positions to the object or camera poses, two approaches can be taken. In feature-based approaches the relative motion can be estimated directly from the change of feature configurations. However, model-based systems form and test different model poses and find the one that has the best match with the particular pose corresponding to the image features. 2 . Related Work on Motion Tracking 9 Model-Based Image Correspondence: Model-based techniques include utilizing a model of an object in an environment or the environment's model to facilitate searching for a possible camera position. Since, this approach is able to exploit global attributes of the object in the scene [11], it can provide significant advantages over purely local methods for situations in which: • the environment is cluttered, • multiple moving objects exist, or • large motion of an object exists. Huang and Gillies[12] determine the position and direction of a camera on a land vehicle f moving on a curved road. They use the curved road's image to find the curvature of the road in 3-D space. Once the road's edge has been found, the position of the vehicle is inferred by matching the curvature of the real road, which can be extracted from map data, and the curvature of the measured road points, which are back-projected into the 3-D space. The achieved precision for the above position detection system is claimed to be 20 meters. However, it cannot be used for motion tracking in an unstructured environment. The tracking system designed by Verghese et a/.[13] is also a model-based approach. They need to have a model of the moving object(s) in the scene. The system also requires the temporal sampling rate of images to be high with respect to the rate of spatial changes in the image. Therefore, they obtain real-time processing through the use of parallelism. New object pose hypotheses are generated and tested in parallel by searching for the correct model pose in a neighborhood around the previous model pose. These hypotheses are correlated with a stream of edge-detected feature maps by performing a logical "and" between the edge points and the model points projected from each hypothesis. Usually, the model-based image correspondence systems require expensive computations. 2 . Related Work on Motion Tracking 10 So they do not obtain real-time speed. In addition, those systems initially require a model of the moving object(s) in the scene, which is not feasible in all navigation environments. Feature-Based Image Correspondence: Feature-based techniques, as opposed to model-, based techniques, do not require an expensive search strategy to locate an object. In feature-based methods, preliminary feature extraction can be employed such as edge or comer detection and then a few features can be tracked in a less expensive system than a model-based system. The feature-based approach was taken by Mandelbaum and Mintz[14] to solve the problem of localization in a partially known environment. They use a clustering algorithm to extract planar features from ultrasound data in the local coordinate system of a mobile robot. With a histogram-based function, correspondences are established between extracted planar features and model features. The pose estimation stage makes use of least squares estimation to yield a refined estimate of translation and rotation. Rapidly convergent gradient descent techniques are employed for orientation estimation. The comparison of this system to dead-reckoning shows that this algorithm reduces the localization error by almost 50% over dead-reckoning. However, the assumptions of having a partially known environment and initial knowledge of model features, limit the applications of this method. Zheng and Chellapa [15] have a feature-based approach to registration, for the estimation of 2-D translation, rotation and scale when the rotation of the camera and scale change between two successive frames are significant. A feature of their approach is that an illuminance direction estimation method is first used to obtain an initial estimate of camera rotation. A small number of feature points, in both successive images is then located based on a Gabor Wavelet model, for detecting local curvature discontinuities. Consequently a scale factor is computed at the lowest level of a hierarchical matching process, by pairwise matching of feature points. Finally, through 2 . Related Work on Motion Tracking 11 the coarse-to-fine feature matching, an accurate estimate of translation and rotation is obtained. With synthetic and real aerial images they could yield accurate results when the scale of the successive images differs by up to 10% and camera rotation is significant. The system requires one to assume that the illumination on the Martian surface is from the sun and is constant during the time the image pairs are taken. They have not reported the timing costs of the system. Ravela et a/.[16] have also presented a registration algorithm, capable of tracking an object through distinct aspects in real-time. It is assumed that they have a few model points of the object from different aspects and they are all stored in an aspect look-up table. Therefore to perform the registration for two successive frames no feature extraction is required. The whole algorithm can be summarized in the following steps: 1. The known model points are projected, with an initial pose or the computed pose obtained from previous image frames, to get the initial guess about the new location of model points. 2. The Rotation Compensated Normalized Cross Correlation (NCC-R) is performed for matching purposes. 3: With the use of Kumar's algorithm the pose including rotation and translation is solved. Kumar's algorithm is an iterative approach that minimizes the squared image plane distance from the data points to the projected model points. The Levenberg-Marquart method is employed to solve this nonlinear optimization problem. Meanwhile, outliers are detected and excluded with a median filter. 4. From the computed pose the new aspect will be achieved. Consequently, the new model points regarding to the new aspect will be extracted from the aspect look up table. 2 . Related Work on Motion Tracking 12 Figure 2.1 The interaction between components of Ravela et al. registration system As a result, they claim to have a stable system by using a median filter in pose estimation module, which detects and eliminates the mistracked features. They also claim to be capable of detecting 360 degrees out-of-plane rotation with a real-time performance on Sparc-2, when the speed of registration for six points is 7 Hz. The drawback of their method is that it is assumed that the camera is roughly pointed toward a specific modelled object throughout the relative motion, which is not always possible. Moreover, model points should be extracted manually and an aspect look-up table should be constructed off-line. 13 Chapter 3 Background on Image Registration This chapter will give a brief background on image registration and explains how a suitable registration method can be designed to meet the requirements of our problem. 3.1 Registration in Theory Image registration is a method of aligning images, comparing them and detecting their differences by finding the proper transformation. Image registration is used when one wishes to detect and/or remove variations. Generally variation is defined as a difference in the value of pixels and their locations between the images, and can be categorized into two groups: 1. Distortions: These are variations which are caused by the source of image misalignment, and do not permit true measurement in the images to be done. 2. Variations of interest: These are variations in the images which we would like to detect and measure after registration is performed, such as movements, growths or any other changes in the scene. So these variations are not to be removed by registration. During registration we would like to model the distortions and remove them by finding an appropriate transformation (or sequence of transformations). If we do that for images that are taken from different sensors or different viewpoints, they will be aligned. They can then be compared and any differences between them, which might be of interest, can be detected. Since 3 . Background on Image Registration 14 the objective of registration is to detect the changes of interest between images, it is important that images be matched only with regards to the misregistration source. Otherwise the changes of interest would be removed as the distortions are removed. In order to register images with variations of interest, there js a need to have enough knowledge about the misalignment source to model it. By defining the registration as: a mapping between two images both spatially and with respect to intensity, for 2-D images h(x,y) and h{x,y) we have: where / is a 2-D transformation that maps two images spatially, and g is 1-D transformation that maps the images with respect to intensity. However, finding the intensity map in real time is not usually necessary because it could be determined by sensor calibration techniques and then matching image intensities by using look-up tables[17]. So, a registration model can be simplified to: 3.2 Common Transformations According to the survey of L. G. Brown in 1992 [17], the most common general transforma-tions that can map images spatially, are: • 2-D Rigid transformation: This is composed of translation, rotation and scale change. During this transformation just a change in the position of the sensor can be modeled, when the relative shape and size of the objects in images remain unchanged : Ux,y) = g[I1(f{x,y))] (3.1) h(x,y) = h[fx(x,y)Jy(x,y)] (3.2) L2/2 J + 5 cos 0 — sin 6 x\ sin 9 cos 6 y^ (3.3) 3 . Background on Image Registration 15 where l x , 6 and S are a 2-D translation vector, a rotation angle and a scaling factor }y J respectively. • Affine transformation: This is a more general form of rigid transformation and more complex. As opposed to a rigid transformation it does not have the properties associated with the orthogonal rotation matrix. Therefore angles and lengths of an object are no longer preserved, but parallel lines do remain parallel. It can count for some general spatial distortions such as shear and changes in aspect ratio. The general 2-D affine transformation is: (3.4) 2^ «13 + V2 «23 a2\ a22 Perspective transformation: This is a mapping from 3-D to a 2-D plane. This transforma-tion accounts for distortions which cause an image to appear smaller the farther it is from the camera or more compressed in one direction the more it is inclined away from the camera. The perspective transformation can be given by: -fxQ • ', (3.5) zo - f -fyo zo- f where / is focal length of lenses, when camera is in focus for distant objects and (x0)y0, z0) are coordinates of object in the scene. • Projective transformation: In a special case where the scene is composed of a flat plane tilted with respect to the image plane, the projective transformation maps the scene to a tilt-free image plane: Vi = auxp + a12yp + Qi3 a3ixp + a32yp + 033 a21xp + a22yp + a23 a31xp + a32yp + a 3 3 (3.6) 3 . Background on Image Registration 16 where, a terms can be computed with regard to equations of image and scene plane. (xt,y,) and {xp,yp) are the image and scene coordinates respectively. • Polynomial transformation: This is a very general transformation which accounts for any distortion, as long as it does not vary too much over the image, such as some types of lens distortions. 3.3 Registration Methods and Trade-Offs In this section two basic registration methods, which are both efficient for finding small rigid transformations, are introduced— Fourier Phase Correlation and Spatial Correlation. 3.3.1 Fourier Phase Correlation This method is introduced by De Castro and Morandi [18] as an expansion of the conventional correlation between phase-only, or whitened, versions of two images to be aligned, so that this extension covers both rotational and translational movements. Moreover, it is strongly robust to distortions, such as those encountered with unsteady, time varying illumination. However, no scale change is allowed and as we will see this method needs some initial assumptions that cannot be provided in our problem. We consider an image plane A t performing rigid movements of translation and rotation within our observed field C. Then, sQ(x,y) can represent the image at reference time t = 0 and st(x,y) represents a replica of s 0 translated by x0 and y0 and rotated by 00 in the present image. Therefore we can have: st(x,y) = s0(u,v) (3.7) when: x — x0 = ucos60 — vs'm80 (3.8) 3 . Background on Image Registration 17 y — yo — u sin 90 + v cos 60 (3.9) In a special case, when we have only translation: s t ( x , y ) = s 0 ( x - x 0 ) y - y 0 ) (3.10) If we take the Fourier transform of both images: St{wx,uy) = e-'2"(u'x°+u»riSo(ux,a>y) • (3.11) and the ratio of the Cross Power Spectrum to its magnitude is: G(«x,wj,) = r M i = e-**(»-x°+u>>y<>) (3.12) Therefore, by inverse transforming G { u x , u y ) , a Dirac 6 distribution centered on ( x 0 , y 0 ) is obtained. So the translation can be found by finding the location of the Dirac function. However, if there is also a rotational transformation included, then G will be a function of u)x,Loy and 9: G ( ^ y , 6 ) = - 5?<"""v) . , (3.13) bo(u)x cos V 4- u>y sin tV, — o;x sin v 4- cos In the case of 0 = 0O (the rotation angle) the inverse of G becomes a Dirac 6 distribution, having its peak amplitude, centered on ( x 0 , y 0 ) : % , w ^ 0 ) = e - ^ I 1 + " * ) (3.14) Therefore, in order to find the transformation parameters, when there are both translation and rotation included, we should vary 0 and compute the inverse of G repeatedly until we obtain the peak of Dirac 6 function in F ^ I G ] . 3 . Background on Image Registration 18 Trade-Off This method is robust to low frequency noise, such as a slow change of illumination caused by weather conditions. In addition, it is an efficient and fast means of computation (using an FFT), with existing well tested programs. On the other hand, one of the weak points of this method is that peak amplitude of G at 6 - 80 cannot be determined by nonoriented search techniques, for instance, by successive halvings, unless the specific application suggests a sufficiently approximate first tentative value. Moreover this approach assumes that At is the only image region that is always fully contained in C, the observation field. However this assumption does not hold in many applications such as our problem. While the camera is moving, parts of the image taken at time t go out of the field and some new parts of scene come into the observation field of image at time t + 1. So, we cannot have just a specific image region, that is moving such as At, assuming that it is fully contained in the observation field of both images. 3.3.2 Spatial Correlation Methods Cross correlation is a basic approach to registration. By itself it is not a registration method. It is a similarity measure or a match metric which gives a measure of the degree of similarity between image and a template (assumed to be equal to or smaller than the image). However, there are several registration methods for which it is a primary tool. All those methods use a correlation match metric to find the location of corresponding features in two images. Then, by having the location of a feature before and after a variation, the related transformation can be computed. As in Fourier phase correlation, these methods are generally useful for images which are misaligned by small rigid transformation. In order to compare the existing correlation based methods, and selecting the most efficient one, we can look at the experimental results obtained by Hotz in [19] and described in more detail 3 . Background on Image Registration 19 in his technical report[20]. In those experiments he compared the correlation results obtained for various window sizes and the following four, one dimensional, correlation measures: Non-normalized cross correlation: • Normalized cross correlation: C i = Jrf,1^ (3-15) ~; ( 3 1 6 ) • Non-normalized mean squared differences C3 = ^ \:12Z (3.17) • Normalized mean squared differences: „_ E ( ( A - A ) - f e - ^ ) ) 2 ( 3 , 8 ) \li:(h-h)2-£(h-h)2 For evaluation of these correlation measures, the following quantities are computed: - Percentage of reference points for which a match is found - Percentage of reference points for which the correct match is found - Average difference between the computed and real disparities The results obtained from scores, C2,C3 and C4, are similar and much better than d . Among C 2 , C 3 and C4, C2 and C4 are more efficient than C3 because C2 and C4 are insensitive to variations in the mean intensity value of images. Between C4 and C2, C4 is preferred for a faster solution, because it requires fewer multiplications, which is an expensive operation. It should be noted that, for faster implementation, all squaring computations can be avoided by using look-up tables, in which the square of all values, representing the image intensity, are stored. Also, the square root computation can be eliminated by computing (C4f instead of C4. 3 . Background on Image Registration 20 Generally, for template T and image I the two dimensional normalized mean squared differences function measures the difference between the template and the image after each translation (dx,dy). The translation, whose the mean squared difference is the smallest, specifies how the template can be optimally registered to the image: . E E {(T{*,y) - MT) - (70 - d x , y - d y ) - N)f C(dx,dy) = - T x y (3.19) J E E (T(x,y) - UT)2 E E Vi* - d x , y - d y ) - Ul)2 y x y x y Where U-T and m are average intensity of the template and image over the correlation window. The selection of the size of template, or correlation window, is also very important. By using large windows more matches can be found but, the finer details are smoothed and, in effect, resolution is reduced. Trade-Off An important reason why the correlation metric is chosen in many applications is that it can be computed using the Fast Fourier Transform (FFT). The use of FFT becomes most beneficial when the image and template to be tested are large. In addition, it is robust in presence of white noise. On the other hand, only cross correlation before normalization may be treated by the FFT. In addition, the computational cost for spatial correlation based methods grows proportional to the number of transformations. However, spatial correlation is still a preferred method for our problem, because in such cases with global transformations, measures can be computed on a few features instead of the whole image area. This significantly reduces the computational cost and needed memory. In addition, it is possible to illuminate the scene area captured by the camera, in order to avoid 3 . Background on Image Registration 21 any frequency dependant noise. All we should consider is allowing for white noise caused by imperfect photographing, which can be tolerated by the correlation metric. 3.4 Registration Components Generally registration methods can be viewed as a combination of four major components: • feature space • similarity metric • search space • search strategy In this section we study the different choices for each component and choose them so that they meet the requirements of our problem. 3.4.1 Feature Space To find what transformation is efficient for registration, the first step is to compare the images and find the differences between similar features. The choice of feature space determines what can be matched. According to the survey done by L. Brown [17], common feature spaces that are used for matching are: • Raw pixel values, such as intensity. Even though they give good information about the image, their computation is costly and time-consuming. • Edges, Contours and Surfaces are intrinsic structures in an image, less sensitive to noise and fast. However, edge points are not easily distinguishable for matching. • Salient features, such as comers, line-intersections and points of high curvature, which are easily distinguishable, have accurate positioning. 3 . Background on Image Registration 22 • Interesting points in absence of shape or curves interesting points are found such as points of greatest local variance. • Statistical features are measurements over a region of an image which represent the evaluation of the region. The region can be a result of a preprocessing segmentation step. Examples are Moment Invariants and Centroid/Principal axes [21]. They are used very often because their values are independent of coordinate system, however, their computation is costly. • High level descriptors, such as Topological, Morphological and Fourier transforms. They are used because they are more unique and discriminating. In order to select a good feature space the following issues should be considered: - to reduce the effect of noise - to extract proper structures of the image to be matched to reduce the computation cost by increasing the precomputations and/or decreasing the similarity measures Features can be found in each image independently in a preprocessing step. This reduces the amount of data to be matched each time and consequently reduces the cost. In our approach, images taken of the ground might lack objects with special structures. So, we need features that are distinguishable from unstructured objects and have accurate positioning. Moreover, we require features to be detected with low computational cost. With respect to the above constraints, corners are selected as the elements of an optimal feature space. 3.4.2 Similarity Metric A similarity metric determines how matches are rated. Since feature space is precomputed 3 . Background on Image Registration 23 on each image before matching, the similarity metric is computed using both images. The most common similarity metrics are: • Normalized Cross Correlation:Th\s is accurate for white noise but not tolerant of local distortions. Sharp peaks in correlation space are difficult to find [21]. • Fourier invariance properties: Phase correlation and other Fourier invariant approaches are tolerant of frequency dependant noise [18]. • Sum of absolute differences: This is an efficient computation, good for finding matches with no local distortion. • Sum of squares of differences of curves, surfaces: This is more robust to local distortion, and not as sharply peaked. • Number of sign changes in pointwise intensity difference: This is good for dissimilar images. • Higher level metrics: These optimize matches based on features or relations of interest. The Feature space and Similarity Metric stages are both responsible for ignoring the un-corrected variations, and detecting and matching the corrected distortions. So, if noise or any other uncorrected variation is not extracted in the stage of selecting the feature space, it should be done in similarity measurement phase. In other words, with a proper similarity metric the remaining uncorrected variations should be ignored. However, if they are extracted in the fea-ture space, this task is done in a single step precomputed independently on each image prior to matching. This performs the registration faster than the case of extracting noise in the similarity measurement phase. For example, Fourier methods like phase-correlation, which are more noise tolerant, can be used on raw images, when there is frequency dependant noise[18]. However, correlation as a similarity metric is optimized for exact matches (without noise); so it requires a 3 . Background on Image Registration 24 feature space such as edges to be robust to noise. Therefore a preprocessing step is required to detect the edges, followed by edge-correlation. For our registration method, normalized mean squared differences as a correlation based similarity metric, is used to match comers which are used as our selected features. It will be seen later that noise and false matches are eliminated with other strategies, such as match verification and error recovery in least squares estimation. 3.4.3 Search Space The search space is a class of transformations from which we would like to select the optimal transformation (translation and rotation) to align the images. By reducing the size of the search space, the number of similarity measures to be computed can be reduced . For example, if we limit our search space to just translation, when correlation is the similarity metric, there is only a need to correlate a single template at all possible shifts to find the correct translation. However, if there is larger search space for more complex misalignment or local variations, finding the optimal transformation is more difficult, time-consuming and costly. Regarding the definition and requirements of our problem, we have rigid transformations, containing 2-D translation, rotation and small scaling, in a continuous mode between sequence of images taken from a moving camera. 3.4.4 Search Strategy The search strategy decides how to choose the next transformation from the search space, to be tested for the optimal transformation. The choice of search strategy is determined by the characteristics of the search space including the form of the transformation and how hard it is to find the optimum. 3 . Background on Image Registration 25 Examples of common search strategies and the kind of problems for which they are used, as explained in more detail in the survey done by Brown[17], are as follows: • Hierarchical techniques: Applicable to improve and speed up by guiding the search through progressively finer resolutions. • Tree and Graph Matching: Uses tree/graph properties to minimize the search. Good for inexact and higher level structures. • Linear programming. For solving systems of linear inequality constraints. Used for finding a rigid transformation with point matching with polygon-shaped error bounds at each point. • Dynamic programming. For finding local transformations when an intrinsic ordering for matching is present. • Hough transform: For shape matching of rigidly displaced contours. • Relaxation: For finding global transformations when local distortions are present. • Decision Sequencing. For similarity optimization for rigid transformations. Each strategy has its own advantages and disadvantages. However, for each specific problem there are some issues that should be considered in order to find a proper approach: - How does the strategy deal with missing information? - Does the strategy make any assumption? - What are the typical computations and storage costs? By considering the above constraints and the trade-offs of different strategies, a hierarchical search strategy is chosen for our registration problem. 26 Chapter 4 Motion-Detection Algorithm The goal of this chapter is to provide an overview of the principal procedures of our algorithm for estimating any 2-D transformation existing between two sequential images. Figure 4.1 shows the block diagram of our algorithm. At the very beginning of the algorithm, two (w x w) sub-images will be generated from the original successive images (W x W). We match sub-regions that contain informative matching features. The future location of these sub-regions in an image sequence are predicted from the estimated kinematics of the camera motion. This reduces the search space and preserves matching targets with high image detail. Then, a coarse-to-fine resolution pyramid is generated for both images 11 and 12. This stage of the algorithm provides our hierarchical image registration algorithm with three levels of resolution for each image. The coarse-to-fine registration algorithm, which will be explained in detail in the next chapter, is used to estimate the transformation between two images within subpixel accuracy. Hierarchical feature extraction, correlation matching, match verification, least squares estimation, and error point rejection are components of the presented registration algorithm. Since the input images for the registration algorithm are the generated sub-images, and we need to obtain a transformation between the original images, the initial transformation parameters used for sub-image generation are also used in the registration procedure to refine the transformation results. 4 . Motion-Detection Algorithm < Grab Image 11 • Grab Image 12 • ( Image i Q Image I 2 ^ « Subimage ' Subimage Generation Generation T Resolution Resolution Pyramid Pyramid Generation Generation J (3-level Resolution ] f 3-level Resolution) Pyramid for 11 J \^ Pyramid for 12 J Hierarchical Registration & Estimation Refinement 1 (Final estimated^ fc ( l n i t i a l estimation\J ^transformation J \ for next step J I Grab another Image 13 11 =12 I 12= 13 Figure 4.1 Flowchart of motion-detection algorithm 4 . Motion-Detection Algorithm 28 Finally, velocity prediction isdone to provide an initial estimate for sub-image generation of the next image pairs. For continuous motion detection another image is taken of the ground and the motion-detection algorithm is performed on a new pair of images. In the new image pair, the transformed image 12 of the previous stage is considered as the new 11 or nontransformed image and the last grabbed image is considered as the new transformed image or 12. An advantage of this algorithm is that it reduces the feature space size in various stages. This is fulfilled with the aid of a rough approximation in multi-scale registration and velocity prediction based on expecting small changes of velocity in short intervals of image analysis. Consequently the motion-detection algorithm can be used in real-time applications. The reduction in computation by reducing the feature space might be expected to result in decreased accuracy. However, we will show that satisfactory results are achieved even with this sparse input information. This could not be accomplished without the aid of match verification and false match recovery procedures that will be presented in the next chapter. 4.1 Motion Model Assume that (x^y^Zi) are the 3-D coordinates of an object in the scene and (XhYi) are 2-D image frame coordinates both measured with respect to the position of the camera at time f,. Then the general projection equations are : 'x{ = i Xi y. t = {l,2} (4.1) where, / is the focal length of camera. The transformation between two 3-D camera locations in the scene can be approximated by: %2 V2 Z2 cos 0 — sin 0 0 sin 0 cos 0 0 0 0 1 Xi z\. + Ax Ay Az (4.2) 4 . Motion-Detection Algorithm 29 where 0 is the rotational angle, Ax, Ay and Az are displacements of the camera in the 3-D world. So: x2 = X\. cos 0 — t/i. sin 0 -+- Ax y2 = x-i. sin 8 + y^. cos 8 + Ay (4.3) z 2 - z\ + Az Therefore: X2 = - X 2 C O S 8 — yy sin # + A z T^+Az 2 1 - ( X 1 c o s f l - y l S i n f l ) + / ^ A X A ^ (4.4) Similarly: z : + A z v 1 1 ; ' zx-\- Az Y2 = —i /2 -*2 sin 0 + yi cos (9 + Ay * Zj + A z 2 1 - ( ^ sing + y- l Cosg) + / A \ (4.5) + A z v - 1 ; ' Z! + Az So: *2 ^ 2 = s cos 6 sin 6 - sin 8 cos 0 + AX AY (4.6) Where the scaling factor, 5, stands for the changes in the vertical distance of camera from the ground7: S = zi + Az 2 2 (4.7) ' In our project, even though, we deal with 2-D motion of a vehicle, by considering a small scaling factor, we can handle the possible small vertical movements (shaking) of the vehicle 4 . Motion-Detection Algorithm 30 and AX and AY are 2-D translations of the camera origin in the image plane: A * = / - ^ r -Ay = / — y — z1 + Az Therefore by determining AX, Ay and ,5, v = (Ax,Ay,Az)T can be determined as follows if z-i, the initial height of the camera is available. (Note that by calculating S, any small vertical motion of the camera is also taken into account in order to improve the estimation of the 2-D transformation.) Az = jAX Ay = ^ -Ay (4.9) Thus the unknown 2-D translation and rotation of the camera can be estimated by detecting the 2-D translation, rotation and scale parameters, existing between successive images taken from that camera via Image Registration. 4.2 Sub-Image Generation The main reason for the generation of sub-images from the original 11 and 12 full resolution image sequence, is that obviously less computations are required for their analysis than for the original images'. In building of a (w x w) sub-image from an (W x W) original image where W > w, there are two concerns. First, in order to obtain the most possible information from sub-images, it is required to include those regions of the original images that have the most overiapping features in them. Moreover, to avoid possible lack of features in a certain region and/or false information provided by unexpected motion of an unknown object in a certain part Of the scene, it is preferred to have well distributed features from all over the image. 4 . Motion-Detection Algorithm 32 velocity prediction stage, to obtain the coordinates of the upper left corner of the corresponding region in the transformed image, 12. To clarify this matter, assume that the upper left comer of each region p. in the sub-image S1 is located at (Ilcp,Ilrp) in the original image 11 and similarly the upper left comer of each corresponding region in the sub-image S2 is located at (I2cp,I2rp) in the original image 12. For the generation of the sub-image S1 we have': Then an initial transformation estimate such as T 0 = (xo,y0,60) is used for the generation of the sub-image S2. For this purpose, the upper left corners of the 11 regions are transformed to obtain the location of upper left comers of the corresponding regions in the image 12. So, for four regions of p - 1..4 : Figure 4.2 shows the selected regions of images 11 and 12 and the consequent sub-images when the initial estimate vector is T 0 = (10,5,7) . The difference between the sizes of the inner window, m, and the original window W', corresponds to the maximum amount of displacement that might be predicted. So, in order ' The image frame origin is located at the centre of the camera-lens/ image (4.10) J2cp = x0 + Ilcp cos 0O + Ilrp sin 0O (4.11) I2rp = -y'o + Ilrp cos 0O - Ilcp sin 0O 4 . Motion-Detection Algorithm 33 to avoid any out-of-image regions being selected from the image 12, we should have7: W - ra —-—> Max{dx,dy) (4.12) Where: TTl 71% dx = xo + y (cos 0O + sin 90) - — (4.13) m . m dy = y0 + y ( c o s #o - sin BQ) - — In the above computations it is considered that the biggest displacement caused by any trans-formation Tb = (zo,2fo,0o) in an (m x m) image is at the comer points: ( ± y , ± y ) 2 . 4.3 Resolution Pyramid In order to have a fast registration algorithm, a hierarchical approach is taken. There are many different methods for generating a resolution pyramid; such as the Gaussian pyramid based on a weighted averaging proposed in [22]. We did tests on both weighted and non-weighted averaging methods. Finally, we concluded that the non-weighted averaging method is much more appropriate for our purpose, since it provides adequate filtering at less computational cost. Suppose an image is initially represented by the array 50 which contains C columns and R rows of pixels. Each pixel represents the gray scale at the corresponding image point. This image becomes the zero level of the pyramid. Pyramid level / contains image 5/which is a reduced or low pass filtered version of Each value within level / is computed as an average ' Usually it would not really matter if a small portion of image falls out of the window. However, for a more robust estimation we prefer not to have an out-of image region. 2 In the rest of our algorithm the 11 and 12 images will refer to the generated S1 and s2 sub-images. 4 . Motion-Detection Algorithm 34 of values in level / - l within a KxK window, where K is the scale-down ratio. A hierarchical representation of this process, for K=2, is shown in Figure 4.3. '—. — • ^ ^ *-~~~ —' —. J 9 . 9i+i 9i+2 Figure 4.3 Generation of Resolution Pyramid The level-to-level averaging process is performed by the function average: gi = average(gi_i) which means that for level 0 < / < TV and nodes (i, j) : 1 1 1 9i(hj) = ^9i-i(K* + m,Kj + n) m—On=0 0 < i < Ci 0 < j <Ri where: K = the scale — down ratio N'= the number of resolution levels (4.14) (4.15) 4 . Motion-Detection Algorithm 35 Figure 4.4 illustrates the contents of resolution pyramid generated by the average function for K = 2 and N = 3 . Figure 4.4 Resolution Pyramid generated based on non-weighted averaging In our hierarchical approach, three levels of the pyramid (as will described later in Section 6.3) are generated. At the first iteration of the registration algorithm a pair of the coarsest images at level g2 are registered in order to provide a rough estimate of the transformation parameters. In the next iteration that estimate is used as an initial estimate for registration of the next finer images which are at level a/j. Finally to obtain the finest estimation, the registration algorithm is performed on images, go, based on initial estimate obtained from previous iteration. 4.4 Hierarchical Image Registration Since image registration is the principal function of the motion-detection algorithm, it has been explained in more detail in the next chapter. As an overall description it can be said that in this function, the existing two dimensional transformation, including 2-D translation, rotation and 4 . Motion-Detection Algorithm 36 scaling, between two images can be detected. An initial parameter of the maximum range of the search window can be defined with respect to the expected size of transformation. Obviously the bigger the transformation that is expected, the slower the image registration function will be. However, with the aim of velocity prediction, and consequently generation of almost similar sub-images from both original images, 11 and 12, we could expect a small transformation between two sub-images. Therefore, the selected size of the search window for matching can be small. It will be mentioned again later, that it is just 3x3. By having a small search window, we can expect to have real-time image registration. It should also be mentioned that the estimated transformation resulting from the registration of sub-images should be refined to produce a real transformation of the original images. Thus, the initial estimate used for the generation of sub-images is another input for the Registration function to refine the final transformation. 4.5 Velocity Prediction Knowing where an object is in an image frame allows a good estimate of where it can be. in the next frame, provided that the interval between frames is not large. In order to reduce the size of the search window for matching, the estimated transformation of each motion step is used to predict the amount of transformation of the following step. Since imaging and image analysis is expected to be done in real-time, it can be assumed that the velocity of motion between two successive frames is almost constant. Therefore, a zeroth order prediction is used. In other words, the transformation between two proceeding frames is used in tracking the features in the current frame. To make the registration simpler, this prediction is used to select similar regions of two images in generation of sub-images, as it has been explained in Section 4.2. Then, the registration algorithm is performed with no initial estimates, except the ones that are achieved internally in the hierarchical registration process. 37 Chapter 5 Hierarchical Image Registration Registration is done when two images need to be aligned with one another so that differences can be detected or when an optimal match for a template is found in an image. In these cases a transformation is found to relate the features of one image to their corresponding points in the other. Some of the feature-based algorithms such as [15] extract features of both images and then perform matching procedures on them. However, in our approach feature extraction is performed only on the non-transformed image, to detect some features to be tracked in the transformed image. The match pairs will be found by doing the correlation matching on the features over their corresponding predicted search area in the transformed image. Figure 5.1 illustrates the flowchart of the presented registration algorithm. In our hierarchical approach we perform feature extraction, correlation matching and least squares estimation at ajl of the resolution levels. As has been shown in the Flowchart in Figure 5.1, the results of each level will aim to predict the search area for feature extraction and correlation matching at the following level. In Feature Detection, the location of extracted features of a coarse level will be an initial guess for fast detection of features in the next finer levels of resolution. Moreover, the rough estimate of the transformation vector in a coarse level will lead to an initial estimate for matching process in the next finer level. Obviously, for the coarsest level no prior knowledge about translation, rotation, and scaling is available, so an initial estimate is obtained by processing the whole image for feature extraction and performing correlation matching on all extracted points of one image with their pairwise neighbors in the other 5 . Hierarchical Image Registration 38 Coarsest/ Next finer level of 11 coarsest / Next finer level of 12 Feature Extraction I . Initial I estimation \ for next leve> J ^ - i^nteresting points^-• Feature transformation Correlation Matching 4 l_. Ksearch space inN Image 2 J Match Verification / Initial estimation N \ for next level ' If finest level : Subpixel Correlation Location Refinement for match points I f Subimage_ N ^ — 1 Generation i \ Initial estimate } Least Square Estimation rough estimated transformation If the finest level: Discarding errors Final estimated transformation Figure 5.1 Flowchart of Hierarchical Image Registration image. The processing of the coarsest level is quite fast because: Firstly, only a small coarse image is processed for feature extraction, Secondly, in the coarsest resolution features have so small a displacement that their match can be found within a small range of search window. 5 . Hierarchical Image Registration 39 After extraction of a small number of feature points from the non-transformed image, 11, based on Moravec's comer detection, they are transformed with the initial estimation to the 12 image frame. This helps to predict the location of their corresponding matches in 12. Consequently for each feature of 11 a search region in 12 will be obtained around the predicted match point. Accordingly, the search window for matching can be minimized in size to reduce computations and mismatches. Correlation matching is next performed on each feature point of the 11 image and its corresponding search region in 12. Then by using the validity criteria (see Section 5.2), gross errors of correlation matching are eliminated. It should be mentioned that for acquiring a sub-pixel accuracy, at the finest level of resolution, sub pixel correlation is performed. The unknown translation, scale and rotation parameters are identified by accumulating the coordinates of matched feature pairs as coefficients of a set of linear equations and then performing the least squares estimation. Finally, for false match recovery, at the finest level of resolution, the components of matched feature pairs with a high residual of least squares fit are discarded. 5.1 Feature Extraction For various applications, different features are used to detect informative patches of an image, such as: line intersections [23], centers of windows having locally maximum variance [24], combined comers and edges [25] and comers [26]. To enable explicit tracking of image objects to be performed, we extract the comer features, because they are discrete and invariant to the scale and rotational changes[27]. Moravec's comer detector proved to be simple, computationally low cost, fast and adequate for our purpose. 5 . Hierarchical Image Registration 40 To study the basic mathematical form of Moravec's comer detector assume : / : Image Intensity • f l within a specif ied rectanglar region • W : Image Window = < [ 0 elsewhere V Wu>v G Search Space For each of four directions resulted from x,y displacements' given by : ( a ; , y ) e { ( 0 , l ) , ( - l , l ) , ( - l , 0 ) , ( - l , - l ) } . a regional intensity-change is defined as: ^x,y — ^ ^ ^^u,v \ Iy.-\-x,v+y -^ 11,111 (5-3) u,v then Wu,v is a comer patch if: min {E} > threshold (5.4) Moravec's corner detector considers a local windowed patch, W, in an image and determines the changes of regional-intensity that result from shifting the patch by a small amount in various directions of x, y . The computed value will be assigned to an EXiV corresponding to that direction. If the minimum of all different EXiV for a patch is higher than a threshold then that patch is considered as a comer. In other words this comer detector simply looks for local maxima in min {E} above a threshold. Therefore, if the patch is flat in intensity then all shifts will result a small change of intensity. If the windowed patch contains an edge, the shifts perpendicular to the edge direction will produce a big change, the ones along the edge will result a small change ' With this set of x,y displacements, comers with a certain orientation will be detected. Since the orientation of comers is not important to our purpose, we have selected them arbitrarily. 5 .. Hierarchical Image Registration 41 of intensity so the m'm{E} remains small. Finally, if a comer or a discrete point exists in the windowed patch, the result of shifts will be big in all directions. Therefore, min{£} will be big enough to be above a threshold. Each level of our hierarchical approach is similar to Moravec's comer detector'with some differences. One of the differences is that, just in the first step, where features are extracted from the coarsest image, the search space is the whole image. However, at the next finer level the search space will be reduced to a few small rectangular regions, corresponding to the comers found at the coarsest level. So generally it can be said that, first, at the coarsest level, the whole image (gN-i) is searched for comers. Then, the position of each of the extracted corners from the coarsest level, are used to estimate a small rectangular search space in the following finer images (gi). The above calculation can be shown in the following mathematical form: /or each corner : gN-i(iiJ) - ' i * ( A ^ " 1 - ' ) ) < col < i * ( A ' ^ " 1 - ' ) + l j - 1 search space in level I : V gi(col,row) | < & - [ 'j* ( A ^ - 1 - ' ) ) < row < j * ( A ' t ^ - 1 - ' ) + 1) - 1 (5.5) where: 0 < / < .V - J N : number of resolution levels' in pyramid K : the scale down, ratio of the resolution pyramid 5 . Hierarchical Image Registration 42 32x32 b) 64x64 c) 128x128 d) original 128x128 sub-image level 0 level 1 level 2 Highlighted patches are the pre-estimated search spaces for feature extraction and white pixels are the extracted features. By using the location of each extracted feature in level 0 of pyramid, it is possible to predict a search space for a possible feature in the following finer levels. Figure 5.2 Hierarchical feature extraction with aim of search space estimation For each of the finer images gi-k, Moravec's corner detector is used again to extract the location of that corner within the pre-estimated search space. Figure 5.2 illustrates the pre-estimated search space and detected corners of the three levels of the resolution pyramid. In comparison to the methods that analyze the whole image for feature extraction, our method that predicts a corner search space based on the location of extracted corners at the coarser level, requires much less computation. On the other hand, in using Moravec's corner detector for hierarchical feature extraction it is preferred not to shift a large corner detector window, W. To clarify, suppose that a shifting window of a coarse image results in recognition of a corner. However, it doesn't necessarily mean that in a higher resolution of that image, within the estimated search space, there will be 5 . Hierarchical Image Registration 43 (a) w contains a (b) none of w1 ,w2 or w3 corner in coarse contain a corner in fine image image (c) detected features by shifting pixel instead of Window patch in fine image Figure 5.3 Use of window patch versus pixel in hierarchical corner detector a same sized corner detector window that would detect a corner. At a higher resolution, in fact we zoom into the corner. Even though, the whole zoomed area is a corner, the region itself contains finer details that cannot be recognized as corners by the use of a window patch. Figure 5.3 illustrates an example of a corner in a coarse image which cannot be detected at its higher resolution by using window patches in Moravec's corner detector. One solution in the hierarchical process is that the corner detector window should be scaled as well as the images. However, the objective is to find adequate corners at all levels of resolution, but not necessarily the same corners. Therefore, it is preferred to determine the intensity change of a single pixel, rather than the intensity change of a window, which also requires much less computations. In other words, in Moravec's Corner detector we use a 1x1 window patch. In Figure 5.3.C, corners that can be detected, by considering pixels instead of window patches, are pointed to. 5 . Hierarchical Image Registration 44 Figure 5.4 illustrates the extracted features of real images resulting from both methods: considering intensity changes resulted by shifting of a) 3x3 windows and b) pixels (1x1 windows). a) Extracted features in hierarchical process, b) Extracted features in hierarchical process, using 3x3 window in Moravec's corner detector usinglxl window in Moravec's corner detector 64x64 6 4 x 6 4 32x32 32x32 Figure 5.4 Feature Extraction results on real images by use of Window patch versus single pixel detector 5 . Hierarchical Image Registration 45 It can be seen that when the patch W is 3x3, only a few corners in the coarse image can be detected, and in images with fine detail, not much change in the patch-intensity will result, by performing small shifts on window patches. On the other hand, a small amount of shift performed on a pixel in a comer position, will be sufficient to result a big change in the pixel-intensity. Therefore, by using our method, in all image resolutions, even where we have very small and discrete objects, a large number of features can be detected. Obviously, this method does not necessarily determine the precise location of a comer and might be too sensitive to intensity changes. However, since we are just looking for some informative patches rather than sharp location of comers, it is wise to consider this method, which provides adequate information for our purpose with relatively low computational cost. It should also be mentioned that in the presence of noise, any information regarding a false comer will be ignored later, with the aim of performing correlation matching, match verification and false information rejection. Corners are ranked based on their corresponding min{£} . To make our feature extraction compatible with various image types, in other words, to extract sufficient features from both highly textured images and images with less detail, a very small threshold has been selected. However, since just a few comers are adequate to track the image motion, another function will choose just a specific number of the most significant ones. If the total number of extracted comers are bigger than the number of features that are necessary, then a function will select the most significant comers based on a Quick Search method introduced in [28]. If the total number of extracted comers at the first point has been less than what is required, then, all the extracted comers are used. 5.2 Hierarchical Correlation Matching A correlation match metric provides a similarity measure score for two points. The points 5 . Hierarchical Image Registration 46 which result in the highest similarity score will be considered as match pairs. Some stereo matching algorithms such as [19], attempt to compute a similarity score for every point in an image. For each point, this is done by taking a fixed window in image 11 and a shifting window, along the epipolar line, in the image 12. These approaches require expensive computation while they perform correlation over the whole image. However, in our approach, we eliminate the computation by not only performing the correlation matching on only the interesting points as a subset of 11, but also eliminating the matching search space for each interesting point to a small (pxp) rectangular region in 12. Each square search region is centered at the location of a predicted match corresponding to a feature point of 11. In a coarse-to-fine hierarchical matching process, at each level, the location of a predicted match is computed based on a transformation estimate, resulting from the previous resolution level. Thus the size of a square search region corresponds to the required offset from a predicted match location. Since at the coarsest level no initial knowledge about the transformation is available, the initial parameters (AX, AY,6, S) are assumed to be (0, 0, 0, 1). Therefore, the search region for each interesting point of 11 is a square region centered at the location of its pairwise point in 12. However, at each of the following levels of the resolution pyramid, the transformation parameters obtained from the previous level, /, are converted to the current resolution level, / - 1, by using the following equations: AA',_i = AAA', A l 7 , . , = KAX, , . e,^=et = e (5.7) = Si = S K : scale down rate of resolution pyramid — 2 5 . Hierarchical Image Registration 47 Then, each interesting point of 11 at the current level of the pyramid is transformed with the above parameters to obtain the location of the predicted match in 12. Therefore for each extracted feature ( z / - i , 3 / i _ i ) , of image 11 at level / - 1, a predicted match point ( X ; _ 1 , V / _ 1 ) is computed as follows: Later in the implementation chapter it will be explained how the size of the search window and correlation window has been selected. In summary, we obtain a similarity score for every interesting pixel of the image 11 by taking a fixed window centered at the pixel's location in the image 11 and a shifting window centered at the location of predicted match point in the image 12. The second window is moved within a pxp search region by integer increments along the vertical and/or horizontal lines and meanwhile an array of similarity scores is generated. A measure based on normalized mean square differences of gray scale is used for similarity score generation. To obtain, Sc(x,y), the score of the best match for each interesting point, the smallest normalized difference square is computed as follows: Sc(x,y) - min{C} (5.9) E E ((^ (x, y) - pr) - (Hx ~dx,y- dy) - /x/))2 C(dx,dy)= * v (5.10) , / E E m*,y) - M T ) 2 E E (/(* - d x , y - d y ) -y x- y x y cos 0 sin# — sin 9 cos 8 cos 8 sin 9 — sin 9 cos 9 x ^ \ {(AXl (5.8) 5 . Hierarchical Image Registration 48 Where T represent the fixed window in 11 and / represents the shifting window in 12 and pT and HI are average intensities of fixed and shifting windows. Obviously dx and dy represent the disparity between the centre of the fixed window and the centre of shifting window. 5.2.1 Match Verification In previous sections we described how we generate a dense map of interesting points propagated across the image even in a featureless area. After detecting a best match for each interesting point of the nontransformed image, 11, with a correlation-based matching algorithm, gross matching errors, resulting from false corner points or correlation errors, are eliminated by using a validity criterion used by Pascal Fua [19]. The validity check function improves the matching algorithm so that when the correlation fails the algorithm usually rejects the matched pair rather than yielding an incorrect match pair. In fact a valid disparity measure is defined in which two images play a symmetric role. To find a valid match pair the correlation is performed two times. Once, matching from image II to 12. Second time, in opposite direction, from image 12 to 11. If in both iterations the same pair is achieved as having the highest similarity score, then those points are valid matches. R'2 R1 R'1 P'1 • -P1 • invalid R2 -• 32 11 . 12 Figure 5.5 Match verification process 5 . Hierarchical Image Registration 49 As a better illustration (see Figure 5.5), assume a fixed correlation window centered at an interesting point p1 in 11. Suppose that in matching from 11 to 12, point p2, among all points of the search region R2, is determined as the best match. To verify the validity of that match, in the second matching iteration, a fixed correlation window is centered at P2 and the centre of the shifting window is moving within the search region R1 around P1. Points P1 and P2 are considered as valid matches, if and only if, in the second iteration of matching, point P1 is selected as the best match for P2 as well. Figure 5.6 Behavior o f Correlation Matching wi th Match Val id i ty Check on real images Thus, with the validity checking we can reject the false matches achieved from images that are suffering from noise or occlusion. Figure 5.6 shows this behavior, using real image pairs. All of the valid match pairs in 11 and 12 images appear in white and those match pairs that did not pass the validity check criterion appear in black points. In this example some extracted points of the image 11 are occluded in 12. Even though some random points of the occluded area in the image 12 are detected as best matches, after the reverse correlation-matching iteration, almost 70% of the false matches are rejected. 5 . Hierarchical Image Registration 50 Since these images are highly textured and we want to run the correlation matching with the least possible computations, a 5x5 correlation window size has been used. Our experi-ments show that the proposed validity check function adequately discriminates the existing false matches. 5.2.2 Subpixel Correlation In order to obtain an optimal match location with subpixel accuracy, at the finest level of resolution, level l0, we fit a second degree curve into three correlation scores, the best score Sc0 and its two neighbors 5c_ and 5c + . Then an optimal match location is computed by using a simple quadratic estimator introduced by Dvomychenko [29]: 6 = (Sc+ - Sc_) 2(2 * Sc0 - Sc_ - Sc+) (5.11) estimated peak SO true peak Figure 5.7 Sub pixel sampling by using quadratic estimator 6: estimates location of real peak Sc in terms of the sample scores and referenced Sc0 5 . Hierarchical Image Registration 51 To obtain both coordinates of an optimal location of a match, the estimator is used twice. In order to compute the x coordinate of the optimal point, the similarity score Sc0 corresponding to the matching pixel and scores ScR and ScL corresponding to the immediate right and left neighbors are taken. Similarly Sc0 and the scores corresponding to immediate up and down neighbors, Scv and ScD, are used to calculate the vertical coordinate of the optimal match. Supposing (x0,y0) is the location of the estimated match corresponding to the highest similarity score then Sc0, the location of the optimal match point (x,y) with subpixel accuracy can be computed as follows: where: 6x = Sy = X — XQ + 6x y = yo + 6y (ScR - ScL) 2(2 * Sc0 - ScL - ScR) (Sep - Scy) (5.12) (5.13) 2(2 * 5c 0 - Scv - ScD) 5.3 Transformation Estimation Motion of a rigid body can be defined as a transformation, that is the product of a rotation by 6, a translation by (Ax,Ay)T and a scaling by S. This transformation is described as vector x = (Az , Ay,5cos6>,Ssin0)T, such that it transforms each arbitrary point (u,v) of an image II to the location (x,y) in image 12, after the body has moved. As has also been stated in equation (4.6) the motion model can be given by the following set of equations: 'x ' = s •y • • 0 cos 6 — sin 8 sin 8 cos 9 + u V Ax Ay S cos 8 S sin 8 'Ax Ay (5.14) 5 . Hierarchical Image Registration 52 or b = A.x (5.15) A transformation vector is expressed as a linear matrix operation as Pope has also done [30]. Obviously, with additional pairings of match features the problem of transformation esti-mation will be over-constrained and can be solved with linear or recursive estimations. Ayache and Faugeras[31] have suggested a recursive method based on generation of a number of hy-potheses. The one-by-one generation of hypotheses after achievement of each matched feature pair will improve the quality of estimation. However, we elected to use least squares estimation since it requires less computation and simpler mathematics. For a set of N matched points in two images we have 2N equations with four unknowns: x0 '1 0 UQ -v0 yo 0 1 v0 U0 1 0 -v1 2/1 = 0 i 1 0 —UJV-1 -VN-i - .0 1 or : b H A„.x Where a - Ax , b '= Ay , c = Scos0 , d - Ss'm6 and for each feature i , (ut,vi) is its location in the image 11 and (z,-,^) is its corresponding match location in image 12. Then the least squares estimation is: x = ( A J A H ) - ^ (5.17) and transformation parameters are estimated as: • Ax = a (5.18) 5 . Hierarchical Image Registration 53 Ay = b (5.19) (5.20) 5 = \ / c 2 + <P (5.21) The minimum number of required match pairs for estimating the transformation parameters is N = 2. However, our experiments show that an approximate number of 20 to 30 feature pairs are necessary to obtain reliable results. 5.3.1 Transformation Refinement To achieve a refined transformation between two original images rather than two sub-images, as has been mentioned before, we should refine the locations of match pairs. Consequently, the location of match points in the sub-images should be replaced with their locations in the original images before the transformation is computed with the least squares estimator. Recall that each sub-image S1 consists of four rectangular regions, taken from the four corner regions of an (m x m) inner window of the original image 11 (see Figure 4.2). Assume that, the upper left corner of each selected region p is located at (Ilcp,Ilrp) in 11; and the upper left comer of each corresponding region that is used to build sub-image S2, is located at (I2cp,I2rp) in original image 12. Then, for each point (Slxp,Slyp) in an region p of sub-image S1, its real location at (IIX,I1Y) in 11, is computed as: I1Y - Slyp + dyp and similarly, for each point (S2xp, S2yp)\r\ region p of sub-image S2, its location at (I2XJ2Y) in 12, is obtained from the following: I2X = S2xp + dxp + I2cp- Ilcp (5.23) I2Y = S2yp + dyp + I2rp - Ilr, v 5 . Hierarchical Image Registration 54 where: dxp = { m—w m—w if p if P 2 or 4 1 or 3 (5.24) and dyv = if p if P 3 or 4 1 or 2 (5.25) 5.3.2 False Match Recovery While the least squares estimator tries to fit a number of data points to the final estimation, it is possible to compute the disparity of each data point with respect to the estimation. This can be done by calculating the residual of each element. For additional improvement of our estimated transformation, at the finest level of the hierarchy, those match pairs that are too far from the estimated transformation are discarded. To discard those elements, the local residual, Ri, according to the best fit for each corresponding feature pair, relative to the global least squares fit, x, should be computed: The elements corresponding to each match pair are discarded from matrixes A and b, if their corresponding residual is higher than a residual-threshold. After discarding all extraneous points, the least squares fit is computed again. The process is repeated and any existing high residual points in the new fit are discarded. This process will be done repeatedly until no high residual points remain. With this technique those feature pairs that are far away (relative to the threshold) from the least squares fit and which could deteriorate the estimation are discarded. It should be mentioned that since we are using a small search window (a 3x3 window), the outliers cannot be very distant from the correct least squares fit. So, the outliers cannot affect the estimation significantly, however, we will improve the results by discarding them. Ri = ||A,x-bi|| ? threshold (5.26) 55 Chapter 6 Implementation Issues and Experimental Results This chapter describes the implementation of a real-time motion tracking system based on the theory set forth in the previous chapters. 6.1 Hardware Configuration 6.1.1 Camera Setup Figure 6.1 Setup of Camera A down-looking, CCD TV camera (Philips, VC72505T) is mounted on a mobile arm at a height of about 70cm above the ground, as shown in Figure 6.1. The mobile arm can manually be rotated, around a central axis perpendicular to the ground, and translated in a 2-D plane parallel to the ground plane. The centre of the camera lens is considered to be the origin of image frames and the central axis of the mobile platform to be the origin of the robot frame. Therefore, even a pure rotation of the mobile arm around its central axis, in the robot frame, will generate a simultaneous 2-D translation and rotation in the image frame. By having a textured 6 . Implementation Issues and Experimental Results 56 surface of dirt under the camera and rotary motion of the mobile arm, a 2-D motion of a mobile robot, consisting of translation, rotation and small scaling is synthesized. 6.1.2 Frame Grabber Images taken from the camera are digitized by an analog video interface. The S1V Video Capture and Frame Buffer Interface made by EDT Inc., which is an S-Bus compatible module, has been selected for this purpose. This frame grabber is hosted by a Sun SPARC20 workstation , using Solaris 2.3. The S1V Video Capture and Frame Buffer Module provides a gray scale video capture and output capability for the Sun SPARC station at a rate of 30 frames per second. The output signal of the camera, which is a 60 Hz standard CVBS signal, is digitized and stored in S1V RAM memory. Each pixel is defined by 8 bits of data, providing 256 levels of gray. The S1V has 1MB RAM divided into two banks. Each image can be captured into one bank while the host computer is processing the contents of the other bank. This double buffering ability of S1V is employed, in continuous acquisition mode, to minimize the waiting period for an image to be captured. When the grabbing function is called, the driver fills one of the banks and returns the last bank into which it acquired data. The S1V driver then continuously acquires frames into the other bank, while the application program processes the contents of the first bank. The driver does not cycle between both banks again until the application calls a function announcing that application is done. With this setup, our motion-detection program requires about 45 ms to detect the relative motion between two images. Moreover, the driver can capture a frame in 33 ms. Therefore, by calling the application done function at the end of each 45 ms period of process, ideally the rate of 22 frames per second can be achieved in motion detection. 6 . Implementation Issues and Experimental Results 57 6.2 Calculation of Camera Location from Transformation Parameters In previous chapters we described how to acquire the translation and rotation parameters existing between two successive images. These transformations were obtained corresponding to the location of matching pixels in the image frame. In this section we will explain how to convert the transformation parameters from the image frame to the global frame of the mobile robot platform in real world. Then it is possible to find the real location of the camera after each step of the transformation and track the motion. As has been mentioned previously, each image frame originates at the centre of the camera lens. Therefore, the image frame is transformed while camera moves. In the cases where only pure translation exists between images, the obtained translation in the image frame is equal to the real translation in the global frame. However, if the transformation includes some rotation, then the obtained translation in an image frame should be converted to the global robot frame in the real world. To clarify this matter, as is shown in Figure 6.2, assume that in coordinate frame A, the mobile arm can rotate around its central axis which is located at A(0,0)1. Assume that initially the origin of the camera frame is located at p0 - A(xo,yo)- After the first step of motion it is relocated to point ^ = A(xl,yl) and next to point p2 = A{x2,y2) and so on. In registration, the transformation existing between two successive images, is estimated based on the coordinate system of the first image (the nontransformed image). Thus, BTBB, = B(Ax0,Ay0)7 and 90 are the estimated translation vector and rotation of the camera moving from location B to B' in terms ' To represent the coordinates of a vector, we follow the notation used in [32]; the leading superscript indicates the frame in which the coordinates are expressed. Pre-multiplying the coordinates of a vector written in frame B by a rotation matrix gR yields the coordinates in frame A. 6 . Implementation Issues and Experimental Results 58 Figure 6.2 Transformation of image frame versus global frame of robot platform of frame B. Similarly, B'TB<B" - s ' ( A a ; ] , ) T and 0] are the estimated translation vector and rotation for a motion from location B' to B" in terms of frame B'. Assume, the location of the camera after the first step, in terms of global frame A, is presented by vector APAB' and after the second step by APAB" • Then we have: APAB<.') - APAB('-1) + B<-i)-ft- B T B ( i - i ) B ( . ) (6.1) In other words, to obtain the location of the camera, after step i, in terms of frame A, APAB{.), the transformation B ( " 1 ) T B L . - L ) B M , which is in terms of an image frame is multiplied with a rotation matrix B I A R . The rotation matrix corresponds to the rotation of frame B ^ with respect to frame A: • where : 7i = ^ K (6.2) It can be seen that after each step i of motion, in the rotation matrix, ^ is the summation of all previous rotational angles except the last one. B(') cos 7; — sin ji sin 7; C O S 7J-6 . Implementation Issues and Experimental Results 59 Therefore after first step of motion we have: APAB' = APAB + A-R-BTBB' 2/1 or : x0 Ax0 A y 0 (6.3) where: B R -1 0 0 1 after the first step : 7o = 0 and the computed location vector of the camera, APAB> . is used to locate it after the second step: [PAB<> = x2 .2/2 APAB' + B > R - B TB'B' or xx (6.5) where: B'R -cos 6Q - sin 60 sin 60 cos 60 after the second step : 71 = 60 6.3 Optimal Parameters In the initial stages of implementation, a few parameters existed as program inputs. Param-eters were specified regarding the density of features in images, range of motion and the speed of the processor. However, as the algorithm performance, on various scenes within different motion ranges improved, less variable parameters were required. Finally, we obtained good performance from a single set of parameters across all real image sequences taken from the camera with the above setup. 6 . Implementation Issues and Experimental Results 60 6.3.1 Image Size At the height of 70 centimeters, for the best focus, the focal length of camera is 1.44 mm. In the interlaced mode, the biggest area that can be captured in an (640x480) image is (112.5 cm x 74.4 cm). For simplicity and speed of process, just a 240x240 central segment of the captured image is stored in image arrays. Recall from Section 4.2 that four segments are selected from an inner window of the image. In our implementation the central inner window is 152x152 and each segment is 64x64 therefore, the generated sub-images are 128x128. Referring to the Figure 4.2 we have: W = 240, m = 152, w = 128 . Obviously in sub-images with four separate regions, more distributed features can be extracted. Also, in cases where some parts of an image do not have adequate information, there is a greater chance to acquire sufficient informative features. Figure 6.3 illustrates an example in which features of the upper part of images are very similar. By analyzing the information of only one central region of each image, those similar features, that also have the highest comer rank, are chosen. Consequently, transformation estimation is based on false matches and the registration algorithm fails. However, by performing the registration on sub-images consisting of four different regions, correct results are obtained. 6 . Implementation Issues and Experimental Results multiple regions search resulted in correct corresponding matches result: Ax = 14.56 Ay = 0.68 0 =0.79 Figure 6.3 sub-image constructed of four different regions vs. one region 6 . Implementation Issues and Experimental Results 62 6.3.2 Number of Resolution Levels By building a resolution pyramid, and performing all steps of registration in,a hierarchical manner, the required time for detection of motion between two successive frames has reduced significantly. Analysis of a small coarse image is much faster than the finer versions of the image. The results obtained.from analysis of a coarse image leads to fast processing of the next higher resolution image. This results in a faster algorithm. On the other hand, there is a trade-off between having more levels of resolution and losing the details in coarse images. Our experiments on real images illustrated in Figure 5.2, show that having 3 levels of resolution with the coarsest image of 32x32 is optimal. So, when the scaling rate is two, there are 32x32 , 64x64 and 128x128 images existing in three levels of the resolution pyramid for each image. However, it can be seen later that we obtain satisfactory results on Our images with synthesized motion, illustrated in Figure 6.5, with four level of resolution, because those images contain bigger details that are still detectable in the coarsest resolution. So, it can be concluded that the number of resolutions depends on the image regarding the size of fine features that it contains. Table 6.1 compares the time spent on the most expensive functions of the algorithm and the number of interesting patches found in hierarchical corner detection, for different numbers of resolution levels. By having more levels in the resolution pyramid the total required time for the algorithm will be less. The reason is that less time is needed for hierarchical detection of the corners and their correspondences. Moreover, since the detected features and their correspondences are obtained from more levels and their validity is confirmed by more levels, results are more reliable and the algorithm will fail for fewer frames. On the other hand, in this example with four levels of resolution, the coarsest images, which are 16x16, do not have 6 . Implementation Issues and Experimental Results 63 adequate information. Therefore, the algorithm fails more often. no. of res. levels total time (ms) Correlate (ms)=%of total Find-Corner (ms)=%of total failed frames ave. no. of detected corners 1 9.72 0.5 = 6% 4.04 = 42% 4 24 2 8.27 1.0=12% 1.41 = 16% 3 20 3 6.60 1.5 = 24% 0.63 = 24% . 2 19 4 , 3.18 1.0 = 25% 0.36 = 9% 6 9 Table 6.1 Comparison between hierarchical registration with different numbers of resolution levels 6.3.3 Number of Features to be Extracted As has been explained in Section 5.3 when solving the linear equation of rigid transformation for four unknowns, at least two valid corresponding points should be found in two images. However, usually for accurate and reliable results, it is preferred to have an over-constrained problem. Moreover, in match verification and false match recovery steps some of the extracted patches might be discarded. Therefore, having more corresponding matches in two images, more reliable results can be generated. On the other hand the number of extracted patches is directly proportional to the amount of computational cost. Our experiments have shown that an average number of 30 patches to be extracted at each level of resolution is adequate, assuming that at most 20% of those extracted patches might be discarded in each level. To assure that at least 30 corners can be extracted from the image of any unknown scene, a very small threshold (e.g. 2) is used in the comer detection function. As it is mentioned in Section 5.1 in our corner detection algorithm all detected corriers are ranked based on their 6 . Implementation Issues and Experimental Results 64 sharpness. Therefore, by using an indexing function which is based on Quick-sort algorithm and is introduced in [28], then the 30 highest ranked comers among all detected comers can be extracted. In the case that even with the mentioned low threshold the number of detected comers has been less than 30, the indexing function would not be called and the system will cope with the number of detected comers. 6.3.4 Correlation Window and Search Window for Matching After extraction of a sufficient number of interesting patches in the nontrarisformed image, 11, a correlation matching approach is taken to find the corresponding match points of the following image 12. A correlation window of 7x7 requires a high computational cost and a 3x3 window does not provide enough information for reliable matching. Therefore a 5x5 correlation window was selected for this purpose. For each interesting pixel of 11 , the matching search space is apxp rectangular region of 12 that is centered at the predicted location for the corresponding match. The size of search window, at the coarsest level, can be chosen by considering the amount of expected transformation between images and in the next finer levels, by considering the maximum expected error in . prediction of the location of match points. Since our correlation matching is performed on sub-images, that are generated to minimize the possible displacements, in the coarsest images, displacements are expected to be small. On the other hand, since transformation is computed based on integer location of matching pairs on the grid, the accuracy of transformation parameters is within ±0.5 pixels. Moreover, in the generation of the resolution pyramid the chosen scale rate, K - 2, therefore the maximum error in conversion of the transformation parameters from each level to the next one and consequently in prediction of match points is: K x (±0.5) = 2 x (±0.5) = ± 1 6 . Implementation Issues and Experimental Results 65 Therefore, a 3x3 square region, centered at the location of predicted match, is sufficient to track matching between levels. Consequently, with a 3x3 search window, for each interesting pixel, 9 match values are computed. With a 3x3 search window, 3 levels of resolution and scale rate of 2, the maximum displacement that can be detected between two consequent images is ±1 pixel at the coarsest level or ±8 pixels at the finest level of images. Usually, big rotations cause correlation matching to fail, because after a big rotation the intensity configuration of the patch under the correlation window changes significantly. However, we assume the rotations that cause the displacement of less than or equal to half a pixel, would retain a substantial degree of correlation. Figure 6.4 shows that, in a 5x5 window, if the center of a pixel moves less than or equal to half a pixel it will remain in the angle of a, which is 11.25°. In other words, a 5x5 correlation window can provide consequential degree of accuracy while the rotation is less than or equal to 11.25°. a = ^ - = 11.25° (6.7) 0.5 pixel -•I H*-5x5 Correlation Window Figure 6.4 An approximate maximum detectable rotation within a 5x5 correlation window It should be mentioned that, in the actual experiments, this amount highly depends on the contrast of image and the shape and size of the rotated feature, overlaid by the correlation window. 6 . Implementation Issues and Experimental Results 66 In an experiment on synthetic images, illustrated in Figure 6.5, when there were no trans-lations but 20° rotation and image was relatively blurred, we could detect the rotation of 20.9°, within 4 levels of resolution. Table 6.2 shows the intermediate estimated results (translation and rotation), obtained from coarse-to-fine resolution levels, when 20 comer patches are requested to be extracted. The translation and rotation obtained from the fourth (finest resolution) level has the highest accuracy. Res. level# Ax. (pixels) Ay (pixels) o n # of found matches 1 0.7963 -2.1667 15.4411 18 2 0.1107 -1.0549 19.5408 10 3 0.0825 ' -0.2922 20.5535 13 4 0.0056 -0.2427 20.9180 7 Table 6.2 Intermediate results, obtained from registration in four levels, for 20° rotation 6.3.5 Thresholding for Discarding False Matches There are two stages, used to discard false matches. First, in the correlation matching function, where the validity of correlation matching is verified. In that step the disparity of the obtained best matches from two reverse matching iterations is compared. For the highest accuracy, the corresponding matches with equal disparities are considered as valid, and the ones with more different disparities are discarded. The next discarding of possible false matching points is after the transformation estimation step in the finest level of resolution. In this stage, if the residual of the best fit for each pair of obtained match points relative to the final least squares fit is more than the residual threshold of 1.0, that pair of corresponding matches will be discarded. For an ideal fit, the residual should 6 . Implementation Issues and Experimental Results 67 be zero but since, with subpixel correlation, we expect an accuracy of less than a pixel, the residual-threshold of 1.0 has been selected: ||A;X-b;|| < 1,0 (6.8) 6.4 Timing Cost This system has been simulated using C language, on Sun Sparc20 with Solaris 2.3. The most effort has been done to optimize the code and modulate the system in order to obtain fast subsystems with less communicational cost and required memory. The most severe drawback of a registration algorithm is its high computation requirements. Especially, where finding and evaluating of a match is required. Computation time is mainly spent on performing correlation between two images, over a specific disparity range. As stated earlier, to reduce the computation time, we only correlate a limited number of patches of the nontransformed image, 11, (interesting patches) over a small 3x3 region of the transformed image, 12. The number of executed correlations is proportional to the number of interesting patches times the size of search area. On a Sun SPARC20 all steps of sub-image generation, resolution pyramid generation and hierarchical registration of the algorithm with the above parameters is done in approximately 46 milliseconds. This figure has been obtained by generating a Graph Profile and measuring the spent CPU time of during the execution of algorithm. The contents of GProfile, shown in Table 6.3, illustrate the spent CPU time for each call, number of calls, total spent CPU time and percentage of CPU time, spent for each function. 6 . Implementation Issues and Experimental Results 68 % of Total time Total seconds Number of calls Spent time (ms/call) Function 37.4 2.72 27382 0.10 Correlate 15.7 1.14 462 2.47 Find_LeftAve 9.6 0.70 13860 0.05 Find_RightAve 9.1 0.66 462 1.43 Find_Corner 8.9 0.65 308 2.11 SubImage_Gen 5.6 0.41 616 0.67 Res_Scaling 5.5 0.40 462 0.87 Find_Match 4.1 0.30 3913 0.08 SubPixel_Corr 1.0 0.07 1112 0.06 Array_Mult 0.7 0.07 462 0.11 Choose_Crnr 0.3 0.02 248 . 0.08 Find_Residual 100 7.28 155 46.96 Total/frame Table 6.3 CPU time spent on various functions of motion-detection algorithm The above GProfile has been generated while the motion of camera has been tracked, within 155 frames, and the algorithm uses the mentioned optimal parameters. It can be seen that the most time consuming function is the correlation function. The second and the third most time consuming functions calculate and provide the average intensity of patches, that overlay the features which are being compared, for doing the normalized correlation. Average CPU time spent for complete motion detection of each frame relative to its previous frame, is 46.96 ms, therefore, with a real_time operating system with faster capability for I/O and memory access 6 . Implementation Issues and Experimental Results 69 the actual speed of motion detection can be up to 21.3 frames/sec. However, because of the limitation of the current non real-time computational resources in our experiments, we could perform our algorithm in only 10 frames/sec. Consequently, it was necessary to move the camera quite smoothly and slowly. 6.5 Maximum Detectable Displacement of Camera As it has been mentioned in Section 6.3.4 the maximum detectable displacement in the images, with 3 levels of resolution, a 3x3 search window and a 5x5 correlation window, is 8 pixels per frame. According to the resolution and focal length of our camera and its height relative to the ground, the maximum displacement in the scene, that can be detected, is as: Therefore, maximum detectable displacement of the camera within two successive images is ±1.4 centimeter. By prediction of velocity to be constant between two pairs of successive images, the change of velocity, or acceleration of 1.4 cm/sec2 in each frame is detectable. Therefore, with the imaging rate of 20 frames/sec the maximum motion with acceleration of 2Scm/sec2 can be tracked. In the other hand, the detectable displacement is limited by the image size. Another factor that is limiting the maximum speed of the vehicle is the sub-image parameters, described in Section 4.2. As equation 4.12 illustrates, the difference between the size of central window, m , and the size of original image, W, should be bigger than the maximum displacement. X 8 image pixels = 16000 CCD pixels (6.9) = 1.4 cm in the scene 6 . Implementation Issues and Experimental Results 70 With the selected sub-image size parameters, we have: W = 240 , m = 152 W - m ^ = 44 image'.pixels (6.10) = 7.73 cm So, the maximum displacement of camera can be 7.73 cm/frame. In other words the maximum detectable velocity of camera's displacement, consisting of both rotation and translation, with the processing rate of 20 frames/sec, is 7.73x20=154.6 cm/sec or 5.4 km/hr. Obviously, by having a faster processing, or a wider angle lens, larger displacement can be detected. However, there is a trade-off between having a wide angle lens and extracting enough details from the images. 6.6 Results 6.6.1 On Images with Synthesized Motion To study the accuracy of algorithm we can refer to the images that have been taken from a terrain of forest and transformed synthetically, with a known amount of translation and rotation. To produce synthetic transformation, each of the above images has been scanned and then shifted and rotated via software. In the following figure the existing transformation between a pair of transformed and nontransformed images has been detected. Our algorithm has performed well on various images with wide range of synthetic trans-formations. Table 6.4 illustrates the results of those experiments, with various transformations, when the search window is 3x3, correlation window is 5x5 and resolution pyramid has 4 levels. 6 . Implementation Issues and Experimental Results 71 These experiments simulate situations in which, motion includes pure rotations, pure translation or simultaneous translation and rotation. Synthetic Motion Detected results Total time (ms) Translation in X = Y (pixels) Rotation (°) A x (pixels) A y (pixels) e n 0 1 -0.46 -0.46 1.03 45 0 -1 -0.65 0.43 -0.9 46 0 4 0.02 T0.34 . 3.9 45 0 -4 -0.11 0.18 -4.08 48 0 20 0.00 -0.24 20.91 50 1 0 0.93 1.04 0.10 45 10 0 9.95 9.98 0.02 46 15 0 14.97 15.01 0.04 50 5 9 5.61 4.40 9.32 51 7 6 6.34 7.79 6.04 46 10 2 9.41 10.19 2.02 47 Table 6.4 Experimental results on a terrain images with various synthetic translations and rotations Figure 6.5 illustrates one of the above examples. An image of a terrain in forest, 11 is shown, with another version of it, 12, that is synthetically shifted and rotated by Ax - Ay -7 pixels and 0 = 6°. The valid correspondences are identified with white and the invalid ones, that did not pass the match verification, are in black. The estimated transformation, 6 . Implementation Issues and Experimental Results 72 obtained from four levels hierarchical registration of the generated sub-images S1 and S2 is : (Ax = 6.34, Ay = 7.79,f5 = 6.04) . Estimated Transformation: Ax = 6.3419 , Ay = 7.7921 9 = 6.0400 Figure 6.5 Experimental results on synthetic images 6.6.2 On Real Images In the mentioned setup, camera has a rotational motion around the central axis of the mobile arm. During the motion of camera 135 images has been taken from the ground, but only one image per 15 sample is displayed in the Figure 6.6. In the result image, each of the white pixels 6 . Implementation Issues and Experimental Results 73 represents the centre of camera at each moment, consequently, the white curve illustrates the tracked motion of camera. By having different patterns on the floor we proved that the algorithm i performs well on various types of grounds, regardless of any specific object existing. We also carried out an experiment on a mini-excavator to obtain a rough idea about the accumulated error. The camera was mounted on the rear dozer-blade of the machine at a height of 50 centimeters'. As shown in Table 6.5, two paths were examined while taking 300 images from the gravelled surface of the test yard. Approximate Motion Detected Results Missed Translation in X (cm) Translation in Y (cm) Rotation (°) Ax (cm) Ay (cm) 0 (°) frames 0 180 0 -0.5 157.5 0.9 5 o 0 75 0.8 0.5 62.8 8 Table 6.5 Experimental results on images, taken by the camera, mounted on a mini-excavator The detected translation and rotation, show that the total accumulated error after 300 image frames in translation is 12% and in rotation is 16%. Accordingly, the average rate of accumulated error per frame is 0.04% in translation and 0.05% in rotation. In other words, total accumulation error in translation, after processing 300 image frames is 22.5 centimeter which is approximately equal to 128.57 image pixels. Therefore, the average translation error is 0.42 pixels per frame. The obtained error value strongly supports the subpixel accuracy of our estimation. Because of the current communication's tether limitations, we could not move the machine farther. In the near future, it is intended to have more experiments on the mini-excavator and compare the results obtained from the presented vision system and other motion tracking systems. ' In the implementation on a large scale excavator, the camera will be mounted at a convenient location under the chassis. 6 . Implementation Issues and Experimental Results Motion of Camera detected within 135 image frames: (only one sample per 15 is displayed) Estimated transformation: Ax = -183.6 pixels = -32.3 cm Ay = 473.3 pixels = 73.9 cm a = 123.93 degree Figure 6.6 Experimental results on real images 75 Chapter 7 Conclusions 7.1 Concluding Remarks We have described a feature-based, real-time motion tracking system for local control of a mobile robot equipped with a down-looking camera. We do not assume the presence of any beacons or landmarks, nor do we require any inertial sensors or any expensive dedicated hardware for fast performance. We have derived a new faster approach to detect motion via image registration. In this approach, an effective method to cut down the computation is to process the sub-regions of images. In partitioning a transformed image, velocity prediction provides enough information to filter out areas which are unlikely to contain features that are being followed in a corresponding non-transformed image. Moreover, a hierarchical approach has been taken to improve the speed of feature extraction and match detection. To eliminate false information two methods have been presented to discard errors caused by noise or partial occlusion. Results obtained from synthetic and real images with various two dimensional transforma-tions showed that this algorithm performs well with, better than a pixel accuracy in detection of translation and better than one degree accuracy in detection of rotation, at a processing rate of 20 frames/sec. The above processing rate allows the maximum acceleration of 28 cm/sec2 and maximum speed of 5.4 km/hr\o be detected while the camera is moving on a two dimensional plane. Moreover, satisfactory results, obtained from our experiment on the mini-excavator in a gravelled yard, shows that, the average accumulated error per frame of the present system is 7 . Conclusions 76 0.04% in translation and 0.05% in rotation. Moreover, the system can successfully compensate the small vertical motions of the camera. 7.2 Directions for Future Improvements An inherent characteristics of our motion-detection system is that estimation errors get accumulated over time and distance. Although, the position information obtained from each iteration of the algorithm is reliable, over long periods of time the system must be integrated to provide absolute measurement of location. Like inertial systems, the platform requires additional information from some absolute positioning systems, such as GPS, to eliminate the accumulated errors. One way of doing this is to periodically reset the initial estimates with the absolute position sensing estimates. As we mentioned before, the CPU time spent on the motion-detection system is much less than (about half of) the actual time that we have obtained from running the program on Solaris 2.3. Higher speed can be obtained by optimizing the code for faster access to I/O and memory, or using a real-time operating system. Since, our algorithm can fail in situations when the camera has abrupt motion discontinuities or large vibrations, future improvements include handling these two problems as well. In addition, an unexpected motion of some relatively large object(s), in the scene such as a branch under the machine, can produce unreliable results as well. Ways should be found to recognize the presence of more than just one global motion. Then, the results could be evaluated and the object could be ignored. In the case that a robot moves on an uneven surface, the vertical distance of the camera to the ground, z, changes significantly thus, the objects that are closer to the camera seems to move faster than others. This situation is also similar to the case that more than one global 7. Conclusions 77 motion is present. Even though, we have presented some methods to discard false information, more robust methods can be searched for. For further applications, other properties such as speed of the mobile robot can be computed using time intervals between frames. In the near future we are going to incorporate our algorithm into a motion detection system for excavators, which allows the path planar know the current position of the machine. We hope to be able to add higher level control capabilities to this application. Bibliography [1] J.F. Le Corre and G. Garcia. Real-Time Determination of the Location and Speed of Mobile Robots Running on Non-Planar Surfaces. In Proceedings of IEEE International Conference on Robotics and Automation, 1992. [2] J.L. Gourdon and F. Peyret. Modelling and Controlling the Road Finishing Process. In Proceedings of The 8th International Symposium on Automation and Robotics in Construction, volume ISARC91, pages 479-488, June 1991. [3] Y. Kikuchi, M. Tochizawa, S. KamadaS. Takeda, K. Hirosawa., T. Wada, and S. Ito. Automatic Excavator. In Proceedings of The 8th International Symposium on Automation and Robotics in Construction, volume ISARC91, pages 277-284, June 1991. [4] J. Borenstein, H. R. Everett, and L. Feng. Navigation of Mobile Robots, Systems and Techniques. AK Peters Wellesley, Massachusetts, 1996. [5] R. H. Byrne, P. R. Klarer, and J. B. Pletta. Techniques for Autonomous Navigation. Technical report, Sandia National Laboratories, Alburquerque, NM, 1992. [6] G. Nard. DGPS Correction and GPS Reception and Processing. A One Unit Embedded Solution. In RTCM Annual Assembly Meeting, 6-10 May, 1990. [7] Billar Barshan and Hugh F. Durrant_Whyte. Inertial Navigation Systems for Mobile Robots. IEEE Trans on Robotics and Automation, 11(3), June 1995. [8] G. D. Dunlap and H. H. Shufeldt. Durttons Navigation and Piloting. Naval Institute Press, 1972. [9] TRC - Transition Research Corp., Shelter Rock Lane, Danbury, CT 06810, 203-798-8999. Beacon Navigation System, 1994. Product Literature. •71 [10] Yasunori Abe, Kouetsu Tanaka, and Yoshio Tanaka. Navigation System Based on Ceiling landmark recognition for Autonomous Mobile Robot; Landmark Detection based on Fuzzy Template Matching. In Proceedings,of IEEE International Conference on Intelligent Robots and Systems, 1995. [11] Scott Bulmer. Real-Time Motion Tracking: A Case Study in Parallelizing Vision Algorithms. Master's thesis, University of British Columbia, 1995. [12] Chao Chi Huang and Duncan Gillies. The Determination of an Autonomous Vehicle's Position by Matching the Road Curvature. In Proceedings of the International Society for Optical Engineering, volume SPIE 2591, Oct 1995. [13] G. Verghese, K. Gale, and CR. Dyer. "Real-Time Parallel Motion Tracking of Three Dimensional Objects from Spatiotemporal Sequences". Parallel Algorithms for Machine Vision. Springer-Verlag, New York, 1990. [14] Robert Mandelbaum and Max Mintz. Feature Based Localization by Using Fixed Ultrasonic Transducers. In Proceedings of IEEE International Conference on Intelligent Robots and Systems, 1995. [15]Q. Zheng and R. Chellappa. A Computational Vision Approach to Image Registration. IEEE Transaction on Image Processing, 2(3), 1993. [16]S. Ravela, B. Draper, J. Lim, and R. Weiss. Adaptive Tracking and Model Registration Across Distinct Aspects. In Proceedings of IEEE International Conference on Intelligent Robots and Systems, pages 174-180, 1995. [17]L. Gottesfeld Brown. A Survey of Image Registration Techniques. ACM Computing Surveys, 24(4):325-376, 1992. to [18]E. De Castero and C. Morandi. Registration of translated and rotated images using Finite Fourier Transforms. IEEE Trans. Patt. Anal. Machine Intell, 9(5):700-703, 1987. [19] Pascal Fua. A Parallel Stereo Algorithm that Produces Dense Depth Maps and Preserves Image Features. Machine Vision and Applications, 3(6):35-49, 1993. [20] B. Hotz. Etude de Techniques de Stereo Vision par Correlation. (Rapport des stage de dea) CNESToulouse, France, 1991. [21] A. Rosenfeld and A. C. Kak. Digital Picture Processing , volume 1 and 2. Academic Press, Oriando, Fla., 1982. [22] Peter J. Burt and Edward H. Aseldon. The Laplacian Pyramid as a Compact Image Code. IEEE Transaction on Communication, COM-31(4), 1983. [23] G. C. Stockman, S. Kopstein, and S. Bennett. Matching Images to Models for Registration and Object Detection via Clustering. IEEE Transaction on Pattern Analysis and Machine Intelligence, 4:229-241, 1982. [24] H. Moravec. Rover Visual Obstacle Avoidance. In Proceedings of 7th International Conference of Artificial Intelligence, pages 785-790, 1981. [25] Chris Harris and Mike Stephens. A Combined Corner and Edge Detector. Plessey Research Rokw Manor, United Kingdom. [26] Chris Harris and J. M. Pike. 3-D Positional Integration from Image Sequence. In Proceedings of third AVC, 1987. [27] Ying Cui and Peter D. Lawrence. Detecting Scale-Space Consistent Corners Based on Corner Attributes. IEEE Conference on Man, System & Cybernetic, 1996. [28] William H. Press, Saul A. Teakolsky, William T. Vetteriing, and Brian P. Flannery. Numerical Recepies in C, The Art of Scientific Computing. Cambridge University Press, 1992. «1 [29] V.N. Devornychenko. Bounds on (Deterministic) Correlation Functions with Application to Registration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2), 1983. [30] Arthur R. Pope. Learning to Recognize Objects in Images: Acquiring and Using Probabilistic Models of Appearance. PhD thesis, University of British Columbia, 1995. [31] N. Ayache and O. D. Faugeras. A New Approach for Recognition and Positioning of Two-Dimensional Objects. IEEE Trans. Patt. Anal. Machine Intell. PAMI, 8(1):44-54, 1986. [32]Erliang Yeh and David J. Kriegman. Toward Selecting and Recognizing Natural Landmarks. In Proceedings of International Conference on'Intelligent Robots and Systems, volume 1, pages 47-53, 1995. [33] R. Hummel and S. Zucker. On the Foundations of Relaxation Labeling Processes . IEEE Trans. Patt. Anal. Machine Intell, 5:267-287, 1983. [34] H. K. Nishihara. Practical Real-time Imaging Stereo Matcher. Optical Eng., 23(5), 1984. [35] D. Charnley and R. J. Blissett. Surface Reconstruction from outdoor Image Sequences. Submitted to AVC 88, 1988.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Real-time motion tracking of mobile robots via image...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Real-time motion tracking of mobile robots via image registgration Soroushi, Atousa 1996
pdf
Page Metadata
Item Metadata
Title | Real-time motion tracking of mobile robots via image registgration |
Creator |
Soroushi, Atousa |
Date Issued | 1996 |
Description | In order to have accurate, safe and reliable remote control for mobile robots, it is necessary to track their motion. With new high technology systems, fools for imaging and image acquisition are becoming less expensive every day. Consequently, vision-based systems are considered practical for tracking purposes. Most of the present motion-detection systems used for control of heavy duty hydraulic machines depend on external equipment like inertial sensors, deadreckoning or laser beacons. There are also a few existing systems for navigation purposes that are based on vision, but they do not perform in real-time or they require a partially known environment. The goal of this thesis is the design and implementation of a stand-alone motiontracking system, to track the local motion of a machine-mounted down-looking camera in realtime. By installing the camera under the base of a mobile robot, the local trajectory of that robot in an unstructured environment can be tracked. In the present work we use a computational vision approach to design an image registration system to estimate the two dimensional translation, rotation and small scaling factor from two partially overlapping Images. This is done in real-time (about 20 frames per second) even when there is simultaneous rotation and translation occurring between the two successive frames and images may have few significant features. To reduce the processing time and obtain information from different regions of the image, we process only selected regions of interest of the original image. An initial estimate is used to select those regions, so that regions contain similar features with the least possible disparity. This method, with the aid of a coarse-to-fine search method, makes the motion-detection algorithm able to execute in about 45ms and detect a dependable rotation of up to 11.25 degrees and translations of up to 8 pixels for each pair of image frames. |
Extent | 9446276 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-03-14 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0064842 |
URI | http://hdl.handle.net/2429/6040 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1996-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1996-0492.pdf [ 9.01MB ]
- Metadata
- JSON: 831-1.0064842.json
- JSON-LD: 831-1.0064842-ld.json
- RDF/XML (Pretty): 831-1.0064842-rdf.xml
- RDF/JSON: 831-1.0064842-rdf.json
- Turtle: 831-1.0064842-turtle.txt
- N-Triples: 831-1.0064842-rdf-ntriples.txt
- Original Record: 831-1.0064842-source.json
- Full Text
- 831-1.0064842-fulltext.txt
- Citation
- 831-1.0064842.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0064842/manifest