Abstract This thesis presents a vision-based tracking system suitable for autonomous robot vehicle guidance. The system includes a head with three on-board C C D cameras, called Triclops®, which can be mounted anywhere on a mobile vehicle. By processing consecutive trinocular sets of precisely aligned and rectified images, the local 37J trajectory of the vehicle in an unstructured environment can be tracked. The use of three cameras enables the system to construct an accurate represen-tation of its environment and therefore results in accurate motion estimation. First, a 3D representation of stable corners in the image scene is generated using a stereo algorithm. Second, motion is estimated by tracking matched features over time. The motion equation with 6 degrees of freedom is then solved using an iterative least squares fit algorithm. Finally, a Kalman filter implementation is used to optimize the 3D representation of scene features in world coordinates. The system has been implemented on a Pentium processor in conjunction with a Triclops and a Matrox Meteor® frame grabber. The system is capable of detecting rotations with 3% error over a 90 degree rotation and translations with 13% average error over a 6m long path. Contents Abst rac t i i Contents iv Lis t of Tables v i L is t of Figures v i i Acknowledgements i x 1 In t roduct ion 1 1.1 What is motion tracking? 2 1.2 What is image registration? 2 1.3 Summary of the approach 2 1.4 Overview 3 2 Previous W o r k 5 2.1 Radio Sensing 6 2.2 Laser Range Finders 6 2.3 Dead-reckoning 8 2.4 Inertia! Systems 10 2.5 Video Based Systems 11 2.5.1 Landmarks 11 2.5.2 Active Beacons 13 2.5.3 Image Correspondence 14 iii 2.6 Summary 16 3 System Design 17 3.1 Assumptions 17 3.2 Objectives 18 3.3 Motion Tracking Methodology 19 3.3.1 Model Initialization and Calibration 19 3.3.2 Feature Detection 24 3.3.3 Feature Identification • • • • 37 3.3.4 Feature Tracking 37 3.3.5 Motion Estimation 39 3.3.6 Position Filtering 42 4 System Implementation 47 4.1 Feature Extraction 49 4.2 Stereo Matching and Match Validation 52 4.3 Matching World Points and Feature Points 55 4.4 Motion Estimation 56 4.5 Kalman Filter Updating 59 4.6 Retiring Old and Introducing New Features 60 5 Experimental Results 62 5.1 Hardware Configuration 62 5.1.1 System Setup 62 5.1.2 Frame Grabber 62 5.1.3 Triclops™ 64 5.2 Optimal Parameters 65 5.3 Results 69 5.3.1 World Point Construction 69 5.4 Motion Estimation using two Frames 69 5.4.1 Accumulated Error 69 iv 5.4.2 Motion Estimation Results from a Number of Images 74 6 Conclusions and Future W o r k 84 6.1 Specification Evaluation 84 6.1.1 Continuous Sensing 84 6.1.2 Real-Time Performance 85 6.1.3 No Odoriietry 85 6.1.4 Minimal Cost and High Accuracy 85 6.2 Future Work 85 6.2.1 Accurate Testing Platform 85 6.2.2 Decreasing the Accumulative Error 86 6.2.3 Speeding up the Performance 86 6.2.4 System Applications 87 References 88 v List of Tables 2.1 Estimation of mean error and the corresponding standard deviation. . 10 3.1 The partial derivatives table 42 3.2 Partial derivatives of u and v with respect to the camera viewpoint parameters 42 5.1 Feature number in different stages of the system 66 5.2 Iteration results for a single pair of images 68 5.3 Iteration results for a single pair of images 68 5.4 Depth construction comparison with the real measured values 71 5.5 Estimated and actual motion between successive image sets 74 vi List of Figures 1.1 System block diagram 4 2.1 Sensor hierarchy vs measurement range and functional characteristics. 7 2.2 Mechanism of an odometer 9 3.1 Motion tracking system tasks 19 3.2 Initialization tasks 20 3.3 Ideal pinhole camera model produces a perspective projection of the world 21 3.4 Camera geometry with perspective projection and radial lens distortion. 22 3.5 Feature detection sub-sections 26 3.6 Result of corner detector on a sample image 30 3.7 Stereo vision sub-sections 31 3.8 Epipolar constraint 32 3.9 Rectification of two images 33 3.10 An unrectified image with the lens distortion 34 3.11 Same image after calibration and rectification 34 3.12 Depth construction using camera parameters 36 3.13 Feature tracking sub-sections 39 3.14 Position filtering components 43 4.1 Flowchart of real-time 3D motion detection system 48 4.2 Corners before suppression of unstable corners 49 4.3 Threshold corners with modified non-maximal suppression 51 vii 4.4 A triple set of images with suppressed corners 52 4.5 Match validation using three images 53 4.6 Sub-pixel sampling using quadratic estimator 54 4.7 Kalman filter ellipsoid surface for real points 56 4.8 Identical features are tracked over time 57 4.9 Scaled Kalman Filter ellipsoid surfaces at the beginning 59 4.10 Updating the Kalman filter 61 5.1 System setup 63 5.2 Position of the Triclops at each moment is shown in the screen. . . . 64 5.3 Triclops Stereo Vision System 65 5.4 Stereo matching result for a trinocular set of images 70 5.5 Stable corners after validation test are compared in two consecutive images 72 5.6 Corresponding features are related and tracked in two consecutive images. 73 5.7 Accumulated error for different motion variables 75 5.8 Sample images taken at reference points A, B , C and D, moving toward end point H 77 5.9 Sample images taken at reference points E, F, G and H , moving toward end point H 78 5.10 Comparison of the actual path with the estimated, path A to H. . . . 79 5.11 Sample images taken at reference points H , G, F and E, moving toward end point A 80 5.12 Sample images taken at reference points D, C, B and A, moving toward end point A 81 5.13 Comparison of the actual path with the estimated, path H to A. . . . 82 vm Acknowledgements I sincerely thank my supervisor Dr. Peter Lawrence for giving me this unique oppor-tunity to explore my creativity and also for his priceless guidance, advice and support. I also feel indebted to Dr. David Lowe, my co-supervisor, for the time he committed to me and his genuine interest in my work. I would also like to thank all the mem-bers of Robotics and Control Laboratory (Electrical and Computer Engineering) and the Laboratory for Computational Intelligence (Computer Science). Without their technical assistance, this thesis would have been impossible. I would like to express my deep gratitude to my parents for their love, support and trust, and for encouraging me to pursue my interests. I would also like to thank all of my friends who believed in me, stood by my side and helped me to get through my work. They consistently supported my ideas and would drop their work on a moment's notice to lend a hand. Also, a. special thanks is extended to the IRIS Network of Centers of Excellence for their financial support of this thesis. Parvaneh Saeedi The University of British Columbia November 1998 ix Chapter 1 Introduction Mobile robots have a wide variety of applications in such diverse areas as forestry, min-ing, space, nuclear reactors, construction and industry. Tasks in these environments may be remote, hazardous to humans or tedious. The necessity of remotely control-ling such robots requires a system that locates or tracks the robot position/orientation in real-time, and a system that operates robustly under a variety of environmental conditions. For this reason, some previous research has been undertaken for the au-tonomous guidance of vehicles such as excavators and bulldozers using GPS or dead-reckoning. However, the accuracy of these systems depends upon many environmental conditions such as weather, lighting and contact surface and might therefore fail in one or some particular conditions. Recently many new approaches have been developed for the autonomous guid-ance of mobile machines using computer vision. During mobile robot guidance, the probability of collision and dangers due to navigation in a changing environment can be reduced. Due to the fast processing speed required for real-time operation, many of the reported approaches assume that the motion is limited to two-dimensional planes. In an attempt to overcome this limitation, a system is presented for real-time motion tracking of a mobile robot in a 3D world via image registration. 1 1.1 What is motion tracking? Motion tracking is the process of determining changes in the location of a mobile object from sensed information over time. By rigidly mounting a camera system on an object as the sensory device, the problem of tracking the object's motion becomes the problem of tracking the camera's motion. If we specify the origin of the camera (robot) coordinate frame by a, 6D vector P relative to fixed world coordinates, the objective of motion tracking is to estimate this camera pose vector, P, over time using visual observation. 1.2 What is image registration? Image registration is the process of comparing and matching two or more pictures taken at different times, from different sensors or different points of view. It enables us to align images by spatial transformation and also to measure variation in images. To track the motion via image registration we would like to identify some informative features of the scene and align consecutive images to find overall changes in the positions of those features. 1.3 Summary of the approach The purpose of this thesis is to implement a motion tracking system for a mobile robot in a static environment. The design incorporates continuous sensing and 3D construction of the scene model with minimal processing time. Since 3D construction of the whole scene model is very time-consuming, we have chosen to construct 3D representations of corner features, which are some of the most informative features of the scene, namely corners. Corners are the most discrete and explicit features, because they are partially invariant to scale and rotational changes. Although working with just some image features speeds up motion estimation, still we are not able to process every incoming image. Processing fewer images with a larger temporal gap, might lead to higher disparities or larger estimation errors. Thus it is desirable to decrease 2 the amount of computation time and to increase the rate of processing images as much as possible. At the beginning, a 3D feature representation process is invoked. Corners de-tected from stereo camera images are matched by cross-correlation with sub-pixel accuracy using epipolar constraints. The matched corners are then cast into 3D us-ing disparity and camera separation distance. The ego-motion computation is used to establish preliminary motion and to construct the basic world-point set. Then in run mode, camera motion is tracked by using a Kalman filter. At the end of each motion estimation, features that are no longer within the field of view are retired and new features are introduced to the system. Figure 1.1 shows a simplified block diagram of the vision system. 1.4 Overview Chapter 2 presents a discussion of issues relevant to motion tracking and a study of the previous work. It also summarizes several notable papers. Chapter 3 contains the requirements and explores design decisions for the design proposed in this thesis, based on a study of different registration algorithms. Chapter 4 describes the design details as they are implemented in the system. It presents the main procedures of the motion-detection algorithm including feature selection, 3D representation of features, the matching process, motion estimation with sub-pixel accuracy, and Kalman filter generation and updating. Chapter 5 presents descriptions of tests performed to evaluate the system along with the results and a discussion of the results. Chapter 6 presents the conclusions inferred from the results and proposes future work for improving the system. 3 Image(u,v)—> World( x, y, z) m + 1 Motion Vector P Feature Extraction > 3D Fe* Construe Stereo V iture tion from [atching r Motion Detection by Tracking Identical Features r Refinement of 3D 1 Feature Positions Figure 1.1: System block diagram. 1 Chapter 2 Previous Work This section presents a summary of several existing motion-tracking systems and some related work. Motion Tracking Systems Motion tracking is concerned with finding the accurate measurement of an object's position and/or orientation in real time as it moves. Motion tracking systems may use different types of sensing techniques. Most of these systems operate based on sensors mounted on the mobile robots. Laser range finders, radio beacons, sonar and video are some common sensors that have been used in navigation systems. Following is a list of these systems based on the sensing equipment [3]. 1. Radio Sensing 2. Laser Range Finders 3. Dead-Reckoning 4. Inertia! System 5. Video Based Systems (a) Landmark 5 (b) Active Beacons (c) Image Correspondences i . Model-Based i i . Feature-Based Each system, has its own advantages and disadvantages depending upon, the ap-plication of navigation, and sometimes a combination of two or more sensors might be more practical for a more accurate and reliable system. In the following, a short description will be given for each system. 2.1 Radio Sensing Radio sensing, such as GPS (Global Positioning System) operates by receiving radio signals from satellites, airplanes, ships or land vehicles. These systems can be inex-pensive, quick and accurate, but can only be used on a global scale. Their accuracy is within 100m to 10m. While this may be extremely accurate when calculating latitude and longitude, it cannot be used for the controlling of a vehicle along a roadway for example. Although new developments in GPS (such as differential GPS) have led to higher accuracy, still the GPS update time is very slow and it might fail in dense wet foliage or under structures. Nevertheless it is important to note that the GPS is a valuable navigation aid to pilots and others, proving the usefulness of location information [21]. Other common radio triangulation techniques can also be accurate and inexpensive but may not be extensible. 2.2 Laser Range Finders Range finders were designed and made during second world war for estimating the position of a target. Range finders usually have two lenses horizontally aligned at either end of a bar. By adjusting one of the lenses, the images viewed through a range finder can be made to closely coincide with the other. The angle by which the 6 lens needs to be moved, can be used to calculate the distance of the object being viewed. K i m [13] proposes a hierarchical localization system using a line-scanning laser range finder. This localization is performed in 3 steps: prediction of environment around the robot, range measurement and disparity estimation. Instead of fusing sensor information, the system employs a hierarchical way of using sensors. A variety of external sensors have been used for this purpose. A touch sensor stops the robot whenever it confronts an unexpected obstacle. A sonar sensor provides the depth information in low resolution for fairly short distances and a laser range finder obtains a more accurate environment range map. Finally a stereo vision system produces intensity images to be used for understanding the scene and for symbolic planning of the robot path. Figure 2.1 shows the sensor hierarchy depending on the range of measurement and purpose. The robot in this system follows a cycle of stop-look-collision avoidance Laser rage finder Figure 2.1: Sensor hierarchy vs measurement range and functional characteristics. and-move operation based on local path characteristic. At each step the range finder acquires a range map and detects scene feature points. These features (corners) are matched with their correspondences in the predicted local map. A weighted least squares estimation is then used to estimate the translational and rotational errors in 7 the dead-reckoning measurements. The local path is planned by checking the direction to the sub-goal from the current position. If there is no obstacle, the robot does not change direction, otherwise it finds another path around the obstacle. The local map accumulates to create a global map and dead-end paths will be remembered to avoid exploring over again. Figure 2.1 shows the sensor hierarchy depending on the range of measurement and purpose. Dead-reckoning is a method to monitor and maintain the robot's position and ori-entation in a fixed external reference frame. Like an odometer in a. car, the robot's position is estimated by summing its incremental movements relative to a starting point. Odometry is a simple implementation of a dead-reckoning method. It uses en-coders, such as magnetic encoders, optical encoders, potentiometers, resolvers and brush encoders, to measure wheel rotation and/or steering orientation. Odometry is based on the assumption that wheel revolutions can be translated into linear dis-placement relative to the ground. However this assumption is not valid in practice. For example, if one wheel was to slip on an oil spill for instance, then the associated encoders would register wheel revolution, even though these revolutions would not correspond to a linear displacement of the wheel [27]. Figure 2.2 shows the mechanism of an odometric sensor. There are several other reasons for inaccuracies in the trans-lation of wheel encoder readings into linear motion. These sources are divided into two categories: systematic errors and non-systematic errors. Odometry has the ad-2.3 Dead-reckoning Systematic errors Non-systematic errors Unequal wheel diameters Misaligned wheels Limited encoder resolution Limited encoder sampling rate Travel over uneven floor Travel over unexpected objects on the floor Wheel-slippage 8 e = Number of ticks along outside curve d.6 1 = Number of ticks along inside curve Wheel dia. ticks/rev. Wheel seperatiem— k v • Figure 2.2: Mechanism of an odometer. vantage of being totally self-contained and it is always capable of providing a motion estimation. The disadvantage of odometry is that the positional-error grows without bound unless an independent reference is used periodically to correct the error. Stella and Cicirelli [27] propose a method for position estimation of a mobile robot, using two independent sub-systems. They combine an on board camera and an image processing unit based on artificial landmarks, with a dead-reckoning sub-system based on odometry. They have used a Kalman filter for integrating the correct position of the robot from both systems' measurements. Experiments, clone by track-ing the robot along a 9 meter long path, show that Kalman filtering results in a better solution than both systems by themselves. The error's average has been reduced by 1% using odometry alone and by 70% using vision system alone. Table 2.1 shows the estimated mean error (E) and corresponding standard deviation (a) for each sub-system. Although the result after Kalman filtering improves, the system still suffers a large amount of uncertainty originating from odometry. 9 Table 2.1: Estimation of mean error and the corresponding standard deviation. Odometer Vision Kalman Filter E = 98.948 E = 313.813 E = 98.141 a = 36.635 a = 179.410 a = 35.102 2.4 Inertial Systems Such systems utilize gyroscopes and accelerometers to measure changes in angular and linear motion. The principle of operation involves continuous sensing of minute accelerations in each of the three directional axes. By integrating these measurements, inertial methods derive velocity and position, which are essential in stabilization and attitude control. The main advantage of inertial systems is that they do not need a fixed reference point to be measured with respect to. On the other hand, inertial systems drift with time. Since the data from inertial sensors is being integrated, any small constant error results in a larger error after the integration. Inertial sensors are thus unsuitable for accurate positioning over an extended period of time. Another problem with inertial systems is high equipment cost. For example a high quality inertial navigation system in a commercial airliner, with a drift of about 2000 meters per hour, costs between $50A" and $70A" [7]. Vaganay [32] presents an algorithm for attitude estimation of a mobile robot using the fusion of inertial data. He uses low-cost inertial sensors including two ac-celerometers and three gyroscopes. The robot's attitude, represented by roll and pitch angles, is obtained from two different methods. While the first method is based on gravity components of accelerometric measurements, the second proceeds by inte-grating the robot's attitude differential equation and its angular velocity, measured by gyroscopes. An extended Kalman filter then combines the results of these two methods. Experiments on a mobile robot with two-differential wheels on an indoor-flat ground shows that this method improves the result by 11% in normal cases, flat floors, and 55% in passing an obstacle or bump. The system appears to be fast and 10 accurate when used on an indoor mobile robot, but the variation range of the attitude is very restricted. Suggested future work includes improvement of the reliability and precision for navigating in outdoor environment with more variable domain. Borenstein and Feng [4] introduce a method, called gyrodometry, for combin-ing measurements from gyroscopes and wheel encoders. By studying the behavior of odometers, they discovered that non-systematic odometry error sources, such as bumps, alter vehicle motion over a very short period, say a fraction of second. If during that time the encoder's measurement is ignored and another measuring source provides the information, then it is possible to reduce unbounded accumulated error in odometers. By using the data of an optical fiber gyroscope, during those instances, the tracking system overcomes the gyroscope's large drift rate and unbounded accu-mulating error in encoders. To implement such a system, the discrepancy between odometry and gyrometry measurements is monitored constantly. If this discrepancy is higher than a predefined threshold, the odometer's measurement will be ignored and the gyroscope's data is taken into account. So, the robot's dead-reckoning system relies on the gyroscope's data measurement for a small fraction of total travel time, keeping the system largely free of the drift associated with the gyroscope. They show that gyrodometry is able to decrease the maximal error to ±0.5° that is 18 times smaller than odometry-only error and about 8 times smaller than gyro-only error. Although this system has good accuracy, it just covers straight line motion and not turning motion. 2.5 Video Based Systems 2.5.1 Landmarks Systems based on landmarks detect and track distinguishable features from sensing samples. If there are different types of landmarks, the task of identifying landmarks should also be added to the list of tasks for most of these systems. There are two typical methods of landmark tracking. The first method retains the image coordinates 11 of the landmark from the last input and searches for them within a specified range of coordinates. The second method derives the expected position of the landmark in the image using the last calculated location of the robot. Although the second has additional processing, it can use location information from other sources such as dead reckoning sensors and therefore is more accurate. The composition of the landmark varies from system to system. Some systems use only natural landmarks [16] like naturally occurring vertical lines, while others use artificial landmarks such as coded ver tical poles or satellites (GPS) [20]. Usually the reliability of natural landmarks is not as high as artificial landmarks. Landmarks can be active, that originate a signal to the robot to be sensed, or passive that do not. GPS is an example of an active landmark since the satellite sends radar signals that can be detected by receivers in the earth's atmosphere. A method has been presented by Bao and Fujiwara [2] for finding the position and orientation of a vehicle via processing the images of a flat artificial landmark. They have used the landmark to control the direction of the robot and a rotary encoder with wheels, to measure the translations. Preliminary results show that there is an error average of ±20° in this method. In order to increase the accuracy, the flat landmark has been substituted by a cubic landmark. It seems that introducing edges in the landmark has a significant effect in the improvement of direction control by about 20%. This system is very simple and useful but the motion range and direction are quite restricted. Nickerson and Jenkin [22] introduce a sonar-based mobile robot system named A R K (Autonomous Robot for a Known environment) that can autonomously navigate in a known industrial environment. The on board sensor system includes a laser range finder, a color camera mounted on a pan-tilt head, and a dead reckoning system. The A R K robot uses naturally occurring objects as landmarks. The robot relies on vision as its main sensor for global navigation, using a map of permanent structures in the environment (pillars) to plan its path. While the robot follows the pre-planned path, it keeps track of its own motion using the dead reckoning system. Moving toward an appropriate point from where it believes there is a good view of known landmarks, 12 it stops, identifies and measures the position and direction of the detected landmark using the range finder. After recognizing one of the landmarks the robot updates its position with respect to the map. Once the first landmark is detected, the head turns toward the next closest landmark and finds its position as well. By measuring the position and orientation of these two landmarks, the robot calculates its own position using triangulation. The A R K has been developed for a static industrial environ-ment (nuclear power plant). Clearly this type of motion tracking is not suitable for unstructured and natural environments, due to the poor accuracy of dead-reckoning measurement and the lack of known permanent landmarks. Also, the range finder is not very fast and economical. In general, landmark tracking systems need structured features that usually do not exist in natural unconstrained environments. 2.5.2 Active Beacons As one of the most common navigation aids, an active beacon system computes the absolute position of the robot, ship or airplane, by measuring the direction of three or more transmitting beacons. The beacons usually transmit light or radio signals and their transmitters should be accurately mounted at known sites in the environment. Although beacon systems are reliable, their installation and maintenance are very expensive. An active beacon localization system is described by Kleeman [15] that provides position and direction for a mobile robot. The system combines the measurements of a dead-reckoning sensor and active ultrasonic beacons, in order to obtain more accurate results. It includes 6 ultrasonic beacons and 8 receivers arranged at 45° intervals on robot and a dead-reckoning sensor located on the robot's driving wheel. Since dead-reckoning measurements are only accurate over short distances, an active ultrasonic beacon is a good complementary method, because it has random independent errors that do not accumulate. The system implements an extended Kalman filter for fusing the measurements. In experiments the system is shown to be capable of updating the position every 150ms and the error of 40mm in a 12 meter long path. The active beacon offers the advantage of simplicity, speed and low cost, but it is also with notion 13 of complete robot autonomy in an unstructured environment. 2.5.3 Image Correspondence 2.5.3.1 M o d e l - B a s e d Tracking The purpose of model-based vision is to find the occurrence and the location of known 2D or 3D objects within an image. Only certain aspects of the object, such as the geometry of its edges, are utilized to construct a geometric model of the object. Geometric models are interesting to work with because of their strong invariancies under perspective projection. They can provide reliable results and they are compu-tationally simple. Model-based tracking consists of applying model-based vision to a sequence of video images. Due to the high data-rate in sequences of images, this task seems more difficult than model-based vision, but the continuity between suc-cessive images eases the problem by anticipating the motion of the object with some precision. The geometric model features used in tracking should be simple, reliable and computationally inexpensive for the purpose of accuracy and speed. Edges and corners are reliable and simple local features. Harris [11] developed a model-based 3D tracking algorithm named R A P i D (Real time Attitude and Position Determination) for a known object with arbitrary motion and viewed by a standard video camera. The 3D-object model consists of selected control points on high contrast object edges, which can be surface markings, fold edges (edges of a cube) and profile edges (outline of a sphere). The disadvantage of this method is that the object's motion should be limited to small movement between frames. Since there is a high data rate in this method, if the object moves rapidly then the estimation of its pose will not be accurate. To deal with high speed motion, Harris has implemented an extended Kalman filter. The Kalman filter uses the previous and the current estimation to predict the new estimation and its positional uncertainty. R A P i D has maintained object tracking at rotational rate of lOrad/s and angular acceleration of lOOrad/s2. Since model-based tracking uses the global attribute of the tracked object, it has 14 significant advantages over other local methods. Not only can it be efficiently used in a cluttered environment, but it is also capable of estimating the motion with multiple moving objects. Although this method has great advantages it requires a significant amount of computation and it cannot be used for motion tracking in an unstructured environment. 2.5.3.2 Feature-Based Tracking In feature-based motion tracking, preliminary features of the scene are selected to be tracked. By measuring the displacement of these features in the sequence of images seen by a camera, it is possible to recover motion information. The types of features are usually extremely important since the feature detection is the most computation-ally intensive task to be done by the system. Corners, edges and informative patches are the most common features used in motion tracking systems. These features are stable and mostly correspond to real 3D structures in the scene. A. Soroushi [26] has implemented a real-time feature-based motion tracking sys-tem, to track local motion of a machine-mounted down-looking camera in real world environment. In her work she uses computational vision approaches to accomplish an image registration system that estimates motion in one plane (3 degrees of free-dom) from two partially overlapped images. To reduce the processing time and to acquire more information from the images, the system processes only selected regions of original images. These regions are chosen to contain similar features with the least possible disparity. With the aid of a coarse-to-fine search method, the system is able to track motion of the camera at approximately 20 frames per second, and to detect rotation of up to 11.25° and translation of up to 8 pixels between each pair of images. Since the feature-based system does not need any object model, it is potentially faster than model-based method and as long as there are some features in the scene, motion tracking can be clone reliably. 15 2.6 Summary As explained, each different approach in resolving motion tracking problem has a spe-cific application and most of these methods do not cover a 3D tracking requirement. Techniques such as radio sensing are in global scale and therefore not very suitable for small motions within a few meters or less. Other techniques based on dead-reckoning sensors are mostly suitable for flat and smooth surfaces and bumpy or natural sur-faces cause a large amount of error and make them unreliable. Methods based on inertial sensors are very expensive and not practical for small systems. They also suffer a large amount of drift in their estimations. Vision systems based on landmark recognition also need a known environment which is not possible in general. Feature based methods seem more applicable to natural scene although most of them cover motion in only one plane. In this thesis we have implemented a more general and affordable solution which covers 3D motions with accuracy within a few millimeters over one meter distance. The system is applicable to any natural outdoor environment as long as the scene has some texture and with a fair lightning condition. Next chapter introduces the outline of this system and explains objectives, issues and details about individual tasks. 16 Chapter 3 System Design The main objectives of this thesis are to provide some visual analysis and control algorithms to monitor and register changes happening in the real world, and to apply such theories in the implementation of a three dimensional motion tracking system. The system used a robot head, called Triclops [24], which includes three cameras. This chapter is structured to present a specification of the system requirements. System Specification The design presented in this thesis is an exploration of the relevant issues in creating a real-time on-board motion tracking system, for a static environment using an active camera. Our system is designed to be as general as possible, but there are some necessary assumptions that to be made as there are some uncertainties and exceptions. The following is the list of assumptions that we have under taken in our approach. 3.1 Assumptions • Much of the current vision research is concerned with rigid objects, since non-rigid objects are very difficult to handle without prior knowledge of their be-havior. Thus, we assume that our scene includes mostly rigid object. If there are few, small moving objects, the system is able to rely on static objects infor-17 mation, while information from moving objects can be discarded as statistical outliers. • The camera characteristics are known. In particular, focal length and the base line separation of the stereo cameras should be provided. • The motion of the robot is assumed to be limited in acceleration. This allows the match searching techniques to work on a small and predictable range of possible matches. • There is no specific knowledge about the unstructured scene which is to be processed. We would like to define a solution to cover following objectives. 3.2 Objectives • Continuous sensing. The robot vision should be able to track its own motion in a continuous manner. Some systems have a stop-sense-and-move mode, which is contrary to the application of this system. • Real-time performance. The system should be able to give location and orien-tation of the robot at each moment. We would like to control the motion of the robot, an excavator for example, using this information. Therefore, we must be able to make on-line control decisions. • No odometry. Due to the high accumulative error of odometers we would like to avoid any odometry sensor. The outdoor environment in which we would like to run our robot may include many bumps and result in wheel slippage. Therefore any odometer sensor would fail in such an environment. • Accuracy, simplicity and cost. The motion accuracy is the most important con-sideration of this problem. Most of the current accurate systems are expensive and are therefore not applicable to any small robot. We would like to come up 18 with a solution which is reliable, simple and inexpensive enough to be used for every purpose. 3.3 Motion Tracking Methodology Figure 3.1 is a simple overview of motion tracking tasks. Model Initialization and Correlation Feature Detection Stereo Vision i Feature Identification Figure 3.1: Motion tracking system tasks. 3.3.1 Model Initialization and Calibration Figure 3.2 provides a brief view of initialization tasks in our system. Camera cali-bration is the process of modeling camera images with respect to the environment, including the determination of the internal camera geometric and optical character-istics and/or the 3D position and orientation of the camera frame relative to a world coordinate system. A camera system provides a model of a camera and an algorithm 19 Feature Tracking > Motion Estimation • Position Filtering Initialization 1 Ideal Camera Model Camera Calibration Figure 3.2: Initialization tasks. for finding its parameters, such as focal length and a transformation from frame co-ordinates to ideal image plane coordinates. The camera usually is calibrated off-line, since it involves very high order calculations, and the model of the camera can be used when performing calculation for the position of subjects in subsequent images. Ideal C a m e r a M o d e l Each camera can simply be modeled using the classic pinhole model, by its optical center C and its image plane p. This leads to equations for calculating where on the image plane a point in space will appear. These equations are based on perspective projections [12]. Figure 3.3 shows a graphical presentation of the ideal camera model. The projective transformations that project the world point P(x, y, z) to its image point P'(u, v) are: Here, / represents the focal length of the camera. Since cameras exhibit non-ideal behavior, precise measurements from images need a more sophisticated camera model than the ideal model. In other words a simple perspective projection is not enough for locating world points in the image. Therefore z (y (3.1) 20 X Figure 3.3: Ideal pinhole camera model produces a perspective projection of the world. a 4 x 3 calibration matrix T is needed such that, if the image point p' = (u,v) is the projection of the world point P = (x,y, z),the following relationship holds. u V w T (3.2) Where: u = U_ w v_ w Tsai-Camera Calibration In [30], Tsai introduces a camera calibration model and algorithm that is widely accepted and has been implemented on our stereo vision system, Triclops. Tsai's camera model consists of extrinsic and intrinsic parameters. Extrinsic parameters are those that are not a function of the construction of the camera but its position in the environment. Intrinsic parameters, on the contrary, are physical parameters related to the construction of the camera and CCD itself. The camera model transforms the 21 real world coordinates into image coordinate and vise versa. Figure 3.4 shows the geometry of the camera calibration problem. x Figure 3.4: Camera geometry with perspective projection and radial lens distortion. There are four steps in transforming from a 3D world coordinate to a camera coordinate: Stepl: Transforming from the object world coordinates to the camera 3D coordinate system. (xw, yw, zw) is the coordinate of the object point P in the 3D world coordinate system. P(x, y, z) presents the coordinate of the same point in cam-era coordinate system, with origin at 0 . This rigid body transformation can be presented by rotation and translation operations: + T (3.3) where R is a 3x3 rotation matrix and T is a 3x1 translation matrix. Operators R and T are in the group of extrinsic parameters of the camera model. X xw y = R z 22 Step2: Transformation from 3D camera coordinate (a;, y, z) to ideal (undistorted) image coordinate (xu,yu). This is done using a. perspective projection with a pinhole camera geometry Equation 3.1. Step3: Transformation from the ideal image plane to the distorted image plane. This is due to radial lens distortion and it transfers (u,v) by the following equations: Xd + Dx = xu (3.4) Yd + Dy - yu (Y'd,Xd) represents the distorted or true image coordinate on the image plane. Dx and Dy represent the distortion and are defined as: DxnXd(Kir2) Dy^Yd(Kir2) (3.5) Step4: Transforming from distorted image plane coordinates to frame buffer coordi-nates. Here distorted image coordinates are changed to the row and column offset of a buffer in the computer memory. Xf=Sx{d'x)-1Xd + Cx (3.6) Yf = {dyylYd + Cy Where: (Xj, Yj) : row and column of the image pixel in computer frame memory. (Cx, Cy) : row and column of the center of computer frame memory. d'x = dx-^ (3.7) dx : Center to center distance between adjacent C C D sensor elements in x di-rection. dy : Center to center distance between adjacent CCD sensor elements in y di-rection. Ncx : Number of sensor elements in the x direction. Njx : Number of pixels in a line as sampled by the computer. Sx : The uncertainty image scale factor to be calibrated. Using a calibration process we can find a calibration matrix that projects each real point in the world to a corresponding pixel on the image plane. 3.3.2 Feature Detection An important feature of a. motion tracking system is its real time performance. To satisfy this requirement the system may extract some scene marks or features to work with. Since feature detection is the most time consuming sub-task, choosing the right type of feature and the most effective extraction method are very critical and depend greatly on the type of input sensors. For instance a system that uses video images may use an object recognition algorithm, while a GPS simply monitors a given frequency for a modulated signal. As explained in the previous chapter, in systems based upon landmarks or models, it is likely that no landmark is visible and so the motion estimation will not be accurate for some percentage of the time. Choosing simple features within the scene increases the reliability of the solution for motion tracking and enables the system to find an answer to the problem most of time, unless the scene is very uniform. Deciding on the type of the feature is the primitive task in image registration. Image registration is aligning two or more images in order to compare and detect similarities or differences; common features that are used for matching are [6]: • Raw pixel values, i.e. the intensities. They have the most information possible, but computation is usually time consuming. • Edges, surfaces and contours. Edges are interesting features since they usually correspond to real 3D structures in the scene, and they provide connectivity information [28]. They are less sensitive to noise and working with them is 24 fast. However edges extracted from real images are subject to inconsistencies such as drop out from image to image. Also, edges are often not viewable as complete entities, for example one or both end points may be obscured by other structures or be outside the field of view. • Salient features, such as corners, line intersections, points of locally maximum curvature on contour lines. These type of features ha,ve the advantage of being discrete, reliable, meaningful and more accurate in finding the position. How-ever the lack of connectivity is the major limitation when obtaining higher level descriptions such as objects and surfaces. • Statistical features, such as moment invariance. These features are measure-ments over a region of image, representing the evaluation of the region. They have the advantage of using all the information but are computationally very expensive. ' . • Higher level features, such as structural and synthetic features. In this type of feature, relations and other higher level information are used. They are usually unique and fitted to specific applications and incapable of determining exact matching. In the search for a feature type that suits our application a natural unstructured environment with varying lighting conditions, we have chosen to work with corners, because they are discrete and partially invariant to scale and rotational changes, which makes them quite suitable. Now that we have chosen the type of the feature, we should find an algorithm for extracting these features. Since in each time we find these features from scratch we should be more cautious in choosing the best method, for the most accurate positioning with the least computation time. Figure 3.5, provides a brief overview of corner detection and two specific corner detectors that have been implemented in our system. 25 Feature Detection Corner Detection Moravec Corner Detector Harris Corner Detector Figure 3.5: Feature detection sub-sections. Corner Detection Many approaches to detecting corners in shapes involve segmentation of the shape by, for example, thresholding, extracting the boundary and searching for significant turning in the boundary. However, relying on segmentation might lead one astray by errors in segmentation. Therefore it is of interest to find other techniques of corner detection which can be applied directly to any gray-scale image without the need for prior segmentation. Kitchen and Rosenfeld [14] introduce a group of corner detection technique, that involve applying a local operator to a neighborhood of gray-scale images. They show with the same localization accuracy, that their techniques are faster than previous methods. On average they need lOii2 + 1 arithmetic operations and one square-root function call for each image pixel, where the neighborhood is n x n Moravec Corner Detector Moravec [19] for the first time introduced a corner detector based on the consideration of a local window in the image and determination of the average changes of intensity that result from shifting the window by a small amount in various directions. In a simple analysis there are three cases to be considered: • If the image patch is uniform, then shifting the window along all directions 26 results in small changes. • If the image patch includes an edge, then shifting along an edge results in small changes, while shifting perpendicular to the edge results in larger changes. • If the image patch includes a corner, then all shifts result in large changes. Therefore a corner can be detected when a minimum of changes produced by any of the shifts is large enough.This is expressed analytically by: E(x,y) = WUtV\Ix+lhy+v - IUjV\2 (3.8) / : Image intensity Wu>v : Rectangular-unity image window E(x,y) : Changes produced by a shift of (x,y) Where: (x,y) G {(1,0), (1,1), (0,1), (-1,1)} The Moravec corner detector simply looks for a local maxima in the minimum of E which is above a suitable threshold. However this corner detector suffers from a number of disadvantages: 1. The response is noisy due to the rectangular-binary window W. 2. The response is anisotropic because only a set of shifts at every 45° is considered. 3. The operator responds too rapidly to edges because only the minimum of E is considered. Harris Corner Detector Harris and Stephens [10] have enhanced Moravec's technique in order to overcome its deficiencies. 1. A smooth circular window can compensate the noisy response due to the rect-angular window. 2. A l l the possible shifts can be covered by analytical expansion of E about the shift origin. E(x,y) = ] P WUtV\Ix+U)y+v - Iu.v\2 (3.10) u,v Since the displacement vector is small, the intensity function can be approxi-mated by Taylor series to the linear term. Taylor expansion: x r, x dl dl d2I d2I 2 d2l 2 I{x + u,y + v)-I(u,v) = -x + -y+—xy+—x + — y + (3.11) E(x,y)« J2Wu>v\xX + yyf u,v The first two gradients are approximated by: (3.11 X = 7<S>[—1,0, l]f a / ''dx Y = 7® - 1 0 1 3y For small shifts, E can be written as: E(x,y) = Ax2 + 2Cxy + By2 Where: A = X2®VK , F = Y2®W , C = XY®W (3.13) (3.14) (3.15) 3. Reformulating the corner measure can make use of the variation of E with direction of shift. r i r A C E{x,y) = [x,y)M where M C B (3.16) Harris and Stephens those three cases we considered once in Moravec's algorithm. We consider a and 0 as the eigenvalues of matrix M. Since E is closely related to local auto-correlation function, then a and 0 are proportional to the principal curvatures of the local auto correlation. 28 • If both curvatures are small, so the local auto-correlation is flat, meaning that the windowed image is approximately uniform. • If just one curvature is high, then the local auto correlation is a ridge shape and only shifts along the ridge causes small changes, which indicate an existing edge. • If both curvatures are high then the auto correlation function has a sharp peak with shift in any direction, indicating a corner on the windowed image. To measure the quality of the corner, a corner response R is introduced: A quick look at R shows that the response function is very small within a uniform region, negative in edge regions and positive in corner regions. The value K in the response function is the maximum ratio of the eigenvalues of M , for which the response function is positive. A value of K = 25 (Harris' suggestion) is used. Figure 3.6 shows the result of Harris' corner detection on a sample image. Stereo V i s i o n The most important information required for the 3D representation of features in the world is depth, or their distance from a plane, passing the center of the camera, at each time. Constructing feature depth is possible using stereo vision. Before getting into stereo vision details, we overview some of the important concepts related to the stereo vision problem. Figure 3.7 shows related subjects that are explained in this section. R(a,B) = Det(M) - R'Tr(M)2 (3.17) Where: (3.18) 29 Figure 3.6: Result of corner detector on a sample image. Epipolar Constraint Given a known imaging geometry, the epipolar constraint in stereo images defines a line in one image along which a match can be found. Let's consider a stereo vision system with two pinhole camera models with optical centers C\ and C 2 . A World point P projects to Pr and Pi respectively. E\ and E2 are the epipols through which all epipolar lines pass and DEi and DE2 are the epipolar lines on which Pr and P/ are bound to lie. Pi, P 2 and P 3 are some points in the scene that may be projecting into the point Pr in the right image, and points P i r , P 2 r and P 3 r are the projections of points P i , P 2 and P 3 on the left image. The epipolar constraint forces the points Pi/ 1 P21 and P 3/ to lie along a straight line. The implication of the epipolar constraint is that the search for a matching point in one image needs to be done only along one line in the other image. 30 Stereo Vision Epipolar Constraint Rectification Feature-Based Stereo Intensity-Based Stereo Figure 3.7: Stereo vision sub-sections. Rectification As seen in epipolar constraint, having several images and a point in one of them, the corresponding points in the other images are bound to lie on epipolar lines. These epipolar lines would be parallel if and only if all the image planes are parallel. In general epipolar lines produce pencil lines going through an epipolar center. The epipolar center in each image is the image of the other camera center in this image plane. If the image planes are coplanar and lie parallel to the vector C\ and C2 defined by optical centers, then the epipolar centers are rejected to infinity and the epipolar lines form a pencil of parallel lines. Ayache and Hansen [1] show that it is always possible to apply to each image a linear transformation which is linear in projective coordinates to obtain conjugated horizontal epipolar lines. Rectification is one of the important processes in stereo vision and it allows potential matches between two or more images to satisfy simpler relations, allowing for simpler and more efficient matching algorithms. Figures 3.10 and 3.11 show two images before and after calibration and rectification. As discussed, for finding a match, a search along the epipolar line is sufficient, and since the epipolar lines are along an axis after 31 Figure 3.8: Epipolar constraint. rectification, search in one column or row is sufficient. The process of rectification is in the group of camera calibration functions and it should be applied on every image to be processed. Stereo A l g o r i t h m The basic idea behind the stereo vision algorithm is to identify a primitive in an image and compare it with a number of primitives in the other images. These candidate matches or primitives are chosen on the rectified images. The type of primitives used for matching, defines the type of stereo method employed. For example, area based methods use a patch of the image, while feature based techniques match the results of a lower level image processing algorithm, such as edges or line segments. The match-ing process between primitives is established based on the amount of similarity or difference between the properties of the primitives. In the ideal case, these primitives would be identical and matching will be done based on the properties that have not changed in two slightly different viewpoints, although sometimes due to the nature 32 Figure 3.9: Rectification of two images. of the stereo problem, these two views differ significantly. This difference can originate from several sources such as camera distortion, noise, occlusions, repetitive features and the lack of texture [31]. In the presence of these obstacles, the general solution for stereo vision is to define a similarity measure that determines the most likely match coordinates. Generally, the probability of match candidates is expressed by a score and the most likely candidate is chosen based on a simple comparison of the scores. The problem with this method has been that the percentage of false matches is very high. Accordingly, validation techniques were developed to reduce the number of the false matches. Intensity-Based Methods In these methods primitives are rectangular patches of the image with the center at the pixel that a match is being done for. Therefore the comparison is taking place on a patch in the other image in such a way that epipolar constraint is satisfied. The match is found by choosing the best correlation score. A number of correlation measures exist such as: 33 Figure 3.10: An unrectified image with the lens distortion. Normalized cross-correlation: M N_ 2 E E (h(x,y)-h)(I2(x,y)-I2) *==fy==f (3.19) M 2 JV M 2 N Normalized mean squared differences: M N E E U M ^ y ) - A ) - {h(x,y ~ 2 y~ 2 ) - W X — (3.20) C2 = M_ N_ M_ N_ 2 2 2 2 N Where M and are dimensions of a. window that the function is computed for. Since neither of these measurements guarantee the correctness of the match, a validation check should be done. One of the simplest means of validation is the setting a threshold and accepting the match if and only if it is above that threshold. The problem with this sort of validation is that it does not identify an invalid match. One more sophisticated technique is to perform correlation twice by reversing the roles of two patches. In this case a match is considered valid if and only if the similarity measurements give the similar disparities [23]. Although this method is a robust solution for validating, it still has a small like-lihood of false matching, however by changing the size of the patch and the threshold value, this likelihood can be minimized. Correlation methods are simple and efficient and they can be run on specialized parallel hardware or single C P U processors over a wide range of scenes. However they fail under occlusion and uniform (textureless) image areas: Feature-Based Methods These approaches use the results of a lower level image processing algorithm, which detect edges, corners or line segments as primitives. Since there is more information 35 about primitives, such as shape and size, it is more possible to match the right primi-tive. The similarity measurement in feature based algorithms can be different, based upon the type of feature that is used. For example, edge pixels can be compared based on their orientation and intensity, while line segments can be compared with their length and orientation. Although corners can be presented easily with high stability in the presence of image noise, they present limited information and a comparison based on corner property results in many false matches. In the case of our problem, a combination of intensity-based and feature-based algorithms seems to be the best choice. After passing the corner detector and finding the location of each corner, the intensity-based method is used for the corner and its neighbors. D e p t h Cons t ruc t ion Us ing Stereo If corresponding features are found and matched within a set of stereo images, con-struction of depth, the distance of the object from camera, is possible using triangu-lation. Figure 3.12 shows the geometry of triangulation. O Figure 3.12: Depth construction using camera parameters. AOLP ~ PC2S AOLQ ~ QC\T z l = L k di k + di + B - d2 k = f_ d2 di(z-f) f (3.21) (3.22) 36 Where z: Depth of the object C\ and C2: Camera centers p: image plane di\ Object's image displacement from the origin of the first camera coordinate system d2: Object's image displacement from the origin of the second camera coordinate system By substituting k from 3.21 into 3.22, we will have: f-B (3.23) d2 — dx Equation 3.23 shows that by finding the disparity (d2 — f/i), the positional change for one point in two images, is sufficient for constructing the point's depth, z. 3.3.3 Feature Identification When there are more than one feature type, model or landmark to be detected, each detected feature must be identified. Identifying the feature can be based upon search and recognition of one or more identification features such as barcodes [5]. Feature identification can be eliminated if one type of model or feature is tracked. In our implementation, we work with uniform corners. Therefore, we ignore the feature identification process. 3.3.4 Feature Tracking In this section corresponding 3D features are tracked from one frame (at time=£) to the next frame (at time—£ + At). Systems with more complicated features or landmarks usually track the landmark through different frames, since detecting the landmark or model from scratch may take more time. This may take place by tracking the position of the robot and predicting the position of the landmark in the next frame. Although this method has the advantage of being fast and high in performance, losing tracking, which happens often, seems to be a. serious disadvantage. Systems with simple feature types, usually detect features in each set of images from scratch. Since the features are simple and therefore less informative, there should be a matching process to find the corresponding similar features in subsequent frames. In our system in which corners have been chosen as features, it is not possible to track identical corners from frame to frame without detecting them in new images. Accordingly a match metric function should be implemented to measure the similarity between each pair of candidates. Corner M a t c h i n g The cross-correlation match metric provides a similarity score for each two match candidates. Each pair of candidates may be chosen by a simple search routine in a neighborhood around the predicted position of each corner on the second image frame. Those two corners that result in the highest similarity metric will be considered as a matched pair. As explained in stereo matching for each two candidates, two fixed rectangu-lar windows, centered at the location of the corner, are considered in each image, Figure 4.5. Then the correlation between corresponding pixels in each window is computed. M 2 K 2 £ ( ( / i ( x - , y ) - / i ) - ( / 2 ( . T , y ) - / 2 ) ) 2 C = M N 2 2 E E — 9 y— o M {h(x,y)-h) 2 ]T (I2(x,y)-I2 K 2 (3.24) 7 ^ 2 2 x==*Ly==£L Since the correlation function relies on the neighborhood of pixels as well as each corner, false matches can be largely eliminated. This will be done by thresholding the similarity metrics. Since normalized squared difference is chosen as the similar-ity metric function, we locate matches with highest accuracy simply by considering matches with a similarity metric lower than a maximum. 38 3.3.5 Motion Estimation Once the identical features are detected and identified in a sequence of images, the motion of the camera must be calculated. This includes calculating the position of the camera relative to the features and transferring this position to the absolute reference frame. Finding the final transformation usually involves using iterative least squares minimizations, that must be performed with care due to high sensitivities which may cause large errors in the calculated location. Figure 3.13 shows feature tracking sub-sections. Motion Estimation V 1 Solving the Camera Motion Efficient Computation of Partial Derivatives Figure 3.13: Feature tracking sub-sections. Solv ing the C a m e r a M o t i o n Construction of 3D features from 2D features is non-linear. However the whole pro-cess is a smooth and well-behaved transformation. Any 3D motion includes rotation and translation. • Rotations in depth or around axis are function of the cosine of the rotation angles. • Translation toward or away from the camera introduce a perspective distortion as a function of the inverse of the distance from the camera. • Translation parallel to the image plane is almost linear. 39 A l l these transformations are smooth and well behaved. Therefore, the problem of 3D motion estimation is a promising candidate for the application of Newton's method, which is based on assuming that the function is locally linear. To minimize the chance of converging to false local minimum, we look for errors and eliminate them during the iterative process. Newton's Method is an iterative method that in each iteration computes a cor-rection vector x that is subtracted from the current estimate P , resulting in a new estimate. If P ' 1 ' is the parameter vector for iteration, z, then, p(«'+i) = pd) _ x (3.25) Given a vector of error measurements between world 3D features and their image, we find the x that eliminates (minimizes) this error. The effect of each element of correction vector, X{, on error measurement e,-, is the multiplication of the partial derivative of the error with respect to that parameter to the same correction vector; this is clone by considering the main assumption, local linearity of the function: Jx = e Where J{j = (3.26) OXj J is called the Jacobian matrix. In this system, e,- is the error between the predicted location of the object and actual position of match found in image coordinates. Each row of this matrix equation states that one measured error e; should be equal to the sum of all changes in that error resulting from the parameter correction shown in [18]. The problem is solved if all these constraints can be simultaneously satisfied; this means that the error is reduced to zero after subtracting the correction 3.25. Since the Equation 3.26 is usually over-determined we can find an x that minimizes the 2-norm of the residual rather than solving the whole equation. min\\Jx — e||2 (3.27) Equation 3.26 has the same solution as normal equation, JTJx — J1 e. On each iteration of Newton's method, we can simply solve the normal equation for x using any standard method. LU decomposition [25] has been used here. Note that 40 this minimization is based on the assumption that the original nonlinear function is locally linear over the range of typical errors. Efficient Computation of Partial Derivatives The most computationally expensive aspect of implementing the Newton method is in calculating the partial derivatives or the Jacobian matrix. Lowe [17] has developed a method to pre-compute these partial derivatives in an efficient way. We can project the three dimensional model point P onto a two-dimensional image point (u, v) («,«) = ( - , - ) (3-28) z z The partial derivatives with respect to the translation parameters can be most easily calculated by first reparametrizing the projection equations [17]. If the motion parameter vector is (Dx, Dy, Dz:cj)X}<py,cj)z), then the new location of projected point in the image is M = {iiE±M,m±3ii) (3,9) z + Dz z + Dz Dx, Dy and Dz show the incremental translations and <j)x, (py and (f>z are rotational increments about the x, y and z. In this application the first two rows of the Jacobian are: a„, Pi.. Pi., Pi., Pi.. Pi.. A<f>z = eu (3.30) = ev (3.31) du A du du . du du A<f>y du oDx 3DtAD" d<i>x + d4>y + d<j>z dv A dv A „ dv dv A ^ dv &<f>y dv dDx 3Dy dDz d(f>x + dcf)y + d<t>z Now, we would like to precompute these partial derivatives using Equation 3.29. dDx dDy dDz (z + Dz)2 (3.33) (3.34) (3.35) du f dx fx dz d<f>x z + Dz dcj)x (z + Dzf d<px du f dx fx dz d(j)y Z + Dz d(j)y du f dx fx dz d<f>z z + dz d4>z (z + Dzy dcf>z 41 The partial derivative of x, y and z with respect to world coordinates counterclockwise rotations </> (in radian) can be found in Table 3.1. Table 3.1: The partial derivatives table. X y z 0 —z y 4>y z 0 —x 4>z -y X 0 Using this table now we can summarize the partial derivatives of u and v with respect to camera view point in Table 3.2, c is equal to (z + Dz)~l Table 3.2: Partial derivatives of u and v with respect to the camera viewpoint pa-rameters. du dv dDx ! z+D, 0 dDy 0 / z+D* dDz -fc2(x + Dx) -fc\y + Dy) d(j>x -fc2(x + Dx)y -fc(z + cy(y + Dy)) d<f>y fc(z + cx(x + Dx)) fc2x{y + Dy) dcf>z -fey fcx This table shows that we can easily and efficiently compute the Jacobian matrix elements in Equations 3.30 and 3.31. Computing the Jacobian elements is now a fast task and quite promising for real-time system performance. 3.3.6 Position Filtering For many systems each motion estimation from an individual sample or set of samples contains a significant degree of random error, ff there is no significant systematic error, these errors can be reduced by filtering the location information. 42 The filtering methods may vary depending on the system. One of the most com-mon filters in position and motion estimation is the Kalman filter [9]. The Kalman filters usually combine the location information from different sources or different times/positions to produce better location estimation. The following sections de-scribes the position filtering sub-tasks which are used in our system for filtering the world position of each feature as shown in Figure 3.14. Kalman Filter Position Filtering Least Squares Estimation 1 Recursive Least Squares Estimation Figure 3.14: Position filtering components. Kalman Filter The Kalman filter is the general solution to the recursive, minimized mean square estimation within the class of the linear estimators. It gives a linear, unbiased and minimum error variance recursive solution to estimate the unknown state of a dynamic process or quality of interest from noisy data. The Kalman filter uses knowledge of the second-order statistics of the observed noise and the measured noise to provide a solution that minimizes the mean square error between the true state and the estima-tion of the state. It also provides a convenient means of determining the weightings to be given to input measurement data, when the measurements are not equally reliable, hereby weighting corner features based upon estimation variance and quality. For Gaussian random variables the Kalman filter is the optimal predictor-estimator and for variables of forms other than Gaussian, it is the best only within the class of linear estimators. Kalman filtering has been used extensively for many diverse ap-plications, such as navigational and guidance systems, radar tracking, sonar ranging 43 and satellite orbit measurements. Least Squares Estimation Gauss discovered [9] that if a system of equations is expressed in matrix form: or O i l O l 2 a13 ... X\ « 2 1 « 2 2 « 2 3 • • • « 2 n x2 = A r a l O-m.2 « m 3 • • • Xn K Ax = b (m>n) (3.36) the best solution for x, given A and b is the one that minimizes \\Ax — b\\2. Usually in applications, measurements of 6; are not equally reliable, so it is more reasonable to attach more weight to the more reliable measurements. It means that if the first observation is trusted more than the second, for example it comes from a larger sample or more careful measurements, then W2(Ax -b1)2 + W2(Ax -b2)2 + ... with W1>W2 (3.38) The systematic solution to this problem is to find the right weighting matrix, for a given W (C = WTW) and then by continuously receiving data, fit the best straight line to them. Since with arrival of each set of new data, the best fitted line is changed, we will hopelessly stay behind by trying every time to solve the new normal equation ATAx = ATb (3.39) It seems that if rather than solving the problem from the beginning, changes in a; are computed, then a real-time implementation might be possible. This is called the Recursive Least Squares Algorithm. 44 Recursive Least Squares and Kalman Filtering Suppose we have estimated the vector x as well as a first set of measurements of 60, then AQx = b0 is generated. It has been proven [29] that the best weighting choice, C , is: C = Inverse of the covariance matrix of error,V (3.40) If the solution of A0x = b0 is weighted by the covariance matrix VQ of errors in bo then x0 = (ATV0-1A0)-1ATV0-1bo The error (a; — x0) then has zero mean and its covariance matrix is P 0 = E[(x - x0)(x - x0)T] = (AT0V0-XAo)-1 (3.41) (3.42) Now when data arrives, the best estimate for combined system AQX = bo and A\X = b\ can be computed for xo and 61 without restarting the calculation from bx. The final result should be the same as computing x\ from scratch. V0 0 V = 0 Vt is the covariance matrix of the errors, eo ei V is block-diagonal, because ei is independent of eo- Therefore 1 _ Ao T V0 0 -1 Ao AK 0 Vi_ A1 = ATV0-1A0 + AT1V1-1A1 (3.43) Now xi is not based on bi, but A0x = bo and A\x = by are combined to solve for xx. ATV~lAx = ATV~lb (3.44) xi = Pi p -1 T r -1 Ao v-1 bo A, b, = P 1 « y 0 - 1 6 0 + A f V ' r 1 6 1 ) (3.45) This is the solution that we find recursively, by using the previously computed x0 in place of 60 in Equation 3.45. We also update the matrix P along with the estimate of x, using 3.43: P-1=P-1+A^V1~1A1 (3.46) 45 It is important to note that 3.46 is based on the statistical properties of b0 and by and not their exact measurements. = p ^ x o -Ajvr'AiXo + Afvr%) (3 .47) = x0 + fci(61 — Aix0), Where ki = PiAjVf1 (gain matrix) This formula is recursive and it uses x0 instead of b0. The least square estimate of .x, and its covariance P, are given recursively by P-1 ^P-\ + AjV-'At (3.48) X i = xi_i + ki(bi - AiXi-i) with ki = PiAfVr1 (3.49) This formula has been implemented in our solution, for the optimization of each comer's world position. 46 C h a p t e r 4 S y s t e m I m p l e m e n t a t i o n This chapter presents a detailed design of the motion detection algorithm for esti-mating any 3D transformation existing between two or more sequential images, ft includes a block diagram of the system, shown in Figure 4.1, and a brief description of principal procedures. The basic scheme is to take a pair of images and unwrap them to correct for lens distortions prior to applying a, corner detector to each image. Corresponding corners are matched in vertically and horizontally aligned images and a disparity measure-ment is completed for each single stable corner pair. At this point, using the corner position and disparity information a list of features in world coordinates is generated. This feature list is matched against a list of current world points, then the optimal transformation between the feature points and the world points is found using the least squares fit algorithm, which then updates the camera motion. Camera position is used to compute a position for each feature point in world space, along with an error vector for each point. This information is processed using a Kalman filter to update the list of world points. Finally weak features, features with smaller proba-bilities of reappearing in the scene, are retired and new features are introduced. This process guarantees that at each time there are enough features for the system to rely on. The chapter includes a detailed explanation of each procedure. 47 Level N - l Level N Level N+l World points to level N Capturing the N' th set of images 1 r Lens distortion correction Feature extraction 1 Stereo matching and match validation Matching world points and feature points Motion estimation Kalman filter updating Retiring weak corners which are unlikely to re-appear and introducing new corners World points from level N Figure 4.1: Flowchart of real-time 3D motion detection system. 48 4.1 Feature Extraction At each time the system grabs a set of three unwarped images. After correcting the lens distortions [30], Gaussian smoothing is applied to the (320 x 240) pixel images in order to remove high frequency noise. As explained in detail in the previous chapter (Section 3.3.2), the Harris and Stephens corner detector is then applied to each image [10]. Figure 4.2: Corners before suppression of unstable corners. A couple of algorithm implementation issues are noted. Since Moravec's [19] method uses a binary rectangular window, Harris and Stephens [10] relates the noisy response of the algorithm to that and suggests a Gaussian smoothing. Smoothing in this concept can be filtering either on the input images or final output response. Therefore smoothing is implemented with variable er=l with a kernel size of ( 3 x 3 ) , on the square of derivatives: Gx,y = e-xJ^ (4.1) 49 0.0751 0.1238 0.0751 0.1238 0.2041 0.1238 (4.2) 0.0751 0.1238 0.0751 In applying the Gaussian smoothing we have used two one dimensional kernels, (3 x 1) and (1 x 3) which speeds up the smoothing process by 33%. = e a.-2 G, = e (4.3) Gr. — 0.274 0.452 0.274 Gy -0.274 0.452 0.274 (4.4) Figure 4.2 shows the result of corner detection for a typical sample image. A non-maximal suppression filter with kernel size of 3 is applied to the result corners. This means that a corner region only is selected as a nominated corner if its response is an 8-way local maximum. Cu,v is a corner if CUiV = Max(C8- i w here i = u — 1, u, u + 1 (4.5) J = v — l,v,v + 1 The detected corners are later thresholded, with a dynamic threshold (starting value of 20). The result after thresholding is shown in Figure 4.3. The value of the corner threshold is one of the most important parameters in this system. It determines the number of the corners that the system processes. If there are few features, then the system answer will not be stable and even one false match may result in a wrong or very poor motion estimation. On the other hand, if the number of corners are too great, not only does the computation time increase, but also the number of mismatches may rise. Here the probability for false mismatches, originates from the fact that by increasing the number of corners, the physical distance between corners decreases, therefore two close corners can be matched with each other-while they are really two different features. This leads to an unreliable solution. Features should also not be too far from each other. Since for each feature only one match is detected, finding the right match is often not possible. For all these 50 Figure 4.3: Threshold corners with modified non-maximal suppression. reasons, it is important to have an appropriate threshold value. A major limitation of the whole proposed system is that it considers matching features only to a single feature. It does not consider multiple match hypotheses. Once a feature is mapped to an incorrect match, it will never be re-matched with the correct feature. The corners therefore must be spaced far enough apart such that most of them are matched correctly at first guess. Our system starts with a predefined threshold (20) which the corner detector applies to the first image of the current set of three stereo images. If the number of features found after corner detection is less than 600 or greater than 1000, then the threshold will be adjusted accordingly and corner detection will be updated for the first image. In our system if the number of the corners are too many we double the threshold and therefore less features are chosen, and if the number of the features are too few then the threshold is divided by two resulting more corners. This checking process will be continued until between 600 to 1000 corners are detected. This ex-amination might appear time consuming but as long as the brightness of the scene is constant or has small changes, the system does not have to refine this threshold 51 repeatedly. It is also necessary to inspect the result of the corner detector because if there is a sudden change in the brightness of the scene, the number of corners changes dramatically. Figure 4.4 shows corners in a typical set of three sample of images after suppression with thresholding. Figure 4.4: A triple set of images with suppressed corners. 4.2 Stereo Matching and Match Validation Once corners are detected in left, right and top images, the next step is to find the corresponding corners. Since the images are rectified, for each corner in the left image, there should be a correspondence along the same row and column in the right and top images. Respectively by applying a simple binary search algorithm along the same row, all the candidate corners are found in the right image in a small predefined 52 neighborhood (34 pixels). A correlation match metric, discussed in Section 3.3.2, measures the similarity of each candidate with the original corner in the left image. The best match, is chosen from among all candidates using this similarity metric. The disparity measurement now is easily computed by subtracting the column posi-tion of the original corner and its best match candidate in the right image. To select the most stable corners, we also use the top image. A corner is considered stable if and only if there is an acceptable match in the top image within the same disparity range along the vertical direction, as shown in Figure 4.5. Therefore, a search is per-M Valid Invalid Invalid 1 Valid \ —f-\ P5 | _ / \ .< \ -'—, m P3 • \ T \ i j ( \ / \ L Figure 4.5: Match validation using three images. formed in the top image along the same column using a small neighborhood (3 x 3). If there is a corner with a high similarity metric, then the original corner is considered as a stable corner otherwise that corner is rejected. After finding each stable corner in the original image and associating it with a disparity measurement, a sub-pixel accuracy routine is applied to improve upon the 53 disparity measurement. Since the resolution is limited to the pixel size, if each corner is shifted a fraction of a pixel, then each disparity measurement has an offset equal to or less than one pixel. This inaccuracy can cause a large error when computing the depth value and later when computing the motion. In order to obtain an optimal corner match location with sub-pixel accuracy, a second degree curve is fitted to three correlation scores, the best match correlation (Ro) and its left and right neighbors and R+ [8]. With the condition that RQ > R-,R+ the optimal match location is estimated by fitting a sub-sample estimator, which is a simple quadratic estimator. Figure 4.6 describes the quadratic estimator. (R+-R.) Ax = (4.6) 2(2i?o -R--R+) This correction gives a good estimation of the corner location and disparity value, New estimated peak True peak -1 0 +1 Displacement (sample interval) Figure 4.6: Sub-pixel sampling using quadratic estimator. which is used in depth construction. When applying sub-pixel accuracy we noticed that sometimes at the position of a corner, intensity is not locally maximum. Therefore two neighbors of the corner can not be used in fitting a second order curve. This problem can be originate from a simple shift in the position of the corner or it can be due to inconsistency of the scene 54 or background. Since it is possibly the corner detector that causes one pixel shift in x or y direction, we try to find the local maxima in the position of the corner and its four neighbors (along one axis). If there is a local maximum on that neighborhood, it substitutes the corner, and sub-pixel disparity measurement will be done based on its two new neighbors. In this way we make sure that the problem of fitting a second order curve does converge. If there is not a local maximum then that corner is rejected and the system continues by working with other corners. Now the depth information is constructed using the focal length of the camera (/), base line (J3) and the disparity measurement (d). z = 11 (4.7) Using the depth information it is possible to project back the image coordinates (u,v) and find the world coordinates (x,y,z). zu zv * = j v = j (4-8) Therefore at the end of this step we have generated a list of world features from the last processed frame. 4.3 Matching World Points and Feature Points This section matches the feature points in world space and the 3D features of the current frame (generated by the previous section) and solves for the camera location in the world. For this purpose a Kalman filter (KF) represents each feature point by a mean position and a covariance. Introducing these Kalman filters improves the system's accuracy. As we will see in more detail, these Kalman filters integrate the location information for each feature that is seen and recognized by camera over a period of time. The major role of the Kalman filter in the matching process is to provide better location estimation for 3-D points, and therefore it results in more accurate matches. Figure 4.7 shows K F ellipsoid surfaces, probability density, for some of the world's 55 feature points at an arbitrary time. Therefore, for each world point feature, we simply search in a space which includes its Figure 4.7: Kalman filter ellipsoid surface for real points. Kalman filter ellipse and find match candidate in ZD features of the current frame. Then by assessing each candidates gray-level attributes, using a normalized mean squared difference function with a kernel-size of 5, Equation 3.20, we choose the best match candidates. Figure 4.8 shows matched features between one image and a subsequent image. 4.4 Motion Estimation In this stage changes in camera translation and rotation vectors from one image to the next are found. To solve for the camera motion we use Newton's method in conjunction with a least squares minimization, over a number of iterations. For each matched feature we use the current motion parameters, pt-i, to pre-dict the new location, pt. Therefore for each world feature, there are two candidate 56 57 projected positions on the image; one from current motion parameters and second from the matching process. Now we compute a vector of error measurements between the matches found and predicted match locations. This error measurement lets us find a correction value for each of the parameters in p, (x, y, z, cf>x, <py, <pz). In other words each point in the world is projected onto the image plane using the current parameter estimates and then its position error, compared to the found match, is measured. This error is computed for both u and v components, yielding Eu and Ev. Each error can be expressed as the sum of all changes in the error resulting from all the parameter corrections. 9u A du A du A 8u A , du A , du —Ax + j-Ay + —Az + ——A(j)x + tt-A^ + — ox ay az o<px dq>y oq>z dv A dy * 9v A dv A , dv •. dv , J l M —Arc + — A y + — Az + ^j-A^>X + T T - A ^ + — Acf>z = Ev 4.10 Ox Oy Oz 0<px Ocpy 0<pz From three match correspondences we can derive six equations and produce a com-plete linear system which can be solved for all six camera-motion corrections. Since in general the number of matched features is more than the number of the parameters (six), then an iterative least squares estimator is implemented to fit the data points to the final estimation. After each iteration the correction shrinks by about one order of magnitude and usually between 4 to 6 iterations are needed for high accuracy. For higher accuracy in the final estimation result, we compute the disparity of each data point with respect to the estimation by calculating the residual of each element. Those matches that are too far from the estimated transformation are discarded. We reject those matches when their error measurements are larger than twice the error covariance in the current iteration: Rejected if || AlX - bt ||> la (4.11) Where a is the average residual of the error and it is computed from: (4.12) a =|| Ax - b || IN (4.13) After discarding the extraneous points, the least squares fit is computed again. This process is repeated until the improvement in each motion parameter is less than 1% of the total estimation for that parameter. It should be mentioned that since our 58 motion estimation is based upon local linearity of the motion, meaning small changes at each frame, outliers cannot be very distant from the correct least squares fit result. Therefore, the process improves the overall result, but it does not have a large effect on each estimation. 4.5 Kalman Filter Updating After solving the motion equation, more precise measurement is available for each 3D feature position. This is because each new motion estimation provides further information relating to the position of the points. Now the Kalman filter's job is to estimate both the point's most likely 3Z) location and its positional uncertainty. Kalman filters can be illustrated by interpreting as normal probability distribution. Each Kalman filter is initially defined with the small and identical probability distri-bution in space, Figure 4.9. To explain the Kalman filter, a single point is considered, as all points are treated Figure 4.9: Scaled Kalman Filter ellipsoid surfaces at the beginning. 59 independently and in a similar way. Suppose a feature point is observed in the cur-rent image-plane at position Xi-\. This is the first Kalman filter measurement. The positional accuracy of each feature is specified by the observation error covariance matrix Vi. If the current estimate for the point location is (centroid) with the accompanying positional accuracy of P t_x, then the covariance Pi and centroid Xi of the Kalman filter after updating the current observation, are found from: P-1 =P-\+AfVf-1Ai (4.14) Xi — Xi-i + ki(bi - AiXi-i) (4-15) Where fct = PiAjV~l and Ai is the projection matrix. Therefore the Kalman filter combines the previous position estimation with the new motion estimation to change the corresponding probability, Pi, and if a feature point is matched then the uncertainty about its 3D position is decreased. If there is no match found for that feature, then the Kalman filter increases the uncertainty of that feature point. Accordingly, for finding the match, a larger space should be searched. This process can be illustrated using Figure 4.10. In implementing the Kalman filter the observation errors of the corners are assumed to have a relative localization error of 2 pixels. 4.6 Retiring Old and Introducing New Features World points are retired when they are no longer being used, since they have gone out of the field of view of the camera. New points are also added as new parts of the scene come into the view of the camera. The accuracy of each point in the world is reduced slightly if it was not matched to any point in the current image. Any world point that goes below a minimal threshold of accuracy is deleted from the world point set. This allows points to slowly be retired and removed from the world set. Unmatched points in the current feature list are added to the world points with a very low accuracy. If they are matched in future frames, then their accuracy gradually increases to the 60 Figure 4.10: Updating the Kalman filter. point where they become some of the strongest feature points, and if they fail to match repeatedly they will be purged very quickly. This strategy also takes care of those feature points which are incorrectly matched. Their Kalman filters will usually will not be supported by matches on subsequent frames, and so these erroneous matches will be discarded from the system. 61 C h a p t e r 5 E x p e r i m e n t a l R e s u l t s This chapter describes the performance of each system component as well as the overall system based on the theory presented in previous chapters. 5.1 Hardware Configuration The purpose of this section is to describe the system setup and system hardware elements. 5.1.1 System Setup Figure 5.1 shows the system setup. The Triclops is mounted on a movable arm. This movable arm can be placed on any mobile device. The position and orientation of the Triclops at each sample instant is computed and shown on the screen, as shown in Figure 5.2. The center of the world is considered to be the center of the bottom-left camera at the first sample instant, so all the relative changes in position and orientation are estimated with respect to that reference point. 5.1.2 Frame Grabber Images taken from the cameras are digitized by an analog video interface, the Matrox Meteor. The Matrox Meteor is a high-quality color and monochrome PCI bus frame 62 Figure 5.1: System setup 63 Figure 5.2: Position of the Triclops at each moment is shown in the screen. grabber, that provides real-time image transfer to a host at up to 45 MB/Sec. It transfers image data in real-time to the C P U R A M for processing or real-time display. In this application the Matrox Meteor is used as a Triclops interface with a standard resolution of 640x480 pixels, and is hosted by a Pentium II 266 MHz PC. RS-170 is a standard monochrome composite video signal that contains both timing and image information in a single signal. This monochrome video signal is a 525 line system with a frame rate of 30 frames per second. 5.1.3 Tr ic lops™ The Triclops is a stereo vision system, designed and implemented by Point Grey Research [24]. It provides real-time range images, using stereo vision technology. It includes a camera module with three monochrome cameras, each Panasonic BS7258, and a software system. These three cameras are rigidly positioned so that each adjacent pair is horizontally or vertically aligned. This results in more reliable and accurate range images. The Triclops software system implements a correlation-based trinocular stereo vision algorithm. It acquires images, removes lens distortions and constructs the depth map of the scene. The main correlation core of the system, written using the 64 Figure 5.3: Triclops Stereo Vision System. Pentium M M X assembly language, computes the range images. Although the Triclops software system is capable of creating accurate and reliable depth maps, we have elected not to use it. This is because we use the Triclops for motion estimation and we would like to allocate as little time as possible to the depth construction task, and spend the rest of the time for the motion calculation. There-fore, our software system uses the Triclops to grab rectified images and constructs a sparse depth map for corner points. Since the number of corners is much lower than the number of pixels in each image, we chose this solution to implement a real-time motion tracking system. 5.2 Optimal Parameters In the initial stages of implementation, a few parameters had to be defined regarding the density of the scene and features distance from the camera, range of motion and speed of the processor. The following gives a short discussion about some of these parameters and the most suitable value for each one. Image Size Images of (640 x 480) pixels are captured and reduced to a resolution of (320 x 240) after rectification. This is because corner detection is an expensive function and by using smaller images we speed up the process, while we still have a good resolution 65 for extracting reliable corners. Number of Features As has been explained in Section 3.3.5, when solving the linear equation for six unknown parameters, at least three valid corresponding points should be found in two sets of images. However, for accurate and reliable results, it is preferred to have an over-constrained problem. Moreover, in match verification (stereo matching) and the task of tracking identical features in different frames, some of the extracted features might be discarded. Therefore, having more features results in more reliable results. Experiments show that in stereo matching 70% of the features fail the valida-tion test. Also approximately 30% of the remaining features do not have a good corresponding feature in the following frame. About 30% of these features will later be discarded as outliers in least squares fitting. Therefore at the time of extracting features we change the value of the threshold, in the Harris corner detector [10], un-til we end up with between 600 and 1000 corners. The final transformation is then based upon 50 to 80 corner features. With this number of features the least squares algorithm is still very fast and takes less than 0.1% of the overall computation time of the system. Table 5.1 presents the number of features in different stages of one of the experiments. Table 5.1: Feature number in different stages of the system. No. of extracted features No. of features after stereo validation No. of trackable features No. of features in final transformation 769 162 107 74 66 Correlation and Search Window Size After extraction of a sufficient number of features for stereo matching, a correlation matching process is performed. Since each feature is a corner, a small correlation window (3 x 3) pixels provides enough information for matching with minimum com-putational cost. The search routine finds all the possible corner candidates within a 3 x 100 pixel window during stereo matching. Rather than searching through just one row a search along three rows is performed to make up for a possible vertical shift in the position of the corner. Looking for a match along 100 columns takes care of both near and distant objects. With the current camera configuration, the focal length of 3.8mm and field of view 71.5°, the system is capable of depth construction for objects located between 0.284m to 28.4m from the camera plane. The correlation and search routines are also used when tracking identical features over time, from frame to frame. The correlation window size in this part is increased to 5x5 in order to filter false matches. Thresholding for Discarding False Matches After finding the corresponding features in two consecutive frames, an estimate of the equation of the motion is generated. From each iteration the residual of the best fit for each pair of matched features, relative to the final least squares fit, is computed. If this value is more than twice the standard deviation of all errors, the corresponding match pair is discarded in the next iteration. Tables 5.2 and 5.3 show the result of motion estimation, including the number of features and error residuals after each iteration,for single pair of images in one experiment. The number of iterations is determined by the convergence of the final transfor-mation. Whenever the change in the 6 motion parameters is less than 1% of the final estimation parameters, the iterative process stops. 67 Table 5.2: Iteration results for a single pair of images. Iteration No. No. of Features No. of Outliers Error Residual crEu 1 99 0 2.140 1.791 2 90 9 0.677 0.557 3 80 10 0.416 0.313 4 74 12 0.337 0.279 5 74 0 0.333 0.278 Table 5.3: Iteration results for a single pair of images. Iteration No. Parameter Changes Ax A y Az A<j>x A<j>y A<pz 1 -0.002670 0.000006 -0.028513 -0.000691 0.000372 0.000997 2 0.001314 0.001592 -0.006051 0.000398 -0.000251 -0.000307 3 0.000504 -0.000857 -0.002445 -0.000214 -0.000020 -0.000305 4 -0.000471 0.000793 -0.001174 0.000251 0.000165 0.000382 5 0.000001 -0.000002 0.000001 0.000000 0.000000 0.000000 68 5.3 Results 5.3.1 World Point Construction Since the motion estimation in this work relies upon tracking world points which are constructed using a stereo algorithm, it is important to ensure that the construction of world points is accurate. Due to the lack of a good setup for calibration, we created a simple scene with two simple planes at different depths resulting in 12 sharp corners. The depth of these corners are measured using measuring tools and by using the stereo system. Figure 5.4 shows corners in the trinocular images. Corresponding features are related by identical numbers. Also, Table 5.4 shows the results of depth construction and compares them with the real measured distances. As shown in this table, the use of sub-pixel interpolation, increases the accuracy of the depth construction and subsequently total motion estimation by 0.33%. 5.4 Motion Estimation using two Frames The system detects all corners and uses the stereo algorithm to initialize world coordi-nate representation of each feature point; after which it estimates the corresponding motion and continues with refining the corresponding Kalman filters. Figure 5.5 shows corner detection results after stereo validation. Figure 5.6 shows the identical features, that are related to their correspondences, in two consecutive image frames. A few corners with false and correct matches are marked in this image. Table 5.5 compares the estimated motion from two single frames with real motion measurements. 5.4.1 Accumulated Error Sub-pixel accuracy and Kalman filtering are features that help to reduce error in motion estimation. Figure 5.7(a,b and c), shows the result of accumulative error after processing 85 frame samples with a stationary Triclops. Figure 5.7-a shows this 69 rc3 > CD CD ct CD o o o A o o o O H CD Q L6 J3 • CD bO . — £1 fl 13 CD S-i !H CD fl - O ct fl R CD 3 fa fl C O C O C O C O C O C O C M C O C O co C O C O C O C O co C O C O co CN C O C O C O C O C O -tf1 C O C O C O C O C O co C O C O C O L O C O C O co o C O cn cn o C O Cn cn o c o cn cn L O Cn c o C M cn OS l o cn cn L O c o cn co C M C O C O C O C M C O C O C O C M C O C O C O C M C O oo C O C M C O co C O C M C O co C O C M C O co co C M C O C O co cn co cn co cn L O C O cn L O •—i o o o C M C O l o I. C O lo I cn co o I I C M O o I C O C M I C M o I C M O I co C O co C O co C O C O C O C O C O C O C O C O C O C O C O cn C M cn ( M <zn C M 0 5 C M 0 5 C M co o C M i O I C M co 1—I C M L O co C M cn co \cn C M C O C O C O L O C O L O cn co C O C O C O C M co" L O co L O co C O cn co cn C O co co co L O C M C M C M L O C O co cn C M 71 72 Figure 5.6: Corresponding features are related and tracked in two consecutive images. 7:5 Table 5.5: Estimated and actual motion between successive image sets. Motion X y z 4>x 4>y 4>, H [m] [m] [rad] [rad] [rad] Actual 1 -0.04000 0.00000 0.00000 0.00000 0.00000 0.00000 Estimated 1 -0.03810 -0.00130 0.00150 -0.00090 -0.000200 0.000200 Actual 2 0.00000 0.02000 0.00000 0.00000 0.00000 0.00000 Estimated 2 -0.00012 0.02116 -0.00050 -0.00124 -0.00079 -0.00106 Actual 3 0.02000 0.02000 0.00000 0.00000 0.00000 0.00000 Estimated 3 0.02121 0.02018 -0.00191 0.00015 -0.00088 0.00021 Actual 4 0.00000 0.06000 0.00000 0.00000 0.00000 0.01745 Estimated 4 -0.00641 0.05199 0.00304 0.00069 0.00233 -0.01428 error in the depth estimation of the transformation. Figure 5.7-b shows the error along the Triclops camera plane and Figure 5.7-c shows the error in the rotational angle about the viewing axis. As expected, this error is higher for the depth variable than other errors, due to limited accuracy in depth construction. The error graph has not been shown for the three other parameters since they are similar to Figures 5.7-b and 5.7-c. As it has been shown accumulative error after processing 87 subsequent images when the system is in the stationary state, the error in estimation, ideally 0 in every parameter, is 0.8 mm in depth construction and 0.4 mm along the image plane. This error in angle parameters is also very small and as shown in the third graph it is just 0.00045 radian. 5.4.2 Motion Estimation Results from a Number of Images To show the performance of the system, a motion estimation experiment is carried out along a known path by processing 87 image frames. Motion is determined while Triclops moves from a starting point A , toward an ending point H. At the end of the path, the camera reverses along the starting point, back toward the starting point. In order to compare the estimated motion at different positions, with the real values, a number of reference points are considered. The position of these reference points 74 0.8 10.6 10-4 i_ o ^ 02 / 1 - 1 Sub-pixel accuracy on Sub-pixel accuracy off s L I - J L 10 20 30 40 50 (a) Frame 60 70 80 0 10 x10" 4 40 50 (b) Frame 90 80 90 CO c CO co 4 >> a o LU 0 _ 1 r - _ / 0 30 40 50 (c) Frame 60 70 80 90 Figure 5.7: Accumulated error for different motion variables. 75 are measured in the world. By using these measurements we can observe and analyze the accuracy of the estimation. This experiment tests just three of the parameters, due to the lack of a setup with 6 degrees of freedom. Other experiments have been performed to show the accuracy of the system for the remaining parameters. The parameters that have been tested in this experiment, are depth z, x and the depth axis orientation ay. This rotation parameter seems to be the best angle to test, since the depth change has more impact on it than others. It is because if Triclops rotates around z axis scene features rotate only in one plane and therefore depth change and system's depth estimation does not effect the final estimation. Figures 5.S and 5.9 show sample images captured at the reference points while the system was moving from the starting point toward the end point. Figure 5.10 shows the real path and the estimated path found by the system. Figures 5.11, 5.12 and 5.13 show the same result for reversed motion from reference point H to A. A study of the results, shown in Figure 5.10, show that the depth construction has 15% error in point B. This error is reduced to 10% when the system is moved to position C. While moving toward points D, E and F, the z estimation error is increased to 18%. This behavior can be explained by studying the experiment's environment. As it can be seen in Figure 5.8-a at the beginning of the experiment most of the objects within the scene are located far from the camera system. This is the main source of the inaccuracy in the z parameter. Distant features have particularly poor depth information since they are gained from stereo. This large error in depth effectively displaces these points and can cause error in localization of the mobile system as it tries to solve the least squares fit using data that has a very systematic type of error in depth. As the system moves closer to objects the depth construction becomes more accurate. As soon as the system starts its rotation, the error in parameter z of the estimation increases dramatically and continues to do so until the end of the curvature. This reaction can be clarified by noticing the fact that farther points show behavior under rotations that is very similar to a forward translation. This causes the least squares fit to find a solution that has larger translation or rotation that is much larger than any actual movements. In other word, there is effectively a coupling 76 D Figure 5.8: Sample images taken at reference points A , B, C and D, moving toward end point H . 77 H Figure 5.9: Sample images taken at reference points E, F, G and H, moving toward end point H. 78 Map of the path between point A and point H System estimated motion from point A to point H Depth axis orientation from point A to point H 300 250 200 150 100 X[cm] 50 -50 Figure 5.10: Comparison of the actual path with the estimated, path A to H. 79 end point A . 80 81 System estimated motion from point H to point A Depth axis orientation from point H to point A i 300 250 200 150 100 X[cm] 50 -50 Figure 5.13: Comparison of the actual path with the estimated, path H to A . 82 between rotational and translational motion, due to these points. By looking at the reverse experiment results one can see that the same effect happens at the beginning of the curve. Rotation from point F toward point C intro-duces the same error. Since the direction of the rotation this time is opposite, distant points move backward this time. Therefore this error can compensate for a part of accumulative error in z and x. 83 C h a p t e r 6 C o n c l u s i o n s a n d F u t u r e W o r k This paper has described a feature-based 3D motion tracking system for the con-trol of a mobile robot equipped with a camera system. This system cuts down the computational cost effectively by processing informative features of the scene. Also, using a stereo algorithm to construct world features has enabled us to perform 3D motion estimation, using 2D images. Sub-pixel interpolation and Kalman filtering have improved the accuracy of the system compared to other visual tracking systems. In this chapter the proposed system is evaluated against the specifications out-lined in Section 3.2 and also several improvements that could be made to the system are suggested. 6.1 Specification Evaluation 6.1.1 Cont inuous Sensing Based on this specification the system should not require the robot to stop in order to. carry out the motion estimation, which is clearly satisfied. The only case in which the system will lose continuous sensing is when the scene is textureless. In this case the system would repeatedly adjust the corner threshold halting further image sensing. 84 6.1.2 Real-Time Performance This specification is not completely met. Although the performance of the system is not faster than 1.1 second per motion estimation, if the motion is not very fast the system is capable of finding the correct motion. It is fairly clear that adding additional optimization to the system can satisfy the real-time specification. 6.1.3 No Odometry This requirement is satisfied. The system does not inherently rely on any odometry. However, odometry could be used to make the system more robust. 6.1.4 Minimal Cost and High Accuracy The system requires a Pentium processor without any special requirement. This spec-ification has been met, but it is obviously a trade-off with the real-time performance specification. A faster processor could be used to satisfy real-time specification at increased expense. Results obtained from real images with various transformations showed that the system performs well with less than one pixel accuracy. 6.2 Future Work This section discusses the scope for future work and extensions to this project. 6.2.1 Accurate Testing Platform There is a need of an accurate testing platform with six degrees of freedom to test and evaluate this type of system. During the implementation it was difficult to compare the system's results with the actual motion, which was not accurately known. 85 6.2.2 Decreasing the Accumulative Error An inherent characteristic of the system is the estimation error which is accumulated over time and distance. Although estimated transformation between each two frames is reliable, over a long period of time the system must be able to correct this error and reset itself. This can be performed by using information obtained from an absolute positioning system such as GPS. Combining a down looking camera with the current system can help in correcting the accumulative error in z direction and therefore may result in higher accuracy. Integrating other sensors such as odometers, can help to find the motion between frames and therefore, the system would perform more accurate estimation while there is a fast motion. 6.2.3 Speeding up the Performance Since corner detection is an extremely time consuming process, its optimization can help a great deal in increasing performance. Also in this routine, Gaussian smoothing can be accelerated by choosing a equal to 0.8. By using this value for a, each three multiplications and one division can be substituted by two shifts. During the implementation, experiments are performed using only two images of each set of trinocular images. It proved that the performance of the system does not change and the system still retains high accuracy in depth construction. This reduces the corner detection time by 30%. Since correlation is the primary function for evaluating the similarities between the features in both depth construction and motion tracking tasks, its optimization can also accelerate the system's running time. This optimization can be achieved by utilizing the correlation routine in Pentium M M X assembly language instructions. Corner detection also as one of the most time consuming functions can be speed up using M M x assembly instructions. 86 6.2.4 System Applications Although the main objective of this system is to control a mobile excavator visually, it could be used in many other applications such as visual guide for the blind, automotive auto-pilot and systems for tracking objects in remote or dangerous environment. 87 References [1] N . Ayache and C. Hansen. Rectification of fmages for Binocular and Trinoc-ular Stereo Vision. International Conference on Computer Vision and Pattern Recognition, pages 11-16, 1988. [2] Y . Bao, N . Fujiwara, and Z. Jiang. A Method of Location Estimating for a Vehicle by Using Image Processing. IEEE International Conference on Robotics and Automation, pages 781-786, 1998. [3] J . Borenstein, H.R. Everett, and L. Feng. Navigating Mobile Robots. A K Peters Wellesley, Massachusetts, 1996. [4] J. Borenstein and L. Feng. Gyrometry: A New Method for Combining Data from Gyros and Odometry in Mobile Robots. IEEE International Conference on Robotics and Automation, pages 423-428, 1996. [5] J . Brewwster. Roll-Robotic on-board localization system using landmark. Tech-nical report, Masters Thesis, University of British Columbia, 1996. [6] Lisa Gottesfeld Brown. A Survey of Image Registration Techniques. Technical report, Department of Computer Science, Columbia University, New York, N Y 10027, 1992. [7] R. Byrne, P. Klarer, and J. Pletta. Techniques for Autonomous Navigation. Technical report, Sandia National Laboratories, Alburquerque, N M , 1992. [8] V . N . Dvornychenko. Bounds on (Deterministic) Correlation Functions with Ap-plication to Registration. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 206-216, 1983. [9] M.S. Grewal and A.P. Andrews. Kalman Filtering: Theory and Practice. Prentice Hall, 1993. [10] C. Harris and M . Stephens. A Combined Corner and Edge Detector. Proceeding 4'th Alvey Vision Conference, pages 147-151, 1988. 88 [11] Chris Harris. Active Vision, Chapter 4-' Tracking with Rigid Models. MIT Press, Cambridge, 1992. [12] Berthold K.P . Horn. Robot Vision. The MIT Press, Massachusetts, 1992. [13] Y . K i m . Localization of a Mobile Robot Using a Laser Range Finder in a hi-erarchical Navigation System. IEEE International Conference on Robotics and Automation, 1993. [14] L. Kitchen and A. Rosenfeld. Gray-level Corner Detection. Technical report, Computer Vision Laboratory, Computer science Center, University of Maryland, College Park, M D 20742, USA, 1982. [15] Lindsay Kleeman. Optimal Estimation of Position and Heading for Mobile Robots Using Ultrasonic Beacons and Dead-reckoning. IEEE International. Con-ference on Robotics and Automation, pages 2582-2587, 1992. [16] Eric Krotkov. Mobile Robot Localization Using a Single Image. IEEE Interna-tional Conference on Robotics and Aidomation, pages 978-983, 1991. [17] David G. Lowe. Artificial Intelligence, Three-Dimensional Object Recognition from Single Two-Dimensional Images. Elsevier Science Publishers B . V . (North-Holland), 1987. [18] David G. Lowe. Fitting Parameterized Three-dimensional Models to Images. IEEE International Conference on Robotics and Automatio Transaction on Pat-tern Analysis and Machine Intelligence, pages 441-450, 1991. [19] H. Moravec. Obstacle avoidance and navigation in the real world by a seeing robot rover. Technical report, Robotics Institute, Carnegi-Mellon University, 1980. [20] S. Murata and T. Hirose. Onboard Locating System Using Real-Time Image Pro-cessing for a self-Navigating Vehicle. IEEE International Conference on Robotics and Automation, pages 145-154, 1993. [21] G. Nard. DGPS Correction and GPS Reception and Processing. IEEE Interna-tional Conference on Robotics and Aidomation, 1990. [22] S.B. Nickerson and M . Jenkin. A R K : Autonomous mobile robot for an industrial environment. Intelligent autonomous systems IAS-3, pages 288-296, 1993. [23] Pascal and Fua. Machine Vision and Applications. Springer-Verlag, 1993. 89 Inc. Point Grey Research. Triclops Stereo Vision System. Technical report, Department of Computer Science, University of British Columbia, Vancouver, www.ptgrey.com, 1997. W . H . Press, S.A. Teukolsk, W.T. Vetterling, and B.P. Flannery. Numerical recipes in C: the art of scientific computing. Cambridge University Press, 1992. A. Soroushi. Motion Tracking of a Mobile Robot. Technical report, Masters Thesis, University of British Columbia, 1996. E. Stella, G. Cicirelli, F. Lovergine, and A. Distante. Position Estimation for a mobile Robot sing Data Fusion. IEEE International Conference on Robotics and Automation, pages 565-570, 1995. M . Stephens and C. Harris. 3D Wire-Frame Integration from Image Sequences. Proceeding 4'th Alvey Vision Conference, pages 988-994, 1988. G Strang. Introduction to Applied Mathematics. Wellesley-Cambridge Press, 1986. R . Y . Tsai. A Versatile Camera Calibration Technique for High-Accuracy 3D Machine Vision Metrology Using Off-the-Shelf T V Cameras and Lenses. Inter-national Conference on Robotics and Automation, pages 323-344, 1987. V . Tucakov. Interpreting Server Occlusions in Region-Based Stereo Vision. Tech-nical report, Masters Thesis, University of British Columbia, 1998. J. Vaganay, M . J . Akin, and A. Fournier. Mobile Robot Attitude Estimation by Fusion of Inertial Data. IEEE International Conference on Robotics and Automation, pages 277-282, 1993. 90
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Real-time 3D motion tracking for a mobile robot
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Real-time 3D motion tracking for a mobile robot Saeedi, Parvaneh 1998
pdf
Page Metadata
Item Metadata
Title | Real-time 3D motion tracking for a mobile robot |
Creator |
Saeedi, Parvaneh |
Date Issued | 1998 |
Description | This thesis presents a vision-based tracking system suitable for autonomous robot vehicle guidance. The system includes a head with three on-board CCD cameras, called Triclops®, which can be mounted anywhere on a mobile vehicle. By processing consecutive trinocular sets of precisely aligned and rectified images, the local 37J trajectory of the vehicle in an unstructured environment can be tracked. The use of three cameras enables the system to construct an accurate representation of its environment and therefore results in accurate motion estimation. First, a 3D representation of stable corners in the image scene is generated using a stereo algorithm. Second, motion is estimated by tracking matched features over time. The motion equation with 6 degrees of freedom is then solved using an iterative least squares fit algorithm. Finally, a Kalman filter implementation is used to optimize the 3D representation of scene features in world coordinates. The system has been implemented on a Pentium processor in conjunction with a Triclops and a Matrox Meteor® frame grabber. The system is capable of detecting rotations with 3% error over a 90 degree rotation and translations with 13% average error over a 6m long path. |
Extent | 14884003 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-07-03 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0064822 |
URI | http://hdl.handle.net/2429/10024 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1999-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1999-0085.pdf [ 14.19MB ]
- Metadata
- JSON: 831-1.0064822.json
- JSON-LD: 831-1.0064822-ld.json
- RDF/XML (Pretty): 831-1.0064822-rdf.xml
- RDF/JSON: 831-1.0064822-rdf.json
- Turtle: 831-1.0064822-turtle.txt
- N-Triples: 831-1.0064822-rdf-ntriples.txt
- Original Record: 831-1.0064822-source.json
- Full Text
- 831-1.0064822-fulltext.txt
- Citation
- 831-1.0064822.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0064822/manifest