UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Vision-based 3D motion tracking in natural environments Saeedi, Parvaneh 2004

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2004-902609.pdf [ 18.43MB ]
Metadata
JSON: 831-1.0065602.json
JSON-LD: 831-1.0065602-ld.json
RDF/XML (Pretty): 831-1.0065602-rdf.xml
RDF/JSON: 831-1.0065602-rdf.json
Turtle: 831-1.0065602-turtle.txt
N-Triples: 831-1.0065602-rdf-ntriples.txt
Original Record: 831-1.0065602-source.json
Full Text
831-1.0065602-fulltext.txt
Citation
831-1.0065602.ris

Full Text

Vision-Based 3D Motion Tracking Natural Environments by Parvaneh Saeedi M . S c , University of B r i t i s h C o l u m b i a , Vancouver, C a n a d a , 1998 B . S c , Iran University of Science and Technology, Tehran, Iran, 1991  A THESIS S U B M I T T E D IN PARTIAL F U L F I L L M E N T O F THE REQUIREMENTS FOR T H EDEGREE OF D o c t o r of P h i l o s o p h y in T H E FACULTY OF G R A D U A T E STUDIES (Department o f E l e c t r i c a l and C o m p u t e r Engineering) We accept this thesis as conforming to the^equired. standard  T H E UNIVERSITY O F BRITISH C O L U M B I A M a r c h 2004 © Parvaneh Saeedi, 2004  Abstract Remotely controlled mobile robots have been a subject of interest for m a n y years. T h e y have a wide range of applications i n science and i n industries such as aerospace, marine, forestry, construction and mining. A key requirement of such control is the full and precise knowledge of the location and m o t i o n of the mobile robot at each moment of time. T h i s thesis presents a vision-based l o c a t i o n tracking system suitable for autonomous vehicle navigation and guidance i n u n k n o w n environments. T h e system includes a trinocular vision head that can be mounted anywhere on a navigating robot. Consecutive sets of triplet 2 D images are used to determine the 3 D l o c a t i o n and the 3 D m o t i o n parameters of the robot at each frame. B y selecting only the most informative points of each image, using a feature detection a l g o r i t h m , faster performance is achieved. T h e use of 3 cameras improves the accuracy and the robustness of the system. B y tracking the 3 D l o c a t i o n of world features w i t h i n a multi-stage tracking approach, the l o c a t i o n of the observer camera is estimated and tracked over time. T h e m o t i o n w i t h 6 D o F is found v i a a least squares m i n i m i z a t i o n method. A K a l m a n filter implementation is used to o p t i m i z e the 3 D representation of scene features i n order to improve the accuracy of the overall system. T h e system introduces several novel contributions for vision-based trajectory tracking. Firstly, it presents a new binary corner detector that can automatically detect the most informative and reliable scene feature points.  T h i s feature detector performs 60% faster  than the most c o m m o n method, the H a r r i s corner detector, used i n vision-based tracking applications. Secondly, it compensates for an inherent source of inaccuracy i n the camera geometry. B y identifying and m a t c h i n g features i n raw images, and then unwarping matched  ii  Ill  data, accurate 3 D world feature reconstruction is achieved. T h i r d l y , through a two-stage search and tracking design, similar 3 D world feature points axe identified and tracked over time. T h i s design, by itself, improves the accuracy of the estimated motions at each time by up to 10%. F i n a l l y , it improves the accuracy of the 3 D trajectory tracking system w i t h 6 degrees of freedom i n natural environments. Results of the application of our method to the trajectory tracking problem i n n a t u r a l and unknown environments are reported to validate our approach, hypotheses and methods. T h e cumulative translational and rotational errors of the system are less than 1% for the studied examples.  Contents Abstract  «  T a b l e of Contents  iv  List of Tables  ix  List of Figures  x  Acknowledgments 1  2  xiv  Introduction  1  1.1 Motivation  4  1.2 Major Contributions  6  1.3 Thesis Outline  8  V i s u a l Trajectory Tracking  10  2.1 Landmark-Based Methods  12  2.1.1 Navigation using predesigned landmarks  14  2.1.2 Navigation using natural landmarks  15  2.2 Natural Feature-Based Methods  17  2.2.1 2D estimation with single cameras  19  2.2.2 3D estimation using single cameras  19  2.2.3 3D estimation using multiple cameras  20  2.3 Thesis Objective  22 iv  CONTENTS 3  4  v  Feature Detection  2  3.1  Features  25  3.2  Corner Detection Background  26  3.3  Harris Corner Detector  27  3.4  S U S A N Corner Detector  3.5  Binary Corner Detector  32  3.6  Evaluation Metric: Repeatability Rate  36  3.7  Repeatability Results  37  3.7.1  Camera noise  37  3.7.2  Rotation  38  3.7.3  Scale  39  3.7.4  Illumination  40  3.7.4.1  Uniform change  40  3.7.4.2  Directional change in lighting  42  . . /  4  29  3.8  Discussion  43  3.9  Computational Time  43  3.10 Chapter Summary  44  3 D World Reconstruction  46  4.1  Camera Model  46  4.1.1  Camera calibration  47  4.1.2  Image unwarping  50  4.2  Stereo Vision Geometry  51  4.2.1  Epipolar constraint  52  4.2.1.1  53  4.2.2  4.2.3  Rectification procedure  Stereo correspondence matching process  54  4.2.2.1  Multiple baseline  55  4.2.2.2  Matching through correlation with validation  57  Stereo correspondence matching rules  58  CONTENTS  4.3 5  6  vi 4.2.4  D e p t h construction inaccuracy sources  59  4.2.5  D e p t h construction w i t h higher accuracy  64  Chapter Summary  64  Tracking and Motion Estimation  66  5.1  S i m i l a r i t y Measurement  67  5.2  Feature T r a c k i n g  68  5.3  Motion Estimation  71  5.3.1  Least-squares m i n i m i z a t i o n  72  5.3.2  Setting up equations  74  5.3.3  Implementation consideration  75  5.3.4  M o t i o n estimation results  76  5.3.5  Feature flow results  76  5.4  D y n a m i c Features and E s t i m a t i o n R e l i a b i l i t y  78  5.5  Feature U p d a t e  78  5.6  Feature Retirement  80  5.7  Chapter Summary  80  Trajectory Error Modeling  82  6.1  G l o b a l 3 D Feature C r e a t i o n  83  6.2  C a m e r a Position Uncertainty  84  6.2.1  Prediction  85  6.2.2  Measurement  86  6.2.3  Update  87  6.3  Feature Position Uncertainty  88  6.4  Feature U p d a t e  89  6.4.1  Feature covariance update  90  6.4.1.1  R o l l transformation  91  6.4.1.2  P i t c h transformation  94  6.4.1.3  Y a w transformation  95  CONTENTS  vii 6.4.1.4 6.4.2  7  8  Translational transformation  Feature position update  96 96  6.5  Experimental Results  97  6.6  Chapter Summary  99  Implementation Results  101  7.1  Camera System: D i g i c l o p s ™  101  7.2  Trajectory Estimation  102  7.2.1  Experiment 1:  103  7.2.2  Experiment 2:  105  7.2.3  Experiment 3:  107  7.2.4  Experiment 4:  109  7.3  Trinoculax and Binocular Stereo Comparison  115  7.4  Trajectory Estimation Refinement by Kalman Filtering  117  7.5  Computation Time  120  7.6  Chapter Summary  128  Conclusions a n d F u t u r e W o r k  129  8.1  Major Contributions  129  8.1.1  Fast feature detection  131  8.1.2  3D feature reconstruction with minimal positional error  131  8.1.3  Multi-stage feature tracking  132  8.2  Future Research  132  8.2.1  Combined wide and narrow views  132  8.2.2  Orthogonal views  133  8.2.3  Hierarchical structure  133  8.2.4  Sub-image processing  134  8.2.5  Cumulative error reduction  135  8.2.6  Specialized hardware  136  CONTENTS  viii  Bibliography  137  Appendix  149  A Location Determination Problem by Triangulation  150  A.l  Triangulation Problem  150  A.2  The 3D Location Determination Problem  151  List of Tables 3.1  C o m p a r i s o n of the time complexity of the three methods  44  4.1  8.0 c m m o t i o n estimation using unwaxped and warped images  63  5.1  T h e p a r t i a l derivatives table  75  5.2  Iteration results along w i t h the error residual, i n pixels, for one m o t i o n estimation  5.3  77  Feature numbers and comparison of the matching accuracy of the tracking process at different stages of the system for two i n d i v i d u a l experiments. . . .  77  5.4  C o m p a r i s o n of the number of d y n a m i c features versus estimation accuracy. .  80  7.1  L o c a l i z a t i o n error for the experiment 7.2.1 (using completely calibrated images).105  7.2  System drift for Experiment 7.2.2 (using unwaxped images)  7.3  3 D drift i n the estimated camera trajectory for a 6 meter long translation.  7.4  3 D drift i n the estimated camera trajectory for a 3 meter long p a t h  7.5  E s t i m a t e d 3 D drift for the camera m o t i o n w i t h 360° rotation on a circular  106 .  p a t h w i t h a diameter of 60cm 7.6  7.8  108  115  C o m p a r i s o n of the estimated cumulative error for trinocular and binocular stereos  7.7  108  116  C o m p a r i s o n of the reduction of 3 D drift for a 6 meter long path using K a l m a n filter  118  C P U time spent on various functions of trajectory tracking system  126  ix  List of Figures 1.1  Visual motion tracking system overview  7  2.1  Landmark based triangulation  2.2  (a). Top views of the landmark and projected images, (b). Estimation geometry. 15  3.1  The S U S A N corner detector  3.2  S U S A N corner detector uses the local area information to identify corners.  3.3  The block diagram of Binary Corner Detector  33  3.4  A n image and its binary sign of Laplacian image  33  3.5  Conditions in (3.15) and (3.18) do not (by themselves) provide enough stability. 35  3.6  Illustration of different cases for a corner  3.7  Repeatability results for camera noise for neighborhood of 1.5 (left) and 0.5  13  30 .  (right) pixels 3.8  3.9  30  36  38  Repeatability rate under rotation for neighborhood of 1.5 (left) and 0.5 (right) pixels  38  Images of a scene under different rotations  39  3.10 Repeatability rate for small rotations a neighborhood of 1.5 (left) and 0.5 (right) pixels  39  3.11 Repeatability rate for different scales in neighborhood of 1.5 (left) and 0.5 (right) pixels  40  3.12 Images of a scene using different scales  40  3.13 Images of a scene with linear illumination change  41  x  LIST  OF FIGURES  xi  3.14 Repeatability rate under uniform illumination change for neighborhood radiuses of 1.5 (left) and 0.5 (right) pixels 3.15 Images of a scene under changing illumination  41 42  3.16 Repeatability rate for changing illumination for neighborhood of 1.5 (left) and 0.5 (right) pixels  42  4.1  A n ideal pinhole camera produces a perspective projection model of the world. 47  4.2  Camera geometry with perspective projection and lens distortion  48  4.3  a- A warped image, b- The corresponding cut unwaxped image  50  4.4  Depth reconstruction using stereo geometry  51  4.5  Epipolar constraint  53  4.6  Rectification of two images  54  4.7  Camera positions for multiple-baseline stereo  55  4.8  Consistent vs inconsistent matches can be identified by validity check  58  4.9  Radial lens distortion removal can change the shape of objects of smaller size. a) A warped (raw) image, b) The corresponding unwarped image  60  4.10 The raw image (1). The raw image after the calibration (2). The calibrated image after the interpolation (3). The final cut unwarped image(4)  61  4.11 Warped (a-l, b - l , c-1) and unwarped (a-2, b-2, c-2) images of a plane at three different positions of the image 4.12 Some important features are eliminated by the unwarping process 5.1  62 63  Graphical overview of the validity check in feature tracking process. The valid match candidates are shown with the solid line  68  xii  LIST OF FIGURES 5.2  V a l i d i t y checking reduces the number of false m a t c h correspondences. In this figure feature points are shown i n white dots. M a t c h e d features i n the first frame w i t h no validity check are shown i n circles (a). M a t c h features i n the second frame are shown w i t h arrows that connect their previous positions into their current positions (b). M a t c h e d features i n the first frame w i t h validity check are shown i n circles (c). M a t c h features w i t h validity check i n the second frame are shown w i t h arrows that connect their previous positions into their current positions (d)  69  5.3  T h e tracking is performed i n two steps  72  5.4  D y n a m i c features can distract the estimated m o t i o n  79  6.1  C a m e r a position K a l m a n filtering m o d e l  88  6.2  Positional uncertainties associated w i t h features i n the image plane  97  6.3  a- Projected relative positional variances of features w i t h distances up to 3.5 meters from the camera on the X Z plane,  b- T o p view of positional  uncertainties for a l l features  98  6.4  O r i g i n a l and updated positional uncertainties of world features  99  7.1  Digiclops stereo camera system  102  7.2  T h e indoor scene from camera point of view  103  7.3  T h e graphical presentation of the 3 D camera m o t i o n while m o v i n g from point A to point B , described i n Experiment 7.2.1, using V i s u a l T o o l k i t 4.0.  In  this figure the position is shown w i t h a red sphere and the orientation w i t h a yellow cone 7.4  104  3 D position and orientation of the mobile camera i n experiment 7.2.2 using VTK4.0  105  7.5  T h e outdoor scene for experiment 7.2.3  107  7.6  G r a p h i c a l 3 D representation of the m o t i o n using V T K 4.0  110  7.7  T h e graphic representation of the traced path i n Experiment 7.2.3 using V T K 4.0  Ill  LIST  OF FIGURES  7.8  xiii  a-The outdoor scene used for the rotational motion. b-An up close view of the same scene from robot's point of view  7.9  Ill  A number of images from the sequence that is used in Experiment 7.2.4.  . .  112  7.10 Graphical presentation of the 3D estimated camera motion on a circular path with a yaw rotation of 360° using V T K 4.0  113  7.11 A closer view of the estimated circular path with a radius of 30cm that is traversed by the camera system in Experiment 7.2.4  114  7.12 Selected number of images viewed while traveling on a path. The motion of an object is highlighted using the yellow circle  117  7.13 Estimated traveled distance along i[cm] for a l m closed loop  118  7.14 Estimated camera distances relative to a starting point for a 6 meter long path with Kalman  filter  7.15 Estimated camera orientations for a 6 meter long path with Kalman filter on.  119 120  7.16 Estimated camera trajectories relative to a starting point for a 6 meter long path without Kalman  filter  121  7.17 Estimated camera orientation for a 6 meter long path with no Kalman filtering. 122 7.18 System's modular flowchart for the time analysis  125  7.19 Comparison of the running time versus search window dimensions for the tracking module 8.1  127  A second set of camera overcomes the coupling effect between translation and rotation  134  8.2  Processing small image patches can reduce the processing time  135  A.l  The intersection of the two circles using two pairs of viewed landmarks provides the observer location  151  A.2 The locations of three landmarks along three image projectors must match the distances between the same points  152  Acknowledgments My G r a d u a t e studies at UBC can be best described by the words of J o h n L e n n o n : "Life is what happens to you while you're busy m a k i n g other plans." S t a r t i n g w i t h the idea of getting a P h . D . , I formed ties w i t h people that have and w i l l influence m y life forever. My thanks must go to m y supervisor, professor Peter Lawrence, for his patience and guidance over a l l these years.  He has been very supportive and a continuous source of  encouragement d u r i n g my P h D . M y thanks also go to my co-supervisor, professor D a v i d Lowe, for i n t r o d u c i n g me to computer vision. I would have floundered long ago were it not for his genuine interest i n m y work and his willingness to answer m y constant questions. I would also like to thank the committee members, university and external examiners for their invaluable time and advice i n the improvement of this thesis. Special thanks also go to D o n M u r r a y and S i m o n B a c h m a n for their technical help through m y entire work. I would like to thank D a n n y French for his priceless help w i t h outdoor experiments. F i n a l l y , I would like to thank the E C E office administration members, I T staff, and workshop technicians. I would like to express m y deep gratitude to m y parents and m y sister for their love and trust, and for encouraging me to persue m y interests.  M y sincerest thanks go to my  good friends for listening to m y thoughts d u r i n g the best and worst periods of my graduate studies.  A special thanks extend to the I R I S  Network of Centers of Excellence  for their  financial support. Parvaneh Saeedi  The University March 2004  of British  Columbia xiv  Chapter 1 Introduction V i s i o n , as one of our most important senses, provides us w i t h a vast amount of information. Fundamentally, our vision system helps us to identify structures i n an environment and to locate ourselves w i t h i n i t . T h e inherent difficulty i n visually determining observer and object locations i n an environment originates from the fact that o p t i c a l projection is a destructive transformation. T h i s is m a i n l y because the projected images on the retina (camera's image plane) are purely 2 D and contain very l i t t l e 3 D information. C o n s t r u c t i o n of 3 D information from 2 D images, therefore, is an ill-posed problem necessitating additional assumptions and constraints. T h e extraordinary proficiency of the h u m a n visual system is a sign that it is possible to solve the inverse projection problem for machine vision. C o m p u t e r vision, the inverse process of image formation, attempts to construct meaningful and explicit descriptions of the world from images. T h e p r i n c i p a l a i m of computer vision is to take an image or sequence of images from a scene and process them i n such a way as to infer specific information about the scene. I n this sense, vision is a very challenging problem because it is full of uncertainties.  C o m p u t e r vision is becoming increasingly important as  an essential technology for a broad range of applications, such as industrial automation and inspection [11,37], medical diagnosis [43,68], human-computer interaction [20,102], v i r t u a l reality [19,72], and autonomous navigation [83,87,101]. One of the p r i n c i p a l subjects i n computer vision is the visual tracking problem. V i s u a l 1  2  tracking relates to the problem of following objects, markers or paths for many purposes, such as the behavioral study of specific species i n science [26,109], or automation i n i n d u s t r i a l applications [73]. In this group of tracking applications the a i m is to constantly have a target i n the field of view, or to move toward a landmark where specific aspects of it can be observed. Often a sequence of l a n d m a r k targets serve as a route that the robot can traverse for this group of applications. Here, estimating the m o t i o n vector, however, is not the goal but the tracking task itself. A n o t h e r group of applications where visual tracking is used comprises trajectory and m o t i o n estimation for mobile robots from image sequences, acquired w i t h an on board camera system. Usually, the objective of this group is localization for the purpose of autonomous navigation. A u t o n o m o u s navigation is required for a wide range of applications, spanning from scientific research and i n d u s t r i a l automation to entertainment a n d m i l i t a r y purposes [6, 7,50,79,81]. T h e localization task here is performed first by detection and identification of static scene structures and objects. T h e positions of these structures are either estimated or provided as system inputs. T h e relative or absolute position of the observer system is then estimated, either by measuring the relative changes i n the position of these structures from one frame to another, or by a geometric approximation m o d e l . Different assumptions and constraints made d u r i n g visual tracking provide the bases for different approaches. Some of assumptions that have been used are these:  • T h e scene is constructed mostly from static objects, a n d i f it does undergo any changes, appropriate adjustments are made to accommodate new changes. • T h e degrees-of-freedom of the m o t i o n are often l i m i t e d to a plane. • M u l t i p l e sensors may be required. • Previous methods have often assumed that the environment is known i n advance, or at least some predefined landmarks w i t h specific characteristics at known positions i n the scene are identified.  3  • W h e n k n o w n landmarks are used, a number of them must be observable at each time d u r i n g the system operation.  E a c h one of these constraints and hypotheses could reduce the applicability d o m a i n and the generality of the associated tracking system. W h i l e some of these assumptions might be valid for i n d o o r and small to m e d i u m size environments, they are not practical for natural terrains or u n k n o w n environments. Perhaps the most restrictive of all these assumptions is the assumption of prior knowledge of the environment or objects w i t h i n it. E x i s t i n g methods i n trajectory and m o t i o n tracking can be partitioned, depending on the amount of prior information about the environment and the m o d e l that is attempted to capture.  In methods based on prior information, a  topological m a p of the environment is facilitated by placing landmarks w i t h i n it and i n cluding the expected visibility region of each. T h e positions of these landmarks are either provided to the system or are found d u r i n g an i n i t i a l i z a t i o n process. M o s t commonly, these landmarks are artificial and are designed w i t h the intention of achieving fast, accurate and unique recognition. T h e position of the observer robot at each time is found by positional triangulation of the observed landmarks. T h e most important requirement for systems based on landmarks is the observation of two or more landmarks at each time. In the absence of any prior information, vision based systems either recover the full geometric structure of the scene or attempt to reconstruct a p a r t i a l representation of the scene through the use of salient points or other naturally occuring features of the scene. T h e 3 D l o c a t i o n of features or scene structures are computed through the use of stereo cameras, laser range finders, multiple cameras w i t h known geometry, or monocular cameras. Generally, the focus of these techniques is on the exact localization and recovery of the relative pose of the robot, w i t h respect to its environment. T h i s is achieved by finding the o p t i m a l transformation that can project instances of identical features from one instant of time to another. Ideally, such systems allow arbitrary camera m o t i o n and only require most of the scenes to be static.  JJ  4  Motivation  1.1  Motivation  E x i s t i n g operational trajectory tracking systems are mostly based on fusing a combination of sensors, such as laser range finders, sonar sensors, G P S , prior maps, artificial landmarks and wheel encoders. E a c h of these types of sensor modalities has several shortcomings that make it insufficient by itself. For example, not only can laser range finders be expensive and slow, but they are also very sensitive to outdoor elements such as dust, rain and snow. GPS  systems have the problem of the obstruction of line-of-sight to satellites, and they  may suffer from active j a m m i n g by other radio frequency sources. Sonar sensors suffer from specular reflections on smooth surfaces. E m p l o y m e n t of artificial landmarks for large scale outdoor environments is also often hard i f not impossible and the maintenance costs can be high. Moreover, acquiring prior information about an unexplored or dangerous environment is not possible. D e a d reckoning sensors m a y generate a great deal of uncertainty as they are mostly designed for smooth and man-made surfaces, and therefore, their readings for rough and uneven outdoor surfaces are unreliable. F o r instance, simultaneous localization and m a p p i n g ( S L A M ) has been the core of m a n y successful indoor robot systems.  SLAM  allows m a p generation and localization w i t h i n large consistent environments. Trajectory and tracking systems based on S L A M generally assume that the unknown environment is static and contains only rigid, non-moving objects. Se [87] represents a S L A M - b a s e d system using a Scale Invariant Feature Transform ( S I F T ) [61] for indoor environments using odometers and a stereo camera system. Cobzas introduced a system using 3 D vertical lines that are obtained by laser range finders [25]. S L A M - b a s e d systems use m u l t i p l e readings of several kinds of sensors such as odometers, laser range finders, ultrasonic sensors and camera systems. As presented earlier, existing vision-based m o t i o n and trajectory tracking systems often impose a number of constraints that confine them to specific applications or l i m i t their performance. A s a result, most of the existing systems generally do not perform well enough i n outdoor applications, largely because they are not designed robustly enough to deal w i t h the irregularities of natural environments. Therefore, robust, autonomous, long-range navigation for natural terrains or unknown environments, based on machine vision, remains an  JJ  Motivation  5  unsolved problem i n robotics. T h i s research is therefore motivated to address some of the l i m i t a t i o n s of vision-based trajectory and m o t i o n tracking problems. T h e key deficiencies i n many different solutions to visual trajectory and m o t i o n tracking solutions can be identified as the following:  • Accuracy: cumulative error makes system output unreliable, as often their drift increases over time. • Robustness: different environmental and i m a g i n g conditions can influence the reliability and the robustness of the estimated solution. • Scene dependence:  dependence on a unique environment makes these systems  inflexible and less useful. • Cost: the use of expensive and sophisticated sensors is not affordable or practical for smaller applications. • Real-time performance: real-time performance adds to the systems' reliability i n dealing w i t h abrupt changes or unexpected events.  T h i s thesis presents a completely vision-based trajectory estimation system for outdoor and unknown environments. A t each time frame, the system provides the 3 D trajectory of the navigating robot. T h e system includes an inexpensive t r i n o c u l a r set of stereo cameras that can be mounted anywhere on the robot.  It employs existing scene information and  requires no prior map, nor any modification to be made i n the scene. T h e system focuses on the improvement i n accuracy by removing any noise introduced by sensory devices or other processes i n the system. Special attention is also p a i d to the problems of robustness and reliability, i n different environmental and imaging conditions. T h e m a i n assumptions here are that the scene provides enough features for m a t c h i n g and that most of the scene objects are static. Moreover, it is assumed that the velocity of the robot is l i m i t e d i n such a way that there is some overlap between each two consecutive frames. T h e system is m a i n l y  1.2 Major  Contributions  6  designed for the use i n autonomous navigation i n natural environments where a m a p or prior information about the scene is either impossible or impractical to acquire. Figure 1.1 represents a simple block d i a g r a m of the system.  A t each time, a set of  trinocular stereo images is captured. T h e most informative features from these images are then extracted. Features w i t h h i g h stabilities are used to reconstruct a sparse representation of the world around the mobile camera. These 3 D features are then tracked over time, a n d the m o t i o n that causes their 3 D displacements, from one frame to the next, is estimated. T h e features are then transformed into a global coordinate system where their p o s i t i o n a l information is updated by merging their previous and current readings, i f they have been viewed before. These features w i l l be used i n future frames and the process continues for next i n c o m i n g captured frames.  1.2  Major Contributions  T h e major contributions of this thesis can be summarized as follows:  1. A novel b i n a r y corner detector for choosing the most informative scene features i n a fast manner. 2. A new approach for 3 D world reconstruction that not only minimizes the lens radial distortion, but also removes unwanted added distortion that is introduced by the lens distortion removal a l g o r i t h m . 3. A multi-step scheme for the m o t i o n estimation process, i n which a rough m o t i o n estimation facilitates a more precise tracking implementation, resulting i n the e l i m i n a t i o n of more outliers at earlier stages. 4. Improvement of the robustness and accuracy of the general trajectory t r a c k i n g problem for outdoors, under different environmental conditions w i t h no prior i n formation about the scene. T h e translational and rotational accuracy is improved  1.2 Major  Contributions  7  Frame n-1  Frame n  Frame n+1  Trinocular Image Capture  Feature Detection  3D World Reconstruction  Feature Tracking  D  x  t  Dy, D  z  Motion Estimation  <t>xl <t> l <t>z  D  D  D  V  World features  3D World Update  from previous frames  Figure 1.1: Visual motion tracking system overview. yielding a value of 1% with paths as long as 6 meters and with rotations up to 360 degrees.  1.3 Thesis 1.3  Outline  8  Thesis Outline  T h i s thesis is concerned w i t h the design and implementation of a vision-based system for tracking the 3D trajectory and m o t i o n of a robot i n an unknown environment using an on-board camera system. It consists of eight chapters as follows. C h a p t e r 2: V i s u a l T r a j e c t o r y T r a c k i n g . In this chapter the background of visual m o t i o n and trajectory tracking problems is explained. Different approaches to the t r a c k i n g problem, along w i t h examples from each are represented later. T h e performances of these systems are compared based on their sensory devices, accuracy, applicability and cost. T h e approach selected i n this thesis is also justified i n this chapter. C h a p t e r 3:  F e a t u r e D e t e c t i o n . A n overview of some common feature detectors is  presented i n this chapter.  T h e basis of a novel binary corner detector, that is developed  for this work, is explained later. T h e performance of the suggested method is then compared w i t h the most c o m m o n feature detector, the H a r r i s corner detector, and the original inspiration for this work, the S U S A N corner detector. C h a p t e r 4:  3D W o r l d R e c o n s t r u c t i o n .  T h i s chapter explains the conventional  method for reconstructing the 3D world, using the stereo algorithm first. W e take a slightly different approach to the 3D world reconstruction problem i n which the positional uncertainty resulting from the lens d i s t o r t i o n removal process is m i n i m i z e d . T h r o u g h this approach, every single pixel of the acquired images is processed; traditionally, some of this information is discarded d u r i n g the lens distortion removal process, and therefore a wider scene is processed. C h a p t e r 5: F e a t u r e T r a c k i n g a n d M o t i o n E s t i m a t i o n . A multiple stage approach for tracking world features is represented i n this chapter. B y t a k i n g this direction, improvements are gained i n the accuracy of the overall m o t i o n estimation by means of more accurate match correspondences and a lower number of outliers. T h e m o t i o n estimation process i n volving the calculation of image Jacobians as well as the least squares fit solution to the problem, are also explained i n this chapter. C h a p t e r 6: T r a j e c t o r y E r r o r M o d e l i n g .  T h i s chapter starts w i t h a discussion of  1.3 Thesis Outline  9  the inherent positional uncertainty associated w i t h each feature point. It continues w i t h the study of a K a l m a n filter scheme for each feature that is used for c o m b i n i n g positional information and uncertainties over time. Derivation of the formulation for propagating the existing uncertainty i n the estimated m o t i o n to feature positions concludes this chapter. C h a p t e r 7: E x p e r i m e n t a l R e s u l t s . A study of the implemented system performance, accuracy and cost, is represented through several examples for indoor and outdoor environments i n this chapter. A discussion of some practical considerations as well as time issues for implementing the system are also represented i n this chapter. C h a p t e r 8: C o n c l u s i o n s a n d F u t u r e W o r k . T h e contributions of this thesis, along w i t h suggestions for future research, are summarized i n this chapter.  Chapter 2 Visual Trajectory Tracking In visual m o t i o n and trajectory tracking, the relative m o t i o n between objects i n a scene and the camera is determined through the apparent m o t i o n of objects i n a sequence of images. T h i s m o t i o n may be characterized or measured through observing the apparent change i n brightness patterns i n the images or m o t i o n of a discrete set of scene structures.  Visual  trajectory tracking approaches can be d i v i d e d into two different approaches:  • Motion-Based Methods • Feature-Based Methods  Motion-Based Methods Motion-based methods detect m o t i o n through optical flow tracking and motion-energy estimation.  • O p t i c a l F l o w - B a s e d M e t h o d s : O p t i c a l flow tracking operates by extracting the velocity field, assuming that image intensity is a continuous function of time. Methods based on o p t i c a l flow are fast and can be used for m o t i o n detection by any unconstrained platform; however, they cannot be used where the camera m o t i o n is more  10  11 than a few pixels between successive frames, thus l i m i t i n g the p a n / t i l t angles to small changes [92]. • M o t i o n - E n e r g y - B a s e d M e t h o d s : Motion-energy tracking is based on calculating the temporal derivatives of images, which are then thresholded to eliminate noise. T h e resulting image is segmented into regions of m o t i o n and inactivity. M e t h o d s based on temporal derivatives are simple and suitable for parallel c o m p u t i n g architectures, but are subject to noise, leading to imprecise values. In addition, pixel m o t i o n is detected but not quantified [70].  Feature-Based Methods Feature-based methods recognize an object or objects (landmarks or scene structures) and extract the position i n successive frames. T h e 2 D change i n the object i n the image can be due to either a change i n its position i n the scene, or to a new aspect of the object. Therefore, by tracking an object over a sequence of images and measuring the similarities, the m o t i o n of the navigation system can be retrieved. W h i l e the performance of these approaches can be l i m i t e d by the efficiency of the recognition m e t h o d , object type, and positional accuracies, they have the advantage of 3 D performance capability i n non-stationary environments w i t h higher accuracy than motion-based methods. W o r k on the feature-based camera localization problem can be divided into two general domains: • L a n d m a r k - B a s e d M e t h o d s : Landmark-based methods detect distinguishable known structures (landmarks) i n the scene and through the geometric interpretation of the known position of these structures, obtain the position of the camera. Tracking w i t h these methods might be aided by an o priori m a p of the environment. • N a t u r a l F e a t u r e - B a s e d M e t h o d s : N a t u r a l feature-based approaches attempt to track projection of p r e l i m i n a r y and more general features of a scene i n a sequence of images. B y tracking and finding relative changes i n the position of these features i n images, it is possible to find the trajectory and m o t i o n of the navigation system.  2.1 Landmark-Based Methods  12  A s described i n the previous chapter, precise measurement of the m o t i o n and trajectory of a mobile navigation system is the m a i n goal of this thesis. Therefore, feature-based approaches seem to be more suitable for this application. A brief study of the previous work i n visionbased navigation systems is presented i n the following sections.  2.1  Landmark-Based Methods  A landmark is denned as any detectable structure, m a r k i n g or object w i t h a k n o w n p o s i t i o n i n a physical environment that can be recognized i n an image. A landmark can be a simple vertical line, a specially designed mark, such as a cross or a pattern of concentric circles, or even a known object i n the scene, such as a b u i l d i n g or an indoor ceiling l a m p . L a n d m a r k s can be categorized into two types: • Predesigned landmarks.  These artificial landmarks enrich the environment w i t h  simple and distinguishable geometric patterns. • N a t u r a l l a n d m a r k s . If the workspace is k n o w n i n advance, it might be possible to manually select objects to constitute a set of known landmarks. A system based on landmarks usually requires a t r a i n i n g phase i n which the landmarks and their corresponding information, such as world p o s i t i o n and physical attributes,  are  extracted to construct a database. M o t i o n tracking proceeds by detecting landmarks, followed by the camera position estimation based on triangulation. T r i a n g u l a t i o n for robot localization is based on t r a d i t i o n a l methods i n cartography and navigation, that use angles measured between lines of sight to known landmarks i n the environment. There has been extensive research on these methods i n the domains of cartography, photogrammetry and computational geometry.  A n d e r s o n et al. [2] categorized the triangulation methods into  three groups, based on the number of landmarks: I. T r i a n g u l a t i o n based on one landmark, views one l a n d m a r k (LM) from two viewpoints; see F i g u r e 2.1(a). T h e observer position ( P ) is determined using the i n i t i a l l o c a t i o n 2  2.1 Landmark-Based Methods  13  (Pi), or the traveled distance (b). The main assumption here is that the observer's displacement occurs parallel to the image plane. II. Triangulation using two landmarks from two viewpoints is similar to using a single landmark with the same assumption and measurements.  The additional landmark,  however, constrains the position estimation process, thus vastly reducing the noise sensitivity of the process; see Figure 2.1(b). III. Triangulation via three landmarks from the same viewpoint relies on the world positions (LMi,LM2,LM ), 3  and the angular separations (</>i, ^2) of the landmarks, to recover  the observer's position as demonstrated in Figure 2.1(c). More details about 2D and 3D localizations using triangulation can be found in Appendix A .  Figure 2.1: Landmark based triangulation.  Several parameters influence the result of the triangulation procedure, including the camera's focal length, the world position of the landmarks and the odometeric readings, and/or the initial position estimation.  A study of these methods in [2] shows that their  efficiency and robustness is proportional to the number of viewed landmarks. Sutherland and Thompson [98] and later Betke and Gurvits [13], studied triangulation methods to show the effect of landmark configuration on the size and shape of the area of triangulation uncertainty. In this work, landmark-based navigation methods are classified according to the type of landmark. A study of several examples in this class is presented next.  2.1 Landmark-Based Methods  2.1.1  14  Navigation using predesigned landmarks  In these types of systems, a predesigned landmark is placed at different but known locations in the environment. Reliable position estimation with this type of approach depends on bounding the uncertainty existing in initialization, recognition and identification of such markers.  Simple geometric models such as circles, ellipses or polygons often are used as  artificial landmarks because they can be detected easily in real-time using edge detection algorithms. Most of the previous works have adopted planar designs when creating landmarks [12,21,33,56,93,112]; however, better accuracy may be obtained by using 3D artificial landmarks. For instance Bao et al. [5] describe a system for estimating the position and orientation of a vehicle using 3D artificial landmarks. They show a significant improvement of orientation accuracy can be achieved by replacing 2D landmarks with 3D ones. Several systems are presented for indoor navigation using 3D geometric landmarks [53,99]. They emphasize landmark issues such as high information content, generality under different viewing conditions, and fast extraction. Generally, in these types of systems, circular patterns are the more popular among the other geometric shapes because their projection in the image plane can be approximated by an ellipse, and they do not usually get confused with the majority of patterns frequently seen in the scene. Also, circular patterns are more robust regarding noise and occlusion than polygonal patterns during the matching process. For instance, one algorithm employs 3D artificial landmarks, introduced by Lin and Tummala [58], in which the landmark is a dark circular disk on a white background, with a secondary, smaller, white, concentric disk (see Figure 2.2(a)). The relative position and heading of the robot can be estimated using corresponding parameters of the ellipses observed in the image, as illustrated in Figure 2.2(b). The major disadvantage to these approaches is that they require substantial engineering of the environment, and therefore, they are not suitable for changing or unexplored environments. Moreover, these methods are mostly limited to planar estimates. The choice of the landmark type might also add some limitations to the tracking system. Sometimes, the cost of the cre-  2.1 Landmark-Based  15  Methods  Figure 2.2: (a). T o p views of the landmark and projected images, (b). E s t i m a t i o n geometry. ation and maintenance of beacons might not be practical, and sometimes, finding one-to-one correspondence between objects i n the workspace and their projections i n the image plane is difficult i n the presence of noise. A l s o , as the detection of each landmark i n an image requires an approximation for the perspective projection of that landmark, the accuracy of such systems can be influenced by a poor approximation.  2.1.2  Navigation using natural landmarks  M u c h of the recent work i n landmark-based navigation attempts to avoid the requirement for artificial landmarks or domain-specific features.  T h i s is m a i n l y due to the restrictions  that artificial landmarks generally place on the types of environment that can be explored. N o t only do they restrict camera m o t i o n and pose, but also their calibration i n the environment is usually an onerous task. T h i s has given rise to methods that are not based on predesigned landmarks; rather, they automatically extract naturally-occuring landmarks v i a a local distinctiveness criterion from the environment d u r i n g a learning phase. Such systems usually require a p r i o r i m a p of the environment that is either provided or created d u r i n g a learning phase. Over the past few years, many vision-based localization systems using landmarks have been developed. T h e y m a i n l y differ i n the features they employ. T h e distinctiveness criterion  2.1 Landmark-Based Methods  16  for feature selection vary based u p o n the application and the navigation environment. M a n y of these methods focus on using structured objects, such as doors, windows and straight lines. For instance, M u n o z and Gonzalez [69] develop a system based on early work by K r o t k o v [55] that employs vertical line models. T h e system fuses a single camera's information, a two dimensional map including l a n d m a r k positions and attributes (vertically oriented parts of fixed objects such as doors, desks and wall junctions) and odometer readings. A n o t h e r work is explained i n [9] where lines and edges are extracted from reference images i n a learning phase. For each reference image a set of geometric models is created using extracted features. T h e n , the rough position of the robot is estimated by a p p l y i n g geometric transformations that fit the extracted d a t a to the reference image models. In another work, Trahanias et al. [103] develops a system to find 2 D camera m o t i o n by employing n a t u r a l landmarks of the walls. D u r i n g the learning phase distinct features on the walls are chosen using statistical measurements and placed into a navigation map. S i m and Dudek [94] introduce a method i n which landmark candidates are selected, based u p o n the extrema of the edge d i s t r i b u t i o n density, to construct a representation of the environment i n the form of a database. T h e position estimation is achieved by extracting image landmark candidates and matching them w i t h the tracked landmarks i n the database, and then, by linear position interpolation of the match correspondence candidates. In another approach [54], vertical lines are extracted from camera images and are combined w i t h d a t a obtained from ultrasound sensors to estimate the position and orientation of the robot. A s demonstrated by these examples, such approaches have had success i n finding 2 D m o t i o n tracking for indoor environments, where structured landmarks are more present and where the readings of an odometer sensor are more reliable. Such systems, however, become less successful i n outdoor applications. T h i s is because odometer readings i n environments w i t h rough and uneven surfaces are less accurate. Moreover, outdoor scenes usually include less structured objects and characteristics, and the shapes of these objects can easily vary under different or changing environmental or imaging conditions. One system that attempts to achieve natural target tracking for an outdoor environment is proposed by M u r r i e t a - C i d et al. [71]. In their system, p r i n c i p a l regions of the scene are extracted by color segmentation  2.2 Natural Feature-Based Methods  17  and are characterized by their color and texture using statistical information. These p r i n c i p a l regions compose a database which includes different environmental classes such as grass, trees, soil, sky and rocks. T h e tracking task is carried o u t by comparing image segments, extracted from image sequences, w i t h database models [44]. T h e major advantage of these landmark based approaches is that they have a bounded cumulative error. Moreover, their simplicity and the fact that they do not require any 3 D reconstruction of scenes have made them very popular. However, these systems require some knowledge of the geometric model of the environment, either built into the system i n advance, a prior map, or acquired using sensory information d u r i n g movement, the learning phase, o r sometimes a combination of b o t h . T h i s requirement seriously l i m i t s the approach's capability i n unknown environments. Moreover, the m o t i o n of the robot is limited along one or two axes a n d the orientation is confined to a field i n which the landmark(s) is visible.  These  restrictions make such approaches inappropriate for environments where m o t i o n involves more degrees o f freedom.  Furthermore, the underlying assumption is that the robot c a n  always reliably detect a landmark i n the sensor data.  T h i s is a fundamental assumption,  critical for the reliability of the whole system. One o f the important aspects of a l l of these solutions is that sensor measurements are not always accurate, a n d therefore, i t is more reasonable to seek a solution that minimizes the uncertainty of the position estimate.  2.2  Natural Feature-Based Methods  N a t u r a l feature-based tracking methods take advantage of the fact that as the camera moves i n an environment, changes are usually related to each other i n the pattern of image intensities i n short time intervals. T h i s correlation shows that patterns move i n an image stream as the camera moves i n the environment [90]. If  I(x,y,t) represents the image intensity func-  tion, for a simple translational m o t i o n of the camera system, formally the following equality holds:  I{x,y,t)=I(x + Z,y + ri,t + T)  (2.1)  2.2 Natural Feature-Based Methods  18  Therefore, the local displacement between pixels in subsequent 2D images corresponds to relative 3D motion between corresponding points in the scene. Through the measurement of such displacements, the 3D location of these features, as well as the motion of the camera, can be determined. One important issue involved with finding these displacements is that a single pixel-sized scene element, unless it has a very distinctive brightness with respect to each of its neighbors, is the fact that they can be confused with adjacent scene elements. A solution to this problem is the use of preliminary features of the scene with maximum local stability under projective transformations. Therefore, the objective of natural feature-based tracking methods becomes to derive the motion of features in a scene through the analysis of the motion of corresponding features in the sequence of images. Common local structural features extracted from images are points, lines, statistical features, surfaces and geometric patterns. The fact that these features are usually not scene-dependent makes them very popular for tracking applications in unknown environments. The type of feature is highly dependent on the working environment that the system is designed for. For instance Wilson et al. [110] uses object corners, centroids and areas of holes in an object as local features. The centroid and diameter of circles are used by Harrell et al. [38] for a fruit tracking robot for harvesting. Rives and Borrelly [78] employ edge features to track pipes with an underwater robot. The road-following vehicle of Dickmanns et al. [29] is also based on edge tracking. In another application, edge contours are tracked in real-time [111] for manipulation of free-floating objects by a space robot. Vanishing points (the intersection of two nearly parallel lines) and line orientations are used by Zhang et al. [116] for robot navigation. Chaumette et al. [22] and Espiau et al. [30] derive variations of a tracking method for points, circles and lines. The main advantage of using local features is that they correspond to specific physical elements of the observed objects, and once correctly located and matched, provide very accurate information concerning the relative position between camera and scene. Also, in systems based on landmarks or models, it is possible that no landmark is visible, so the motion estimation cannot be accurate for some percentage of the time, while estimations  2.2 Natural Feature-Based Methods  19  based on scene features are potentially less likely to fail due to the large number of features that can be available from any point of view.  T h e accuracy of these methods, however,  is highly dependent on the accuracy of the features.  Even a small amount of positional  uncertainty can eventually result i n a significant trajectory drift. T h e previous work i n natural feature-based  tracking can be categorized based on the  estimation's degree of freedom and the number of employed cameras.  2.2.1  2D estimation with single cameras  Due to limited processing and sensory technologies, early work i n the natural feature-based m o t i o n tracking area is l i m i t e d to planar motions. Sethi and J a i n [88] describe a method for finding 2 D trajectories of feature points i n monocular images by employing an iterative optimization scheme based on constraints on the direction and magnitude of the m o t i o n . W a n g and Clarke [108] implement a 2 D general m o t i o n estimator using Fourier descriptors of object contours. Soroushi [96] implements a real-time, natural feature-based  trajectory  tracking system for planar m o t i o n of a down-looking monocular camera i n an unknown environment, using a coarse-fine registration a l g o r i t h m . Betke et al. [14] develops a realtime vision system that analyzes color videos of a highway to recognize and track road boundaries and vehicles. T h e boundaries of the road and the outline of the vehicle are found using edges, color, and saturation contents of video images. E t o h and Shirai [31] address a system that segments i n p u t images into fragments, characterized by their color, pixel p o s i t i o n and intensity gradients. T h e planar m o t i o n parameters are estimated precisely by t r a c k i n g clusters of fragments over image sequences.  2.2.2  3D estimation using single cameras  Later attempts are directed toward 3 D tracking using monocular images.  For instance,  Basri et al. [10] presents a feature-based navigation system that moves a 3 D robot a r m to a desired location i n space. T h e destination information is provided by a single image taken  2.2 Natural Feature-Based Methods  20  at the goal position. T h r o u g h the m a t c h i n g of a m i n i m u m of eight corner correspondences between two consecutive frames, the rotation m a t r i x between the two images is found first [8]. E m p l o y i n g a m u l t i step approach, the camera rotates and then translates u n t i l it reaches the goal position. Irani et al. proposes a 3 D m o t i o n estimation approach for static scenes i n [46]. T h e y use the particular behavior of points on planar surfaces to register patches on consecutive frames first, and to derive 3 D translation and rotation vectors of the camera m o t i o n later. Harris introduces a monocular vision system that uses the visual m o t i o n of image corners to create a 3 D geometric model of the viewed scene [39]. In his work he finds 3 D locations of features through an egomotion process. Successive frames of a m o v i n g camera are then processed and the m o t i o n of the camera, as well as 3 D position of features i n the consecutive frames, are estimated by an o p t i m i z a t i o n a l g o r i t h m . L e u n g et al. [57] proposes a constrained 3 D m o t i o n estimation method using lines over monocular images. I n his m e t h o d lines are identified from a set of edge correspondences and their mathematical equations are computed. T h e lines are tracked over three consecutive frames, and the m o t i o n of the camera is estimated using an o p t i m i z a t i o n scheme on a triplet set of correspondences.  2.2.3  3D estimation using multiple cameras  T h e m a i n problems w i t h the systems described above are p o o r overall m o t i o n estimation, l i m i t e d m o t i o n , small range tolerance, and long term cumulative error. Several attempts are made to increase m o t i o n range, accuracy and robustness through the employment of images from m u l t i p l e views.  T h r o u g h orthogonal or slanted views, 3 D position estimation and  alignment are achieved w i t h higher accuracy. Joshi and Sanderson address an application of feature-based visual tracking for three dimensional filament alignment, 5 degrees of freedom ( D o F ) , using two orthogonal cameras [47]. A r g y r o s and Fredrik describe a vision system [3] that allows the robot to move i n the m i d d l e of the free space by e x p l o i t i n g a forward-looking camera and two side-looking slanted cameras for the peripheral visual fields. Zhang [113] implements an algorithm to estimate the 3 D m o t i o n of line segments i n space using two perspective images taken by two calibrated cameras. W i t h the assumption that the matched  2.2 Natural Feature-Based Methods  21  line segments contain the projection of a common part of the corresponding line segment in space, he can estimate the relative motion between the two cameras. The main problem with these techniques however is the existing ambiguity with respect to the absolute value of the distance between the camera and the scene [1]. Later research in natural feature-based 3D motion and trajectory estimation addresses stereo vision analysis.  The use of stereoscopy uniquely determines depth and hence ab-  solute values for motion parameters. Wettergreen et al. [109] embark the development of autonomous underwater vehicles for performing automatic sub-sea searches by tracking dynamic or fixed targets. The system uses area-based features and is equipped with a stereo vision system that was designed specifically for tracking targets and extracting their position, orientation and velocity. Thorpe et al. [100] successfully develops a real-time autonomous driving system, Navlab, to navigate on highways with the aid of a map, a laser range finder and a stereo camera system.  The visual tracking module employs a variety of perception  methods and techniques, selected depending upon road and lighting conditions. Zhang and Faugeras [115] propose an approach using 3D line segments obtained from stereo images. Tracking is performed using a prediction, matching and update loop and the motion was estimated using an extended Kalman filtering scheme. Shieh et al. [91] introduces a method to find the relative motion of an object with respect to a moving camera using stereo and image brightness information. The main assumption here is that the motion of the object is limited to a very small range. Although they show that their approach improves accuracy, the reported error is still very large. Mark et al. [106] represents a vehicle ego motion estimation system using stereo and corner features based on the Harris corner detector [40]. The major problem with their work is reported to be the accuracy. Moreover, the results are represented only for simulated and synthetic data. Pedersini et al. [75] describes a method for egomotion estimation using 3D straight and curved contours. The motion is first approximated through the optimization of a 3D transformation that maps the 3D straight line correspondences, and it is refined by adding 3D curved contour correspondences to the optimization process.  Demirdjian and Darrel [27]  22  2.3 Thesis Objective  address a system using correspondences in disparity images. They show that working in a projective disparity space where the error associated with the image points are precisely known can help to reduce the disturbing noise effect on the final estimated motion. A n other stereo-based 3D motion estimation is addressed by Goncalves and Araujo in [35]. A n estimation of the velocity along the direction perpendicular to the image plane is used to recover complete motion parameters. The analysis of the results shows that the suggested method is very path dependent, and generally suffers from an average error of between 10% and 30%. The main advantage of natural feature-based systems are reduced complexity, as only a limited set of features are processed at each time, and applicability to more general environments. However, as represented, central problems for these methods are the accuracy and robustness of the estimated motion. The performance of these systems is highly dependent on an accurate feature correspondence establishment/tracking [59,105,117], and the resolution of the optical system. These become serious issues when estimating and tracking trajectory over a long sequence of images. Therefore, while these type of systems can perform reliable real-time 3D object tracking, they may drift in long term trajectory tracking.  2.3  Thesis Objective  The design presented in this thesis is an exploration of the relevant issues in creating a realtime on-board motion tracking system, for a natural environment using an active camera. The system is designed to be as general as possible, but there are some necessary assumptions to be made, as some uncertainties and exceptions do exist.  The following is the list of  assumptions that we have undertaken in our approach:  • Much of the current vision research is concerned with stationary objects, since nonstationary objects are very difficult to handle without prior knowledge of their behavior. Thus, we assume that our scene includes mostly stationary objects.  If there are a  few moving objects, the system is able to rely on static object information, while  23  2.3 Thesis Objective  information from m o v i n g objects can be discarded as statistical outliers. • T h e camera characteristics are known. I n particular, focal length and the base line separation of the stereo cameras should be provided. • T h e motion of the robot is assumed to be limited i n acceleration. T h i s allows the m a t c h searching techniques to work o n a small and predictable range of possible matches. • There is no specific knowledge about the unstructured scene which is to be processed. T h e working environment is assumed not to be a uniform scene a n d includes a number of objects a n d textures.  T h e r e is also the assumption that at most only 40% of the  scene objects are moving. T h e ultimate goal of this thesis is to define a solution to cover the following objectives: • Accuracy, simplicity a n d cost.  T h e m o t i o n accuracy is the most important consid-  eration of this problem. M o s t of the current, accurate systems are expensive, a n d therefore, are not applicable to any small application. In this work, a solution that is reliable, simple, and inexpensive enough to be used for every purpose is sought. • Continuous sensing. T h e system should be able to track its own m o t i o n i n a continuous manner.  Some systems have a stop-sense-and-move mode, which is contrary to the  application of this system. • Real-time performance. T h e system should be able to provide the location and orientation of the robot at each moment i n time. T h i s is an important requirement since m a k i n g the necessary decision based on abrupt a n d unexpected changes i n natural environments is a critical issue i n the safety of such autonomous operations. • Vision-based only. System input is only provided through vision based sensors, that is, cameras. T h e use of wheel-based encoders, range finders or G P S sensors are not considered practical. T h i s is m a i n l y because the working environment includes severely uneven surfaces, outdoor elements such as fog, dust, snow a n d rain and foliage.  Chapter 3 Feature Detection M a n y vision based tracking systems estimate camera m o t i o n by measuring relative changes i n the projection of identical scene features i n different image frames. A l t h o u g h , globally all points i n the scene convey some information about the m o t i o n , locally not a l l the pixels of the image carry valuable m o t i o n information. F o r example, edges, occlusions or areas of uniform local intensity, can convey, at best, only p a r t i a l information about the m o t i o n . A n i m p o r t a n t aspect of a m o t i o n tracking system is its real time performance. Processing a l l the pixels of an image, from which only a small number carry information about the camera's m o t i o n , may not be possible w i t h the real-time requirement for such systems.  Therefore,  special attention is paid to selecting regions w i t h higher information content. There have been many different criteria for choosing such regions, mostly based on areas w i t h high second-order derivatives or high frequency content. In spite of the fact that these methods deliver traceable features, there are several characteristics that make some of them more reliable t h a n others. T h i s chapter first presents an overview of the previous work i n feature detection. A novel feature detector using binary images is then introduced. C o m p a r i s o n of the efficiency and performance of the proposed method w i t h existing ones is then presented.  24  3.1 Features  3.1  25  Features  M a r r [66] describes scene features as tokens that represent attributes of the image, which corresponds to physical events i n the scene, e.g., a discontinuity i n the surface reflectance or depth. In a simpler definition, features are defined by meaningful tokens of a scene that can be identified and tracked over a sequence of image frames. Several feature attributes, such as good localization, robustness w i t h respect to noise and i l l u m i n a t i o n changes, and efficiency of the detection algorithm can lead to a more accurate or faster estimation. Deciding on the type of feature is critical and depends greatly o n the type of input sensors, environment, a n d the a p p l i c a t i o n that these features are used for. C o m m o n features that are generally used include the following [17]:  • R a w pixel values, i.e. the intensities. T h e y retain the most information possible, but the extraction of that information is expensive. Moreover raw pixel values are sensitive to noise. • Edges, surfaces and contours. These features usually correspond to real 3 D structures i n the scene, and they provide connectivity information [97]. For example, edges are less sensitive to noise and working w i t h them is fast.  However they are subject to  inconsistencies, such as dropout from image to image, especially when extracted from real images. A l s o , they are often not viewable as complete entities, for example one or b o t h end points may be obscured by other structures or be outside the field of view. Edges also suffer from the aperture p r o b l e m . T h i s means that for a m o v i n g edge, only the component of the m o t i o n perpendicular to the edge can be found accurately while the other component may belong to a set of possible motions. • Salient features, such as corners, line intersections, points of locally m a x i m u m curvature on contour lines and texture patches. These types of features have the advantage of being discrete, reliable, meaningful and more accurate for finding the position. • Statistical features, such as moment invariance, energy, entropy and color histograms.  3.2 Corner Detection Background  26  These features are measurements over a region of an image, representing the evaluation of the region. T h e m a i n problem w i t h this type of feature is that they are difficult to use i n a matching process. • Higher level features, such as structural and synthetic features. In this type of feature, relations and other higher level information is used. T h e y are usually unique and fitted to specific applications or specific environments.  C h o o s i n g simple features w i t h i n the scene increases the reliability of the solution for m o t i o n tracking and enables the system to find answers to problems most of the time, unless the scene is very uniform.  I n the search for a feature type that suits our application, a  n a t u r a l , unstructured environment w i t h varying lighting conditions, i t was decided to work w i t h corners, because they are discrete and p a r t i a l l y invariant to scale and rotational changes.  3.2  Corner Detection Background  T w o broad groups of corner detectors are identified by Deriche a n d G i r a u d o n [28].  • T h e first group extracts corners from edges represented as chain codes, by searching for points w i t h m a x i m u m curvature or w i t h some polygonal a p p r o x i m a t i o n , and searching for intersection points. • T h e second group detects corners by c o m p u t i n g a measure that indicates the presence of an interest point directly from the signal; for example, the grey-level values or brightness gradient.  Several detectors for identifying and localizing interest points have been developed and many of them are reviewed i n [86]. K i t c h e n and Rosenfeld [52] find points i n the image where the product of the gradient magnitudes and edge contour curvatures are m a x i m u m . W a n g a n d B r a d y [107] define corners as points w i t h m a x i m u m gradient on the direction perpendicular to the edge curvature. Moravec [67] presents a m e t h o d based on c o m p u t i n g  27.  3.3 Harris Corner Detector  the intensity autocorrelation on four directions over a small window. Features w i t h the local m i n i m u m auto-correlations are declared as corners. Harris and Stephens [40] improve Moravec's method by c o m p u t i n g auto-correlation for the squared, first order image derivatives. S m i t h and B r a d y [95] introduce a significantly different low level corner detector by measuring the similarity of the related parts of the image to each i n d i v i d u a l pixel.  They  associate each pixel w i t h its local image regions w i t h similar brightness. Distinctive features are detected by m i n i m i z i n g these image regions. Schmid et al. [86] show that Harris and Stephens' m e t h o d give the best results based on several criteria, such as repeatability w i t h respect to scale, i l l u m i n a t i o n , viewpoint change and noise. Robustness of a feature detector can strongly affect the performance of a system based on such features. Originally, for this work, the Harris corner detector was used. Since Harris method involves several smoothings, for the purpose of noise suppression, and c o m p u t i n g first order image derivatives, it becomes computationally expensive.  A l s o , by  smoothing the intensity image and its derivatives, the higher spatial frequencies are attenuated, and therefore, the position of the corner appears w i t h an offset, usually a fraction of a pixel, from its real position. These two effects l i m i t performance of the system and to some extent restrict further attempts at quality improvement. One of the methods that promises a very good positional accuracy is the S m i t h and B r a d y [95] corner detector. A s it w i l l be explained later, this method is computationally very expensive and highly noise sensitive. T h e principles of these two methods as well as a novel m e t h o d that is developed for this research are described i n the following sections.  3.3  Harris Corner Detector  Harris and Stephens corner detector [40] was developed for enhancing the Moravec interest operator. T h e problem of detecting corners can be analyzed i n terms of the curvature properties of local image brightness autocorrelation function, where curvature information can be represented by the Hessian m a t r i x . T h e autocorrelation function is useful for characterizing  3.3 Harris Corner Detector  28  how brightness values change i n the neighborhood of a location. A t a corner or an isolated brightness peak, a l l shifts result i n large changes i n the autocorrelation function at that location. One version of the brightness spatial change function for a s m a l l shift (x, y) can be expressed by E(x,y) = Here W  u>v  J2 W , \Ix+u, v u v  y+  ~  4,*I  (3-1)  2  represents a smooth circular window and is defined by  W „ = e - ^  (3.2)  2  u>  A first order approximation of E is given by  E(x, y) = Ax + 2Cxy + By 2  (3.3)  2  where A, B and C axe approximations of the second order directional derivatives of the Gaussian smoothed brightness image.  A = X ®W 2  ,  B = Y ®W 2  ,  C = XY®W  (3.4)  E can be rewritten  E(x, y) = [x, y]M  where  M =  A  C  C  B  (3.5)  where M is an approximation of the Hessian m a t r i x of E and is closely related to the image's local autocorrelation function. T h e p r i n c i p a l curvatures of the image brightness autocorrelation function at a point can be approximated by the eigenvalues of the approximated two by two Hessian m a t r i x , M , defined at that point. A corner is announced i f p r i n c i p a l curvatures are approximately equal and sufficiently large. T h e determinant of the approximated Hessian m a t r i x is proportional to the product of the p r i n c i p a l curvature. T h e Harris and Stephen's corner detector is given by the following operator where a large value of R corresponds w i t h  3.4 SUSAN Corner Detector  29  . the presence of corners [40].  R(a,B) = Det{M) -  KT (M)  (3.6)  2  r  T {M) =a + P = A + B T  Det(M) = a/3 = AB - C  (3.7) 2  -1 X = /®[-l,0,l]  Y = I®  0  dl_ dy  (3.8)  1 Here / is the image brightness. A positive value for scalar K can be used to ensure that the Gaussian curvature approximation det(M) is valid by suppressing the detector response for image regions w i t h very high contrast, as signaled by a large mean curvature. Zhang et al. specify a value of 0.04 for K to discriminate against h i g h contrast pixel step edges [114].  3.4  SUSAN Corner Detector  A low level corner detector recently introduced by S m i t h [95] m a y provide a solution to the position uncertainty and noise problems. T h e S U S A N corner detector is based on measuring the similarity of the related parts of the image to each i n d i v i d u a l pixel by means of nonlinear filtering; each pixel is associated w i t h local image regions w i t h similar brightness. Distinctive features are detected by m i n i m i z i n g these regions. Since the m e t h o d is not based on a derivative, it is less sensitive to existing white noise i n the image; therefore, i n i t i a l noise reduction is not necessary. T h e principle of this corner detector is i l l u s t r a t e d i n Figures 3.1(a) and 3.1(b), which show a circular mask (with a central pixel, the nucleus) at 5 different positions of a dark rectangle on a white background. A n area of the mask that has similar brightness to that of  3.4 SUSAN Corner Detector  30  (b)  (a)  F i g u r e 3.1: T h e S U S A N corner detector. the nucleus is called the U S A N  1  and is the basis for the S U S A N  2  corner detector. T h e local  area contains m u c h i n f o r m a t i o n about the structure of the image. F r o m its size, centroid and second moments, one a n d two dimensional features can be detected. Figure 3.1(b) shows the area of U S A N , where each point i n the input image is used as the nucleus of a small circular mask associated w i t h the U S A N at that point. It can be seen that this area decreases as a n edge is approached, a n d i t decreases even further near a corner, giving rise to a local m i n i m a at the exact position of the image corner. F i g u r e 3.2 shows some examples where a corner o r  corner  corner  corner  non-corner  non-corner  Figure 3.2: S U S A N corner detector uses the local area information to identify corners. a non-corner situation is identified. Therefore, a circular mask, for an isotropic response, is placed at each point i n the image. T h e brightness of each pixel w i t h i n the mask is compared w i t h that of the nucleus.  I(-?)-I(7> )  6  0  c Vr M) >  l  =e  {  t  1  (3.9)  Here, r~o is the position of the nucleus i n the two dimensional image, ~F is the position 1 2  An acronym standing for Univalue Segment Assimilating Nucleus. An acronym standing for Smallest Univalue Segment Assimilating Nucleus.  3.4 SUSAN Corner Detector  31  of any point w i t h i n the mask, 7(~r*) is the brightness of any p i x e l . Parameter t defines the m i n i m u m contrast of detected features, as well as the m a x i m u m amount of ignored noise. A r u n n i n g total of n is then generated from the outputs of cy.  Here, n represents the area of S U S A N . F o r a point to be considered a corner, n must be less than half of its m a x i m u m possible value, g , which determines corner quality.  The  dependence of the correct threshold on the d a t a and the c a p a b i l i t y of setting it without h u m a n intervention are two factors having a large effect on the success of the algorithm. Occasionally, S U S A N gives false positives w i t h real data, where b l u r r i n g of boundaries between regions occurs or where there is a t h i n line w i t h a brightness approximately halfway between the two surrounding regions. T h i s problem is eliminated by c o m p u t i n g U S A N ' s center of gravity ~f^.  (3.11)  Clearly, an U S A N corresponding to a proper corner has a center of gravity that is not near the nucleus, while false positives can be rejected by their short distance from the nucleus. T h e real-time performance for our 3 D m o t i o n tracking system requires a fast corner detector w i t h a high positional accuracy.  After i n i t i a l inspections of b o t h methods, it was  concluded that b o t h methods are computationally expensive. T h i s can be explained by the fact that i n Harris corner detector for each input image, three image derivatives are c o m puted. These images are later smoothed by Gaussian filters. C o m p u t i n g these images as well as the corner response R, ( E q u a t i o n 3.6) for every pixel of an image increases the execution time for this method. C o m p u t a t i o n a l l y , S U S A N corner detector was more promising as it cuts down the number of corner candidates at each inspection phase. However, measurement  32  3.5 Binary Corner Detector  of the brightness s i m i l a r i t y Ci(~r*,ro), E q u a t i o n 3.9, turned out to be computationally an expensive process. T h e most computationally expensive task i n S U S A N corner detector is c o m p u t a t i o n of E q u a t i o n 3.9. If it is assumed that the intensity values of two consecutive pixels pi and p2 at locations (n, m) and (n, m + 1) are constant then the evaluation of E q u a tion 3.9 can be greatly simplified. T h i s is due to the fact that most of the computations are the same for these two points. T h e only two things that must be done is the evaluation and the a d d i t i o n of E q u a t i o n 3.9 for the last c o l u m n of the masking window when it is centered at l o c a t i o n pi and the subtraction of the corresponding values for the first column of the mask when it is centered at p . Y  T h i s observation has led to the development of a very fast  corner detector as explained i n the next section.  3.5  Binary Corner Detector  We developed a novel corner detector called B i n a r y Corner Detector [82]. T h e basic idea b e h i n d the S U S A N corner detector is the m a i n inspiration for this algorithm. T h e m a i n emphasis of this m e t h o d is on exploiting b i n a r y images and substituting arithmetic operations w i t h logicals. Figure 3.3 represents the block diagram of the binary corner detector. To generate a binary image that contains good low-level information content, first, the L a p l a c i a n is computed at each point of the intensity image.  H o r n [42] approximates  the  image L a p l a c i a n value by d 2I  Iij  d 2I  stands for the image intensity value at row i and column j.  Such an approximation for  2D L a p l a c i a n is separable and is implemented efficiently by logical operations. T h e binary image is then generated by the invariance of the sign of the L a p l a c i a n value at each point, (1) i n F i g u r e 3.3. Figure 3.4 shows a t y p i c a l image and the corresponding binary image. A t this point a circular mask, w, is placed on each point of the binary image. T h e binary  3.5 Binary Comer Detector  33  (i)  Create a binary image based on the sign of Laplacian  (2)  Count the number of similar pixels to the centroid within a circular mask, n(p^  I  Ou  .  CD  cd  (3)  (4)  R(Po) = "(Po> Compute the center of gravity G(PJ  (5)  (6)  Compute the intensity variation along the vector of center of gravity and centroid  Figure 3.3: The block diagram of Binary Corner Detector.  Figure 3.4: An image and its binary sign of Laplacian image.  3.5 Binary Corner Detector  34  value of each point inside the mask is compared w i t h that of the central point,  C( ,p) = I  1  i f L(p) = L ( p ) ,  0  HL(p)^L(p ).  0  L(p ). 0  (3.13)  Po  0  L(p) represents the binary image value at location p(x, y\. N o w a total running sum n is generated from the output of C(po,p), (2) i n Figure 3.3.  (3.14)  n represents the area of the mask where the sign of L a p l a c i a n of the image is the same as that o f the central point. F o r each pixel to be considered a potential corner, the value of n must be smaller than at least half the size of the mask w i n pixels, (3) i n F i g u r e 3.3. T h i s value is shown by t i n the corner response equation (3.15).  n(p )  i f n(p ) < t,  0  otherwise.  0  0  (3.15)  R(PO) = {  A t this point, for each candidate w i t h R(po) > 0, a center of connectivity G(po) is computed , (4) i n Figure 3.3. In order to be persistent w i t h the notation i n S U S A N corner detector [95] the center of connectivity is referred as the center of gravity.  (3.16)  G(p ) = y/g(xo) + 9(yof 2  0  where  J](xo fl(zo) =  -  n(po)  x) '  ff(yo) =  n(p )  (3.17)  0  T h e center of gravity G provides the corner direction, as well as a c o n d i t i o n to eliminate  3.5 Binary Corner Detector  35  points w i t h random distributions. R a n d o m distributed binary patches tend to have a center of gravity fairly close to the center of the patch. Therefore, all points w i t h close centers of gravity are filtered out of the remaining process, (5) i n Figure 3.3.  G( )>\r \ Po  (3.18)  g  T h e two conditions i n (3.15) and (3.18) do not (by themselves) provide enough stability. F o r instance, Figure 3.5 shows an example where conditions 3.15 and 3.18 are both satisfied, however the geometrical distribution of pixels do not represent a corner.  1: .1 1 1 1 1 1 1 11 11 1  Figure 3.5: Conditions i n (3.15) and (3.18) do not (by themselves) provide enough stability.  Therefore, a t h i r d inspection is performed by c o m p u t i n g the directional derivative along the corner direction for remaining candidates, (6) i n Figure 3.3. Once again points w i t h small directional intensity variations are eliminated, (7) i n Figure 3.3. T h i s c o n d i t i o n is shown by  \I(p )-I(p)\>I 0  (3.19)  t  I represents the brightness variation threshold. Figure 3.6.a to 3.6.e illustrates the principles t  of the binary corner detector through different cases. Figure 3.6.a shows a circular mask o n a n ideal corner. T h e response R{po) for this corner is smaller than half the mask size a n d the center of gravity g is located a fair distance from the center of the mask p . Furthermore, the intensity variation of the mask center along gp\ is 0  large. F i g u r e 3.6.b and 3.6.c demonstrate cases where the mask centers are located either o n an edge or i n a n area of uniformity. I n such cases, n(p ) fails i n equation (3.15). F i g u r e 3.6.d 0  36  3.6 Evaluation Metric: Repeatability Rate  1 1 y 1 if 1 1 :i y 1  il il fl 1 • 'f 1 P 1 1: 1 1 1 1 1 1 ?! 1 if:  y  >  :  d  1  p  1 19 it 1 it 1 1 1 1 1 1 :1 1 1 1 1 1 if 1i if; :  if: 1 1  0  A 'o 1 M.  I  •it1(1  1  ii  1 1 i •ii it  +  ii 1 1 1  ,1 1  V  1  Figure 3.6: Illustration of different cases for a corner. represents an example where pixels on the mask have random d i s t r i b u t i o n . A l t h o u g h the i n i t i a l verification i n equation (3.15) may be satisfied, the condition on a distant center of gravity i n (3.18) fails. F i n a l l y , F i g u r e 3.6.e shows an instance where the first two tests are fully fulfilled, but the intensity variation along vector gp% fails to provide a sufficiently large value, c o n d i t i o n (3.19).  3.6  Evaluation Metric: Repeatability Rate  S c h m i d et. al. [85] introduce an evaluation metric to verify the performance of several feature detectors. T h i s measurement is defined as the percentage of the total detected corners that are repeated i n the two images, and is called the repeatability rate. A scene point P w i t h projection p , i n image Ii is repeated, i f the corresponding point pj is detected i n image Ij. In c o m p u t i n g this rate, points that are seen i n one image, but due to the existing transformation, Hij, between the two frames are not visible i n the other one, are not counted. A l s o , since the corresponding point to pi, pj, is usually not detected at the exact predicted position pj, but rather i n the neighborhood of e, the repeatability rate is defined as a function of the a neighborhood size by n(e) =  (3.20)  37  3.7 Repeatability Results  where D(e) = {( , )\dist( ,H ) Pi Pj  Pi  ijPj  < e}  (3.21)  n* represents the number o f the corners i n image i * that can potentially be observed i n image  3.7  Repeatability Results  In this section, the performance of the algorithm under different i m a g i n g conditions is presented. A reliable c o m p u t a t i o n of repeatability i n this part involves the careful registration of image frames. For this purpose, first the registration parameters between each image pair are roughly estimated b y solving the registration equations for manually found correspondences. Second, through a recursive, least squares m i n i m i z a t i o n , the registration parameters are refined u n t i l an error residual, smaller than half a p i x e l , is attained. Since the repeatability rate depends o n the neighborhood e, i n E q u a t i o n 3.20, for each case, repeatability results are presented for two neighborhoods of 1.5 (left) and 0.5 (right) pixels.  3.7.1  Camera noise  The camera noise effect is studied by processing images of a static scene, captured under identical conditions, but at different moments. T h e results are shown i n F i g u r e 3.7. In this experiment Harris delivers the highest robustness, followed closely by the b i n a r y method. The S U S A N m e t h o d , however, does not retain a good rate, m a i n l y due to the e l i m i n a t i o n of Gaussian smoothing. Even though binarization of S U S A N corner detector is a n approxi m a t i o n and should lead to worse results the t h i r d condition (Equation 3.19) more than compensates a n d provides better repeatability rates.  3.7 Repeatability Results  38  —Harris  Harris  • Bimry  —*— Binary  • SUSAN  - — SUSAN  Image number  Image number  Figure 3.7: Repeatability results for camera noise for neighborhood of 1.5 (left) and 0.5 (right) pixels.  3.7.2  Rotation  The behavior of the method under rotation is studied by processing a set of images that are acquired while the camera rotates near its optical axis.  Due to the limitation of the  rotation mechanism, the transformations between frames are not purely rotational around y, and include translation as well. Figure 3.8 shows the repeatability results for neighborhoods 1.5 (left) and 0.5 (right) pixels. This experiment covers rotations between 0° to 180° with steps of 10°. As the rotation angle increases the repeatability of the Harris and binary method becomes closer. Since in many real-time applications the rotation between the two consecutive frames is small, another experiment is performed to test the methods for smaller rotations covering a range of 0° to 10°, as shown in Figure 3.10.  !\  ;  :  , ;\\  .. . j  -o—  Harris  —•— Binary  — - SUSAN  Harris Binury  - — SUSAN 0  20  40  bO  SO  100  120  Rotation angle in degree  140  160  ISO  40  60  SO  100  120  140  IbO  150  Rotation angle in degree  Figure 3.8: Repeatability rate under rotation for neighborhood of 1.5 (left) and 0.5 (right) pixels.  3.7 Repeatability Results  39  9  ^^^^^^^^^^^^^^^^^^  :  WmmWmmBSm  Figure 3.9: Images of a scene under different rotations.  Ci  '  I  0  1  ' 1  ' 2  '  1  3  4  ' 3  ' &  7  Rotation angle in degree  ' 1  1  9  1  I  1  0  II I  1  0  '  1  1 2  1  3  1  4  )  1  6  1  7  Rotation angle in degree  1  8  '  9  1  1  10  Figure 3.10: Repeatability rate for small rotations a neighborhood of 1.5 (left) and 0.5 (right) pixels.  3.7.3  Scale  The scale change is accomplished by moving the camera in a perpendicular direction to the image plane. The scale change over 1.8 is not tested due to the high sensitivity and poor performance of all the methods. Figure 3.11 represents the repeatability rate for the set of images shown in Figure 3.12. The results show that all methods perform better in a larger neighborhood, (e = 1.5). This can be explained by the fact that at different scales, identical features of the scene project onto areas with different resolutions. Therefore, although corners are considered invariant  3.7 Repeatability Results  1  II  1.2  40  1.3  14  Scale factor  IS  16  17  II  1  1.1  1.2  IJ  1.4  13  1.6  l.T  1.1  Scale factor  Figure 3.11: Repeatability rate for different scales in neighborhood of 1.5 (left) and 0.5 (right) pixels. to scale change, their positional precision and their repeatability are highly scale dependent.  3.7.4  Illumination  In this part, the sensitivity of the proposed method to image intensity variation is studied. 3.7.4.1  U n i f o r m change  Uniform illumination change is simulated in this part, due to existing limitations on  3.7 Repeatability Results  41  changing our camera's aperture. For this purpose, first a set of images from a static scene with an identical point of view at different moments is acquired. Then, the average grey level (intensity) of each image is manually changed with steps of 10%, as shown Figure 3.14. Figure 3.13 displays the computed repeatabilities. These graphs reveal that as the relative  Figure 3.13: Images of a scene with linear illumination change.  grey level with respect to the reference image changes, the number of repeating features decreases with these methods.  b 3  0.5  Harris Binary  I ' &  *-*— Harris —i— Binary —- SUSAN  02  SUSAN 0.2  04  o.6  as I 1.2 R e l a t i v e grey l e v e l  0.1  1  1.1  Relative grey level  Figure 3.14: Repeatability rate under uniform illumination change for neighborhood radiuses of 1.5 (left) and 0.5 (right) pixels.  3.7 Repeatability Results  42  3.7.4.2 Directional change in lighting One of the important features of a good corner detector is its robustness in more complex lighting conditions. Therefore, the method's stability is evaluated under directional illumination variations in which the light source illuminates the scene from different directions. Figure 3.15 represents some of the processed image taken under such conditions. The light source is moved with a step of 10° and covers a range of 0° to 90°. As the light source moves  Figure 3.15: Images of a scene under changing illumination.  from a direction perpendicular to the camera plane (0°), the left image in Figure 3.15 to its side (at 90°), the right image in Figure 3.15, shadow effect becomes stronger. Such shading effects cause object boundaries to move toward the light source. This noticeably effects the repeatability by a large amount, especially in small neighborhoods. Figure 3.16 shows the results of this experiment.  O  10  20  30  40  50  60  70  BO  W  Direction of the illumination source in degrees  O  10  20  30  40  50  60  70  SO  Direction of the illumination source in degrees  90  100  Figure 3.16: Repeatability rate for changing illumination for neighborhood of 1.5 (left) and 0.5 (right) pixels.  It is important to note that in real situations there are interactions between different  3.8 Discussion  43  imaging conditions. Here however the effect of each i m a g i n g condition was studied i n d i v i d ually. Such interaction may affect presented results.  3.8  Discussion  A s presented i n the previous section, on average the repeatability of the binary method is about 20% less than i n H a r r i s ' and 15% more than i n S U S A N ' s . T h i s might seem to be a problem at first, but as discussed here, for m a n y vision based applications, i n c l u d i n g our trajectory t r a c k i n g system, it may not affect reliability. There are two important aspects that make the binary corner detector still suitable:  • If the acquired images are processed faster, the transformation between the two consecutive frames becomes smaller. A smaller difference between the two frames can increase the repeatability rate. For instance, i f binary corner detector performs 60% faster than Harris corner detector, for a m o t i o n of 10 degrees per second the repeatab i l i t y rate of the binary corner detector becomes about 6% more than the reported results i n Figure 3.8. • O n e of the questions that was confronted i n our previous work [80] is how many corners are needed for a reliable m o t i o n estimation. It is observed that m o t i o n parameters can be determined from a m i n i m u m of 30 to 40 corners. However, a number between 50 and 100 guarantees reliable m o t i o n estimation. For a typical outdoor image, Harris detects about 400 corners. A loss of 20% w i t h the binary method results i n 300 corners, still plenty to p r o v i d i n g sufficient information for a reliable m o t i o n estimation.  3.9  Computational Time  T h e computational complexity of a l l the methods is studied by comparing their r u n n i n g time under the same conditions. In this study, images of 320x240 pixels and average the  44  3.10 Chapter S u m m a r y  running time for one thousand executions on a 1.14 G H z A M D A t h l o n ( t m ) P r o c e s s o r ® are used. T h e results o f this comparison are presented i n Table 3.1. A s shown i n this table,  Table 3.1: C o m p a r i s o n of the time c o m p l e x i t y of the three methods. Method  Execution time (msec)  Binary  23.293  Harris  37.674  SUSAN  168.452  the S U S A N a l g o r i t h m has the slowest performance. Details of the S U S A N corner detector suggest a simple c o m p u t a t i o n a l complexity. T h i s is because most of the computations a n d evaluations are performed o n a considerably small selection of pixels than the total pixels of the image. However, the i n i t i a l inspections for n o m i n a t i n g such candidates include a considerable number of operations, causing the S U S A N a l g o r i t h m to have the slowest performance. The Harris method performs significantly faster t h a n the S U S A N , however 60% o f its time is spent on the G a u s s i a n smoothing of each image a n d its corresponding squared first order derivatives. T h i s is i n spite of the fact that a l l the 2 D Gaussian filters are approximated by two I D s a n d a large number of the arithmetic operations are approximated by logical operations through a sigma of 0.8. The binary corner detector performs 1.64 times faster than the Harris, and 7.23 times faster than the S U S A N . Moreover, about 70% of a l l the employed operations can be substituted by bitwise operations using specific hardwares such as F P G A (Field P r o g r a m m a b l e G a t e A r r a y ) for even faster implementation.  3.10  Chapter Summary  In this chapter a novel corner detector for selecting informative pixels of an image is described. The proposed method is evaluated by c o m p u t i n g the repeatability rate and c o m p a r i n g it to those of the Harris a n d S U S A N corner detectors.  A l l the experiments are performed on  3.10 Chapter Summary  images of outdoor scenes, i n c l u d i n g natural geometric shapes.  45  T h e faster performance of  this method may allow for a better real-time performance of feature based vision systems. O n average, the repeatability rate for our m e t h o d is about 2 0 % less t h a n Harris's. A s mentioned earlier, the estimated m o t i o n can be computed from a small subset of the image points, usually about 1 0 % to 5% of the image size. A decrease of 2 0 % i n that number still provides more than enough points and delivers the same m o t i o n parameters.  A l s o , as the  system performs faster, the changes between consecutive frames lessen that potentially can cause a higher repeatability rate.  Chapter 4 3D World Reconstruction Systems w i t h no prior information about a scene usually need to determine the 3 D positions of features i n a scene. T h i s is a difficult problem because each projected 2 D image only records a point's relative direction from the camera, not its distance. Recovering distance or depth is known as the "inverse projection p r o b l e m " . T h i s chapter describes the problem of o p t i c a l projection as performed by our camera system, 3 D world position reconstruction for feature points, and considerations for increasing the system accuracy.  4.1  Camera Model  A camera can be simply modeled using the classic pinhole m o d e l , its optical center C and its image plane p, as i n Figure 4.1. T h i s leads to equations for c a l c u l a t i n g where on an image plane a point i n space w i l l appear. These equations are based on perspective projections. T h e projective transformations that project a world point P(x, y, z) to its image point p(x , u  y) u  are  - f Z  Vu =  46  fyZ  (4-1)  4.1 Camera Model  47  Figure 4.1: A n ideal pinhole camera produces a perspective projection model of the world. Here, f  and f  x  y  represent the horizontal and vertical focal lengths of the camera. Since a  camera exhibits non-ideal behavior, precise measurements from an image that are necessary i n the 3 D reconstruction process require a more sophisticated camera model than the ideal model.  In other words, a simple perspective projection is not enough for locating world  points i n the image.  4.1.1  Camera calibration  C a m e r a c a l i b r a t i o n is the process of finding internal and external camera system parameters that affect the i m a g i n g process. A camera model consists of extrinsic and intrinsic parameters. E x t r i n s i c parameters are those that are not a function of the construction of the camera but its position i n the environment. Intrinsic parameters, on the other hand, are physical parameters related to the construction of the camera and C C D itself. T h e camera m o d e l transforms real world coordinates into image coordinates and vice versa. Figure 4.2 shows the geometry of the camera calibration problem. There are four steps to transforming from a 3D world coordinate to a camera coordinate:  I. Transforming from the object world coordinates to the camera 3D coordinate system. {x ,y ,z ) w  w  w  is the coordinate of the object point P i n a known 3D world coordinate  48  4.1 Camera Model  P(x, y, z) or Vu Figure 4.2: C a m e r a geometry w i t h perspective projection and lens distortion. system, and (x, y, z) is the same point i n a camera coordinate system. T h i s r i g i d b o d y transformation can be presented by rotation and translation operations:  X  y  = R  (4.2)  + T  y  w  z where R is a 3 x 3 and T is a 3 x 1 matrices (camera's extrinsic parameters). II. Transformation from 3 D camera coordinate (x, y, z) to ideal (undistorted) image coordinate (x ,y ), u  u  perspective projection i n E q u a t i o n 4.1.  III. Transformation from the ideal image plane to the distorted image plane, r a d i a l lens distortion: (4-3) y + D = y a  y  u  (£d, yd) represents the distorted image coordinates and (D , D ) represent their distorx  y  4.1 Camera Model  49  (4.4)  D  and D  x  y  can be approximated by their first terms:  £> ~  x (kir ) 2  x  d  (4.5)  F r o m these equations, the distortion can be seen to increase when moving from the center of the image to its edge. I V . Transforming from distorted image plane coordinates to the frame buffer coordinates, (xf,yf) row and c o l u m n offset of a buffer i n the computer memory.  x =S (d —^-) x + C 1  f  x  Vf={ v) d  x  d  fx  x  (4.6)  Vd + C  l  y  Where: (xf,yf) : row and c o l u m n of the image pixel i n computer frame memory. (C ,C ) : row and c o l u m n of the center of computer frame memory. x  y  d  : Center to center distance between adjacent C C D sensor elements i n x direction.  d  : Center to center distance between adjacent C C D sensor elements i n y direction.  x  y  Ncx : N u m b e r of sensor elements i n the x direction. Nfx S  x  : N u m b e r of pixels i n a line as sampled by the computer. : T h e uncertainty image scale factor to be calibrated. T h i s additional uncertainty  factor is introduced due to a variety of factors such as slight hardware t i m i n g mismatch between image acquisition hardware and camera scanning hardware.  v  4.1 Camera  Model  50  The intrinsic parameters to be found are focal lengths ( / , / ) , computer image centers x  y  ( C , C ) , lens distortion coefficient ( k ) , and the uncertainty factor S . Details of the method x  y  t  x  for estimating intrinsic parameters can be found in [41,104]. Once cameras intrinsic parameters are found, the positional relationship between the undistorted and distorted images axe established. The intrinsic parameters for our camera system axe estimated by the manufacturer of the unit, Point Grey Research [45].  4.1.2  Image unwarping  Once the camera intrinsic parameters are found, they can be used to transform distorted image coordinates to their ideal and undistorted coordinates.  This procedure is usually  called the "unwarping" process and is performed using a lookup table that transfers each pixel on the distorted image onto its location on the corresponding undistorted location. Figure 4.3-a shows an image, acquired by our camera system, that has a 104° field of view. In this image the distortion effect is more noticeable on the curved bookshelves. Figure 4.3-b shows the same image after removal of the lens distortion. It can be clearly seen that the curved shelves on the original image axe now stxaightened.  a  b  Figuxe 4.3: a- A waxped image, b- The corresponding cut unwarped image.  As explained in more detail in the next section, reconstruction of the world through the  4.2 Stereo Vision Geometry  51  stereo process is based on the underlying assumption of an ideal pinhole camera model. Therefore, camera c a l i b r a t i o n and image unwarping are central issues for stereo-based vision systems. In fact, each time a set of images are acquired they must be corrected for the radial lens distortion, i f they are to be used i n the stereo vision system.  4.2  Stereo Vision Geometry  Stereo geometry is based on the fact that two s p a t i a l l y separated cameras (or eyes) view a scene from slightly different angles. T h e resulting shift i n the projected images encodes the depth of objects i n the scene. T h e stereo geometry is shown i n Figure 4.4.  O  P  Figure 4 . 4 : D e p t h reconstruction using stereo geometry.  T h e distance of a point O (viewed i n b o t h stereo images) from the camera image plane (z) can be estimated v i a triangulation.  AOLP~PC S  =» ^ ~ = Tk «i  AOLQ ~ QC,T  *  2  =•  k= ^ ~ /  j - ^ Z — . I  d  Z  f )  (4.7)  (4.8)  Here z is the distance of the object point O from the image planes, Ci and C-i are the camera centers, p i s the image plane, / and B stand for the focal length and the displacement between the stereo cameras, a n d d and d? represent points image displacement from the origin of the v  52  4.2 Stereo Vision Geometry  first a n d second image centers. S u b s t i t u t i n g k from 4.7 into 4.8 concludes the m a i n stereo formulation: ,,., ,  d - d\ 2  E q u a t i o n 4.9 shows that finding the disparity difference o f (d — d\), the positional change for 2  one point i n two images, is sufficient for constructing the point's depth, z. Therefore, w i t h stereo processing one of the i m p o r t a n t factors that affects reliability of depth reconstruction is the baseline selection. One of the assumptions that is made i m p l i c i t l y when describing the stereo geometry is that the two stereo images are coplanar. In reality, this assumption m a y not be valid a n d therefore, extra effort is required to make the two images coplanar. T h i s assumption is called the "epipolar constraint" a n d the process involved i n creating such a condition is known as the "rectification procedure". Details of these processes are described next.  4.2.1  Epipolar constraint  G i v e n a known i m a g i n g geometry, the epipolar constraint i n stereo images defines a line i n one image along which a m a t c h can be found. Let's consider a stereo vision system w i t h two pinhole camera models w i t h o p t i c a l centers C i and C (see F i g u r e 4.5). 2  A W o r l d point P projects to P a n d Pi, respectively. E a n d E are the epipoles through r  v  which a l l epipolar lines pass a n d DEy a n d DE  2  2  are the epipolar lines on which P a n d Pi T  are b o u n d to lie. P i , P a n d P3 are points i n the scene that m a y project into the point P 2  r  i n the right image, and points P n , P> a n d P13 are the projections of points P i , P and P3 2  2  on the left image. T h e epipolar constraint forces the points P a , P> a n d P - to lie along a 2  3  straight line. T h e i m p l i c a t i o n of the epipolar constraint is that the search for a m a t c h i n g point i n one image needs to be made only along one line i n the other image.  53  4.2 Stereo Vision Geometry  Figure 4.5: E p i p o l a r constraint. 4.2.1.1  Rectification procedure  A s seen i n the epipolar constraint, w i t h several images a n d a point i n one of them, the corresponding points i n the other images are b o u n d to lie o n epipolar lines. These epipolar lines would be parallel i f and only i f a l l the image planes are parallel. I n general, epipolar lines produce pencil lines going through an epipolar center.  T h e epipolar center i n each  image is the image of the other camera center i n this image plane. I f the image planes are coplanar and lie parallel to the vector C i C , defined by o p t i c a l centers, then the epipolar 2  centers are projected to infinity a n d the epipolar lines form a pencil of parallel lines. Ayache a n d Hansen [4] show that i t is always possible to apply a linear transformation to each image to obtain "conjugated" horizontal epipolar lines, e.g. the 2 lines E D a n d 2  EyD.  Rectification is one of the important processes i n stereo vision a n d i t allows potential  matches between two or more images t o satisfy simpler relations, further allowing for simpler and more efficient matching algorithms. Positional corrections for the rectification process are also performed by Point G r e y Research [45]. Such corrections are added to the displacement parameters D and D i n Equations 4.5. Therefore once unwarping lookup tables are created, x  y  54  4.2 Stereo Vision Geometry P  ^  \  \°  2  C i /  ^Ov/— D  Figure 4.6: Rectification of two images. they correct for the rectification distortion as well. A s discussed, for finding a match, a search along the epipolar line is sufficient, a n d since the epipolar lines are along a n axis after rectification, a search i n one c o l u m n or row is sufficient. A reliable 3 D reconstruction o f a physical point i n space depends heavily o n the correct identification of its match correspondences on the stereo images.  4.2.2  Stereo correspondence matching process  A s discussed i n Section 4.2 the essence of the stereo algorithm is i n finding two corresponding points i n stereo images such that the two points are the projections of an identical physical point. E q u a t i o n (4.9) also indicated that the depth is proportional to the baseline length, B;  accordingly the estimated distance is more accurate w i t h two more distant  cameras.  Therefore, i n stereo processing one of the i m p o r t a n t factors that affect the reliability o f the depth reconstruction is the baseline selection. A short baseline means that the estimated distance is less precise due to a narrow triangulation, but i t also means that the two images have more similarities and consequently a higher correspondence rate; whereas, a longer baseline increases the accuracy by creating a larger range of disparity. T h i s nevertheless results i n a more expensive search d u r i n g the m a t c h i n g process, as well as an increase i n the  55  4.2 Stereo Vision Geometry  number of false correspondences. Thus, there is a tradeoff between the accuracy and cost. 4.2.2.1  Multiple baseline  Okutomi and Kanade [48,74], introduce a stereo matching algorithm by using multiple stereo image pairs with different baselines. These multiple baselines are generated by displacement of one the cameras as illustrated in Figure 4.7. The image intensity functions f (x) and f (x) 0  P.  P  <§  P  2  < § )  D  i I I  £  Pn-l  3  . . . .  @  2  _  t  ®  J B  n-1  I  Figure 4.7: Camera positions for multiple-baseline stereo, near the matching positions can be modeled as, fo(x) = f(x) + n (x), 0  and  fi{x) = f(x - d ) + rii(x), T{i)  (4.10)  where d ^ represents the real disparity for point x in image pair P and F,. With the T  Q  assumption of constant distance near Z and independent Gaussian white noise such that: (4.11)  n (x),ni(x)~N{0,o ) 2  Q  the SSD function, e ^), over a window W and at pixel position x of the image fo(x) for the d  candidate disparity d^) is defined as e {x,d ) m  (0  =  (fo( +J) x  ~fi(x  + d(i) + J ) )  2  ( - ) 4  12  jew-  By introducing: Q = \ ,  d {i) r  = B Fr , i  r  d{i) = B FQ , t  (4.13)  4.2 Stereo Vision Geometry  56  the SSD with respect to inverse distance can be compute from:  cw(*.0 = ^(Mx  + fi-Mx + BiFQ + j))  e  2  (4.14)  jewwhere Cr and C are the real and candidate inverse distance. The expected value of function  C(i)()C)  e  x  then is computed from:  £[e«.)(s,C)] = J2(ttx + j)-nx  + B F(C.-<; )+j))  + 2N o*  2  i  T  w  ,  (4.15)  jew where N  w  is the number of the points within the window W. A n integration of each  e^,  creates an evaluation function ec(\2... ){x, C)> called SSSD, with an expected value as follows: n  n  ec(i2...n)(z,C) = 53e (,)(x,C)  ( - 6) 4  C  n  E[e ... (x,Q} ai2  n)  = EE(/(x + j)-/(x +  BF(C-Cr)+j)) + 2 n i V 2  i  u;  ^  1  (4.17)  i=i jew  This new function has two interesting characteristics that make it beneficial in depth construction and motivated its use in this work.  1. E l i m i n a t i o n o f t h e a m b i g u i t y If intensity pattern f(x) has the same pattern around x and x + a, then:  E[e«i){x,  Cr)] = E[e  {x, Q +  c{i)  ^ ) ] = 2nN o  2  w  n  (4.18)  Equation (4.18) shows that the expected SSD value is minimum for the correct, C, as well as for false matches, C/ — Cr + irj?r  However, an important point  to be observed is that the position of this false minimum is dependent upon the baseline Bf, whereas the minimum for the correct match is not. Therefore, having more stereo camera pairs and combining their information, can clarify the inherent ambiguity existing in finding correspondence in the periodic patterns.  57  4.2 Stereo Vision Geometry  2. P r e c i s i o n T h i s method not only resolves the ambiguity for periodic a n d non-periodic patterns, but it also improves the precision of the estimated distance. If o\ represents the variance of the white Gaussian image noise then the variance of the a n estimate C(,) is computed by: r  VariCrU)) =  =r^  (4.19)  T h e variance of the estimated inverse distance Ct-(i2...n) f  n a  ^ minimizes the S S S D  in inverse-distance is then defined by:  ^r(C  r(12  ... ) = — n)  i=i  ^  = ( £ TT^T—J  -  (4 20)  jew  E q u a t i o n (4.20) demonstrates that using the S S S D i n inverse-distance w i t h m u l tiple baseline pairs, improves the accuracy of the estimated depth.  4.2.2.2  M a t c h i n g t h r o u g h correlation w i t h validation  In this method, b y F u a [32], the s i m i l a r i t y scores for every feature point i n the image is computed by measuring the N S S D (Normalized S u m o f Squared Differences) for a l l the match candidates along the epipolar line. T h e best match is then selected based o n the similarity metric value. I n this approach, the probability of a m i s m a t c h decreases as the size of the correlation window a n d the texture density of the area inside the window increase. U s i n g a large window is not a p r a c t i c a l solution as i t slows down the system. I n order to improve the reliability, a validation is performed i n which two images play similar roles. Figure 4.8 represents a graphical presentation of the validity check for the stereo m a t c h i n g process.  G i v e n a point P i i n J i , let P  2  be the point of I z located o n the epipolar line  corresponding to P i such that the windows centered o n Py and P  2  yield to an o p t i m a l  4.2 Stereo Vision Geometry  58  F i g u r e 4.8: Consistent vs inconsistent matches can be identified by validity check. correlation measure. T h e match is valid i f a n d only i f P i is also the point that maximizes the scores when correlating the window centered on P  2  w i t h windows that shift along the  epipolar line of l y corresponding to P . T h i s m e t h o d was also used i n this work i n m a t c h i n g 2  features that belong to areas that were c o m m o n between either the reference and horizontal, or the reference and vertical images.  4.2.3  Stereo correspondence matching rules  O u r camera system, Digiclops [45], includes 3 stereo cameras that are vertically and horizontally aligned. T h e displacement between the reference camera and the horizontal a n d the vertical cameras is 10 centimeters.  D u e to the orthogonal arrangement of the stereo  cameras a n d their identical baseline values, an improvement of depth accuracy by incorp o r a t i n g longer baselines is not really possible. However, to fully take advantage of the existing features i n the three stereo images, the following constraints are employed i n the stereo matching process:  • F e a t u r e s t a b i l i t y c o n s t r a i n t I: F o r each feature i n the reference image that is located i n the c o m m o n regions amongst the three stereo images, there should be two correspondences, otherwise the feature gets rejected due to the instability condition. 3 D locations of the features that pass this constraint are estimated by the m u l t i p l e baseline m e t h o d [74].  4.2 Stereo Vision Geometry  59  • Feature stability constraint II: Features located on the areas common to only the reference and horizontal or to the reference and vertical images are reconstructed if they pass the validity check by Fua's method [32].  • D i s p a r i t y constraint: The disparities of each feature from the vertical and horizontal images to the reference image have to be positive, similar (with maximum difference of 1 pixel), and smaller than 90 pixels. This constraint allows the construction of the points as close as 12.5 cm from the camera for our system configuration.  • E p i p o l a r constraint: The vertical disparity between the matched features in the horizontal and reference images must be within 1 pixel. The same rule applies to the horizontal disparity for matches between the vertical and reference match correspondences.  • M a t c h uniqueness constraint: If a feature has more than one match candidate that satisfies all the above conditions, it is considered ambiguous and gets omitted from the rest of the process. The similarities between each feature and its corresponding candidates are measured by employing the Normalized Sum of Squared Differences metric [32]. After matching the features, a subset of features from the reference image is retained, and for each one, its 3D location with respect to the current camera coordinate system is obtained using equation 4.9.  4.2.4  Depth construction inaccuracy sources  While the previous constraints decrease the chance of false matches to construct a more reliable depth value, the accuracy of the 3D reconstruction is still limited to the accuracy of the unwarped images. Although the unwarping process is a necessary step in the stereo process, it is noticeable that for a camera with a wide field of view, the unwarped image includes some blurriness  60  4.2 Stereo Vision Geometry  that increases when moving from the center toward the sides of the image. This effect is more noticeable and disturbing when the scene is constructed of objects with finer details, Figure 4.9.  a  b  Figure 4.9: Radial lens distortion removal can change the shape of objects of smaller size, a) A warped (raw) image, b) The corresponding unwarped image.  Lens distortion elimination through the unwarping process is the procedure of projecting distorted image pixels onto their undistorted positions. As explained in Section 4.1.2, the unwarping process is performed by using intrinsic parameters of the camera. During this process the following occurs, as shown in Figure 4.10 1 to 4.10 4:  I. The image coordinates of each pixel, integer values, are transformed into the corresponding undistorted image coordinates, floating point, Figure 4.10 2. II. Since not all the pixels in the unwarped image can be defined from the warped image an interpolation scheme is used to reconstruct the image values at an integer, equally spaced, mesh grid, Figure 4.10 2 and 4.10 3. III. The resultant image is cut to the size of the original raw image, Figure 4.10 4.  For our camera system, 28.8% of the unwarped image pixels have no correspondences in the warped image, and therefore, are merely guessed at by the interpolation of the neighboring pixels. This creates considerable distortion of the shape of smaller objects located near  01  4.2 Stereo Vision Geometry  3  4  Figure 4.10: The raw image (1). The raw image after the calibration (2). The calibrated image after the interpolation(3). The final cut unwarped image(4). the sides. It also increases the inaccuracy of the 3D world reconstruction, which influences the overall accuracy of the motion tracking system. The effect of this distortion on the estimated 3D motion is examined by measuring a similar motion using identical features, located at different regions of the image plane. Figures 4.11 a-2, b-2, and c-2 demonstrate the raw images of a planar subject with small details at three different locations of a scene. Figures 4.11 a-l, b - l , and c-1 show corresponding unwarped images. The subject plane is placed at a distance of 20cm from the camera plane. For each case, a motion of 8cm in the direction perpendicular to the image plane is created. The 3D motion is then estimated using two approaches:  a-2  b-2  c-2  Figure 4.11: Warped (a-l, b - l , c-1) and unwarped (a-2, b-2, c-2) images of a plane at three different positions of the image. • In the first approach, feature points are found in the warped raw images first. Second, using calibration equations in Section 4.1.1 the exact location of the feature points in the unwarped images are estimated. These values are used in finding disparity values and hence in estimating the 3D locations of the feature points. • In the second approach, feature points are found in unwarped images and are projected into the world using the stereo algorithm that uses their interpolated, unwarped positions.  The motion in each case is then estimated using the displacement between these world points in two consecutive frames. Table 4.1 represents the results of this experiment. As illustrated in Table 4.1, using warped images instead of unwarped ones may improve trajectory estimation results. Elimination of the unwarping process can potentially reduce the execution time of the system. When using unwarped images, at each frame capture three new images are acquired that must be unwarped. Another undesirable effect of the unwarping process is the elimination of some raw image  4.2 Stereo Vision Geometry  63  Table 4.1: 8.0 cm motion estimation using unwarped and warped images. Image Type  a  b  c  Warped (-1)  7.98  7.97  7.99  7.18  8.01  7.98  Unwarped  (-2)  regions after this process. Generally, the behavior of image pixels located at relatively far distances from the camera with image projections close to the center, under small rotations are very similar to lateral translations. As these points move further from the image center, their rotational behaviors become more distinguishable from their translational behaviors. Therefore, by cutting the edges of the unwarped image, a number of features that may carry crucial information about the nature of the motion get eliminated. This may be responsible for another source of inaccuracy in the estimated motion. Figure 4.12 illustrates an example  •  .  ...  ij-—•  - • , MfMgf ^ 1  • fc.  £«$ ... + .  50  ... v ^ . i *i ''.fg. W  *** ,  . <.-,.'•'.. i  100 150 200 250 300 350 400 450 500  550  Figure 4.12: Some important features are eliminated by the unwarping process,  where a number of far away features are eliminated after the size adjustment.  4.3 Chapter Summary  4.2.5  64  Depth construction with higher accuracy  For a l l of the above reasons, it is chosen to employ a combination of raw and unwarped images.  T h i s means that features are detected only on raw images w i t h sharper detail.  T h e image coordinate of each feature is then unwarped using unwarping lookup tables. For constructing the 3 D positions of the features, the unwarped locations are used. However, when measuring the similarity of features, raw locations w i t h warped image content are used. Performing a p a r t i a l unwarping procedure for a small percentage of each image also improves the running time of the system. T h e 3 D reconstruction of the feature points can be summarized as having the following steps:  1. A projection lookup table implementation using intrinsic camera parameters for raw image projection onto the unwarped image and vice versa. 2. Detection of features i n raw images. 3. D i s p a r i t y measurement i n raw images using the projection l o o k u p table. 4. 3 D reconstruction of image features using the constraints i n 4.2.2.  A n o t h e r gained advantage on e m p l o y i n g the above steps i n 3 D reconstruction process is the e l i m i n a t i o n of the sub-pixel estimator when c o m p u t i n g the disparity values; T h i s is due to the fact that the exact position of each corner is now found (in the undistorted images) w i t h an accuracy of 0.01 pixel.  4.3  Chapter Summary  In this chapter, the camera c a l i b r a t i o n process as well as the stereo a l g o r i t h m for the inverse perspective projection problem are explained. It is explained that camera calibration is an essential process i n the 3 D reconstruction problem. However, the existing approximation  4.3 Chapter  Summary  65  performed i n the reconstruction of calibrated images can degrade the accuracy of the overall system.  T o remove this undesirable effect, the reconstruction of the calibrated images is  eliminated. Instead, a calibration m a p for each one of the cameras is constructed that defines the relationship between the integer p o s i t i o n of uncalibrated pixels w i t h corresponding floating point locations on the calibrated image. T h r o u g h this process, three m a i n advantages axe gained. F i r s t , the elimination of the resampling process improves the running time, as previously, for each time three images needed to be unwarped. Second, a sharper image is processed that improves the number of detected features. For wide angle lenses, such as the one we use, the b l u r r i n g effect on the sides is so disruptive that it degrades the feature detection process i n those areas. T h i r d , this process increases the accuracy i n found disparities. Therefore, a more reliable world reconstxuction is achieved, resulting i n moxe accuxate m o t i o n estimation.  Chapter 5 Tracking and Motion Estimation T h e measurement of local displacement between the 2 D projection of similar features on consecutive image frames is the basis for measuring the 3 D camera m o t i o n i n the world coordinate system. Therefore, world and image features must be tracked from one frame (at t i m e = £ ) to the next frame (at t i m e = t +  At).  In order to take advantage of a l l the information acquired by the camera while navigating i n the environment, a database is created. the features seen since the first frame.  T h i s database includes information about a l l  For each feature, the 3 D l o c a t i o n i n the reference  coordinate system and the number of times it has been observed are recorded. E a c h database entry also holds a 3 x 3 covariance m a t r i x that represents the uncertainty associated w i t h the 3 D location of each feature.  T h e i n i t i a l camera frame is used as the reference coordinate  system, and a l l the features are represented i n relation to this frame. After the world features are reconstructed using the first frame, the reference world features, as well as the robot's starting position, are i n i t i a l i z e d . B y processing the next frame, i n which a new set of world features are created, corresponding to the current robot position, new entries axe created i n the database. T h i s database is updated as the robot navigates i n the environment. Since corner features are simply points there should be a m a t c h i n g process to find corresponding features i n subsequent frames. T h e accuracy of the m a t c h correspondence  66  finding  5.1 Similarity  Measurement  67  process has a crucial impact on the accuracy of the system's performance.  5.1  Similarity Measurement  In the feature matching process, a feature is selected in one image frame and a similar feature is sought in the following image frames based upon different similarity measurement criteria. There are many different similarity measurement techniques in the literature. Suitability of each one of these methods depends highly on the application that they are used for [17,51,65]. To measure the similarity of a feature to a set of correspondence candidates in this work, a normalized cross-correlation metric [32] is employed. Each feature and its candidates are first projected onto their corresponding image planes, Ii(x, y) and I (x, y). The cross-correlation 2  function, Equation 5.1, is then estimated for each pair.  C(h,I ) =  M 2  M 2  E  E  T  \\2  ((h(x,y)-h)-(h(x,y)-h))  (5.1)  2  M  V  M  M  2  2  M  E E (/i(x,y)-/i) E E i =  =M.„_=M 2 y2  2  x  -  (^(x /)-/ ) ) 2  2  2  T-=M„-=M. 2 y2  Here, h and I are average gray levels over image patches of h and I with dimensions of 2  2  M x M. After evaluation of the similarity metric for all pairs, the best match with the highest similarity is selected. The highest similarity as estimated by the cross-correlation measurement does not, by itself, provide enough assurance for a true match. Since the patch sizes are fairly small, there may be cases where a feature (at time=f) and its match correspondence (at time—1 +At) do not correspond to an identical feature in the space. In order to eliminate such falsely matched pairs, a validity check is performed. In this check, after finding the best match for a feature, the roles of the match and the feature are exchanged. Once again, all the candidates for the match are found on the previous frame (at time=i). The similarity metric is evaluated for  5.2 Feature  Tracking  68  all candidate pairs and the most similar pair is chosen. If this pair is exactly the same as the one found before, then the pair is announced as a true match correspondence. Otherwise, the corner under inspection gets eliminated from the rest of the process. Figure 5.1 represents a graphical overview of the match correspondence validity check. Match candidates from the previous frame to the current frame for feature F  •t  ;  m  1  •  y  ]  1  1  F  1  1 ;i  —  •  •  1  •  m  P  ) /  • F  •  \  •  11  r  -  • Fh  0  S  •  2  :  ...  T •)  I  0  •  \  )  PF  F' a  .t.+.Ai.  Match candidates from the current frame to the previous frame for feature Po Figure 5.1: Graphical overview of the validity check in feature tracking process. The valid match candidates are shown with the solid line.  Comparison of the validity check of the number of correct match correspondences for two consecutive outdoor images is shown in Figure 5.2. Figure 5.2 a and b, show the match correspondence without the validity check. Figure 5.2 c, and d display the results of the matching process for the same images in the presence of a validity check.  Clearly, the  number of false matches are reduced after the validity check.  5.2  Feature Tracking  The objective of the feature matching process is to find and to match the correspondences of a feature in the 3D world on two consecutive image planes (the current and the previous frames) of the reference camera . At all times, a copy of the previous image frame is maintained. Therefore, database feature points in the reference coordinate system are transformed to the  5.2 Feature Tracking  69  c  d  Figure 5.2: Validity checking reduces the number of false match correspondences. In this figure feature points are shown in white dots. Matched features in the first frame with no validity check are shown in circles (a). Match features in the second frame are shown with arrows that connect their previous positions into their current positions (b). Matched features in the first frame with validity check are shown in circles (c). Match features with validity check in the second frame are shown with arrows that connect their previous positions into their current positions (d). last found robot (camera) position. They are then projected, if their projection fall inside the image plane, onto the previous image plane. With two sets of feature points, one in the previous image frame and one in the current image frame, the goal becomes to establish a one by one correspondence between the members of both sets. The matching and tracking  5.2 Feature Tracking  70  process is performed using a two-stage scheme. I. Position of each feature in the previous frame is used to create a search boundary for corresponding match candidates in the current frame. For this purpose it is assumed that the motion of featuresfromprevious frame to the current frame do not have image projection displacements more than 70 (w) pixels in all four directions. This means that if a feature is located in row n and column m of the previous imageframe,its match candidates are sought in a window that expands horizontallyfromn —w to n+w and verticallyfromm — w to m + w on the currentframe.The value of w is highly dependent on the lens type, the average distance of the camera from the scene, and the speed of the mobile robot. For the current system, a search window of 141x141 seemed to be appropriate. Therefore, all the feature points in the current image frame that fall inside these boundaries are chosen as match candidates. If a feature does not have any correspondences, it cannot be used at this stage and therefore is ignored until the end of first stage of the tracking. The normalized cross-correlation with validity check, as explained in Section 5.1, is then evaluated over windows of 13x13 pixels to select the most similar candidate. Using the established match correspondences between the twoframes,the motion of the camera is estimated. Due to the large search window, and therefore, a large number of match candidates, some of these match correspondences may be chosen falsely. This influences the accuracy of the motion process. In order to eliminate inaccuracy due to faulty matches, the estimated motion is used as an initial guess for the amount and direction of the motion to facilitate a more precise motion estimation in the next stage. II. Then, using the found motion vector and the previous robot location, all the database features are transformed into the current camera coordinate system. Regardless of the motion type or the distance of the features from the coordinate center, features with a persistent 3D location end up on a very close neighborhood to their real matches on the current image plane. Using a small search window of 4x4, the normalized  5.3 Motion Estimation  71  cross-correlation and the image intensity values i n the previous and current frames, best match correspondences are found quickly w i t h higher accuracy. T h e new set of correspondences are used to estimate a m o t i o n correction vector that is added to the previous estimated m o t i o n vector to provide the final camera m o t i o n .  Figure 5.3 a and 5.3 b, show match correspondences on the two frames for the first step and Figure 5.3 c and 5.3 d, show the matches using the i n i t i a l m o t i o n estimation from the first step. N o t only does the number of false matches decrease when a rough m o t i o n estimate is used, but the total number of matches increases dramatically. These matches are missed i n the i n i t i a l stage, due to the large number of match candidates w i t h i n the large search window.  5.3  Motion Estimation  G i v e n a set of corresponding features between a pair of consecutive images, m o t i o n estimation becomes the problem of o p t i m i z i n g a 3 D transformation that projects the world corners, from the previous image coordinate system, onto the next image. A l t h o u g h the 3 D construction of 2 D features is a nonlinear transformation, the problem of m o t i o n estimation is still well behaved.  T h i s reasonable behavior is because any 3 D m o t i o n includes rotations and  translations [63].  • Rotations are functions of the sine and cosine of the rotation angles.  W i t h the as-  sumption of small rotational changes between each two frames, sine and cosine can be linearly related to the angle. • Translation toward or away from the camera introduces a perspective distortion as a function of the inverse of the distance from the camera. • Translation parallel to the image plane is almost linear.  5.3 Motion Estimation  72  Figure 5.3: The tracking is performed in two steps. Therefore, the problem of 3D motion estimation is a promising candidate for the application of Newton's minimization method, which is based on the assumption of local linearity.  5.3.1  Least-squares minimization  The objective of the Newton method is to find the camera motion that brings each projected 3D feature into best alignment with its observed matching feature. Rather than solving this directly for the camera motion with 6 DoF, Newton's method iteratively estimates a vector of corrections x, that if subtracted from the current estimate, resulting in the new  73  5.3 Motion Estimation  estimate [64]. If P W is the vector of parameters for iteration i, then  P  (i+i)  =  P  d)_  (5.2)  x  G i v e n a vector of error measurements between the projection of 3 D world features on two consecutive image frames, e, a vector x is computed that eliminates (minimizes) this error [63].  Jx = e  (5.3)  T h e effect of each correction vector element, Xj, on error measurement e, is shown i n E q u a tion (5.4), where  (5.4)  Here e* is the error vector between the predicted location of the object and the actual position of the m a t c h found i n image coordinates, n represents the number of matched features. Since E q u a t i o n (5.3) is usually over-determined, x is estimated to m i n i m i z e the error residual [34].  min  11 Jx  — e||  2  x = [J J]~ J e T  l  T  (5.5)  (5.6)  Therefore, i n each iteration of Newton's m e t h o d , E q u a t i o n (5.6) is solved for x using LU decomposition [77]. T h e most computationally expensive aspect of implementing Newton's m e t h o d is calcul a t i n g the p a r t i a l derivatives. T h e p a r t i a l derivatives w i t h respect to the translation parameters can be most effectively calculated by first reparameterizing the projection equations to express translations i n terms of each camera coordinate system [62].  74  5.3 Motion Estimation  5.3.2  Setting up equations  If the m o t i o n parameter vector is (D , D ,D ,(j) ,(t)y,(j) ), x  point (x, y,z)  i n the image is this:  D,  z  x  D  y  and D  y  z  x  then the new location of a rotated  z  represent the incremental translations, while <j> , cf> a n d <f> are rotational x  y  z  increments around the x, y a n d z axes. T h e p a r t i a l derivatives i n rows 2 n and 2n + 1 of the Jacobian m a t r i x , E q u a t i o n 5.3, that corresponds to the n'th matched feature, can be calculated by:  du dD  x  du  [  fix z  dL\ <7{2n+l,l:6} — \  [z +  dD  dv  +  D) x  Df z  dD~y dv dDl  du  f  d(f)  D  (z +  dz  x  (z + D y dcj> z  z  f(y +  z + D d<f> z  x  D )dz y  (z + D y dct> z  x  f{y + D )  dy  y  dz  Dy)  !{y + D )  Dy  d<t>z z + D d<f>  (z + D y d<t>  z  z  z  (5.9)  (z + D y dcj>  d<t>y z + D d(f> dv f dy  z  f(y +  f(x + D )  dy  f  (5.8)  z  z  f x  z +  x  d(j)  z  x  dz  [z + D y d<t>y  dx  z + D  z  d(f> dv  dv  dx z  dv  = 0  z  f{x + D )  z + D d(f>y  y  du  (z + D y d<t>  x  f  dz  x  z  d(j)  y  f{x + D )  dx  z + D d(f>  x  = 0  dD  f  d(f>  z  du '  du  f z + D  z  y  y  y  z  z  dz z  T h e p a r t i a l derivative of x, y a n d z w i t h respect to counterclockwise r o t a t i o n parameters (j> (in radians) can be found i n Table 5.1. T h i s table shows that c o m p u t i n g the Jacobian elements i n Equations 5.8 and 5.9 is a fast task and quite p r o m i s i n g for a real-time system performance.  Jacobian components w i t h respect to rotation i n E q u a t i o n s 5.8 and 5.9 are  5.3 Motion  Estimation  75  Table 5.1: T h e p a r t i a l derivatives table. d/d  z  X  y  <t>x  0  —z  y  <t>V  z  0  —X  -y  X  0  then simplified into:  +  _ -fy{x  du  x  [z +  _  fz  d<t> ~ du  y  =  d(f)  z  z  +  D) x  ( + Df z  z  dv  =  dcf>  y  d(f>  z  z  z  fx{y +  D)  (z +  Dy  _  dv  D  fy(y + Dy) (z + D f  z + D,  x  -fy z +  fz  d(f>  z  fx(x  _  dv  x  Df  &4> ~ z + D du  D)  y  (5.10)  z  fx z +  D  z  After setting up E q u a t i o n 5.6, it is solved iteratively u n t i l a stable solution is obtained.  5.3.3  Implementation consideration  Converging to a true solution requires a high quality feature m a t c h i n g process.  Although,  through several tests, it is attempted to prevent outliers from reaching this stage, chances are that some outliers exist among our matched features.  In order to m i n i m i z e the effect  of faulty matches on the final estimated m o t i o n , the following considerations are taken into account during implementation: • T h e estimated m o t i o n is allowed to converge to a more stable state by running the first three consecutive iterations. • A t the end of each iteration the residual error for each matched pair i n b o t h coordinate directions, E  u  and E , v  are computed.  • F r o m the fourth iteration, the m o t i o n is refined by elimination of outliers. For a feature  76  5.3 Motion Estimation to be considered an outlier, it must have a large residual error, \f E  2 u  + E. 2  v  Removing  all outliers, where the estimation is not finalized yet, may affect the correct solution dramatically. F o r this reason, at each time only 10% of the features w i t h the highest residual error w i l l be discarded as outliers. • T h e m i n i m i z a t i o n is repeated for u p to 10 iterations i f changes i n the variance of the error residual vector is more t h a n 10%. D u r i n g this process the estimation moves gradually toward the best solution. • T h e m i n i m i z a t i o n process stops i f the number of inliers drops to 40 or less matches. T h i s is due the fact the least squares m i n i m i z a t i o n results, for six unknowns, when the number of m a t c h correspondences between two frames are less than 40 may not be very stable and accurate. Moreover, a m i n i m u m number of 40 m a t c h correspondences prevents m a t r i x J J from being i l l conditioned. 7  5.3.4  Motion estimation results  T h e results of an entire m o t i o n estimation cycle, for a distance of about 5cm i n the outdoor environment, is presented i n Table 5.2. A s shown i n this table, T h e error is reduced i n a consistent manner and the final error residual is less than a pixel. Generally the error residual is only a fraction of a pixel.  5.3.5  Feature flow results  A s it has been explained i n Section 5.3.1, when solving the linear equation system w i t h six unknowns, at least three valid corresponding points should be found i n two sets of images. However, for accurate and reliable results, it is better to have an over-constrained problem. Moreover, i n match validation required for the stereo process, and for tracking identical features i n different frames, some of the extracted features might be discarded. Therefore, having more features results i n more reliable results.  Table 5.3 presents the number of  77  5.3 Motion Estimation  Table 5.2: Iteration results along with the error residual, in pixels, for one motion estimation.  4>y  ^/El  + El  Iter-  No of  ation  matches  Cm  Cm  Cm  Degree  Degree  Degree  Pixel  1  188  -6.063  -0.059  -0.233  -0.047  0.127  -1.041  19.7947  2  188  -5.976  0.012  -0.207  0.118  0.099  -1.014  10.5260  3  188  -5.975  0.012  -0.207  0.117  0.096  -1.015  10.48685  4  170  -6.413  0.063  -0.493  -0.030  0.570  -0.345  4.4073  0.018  0.530  -0.207  2.4057  D  D  z  x  4>z  5  153  -6.401  0.096  -0.414  6  138  -6.469  0.112  -0.454  -0.030  0.541  -0.121  1.5502  7  125  -6.530  0.141  -0.424  0.049  0.585  -0.052  1.1942  8  113  -6.681  0.101  -0.427  -0.025  0.614  -0.061  1.2864  9  102  -6.521  0.145  -0.468  -0.014  0.588  -0.075  0.7966  -0.078  0.6333  92  10  -6.514  0.137  -0.458  -0.013  0.563  features at different stages of the system for two individual experiments.  Table 5.3: Feature numbers and comparison of the matching accuracy of the tracking process at different stages of the system for two individual experiments. Variable  Static  Moving  definition  camera  camera  1  No. of features in the reference image  2199  2127  2  No. of world features after stereo validation  1054  963  3  No. of tracked features in two consecutive frames  476  160  4  Features' column variance in the first iteration (E )  0.0949  107.22  5  Features' row variance in the first iteration (E ) v  0.1180  41.635  6  Features' column variance in the last iteration  0.0946  0.19811  7  Features' row variance in the last iteration  0.1176  0.0756  No  u  Here, two cases are studied. In the first case, the camera is static and in the second, it is mobile (columns 3 and 4). From this table, about 50% of features fail the validation test or have no correspondences in the stereo reconstruction process row 2 of the Table 5.3. Variables  5.4 Dynamic Features and Estimation  Reliability  78  "Features' c o l u m n / r o w variance i n the first iteration" , rows 4 and 5 of this table, represent the positional variances (in pixels) when estimating the m o t i o n on the first iteration. Variable "Features' c o l u m n / r o w variance i n the last iteration" (rows 6 a n d 7 of Table 5.3) show these variances i n the last iteration. T h e transition from the i n i t i a l m o t i o n estimation to the final estimation can be observed by the d r a m a t i c change i n values of error residuals of E  u  5.4  and  E. v  Dynamic Features and Estimation Reliability  T h e m a i n task of the m o t i o n estimation is to find the 3 D m o t i o n vector that projects two set of observations onto each other w i t h m i n i m a l error. One of the m a i n assumptions i n this work was that most of scene objects are static. In that case the o p t i m u m results are gained by e l i m i n a t i n g outliers that are detected through their large positional error residual. These outliers originate from either falsely matched features or d y n a m i c scene features.  If outliers are from different d y n a m i c objects w i t h  r a n d o m motions they would not affect the estimated results as m u c h as when they are clustered a n d they a l l have the same m o t i o n . F i g u r e 5.4 shows a static plane, w i t h a d y n a m i c part (highlighted w i t h the bright square i n F i g u r e 5.4 a). T h e m o t i o n of the static plane is estimated while the d y n a m i c part is moved. A m o t i o n vector of zero, i n the last c o l u m n of Table 5.4, is expected i f the system is not overwhelmed by outlier features. Table 5.4 shows results of this experiment.  F r o m these results when the number of  d y n a m i c features is more t h a n 35% to 40% of the employed scene features, rows b and c, the accuracy of the estimated m o t i o n is questionable.  5.5  Feature Update  After the m o t i o n parameters are found, the database information must be updated. T h i s is performed based on the prediction and observation of each feature and the robot's m o t i o n .  5.5  Feature Update  79  <-  d  e  f  Figure 5.4: Dynamic features can distract the estimated motion. The position and uncertainty matrix for features that are expected to be seen and have corresponding matches are updated (details in Chapters 6). Their count increases by 1.  • Positions and covariance matrices for features that are not expected to be seen are also updated; however, their counts stay unchanged.  • Features that are expected to be seen but have no unique matches are updated. The uncertainty for these features is increased by a constant rate of 10% and their count decreases by 1.  • New features, with no correspondence in the reference world, are initialized in the database and their count is set to 1.  80  5.6 Feature Retirement  Table 5.4: C o m p a r i s o n of the number of d y n a m i c features versus estimation accuracy. Figure  Dynamic  Dynamic patch's actual motion  Static scene's estimated motion  5.4  features  D , D , D (Cm)  D , D , D (Cm)  subfigure  (%)  b  51.8  x  y  <f>x, <f>y, 4>z (Deg)  44.9  0, 3, 0  0.134, 1.871, 0.017  0, 0, 0  0.939, 0.094, 0.162  35.6  0, 2.5, 0  0.009, - 0 . 0 2 0 , - 0 . 0 4 2  0, 0, 0  - 0 . 0 8 8 , - 0 . 0 0 0 , 0.006  29.1  0, 3.5, 0  - 0 . 0 5 8 , 0.065, 0.084  0, 0, 0  0.145, - 0 . 0 3 2 , 0.193  20.1  0, 3.7, 0  - 0 . 1 2 7 , 0.005, - 0 . 0 0 3  0, 0, 0  0.020, 0.448, - 0 . 0 0 7  relative to e  5.6  y  - 0 . 0 4 8 , - 0 . 2 0 9 , 1.558  relative to d f  z  0, 0, 0  relative to c e  v  - 0 . 0 3 3 , 0.557, 3.217  relative to b d  x  <t>x,4> , <t>z (Deg)  • 0, 4, 0  relative to a c  z  Feature Retirement  After u p d a t i n g global feature points, an investigation is carried out to eliminate those features that have not been seen over some distance. F o r this purpose, feature points w i t h a count value of —5, i n d i c a t i n g that they have not been observed for at least 5 consecutive frames, which for our system corresponds to a distance of 50cm, are eliminated. T h i s cond i t i o n removes some of the remaining unstable features that falsely pass the stability and disparity conditions i n the stereo m a t c h i n g process i n spite of their p o o r conditions. It also decreases the size of the global database and prevents the system from wasting the processing time on such features i n later operations.  5.7  Chapter Summary  In this chapter, the feature tracking process as well as m o t i o n estimation, are explained. Identification of similar features is performed through a normalized sum of squares differences  5.7 Chapter Summary  81  w i t h a validity check. T h e tracking is refined by a novel multi-step scheme that increases the correct match correspondences by up to 30%. T h e m o t i o n is estimated using Newton's least squares m i n i m i z a t i o n . In this estimation, the d y n a m i c scene features are eliminated as outliers, forcing the m o t i o n vector to a more reliable solution. For an even more accurate 3 D representation, the global location of the features are updated at the end of each frame. A l s o , features that have not been seen for a while, and w i l l probably not be seen again are eliminated.  Chapter 6 Trajectory Error Modeling There axe several error sources that can affect the accuracy of m o t i o n estimation results. Image noise, the quantization associated w i t h images and the locations of found image features can a l l introduce inaccuracy to the location of the world features, as well as into the overall estimated robot trajectory. K n o w i n g how reliable the 3 D feature locations are is key to incorpoxating the uncertainty factor (position covariance) for each feature that later facilitates the elimination of less reliable features. T h e noise associated w i t h an image is considered to be white and Gaussian [18]. T h e dispaxity measuxement using such an image inherits this noise. Since the 3 D estimation of each feature i n space is a linear function of the inverse dispaxity, a K a l m a n filter estimator seems to be a p r o m i s i n g m o d e l for reducing the system exxor associated w i t h the existing noise [36]. Thexefore, a K a l m a n filtexing scheme is incoxpoxated into the system, that uses the many measuxements of a feature over time and smooths out the effects of noise i n the feature positions, as well as i n the robot's trajectory. i n d i v i d u a l K a l m a n filter is generated.  I n this scheme, for each feature an  E a c h K a l m a n filter includes a 3 x 1 mean position  vector and a 3 x 3 covariance m a t r i x that represent the mean position and the positional uncertainty associated w i t h that feature i n space. A K a l m a n filter is also created for the robot mounted camera, that includes position and the uncertainty associated w i t h it.  82  6.1 Global 3D Feature Creation  6.1  83  Global 3D Feature Creation  A t each frame, the 3 D location of the features generated through the stereo algorithm are computed relative to the current position of the camera system. M a n y of these features get repeated i n future frames, and as the robot moves around, these features may come into or go out of view. In order to combine several observations of an identical feature that is seen over two or more frames, it is necessary to represent these features i n a reference coordinate system. F o r this purpose, a global reference coordinate system is created, which is the same as the first camera coordinate system. T h e processes that are involved i n m a i n t a i n i n g and u p d a t i n g world features include the following:  I. I n i t i a l i z i n g world features and the camera pose i n a global reference frame. T h i s global reference frame is defined by the i n i t i a l pose of the camera stereo unit and at the position of reference camera, F i g u r e 7.1. II. A c q u i r i n g the next set of images which become the "current frame" and constructing the next set of observed features i n the current camera frame. A l s o , estimating the camera m o t i o n from the previous frame a n d u p d a t i n g the camera pose accordingly. III. U s i n g the current robot position to transfer the observed features i n the current frame into the global coordinate system. I V . M a t c h i n g a n d c o m b i n i n g the measurements of identical scene features. V . A d d i n g entries to the global set for features that are observed for the first time. V I . Keeping global features seen before which might be seen i n the near future. V I I . R e t i r i n g features that have not been seen for a while and may not be seen again. E a c h time a feature from the current frame matches an existing global feature, the covariance m a t r i x of the feature i n the current frame is transfered into the global coordinate system, where it combines w i t h the uncertainty m a t r i x of the matching feature.  6.2 Camera Position Uncertainty  6.2  84  Camera Position Uncertainty  T h e robot's position is u p d a t e d using the estimated m o t i o n vector found by the least-squares m i n i m i z a t i o n . For this purpose, a simple K a l m a n filter model is employed. T h e only assumption that is made for this m o d e l is that the robot moves w i t h a constant velocity. Following equations represent the K a l m a n filter model for this application:  z  = Fx + €  k+l  k  + Vk  z = Hx k  where x  k  (6.1)  k  k  (6.2)  represents the state variable at frame k,  x  = [x, y, z, <j) , (f> , (f> , x, y, i , </> , </>„, <[> ]  (6.3)  T  k  x  y  z  x  Z  and F is a constant 12 x 12 m a t r i x and is defined by:  F =  1 T  0  0  1  0  0  0  0  0  0  0  0  0  0  0  o"  T  0  0  0  0  0  0  0  0  0  1  T  0  0  0  0  0  0  0  0  0  0  1  T  0  0  0  0  0  0  0  0  0  0  0  1  T  0  0  0  0  0  0  0  0  0  0  0  1  T  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  1  (6.4)  In m a t r i x F, T is the s a m p l i n g rate and is set to 1. & and n are repectively (unknown) k  system and observation noises. H is a 6 x 12 m a t r i x and is defined by 6 x 6 identity and N is a 6 x 6 zero matrices.  I  N  ,where / is a  6.2 Camera Position Uncertainty  6.2.1  85  Prediction  U s i n g the standard K a l m a n filter notation [23], the state prediction is made by  x(k + l\k) =  Fx(k\k)  (6.5)  If P(k\k) represents the process covariance, the process covariance prediction is  P(k + l\k) = FP(k\k)F  T  (6.6)  + Q(Jfe)  Q shows the noise associated w i t h the process covariance and is defined by  Q(*) =  0 05  0 00  0 00  0 00  0.00  0.00  0 00  0.00  0 00  0 00  0 00  0.00  0 00  0 05  0 00  0 00  0.00  0.00  0 00  0.00  0 00  0 00  0 00  0.00  0 00  0 00  0 05  0 00  0.00  0.00  0 00  0.00  0 00  0 00  0 00  0.00  0 00  0 00  0 00  0 03  0.00  0.00  0 00  0.00  0 00  0 00  0 00  0.00  0 00  0 00  0 00  0 00  0.03  0.00  0 00  0.00  0 00  0 00  0 00  0.00  0 00  0 00  0 00  0 00  0.00  0.03  0 00  0.00  0 00  0 00  0 00  0.00  0 00  0 00  0 00  0 00  0.00  0.00  0 05  0.00  0 00  0 00  0 00  0.00  0 00  0 00  0 00  0 00  0.00  0.00  0 00  0.05  0 00  0 00  0 00  0.00  0 00  0 00  0 00  0 00  0.00  0.00  0 00  0.00  0 05  0 00  0 00  0.00  0 00  0 00  0 00  0 00  0.00  0.00  0 00  0.00  0 00  0 03  0 00  0.00  0 00  0 00  0 00  0 00  0.00  0.00  0 00  0.00  0 00  0 00  0 03  0.00  0 00  0 00  0 00  0 00  0.00  0.00  0 00  0.00  0 00  0 00  0 00  0.03  (6-7)  M a t r i x Q is a constant m a t r i x and is found experimentally. I n this m a t r i x , the associated uncertainties w i t h the r o t a t i o n a l components of the state variable are defined to be smaller than those of the translational parameters. T h i s is m a i n l y due to the fact that the accuracy of the estimated rotational parameters of the m o t i o n is higher. These values however are defined to be larger t h a n the estimated uncertainties associated w i t h the measurements as shown i n E q u a t i o n 6.9. Such larger uncertainties emphasize the fact that the measurement values are more reliable under n o r m a l circumstances. However, i f for any reason, the leastsquared m i n i m i z a t i o n for estimating the m o t i o n parameters fails, then the covariance m a t r i x  6.2 Camera Position Uncertainty  86  of the measurements i n E q u a t i o n 6.9 is changed to an identity m a t r i x , forcing the system to give larger weight to the predicted values.  6.2.2  Measurement  T h e measurement prediction is computed as  z(k + l\k) = Hx{k + l\k)  (6.8)  T h e new position of the robot is obtained by u p d a t i n g its previous position using estimated m o t i o n parameters by the least squares m i n i m i z a t i o n from E q u a t i o n 5.6. T h i s is the new position measurement for the robot or x^sobtained by c o m p u t i n g the inverse of J^J  T h e covariance Ris  for the measurement is  [64] i n Section 5.3.1.  M a t r i x 6.9 represents a typical RLS that is computed by our system d u r i n g one of the tracking process.  0.000001  0.000000  0.000000  0.000001  -0.000002  -0.000001  0.000000  0.000000  -0.000000  0.000001  -0.000001  -0.000000  0.000000  -0.000000  0.000001  0.000001  -0.000001  -0.000001  0.000001  0.000001  0.000001  0.000005  -0.000004  -0.000004  -0.000002  -0.000001  -0.000001  -0.000004  0.000006  0.000005  -0.000001  -0.000000  -0.000001  -0.000004  0.000005  0.000007  (6.9)  If for any reason, a feasible result for the measurement vector XLS is not found by the leastsquared m i n i m i z a t i o n procedure, RLS is set to a 6 x 6 identity m a t r i x . A RLS w i t h larger components, c o m p a r i n g to Q, causes the system to give the prediction values a higher weight than the unknown measurements that are set to zero.  6.2 Camera Position Uncertainty  6.2.3  87  Update  The filter gain is defined by  W(k + 1) = Pik + lltfFFS-^k + l)  (6.10)  The state update is then computed from  x(k + l\k + l)=x(k + l\k) + W{k + l)[x - z{k + l|jfe)] LS  (6.11)  The covariance update is then  P(k + l\k + 1) = P(k + l\k) - W(k + \)HP(k + l\k)  (6.12)  In summary the Kalman filtering formulation can be presented by the following set of relationship:  P(0|0) = Var(x ) 0  P(k + l\k) = FP(k\k)F + Q{k) T  W{k + 1) = P(k + l\k)H [HP(k + 1 \k)H + T  T  RLS]'  1  P{k + l\k + l) = P(k + l\k) - W(k + l)HP(k + l\k) x(k + l\k) = Fx(k\k) x(k + l\k + l) = x(k + l\k) + W{k + l)(x  LS  - z{k + l\k))  k = l,2,...  Figure 6.1 also represents a graphical presentation of the Kalman filtering model that is used for the linear motion of the camera.  6.3 Feature Position Uncertainty  88  x(Jfe + l | A ; + l )  W(k + l)  Q  '+  H  F  x{k + l\k)  delay  F i g u r e 6.1: C a m e r a position K a l m a n filtering model.  6.3  Feature Position Uncertainty  Uncertainties i n the image coordinates and disparity values of the features from the stereo algorithm propagate t o uncertainty i n the features' 3 D positions. A feature point at (u, v) on the image plane, w i t h a disparity value of d, has the world corresponding location (x, y, z) that is found by the inverse perspective projection:  x=  (C  x  d  _(V-  7 !  "  Where (C ,C ), x  - u)b Cy)b  (6.14)  d  b and / represent the image center, stereo camera separation and camera  y  focal length. A first-order error propagation model [15] provides the positional uncertainties associated w i t h each feature.  bo o = d Vo c d 2  b\C -ufo\ d V{v-C fo d  2  _  ~2  Cl  x  2  4  x  2  2  v  y  2  =  Or,  Where, o , 2  x  spectively.  o , 2  y  a , oc, 2  2  z  x  oc 2  and o  2  v  d  (6.15)  4  d  d  4  are the variances of x, y, z, C , C a n d d, rex  y  Based on the results given i n Section 5.3.3, where the mean o f error i n the  6.4 Feature Update  89  least-squares m i n i m i z a t i o n is less than one pixel, assumptions are made that o c 2  x  e c = 0.5 and o 2  = 1.  2  s  = 0.5,  d  K n o w i n g the intrinsic parameters of the camera system, we compute the variances for each feature's 3 D position, i n the current camera frame coordinate, according to the above error propagation formula i n the stereo process.  6.4  Feature Update  E a c h time a feature is observed, a new set of measurements is obtained for that feature i n the space. Therefore, at the end of each frame and after estimating the m o t i o n , world features found i n the current frame are used to update the existing global feature set. T h i s requires that these features be transformed into the global coordinate system first. N e x t , the positional mean and covariance of each feature are combined w i t h corresponding matches i n the global set.  T h e 3 D uncertainty of a feature i n the current frame is computed as  described by Equations i n 6.15. However, when this feature is transformed into the global coordinate system the uncertainty of the m o t i o n estimation and robot position propagates to the feature's 3 D position uncertainty i n the global frame. Therefore, before combining the measurements, the 3 D positional uncertainties of the feature are updated first. F r o m the least-squares m i n i m i z a t i o n procedure, the current robot pose, as well as its covariance ( [ J J ] ~ ) , T  1  are obtained [64] (Section 5.3.1). T h e current p o s i t i o n of the features  can be transfered into the reference frame by  Pnew ={RY{Rx{Rz{Pobs))) + T where  P  oos  and  P  new  (6.16)  are the observed position i n the current frame and the corresponding  transformed position i n the reference frame. T is the translational transformation and  Rz,  Rx and Ry are the rotational transformations (roll, pitch and yaw) required around each  6.4  90  Feature Update  one of the three axes defined by  cos((f> )  —sin((f) ) 0  z  Rz =  z  sin((f) ) cos((j) ) 0 z  0  0  (6.17)  z  0  1  0  Rx = 0 cos(4> ) -sin((f) ) x  (6.18)  x  0 sin((j> ) cos((j) ) x  x  cos((f)y) 0 sin((j)y) RY  0  (6.19)  0  —sin((j)y) 0 cos((j) ) y  6.4.1  Feature covariance update  T h e goal is to obtain the covariance of the features i n the reference coordinate system, given the diagonal uncertainty m a t r i x for each observed feature i n the current frame consisting of <7 x, 2  <y and o 2  2  y  z  ( E q u a t i o n set 6.15). For this purpose, the new covariance m a t r i x ( E  N E U )  of a feature is combined w i t h the previous covariance m a t r i x of the matched feature i n the global set (EJCF) to obtain the new covariance m a t r i x (Y,KF')-  X' =  PX  Given  (6.20)  where P is a 3 x 3 transformation m a t r i x and X and X' are the o l d and new position vectors, 3 x 1 . If there are errors associated w i t h b o t h P and X , A p (9 x 9 covariance for P) and, A x ( 3 x 3 covariance for X),  the 3 x 3 covariance of the resulting vector X' is computed  )  6.4 Feature Update  91  by [24]:  X X  0  T  P  0  X  T  0  0  0 X 0  0 A  0  0  0 0  0 0  X  0  (6.21)  X  Ax  T  In Equation 6.21, the first matrix is a 3 x 12, the second is a 12 x 12 and the third, which is the transpose of the first matrix, is a 12 x 3 matrix. With the assumption that at each time the three rotation angles are small and therefore independent, the transformation proceeds, in order, for rotations Rz Variances of a | , a £ and x  (roll), Rx  (pitch), R  Y  (yaw) first, and then the translation T.  are already found during the last motion estimation. Required  transformations for each stage and how the positional uncertainties propagate are explained next.  6.4.1.1  R o l l transformation  With the assumption that noises associated with the rotational angles  <f> , <j> and (f> are x  y  Gaussian and of zero mean, the 9x9 covariance matrix for the roll transformation is computed  z  92  6.4 Feature Update  by  A  0  0  A  0  0  0  0  0  B  0  —B 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  -B  0  B  0  0  0  0  0  A  0  0  0  A  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  (6.22)  where A a n d B are defined b y V ariance(cos(t> ) and Variance(sin(f> ). B y definition z  z  Variance(cos(p ) = E(cos <t> ) — E (coscj) ) 2  (6.23)  2  z  g  z  (6.24)  Variance(siri(f> ) = E(sin (j> ) — E (sin<f> ) = E(sin (j) ) 2  2  z  2  z  z  z  T h e expected value of cos <f> is computed [76], w i t h the assumption that <f> has a G a u s s i a n 2  z  z  distribution, by + 0  E{cos cj> ) = -=L2  z  [  °  e  iLl +c o ^  d4z  (6.25)  = ——e 2 2 also  E(cos<j) ) = e~ z  2  (6.26)  6.4 Feature Update  93  therefore  (6.27)  In the same manner B is computed by  B= -(l-e-^l) 1  (6.28)  U s i n g Equations 6.21 and 6.17, the covariance m a t r i x after the r o l l , <f> , rotation is comz  puted from  2  2  2  x  z  —2a cos<j> sin<f) 2  z  2  z  z  +ipl ~ °l)sin<t> cos(t>  (A - B)xy  Ay + Bx + olsin (f>  2  +a (cos (p 2  2  +o sin 4>. 2  A<*.  z  +oxy (cos (j) — sin (j) ) 2  z  ol cos<j>  (A - B)xy  Ax' + By + o cos <\)  2  z  2  - sin (j) )  2  2  z  +{o - o-?,)sin(t> cos(f> 2  x  z  z  o sin§ xz  z  (6.29)  +a cos(j). 2  z  2  y  z  o- sin<$> 2  z  xz  -a sin^) 2  yz  2  +a cos (j) 2  z  ol cos<f> z  2  +2a sin(f> cos(j)  z  —a sincp  z  z  2  z  z  z  +al coscj>  z  z  z  Here, (x, y, z) is the 3 D location of the feature i n the current frame. Since this is the first time the transformation is carried out, o  2  —a  2  y  —a  2  z  z  — 0 as the i n i t i a l covariance m a t r i x of  the observation is a diagonal m a t r i x . A p p l y i n g the r o l l transformation to the i n i t i a l position of the feature provides the transformed position of the feature. T h i s new position is used for the next stage. T h e uncertainty associated w i t h this position is the recent covariance m a t r i x of A ^ .  94  6.4 Feature Update  6.4.1.2  P i t c h transformation  The 9 x 9 covariance matrix for the pitch rotation, <j) , is computed in a similar way, as shown x  in Equation 6.22. Once again, substituting the pitch rotational transform in Equation 6.21 and the covariance matrix in 6.29, the new covariance matrix can be defined by  a cos(j) l  xy  x  +al cos(j>  -al sin(j> z  a cos(j>  Cy  2  xy  -al sin(j> z  A,<t>z<t>x  + Dz  2  x  z  x  ,  2  (C -  +OyCos <f> + alsin <j) 2  x  +{a  2  x  -2a cos(j> sin(j)  o-l siruj> y  +(o  2  x  2  2  x  + Dy  2  +a sin <fr + a cos <f> 2  x  2  2  2  x  — sin (j) )  x  +2a cos^ sin^)  2  x  x  2  x  (6.30)  cf> — sin (f> )  Cz  2  x  2  yz  — o )cos<f> sin<t>  +a (cos (j)  x  +a (cos  x  D)yz al)coscj) sin(p  2  (C - D)yz  x  +o cos(f) xz  x  -  2  x  2  z  x  2  x  z  x  x  where C and D are defined by  C = -(l + - < 2  e  (6.31)  -2e-°*>)  (6.32)  D= -{l-e-H*) l  In this formula, a^  x  is found from the last motion estimation,  (x, y, z) is the transformed  3D location of the feature after the roll transformation and o~ , a , a , a , 2  2  2  2  y  a  2 z  and a  yz  are  from the covariance matrix A ^ , Equation 6.29. Applying the pitch transform provides the transformed position of the feature, which is used in the next stage together with this new feature covariance.  95  6.4 Feature Update  6.4.1.3  Y a w transformation  T h e 9 x 9 covariance m a t r i x after the yaw rotation is computed b y  Ex + Fz 2  2  c  +a cos (t>y + a sin (j)y 2  2  (E - F)xz  aly °s<t>y  2  +a sin(j)  2  yz  +{o  — al)sin(j)yCos(j)  2  y  y  +a (cos (j)y  +2a sin(f>yCos(f)y  2  2  xz  — sin <j)y) 2  a cos(j)y  a cos4>y  +ol sin<t>  -o- sin<t>  2  2  xy  yz  (6.33)  '•<t>i<j>x<t>y z  2  xy  y  (E - F)xz +(°z  ~  a )sin^ cos4>y 2  x  -  2  xz  -a  y  +a (cos 4> 2  •xy  y  2  sin(j)  Ez  + Fx  2  y  cos<j>  sin 4> )  2  +a sin <t> + a cos <t> 2  y  y  2  x  2  y  2  z  y  -2a sin(f>yCOS(j)y  2  xz  y  E a n d F are defined by:  E = - ( 1 + e~ *v - 2e~"^)  (6.34)  F=^(l-e~  (6.35)  2a  2  a  ^)  Zt and a ^ is the variance of <j> estimated earlier i n the m o t i o n estimation process, 2  y  (x, y, z)  is the transformed 3 D location o f the feature after the p i t c h transformation a n d cr ., a , a , 2  2  y  a , 2  y  a  xz  and a  yz  are from the covariance m a t r i x A ^ ^ , E q u a t i o n 6.29. A p p l y i n g the yaw  transformation gives the transformed p o s i t i o n o f the feature, which is used along w i t h the new covariance i n the next stage.  96  6.4 Feature Update  6.4.1.4  Translational transformation  T h e final transformed covariance for each feature i n the global coordinate system is computed from:  ^ n e t u  Here o , x  o  Y  —  0  o\  0  0  ^-<Pz<t>x<l>y " f "  (6.36)  and o~\ are the translational uncertainties associated w i t h the estimated mo-  tion and found by the first three diagonal components of m a t r i x [J J] 7  1  computed i n Sec-  tion 5.3.1.  6.4.2  Feature position update  T o update the 3 D p o s i t i o n of a feature [89], the transformed covariance matrix, E combined w i t h the existing covariance of the m a t c h i n g global feature, T, , KF  new covariance m a t r i x ,  n e t 0  , is  to obtain the  T,' . KF  ^KF  = ^KF  T h e new global position of the feature, S' ,  + KLy  (6-37)  1  is then found using the covariances, the trans-  KF  formed position (using E q u a t i o n 6.16).and the previous position.  S  = T, (Y,K SK 1  K  F  KF  F  + ^new new)  (6.38)  r  F  Here, SKF represents the global position of a feature i n the database and r  new  new position of the same feature i n the global coordinate system.  represents the  97  6.5 Experimental Results  6.5  Experimental Results  50  100  150  200  250  300  Figure 6.2: Positional uncertainties associated with features in the image plane. Figure 6.2 shows the projection of estimated uncertainties associated with world features on the image plane. In this figure, the uncertainties associated with closer objects are very small and therefore appear as bright dots. As expected farther features, for instance features around windows on the upper right corner of the scene, have larger projected uncertainties. Some of the closer features also have large positional uncertainties associated with them. These large positional uncertainties imply wrong depth estimation for those features. To get a closer look at 3D scene features and their positional uncertainties Figure 6.3 is generated. It demonstrates two top views of world features and their associated uncertainties. Figure 6.3 a represents a look at the features located on close distances, up to 3.5 meters, from the image plane. Figure 6.3 b represents a broader view of the uncertainty space for  6.5 Experimental Results  98  the entire existing features.  A s clearly displayed, associated positional uncertainties w i t h  features grow i n dimensions as these features move away (in depth) from the camera plane and as they move to the sides (from the camera center). Z[mk  b Figure 6.3: a- Projected relative positional variances of features w i t h distances up to 3.5 meters from the camera on the X Z plane, b- T o p view of positional uncertainties for a l l features.  fj.6 Chapter Summary 1  99  Figure 6.4 shows the updated positional uncertainties of world features and their corresponding i n i t i a l p o s i t i o n a l uncertainties. In this figure, dotted ellipsoids show the original and the solid ellipsoids show the updated uncertainties.  Figure 6.4: O r i g i n a l and updated positional uncertainties of world features.  Results of the K a l m a n filters incorporated into the trajectory tracking system are presented i n C h a p t e r 7.  6.6  Chapter Summary  In this chapter the error m o d e l i n g is described. T h e m a i n idea is to combine the measurements of an identical feature viewed several times to reduce the error originating from existing noise i n the measurement of such feature. For this purpose, for each feature a 3D ellipsoid is created that represents the 3D positional uncertainty associated w i t h that feature. Such ellipsoids are updated i n each frame and depending on the visibility of features, and l o c a t i o n of the features w i t h respect to camera they might expand or shrink. W o r l d features w i t h large  6.6 Chapter Summary  100  uncertainties are eliminated as they might not be very reliable. A K a l m a n filtering model is created for the camera (robot) m o t i o n . W i t h the assumption of constant translational and rotational velocities, the camera location on the next frame is predicted and refined using the camera position measurement gained through the least-squares m i n i m i z a t i o n process.  Chapter 7 Implementation Results T h i s chapter contains the experimental results obtained from implementing the solution strategies put forth i n previous chapters.  These results provide trajectory estimations for  several indoor and outdoor experiments.  T h e y also discuss system processing time and  accuracy as well as some related issues.  7.1  Camera System: Digiclops™  T h e Digiclops stereo vision camera system, F i g u r e 7.1, is designed and b u i l t by Point G r e y Research [45].  It provides real-time d i g i t a l image capture for different applications.  It  includes three monochrome cameras and a software system w i t h the I E E E - 1 3 9 4 interface. These three cameras are rigidly positioned so that each adjacent pair is horizontally or vertically aligned. T h e Digiclops system is equipped w i t h a standard C / C  + +  interface l i b r a r y and is capable  of acquiring and displaying continuous real-time digital video from the camera. It is also capable of real-time depth and 3 D image construction w i t h a speed of 30 to 4.3Hz depending on the image size. In this work, the Digiclops is used only for the purpose of image acquisition. T h e m a i n reason for this decision is due to the fact that not a l l the image pixels carry,  101  102  7.2 Trajectory Estimation  Vertical camera  Horizontal camera  Reference camera  Figure 7.1: Digiclops stereo camera system. significant information about the camera motion. Therefore, reconstruction of the 3D map of the entire viewed scene could waste the time that is required for other tasks of the system. In this work, the intrinsic camera parameters are also provided by Point Grey Research. The camera system captures gray scale images of 320x240 pixels. In order to reduce the ambiguity between the yaw rotation and sideways movements, a set of wide angle lenses with a 104° field of view, VL-2020 by Universe Kogaku America, is used. These lenses incorporate information from the sides of the images that behave differently under translational and rotational movements.  7.2  Trajectory Estimation  The performance of the system is evaluated based on its cumulative trajectory error or positional drift. For this purpose a series of experiments are performed in indoor and outdoor environments. These experiments are either performed for known paths or closed paths. For a known path the location of the robot is compared either with a destination point or some known points along the path. The location of the destination or known points are measured ahead of time by a human operator. For closed path experiments, the robot starts from an initial point with an initial pose. After moving around, it returns to the starting point. To ensure returning to the exact initial pose, an extra set of images are acquired at the starting position, right before the robot starts its motion. This set is used as the last set and since a second image will differ from the first image by noise and is thus more appropriate to use  103  7.2 Trajectory Estimation  than the first image again. For perfect tracking, the expected cumulative positional error over the whole trajectory should be zero and therefore anything else represents the system's drift. The first two presented experiments, 1 and 2, show the performance of the system using unwarped images. Experiments 3 and 4, however, show system's output using warped images. The error was observed to be increasing with distance and consequently the use of percent error - being the error divided by the total traversed distance seemed reasonable as then the percent error is roughly constant.  7.2.1  Experiment 1:  This experiment is intended to investigate the localization accuracy.  It represents the  performance of the system at an earlier stage of this thesis when unwarped images were processed for the trajectory estimation purpose. The camera is moved along a route from point A to the known point B, Figure 7.2. A number frames (32) are processed along this route, white line segment in this figure.  Figure 7.2: The indoor scene from camera point of view. Figure 7.3 represents the estimated 3D path of the system.  The gradual motion of  7.2 Trajectory Estimation  104  the camera system is displayed using a graphical interface that is written i n V i s u a l C  +  +  and V i s u a l Toolkit ( V T K ) 4.0. In this figure the estimated location a n d orientation of the camera along the path are shown w i t h red spheres a n d 3 D yellow cones.  x  Figure 7.3: T h e graphical presentation of the 3 D camera m o t i o n while moving from point A to point B , described i n E x p e r i m e n t 7.2.1, using V i s u a l Toolkit 4.0. In this figure the position is shown w i t h a red sphere a n d the orientation w i t h a yellow cone A s shown, the camera rotates as well as translates.  T h e coordinates of point B are  measured manually i n a coordinate system w i t h origin A . Table 7.1 provides the estimated location and compares it w i t h the actual location of point B . T h e path i n this experiment is a straight line and i t includes translation as well as rotation. T h e motion range includes an overall translation of 158.9cm along Z and an overall yaw rotation of 45°. F r o m these results, the drift of the system is about 4.16 centimeters w i t h an orientational drift of 4.20 degrees.  7.2 Trajectory Estimation  105  Table 7.1: Localization error for the experiment 7.2.1 (using completely calibrated images). Motion range A,(Cm), D ( C m ) , y  Estimation error  D (Cm) z  M°)> M")  7.2.2  E (Cm), Dm  E {Cm), Dv  frO,  E ,(Cm) D  W)  0.00,0.00,-158.90  0.4,-0.86,4.05  0.00,45.0,0.00  -1.25,-3.97,0.59  E x p e r i m e n t 2:  The next experiment is designed to measure the resulting drift using a closed path. For this purpose, the camera starts moving from point A on an arbitrary route and then returns to its starting point. Along this path 71 consecutive frames are processed.  The images  for this experiment were fully calibrated. The subject environment is an indoor scene with object on distances that range between 0.5 meter to about 8 meters from the camera image plane. Figure 7.4 represents the graphical presentation of the estimated 3D positions and orientations of the camera along this path by the system using V T K 4.0.  Figure 7.4: 3D position and orientation of the mobile camera in experiment 7.2.2 using V T K 4.0.  106  7.2 Trajectory Estimation  Table 7.2 represents the existing drift i n this case. In this table M o t i o n range represents the absolute amount of the m o t i o n along each coordinate variable. T h e m o t i o n range along the z axis is 2.94 m , along the x axis is 1.25 m w i t h a yaw rotation of 10°. T o show the improvement resulting from K a l m a n filtering, the results are presented once without the K a l m a n filter and once w i t h it. It can be seen that the system w i t h the K a l m a n filter has about 8cm of positional drift.  Table 7.2: System drift for E x p e r i m e n t 7.2.2 (using unwarped images). Kalman  M o t i o n range  E s t i m a t i o n error  Filter  Ac (cm), D (cm), Z) (cm)  E (cm), E (cm), E ,(cm)  y  z  Dl  *«("), <t>v(°), <t>z(°) Off On  Dy  E*.{°), E*,n  D  E*.(°)  125,0,294,  13.68,8.06,4.04  0°, 10°, 0°  0 . 2 0 ° , - 2 . 4 3 ° , 0.24°  125,0,294,  -1.77,7.57,-0.45  0°,10 ,0°  1.1°, 0 . 2 2 ° , - 0 . 0 6 °  o  T h e large amount of cumulative error could be explained by the fact that most of the scene features were located i n fairly far distances from the camera plane (about 8 meters far). O n l y a small number of scene features, 10% of the entire features, were located o n distances w i t h i n 1 meter from the camera plane. Generally for an accurate trajectory estimation, there should be a good balance between the close and far features i n the scene. T h i s is due to the fact that close features affect the precision improvement of the found m o t i o n as they have more accurate 3 D reconstructed locations. Far features, however, help i n the separation of m o t i o n components from each other.  A l s o , they are more prone to sudden positional  changes. For instance i f the camera undergoes a large motion, while closer features may go under d r a m a t i c shape changes or simply fall out of the viewable scene, features at farther distances still remain viewable. T h i s characteristic of far features assists the robustness of the system i n dealing w i t h such condition. Therefore, a t i l t e d camera can create images that include objects o n the floor as well as  7.2 Trajectory Estimation  107  surrounding areas. In the case of this example however, a tilted view would not be advisable as the floor includes no visible textures.  7.2.3  Experiment 3:  In this experiment the robot moves along an outdoor path and the system performs using warped images. The scene was a natural environment including trees, leaves and building  Figure 7.5: The outdoor scene for experiment 7.2.3. structures that were located in distances between 0.1 to 20 meters from the camera image plane. For this environment, most of the scene features belong to leaves with a small amount of dynamic motion due to the breeze. Other possible scene components such as sidewalks and building walls, in which there are only a few features, are problematic for feature-based vision systems. Figure 7.5 shows the scene in this experiment. This environment allowed us to test the stability of the system in an outdoor environment with more non-static elements in the image. The traversed path was 6 meters long and along the path 172 warped images were captured and processed. In this figure the traversed path is highlighted with a red line  7.2 Trajectory Estimation  108  and the orientation of the camera is shown using a white arrow. T h e robot starts the forward m o t i o n from point A to point B . A t point B the backward m o t i o n begins u n t i l point A is reached. F i g u r e 7.6 26-a, 7.6 96-a a n d 7.6 151-a represent a number o f processed warped images b y the system d u r i n g this experiment. A l t h o u g h the scene includes some structures from the b u i l d i n g , b u t most of the features, over 90%, belong to the unstructured objects of the scene. T h e estimated 3 D path, s t a r t i n g from the first frame t i l l each displayed image frame, is shown i n Figures 7.6 26-b, 7.6 96-b a n d 7.6 151-b. T h e estimated trajectory at each frame is shown w i t h the red sphere a n d the orientation is displayed w i t h the yellow cone. T h e center of the reference camera is considered as the center of the m o t i o n . Figure 7.7 shows a closer look at the overall estimated trajectory along the entire experiment. T a b l e 7.3 displays the cumulative trajectory error i n this experiment. F r o m this  Table 7.3: 3 D drift i n the estimated camera trajectory for a 6 meter long translation. Cumulative  D  Dy  E  E  E  X  D  4>  E  z  <t>v  E x  <i>z  E  error  (Cm)  (Cm)  (Cm)  (Degree)  (Degree)  (Degree)  E x p e r i m e n t 7.2.3  -1.93000  1.74472  0.52874  -0.06628  0.00783  -1.00911  table the translation error is about 2.651cm which i n this case is only 0.4% of the overall translation. T o show the repeatability of the same results, another experiment is performed i n which a 3 meter p a t h is traveled. T a b l e 7.4 displayes the the error associated w i t h this experiment. T h e error associated w i t h this case is about 0.54%.  Table 7.4: 3 D drift i n the estimated camera trajectory for a 3 meter long p a t h . Cumulative  D  E  X  En  v  u  D  E  <t>*  E  Z  <t>y  E  error  (Cm)  (Cm)  (Cm)  (Degree)  (Degree)  (Degree)  3 meters long path  -1.3830  -0.4921  0.7340  -0.1578  0.0231  0.5980  '  7.2 Trajectory Estimation  7.2.4  109  Experiment 4:  In this experiment the robot moves o n a closed circular p a t h including a full 360° yaw rotation. T h e orientation of the camera is toward the ground. Figure 7.8 a represents the overview of the environment i n this scenario a n d F i g u r e 7.8 b shows the scene from robot point o f view. D u r i n g this experiment, 101 warped images are captured a n d processed. Some of these images are shown i n F i g u r e 7.9. T o make i t easier for the viewer to sense the m o t i o n between frames, along each row of this figure, bright circles highlight identical features of the scene. Once again the presented results i n this experiment are based on processing scene warped images. T h e estimated trajectory from the first frame u n t i l corresponding frames i n Figure 7.9, are shown i n Figure 7.10.  7.2 Trajectory Estimation  151-a  110  151-b  F i g u r e 7.6: G r a p h i c a l 3 D representation of the motion using V T K 4.0.  7.2 Trajectory Estimation  111  Figure 7 . 8 : a-The outdoor scene used for the rotational motion. b-An up close view of the same scene from robot's point of view.  Figure 7.9: A number of images from the sequence that is used in Experiment 7.2.4.  7.2 Trajectory Estimation  113  Figure 7.10: Graphical presentation of the 3 D estimated camera motion on a circular path with a yaw rotation of 360° using V T K 4.0.  7.2 Trajectory Estimation  114  Figure 7.11: A closer view of the estimated circular path with a radius of 30cm that is traversed by the camera system in Experiment 7.2.4.  From this table the overall rotational error in this Experiment is about 3.341° or 0.9% and the translational error is 1.22cm or about 0.6% of the overall translation.  7.3 Trinocular and Binocular Stereo Comparison  115  Table 7.5: Estimated 3D drift for the camera motion with 360° rotation on a circular path with a diameter of 60cm. Cumulative  E  D  E D  x  <t>*  E  V  <t>v  E  <t>z  E  error  (Cm)  (Cm)  (Cm)  (Degree)  (Degree)  (Degree)  Experiment 7.2.4  1.030093  -0.252289  0.603087  -1.07127  -2.59989  1.18193  7.3  Trinocular and Binocular Stereo Comparison  Establishing accurate match correspondences in a stereo system is a key issue in the accuracy and precision of the 3D reconstruction and trajectory tracking problems.  The physical  arrangement of the cameras in stereo vision has an important role in the correspondence matching problem. As shown in Equation 4.9, the accuracy of the depth reconstruction has a direct relationship with the baseline and it can be improved by choosing a wider separation between the stereo lenses. On the other hand a narrower baseline facilitates a faster search scheme when establishing the correspondences in the stereo image pair. The use of more than one stereo camera was originally introduced to compensate for the trade-off between the accuracy and ease of the match correspondences [49]. As shown in Figure 7.1 the stereo baselines are almost identical in length. Therefore the improvement of the accuracy by means of multi-baseline stereo matching is not expected. However, the non-collinear arrangement of the lenses adds a multiple view of the scene that could improve the robustness and therefore the long term system accuracy. This is mainly because:  • Generally a broader scope of the scene is viewable by the three images increasing the number of the features. • Moreover, the third image is used for a consistency check, eliminating a number of unstable features that are due to shadows and light effects. • Often close objects to the camera plane have a large shape change. Sometimes a feature,  7.3 Trinocular and Binocular Stereo Comparison  116  depending on its position in the scene, is recognizable only in one pair of stereo images. Switching between the stereo pairs when necessary, prevents elimination of the feature points that even though are not identifiable in both stereo sets, but are stable enough to be used in the system. • The perpendicular arrangement of the two stereo systems improves the accuracy by creating a condition in which the recognition of the periodic patterns in the horizontal and vertical directions, such as horizontal and vertical lines in man made objects and environments, is less ambagious.  The above improvement however could potentially cause a slow down in the system as each time there is one extra image to be processed. To assess the effectiveness of trinocular stereo versus binocular, an experiment is conveyed in which a closed path is traversed. Once again the cumulative error is used as a measure of system performance. Figure 7.3 represents a number of warped images, among a set of 32, that are taken on a path of approximately one meter in length. Table 7.6 shows the resultant error vector in both cases. In this table ET and EQ represent  Table 7.6: Comparison of the estimated cumulative error for trinocular and binocular stereos. Cumulative  D  E  D  E X  D  E Y  Z  4>*  E  <t>y  E  <t>z  E  ET  EQ  error  Cm  Cm  Cm  Degree  Degree  Degree  Cm  Degree  Trinocular  -1.138  -0.029  -0.233  0.178  0.005  0.189  1.162  0.259  Binocular  -1.770  -0.402  0.580  -0.055  0.135  -0.031  1.906  0.149  the overall translational and rotational errors. These results clearly show the similarity of estimated motions based on the number of used cameras. Figure 7.3 displays the traveled distance along the x direction for this experiment. Although the trinocular provides a slightly smaller error vector, the estimation based on the binocular stereo is still quite accurate. Considering that the cost and the complexity of  7.4 Trajectory Estimation Refinement by Kalman Filtering  25  29  117  31  Figure 7.12: Selected number of images viewed while traveling on a path. The motion of an object is highlighted using the yellow circle. a binocular system is less than a trinocular stereo system, the binocular stereo might be a better solution for some applications.  7.4  Trajectory Estimation Refinement by Kalman Filtering  Comparison of the estimated trajectory with and without a Kalman filtering scheme is represented through the comparison of the cumulative error in 3D trajectory parameters. This comparison is studied for the case in Experiment 7.2.3, in which the traversed path  118  7.4 Trajectory Estimation Refinement by Kalman Filtering  Frame number Figure 7.13: E s t i m a t e d traveled distance along x[cm] for a l m closed loop. is 6 meters long. Table 7.7 represents results o f this comparison. I n this table E  T  a n d Eo  represent overall translational and rotational drifts. T h e overall translational error w i t h a  Table 7.7: C o m p a r i s o n of the reduction of 3 D drift for a 6 meter long path using K a l m a n filter. Cumulative  D  E  D  E X  V  E  D  z  <fix  E  error  (Cm)  (Cm)  (Cm)  (Degree)  Kalman filter on  -1.930  1.745  0.529  Kalman filter off  -4.092  4.805  -0.138  ET  <t>  E  v  E o  (Degree)  (Degree)  (Cm)  (Degree)  -0.066  0.008  -1.009  2.655  1.011  -0.277  -0.262  -0.104  6.313  0.395  K a l m a n filter is considerably less than that when K a l m a n filtering scheme is turned off. T h e rotational error w i t h the K a l m a n filtering is slightly more. However, b o t h these rotational drifts are very small, 1.011 a n d 0.395 degrees, a n d they can easily be due to the noise i n the estimation process. Figures 7.14 represents the estimated trajectories i n the presence of K a l m a n scheme.  filtering  A s shown i n Figure 7.14's top sub-figure, the robot moves along X for 3 meters  i n the forward direction and then it returns to its starting point. T h e overall translational  119  7.4 Trajectory Estimation Refinement by Kalman Filtering  error for this experiment is about 2.65 centimeters. orientation at each frame for this experiment.  Figure 7.15 represents the estimated  T h e cumulative orientation error for this  experiment is about 1°. T h e exact amount of cumulative errors are shown on the first row of Table 7.7.  0 41  20  40  60  80  100  120  140  160  180  Frame Number 1  1  1  1  1  1  1  1  1  Frame Number Figure 7.14: E s t i m a t e d camera distances relative to a starting point for a 6 meter long p a t h w i t h K a l m a n filter.  Figures 7.16 and 7.17 show results for the same experiment when the K a l m a n  filtering  process is turned off. A s represented i n the second row of Table 7.7, the positional error is increased to 6.31 centimeters. T h e cumulative orientational error i n this case is smaller t h a n 1°.  120  7.5 Computation Time  0.51  1  1  1  1  1  1  1  r  80  0  20  40  60  80  100  120  140  160  180  Frame Number 11  1  1  1  1  1  1  1  1  1  180  Frame Number Figure 7.15: E s t i m a t e d camera orientations for a 6 meter long p a t h w i t h K a l m a n filter o n .  7.5  Computation Time  T h e presented system is implemented i n Microsoft V i s u a l C  +  +  6.0 language, on a 1.14 G H z  A M D A t h l o n ™ processor under Microsoft W i n d o w s ® operating system. T h e camera system captures gray scale images o f 3 2 0 x 2 4 0 pixels at 8 H z rate. T h e most effort has been done to optimize the code a n d modulate the system i n order to o b t a i n fast subsystems w i t h less communicational cost a n d required memory. T h e most severe drawback o f the system is its high computational requirement. Especially, when finding a n d evaluating match correspondences.  C o m p u t a t i o n is mostly spent  on performing correlation between image patches. In the older version of the system when  121  7.5 Computation Time  .—1001  0  1  1  1  1  1  1  1  20  40  60  80  100  120  140  r  160 180 Frame Number  180  _I 5  0  i  i  i  i  20  40  60  80  i  100  i  120  i  140  1  1  160 180 Frame Number  Figure 7.16: E s t i m a t e d camera trajectories relative to a starting point for a 6 meter l o n g path w i t h o u t K a l m a n filter. unwarped images were used, the system h a d a speed of 2.8Hz. However, when the system was changed to improve the accuracy by means of e l i m i n a t i n g the unwarping process, a n d i t was moved to outdoor environments the r u n n i n g t i m e of the system increased to 8.4 seconds per frame. T h i s can be explained by several sources.  • Perhaps the most important reason for this problem is the considerably larger number of features i n outdoors c o m p a r i n g t o indoor environments. If n represents the number of corners, the stereo m a t c h i n g a n d construction stages have complexities of 0(n ). 2  T r a c k i n g n 3 D features from one frame to another frame has also a c o m p l e x i t y of 0 ( n ) . 2  T h i s is due to the fact that b o t h tracking a n d stereo processes are heavily involved  122  7.5 Computation Time  60  80  100  120  140  160 180 Frame Number  Figure 7.17: E s t i m a t e d camera orientation for a 6 meter long p a t h w i t h no K a l m a n  filtering.  i n the use of the normalized cross-correlation function for the purpose of measuring similarities. It must be pointed out that the number of features i n E x p e r i m e n t 7.2.3 was 4 times higher than of the number of features i n Experiments 7.2.1 and 7.2.2. A n increase i n the number of features by a factor of 4 increases can potentially cause the r u n n i n g time of the tracking and stereo tasks alone by a m i n i m u m factor of 16. Therefore, the r u n n i n g time of the tracking task is expected to increase from 0.21 second to 3.36 seconds and the stereo procedure from 0.057 second to 0.912 seconds. Since tracking and stereo routines include several added functions that are required for the p a r t i a l image unwarping process, i n practice the r u n n i n g times of these two procedures are increased to 5.1 and 1.35 seconds (as shown i n Table 7.8). Therefore,  123  7.5 Computation Time  a dramatic increase in the running time of the system seems to be expected when moving to outdoor environments with higher number of features. As shown in Table 7.8 about 80% of the time in Experiment 7.2.3 is spent on the tracking and the similarity evaluation of matched features. • Another reason for the slower performance is the distance of features from camera. For indoor environments, Experiments 7.2.1 and 7.2.2, the objects were located at distances ranging from 0.5 to 8 meters. This range is increased to 0.1 to 20 meters, with 60% of the features in distances of 0.125 to 0.4 meter from the camera in outdoor environments in Experiments 7.2.3 and 7.2.4. For a same amount of motion, closer features undergo larger positional changes and therefore for outdoor cases the search window had to be increased by 4 times (141 x 141 versus 71 x 71) [84]. Such larger search boundaries cause a higher number of match candidates and consequently increase the running time. • Finally, when using warped images, features are detected and stored based on their position in the warped images. However, they must be sorted based on their unwarped image locations right before the stereo algorithm. This sorting scheme is a necessary process as in stereo process, the search is performed along epipolar lines. Also when transferring world features from world coordinates to project them onto the previous image plane, they must, once again, be sorted based on their warped image locations. This is also a necessary process since the tracking is performed based on the features' warped image locations and contents.  These sorting processes, although not much,  increase the running time as well.  Figure 7.18 shows implemented software modules.  In this figure, the first column of  modules (1.1, 1.2, 1.3, and 1.4) are performed only for the first frame. The rest of the modules are repeatedly performed for later frames. Table 7.8 is created to represent the cost associated with each module for the typical outdoor scene in Experiment 7.2.3. The running time of the system is highly dependent  7.5 Computation Time  124  on the scene, number of features, a n d i m a g i n g conditions. In this table at the end of each module name there is a number which shows the corresponding block number i n Figure 7.18. T h e displayed times are the result of 100 runs for each module.  Second frame and frames after that  First frame Start: Read input images (1)  Detect features of input images (2)  Track features between current and previous frames (8)  Estimate the camera's motion (14)  S' (3  Sort featur :s based on their row and column position of unwarpe d images (3)  Find the camera's motion (9)  Update robot's current position (15)  Detect features of input images (1.1)  Stereo algorithm to find world features in the current frame (4)  Use motion parameters to transfer the world features into the current frame (10)  Move current features into their corresponding global location using the robot position and orientation (16)  Sort features based on their row and column position of unwarped images (1.2)  Transfer w<jrld features into the previous robot position (5)  Sort world features based on their unwarped image locations (11)  Combine transformed current world features with existing features in the database (17)  Stereo algorithm to find world features in the current frame (1.3)  Project world features onto the previous image frame (6)  Initialize global world feature database (1.4)  Compute mean and square values for features on the current and previous frames (7)  Project world features after transformation onto the previous image frame (12)  Update database features and retire if necessary (18)  Find correspondences between two frames once again (13) (small search neighborhoods)  Figure 7.18: System's modular flowchart for the time analysis.  to Cn  Table 7.8: C P U time spent on various functions of trajectory tracking system. No. of calls  Total time  % of total time  Function  Spent time  name  (ms/call)  Read i n p u t images (1)  0.95353  3  2.85606  0.0339  Feature detection (2)  31.13  3  93.39  1.1094  Sort features (3)  3.16  1  3.16  0.0375  Stereo algorithm (4)  1350.25  1  1350.25  16.0403  Transfer world features (5)  0.8336  1  0.8336  0.0099  1  0.4202  0.0050  437.0952  5.1925  (ms)  Project features onto the image (6)  0.4202  C o m p u t e mean and square values (7)  218.5476  Track features (8)  5109.3  1  5109.3  60.6959  F i n d the m o t i o n (9)  8.132  1  8.132  0.1198  Transform world features using the m o t i o n (10)  0.82037  1  0.82037  0.0966  Sort features using their unwarped locations (11)  0.10125  1  0.10125  0.0097  Project features after transformation on the image (12)  0.78925  1  0.78925  0.0012  F i n d correspondences (small neighborhood) (13)  5.1716  1  5.1716  0.0614  E s t i m a t e the motion (14)  6.16  1  6.16  0.0732  Update robot's current location (15)  0.3411  1  0.3411  0.0041  Transfer from current to the global (16)  5.51  1  5.51  0.0655  C o m b i n e features (17)  91.43  1  91.43  1.0861  U p d a t e database and retire i f is possible (18)  0.1  1  0.1  0.0012  Miscellaneous  1302  1  1302  15.4671  8417.9  100  Total  127  7.5 Computation Time  From this table, it is concluded that the most time consuming modules are the stereo a n d tracking routines. T h e m a i n core of these two modules is the correlation ( N S S D ) function. T h e tracking process, block (8), requires over 60% of the processing time. In this part of the system, since there is no knowledge about the m o t i o n , a fairly large search neighborh o o d was required. Since the number of m a t c h candidates increases by the search window dimensions, i t is interesting to see the effect of the search boundary on the running time. Figure 7.19 represents the r u n n i n g time for our system based on the search dimension i n block 8 Figure 7.18.  W i n d o w w i d t h a n d height [pixel] Figure 7.19: C o m p a r i s o n of the running time versus search window dimensions for the tracking module.  A s expected when the search size decreases a faster performance achieved.  Another  time consuming process, block 7, calculates the average and square intensity values of the compared candidate patches used i n the correlation process. Obviously, by using a faster processor, the r u n n i n g time of the system can be lowered. To conclude this discussion, it is i m p o r t a n t to see the trade off between the system processing rate w i t h the m o t i o n rate and search window dimensions i n the tracking process. A smaller m o t i o n , between two consecutive frames, results i n smaller displacements of image  7.6 Chapter  Summary  128  features i n two corresponding image frames. In such condition, corresponding features can be found by searching over smaller regions. A s shown i n F i g u r e 7.19, smaller windows speed up the system processing rate. Therefore, through a slower m o v i n g robot a faster performance can be achieved.  7.6  Chapter Summary  In this chapter the implementation results are presented.  T h e improvement of the system  accuracy by employing a combined use of images before and after lens distortion removal was shown through several indoor and outdoor experiments. T h e comparison of the trinocular and binocular stereo is also presented. It is concluded that the use of binocular system can still provide a good accuracy. T h e effectiveness of the K a l m a n filter over a long range of m o t i o n is presented. T h e cost of different modules is also discussed.  Chapter 8 Conclusions and Future Work T h i s chapter summarizes the m a i n contributions of this thesis. It also discusses the scope for future work a n d extensions to this thesis for further improvement.  8.1  Major Contributions  T h i s thesis has presented important contributions for the successful development of a general purpose 3 D trajectory tracking problem for u n k n o w n indoor and outdoor environments. It requires no modifications to be made to the scene and it is not dependent on any prior information about the scene. T h e system operates using a small fraction of the scene features that can potentially carry m a x i m u m information about the camera m o t i o n . There are no constraints imposed on the camera m o t i o n or the scene features other t h a n the rigidity of most of the features.  T h e system reduces the time that is generally spent for the lens  distortion correction process by removing i t only for a number of selected feature points (at most only 2% of the entire image pixels). T h i s process minimizes the positional projection error of image features, corresponding to identical scene features, at different time instances. T h i s improved positional accuracy allows one to obtain the camera's position and m o t i o n parameters w i t h a higher accuracy. Moreover the system implements an active vision-based  129  8.1 Major Contributions  130  3 D trajectory tracking system that is affordable and can be easily fitted to satisfy different requirements. T h e cumulative positional error of the system is less t h a n 1% i n b o t h position and angle. Four m a i n objectives were sought i n this thesis.  • 3 D estimation: T h i s requirement is fully satisfied and at each estimation six m o t i o n parameters are estimated. • Scene independence: T h i s c o n d i t i o n for indoors and outdoors w i t h the exception of the following pathological cases are achieved: I. T h e environment must be textured and the details of the textured surfaces must be large enough to be identified i n d i v i d u a l l y i n scene images. II. N o significant surface specularity such as objects w i t h polished mirror-like surfaces. III. P e r i o d i c scene features (picket fence effect). I V . Significant effects of occlusions. • H i g h accuracy: T h e system performs w i t h an accuracy of 1% i n b o t h rotational and translational movements. A l t h o u g h we do not have the % error for a system w i t h a similar configuration as ours, our system appears to be more accurate than some of the prior similar works (e.g. H a r r i s [103], who we estimated, achieved a 3.2% error). • Real-time performance: T h e real-time performance of the system is not achieved as there is a trade off between the performance accuracy and speed. Improvement of the accuracy was given a higher p r i o r i t y i n this work. Suggestions for future improvement of the speed are represented i n Section 8.2.  T h e major contributions of this thesis can be summarized as described below:  8.1 Major Contributions  8.1.1  131  Fast feature detection  A novel fast feature detection algorithm named the B i n a r y Corner Detector ( B C D ) is developed. T h e performance quality of this a l g o r i t h m is compared w i t h the most commonly used method (Harris corner detector) as well as the method inspiring it ( S U S A N corner detector). T h e result of this comparison indicates that it performs considerably faster than b o t h methods while m a i n t a i n i n g a high stability under different environmental and imaging conditions. T h e faster performance is gained by substituting arithmetic operations w i t h logical ones. Since the m a i n assumption for the whole system has been that temporal changes between consecutive frames are not large, a faster feature detector leads to less temporal changes between the consecutive frames a n d therefore resulting i n a higher accuracy i n the overall system.  8.1.2  3D feature reconstruction with minimal positional error  Reconstruction of world 3 D features using the pinhole camera model and stereo algorithm is an essential part of the trajectory t r a c k i n g approach that was undertaken i n this thesis. Due to imperfect lenses, the acquired images include some distortions that are generally corrected through the calibration process. N o t only is the image c a l i b r a t i o n at each frame for three images a time consuming process but it could add positional shifts (error) to image pixels. T h i s process degrades 3 D reconstruction results and increases the cumulative error i n the overall trajectory tracking process.  T o remove this undesired effect, a calibration  m a p for each one of the cameras is constructed that defines the relationship between the integer position of the uncalibrated pixels w i t h the corresponding floating point location on the calibrated image. T h i s process allows one to work w i t h sharper images. It also provides a more accurate disparity estimation up to several pixels as well as a faster processing time by e l i m i n a t i n g the calibration process for three i n d i v i d u a l images.  8.2 Future 8.1.3  Research  132  Multi-stage feature tracking  T r a c k i n g identical 3 D world features at different times is a key issue for accurate 3 D trajectory estimation. Correct identification of such features however, depends on several factors such as search boundaries, similarity measurement window size, and a robot's m o t i o n range. E x p a n d i n g search boundaries and the window size for similarity evaluation can improve the accuracy by adding more correct matches. T h e y can however slow d o w n the performance, leading to a larger m o t i o n between two consecutive frames. A larger m o t i o n introduces more inaccuracy into the system. In order to improve the accuracy, a two-stage tracking scheme is introduced i n which the m a t c h correspondences are first found using a large search window and a smaller similarity measurement window. T h r o u g h this set of correspondences a rough estimation of the m o t i o n is obtained. These m o t i o n parameters are used i n the second stage to find and track identical features w i t h higher accuracy. T h e process increases the tracking accuracy by up to 30%.  8.2  Future Research  There are several aspects of the trajectory tracking system that can be improved or extended to make the system more useful and reliable. Some example are o u t l i n e d below:  8.2.1  Combined wide and narrow views  T h e type of the lenses i n the stereo set has an i m p o r t a n t role on the overall accuracy as well as the real-time performance of the system.  A wider field of view can improve the  speed since its coarser image resolution leads to a smaller search window which can recover motions w i t h larger ranges. A narrower field of view on the other h a n d can provide more precise location information. T h i s however w i l l decrease the speed as wider search windows are required. A s presented i n Section 7.3 the orthogonal arrangement of the cameras w i t h equal baselines does not affect the accuracy of the system significantly. For this reason, it  133  8.2 Future Research  seems more reasonable to substitute the trinocular stereo set w i t h two i n d i v i d u a l binocular stereo systems w i t h two different lens types. T h e m o t i o n can be estimated first through the wider set w i t h the lower resolution w i t h a smaller search window depending on the w i d t h of the view. T h i s result can be used as an i n i t i a l estimate of the m o t i o n and can be refined w i t h a much higher accuracy through the narrower lens set i n a timely manner.  8.2.2  Orthogonal views  One of the general existing problems for our system is the increase of the translational error w i t h rotational movements.  T h i s m a i n l y originates from the fact that points at different  locations i n the environment show different behaviors under varying motions. T h i s can be explained i n F i g u r e 8.1. F o r a rotational m o t i o n of a the behavior of point A is the same as moving toward left w i t h amount of A x . A d d i n g a second set of stereo cameras w i t h an image plane perpendicular to the first pair can prevent this. W i t h a pure translation toward left, point A still remains fixed i n the image of the second camera set, while w i t h a rotation a toward left i t moves toward left o n the second image plane. A s suggested i n Section 8.2.1, the use o f two sets of i n d i v i d u a l stereo systems w i t h different lenses can improve the accuracy. A r r a n g i n g the two sets i n an orthogonal way, as explained, can prevent the coupling effect between the translation and the rotation.  8.2.3  Hierarchical structure  T h e most time consuming process of the system was the establishment of feature correspondences for consecutive frames. T h i s deficiency becomes more costly as m o v i n g from indoor scenes to outdoor scenes increases the number of features dramatically. O n e possible solution to this problem is the establishment of a hierarchical approach i n feature extraction and tracking as well as i n m o t i o n estimation. F o r this purpose, input images are resampled to create a n image resolution p y r a m i d . A t each time, features are detected on the coarser level and using them a rough m o t i o n estimation is obtained. T h e estimated m o t i o n along  134  8.2,Future Fiesearch  B'  B  '  r  v  (  * —  A  C a m e r a set 1  Figure 8.1: A second set of camera overcomes the coupling effect between translation and rotation. w i t h a finer image from a lower level of the p y r a m i d w i l l be employed to refine the m o t i o n parameters. T h i s method reduce the computation by the 4'th power of the reduction i n the resolution. For instance reduction by a factor 2 reduces the image sizes to 1/4 of the original sizes. Since there is a d u a l role i n treating the image features this process w i l l speed up the computation by a factor of 16.  8.2.4  Sub-image processing  A n o t h e r possible way for i m p r o v i n g the processing rate is to select and process only selected patches of each image instead of the entire image.  In order to obtain the most possible  information from these image patches, they should be chosen from specific areas. A n example of such areas are shown i n Figure 8.2. Patches from areas close to image edges help to separate the rotation from the sideway translation, while a patch from the center provides features w i t h higher 3 D positional accuracies and therefore results i n more accurate m o t i o n estimation. It is also possible to choose these image patches based on the number of their features. However these patches must be a carefully distributed to avoid false information provided by the unexpected motion of an unknown object i n a certain part of the scene.  8.2 Future Research  135  Figure 8.2: Processing small image patches can reduce the processing time.  8.2.5  Cumulative error reduction  A n inherent characteristic of the system is the estimation error which is accumulated over time and distance. Although estimated transformation between each two frames is reliable, over a long period of time the system must be able to correct this error and reset itself. Integrating other sensors such as odometers can help to find the motion between frames more accurately and therefore, the system would perform more accurate estimation while there is a fast motion. Another possibility could be the creation of a map of the environment and retaining specific frames (for example every 10'th frame) and their corresponding location in a database.  As the system moves around if it becomes close to a place that has been  traversed before, the saved frames and their location can be used to decrease the cumulative error.  8.2 Future Research  8.2.6  136  Specialized hardware  Employment of specific hardware (e.g. F P G A ' s ) that allows the system to perform bitwise parallel operations on the generated binary images can improve the speed of the system. Bitwise operations can increase the speed of the feature detection sub-system.  T h e y can  also be used to substitute the expensive N S S D on the intensity images w i t h a binary crosscorrelator as described by [109].  Bibliography [1] J . K . Aggarwal and N. Nandhakumar. On the computation of motion from sequences of images - a review.  Proceedings of the IEEE,  pages 917-935, 1988.  [2] C S . Andersen, O B . Madsen, T . Johannesson, and O. Stefansson. Landmark based  Proceedings of the 1998 Mobile-Robots Transportation Systems, pages 170-181, 1998.  XIII and  navigation strategies.  Intelligent  [3] A . A . Argyros and F . Bergholm. Combining central and peripheral vision for reactive  Proceedings of the 1999 IEEE Computer Computer Vision and Pattern Recognition, pages 646-651,  robot navigation.  Society  Conference  on  1999.  [4] N . Ayache and C . Hansen. Rectification of Images for Binocular and Trinocular Stereo Vision.  International  Conference  on Computer  Vision  and Pattern  Recognition,  pages  11-16, 1988. [5] Y . Bao, N . Fujiwara, and Z. Jiang. A Method of Location Estimating for a Vehicle by Using Image Processing.  IEEE International  Systems, pages 781-786,  Conference on Intelligent  Transportation  1998.  [6] D. Bapna, E . Rollins, J . Murphy, M . Maimone, W . R . L . Whittaker, and D. Wettergreen. The atacama desert trek: Outcomes. In  Proceedings of ICRA 1998,  volume 1, pages  597 - 604, May 1998. [7] J . Bares and D. Wettergreen. learned.  Dante ii: Technical description, results and lessons  International Journal of Robotics Research, 137  18(7):621-649, July 1999.  BIBLIOGRAPHY  138  [8] E . B . Barrett, M . H . B r i l l , N N . N i l s , and P . M . Payton.  Computer Vision,  Geomoteric Invariance in  pages 277-292. M I T Press, 1992.  [9] R . B a s r i and E . R i v l i n . L o c a l i z a t i o n and h o m i n g using combinations of model views. A I , 78(l-2):327-354, October 1995. [10] R . B a s r i , E . R i v l i n , and I. Shimshoni. Image-based robot navigation under the perspective model.  IEEE International  Conference  on Robotics  and Automation,  pages  Intelligent Vision Systems for Industry.  Springer-  2578-2583, 1999. [11] B . G . Batchelor and P . F . W h e l a n . Verlag, 1997. [12] C . Becker, J . Salas, K . Tokusei, and J . L a t o m b e . Reliable navigation using landmarks.  IEEE International  Conference  on Robotics  and Automation,  pages 401-406, 1995.  [13] M . Betke and L . G u r v i t s . M o b i l e robot localization using landmarks.  tional Conference  on Robotics and Automation,  IEEE Interna-  pages 251-263, 1997.  [14] M . Betke, E . H a r i t a o g l u , and L . S . D a v i s . Highway scene analysis i n hard real-time.  Proceedings  of the 1997 IEEE  Conference  on Intel igent  Transportation  Systems,  pages  812-817, 1997. [15] P . R . B e v i n g t o n and D . K . Robinson.  ical Sciences. M c G r a w - H i l l ,  Data Reduction and Error Analysis for the Phys-  Massachusetts,  1992.  [16] R . C . Bolles and M . A . Fischler. A ransac-based approach to model fitting and its application to finding cylinders i n range data. In  Proc. of the 7th IJCAI,  pages 6 3 7 -  643, Vancouver, C a n a d a , 1981. [17] L . Gottesfeld B r o w n . A Survey of Image Registration Techniques.  Technical report,  Department of C o m p u t e r Science, C o l u m b i a University, N e w Y o r k , N Y 10027, 1992. [18] M . J . B u c k i n g h a m .  Noise in Electronic Devices and Systems.  Electronic Engineering. E l l i s H o r w o o d / W i l e y , 1983.  Series i n E l e c t r i c a l and  BIBLIOGRAPHY  139  [19] G . B u r d e a , G . Patounakis, V . Popescu, and R . E . Weiss. V i r t u a l reality-based training for  the diagnosis of prostate cancer.  IEEE Transactions on Biomedical Engineering,  46(10):1253-1260, October 1999. [20] W . B u r g a r d , A . B . Cremers, D . Fox, D . H a h n e l , G . Lakemeyer, D . Schulz, W . Steiner, and S. T h r u n . T h e interactive museum tour-guide robot.  Conference  on Artificial  Intel igence,  In  Proc. of the National  1998.  [21] M . F . Montenegro C a m p o s and L . de Souza Coelho. A u t o n o m o u s dirigible navigation using visual tracking and pose estimation.  and Automation, 4:2584-2589,  IEEE International Conference on Robotics  November 1999.  [22] F . Chaumette, P . B a n k s , and B . E s p i a u . P o s i t i o n i n g of has robot w i t h respect to year object, alignment it and estimating its velocity by visual servoing. I n  one Robotics and Automation, ICRA' 91, volume  IEEE Int Conf  3, pages 2248 - 2253, Sacramento,  California, A p r i l 1991. [23] C . K . C h u i and G . C h e n .  Kalman Filtering with Real-time Applications.  Springer-  Verlag, N e w Y o r k , 1991. [24] J . Clarke. M o d e l l i n g uncertainty: A primer, 1998. [25] D . Cobzas, H . Zhang, and M . Jagersand.  Image-Based L o c a l i z a t i o n w i t h D e p t h -  Proceedings of IEEE International ICRA 03, Sept. 2003.  Enhanced Image M a p .  and Automation,  Conference  on  Robotics  [26] S. Comtois, D . Laurendeau, and E . Bourget. A real-time computer vision system for tracking of underwater benthic larvas. In  MVA2002,  Tokyo, J a p a n , Dec. 11-13 2002.  [27] D . D e m i r d j i a n and T . D a r r e l l . M o t i o n E s t i m a t i o n from D i s p a r i t y Images.  of IEEE  Conference  on Computer  Vision  (ICCV01),  Proceedings  1:213-218, 2001.  [28] R . Deriche and G . G i r a u d o n . A computational approach for corner and vertex detection.  IJCV,  10(2):101-124, 1993.  BIBLIOGRAPHY  140  [29] E . D . D i c k m a n n s , B . D . Mysliwetz, and T . Christians. A n integrated spatio-temporal approach to automatic visual guidance of autonomous vehicles. S M C , 20(6):1273-1284, November 1990. [30] B . E s p i a u , F . Chaumette, and P . Rives. A new approach to visual servoing i n robotics.  IEEE Trans, on Robotics and Automation,  8(3):313-326, June 1992.  [31] M . E t o h and Y . Shirai. Segmentation and 2d m o t i o n estimation by region fragments.  Proceedings  of Fourth  International  Conference  on Computer  Vision,  pages 192-199,  May 1993. [32] P . F u a .  Machine Vision and Applications.  Springer-Verlag, 1993.  [33] J . C . G a r c i a G a r c i a , M . M a r r o n R o m e r a , M . M a z o Quintas, and J . U r e n a U r e n a . P o s i t i o n i n g and localization system for autonomous wheelchairs.  ference of the Industrial [34] A . G e l b .  Electronics  Society,  Applied Optimal Estimation.  IEEE 28th Annual Con-  2:1555-1560, November 2002. T h e M I T Press, Massachusetts, 1974.  [35] N . Goncalves and H . Araujo. E s t i m a t i o n of 3d m o t i o n from stereo images-uncertainty analysis and experimental results.  Robots and System,  IEEE/RSJ International Conference on Intelligent  1:7-12, 2002.  [36] M . S . G r e w a l and A . P . Andrews.  Kalman Filtering, Theory and Practice.  Prentice  H a l l , N e w Jersey, 1993. [37] P . G u n a t i l a k e and M. Siegel. Remote enhanced visual inspection of aircraft by a mobile robot. In  Proceedings of the 1998 IMTC Conference, pages 49 -  [38] R . C . H a r r e l l , D . C . Slaughter, and P D . A d s i t .  58. I E E E , M a y 1998.  A fruit-tracking system for robotic  harvesting. M V A , 2(2):69-80, 1989. [39] C . Harris.  Geometry from Visual Motion.  [40] C . H a r r i s and M . Stephens.  Alvey Vision Conference,  A c t i v e V i s i o n , M I T Press, Cambridge, 1992.  A combined corner and edge detector.  pages 147-151, 1988.  Proceeding 4'th  BIBLIOGRAPHY  141  [41] B . K . P. H o r n . Tsai's camera calibration method revisited. Technical report, M I T , Massachusetts, [42] B . K . P . H o r n .  2000.  Robot Vision.  T h e M I T Press, Massachusetts,  1992.  [43] S. H u , E . A . Hoffman, and J . M . R e i n h a r d t . A u t o m a t i c lung segmentation for accurate quantitation of volumetric X - r a y C T images.  IEEE Trans. Medical Imaging, 20(6):490-  498, June 2001. [44] D . P . Huttenlocher, G . A . K l a n d e r m a n , and W . J . Rucklidge. C o m p a r i n g images using the Hausdorff distance.  gence, pages  IEEE Transactions  on Pattern  Analysis  and Machine  Intelli-  850-864, 1993.  [45] P o i n t G r e y Research Inc. Triclops Stereo V i s i o n System. Technical report, Department of C o m p u t e r Science, University of B r i t i s h C o l u m b i a , Vancouver, www.ptgrey.com, 1997. [46] M . Irani, B . Rousso, and S. Peleg. Recovery of ego-motion using image stabilization.  IEEE  International  Conference  on Computer  Vision  and Pattern  Recognition,  pages  454-460, 1994. [47] R . Joshiand and A.C. Sanderson. A p p l i c a t i o n of feature-based m u l t i - v i e w servoing for l a m p filament alignment.  IEEE International  Conference  on Robotics  and  Automation,  pages 1306-1313, 1996. [48] T . K a n a d e , A . Y o s h i d a , K . O d a , H . K a n o , and M . T a n a k a . A stereo machine for videorate dense depth mapping and its new applications.  Computer  Society  Conference  on Computer  Vision  Proceedings of the 1996 IEEE and Pattern Recognition, pages  196-202, 1996. [49] S . B . K a n g , J . A . Webb, C . Z i t n i c k , and T . K a n a d e .  A multibaseline stereo system  w i t h active i l l u m i n a t i o n and real-time image acquisition. In  International  Conference on Computer  Vision (ICCV '95),  Proceedings of the Fifth  pages 88-93, June 1995.  BIBLIOGRAPHY  142  [50] G . A . K a n t o r , D . Wettergreen, J . P. Ostrowski, and S. Singh. C o l l e c t i o n of environmental  Proceedings of the SPIE Conference on Sensor Control in Robotic Systems IV, volume 4571, pages 76-83,  d a t a from an airship platform. In  Fusion and Decentralized  B e l l i n g h a m , W A , October 2001. S P I E . [51] D . K i m , T . M . K i n t e r , and J . F . Greenleaf. C o r r e l a t i o n search m e t h o d w i t h third-order statistics for c o m p u t i n g velocities from ultrasound images.  Symposium, pages  IEEE 1989 Ultrasonics  110-111, 1989.  [52] L . K i t c h e n and A . Rosenfeld. Gray-level Corner Detection. Technical report, C o m p u t e r V i s i o n Laboratory, C o m p u t e r science Center, University of M a r y l a n d , College Park, M D 20742, U S A , 1982. [53] D . K o r t e n k a m p , M . Huber, C B . C o n g d o n , S . B . Huffman, C R . B i d l a c k , C J . Cohen, and F.V. Koss-Frank. Integrating obstacle avoidance, global path planning, visual cue detection, and landmark triangulation i n a mobile robot .  International  Society for Optical Engineering,  Proceedings of SPIE The  pages 515-522, 1993.  [54] D . K o r t e n k a m p and T . W e y m o u t h . Topological m a p p i n g for mobile robots using a combination of sonar and vision sensing. In  Proc. of A A AI-94,  pages 979-984, Seattle,  W A , 1994. [55] E . K r o t k o v . M o b i l e robot localization using a single image.  ference on Robotics and Automation,  IEEE International Con-  pages 978-983, 1989.  [56] S. K u r o g i , Y . Fuchikawa, T . Ueno, K . M a t s u o , and T . N i s h i d a . A method to measure 3d positions of elevator buttons from a mobile robot using a 2d artificial landmark, a laser navigation system and a competitive neural net.  on Neural Information Processing.,  IEEE 9th International Conference  4:2122-2126, November 2002.  [57] M . K . Leung, Y . L i u , and T . S . Huang. E s t i m a t i n g 3d vehicle m o t i o n i n an outdoor scene from monocular and stereo image sequences. W V M , 91:62-68.  143  BIBLIOGRAPHY  [58] C . C . L i n and R . L . T u m m a l a .  M o b i l e robot navigation using artificial landmarks.  Journal of Robotic Systems, pages 93-106, 1997. [59] J . J . L i t t l e , J . L u , a n d D . R . M u r r a y . Selecting stable image features for robot localization using stereo. IEEE/RSJ  International Conference on Intelligent Robots and  Systems, pages 1072-1077, 1998. [60] Y . L i u , T . S . H u a n g , a n d O . D . Faugeras. Determination of camera location from 2-d to 3-d line a n d point correspondences. IEEE Transactions on PAMI, 12:28-37, 1990. [61] D a v i d G . Lowe. Object recognition from local scale-invariant features. In International Conference on Computer Vision, pages 1150-1157, Corfu, Greece, September 1999. [62] D . G . Lowe. Artificial Intelligence, Three-Dimensional Object Recognition from Single Two-Dimensional Images. Elsevier Science Publishers B . V . ( N o r t h - H o l l a n d ) , 1987. [63] D . G . Lowe. F i t t i n g Parameterized Three-dimensional M o d e l s to Images. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 441-450, 1991. [64] D . G . Lowe. Robust model-based m o t i o n tracking through the integration of search and estimation. Technical Report T R - 9 2 - 1 1 , 1992. [65] R . M a n d u c h i and G . A . M i a n . A c c u r a c y analysis for correlation-based image registration algorithms. I n ISCAS, pages 834-837, 1993. [66] D . M a r r .  Vision: a computational investigation into the human representation and  processing of visual information. S a n Francisco: W . H . Freeman, 1982. [67] H . Moravec. Obstacle avoidance a n d navigation i n the real world by a seeing robot rover. Technical report, Robotics Institute, Carnegie-Mellon University, 1980. [68] N . R . M u d i g o n d a , R . M . Rangayyan, a n d J . E . L . Desautels. Detection of breast masses i n mammograms by density slicing and texture flow-field analysis. IEEE Trans. Medical Imaging, 20(12):1215-1227, 2001.  BIBLIOGRAPHY  144  [69] A . J . M u n o z and J . Gonzalez. Two-dimensional landmark-based position estimation from a single image.  IEEE International  Conference  on Robotics and  Automation,  pages 3709-3714, 1998. [70] D . M u r r a y and A . B a s u . M o t i o n tracking w i t h an active camera.  on Pattern Analysis  and Machine Intelligence,  IEEE Transactions  pages 449-459, 1994.  [71] R . M u r r i e t a - C i d , M . B r i o t , and N . V a n d a p e l . L a n d m a r k identification and tracking i n natural environment.  Systems, pages  IEEE/RSJ  International  Conference  on Intelligent  Robots  and  179-184, 1998.  Proceedings Reality Based  [72] P . J . Narayanan and T . K a n a d e . V i r t u a l worlds using computer vision. I n  of the 1998 IEEE and ATR Workshop on Computer Vision for Human Communications, pages 2 - 13, January 1998. [73] S . B . Nickerson and M . J e n k i n . environment.  Virtual  A R K : Autonomous mobile robot for an i n d u s t r i a l  Intelligent autonomous systems IAS-3,  pages 288-296, 1993.  [74] M. O k u t o m i and T . K a n a d e . A multiple-baseline stereo.  ' Analysis and Machine Intelligence,  IEEE Transactions on Pattern  pages 353-363, 1993.  [75] F . Pedersini, A . S a r t i , and S. Tubaro. E g o m o t i o n estimation of a m u l t i c a m e r a system through line correspondences.  ICIP,  B: 175-178.  [76] R . L . Peterson, R . E . Ziemer, and D . E . B o r t h .  munications.  Introduction to Spread Spectrum Com-  P r e n t i c e - H a l l , N J , 1995.  [77] W . H . Press, S . A . Teukolsk, W . T . Vetterling, and B . P . Flannery.  C: the art of scientific computing.  Numerical recipes in  C a m b r i d g e University Press, 1992.  [78] P. Rives and J . J . Borrelly. R e a l time image processing for image-based visual servoing.  ICRA98 IEEE  Notes Intl.  for workshop Conf. on Robotics  WS2 (Robust Vision for Vision-Based and Automation, May 1998.  Control  of  Motion),  BIBLIOGRAPHY  145  [79] P. Saeedi, P. Jacobsen, K . A r d r o r i , P. D . Lawrence, and D . G . Lowe. P a t h tracking  9th Aerospace Division Inand Operations in Challenging  control for tracked based machines using sensor fusion.  ternational Conference on Engineering, Environments, M a r c h 2004.  Construction  [80] P. Saeedi, P. Lawrence, and D . Lowe. 3 D m o t i o n tracking of a mobile robot i n a natural environment.  IEEE Int. Conf. Robotics and Automation,  pages 1682-1687, 2000.  [81] P. Saeedi, P . D . Lawrence, D . G . Lowe, a n d P. Jacobsen. Multi-Sensor Track Controls for  ISSS-CCCT03 International Conference on Computer, Technologies, Orlando, Florida, USA, J u l y 2003.  Excavator Based Machines.  Communication  and Control  [82] P. Saeedi, P . D . Lawrence, and D . G . Lowe. A n efficient binary corner detector. 7 t h I n -  ternational  Conference  on Control,  Automation,  Robotics  and  Vision,  2002,  ICARCV,  pages 338-343, 2002. [83] P. Saeedi, P . D . Lawrence, D . G . Lowe, P. Jacobsen, D . Kusalovic, and K . A r d r o n . A V i s i o n - B a s e d M o t i o n and Slippage Controller for C o n t r o l of a Tracked M o b i l e R o b o t  . Accepted  by IEEE  Transaction  on Control  Systems  Technology.  [84] P. Saeedi, D . G . Lowe, and P . D . Lawrence. 3 D localization and tracking i n unknown  Proceedings of IEEE International tion, ICRA OS, 1:1297 - 1303, Sept. 2003.  environments.  Conference  on Robotics  and  [85] C . Schmid and R . M o h r . L o c a l G r e y value Invariants for Image Retrieval.  Conf. Pattern Analysis  and Machine Intelligence,  AutomaIEEE Int.  pages 530-535, 1997.  [86] C . Schmid, R . M o h r , and C . Bauckhage. C o m p a r i n g and E v a l u a t i n g Interest Points.  Int. Conf. Computer Vision,  pages 230-235, 1998.  [87] S. Se, D . Lowe, and J . L i t t l e . Vision-based mobile robot localization and m a p p i n g using scale-invariant features.  Automation,  Proceedings  of IEEE International  pages 2051-2058, 2001.  Conference  on Robotics  and  BIBLIOGRAPHY  146  [88] I . K . Sethi a n d R . J a i n . F i n d i n g trajectories of feature points i n a monocular image sequence. IEEE Transactions Pattern Analysis and Machine Intelligence, pages 56-73, 1987. [89] L . S . Shapiro, A . Zisserman, a n d M . Brady.  3d m o t i o n recovery v i a affine epipolar  geometry. International Journal of Computer Vision, 16:147-182, 1995. [90] J . S h i a n d C . Tomasi.  G o o d features to track.  IEEE International Conference on  Computer Vision and Pttern Recognition, pages 593-600, 1994. [91] J . Shieh, H . Zhuang, a n d R . Sudhakar.  M o t i o n estimation from a sequence of stereo  images: A direct method. IEEE Transaction on Systems Man and Cybernetics, SMC, pages 1044-1053, J u l y 1994. [92] C W . S h i n and S. Inokuchi. robots.  E s t i m a t i o n of optical-flow i n real time for navigation  Proceedings of the IEEE International Symposium on Industrial Electronics,  pages 541-546, 1995. [93] C . N . S h y i , J . Y . Lee, and C H . C h e n .  R o b o t guidance using standard mark.  IEEE  Electronics Letters, pages 1326-1327, 1988. [94] R . S i m a n d G . Dudek. M o b i l e R o b o t L o c a l i z a t i o n from Learned L a n d m a r k s .  IEEE  International Conference on Intelligent Robots and Systems, pages 1060-1065, 1998. [95] S . M . S m i t h and J . M . Brady. S U S A N - A new approach to low level image processing. J. Computer Vision, pages 45-78, 1997. [96] A . Soroushi. M o t i o n T r a c k i n g of a M o b i l e R o b o t . Masters Thesis, University of British Columbia, 1996. [97] M . Stephens a n d C . Harris. 3 D W i r e - F r a m e Integration from Image Sequences. Proceeding 4'th Alvey Vision Conference, pages 988-994, 1988. [98] K . T . Sutherland and W . B . T h o m p s o n . Inexact navigation. IEEE International Conference on Robotics and Automation, pages 1-7, 1993.  BIBLIOGRAPHY  147  [99] K . Tashiro, J . O t a , Y . C . L i n , a n d T . A r a i . artificial landmarks.  Design of the o p t i m a l arrangement of  IEEE International Conference on Robotics and Automation,  pages 407-413, 1995. [100] C . Thorpe, M . Hebert, T . K a n a d e , and S. Shafer. Toward autonomous d r i v i n g : T h e emu navlab. part i : Perception. IEEE Expert, 6(4), A u g u s t 1991. [101] C . Thorpe, M . Hebert, T . K a n a d e , and S. Shafer. Toward autonomous driving: T h e emu navlab. part ii: System a n d architecture. IEEE Expert, 6(4):44 - 52, A u g u s t 1991. [102] S. T h r u n , M . Beetz, M . Bennewitz, W o . B u r g a r d , A . B . D.  Creemers, F . Dellaert, D . Fox,  Hahnel, C . Rosenberg, N . Roy, J . Schulte, a n d D . Schulz.  Probabilistic algo-  rithms and the interactive museum tour-guide robot minerva. International Journal of Robotics Research, 1 9 ( l l ) : 9 7 2 - 9 9 9 , November 2000. [103] P . E . Trahanias, S. Velissaris, and T . Garavelos. V i s u a l landmark extraction a n d recognition for autonomous robot navigation. IEEE/RSJ  International Conference on In-  telligent Robot and Systems, pages 1036-1042, 1997. [104] R . Y . T s a i . A Versatile C a m e r a C a l i b r a t i o n Technique for H i g h - A c c u r a c y 3 D Machine V i s i o n Metrology U s i n g Off-the-Shelf T V Cameras and Lenses. International Conference on Robotics and Automation, pages 323-344, 1987. [105] V . S . Tsonis, K . V . ing  Chandrinos, and P . E . Trahanias.  projective invariants.  Landmark-based navigation us-  IEEE International Conference on Intelligent Robots and  Systems, pages 342-347, 1998. [106] W . van der M a r k , D . Fontijne, L . Dorst, a n d F . C . A . G r o e n . Vehicle ego-motion est i m a t i o n w i t h geometric algebra. IEEE Intelligent Vehicle Symposium., pages 58-63, May  2002.  [107] H . W a n g and M . Brady. A practical solution to corner detection. Image Processing, pages 919-923, 1994.  IEEE Int. Conf.  BIBLIOGRAPHY  148  [108] R.J. Wang, Q. Clarke. Estimation of general 2d motion using fourier descriptors. IEE Proceedings of Vision, Image and Signal Processing, 141(2):115-121, April 1994. [109] D . Wettergreen, C . Gaskett, and A . Zelinsky. Development of a visually-guided autonomous underwater vehicle.  IEEE International Oceans Conference, pages 1200-  1204, 1998. [110] W . J . Wilson, C C . Williams Hulls, and F . Janabi-Sharifi. Robust image processing and position based visual servoing. In SPIE/ IEEE Series on Imaging Science and Engineering, 1999. [Ill] P. Wunsch and G . Hirzinger. Real-time visual tracking of 3d objects with dynamic handling of occlusion. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Albuquerque, USA, pages 2868-2873, April 1997. [112] K . Yoon and I. Kweon. Artificial landmark tracking based on the color histogram. IEEE/RSJ  International Conference on Intelligent Robots and Systems, 4:1918-1923,  October 2001. J  [113] Z. Zhang.  Estimating motion and structure from correspondences of line segments  between two perspective images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(12):1129-1139, 1995. [114] Z. Zhang, R. Deriche, O . D . Faugeras, and Q. Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence, 78(l-2):87-119, 1995. [115] Z. Zhang and O . D . Faugeras. Three-dimensional motion computation and object segmentation in a long sequence in stereo frames.  IEEE International Conference on  Computer Vision., pages 211-241, July 1994. [116] Z. Zhang, R. Weiss, and A . R . Hanson. Automatic calibration and visual servoing for a robot navigation system. Technical Report UM-CS-1993-014, , 1993.  BIBLIOGRAPHY  149  [117] Q . Zheng and R . C h e l l a p p a . A u t o m a t i c feature point extraction and tracking i n image sequences for arbitrary camera motion. pages 31-76, 1995.  International Journal of Computer Vision, 15,  Appendix A Location Determination Problem by Triangulation If a set of k n o w n beacons are observed from an unknown location, then the position of the observer can be computed from the relative directions of the observed beacons. T h i s is known as the triangulation problem. In other word, for visual localization problem the camera's optical center can be estimated by examining the relative direction of a set of seen landmarks. T h e 2 D a n d the 3 D location determination problems are defined differently and are therefore examined separately i n here.  A.l In R  2  Triangulation Problem the observed angle  ai  >2  between two landmarks pi and p , w i t h unit direction vectors 2  i>i and V2, respectively is given by  sinaifi = \vi x v \  (A-l)  2  T h e law of Cosine states that the  ai  i2  defines a circle passing through pi, p and the o p t i c a l 2  center o of the image, as shown i n Figure A . l .  150  151  A.2 The 3D Location Determination Problem  Figure A . l : T h e intersection o f the two circles using two pairs of viewed landmarks provides the observer location. For a l l the points along the perimeter of this circle the observed angle between p i a n d P2 is the same. T h e radius of this circle is give by  (A.2)  7-1,2 =  There are two opposing circles o f radius r i  > 2  passing through two points p i and p ^ T h e  correct circle is determined b y e x a m i n i n g the sign of the cross-product vi x v 2  T h e observed angle between two landmarks p i a n d p  2  defines a circle on which the  observer 0 must lie. O b s e r v i n g a t h i r d landmark P3 defines a second such circle. A s shown i n F i g u r e A . l , the intersection of these two circles gives the unique position of the camera's optical center. Once the l o c a t i o n of the camera is found, any of the landmarks can be used to determine the appropriate r o t a t i o n of the image around the optical center.  A.2  The 3D Location Determination Problem  T h e 3 D problem has the same analogy as the 2 D problem, except that now the observed angle o ; i between landmarks p i and p defines a surface which 0 must lie on. I n principle, )2  2  152  A . 2 The 3D Location Determination Problem  the intersection of three surfaces, derived from the three different pairs of landmarks, w i l l uniquely determine the location of o. However, this approach requires solving a complex 4th order p o l y n o m i a l . A simple approach was described by Fischler and Bolles [16], called the location determination problem. T h i s method is non-linear and examines three landmarks, w i t h fourth used to resolve ambiguity. Faugeras et al. [60] also described a linear solution to the location determination problem that used six landmarks.  Figure A . 2 : T h e locations of three landmarks along three image projectors must match the distances between the same points.  If three landmarks p i , P2 and p are projected onto an image o, w i t h direction vectors vi, 3  v  2  and v , then the position of each feature point along its respective projector must match 3  the distance between the recovered point i n the solution, as shown i n Figure A . 2 that:  (A.3)  p = o + Q, v 3  3  3  w i t h the constraint that |pi  -P2I =  di,2  |Pl-P3| = |P2 - P 3 |  = ^2,3  rfl,3  ( - ) A  4  153  A . 2 T i e 3D Location Determination Problem  Here d is the distance between the two landmarks i n the solution. E q u a t i o n A . 4 is quadratic w i t h three unknowns, 0  1 (  Q and fi , a n d has u p to eight solutions [16]. However, for every 2  3  positive solution there is a geometrically isomorphic negative solution behind the camera, so there are at most four actual solutions to consider. T o resolve the ambiguity, the procedure is repeated substituting a fourth point P4 i n place o f p to identify the unique solution. 3  E q u a t i o n A . 3 gives the relative distance f2* of each feature point pi from the focal point o. T h e actual position of o w i t h respect to the feature points is found b y solving:  \o — Pi\ = f ^ i | -P2J = Q 0  |o-p | = 0 3  (A.5)  2  3  which has two solutions, one on each side of the triangle w i t h vertices p  lt  p and p . T h e 2  3  correct position of o is determined by examining the t r i p l e scalar product (vy x 1)2)^3-  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065602/manifest

Comment

Related Items