Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

On the design and implementation of decision-theoretic, interactive, and vision-driven mobile robots Elinas, Pantelis 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2008_spring_elinas_pantelis.pdf [ 3.37MB ]
Metadata
JSON: 24-1.0051411.json
JSON-LD: 24-1.0051411-ld.json
RDF/XML (Pretty): 24-1.0051411-rdf.xml
RDF/JSON: 24-1.0051411-rdf.json
Turtle: 24-1.0051411-turtle.txt
N-Triples: 24-1.0051411-rdf-ntriples.txt
Original Record: 24-1.0051411-source.json
Full Text
24-1.0051411-fulltext.txt
Citation
24-1.0051411.ris

Full Text

ON THE DESIGN AND IMPLEMENTATION OF DECISION-THEORETIC, INTERACTIVE, AND VISION-DRIVEN MOBILE ROBOTS by  PANTELIS ELINAS  M. Sc., University of British Columbia, 2002 B. Sc., York University, 1999  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Computer Science)  THE UNIVERSITY OF BRITISH COLUMBIA December 2007 c Pantelis Elinas, 2007  Abstract We present a framework for the design and implementation of visually-guided, interactive, mobile robots. Essential to the framework’s robust performance is our behavior-based robot control architecture enhanced with a state of the art decisiontheoretic planner that takes into account the temporal characteristics of robot actions and allows us to achieve principled coordination of complex subtasks implemented as robot behaviors/skills. We study two different models of the decision theoretic layer: Multiply Sectioned Markov Decision Processes (MSMDPs) under the assumption that the world state is fully observable by the agent, and Partially Observable Markov Decision Processes (POMDPs) that remove the latter assumption and allow us to model the uncertainty in sensor measurements. The MSMDP model utilizes a divide-and-conquer approach for solving problems with millions of states using concurrent actions. For solving large POMDPs, we present heuristics that improve the computational efficiency of the point-based value iteration algorithm while tackling the problem of multi-step actions using Dynamic Bayesian Networks. In addition, we describe a state-of-the-art simultaneous localization and mapping algorithm for robots equipped with stereo vision. We first present the MonteCarlo algorithm σMCL for robot localization in 3D using natural landmarks identified by their appearance in images. Secondly, we extend σMCL and develop the σSLAM algorithm for solving the simultaneous localization and mapping problem for visually-guided, mobile robots. We demonstrate our real-time algorithm mapping large, indoor environments in the presence of large changes in illumination, image blurring and dynamic objects. Finally, we demonstrate empirically the applicability of our framework for developing interactive, mobile robots capable of completing complex tasks with the aid of a human companion. We present an award winning robot waiter for serving hors d’oeuvres at receptions and a robot for delivering verbal messages among inhabitants of an office-like environment.  ii  Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Co-Authorship Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  1.1  Robot Control Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4  1.2  Sequential Decision Making Under Uncertainty. . . . . . . . . . . . . . . . . . . . . . . 13  1.3  Simultaneous Localization and Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21  1.4  Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2 Waiting With Jos´ e, a Vision-based Mobile Robot . . . . . . . . . . . . . . . . 45 2.1  Jos´e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47  2.2  Perception and Motor Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49  2.3  Modeling and Task Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50  iii  2.3.1  Mobility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50  2.3.2  Human-Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56  2.3.3  Serving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58  2.4  Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61  2.5  Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62  2.6  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3 Multiply Sectioned Markov Decision Processes for Task Coordination in Interactive Mobile Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.1  Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73  3.2  HOMER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75  3.3  3.2.1  Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75  3.2.2  Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76  Management and Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.3.1  Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78  3.3.2  Multiply Sectioned MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79  3.3.3  Managing the Message Delivery Task . . . . . . . . . . . . . . . . . . . . . . . . . 80  3.4  Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83  3.5  Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4 Decision Theoretic Task Coordination for a Visually-guided Interactive Mobile Robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.1  Partially Observable Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . 94 4.1.1  Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94  4.1.2  POMDPs for Behavior Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . 95  4.1.3  Randomized Point-based Value Iteration (Perseus) . . . . . . . . . . . . 97  iv  4.1.4  Speeding Up Perseus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97  4.2  The Human-Oriented MEssenger Robot (HOMER) . . . . . . . . . . . . . . . . . . 100  4.3  Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5 σMCL: Monte-Carlo Localization for Mobile Robots with Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.1  Bayesian Filtering with Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112  5.2  Map Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114  5.3  Observation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115  5.4  Computing Camera Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.1  Linear Estimation of the Essential Matrix . . . . . . . . . . . . . . . . . . . . 117  5.4.2  Non-Linear Estimation of the Camera Pose . . . . . . . . . . . . . . . . . . . 118  5.5  The Mixture Proposal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120  5.6  Single-frame Approach Using RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122  5.7  Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123  5.8  Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6 σ SLAM: Stereo Vision SLAM Using the Rao-Blackwellised Particle Filter and a Novel Mixture Proposal Distribution . . . . . . . . . . . 137 6.1  SLAM Using the Rao-Blackwelllized Particle Filter . . . . . . . . . . . . . . . . . . . 140  6.2  Vision-based SLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.2.1  Map Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142  6.2.2  Observation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142  6.2.3  Motion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144  6.3  The Proposal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145  6.4  The σSLAM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148  v  6.5  Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148  6.6  Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.1  Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161  7.2  Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.2.1  Decision Theoretic Behavior Coordination . . . . . . . . . . . . . . . . . . . . 164  7.2.2  Visual Simultaneous Localization and Mapping . . . . . . . . . . . . . . . 167  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Appendix A Face detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 A.1 Color Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 A.2 Filtering and Connected Components Analysis . . . . . . . . . . . . . . . . . . . . . . . 172 A.3 Component Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Appendix B Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 B.1 Path Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 B.2 Path Following . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 B.2.1 Detecting and Representing Safe Space for Collision Avoidance 179 B.2.2 POMDP for Path Following and Obstacle Avoidance . . . . . . . . . . 179  vi  List of Tables 3.1  Homer’s state variables. For each variable, we show the MDPs that include it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81  3.2  Homer’s MDPs and its corresponding actions. For each action, we indicate the modules which effect them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82  4.1  Comparing the original Perseus algorithm (SU) with two variants (SM and HM) that select belief points to optimize using a heuristic value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100  5.1  Comparing our least-squares estimate of camera motion with odometry. Position is shown in cm and orientation is degrees. . . . . . . . . . . . . . 121  B.1 The navigation POMDP state and observation variables. . . . . . . . . . . . . . 181  vii  List of Figures 1.1  Illustration of the overlap of skills for three robots that perform different tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1.2  3  (a) The Sense-Map-Plan-Act and (b) Subsumption robot control architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6  1.3  Diagram of a three-layered robot control architecture. . . . . . . . . . . . . . . . . 11  1.4  The standard reinforcement learning model . . . . . . . . . . . . . . . . . . . . . . . . . . 13  1.5  Graphical representation of a POMDP agent . . . . . . . . . . . . . . . . . . . . . . . . . 18  1.6  Representation of a POMDP agent using a Dynamic Bayesian Network 20  1.7  Examples of 3D-landmark and 2D occupancy-grid maps learned with our σSLAM algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25  2.1  Jos´ e, the robotic waiter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46  2.2  The control architecture of the robot waiter Jos´ e . . . . . . . . . . . . . . . . . . . . 48  2.3  Example of radial map estimation from stereo vision. . . . . . . . . . . . . . . . . 52  2.4  Two examples of SIFT features extracted from images captured with a stereo camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54  2.5  Bird’s eye view of the SIFT map around base region: the home location is indicated by the square with the current robot position and view direction shown as a V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55  2.6  Faces coincide with speech (a) “Stop stealing food!”, (b) “I’m sorry”, (c) “Would you like an appetizer?”, (d) “I have no food left!” . . . . . . . . 58  viii  2.7  Top row: images taken while a person helps himself to an appetizer. Bottom row: segmented regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60  2.8  The state diagram for Jos´ e’s serving behaviors . . . . . . . . . . . . . . . . . . . . . . . 61  2.9  Two views of Jos´ e’s environment from the home base as seen though the color camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66  2.10 Occupancy grids at three times during a serving run. Also shown are the accumulated skin-color maps, and Jos´ e’s trajectory, with an ’x’ marking the home base, and ’o’s marking serving locations. . . . . . . . . . . 67 2.11 Desirability maps at two times during a serving run. . . . . . . . . . . . . . . . . . 67 3.1  (a) HOMER the messenger robot interacting with a person and (b) closeup of HOMER’s head . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75  3.2  Control Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77  3.3  Example message delivery run, showing key frames, variables that change or are significant, actions taken by each planner (Navigator, Dialogue, Gesture), and a description of what happened.. . . . . . . . . . . . . 85  4.1  Diagram of a three-layered behavior-based robot control architecture showing the deliberative, executive and functional components. . . . . . . . 91  4.2  Dynamic Bayesian Network representation of the behavior control problem in robotics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93  4.3  Performance of the different heuristics for belief point selection in Perseus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98  4.4  The components of HOMER’s POMDP model including the state and observation variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103  4.5  The actions available for HOMER along with the maximum duration for each. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104  4.6  Part 1: Trace of HOMER’s POMDP during a complete run from receiving to delivering a message. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105  ix  4.7  Part 2 (continued from Figure 4.2): Trace of HOMER’s POMDP during a complete run from receiving to delivering a message. . . . . . . . . 106  5.1  Bird’s eye view of 3D landmark map used for localization. . . . . . . . . . . . 115  5.2  Examples of estimating the epipolar geometry for forward and rotating motions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119  5.3  Photos of HOMER, the visually-guided mobile robot we used for our experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123  5.4  An example of the evolution of particles during global localization with σMCL (part (a)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125  5.5  An example of the evolution of particles during global localization with σMCL (part (b)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126  5.6  An example of the evolution of particles during global localization with σMCL (part (c)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127  5.7  Plots of the localization error for σMCL and a single-frame adaptive RANSAC approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128  5.8  The robot’s path for image set S2 . The robot traveled a total distance of 18 meters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130  5.9  The robot’s path for image set S3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131  5.10 Experimental results for the kidnaped robot problem. . . . . . . . . . . . . . . . . 132 6.1  The RWI B-14 robot and its camera used for data collection . . . . . . . . . 139  6.2  Sample frames from our σSLAM test sequence. . . . . . . . . . . . . . . . . . . . . . . 150  6.3  Plot of the localization error for the filter using just the standard proposal, mixture proposal (with a variable mixing ratio) and the raw visual odometry estimates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151  6.4  Examples of two maps learned with σSLAM including the 3D landmarks and 2D occupancy grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153  x  6.5  Plot of the average processing time vs frame number for 500 particles using σSLAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154  6.6  Results from the outdoors image sequence with 5000 particles using σSLAM . The visual odometry trajectory is shown in blue. The GPS ground truth data is shown in red and the maximum likelihood particle is shown in black. Not all landmarks are visible in the image. The map has 5277 landmarks. The arrows point to the locations where the visual odometry was very unreliable. . . . . . . . . . . . . . . . . . . . . . . 155  A.1 Face detection example in the presence of noise . . . . . . . . . . . . . . . . . . . . . . 175 A.2 Example of object detection using color and shape constraints . . . . . . . . 176 B.1 Plot of disparity vs tilt angle data collected during training and used for learning a model for floor detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 B.2 Example of floor detection from disparity maps obtained with a steerable stereo camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 B.3 Diagram showing how the values for the navigation POMDP observation variables relating to obstacles in the robot’s path are set from the input images. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 B.4 Diagram showing how the value for the navigation POMDP’s observed variable OP is set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 B.5 Example of the robot navigating using our POMDP for path following and obstacle avoidance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 B.6 Example of the robot navigating around a dynamic obstacle not present in its occupancy-grid map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186  xi  Acknowledgements First, I want to express my most sincere thanks to my supervisor Dr. James J. Little for his guidance and support during my time at UBC. Similarly, I wish to thank my supervisory committee members Dr. David Poole, Dr. Kevin Murphy and Dr. Giuseppe Carenini for their insightful comments and guidance over the years. In addition, I want to thank Dr. Jesse Hoey for helping me establish some of the theoretical basis for parts of my work and for proofreading many of my papers providing valuable feedback without which my submissions may not have been successful. Also, I am mostly in debt to Dr. Don Murray for teaching me how to develop software for controlling our robots and stereo cameras, the two most important components in my thesis work. Finally, most thanks belong to my parents who encouraged me to enter graduate school and pursue an academic career.  Pantelis Elinas  The University of British Columbia December 2007  xii  Co-Authorship Statement This is a manuscript thesis and as such Chapters 2 to 6 have been written as papers that have already been published in conferences and journals. I have performed some of the work presented in this thesis in collaboration with other researchers from the Laboratory for Computational Intelligence (LCI) at UBC. I have often benefited greatly from in-depth discussions with all my colleagues. In addition, Dr. James J. Little has provided me with invaluable guidance and support over the years. Chapter 2 is the result of a collaboration among a number of LCI members who helped develop our entry for the AAAI “Hors D’oeuvres anyone?” competition. Specifically, the following people should receive credit: Stephen Se for section 2.3.1, Jefferson D. Montgomery for section 2.3.3, Darrell Lahey for section 2.3.3, Don Murray for section 2.3.1, and Jesse Hoey and Don Murray for help with the hardware setup and systems support. The author of this thesis worked on the control architecture, people finding (section 2.3.2), interaction with people(section 2.3.2), navigation (section 2.3.1), and overall design and system integration. For Chapter 3, Jesse Hoey provided the SPUDD software for solving factored MDPs while Enrique Sucar and Alberto Reyes helped to carry out the experiments. For Chapter 6, Robert Sim provided an implementation of the FastSLAM data structure for map sharing among the different particles and the k-d tree data structure for finding the nearest neighbors of SIFT keys.  Pantelis Elinas  xiii  Chapter 1  Introduction This thesis presents a framework and essential components for the design, specification and implementation of visually-guided, interactive, mobile robots. The framework is based on organizing the skills of the robot into a behavior-based, threelayered architecture with the behavioral parameters controlled by a layer modeled using Markov Decision Processes. The combination of the distributed and reactive robot control architecture and the mathematical harmony of utility-based planning allows us to realize complex robots relying on vision for sensing. In addition, we present three critical robot skills: localization and mapping, navigation with realtime obstacle avoidance, and face detection. In this thesis, our focus is on visually-guided, mobile robots that interact with people. Such robots are intricate systems both in hardware and software. Some of the complexity in software derives from the complexity of the underlying hardware that it must control. As a result, elaborate systems for the design and implementation of control software have been invented. The framework that these techniques define is known as a Robot Control Architecture (RCA). An RCA includes mechanisms for defining its parts and their interaction. It also provides a consistent interface to the robot’s mechanical sub-systems. A well designed RCA makes programming robots easy and enables the machines to exhibit intelligent  1  behavior whether programmatically specified or emergent. Robots that interact with people fall under the more general class of social robots. Social robots are embodied mobile agents that are capable of interacting with people using multi-modal interfaces [Res02, Bre03] while following social conventions. Tasks for such robots vary from helping out with household chores such as intelligent vacuuming to providing services for large crowds such as serving hors d’oeuvres at receptions. More importantly, such robots can be useful in caring for an ageing population [Dav99]. We can develop intelligent systems to remind the elderly when it is time for their medication or to help them remember the location of misplaced personal items and guide them around complex and unfamiliar places. One such example application for assisted living is the intelligent, powered wheelchair anti-collision system that we have developed for use by older adults suffering from Alzheimer’s disease or other forms of dementia [MEBH07]. The roles of such robots impose significant constraints on their appearance and interaction interfaces [Cap86, Duf02]. People are more likely to accept and more apt to use robots that mimic a familiar form such as that of a pet or even another human. The robots must also be able to navigate environments designed for humans. There is a need for robots to integrate smoothly into our lives without users being required to apply much effort to learn how to use them. This makes multi-modal interfaces that are based on speech and vision a necessity. People expect robots to correctly understand their instructions, quickly respond to their requests and complete their given task efficiently and promptly. These requirements of robustness, responsiveness and optimality are extremely hard to satisfy in general but even more so within the parameters of a multi-modal interface. The newly created field of Human-Robot Interaction (HRI) is the discipline focused on enabling robots with the ability of multimodal communication and social behavior. This includes being able to learn models of interactions from user studies and later integrating these models into the system. Towards this goal, we need both a robot control architecture  2  Waiter  Food Tray Monitoring  Messenger  Speech Synthesis  Speech Recognition  Person Finding  Mapping Navigation  Face Detection  Localization  Facial Expression Generation  Exploration  Explorer  Gesture Recognition  Sound Localization  Head Tracking  Figure 1.1: Diagram illustrating the overlap among skills for three robots that perform different tasks. Small boxes indicate individual skills; of these, shaded boxes represent skills that control the robot actuators while unshaded boxes represent skills that process sensor data to be made available in a form suitable for consumption by any other module. We describe the waiter robot in Chapter 2 and the messenger robot in Chapters 3 and 4. The third robot allowing for multimodal interaction for directed exploration is described in [EL02].  that allows the re-usability of components or skills and given these, an easy way to specify high-level tasks. Additionally, the same framework should allow us to easily add any new models that have been learned. Figure 1.1 shows the overlap among skills for three different robots that we have developed over the last few years. The waiter robot is designed to serve hors d’oeuvres to large groups of people at receptions; the messenger robot delivers verbal messages among inhabitants of an office-like environment, and lastly the explorer receives directions from a human guide using arm-gestures and speech [EL02]. Our work considers the problem of developing interactive robots that are capable of helping an individual to complete a specific task. We develop a number of individual robot skills that can be easily re-used and a theoretical framework that allows us to quickly prototype systems, and compute near-optimal solutions for specific tasks. The main robot task that we consider here is that of the robot delivering verbal messages among people in an office-like environment. It is very similar to the 3  more traditional robot task of delivering mail. It requires high level planning and it depends on the most basic of robot skills such as localization, mapping, obstacle avoidance and path following. To summarize, in developing our robots, we must consider the capabilities and limitations imposed by the robot hardware and software architecture. They directly affect the robot’s robustness and responsiveness. We must be able to program the robot quickly and effectively reusing skills previously developed for other tasks. To complete a task, the robot must be able to construct high level plans and reason about what actions to take. We approach the problem of task specification and policy computation from a decision theoretic point of view. We specify tasks using a factored representation and decision diagrams thereby making it easier to specify problems with large state, action and observation spaces. We explore finding solutions for large problems by employing a divide-and-conquer approach within a hierarchical system. Our framework addresses the issues of efficiency, modularity and optimality as we demonstrate its use in designing a visually-guided, interactive, mobile robot capable of delivering verbal messages. In this thesis we present our progress towards developing the tools and algorithms for constructing visually-guided, interactive, mobile robots. More specifically, in a series of papers, we present our progress in developing within a distributed, behavior-based, robot control architecture several robot skills for navigation, mapping, robot localization and face detection as well as principled coordination of these skills to successfully complete complex human-interactive tasks.  1.1  Robot Control Architectures  A Robot Control Architecture is a development framework that allows the design and implementation of robotic systems. In the early days of AI, intelligent agents were commonly designed using the Sense-Model-Plan-Act (SMPA) model shown in Part (a) of Figure 1.2. SMPA 4  favors the design of intelligent agents that reason by searching within an accurate symbolic model of the world continuously updated from sensor data. Early robots such as the Stanford Cart [Mor77] and Shakey [Nil69] utilized such an architecture and presented promising results for future research. The progress of the SMPA model was eventually hindered by its inability to scale to large problems mostly driven by the difficulty of creating accurate world models. In SMPA, the rather hierarchical and centralized method of control prohibited the lower levels from acting independently always being under the explicit control of the higher level ones. Lower level layers that are attached to the actuators need to run at higher frequencies to handle the real-time nature of the hardware. On the other hand, higher level layers need longer time to process sensory data to build world models and search for a good plan. The difference in time requirements among the different layers in SMPA makes it difficult for the robots to act quickly and respond to dynamic changes in the world. Robots that used SMPA were only demonstrated to work in static well engineered environments that allowed the quick construction of accurate world models with ample time for planning. The expectation was that with the increase in computer speed and with the discovery of better model building and search algorithms, SMPA-type architectures would become viable. Unfortunately, to this very day, real-time robust algorithms for constructing accurate models are still rare. One can point out the vast amount of work on the Simultaneous Localization and Mapping (SLAM) problem, an area in which only recently researchers have found adequate solutions for mapping small indoor environments that are mostly static [BFT97, DBFT99, SLL02, Thr00]. Similarly, many of the planning algorithms have only shown partial success in domains whence a compact symbolic representation of the problem is easy to construct such as the block-world domain or board games such as chess. Brooks [Bro86] challenged the Artificial Intelligence community by introducing a new paradigm for building  5  (a)  (b)  Figure 1.2: (a) The Sense-Map-Plan-Act and (b) Subsumption robot control architectures. intelligent agents expressed in the Subsumption architecture shown in Part (b) of Figure 1.2. Subsumption was meant to replace the traditional SMPA model. Brooks took the extreme position that no world models, and effectively no complex plans, are necessary for building an intelligent agent. In Subsumption there is a number of stand-alone processes (also called behaviors) attached directly to the actuators and sensors of the agent. They receive input from the environment (stimuli) and generate control signals for the actuators without constructing a world model. When there is structure in the environment and hence consistency among the stimuli for all the behaviors, they all collectively work together to perform a complex task, i.e., a six legged-robot learns to walk as a behavior is independently attached to each leg [Bro89]. Subsumption is a purely reactive architecture as the agent’s behavior consists of its instantaneous reactions to environmental stimuli. There is no memory mechanism and no elaborate plans are constructed. Intelligent behavior emerges from the implicit coordination of all the available behaviors. Brooks advocated this approach and he supported the notion of intelligence without representation [Bro91]. In truth, however, some plans and some representation of the world do exist even within Subsumption. Nevertheless, these constructs tend to be very simple. Both the memory of events retained and plans  6  constructed are short-lived and only local. Within Subsumption there is no explicit coordination among the behaviors (although some behaviors subsume others using a priority scheme) and there is no central control. Instead all behaviors run in parallel and only an arbitration mechanism for resolving conflicts in accessing the actuators is present. Brooks argues [Bro91] that there is room for more complex representations within Subsumption but not as a centralized module that handles control but rather as another behavior whose output signals are input to other behaviors. Even so, any representation is short lived and unsuitable for representing long-term high level goals such as deliver coffee. Eventually, Subsumption as a purely reactive control mechanism was shown not to scale to human-like intelligence [Tso95]. Regardless, a purely reactive architecture is very useful for constructing insect like robots whose overall behavior is reactive. In order to program Subsumption robots, Brooks created the Behavior Language [Bro90]. It is a rule-based, Common Lisp derivative that compiles into machine code for a number of computer architectures. Behaviors are represented as groups of rules that are activated based on different schemes. Communication between behaviors is done using message passing. Since this is a tool for programming Subsumption systems, it provides facilities for the parallel and asynchronous execution of the specified behaviors along with mechanisms for suppression and inhibition of signals for behavior coordination. The biggest drawback of the Behavior Language was its lack of a learning mechanism. It only provided a better way (compared to other programming languages) for a programmer to fully specify a Subsumption system as a collection of Finite State Automata. The behavior-based architectures in use today provide the means for integrating both reactive and deliberative behaviors [ARH87, Ark89, Alb91]. These are known as hybrid architectures. Arkin [Ark89] was the first to propose one in the Autonomous Robot Architecture (AuRA) developed for robot navigation in a dynamic environment. It used reactive behaviors, called schemas, for collision  7  avoidance and path following in an application for robot navigation. A deliberative schema was responsible for path planning in metric maps acquired using sonar sensors [ME85]. Deliberation was present as a mechanism for deciding what behavioral subset should be active at any one time for a long term plan to be executed. The deliberative module would be de-activated once a plan was formed and control would be passed to another behavior, the Schema Controller, that interfaced with the reactive schemas. Planning would occur again only if a failure was detected or the mission was completed. AuRA has been used on real robot projects and has been extended with different learning methods for tuning schema parameters [AB97, RB94]. Later, Connell introduced the SSS architecture [Con92]. It combined three layers of behaviors, Servo-control, Subsumption and Symbolic. Each component operated independently but under the guidance of the Symbolic level. Its main focus was on the communication protocol between the three layers. SSS has been demonstrated on the mobile robot TJ whose main task was navigation in an office-like environment. Since the early 90s many more architectures have been proposed but they have contributed little more beyond the hybrid architectures of Arkin and Connell. Most research centers on the basic principles of hybrid architectures differing in the experts’ decisions on how modules are instantiated and communicate with each other. One such example is the Task Control Architecture (or TCA) [LSF89]. It has mechanisms for process control, exception handling, parallelism and more importantly error recovery. Recently, TCA has been replaced by two packages that encapsulate the same principles. They are the Task Description Language [SA98], a programming language derivative of C++ that provides support for task specification, execution and error recovery, and the Inter-Process Communication package that handles message passing between processes. The complete architecture differs from Subsumption in that there is centralized control (for communication and process scheduling) as well as explicit encoding of tasks (in a tree structure aptly  8  named a task tree.) The older TCA and the newer TDL and IPL packages have been used to build several successful robots such as Xavier ([SFG+ 99]), Amelia (an improved Xavier), and most recently, PEARL (a robot designed to aid elderly people [MPR+ 02, MDK+ 02]) and GRACE (Graduate Student Attending a Conference [SGG+ 03]). The latter is probably the most complete mobile robot demonstrated to this day. It was designed to attend a conference (as part of an AAAI challenge.) The robot was left at the entrance of a conference’s exhibition hall. It found its way to the registration desk, received its registration kit, proceeded to an elevator and used it to reach a lecture hall where it gave a brief talk about itself. Each of these actions was coded as a task in TDL. Although the robot did not perform perfectly at each of the subtasks, it managed to find its way around a crowded hall, stand in line for registration, briefly interact with a human guide and negotiate an elevator. GRACE’s creators explain how the largest challenge in developing the software was the integration of the many software components (behaviors) and the coordination of those components as the robot switched between subtasks. GRACE and PEARL’s successes arise from their use of statistical techniques to model uncertainty in sensor measurements, actuator responses and multi-modal interactions with people. Two other very successful robots driven using a hubrid RCA are RHINO and Minerva, two generations of museum tour guides, [Bee99, BCF+ 98, BCF+ 99, TBB+ 99a, TBB+ 99b]). TCX handled process related issues (similar to TCL) while task planning was specified using GOLOG, a programming language based on the situation calculus. GOLOG has the power to perform theorem proving in partially observable environments. TCX had a significantly more reactive component while still allowing for high level planning. Both robots operated for days giving tours to large groups of people in very crowded environments, never getting lost or requiring much help from engineers. Another effort to compensate for the inability of reactive architectures to  9  represent high level tasks comes from Nicolescu et al. [NM02]. They introduce an architecture that includes a way to add to the system abstract behaviors. Theirs is again a hybrid architecture. Instead of partitioning behaviors into three-layers, they introduce the behavior network construct that is a flat representation but has the ability to represent hierarchical plan-like strategies. Their model of abstract behaviors lacks a learning mechanism and it makes no mention of handling the uncertainty in sensor measurements and behavioral output. For completeness, we should mention that there are a number of open source and commercially available architectures that usually ship with research robots such as Saphira, TeamBots, BERRA, Microsoft Robotics Studio, and Player/Stage [GVH03]. Oreback and Christensen [OC03] provide an evaluation of these in terms of ease of use, portability, programming and run-time efficiency. Arkin [Ark98] presents a more complete treatment of the different choices for behavior-based control architectures. Part of this thesis is concerned with the implementation of a three-layered behavior-based, hybrid robot control architecture. We have used this architecture to develop three different robots which are a robot waiter, a human-interactive explorer and a human-oriented messenger [EHL+ 02, EHL03, EL02]. We describe the waiter and messenger robots in Chapters 2, 3 and 4 For our architecture, we define a behavior as a stand-alone, OS-level process with well specified inputs and outputs. For this thesis, we use the terms modules, processes or skills as synonyms for behaviors. Conceptually, behaviors are classified as low, middle and high level, very much like in the SSS architecture. Figure 1.3 shows an example of the layered, behavior organization for a robotic waiter that we have developed and describe in Chapter 2. The lowest level behaviors interface with the actuators and sensors. The middle level behaviors communicate with the lower layer to send motor commands and receive sensor data. Behaviors can also communicate with others in the same layer. Middle and low level behaviors communicate  10  SENSORS  Layer 1 (high): Planning  ACTUATORS supervisor  Layer 2 (middle): Modelling and task execution Stereo camera  MOBILITY localization  HRI  SERVING  motors  speech synthesis  Web camera people Finding  goal planning  speech recognition  food tray monitoring  mapping  speech card  odometry navigation  laptop screen  facial expressions  microphone Layer 3 (low): Perception and motor control Robot server  Image server  PTU server  Figure 1.3: Diagram of a three-layered, robot control architecture showing the conceptual division of behaviors to three distinct layers: high, middle, and low. The example shows the behavior organization for an interactive, robot waiter described in Chapter 2. Notice that direct access to the robot’s sensors and actuators is only available to the lower-level behaviors.  through a shared memory mechanism. All other communication among behaviors uses network sockets. Behavioral input and output including connections between modules are specified by an expert and cannot be changed online. All behaviors are started independently at boot time and run according to their internal execution model. The lower layer of behaviors for all robots that we have developed consists of essentially the same pieces. This is the only layer that must be modified to port the architecture to different platforms. For example there is a behavior, the ImageServer, that handles interfacing with the on-board stereo vision system. Another, the RobotServer, handles commands to the robot’s motors and extracts odometry information. Our work uses an RWI-B12 mobile robot but the same architecture has been ported for use with the ActiveMedia Powerbot platform simply by modifying the RobotServer module. Similarly, in order to use a camera other than the Point Grey Research stereo camera one must simply update the ImageServer module.  11  Middle level behaviors perform mostly processing of the sensor data to extract useful information. For example, the explorer robot is equipped with sound localization, face finding and gesture recognition behaviors for interacting with its human guide. The messenger robot adds face recognition, occupancy grid mapping and localization behaviors. These are all skills that are self-contained and implemented as independent software modules. What differs significantly among the different robots is the structure of the highest layer. All of our robots use a single behavior at the highest level that coordinates the middle level ones. We refer to it as the Supervisor. The robot waiter implements the Supervisor as a single Finite State Automaton. This requires an engineer to specify the FSA much like in the architectures mentioned earlier. However, for the messenger robot, we use a decision theoretic approach. We specify the task as a Markov Decision Process and compute an optimal policy using the efficient method of Hoey et al. [HSAHB99]. A policy is a mapping from states to actions; we will further discuss MDPs in Section 1.2 and Chapters 3 and 4. The state of the system is set by the middle level behaviors. The supervisor monitors the outputs from the middle layer and every time the state of one of them changes, it consults with the learned policy and takes the optimal action. Supervisor actions are control signals to the middle layer to activate behavioral subgroups. This model improves the ease of programming the highest layer and it allows for the automatic generation of the optimal policy for the given task, unlike previous approaches that rely on an engineer to specify the optimal policy. Others have also proposed non-decision theoretic approaches for achieving the same goal. For example, Tsotsos [Tso97] presents a framework and procedure to verify plans that focus on active vision tasks, but to this day no fully implemented system exists. Other systems do not automatically compute optimal policies but allow the developer to verify their own plans using simulation. Unfortunately, the more complex the robot, the more difficult it is to test them in simulation. Such approaches do not tend to  12  Figure 1.4: The standard reinforcement learning model [KLM96]  scale very well as the domains increase in complexity as is the case with robots that can interact with people.  1.2  Sequential Decision Making Under Uncertainty  In the previous section, we described the dominance of three-layered control architectures in robot development. In this section, we focus on the deliberative layer responsible for long term planning and coordination of the behaviors resident at the executive layer. A robot is continuously faced with making decisions about what actions to take in order to complete its given task. Optimally, it always makes the best decision with the highest immediate reward and largest future returns. Reinforcement learning is the branch of AI that studies this type of agent [SB98]. Figure 1.4 shows the standard model for reinforcement learning. In this model the agent is connected to the environment through perception (sensors) and actions (actuators). At any time, the agent receives information about the state of the world, τ ∈ T , through its sensors (shown as signal i ). It then produces an action, α, that causes a transition in the world state, T × A → T and generates a reinforcement signal, γ, that is the instantaneous reward for taking action α. Since  13  the agent acts in such a way as to maximize the accumulated total reward over the long-term, its behavior, B, is fully specified with this model. Assuming that the state of the world is fully observable and that future states depend only on the current state, the model commonly used in reinforcement learning is a Markov Decision Process [KLM96, RN03, Put94]. An MDP consists of • a discrete and finite set of environment states, S • a discrete and finite set of agent actions, A • a transition function T (s, a, s ) for all s, s ∈ S and a ∈ A i.e., P (s |s, a) and • a reward function R(s, a) that maps S × A → The state transition model encodes the probability of transitioning from state s to s if action a is executed. It is often encoded as an S × S × A table. The reward function assigns a reward for reaching a state s from state s after action a. The planning problem is that of finding a policy, π, that is a mapping of states to actions. The policy that maximizes the total amount of reward accumulated in the long run is the optimal policy, π . An MDP makes the Markov assumption in that the future state only depends on the current state (this is encoded implicitly in T .) In this thesis, we focus on discounted reward, infinite horizon problems with discrete states and observations. For a fully observable environment, there are exact algorithms for computing the optimal policy. There are model-based and model-free ones. We focus on the model-based methods which assume that the number of states, transition probabilities and rewards are known. The value iteration and policy iteration algorithms can be used to solve an MDP, i.e., find the optimal policy π. For each algorithm, we perform a search in either the value or policy space. Value iteration focuses on finding the best policy by maximizing the utility of each state. The utility of a state, U (s), is defined as the expected utility of the sequence of states that follow it [RN03]. Since the sequence of states depends on 14  Algorithm 1 Value Iteration Initialize : U ← U ; δ ← 0 for each state s ∈ S do U (s) ← R(s) + γ maxa s T (s, a, s )U (s ) if |U (s) − U (s)| > δ then δ ← |U (s) − U (s)| end if until δ < (1 − γ)/γ end for the policy being executed, for policy π, we denote the utility of a state as U π (s). We can define it with respect to the immediate rewards as ∞  U π (s) = E  γ t R(st )  (1.1)  t=0  where γ is a discount factor in the interval (0, 1). For the optimal policy, the agent is selecting actions based on the maximum expected utility such that π ∗ (s) = argmaxa  T (s, a, s )U (s )  (1.2)  Equation 1.2 is central to the value iteration algorithm because it tells us that if we know U (s) then we can trivially determine π ∗ . In Equation 1.1 we have already expressed the utility of a state as a function of the expected future discounted reward. It can also be written as U (s) = R(s) + γmaxa  T (s, a, s )U (s )  (1.3)  s  This naturally leads to the value iteration algorithm shown in Algorithm 1 (adapted from [RN03]). In the algorithm, U is initially set to zero;  is the maximum error allowed  in the utility of a state and γ is the discount factor as defined previously; δ is the maximum change of the utility function allowed at each iteration of the algorithm such that the algorithm stops when the solution is within  of the optimal solution.  Policy iteration is an alternative algorithm for solving MDPs by searching in the space of policies. The algorithm begins with an arbitrary policy and then 15  proceeds to iteratively improve it in two steps. During the nth iteration, first, the value of the policy, π n , is evaluated by solving the set of |S| equations given by 1.3. n  The result is U π . Second, the policy is improved by selecting the actions that n  maximize Equation 1.3 for U (s) = U (s)π . The algorithm exits when the policy can no longer be improved. In general, policy iteration converges quickly after only a few iterations. Most real problems have very large state spaces and as such are hard to solve. Consequently, researchers tackle solving large MDPs with a variety of techniques including the use of clever data structures that exploit problem structure or approximate solution methods that yield sub-optimal but good enough solutions. Factored MDPs are designed to exploit problem structure while hierarchical and distributed MDPs are tailored towards finding approximate solutions. Factored MDPs take advantage of problem structure by exploiting variable independence expressed using a Dynamic Bayesian Network (DBN) [BDG95]. Boutilier et al. [BDG95] present a modified algorithm for policy iteration that exploits this type of structure. They call it the Structured Policy Iteration (SPI) algorithm and they show that it can do better than standard policy iteration. In SPI, the state space is compressed by aggregating states that have the same value or are assigned the same action by the current policy. They use decision trees to represent policies and value functions in a compact way. Later, Hoey et al. [HSAHB99] presented a value iteration algorithm for factored MDPs that uses Algebraic Decision Diagrams (ADDs) [BFG+ 93] as the underlying data structure to achieve large gains in compactness and computational efficiency; ADDs are a generalization of the binary decision diagrams used in SPI for representing policies and value functions. Their approach performs better that SPI and can solve much larger problems. In general, factored MDPs will perform well if there is structure in the problem, otherwise their complexity can be worse than standard methods because of the overhead involved in maintaining the data structures.  16  Hierarchical MDPs decompose a problem into a hierarchy of subproblems such that each subproblem is small enough that it can be solved independently [Die98, Die00, PR98, MPR+ 98]. The tasks higher in the hierarchy are more general and update at a slower rate. Decomposing the problem into parts and guaranteeing that the global policy is optimal is rather difficult and it often requires much guidance by a programmer. Solutions to problems expressed as HMDPs can either be globally or recursively optimal. Globally optimal solutions can sacrifice optimality for some or all subtasks such that the overall policy is optimal. Recursively optimal solutions trace into the hierarchy and compute optimal policies for each task given that the policies of all its subtasks are optimal. There is no equivalence between the two although recursively optimal policies are a weaker form of optimality. One apparent drawback of HMDPs for decision making in robots is the recursive execution of policies. When a subtask is given control, no other subtask can interrupt it until it completes and returns control back to the next level up in the hierarchy that invoked it. The work of [MPR+ 98] addresses this issue by allowing a subtask to be interrupted and another subtask to execute for the next time step. They refer to each subtask using the term option which is an MDP that can be activated only when one of several activation states is reached. Most importantly, the options framework allows the agent to reason about temporally extended actions i.e., actions that take more than one time step to conclude. It does not have to handle the case of conflicting actions among different options since only one option is active at any time. Another advantage of this framework is that it allows for an agent to reason uniformly using both primitive and abstract actions. Temporal extension of actions leads naturally to a different class of problems known as semi-Markov Decision Processes. SMDPs are continuous time systems and as such are a bit more challenging to solve. However, there are extensions to the Bellman equations and the value and policy iteration algorithms for SMDPs. SMDPs are described in detail in [Put94].  17  Figure 1.5: Graphical representation of a POMDP agent. The agent is seen to be composed of two parts, a state estimator SE and a policy estimator π [Lit94]  An alternative approach to hierarchical problem decomposition is the divideand-conquer approach of splitting a problem into a number of subproblems that can be solved independently while the global policy is obtained by adding the solutions. The difference compared to HMDPs is that there is no dependence among the different subproblems. Decentralized MDPs have been studied in the AI literature for a long time. Solutions focus on determining loosely coupled MDPs often by exploiting structure in the transition model [MHK+ 98, SC98, RM05, MW04, BZLG03]. In Chapter 3, we propose Multiply Sectioned MDPs. We take the approach of decomposing a task into a number of subtasks that can be solved independently much as in HMDPs. With MSMDPs we investigate concurrency at a single level of a hierarchy. We compute an optimal policy for each subtask MDP using a factored MDP solver in order to exploit any structure inherent to the problem. We then execute each MDP concurrently, in parallel, in the same way that behaviors are executed in the Subsumption architecture. We make the assumption of non-intersecting action sets among the sub-MDPs. We show how we used them to design and implement a robot that can successfully complete an interactive task. Since we are interested in real, physical agents that exist in a dynamic world, the assumption that the world state is fully observable is violated making MDPs unsuitable for use. The state of a dynamic world is not fully observable because of noisy robot dynamics and uncertainty in sensor measurements. MDPs have been extended to handle this uncertainty over the current state in what is known as  18  Partially Observable Markov Decision Processes (POMDPs). A POMDP has all the elements of an MDP augmented by another two. One is a finite set of possible observations Ω. The other is an observation model O(s, o, a) that describes the probability of making observation o after taking action a from state s and reaching state s . Formally, this is P (o|s , a). Figure 1.5 shows a graphical representation of a POMDP agent and Figure 1.6 shows the same model as a Dynamic Bayesian Network. It consists of two parts, one estimating the current state, SE, and one for maintaining and executing a policy. The State Estimator is responsible for computing a probability distribution over the possible states of the world. This distribution output from SE is called the belief b. The problem of finding the optimal policy for a POMDP can be formalized as one of searching for the optimal policy of a fully observable MDP defined over the belief state space, b(s). More specifically, this “belief MDP” consists of • the set of belief states B such that b ∈ B, • the set of possible actions, A, that are the same as for the original POMDP • the belief state transition function, τ (b , a, b), and • the reward function given by ρ(b, a) =  b(s)R(s)  (1.4)  s  We can use value iteration to find the optimal solution for a belief MDP. However, it has been shown that finding the optimal policy is NP-hard [Lit94]. As a result, recent work is focused on using decomposition and hierarchical approaches to compute approximate solutions for POMDPs. For example, Theocharus et al. [TRM01] study how to learn an HPOMDP for robot navigation; they also show how an HPOMDP can be learned efficiently if represented as a DBN [TMK04]. Other work considers the problem of decentralized or interactive POMDPs [GD04, 19  A t+1  At  A t−1  R t+2  R t+1  Rt Ot  S t+2  S t+1  St  O t+1  O t+2  Figure 1.6: Three time slices of the general POMDP model viewed as a Dynamic Bayesian Network. The diagram shows 4 different types of nodes: A for actions, S for states, O for observations, and R for rewards.  NTY+ 03, CSC02]. Solving decentralized POMDPs has been shown to be NEXPcomplete [BGIZ02] and so it is not currently suitable for the development of robotic agents. In recent years, point-based approaches have gained much ground as the preferred solution method for POMDPs. Point-based methods compute an estimate for the value function for a sampled set of belief points; the idea is that only a few key beliefs are relevant for getting a good estimate for the value function and its gradient that generalizes well for all reachable beliefs. The Point-Based Value Iteration (PBVI) [PGT03] and Perseus [SV05], a randomized and more efficient version of PBVI, algorithms are considered state of the art in approximate planning for POMDPs. Perseus has been shown to largely outperform PBVI and so we have chosen to study it further in this thesis. Perseus like PBVI estimates the value function and its gradient for a small set of key beliefs. Selecting a good set of belief points is important and is the focus of some current research [IPA06] but the simplest approach is to sample some beliefs at random by simulating trajectories according to the POMDP model. Recent approaches such as Heuristic Search Value Iteration (HSVI) [SS05] and Forward Search Value Iteration (FSVI) [SBS07] attempt  20  to improve efficiency by using heuristics to guide the search for key beliefs. HSVI computes an upper and lower bound to the optimal value function and selects beliefs that will close the gap between the two bounds as much as possible but it requires the solution of a linear program making it expensive to compute. FSVI improves on HSVI by using the solution to the underlying MDP to guide the search for key beliefs. FSVI is much faster than HSVI but it fails in domains that require agents to execute information gathering actions. Perseus on the other hand works with a fixed set of belief points and achieves fast performance by not optimizing each sampled belief at every step of value iteration but only optimizing enough points such that the value function is improved overall. In Chapter 4 we describe a factored-POMDP framework with multi-step actions for the development of interactive mobile robots. We improve upon Perseus by utilizing a heuristic that selects belief points to optimize according to the expected improvement of the value function; we show that Perseus with our heuristic outperforms the original algorithm for a number of large problems. Second, we consider the case of multi-step actions and we show how we can derive the problem’s transition model for such actions from a partially specified model. We demonstrate our solution using the Human-Oriented Messenger Robot (HOMER).  1.3  Simultaneous Localization and Mapping  Simultaneous localization and mapping (SLAM) is the problem of recursively estimating both the position of a moving sensor and the structure of the environment it senses. SLAM is a core problem in mobile robotics, as it must be solved in order to accurately and effectively explore and navigate through an unknown environment, such that a robot can learn a representation of its surroundings. SLAM is especially difficult to solve because a robot’s motions are often noisy and difficult to estimate accurately over time – the error in the position estimate can grow without bound. Likewise, all sensors are noisy and the problem of recovering structure from their 21  output can be ill-posed. Formally, SLAM is the problem of estimating the joint posterior distribution over the map M and robot pose s at time t given all observations z and controls u since time 1. p(st , Mt |z1...t , u1...t )  (1.5)  Non-linear robot and sensor dynamics result in a posterior distribution that does not admit a closed form solution. In addition, the large size of the map makes maintaining an approximation to the joint difficult. As a result, most SLAM algorithms focus on either assuming linear-Gaussian dynamics or try to exploit structure in the problem in order to find a tractable solution. Linear-Gaussian SLAM solutions can be derived using the Kalman Filter (KF); however, real systems are non-linear and as a result SLAM research focuses on the use of the Extended Kalman Filter (EKF) and most recently its inverse, the Information Filter (IF)  [SSC90, LD91, TLK+ 04, ESL05].  These  approaches model the joint posterior distribution as a unimodal Gaussian. The EKF scales quadratically with respect to the number of landmarks in the map and it is unable to handle bad data associations in landmark observations. As a result much focus is placed on improving the filter’s scalability by decomposing the problem into loosely-coupled spatially independent subproblems using submaps [LF99, GN01, GN03, KDR01, PJTN07, NSD07]. Recent work has produced information filter approaches that operate in constant time, but in most cases depend on approximations [TLK+ 04, Pas03] or linear-time iterative state recovery methods [ESL+ 05]. An alternative approach that does not assume linear-Gaussian dynamics is to use a particle filter (PF). In recent years, particle filters have gained ground in state estimation because of their ability to represent any distribution. For SLAM, PFs can model the multi-modal joint posterior distribution that can arise from nonlinear robot dynamics and the perceptual aliasing problem in sensing. In addition,  22  PF updates scale logarithmically with respect to the map size, and can handle poor data association robustly. The first successful application of PFs to robot localization was the Monte Carlo Localization (MCL) algorithm used in the robot museum tour guide Minerva [DBFT99, BCF+ 98]. MCL was based on the CONDENSATION [IB98] algorithm suggested in computer vision for robust visual tracking in cluttered scenes. Since then, MCL has been adopted to work using a number of different sensors, especially vision [WBB02, PMZI04, KBH+ 04]. In Chapter 5 we present σMCL, an algorithm for solving the localization problem for unconstrained camera motion in 3D. We assume that a map of 3D point landmarks is given to the algorithm. We show that our approach accurately estimates the camera’s position using only a few frames. The σMCL algorithm is also capable of solving the kidnaped robot problem, that is, estimating the robot’s location given a strong but incorrect prior. The application of particle filters to SLAM was first studied by Murphy [Mur99] in the context of learning occupancy-grid maps. He was the first to exploit the method of Rao-Blackwellisation to show that a sampling approach can be used to solve this challenging problem. Because PFs require a very large number of particles as the size of the state increases, the method of Rao-Blackwellisation is used to factor the joint posterior distribution as shown in Equation 1.6, p(s1...t , Mt |z1...t , u1...t ) = p(s1...t |z1...t , u1...t )  p(Mt (k)|s1...t , z1...t , u1...t )  (1.6)  k  where Mt (k) denotes the k-th landmark in the map. In addition, notice that this modified posterior distribution is over robot trajectories avoiding the need for integrating out the past. One can estimate each of the distributions in Equation 1.6 using either the EKF or a PF. In SLAM, it is common to use a standard PF to estimate p(s1...t |z1...t , u1...t ) and an EKF for each of the k landmarks in the map. The properties of the Rao-Blackwellised Particle Filter (RBPF) were studied in detail by Doucet et al. [DdFMR00]. However RBPF-based SLAM was popularized by Montemerlo et al. [MTK+ 04, MTKW03, MTKW03] who introduced the 23  FastSLAM algorithm for learning landmark-based maps using a time-of-flight laser sensor. Their major contribution was a tree data structure for sharing landmark estimates among particles, significantly reducing the memory requirements of the algorithm and allowing for fast particle propagation. Similarly, Eliazar et al. [EP04] introduced an efficient data structure for a variation of FastSLAM that learns high fidelity occupancy-grid maps from laser measurements. Recently focus is switching to vision as the sensing modality [KA07, CDM07, Dav03, Bar05, SLL02, Tom07, CGKM07]. Se et al. [SLL02] utilize a stereo camera to learn maps of 3D landmarks identified using the Scale Invariant Feature Transform [Low99] (SIFT). They also use the method of Rao-Blackwellisation to reduce the size of the problem but employ a Kalman filter where Murphy used a particle filter. This is equivalent to block-diagonalizing an EKF (correlations between robot pose and landmarks, and between the landmarks themselves are ignored). They demonstrate their approach successfully mapping a small laboratory environment. For larger environments, Se’s method is likely to be inconsistent and lead to filter divergence as it has been demonstrated that a diagonalized (decoupled) EKF for SLAM is overconfident [DNC+ 01]. Recently, Barfoot [Bar05] has extended this work to use a particle filter in place of the Kalman filter but he has only presented results using a single particle in which case it does not differ significantly from Se’s original work. Other recent work on visual SLAM considers the case of using a single camera. Davison [Dav03] presents monocular SLAM using a Kalman filter. He uses as landmarks salient image features and employs a Bayesian approach for estimating their 3D position. Notably, his algorithm runs in real-time but it has only been demonstrated to work well on small trajectories. The vSLAM [KBO+ 05] approach employs particle filters to localize in a hybrid metric-topological map. They identify landmarks using SIFT but can only localize accurately in locations where nodes of the topological map have been instantiated.  24  (a)  (b)  Figure 1.7: Examples of the maps learned with σSLAM for two adjacent rooms size 16 × 10 meters. Part (a) shows a top-down view of the maps of 3D landmarks used for localization. Shown in gray is the filter trajectory and in blue the raw visual odometry trajectory. Part (b) shows the occupancy grid constructed using the camera location estimated with σSLAM . In the grid, white pixels represent empty space while black pixels represent occupied space. Gray pixels denote unobserved areas. The grid’s resolution is 15 × 15cm. Visual SLAM is very closely related to the structure from motion (SFM) problem, wherein given a set of images taken of a scene from unknown positions with an arbitrary amount of overlap, the problem is to recover the position of the camera for each image, as well as the structure of the scene (or a subset of 3D point features extracted from the images) [HZ00]. The primary difference between SFM and SLAM is that SLAM involves recursive estimation of the camera position and scene structure (usually taking advantage of the spatio-temporal adjacency of images in the sequence), whereas SFM solutions are usually batch methods. Recursive estimation is especially important for a robot that must explore autonomously. Our work centers around solving the problem of 3D visual SLAM using a calibrated stereo camera. In Chapter 6 we present σSLAM which is the σMCL extension for solving the SLAM problem. It is similar to Se et al. [SLL02] because we learn the same type of maps and identify landmarks using SIFT. It is also similar to  25  Barfoot [Bar05] because we employ a particle filter. The main differences between that work and ours are the real-time performance of our system for large numbers of particles. We present a new mixture proposal distribution for the particle filter that takes into account both the robot’s motion dynamics and the most recent observation. We use a SFM-based approach for visual odometry estimation, similar in nature to that developed in [ZS01] and as such our system is not confined to planar motion. We show that we can create maps of dynamic indoor environments under difficult conditions including large changes in illumination and image blurring. Figure 1.7 shows one map learned using our algorithm for a typical office-like environment. The figure shows both the map of 3D landmarks used for localization and the 2D occupancy-grid that is used for path planning and obstacle avoidance.  1.4  Thesis Contributions  The work presented in this thesis makes the following major scientific contributions: • Collectively, our work presents a framework for designing and implementing visually-guided, interactive, mobile robots. It is significant that all robots presented in this thesis utilize no other sensors other than a steerable stereo camera. The framework draws from the strengths of reactive robot control architectures and the expressiveness of utility-based planning methods. • [ESRH04, EHL03] We study the use of loosely coupled, factored, Markov Decision Processes (MDPs) for offline task specification, high level planning and real-time execution of plans with concurrent actions. We call the framework Multiply Sectioned Markov Decision Processes (MSMDPs) because it is partially inspired by research in multi-agent systems. In Chapter 3, we show that MSMDPs are an effective approach for mobile robot planning under the strong assumption that the world’s state is fully observable. • [EL07] In Chapter 4, we remove the assumption of full observability and use 26  the Partially Observable Markov Decision Process (POMDP) framework for behavior coordination. Once again, we focus on using a factored representation. In addition, we show how to properly model the temporal characteristics of the robot’s actions as exposed to the planner via high level behaviors. We present heuristics for improving the efficiency of point-based solution methods computing approximate solutions for large POMDPs. Finally, we justify our work presenting empirical results with our mobile robot delivering verbal messages among the inhabitants of an office-like environment. • [EL05, SEGL05, SEL05, SEG+ 06] Finally, we present our state-of-the-art metric localization system for indoor spaces, based on visually identified natural 3D landmarks. In Chapter 5, we present σMCL which is a Monte Carlo approach for solving the problem of robot localization in 3D using a stereo camera. We show that the algorithm correctly estimates the robot’s position over long trajectories and in the presence of strong but incorrect location priors. In Chapter 6, we present σSLAM, the extension of σMCL towards solving the simultaneous localization and mapping problem in 3D using stereo vision. We show our method performing in near real-time constructing maps of large indoor environments in the presence of large variations in illumination, image blurring and dynamic objects.  27  Bibliography [AB97] R. C. Arkin and T. R. Balch. Aura: Principles and practice in review. Journal of Experimental and Theoretical Artificial Intelligence(JETAI), Volume 9(Number 2/3):175–188, April 1997. [AHB87] K. S. Arun, T. S. Huang, and S. D. Blostein. Least-squares fitting of two 3-d point sets. IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, no. 5, pp. 698–700, 1987. [Alb91] J. S. Albus. Outline for a theory of intelligence. Systems, Man and Cybernetics, IEEE Transactions on, vol. 21, no. 3, pages 473–509, May/June, 1991. [ARH87] R. C. Arkin, E. Riseman, and A. Hanson. Aura: An architecture for vision-based robot navigation. pages 417–431, Los Angeles, 1987. [Ark89] R. C. Arkin. Motor schema-based mobile robot navigation. International Journal of Robotics Research, 8(4):99–112, 1989. [Ark98] R. C. Arkin. Behavior-based Robotics. MIT Press, 1998. [Bar05] T. D. Barfoot. Online visual motion estimation using FastSLAM with SIFT features. In Proc. IEEE/RSJ Conf. on Intelligent Robots and Systems, pages 3076–3082. IEEE/RSJ, IEEE Press, Edmonton, AB, August 2005.  28  [BCF+ 98] W. Burgard, A.B. Cremers, D. Fox, D. Hahnel, G. Lakemeyer, D. Schulz, W. Steiner, and S. Thrun. The interactive museum tourguide robot. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI ’98), Madison, Wisconsin, July 1998. [BCF+ 99] W. Burgard, A. B. Cremers, D. Fox, D. Hahnel, G. Lakemeyer, D. Schulz, W. Steiner, and S. Thrun. Experiences with an interactive museum tour-guide robot. Artificial Intelligence, 114(1-2):3–55, 1999. [BDG95] C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting structure in policy construction. Proc. of the Fourteenth Inter. Conf. on AI (IJCAI95), August 1995. [Bee99] M. Beetz. Structured reactive controllers. In Oren Etzioni, J¨org P. M¨ uller, and Jeffrey M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents’99), pages 228–235, Seattle, WA, USA, 1999. ACM Press. [BF81] R. C. Bolles and M. A. Fischler. A RANSAC-based approach to model fitting and its application to finding cylinders in range data. In Proceedings of the Seventh Int. Joint Conf. on Artificial Intelligence, pages 637–643. Vancouver, B.C., Canada, August 1981. [BFG+ 93] R. I. Bahar, E. A. Frohm, C. M. Gaona, G. D. Hactel, E. Macii, A. Pardo, and F. Somenzi. Algebraic Decision Diagrams and their applications. Proc. International Conf. Computer-Aided Design (ICCAD), pages 188–191, 1993. [BFT97] W. Burgard, D. Fox, and S. Thrun. Active mobile robot localization. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI ’97), Nagoya, Japan, August 1997.  29  [BGIZ02] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control of Markov Decision Processes. Mathematics of Operations Research, 27(4), November 2002. [BL97] J. S. Beis and D. G. Lowe. Shape indexing using approximate nearestneighbour search in high-dimensional spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1000–1006. IEEE, IEEE Press, Peurto Rico, June 1997. [BNN06] T. Bailey, J. Nieto, and E. Nebot. Consistency of the FastSLAM Algorithm. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 424–429. IEEE Press, Orlando, FL, May 2006. [Bre03] C. Breazeal. Towards sociable robots. Robotics and Autonomous Systems, 42(3-4):167–175, 2003. [Bro86] R. A. Brooks. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, 2(1):14–23, April 1986. [Bro89] R. A. Brooks. A Robot that Walks; Emergent Behavior from a Carefully Evolved Network. Neural Computation, 2(1):253–262, 1989. [Bro90] R. A. Brooks. The behavior language: User’s guide. Technical Report AIM-1227, 1990. [Bro91] R. A. Brooks. Intelligence without representation. Artificial Intelligence, 47, pages 139–159, 1991. [BZLG03] R. Becker, S. Zilberstein, V. Lesser, and C. V. Goldman. Transitionindependent decentralized Markov Decision Processes. In Proc. AAMAS, 2003. [Cap86] L. R. Caporael. Anthropomorphism and mechanomorphism: Two faces  30  of the human machine. Computers in Human Behavior, 3(2):215–234, 1986. [CDM07] J. Civera, A. Davison, and J. Montiel. Inverse depth to depth conversion for monocular SLAM. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA’07), Rome, Italy, April 2007. [CGKM07] R. Castle, D. Gawley, G. Klein, and D. Murray. Towards simultaneous recognition, localization and mapping for hand-held and wearable cameras. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA’07), Rome, Italy, April 2007. [Con92] J. H. Connell. SSS: A hybrid architecture applied to robot navigation. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA ’92), pages 2719–2724, Nice, France, May 1992. [CSC02] I. Chades, B. Scherrer, and F. Charpillet. A heuristic approach for solving decentralized-POMDPa: assessment on the pursuit problem. In Proceedings of the 2002 ACM symposium on Applied computing (SAC ’02), pages 57–62, Madrid, Spain, August 2002. ACM. [Dav99] A.M. Davies. Aging and health in the 21st century. In Aging and Health: A global Challenge for the Twenty-First Century, Geneva, Switzerland: World Health Organization, 1999. [Dav03] A. Davison. Real-time simultaneous localisation and mapping with a single camera. In Proceedings of the IEEE Int. Conf. on Computer Vision, pages 1403–1410. Nice, France, 2003. [DBFT99] F. Dellaert, W. Burgard, D. Fox, and S. Thrun. Using the condensation algorithm for robust, vision-based mobile robot localization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’99), Fort Collins, CO, June 1999. 31  [DdFG01] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo in Practice. Springer-Verlag, 2001. [DdFMR00] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-Blackwellised particle filtering for dynamic Bayesian networks. In Uncertainty in Artificial Intelligence, pages 176–183. 2000. [Die98] T. G. Dietterich. The MAXQ method for hierarchical reinforcement learning. In Proc. 15th International Conf. on Machine Learning, pages 118–126. Morgan Kaufmann, San Francisco, CA, 1998. [Die00] T. G. Dietterich. An overview of MAXQ hierarchical reinforcement learning. Lecture Notes in Computer Science, 1864:26–??, 2000. [DNC+ 01] G. Dissanayake, P. Newman, S. Clark and H. Durrant-Whyte and M. Csorba. A solution to the simultaneous localization and map building (SLAM) problem. IEEE Transactions on Robotics and Automation, 17 (3) 229–241, June 2001. [Duf02] B. R. Duffy. Anthropomorphism and the social robot. In Robot as Partner: An Exploration of Social Robots, IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, Lausanne, Switzerland, 2002. [EHL+ 02] P. Elinas, J. Hoey, D. Lahey, J. Montgomery, D. Murray, S. Se, and J. J. Little. Waiting with Jose, a vision based mobile robot. In Proc. of the IEEE International Conference on Robotics & Automation (ICRA ’02), Washington, D.C., May 2002. [EHL03] P. Elinas, J. Hoey, and J. J. Little. Human oriented messenger robot. In Proc. of AAAI Spring Symposium, Stanford, CA, March 2003. [EL02] P. Elinas and J. J. Little. A robot control architecture for guiding a vision-based mobile robot. In Proc. of AAAI Spring Symposium in 32  Intelligent Distributed and Embedded Systems, Stanford, CA, March 2002. [EL05] P. Elinas and J. Little. σMCL: Monte-Carlo localization for mobile robots with stereo vision. In Proceedings of Robotics: Science and Systems, Cambridge, MA, USA, 2005. [EL06] P. Elinas, R. Sim and J. J. Little. σSLAM: Stereo vision SLAM using the Rao-Blackwellised Particle Filter and a novel mixture proposal distribution In Proceedings of the 2006 IEEE International Conference on Robotics and Automation,1564–1570, Orlando, FL, USA, 2006. [EL07] P. Elinas and J. J. Little. Decision theoretic task coordination for a visually-guided interactive mobile robot. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), San Diego,USA, 2007. [EP04] A. I. Eliazar and R. Parr. DP-SLAM 2.0. In Proceedings of the 2004 IEEE International Conference on Robotics and Automation, vol 2, 1314–1320. IEEE Press, New Orleans, LA, 2004. [ESL05] R. Eustice, H. Singh, and J. Leonard. Exactly sparse delayed-state filters. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation, pages 2428–2435. Barcelona, Spain, April 2005. [ESL+ 05] R. Eustice, H. Singh, J. Leonard, M. Walter, and R. Ballard. Visually navigating the RMS Titanic with SLAM information filters. In Proceedings of Robotics Science and Systems. Cambridge, MA, June 2005. [ESRH04] P. Elinas, E. Sucar, A. Reyes, and J. Hoey. A decision theoretic approach for task coordination in social robots. In The 13th IEEE In33  ternational Workshop on Robot and Human Interactive Communication (to appear), pages 679–684, Kurashiki, Okayama Japan, September 2004. [Fau93] O. D. Faugeras. Three-Dimensional Computer Vision: A Geometric Viewpoint. MIT Press, 1993. [FJC05] J. Folkesson, P. Jensfelt, and H. I. Christensen. Graphical SLAM using vision and the measurement subspace. In Int. Conf. on Intelligent Robotics and Systems (IROS). IEEE/JRS, pages 325–330 Edmonton, Canada, Aug 2005. [GD04] P. Gmytrasiewicz and P. Doshi. Interactive POMDPs: Properties and preliminary results. pages 1374–1375, New York, NY, 2004. [GN01] J. E. Guivant and E. M. Nebot. Optimization of the simultaneous localization and map building algorithm for real-time implementation. IEEE Transactions on Robotics and Automation, 17(3):242–257, June 2001. [GN03] J. E. Guivant and E. M. Nebot. Solving computational and memory requirements of feature-based simultaneous localization and mapping algorithms. IEEE Transactions on Robotics and Automation, 19(4):749– 755, August 2003. [GVH03] B. Gerkey, R. T. Vaughan, and A. Howard. The Player/Stage Project: Tools for Multi-Robot and Distributed Sensor Systems Proc. of the 11th International Conference on Advanced Robotics (ICAR), Coibra, Portugal, June 2003. [HFBT03] D. H¨ ahnel, D. Fox, W. Burgard, and S. Thrun. A highly efficient FastSLAM algorithm for generating cyclic maps of large-scale environ-  34  ments from raw laser range measurements. In Proc. of the Conference on Intelligent Robots and Systems (IROS), pages 206–211, 2003. [HSAHB99] J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier. SPUDD: Stochastic planning using decision diagrams. In Proceedings of International Conference on Uncertainty in Artificial Intelligence (UAI ’99), Stockholm, 1999. [HZ00] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge Univ. Pr., Cambridge, UK, 2000. [IB98] M. Isard and A. Blake. Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5– 28, 1998. [IPA06] M. T. Izadi, D. Precup, and D. Azar. Belief selection in point-based planning algorithms for POMDPs. In Canadian Conference on AI, pages 383–394, 2006. [KA07] K. Konolige and M. Agrawal. Frame-frame matching for real-time consistent visual mapping. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA’07), Rome, Italy, April 2007. [KBH+ 04] B. Krose, R. Bunschoten, S. T. Hagen, B. Terwun, and N. Vlassis. Household robots look and learn. IEEE Robotics and Automation Magazine, 11(4):45–52, December 2004. [KBO+ 05] N. Karlsson, E. Di Bernardo, J. Ostrowski, L. Goncalves, P. Pirjanian, and M. E. Munich. The vSLAM algorithm for robust localization and mapping. In IEEE Int. Conf. on Robotics and Automation (ICRA), pages 24–29. Barcelona, Spain, April 2005. [KD04] N. M. Kwok and G. Dissanayake. An efficient multiple hypothesis filter for bearing-only SLAM. In Proceedings of the IEEE/RSJ Conference 35  on Intelligent Robots and Systems, pages 736–741. IEEE Press, Sendai, Japan, October 2004. [KDR01] J. Knight, A. Davison, and I. Reid. Towards constant time SLAM using postponement. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Maui, USA, 2001. [KLM96] L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996. [LD91] J. J. Leonard and H. F. Durrant-Whyte. Simultaneous map building and localization for an autonomous mobile robot. In Proceedings of the IEEE Int. Workshop on Intelligent Robots and Systems, pages 1442– 1447. Osaka, Japan, November 1991. [LF99] J. J. Leonard and H. J. S. Feder. A computationally efficient method for large-scale concurrent mapping and localization. In D. Koditschek J. Hollerbach, editor, Proceedings of the Ninth International Symposium on Robotics Research, pages 316–321, London, England, October 1999. Springer-Verlag. [Lit94] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the 11th International Conference on Machine Learning (ML-94), pages 157–163, New Brunswick, NJ, 1994. Morgan Kaufmann. [Low91] D. G. Lowe. Fitting parameterized three-dimensional models to images. IEEE Trans. Pattern Analysis Mach. Intell. (PAMI), 13(5):441–450, May 1991. [Low99] D. G. Lowe. Object recognition from local scale-invariant features.  36  In Proceedings of the Seventh International Conference on Computer Vision (ICCV’99), pages 1150–1157, Kerkyra, Greece, September 1999. [LSF89] L. J. Lin, R. Simmons, and C. Fedor. Experience with a task control architecture for mobile robots. Technical Report CMU-RI-TR-89-29, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, December 1989. [MDK+ 02] A. Morris, R. Donamukkala, A. Kapuria, A. Steinfeld, J. Matthews, J. Dunbar-Jacobs, and S. Thrun. A robotic walker that provides guidance. Submitted for publication, 2002. [ME85] H. Moravec and A. Elfes. High-resolution maps from wide-angle sonar. In Proc. IEEE Int’l Conf. on Robotics and Automation, St. Louis, Missouri, March 1985. [MEBH07] A. Mihailidis, P. Elinas, J. Boger, and J. Hoey. An intelligent powered wheelchair to enable mobility of cognitively impaired older adults: an anticollision system. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 15(1):136–143, 2007. [MHK+ 98] N. Meuleau, M. Hauskrecht, K. Kim, L. Pashkin, L. P. Kaelbling, T. Dean, and C. Boutilier. Solving very large weakly coupled Markov Decision Processes. In Proc. AAAI, pages 165–172, 1998. [MPR+ 98] A. McGovern, D. Precup, B. Ravindran, S. Singh, and R. Sutton. Hierarchical optimal control of MDPs. In Proc. of the Yale Workshop on Adaptive and Learning Systems, 1998. [MPR+ 02] M. Montemerlo, J. Pineau, N. Roy, S. Thrun, and V. Verma. Experiences with a mobile robotic guide for the elderly. In Proceedings of the AAAI National Conference on Artificial Intelligence (AAAI ’02), Edmonton, Canada, 2002. 37  [MTKW03] M. Montemerlo and S. Thrun. Simultaneous localization and mapping with unknown data association using FastSLAM. In Proceedings of the 2003 International Conference on Robotics and Automation, pages 1985–1991, 2003. [MTKW03] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges. In Proceedings of the Eighteenth Int. Joint Conf. on Artificial Intelligence (IJCAI-03), pages 1151–1156. Morgan Kaufmann Publishers, San Francisco, CA, 2003. [MTK+ 04] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM: A factored solution to the simultaneous localization and mapping problem. In Proceedings of the AAAI National Conf. on Artificial Intelligence, pages 593–598. AAAI, Edmonton, Canada, 2002. [MW04] Mausam and D. S. Weld. Solving concurrent Markov Decision Processes. In Proc. AAAI, pages 716–722, San Jose, CA, USA, July 2004. [Mor77] H. Moravec. Towards automatic visual obstacle avoidance. In Proceedings of the Fifth International Joint Conference on Artificial Intelligence (IJCAI), Cambridge MA, USA, August 1977. [Mur99] K. Murphy.  Bayesian map learning in dynamic environments.  In  1999 Neural Information Processing Systems (NIPS), pages 1015–1021, 1999. [NM02] M. N. Nicolescu and M. J. Matari´c. A hierarchical architecture for behavior-based robots. In Proceedings of the 1st International Joint Conference on Autonomous Agents and Multiagent systems, pages 227– 233, Bologna, Italy, July 2002.  38  [NSD07] K. Ni, D. Steedly, and F. Dellaert. Tectonic SAM: Exact; out-of-core; submap-based SLAM. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA’07), Rome, Italy, April 2007. [NTY+ 03] R. Nair, M. Tambe, M. Yokoo, D. Pynadath, and S. Marsella. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, August 2003. [Nil69] N. Nilsson. A mobile automaton: An application of Artificial Intelligence techniques. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Washington D.C., USA, May 1969. [OC03] A. Oreback and H. I. Christensen. Evaluation of architectures for mobile robots. Autonomous Robots, 14:33–49, 2003. [Pas03] M. A. Paskin. Thin junction tree filters for simultaneous localization and mapping. In Proceedings of the Eighteenth Int. Joint Conf. on Artificial Intelligence (IJCAI-03), pages 1157–1164. Morgan Kaufmann Publishers, San Francisco, CA, 2003. [PGT03] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, 2003. IJCAI. [PJTN07] L.M. Paz, P. Jensfelt, J.D. Tard´os, and J. Neira. EKF SLAM updates in O(n) with Divide and Conquer SLAM. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA’07), Rome,Italy, 2007.  39  [PMZI04] E. Pagello E. Menegatti, M. Zoccarato and H. Ishiguro. Image-based Monte-Carlo localisation with omnidirectional images. Robotics and Autonomous Systems, 48(1):17–30, August 2004. [PR98] R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1998. [Put94] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, NY., 1994. [RB94] M. Pearce, R. C. Arkin, A. Ram and G. Boone. Using genetic algorithms to learn reactive control parameters for autonomous robotic navigation. Adaptive Behavior, 2(3), 1994. [Res02] S. Restivo.  Romancing the robots: Social robots and society.  In  IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, Lausanne, Switzerland, 2002. [RM05] K. Rohanimanesh and S. Mahadevan. Coarticulation: An approach for generating concurrent plans in Markov Decision Processes. In Proceedings of the 22nd International Conference on Machine Learning (ICML)), pages 720–727, 2005. [RN03] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, New Jersey, 2003. [SA98] R. Simmons and D. Apfelbaum. A task description language for robot control. In In Proceedings of Conference on Intelligent Robotics and Systems, Vancouver, canada, 1998. [SB98] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. 40  [SBS07] G. Shani, R. I. Brafman, and S. E. Shimony. Forward search value iteration for POMDPs. Proc. of the Twentieth International Conference on AI (IJCAI-07), pages 2619–2624, January 2007. [SC98] S. Singh and D. Cohn. How to dynamically merge Markov Decision Processes. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems, volume 10, pages 1057–1063, Cambridge, MA, 1998. The MIT Press. [SEG+ 06] R. Sim, P. Elinas, M. Griffin, A. Shyr, and J. J. Little. Design and analysis of a framework for real-time vision-based SLAM using RaoBlackwellised particle filters. In Proceedings of the 3rd Canadian Conference on Computer and Robotic Vision (CRV 2006), Quebec City, QC, Canada, May 2006. [SEGL05] R. Sim, P. Elinas, M. Griffin, and J. J. Little. Vision-based SLAM using the Rao-Blackwellised Particle Filter. In IJCAI Workshop on Reasoning with Uncertainty in Robotics (RUR), Edinburgh, Scotland, 2005. [SEL05] R. Sim, P. Elinas, and J. J. Little. A study of the Rao-Blackwellised particle filter for efficient and accurate vision-based SLAM. International Journal of Computer Vision/International Journal of Robotics Research Special Joint Issue on Vision in Robotics. [SFG+ 99] R. Simmons, J. Fernandez, R. Goodwin, S. Koenig, and J. O’Sullivan. Xavier: An autonomous mobile robot on the web. IEEE Robotics and Automation Magazine, 1999. [SGB05] C. Stachniss, G. Grisetti, and W. Burgard. Recovering particle diversity in a Rao-Blackwellized particle filter for SLAM after actively closing  41  loops. In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), pages 667–672. Barcelona, Spain, 2005. [SGG+ 03] R. Simmons, D. Goldberg, A. Goode, M. Montemerlo, N. Roy, B. Sellner, C. Urmson, A. Schultz, M. Abramson, W. Adams, A. Atrash, da Bugajska, M. Coblenz, M. MacMahon, D. Perzanowski, I. Horswill, R. Zubek, D. Kortenkamp, B. Wolfe, and T. M. B. Maxwell. Grace:An autonomous robot for the AAAI robot challenge. AAAI Magazine, 24(2):51–72, 2003. [SGSL05] R. Sim, M. Griffin, A. Shyr, and J. J. Little. Scalable real-time visionbased SLAM for planetary rovers. In IEEE IROS Workshop on Robot Vision for Space Applications, pages 16–21. IEEE, IEEE Press, Edmonton, AB, August 2005. [SLL02] S. Se, D. Lowe, and J. J. Little. Mobile robot localization and mapping with uncertainty using scale-invariant landmarks. International Journal of Robotics Research, 21(8):735–758, August 2002. [SMDL05] J. Sol` a, A. Monin, M. Devy, and T. Lemaire. Undelayed initialization in bearing only SLAM. In Proc. Int. Conf on Intelligent Robots and Systems (IROS), pages 2751–2756. IEEE/RSJ, IEEE Press, Edmonton, AB, August 2005. [SS05] T. Smith and R. Simmons. Point-based POMDP algorithms: Improved analysis and implementation. In Proceedings of International Conference on Uncertainty in Artificial Intelligence (UAI ’05), Edinburgh, Scotland, 2005. [SSC90] R. Smith, M. Self, and P. Cheeseman. Estimating uncertain spatial relationships in robotics. In I.J. Cox and G. T. Wilfong, editors, Autonomous Robot Vehicles, pages 167–193. Springer-Verlag, 1990. 42  [SV05] M. T. J. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24:195–220, 2005. [TBB+ 99a] S. Thrun, M. Bennewitz, W. Burgard, A.B. Cremers, F. Dellaert, D. Fox, D. Hahnel, C. Rosenberg, N. Roy, J. Schulte, and D. Schulz. Minerva: A second-generation museum tour-guide robot. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA’99), Detroit, Michigan, May 1999. [TBB+ 99b] S. Thrun, M. Bennewitz, W. Burgard, A. B. Cremers, F. Dellaert, D. Fox, D. Hahnel, C. R. Rosenberg, N. Roy, J. Schulte, and D. Schulz. MINERVA: A tour-guide robot that learns. In KI - Kunstliche Intelligenz, pages 14–26, 1999. [TFB00] S. Thrun, D. Fox, and W. Burgard. Monte Carlo localization with mixture proposal distribution. In Proceedings of the 2000 National Conference of the American Association for Artificial Intelligence (AAAI), pages 859–865, 2000. [Thr00] S. Thrun. Probabilistic algorithms in robotics. AI Magazine, 21(4):93– 109, 2000. [TLK+ 04] S. Thrun, Y. Liu, D. Koller, A.Y. Ng, Z. Ghahramani, and H. DurrantWhyte. Simultaneous localization and mapping with sparse extended information filters. International Journal of Robotics Research, 23(7-8): 693–716, 2004. [TMK04] G. Theocharous, K. Murphy, and L. P. Kaelbling. Representing hierarchical POMDPs as DBNs for multi-scale robot localization. In Proceedings of the IEEE Conference on Robotics and Automation (ICRA ’04), New Orleans, U.S.A, April 2004. 43  [Tom07] M. Tomono. Monocular SLAM using a Rao-Blackwellised particle filter with exhaustive pose space search. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA’07), Rome, Italy, April 2007. [TRM01] G. Theocharous, K. Rohanimanesh, and S. Mahadevan. Learning hierarchical partially observable Markov Decision Process models for robot navigation. In Proc. of the IEEE International Conference on Robotics & Automation (ICRA ’01), Seoul, Korea, May 2001. [Tso95] J. K. Tsotsos. On behaviorist intelligence and the scaling problem. Artificial Intelligence, 75(2):135–160, 1995. [Tso97] J. K. Tsotsos. Intelligent control for perceptually attentive agents: The S* proposal. Robotics and Autonomous Systems, 5(21):5–21, January 1997. [WBB02] J. Wolf, W. Burgard, and H. Burkhardt. Robust vision-based localization for mobile robots using an image retrieval system based on invariant features. In Proc. of the IEEE International Conference on Robotics & Automation (ICRA), 2002. [ZS01] Z. Zhang and Y. Shan. Incremental motion estimation through local bundle adjustment. Microsoft Research Technical Report MSR-TR-015, 2001.  44  Chapter 2  Waiting With Jos´ e, a Vision-based Mobile Robot 1 This  Chapter is about using vision for autonomous robotics. Vision provides rich,  high bandwidth, two dimensional data containing information about color, texture, depth and optic flow, among others. This multi-modal data source can be exploited universally for the accomplishment of many different tasks. It is a harmonious host of information about a robot’s environment, and is an alternative to more specialized sensors such as sonar or laser range finders. Although vision is such a rich data source, it usually requires complex techniques for the extraction of useful information. For example, while sonar data directly estimates depth information, vision data (from multiple cameras) requires a stereo matching algorithm. However, the vision sensors can estimate further properties of environmental structure using the integrated color and texture information. In this Chapter, the techniques we have developed for using vision are discussed in the context of a particular robotic task: serving food to a gathering of people. To accomplish this task, a robot must reliably navigate around a room pop1  A version of this chapter has been published. P. Elinas, J. Hoey, D. Lahey, J. D. Montgomery, D. Murray, S. Se and J. J. Little, Waiting with Jos´e, a vision-based mobile robot, IEEE International Conference on Robotics and Automation, 2002.  45  Figure 2.1: Jos´ e, the robotic waiter ulated by groups of people, politely serving appetizers to humans. The robot must also monitor the food it has available to serve, and return to a home base location to refill when the food is depleted. The serving task involves many basic aspects of mobile robotics, including localization, mapping, navigation and human-robot interaction. In ongoing research, we have developed a solid framework for accomplishing these fundamentals using only vision as a sensor on our autonomous robot, Jos´ e (Figure 2.1). Problems specific to the serving task were also solved using vision, including finding people to serve and monitoring food. Previous approaches to the autonomous serving task include Alfred [MMA+ 99], the winning robot waiter at the 1999 “Hors D’œuvres Anyone?” competition. Jos´ e differs from Alfred in three respects. First, Alfred relies on sonar for navigation, while Jos´ e uses only vision. Second, Alfred focused on speech recognition much more than Jos´ e. Although Alfred’s speech recognition worked well in the laboratory environment, it performed poorly in the crowded, noisy conference reception hall typical of “Hors D’œuvres Anyone?” competitions [MMA+ 99]. Commercial speech recognition systems, as used by both Alfred and Jos´ e, have not reached the level of accuracy needed for conference reception environments, and we therefore  46  decided not to rely on speech recognition for Jos´ e. Third, Alfred needed special landmarks for navigation, and had lighting and scale dependent landmark recognition systems. Jos´ e uses natural landmarks and a scale and illumination invariant recognition system. This Chapter is structured as follows. The next section presents an overview of our mobile vision-based robot, Jos´ e, including his hardware systems, and his software architecture which separates control into low, mid and high-level behaviors. Sections 2.2, 2.3 and 2.4 present details on these three levels. Section 2.5 shows results from some sample serving runs, and discusses our experiences at the 2001 AAAI mobile robot competition. Section 2.6 concludes the paper.  2.1  Jos´ e  Jos´ e is a Real World Interface (RWI) B-14 mobile robot equipped with an Intel Pentium PC running the Linux operating system. Jos´ e senses the environment through five cameras, three of which are are encapsulated in a Triclops stereo vision camera module. Jos´ e has a Sony pan-tilt color camera, which is used for locating people to interact with. A Sony laptop computer mounted above Jos´ e supports the food tray and its screen displays Jos´ e’s face. A Logitech web-cam keeps watch over the food tray. Jos´ e has a Compaq wireless ethernet modem which allows his software systems to be distributed. In particular, a second Linux PC serves as a host computer running a supervisor module, and the laptop computer runs the food monitoring and face generation programs. Jos´ e is driven by a hierarchical behavior-based control architecture [Ark98], as shown in Figure 2.2. The system divides the robot’s behaviors into three levels, each of which contains simple, independent, modules. The modularity of the system makes implementation and testing simple and efficient. The lowest level involves perception and motor control, and includes servers interfacing with the robot’s motors and odometry, the Triclops unit, and the color camera. 47  planning supervisor modelling and task execution MOBILITY  HRI  localization  speech synthesis  sockets SERVING  people Finding  goal planning  speech recognition  food tray monitoring  mapping  navigation  facial expressions  perception and motor control robot server  Triclops image server  shared memory color image server  Figure 2.2: Control Architecture The middle level includes modules for the various behaviors the robot needs to perform its tasks. These fall into three categories, as shown in Figure 2.2. Mobility: Behaviors which enable the robot to circulate in its environment: mapping, localization, and navigation. Human-Robot Interaction (HRI): Behaviors for interacting with humans: finding people to interact with, speech recognition and synthesis, and facial expression generation. Serving: Behaviors specific to the serving task: Planning locations for service and monitoring the food tray. The highest level is a single supervisor behavior which delegates tasks to middle level modules. To ensure scalability of the system, the supervisor runs on a remote computer, and communicates with the middle level behaviors through sockets. The middle level modules communicate with the lowest level through a shared memory architecture. Middle and low level behaviors must therefore all run 48  on the robot, with the exception of speech recognition, the facial expressions and the tray monitoring, which communicate directly with sensors. The following three sections will describe each of the three levels in Figure 2.2.  2.2  Perception and Motor Control  Jos´ e’s trinocular stereo unit (Triclops) outputs three images. The corresponding dense two-dimensional depth information is used as the primary input for map building, localization, navigation, and people finding behaviors. Triclops was developed at the UBC Laboratory for Computational Intelligence (LCI) and is being marketed by Point Grey Research, Inc. (www.ptgrey.com). A Matrox Meteor frame grabber connects the Triclops to Jos´ e. The Triclops stereo vision module has 3 identical wide angle (90◦ degree field-of-view) cameras, arranged in an L shape. The system is calibrated, and corrected for lens distortion and camera misalignment in software to yield three corrected images that conform to a pinhole camera model with square pixels. The camera coordinate frames are co-planar and aligned so that the the epipolar lines of the camera pairs lie along the rows and columns of the images. The trinocular stereo approach is based on the multi-baseline stereo developed by Okutomi and Kanade [OK93]. Each pixel in the reference image is compared with pixels along the epipolar lines in the top and left images. The comparison measure used is sum of absolute differences. The results of the two image pairs (left/right, top/bottom) are summed to yield a combined score. Multi-baseline stereo avoids ambiguity because the sum of the comparison measures is unlikely to cause a mismatch—an erroneous minimum in one pair is unlikely to coincide with an erroneous minimum in another pair. Examples of the stereo results are shown in Figure 2.3(a) and (b). Further details on the stereo algorithm we use can be found in [ML00]. A PCTV frame grabber card delivers color images from the Sony pan-tilt unit 49  through the color image server (see Figure 2.2). The color images are registered with the stereo images from the Triclops using a offline manual calibration. The calibration must be repeated only when the positions (relative to the robot) of the color camera or Triclops unit are adjusted. We are currently replacing the Triclops and color camera with a single digital color Triclops unit called Digiclops (also from Point Grey Research). The Digiclops unit, with integrated color and stereo information, will circumvent the need for the calibration. The RWI robot platform has motor controls for rotation and translation, and provides odometry data. Although the odometry is fairly accurate, it can lead to serious errors in mapping and localization over the time period of a typical circulation of the serving robot. Methods for correcting such errors are discussed in Section 2.3.1.  2.3  Modeling and Task Execution  This section describes the mid-level behaviors which enable Jos´ e to accomplish basic mobility (mapping, localization and navigation), human-robot interaction (people finding, speech synthesis and recognition, and facial expressions), and other behaviors specific to the serving task (goal planning and tray monitoring).  2.3.1  Mobility  The most fundamental, by no means the simplest, task for a mobile robot is moving around in its environment. This must be accomplished within certain safety limits for the robot. If humans are present (as in the serving task), their safety cannot be jeopardized. These constraints are satisfied by building an accurate map, localizing the robot, and then navigating safely through the mapped environment, as we now describe.  50  Occupancy Grid Mapping Occupancy grid mapping, pioneered by Moravec and Elfes [Elf89], is the most widely used robot mapping technique due to its simplicity, robustness and flexibility in accommodating many kinds of spatial sensors. It also adapts well to dynamic environments. The technique divides the environment into a discrete grid and assigns to each grid location a value related to the probability that the location is occupied by an obstacle. Initially, all grid values are set to 50%, indicating equal probability that the grid location is occupied and unoccupied. Sensor readings supply uncertainty regions within which an obstacle is expected to be. Probabilities at grid locations that fall within these regions of uncertainty are increased while those at locations in the sensing path between the robot and the obstacle are decreased. Although occupancy grids may be implemented in any number of dimensions, most mobile robotics applications (including ours) use 2D grids. Much of the 3D data is lost in the construction of a 2D occupancy grid map. The robot possesses 3 DOF (X, Y, heading) within a 2D plane corresponding to the floor. The robot’s field of view sweeps out a 3D volume above this plane. A projection of all obstacles within this volume to the floor uniquely identifies free and obstructed regions in the robot’s space. Figure 2.3 shows the construction of the 2D occupancy grid sensor reading from a single 3D stereo image. Figure 2.3(a) shows the reference camera greyscale image (320x240 pixels), and (b) the resulting disparity image. Black regions indicate image areas which were invalidated. Otherwise, brighter areas indicate higher disparities (closer to the camera). The maximum disparities in each column are converted to depth to produce a radial map, as shown in Figure 2.3(c). Figure 2.3(d) shows these depth values converted into an occupancy grid representation; light grey indicates clear regions, black indicates occupied, and dark grey indicates unknown areas. The process illustrated in Figure 2.3 generates the input into our stereo vision occupancy grid. The mapping system then integrates these values over time, to ex-  51  (a)  (b)  (d)  (c)  Figure 2.3: From stereo images to radial maps. (a) greyscale image (b) disparity image (black indicates invalid, otherwise brighter indicates closer to the cameras) (c) depth vs columns graph (depth in cm) (d) the resultant estimate of clear, unknown and occupied regions (light grey is clear, black is occupied and dark grey is unknown)  52  pand the map and keep it current in the changing world. We identify an obstacle at all locations in the occupancy grid where the value is above a threshold. Figure 6.4 shows examples of occupancy grids generated in this way. Localization Safe mobility involves simultaneous localization and mapping (SLAM). The robot must build a map of the environment and track its position relative to that environment. However, accurate localization is a prerequisite for building a good map, and having an accurate map is essential for good localization. This problem has been a central research topic for the past few years [SK95, BCF+ 98, TBF98, DBFT99, TBF00]. Our vision-based SLAM algorithm uses Triclops stereo data of features detected by the Scale Invariant Feature Transform (SIFT) [Low99]. Simply put, Jos´ e finds out where he is by recognizing and locating previously observed visual features in his environment. SIFT features are invariant to image translation, scaling, rotation, and partially invariant to illumination changes and affine or 3D projection. These characteristics make SIFT features suitable landmarks for mapping and localization, since when mobile robots are moving around in an environment, landmarks are observed from different angles, distances and under different illuminations. Figure 2.4(a) shows an example of detected SIFT features, including scale and orientation. The SIFT features must be located in three dimensions. To accomplish this, we match SIFT features in each of the three images delivered by the Triclops system combined with epipolar and disparity constraints. Figure 2.4(b) shows the final disparities of all consistent SIFT features. From the positions of the matches, and the camera intrinsic parameters, we can compute the 3D world coordinates of each feature relative to the robot. We maintain a database of the located SIFT landmarks and use it to match features found in subsequent views. Once the SIFT features are matched, we can use the matches in a least-squares procedure to compute a  53  (a)  (b)  Figure 2.4: The (a) SIFT features found, with scale and orientation indicated by the size and orientation of the squares, and (b) Stereo matching result, where horizontal and vertical lines indicate the horizontal and vertical disparities respectively. more accurate camera ego-motion and hence correct localization errors. This SLAM results in a 3D map of SIFT features, and an accurate position and orientation of the robot in the map. The SIFT map presently is separate from the occupancy grid, but in principle it can be integrated with the grid so that errors due to drift and slippage can be corrected. An example SIFT map with over 2000 landmarks is shown in Figure 2.5. Readers are referred to [SLL01b] for further details of the SLAM technique. The database of SIFT features can also be used for global localization [SLL01a], i.e., determining current position with no prior position information. To tackle this problem, we consider matching a set of SIFT landmarks as a whole. Given a small set of current SIFT features and a large set of SIFT landmarks in the database, we would like to estimate the robot position that would have brought the largest number of landmarks into close alignment, provided that the robot has previously viewed the current scene during the map building stage. We use the RANSAC (RANdom Sample And Consensus) method to generate hypotheses of the form: (X, Z, θ) where X is the sideways displacement, Z is the forward displacement and θ is the orientation. We select the hypothesis with the maximum number of matches and the lowest least-squares error. 54  Figure 2.5: Bird’s eye view of the SIFT map around base region: the home location is indicated by the square with the current robot position and view direction shown as a V.  Global localization is necessary for serving food when the robot must find its way back to a home base to refill with food. When a refill is required, the robot navigates by dead-reckoning to around 2m away from the home base and carries out global localization there. Figure 2.5 shows the positions of robot and home base at this stage. Using the localization estimate, the robot can then proceed to the home region successfully for refill. Navigation Given a goal location, the robot position, and the occupancy grid map, we want to find the shortest and safest path connecting the two. The path planning algorithm we use is a mixture of shortest path [LRDG90] and potential field methods [Kha86, BLL92]. In clear areas, the method operates as a shortest path planner with a fixed distance constraint from obstacles. In cluttered areas, the method turns into a potential field planner, to avoid getting stuck. The combination of the two allows the 55  robot to navigate efficiently in clear environments without getting stuck in cluttered areas. Our navigator is described more fully in [ML00].  2.3.2  Human-Robot Interaction  We are mainly interested in robotic tasks oriented towards people, and devote a significant portion of our research to human-robot interactions. We wish to develop natural interfaces for control of and for social interaction with our robots. Natural interfaces include speech, gesture and facial expression. This section describes our efforts towards enabling Jos´ e with the capacity to find people in his environment (a necessary precursor to interaction) and with natural interaction behaviors. Finding People Humans are distinguished from the environment in a two-stage process: skin-color segmentation followed by rejection of false positives using the occupancy grid. One feature that all people have in common is the hue of their skin. The hue of human skin falls in a narrow range which is largely invariant to a person’s skin color. The threshold value is decided during a training stage by calculating the mean and standard deviation of the hue of a number of sample skin pixels. We re-train the system for significant changes of the illumination in the operating environment. On average, the hue threshold falls around the value 30 ± 10. Jos´ e converts RGB color images from the color camera to HSV color space and segments to select the human skin colored pixels. Since the color images are registered with the Triclops stereo data (see Section 2.2), 3D locations of skin-colored regions are recovered. These locations are then projected to the floor, and used to build a 2D map of people locations (see Figure 6.4 for examples). Some objects have hues very similar to human skin (e.g., cardboard and wood). Jos´ e must differentiate people from such obstacles to ensure appropriate serving behavior (e.g., so as not to serve wooden tables). Fortunately, Jos´ e’s map, as described in Section 2.3.1,  56  is built while he is alone in the area he is to operate in. Thus, we can compare each selected skin pixel’s projected floor location against the occupancy grid and ignore locations that are unoccupied or marked as static obstacles. Two examples of skin-color detection are shown in Figure 2.9. While the segmentation clearly misses some of the skin colored regions in both images, there are no false positives remaining after comparison with the occupancy grid map. The 2D people location map is integrated over time, resulting in a map, Pp (x, t), giving the probability that a person is at location x at time t. Figure 6.4 shows examples of this map during a typical serving run. Interacting with People A robot gains acceptance by humans if it allows for natural interaction. We have explored interactions between Jos´ e and his customers using speech and facial expressions. Jos´ e uses a DoubleTalk speech generation engine to utter pre-defined statements. In conjunction with speech, facial expressions are displayed with an animated face on a laptop screen mounted above the serving tray. Examples are shown in Figure 2.6. The animated face lends expressiveness to the speech, thus making interactions with Jos´ e more interesting for his customers. While many face generation systems use complex 3D graphics [MMA+ 99], Jos´ e’s face is a simple cartoon. This allows for fast rendering, and does not detract from interaction quality, since humans will interact with even the simplest of generated faces as a real human face [RN96]. Jos´ e has speech recognition capabilities, but has not made extensive use of them yet. We chose not to rely on speech recognition, as robustness to environmental factors has not yet emerged in commercially available products. We are also working on facial expression and gesture recognition for Jos´ e [Eli01].  57  (a)  (b)  (c)  (d)  Figure 2.6: Faces coincide with speech (a) “Stop stealing food!”, (b) “I’m sorry”, (c) “Would you like an appetizer?”, (d) “I have no food left!”  2.3.3  Serving  The particular task we have used recently as a testbed for our vision-based robotic system is that of serving food to a gathering of people. This task requires many of the behaviors which have been implemented on our robotic platform, Jos´ e. Serving also necessitates some additional task-specific behaviors: circulating in a room full of people, ensuring coverage (everyone gets served) and making sure there is food in the serving tray. This section describes these two behaviors. Route planning Jos´ e must plan a route through the environment that enables him to offer food to candidate humans and to return to his refilling station when required. This is accomplished using a procedure that dynamically determines the best feasible goal location. At each time, t, the best goal is defined using a dynamic desirability function, D(x, t), x ∈ E, where E is the spatial extent of the environment. The desirability of a location, x, tells the robot the utility of being at position x at time t given that the current robot position is xr (t). A goal is chosen as the maximum of the desirability function. We calculate the desirability as a weighted sum of the people probability map, Pp (p) (Section 2.3.2), and three cost terms Cc , Co , and Ch .  58  1. The cost associated with locomotion, Cc , is given by the distance of a path planned to x from the current robot position, xr (t): Cc = c (xr (t) , x). 2. The cost of proximity to obstacles, Co , is given by Co = mini=0..No x − oi , where oi is the location of the ith static obstacle (Section 2.3.1). and No is the number of obstacles. 3. The cost of serving at previously served sites, Ch , is given by τmax  mc(x,xr (t−τ )) hτ  Ch =  (2.1)  τ =1  where the constant parameter m ∈ (0, 1) adjusts Jos´ e’s desire to serve as many locations as possible, and the constant parameter m ∈ (0, 1) discounts the past. The history is not considered beyond a horizon τmax . The desirability function is given by: D = Pp − ωo2 Co − ωh2 Ch − ωc2 Cc , where the weights, ω, are parameters specified by the designer. Figure 2.11 shows some example desirability maps. Our experiments in various environments have shown that maximization of the desirability function produces reasonable goals. If no people are detected, Jos´ e will wander the room in an exploratory fashion. If people are detected, Jos´ e will try to serve the closest person. The more people that are detected, the longer Jos´ e will remain in the area to serve before moving on. Despite this success, we have found that such a primitive motion model to be insufficient in general. The assumption that congregation sites remain relatively stationary in a typical reception setting was found to be misguided. Therefore, a method of dynamically tracking the desired target(s) and appropriately adjusting the goal location is important for general application, and is a subject of ongoing research. 59  Monitoring appetizers  Figure 2.7: Top row: images taken while a person helps himself to an appetizer. Bottom row: segmented regions. As shown in Figure 2.1, Jos´ e carries a food tray monitored by a Logitech web cam. Monitoring the amount of food allows Jos´ e to detect when someone takes food, and when the tray is empty (calling for a return to base). The tray is solid black and has a dull texture so that regions containing food will have a significantly higher intensity than the background, allowing the percentage of food on the tray to be estimated using a simple thresholding operation. Other objects, such as human hands, occasionally appear in the cameras field of view, causing increases in the percentage of segmented pixels. However the amount of food on the tray should only decrease as people take food from the tray. The amount of food on the tray should increase only when Jos´ e is at home base for refilling. Therefore, if the number of non-black pixels suddenly increases significantly, it is likely that some other object has entered the image. However, a persistent increase indicates new food on the tray. Jos´ e keeps a ten second history of the percentages of non-black pixels that it has computed for the images. The percentage of food on the tray is estimated as the minimum of the percentages of non-black pixels in the history. With this 60  Has−Food−in−tray & Has−target−location  Start  At−target−location  Serving  To−serve−location Has−food−in−tray & Has−target−location  No−more−food  No−Food−in−tray Reached−base No−food−in−tray  Done  Go−to−base  Figure 2.8: The state diagram for Jos´ e’s serving behaviors strategy, an increase in the percentage of non-black pixels will not affect the food percentage unless the increase persists for the entire length of the history. Figure 2.7 shows images of the food tray before, during, and after, respectively, a person takes food from it. The top set of images are the images taken by the camera. The bottom set of images are the corresponding images resulting from segmentation of food pixels (shown in white) from tray pixels (shown in black). Before the person takes food from the tray (leftmost image), the computed food percentage is 34%. When the person moves his hand into the view of the camera (middle image), to take food, the number of non-black pixels increases. However, the computed food percentage remains at 34%. Once the food is removed (rightmost image), the computed food percentage drops to 26%.  2.4  Planning  The highest level of control belongs to a supervisor that activates mid-level behaviors to achieve the task at hand. The behavior of the supervisor is modeled with a finite state automaton, as shown in Figure 2.8. The supervisor remains in the start state while the robot waits at the base location for food to be placed in its tray. When the tray is filled and the goal planner returns a location to serve, the supervisor enters the to-serve-location state and the 61  supervisor directs the navigator to move to the goal location. Once the robot arrives at the goal, the supervisor enters the serving state, and directs the speech and face to offer food. After a brief pause, the supervisor checks the amount of food left. If there is still food in the tray, the goal planner is again invoked. If there is no food left in the tray, the supervisor switches to the go-to-base state, and directs the navigator to return back to the base. When the robot arrives near the base, the supervisor invokes the localization behavior, which globally locates the robot and base, allowing a precise move to the base.  2.5  Results  Figures 2.9, 6.4, and 2.11 show data from an example serving run performed in our laboratory. Jos´ e had previously built an occupancy grid, which is shown in Figure 6.4. The occupancy grids show unexplored and explored space in dark and light gray, respectively, while obstacles are shown in black. Jos´ e starts at his home base and performs a visual scan of his environment. Color, skin segmentations and stereo data from this scan are shown across two images in Figure 2.9. The people finding behavior locates two groups based on the skin segmentations and stereo data, as shown in light-colored pixels overlaid on the occupancy grids in Figure 6.4(a). The goal planner computes the initial desirability function, as shown in Figure 2.11(a). The parameters for the desirability function were set to ωo = 3.0, ωh = 1.0, ωc = 0.2, m = 0.75 and h = 0.9. The first goal is chosen as the maximum of this function, and is centered on the group of people to Jos´ e’s right. The navigator plans a path to this goal, and Jos´ e offers appetizers upon arrival in front of the group of people, as shown in Figure 2.11(b). The goal planner incorporates this serving location into the desirability function, as shown in Figure 2.11(b). The new maximum of the function is centered on the second group of people, which becomes the next goal location. Figure 6.4(b) shows Jos´ e’s path as he navigates to the second group and again offers 62  food. Jos´ e has now run out of food on his tray, and proceeds to navigate back to home base, as shown in Figure 6.4(c). To ensure accurate global localization, he first navigates to a point 2m in front of the base, performs the localization, and then navigates to the base, arriving within 10cm. Jos´ e was deployed at the Hors D’œuvres Anyone? mobile robot serving competition in Seattle. He detected and approached groups of people, knew when his tray was empty, and found his way back home to within 10cm. Jos´ e’s face, voice and well-tailored dress were great crowd pleasers, evoking many smiles and laughter. Our experience at the competition uncovered two facts about the robotic serving task. We found that vision alone is sufficient to perform the serving job. We also realized that probabilistic dynamic modeling of people would be a very useful additional component to a system for robotic waiting. We were initially apprehensive about Jos´ e moving in a room full of people without using sonar. Collisions with humans in the room would lead to disqualification. However, we found that our vision capabilities provided ample real-time feedback about the positions of obstacles to allow the robot to successfully and safely move and serve. The primary reason for these capabilities is the fast, high quality stereo data provided by the Triclops system. However, using vision data alone does impose constraints. First, stereo matching takes time. The result is a bound on the translation and rotation speeds that the robot can achieve. Translation is limited to avoid collisions. The robot cannot react to objects in its path until they appear in the stereo data. Rotation is limited because multiple frames are needed to confirm the presence or absence of an obstacle. If the robot rotates too quickly, the presence and position of an obstacle will not be confirmed, and will not appear in the occupancy grid, possibly leading to a collision. Translation speed was set to 30 cm/s, while rotation speed was set to 10 deg/s. While this gave fairly satisfactory performance, an increase in speed would give a more life-like performance. The second constraint imposed by our stereo vision data is a limited field of view, implying  63  a limited amount of map updates which can be performed in a time interval. Sonar and laser range data avoid this problem with omnidirectional scanning. Additional Triclops units could be used to achieve a larger field of view for a vision based robot. The Hors D’œuvres Anyone? competition showed that our assumption of static groups of people is not often valid in a serving environment. People are dynamic objects, and seem to behave in strange ways in the presence of a robot. Many of the observed human behaviors were attempts to provoke some kind of reaction from the robot: clustering around, waving hands in front of the cameras, attempting to block Jos´ e’s path, etc. These behaviors were interpreted by Jos´ e (perhaps correctly) as attempts to foil his serving task. He would remonstrate with the culprits, sometimes to no avail. Jos´ e currently makes assumptions about the dynamics of people. Jos´ e chooses a group of people to serve, and then makes his way to the location of the group. Once he begins, he does not verify that the group has maintained position, and continues until he reaches his target. In many cases, the chosen group moves, often towards Jos´ e. If they come towards him, he perceives them as an obstacle, asks them to move, and waits for them to do so. We are currently working on simple following behaviors which will avoid these kinds of problems. However, a more general dynamic plan updating scheme would be an asset. We are currently investigating a probabilistic people mapping algorithm. As well, we are combining the occupancy grid navigation and obstacle avoidance with the localization and odometry correction provided by the SIFT map.  2.6  Conclusions  We have presented our visually guided autonomous serving robot, Jos´ e. Mapping, localization and navigation issues which have been the focus of recent research in our laboratory were discussed. Human-robot interaction, and serving issues were also covered. Our results show that Jos´ e is capable waiter, combining effective robotic techniques with panache and wit, and the delicate savoir-faire of an ´elite 64  waiter. Our experiences at the 2001 AAAI Hors D’œuvres Anyone? competition uncovered issues which we are currently looking into. These include more dynamic modeling of people, better navigation techniques, and more integrated speech and facial expression interactions.  65  color images  skin segmentation  stereo images Figure 2.9: Two views of Jos´ e’s environment from the home base as seen though the color camera (top row) and corresponding skin segmentations (middle row) and stereo depth images (bottom row). Note how the positions of the two groups of people in these images relate to the 2D people locations shown superimposed on the occupancy grids in Figure 6.4  66  (a)  (b)  (c)  Figure 2.10: Occupancy grids at three times during a serving run. Also shown are the accumulated skin-color maps, and Jos´ e’s trajectory, with an ’x’ marking the home base, and ’o’s marking serving locations.  (a)  (b)  Figure 2.11: Desirability maps at two times during a serving run.  67  Bibliography [Ark98] R. C. Arkin. Behavior-based Robotics. MIT Press, 1998. [BCF+ 98] W. Burgard, A. B. Cremers, D. Fox, D. Hahnel, G. Lakemeyer, D. Schulz, W. Steiner, and S. Thrun. The interactive museum tourguide robot. In AAAI-98, Madison, Wisconsin, July 1998. [BLL92] J. Barraquand, B. Langlois, and J. Latombe. Numerical potential field techniques for robot path planning. IEEE Trans. on Systems, Man, Cybernetics, 22(2):224–241, March/April 1992. [DBFT99] F. Dellaert, W. Burgard, D. Fox, and S. Thrun. Using the condensation algorithm for robust, vision-based mobile robot localization. In CVPR’99, Fort Collins, CO, June 1999. [Elf89] A. Elfes. Using occupancy grids for mobile robot perception and navigation. IEEE Computer, 22(6):46–67, June 1989. [Eli01] P. Elinas. Interactive directed exploration for mobile robots. Master’s thesis, University of British Columbia, 2001. [Kha86] O. Khatib. Real–time obstacle avoidance for manipulators and mobile robots. Intl. Journal of Robotics Research, 5(1):90–98, Spring 1986. [Low99] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV-99, Kerkyra, Greece, September 1999.  68  [LRDG90] J. Lengyel, M. Reichert, B. Donald, and D Greenberg. Real–time robot motion planning using rasterizing computer graphics hardware. In SIGGRAPH, Dallas, Texas, August 1990. [ML00] D. Murray and J. J. Little. Using real-time stereo vision for mobile robot navigation. Autonomous Robots, 8:161–171, 2000. [MMA+ 99] L. Meeden, B. Maxwell, N. S. Addo, L. Brown, P. Dickson, J. Ng, S. Olshfski, E. Silk, and J. Wales. Alfred: The robot waiter who remembers you. In AAAI Workshop on Robotics, Orlando, FL, 1999. [OK93] M. Okutomi and T. Kanade. A multiple–baseline stereo. IEEE PAMI, PAMI–15(4):353–363, April 1993. [RN96] B. Reeves and C. Nass. The media equation. Cambridge University Press, 1996. [SK95] R. Simmons and S. Koenig. Probabilistic robot navigation in partially observable environments. In IJCAI-95, San Mateo, CA, 1995. Morgan Kaufmann. [SLL01a] S. Se, D. Lowe, and J. J. Little. Local and global localization for mobile robots using visual landmarks. In IROS-01, Maui, Hawaii, October 2001. [SLL01b] S. Se, D. Lowe, and J. J. Little. Vision-based mobile robot localization and mapping using scale-invariant features. In ICRA-01, Seoul, Korea, May 2001. [TBF98] S. Thrun, W. Burgard, and D. Fox. A probabilistic approach to concurrent mapping and localization for mobile robots. Machine Learning and Autonomous Robots (joint issue), 31(5):1–25, 1998.  69  [TBF00] S. Thrun, W. Burgard, and D. Fox. A real-time algorithm for mobile robot mapping with applications to multi-robot and 3d mapping. In ICRA-00, San Francisco, CA, April 2000.  70  Chapter 3  Multiply Sectioned Markov Decision Processes for Task Coordination in Interactive Mobile Robots 1 We  are concerned with the problem of building mobile robotic systems with capac-  ities to interact with humans. Such robots will need navigation, mapping, localization and obstacle avoidance capabilities to deal with moving around in an uncertain and changing environment. They will also need to model the dynamics of people in an environment, including their locations in space and their behavioral patterns. Finally, human users will require such robots to present clear, simple and natural interactive interfaces, which enable easy exchanges of information between robot and human. The systems to deal with each of these problems can be made fairly independent, and thus can be implemented modularly. The remaining challenge is to integrate and coordinate the modules to perform a given task. 1  A version of this chapter has been published. P. Elinas, E. Sucar, A. Reyes, and J. Hoey, A Decision Theoretic Approach for Task Coordination in Social Robots, In the 13th IEEE International Workshop on Robot and Human Interaction, 2004.  71  We propose a framework for task coordination based on multiple Markov decision processes (MDPs) that satisfies the previous requirements, which we call Multiply Sectioned Markov Decision Processes (MS-MDPs). Using a representation based on MS-MDPs, we partition the task into a number of subtasks, each assigned to an MDP, such that each one can be specified and solved independently. Each MDP controls one aspect of the global task, and all are executed concurrently, coordinated implicitly by common state variables. At the execution stage, all MDP policies are consulted concurrently, and the best actions (for each) are selected according to each policy and the current state. The MS-MDP approach has several advantages against using a single MDP: (i) it is easier to specify a simpler, smaller MDP for each subtask, (ii) it is more efficient in terms of solving each subtask MDP versus solving a more complex single MDP. However, these efficiency gains come at the cost of optimality, as we discuss in Section 3.3.2. MS-MDPs also have an advantage over hierarchical MDPs [Die00], since they allow multiple actions to be executed simultaneously without considering all possible action combinations. We validate our approach by presenting experiments performed using our robot: HOMER. HOMER, the Human Oriented MEssenger Robot, is a mobile robot that communicates messages between humans in a workspace. The message delivery task is a challenging domain for an interactive robot. It presents all the difficulties associated with uncertain navigation in a changing environment, as well as those associated with exchanging information and taking commands from humans using a natural interface. For this task we use 3 MDPs: the navigator, the dialogue manager and the gesture generator. Together they coordinate 10 behaviors for accomplishing the message delivery task. Although we describe our framework for the message delivery task, our system can easily be extended to other human-interactive mobile robotic tasks. The rest of the paper is organized as follows. We begin by reviewing related  72  work in social mobile robots, in particular in alternative approaches for planning and coordination. Then we introduce our mobile robot, HOMER, describe his hardware and software systems, and the different modules that are used for the message delivery task. We describe the general framework for task coordination based on MS-MDPs, and present the particular configuration used for HOMER. We then introduce the domain of message delivery, and show results of some experiments demonstrating HOMER’s performance in a complex domain. Lastly, we conclude and discuss future research directions.  3.1  Related Work  Building service robots to help people has been the subject of much recent research. The challenge is to achieve reliable systems that operate in highly dynamic environments and have easy to use interfaces. This involves solving both the more traditional robot problems of navigation and localization and the more recent problems in human-robot interaction. Another challenge arises from the large scope of these systems and the many pieces that must be integrated together to make them work. RHINO [BCF+ 98], was one of the most successful service robots ever built. It was designed as a museum tour guide. RHINO successfully navigated a very dynamic environment using a laser sensors and interacted with people using pre-recorded information; a person could select a specific tour of the museum by pressing one of many buttons on the robot. RHINO’s task planning was specified using an extension to the GOLOG language called GOLEX; GOLEX is an extension of first order calculus, but with the added ability to generate hierarchical plans and a run-time component monitoring the execution of those plans. MINERVA [TBB+ 99], was the successor of RHINO. MINERVA differed from RHINO in that it could generate tours of exhibits in real-time as opposed to choosing one of several pre-determined tours. MINERVA also improved on the interaction by incorporating a steerable head capable of displaying different emotional states. The GOLOG language was combined 73  with decision theoretic planners in DTGOLOG, used in the implementation of a service delivery robot [BRST00]. More recently, the robot PEARL escorted elderly people around an assisted living facility [MPR+ 02]. Its navigation and localization used probabilistic techniques with laser sensors. PEARL is more focused on the interaction side with an expressive face and a speech recognition engine. One of PEARL’s contributions is the use of a hierarchical partially observable Markov decision process (HPOMDP), which is an extension of hierarchical MDPs (HMDPs) [Die00] to model uncertain observations. HMDPs use an specified hierarchical breakdown of the domain, and introduce abstract actions in higher level MDPs which invoke the policies of lower-level MDPs. Our framework of MultiplySectioned MDPs (MS-MDPs) is related to HMDPs, with the important difference that MS-MDPs do not require recursive execution of policies, instead allowing multiple actions to be performed simultaneously. When a subtask is given control in a HMDP, no other subtask can interrupt it until it completes and returns control to the level in the hierarchy that invoked it. In a MS-MDP, on the other hand, subtasks are all executed concurrently, allowing multiple behaviors in a robotic system to be controlling the actions of the robot simultaneously. Markov Task Sets [MHK+ 98] also allow this type of concurrent control, but assume utility independence between subtasks. MS-MDPs use a common reward function for all subtasks, and seek collaborative solutions between behaviors. The work we present in this Chapter builds upon our previous service robots, including the award-winning waiter, Jos´e [EHL+ 02], and an initial implementation of HOMER [EHL03], in which we used a single Markov decision process (MDP) planner. Our current work extends this by adding behaviors, and by introducing MS-MDPs to make the planning more efficient.  74  (a)  (b)  Figure 3.1: (a) HOMER the messenger robot interacting with a person and (b) closeup of HOMER’s head  3.2  HOMER  3.2.1  Hardware  Our robot, HOMER, shown in Figure 5.3(a), is a Real World Interface B-14 robot, and has a single sensor: a Point Grey Research BumblebeeT M stereo vision camera. The Bumblebee is mounted atop an LCD screen upon which is displayed a pleasant and dynamic animated face. We call the combination of Bumblebee and LCD screen the head. The face displays non-verbal invitations to humans to approach and speak, expresses emotions, or emphasizes or conveys further information. The head is mounted on a Directed Perception pan-tilt unit, as shown in Figure 5.3(b), which provides lateral and dorsal movement for the camera system and animated face, enabling visual search and following for realistic interaction. An omnidirectional microphone is the only other sensor available on HOMER and it is used for speech recognition.  75  HOMER is equipped with 4 on board computers based on the Intel Pentium III processor and running the LINUX operating system. The computers communicate among each other using a 100Mbps local area network. A Compaq wireless 10Mbps network connection allows the robot’s computer to communicate with any other computer in our lab’s LAN for additional compute power as needed.  3.2.2  Software Architecture  Crucial to the design of a human-interactive mobile robot is the ability to rapidly and easily modify the robot’s behavior, specifying the changes in a clear and simple language of high-level concepts. These design constraints call for a mobile robot to have a modular software system, with a planning module written in a simple and easy to use language for specifying new world states and goals. HOMER’s software system is designed around a Behavior-based architecture [Ark98]. For HOMER, a behavior is an independent software module that solves a particular problem, such as navigation or face detection. We refer to behaviors interchangeably as modules. Behaviors exist at 3 different levels, as shown in Figure 3.2. The lowest level behaviors interface with the robot’s sensors and actuators, relaying commands to the motors or retrieving images from the camera. These are described more fully elsewhere [EL02]. Behaviors at the middle level can be grouped in two broad categories: mobility and human-robot interaction (HRI). Mobility modules perform mapping, localization and navigation [ML00, SLL02], as well as resource (battery level) monitoring. HRI modules are for people finding and tracking, speech synthesis and recognition, and facial expression generation [EHL+ 02, EL02]. Middle level modules interface with the lowest level through a shared memory mechanism. Each middle level module computes some aspect of the state of the environment. For example, the localization module computes the current location of the robot with respect to the robot’s internal maps. These outputs are typically reported to the highest level modules. Each module further offers a set of possible actions the  76  Domain specification  Utility  Supervising PLANNING Dialogue  Navigation  Gestures  Manager  Sockets  Modelling and task execution HRI  MOBILITY Localization  Speech synthesis  Mapping  Facial expressions  Navigation  Face finding  Resource monitor Head tracking Voice Detection  Speech recognition  Shared memory  Perception and motor control Robot Server  Image Server  Figure 3.2: Control Architecture module can effect. Communication among middle and high level behaviors is done using network sockets. There are four high-level modules: a manager and three planners. The manager maps between the output of the modules and the inputs to the three planning engines. The current outputs of all the modules (and the manager’s state) are the current state of the robot, and it is divided up into different sections for each planner. The manager’s job is to integrate all the state information from the modules with its own, and present the result to the planning engines. The planning engines together implement the MS-MDP framework to select the actions to perform given the current state. The manager then delegates the actions sent by the planners to the appropriate modules.  77  3.3  Management and Planning  The global state of the system is described by a vector X =  N i=1 Xi ,  where Xi is  the state vector for each module and N is the number of modules. The manager synthesizes this state vector by collecting information from each module, possibly compressing the probabilistic belief states reported by modules by choosing the value with maximum likelihood. In the general case, the planners would take advantage of the information contained in the belief state, by using partially observable MDPs (POMDPs). However, it is P-SPACE hard in the general case to find policies for POMDPs, calling for approximate techniques for robotics applications [MPR+ 02]. Hierarchical methods are another way to combat the complexity of POMDPs [SK95, TRM01, PGT03]. HOMER considers the state vector to be fully observable, and uses MS-MDPs to break the domain into tractable components. After a review of MDPs, we describe our MS-MDP framework and how this is used for the message delivery task.  3.3.1  Markov Decision Processes  Markov decision processes (MDPs) have become the semantic model of choice for decision theoretic planning (DTP) in the AI community [Put94]. They are simple for domain experts to specify, or can be learned from data. They are the subject of much current research, and have many well studied properties including approximate solution and learning techniques. An MDP is a tuple {S, A, Pr, R}, where S is a finite set of states and A is a finite set of actions. Actions induce stochastic state transitions, with Pr(s, a, t) denoting the probability with which state t is reached when action a is executed at state s. R(s) is a real-valued reward function, associating with each state s its immediate utility R(s). Solving an MDP is finding a mapping from states to actions. Solutions are evaluated based on an optimality criterion such as the expected total reward. An optimal solution is one that achieves the maximum over the optimality measure, while an approximate solution comes to 78  within some bound of the maximum. We use a factored, structured, MDP solver, SPUDD [HSAHB99], that uses the value iteration algorithm to compute an optimal infinite-horizon policy of action for each state, with expected total discounted reward as the optimality criterion. SPUDD uses a representation of MDPs as decision diagrams, and is able to take advantage of structure in the underlying process to make computation more efficient and scalable towards larger environments. The modularity of our system makes representation as a factored MDP simple and typically results in a sparsely connected Markov network. Such sparseness leads to very efficient calculations when using a structured solution approach as in SPUDD. However, if we require simultaneous actions using a single MDP, we need to consider all the possible action combinations, which will imply an additional increase in the size of the state-action space. So we propose a framework for task coordination based on multiple MDPs, that we call Multiply Sectioned Markov Decision Processes (MS-MDPs),  3.3.2  Multiply Sectioned MDPs  A MS-MDP is a set of N MDPs, all of which share the same goal and state space, but have different action sets. We assume that the actions of each MDP do not conflict with the other processes, so that each action set can be executed concurrently with the others. As mentioned in the introduction, we do not find optimal solutions for the global MDP, but simply simultaneously execute the optimal solutions from each sub-MDP. Intuitively, we can think that each MDP is solving one aspect of the global task, coordinated by a common state vector, and in this way accomplish the common goal. Our results for the messenger robot give empirical evidence for this framework. We leave a formal proof of optimality as future work. Given that we have a factored representation of the state space, each MDP only needs to consider the state variables that directly influence its actions and reward. This implies that each MDP, Pi , will in general have a subset of the state  79  variables spanning its local state space, Si . Further, we do not consider the effects of combined actions. These two aspects can make a considerable reduction in the action-state of the problem, as we show for the messenger task. At the design phase, we specify each MDP to solve one aspect of the global task. For HOMER’s message delivery task, one MDP can focus on the navigation part, another on the speech dialogue, another on gesture generation. Each MDP can be specified relatively independently of the others, although the goal is the same and the designer should be aware of the other subtasks. In fact, specifying the model is an iterative procedure. The expert specifies the model for each MDP and then solves them independently to obtain the optimal policies for each. Then, he examines the computed policies for a few key states to make sure that the combined policy is globally consistent. If this is not the case, then he changes one or all MDP models and repeats the same evaluation procedure until a correct policy is found (at least, with respect to the key states examined.) At the execution stage, all the MDPs are executed concurrently, and the best actions (for each) are selected according to each policy and the current state. A final consideration is conflicts between the actions of the different MDPs. Conflict in this case is simply a constraint that would preclude two actions from being executed. For example, the robot cannot navigate and recognize people simultaneously because the robotic head must be facing in different directions for each. Currently, we use a simple heuristic to resolve conflicts, but one could extend our approach to include an arbiter to decide the best action based on its value.  3.3.3  Managing the Message Delivery Task  HOMER’s message delivery task consists of accepting messages, finding recipients and delivering messages. In his quiescent state, HOMER explores the environment looking for a message sender. A potential sender can initiate an interaction with HOMER by calling his name, or by presenting herself to the robot. HOMER asks  80  Variable Has message Receiver name Sender name At location Has location Location Unreachable Receiver Unreachable Battery low Uncertain location Voice heard Person close Called Homer Yes/No Table 3.1: include it.  Description has a message to deliver receiver name or none sender name or none at location of receiver has receiver’s location cannot go to location cannot find the receiver battery is low uncertain about location detected voice (speech) detected a person someone call its name yes/no response  MDP N,D,G N,D,G N,D,G N,D,G N N N N N D,G D,G D,G D,G  Homer’s state variables. For each variable, we show the MDPs that  the person for her name (sender), the recipient’s name, and the message. During the interactions, HOMER uses speech recognition, speech generation and gesture generation to communicate with people. Once HOMER has a message to deliver, he must find the recipient. This requires some model of the typical behavioral patterns of people within HOMER’s workspace. We use a static map of person locations, which is updated when new information is obtained about the presence or absence of persons. This map allows HOMER to assess the most likely location to find a person at any time. Navigation to that location is then attempted. If the location is not reachable, HOMER finds another location and re-plans. If the location is reached, then HOMER attempts to find a potential receiver using face and voice detection. Upon verifying the receivers name, HOMER delivers the message. During the entire process, HOMER will localize in the map if necessary, or it will go home to recharge if its battery is low. The message delivery task can be divided in 3 subtasks, each one controlled by an MDP. The Navigator controls the navigation and localization of the robot, the Dialogue Manager controls the interaction with people using speech, and the 81  MDP Navigator  Dialogue  Gesture  actions explore navigate localize get new goal go home wait ask confirm give message neutral happy sad angry  modules navigation navigation localization location generator navigation navigation speech generation speech generation speech generation gesture generation gesture generation gesture generation gesture generation  Table 3.2: Homer’s MDPs and its corresponding actions. For each action, we indicate the modules which effect them.  Gesture Generator controls the interaction with people using gestures performed by an animated face. Each MDP includes the relevant variables as its state space, and controls several behaviors through its actions. The complete state is represented by 13 variables, shown in Table 3.1. The actions for each MDP and corresponding behaviors are shown in table 3.2. Several behaviors do not appear in the table (person detection, voice detection, etc.); these are used to get information by observing the world. The dialogue manager includes some actions that are generic, such as the ask action, which can be used to ask for the sender’s name, receiver’s name and the message. The specific action is determined by the state variables, and it is decided at another level. For the Dialogue MDP, it is like any other action (the transition function considers all the specific actions). We implement these generic actions as finite state machines, although this could be also represented as another MDP in a hierarchical way. The goal of the message delivery task is encoded in the reward function: a small reward for receipt of a message, a big reward for message delivery, and a 82  negative reward for a low battery. The Dialogue and Gesture planners only include rewards for message receipt and delivery, while the navigator includes all three. We solved the 3 MDPs using SPUDD and generated the optimal policies for each one. During concurrent execution of the policies, potential conflicts are avoided by simply giving priority to the Navigator. Thus, if HOMER is navigating to a location, such as home, it does not stop for an interaction. Our current work involves using an arbiter to resolve these conflicts more generally.  3.4  Experiments  To validate our approach, we ran several experiments with our robot. Each experiment involves the robot receiving and delivering a message by visiting locations as necessary. Initially, we performed a guided exploration task in order to build all the necessary maps for navigation and localization. We also manually specified a list of possible users and the most likely areas they inhabit. HOMER then ran autonomously for the message delivery task.2 Figure 3.3 shows key frames from one message delivery run. Some of the key variable values are shown, as well as the actions taken by each planner (Navigator, Dialogue, Gestures). A brief description is also given.  3.5  Conclusion and Future Work  In this Chapter, we introduced MS-MDPs as an efficient approach to the design and implementation of socially interactive robots. This technique allows the partition of a robot’s task into a number of subtasks, each assigned to an MDP, such that each one can be specified and solved independently. At run-time, all MDPs execute concurrently, coordinated implicitly by common state variables. We validated the 2  Videos of three message delivery http://www.ubc.ca/∼elinas/homer2.html.  83  runs  are  available  at  MS-MDP framework using it to design our robot HOMER, that approaches the problem of message delivery among people in a workplace. In the future, we wish to analyze the optimality of MS-MDP policies in order to establish theoretical bounds, and investigate conflict resolutions. It would also be worthwhile to extend HOMER’s current model such that it can engage in more complex interactions. In the next Chapter, we remove the assumption that the world state is fully observable and study the use of the Partially Observable Markov Decision Process (POMDP) framework for behavior coordination. We also present a model that can properly model the multi-step actions that are available to the planner within out behavior-based robot control architecture.  84  Key frame  Variables changed  Actions N: D: G:  person close = true voice heard = true  ask smile  Description: Person approached. HOMER initiated conversation by asking person’s name, smiled encouragement. has message = true N: navigate has location = true voice heard = false D: mute G: neutral sender name = 4 receiver name = 1 Description: HOMER received message and recipient, and got location from location planning module. The navigator began moving towards that location. Dialogue manager silent. N: D: G:  person close=false uncertain loc = true  localize -  Description: During navigation, HOMER’s position became uncertain. Navigator MDP localizes. State of other two MDPs not affected by variable uncertain loc. at location=true voice heard=true yes/no=yes  N: D: G:  wait deliver message smile  Description: HOMER reached the target location, detected a person and confirmed the correct recipient (through yes/no). Message delivered with smile. has message = false at location=false battery low=true  N: D: G:  go home -  Description: HOMER’s battery signaled as low, navigator MDP returned HOMER to home. Figure 3.3: Example message delivery run, showing key frames, variables that change or are significant, actions taken by each planner (Navigator, Dialogue, Gesture), and a description of what happened.  85  Bibliography [Ark98] R. C. Arkin. Behavior-based Robotics. MIT Press, 1998. [BCF+ 98] W. Burgard, A. B. Cremers, D. Fox, D. Hahnel, G. Lakemeyer, D. Schulz, W. Steiner, and S. Thrun. The interactive museum tourguide robot. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI ’98), Madison, Wisconsin, July 1998. [BRST00] C. Boutilier, R. Reiter, M. Soutchanski, and S. Thrun.  Decision-  theoretic, high-level agent programming in the situation calculus. In Proceedings of the AAAI National Conference on Artificial Intelligence (AAAI ’00), 2000. [Die00] T. G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000. [EHL+ 02] P. Elinas, J. Hoey, D. Lahey, J. Montgomery, D. Murray, S. Se, and J. J. Little. Waiting with Jose, a vision based mobile robot. In Proc. of the IEEE International Conference on Robotics & Automation (ICRA ’02), Washington, D.C., May 2002. [EHL03] P. Elinas, J. Hoey, and J. J. Little. Human oriented messenger robot. In Proc. of AAAI Spring Symposium on Human Interaction with Autonomous Systems, Stanford, CA, March 2003.  86  [EL02] P. Elinas and J. J. Little. A robot control architecture for guiding a vision-based mobile robot. In Proc. of AAAI Spring Symposium in Intelligent Distributed and Embedded Systems, Stanford, CA, March 2002. [HSAHB99] J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier. SPUDD: Stochastic planning using decision diagrams. In Proceedings of International Conference on Uncertainty in Artificial Intelligence (UAI ’99), Stockholm, 1999. [MHK+ 98] N. Meuleau, M. Hauskrecht, K. E. Kim, L. Pashkin, L. P. Kaelbling, T. Dean, and C. Boutilier. Solving very large weakly coupled Markov decision processes. In Proc. AAAI, 1998. [ML00] D. Murray and J. J. Little. Using real-time stereo vision for mobile robot navigation. Autonomous Robots, 8:161–171, 2000. [MPR+ 02] M. Montemerlo, J. Pineau, N. Roy, S. Thrun, and V. Verma. Experiences with a mobile robotic guide for the elderly. In Proceedings of the AAAI National Conference on Artificial Intelligence (AAAI ’02), Edmonton, Canada, 2002. [PGT03] J. Pineau, G. Gordon, and S. Thrun. Policy-contingent abstraction for robust robot control. In Proceedings of Uncertainty in Artificial Intelligence (UAI), pages 477–484, Acapulco, Mexico, August 2003. [Put94] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, NY., 1994. [SK95] R. Simmons and S. Koenig. Probabilistic robot navigation in partially observable environments. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI ’95), pages 1080–1087, Montreal, Canada, 1995. 87  [SLL02] S. Se, D. G. Lowe, and J. J. Little. Mobile robot localization and mapping with uncertainty using scale-invariant landmarks. International Journal of Robotics Research, 21(8):735–758, August 2002. [TBB+ 99] S. Thrun, M. Bennewitz, W. Burgard, A.B. Cremers, F. Dellaert, D. Fox, D. Hahnel, C. Rosenberg, N. Roy, J. Schulte, and D. Schulz. Minerva: A second-generation museum tour-guide robot. In Proceedings of IEEE International Conference on Robotics and Automation (ICRA’99), Detroit, Michigan, May 1999. [TRM01] G. Theocharous, K. Rohanimanesh, and S. Mahadevan. Learning hierarchical partially observable Markov decision process models for robot navigation. In Proc. of the IEEE International Conference on Robotics & Automation (ICRA ’01), Seoul, Korea, May 2001.  88  Chapter 4  Decision Theoretic Task Coordination for a Visually-guided Interactive Mobile Robot 1  Modern robots are good at performing very specific tasks such as constructing  maps of large environments, navigating in crowded areas and performing basic interactions with people. These robots rely on multiple cameras and range sensors such as sonar and laser in order to achieve this basic functionality. In this Chapter, we will present a robot that is capable of performing a human-interactive task using a steerable stereo camera as the primary sensor. We can do this because we take advantage of the modularity of a distributed behavior-based robot control architecture and the expressive power of decision theoretic planning. The Partially Observable Markov Decision Process (POMDP) [Put94] model has emerged as a preferred framework for planning in robotics because of its ability to deal with the uncertainty 1  A version of this chapter has been published. P. Elinas and J. J. Little, Decision Theoretic Task Coordination for a Visually-guided Interactive Mobile Robot, IEEE International Conference on Intelligent Robots and Systems (IROS), 2007.  89  inherent in sensor measurements and action outcomes. Previously, POMDPs have been used to develop mobile robots capable of autonomous navigation [TRM01] and interaction with humans [MPR+ 02]. Figure 4.1 shows a diagram of the typical 3-layered robot control architecture. The focus of our work is in the fine interplay between the deliberative/supervisory and executive layers. The executive layer consists of independently developed robot skills such as navigation, localization, mapping, speech generation and people detection. The deliberative layer enables task specification and performs planning using the POMDP framework. Planning involves coordinating the robot’s skills such that it completes a task specified via the POMDP’s reward function. The behavioral modules perform two critical tasks: (a) they expose abstract/multi-step actions and (b) provide observations to the planner. We assume that all processes are synchronized using a global clock. The behavioral modules operate at a faster rate than the planner. Because of this difference, when the planner selects an action the behavioral module activated will return a long sequence of observations before the action completes. The planner needs to perform belief updates at the behavior’s rate while deciding whether the action should be allowed to continue or be interrupted. Several frameworks have been suggested for handling such temporal abstraction including hierarchical POMDPs [PGT03b, TRM01], MAXQ [Die00], HAM [PR98] and the Options [SPS99] framework. In the work presented in this Chapter, we deal with such multi-step actions as black boxes (similar to the Options framework) and represent the evolution of the system using Dynamic Belief Networks (DBNs.) Figure 4.2 shows a DBN representation of the dynamical system. An action is selected and dispatched at time t − 1 and remains operational for a ∆X interval. Dispatching an action means passing control to a behavior in the executive layer. The active behavior corresponding to the selected action generates observations at ∆χ intervals. The planner  90  Supervising/behavior coordination  A t−1  At  At+1  At+2  St  S t+1  S t+2  S t+3  Rt Ot  Rt+2  Rt+1 Ot+1  Ot+2  Rt+3 Ot+3  Task execution  Hardware Abstraction Layer  Localization Mapping Navigation Speech synthesis Facial expressions People finding Face detection  Figure 4.1: Three-layered behavior-based robot control architecture showing the deliberative, executive and functional components. The decision theoretic planner coordinates the behaviors at the executive level. Actions enable behavioral modules (solid arrow) while the latter supply observations to the planner (dotted arrow.) All modules are synchronized using a global clock. See Figure 4.2 for a description of the Dynamic Bayesian Network model used at the deliberative layer.  91  must perform belief updates every time an observation is received requiring that the transition model for the ∆χ interval is known. In Section 4.1.2, we will show how we can derive the transition model for ∆χ steps from the transition model for ∆X which can be specified intuitively by an expert. We note that different actions have different durations so the ∆X interval varies accordingly. Given the POMDP model for our problem, we must solve it. Solving a POMDP entails finding a mapping from beliefs, i.e., distributions over hidden states, to robot actions. Finding optimal policies for POMDPs is P-SPACE hard in general but in recent work approximate solution methods that work well in practice have been proposed. One family of approximate methods includes point-based value iteration that seeks to estimate the value function and its gradient for a small but important subset of key beliefs. The Point-Based Value Iteration (PBVI) [PGT03a] and Perseus [SV05], a randomized and more efficient version of PBVI, algorithms are considered state of the art in approximate planning for POMDPs. Perseus has been shown to largely outperform PBVI and so it is the focus of our work. Perseus like PBVI estimates the value function and its gradient for a small set of key beliefs. Selecting a good set of belief points is important and the focus of current research [IPA06] but the simplest approach is to sample reachable beliefs at random by simulating trajectories according to the POMDP model. Recent approaches such as Heuristic Search Value Iteration (HSVI) [SS05] and Forward Search Value Iteration (FSVI) [SBS07] use heuristics to guide the search for key beliefs. HSVI computes an upper and lower bound to the optimal value function and selects beliefs that will close the gap between the two bounds as much as possible but it requires the solution of a linear program making it expensive to compute. FSVI improves on HSVI by using the solution of the underlying MDP to guide the search for key beliefs. FSVI is much faster than HSVI but it fails in domains that require agents to execute information gathering actions. Perseus on the other hand works with a fixed set of belief points and achieves fast performance by not  92  ∆Χ A t−1  At  A t+1  A t+2  A t+3  St  S t+1  S t+2  S t+3  S t+4 R t+3  Ot+1  Ot  Ot+2  Ot+3  O t+4  ∆χ  Figure 4.2: Dynamic Bayesian Network representation of the behavior control problem in robotics. The diagram shows 4 different types of nodes: A for actions, S for hidden states, O for observations, and R for rewards. The action/behavior selected at time t − 1 executes for a ∆X time interval while belief updates occur at ∆χ intervals as observations become available. Dashed action boxes denote that the action dispatched at an earlier time is still executing. Rewards are received when the action completes at the conclusion of the ∆X interval. optimizing each sampled belief at every step of value iteration but only optimizing enough points such that the value function is improved overall. In this Chapter, we show that it is possible to further improve the performance of Perseus by applying a more intelligent sampling scheme during value iteration, introducing heuristics within Perseus as opposed to during belief point selection. The rest of this Chapter is structured as follows. First, we introduce POMDPs and our method for handling belief updates during the execution of multi-step actions. Then, we present the Perseus algorithm and we compare three different heuristics for improving its performance. We show that there exists a heuristic that improves performance significantly in terms of computation time without degrading the quality of the computed policy. Finally, we present in detail the Human Oriented MEssenger Robot (HOMER) as a practical example of the applicability of our approach for constructing behavior-driven visually-guided interactive mobile robots.  93  4.1  Partially Observable Markov Decision Processes  POMDPs are a useful framework for modeling uncertainty in an agent’s actions and partially observability of the true state given noisy sensors.  4.1.1  Basics  A POMDP is defined as the tuple, {A, S, Z, T, O, R}. In the POMDP model, A = {a1 , a2 , . . . , an } is the set of actions, S = {s1 , s2 , . . . , sm } is the set of states and Z = {z1 , z2 , . . . , zk } is the set of observations. In this work, we only consider POMDPs with discrete states, actions and observations. T is the state transition probability function, i.e., p(s |s, a). O is the observation model, p(z|s, a), that is, the probability of observing z after executing action a and arriving at state s. R represents the rewards the agent can receive such that r(s, a) is the reward received if the agent executes action a and reaches state s. Since the agent cannot observe the true state, it maintains a probability distribution or belief, b, over all states. Every time the agent executes an action a and receives an observation z, it updates its current belief using Bayes rule p(s |s, a)bt−1 (s) p(z|a, bt )  p(z|s , a)  ba,z t (s ) =  s∈S  (4.1)  Given the POMDP model, the agent’s goal is to construct a plan (or policy) maximizing future discounted reward (this is the focus for our work but other variations exist.) A policy, π, is a mapping from beliefs to actions. A policy is related to a value function, V π : ∆ →  , which is defined as ∞  π  γ t r(bt , π(bt ))]  V (b) = Eπ [  (4.2)  t=0  where γ is the discount rate. Given these, one can define an optimal policy, π ∗ , and corresponding optimal value function, V ∗ , that satisfies the Bellman equations V ∗ (b) = maxa∈A [  p(z|a, b)V ∗ (ba,z )]  r(s, a)b(s) + γ  s∈S  z∈Z  94  (4.3)  We can compute the optimal value function incrementally using the value iteration algorithm. The value function is piecewise linear and convex [Son71] so it can be represented as a set of vectors, Vn = {α1 , α2 , ..., α|V | }, such that each vector has |S| elements. With each alpha vector in the value function we associate an action that is the best action to execute in this part of the belief space. Given such a representation for the value function, one can compute the value for any belief using Vn (b) = maxα∈Vn b · α  (4.4)  while the policy for b is given by π(b) = a(αb ), where αb is the alpha vector that maximizes Equation 4.4. Value iteration is one algorithm that can be used to compute the optimal value function. It utilizes a subroutine that applies a backup operator H to the current estimate of the value function to generate an improved one. The algorithm exits when a stopping criterion is met.  4.1.2  POMDPs for Behavior Coordination  When using a POMDP for high-level robot control, the actions available to the planner are commands that activate behaviors at the executive layer. When a behavioral module is given control, it operates for a certain period of time depending on the task it solves. For example, navigating to two different locations will take variable amount of time depending on the robot’s distance to each destination; in addition, a speech action will have much shorter duration than a navigation action. As we explained earlier, the planner selects among the behaviors at times that do not necessarily occur at fixed intervals, ∆X, but belief updates must occur regularly at ∆χ rate. Performing belief updates at this rate allows the planner to be more reactive to important external events. However, the updates require that the POMDP’s transition model over ∆χ is known. In our approach, an expert specifies the POMDP model defined over the ∆X  95  intervals (see Figure 4.2) along with a value for the maximum allowed execution time for each action. We assume that all actions have finite duration and can be interrupted at any time. We model the system’s evolution over ∆X for each action using DBNs as is commonly done. We employ a factored representation [HSAHB99] in which the transition and observation models are specified using Algebraic Decision Diagrams (ADDs.) Intuitively, actions will not cause the expected state transition until a certain amount of time has elapsed. For example, asking for a person’s name might take a few seconds before a reply is observed causing a transition to a state where the person’s name is known. We can estimate T∆χ from T∆X as the linear combination of a mixture transition model that says the state will not change, pI , and one that says that the transition is likely to happen under the current action, p∆X . We have, p∆χ (s |s, a, k) = (1 − β)pI (s |s, a, k) + βp∆X (s |s, a, k)  (4.5)  where k is the number of ∆χ size steps since the action started executing and β is the mixing ratio for the two transition models. The value of β varies linearly between 0 and 1 as a function of the number of elapsed steps k. We have found that our choice for the value of β works well for our problem but other distributions might be more suitable in other cases depending on the actions’ temporal characteristics. Algorithm 2 Perseus: Vn+1 = HP erseus Vn ˜ to B 1: Set Vn+1 = ∅ and initialize B 2: while B = ∅ do ˜ 3: Sample a belief point b uniformly at random from B 4: Compute α = backup(b) 5: if b · α >= Vn (b) then 6: Vn+1 = Vn+1 ∪ α 7: else 8: Compute α = argmaxα∈Vn b · α 9: Vn+1 = Vn+1 ∪ α 10: end if ˜ = {b ∈ B : Vn+1 < Vn (b)} 11: Compute B 12: end while  96  4.1.3  Randomized Point-based Value Iteration (Perseus)  We can compute the optimal policy for a POMDP using the value iteration algorithm. However, computing optimal policies is prohibitive for all but the smallest problems. Approximate value iteration allows us to compute policies for large problems. It works by only considering a small subset of key belief points. These points are selected to cover the interesting and relevant parts of the belief simplex. In this section, we consider the randomized point-based value iteration (or Perseus) algorithm for solving our POMDP. We present a modification to the algorithm that reduces computation time with no loss in the policy’s quality. Perseus is the most efficient point-based method for solving a POMDP. It operates on a fixed set of belief points, B, generated by simulating random trajectories in the belief tree using the POMDP model. It then proceeds to apply the backup operator not to all of these points but only a sufficient subset of them such that at each iteration the value function is improved or remains the same. Since most of the cost of point-based approaches comes from applying the backup operator on a belief point, Perseus’s randomized backups achieve large speed increases applied to common test problems. The algorithm, reprinted from [SV05] is given in ˜ is decreased at Algorithm 2. Steps 5 − 9 ensure that the number of elements in B every iteration assuring that the algorithm eventually terminates.  4.1.4  Speeding Up Perseus  There are two aspects of Perseus that can be considered for improvement. One is to find the smallest but highest quality set of beliefs to work with [IPA06]. The other is to devise a heuristic that finds the best value function with fewer backups. We focus on the latter and more specifically on step 3 of Algorithm 2. Given that the set of beliefs that Perseus receives is fixed, is it possible to select among beliefs more intelligently such that the value of all points is improved in fewer iterations? The original algorithm uniformly samples from the set of avail97  250 SU SM HM  Number of Backups  200  150  100  50  0  0  5  10  15  20  25  30  35  40  Number of Iterations  Figure 4.3: Performance of the different heuristics for belief point selection in Perseus. Results are shown as the average of 10 runs solving the message delivery factored POMDP using a randomly sampled set of 3500 belief points. able beliefs. Instead, we explore two methods that prioritize beliefs according to a heuristic value. In each case, the heuristic ranks every belief point according to its importance with respect to the other points. The heuristics that we consider are, • Stochastic Uniform (SU): Uniformly sample from the available set of belief points as described in the original Perseus algorithm. • Heuristic Myopic (HM): At the conclusion of the n-th step of value iteration and for each belief point we use the current value estimate, Vn (b), to place all points in a priority queue and optimized them in sequence. The idea is to perform backups first on those points that have shown the most increase in value so far. • Stochastic Multinomial (SM): Instead of sampling uniformly from all beliefs, each point is weighted according to the HM metric. The normalized values are then treated as a multinomial distribution that we can sample from. 98  The idea is to avoid making a commitment to certain points that improve the most as in the HM approach while still preserving the stochastic component from the original algorithm. We evaluated the performance of the proposed heuristics and compared it to that of the original Perseus algorithm using two large POMDPs. One is our message delivery POMDP that has 1, 728 states, 9 actions and 768 observations (see Section 4.2 for a detailed description of this POMDP.) The other is the well known Tag POMDP used as a benchmark in the literature; it has 870 states, 5 actions and 30 observations. Table 4.1 shows the results for all algorithms. We sampled 3500 belief points for the message delivery problem and 20000 for Tag. The initial belief is set to a uniform distribution over all states. Our experiments show that the SM heuristic performs fewer backups for both problems compared to both the SU and HM ones. Comparing the SM and SU methods, Perseus was 21% more efficient for the message delivery problem and 14% more efficient for Tag. Figure 4.3 shows a plot of the number of backups for every iteration up to 40 for the larger problem. The HM method required the same amount of backups as the original algorithm. However, the policy computed using the HM heuristic was not of the same quality as the others because the algorithm focuses on optimizing for a small number of beliefs that do not necessarily generalize well for all points. For the message delivery problem, running 5000 simulations, the average discounted reward for SU and SM was 1.8; in contrast, the policy computed using HM achieved a lower average reward of 1.7 for the same number of simulations. Similarly, for Tag, running 10000 simulations, the average discounted reward for SU was −6.9, for SM −6.5 and for HM the much lower −8.8. Our empirical results show that significant performance increases can be achieved in Perseus if the sampling scheme in the inner loop is something other than uniform.  99  Method SU SM HM  Message Delivery 3500 beliefs # backups Avg Reward 1610 1.8 1272 1.8 1565 1.7  Tag 20000 beliefs # backups Avg Reward 1198 -6.9 1030 -6.4 2487 -8.5  Table 4.1: Comparing the original Perseus algorithm (SU) with two variants (SM and HM) that select belief points to optimize using a heuristic value.  4.2  The Human-Oriented MEssenger Robot (HOMER)  HOMER’s [EHL03] message delivery task consists of accepting messages, finding recipients and delivering messages. HOMER is an RWI B14 base that operates using an on-board PIII computer along with 3 off-board P4 computers. The robot perceives its environment using a Point Grey Research Bumblebee stereo camera mounted on a Directed Perception pan-tilt unit. On the same pan-tilt unit there is an LCD screen, attached below the camera, that displays HOMER’s animated face. HOMER can be seen in Figure 4.1. In his quiescent state, HOMER remains stationary waiting for a message sender. A potential sender can initiate an interaction with HOMER by presenting herself to the robot. HOMER asks the person for her name, the recipient’s name, and the message. During the interaction, HOMER communicates with the person verbally but accepts user input via a GUI application. Once HOMER has a message to deliver, he must find the recipient. This requires some model of the typical behavioral patterns of people within HOMER’s workspace. For each person, we maintain two searchable locations that HOMER visits in sequence as necessary. Navigation to that location is then attempted. When HOMER is moving, the camera is tilted forward such that the robot can detect obstacles that might be in its path. If the first location is not reachable, HOMER retrieves the second location and re-plans. If the location is reached,  100  HOMER attempts to find the potential receiver using face detection. Upon verifying the receiver’s name, HOMER delivers the message. If HOMER visits both target locations and cannot find the message’s intended receiver, he gives up and resets his internal state waiting for a new message. During the message delivery process, HOMER will localize, if necessary, using a previously learned map; we construct 3D point landmark and 2D occupancy grid maps using the σSLAM [ESL06] algorithm during an offline phase. Localization requires that HOMER lift his head up to look for previously mapped landmarks and localize using the σMCL [EL05] algorithm. We model HOMER’s message delivery task using a factored POMDP that includes the relevant variables as its state and observation space, and controls several behaviors through its actions. The hidden state and observation variables are shown in Figure 4.4 while the available actions are shown in Figure 4.5. The model consists of 9 state variables, 8 observation variables and 9 actions corresponding to enabling behaviors in the executive layer. Behavior modules operate at a fixed frequency of 1.0 Hz emitting observations; at any time, we allow only one module to be in control of the robot’s actuators. The goal of the message delivery task is encoded in the reward function that specifies a big reward for message delivery and no reward for any other state. Figures 4.6 and 4.7 show a trace of a typical run for HOMER. The robot starts waiting for a person to initiate an interaction. After 38 time units, a person has approached the robot and HOMER asks for his name. HOMER interacts with the person for 10 time units receiving a message and recipient; 1 time unit in our system corresponds to 1 second of real time. Then, it retrieves a target location for the receiver and executes a navigate action. After navigating for 193 time units, HOMER reaches the target location and engages a person in dialogue in an attempt to confirm the message’s recipient. Upon failure to do so, HOMER retrieves the second target location and starts navigating again. At 389 time units, the robot performs localization and then continues to navigate. Once at the second location,  101  HOMER verifies the recipient and delivers the message. The complete run lasted 722 time units. The example presented shows that the robot can complete its task successfully. In some test runs, we have observed that the robot may repeat a speech action such as asking a person for his name twice in a row until it becomes confident enough that it has identified the person correctly. This is expected since the robot’s sensors are noisy and cannot be trusted fully. In order to better evaluate the effectiveness of our approach, future work will focus on running longer trials with HOMER delivering multiple messages using a variety of people.  4.3  Conclusions  In this Chapter, we presented a visually guided interactive mobile robot that utilizes a decision theoretic framework for behavior coordination in order to complete a complex task. We have shown that modeling the task using a factored POMDP and taking into account the temporal characteristics of the robot’s actions, we can achieve principled coordination of the robot’s behavioral modules. In addition, we have presented a heuristic that reduces the total number of backups in randomized point-based value iteration for solving large POMDPs. In the near future, we plan to further investigate the performance of our heuristics using a variety of standard problems including an extention to domains with continuous observations which has been the focus of recent research [HvBPM07]. Additionally, we plan to perform user studies with our robot to better evaluate the applicability of our framework for the design and implementation of interactive mobile robots.  102  State Variables (shorthand:domain) Has message (msg:none,has,del) Has receiver’s name (hrn:true, false) Has sender’s name (hsn:true, false) Reached target location (al:true, false) Target location (tloc:none, loc, nomore) Robot lost (lost:true, false) Person close (pc:true, false) Receiver confirmed (conf:yes,no, never) Location Unreachable (lu:true, false) Observation Variables (shorthand:domain) message (omsg:true, false) Sender name (sn:name none) Receiver name (rn: name none) Response (rsp: yes, no, none, other) Face visible (fv:true, false) Reached location (oal: true, false) Location Unreachable (olu: true false) Location (loc:none val nomore)  Description has a message to deliver receiver name or none sender name or none at location of receiver has receiver’s location the robot is uncertain about its own location detected a person nearby is the person near the robot the message’s receiver? target location cannot be reached Description user selected one of two messages the sender’s name the receiver’s name person’s response to a confirmation action face visible in the the stereo images noisy reached location noisy location unreachable receiver location retrieved from database  Figure 4.4: The components of HOMER’s POMDP model including the state and observation variables. All variables are discrete and their domains are given in this table. 103  Actions (shorthand:max duration) Wait (W:1) Navigate (N:120) Localize (L:20) Get next goal (GNG:1) Ask sender’s name (ASN:8) Ask receiver’s name (ARN:8) Ask message (AM:8) Confirm (C:8)  Give message (GM:8)  Description do nothing navigate to location do MCL get the next location to search for the receiver ask the person’s name ask the person’s name ask for the message confirm that the person in front of the robot is the message’s receiver play back message  Figure 4.5: The actions available for HOMER along with the maximum duration for each. An action’s maximum duration is given as the number of ∆χ steps to the expected completion.  104  Time  State variables marginal distributions  1  38  42  45  48  49  Action  W omsg:false fv:false  sn:none oal:false  rn:none olu:false  rsp:none loc:none  omsg:false fv:true  sn:none oal:false  rn:none olu:false  rsp:none loc:false  omsg:none fv:true  sn:name oal:false  omsg:false fv:true  sn:none oal:false  omsg:true fv:true  omsg:false fv:false  sn:none oal:false  sn:none oal:false  rn:none olu:false  rsp:other loc:none  rn:name olu:false  rsp:other loc:none  rn:none olu:false  rsp:none loc:none  rn:none olu:false  rsp:none loc:val  ASN  ARN  AM  GNG  N  Figure 4.6: Part 1: Trace of HOMER’s POMDP during a complete run from receiving to delivering a message. We only show some of the most important steps and we mark in bold the observed values that are most relevant to the current situation. See Figures 4.4 and 4.5 for explanation of notation.  105  Time  State variables marginal distributions  C  242  245  246  omsg:false fv:true  sn:none oal:true  rn:none olu:false  rsp:none loc:none  omsg:false fv:true  sn:none oal:false  rn:none olu:false  rsp:no loc:none  omsg:false fv:true  sn:none oal:false  rn:none olu:false  rsp:none loc:val  omsg:false fv:false  sn:none oal:false  rn:none olu:false  rsp:none loc:val  omsg:false fv:true  sn:none oal:true  rn:none olu:false  omsg: false fv:true  sn:none oal:false  rn:none olu:false  389  714  718  Action  GNG  N  L  rsp:none loc:none  rsp:yes loc:none  C  GM  Figure 4.7: Part 2 (continued from Figure 4.2): Trace of HOMER’s POMDP during a complete run from receiving to delivering a message. We only show some of the most important steps and we mark in bold the observed values that are most relevant to the current situation. See Figures 4.4 and 4.5 for explanation of notation.  106  Bibliography [Die00] T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 13:227–303, 2000. [EHL03] P. Elinas, J. Hoey, and J. J. Little. Human oriented messenger robot. In Proc. of AAAI Spring Symposium, Stanford, CA, March 2003. [EL05] P. Elinas and J. J. Little. σMCL: Monte-Carlo Localization for mobile robots with stereo vision. In Proc. of Robotics: Science and Systems (RSS ’05), pages 373–380, Boston, MA, June 2005. [ESL06] P. Elinas, R. Sim, and J. J. Little. σSLAM: Stereo vision slam using the Rao-Blackwellised particle filter and a novel mixture proposal distribution. In Proc. of the IEEE International Conference on Robotics & Automation (ICRA ’06), pages 1564–1570, Orlando, FL, May 2006. [HSAHB99] J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier. SPUDD: Stochastic planning using decision diagrams. In Proceedings of International Conference on Uncertainty in Artificial Intelligence (UAI ’99), Stockholm, 1999. [HvBPM07] J. Hoey, A. von Bertoldi, P. Poupart, and A. Mihailidis. Assisting persons with dementia during handwashing using a Partially Observable Markov Decision Process. In Proceedings of International the 5th Inter-  107  national Conference on Computer Vision Systems (ICVS ’07), Biefeld, Germany, 2007. [IPA06] M. T. Izadi, D. Precup, and D. Azar. Belief selection in point-based planning algorithms for POMDPs. In Canadian Conference on AI, pages 383–394, 2006. [MPR+ 02] M. Montemerlo, J. Pineau, N. Roy, S. Thrun, and V. Verma. Experiences with a mobile robotic guide for the elderly. In Proceedings of the AAAI National Conference on Artificial Intelligence (AAAI ’02), Edmonton, Canada, 2002. [PGT03a] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, 2003. IJCAI. [PGT03b] J. Pineau, G. Gordon, and S. Thrun. Policy-contingent abstraction for robust robot control. In Proceedings of International Conference on Uncertainty in Artificial Intelligence (UAI ’03), pages 477–484, 2003. [PR98] R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1998. [Put94] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York, NY., 1994. [SBS07] G. Shani, R. I. Brafman, and S. E. Shimony. Forward search value iteration for POMDPs. Proc. of the Twentieth International Conference on AI (IJCAI-07), pages 2619–2624, January 2007.  108  [Son71] E. J. Sondik. The Optimal control of Partially Observable Markov Processes. PhD thesis, Stanford Univercity, Stanford, California, U.S.A, 1971. [SPS99] R. S. Sutton, D. Precup, and S. P. Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999. [SS05] T. Smith and R. Simmons. Point-based POMDP algorithms: Improved analysis and implementation. In Proceedings of International Conference on Uncertainty in Artificial Intelligence (UAI ’05), Edinburgh, Scotland, 2005. [SV05] M. T. J. Spaan and N. Vlassis. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24:195–220, 2005. [TRM01] G. Theocharous, K. Rohanimanesh, and S. Mahadevan. Learning hierarchical partially observable Markov decision process models for robot navigation. In Proc. of the IEEE International Conference on Robotics & Automation (ICRA ’01), Seoul, Korea, May 2001.  109  Chapter 5  σMCL: Monte-Carlo Localization for Mobile Robots with Stereo Vision 1  Global localization is the problem of a robot estimating its position by considering  its motion and observations with respect to a previously learned map. Bayesian filtering is a general method applicable to this problem that recursively estimates the robot’s belief about its current pose. Monte-Carlo localization provides an efficient method for representing and updating this belief using a set of weighted samples/particles. Previous MCL approaches have relied on the assumption that the robot traverses a planar world and use a motion model that is a function of the robot’s odometric hardware. Based on uninformative sensor measurements, they suffer from the perceptual aliasing problem [Chr92] requiring that the robot move for several meters and make many observations before its location can be established. They also demand a large number of particles in order to converge. MCL has been demonstrated to be accurate for laser-based robots but it has failed to 1  A version of this chapter has been published. P. Elinas and J. J. Little, σMCL: MonteCarlo Localization for Mobile Robots with Stereo Vision, Robotics: Science and Systems, 2005.  110  achieve similar results for vision-based ones. In this Chapter, we present Monte-Carlo localization for robots with stereo vision. We call it σMCL and it differs from others in several ways. Firstly, it is not limited to robots executing planar motion. We solve for unconstrained 3D motion (6 degrees of freedom) by decoupling the model from the robot’s hardware. We derive an estimate of the robot’s motion from visual measurements using stereo vision. Secondly, we use sparse maps of 3D natural landmarks based on the Scale Invariant Feature Transform [Low99] that is fully invariant to changes in image translation, scaling, rotation and partially invariant to illumination changes. The choice of SIFT leads to a reduction of perceptual aliasing enabling σMCL to converge quickly after the robot has traveled only a short distance. Finally, our method is more accurate than other constrained vision-based approaches and only requires a small number of particles. In comparison, Thrun et al. [TFBD00] introduce MCL and study its performance for planar, laser guided robots that utilize 2D occupancy grid maps [ME85]. They also demonstrate it for a robot with vision that relies on an image-based mosaic of a building’s ceiling but fail to match the accuracy of their laser-based approach [DBFT99]. Wolf et al. [WBB02] implement MCL for a vision guided robot that uses an image retrieval system based on invariant features but also needs 2D occupancy grid maps for visibility computations. Their system requires the storage of a large database of images along with the metric maps. Recently, image-based approaches have been proposed that do not require metric maps but operate using collections of reference images and their locations. These methods combine filtering with image-based localization [SD03] in a more general setting than originally proposed by [WBB02]. Variations exist for different choices of image descriptors and associated similarity metrics. Menegatti et al. [EMI04] represent images using the Fourier coefficients of their lower frequency components. They define a simple similarity metric using the Euclidean distance of  111  these coefficients. Gross et al. [GKSB03], in a similar approach, use the Euclidean distance of the mean RGB values of images as a similarity metric and employ a luminance stabilization and color adaptation technique to improve matching accuracy. Ulrich et al. [UN00] represent images using color histograms in the normalized RGB and HLS color spaces. They study the performance of different similarity metrics for their histogram representation. Kr¨ose et al. [KBH+ 04] perform Principle Component Analysis on images and store the first 20 components. They match images by comparing the Euclidean distance of these components smoothed by local Gaussian kernels. For improved efficiency, they implement an approximate nearest-neighbour approach using the kd-tree data structure. Rofer et al. [RJ03] propose a variant that focuses on fast feature extraction. Theirs is limited to very small environments and it depends on color-based landmarks suitable only for the robot soccer domain. The rest of this Chapter is structured as follows. We begin with an overview of Bayesian filtering and its application to robot localization leading to MCL. We describe the acquisition of maps and continue to present the main elements of σMCL, namely its vision-based motion and observation models. We provide experimental results to prove its accuracy. We compare it with other vision-based MCL methods and show that it performs better. Finally, we conclude and suggest directions for future work.  5.1  Bayesian Filtering with Particle Filters Our goal is to estimate the robot’s position and orientation at time t, denoted  by st . There are 3 degrees of freedom for the position (x,y and z) and 3 for the orientation (roll, pitch and yaw). That is, st is a 6-dimensional vector. The state evolves according to p(st |st−1 , ut ) where ut is a control signal most often an odometry measurement. Evidence, denoted by yt , is conditionally independent given the state (Markov assumption) and distributed according to p(yt |st ). Bayes filtering  112  Algorithm 3 Particle Filter for each particle i do do (i) Sample from Bel(st−1 ) using the weighted samples, giving st−1 Sample from qt = p(st |st−1 , ut ) (also known as the proposal distribution), giving (i) st (i) Compute the importance weight, w(i) according to p(yt |st ) end for Normalize the weights such that they add to 1.0 Resample from the particles proportionally to their weight recursively estimates a probability density over the state space, given by [TFBD00]: p(st |y t , ut ) = Bel(st ) = αp(yt |st )  p(st |st−1 , ut )Bel(st−1 )dst−1  (5.1)  st−1  Its performance depends on how accurately the transition model is known and how efficiently the observation probability density can be estimated. Particle filtering is a method for approximating Bel(st ) using a set of m weighted particles, Bel(st ) = {x(i) , w(i) }i=1,···,m . The system is initialized according to p(s0 ) and the recursive update of the Bayes filter proceeds as shown in Algorithm 3. This procedure is known as sampling-importance-resampling. It has been shown that the choice of proposal distribution is important with respect to the performance of the particle filter. An alternative proposal distribution that leads to what is called dual MCL is suggested in [TF00]. Dual MCL requires that we can sample poses from the observations using the dual proposal distribution (a hard problem in robotics) q˜t =  p(yt |˜ st ) , π(yt ) = π(yt )  p(yt |˜ st )  (5.2)  s˜t  In this case, the importance factors are given by [TF00] (i)  (i)  w(i) = π(yt )π(˜ st |ut )Bel(st−1 )  (5.3)  MCL and dual MCL are complementary and it is shown in [TF00] that using a 113  mixture of the two results in superior performance compared to either of them. Implementing a particle filter is straightforward. One must describe the initial belief, the proposal distribution and the observation and motion models. Until now, MCL methods have been limited to solving the simpler case of planar robots with only 3 dof. In the next few sections we will provide our solution, σM CL, for visionbased robots moving unconstrained in 3D space with 6 dof. However, first we will describe the maps we use and how we construct them since they play a central role in our approach.  5.2  Map Construction  We use maps of naturally occurring 3D landmarks as proposed in [SLL02]. Each landmark is a vector l = {P, C, α, s, f } such that P = {X G , Y G , Z G } is a 3dimensional position vector in the map’s global coordinate frame, C is the 3 × 3 covariance matrix for P , and α, s, f describe an invariant feature based on the Scale Invariant Feature Transform (SIFT) [Low99]. Parameter α is the orientation of the feature, s is its scale and f is the 128-dimensional key vector. SIFT descriptors have been shown [MS03] to outperform others in matching accuracy and as such they are a natural choice for this application. An example of a map with 32, 000 landmarks is shown in Figure 5.1. We constructed it using a total of 1200 frames while the robot traveled a total distance of 25 meters. We learn these maps using the method presented in [SLL02]. We use visual measurements to correct the odometry estimate as the robot travels. We assume that each landmark is independent and its position is tracked using a Kalman filter. This approach is only useful for constructing maps for small environments [SLL02].  114  Figure 5.1: Bird’s eye view of 3D landmark map used for localization. Only the 3D coordinates of each landmark are shown as dark dots. Overlayed is the path the robot followed during map construction while the robot’s position at the end of the path is shown using a green “V”.  5.3  Observation Function  In order to implement the Monte-Carlo algorithm we must first specify the distribution p(yt |st ) that is used to compute the importance weights. Given our map representation and our visual sensor, a measurement yt consists of correspondences between landmarks in the current view and landmarks in the known map. Let ItR and ItL denote the right and left gray scale images captured using the stereo camera at time t. The right camera is the reference camera. We compute image points of interest from both images by selecting maximal points in the scale space pyramid of a Difference of Gaussians [Low99]. For each such point, we compute the SIFT descriptor and record its scale and orientation. We then match the points in the left and right images in order to compute the points’ 3D positions in the camera coordinate frame. Matching is constrained by the stereo camera’s known epipolar geometry and the Euclidean distance of their SIFT  115  keys. Thus, we obtain a set OC = {o1 , o2 , · · · , on } of n local landmarks such that oj = {Poj = {XoLj , YoLj , ZoLj }, poj = {roj , coj , 1}, C, α, s, f } where poj = {roj , coj , 1} is the image coordinates of the point and j ∈ [1 · · · n]. An observation is defined as a set of k correspondences between landmarks in the map and the current view, yt = ∪1···k {li ↔ oj } such that i ∈ [1..m] and j ∈ [1..n] where m is the number of landmarks in the map and n is the number of landmarks in the current view. We compare the landmarks’ SIFT keys in order to obtain these correspondences just as we did before during stereo matching. There are no guarantees that all correspondences are correct but the high specificity of SIFT results in a reduced number of incorrect matches. A pose of the camera, st , defines a transformation [R, T ]st from the camera to the global coordinate frame. Specifically, R is a 3 × 3 rotation matrix and T is a 3 × 1 translation vector. Each landmark in the current view can be transformed to global coordinates using the well known equation  PoGj = Rst Poj + Tst  (5.4)  Using equation 6.4 and the Mahalanobis distance metric (in order to take into account the map’s uncertainty), we can define the observation density by:  0.5  p(yt |st ) = e  5.4  k b=1  (PoG,b −PiG,b )T C −1 (PoG,b −PiG,b ) j j  (5.5)  Computing Camera Motion  Another essential component to the implementation of MCL is the specification of the robot’s motion model, ut . In all previous work, this has been a function of the robot’s odometry, i.e., wheel encoders that measure the amount the robot’s wheels rotate that can be mapped to a metric value of displacement and rotation. Noise drawn from a Gaussian is then added to this measurement to take into account  116  slippage as the wheels rotate. Such measurements although accurate and available on all research robots are only useful for planar motions. We want to establish a more general solution. Thus, we obtain ut measurements by taking advantage of the vast amount of research in multiple view geometry. Specifically, it is possible to compute the robot’s displacement directly from the available image data including an estimate of the uncertainty in that measurement. Let It and It−1 represent the pairs of stereo images taken with the robot’s camera at two consecutive intervals with the robot moving between the two. For each pair of images we detect points of interest, compute SIFT descriptors for them and perform stereo matching, as described earlier in section 5.3, resulting in 2 sets of landmarks Lt−1 and Lt . We compute the camera motion in two steps, first by the linear estimation of the Essential matrix, E, and its parameterization, as described below. Second, we compute a more accurate estimate by dropping the linearity assumption and using a non-linear optimization algorithm minimizing the re-projection error of the 3D coordinates of the landmarks.  5.4.1  Linear Estimation of the Essential Matrix  For this part we only consider the images from the reference cameras for times t and t − 1. Using the SIFT descriptors we obtain landmark correspondences between the two images as described earlier in section 5.3. Let the i-th such pair of landmarks i . For the time being we only consider the image coordinates of be denoted lti ↔ lt−1  these landmarks, pit ↔ pit−1 . Given this set of 2D point correspondences, the Essential matrix, E, is any 3 × 3 matrix that satisfies the property T  pit Epit−1 = 0  (5.6)  The Essential matrix maps points from one image to lines on the other. If it is known, the camera pose, [R, T ], at time t can be estimated with respect to the  117  camera pose at time t − 1 via its parameterization. We use the normalized 8-point algorithm to estimate E. The algorithm can be found in most modern machine vision books such as [HZ00], and so we do not repeat here. We note that the algorithm requires a minimum of 8 point correspondences. In our experiments, we obtain an average of 100 such correspondences allowing us to implement the robust version of the algorithm that uses RANSAC to consider solutions for subsets of them until a solution is found with a high number of inliers. We can compute the camera pose via the Singular Value Decomposition ˜ (SVD) of E as described in [HZ00]. As a result, we obtain the rotation matrix R and a unit size vector T˜ that denotes the camera’s displacement direction but not the actual camera displacement. Although we could take advantage of the information in T˜ to guide our non-linear solution described in the next section, currently we do not and we simply set T˜ = 0. Figure 5.2 gives two examples of the estimated epipolar geometry for forward motion and rotation. ˜ T˜]st to initialize the non-linear estimation of the camera pose. We use [R, The advantage of the initial value is that it allows us to do outlier removal , i.e., we can remove landmark matches that do not satisfy equation 5.6. Additionally, having an initial estimate helps guide the non-linear optimization algorithm away from local minima.  5.4.2  Non-Linear Estimation of the Camera Pose  Given an initial value for the camera pose at time t with respect to the camera at time t − 1, we compute a more accurate estimate using the Levenberg-Marquardt (LM) non-linear optimization algorithm. We utilize the 3D coordinates of our landmarks and use the LM algorithm to minimize their re-projection error. Let x ˜t be the 6-dimensional vector xt = [roll, pitch, yaw, T11 , T21 , T31 ] corresponding to a given [R, T ]. Our goal is to iteratively compute a correction term χ xi+1 = xit − χ t 118  (5.7)  (a)  (b)  Figure 5.2: Examples of estimating the epipolar geometry for (a) forward motion and (b) a rotation to the right. The top row shows the point correspondences between two consecutive image frames. Crosses mark the points at time t − 1 and lines point to their location at time t. The bottom row, shows the epipolar lines drawn for a subset of the matched points using the estimated Essential matrix.  119  such as to minimize the vector of error measurement  , i.e., the re-projection error  of our 3D points. For a known camera calibration matrix K,      =     0 + T) p0t − K(RPt−1        1   p − K(RP 1 + T ) t−1   t = .. ..    . .               T 0      T 1  k + T) pkt − K(RPt−1  T k  is defined as  (5.8)  Given an initial estimate for the parameters, we wish to solve for χ that minimizes , i.e.,        J    χ =   λI    λd   T T  ⇔ (J J + λI)χ = J + λId  (5.9)  where J = [ ∂∂χ0 , · · · , ∂∂χk ]T , is the Jacobian matrix,I is the identity matrix and d ˜ T˜]. The LM algorithm is the initial solution from the linear estimate given by [R, introduces the variable λ that controls the convergence of the solution by switching between pure gradient descent and Newton’s method. As discussed in [Low91] solving 6.10 , i.e., the normal equations, minimizes  ||Jχ − ||2 + λ2 ||χ − d||2  (5.10)  The normal equations can be solved efficiently using the SVD algorithm. A byproduct from solving 6.11 is that we also get the covariance of the solution in the inverse of J T J. Table 5.1 compares our vision-based estimate of camera motion with that of odometry. One can see that it is very accurate.  5.5  The Mixture Proposal Distribution  As we discussed in section 5.1, we perform filtering using a mixture of the MCL and dual MCL proposal distributions [TFBD00] (1 − φ)˜ qt + φqt = (1 − φ)p(yt |˜ st ) + φp(st |st−1 , ut )  120  (5.11)  x 0.00 0.00 0.00 0.12 0.11  Odometry y θ 0.00 -1.76 0.00 -2.11 0.00 2.11 0.00 0.00 0.01 0.00  x 0.01 -0.01 0.01 0.18 0.09  Vision-based Estimate y z α θ 0.00 -0.01 0.22 -1.83 -0.00 0.01 -0.01 -1.97 0.02 0.00 0.01 1.63 -0.02 0.01 -0.01 -0.13 0.01 0.01 -0.26 -0.52  β 0.05 -0.02 0.00 0.18 0.04  Table 5.1: Comparing our least-squares estimate of camera motion with odometry. Position is shown in cm and orientation is degrees.  where φ is known as the mixing ratio and we have set it to 0.80 for all of our experiments. Sampling from the transition prior, p(st |st−1 , ut ), is straightforward as all particles from time t − 1 are updated using our estimate of the camera’s motion, ut , with noise added drawn from our confidence on this estimate given by (J T J)−1 . On the other hand, sampling from the dual proposal distribution is thought of as a hard problem in robotics since we must be able to generate poses from observations. It turns out that for our choice of maps and sensor this is trivial. Let M be a map of m landmarks liG for 1 ≤ i ≤ m and N be the set of n landmarks ljL for 1 ≤ j ≤ n in the current view. Let yt be the current observation. Using the procedure described in section 5.4 and k random subsets of matched (1:k)  landmarks, x1:k ⊂ X, we compute k candidate poses s˜t  . For efficiency we only  call on the non-linear estimation procedure initializing the pose to zero translation and rotation. Even if our subset of landmarks leads to an incorrect estimate, this sample will receive a low weight given our complete set of observations and as such we do not need to incur the penalty of the robust linear estimate described in (j)  section 5.4.1. For each sampled pose we compute p(yt |˜ st ) with 1 ≤ j ≤ k using equation 5.5. Because of the high quality of observations, we only need to sample as few as 100 poses to get a good approximation for p(yt |˜ st ). To sample from the (1:k)  dual proposal we draw a random particle from {˜ st  121  ,w ˜t1:k } according to the particle  weights and compute its importance factor using equation 5.3. Lastly we should mention that the procedure we just described for approximating the dual proposal distribution is also useful for estimating p(s0 ), that is, the initial belief. It is common practice that the initial belief is set to be a uniform distribution but considering the large dimensionality of our state space, ours is a better choice.  5.6  Single-frame Approach Using RANSAC  A straightforward approach to global localization that does not require filtering is presented in [SLL01]. Poses are generated as discussed in the previous section and evaluated using equation 5.5. For robustness a RANSAC approach is employed in that poses are sampled until one is found that is well supported by the current observation, i.e., p(yt |st ) is above a given threshold. The advantage of such an approach, made possible by the geometry of the camera and map representation, is its simplicity and its high accuracy considering that only a single image frame needs to be considered. A major disadvantage is that is not suitable for tracking the robot’s pose over multiple frames due to the large variance of the estimates. Additionally, a user must specify beforehand the number of samples to consider ( [SLL01] uses only 10) and a threshold value for what is a good pose. However, the former can be avoided if the adaptive RANSAC algorithm is used. Until now, no vision-based MCL method has been shown to match the accuracy of the single-frame approach. We will show in the next section that σMCL comes close to this target while considering a more general case of unconstrained motion in 3D.  122  (a)  (b)  Figure 5.3: The robot we used for our experiments (a) seen from a distance and (b) closeup of its head  5.7  Experiments  We have implemented σMCL for our mobile robot seen in Figure 5.3. It is a modified RWI B-14 quasi-holonomic base. It is equipped with a PointGrey Research Bumblebee stereo vision camera mounted on a Pan-Tilt Unit. It has an on-board Pentium-based computer and wireless connectivity. It has no other functioning sensors other than the stereo camera. For our experiments, we have used 3 different sets of images, S1 , S2 , S3 , at 320×240 resolution. S1 is the set of images that we used for constructing the map. It is a useful set because along with the images we have the robot’s pose at the time they were acquired. We use these corrected odometry estimates as a baseline for judging the accuracy of the σMCL approach. One obvious disadvantage of using this set is that we always get a really large and accurate number of landmark correspondences. In order to show that σMCL works in general, the second set of images, S2 , is an arbitrary sequence that was not used during map building. Unfortunately, both of these image sets are acquired as the robot traverses a planar path. To demonstrate that our solution works for non-planar motions, we have acquired a 3rd set of images, S3 , in which we moved the camera by hand in a rectangular pattern off the robot’s plane of motion. An important and open question in the study of particle filters is the number 123  of particles to use. In our experiments with S1 we found that as few as 100 particles were sufficient to achieve highly accurate results; we can get away with only a small number of particles because of good observations. Using more particles provided no improvement. For our experiments with S2 and S3 we used 500 particles; using more did not generate a noticeable improvement on the computed trajectory of the robot. These numbers are significantly smaller than other vision-based approaches. For example, [WBB02] uses 5000 particles. Figure 5.4 shows the results of global localization using S1 . Part (a) of the figure presents the initial belief. Part(b) shows the particles after 7 frames, the robot having moved forward for about 80cm. One can see that there are several modes to the distribution. Part (c) shows the samples after 35 frames having all converged to the robot’s true location. The robot has moved a total of 100cm and rotated by 20 degrees. Figure 5.7 plots the localization error for the mean pose taken over all the particles with respect to the odometry estimate. The position error is less than 20cm and the orientation error is less than 6 degrees. The results are the average of 10 runs per frame. In comparison, [TF00] reports a localization error that varies from 100cm to 500cm. Similarly, the results presented in [WBB02] specify an error that is as large as 82cm in position and 17 degrees in orientation. In [GKSB03] the position error varies from 45cm to 71cm. Finally, [EMI04] reports that the localization error is 20cm but this value is highly correlated with the distance of the reference images which is also 20cm. [TF00, GKSB03, EMI04] do not include results for the error with respect to the robot’s orientation. Also shown in Figure 5.7 is the estimate of the single frame approach using adaptive RANSAC as described in section 5.6. It performs better than σMCL but the generated track is more noisy due to the large variance of the estimates. Figure 5.8 shows the camera path for the image set S2 . It was generated using a total of 500 images while the robot traversed a distance of 18 meters. It  124  Figure 5.4: An example of the evolution of particles during global localization. The top row shows the state from a top view point and the bottom row shows it in 3D. The current observation is marked using blue crosses/dots. The particles are shown in red and the robot’s true position is shown using a green “V” on the top row and a green sphere on the bottom. In this figure, we show the initial distribution of particles p(s0 ).  125  Figure 5.5: An example of the evolution of particles during global localization. The top row shows the state from a top view point and the bottom row shows it in 3D. The current observation is marked using blue crosses/dots. The particles are shown in red and the robot’s true position is shown using a green “V” on the top row and a green sphere on the bottom. In this figure, we show the distribution of particles after processing 7 image frames.  126  Figure 5.6: An example of the evolution of particles during global localization. The top row shows the state from a top view point and the bottom row shows it in 3D. The current observation is marked using blue crosses/dots. The particles are shown in red and the robot’s true position is shown using a green “V” on the top row and a green sphere on the bottom. In this figure, we show the distribution of particles after processing 35 image frames.  127  0.4  RANSAC sigmaMCL  0.35  0.3  Error (meters)  0.25  0.2  0.15  0.1  0.05  0  0  20  40  Frame (a)  60  25  80  100  RANSAC sigmaMCL  20  Error (degrees)  15  10  5  0  0  20  40  Frame (b)  60  80  100  Figure 5.7: Plots of the localization error for σMCL and single-frame adaptive RANSAC approach. Part (a) shows the mean error in position and (b) shows the mean error in orientation for 100 frames. Both approaches are equally accurate but the standard deviation of the σMCL is less than 1/3 of RANSAC.  128  took about 20 frames before the robot estimated its location and it successfully kept track of it from then on. Finally, Figure 5.9 shows the results using S3 . We have provided a general solution to the localization problem that can handle unconstrained motions in 3D. So far we have only demonstrated it using data from our robot that moves on a planar surface and has limited range off this plane due to its steerable head. However, in order to demonstrate that our approach really works we present results with an image set obtained by moving the camera around by holding it in our hand. This particular image set is very challenging with much jitter in the camera motion. The robot lives on the x − z plane and we obtained S3 by moving the camera on the x − y plane in an approximately rectangular pattern. In Figure 5.9, the first few frames show a large variation in the x − z plane until σMCL converges, after which the camera’s path is correctly tracked along the rectangular path. We timed the performance of our approach on a 3.2GHz Pentium 4. σMCL takes on average 1.1secs per frame. The time does not include computing landmark correspondences, i.e., yt , because it is common to both approaches. In our implementation, it takes about 1sec to compute yt because we have not implemented the efficient kd-tree approach for SIFT matching as in [Low99]. The performance of σMCL is good enough for an online system even though we have not made much effort to optimize our software. For completeness, Figure 5.10 shows the results for the kidnapped robot problem using S1 . We demonstrate that the robot can quickly re-localize after a sudden loss in position and a strong prior estimate of its location. Specifically, we allowed the robot to localize itself and then we transported it at frame 30. The robot localized again within 20 frames. We repeated this procedure at frame 80 and the robot once again estimated its position within the error bounds reported earlier after only 30 frames.  129  Figure 5.8: The robot’s path for image set S2 . The robot traveled a total distance of 18 meters.  130  Y 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4  -1.5  -2  -2.5  -3  -3.5  -4  -4.5  -5 -1.2  -1  -0.8  -0.6  -0.4  -0.2  0  0.2  0.4  X  Figure 5.9: TheZrobot’s path for image set S3 . The camera’s path is correctly tracked around an approximately rectangular pattern on the x − y plane, after the first few frames required until σMCL converges.  131  3.5  3  Error (meters)  2.5  2  1.5  1  0.5  0  0  20  40  0  20  40  60  80  100  120  60  80  100  120  Frame (a)  100 90 80 70  Error (degrees)  60 50 40 30 20 10 0  Frame (b)  Figure 5.10: Results for the kidnaped robot problem. We kidnapped the robot twice at frames 30 and 80. Part (a) shows the error for the robot’s position and (b) shows it for the orientation.  132  5.8  Conclusion  We have presented an approach to vision-based Monte-Carlo localization with a mixture proposal distribution that uses the best of invariant image-based landmarks and statistical techniques. Our approach decouples the sensor from the robot’s body by specifying the motion model as a function of sensor measurements and not robot odometry. As a result we can solve for unconstrained 6 dof motion by taking advantage of results in multiple view geometry. We presented a number of experimental results to prove that our approach is robust, accurate and efficient. In the future, we plan to optimize the efficiency of our implementation using such improvements as the kd-tree approach for landmark matching [Low99]. For all our experiments we used a fixed value for the mixing ratio. It would be worth experimenting with adapting this ratio according to the variance of the two proposal distributions. In the next Chapter, we extend our algorithm for soving the simultaneous localization and mapping (SLAM) problem using the Rao-Blackwellised particle filter (RBPF) framework similar to the FastSLAM [MTK+ 04] algorithm.  133  Bibliography [Chr92] L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), pages 183–188, San Jose, California, 1992. AAAI Press. [DBFT99] F. Dellaert, W. Burgard, D. Fox, and S. Thrun. Using the condensation algorithm for robust, vision-based mobile robot localization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR’99), Fort Collins, CO, June 1999. [EMI04] E. Pagello E. Menegatti, M. Zoccarato and H. Ishiguro. Image-based Monte-Carlo localisation with omnidirectional images. Robotics and Autonomous Systems, 48(1):17–30, August 2004. [GKSB03] H. M. Gross, A. Koenig, Ch. Schroeter, and H. J. Boehme. Omnivisionbased probalistic self-localization for a mobile shopping assistant continued. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’03), pages 1505–1511, Las Vegas, USA, 2003. [HZ00] R. Hartley and A. Zisserman. Multiple View Geometry, in computer vision. Cambridge University Press, 2000. [KBH+ 04] B. Krose, R. Bunschoten, S. T. Hagen, B. Terwun, and N. Vlassis.  134  Household robots look and learn. IEEE Robotics and Automation Magazine, 11(4):45–52, December 2004. [Low91] D. G. Lowe. Fitting parameterized three-dimensional models to images. IEEE Trans. Pattern Analysis Mach. Intell. (PAMI), 13(5):441–450, May 1991. [Low99] D. G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the Seventh International Conference on Computer Vision (ICCV’99), pages 1150–1157, Kerkyra, Greece, September 1999. [ME85] H. Moravec and A. Elfes. High-resolution maps from wide-angle sonar. In Proc. IEEE Int’l Conf. on Robotics and Automation, St. Louis, Missouri, March 1985. [MS03] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. In International Conference on Computer Vision & Pattern Recognition, volume 2, pages 257–263, June 2003. [MTKW02] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM: A factored solution to the simultaneous localization and mapping problem. In Proceedings of the AAAI National Conference on Artificial Intelligence, Edmonton, Canada, 2002. AAAI. [RJ03] T. Rofer and M. Jungel. Vision-based fast and reactive Monte-Carlo localization.  In Proceedings of IEEE International Conference on  Robotics and Automation (ICRA’03), pages 856–861, Tapei, Tawiwan, 2003. [SD03] R. Sim and G. Dudek. Comparing image-based localization methods. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI ’03), Acapulco, Mexico, 2003.  135  [SLL01] S. Se, D. Lowe, and J. J. Little. Local and global localization for mobile robots using visual landmarks. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS ’01), Maui, Hawaii, October 2001. [SLL02] S. Se, D. Lowe, and J. J. Little. Mobile robot localization and mapping with uncertainty using scale-invariant landmarks. International Journal of Robotics Research, 21(8):735–758, August 2002. [TF00] S. Thrun and D. Fox. Monte-Carlo localization with mixture proposal distribution. In Proceedings of the AAAI National Conference on Artificial Intelligence, Austin, TX, 2000. AAAI. [TFBD00] S. Thrun, D. Fox, W. Burgard, and F. Dellaert. Robust Monte-Carlo localization for mobile robots. Artificial Intelligence, 128(1-2):99–141, 2000. [UN00] I. Ulrich and I. Nourbakhsh. Appearance-based place recognition for topological localization. In Proc. of the IEEE International Conference on Robotics & Automation (ICRA ’02), pages 1023–1029, San Francisco, CA, April 2000. [WBB02] J. Wolf, W. Burgard, and H. Burkhardt. Robust vision-based localization for mobile robots using an image retrieval system based on invariant features. In Proc. of the IEEE International Conference on Robotics & Automation (ICRA), 2002.  136  Chapter 6  σ SLAM: Stereo Vision SLAM Using the Rao-Blackwellised Particle Filter and a Novel Mixture Proposal Distribution 1  In robotics, the problem of Simultaneous Localization and Mapping (SLAM) is that  of estimating both a robot’s location and a map of its surrounding environment. It is an inherently hard problem because noise in the estimate of the robot’s pose leads to noise in the estimate of the map and vice versa. In general, SLAM algorithms must address issues regarding sensors, map representation and robot/environment dynamics. A probabilistic framework is necessary for combining over time the incoming sensor measurements and robot control signals. There are two such frameworks popular in the SLAM community, the Extended Kalman Filter and the RaoBlackwellised Particle Filter. We choose to study the latter since it scales better for maps with large numbers of landmarks and it handles incorrect data associations 1  A version of this chapter has been published. P. Elinas, R. Sim, and J. J. Little, σSLAM: Stereo Vision SLAM Using the Rao-Blackwellised Particle Filter and a Novel Mixture Proposal Distribution, IEEE International Conference on Robotics and Automation, 2006.  137  more gracefully. In this Chapter, we present σSLAM (where σ stands for stereo), a novel model of the RBPF applied to the SLAM problem for robots with stereo vision. Our work is unique because it uses a video stream of stereo images as the only input while it achieves results whose accuracy is comparable to those of other models that rely on laser and odometric measurements. We derive an estimate of the robot’s motion from sparse visual measurements using stereo vision and multiple view geometry techniques. We construct dense metric maps of 3D point landmarks identified using SIFT [Low99] whose invariance to changes in image translation, scaling and rotation makes it suitable for reliable, but not perfect, data association. SIFT’s invariance properties along with our novel mixture proposal distribution improve the robustness of our filter in closing large loops. We show that our model works well in environments with large changes in illumination, image blurring, glass doors and many large windows. Finally, we show that we can trivially compute 2D occupancy grids, if desired. Such grids are useful for robot path planning and obstacle avoidance. In comparison, previous work on SLAM has focused predominantly on robots that utilize 2D occupancy grid maps [ME85] and use laser sensors [EP04, TMK+ 04]. Laser has high depth resolution providing accurate measurements of landmark positions but its low perceptual resolution makes it difficult to distinguish between different landmarks. We focus on using vision as the sensing modality. Vision can be used to construct 2D occupancy grids [TSM+ 97] and as such one might believe that the application of laser-based SLAM algorithms could be used just as well with stereo vision. However, as shown in part (b) of Figure 6.4 its low depth resolution makes it very difficult to directly apply such occupancy grid methods to vision-based SLAM. Instead, we focus on learning maps of 3D natural landmarks identifiable by their appearance in images. We demonstrate how we can trivially construct the 2D occupancy grids as a by-product of our algorithm.  138  (a)  (b)  Figure 6.1: The RWI B-14 robot that we used for data collection seen from (a) far and (b) closeup view of the stereo camera. Our work is analogous to that of [SLL02] because we learn a similar type of map. However, their approach maintains only the maximum likelihood estimate of the camera pose predicted by robot odometry and corrected using visual information, that is 3D points identified using SIFT. They exploit odometry information to achieve better data associations. This approach has only been shown to work for small environments and trajectories. A similar method that constructs metric maps is presented in [Bar05]. Their algorithm is a direct extension of FastSLAM [TMK+ 04] for visually guided robots that identify landmarks using SIFT. They construct maps similar to those in [SLL02]. They present results that are promising but preliminary and only for 1 particle in which case their approach closely resembles the original work of [SLL02]. Some recent work is also focused on the use of monocular cameras. For example, [KBO+ 05] uses a particle filter with a single camera and robot odometry to map small indoor environments. They construct topological maps and identify visual landmarks using SIFT. In this case, occupancy grid maps would have to be constructed using a different sensor, i.e., laser, in which case using vision for SLAM seems unnecessary since this can be achieved using the laser data in the first place. Work on monocular SLAM that does not depend on robot odometry has been presented in [Dav03]. Their approach is mostly tailored to real-time performance. In consequence, they identify visual landmarks using salient visual features that are not necessarily distinctive enough 139  for operation in large environments. This approach is also based on the Extended Kalman Filter that is inefficient because the filter scales quadratically in the number of landmarks. This Chapter is structured as follows.  We first present an overview of  Bayesian filtering applied to SLAM and its RBPF approximation. We then focus on the details of our vision-based SLAM presenting our map representation, observation and motion models. We discuss our novel mixture proposal distribution and give the complete algorithm for our particle filter. We then provide experimental results that support our hypothesis in terms of robot localization accuracy and our ability to construct occupancy grid maps for navigation. We conclude by discussing directions for future work.  6.1  SLAM Using the Rao-Blackwelllized Particle Filter  Formally, let st denote the robot’s pose at time t, mt the map learned thus far and xt = {st , mt } be the complete state. Also, let ut denote a control signal or a measurement of the robot’s motion from time t − 1 to time t and zt be the current observation. The set of observations and controls from time 0 to t are denoted by z t and ut respectively. Our goal is to estimate the density p(st , mt |z t , ut ) = p(xt |z t , ut )  (6.1)  That is, we must integrate the set of observations and controls as they arrive over time in order to compute the posterior probability over the unknown state. Applying Bayes rule on Equation 6.1 and making the Markov assumption we get [Thr02] p(xt |z t , ut ) = Bel(xt ) = ηp(zt |xt )  p(xt |ut , xt−1 )p(xt−1 |z t−1 , ut−1 )dxt−1 = ηp(zt |xt )  p(xt |ut , xt−1 )Bel(xt−1 )dxt−1  where η is a normalizing constant. 140  (6.2)  Equation 6.2 allows us to recursively estimate the posterior probability of maps and robot poses if the two distributions p(zt |xt ) and p(xt |ut , xt−1 ) are given. These distributions model the observations and the state’s evolution respectively. For SLAM, an analytical form for Bel(xt ) is hard to obtain and as such the Bayes filter is not directly applicable. Instead, it was noted in [TMK+ 04, Mur99] that if we estimate the modified posterior, p(xt |z t , ut ), then we don’t have to integrate and we can obtain a solution efficiently using a Particle Filter. Particle Filtering is a general method for approximating p(xt |z t , ut ) using a set of m weighted particles, {x(i) , w(i) }i=1,···,m . The system is initialized according to p(x0 ) and the recursive update of the Bayes filter proceeds using a procedure known as sampling-importance-resampling [DdFG01]. A major caveat of the standard PF is that it requires a very large number of particles as the size of the state increases. Since for SLAM the state of the system includes the map that often has tens of thousands of landmarks, the PF is not applicable from a practical point of view. The method of Rao-Blackwellization can be used to reduce the complexity of the PF by factoring the posterior [Mur99, DdFMR00] p(st , mt |z t , ut ) = p(st |z t , ut )  p(mt (k)|st , z t , ut )  (6.3)  k  where mt (k) denotes the k-th landmark in the map. We use a standard PF to estimate p(st |z t , ut ) and an EKF for each of the k landmarks.  6.2  Vision-based SLAM  In this section we present our model for vision-based SLAM. We first describe how we represent maps that are central to our method. Next we define observations and how we compute the observation likelihood, followed with a description of our motion model based on visual odometry. We then discuss our mixture proposal distribution and give the complete particle filter algorithm. 141  6.2.1  Map Representation  We construct maps of naturally occurring 3D landmarks similar to those proposed in [SLL02].  Each landmark is a vector l = {P G , C G , α, s, f } such that  P G = {X G , Y G , Z G } is a 3-dimensional position vector in the map’s global coordinate frame, C G is the 3 × 3 covariance matrix for P G , and α, s, f describe an invariant feature based on the Scale Invariant Feature Transform [Low99]. Parameter α is the orientation of the feature, s is its scale and f is the 128-dimensional key vector which represents the histogram of local edge orientations.  6.2.2  Observation Model  Let ItR and ItL denote the right and left gray scale images captured using the stereo camera at time t. We compute image points of interest from both images by selecting maximal points in the scale space pyramid of a Difference of Gaussians [Low99]. For each such point, we compute the SIFT descriptor and record its scale and orientation. We then match the points in the left and right images in order to compute the points’ 3D positions in the camera coordinate frame. Matching is constrained by the stereo camera’s known epipolar geometry and the Euclidean distance of their SIFT keys. Thus, we obtain a set OC = {o1 , o2 , · · · , on } of n local landmarks such that L L oj = {PoLj = {XoLj , YoLj , ZoLj }, pL oj = {roj , coj , 1}, C , α, s, f } where poj = {roj , coj , 1}  is the image coordinates of the point and j ∈ [1 · · · n]. An observation is defined as a set of k correspondences between landmarks in the map and the current view, zt = ∪1···k {li ↔ oj } such that i ∈ [1..m] and j ∈ [1..n], where m is the number of landmarks in the map and n is the number of landmarks in the current view. Each local landmark either corresponds to a mapped landmark lk , or has no corresponding landmark, denoted by the null correspondence l∅ . We compare the landmarks’ SIFT keys in order to obtain these correspondences just as we did before during stereo matching. There are no guarantees that all correspondences are correct but the high specificity of SIFT results in a reduced 142  number of incorrect matches. A pose of the camera, st , defines a transformation [R, T ]st from the camera to the global coordinate frame. Specifically, R is a 3 × 3 rotation matrix and T is a 3 × 1 translation vector. Each landmark in the current view can be transformed to global coordinates using the well known equation PoGj = Rst PoLj + Tst  (6.4)  Using Equation 6.4 and the Mahalanobis distance metric we can define the observation log-likelihood, log p(zt |mt , st ).  Special consideration must be taken  when computing this quantity, particularly where large numbers of feature observations, with significant potential for outlier correspondences, are present. We compute it by summing over the feature correspondences: log p(zt |mt , st ) =  log p(ok |lk , st )  (6.5)  k  The log-likelihood of the k-th observation is given by log p(ok |lk , st ) = −0.5 min(Tl , (PoGk − PkG )T S −1 (PoGk − PkG ))  (6.6)  where the correspondence covariance S is given by the sum of the transformed local landmark covariance CoLk and the map landmark covariance CkG : S = Rst CoLk RsTt + CkG .  (6.7)  For the null correspondence, S is assumed to be zero. The maximum observation innovation Tl is selected so as to prevent outlier observations from significantly affecting the observation likelihood. However, given the potentially large numbers of correspondences, even with a reasonable setting for Tl (in our case, 4.0), the magnitude of log p(zt |mt , st ) can be such that raising it to the exponential results in zero values. Since particle weights are computed partially based on the observation likelihood, it is important that the computation of this quantity is robust to outliers. We have shown in previous work [SEGL05] that this 143  can be done correctly and without loss of generality by subtracting the observation likelihood of the least likely particle. Similarly, in the same work [SEGL05], we provide details as to how a landmark is evaluated as stable enough to be added to the map.  6.2.3  Motion Model  An essential component to the implementation of RPBF is the specification of the robot’s motion model, ut . In the vast majority of previous SLAM work, this has been a function of the robot’s odometry, i.e., wheel encoders that measure the amount the robot’s wheels rotate that can be mapped to a metric value of displacement and rotation. Noise drawn from a Gaussian is then added to this measurement to take into account slippage as the wheels rotate. Odometric measurements of this type are limited to robots moving on planar surfaces. We want to establish a more general solution. Thus, we obtain ut measurements by taking advantage of the vast amount of research in multiple view geometry [HZ00]. Specifically, it is possible to compute the robot’s displacement directly from the available image data including an estimate of the uncertainty in that measurement. This is often known as visual odometry [OMSM03, NNB04, CSN04, LS04, MS06]. Let It and It−1 represent the pairs of stereo images taken with the robot’s camera at two consecutive intervals with the robot moving between the two. For each pair of images we detect points of interest, compute SIFT descriptors for them and perform stereo matching, as described earlier in section 6.2.2, resulting in 2 sets of landmarks Lt−1 and Lt . We compute the camera motion using the Levenberg-Marquardt (LM) algorithm minimizing the re-projection error of the 3D coordinates of the landmarks [HZ00]. Let s˜t be the 6-dimensional vector s˜t = [roll, pitch, yaw, T11 , T21 , T31 ] corresponding to a given [R, T ]. Our goal is to iteratively compute a correction term χ s˜i+1 = s˜it − χ t 144  (6.8)  such as to minimize the vector of error measurement , i.e., the re-projection error of our 3D points. For a known camera calibration matrix K,      =     0 + T) p0t − K(RPt−1        1   p − K(RP 1 + T ) t−1   t = .. ..    . .               T 0      T 1  k + T) pkt − K(RPt−1  T k  is defined as  (6.9)  Given an initial estimate for the parameters, we wish to solve for χ that minimizes , i.e.,        J    χ =   λI    λd   T T  ⇔ (J J + λI)χ = J + λd  (6.10)  where J = [ ∂∂χ0 , · · · , ∂∂χk ]T , is the Jacobian matrix, I is the identity matrix and d is an initial solution that in this case is set to zero rotation and translation. The LM algorithm introduces the variable λ that controls the convergence of the solution by switching between pure gradient descent and Newton’s method. As discussed in [Low91] solving Equation 6.10 minimizes ||Jχ − ||2 + λ2 ||χ − d||2  (6.11)  The normal equations can be solved efficiently using the SVD algorithm. A byproduct from solving Equation 6.11 is that we also get the covariance of the solution in the inverse of J T J.  6.3  The Proposal Distribution  The correct performance of a particle filter highly depends on the selection of proposal distribution. Most SLAM approaches using the RBPF depend on a proposal that is derived from the motion model using the observations to offset the difference between the proposal and target distribution. The FastSLAM 2.0 algorithm [MTKW03] employs a proposal that also takes into account the most recent 145  observation via the linearization of the motion and observation models allowing them to compute the proposal in closed form. In contrast, [KBO+ 05] uses a mixture proposal that combines hypotheses from the motion and observation models weighted according to either the motion or observation models (such a mixture proposal has been previously used for global localization and tracking [EL05, TF00]). However, computing the particle weights according to two different models generates an inconsistent set of hypotheses that in [KBO+ 05] is not treated in a principled way. We also use a mixture proposal generating hypotheses from both the motion model and a second distribution that depends on the latest observation and learned map thus far. We compute the weights correctly avoiding the problem described in [KBO+ 05]. Formally, our proposal distribution is given by: q(st |st−1 , zt , ut ) = φqglobal (st |zt , mt−1 ) + (1 − φ)qm (st |st−1 , ut )  (6.12)  where φ is known as the mixing ratio. Sampling from the motion model, p(st |st−1 , ut ), is straightforward as all particles from time t − 1 are updated using our estimate of the camera’s motion, ut , with noise added drawn from our confidence on this estimate given by (J T J)−1 . In order to sample from qglobal , we generate it by taking advantage of the 3D geometry of our map and camera configuration. (i)  Specifically, let zt be the current observation and mt−1 be the map of the ith particle learned up to time t−1. Given the correspondences in zt , we can compute the transformation that relates the two coordinate frames using weighted-least squares. The procedure is similar to that described in section 6.2.3 except that in this case instead of minimizing the re-projection error of the 3D points, we minimize their Euclidean distance in 3D [AHB87]. The problem with this approach is that it is sensitive to outliers that are present in our data association. To overcome this problem, we employ a RANSAC type approach where we select subsets of the point 146  correspondences and compute a candidate robot pose for each. We have found that generating as few as 200 candidate poses, st1...200 , is sufficient for good localization given our noisy observations. For computational efficiency, we only compute the candidate poses with respect to the map of the most likely particle at time t − 1. In order to sample from qglobal , we evaluate, using Equation 6.5, the probability of our latest observation given each candidate pose. We then fit a Gaussian distribution to these data points such that: qglobal = Nglobal (µ, Σ)st1...200  (6.13)  We compute the weight for the i − th particle correctly by evaluating the ratio of the target and proposal distributions, (i)  wi =  (i)  (i)  (i)  p(zt |s t , mt−1 )p(st |st−1 , ut ) (i)  (i)  (i)  (i)  (6.14)  φqglobal (s t |zt , mt−1 ) + (1 − φ)p(s t |st−1 , ut )  Each of the distributions involved is a Gaussian that we have already described how to compute. One should notice that the weights for the particles are equal to the observation likelihood scaled by the ratio of the probability of the pose under the motion model and the weighted sum of the probability under the motion model and the global distribution. That is those particles that are supported by both models are given weights that are mostly proportional to the observation probability while those that disagree are penalized. Finally, our empirical results show that using a constant mixing ratio tends to generate noisy robot trajectories. This is the result of the bad data associations in our observations. Also, the observations are often dominated by landmarks that were seen most recently, biasing the global distribution towards the most recently added landmarks. This potentially prohibits the closure of large loops. To correct this, we are varying the mixing ratio as a function of the ratio of old landmarks to total landmarks observed at time t. An old landmark in our case is defined as one that has not been observed for longer than 2 minutes. Using a variable mixing ratio, we rely on the standard proposal in the short term and the mixture proposal for 147  closing loops. In the next section, we give the complete algorithm for our particle filter.  6.4  The σSLAM Algorithm  Algorithm 4 gives the filter update steps for σSLAM . The input to the algorithm is the set of particles from the previous time step, the current observation and the visual odometry estimate. It starts by computing the ratio of old to total number of landmarks. If this number is larger than 30%, poses are sampled using the mixture proposal otherwise all poses are sampled from the motion model. In the latter case the particle weights are just proportional to the likelihood of the observation given by Equation 6.5. If the mixture proposal is used, then the weights are computed according to Equation 6.14. As we mentioned earlier, for efficiency, we only compute qglobal with respect to the most likely particle from the previous time. If we do not wish to make this approximation then we must move the code that estimates qglobal inside the last for loop, i.e., compute it for every particle.  6.5  Experimental Results  We have implemented the σSLAM algorithm for our mobile robot seen in Figure 6.1. The robot is equipped with a PointGrey Research Bumblebee stereo vision camera mounted on a pan-tilt unit. The robot also has an on-board Pentium-based computer and no other functioning sensors. For our experiments, we manually drove the robot inside two adjacent rooms, 10 × 16 meters, in our laboratory and around a large corridor. In both cases we collected the stereo images for offline processing. For the first data set, we followed a star-like exploration pattern starting from the middle of one room, exploring one corner at a time and returning back to the center to close the loop. We collected a total of 8500 frames and we processed all of them. The total length of the trajectory  148  Algorithm 4 update(Pt−1 = {x(i) , w(i) }i=1,...,M , zt , u t ) (t−1) if  old landmarks total landmarks  > 30% then use mixture = true M L = most likely particle from P Pt−1 t−1 (j) (j) M L, z ) {h , hw }j=1,...,N = generateP oses(Pt−1 t for i = 1 to i = N do (i) L compute weight hw = p(h(i) |zt , mM t−1 ) end for (j) Estimate Nglobal (µ, Σ) from {h(j) , hw } end if {Compute the new set of particles} if use mixture == f alse then for i = 1 to i = M do (i) (i) (i) sample st from p(st |st−1 , ut ) (i)  (i)  (i)  compute wt = p(zt |st , mt−1 ) end for else for i = 1 to i = M do Draw a random number, rand, from a uniform distribution if rand < φ then (i) (i) (i) sample st from p(st |st−1 , ut ) else (i) sample st from Nglobal (µ, Σ) end if (i) Compute wt using Equation 6.14 end for end if Normalize particle weights Resample Update particle maps using zt  149  (a)  (b)  (c)  (d)  (e)  (f)  Figure 6.2: Sample frames from our test sequence. A scene with (a) lots of texture that is good for localization; (b) excessive brightness; (c) blurring caused by uneven floor surface, and (d) a person walking in front of the camera. Also, (e) and (f) are two examples of typical images from the outdoors dataset.  150  Figure 6.3: The localization error for the filter using just the standard proposal, mixture proposal (with a variable mixing ratio) and the raw visual odometry estimates. Note that the robot starts at the left-most node in the plot and finishes at the right-most node. is 110 meters. For the corridor data set, we collected and processed 2400 frames for a trajectory of over 100 meters. The data we collected was challenging from a computer vision point of view because they include many frames with blurring caused by the uneven floor surface and people walking in front of the camera. Also the existence of large windows on the North, East and South walls of the rooms and at the four corners of the corridors cause large portions of some image frames to be saturated with bright light. Figure 6.2 shows a small sample of good and bad frames in our test sequences. Part (a) of Figure 6.4 shows a top-down view of the 3D landmark map, learned using σSLAM with 500 particles for our first data set. We used a mixing ratio that varied between 0 and 50% according to the age of the observed landmarks. The map shown is that of the most likely particle at the end of the trajectory. The map has a total of 80, 000 landmarks. Part (b) of Figure 6.4 also shows the occupancy grid associated with this particle, constructed using the method given  151  in [TSM+ 97], and the dense stereo images returned by the stereo camera. The resolution of the grid cells is 15 × 15cm. In order to measure the accuracy of σSLAM , we have measured ground truth pose data for 5 different locations. The starting location is node 1 and each of the corners in the first room are labeled nodes 2, 3, 4 and 5. We have labeled these locations on the learned map shown in part (b) of Figure 6.4. The plot in Figure 6.3 shows the average position error at each node for the raw visual odometry, the particle filter using the standard proposal and σSLAM using the mixture proposal. One can see that the latter outperformed the other methods. In fact, the average error, over all nodes, using the mixture proposal is approximately 15cm; using the standard proposal it is 25cm and using just the raw visual odometry measurements it is 159cm. Also, using σSLAM the robot closed a large loop (25 meters long), going from node 1 to the room on the left and back to node 1 achieving an accuracy of approximately a factor of two better than using the standard proposal. Part (c) of Figure 6.4 shows the 3D landmark map learned using the second data set and 500 particles. This map has approximately 22, 000 landmarks. For this experiment we also used a varying mixing ratio similarly to the first experiment but we only closed the loop once at the end of the trajectory. The error was less than 1m for an environment with very unreliable visual odometry due to the lack of texture and excessive light at the ends of the corridors. Part (d) of Figure 6.4 shows the occupancy grid corresponding to the most likely particle at the end of the trajectory. Figure 6.5 shows the processing time for each frame of the two room image sequence running on a Pentium Xeon 3.2 GHz computer. The average time over all frames is 1.5 seconds. A large part of this, 0.35 seconds, is spend on SIFT extraction and matching. The time increases as the number of landmarks in the map increases because it takes longer to compute data associations, update the map for each particle and copy maps between particles. For efficiency, we are organizing all SIFT  152  (a)  (b)  (c)  (d)  Figure 6.4: Examples of the maps learned with σSLAM for (top row) two adjacent rooms size 16 × 10 meters and (bottom row) a corridor. Parts (a) and (c) show a top-down view of the maps of 3D landmarks used for localization. Shown in gray is the filter trajectory and in blue the raw visual odometry trajectory. Parts (b) and (d) show the occupancy grids constructed using the camera location estimated with σSLAM . In the grid, white pixels represent empty space while black pixels represent occupied space. Gray pixels denote unobserved areas. The grid’s resolution is 15 × 15cm.  153  Figure 6.5: Plot of the average processing time vs frame number for 500 particles using σSLAM . keys using a kd-tree data structure as originally suggested in [Low99]. In addition, in order to perform efficient copying of maps between landmarks, we have implemented the tree data structure introduced in [TMK+ 04]. We should emphasize that the latter data structure is only used to efficiently copy the 3D landmark maps and not the occupancy grids. We only construct the latter to demonstrate that we can. In fact, we could manage the grids efficiently by implementing the data structure introduced in [EP04] but we have not done this yet. There are frames when the processing time increases well above the average time; this is due to instances when a large number of particles and in effect their maps must be pruned during the resampling step because of their low likelihood. Lastly, we have also tested σSLAM using data provided by SRI International collected with an outdoors rover (see Figure 6.2 for example frames.) Ground truth data available with this data set were obtained using RTK GPS. Figure 6.6 shows the map learned using σSLAM on this data set for 1182 frames. For this experiment, we used 5000 particles and the robot traveled a distance of 304 meters. The final map has a total of 5277 landmarks. Comparing the filter estimate with the ground truth 154  Figure 6.6: Results from the outdoors image sequence with 5000 particles using σSLAM . The visual odometry trajectory is shown in blue. The GPS ground truth data is shown in red and the maximum likelihood particle is shown in black. Not all landmarks are visible in the image. The map has 5277 landmarks. The arrows point to the locations where the visual odometry was very unreliable. for each frame the average error over all frames is approximately 40 meters. This error is the result of two factors. First, even though our visual odometry estimate is in 3D, our particle filter assumes planar motion; this assumption is violated in this set of data and as a result, the filter estimate is inherently inaccurate. Second, most of the error in the filter is the result of very bad visual odometry estimates on two occasions marked with arrows on Figure 6.6. The mistakes in visual odometry were the result of very noisy data associations for some frames due to the lack of nearby landmarks and sufficient texture for SIFT matching to work well. We believe, that if we relax the planar motion assumption in the particle filter and better optimize our visual odometry algorithm, σSLAM can be an effective algorithm for SLAM in outdoor environments. 155  6.6  Conclusion  We have presented σSLAM , an approach for vision-based SLAM using a stereo camera and the Rao-Blackwellised Particle Filter. We have proposed a specific model that does not use mechanical odometry information and operates using as the only input a video stream of stereo images. We derive our motion model from the sparse optical flow of image points identified using SIFT. We have proposed a novel mixture proposal distribution that allows us to robustly close large loops. We have shown that we can construct accurate metric maps of 3D point landmarks and 2D occupancy grid maps from dense correlation-based stereo. Finally, we should note, that our model is derived from the start for unconstrained 3D camera motion. In this Chapter, we have presented mostly results for planar robot motion for which we have ground truth data allowing us to evaluate its performance. However, we have evidence [EL05] that the system can solve for 6 degrees of freedom and we are currently exploring methods for evaluating it more precisely. Possible applications are a wearable stereo camera and the construction of volumetric representations of indoor spaces.  156  Bibliography [AHB87] K. S. Arun, T. S. Huang, and S. D. Blostein. Least-squares fitting of two 3-d point sets. IEEE Trans. Pattern Anal. Mach. Intell., 9(5):698– 700, 1987. [Bar05] T. D. Barfoot. Online visual motion estimation using FastSLAM with SIFT features. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3076–3082, Edmonton, Alberta, August 2005. [CSN04] J. Campbell, R. Sukthankar, and I. Nourbakhsh. Techniques for evaluating optical flow for visual odometry in extreme terrain. In IEEE/RSJ Int. Workshop on Robots and Systems (IROS-04), Sendai, Japan, October 2004. [Dav03] A. J. Davison. Real-time simultaneous localisation and mapping with a single camera. In Proc. International Conference on Computer Vision, Nice, France, pages 1403–1410, October 2003. [DdFG01] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo in Practice. Springer-Verlag, 2001. [DdFMR00] A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-Blackwellised particle filtering for Dynamic Bayesian Networks. In Uncertainty in Artificial Intelligence. 2000.  157  [EL05] P. Elinas and J. J. Little. σMCL: Monte-Carlo localization for mobile robots with stereo vision. In Proceedings of Robotics: Science and Systems, Cambridge, MA, USA, 2005. [EP04] A. I. Eliazar and R. Parr. DP-slam 2.0. In Proceedings of the 2004 IEEE International Conference on Robotics and Automation, New Orleans, LA, 2004. IEEE Press. [HZ00] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge Univ. Pr., Cambridge, UK, 2000. [KBO+ 05] N. Karlsson, E. Di Bernardo, J. Ostrowski, L. Goncalves, P. Pirjanian, and M. E. Munich. The vSLAM algorithm for robust localization and mapping. In IEEE Int. Conf. on Robotics and Automation (ICRA), pages 24–29, Barcelona, Spain, April 2005. [Low91] D. G. Lowe. Fitting parameterized three-dimensional models to images. IEEE Trans. Pattern Analysis Mach. Intell. (PAMI), 13(5):441–450, May 1991. [Low99] David G. Lowe. Object recognition from local scale-invariant features. In Int. Conf. on Computer Vision, pages 1150–1157, Corfu, Greece, September 1999. [LS04] A. Levin and R. Szeliski. Visual odometry and map correlation. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004), Washington, DC, USA, 2004. [ME85] H. P. Moravec and A. Elfes. High resolution maps from wide angle sonar. In Proc. IEEE Int. Conf. on Robotics and Automation, pages 116–121, 1985. [MS06] A. Milella and R. Siegwart. Stereo-based ego-motion estimation using pixel tracking and iterative closest point. In In Proceedings of IEEE 158  International Conference on Computer Vision Systems (ICVS 2006), Manhattan, New York, USA, 2006. [MTKW03] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM 2.0: An improved particle filtering algorithm for simultaneous localization and mapping that provably converges. In Proceedings of the Eighteenth Int. Joint Conf. on Artificial Intelligence (IJCAI-03), pages 1151–1156, San Francisco, CA, 2003. Morgan Kaufmann Publishers. [Mur99] K. Murphy. Bayesian map learning in dynamic environments. In Neural Information Processing Systems (NIPS), pages 1015–1021, 1999. [NNB04] D. Nister, O. Naroditsky, and J. Bergen. Visual odometry. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2004), pages 652–659, 2004. [OMSM03] C. F. Olson, L. H. Matthies, M. Schoppers, and M. W. Maimone. Rover navigation using stereo ego-motion. Robotics and Autonomous Systems, 43(4):215, 2003. [SEGL05] R. Sim, P. Elinas, M. Griffin, and J. J. Little. Vision-based SLAM using the Rao-Blackwellised Particle Filter. In IJCAI Workshop on Reasoning with Uncertainty in Robotics (RUR), Edinburgh, Scotland, 2005. [SLL02] S. Se, D. G. Lowe, and J. J. Little. Mobile robot localization and mapping with uncertainty using scale-invariant visual landmarks. Int. J. Robotics Research, 21(8):735–758, 2002. [TF00] S. Thrun and D. Fox. Monte carlo localization with mixture proposal distribution. In Proceedings of the AAAI National Conference on Artificial Intelligence, 2000.  159  [Thr02] S. Thrun. Robot mapping: A survey. Technical Report CMU-CS-02-11, Carnegie Mellon university, February 2002. [TMK+ 04] S. Thrun, M. Montemerlo, D. Koller, B. Wegbreit, J. Nieto, and E. Nebot. Fastslam: An efficient solution to the simultaneous localization and mapping problem with unknown data association. Journal of Machine Learning Research, 2004. [TSM+ 97] V. Tucakov, M. Sahota, D. Murray, A. Mackworth, J. Little, S. Kingdon, C. Jennings, and R. Barman. A stereoscopic visually guided mobile robot. In Proc. of Hawaii International Conference on Systems Sciences, 1997.  160  Chapter 7  Conclusions This chapter summarizes the work presented in this thesis. We begin with an overview of our major contributions towards developing visually-guided, interactive, mobile robots. We conclude with suggestions for future work extending the material presented in this thesis.  7.1  Contributions  The main contribution of this thesis is the development of a unified framework for the design and development of visually-guided, interactive, mobile robots. We have shown that utilizing a decision theoretic framework within a three-layered, distributed, behavior-based, robot control architecture, we can implement mobile robots capable of completing complex tasks requiring interaction with a human companion. Our framework makes possible the re-use of robot skills implemented as stand-alone software modules referred to as behaviors. We first provide evidence that visually-driven, interactive, mobile robots are possible to develop. The robot waiter presented in Chapter 2 is such a robot capable of completing a complex, humaninteractive task in a crowded and dynamic environment. The only shortcoming of this system is its reliance on an expert to supply a policy for behavior coordina161  tion. There are no guarantees that the policy is optimal in any way other than the hope the expert has taken into account all possible critical situations the robot may encounter. Next, we develop methods for the principled coordination of these behaviors using a decision-theoretic framework based on the theory of factored Markov Decision Processes and their partially observable extension. We explore methods for solving large planning problems within a decision theoretic framework including problem decomposition, partial observability, modeling of multi-step actions, and real-time execution. In addition, we develop important robot skills relying strictly on visual feedback; these include robot localization and mapping, face detection, and navigation with dynamic obstacle avoidance. Specifically, our contributions as presented in each chapter of this thesis are as follows. • Chapter 2: Early in this thesis, we present our first, visually-guided, service robot. The robot is designed to serve hors d’oeuvres to large groups of people attending a reception. It achieves the waiting task using three cameras for people detection, navigation, localization, mapping and food supply monitoring. Our success with this robot waiter provides evidence that it is possible to develop robots that complete complex, human-interactive tasks without the need for time-of-flight sensors traditionally used in robotics; such sensors include sonar and the recently popular laser scanners. The robot relies on many skills that we have implemented as independent software modules organized in a three-layered, distributed, hybrid, robot control architecture. In this early design, an expert must specify a policy for the robot to execute in order to perform the task at hand. • Chapter 3: Under the assumption of full state observability, we explore the use of decision theoretic frameworks for the task specification and coordination of robot behaviors. In order to solve large and complex tasks, we study the option of decomposing the large problem into a number of smaller, loosely coupled ones that can be solved optimally using current methods. Then, 162  we combine their policies to assemble a solution for the original problem. We introduce the framework of Multiply-Sectioned Markov Decision Processes which performs state decomposition exploiting the independence among state variables imposed by structure in the problem’s domain. Such a structure can be derived from the properties of non-conflicting actions. Specifically, the MSMDP framework exploits the fact that different behavioral modules can act concurrently because they affect non-interacting parts of the state and reward variables. As a result, the optimal value function for each subproblem can be estimated separately and the value function for the large problem computed from their sum. We demonstrate the applicability of the MSMDP framework with the development of a message delivery robot solving a problem with millions of states. • Chapter 4: This chapter addresses the issue of planning under partial state observability inherent in dynamical systems due to noisy sensors and actuators. In addition, we study the proper modeling of multi-step actions improving the responsiveness and robustness of our interactive, message delivery, mobile robot. We utilize the Partially Observable Markov Decision Process model to handle state uncertainty. Furthermore, we propose heuristic methods for improving the computational efficiency of point-based methods for solving large POMDPs. Our empirical evaluation supports the hypothesis that heuristics do exist that can improve the performance of point-based value iteration without loss in solution quality. We demonstrate the effectiveness of our heuristic function comparing its performance against current state-of-the-art methods on standard benchmark problems in addition to our very own message delivery POMDP. • Chapter 5: We propose σMCL which is a Monte-Carlo Localization method for robots using stereo vision. We take advantage of recent work in object recognition and more specifically the development of robust invariant features 163  to recognize landmarks. In addition, we propose a new mixture proposal distribution for the particle filter that effectively solves the continuous localization and kidnaped robot, i.e., localization given a strong but incorrect prior, problems. Most importantly, σMCL works well for unconstrained 3D motion as demonstrated by our empirical evaluation of the method tracking a camera moving in 3D. • Chapter 6: We extend our localization algorithm presented in Chapter 5 towards solving the more difficult simultaneous localization and mapping problem. We propose σSLAM which is based on the Rao-Blackwellised particle filter specifically adapted to work for robots equipped with stereo vision. Utilizing a new mixture proposal distribution and the same invariant features used in σMCL, we show that we can construct accurate metric maps of 3D landmarks for large indoor environments. Moreover, σSLAM operates in near real-time. Finally, we show that we can also construct accurate 2D occupancygrid maps from stereo vision; these maps are useful for path planning and obstacle avoidance during navigation.  7.2  Future Work  This thesis develops a framework for the specification and development of visuallyguided, interactive, mobile robots. We address issues in task specification, robot skill development and coordination in a principled way. In this section, we briefly examine future challenges and describe open problems worth pursuing in future research.  7.2.1  Decision Theoretic Behavior Coordination  In Chapters 3 and 4, we present our work on behavior coordination using a decisiontheoretic approach based on the framework of Markov Decision Processes. The  164  central problem in using this framework for robot control is that real-world problems are too large to solve optimally. Instead, we explore two different methods for computing approximate policies for large MDPs. In both cases, we use a factored representation for the problem separating between state and observation variables. Such representation allows us to exploit conditional independencies among the model variables. Chapter 3 addresses the planning problem under the assumption of full state observability but maintains uncertainty in action outcomes.  We develop  the Multiply-Sectioned MDP framework that works using a divide-and-conquer approach. We decompose a large MDP into a set of smaller loosely coupled MDPs that can be solved optimally and their policies combined to provide a solution for the original MDP. We assume that concurrent actions are non-conflicting and that the reward functions of the smaller MDPs are additive. These are both very restrictive assumptions that need to be addressed in future work [MLRG04]. Actions are often in conflict as they can affect several common state variables. Assuming that the actions are non-conflicting effectively eliminates influences in the Bayesian Network representation of our model. Removing these dependencies leads to sub-optimal solutions. In future work, we must devise methods for estimating bounds on the optimality of the approximate solution with respect to the optimal one. Such bounds are hard to estimate and there is little current work in this direction [KM01]. In Chapter 4, we remove the assumption that the world state is fully observable and instead we explore the use of Partially Observable Markov Decision Processes for behavior coordination. State uncertainty makes finding optimal solutions for large planning problems even harder. We explore a point-based approach for computing POMDP policies devising heuristics that help guide the search for a good approximation to the optimal value function. In future work, we must further study the class of problems for which our heuristic can be used to improve the per-  165  formance of point-based POMDP solution methods. It is worth exploring whether there exist other heuristics that can be devised in order to solve some general class of POMDPs. Furthermore, we explore the modeling of multi-step actions within the POMDP framework. Behavior coordination requires that we find solutions for POMDPs with such actions. We study a method that allows an engineer to specify a POMDP model describing the task at a high level; this model includes the observation and transition tables and the maximum duration for each multi-step action. We use this initial model as a guide to further discretize the POMDP into a finer Dynamic Bayesian Network with its parameters computed from the expert’s initial model. We demonstrate the effectiveness of our approach with the design of an interactive mobile robot performing a complex task coordinating high-level behaviors using our decision theoretic model. In future work, we must further study the properties of our method. Currently, we assume that actions have finite duration but we should consider extention to problems with actions that do not have finite duration. In addition, our model for computing the parameters for the finer DBN is based on the assumption that the effect of actions on the world state will vary linearly with time passed. There are actions for which this assumption may not hold true. In that case, we must consider alternative action models. Finally, we should mention that in our experience there is a need for a new set of benchmark problems for testing the solution methods for large POMDPs with or without multi-step actions. The current set of problems used in the literature today is unfortunately too small to be considered adequate. However, we do not propose that our message delivery task is a good alternative problem.  166  7.2.2  Visual Simultaneous Localization and Mapping  In Chapters 5 and 6, we present our solution to the visual localization and mapping problems utilizing a stereo camera and the Rao-Blackwellised particle filter. Our method focuses mostly on mapping indoor environments in near real-time. Our experimental results clearly demonstrate the applicability of our approach for localizing within and mapping a 3-dimensional environment. We produce accurate metric maps of 3D point-landmarks recognized by their appearance in images and also 2D occupancy-grid maps that are useful for path planning and obstacle avoidance in navigation. Our work on visual SLAM can be extended in two ways, • Outdoor mapping: Clearly the next step would be to extend σSLAM for use in outdoor environments [AK07, HFBT03, NCH06]. This requires a stereo camera with a longer baseline since most landmarks are expected to be much farther away from the robot as compared to indoor navigation. • Topological mapping: One method for improving the accuracy of σSLAM is to consider a hybrid mapping approach combining metric and topological SLAM [CN01, RZL+ 03]. For an indoor environment, a robot could construct a topological map connecting all rooms in the building and then learn accurate metric 3D and 2D maps for each room. Visual recognition of landmarks would provide hints about the location of the robot in the topological level while accurate tracking of the position of these landmarks via stereo vision would provide an accurate estimate of the robot’s location within a particular room as we described in Chapter 5. A similar approach should be studied for mapping large outdoor environments.  167  Bibliography [AK07] M. Agrawal and K. Konolige. Rough terrain visual odometry. Proceedings of the International Conference on Advanced Robotics (ICAR), Jeju Island, Korea, August 2007. [CN01] H. Choset and K. Nagatani. Topological Simultaneous Localization and Mapping (SLAM): Toward Exact Localization Without Explicit Localization. In IEEE Transactions on Robotics and Automation, Vol. 17, No. 2, pages 125–137, April, 2001. [HFBT03] D. H¨ ahnel, D. Fox, W. Burgard and S. Thrun. A highly efficient FastSLAM algorithm for generating cyclic maps of large-scale environments from raw laser range measurements. In Proc. of the Conference on Intelligent Robots and Systems (IROS), Las Vegas, Nevada, USA, 2003. [KM01] D. Koller and B. Milch. Multi-Agent Influence Diagrams for Representing and Solving Games. In Proceedings of the Seventh Int. Joint Conf. on Artificial Intelligence (IJCAI-01), pages 1027–1036, Seattle, Washington, USA, August 4-10, 2001. [MLRG04] B. Marthi, D. Latham, S. Russell and C. Guestrin. Concurrent Hierarchical Reinforcement Learning. In Proceedings of the AAAI National Conference on Artificial Intelligence (AAAI), San Jose, CA, USA, 2004. [NCH06] P. Newman, D. Cole and K. Ho. Outdoor SLAM using visual appearance  168  and laser ranging. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), pages 1180–1187, Orlando, FL, USA, May 15-19, 2006. [RZL+ 03] P. Rybski, F. Zacharias, J. Lett, O. Masoud, M. Gini and N. Papanikolopoulos. Using visual features to build topological maps of indoor environments. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), pages 850–855, Taipei, Taiwan, September 1419, 2003.  169  Appendix A  Face detection This Appendix describes the face detection algorithm that we use for all interactive robots presented in this thesis. It depends on image segmentation using skin color detection, connected components analysis and model fitting for removal of outliers. In the following sections, we describe each step in detail.  A.1  Color Segmentation  Our method for face detection depends on the segmentation of images using skin color. We transform the RGB (Red-Green-Blue) color image captured with the Bumblebee Stereo vision camera to the HSV (Hue-Saturation-Value) color space where the color of human skin forms a small, tight cluster. The values for H are in the range 0.0 to 360.0 while for S and V, are in the range 0.0 to 1.0. The probability that the ith pixel, ni , is skin-colored is given by Equation A.1, p(ni = skin|Hi , Vi , Si ) = p(ni = skin|Hi )×p(ni = skin|Si )×p(ni = skin|Vi ) (A.1) where we assume that each color channel is independent. We use a Gaussian distribution for each of the three factors on the right-hand side of Equation A.1. More specifically, p(ni = skin|Hi ) = N (11, 3) 170  (A.2)  p(ni = skin|Si ) = N (0.5, 0.1)  (A.3)  p(ni = skin|Vi ) = N (0.5, 0.2)  (A.4)  We select the parameters for each distribution from a sample set of skincolored pixels collected during an offline calibration phase. Calibration requires that we provide the system with 40 to 50 examples of skin pixels at the current level of illumination. To acquire the data, a person positions himself in front of the camera while another manually selects skin pixels in the color image. For each color channel, we compute the mean and standard deviation of the selected pixels to be used as the parameters for the Gaussian distributions in Equation A.1. The distribution over the values in the V channel favors those pixels that are neither too bright or too dark. Pixels that are too bright are either the result of the imaging process, i.e., the dynamic range of the camera is not large enough to capture pixels that are brighter than a certain value, or are the result of specularities on the surfaces of objects. Human skin in particular tends to be very specular. We have tuned our system to work in environments illuminated by ambient artificial light adding in the reduction of specularities. After computing for each pixel the probability of being of skin color, we apply an empirically selected threshold resulting in a binary image of candidate skin-colored pixels. Skin color detection is far from perfect requiring additional steps to eliminate false positives. False positives are pixels that have been marked as skin even though they are not, i.e., a brown box in the background could possibly fall in the same range as human skin and so it could falsely be included in the segmented image. We perform connected component analysis on the binary image resulting from the skin-color segmentation in order to eliminate any false positives.  171  A.2  Filtering and Connected Components Analysis  Before computing the connected components, we apply a median filter on the binary image. Applying a median filter implies that we count the set pixels in the local neighborhood of each pixel in the image; if the sum of these pixels is larger than the sum of pixels that are not set, the center pixel is set and vice versa. Convolving a binary image with a median filter eliminates noise pixels that are incorrectly set during color segmentation. Through experimentation, we concluded that the best results are obtained when using a 7 × 7 size filter. Next, we compute the connected components.  We scan the image and  mark all neighboring pixels as belonging to the same region. We look in the 4neighborhood of each pixel; the four neighbors of a pixel are the pixels, north, south, east and west from it. Once all the components are identified, we apply an area threshold algorithm to eliminate the smallest of them. We expect that the person will be within a few meters from the robot and so his/her face could be no smaller than a certain size. From our experiments we set the size to 49 pixels, that is, a 7 × 7 pixel region. So, any component with an area smaller than 49 pixels is deleted as background noise.  A.3  Component Selection  The connected component analysis outlined in the previous section returns a set of n components such that C = {c1 , c2 , . . . , cn }. Given C, we need to select one component as the user. For each component we compute, p(ci = f ace|Ai , Di ) = p(ci = f ace|Ai ) × p(ci = f ace|Di )  (A.5)  where Ai is the area of the component and Di is the distance of the component from the camera. The component with the highest probability computed using Equation A.5 that is above 0.5 denotes the user. The two distributions on the  172  right-hand side of Equation A.5 are Gaussian distributions with parameters set as explained next. The first distribution, p(ci = f ace|Ai ), is the probability of a component generated by the presence of a person’s face in the image given its area. The area of a component equals the number of pixels in it. The face of a person looking at the camera traces on the image plane a rectangular region of skin colored pixels. We measured the average size of a person’s head to be approximately 11 × 15 ± 2cm. We determine the distance of each component from the camera as the mean of the distance of all the pixels belonging to the component. Then, given the known camera parameters, we project the face model to the image plane and estimate its area in pixels. The probability that a component is generated by a user looking at the camera is given by Equation A.6, p(ci = f ace|Ai ) = N (Am , 4)  (A.6)  where Am is the area of the model face projected to the image plane. The second factor in Equation A.5 favors those components that are at an optimal distance from the robot. It is a reasonable assumption to expect that people interacting with the robot will stand 1 − 3 meters away from it. People that are too far from the robot are not likely to be engaged in a dialogue with it. In addition, social norms dictate that people do not stand too close to someone with whom they are interacting. We use a Gaussian distribution with mean 2m and a standard deviation of 0.5m in order to capture the above assumption. p(ci = f ace|Di ) = N (2, 0.5)  (A.7)  Figure A.1 shows an example of face detection in the presence of noise. The desk next to the person has color that falls within the range of skin tone. The Figure shows each of the major steps in the algorithm described in Sections A.1, A.2,  173  and A.3. The connected component analysis and shape constraint correctly eliminate the desk as a less likely face component; the person is then correctly identified as the component in the middle of the image. Our face detection method highly depends on a model of skin color and a size constraint. If these two components are modified then other objects can also be identified. Figure A.2 shows an example of the face detection algorithm operating using a different color model and the same size model as before. In this case, we have trained our method to identify the blue recycling bins. Notice that color segmentation correctly identifies both bins but connected component analysis and the size constraint prefer the bin on the top-right corner of the image. The other bin is too large to satisfy the size constraint.  174  (a)  (b)  (c)  (d)  Figure A.1: Face detection example showing the major steps of our approach including demonstrating robustness in the presence of skin colored objects such as a wooden desk, (a) the original color image, (b) the binary image resulting from color segmentation where white pixels denote skin regions, (c) the connected components computed using the binary image from the previous step and (d) the original image with the most likely face component marked using a rectangle.  175  (a)  (b)  (c)  (d)  Figure A.2: Object detection example demonstrating the use of our face detection system operating with a different color scheme. The figure shows the major steps of our approach, (a) the original color image, (b) the binary image resulting from color segmentation where white pixels denote regions that match our color model, (c) the connected components computed using the binary image from the previous step and (d) the original image with the most likely object component marked using a rectangle.  176  Appendix B  Navigation The navigator consists of two different components. The first component performs high level path planning within the global 2D occupancy-grid map constructed offline using our visual SLAM algorithm described in Chapter 6; the second component implements dynamic collision avoidance and path following using a POMDP. Both work together to safely move the robot to a target location. For navigation, we assume that the robot is well localized in the maps. We presented our approach for robot localization in Chapters 5 and 6. In the following sections, we describe each of the two navigation components in more detail and present empirical results to demonstrated the capabilities of our navigation module.  B.1  Path Planning  We perform path planning using the 2D occupancy-grid map from the current robot location to the desired goal location. The plan is represented as a list of waypoints such that the robot remains a safe distance from any obstacles represented in the map while following the shortest route to the target location. The planner uses breadth first search to find a safe path; this method may be computationally inefficient for planning in large environments but it is sufficient 177  for our experiments with the waiter and message deliver robots. We construct the occupancy grid using stereo data during an offline mapping procedure that utilizes the σSLAM algorithm for robot localization as described in Chapter 6. An example of such a map along with a computed path is shown in Figure B.5. Using standard notation, white pixels in the grid represent empty space while black pixels represent occupied space and grey pixels denote space of unknown occupancy. Given the path to the target location, control is passed to the path following and obstacle avoidance module that drives the robot towards the goal avoiding any dynamic obstacles that may not be represented in the occupancy-grid map.  B.2  Path Following  Once a path to the goal has been determined, the navigator switches to path following mode. There are two main requirements that path following must satisfy. First, the robot’s motion must be smooth. Second, the robot must stay near the computed path since given the prior knowledge about the environment as encoded in the occupancy-grid map, this trajectory represents the safest route towards the goal. The robot does not have to necessarily pass over all the waypoints, as long as it makes it to the goal. The robot may have to move away from the pre-computed path in order to avoid dynamic obstacles not present in the occupancy-grid. We use a POMDP for path following and obstacle avoidance. We detect obstacles using stereo vision. Since we use a wheeled robot, we make the explicit assumption that it moves on a planar surface. This assumption allows us to define obstacles as those objects that are not part of the floor. We employ an approach that uses disparity information to identify the floor in the robot’s current view. Section B.2.1 provides a more detailed description of our obstacle detection method. In Section B.2.2, we describe the POMDP we use for path following and obstacle avoidance.  178  B.2.1  Detecting and Representing Safe Space for Collision Avoidance  We teach the robot the disparity values of the floor by collecting example depth maps. Since the baseline of the stereo camera remains parallel to the floor at all times, the disparity of the floor for each row is independent of the column. So, for a training depth image of M × N pixels we need to learn M disparities for each tilt angle. We collected pairs of depth maps and head tilt angles for 10 different angles. For each row, we computed the mean and standard deviation of the valid floor disparities. These define a 1D Gaussian distribution that for the i − th row is given by Equation B.1, − 1 Gd (x) = i √ e σd 2π  (x−mi )2 d 2 2σ i d  (B.1)  where mid and σdi are the mean and standard deviation of the learned disparity for this row and x is the disparity value of a random pixel in that row. For each row, we fit a line through the data. This gives us a method for computing an expected disparity for each candidate floor pixel given an arbitrary tilt angle. Figure B.1 shows an example of the mean values collected for four different image rows and several tilt angles along with the lines fit to the data. Figure B.2 shows an example of detecting the floor. It shows the corresponding color and depth image for a random view. It also shows the resulting binary image after applying a threshold on the probabilities computed using Equation B.1 such that those over 0.7 are marked as floor in white. All other pixels are marked in black.  B.2.2  POMDP for Path Following and Obstacle Avoidance  Once a path to the goal and a description of the occupancy of the space infront of the robot are given, the robot must follow the path by generating the appropriate motor commands. This is a sequential decision making problem and as such we model it  179  Figure B.1: Plot of disparity vs tilt angle data collected during training. For clarity data for only 4 out of 160 image rows are shown at 40 row intervals. Lines are fit through the data for each row that give us disparities for intermediate tilt angles. using a POMDP; we employ the Perseus algorithm for solving the POMDP. The rest of this section describes the POMDP model in detail. The navigation POMDP makes local decisions for keeping the robot on the global path while avoiding any dynamic obstacles. There are 4 actions available to the robot which are rotate left, rotate right, move forward and stop. The stop action is executed only when the robot is within 1 meter of the target location and it is not part of the POMDP model. The hidden state consists of the 4 variables {L, R, F, OP}. We give the meaning and domain of these variables in Table B.1.  The  OP variable encodes the robot’s deviation from the pre-computed path to the next waypoint. Once the robot gets to within 1 meter of the waypoint, it switches to the next one until all points have been visited. The last waypoint is the target location. The observed quantities are the obstacles detected in front of the robot. These are encoded in the binary image resulting from the floor detection step. In  180  (a)  ( b)  (c)  (d )  Figure B.2: Detecting the floor (a) the color image, (b) the original disparity map,(c) the learned floor disparities for the head’s tilt angle (40 degrees in this example) and (d) the binary floor image with floor pixels in white and all others in black State variable L R F OP Observation variable o1 o2 o3 o4  Domain  Description  T, F T, F T, F val1, val2, val3, val4 Domain  Obstacle to the robot’s left Obstacle to the robot’s right Obstacle infront of the robot The angle to the global path Description  occ25, occ50, occ75, occ100 occ25, occ50, occ75, occ100 occ25, occ50, occ75, occ100 val1, val2, val3, val4  obstacle to the robot’s left obstacle to the robot’s right obstacle in front of the robot noisy OP  Table B.1: The navigation POMDP state and observation variables.  181  (a)  o1  o3  o2  (b) Figure B.3: Setting the observed quantities for the POMDP showing (a) the raw input image and (b) the binary image with the obstacles marked in black and free space marked in white. The three observed variables o1, o2 and o3 take values based on the ratio of free and occupied space in the shown image region.  182  val2 val1 val4 val3  Figure B.4: This diagram shows how we set the values for the observation variable OP. The robot is assumed to be at the center of the circle facing to the right. If the direction to the next waypoint in the pre-computed path falls within −20 and 20 degrees then variable OP takes value val1; if the direction is between 20 and 90 degrees then it takes value val2; if the direction is between −20 and −90 degrees then OP takes value val3; in all other cases, OP has value val4. order to keep the size of the POMDP model small, we group together the occupancy of neighboring pixels. Table B.1 lists the observation variables. There are 3 variables that denote the likelihood of an obstacle existing to the left, right and in front of the robot. The fourth variable is the measured deviation from the global path. The reason why the mapping between o4 and OP is not deterministic has to do with the fact that the robot’s location is not known exactly since localization is prone to small errors. Variables o1, o2 and o3 encode the percentage of empty space in the area visible to the robot’s camera. Occupancy is discretized in 4 intervals encoding the occupancy of 0 − 25%, 25 − 50%, 50 − 75% and 75 − 100%. Figure B.3 shows an example of how we discretize the current view and set the values of the observation variables. The robot receives a large negative reward (−100) for having an obstacle  183  in front of it and a small positive reward (+1) for staying on the path, i.e., the OP variable has value val1. Figure B.5 shows an example trajectory followed by the robot. In this case the path computed using the known occupancy-grid map was sufficient to lead the robot to the target location. Notice how the robot does not follow the path exactly as it keeps a safe distance from the obstacle to its left. One of the problems with the path following POMDP is that it forgets obstacles that it cannot see. For example, if there is an obstacle in front of the robot, then the POMDP turns to the left/right to avoid it. At some point, the obstacle is outside the camera’s field of view and the robot may turn back into it. One way to deal with this problem is to increase the size of the POMDP’s hidden state with variables about obstacles behind the robot. This, however, increases the size of the POMDP making it harder to find a good solution. Instead, we recompute the global plan at regular intervals. Our empirical evaluation revealed that replanning eliminates the problem of forgotten obstacles by computing a path around them. Replanning occurs at 15-second intervals. A better solution is definitely needed; however, the navigation method presented here has been sufficient for the empirical evaluation of the robot design and development framework described in this thesis. Figure B.6 shows the robot avoiding an obstacle (plant) that was inserted to overlap with the pre-computed plan. The robot successfully navigates around the object to reach the target location.  184  Goal  (a)  (b) Figure B.5: Navigation example showing (a) the initial plan (red) to the goal location and (b) the robot’s path during navigation (blue.) The blue square denotes the robot.  185  Goal  (a)  Plant  (b) Figure B.6: Navigation example showing how the robot avoids a dynamic obstacle not represented in the original occupancy grid map. Part (a) shows the initial plan (red) towards the goal location and part (b) shows the path followed (blue) as the robots avoids an obstacle introduced dynamically. The blue square denotes the robot. 186  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0051411/manifest

Comment

Related Items