UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Cooperative and intelligent control of multi-robot systems using machine learning Wang, Ying 2008-09-08

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


24-ubc_2008_spring_wang_ying.pdf [ 15.19MB ]
JSON: 24-1.0066612.json
JSON-LD: 24-1.0066612-ld.json
RDF/XML (Pretty): 24-1.0066612-rdf.xml
RDF/JSON: 24-1.0066612-rdf.json
Turtle: 24-1.0066612-turtle.txt
N-Triples: 24-1.0066612-rdf-ntriples.txt
Original Record: 24-1.0066612-source.json
Full Text

Full Text

Cooperative and Intelligent Control of Multi-robot SystemsUsing Machine LearningbyYing WangB.A.Sc., Shanghai Jiao Tong University, 1991M.A.Sc., Shanghai Jiao Tong University, 1999A THESIS SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIES(Mechanical Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)March 2008© Ying Wang, 2008ABSTRACTThis thesis investigates cooperative and intelligent control of autonomous multi-robotsystems in a dynamic, unstructured and unknown environment and makes significantoriginal contributions with regard to self-deterministic learning for robot cooperation,evolutionary optimization of robotic actions, improvement of system robustness, vision-based object tracking, and real-time performance.A distributed multi-robot architecture is developed which will facilitate operation of acooperative multi-robot system in a dynamic and unknown environment in a self-improving, robust, and real-time manner. It is a fully distributed and hierarchicalarchitecture with three levels. By combining several popular AI, soft computing, andcontrol techniques such as learning, planning, reactive paradigm, optimization, andhybrid control, the developed architecture is expected to facilitate effective autonomousoperation of cooperative multi-robot systems in a dynamically changing, unknown, andunstructured environment.A machine learning technique is incorporated into the developed multi-robot systemfor self-deterministic and self-improving cooperation and coping with uncertainties in theenvironment. A modified Q-learning algorithm termed Sequential Q-learning withKalman Filtering (SQKF) is developed in the thesis, which can provide fast multi-robotlearning. By arranging the robots to learn according to a predefined sequence, modelingthe effect of the actions of other robots in the work environment as Gaussian white noiseand estimating this noise online with a Kalman filter, the SQKF algorithm seeks to solveseveral key problems in multi-robot learning.As a part of low-level sensing and control in the proposed multi-robot architecture, afast computer vision algorithm for color-blob tracking is developed to track multiplemoving objects in the environment. By removing the brightness and saturationinformation in an image and filtering unrelated information based on statistical featuresand domain knowledge, the algorithm solves the problems of uneven illumination in theenvironment and improves real-time performance.iiIn order to validate the developed approaches, a Java-based simulation system and aphysical multi-robot experimental system are developed to successfully transport anobject of interest to a goal location in a dynamic and unknown environment withcomplex obstacle distribution. The developed approaches in this thesis are implementedin the prototype system and rigorously tested and validated through computer simulationand experimentation.iiiTABLE OF CONTENTSAbstract ^ iiTable of Contents ^ ivList of Figures viiNomenclature xiAcknowledgements ^ xivChapter 1 Introduction  11.1 Goals of the Research ^ 21.2 Problem Definition  31.3 Related Work ^ 51.3.1 Multi-robot Architectures ^ 91.3.2 Machine Learning in Multi-robot Systems ^  131.3.3 Computer Vision and Sensing Technologies for Multi-robot Systems ^ 171.4 Contributions and Organization of the Thesis 20Chapter 2 Learning/Planning Based Hierarchical Multi-robot Architecture ^ 222.1 Overview^ 222.2 General Architecture ^ 252.3 Machine Learning Unit 282.4 Planning Unit 292.4.1 Planning and Learning in Multi-robot Applications ^ 292.4.2 Local Path Planning for Multi-robot Cooperative Transportation ^ 322.5 Integrating Learning with Planning ^ 352.6 Behavior-based Distributed Control 362.6.1 Behavior Representation 372.6.2 Behavior Composition ^ 382.6.3 Group Coordination 402.6.4 Communication Behaviors 402.6.5 Coordinated Object Transportation ^ 412.7 Low-level Control ^ 432.7.1 Robotic Manipulators ^ 432.7.2 Mobile Robots 442.8 Distributed Control Issues 472.9 Summary^ 49Chapter 3 Machine Learning for Multi-robot Decision-making ^ 513.1 Overview 51iv3.2 Markov Decision Process (MDP) and Reinforcement Learning ^ 523.2.1 Reinforcement Learning ^ 553.3 Multi-robot Transportation Using Machine Learning: First Version ^ 573.3.1 Multi-agent Infrastructure 593.3.2 Cooperation Based on Machine Learning ^ 613.3.3 Simulation Result ^ 683.3.4 Experimentation 723.4 Distributed Multi-robot Q-learning in Complex Environments ^ 743.4.1 Distributed Q-learning ^ 743.4.2 Task Description 763.4.3 Multi-robot Box-pushing With Distributed Q-learning ^ 763.4.4 Simulation Results ^ 823.5 Team Q-learning Based Multi-robot Cooperative Transportation ^ 903.5.1 Stochastic Games and the Team Q-learning Algorithm 913.5.2 Simulation Results ^ 923.5.3 Impact of Learning Parameters ^ 953.5.4 Comparing Team Q-learning with Multi-robot Distributed Q-learning 1023.6 Summary ^ 110Chapter 4 Sequential Q-learning with Kalman Filtering (SQKF) ^ 1124.1 Overview  1124.2 Sequential Q-learning with Kalman Filtering ^  1134.2.1 Sequential Q-learning ^  1134.2.2 Kalman Filtering based Reward Estimation  1154.3 Simulation Results ^  1204.4 Summary ^ 131Chapter 5 Computer Vision in Multi-robot Cooperative Control ^ 1335.1 Overview  1335.2 Multi-robot Transportation System ^  1355.3 The Fast Color-blob Tracking Algorithm  1375.3.1 Removal of the Lighting Disturbance ^  1375.3.2 Statistical Feature Filtering ^  1395.3.3 Color-blob Template Matching  1405.3.4 Coordinate Transformation and Pose Reconstruction ^ 1405.4. Experimental Result^  1415.4.1 The Test-bed  1415.4.2 Experimental Result  142v5.5 Multi-robot Route Planning ^  1475.6 Summary ^ 149Chapter 6 Experimentation in Distributed Multi-robot Cooperative Transportation^  1506.1 Overview^  1506.2 Test Bed  1516.3 JAVA RMI 1556.3.1 Remote Interface^  1586.3.2 Server Side  1586.3.3 Client Side  1606.4 A Physical Multi-robot Transportation System ^  1616.5 Force/Position Hybrid Control ^  1656.6 Summary ^ 172Chapter 7 Conclusions  1737.1 Primary Contributions ^  1737.2 Limitations and Suggested Future Research^  1757.2.1 Improvement of the Model and Algorithm of SQKF ^ 1767.2.2 GA-based Learning Space Compression  1767.2.3 Local Modeling and Low-level Control ^  1767.2.4 Integrating Planning and Learning  1777.3 Summary^ 177Bibliography 178Appendix: JAVA Documents of the Developed System ^ 190viLIST OF FIGURESFigure 1.1Figure 2.1Figure 2.2Figure 2.3Figure 2.4Figure 2.5Figure 2.6Figure 2.7Figure 2.8Figure 2.9Figure 2.10Figure 2.11Figure 2.12Figure 3.1Figure 3.2Figure 3.3Figure 3.4Figure 3.5Figure 3.6Figure 3.7Figure 3.8Figure 3.9Figure 3.10Figure 3.11Schematic representation of the developed system ^ 3A general and hierarchical multi-robot architecture 26The principle of the machine learning unit ^ 28An example of the local minimum phenomenon in a purely learning-basedmulti-robot cooperative box-pushing task ^ 30The bounded local environment in navigation potential function planning 33The hierarchical behavior-based control layer ^ 36A SR diagram of a simple behavior ^ 37The diagram of a composite behavior with a competitive operator ^ 39The diagram of a composite behavior with a cooperative operator ^ 40Behavioral hierarchy describing the coordinated transport sub-task ^ 41A FSA describing the transitions of the group behaviors in the project ofcoordinated object transportation ^ 43Control of a mobile robot ^ 45The block diagram of motion control of a mobile robot ^ 46The value iteration algorithm to calculate the utilities  55The single-agent Q-learning algorithm ^ 56The developed multi-robot system 59The multi-agent architecture used in the developed system ^ 59The world state representation ^ 62Representation of the cooperation strategy (action) ^ 63The Genetic Algorithm^ 64The integration scheme of RL and GA ^ 68The robots move randomly and locate the box 70The whole transportation process ^ 71The goal location was suddenly changed from (380,370) to (385,60) at t =34s while the robots were transporting the box ^ 72viiFigure 3.12Figure 3.13Figure 3.14Figure 3.15Figure 3.16Figure 3.17Figure 3.18Figure 3.19Figure 3.20Figure 3.21Figure 3.22Figure 3.23Figure 3.24Figure 3.25Figure 3.26Figure 3.27Figure 3.28Figure 3.29Figure 3.30Figure 3.31Figure 3.32The robots have selected the "sweeping" action to remove a movableobstacle in the path ^  73Description of a multi-robot box-pushing task ^  76The binary coding of the environmental states  77Coding rule of the environmental states ^  78Pushing actions available to each robot  79The dynamics of the box pushed by three robots ^  80Three box-pushing snapshots under different obstacle distributions and goallocations 84The environmental state with binary value "1000000" and six pushingactions available to the robots ^ 87The Q values under the state of "1000000" in 100,000 rounds ofbox-pushing ^ 88Probability density estimate of Q values under state of "1000000" in100,000 rounds of box-pushing ^ 88Probability density estimate of action selection probability under the stateof "1000000" in 100,000 rounds of box-pushing ^ 89The box trajectory in a round of box-pushing 93The rewards received by the robots in a round of box-pushing ^ 94The surface of the Q-table (the states from 1 to 32) after 10,000 trainingcycles ^ 94The impact of training ^ 96Impact of the lower bound of the random action probability ^ 97The impact of the detection radius ^ 98The impact of the pushing force 99The rewards received by the robots in a round of box-pushing ^ 101Simulation results with the single-agent Q-learning algorithm and the teamQ-learning algorithm ^  103Performance index comparison of the single-agent Q-learning algorithmbefore and after the training stage ^  106viiiFigure 3.33 Performance index comparison of the Team Q-learning algorithm beforeand after the training stage ^  109Figure 4.1 The simulated multi-robot box-pushing system ^ 120Figure 4.2 A successful example of multi-robot box pushing with the new SQKFalgorithm ^  121Figure 4.3 Probability density estimate of the number of steps per round before andafter training^  122Figure 4.4 Probability density estimate of the average global reward per round beforeand after training ^  123Figure 4.5 The number of steps per round in 1,000 rounds of continuous box-pushing^  124Figure 4.6Figure 4.7Figure 4.8Figure 4.9Figure 4.10Figure 4.11Figure 4.12Figure 5.1Figure 5.2Figure 5.3Figure 5.4Figure 5.5Figure 5.6Figure 5.7Figure 5.8The average global reward per round in 1,000 rounds of continuousbox-pushing ^  125The global rewards received by the robots in a round of box-pushing beforetraining ^  126The probability density estimate of the global rewards in Figure 4.7 ^ 127The global rewards received by the robots in a round of box-pushing aftertraining ^  128The probability density estimate of the global rewards in Figure 4.9 ^ 128Probability density estimate of number of steps per round in 500 rounds ofbox-pushing 129Probability density estimate of average global reward per round in 500rounds of box-pushing ^  130The developed physical multi-robot system ^  135The color-blob tracking algorithm ^  137A sample image of a color blob  139The experimental system in IAL at UBC ^ 142An original image captured by the CCD camera 143The hue component of the original color image in the HSI color space ^ 144The hue component image after the background is filtered ^ 145The result after filtering Figure 5.7 using template matching ^ 146ixFigure 5.9Figure 6.1Figure 6.2Figure 6.3Figure 6.4Figure 6.5Figure 6.6Figure 6.7Figure 6.8Figure 6.9Figure 6.10The reconstructed result of the poses of the robots and the box ^ 147The multi-robot object transportation system developed in IAL ^ 151The color blobs used to identify the orientation of the box ^ 153The PioneerTM 3-DX mobile robot ^  154The PioneerTM 3-AT mobile robot  154JAVA RMI distributed computing ^  157Multi-robot cooperative transportation  164The robot hybrid force/position control scheme ^ 166The robots move to the desired position and push the box under hybridforce/position control ^  168Position control of the robotic arm ^ 170The force control of the robotic arm  171xNOMENCLATURENotationsI k(Xy)^Attraction function in the potential field approachP(x, y)^Repulsion function in the potential field approachCo(x, y) Potential functionco(xi , y,)^Gradient of a potential fieldD(q) Inertia matrix of the manipulatorC(q,q)^Coriolis matrixJoint torque (see Chapter 2)S = {s 1 , s2 ,^sn }^Set of states in the environment^{a„a2 ,...,an }^Set of actions available to a robotT(s,a,s')^Transition function, which decides the next environmental states' when robot selects action a1 under current state s .fl(s)^Probability distribution over the statesR : S x A —* R^Reward function. Gives the immediate reward when robot takesaction a, under current state sPolicy with which a robot selects its actions7Z"^ Optimal policy/3 (Chapter 2)^Mapping function/3 (Chapter 3)^Discount factorU(s)^Utility of a stateMaximum error allowed in the utility of any stateMaximum change in the utility of any state in an iterationQ(s,,a,)^An entry in the Q-table77 Learning rater^ Direct rewardxir (Section 3.2)^" Temperature " parameter (see Chapter 3)r (Section 3.3)^The angle between the net force vector and the position vector ofthe closest obstacleNet torque applied by the robotsNet force applied by the robotsA.^Action set of the ith robotA set including all actions selected by the robots thus farEmpty setA^Set including the currently available actionsg, Global rewardb,^An additive noise process which models the effect of theunobservable states on the global rewardz.^A Gaussian random variableMean2^VarianceCovariance matrixIs'^Total number of world statesL Direction constraint matrixStandard deviationP^Composite behaviorxiiAbbreviationsMDP^Markov Decision ProcessGA Genetic AlgorithmRI_^Reinforcement LearningSQKF Sequential Q-learning with Kalman FilteringAI^Artificial IntelligenceGLIE Greedy in the Limit with Infinite ExplorationSG^Stochastic GameRGB Red, Blue, and GreenHSI^Hue, Saturation, and IntensityLAN Local Area NetworkTCP/IP^Transmission Control Protocol/Internet ProtocolRMI Remote Method InvocationACKNOWLEDGEMENTSI wish to express my sincere appreciation to my supervisor, Prof. Clarence W. deSilva. Four years ago, when I first contacted him to seek a Ph.D. student position in hisgroup, he was so kind and warm to accept my request, which enabled my memorablePh.D. career at The University of British Columbia. In the past 4 years, Prof. de Silva hasgiven me constant supervision, support, advice, and help. These include, to name a few,supervising my Ph.D. research, suggesting and sharpening the research ideas, discussingthe research problems and helping to solve them, carefully revising and improving mywriting, generously funding me to attend top conferences in the robotics and controlsfield and providing financial support for my Ph.D. studies, purchasing state-of-the-artrobots and other equipment for my experiments, kindly appointing me as Lab Manager ofthe Industrial Automation Laboratory, and writing excellent recommendation letters forscholarships and job applications. In particular, it was Prof. de Silva who suggested theidea of integrating machine learning with multi-robot architectures for my Ph.D. research,which has been proved to be an excellent and innovative idea as I carried out my researchin the subsequent years. In a word, I am very grateful to Prof. de Silva who has helpedme so much, and in my heart, he is not only a fantastic supervisor but also anotherrespected father.I want to thank Dr. Otman Basir from University of Waterloo, Dr. Farrokh Sassani, Dr.Jose Marti, Dr. Ian Yellowley, and Dr. Robin Turner to work as the examining committeefor my Ph.D. defense exam, who carefully assessed my thesis and gave me valuableopinions. I also wish to express my appreciation to Prof. Farrokh Sassani, Prof.Mohamed Gadala, and Dr. Roya Rahbari, who have kindly served as members of myresearch committee. It is they who have helped and encouraged me to improve myresearch in these years. In addition, I also want to thank Prof. Elizabeth Croft, Prof. IanYellowley, Prof. Yusuf Altintas, Prof. Mu Chiao, Prof Ryozo Nagamune and Prof.Xiaodong Lu for their valuable suggestions and assessments during our annual graduateresearch seminars.xivI would also want to thank Ningbo University in China and K.C.Wong EducationFoundation (Hong Kong) for offering me an amazing Ph.D. fellowship. In addition, I amgreatly indebted for the Ph.D. tuition award from The University of British Columbia.The appreciation should also go to other generous financial support, including researchassistantships and equipment funding from the Natural Sciences and EngineeringResearch Council (NSERC) of Canada and the Canada Foundation for Innovation (CFI)through Prof de Silva as the Principal Investigator.I take this opportunity to thank my close friends and colleagues in our research group;especially, Dr. Poi Loon Tang, Mr. Roland (Haoxiang) Lang, Dr. Tao Fan, Mr. DumindaWijewardene, Mr. Jason J. Zhang, Mr. M. Tahir Khan, Mr. Arun Gupta, Mr. MohammedAlrasheed, Prof. Lalith Gamage, Mr. Richard McCourt, Dr. Yang Cao, and Ms. F. NazlyPirmoradi. You guys have made my life at UBC memorable and pleasant. In addition, Iwant to thank my previous colleagues at Ningbo University in China, especially Prof.Zhimin Feng, Dr. Chao Hu and Prof Shirong Liu, for their warm and constant support.Finally, I want to thank my family for their powerful support. Above all, I wish toexpress my exceptional appreciation to my wife Jiangfei Sun, my daughter Shihan (Lucy)Wang, my father and mother, my sister and her family, for their continuous love, support,and encouragement throughout. I dedicate this thesis to them.xvChapter 1IntroductionA multi-agent robotic system is a group of autonomous robots, which are organizedinto a multi-agent architecture so as to cooperatively carry out a common task in apossibly dynamic environment with or without obstacles. In such a system, each robot isregarded as an agent, which possesses capabilities of independent sensing, actuation, anddecision making. Besides the physical agents (e.g., robots), some pure software agentsmay be included into the system to help coordinate the physical agents, or implementcomplex machine intelligence. The group of agents constitutes an "agent society," wherethey cooperate, coordinate, and even compete with each other to achieve a common goal.In the past decade, multi-agent robotic systems have received increased attention in therobotics community as they possess important advantages over a single robot system. First,a multi-agent robotic system is a parallel system, which can provide improved efficiency,especially in such tasks as mapping, searching, and space exploration. Second, throughcooperation and communication, a multi-agent robotic system may become more powerfulthan a single complex robot, at the same cost, and may be able to complete some specialtasks which a single robot is unable to carry out. Third, multi-agent robotic systems havebetter fault tolerant capabilities than a single robot. For example, if one robot is damagedor malfunctioned, the remaining robots in the system can takeover its role and continue tocomplete the task, perhaps at a reduced level of performance.Multi-agent robotic systems face some special challenges as well. These includesynchronization with regard to the locations, applied forces and actions of the robots;determination of a cooperation strategy; and the way to learn and improve such a strategythrough interaction with the environment.An important research topic in the multi-agent robotics is the multi-robot objecttransportation problem. In this problem, several autonomous robots move cooperatively to1transport an object to a goal location and orientation in either a static or a dynamicenvironment, which may contain fixed or removable obstacles. The object may be soheavy or large that no single robot will be able to handle it alone. This is a ratherchallenging problem. For example, in the transportation process, each robot needs to sensepossible changes in the environment, the locations of the obstacles, and the other robots inthe neighborhood. Then it needs to communicate with the peers, exchange sensinginformation, discuss cooperation strategies, suggest obstacle avoidance schemes, planrobot trajectories, assign or receive subtasks and roles, and coordinate robot actions so asto transport the object quickly and successfully to the goal location. Because of manychallenges that belong to different research fields, the multi-robot object transportationproblem is a good benchmark to assess the effectiveness of various multi-agentarchitectures and cooperation strategies. Furthermore, multi-robot cooperation has manypractical applications in fields such as space exploration, intelligent manufacturing, deepsea salvage, dealing with accidents in nuclear power plants and other hazardousenvironments, and robotic warfare. Therefore, multi-robot object transportation problemhas become an important topic in the area of multi-agent robotics.1.1 Goals of the ResearchIn this thesis, a physical multi-robot transportation system operating in a dynamic andunknown environment with complex obstacle distribution will be developed. In order tocomplete this challenging task, several key approaches will be established. The fourprimary research goals of the thesis are to: develop a fully distributed multi-robot architecture that accommodates behaviorcoordination, resource assignment, and dynamic unknown environment. develop a new machine learning algorithm that will enable multi-robot systems todeal with uncertainties in the environment. develop technologies of computer vision and sensing for carrying out multi-robottasks. study performance issues such as robustness, cooperative behavior, self-learning andadapting capability, speed, and accuracy of the developed methodology.2Digital CameraObjectRobotic Arm Mobile RobotFixedObstacles Goal Location /// MovableObstaclesForce SensorsEthernet Network1.2 Problem DefinitionA primary objective of the present work is to develop a physical multi-robot system,where a group of intelligent autonomous robots work cooperatively to transport an objectto a goal location and orientation in an unknown and dynamic environment whereobstacles may be present or even appear randomly during the transportation process.Robot control, multi-agent planning, and machine learning are integrated into thedeveloped physical platform to cope with the challenges of the problem. A schematicrepresentation of the developed system is shown in Figure 1.1.Figure 1.1: Schematic representation of the developed system.In this system there are three or more autonomous mobile robots or fixed robotic arms,which cooperatively transport an object to a goal location. During the course of thetransportation, they have to negotiate the cooperation strategy and decide on the optimallocations and amplitudes of the individual forces applied by them so that the object istransported quickly and effectively while avoiding any obstacles that may be present in thepath. In some special cases, they also need to consider whether to remove the obstaclesrather than negotiating around them. Other considerations such as the level of energyutilization and avoiding damage to the transported object may have to be taken intoaccount as well. Charge Coupled Device (CCD) camera systems are used to monitor andmeasure the current location and orientation of the object. The environment is dynamicand unpredictable, and contains movable obstacles, which may appear randomly, and3some fixed obstacles. The robots, the camera and the sensors are separately linked to theirhost computers, which are connected through a local network, to implement complexcontrols and machine intelligence.There are three major challenges in the developed systems, which are all key issues inthe multi-robot domain. The first one is the dynamic environment. In particular, the robotsare working in an environment with dynamic random obstacle distribution. The obstaclesmaybe appear and disappear randomly. In addition, because there are multiple robotsworking in parallel within the same environment, while one robot makes decisions, someother robots may have changed the environment. This also can make the environmentdynamic from the standpoint of a single robot. The dynamic environment makes it verydifficult for the robots to make decisions and learn useful policies because there will behidden environment states, which are unobservable for the robots.The second challenge results from the local sensing capabilities of the the robots. Inthis project, it is assumed that each robot is able to only detect an object or obstacleswithin a limited detection radius. It means that the robots possess local sensing capabilitiesonly, and the global environment is unknown to them. Before a robot moves close to anobstacle, it does not know the location or the shape of the obstacle. The unknownenvironment is another major challenge in the multi-robot domain. Because the robotscannot know the complete environment in advance, in part due to the unfamiliarity and inpart due to the dynamic nature, they cannot employ the traditional planning technologiesfor decision making associated with the execution of the task.The third challenge arises due to the unreliability of communication among robots, asassumed in the present project, where the disturbances are represented by white noise. Dueto unreliable communication, sometimes the robots may end up selecting a wrongcooperation strategy. Because multi-robot systems are usually expected to work in someextreme environments (for example, planetary surfaces) where communication may beunreliable due to unknown disturbances, how to improve the robustness of multi-robotdecision-making in an environment with unreliable communication is a key issue in themulti-robot domain.A typical natural environment of a multi-robot system is usually dynamic andunknown. In addition, robot communication is usually unreliable in such an environment.4In this thesis, several approaches will be developed to meet these key challenges in themulti-robot domain and enable multi-robot systems to work effectively in an unstructurednatural environment.1.3 Related WorkA considerable amount of work has been done in multi-robot cooperative control anddecision-making by the robotics community in the past 15 years, particularly to supporteffective and robust operation of multi-robot systems in both dynamic and stationaryenvironments. Some pertinent work is surveyed below.Kube and Bonabeau (2000) studied cooperative transport by ants and robots, and founda "Phase Transition" phenomenon. Specifically, the ants were found to suddenly improvetheir performance after a learning stage. They also noticed that the ants coped with thedeadlock problem in collective transportation using the method of "Realignment andRepositioning." Based on these observations, a formalized model of cooperativetransportation was proposed, which included an FSM (Finite State Machine) basedcontroller, a reactive architecture, perceptual cues, and taxis-based or kinesthetic-basedactions. This biologically inspired approach was applied to a multi-robot box-pushing taskand they observed that local interactions among the robots resulted in a global action. Themain shortcoming of their approach stems from inefficiency for real multi-robotapplications.Cao et al. (2006) studied a multi-robot hunting task in an unknown environment. Theyproposed a distributed control approach involving local interactions with local coordinatesystems. This approach can cope with the cumulative errors of wheels and imperfectcommunication networks. Computer simulations showed the validity of the proposedapproach.Rus et al. (1995) employed a team of autonomous robots to move furniture. In thisproject, they proposed four cooperative protocols and analyzed them in an informationinvariant framework, and studied whether a global controller, a global planner, explicitcommunication and a geometric model of the pushed object were necessary forcompleting the task. They also presented a pushing-tracking reorientation method for amulti-robot box-pushing task.5Kumar and Garg (2004) developed a multi-robot cooperative manipulation project withtwo industrial robotic arms. They employed a fuzzy logic-based approach to coordinatethe motions of the two robotic manipulators so that the internal forces among thembecame a minimum. In addition, they used Kalman Filtering to estimate the external forceon the end effector based on information from the force/torque sensor mounted on therobot wrist.Gustafson et al. (2006) studied the scaling issues of multi-robot systems. They usedseveral abstract models to elucidate that it was harmful to increase the number of robots orthe sensor strength in multi-robot systems. Stone et al. (2006) investigated the issue ofuncertainty in multi-robot systems. They identified several uncertainty sources andintroduced methods for reducing uncertainty and making decisions in the face ofuncertainty. These contributions enable effective planning under uncertainty for robotsengaged in goal-oriented behavior within a dynamic, collaborative and adversarialenvironment.Mataric et al. (1995) developed a behavior-based multi-robot box-pushing system. Intheir project, they proposed a behavior-based reactive architecture for multi-robot systemswith explicit communication. The cooperative strategies among the robots were studiedand two legged robots with minimum computational capabilities were used to demonstratethe proposed approach. In their latest work Lerman et al. (2006) presented a mathematicalmodel for dynamic task allocation in multi-robot systems, based on stochastic processes.They assumed a large scale multi-robot system with local sensing, behavior-basedcontroller and distributed task allocation capabilities. Through storing the history of theenvironment observed in the internal state of a robot, they tried to cope with the localsensing and indirect communication issues in multi-robot systems. The model proposed bythem was demonstrated in a multi-robot foraging task and some statistical results andcomparisons were given. In another paper written by them (Jones and Mataric, 2004a), aprincipled framework suitable for describing and reasoning about the intertwined entitiesinvolved in any task-achieving multi-robot systems was addressed. Using this framework,they presented a systematic procedure to synthesize controllers for the robots in a multi-robot system such that a given sequential task would be correctly executed. Gerkey andMataric (2004) also proposed a taxonomy method for MRTA (multi-robot task allocation).6They proposed three axes for use in describing MRTA problems: (1) Single-task robots(ST) versus multi-task robots (MT). (2) Single-robot tasks (SR) versus multi-robot tasks(MR). (3) Instantaneous assignment (IA) versus time-extended assignment (TA). Based onthis method they compared six typical multi-robot architectures and their MRTAapproaches.There are several useful surveys in the multi-robot domain. An older one on"Cooperative Mobile Robotics: Antecedents and Directions" was done by Cao et al.(1997). A newer overview is presented by Arai et al. (2002), where they identify sevenprincipal areas of multi-robot systems: biological inspirations, communication,architectures/task allocation/control, localization/mapping/exploration, objecttransportation and manipulation, motion coordination, and reconfigurable robots. Theyalso point out that the recent progress is beginning to advance the areas of reconfigurablerobotics and multi-robot learning. Dudek et al. (2002) also proposed a taxonomy of multi-robot systems, which included seven dimensions: Collective size, Communication range,Communication topology, Communication bandwidth, Collective Reconfigurability,Processing ability, and Collective composition. In addition, a survey done by Parker(2000b) studied the existing multi-robot architectures, and pointed out several challengesin typical multi-robot tasks. In particular, she summarized that there were three multi-robot architectures in the literature, identified as: general architecture, specializedarchitecture and hybrid architecture. Further, she identified three typical multi-robot tasksand their main challenges. For example, in multi-robot object transportation tasks, a majorchallenge is uneven terrain; in multi-robot motion coordination, the key challenges arephysical demonstration, extending from a 2D environment to a 3D environment, andcomputational complexity; For multi-robot learning, an important issue is how to assigncredits among robots. Moreover, Parker argued that an important direction in the multi-robot domain was to extend the multi-agent learning approaches to multi-robot systems,where real-time constraints challenged most existing multi-robot systems.A recent survey on multi-robot systems is the one completed by Farinelli et al. (2004).In this survey, they proposed a taxonomy-like methodology for multi-robot systems,which focused on coordination among robots. They addressed the differences betweenmulti-robot systems (MRS) and multi-agent systems (MAS). They also classified multi-7robot tasks as unaware systems, aware but not coordinated systems, weakly coordinatedsystems, strongly coordinated and strongly centralized systems, strongly coordinated andweakly centralized systems, and strongly coordinated and distributed systems.Furthermore, they pointed out several research trends in the area, which involved complexcoordination, distributed systems, uncertain environments and sensing, coordinationmodel, conflict resolution, large scale systems, more complex social deliberationarchitecture, and weakly centralized approaches.As Parker pointed out (Parker, 2000b), multi-robot learning has received increasedattention in the past decade in order for multi-robot systems to work in dynamic andunknown environments. There are several surveys that summarize multi-robot learning ormulti-agent learning; for example, the survey done by Yang and Gu (2004). In this survey,they identified that scaling an individual reinforcement learning algorithm to multi-robotsystems would violate its Markov assumption, which is the drawback of many existingmulti-robot reinforcement learning approaches. They also identified four frameworks ofmulti-agent reinforcement learning; namely, the Stochastic Games (SGs) based framework,the Fictitious Play framework, the Bayesian framework, and the Policy Iterationframework. Next, they introduced seven typical multi-agent reinforcement algorithms forpossible application to multi-robot systems. They argued that there were four challenges toapplying reinforcement learning to multi-robot systems; namely, the Markov assumption,continuous spaces, uncertainties, and incomplete information. Three future directions werepointed out as well; as, the behavior-based approach with reinforcement learning, fuzzyapproximation functions, and continuous reinforcement learning.Shoham et al. (2003) presented a critical survey of multi-agent reinforcement learning(Shoham et al., 2003), where they criticized that the researchers had been focusing onNash equilibrium, which is very awkward. In addition, they identified four well-definedproblems in multi-agent reinforcement learning. Another famous survey in this area wasdone by Stone and Veloso (2000). They proposed a taxonomy methodology for multi-agent systems from a machine learning perspective, which classified multi-agent systemsas homogenous non-communicating agents, homogenous communicating agents,heterogeneous non-communicating agents, and heterogeneous communicating agents.Opportunities for applying machine learning to these multi-agent systems were8highlighted, and robotic soccer was presented as an appropriate test bed for multi-agentsystems.Panait and Luke (2005) completed a survey on cooperative multi-agent learning, wherethey pointed out that the huge learning space and dynamic environment were two mainchallenges in cooperative multi-agent learning. Simultaneously, they identified two typesof learning: Team learning (Centralized learning) and Concurrent learning. Then fourissues in the area of concurrent learning were discussed, as: credit assignment, learningdynamics, teammate modeling, and relationship between learning and communication. Inaddition, some future directions in cooperative multi-agent learning were presented, as:large-scale multi-robot systems, heterogeneity, agents with internal states (e.g., morecomplex computational intelligence), and dynamic team size.1.3.1 Multi-robot ArchitecturesTo develop a general and flexible architecture for multi-robot cooperation andcoordination has been a standing objective in the robotics community. Five primary typesof multi-robot architectures are discussed in the literature; namely, the behavior-basedarchitecture, the planning-based architecture, the market-based architecture, the distributedcontrol based architecture, and the learning-based architecture. A brief description of thesearchitectures is given next. Behavior-based multi-robot architectureParker (1998, 2000a) proposed a distributed and behavior-based multi-robotarchitecture called L-ALLIANCE. It uses the concepts "Impatient" and "Acquiesce"to dynamically motivate behaviors. Moreover, by assessing the performance of theteammates and dynamic changes in the environment, L-ALLIANCE autonomouslyupdates its parameters to adapt to those changes. L-ALLIANCE was validated in acooperative box-pushing task with two heterogeneous mobile robots. Similar research inmulti-robot transportation can be found in (Rus et al., 1995), where a team of autonomousrobots were employed to move furniture, and in (Yamada and Saito, 2001) where a multi-robot box-pushing system was adopted based on a reactive behavior architecture.9The behavior-based multi-robot architecture, CAMPOUT, has been employed in theRobot Work Crew (RWC) project developed in Jet Propulsion Laboratory (JPL) of NASA(Huntsberger et al., 2003; Schenker et al., 2000, 2003; Trebi-011ennu et al., 2002;Bouloubasis et al., 2003). CAMPOUT is a fully distributed, behavior-based multi-robotarchitecture. It employs the leader-follower decentralized control strategy. CAMOUT wasvalidated in a multi-robot cooperative transportation task in a PV tent autonomousdeployment project on a planetary surface and a cliff robot project. In their latest workStroupe et al. (2005, 2006) presented a multi-robot autonomous construction task based onthe CAMPOUT architecture.Pimentel et al. (2002) developed a behavior-based architecture with application in atwo-robot cooperative carrying project. In this project, they considered simple obstacles inthe environment and compared the effects of implicit communication with explicitcommunication.Goldberg and Mataric (2002) demonstrated a successful application of behavior-basedcontrol in a task of distributed multi-robot collection. They attempted to developcontrollers which were robust to noise and robot failures. Three versions of the collectiontask were presented and were evaluated in a spatio-temporal context using interference,time to completion, and distance traveled as the main diagnostic parameters.Balch and Arkin (1998) proposed a formation control approach for multi-robot teams.This is a behavior-based architecture using the motor scheme approach. In thisarchitecture, multiple behaviors can be activated simultaneously and combinedcooperatively. Four types of robot team formation were controlled and switched with theproposed architecture. The developed system employed explicit communication andworked in an environment with static obstacles. In addition, they developed a behavior-based control project of a non-holonomic robot for pushing tasks where they proposed tointegrate the planning approach with the behavior-based approach to cope with the localminimum problem (Emery and Balch, 2001).In their pioneering work, Konidaris and Hayes (2005) proposed to integrate thebehavior-based approach with Reinforcement Learning for multi-robot coordination.Furthermore, they suggested using topological maps to reduce the learning space inReinforcement Learning.10Camacho et al. (2006) developed a behavior-based architecture for coordinating robotsoccer agents. In their work, they assigned each robot a role for the task execution. A rule-based RECCA control algorithm was presented as well. Planning-based multi-robot architectureMiyata, et al. (2002) studied cooperative transportation by multiple mobile robots inunknown static environments. A centralized task planning and assignment architecturewas proposed in their work. Priority-based assignment algorithms and motion planningapproaches were employed. In their later paper (Yamashita et al., 2003), a motionplanning approach in a 3D environment was presented. Distributed-control-based multi-robot architectureWang et al. (2003a, 2003b, 2003c, 2004a, 2004b, 2005), Kume et al. (2002), Zaerpooraet al. (2005), Pereira et al. (2004), Sugar and Kumar (2002), and Chaimowicz et al. (2004)studied the caging formation control problem in multi-robot cooperative transportation.They proposed the "Object Closure" strategy to move the object with multiple mobilerobots while maintaining the formation. The key issue was to introduce a boundedmovable area for the object during its transportation and manipulation. In this case, thecontact between the object and the robots did not need to be maintained by the controllerof each robot. They termed it "Object Closure", which was employed as a type ofdistributed formation control strategy for multi-robot cooperative carrying tasks.Basically, the "Object Closure" approach is a sub-category of behavior-based multi-robot coordination. However, it places its emphasis on the distributed control features.Moreover, some other distributed-control strategies were employed by researchers toimplement task allocation among multiple robots or maintain formations of a robot team.For example, Dahl et al. proposed to allocate tasks among multiple robots throughvacancy chains (Dahl et al., 2003). They also demonstrated how Reinforcement Learningcan be used to make vacancy chains emerge in a group of behavior-based robots.Experiments showed that the proposed algorithm outperformed random or static strategies.On the other hand, Marshall et al. proposed another distributed-control strategy for multi-robot coordination, which was called "cyclic pursuit". They demonstrated that the "cyclic11pursuit" approach was surprisingly robust in the presence of un-modeled dynamics anddelays due to sensing and information processing. Market-based multi-robot architectureMulti-robot task allocation is a quite complex subject. It has been shown thatconventional task allocation approaches are not successful in this area, especially when thenumber of the robots is large. However, market-based approaches seem promising insolving the complex multi-robot task allocation problem, and they are becoming popularin the multi-robot domain. For example, Gerkey and Mataric (2002a, 2002b) developed anauction-based approach for multi-robot coordination. They used the publish/subscribefirst-price auction method and validated it in a multi-robot box-pushing task. A "watcher"robot observed and monitored the course of the box transportation, and "published" therequired actions for use by the "pusher" robots in the same environment. The "pusher"robots understood which kinds of tasks were available at this point through "subscribing"to the information from the "watcher" robot. By matching its own skills to the requiredtasks, each "pusher" robot bids for a task. When the "watcher" robot received all bidsfrom the "pusher" robots, it would select the most suitable one for each task and broadcastits decisions to the "pusher" robots. This auction-based approach was demonstrated to bequite successful in a multi-robot box-pushing task in a static environment withoutobstacles.A well-known work in this area was completed by Zlot and Stenz (2006). In this paper,they integrated the market-based approach with a hierarchical task tree for complex taskallocation in a multi-robot system. They demonstrated their approach in a cooperativereconnaissance task which required sensor coverage by a team of robots over a set ofdefined areas of interest. Learning-based multi-robot architectureMulti-robot systems usually work in a dynamic and unknown environment where thetraditional behavior-based robots can easily fail because it is very difficult to design asufficiently extensive behavioral rule base to cope with all possible world statesencountered by the robots. Therefore, the learning capabilities are important for multi-12robot systems that work in such an environment. Most existing work in multi-robotlearning employs Reinforcement Learning due to its simplicity and good real-timeperformance. For example, Mataric (1997) proposed a Reinforcement Learning approachfor a multi-robot foraging task. In this paper, she integrated the behavior-based approachwith Reinforcement Learning so as to condense the state/action space of ReinforcementLearning. She also designed a new credit assignment method called the social creditassignment, to speed up the convergence of the learning algorithm. Similar work can befound in the paper by Elahibakhsh et al. (2004) which proposed a distributed form closurestrategy with Reinforcement Learning for a multi-robot cooperative carrying task.Another interesting work in this area was completed by Ito and Gofuku (2004). Theyproposed a two-layer multi-robot architecture for a multi-robot cooperative transportationtask. In the top level of this architecture, they employed a centralized machine learningmethod for decision-making. For the lower level, a distributed rule-based control strategywas developed, to control the robot movement to reach a specified position and take aspecific action according to the commands from the upper level. In addition, theyintegrated Reinforcement Learning with Genetic Algorithms to reduce the learning space.1.3.2 Machine Learning in Multi-robot SystemsAs stated before, learning capabilities are desirable for multi-robot systems, especiallywhen they are required to work in a dynamic and unknown environment, and machinelearning technologies have been employed commonly for this purpose. Among thesemachine learning approaches, Reinforcement Learning and especially Q-learning, hasbeen quite popular due to its simplicity and good real-time performance.Parker et al. (2000a, 2002) studied multi-robot learning with regard to two aspects:Learning new behaviors and learning parameter adjustments. In order to make robots learnnew behaviors autonomously, they proposed two approaches. The first one, distributedpessimistic lazy Q-learning, combines lazy (or instance-based) learning with Q-learningand uses a Pessimistic Algorithm for dealing with the credit assignment problem. Thesecond approach, called Q-learning with VQQL(Vector Quantization with Q-learning) andGeneralized Lloyd Algorithm, addresses the generalization issue in reinforcement learning.The two approaches have been validated in a CMOMMT (Cooperative Multi-robot13Observation of Multiple Moving Targets) project. In the area of learning for parameteradjustment, the L-ALLIANCE multi-robot architecture enables robots to autonomouslyupdate their control parameters so that they can adapt their behavior over time in responseto changes in the capabilities of the robots, team composition, and the environment(Parker et al., 2000a, 2002; Fernandez and Borrajo, 2005).Kapetanakis and Kudenko (2002) have summarized machine learning for autonomousagents and multi-agent systems. They established that there were two kinds of multi-agentlearning: multiple single-agent learning and social multi-agent learning. In multiple single-agent learning, a learning agent regards other agents as a part of the environment and hasno explicit awareness of their existence, let alone their goals and beliefs. However, insocial multi-agent learning, agents can have a high awareness of other agents andincorporate this knowledge in the learning process, potentially using communication,coordination, and agent modeling techniques to support the learning task. In addition, theystudied the relationships among learning, communication and roles.Martinson and Arkin (2003) developed a multi-robot foraging task with Q-learning.They integrated Q-learning with the behavior-based approach. The learning space wasreduced by the behavior representation, and Q-learning was employed to dynamicallyselect the roles of robots. They compared the Q-learning strategy with a hand-craftedstrategy and the simulation results showed that the former outperformed the latter. Inanother work (Martinson et al., 2002), they employed Q-learning to select behaviors for asingle robot.Park et al. (2001) proposed a modular Q-learning based multi-agent cooperation forrobotic soccer. The basic idea was to use modular Q-learning to cope with the largelearning space problem. Ishiwaka et al. (2003) also developed an approach to the pursuitproblem in a heterogeneous multi-agent system using reinforcement learning. In theirpaper, they proposed the angle-based state representation in Q-learning and employedPOMDP (Partially Observable Markov Decision Processes) to construct predictions of thetarget position and its next moving direction. In addition, Q-values for each state andaction were updated via a radial basis function (RBF) due to the continuity of the sate-action space.14Dahl et al. (2004) developed a Q-learning based multi-robot cooperative transportationtask in a simple environment. They proposed a two-layer multi-robot architecture where abehavior-based reactive subsystem was placed in its lower level while a Q-learning basedbehavior-selection subsystem was placed in the upper level. They also introduced twocommunication strategies to speed up the convergence of Q-learning.Svinin et al. (2000) employed another type of reinforcement learning, ClassifierSystem, in a multi-robot cooperative lifting task. In their project, they considered the issueof continuous space in reinforcement learning. Kawakami et al. (2001) developed a multi-robot cooperative carrying project by combining two reinforcement learning algorithms.The first reinforcement learning algorithm was used to predict the states of other robotswhile the second reinforcement learning algorithm was used to select the action of thecurrent robot.Kovac et al. (2004) addressed a "pusher-watcher" multi-robot box-pushing problemwith reinforcement learning. In this project, one robot acted as a "watcher" whichobserved the current world state and broadcasted it to other robots. Other robots in theproject acted as "pushers" which selected respective pushing actions with Q-learning. Thisarchitecture had to rely on a fully robust communication network, which represents aweakness of the approach.Inoue et al. (2004) developed a reinforcement learning based multi-robot transportationproject in which they employed machine learning to correct the robot positions in thecourse of transportation. Moreover, in this project they compared Q-learning withClassifier System.Talor and Stone (2005) addressed an interesting issue in reinforcement learning: how totransfer the learned knowledge from one task to another one that has a different state-action space. The main benefit of this kind of transfer is the reduction of the training timein reinforcement learning. They used the specific domain knowledge to construct aknowledge mapping function between two tasks so that the knowledge transfer happenedeffectively.Tangamchit et al. (2003) addressed several crucial factors affecting decentralizedmulti-robot learning in an object manipulation task. They first gave a taxonomy of multi-robot systems with the following dimensions: individual architecture (reactive,15deliberative, hybrid), group architecture (centralized vs. decentralized), number of robots(small vs. large), diversity (homogenous vs. heterogeneous), learning type (decentralizedvs. centralized), learning reward (local reward vs. global reward) and learning algorithm(Q-learning vs. Monte Carlo). Then, they identified four factors affecting distributedmulti-robot learning, which were the reward type, the learning algorithm, diversity, andthe number of robots. They concluded that the reward type and the learning algorithmaffected the final results significantly, and diversity and the number of robots could affectthe learning speed but had little effect on the final results.In order to reduce the state-action space (learning space) of reinforcement learning inmulti-robot systems, Ferch and Zhang (2002) proposed a kind of graph representation ofthe state-action space. They demonstrated their approach in a cooperative assembly task.It is clear that Reinforcement Learning is rather popular in multi-robot coordination.However, the applications used in this context do not have strong theoretical basis. Theenvironment in a multi-robot task is usually dynamic because it is changed by the robotsthemselves. Simply extending the single-agent Reinforcement Learning algorithm to themulti-robot field violates its assumption of stationary environment (Yang and Gu, 2004).However, it is believed that this type of extension is feasible in a purely cooperative tasksuch as multi-robot transportation.Several multi-agent reinforcement algorithms have been developed in the multi-agentcommunity. They include MiniMax Q-learning algorithm (Littman, 1994), Nash Q-learning algorithm (Hu and Wellman, 1998), and Friend-or-Foe Q-learning algorithms(Littman, 2001b). In particular, for purely cooperative games, Littman (2001a) suggestedthe Team Q-learning algorithm, which is a simplified version of the Nash Q-learningalgorithm. All these algorithms assume a dynamic environment and allow each robot toobserve the actions of its peers before it makes decisions. Under some conditions, thesealgorithms are proved to converge to the optimal policy (Yang and Gu, 2004).The main disadvantage of the multi-agent reinforcement learning algorithms is thatthey result in a tricky and extensive learning space (state/action space) because each robotneeds to monitor not only its own actions but also the actions of its peers. When thenumber of robots increases, the resulting extensive learning space will make the algorithmless effective. However, when the number of robots increases, the single agent16Q-learning algorithm still can work because its learning space is fixed.Some researchers have tried to employ the neuro-fuzzy approach to coordinate multiplerobots. For example, Pham and Awadalla (2002) presented a neuro-fuzzy based actionselection architecture for cooperative robots to push a box in an obstacle-free environment.However, the neuro-fuzzy approach is not popular in the multi-robot community due tothe need of complex training.Another trend in multi-robot learning is to integrate Genetic Algorithms (GAs) withReinforcement Learning to reduce its large learning space. A representative work in thisarea is done by Ito and Gofuku (2004). They tried to use GAs to search an optimallearning space for Reinforcement Learning so that it can operate within a much smallerspace. Several papers discussed the benefits of integrating learning with evolution (Sen,1998; Nolfi and Floreano, 1999). However, research in this area is still in its infancy.There are many open questions yet to be explored.1.3.3 Computer Vision and Sensing Technologies for Multi-robot SystemsIn a multi-robot system, it is important for a robot to know the latest poses(positions/orientations) of other robots and potential obstacles in the environment so as tomake rational decisions. There are many methods to detect the poses of objects or peerrobots in the environment; for example, using global/geographical positioning systems(GPS), sonar, laser distance finders, wireless beacons, and digital CCD (charge-coupleddevice) cameras. Today, the CCD camera, sonar and laser distance finder have become thetypical sensors included in a standard configuration of mobile robots. As a result, there hasbeen a need in the robotics community to develop sensing algorithms to recognize,identify and estimate features or poses of objects in a robotic work environment. Forexample, Stone and Veloso (1998) studied a multi-robot soccer system. In their work, theyused a global camera to monitor the positions of the robots and objects in the game. Rochaet al. (2005) developed a cooperative multi-robot system to build 3-D maps of anunknown environment using the vision-based approach. Kato et al. (2003) proposed amethod for identifying a robot and obtaining its relative position in a multi-robot systemusing an omnidirectional vision sensor. They validated their approach using the simulationand experiment results. Similar work has been done by Hajjdiab et al. (2004) where a17vision-based approach for multi-robot Simultaneous Localization and Mapping (SLAM)was proposed. They proposed to calculate the locations of the robots using a collection ofsparse views of the planar surfaces on which the robots were moving on. The cameramotions were estimated using inter-image homographes computed by matching theoverhead transformed views.Simultaneous tracking of multiple moving objects with vision sensors is an importanttopic in road traffic control systems. For example, Veeraraghavan et al. (2003) developeda computer vision algorithm to track the vehicle motion at a traffic intersection. Amultilevel approach using a Kalman filter was presented by them for tracking the vehiclesand pedestrians at an intersection. The approach combined low-level image-based blobtracking with high-level Kalman filtering for position and shape estimation. Maurin et al.(2005) presented a vision-based system for monitoring crowded urban scenes. Theirapproach combined an effective detection scheme based on optical flow and backgroundremoval that could monitor many moving objects simultaneously. Kalman filteringintegrated with statistical methods were used in their approach. Chen et al. (2005)presented a framework for spatiotemporal vehicle tracking using unsupervised learning-based segmentation and object tracking. In their work, an adaptive background learningand subtraction method was applied to two real-life traffic video sequences in order toobtain accurate spatiotemporal information on vehicle objects. Rabic et al. (2005)developed a mobile bus-mounted machine vision system for transit and traffic monitoringin urban corridors. They used a new computer vision approach: the active vision paradigm,which has mechanisms that can actively control camera parameters such as orientation,focus, and zoom in response to requirements of the task and external stimuli. Siyal (2004)proposed a novel neural network and window-based image processing technique for roadtraffic applications. In particular, they used morphological edge detection techniques todetect vehicles. Once the vehicles were detected, a back-propagation (BP) neural networkwas used for calculating various traffic parameters. Deng et al. (2005) developed anadaptive urban traffic signal control system using vision sensors. They integrated andperformed vision-based methodologies that included object segmentation, classificationand tracking methods to determine the real-time measurements in urban roads.The laser distance finder and the CCD camera are two standard sensing devices whichare commonly used in mobile robots today. The laser distance finder can accurately detectobjects within a radius of 50 meters. It can scan the surrounding environment and return adistance measurement for each degree in a 180 degree view. Alternatively, the CCDcamera can return rich vision information instead of the distance information. It can detect18the shape, color and surface features of the objects of interest. Therefore, it is a naturalway for researchers to combine the two sensors for object recognition and localization.For example, Alenya et al. (2005) used laser and vision to locate a robot in an industrialenvironment. They combined the precision of laser-based local positioning with theflexibility of vision-based robot motion estimation. Their proposed approach wasvalidated in the warehouse of a beer production factory. Tomono (2005) employed amobile robot equipped with a laser distance finder and a monocular camera to build a 3-Denvironmental model. In this framework, they first built a 2-D map by scanning objects inthe environment with a laser sensor, and then verified the detected candidates by usingvision-based object recognition.Takahashi and Ghosh (2005) integrated a laser range finder sensor with a vision sensorto identify the parameters of a moving object. They showed that this approach couldreduce the dimension of the parameter ambiguity. Murarka et al. (2006) presented a hybridmethod using laser range-finders and vision for building local 2D maps to help anintelligent robotic wheelchair for distinguishing between safe and hazardous regions in itsimmediate environment. In this project, laser range-finders were used for localization andmapping of the obstacles in the 2D laser plane while vision was used for detection of thehazards and other obstacles in the 3D space. The information on the hazard and theobstacle was then combined to construct a local 2D safety map.Frontoni and Zingaretti (2005) compared the sonar-based Monte Carlo Localizationapproach with the vision-based localization approach in terms of localization accuracy,and found that both approaches bore some weaknesses in an environment with highperceptual aliasing. However, it was shown that the overall accuracy was improved whenthe approaches were combined. Asensio et al. (1999) developed a goal directed reactiverobot navigation system with relocation using laser and vision. In their project, the robotfirst selected the initial goal location for the navigation task, and laser was used toaccomplish reactive navigation while avoiding obstacles. An extended Kalman filter wasused to solve the data association problem. Ortin et al. (2004) presented a method forsolving the localization problem using 2D laser and vision. A 2D laser scan together withits corresponding image was obtained and it was segmented into textured vertical planeswhich allowed safe plane recognition. Gopalakrishnan et al. (2005) also developed amobile robot navigation project which integrated laser and vision. The laser-basedlocalization, vision-based object detection, and route-based navigation techniques for amobile robot were combined.191.4 Contributions and Organization of the ThesisThis thesis researches and develops new techniques and expertise which will helpmulti-robot systems operate in a robust and effective manner in dynamic and unknownenvironments. There are four main contributions in the thesis, as follows: A fully distributed hierarchical multi-robot architecture is developed, whichintegrates the techniques of machine learning, path planning, behavior-basedparadigm and low-level control. This architecture enables multi-robot systems toplan their sub-tasks dynamically, learn and adapt to new unknown environments,promote cooperation among robots, detect any local minimum problems, switchdecision-making methods online, and complete the desired tasks in robust andeffective manner. A modified reinforcement learning algorithm termed the Sequential Q-learningwith Kalman Filtering (SQKF), is developed, which is designed to deal with somecrucial issues in multi-robot learning, such as credit assignment, reducing thelearning space, and Markov assumption. The proposed SQKF algorithm iscompared with the conventional single-agent Q-learning algorithm and Team Q-learning algorithm through computer simulation, and it is shown that the SQKFalgorithm outperforms the conventional Q-learning algorithms. A fast computer vision algorithm is developed to track multiple objectssimultaneously, and applied to a multi-robot object transportation task. Thealgorithm is validated using physical experimentation. A physical multi-robot transportation project, which integrates the developedtechniques, is developed in-house. This multi-robot system works in a dynamicand unknown real-world environment and shows adaptivity, effectiveness, andoverall good performance.The organization of this thesis is as follows. In Chapter 1, the background of multi-robot systems and the research objectives of the thesis are introduced. Then theresearch problem of the thesis is defined and a literature survey is carried out toestablish the related background work. In Chapter 2, a learning/planning basedhierarchical multi-robot architecture is presented. Various techniques included in thisarchitecture, such as the behavior-based approach, the machine learning and planning20techniques, are addressed. Their relationships are also studied. Finally, severalpertinent issues of distributed control are discussed. In Chapter 3, two conventionalreinforcement learning algorithms--the single-agent Q-learning and the Team Q-learning--are extended to the multi-robot domain to facilitate decision making forrobots in a dynamic and unknown environment. The two algorithms are assessedthrough a multi-robot transportation project. The simulation results are analyzed, andthe advantages and disadvantages of the two algorithms are assessed. In Chapter 4, amodified reinforcement learning algorithm termed the Sequential Q-Learning withKalman Filtering (SQKF), which is suitable for multi-robot decision-making, isproposed and developed. Simulation results are presented to show that the modifiedalgorithm outperforms the conventional Q-learning algorithms in multi-robotcooperative tasks. In Chapter 5, a fast color-blob tracking algorithm is developed tosupport robots to track multiple objects of interest in an environment with unevenillumination. The algorithm is validated in a multi-robot object transportation project.In Chapter 6, a physical multi-robot cooperative transportation system operating in adynamic and unknown environment is developed and presented. A JAVA RMIdistributed computing model is introduced to allow robots to communicate andcooperate effectively. Multiple experimental results are presented to demonstrate theperformance of the developed system, particularly concerning adaptivity androbustness. Chapter 7 summarizes the primary contributions of the thesis and indicatessome relevant issues for future research.21Chapter 2Learning/Planning Based Hierarchical Multi-robotArchitecture2.1 OverviewA multi-robot architecture represents the organization structure of the key componentsand their function descriptions in a multi-robot system. It has two primary functions: topromote cooperation or coordination among robots; and to enable a member robot to makerational decisions and effectively complete the corresponding tasks.As presented in section 1.3.1, there are five main types of multi-robot architectures asgiven in the literature. They are the behavior-based architecture, the planning-basedarchitecture, the distributed-control based architecture, the market-based architecture andthe learning-based architecture. Especially, the behavior-based architecture, profiting fromits simplicity and robustness, is the most popular one and has enjoyed many successfulapplications in industrial and academic fields (Parker, 1998, 2000b; Huntsberger et al.,2003; Schenker et al., 2003; Balch and Arkin, 1998). If the designer can foresee allpossible world states the robots will encounter, designs the corresponding behavior rulesand carefully adjust their preference levels, a behavior-based multi-robot system will workvery well in a known and structured environment.However, most multi-robot systems need to work in dynamic and unknownenvironments; for example, planet surfaces or deep ocean floors, where they face severalchallenges. First, the behavior-based approach requires a human designer to design allbehavior rules and determine their preferences for the individual robots in advance.Because it is almost impossible for a designer to foresee all possible world states orsituations that the robots will encounter and design the corresponding behavior rules forthem, a behavior-based robot can easily fail in a dynamic and unknown environment witha large number of states. Second, since for a dynamic and unstructured environmentusually there will be a large number of possible "world" states, the designer will be forcedto design an extensive rule base to deal with all these states. However, it is very difficult22and sometimes impossible to arrange the preference for each behavior rule in an extensiverule base due to the "combination explosion" problem. It is clear then that a purelybehavior-based multi-robot system is doomed to be weak and not robust in a dynamic andunknown environment.The second multi-robot architecture is the planning-based architecture whichemploys planning techniques of traditional Artificial Intelligence (AI) to find optimaldecisions for the robots (Miyata et al., 2002). There are two types of planning-based multi-robot architectures: the centralized planning architecture and the distributed planningarchitecture. In a centralized planning architecture, one robot will plan actions and assignsub-tasks for all other robots in the team. Because task allocation among multiple robots isa NP (Non-deterministic Polynomial time) problem in the computational complexitytheory, the main disadvantage of this approach is its computational complexity. When thenumber of robots is large, it is very difficult and even impossible to find an analyticalsolution. The only appropriate method might be to employ an AI-based optimizationtechnique to approximate it. In addition, the fragility and communication bottleneck of acentralized system will make it weak. On the other hand, the distributed planning multi-robot architecture attempts to enable each robot to plan its actions independently, andreach a global optimal decision by merging the individual decisions of all robots. Due toits difficulty, as reported in the literature, there have few successful examples in this field.Furthermore, both centralized-planning architecture and distributed-planning architecturewill easily fail in a dynamic and unknown environment because they cannot adjust thepolicies (action sequences) of robots quickly to adapt to a fast changing environment.The third type of multi-robot architecture is based on distributed control strategies(Pereira et. al, 2004; Sugar and Kumar, 2002; Wang et al., 2003a, 2003b, 2003c, 2004a,2004b, 2005). They employed a distributed control strategy such as "Form Closure,""Object Closure," or "Leader-Follower Control" to allocate sub-tasks for robots so as tocontrol the team formation. The main disadvantage of this type of multi-robot architectureis that it is only suitable for a special multi-robot task in the class of Multi-robotCooperative Transportation, and it is not a general architecture. In particular, the "ObjectClosure" approach cannot be extended to other multi-robot systems such as multi-robot23cooperative foraging and multi-robot cooperative observation. In addition, this type ofmulti-robot architecture also suffers from the computational complexity problem.The fourth type of multi-robot architecture is the market-based architecture (Zlot andStenz, 2006; Gerkey and Mataric, 2002b). Rooted in the multi-agent coordination theory,it is a new and promising approach for coordinating multiple robots in a complexenvironment. However, there are still many open questions in this area, and its mainchallenges are related to computational complexity and real-time performance.The fifth type of multi-robot architecture is the machine learning-based architecture.Rooted in iteration and dynamic programming techniques, machine learning technologiesapproximate the optimal polices of the robots and attempt to avoid the computationalcomplexity problem which affect other multi-robot architectures. In particular, due to itssimplicity and good real-time performance, reinforcement learning, especially the Q-learning algorithm, has been used commonly (Ferch and Zhang, 2002; Ito and Gofuku,2004; Martinson and Arkin, 2003; Mataric, 1997; Stone and Veloso, 1998; Yang and Gu,2004; Liu and Wu, 2001). In a survey paper, Arai et al. (2002) have suggested that area oflearning-based multi-robot system was one of two emerging trends in the multi-robotdomain.As discussed in Chapter 1, the learning capability is crucial for multi-robot systems towork in a dynamic and unknown environment. It will help robots to identify new worldstates and learn how to adapt to them. Based on the experiences of past success and failurein a dynamic environment, the robots with a learning capability can continuously adjusttheir action policies and improve the performance, thereby completing the required tasksin a quicker and more robust manner.It has been shown, however, that simply extending learning algorithms that aredesigned for a single robot to a multi-robot system is not appropriate (Yang and Gu, 2004).In particular, our own work has shown that extending the single-agent Q-learningalgorithm to the multi-robot domain violated its assumption of a stationery environment,and made the learning algorithm not converge to its optimal policy. In addition, a purelylearning-based multi-robot system cab be easily trapped into a local minimum (Wang andde Silva, 2006a, 2006b).24From these observations it is clear that the problem of multi-robot decision andcontrol is a challenging one. No single approach can solve all the problems associatedwith it. Accordingly, the combination of several existing mainstream techniques may benecessary to build a new multi-robot architecture that can support multi-robot cooperativeoperations in a dynamic and unknown environment, effectively and in a robust manner. Inthis chapter, a new hierarchical multi-robot architecture is proposed, which integratesmachine learning, planning, and behavior-based approaches. In section 2.2, a preliminaryintroduction of this general architecture is given. Its detailed implementation is presentedin the subsequent sections. Finally, some distributed control issues in multi-robot systemsare discussed. The multi-robot architecture developed in this chapter will be validated withcomputer simulations and physical experiments in the subsequent chapters.2.2 General ArchitectureIn order to support multi-robot systems that work in a dynamic and unknownenvironment, a hierarchical multi-robot architecture that offers improved robustness andreal-time performance is developed in this chapter. This architecture, presented in Figure2.1, will integrate machine learning, planning and behavior-based approaches for roboticdecision-making.25rocal PlanningUnit [Sensors) (Actuators)CommunicationHardwareAHigh-level TaskDecomposition andAssignmentBehavior-basedDistributed Control711 Modeling Low-level Force/PositionControl Model BaseHardware LevelRobot j4-- Low-level ControlmeLeamingUnitArbitratorComposite Behaviors Group Behaviors,• 'Communication Behay . orI  ^   Primitive BehaviorsFigure 2.1: A general and hierarchical multi-robot architecture.This is a fully distributed multi-robot architecture which attempts to integrate severalpopular artificial intelligence (AI) techniques for decision-making in a dynamic,unstructured, and complex environment. By organizing different techniques in ahierarchical manner, this architecture provides the robots with flexible and robustcapabilities of decision-makings to adapt to the external environment dynamically. It hasfour layers which are: High-level Task Decomposition and Assignment, Middle-levelBehavior-based Control, Low-level Robot Force/Position Control, and Hardware Level.In its top level, there is a learning unit and a planning unit, which are used toimplement inter-robot coordination. In particular, a specific robot will collect informationfrom its own sensors and from the sensors of the peer robots through the communicationnetwork, determine the current world state, and decide its actions with its learning-basedor planning-based units to reach the goal of the robot team. Therefore, this level will carryout inter-robot task decompositions and assignment.As noted before, when making decisions, each robot may use a learning-based unit ora planning-based unit. An arbitrator will monitor the current world state and decide which26unit will be used to make decisions for the robot. For example, when the robots work in adynamic environment, the learning-based unit is usually selected by the arbitrator todecide actions for the robot. However, once the arbitrator realizes that the whole multi-robot system is trapped in a local minimum, it will turn off the learning unit temporarilyand switch on the planning unit to plan actions for the robots in a local environment.When the robots escape the local minimum, the arbitrator will return the decision-makingcapability to the learning unit. Through this type of switch mechanism, the multi-robotarchitecture benefits from both learning and planning, and can flexibly adapt to variousenvironments.In the middle level of the architecture, a popular behavior-based approach is employedto implement robust control of robot actions. As stated in Section 2.1, the behavior-basedapproach is known to be effective in multi-robot cooperative control. Here, it is employedto implement sub-tasks as decided by the top level. When the top level of the architectureestablishes the current sub-tasks for the robots, they will be sent down to the behavior-based control level so as to be decomposed in further detail of robot behavior. There arefour types of behavior: group behaviors, composite behaviors, communication behaviorsand primitive behaviors. The group behaviors implement coordination among robots; forexample, "Follow the leader robot," and they are usually fired by the learning unit or theplanning unit in the top level of the robot architecture. The communication behaviors areresponsible for communication among robots. Composite behaviors control the actions ofindividual robots; for example, "Avoid obstacle," and they can be decomposed further intoprimitive behaviors such as "Turn left" or "Go straight." Through the design ofhierarchical behaviors and by carefully arranging their preferences, the robots cancomplete the required tasks as defined by the top level, in a effective and robust manner.The third level of the architecture is the low-level control layer. In this layer, variouslow-level "hard" controllers, such as proportional-integral-derivative (PID) controllers orforce controllers, can be designed to accurately control the actions of the robots. Therequired values of the controllers are established by receiving commands from thebehavior-based control layer. In addition, a local modeling unit may be designed to modellocal interactions among the robots or between one robot and the manipulated object (or,work environment) based on the online position/force information of the robots.27The lowest level of the architecture is the hardware level. It includes sensors,actuators, communication hardware, video cameras of robots. They constitute thehardware platform of the multi-robot architecture.The first two levels of the architecture implement "soft" control of a robot while thethird level implements "hard" control. In particular, the top level is employed todecompose and assign tasks among robots; the behavior-based control layer decomposesthe actions of a single robot further into more detailed primitive behaviors; and the thirdlevel controls motions and forces of the robot with "hard" control laws. By combiningestablished mainstream approaches in multi-robot coordination, such as learning, planningand behavior-based methods, this multi-robot architecture provides more flexiblity andcooperating power in a dynamic, unknown and unstructured environment than thosepresented in literature.The proposed multi-robot architecture will be validated through computer simulationand experimentation, as presented in the following chapters.2.3 Machine Learning UnitA machine learning unit is designed in the proposed multi-robot architecture and isplaced in its top layer. The objective of this unit is to learn optimal decision-makingpolicies in a given environment through continuously observing actions of the robots andthe corresponding environmental rewards. The principle of the machine learning unit ispresented in Figure 2.2.Figure 2.2: The principle of the machine learning unit.28In Figure 2.2, each robot gains experience by continuously observing the currentenvironmental state, selecting and taking an action, and getting a reward from theenvironment. Simultaneously, it also needs to observe actions of its peer robots and assesstheir effects on the environment. Based on the past experience with the work environment,each robot employs a machine learning technique, as developed in the thesis, tocontinuously improve its performance and approximate its optimal decision-making policy.In particular, with the help of the machine learning unit, the robot dynamically adjuststhe mapping relationship between the world (work environmental) states and its actions.Therefore, it will outperform a robot that may use a conventional behavior-based approachwhose behavior rules and rule preferences are fixed.In this thesis, Q-learning will be employed and enhanced to learn and improve themapping relationship between the world states and the robot actions. Here, the mainchallenge is how to extend the single-agent Q-leaming algorithm to the multi-robotdomain. Chapter 3 and Chapter 4 will discuss this issue in detail.2.4 Planning Unit2.4.1 Planning and Learning in Multi-robot ApplicationsA multi-robot architecture integrating the behavior-based approach with machinelearning works well in a complex environment even when the environment is dynamic andunknown. However, in our previous research (Wang and de Silva, 2006a), it was shownthat with this type of architecture, sometimes the multi-robot system may be trapped in alocal minimum, as exemplified in Figure 2.3.29Figure 2.3: An example of the local minimum phenomenon in a purely Q-learning-basedmulti-robot cooperative box-pushing task.Figure 2.3 shows a multi-robot cooperative box-pushing task which employs thedistributed Q-learning algorithm. This project will be described in detail in the nextchapter. In the present section, it is used just to demonstrate why planning capabilities arestill necessary for multi-robot tasks. In Figure 2.3, there are three autonomous robotswhich attempt to push a rectangle box cooperatively from its starting position to a goallocation. Many challenges exist. For example, there are obstacles distributed randomly inthe environment and each robot possesses local sensing capabilities only, which means arobot does not know about an obstacle before it moves sufficiently close to the obstacle.Another challenge is that the box is very long and heavy so that no single robot will beable to handle it alone. The robots have to find optimal cooperative pushing strategies tomove the box to the goal location while avoiding collision with any obstacles in theenvironment. In our previous work (Wang and de Silva, 2006a, 2006b), a Q-learningbased distributed multi-robot architecture was proposed to complete the task. While it wassuccessful in most of trials, some failure examples like the one shown in Figure 2.3 wereobserved.30In Figure 2.3, it is noted that the multi-robot system enters a local minimum area andthe box is stuck there. Even though there was a possible trajectory for the robots to movethe box out of the local minimum area, the learning-based robots could not find it. Thisexample shows that a purely learning-based multi-robot system sometimes fails in acomplex environment, due to the local minimum problem. In other words, a multi-robotsystem with learning capability alone is not strong enough to deal with all possibleenvironmental states.However, if the multi-robot system in Figure 2.3 possesses some local path planningcapabilities, it will be able to get out of the local minimum easily because of the presenceof at least one path for the robots to avoid the obstacles and move the box to the goallocation.It follows that, both learning and planning are important for a multi-robot system tosuccessfully complete cooperative tasks in a dynamic and unknown environment in thepresence of complex obstacle distribution. The learning capability will help robots toadapt to a fast changing environment and improve their performance continuously whilethe planning capability will overcome shortcomings of a purely learning-based decision-making system and make it escape any local minimum.In the top level of the proposed multi-robot architecture shown in Figure 2.1, a localplanning unit is integrated with a machine learning unit to decompose tasks among therobots. The so-called "local planning" means that planning is executed in a local world orlocal environment instead of a global one. In Section 2.1, it is stated that a global planningarchitecture suffers from shortcomings such as computational complexity andcommunication bottleneck. However, a local planning unit does not possess theseproblems or these problems are not very serious because it only needs to deal with localdecision-making issues in a part of the environment. The cost of a local planning unit is itis not strong enough to make all decisions necessary for the robots to complete a task. Ithas to combine with other AI techniques, such as machine learning and the behavior-basedparadigm, to constitute a more powerful and flexible decision-making unit for multi-robotsystems.312.4.2 Local Path Planning for Multi-robot Cooperative TransportationAs presented in Section 1.1, one of the objectives of this thesis is to develop a multi-robot cooperative transportation system which works in a dynamic and unknownenvironment with complex obstacle distribution. In order to reach this goal, a local pathplanning unit is designed. This planning unit will be activated when the learning-baseddecision-making unit is trapped at a local minimum, and it will temporarily plan localmoving paths of the robots so as to escape the local minimum. Once the robots are awayfrom the local minimum, the decision-making right will be returned to the learning-basedunit from the planning unit.The local planning unit is a centralized planning unit. In other words, when a localminimum is detected, all learning units of robots will stop working and transfer theirdecision rights to the planning unit of a leader robot whose role is designated in the designstage. Then, the leader robot will merge the sensor information of all robots and establishthe current local world state. Based on the local world state, it will employ a potential-function-based path planning algorithm to plan the moving paths for all robots in thesystem.Although the centralized planning approach is criticized in Section 2.1, it is acceptablein the present multi-robot local planning unit. The first reason for this is thatcomputational complexity is not high here because only a small part of the environment isconsidered when planning the moving paths for the robots. The second reason is that thelocal planning unit is seldom activated when a multi-robot system is working. It isactivated only when a local minimum is detected. Once the local minimum is escaped, itwill transfer the decision-making right to the machine learning unit in the top level of thearchitecture. Therefore, the local planning unit only works for a very short period of timeand during most of time the machine learning unit determines the action policy of a robot.In this thesis, instead of the conventional attractive/repulsive potential functionapproach, the navigation potential function approach developed by Rimon and Koditschek(1992) is employed to plan the moving paths of multiple robots. The advantage of thisapproach is that it has only one minimum so that the planner will not be trapped at a localminimum. The navigation potential function approach is presented below.32ObstaclesGoal Location •This approach initially assumes that the local environment is bounded by a circlecentered at (x0, yo) and has n spherical obstacles centered at (x1,3 21),(x2 2), '" , (xn,yn),as presented in Figure 2.4. The objective of the planning algorithm is to find a path from(X start , Y start ) to (xgoal ,ypa/  ) while avoiding all obstacles in the local environment.. Figure 2.4: The bounded local environment in navigation potential function planning.In order to set up the potential function of the local environment, the distance functionsat a point (x, y) inside the local environment are defined asN0 (x, Y) -(x - x0)2 - (Y - Y0)2 + /02^ (2.1)fl,(x,y)=(x-x,) 2 +(y- yi) 2 - ri2^1,2,'   ',n^ (2.2)where ro is the radius of the local environment with the center at (x0 ,y0 ), and r, is theradius of the^obstacle in the local environment. Instead of considering the distance tothe closest obstacle or the distance to each individual obstacle, the sum of the distancefunctions is used as the repulsive function of this potential field approach; specifically,n18(x, Y) = EA(x,y)=0In addition, the attractive function is defined as(2.3)33Yk (x , y) = ((x — xgoal ) 2 + (y — ygoa, )) k^ (2.4)where k is a constant. Finally, the total potential function in the local environment isdefined as —93(x, y) ( TIC ( X, Y) ^)Ilk = ^(x XgOal )2 + (y YgOal )2 P(X' y) rk ( X, y)^(((x - xgoa, )2 +(Y ygoai )2 )k fi(x, y))1 /IC(2.5)It is shown that the potential function q)(x, y) has a unique minimum in the localenvironment (Rimon and Koditschek, 1992) if k is sufficiently large. Therefore it isdesirable to employ a gradient descent algorithm on the potential field to find the localgoal location (the unique minimum) . The gradient descent algorithm is presented below.Algorithm 2.1 Gradient Descent Input: A means to compute the gradient V co(x,y) at a point q(x, y)Output: A sequence of points fq(xo , yo ),q(xl , yi ), • • • ,q(xi , yi )}1: q(xo , v ) a(^v )- 0,^start,' start,2: i=03: While IV ca(xi , y )1> E do4: q(x^= q(xi , y) + a(i)V co(xi , yi )5: i=i+16: end WhileIn algorithm 2.1, e is chosen to be a sufficiently small positive number based on thetask requirement. In addition, the value of the scalar a(i) determines the step size at theid, step. It is important to select a good a(i) because a small value will result in long andunnecessary computation and a big value will cause the robots to "jump" into theobstacles. Usually a(i) is selected based on empirical results. For example, it isdetermined according to the distance to the nearest obstacle or to the goal.342.5 Integration of Learning with PlanningAs discussed in Section 2.1, multi-robot planning faces challenges such ascomputational complexity and communication bottleneck. In particular, a planning-basedmulti-robot system can easily fail in a dynamic and unknown environment because it doesnot have the knowledge of the global environment and cannot adjust its planned policysufficiently fast to adapt to a fast changing environment. Therefore, learning capabilitiesare crucial for a multi-robot system to work robustly in a dynamic environment. However,as stated in Section 2.4.1, our previous research shows that a purely learning-based multi-robot system can be easily trapped at a local minimum in a complex environment. Itindicates that both learning and planning capabilities are very important for multi-robotsystems when carrying out tasks in a dynamic and unknown environment.Accordingly, in Figure 2.1, an arbitrator is designed in the top layer of the proposedarchitecture to select either the learning unit or the local planning unit for multi-robotdecision-making. When a multi-robot system begins its work, the arbitrator selects themachine learning unit as its decision-making unit to determine its action policy, which isthen communicated to its lower level: the behavior-based distributed control layer. Byobserving the world states, the robotic actions and the rewards received when the multi-robot system is in operation, the machine learning unit gains knowledge of cooperationamong the robots and improves its performance continuously. At the same time, thearbitrator continuously monitors the progress of the multi-robot task and determineswhether it is trapped at a local minimum. If the arbitrator finds that the multi-robot systemis trapped at a local minimum, it will temporarily close the machine learning unit of thecurrent robot and transfer its decision making right to a leader robot which is designated inadvance. After all robots have transferred their decision-making rights to the leader robot,a centralized planning architecture is formed temporarily. Through fusing sensorinformation of all robots, a local environment is estimated by the leader robot. Based onthis local environment, the leader robot plans the actions of the robots by employing thenavigation-potential-function approach which is presented in Section 2.4.2. All the robotstake actions planned by the leader robot until they escape the local minimum. When therobots are away from the local minimum, the arbitrator of each robot will return thedecision-making right to its machine learning unit. The temporarily centralized planning35Group Behaviors- -•^411.wOther Behaviors-„/ 0 0Composite Behaviors••0" •^•^•Primitive Behaviorsarchitecture is then dismissed and the multi-robot system returns to a learning-baseddistributed decision-making system.In addition, a heuristic approach is used to recognize a local minimum forimplementing the planning/learning switch mechanism, as outlined above. In particular, ifn(n 3) successive q(xi , yi ) are found to lie within a small region of the localenvironment, it is possible that a local minimum exists. For example, if for a smallpositive cm , Iq(xi, yi )—q(xi+1,yi+l )I«m , lq(x,, y)— q(x,+2 , yi+2 )1< em andq(x„ y,) — q(xi+3 , yi„)1< em are detected, then it is concluded that q(xi, yi ) is close to alocal minimum.2.6 Behavior-Based Distributed ControlIn the proposed multi-robot architecture as presented in Figure 2.1, a behavior-baseddistributed control layer is designed to control the local actions of a robot. The behaviorcontrol layer is situated between the high-level task decomposition and assignment layerand the low-level "hard" control layer. Actions selected by the high-level taskdecomposition and assignment layer are usually quite abstract and cannot be directly sentto the low-level "hard" control layer for execution. Therefore, these high-level abstractactions need to be decomposed into more detailed actions in the layer of behavior-baseddistributed control.Here, a hierarchical behavior control architecture, which is based on the CAMPOUTarchitecture developed at the NASA Jet Propulsion Laboratory or JPL (Huntsberger et al.,2003), is developed to control the robot behavior in a robust manner. The general idea ofthe behavior control layer is presented in Figure 2.5.Figure 2.5: The hierarchical behavior-based control layer.36The so-called "Behavior-based systems" are systems that use behaviors as basicbuilding blocks of robotic actions. Here, a behavior is a reaction to a stimulus. Therefore,in some literature, behavior systems are also called reactive systems. In such a system, theuse of an explicit abstract knowledge representation is usually avoided. Actions of robotsare coupled tightly with their perceptions without a knowledge reasoning process or timehistory. The ideas of behavior-based systems are derived from animal models, which havebeen proved to be quite successful in AI research.There are several advantages of behavior-based systems over planning-based systems.First, a behavior-based system is simple and easy to design. By tightly couplingperceptions with actions and avoiding complex time-consuming reasoning and planning, itis much easier to design a behavior-based system than a traditional AI system withplanning and reasoning. In addition, it is shown that planning is often "useless" in a fastchanging environment because it cannot adapt to it. Second, the behavior-based approachhas shown good performance of robustness in simulated and physical robotic systems. Ifthe designer can foresee all possible world states and arrange the corresponding behaviorrules to deal with them, a behavior-based system can be quite successful even in adynamic and unknown environment. Third, through the interaction of existing behaviors,some "emergent behaviors" or new behaviors generated by uncertainties in theenvironment will occur automatically, which can surprise the designer.2.6.1 Behavior RepresentationHere, behaviors are expressed using Stimulus-Response (SR) diagrams. Figure 2.6shows a SR diagram.Figure 2.6: A SR diagram of a simple behavior.In Figure 2.6, a behavior can be expressed as a 3-tuple (S,R,fl) where S denotes theset of all possible stimuli, R represents the range of the response, and fi denotes themapping function between S and R.37An individual stimulus s (s E S) is coded as a binary tuple (p,2) where p indicates thetype of stimulus and A denotes its strength. In addition, there is a threshold value r foreach stimulus. If the stimulus strength is bigger than r , a response will be generated.Further, for mobile robots, a response r can be expressed with a six-dimensional vectorconsisting of six subcomponents. Each subcomponent encodes the magnitude of thetranslational and orientational responses for each of the six degrees of freedom of motion.r=[x,y,z,0,0,p], where r E R^ (2.6)For ground-based mobile robots, the response vector can be further simplified as:r =[x,y,0]^ (2.7)where x and y denote the position of the mass center of the robot, and 9 represents itsheading.For a particular behavior 16 , a gain value g is to be defined to modify the magnitude ofits response further. Finally, a behavior is expressed asr'=g*r=g*fl(s)^ (2.8)2.6.2 Behavior CompositionIn Figure 2.5, simple primitive behaviors can be combined to build more abstractcomposite behaviors. Composite behaviors are behavior packages on which a behavior-based robot system is built. Each composite behavior includes a behavior coordinationoperator and a number of behavioral components (primitive behaviors or other compositebehaviors).The power of composite behaviors results from abstraction of behaviors in which high-level behaviors can be built on some simple behaviors, and they can be referred to withoutneeding to know its behavioral components.A formal expression of a composite behavior is given byp = C(G * R) = C(G * B(S))^ (2.9)where38Sl gi 0   ^ 0R 2 , S = S2 , G = 0 g2  . 0 , and B = P2_rn _ sn _ 0   ^ 0 grn AiIn equation (2.9), B represents the n behaviors (fli ,f32 ,• • •, 16,) which constitute thecurrent composite behavior. Also, S denotes a vector of all detectable stimuli relevant foreach behavior A at time t ; R denotes a vector of responses of n behaviors(flp P2 ,       , fin ) at time t; G denotes a gain matrix which modifies the response matrix R;and C denotes the behavior coordination operator of this composite behavior.There are two types of behavior coordination operators: the competitive operator andthe cooperative operator. Figure 2.7 shows the SR diagram of a composite behavior with acompetitive operator.siBehavior #1Stimuli pResponseFigure 2.7: The diagram of a composite behavior with a competitive operator.In Figure 2.7, because multiple behaviors may be activated simultaneously and theirresponses may conflict with each other, a competitive operator is designed to select thewinner of the behaviors, and only the winner's response can be directed to the robot forexecution. There are several competitive operators available in literature, such assubsumption-based arbitration, arbitration via action-selection, voting-based arbitration,and so on.Another type of behavior coordination operator is the cooperative operator, which isshown in Figure 2.8.39Stimuli2.8: The diagram of a composite behavior with a cooperative operator.In Figure 2.8, the outputs of multiple behaviors are fused and directed to the robot forexecution. It indicates that multiple behaviors are executed by the robot at a time if theydo not conflict with each other. The most straightforward cooperative operator isperformed through vector addition, which simply adds the outputs of multiple behaviors.The key to a cooperative operator is to design the behaviors so that their responses can befused. The most popular approach is the potential field based method which definesbehavior outputs as positions and orientations of the robot, which can be added easily.2.6.3 Group CoordinationIn order to manage and solve conflicts among robots and make them contribute tocommon tasks, the activities of the robots have to be carefully coordinated. Groupcoordination is implemented in the proposed multi-robot architecture through the machinelearning unit or the planning unit which is located in its top level. The learning or planningunit continuously decomposes sub-tasks among multiple robots and assigns them to eachrobot. When a robot receives its sub-task, it will decompose the sub-task further into moredetailed composite behaviors in its behavior-based distributed control level and finallyconvert them into primitive behaviors for execution.2.6.4 Communication BehaviorsIn order to coordinate the activities of the robots, communication behaviors aredesigned to exchange sensing information, the objectives, and internal states of the robots.40f, StopAssumeFFind ObjectIt'Move, to alocationgrtIRRITITM81?Sub-task- ------Search ColorMea.surem t........Laser^ ObstacleBlobs voidanceCarry^ DismissThere are two types of communication: explicit communication and implicitcommunication. Explicit communication concerns sending or receiving information viaexpress data links; for example, a computer network. However, a robot also can indirectlyexchange information with its teammates by interacting with the environment or usingsome sensing approaches. For instance, it can estimate its relative distance from anotherrobot by using its CCD camera or laser distance finder.2.6.5 Coordinated Object TransportationIn this thesis, a multi-robot coordinated object transportation system is developed. Inorder to facilitate a robot to complete its sub-tasks as assigned by the learning/planningunits in the top level of the proposed multi-robot architecture, a behavior-based distributedcontrol subsystem which implements Figure 2.5 is developed, as shown in Figure 2.9.^/ Image ss‘^Turn in^f' Move^Move ss ,' DetectAnalysis / 1^,/Place--___-- Forward,/ ' , ,Backward, ,)^Obstacles--____-- Legend: PrimitiveBehavior ,/ CompositeBehaviorFigure 2.9: Behavioral hierarchy describing the coordinated transport sub-task.41In Figure 2.9, the group behavior of "Assume Formation" is to make the robot move toa designated site around the transported object so as to establish a formation with its peers.The learning/planning unit in the top level of the proposed multi-robot architecturedetermines the transportation formation of the robots, decomposes it into the specificposes and sites of the robots, and sends downward these sub-tasks to the behavior-controllayer for execution. When the "Assume Formation" behavior is activated, it will firstactivate its child behavior of "Find Object" which attempts to search the transported objectand estimates its location and orientation in the environment by suing the CCD cameraand the laser distance finder. Accordingly, the "Find Object" behavior will call the"Search Color Blobs" behavior to make the robot turn in place continuously and try to findsome predefined color blobs which are attached to the surface of the object of interest, byusing its CCD camera, so that it can be distinguished from other objects or obstacles in theenvironment. When the robot finds the object of interest, the behavior of "LaserMeasurement" is activated to estimate the position and orientation of the object with thelaser distance finder sensor. Then, the behavior of "Move to a Location" will attempt tomove the robot to the designated position as determined by its higher levellearning/planning unit. As a robot moves to its goal location, the "Move" behavior and the"Obstacle Avoidance" behavior are activated in turn, depending on the obstacledistribution of the environment.After all robots reach their designated positions and generate a formation around theobject of interest, the "Signal" behavior is waked to start a synchronization process whichnotifies all robots to activate the "Carry" behavior so that the object is transported. Whilethe robots transport the object, they also monitor whether the object collides with anyobstacles in the environment. If the object does collide with an obstacle in the course oftransportation, the robots will cancel the "Carry" behavior and report this accident to theirlearning/planning units in the top level and ask them to re-plan their formation.In this thesis, Finite State Diagrams (FSA) is used to describe the aggregations andsequences of the behaviors so that the process of behavior activation and transitionbecomes more explicit. A finite state diagram can be specified by the quadruple(Q, 5, q0 , F) with Q representing all behavioral states; qo representing the initial state;F representing all final (termination) states; and g representing the mapping function42Not at thedesigna^positionThe formationis establishedOtherNot at thegoal positionwhich maps the current behavioral state q and input a into a new behavioral state q' , i.e.,q = g(q,a). The FSA describing the transitions of the group behaviors in the project ofcoordinated object transportation is shown in Figure 2.10.Figure 2.10: A FSA describing the transitions of the group behaviors in the project ofcoordinated object transportation.2.7 Low-level ControlThe third-level of the proposed multi-robot architecture presented in Figure 2.1 is thelow-level "hard" control layer. This layer receives commands such as "Move Forward","Move Backward" or "Turn in Places for a Certain Degree" from its higher level of"behavior-based control," and exactly controls the physical motion of the robot.Basically, there are two types of robots employed in multi-robot systems. They arerobotic manipulators or mobile robots whose models and corresponding control strategiesare quite different. Both types are employed in the present research.2.7.1 Robotic ManipulatorsA robotic manipulator is a robot with multiple links connected with some revolute orprismatic joints which are driven by motors (typically DC motors). The robot is mountedon a fixed base in the environment and the end effector at the other end of the robot can43move in a limited workspace by controlling the rotational angles of its joint motors. Overthe past thirty years modeling and control of robotic manipulators has become a matureresearch field. For example, the dynamic model of a robotic manipulator with n joints canbe presented as (Spong et al., 2006)^D(q) q+ C(q, q) q+ g(q) = r^ (2.10)whereq = (q,, q2 , • • • , q n ) T is a vector of generalized coordinates of n jointsand^D(q) =[ifm iJ^J^+ J^Ri(q)liRi(q)T J ,,,,, (q)}] is the n x n inertiamatrix of the manipulator. The (k, j)th element of the Coriolis matrix C(q, q) is defined asx--, 1 { adk; adk. adu •Cki =^ qi2 aq,^ aqkand the gravity vector g(q) is given byg(q) ^(0,g2 (0,.. .,gn (ofIn most cases, the dynamic model given by (2.10) is not used due to its nonlinear andcoupled nature, and the independent joint control approach which employs multipleproportional-derivative (PD) controllers to control each joint independently is adequate tomeet the control requirements. However, sometimes, nonlinear control approaches basedon the dynamic model of (2.10), such as reverse dynamics control, are employed tocontrol the motion of robotic manipulators more accurately, at the expense of increasedcomputational effort.2.7.2 Mobile RobotsMobile robots are another class of robots which are used commonly in multi-robotsystems. Control issues of mobile robots are quite different from those of fixed roboticmanipulators. Figure 2.11 shows a control problem of a mobile robot with two standardwheels (Siegwart and Nourbakhsh, 2004).441^CO^7 ' CO21 +2^2rco —rw21 ± ^2/^2/ _Figure 2.11 Control of a mobile robot.In Figure 2.11, the axes X l and Y1 define a global reference frame from the origin0 : {X,, Y, } . The coordinate axes {XR ,YR } define the robot's local reference frame with itsorigin at point P centered between the two drive wheels, and each wheel is located at adistance 1 from P. The angular difference between the global and local reference frames isdenoted by 0 . The diameter of each wheel is r, and the spinning speeds of the two wheelare co, and (02 . The linear and angular velocities of the robot are v and w, respectively.Given 0 , 1 ,xv and w , thecos 0 0-kinematic model of the mobile robot is presented as5)O= sin 0 00^1 [ vwl-(2.11)(2.12)where .i and j) are the linear velocities in the directions of X1 and Y, , respectively, of theglobal reference frame.45Given an arbitrary position and orientation of the robot in the global reference frame,the control objective is to drive the robot to a predefined goal position and orientation bycontrolling the spinning speed of the wheels of the robot. The control block diagram isshown in Figure 2.12.xrob°,Mobile RobotFigure 2.12: The block diagram of motion control of a mobile robot.Here [x, y, O]g oai and [x,y,t9] rTobot give position and orientation of the goal and the robot,respectively, in the global reference frame. The task of the controller layout is to find asuitable control gainK =[k"k21such that the control of v(t) and w(t)ki2k22 k23(2.13)Ho]Lw(t)] = K • e= K(xyxy (2.14)- 8 - goal robotwill drive the error e toward zero:lim e(t) = 0^ (2.15)A control law is proposed in (Siegwart and Nourbakhsh, 2004), which may bepresented asv k p^ (2.16)W = kaa (2.17)where46Ax xgoal Xrobot , and , Ay = Y goal — Y robot^ (2.18)p _ /Ax2 Ay2^ (2.19)a = —0 + a tan 2(Ay, Ax) (2.20)fl = —0 — a^ (2.21)When the following conditions are met:kp >0,kfl <0,and ka —kp > 0^ (2.22)the closed loop control system in Figure 2.12 will be locally exponentially stable(Siegwart and Nourbakhsh, 2004).2.8 Distributed Control IssuesThe multi-robot architecture proposed in this chapter is a distributed controlarchitecture. It means that there does not exist a leader robot to make decisions for allother robots in the team. Instead, each robot needs to decide its own actions independentlyby observing the environment or communicating and negotiating with its teammates. Theonly exception is the local path-planning unit which needs centralized decision-makingwhen the learning and behavior-based multi-robot units are found to being trapped in alocal minimum. However, the centralized planning unit works only for a very short periodof time. Once the multi-robot system leaves the local minimum, the learning and behavior-based units will takeover the control of the robot again.Compared with a centralized control architecture, a distributed architecture has manyadvantages, especially for multi-robot systems. The most beneficial feature of a distributedsystem is its scalability. For a distributed multi-robot system, it is very easy to scale therobot team without significantly increasing its computational cost. For example, the multi-robot architecture proposed in this chapter can accommodate a robot team with 3 to 20robots, and the team size can change dynamically if some robot members quit or newmembers enroll in the team in the course of transportation. However, for a centralizedmulti-robot system, scalability is a very difficult issue because the computationalcomplexity of the centralized decision-making algorithm increases rapidly with thenumber of robots. In most of cases, centralized decision-making is an N-P complexity47problem requiring superpolynomial time, which makes the algorithm useless in a realmulti-robot system.Another advantage of a distributed system is its low communication volume. Becausethere is no centralized leader, the communication bottleneck problem will not appear.Instead, in a centralized multi-robot system, when the number of robots increases, thecommunication bottleneck will be so serious that the algorithm becomes worthless. Inaddition, through carefully designing its local behavior rules, a distributed system cancomplete some cooperative tasks even without communication. It is always desirable for adistributed multi-robot system to work with a limited communication capability becausemost multi-robot systems need to run in an environment where communication amongrobots is usually unreliable; for example, on a planetary surface or the bottom of a deepocean. When communication among robots is not reliable, while a distributed multi-robotsystem may still work well, a multi-robot system with centralized decision-makingbecomes weak because the robot members cannot receive commands from their leader in atimely and accurate manner.The third advantage of a distributed system is its robustness. Decision-makingcapabilities are spread across the robot members. When one robot in the team fails or quits,other robots still can make decisions for themselves so as to continue their tasks. However,a centralized system is not a robust one. In such a system, because one leader robot makesdecisions for all other robots in the team, once the leader robot fails, the whole system willtotally fail.A main disadvantage of a distributed system is its low efficiency, where multiplerobots reach their agreements through some negotiation technique such as voting, auction,and so on. These negotiation approaches are usually time-consuming and not efficient. Inaddition, the distributed planning and learning algorithms are much more difficult andcomplex than their centralized counterparts. Consequently, compared to a centralizedsystem, the final policy reached by the members in a distributed system is usually notoptimal. Therefore, it is still an open question as to how to implement efficient andoptimal distributed decision-making, especially distributed planning and distributedlearning, for multi-robot systems.482.9 SummaryIn this chapter, a distributed multi-robot architecture, which accommodates distributedmachine learning, local path planning, behavior-based distributed control and low level"hard" control, was developed. The objective of the architecture is to support multi-robotsystems for carrying out cooperative tasks in a dynamic and unknown environment in aneffective and robust manner.In the beginning of the chapter, several popular multi-robot architectures given in theliterature were introduced and compared, and their advantages and disadvantages werehighlighted. Basically, there are five architectures that have been employed by the existingmulti-robot systems; namely, the behavior-based architecture, the planning-basedarchitecture, the distributed control based architecture, the market-based architecture, andthe learning-based architecture. Although each architecture possesses some advantages insolving the multi-robot cooperative problem, all of them are generally weak in supportinga multi-robot system that operates in a dynamic and unknown environment.In order to facilitate multi-robot systems when operating in a dynamic and unknownenvironment, which is quite common for multi-robot tasks, a distributed hierarchicalmulti-robot architecture was developed in this chapter. This architecture is a generalplatform that consolidates several existing techniques of multi-robot decision-making, andaccommodates homogenous or heterogeneous robots in the same framework.Next, the top level of the proposed architecture was introduced, which includes amachine learning unit, a local path planning unit and an arbitrator. By monitoring whetherthe multi-robot system is trapped at a local minimum, the arbitrator dynamically selectseither the learning-based or the planning-based approach to make decisions for the robots.In particular, the potential-field-based path planning algorithm was introduced.The second level of the proposed multi-robot architecture constitutes a behavior-basedcontrol layer which further decomposes the behaviors of the individual robots. Thebehavior-based approach has been proved to be robust and effective in the literature. Inthis thesis, it is combined with the learning and planning units in the top level of thearchitecture to control the actions of a single robot.Below the behavior-based control layer there exists the low-level "hard" control layer.The low-level controller in this layer receives its commands from the "primitive49behaviors" in the upper level, and precisely controls the motion or force of a robot. Thedynamic or kinematic models of two types of representative robots--robotic manipulatorsand mobile robots--were presented. In particular, a motion control law for a wheeledmobile robot was introduced.Finally, some distributed control issues of multi-robot systems were discussed. In theinception of multi-robot systems, centralized architectures were used commonly due totheir simplicity. However, centralized architectures caused many problems when the robotteam size became large. The advantages of distributed architectures were pointed out andsome open questions in this field were identified.The proposed multi-robot architecture provides a framework and guideline fordesigning physical multi-robot systems. Its machine-learning based multi-robot decision-making will be presented in more details in Chapter 3 and Chapter 4, and the simulationand experimental results will be given in Chapter 6, which will validate this new multi-robot architecture.50Chapter 3Machine Learning for Multi-robot Decision-Making3.1 OverviewMulti-robot decision-making has been a challenging problem because the actions ofmultiple robots and their interactions have to be considered simultaneously so as to reacha global optimal policy. Intuitively, planning is the suitable approach to deal with thisproblem. However, it is shown that planning is usually not feasible if the environmentchanges fast (Arkin, 1998) because a planned optimal policy can become inapplicable dueto changes of the environment.Multi-robot environments, such as planet surfaces and urban environments are typicaldynamic and unstructured environments, where the terrain features can changecontinuously. In addition, even though the physical environment is stationary, from theviewpoint of a single robot, it is still a dynamic one because the other robots in the teamtake actions in the work environment and change it continuously.There are two solutions to the challenge of the dynamic environment: the behavior-based approach and the machine learning approach. The behavior-based approach asmentioned in Chapter 2, is known to be more robust than the planning approach in adynamic environment. By identifying the current world state and tightly coupling it with aspecific action, behavior-based systems can quickly adapt to a fast changing environmentand show good robustness. However, although the behavior-based approach is quitesuccessful in simple environments, it is shown to be weak in a complex environment witha large number of states. In particular, it is almost impossible for a designer to considerthousands or ten-thousands of world states and arrange the corresponding behavior rulesand preferences for those rules. Therefore, the behavior-based approach is sometimesmocked as "the insect-level intelligence" which only can deal with a simple world.By observing the human capabilities to deal with a dynamic environment, it is notdifficult to draw a conclusion that a human employs not only planning or reactive(behavior-based) techniques but also learning techniques to successfully complete tasks in51a dynamic environment. Through learning, a human learns new world states, findscorresponding optimal actions from the past experience, and improves his planning andreactive techniques. In other words, learning makes a human deal with unexpected worldstates, decreases uncertainties in the environment, and makes his decisions more robustand effective in a dynamic environment.Accordingly, machine learning has become an important topic in the roboticscommunity. Especially in the multi-robot field, machine learning is regarded as as a newand important trend due to the highly dynamic character of typical multi-robotenvironments.Among the existing machine learning approaches, Reinforcement Learning (RL),especially Q-learning, is used most commonly in multi-robot systems due to its simplicityand good real-time performance. Few multi-robot systems employ neural network orneuro-fuzzy approaches because they need complex training and longer computationaltime.In this chapter, machine learning is adopted to make decisions for a multi-robot objecttransportation task. In section 3.2, Markov Decisions Processes (MDP) and the single-agent Q-learning algorithm are introduced. The single agent Q-learning algorithm willconverge to an optimal policy in an MDP problem without a transition model, and it willbe integrated with Genetic Algorithms (GA) to constitute the centralized decision-makingunit in a multi-robot transportation task, as presented in section 3.3. In section 3.4, adistributed version of Q-learning based multi-robot box-pushing is presented and it isvalidated through computer simulation. In section 3.5, another Q-learning algorithm moresuitable for multi-robot tasks, the Team Q-learning algorithm, is applied to multi-robotdecision making, and it is compared with the single agent Q-learning algorithm, againusing computer simulation.3.2 Markov Decision Process (MDP) and Reinforcement LearningMarkov Decision Process (MDP) is a popular topic in the artificial intelligence (AI)community. Not only is it important in the theoretical research of AI, it also has somepractical applications such as facilitating an intelligent agent to make decisions in a real-world situation. The basic idea of MDP is presented below.52MDP can be defined by the 4-tuple <S, A,7', R, 13 > , where• S = {s1 , s2 ,^, is a set of states of the environment• A= {ai ,a2 ,...,am } , is a set of actions available to the robot• T:SxA—> n(s), is a transition function, which decides the next environmentalstate s ' when the robot selects an action a, under the current state s . n(s) is theprobability distribution over the states.• R : S x A --->^is a reward function, which determines the immediate rewardwhen the robot takes an action a, under the current state s .MDP describes a sequence of decisions made by an agent or a robot in an environmentwith many states. At time t, the agent needs to observe the environment and determine itscurrent state s E S , and then select an action a E A based on its policy if which specifieswhat the agent should do for any state that it might reach. In addition, there is a rewardfunction which determines the immediate reward when the agent takes an action a underthe current state s .There are two things which make MDP interesting and challenging. The first one isthat the subsequent state s' is not deterministic when the agent takes an action a under thecurrent environmental state s . Instead, it has a probability distribution n(s) over all thestates. This probability distribution is defined by a transition function or transition modelT(s,a,^) . The second attribute of MDP is that the transitions among states areMarkovian; that is, the probability of reaching s' from s depends only on s and not on thehistory of earlier states.The performance of an agent policy 7z - is measured by the rewards the agent receivedwhen it made a sequence of decisions according to this policy and visited a sequence ofstates. This measurement is usually represented by a sum of discounted rewards, as givenbyf=0 fil(s, ) I R- ]^(3.1)where, 0 < fl 1 is the discount factor, and R(s1 ) is the reward received when the agentvisits the state st at time t . Because the transitions among states are not deterministic in53view of the probablistic nature of the transition function T(s,a,s'), given a policy, thesequence of states visited by the agent each time is not fixed and has a probabilitydistribution. Therefore, an optimal policy 7r* is a policy that yields the highest expectedsum of the discounted rewards, which is given byrc* =argmaxE[Ifl tR(si )Iffi^ (3.2)t=0Given an MDP problem, one crucial consideration is how to find all optimal policies (ifthere exist more than one optimal policy). The value iteration algorithm has beendeveloped to find the solution to this problem (Russel and Norvig, 2003).In the value iteration algorithm, first the concept of "utility of a state" is defined asU(s)=E[Ifi tR(st )1 7-1-* , S 0 = Sit =0(3.3)From equation (3.3), the utility of a state s is given by the expected sum of discountedrewards when the agent executes the optimal policy.If we have the utilities of all states, an agent's decision making (or action selection)will become easy; specifically, to choose the action that maximizes the expected utility ofthe subsequent state:(s) = arg max E T (s , a , )U (s') (3.4)aHowever, the problem is how to find the utilities of all states. It is not an easy task.Bellman found that the utility of a state could be divided into two parts: the immediatereward for that state and the expected discounted utility of the next state, assuming that theagent chooses the optimal action.U(s) = R(s) + 13 max yT(s,a,s')U(s) (3.5)a ,Equation (3.5) is the famous Bellman equation. For each state, there is a correspondingBellman equation. If there are n states totally, then we can have n Bellman equations. Theutilities of the n states are found by solving the n Bellman equations. However, theBellman equation is a nonlinear equation because it includes a "max" operator. Theanalytical solutions cannot be found using techniques of linear algebra. The only way tofind the solutions of n Bellman equations is to employ some iterative techniques. The54value iteration algorithm is such an approach to find the utilities of all states. Thisalgorithm is presented below (Russel and Norvig, 2003).Function Value-Iteration (mdp, e) returns a utility functioninputs: mdp, an MDP with state S, transition model T, reward function R,discount fi , and the maximum error s allowed in the utility of anystate.local variables: U ,U ' are the vectors of the utilities for the states in S ,initially zero. 8 is the maximum change in the utility of any state in aniteration.repeatU <— U ' ; 5 4— 0for each state s in S doU'(s) R(s) + p max E T (s , a , s ')U (s ')a^s,If I U '(s) — U(s)1> 8 then 8 <— IU'(s) — U(s)1until 5 < 8(1— 13)1 13Figure 3.1: The value iteration algorithm to calculate the utilities.3.2.1 Reinforcement LearningIn section 3.2, the value iteration algorithm is introduced to solve MDP problems. Itappears that MDP is no longer a difficult issue. However, in a real robot decision-makingsituation, the value iteration algorithm is not practical because the environment modelT (s, a, s') and the reward function R(s) are usually unknown. In other words, it is usuallyimpossible for one to obtain perfect information of the environment model and employ thevalue iteration algorithm to solve the Markov Decision Process (MDP) problem.Reinforcement Learning is developed to meet this challenge.In reinforcement learning, through trials of taking actions under different states in theenvironment, the agent observes a sequence of state transitions and rewards received,which can be used to estimate the environment model and approximate the utilities of thestates. Therefore, reinforcement learning is a type of model-free learning, which is itsmost important advantage.There are several variants of reinforcement learning among which Q-learning is themost popular one. A typical Q-learning algorithm is presented below.55 For each state si E (s1 , S2 ,• • •,s,i ) and action a./ E (a1 ,a2 ,• • •table entry Q(si ,a,) to zero.Initialize z- to 0.9. Initialize the discount factor 0 < firate 0< ^1. Observe the current state s Do repeatedly the following:. Probabilistically select an action ak with probabilityeQ(s,ak )Ir, and execute itx-■ eQ(s,a1 )1r1=1 Receive the immediate reward r Observe the new state s'▪ Update the table entry for Q(s, ak ) as follows:▪ Q(s,ak) = (1 - ri)Q(s,ak)+ ri(r + flmaxQ[s ,a1)P(ak ) =,am), initialize the1 and the learninga■ z-^r*0.999Figure 3.2: The single-agent Q-learning algorithm.At the initiation of the Q-learning algorithm, an empty Q-table is set up and all itsentries are initialized to zero. The Q-table is a 2D table whose rows represent theenvironment states and whose columns represent the actions available to the agent (therobot). Here, the value of a Q-table entry, Q(si ,a,), represents how desirable it is to takethe action a./ under the state s1 , while the utility U(s1 ) in section 2.2 represents howappropriate for the agent to be in the state Parameters such as the learning rate ri , thediscount factor /3 and the " temperature " parameter r , have to be initialized as well.During operation, the agent observes the environment, determines the current state s ,probabilistically selects an action ak with probability given by (3.6) and executes it.eQ(s,ak)11-P(ak) = m^(3.6)zeQ(s,a1)Ir1=156After the agent takes the action a k , it will receive a reward r from the environment andobserve the new environment s' . Based on the information of r and s' , the agent willupdate its Q-table according toQ(s ,a k ) = (1 — 77)Q(s ,a k ) + ii(r + ft. max Q[s' ,a1)^ (3.7)aIn this manner the current environment state is transited from s to s' . Based on thenew state, the above operations are repeated until the values of the Q-table entriesconverge.In the Q-learning algorithm described in Figure 3.2, a s — greedy search policypresented in equation (3.6) is used instead of a greedy policy which always selects theaction with the maximum Q value. In a e — greedy search policy, with probability 6. , theagent chooses one action uniformly randomly among all possible actions, and withprobability 1— 6. , the agent chooses the action with a high Q value under the current state.In addition, the probability s is decreased gradually. The advantage of the s. — greedysearch policy is its balance in exploring unknown states against exploiting known stateswhen the agent attempts to learn its decision-making skills and improve its Q-table. Inequation (3.6), the probability s is controlled through decreasing the " temperature "parameter r .3.3 Multi-robot Transportation Using Machine Learning: First VersionAn important research topic in multi-agent robotics is multi-robot object transportation.There, several autonomous robots move cooperatively to transport an object to a goallocation and orientation in a static or dynamic environment, possibly avoiding fixed orremovable obstacles. It is a rather challenging task. For example, in the transportationprocess, each robot needs to sense any change in the environment, the positions of theobstacles, and the other robots in the neighborhood. Then it needs to communicate withthe peers, exchange the sensing information, discuss the cooperation strategy, suggest theobstacle avoidance strategy, plan the moving path, assign or receive the subtasks and roles,and coordinate the actions so as to move the object quickly and successfully to the goallocation. Arguably, the success of this task will require the use of a variety of technologiesfrom different fields. As a result, the task of multi-robot transportation is a good57benchmark to assess the effectiveness of a multi-agent architecture, cooperation strategy,sensory fusion, path planning, robot modeling, and force control. Furthermore, the taskitself has many practical applications in fields such as space exploration, intelligentmanufacturing, deep sea salvage, dealing with accidents in nuclear power plants, androbotic warfare.In this section, a physical multi-robot system is developed, integrated with machinelearning and evolutionary computing, for carrying out object transportation in an unknownenvironment with simple obstacle distribution. In the multi-agent architecture,evolutionary machine learning is incorporated, enabling the system to operate in a robust,flexible, and autonomous manner. The performance of the developed system is evaluatedthrough computer simulation and laboratory experimentation.As explained in section 3.1, a learning capability is desirable for a cooperative multi-robot system. It will help the robots to cope with a dynamic or unknown environment, findthe optimal cooperation strategy, and make the entire system increasingly flexible andautonomous. Although most of the existing commercial multi-robot systems are controlledremotely by a human, the autonomous performance will be desirable for the nextgeneration of robotic systems. Without a learning capability, it will be quite difficult for arobotic system to become fully autonomous. This provides the motivation for theintroduction of machine-learning technologies into a multi-robot system.The primary objective of the work presented here is to develop a physical multi-robotsystem, where a group of intelligent robots work cooperatively to transport an object to agoal location and orientation in an unknown and dynamic environment. A schematicrepresentation of a preliminary version of the developed system is shown in Figure 1.1 andit is repeated in Figure 3.3.58Digital CameraRobotic AnnFixedObstacles Goal Location A/IForce SensorsMovableObstaclesr —1SonarEthernet NetworkIIVisionAgentRobotAssistantAgent #1CameraLearning/Evolution Agent--------RobotAssistantAgent #2High-levelCoordinationPhysicalAgent #2Low-levelControlPhysicalAgent #1Figure 3.3: The developed multi-robot system.3.3.1 Multi-agent InfrastructureThe multi-agent architecture as outlined before is used here as the infrastructure toimplement cooperative activities between the robots. This infrastructure is schematicallyshown in Figure 3.4.Figure 3.4: The multi-agent architecture used in the developed system.In Figure 3.4 shows four software agents and two physical agents in the developedarchitecture, forming the overall multi-agent system. Each agent possesses its own internal59state (intention and belief) and is equipped with independent sensing and decision-makingcapabilities. They also are able to communicate with each other and exchange informationon their environment as acquired through sensors, and the intentions and actions of thepeers. Based on the information from their own sensors and their internal states, the agentscooperatively determine a cooperation strategy to transport the object.In Figure 3.4, the four software agents constitute a high-level coordination subsystem.They will cooperate and coordinate with each other to generate cooperation strategies, andassign subtasks to the two robot assistant agents. In the meantime, the two physical agents(robots) form a low-level subsystem of control and execution, which receives commandsfrom the upper-level subsystem.There is a vision agent, which is in charge of acquiring and processing the imagesusing a CCD camera. Because the present section focuses on developing a learning-basedmulti-robot architecture, a simplified computer vision scheme is employed, which uses aglobal camera and a global localization technology. The vision agent will analyze theacquired images and compute the exact locations and orientations of the robots, the objectand the obstacles. All the information is then broadcasted to the other three softwareagents so that they will immediately know the current world state (the position andorientation messages of the robots, the object and the obstacles; and the environmentalmap information).A special learning/evolution agent is used in the architecture to play the combined roleof a learner, a monitor and an advisor. The machine-learning algorithm and the decision-making capabilities are embedded in this agent. First, it will collect from other agents theinformation of the environment, the robots, the object and the obstacles, to form thecurrent world state, which can be shared by its peers. Then it will make decisions on theoptimal cooperation strategy based on its own knowledge base so that the common taskcan be completed effectively and quickly. Finally, it will assign the roles of other agents,and monitor the task schedule.Each physical agent (robot) is linked to a corresponding assistant agent, which needs toforward the robot position/force information to the other agents in the upper level, and toreceive the desired force/trajectory of the robot from the learning/evolution agent and sendthem to the corresponding robot in the low-level subsystem. All six agents will form a60tightly-coupled multi-agent system to transport the object cooperatively to the goallocation while avoiding obstacles.3.3.2 Cooperation Based on Machine LearningLearning ability is a desirable feature for a multi-robot transportation system. It willhelp the robots to cope with dynamic changes in an unknown environment. Althoughconventional planning algorithms can make the robots move effectively in a staticenvironment, they usually fail in an environment with dynamic obstacles. A multi-robotsystem with learning capability possesses characteristics of adaptation and self-organization, which will enable it to overcome the difficulties in an unknown environment.In this section, two well-known machine learning approaches, Reinforcement Learning(RL) and Genetic Algorithms (GAs), are integrated into the learning/evolution agent toachieve robust and effective cooperation between robots. Reinforcement learningA key problem in a multi-robot transportation system is the determination of thecooperation strategy; i.e., finding the optimal amplitudes and locations of the appliedforces of the robots so that a net force with maximum amplitude will point toward the goallocation, while avoiding any obstacles. Reinforcement Learning (RL) provides a way todetermine an appropriate cooperation strategy through experimentation with theenvironment. A method of reinforcement learning known as the Q-learning algorithm, asgiven in Figure 3.2, is employed here to find the optimal action policy for each robot.In the system developed in this section, the world state is defined as the relativeposition and orientation between the object and the goal location, and the positions of theobstacles within a fixed detection radius around the object. It is shown in Figure 3.5.61Obstacles0y Goal Locationi^0,^...0^'%, ............... ..........0^1^ 0DetectionRadiuso^.4% %CZ:2)^Object^0 00Figure 3.5: The world state representation.In Figure 3.5, only the obstacles located within the detection radius around the objectare detected and considered. The detection circle is evenly divided into 16 sectors and thesectors are represented by a 16-bit binary code. If one obstacle is located in one of thesectors, the corresponding bit of the binary code will become "1." Otherwise it willbecome "0." Accordingly, this binary code represents the obstacle distribution aroundthe object. The world state s in the Q-learning algorithm is made up of this binary codeand the angel 9 in Figure 3.5, which describes the relative orientation between the goallocation and the object. For example, the world state shown in Figure 3.5 can berepresented by the vector / 000 1  0 0 0 0 0 / 1 0 0 0 0].The cooperation strategies; i.e., the force locations and amplitudes for all robots, whichare shown in Figure 3.6, form the set of actions in the Q-learning algorithm.62Robot #iSGoal LocationFigure 3.6: Representation of the cooperation strategy (action).In Figure 3.6, the application point of the force from each robot is represented by thepair of angles (a, 13). Therefore, possible actions in the Q-learning algorithm for the robotR, can be modeled as the set A i = {(a,fl,F) k }, k =1,2,...,m, .There remains a special action called "sweeping" which instructs the robots to removeall the obstacles in the path after establishing that they are capable of carrying out thissweeping task.The reward r is calculated asr = D1 — Df^(3.8)where, D, and^are the distance between the mass center of the object and the goallocation, at times t, and ti_1 , respectively. Genetic AlgorithmsAnother approach that is used here to find the optimal cooperation strategy is GeneticAlgorithms (GAs). By simulating the biological evolution process, GA provides apowerful way to quickly search and determine the optimal individual in a large candidatespace.A typical Genetic Algorithm as found in (Karray and de Silva, 2004; Mitchell, 1997) isgiven below.63 Initialize p (the number of individuals in the population), ti (the fraction ofthe population to be replaced by crossover at each step), m (the mutationrate), and fitness threshold. Generate p individuals at random to produce the first generation ofpopulation P Calculate the fitness(i) for each i (individual) in P While (max fitness(i)) < fitness _ threshold , then repeat:Produce a new generation Ps:1. Probabilistically select (1— ii)p members of P to be included in P. Theselection probability is given by: 2(i) ^Fitness(i) IPj _ I Fitness(j)2. Execute the Crossover operation to generate (rip)12 pairs of offspringand add them to P.3. Mutate m percent of the numbers of Ps4. Update P5. Calculate Fitness(i) for each i in P Return the individual with the highest fitness in PFigure 3.7: The Genetic Algorithm.In this section, the entire candidate space is formed by the "sweeping" action and(I) = (a„ f31 ,F1 ,• • • ,ai „ 60 Fi ,• • • an , Ign ,F„) described in Figure 3.6, which is the set of allpossible cooperation strategies. This candidate space can be represented by a vector space913n+1 which is made up of all possible actions: ((, sweeping). The Genetic Algorithmgiven in Figure 3.7 is used to search for the optimal individual vector; i.e., the optimalrobot cooperation strategy, in this large vector space.In the beginning, the "sweeping" action and nine individuals randomly selected fromthe candidate space are used to constitute the first generation of population. Then the64Genetic Algorithm outlined before is used to search for the optimal individual with themaximum fitness value. The fitness is calculated as follows:if it is a "sweeping" action and someC removable obstacles are in the path.Fitness= 0 if a "sweeping" action withoutremovable obstacles in the path.M otherwise(3.9)M= ki *IFI+k,*(1+cos0) 2 +k3 1111+k4S+k5 1(1+cosi-)2^(3.10)k,+k,+k,+k4 +k,where, C is a threshold value which is determined through trials, F is the net force appliedby the robots, 0 is the angle between the net force vector and the goal location vectorwhich is described in Figure 3.5, F is the net torque applied by the robots, S is the area ofthe polygon formed by the robot locations, r is the angle between the net force vector andthe position vector of its closest obstacle, and k i to k5 are the weights(ki + k2 + k3 + + k5 =1).The fitness function includes several practical considerations. First, the "sweeping"action is always intended for use in the current world state. If a removable obstacle ispresent in the forward path of the object (it is assumed that, based on its size, color, shapeand so on, the agents are able to judge whether the obstacle can be moved away by therobots), the "sweeping" action will assume a high fitness value so that the obstacle canbe removed immediately. In order for it to always have the opportunity to be selected, the"sweeping" action is included in the population of the first generation and it is alwayscopied to the next generation in the evolution process.Second, a large net force that points to the goal location will earn a high fitness valueso that the internal force becomes minimum and the object is transported to the goallocation quickly and effectively. Third, a high net torque F is not encouraged, because theresulting unnecessary rotation of the object will make it difficult for the robots to handle it,while a small F may be encouraged for avoiding the obstacles. Fourth, a large S impliesthat the spacing of the robots is quite adequate, which will receive a high fitness value,because crowding of the robots will make their operations more difficult and less effective.65Finally, a small z- means that the probability of the presence of an obstacle in the path ishigh, and this condition should be punished. The weights k — k5 are calibrated throughexperimentation.In each step of object transportation, the GA is used to search for the optimalcooperation strategy. Then this strategy is broadcasted to the two robot assistant agents bythe learning/evolution agent. Next, the robot assistant agents forward the commands to thephysical robots. Based on their internal hybrid force/position controllers, the robots thenmove themselves to the desired positions and transport the object for a specified timeperiod. In the meantime, the current force and position information is sent back to thehigh-level coordination subsystem via the robot assistant agents for decision-making in thenext step. After the object is moved through some distance, the world state would bechanged (For example, new obstacles are observed and so on). In this manner, the GA isused again to search for the new optimal cooperation strategy under the new world state.This process will be repeated until the object is successfully moved to the goal location.The main problem of GA in a physical robotic system is that its real-time performancecannot be guaranteed because of its low convergence speed. However, in the present thesis,this problem is solved partly by using the layered architecture shown in Figure 3.4. In thistwo-layer multi-agent architecture, the low-level physical agents can continue to executethe old commands before the new commands are given by the agents in the upper level. Inother words, the agents in the upper level are not necessary to work at the speed of thephysical agents in the lower level. This enables a " low-speed " GA agent to makedecisions for a real-time task. Integrating reinforcement learning with Genetic AlgorithmsReinforcement Learning (RL) may face difficulties and generate non-optimal outputs.A typical problem is that "the agent runs the risk of over-committing to actions that arefound to have high Q values during the early stages of training, while failing to exploreother actions that also have high values" (Mitchell, 1997). Another problem is that RLmay not be able to capture the slow changes in the environment. In other words, when theenvironment changes very slowly, RL will have less opportunity to try the actions under66different world states. Consequently, the knowledge base of RL will be updated veryslowly. It means that the learning process will become very slow and the output of RL willnot be optimal.Genetic Algorithms (GAs) can partly overcome these problems. Because of the specialevolution operations such as Crossover and Mutation in GAs (Karray and de Silva, 2004),it is possible for GAs to select a new search point in the candidate space, which had notbeen tried in the past steps. This will provide an opportunity for the actions with low Qvalues in Reinforcement Learning to demonstrate their capabilities. Moreover, because aGA simulates the biological evolution process, it provides better adaptivity than RL and isable to capture slow changes in the environment.On the other hand, GAs also benefit from RL because GAs cannot guarantee aprecisely optimal output. In particular, in the system developed in the present work, thecrowding phenomenon (Mitchell, 1997) was observed, where "very similar individualstake over a large fraction of the population." In addition, GAs are unable to deal withquick changes in the environment because the evolution process usually requires aconsiderable duration of time. All these problems can greatly slow down the evolutionprocess and make the output of the GAs unacceptable.However, by integrating GAs with RL, different individuals can be inserted into thepopulation of GA through RL, because RL uses an entirely different mechanism to selectthe new cooperation strategy. Consequently, it would be possible to disturb the populationpurity in a GA and resolve the "crowding" problem in part. In addition, RL is able to dealwith quick changes in the environment, where GA usually fails.The combination of learning with evolution, which are two important adaptivemechanisms that exist in nature, gives rise to distinct advantages, as discussed before. Inthis section, a specific integration scheme of these two capabilities is implemented in themulti-agent object transportation system, which is indicated in Figure 3.8.67Model KBActionRewarding PerformanceEvaluationProbabilisticArbitratorAction jGAKnowledgeBase (KB)Robot and Object LocalModelingWorld StateExtractionWorld state attime t7. GeneticAlgorithms ^ Actionat time tSensoryFusionReinforcementLearning Action iRL KB IIFigure 3.8: The integration scheme of RL and GA.In Figure 3.8, the sensor data is first sent to a sensory fusion module. This informationwill form the current positions and orientations of the object, the robots and the obstacles,and the state of the environment. Moreover, there is a model knowledge base in which thegeometry and dynamic parameters of the object and the robots are stored. These modelparameters are determined off line and stored in the model knowledge base prior tooperating the system. When the sensory data is sent to a local modeling module, themodule will identify and update the current model parameters by combining the off-lineinformation in the knowledge base and the new sensory data. The information from thesensory fusion module and the model knowledge base is then integrated to form the newworld state, as described in Figure 3.5.The two decision-making mechanisms, Reinforcement Learning and GeneticAlgorithm, are implemented based on the algorithms presented in the previous sections.They both are able to determine a robot cooperation strategy according to the currentworld state. Their outputs are probabilistically selected by an arbitrator, which determinesthe winning output. In this chapter, it is assumed that the output of the reinforcementlearning block is selected by the arbitrator with a probability of 0.7, while the output of the68genetic algorithm block is selected with a probability of 0.3. The output of the arbitrator;i.e., the cooperation strategy, is then broadcasted to the two robot assistant agents by thelearning/evolution agent to implement the object transportation task.3.3.3 Simulation ResultsSome simulation work has been carried out to demonstrate the performance of theevolutionary learning mechanism described before. Because the simulation is used tovalidate the evolutionary learning approach in section 3.3.2, a special case is assumed,which is somewhat different from Figure 3.4. Specifically, it is assumed that there arethree autonomous robots, which cooperate to transport a rectangular box (object) to a goallocation in a 400*400-unit environment without any obstacles. The box has a width of 60units and a height of 40 units. Each robot is simplified as a circular point in theenvironment shown in the subsequent figures. The multi-thread programming of Javalanguage is used to simulate the parallel operations of the three robots.The present work is applicable not only in object transportation tasks with smalldisplacements, described in Figure 3.4, but also in more challenging cases such as spaceexploration. For example, consider the case of several robots sent to a planet forexploration of a special ore, where they transport the ore cooperatively into a spacecraft.In this application, it is impossible for a robot or an agent to possess global information. Inother words, each robot only possesses the local sensing ability and it will need tocommunicate with the other robots to establish the global information. In the presentsimulation, each robot is assumed to be only equipped with local sensing capability.Moreover, the evolutionary learning mechanism is also embedded in each robot so thatthey can cope with the possible existence of unreliable communication, in the planetaryexploration.Because a robot only possesses local sensing capability and the environment dimensionis somewhat large, it will not know where the box is in the beginning. Therefore, a randomsearch strategy is used to find the object in the beginning of the simulation. Specifically,the robots walk randomly to search the object. When one robot reaches a wall or anotherrobot, it simply changes its direction and continues the "wandering state." However, if arobot reaches the box, the "wandering" behavior ceases and the position of the box is69immediately broadcasted to the peers. Then this robot will temporarily take over the"leader" role in the subsequent transportation task. Next, it will use its embeddedevolutionary learning mechanism to compute the optimal cooperation strategy and send itout to the other robots. Consequently, the other robots will move themselves to the desiredlocations and orientations, which are described in the new cooperation strategy sent by theleader robot. This process is shown in Figure 3.9.XFigure 3.9: The robots move randomly and locate the box.The leader robot is responsible for synchronizing the forces and motions among theteammates. It will check whether all the robots have moved to the desired position andorientation. If all the robots are ready, the leader robot will send out a "transport"command. Then the robots will exert the desired forces on the box, and the position andorientation of the box will be changed as a result. Next, the leader robot will compute thenew world state and generate the new cooperation strategy. This process will be repeateduntil the box is transported to the goal location and orientation. The overall process, assimulated, is shown in Figure 3.10.70XFigure 3.10: The whole transportation process.Figure 3.11 demonstrates the adaptivity of the evolutionary learning mechanismemployed by the robots. In this example, the goal location was suddenly changed from(380,370) to (385,60) at t = 34s, while the robots were transporting the box. From Figure3.11 it is observed that the robots quickly adjusted the cooperation strategy andsuccessfully transported the box to the new goal location.71VXFigure 3.11: The goal location was suddenly changed from (380,370) to (385,60) at t =34s while the robots were transporting the box.From the simulation results presented here, it is seen that the evolutionary learningmechanism described before is quite effective in meeting its design objective.3.3.4 ExperimentationThe experiment presented here is designed to test the multi-agent learning strategydescribed before. Here, a movable obstacle is placed in the path while the robots arecooperatively transporting the box to the goal location. Figure 3.12 shows how the robotshandle this problem.72(a) Initial state (Pushing of the box). ^(b) Leaving the box.(c) Removing the obstacle.^(d) Returning.Figure 3.12: The robots have selected the "sweeping" action to remove a movable obstaclein the path.From Figure 3.12 it is seen that when the robots detect the existence of a movableobstacle in the forward path with the help of the camera agent, they select the "sweeping"action. Accordingly the mobile robot leaves the box, moves toward the obstacle, andpushes it away from the path. While the mobile robot is sweeping the obstacle, the roboticarm just stays in the same place, waiting for its peer to return. Once the obstacle is sweptaway, the mobile robot returns to continue the task of box transportation in cooperationwith the robotic arm.This experiment demonstrates that the multi-agent robotic system developed in thepresent work, equipped with machine learning capability, exhibits good flexibility andadaptivity in a complex environment, while effectively carrying out the intended task.733.4 Distributed Multi-robot Q-learning in Complex EnvironmentsIn section 3.3, machine learning is used to find optimal cooperation policies for amulti-robot transportation task. Although this fist version of learning-based multi-robottransportation system is successful, it has two limitations. First, it is a partly centralizedsystem. In the proposed multi-agent architecture shown in Figure 3.4, a centralizedlearning/evolution agent makes decisions for all robots (physical agents) in theenvironment. As stated in section 2.1, this centralized learning/evolution agent willdecrease the scalability performance of the system and result in a communicationbottleneck problem. When the number of robots is large, the burden of thelearning/evolution agent will increase rapidly. In other words, the multi-agent architecturein Figure 3.4 is not suitable for a large multi-robot system with many robots.Second, a relatively simple environment is assumed in section 3.3. There are fewobstacles in the environment and the obstacles are usually stationary. However, in a realmulti-robot environment, the obstacle distribution will be more complex. In particular,there may be many obstacles which may appear and disappear randomly. In such acomplex environment, a multi-robot system is expected to still complete its task in arobust manner.In this section, a multi-robot transportation system equipped with distributed Q-learningis studied, which works in a dynamic and unknown environment with complex obstacledistribution. In this more challenging environment, the Q-learning algorithm will stillshow good adaptivity and robustness.3.4.1 Distributed Q-learningThere are two options for extending single-agent Reinforcement Learning to the multi-robot field. The first one is to simply regard the multi-robot learning as a combination ofmultiple single-agent learning processes across the team members. In this case, each robotis equipped with the reinforcement learning technique, and learns their skills as if the otherrobots do not exist in the environment. This type of reinforcement learning is called"distributed learning." The main advantage of "distributed learning" is its simplicity(Yang and Gu, 2004). Every robot just needs to monitor its own actions and thecorresponding rewards, and it does not need to care about the behaviors of the peer robots.74As a result, the learning space is small and the computational cost is low. However,directly extending the single-agent reinforcement learning algorithm to the multi-robotdomain violates the MDP (Markov Decision Processes) assumption, the theoretical basisof reinforcement learning, which assumes that the agent's environment is stationary anddoes not contain any other active agents. Although some research has shown that"distributed learning" is effective in some cases, more investigation is needed on thisaspect.The second one is the SG-based (Stochastic Game based) Reinforcement Learning. Inthat extension, each robot not only monitors its own actions but also observes the actionsof the peer robots in the same environment. It believes that the rewards and theenvironmental changes are caused by the joint actions of all the robots. This type ofreinforcement learning is called "SG-based learning." Several types of SG-based learningalgorithms have been developed in the multi-agent systems area, such as MiniMax-Qlearning, Nash-Q learning, and Friend-or-Foe-Q learning. With a solid theoretical basis,these algorithms are shown to be effective in several multi-agent/ multi-robot simulationsand experiments. However, they introduce an extensive learning space (state/action space)and complex calculations in the Reinforcement Learning algorithms because each robothas to observe not only its own actions but also the actions of its peers. The extensivelearning space is absolutely a nightmare for researchers in the Reinforcement Learning(RL) field, especially in the multi-robot RL field where the real-time performance isimportant.It follows that, "distributed learning" still makes sense in the multi-robot learningdomain because of its simplicity. Yang and Gu (2004) indicates that this type of learningmay work in a purely cooperative multi-robot task, and Littman (2001) theoreticallyproves its convergence under some conditions. Next, "distributed Q-learning" is applied toa multi-robot box-pushing task in a dynamic and unknown environment, which is achallenging and important sub-field in the multi-robot domain. Effectiveness, robustnessand adaptivity of this learning-based multi-robot system will be shown through computersimulation, and some issues of "distributed learning" will be discussed.753.4.2 Task DescriptionMulti-robot box-pushing is a long-standing yet difficult problem in the multi-robotdomain. In this task, multiple autonomous robots have to cooperatively push a box to agoal location and orientation in a dynamic and unknown environment. The box is big andheavy enough so that no single robot can complete this task alone. In addition, each robotonly possesses local sensing capability and the obstacles may occur and disappearrandomly in the path, so the robots cannot plan the trajectory in advance. In such adynamic and unknown environment with a continuous space which is presented in Figure3.13, any static planning technology is bound to fail. Although dynamic planning (Miyataet al., 2002) may be useful to complete the task, it will fail if the environment changesquickly and the calculation speed of the robots is not sufficiently fast.Figure 3.13: Description of a multi-robot box-pushing task.Because the learning-based architecture may be a better solution to this challengingproblem than a static/dynamic planning architecture, next, the single-agent Q-learningdescribed in Figure 3.2 is extended to this multi-robot box-pushing problem.3.4.3 Multi-robot Box-pushing With Distributed Q-learningIn this section, single-agent Q-learning is directly extended to the multi-robot box-pushing problem. The basic idea is as follows. Each robot is equipped with single-agentQ-learning. When the robots begin to transport the box, each robot identifies the currentenvironmental state, and then independently employs the Q-learning algorithm in Figure763.2 to select the "optimal" action to push the box. After the actions are taken, each robotindependently observes the new environmental state and receives its direct reward r. Therobots then update their own Q-tables with the reward and the new state information, andcontinue with the next step of box-pushing. There are no communications among therobots except the synchronization signal. Each robot independently maintains its own Q-table and makes its decisions. Therefore, it is a fully distributed learning-based decision-making system. In other words, the multi-robot box-pushing problem is solved with apurely learning-based architecture instead of complex static or dynamic planningapproaches.The main challenge of such a "distributed reinforcement learning" system is that itviolates the assumption of stationary environment in the MDP problem. While a robot ispushing the box, its peer robots are also pushing it. It means that the environment is nolonger stationary from the viewpoint of this robot and Q-learning may not converge to theoptimal policy. In addition, the random obstacles in the environment can make the case farworse. However, the simulation results in the following section will show that the task stillcan be completed at the cost of slight reduction in efficiency or increase in the requirednumber of pushing steps. The environmental statesBecause the environmental space in the multi-robot box-pushing problem is continuous,in order to employ the Q-learning algorithm, the continuous space is converted into a setof finite states. In particular, each environmental state is coded into a 13-bit binary codewhich is shown in Figure 3.14.Obstacle Distribution^Goal Orientation12^ 5 4^0Figure 3.14: The binary coding of the environmental states.770 00DetectionRadiusObstacles0 Goal LocationxIn Figure 3.14, the higher 8 bits (from bit 5 to bit 12) in the code represent the currentobstacle distribution around the box within the detection radius. The lower 5 bits (from bit0 to bit 4) represent the orientation of the goal relative to the current box pose. The codingrule is explained in Figure 3.15.Figure 3.15: Coding rule of the environmental states.In Figure 3.15, it is assumed that the robots only possess local sensing abilities.Consequently, only those obstacles within the detection radius can be detected and codedinto the higher 8-bits of the environmental state. In order to describe the obstacledistribution in Figure 3.15, the detection circle is divided into 8 equal sectors. Each sectorcorresponds to one of the higher 8 bits of the environmental state. If there is at least oneobstacle detected in this sector, the corresponding bit in the state will become "1."Otherwise, it will become "0." For example, in Figure 3.15, the higher 8-bits of theenvironmental state is "10100100" which indicates the current obstacle distributionaround the box within the detection radius.The lower 5 bits of the environmental state indicate the orientation of the goal relativeto the current box pose. The orientation angle 9 of the goal in Figure 3.15 is used tocalculate the lower 5-bits of the state usingStateba(0- 4) = INT (01(360.0132))^ (3.11)where, INTO is a function to covert a float value into an integer.783.4.3.2 Robot actionsIn a physical world, a robot usually can push a box at any contact point. It means thatthere are infinite actions available to a robot to push the box. However, in this section,because it is focused on extending the single-agent reinforcement learning algorithm to themulti-robot domain, the action space is simplified and only six actions are assumed to beavailable to each robot. The available actions (a, — a6 ) are presented in Figure 3.16. a4Figure 3.16: Pushing actions available to each robot. Box dynamicsBecause this section focuses on extending the single-agent reinforcement learningalgorithm to the multi-robot box-pushing problem, the box dynamics is simplified in thesimulation system. It is assumed that the box displacement is proportional to the net forceexerted by the robots, and the rotational angle of the box is also proportional to the nettorque exerted by the robots. In Figure 3.17, the dynamics of the box is illustrated, wherethe box is being pushed by three mobile robots.79(a)•YenvC l• ,^,,r♦Yf'♦xenvF2 YbFlt1Box xbY2Ylxl^X2(c)Figure 3.17: The dynamics of the box pushed by three robots.In Figure 3.17, three robots push the box and Fi F3 aretheir pushing forces. The boxdynamics is described with the following equations:x2 = x, +^cos(y + cel )^ (3.12)y2 = y1 + c^sin(y + a, ) (3.13)80c€2 =^ctr^ (3.14)where, EF is the net force, F is the net torque, cf and c, are coefficients, and= arctg2(1EFy llIEF,)). Reward functionA common payoff reward strategy is employed in this section. Specifically, each robotgets the same reward after it pushes the box for a short period of time At . The rewardconsists of the following three parts:(1) The reward for pushing the box toward the goal locationRdistace = (Dist old — Dist new )Cd^ (3.15)where, Disc,/ and Dist,,,,„ are the distances between the goal location and the box centerbefore and after taking the pushing actions, and cd is a coefficient.(2) The reward for the rotational behavior of the boxRrotatton = cos(a2 — a1 ) — 0.9^ (3.16)where, ( a2 — a1 ) is the rotational angle of the box when it is pushed by the robots for aperiod of time At . While a small rotation is encouraged, a large rotation will be punishedbecause it will make handling the box difficult for the robots.(3) The reward for obstacle avoidance=1.0 if no obstacleRobstacle =^ (3.17)= 1.5(sin(v /2) — 0.3)where, yr is the angle between the direction of the net force and the direction of theobstacle closest to the net force.Finally, the total reward is calculated as.R = wiRdis tan ce W 2R rotation + W 3R obstacle^ (3.18)where, w1 w3 are the weights and w1 + w2 + w3 =1.0 . Random actionReinforcement Learning may face difficulties in the action selection process. Forexample, "the agent runs the risk that it will over-commit to actions that are found to have81high -.1 values during the early stages of training, while failing to explore other actionsthat also have high values." One solution to this problem is to select actions with the GLIE(Greedy in the Limit with Infinite Exploration) policy (Littman, 2001 a). However, here asimpler strategy instead of the GLIE policy is used. The strategy is to let the robotsrandomly select their actions at a low level of probability. In other words, most of the time,the robots take their actions as determined by the Q-learning algorithm, while they doselect their actions randomly at a low level of probability.3.4.4 Simulation ResultsA simulation system is developed with the Java language to simulate a learning-basedmulti-robot box-pushing system, which has been described in the foregoing sections. Thedimensions of the environment are 950x700 units and the dimensions of the box are120x80 units. There are three robots, denoted by circles with a radius of 2 units in thesubsequent figures, which push the box to the goal location. The box is big and heavy sothat no individual robot can handle it without the help of its peers. The Java multi-threadprogramming technique is used to implement the parallel operations of the robots.When the simulation system starts to run, based on the current environmental state, therobots independently select their actions using the single-agent Q-learning algorithm.Then they begin to push box after synchronization. The box is moved for a short period oftime At , and the reward is collected and distributed equally among the robot. Based on thereward and the new environmental state, the robots independently update their own Q-tables. Then they start the next step of pushing, and so on, until the goal location isreached or the total step number exceeds the maximum limit.When the box reaches the goal location or the total step number in one round exceedsthe maximum allowed step number, this round of box-pushing is completed and a newround of box-pushing is started. In addition, every time the next round of box-pushing isstarted, the locations of the box, the obstacles, and the goal are selected randomly.However, the Q-table of each robot is inherited from the last round. When the totalnumber of rounds reaches the desired number, the simulation system is stopped.Parameters of the simulation system are: /3 =0.9, r = 0.9 , 77 = 0.8 ,c f = 0.2,c, = 0.02,cd = 0.5, wl = 0.7,w2 = 0.05,^Iv, = 0.25,/ =10 , the probability of82ObstacleBox TraiectoryGoal•Obstacle.1§tarting pointselecting random actions is 0.2, the total number of obstacles in the environment is 11, andthe maximum number of steps per round is 6, Box-pushing simulation resultsThree different box-pushing snapshots under different obstacle distributions arepresented in Figure 3.18 where three robots represented by very small circles are locatedon the sides of the box. The eleven big circles in Figure 3.18 represent the obstacles in theenvironment.(a) A successful box-pushing task.83(b) Another successful box-pushing task with a different obstacle distribution(c) A failed box-pushing task because of a local maximumFigure 3.18: Three box-pushing snapshots under different obstacle distributions and goallocations.84In Figure 3.18 (a), the three robots first cooperatively push the box upwards toapproach the goal location in the top right corner. However, the box reaches an obstacle inthe path. After several failed trials, the robots select to slightly push the box downwards toavoid the obstacle. Then they push the box upwards again to approach the goal location.However, before the box reaches the goal location, it reaches another obstacle. The robotsagain adapt their actions to the new states and avoid the obstacle. Finally, the box issuccessfully pushed to the goal location. Figure 3.18(b) presents another successful box-pushing task with a different obstacle distribution.Figure 3.18(a)-(b) demonstrates that the single-agent reinforcement learning algorithmcan be extended to the multi-robot box-pushing domain in an effective and successfulmanner. Even though each robot is only equipped with a simple single-agent Q-learningalgorithm, and no communications exist among robots except the synchronization signal,the robots still can cooperatively push the box to the goal location in an unknownenvironment with a random obstacle distribution. In the process of pushing and obstacleavoidance, the robots demonstrate a flexible and adaptive action selection strategy in acomplex and unknown environment. In addition, the system employs a fully distributedmulti-robot decision-making architecture with local sensing capability and limitedcommunications (only the synchronization signal). Consequently, it is robust and scalable,and may be easily employed in various complex environments and applications.It is also observed that the learning algorithm is not perfect in every aspect. It cansuffer from the lack of local sensing and the presence of a local maximum. A typical caseis shown in Figure 3.18(c), where the robots are trapped in a local maximum and theycannot avoid the obstacles to reach the goal. The problem can be solved partly byincreasing the sensing radius. However, that will lead to rapid expansion of the learningspace, which makes the algorithm ineffective. Effectiveness of single-agent reinforcement learning in a multi-robot cooperativetaskIn general, in a multi-robot environment, because a robot's reward is affected by theactions of its peers, the assumption of a stationary environment in the single-agent Q-85learning algorithm is violated. This means the Q-learning algorithm makes little sense in amulti-robot environment. However, Yang and Gu (2004) suggested that it was possible forsingle-agent Q-learning to learn the coordination among robots in a purely cooperativemulti-robot task. With the multi-robot box-pushing simulations presented in this section,this observation is verified and supported. Furthermore, in the following figures, someprobability analysis results are employed to show why the single-agent Q-Learning canwork in a multi-robot cooperative task.It should be noted that the Q-learning algorithm presented in this chapter is differentfrom Littman's algorithm (Littman, 2001a). The latter is a type of "SG-based" Q-learning,and requires a robot to monitor the actions of all peer robots. Consequently, it also suffersdue to the extensive learning space problem, highlighting the difference betweenLittman's algorithm and the "distributed learning" algorithm discussed in the presentsection.In the computer simulations, the robots are allowed to push the box continuously for100,000 rounds to gain experience and update their respective Q-tables. Then oneenvironmental state is selected randomly for analysis, whose state value is binary"1000000" or decimal "64," and its Q values corresponding to the six action candidatesare examined. According to the coding rule of the environmental states in section,the binary value of "1000000" represents an environmental state, which is shown inFigure 3.19.86Action #3 Action #2Goal LocationxDetectionRadiusAction #5Figure 3.19: The environmental state with binary value "1000000" and six pushing actionsavailable to the robots.The Q values under the state of "1000000" in 100,000 rounds of box-pushing arerecorded and shown in Figure 3.20, and the probability density estimate of these Q valuesis given in Figure 3.21.87^Action 1^ Action 2^ Action 3Action 4^ Action 5Action 6181614124^6Round number8^10x 10 4420 0 210vi0 86Action 1Action 2Action 3Action 4Action 5Action 6E 0.25w>, 0.15015 205^100 Value0-50 350.3a)0.05Q Values under State=64Figure 3.20 The Q values under the state of "1000000" in 100,000 rounds of box-pushing.Probability Density Estimate of Q Values under State=64Figure 3.21: Probability density estimate of Q values under state of "1000000"in 100,000 rounds of box-pushing.It is known that if the environment is stationary, the Q values in a single-robot systemwill usually converge to some values with probability one (Yang and Gu, 2004). However,88Action 1Action 2Action 3Action 4Action 5Action 6in the multi-robot box-pushing task presented in this chapter, from Figure 3.20, it is seenthat the Q values increase rapidly from the initial values of zero to some positive values ina very early stage. Then they begin to fluctuate in a bounded area with an upper bound of16 and a lower bound of 2. From Figure 3.20 and Figure 3.21, it is rather difficult to findwhich action is selected at higher probability under the state of "1000000." The Q valuesin Figure 3.20 seem to fluctuate randomly and all the curves overlap. Therefore, it appearsthat the robots do not learn any cooperative strategy in the training process and onlyrandomly select their actions. This phenomenon conflicts with the results in Figure 3.18.In order to reveal the action selection strategy learned by the robots, the action selectione0(s,ak )17-probability in each round is calculated using the equation P(ak)— ^ , and themzeQcs,aoh-1=1probability density estimate of the action selection probability is determined with thesample of 100,000 rounds of box-pushing. The result is presented in Figure 3.22.Probability Density Estimate of Action Selection Probabilty under State=e41412a)76E  108Cu'a)0 6IL-(1_a 420-0.2 0 0.2 0.4 0.6 0.8 1 12Action Selection ProbabiltyFigure 3.22: Probability density estimate of action selection probability under the state of"1000000" in 100,000 rounds of box-pushing.89Figure 3.22 gives the probability density estimate of action selection probability undera special state in 100,000 rounds of box-pushing. From Figure 3.22 it is seen that there is ahigh probability that the robots select action #1 in a low probability, and there is a lowerprobability that the robots select action #4 in a low probability. In other words, action #1is selected by the robots in low probability under the environmental state of "1000000"and action #4 is selected by the robots in higher probability under this state. From Figure3.19 it is observed that action #1 pushes the box away from the goal location, and action#4 pushes the box towards the goal without colliding with the obstacle. By comparingFigure 3.22 with Figure 3.19, it is easy to conclude that the robots learn the correct actionselection strategy when they cooperatively push the box to the goal location. This explainswhy the robots in Figure 3.18 can cooperate to push the box to the goal successfully in acomplex and dynamic environment with random obstacle distributions.3.5 Team Q-learning Based Multi-robot Cooperative TransportationIn section 3.4, the single-agent Q-learning algorithm is directly extended to the multi-robot domain for decision-making. Although some positive results were obtained, suchextensions do not have a solid theoretical foundation because they violate the staticenvironment assumption in single-agent Q-learning. In the multi-robot domain, theenvironment is no longer static because multiple robots may change it simultaneously(Yang and Gu, 2004). Therefore, robots may be confused if the single-agent Q-learningalgorithm is directly employed in a multi-robot environment. This kind of confusion canbe observed in Figure 3.20 where no action prevails over other actions under a specificstate.It appears important for a multi-agent Q-learning algorithm to model a dynamicenvironment instead of a static environment in the single-agent Q-learning algorithm.Several types of multi-agent reinforcement algorithms, such as MiniMax Q-learning andNash Q-learning, have been developed by the multi-agent community (Yang and Gu, 2004;Littman, 1994). All of them assume a dynamic environment and are proved to converge tooptimal policies under certain conditions. These algorithms enjoy a good theoreticalfoundation and are preferable in multi-agent tasks over the simple extension of single-agent reinforcement learning. In particular, Littman suggests using the team Q-learning90algorithm for purely cooperative games, which is much simpler than the Nash Q-learningalgorithm but has comparable performance (Littman, 2001a).In this section, the team Q-learning algorithm is applied to a multi-robot box-pushingtask in an unknown environment with a complex obstacle distribution. A fully distributedmulti-robot system will be developed, which is equipped with learning-based decision-making and local sensing capabilities. Using computer simulations, it will be studied howthe learning parameters such as random action probability, detection radius, and thenumber of robots affect the performance of the team Q-learning algorithm in a multi-robottask.3.5.1 Stochastic Games and the Team Q-learning AlgorithmMost multi-agent reinforcement learning algorithms are based on the framework ofStochastic Games (SGs) (Yang and Gu, 2004). A learning framework of SGs is definedby the tuple <S, A1 , , An , T, RI , Rn , f3 >, where• S is a finite set of states and n is the number of agents.• AI ,^, An are the corresponding finite sets of the actions available to eachagent.• T: S x Al x A2 x x An^S , is a transition function which maps the currentstate to the next state.• R,: S x AI x A2 x x An^, is a reward function which describes therewards received by the ith agent. 0 < 13 <1 is the discount factor.In a SGs framework, the joint actions of the agents determine the next state and therewards received by each agent. Based on the interaction history, each agent tries to findthe optimal action-selection policy that maximizes the expected sum of the discountrewards in the future. For a purely cooperative game in which each agent shares the samegoal and receives the same payoff as its peers, Littman suggests the team Q-learningalgorithm, which is a simplified version of the Nash Q-learning algorithm, to find theoptimal policy (Littman, 2001a). The team Q-learning algorithm is given below For each state s, E S and joint action (ai ,... ,an )e^, initialize the table91entry (s^a n ) to zero Observe the current state s Do repeatedly the following: Let (a „•^,a n )= arg max Q(s,a,,... ,a,,) . Execute the joint action^(a 1 ,^,a„)(a l , ...,a n ) . Receive an immediate reward r Observe the new state s' Update the table entry for 0( s, ,^,an*) as follows:,a„ *)= (1- - s)0(s,a,. ,...,a,:)+ e(r+ )6' max 0[s' ,a„...,a„])where, 0 < s <1 is the learning rate S^SAfter a sufficient number of iterations, the 0(s, a„ , a,) will converge to the optimalvalue. In a purely cooperative game, compared with the Nash Q-learning algorithm, theteam Q-learning algorithm requires less computational resources, and exhibits comparableperformance (Littman, 2001a).The standard team Q-learning algorithm uses the greedy policy to choose actions(Littman, 2001a). However, this policy may explore too few states to converge to theoptimal policy. Therefore, in this section, a modified GLIE policy (Greedy in the Limitwith Infinite Exploration) as given below is used to guarantee the team Q-learningalgorithm to converge to optimal policies (Littman, 2001a):c In(s) if Cr I n(s)> Pthresholdrandom =^ (3.19)'threshold if cr I n(s)^where, P.random is the probability of selecting random actions, n(s) is the number of timesstate s has been encountered so far during learning, 0 < cr < 1 is a constant, and Pthreshold isthe threshold value of Prandom .3.5.2 Simulation ResultsA simulation system is developed with the Java language to simulate a multi-robotbox-pushing system using the team Q-learning algorithm. In this simulation system, thePthreshold92representation of environmental states, robot actions and the reward function of the teamQ-learning algorithm are the same as those in the single-agent Q-learning algorithm, asdescribed in section 3.4.3. The dimensions of the environment are 950x700 units and thedimensions of the box are 120x80 units. There are three robots, denoted by circles withradius 2, in the subsequent figures, which push the box to the goal location. Parameters ofthe simulation system are fl =0.9, s = 0.8, cf = 0.2,c, = 0.02,cd = 0.5, wl = 0.7,w2 = 0.05,w3 = 0.25,/ =10, cr = 0.9, Pthreshold = 0.1, the detection radius is 72, the maximum numberof steps per round of box-pushing (from the beginning position to the goal position) is5,000, and the total number of rounds in the training stage is 10,000. Figure 3.23 presentsa round of box-pushing after training.Figure 3.23: The box trajectory in a round of box-pushing.The rewards received by the robots are shown in Figure 3.24 and the Q values learnedby the robots after 10,000 cycles of training are shown in Figure 3.25.93ft'200^400^600^800^1000^1200Step14002.521.5E 0.5a 024232276- 2019181735302520151050State (1-32)250200150100Joint action (1-216)Figure 3.24: The rewards received by the robots in a round of box-pushing.Figure 3.25: The surface of the Q-table (the states from 1 to 32) after 10,000 trainingcycles.94From Figure 3.23 to Figure 3.25, it may be concluded that the team Q-learningalgorithm can be effective in a multi-robot box-pushing task. As shown in Figure 3.23, thebox was successfully pushed to the goal by the robots in an environment with complexobstacle distribution. The rewards shown in Figure 3.24 indicate that the robots receivedpositive rewards most of the time although sometimes they received negative values forcolliding with the obstacles. The robots quickly adapted to new states with a differentobstacle distribution and pushed the box to the goal without any centralized strategies. TheQ-table surface in Figure 3.25 also indicates that the robots learned and increased the Qvalues in 10,000 rounds of training.3.5.3 Impact of Learning ParametersIn this section, the impact of several learning parameters on the team performance willbe assessed. These parameters are the random action probability, the detection radius, thenumber of training rounds, the pushing forces of the robots, and the number of robots.Prior to assessing the impact of the learning parameters, the robots are given 10,000rounds of training in box-pushing. Then, based on the knowledge acquired in the trainingstage, the robots push the box for 1,000 rounds and record the number of steps in eachround, to assess the performance of the team Q-learning algorithm, based on the numberof steps per round. Impact of trainingThe impact of training on the team performance is assessed in Figure 3.26.9521 81 61 40.40.2 — — - no training^ after 10,000 trainingsjrtllx10 3500^1000^1500^2000^2500^3 000^3500Step number per roundFigure 3.26: The impact of training.In Figure 3.26, two cases are compared. The first one corresponds to "no training" inwhich the robots pushed for 1,000 rounds without any preliminary training. In the secondcase, denoted by "after 10,000 trainings," the robots first pushed the box for 10,000rounds to gain experience and then pushed the box for further 1,000 rounds to assess theteam performance. The probability distributions of the number of steps per round in thetwo cases are shown in Figure 3.26. They indicate that training is helpful but it has only asmall influence on the team performance. Random action probability.In equation (3.19), the modified GLIE policy is used to determine the random actionprobability of the robots so that the team Q-learning algorithm explores an adequatenumber of states to converge to the optimal policy. Here, the impact of the random actionsis assessed. Three different lower bounds of the random action probability (Pthreshold =0, 0.3and 0.6) are employed by the robots to push the box, and the results are presented inFigure 3.27.96P(threshold)=0P(threshold)=0.3— — - P(threshold)=0.6x10 311' 403p0_a20^500^1 000^1500^2000- 11-111111tfith^2500^3000 3500Step number per roundFigure 3.27: Impact of pth,esholi, (Lower bound of the random action probability).Figure 3.27 illustrates that random actions are very important for the team Q-learningalgorithm. When Pthr„hom is equal to zero, the number of steps per round is 3,000 at a highprobability. Because the maximum number of steps per round is set to 3,000 in thesimulation program, it means that the robots cannot complete the task in the desirednumber of steps at a high probability when 13;hreshom is equal to zero. In other words, thestandard team Q-learning algorithm with GLIE policy (i.e., Pthreshold =0) cannot work in anenvironment with complex obstacle distribution because it cannot escape from a localminimum. When Pthr„hoid increases to 0.3, the number of steps per round decreases to 500at all probabilities. It means that the robots push the box to the goal more quickly.However, when Pthreshoid increases to 0.6, the number of steps per round increases to 700 atall probabilities because too many random actions make the robots fail to committhemselves to the goal and reduce the team performance.97---  R=92+ R=132^ R-222Figure 3.28: The impact of the detection radius.I+ \0^500^1000^1500^2000^2500Step number per round411f3 0003.5.3.3 Detection radiusIn the developed multi-robot box-pushing system, each robot only possesses the localsensing capability. Only the obstacles within the detection radius can be sensed by therobots. Here, the impact of the detection radius on the team performance is assessed, aspresented in Figure 3.28.x10 33.5 ^3500In Figure 3.28, three multi-robot box pushing systems with different detection radii(R=92, R=132, and R=222) are compared. Although a bigger detection radius helps therobots to find more local obstacles, Figure 3.28 indicates that team performance degradeswhen the detection radius increases. Pushing forceIn the previous sections, a team of homogenous robots each possessing the samecapabilities is assumed. In particular, each robot has the same pushing force( F. =10, i =1,2,3 ). However, here a team of robots with different pushing forces are981 F1 =10, F,=50,F3 =50F1 = 10, F2=20,F3=20F1 =10, F2 =10,F3 =10500^1000^1500^2 000^250 0^3000^3 500Step number per roundassessed, as presented in Figure 3.29.x10 35ct)4cvo 3Figure 3.29: The impact of the pushing force.Figure 3.29 indicates that a bigger pushing force helps to improve the teamperformance because it enables the robots to move the box more quickly and makes it easyto escape from a local minimum. However, Figure 3.29 does not indicate that a biggerpushing force will result in excessive box rotations, which can make it difficult for therobots to handle the box. The impact of the number of robots and the disadvantage of team Q-learningIn the previous sections, three robots were used to push the box. Now, five robots areassumed to complete the same task and compare its team performance with that of thethree-robot team. The rewards received by the robots in a round of box-pushing are shownin Figure 3.30.99rifirrir--- S0-I I'- S-ZS'0 a00T(q)dais0001,^008^009^00t^00Z- Z -(u)dais009E^0002^0096^0006^009 Z-9Z^ 3 robots+^5 robots1 81 61^\-^0 unlit-3^-2^-1 0^1^2reward4(c)Figure 3.30: The rewards received by the robots in a round of box-pushing. (a) Therewards received by the three-robot team. (b) The rewards received by the five-robot team.(c) The probability density of the rewards.Figure 3.30 indicates that the five-robot team does not do a better job than the three-robot team. In particular, the former does not get more rewards than the latter. Because thenet force is bigger in the five-robot team, if the members cooperate correctly, the numberof steps per round should be much smaller, as indicated in Figure 3.30. However, thisteam performance is improved at a cost of much heavier computational burden.The main disadvantage of the team Q-learning algorithm is its extensive learning space(state/action space). When the number of robots increases, the learning space growsquickly. For example, in this section, there are three robots and 8,192 environmental states,and each robot can select one of six available actions. Then the Q-table size is8192 x 6 3 (z1.7M). However, if the number of robots increases to 10, the Q-table size willincrease to 8192 x 6 b0 (z495G) which will require an extensive memory space in the host101computer. Clearly, it is very difficult and even impossible to operate and maintain such ahuge memory space in a computer. Therefore, when the number of robots is more than 10,the team Q-learning algorithm can become entirely ineffective. In this case, directlyextending the single-agent Q-learning algorithm to the multi-robot domain becomes anappropriate alternative because its learning space is fixed.In this section, the team Q-learning algorithm has been assessed in a multi-robot box-pushing task. As a type of multi-agent reinforcement learning algorithm, the team Q-learning algorithm possesses a good theoretical foundation. Hence, it is preferable over thesingle-agent Q-learning algorithm for multi-robot problems. The algorithm of team Q-learning has been presented in this section and a multi-robot box-pushing simulationsystem developed to assess the algorithm. In addition, the impact of the learningparameters has been studied. The simulation results show that the team Q-learningalgorithm is successful in a multi-robot box-pushing system with distributed decision-making, local sensing capabilities and an unknown environment containing obstacles. Themain disadvantage of the team Q-learning algorithm is that it only can work in a smallrobot team, typically with 2 to 5 members.3.5.4 Comparison of Team Q-learning with Multi-robot Distributed Q-learningIn section 3.4, the single-agent Q-learning (or distributed Q-learning) has been directlyextended to a multi-robot box-pushing task. Now, the team Q-learning algorithm iscompared with distributed Q-learning, using the same benchmark problem: the multi-robotbox-pushing project described in Figure 3.13. Both Q-learning algorithms employ thesame representation of the environmental states, robot actions, and reward function, aspresented in section 3.4.3. A simulation system is developed to simulate the learning-based multi-robot system, and compare the performances of the two Q-learning algorithms.The maximum number of steps per round in the simulation is 5,000, and the total numberof rounds in the training stage is 10,000. The simulation results are presented below. Simulation results without trainingThe simulation results of the two Q-learning algorithms are presented in Figure 3.31.Note that the robots in Figure 3.31 do not receive any training before they begin to push1028765co 40322.51.5-8 0.5a 0-05-1 5 Box Trajectory(al)0 value history when state=5100^200^300^400^500^700^800step(bl) (b2)Reward received by the agents100^200^300^400^500^600^700^800step(c 1) (c2)(a2)0 value history when state=5Reward received by the agentsthe box. Accordingly, all the robots are novice and they do not have any experience in boxpushing.Figure 3.31: Simulation results with the single-agent Q-learning algorithm and the team103Q-learning algorithm: (al) The box-pushing process with the single-agent Q-learningalgorithm; (a2) The box-pushing process with the team Q-learning algorithm; (bl) TheQ value history (state=5) in the single-agent Q-learning algorithm; (b2) The Q valuehistory (state=5) in the team Q-learning algorithm; (c 1) The received reward in thesingle-agent Q-learning algorithm; (c2) The received reward in the team Q-learningalgorithm.From Figure 3.31 (al) and (bl), it is seen that both Q-learning algorithms cansuccessfully push the box to the goal location in an unknown environment. Because eachrobot independently makes decisions and chooses its actions, it is a fully distributed multi-robot system. From Figure 3.31 (a2) and (b2), it is seen that the Q values converge tosome fixed values. In Figure 3.31 (a3) and (b3), it is observed that the rewards received bythe robots tend to fluctuate. This fluctuation results from reaching unknown obstaclesduring the pushing process, or disturbance from the actions of its peers. However, most ofthe time, the robots receive positive rewards. Comparison between two Q-learning algorithmsIn Figure 3.32, the performance of the single-agent Q-learning algorithm before andafter training is compared. Here, the performance index is taken as the reciprocal of thetotal number of steps per round. All simulations are based on the same complexenvironment with eleven obstacles, as shown in Figure 3.32 (a).104(a) The environment with 11 obstacles.5 000 ^45004 0003 500(,) 3000a)-(7) 25002000 -15001000500 ^0 20^40^60^80Round number100(b) The total steps per round BEFORE training.105IIVI^ 1 20 40^60Round number80^1001.61.47'±LOc 1.2D›,1co-g 0.800.60.40.20After Training+ ++5000 ^4 50040003500in 3000eLa)2500200015001000500 0(c) The total steps per round AFTER training.X 10-3^Probability density estimate of steps per round21.8Before Training0^1000^2000^3000^4000^5000Steps per round(d ) Probability density estimate of steps per roundFigure 3.32: Performance index comparison of the single-agent Q-learning algorithmbefore and after the training stage.6 000106From Figure 3.32 it is seen that the performance index is significantly improved in thetraining stage. After training, the total number of steps per round is decreased from 5,000to 1,000. Because the maximum number of steps is set at 5,000 for a round of box-pushing,a total number of steps of 5,000 means that the robots fail to push the box to the goallocation within the desired number of steps in that round. Figure 3.32 indicates that thesuccess rate of box pushing greatly increases after training because the robots learn andimprove the box-pushing policy through experience.Similarly, the performance of the team Q-learning algorithm before and after trainingis compared in Figure 3.33. The simulations are based on the same environment as inFigure 3.32 (a).5 000450040003500coeLa)co 3000250020001 500 0 I^ i 20^4060^80Round number100(a)10732.52cio 1.502015 000 ^4 5004 0003 500a 30002 500200015001000 ^0X 10 -33.5 ^20 40^60^80Round number(b)Probability density estimate of steps per round100+0.5II5000 6 0001000^2000^3000^4000Steps per round(c)108Figure 3.33: Performance index comparison of the Team Q-learning algorithm before andafter the training stage. (a) The total number of steps per round BEFORE training. (b) Thetotal number of steps per round AFTER training. (c) Probability density estimate of stepsper round before and after training.From Figure 3.33, the following conclusion may be drawn: Training is not required inthe team Q-learning algorithm. The robots do not learn any useful policy for pushing thebox in the training stage. It appears that the team Q-learning algorithm does not help therobots to improve their skills using the experience received during the training stage. Thetotal number of steps per round has a high probability index of 5,000 before and aftertraining. It means that the robots always fail to push the box to the goal location in thedesired number of steps even after training.By comparing Figure 3.32 (d) with Figure 3.33 (c), one observes a further interestingresult: The single-agent Q-learning algorithm seems to do a better job than the team Q-learning algorithm even though the latter has a stronger theoretical foundation. A possibleexplanation for this phenomenon is that some random actions are necessary to solve thelocal minimum problem in a complex and unknown environment for a multi-robot box-pushing task. Although the team Q-learning algorithm with GLIE policy does take randomactions, the probability of random action is still too low. Therefore, the team Q-learningalgorithm can be easily trapped in a local maximum.The single-agent Q-learning algorithm is different since a robot/agent does not observethe actions of its peers when it makes decisions. The peer actions are regarded as a type ofrandom disturbance or random "noise" in the environment. When this type of random"noise" is large, it degrades the team performance. However, if the random "noise" issmall, it can help the robot team to escape from a local maximum. That is why the single-agent Q-learning algorithm does a better job than the team Q-learning algorithm in thepresent simulations.There are two possible ways to improve the performance of the team Q-learningalgorithm in an unknown environment with a complex distribution of obstacles. First is toincrease the probability of random actions, which will overcome the local minimum109problem. Second is to increase the number of training rounds so that the robot can exploreenough states and converge to the optimal policy.In the present section, the performance of two types of Q-learning algorithms has beencomparatively studied in the context of multi-robot box pushing. First, the single-agent Q-learning algorithm was directly extended to the multi-robot domain. Such extensions donot possess a solid theoretical foundation because they violate the static environmentassumption of the algorithm. Second, the team Q-learning algorithm was introduced intothe multi-robot box-pushing field, which had been specifically designed to solve thepurely cooperative stochastic game problem. This algorithm was shown to converge to theoptimal policy if an adequate number of explorations were executed. Both algorithmswere implemented in a multi-robot box-pushing system. The simulation results showedthat both algorithms were effective in a simple environment without obstacles or with avery few obstacles. However, in a complex environment with many obstacles, thesimulation results showed that the single-agent Q-learning algorithm did a better job thanthe team Q-learning algorithm. It is believed that more random disturbances in the formerhelped it to escape from local minimum.3.6 SummaryThis chapter addressed the application of machine learning to multi-robot decisionmaking. Multi-robot systems face some special challenges; for example, dynamic andunknown environments, which the traditional AI technologies cannot effectively deal with.Therefore, it is important to integrate machine learning with the conventional planning orbehavior-based decision-making techniques to help multi-robot systems work in acomplex and unknown environment, in a robust and effective manner.Reinforcement learning is commonly used to make decisions for a single robot due toits good real-time performance and guaranteed convergence in a static environment. Inthis chapter, first, the single-agent reinforcement learning was directly extended to multi-robot decision-making. The basic idea of the single-agent Q-learning and its framework ofMarkov Decision Processes were introduced in section 3.2. The first version of learning-based multi-robot transportation was developed in section 3.3 to validate the feasibility ofthe proposed algorithm. A centralized decision-making architecture was assumed for the110multi-robot system.In section 3.4, a more challenging multi-robot environment was assumed, which isdynamic and unknown and has more complex obstacle distribution. For such anenvironment, a distributed multi-robot box-pushing system based on the single-agent Q-learning algorithm was developed. The simulation results showed that the robots stillcould complete the task well albeit at a slower speed.In order to overcome the static environment assumption in the single-agent Q-learningalgorithm, the team Q-learning algorithm which inherently models the dynamicenvironment was employed to build a multi-robot decision-making subsystem. Theperformances of the two Q-learning algorithms were compared and analyzed on the sameplatform of multi-robot box-pushing.The simulation results show that both Q-learning algorithms have disadvantages whenthey are used to make decisions in a multi-robot system, even though they can help robotsmake good decisions and properly complete tasks in most of cases. Because it violates theassumption of a static environment, directly applying the single-agent Q-learningalgorithm to the multi-robot domain will result in very slow convergence speed, and thereis no guarantee that the algorithm will converge to the optimal policies. On the other hand,the team Q-learning algorithm will generate an extensive learning space which makes thealgorithm infeasible when the number of robots is large. It indicates that a new Q-learningalgorithm that is suitable for multi-robot decision-making is required, by integrating theadvantages of both single-agent Q-learning and team Q-learning.111Chapter 4Sequential Q-learning with Kalman Filtering (SQKF)4.1 OverviewLearning is important for cooperative multi-robot systems whose environments areusually dynamic and unknown. In chapter 3, two Q-learning algorithms are employed tofind optimal action policies for the members of a multi-robot system. Although both Q-learning algorithms can help robots find good cooperation policies, they also have someserious disadvantages in a multi-robot environment. Furthermore, directly extending thesingle-agent Q-learning algorithm to the multi-robot domain violates its assumption ofstatic environment, and as a result the values of the Q-table cannot converge. Although therobots still can find some good policies for cooperation, the performance of the wholeteam can be degraded when that approach is used. On the other hand, the team Q-learningalgorithm models a dynamic environment inherently and is clearly more suitable formulti-robot systems, but it requires an extensive learning space (state/action space) andthis space increases rapidly with the number of robots in the system. It is very difficult fora robot to manipulate and manage an extensive learning space in real-time. This isexacerbated by the fact that all states of this extensive space should be visited to guaranteethat the team Q-learning algorithm converges to its optimal policies, which is quitedifficult to achieve in a real multi-robot system.These observations indicate that it is important to develop a new Q-learning algorithmwhich is suitable for real-time operation of cooperative multi-robot systems. In thischapter, a modified Q-learning algorithm termed Sequential Q-learning with KalmanFiltering (SQKF) is developed, which takes inspiration from human cooperation inperforming heavy-duty manual tasks, to meet the main challenges of Q-learning in themulti-robot domain. The SQKF algorithm is presented in section 4.2, and it is validatedthrough computer simulation in section 4.3.1124.2 Sequential Q-learning with Kalman FilteringThe Sequential Q-learning algorithm with Kalman Filtering (SQKF) is speciallydesigned to cope with various challenges which the single-agent Q-learning algorithm andthe team Q-learning algorithm face in a multi-robot environment. The SQKF algorithm,which is inspired by human cooperation in executing heavy-duty manual tasks, has twoparts: Sequential Q-learning and Kalman Filtering based Reward Estimation, which aredescribed next.4.2.1 Sequential Q-learningThe sequential Q-learning algorithm may be summarized as follows:• Assume that there are n robots, R" R2 , .^, which are arranged in a specialsequence. The subscripts represent their positions in this sequence. A,, Az , ...,A n are the corresponding action sets available for the robots. Inparticular, the robot Ri has mi actions available for execution as given byR: Ai = (a',a2i ,^ )• Qi (s,a),Q2 (s,a),^,Qn (s,a) are the corresponding Q tables for the robots. Allentries in the Q tables are initialized to zero.• Initialize r to 0.99.• is a set including all actions selected by the robots thus far, and q5 representsthe empty set.• Do repeatedly the following:■ Initialize kis =■ Observe the current world state s■ For (i=1 to n)1) Generate the currently available action set Ai = (A i — (A, n kif))2) The robot R selects the action a ji E A, with probabilityP(aid - (4.1)113where a ir e A, (r = 1,2,    k) and k is the size of the set A,3) Add action a f' to the set■ End For■ For each robot Ri (i = 1,2,...,n), execute the corresponding selected action a■ Receive an immediate reward r■ Observe the new state s'■ For each robot R,(i = 1, 2,...,n), update its table entry for Q(s,a1,(22;^asfollows:2,a,    ^) = (1- 0)Q,^a2,,    ^) + efr + max Q,[5 ^2^di),a,     where 0 < s < 1 is the learning rate and 0 < ,u <1 is the discount rate.■ ,^r*0.999The basic idea of the Sequential Q-learning algorithm comes from a strategy that istypically employed by humans when they cooperatively push a large and heavy object to agoal location. In transporting the object, the group members typically do not select theirpushing locations and forces concurrently. Instead, one of them will select his pushinglocation and apply a force first. Then, by observing the first person's action, the secondperson will select his pushing action (i.e., a cooperative strategy) accordingly. Next, thethird person will determine his action by observing the actions of the first two people. Inthis manner, when all the people in the group have determined their actions, they willexecute the overall pushing actions through simple synchronization. This cooperativestrategy is known to work well in manual tasks. By taking inspiration from humancooperation, the same strategy is used here to develop the Sequential Q-learning algorithmfor multi-robot transportation tasks.In the Sequential Q-learning algorithm, in each step of decision making, the robots donot select their actions simultaneously. Instead, they select their actions one by oneaccording to a pre-defined sequence. Before a robot selects its action, it observes thenature of actions that have been chosen by the robots who precede it in the sequence. By(4.2)114not selecting exactly the same actions as those of the preceding robots and instead bytaking a complementary cooperative action, this robot is able to successfully solve thebehavior conflict problem and promote effective cooperation with its teammates.The benefits of the Sequential Q-learning algorithm are obvious. Because each robotobserves the actions of the robots preceding it in the sequence before it makes its owndecision, the Sequential Q-learning algorithm is likely to avoid conflicts and promotecomplementary actions, resulting in more effective cooperation and faster convergencethan a single-agent Q-learning algorithm in a multi-robot cooperative task. Furthermore,because a robot does not need to observe the actions of all its teammates, the SequentialQ-learning algorithm results in a significantly smaller learning space (or Q table) than thatfor the team Q-learning algorithm. In addition, by selecting their actions according to aparticular pre-defined sequence, the robots avoid selecting the same action which wouldresult in behavior conflicts and cooperation failure in multi-robot cooperativetransportation tasks.4.2.2 Kalman Filtering Based Reward EstimationWhen the conventional Q-learning algorithms (single-agent Q-learning or team Q-learning) are employed in a multi-robot environment, there are several factors that confuseor even fail the learning agents. First, as argued before, a multi-robot environment isessentially dynamic. In this dynamic environment the learning agent finds it very difficultto assess all the rewards received and update its learning process effectively.Second, usually a multi-robot environment is only partially observed. Due tolimitations in the sensor performance, a robot or an agent sometimes cannot see theregions of the environment that are obstructed or far away from it. Consequently, it onlyobserves a part of the world. The feature of partial observation makes it difficult for alearning agent to assess an action under a specific state. In particular, the agent may beconfused when an action receives different rewards under the "same" world state.Third, the credit assignment is rather difficult in multi-robot learning. For example,when multiple robots cooperatively push a box to a goal location, how to split theobserved global reward among the robots (e.g., how to assign a lazy or disruptive robot anegative reward and a hard-working robot a positive value) is an important issue. In the115single-agent Q-learning or team Q-learning algorithm, the global reward is directlyassigned to each robot. It means that each robot receives the same reward regardless ofwhether it makes a positive contribution to the collective task. Due to this global rewardpolicy, the simulation results in Chapter 3 have shown that the learning agents canconverge slowly or even become confused in a multi-robot cooperative task. It followsthat updating the Q-values with a global reward is not proper in a dynamic and unknownenvironment. How to estimate the real reward of each robot from the global reward signalhas become a key topic in multi-robot Q-learning.The issue of Kalman filtering based reward estimation was first studied by Chang andKaelbling (2003). In their paper, they proposed to model the effects of unobservable stateson the global reward as a random white noise process, and employed a Kalman filter toremove the noise and continuously extract the real reward from the global reward. Whiletheir simulation results validated this approach, they had made some crucial assumptionsto simplify their model, which degraded its value to some degree. Among thoseassumptions, most importantly, they model the effect of unobservable states on the globalreward as a zero-mean Gaussian noise process, which may be not correct in a real multi-robot system. Another limitation of their approach is the necessity to guess a variancevalue o the random noise process before the algorithm is executed. Therefore, someimprovements are needed before the approach can be applied to a real multi-robot system.In this section, a modified approach to estimate real rewards from global rewards,which is based on the original approach given in (Chang and Kaelbling, 2003), isproposed. A more general noise process is assumed and a method to estimate and updatethe parameters of the noise process is included in the algorithm to improve itsperformance. In addition, the approach of Kalman fitering based reward estimation isintegrated into the sequential Q-learning algorithm presented in the previous section, topromote cooperation among robots and speed up the convergence process. The system modelIn the approach of Kalman filtering based reward estimation, the global rewardreceived by a learning agent is thought to be the sum of its real reward and a random noisesignal caused by changes in the environment and unobservable world states. In particular,116if the agent is in the world state i at time t and it receives a global reward g„ then it canbe expressed asg, = r(i),+b,^ (4.3)where r(i), is the real reward the agent receives in state i at time t , and 4 is an additivenoise process which models the effect of the unobservable states on the global reward andevolves according tob, + zr , z, N(,u,c,„)^ (4.4)bt+12Here z, is a Gaussian random variable with mean p and variance a. Based on theseassumptions, the system model may be presented as{x, = Ax,_, + w„ w,— N(A,Ei )g, = Cx, + v„ v, — N(0,E2 )( r(1),r(2),(4.5)where, x, =^is the state vector, Id is the total number of world states, w, isr(is1),b, )(isroxi00the system noise having a Guassian distribution with mean A =0and covariancematrix EI =r0 00 00^  ^ 0, and v, is the observation error which is a zero-mean Gaussian white noise.In addition, E 2 is set to zero because no observation error is assumed. The systemmatrix is A= I , and the observation matrix is C = (0       0 1, 0       0 1) 1x(1,1+1)where the 1, occurs in the position when state i is observed.1174.2.2.2 Kalman filtering algorithmKalman filter is a powerful tool to estimate states of a linear system with Guassiannoise. Here, the Kalman filtering algorithm is employed to dynamically estimate the realrewards and process noise through observing global rewards received by the learningagent. Then, the estimated real reward r(i), in state i at time t instead of the globalreward g, will be used to update the Q-table. The standard Kalman filtering algorithm isbased on a system model with zero-mean Gaussian white noise. In order to employ thestandard Kalman filtering algorithm, the system model in equation (4.5) is transformed to{xi = Ax,_,+ Bu + et ,^e, — N(0, Zi ) (4.6)g,=Cx,+v„^v,— N(0,E 2 )'o r0 0where, B = , u= : , e, is a zero-mean Gaussian white0 01noise signal with covariance matrix El , and E2 = 0 .Based on the model given by (4.6), the Kalman filtering algorithm is presented below. Initialize xo = (0,       , 0) T , the covariance matrix Po = I , u= (0       0 0) T2 = 0.1, and t =1. While (true)■ From current state i , select an action a with Q-learning and execute it, observea new state k , and receive a global reward g,.■ Update the estimate _it and its covariance matrix Pt according to= Ax,_,+ BuPt =^+El■ These priori estimates are updated using the current observation g,.C,=(0,•••,1,,0,•••,0,1), whose ith element is 1.)(1,1+0.(1,1+0 \/-1)(Is1+0.1118K,=13,c 7 (c,Pc r,) -1x, = x, + K, (g, — Cric,)P = (I — K,C,)P,■ Replace the global reward r in equation (4.2) with x,(i) to update the Q-table.■ Re-estimate the mean ,u and variance 6,,2 of the noise process with the historyof b, (i.e., xt-width^+ 1),   —,x,(1s1 + 1) ), and update u and El .■ t^t +1, i <—k End While4.2.2.3 Online parameter estimation of the noise processIn the approach proposed by Chang and Kaelbling (2003), a value of a 2 , thecovariance of the process noise, has to be guessed before the Kalman filtering algorithm isexecuted. However, in a real multi-robot task, it is usually very difficult to guess thiscovariance value. In this section, an online estimation method is presented to estimate thecovariance 63,2 with the history of b, (i.e., x,(IsI + 1) ). Moreover, the mean p is estimatedonline because it is not assumed to be zero in the system model presented in equation (4.6).The estimation method is given below. Initialize po = 0 , o-2 0 =0.1, and t= 0. Execute the Kalman filtering algorithm for n (n > 200) iterations with constant poand awo , and record the historical values of x,, 1 (1s1 + 1), xt+2 (Isl +1),      ,x,+„ (Is' + 1) . While (true)■ Estimate the mean and covariance as follows:1^nfit = n —1 j=2E(xt,.(Is1+1)—x,i_i (Isi+ 1))in2Cr = ^ L(x t+i + 1)— X i+i_1 (Isi + 1) — 10 2n- 1 1=2■ Execute the Kalman filtering algorithm with p, and cr2„ and record the valueof xt+.+I(Is1+ 1)  119  t+- t+1  End While4.3 Simulation ResultsIn order to validate the SQKF algorithm developed in this chapter, a multi-robot boxpushing simulation system is used as shown in Figure 4.1.Figure 4.1: The simulated multi-robot box-pushing system.Here, three autonomous robots attempt to cooperatively push a box to a goal locationand orientation in an unknown and dynamic environment. The box is sufficiently big andheavy so that no single robot can complete the task alone. In addition, each robot onlypossesses local sensing capability and the obstacles may occur and disappear randomly inthe path, so the robots are unable to plan the trajectory in advance. In such an unknownand dynamic environment in continuous space, a static planning approach will fail.A system is developed using the Java language to simulate the multi-robot box-pushingsystem equipped with the proposed SQKF algorithm presented in the previous sections.The dimensions of the environment are 950x700 units and the dimensions of the box are120x80 units. There are three robots, denoted by circles with a radius of 2 in thesubsequent figures, which push the box to the goal location. The Java RMI (RemoteMethod Invocation) is used to implement a fully distributed multi-robot simulation system.120The definitions of the environmental states, robot actions and reward functions of theQ-learning algorithm are given in Chapter 3. Without any prior knowledge, the threemobile robots employ the SQKF learning algorithm to find good cooperation strategies sothat the box is pushed to the goal location as quickly as possible. A successful example ofbox-pushing is shown in Figure 4.2.Figure 4.2: A successful example of multi-robot box pushing with the new SQKFalgorithm.Figure 4.3 shows the effect of training on the number of steps per round of box-pushing(The robots pushing the box from the start position to the goal position is counted as around).1212 5 x103— — - before trainingafter training0.5!500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000Step number per roundFigure 4.3: Probability density estimate of the number of steps per round beforeand after training.In Figure 4.3, 500 rounds of box pushing are first executed, the number of steps perround is recorded, and its probability density is estimated based on the 500 samples. Wheneach round of box pushing is started, the Q-table entries of each robot is filled with zeroswhich means the robots do not have any prior knowledge of pushing the box. Then, therobots are trained to improve their Q tables by pushing the box for 10,000 roundscontinuously. After training, another 500 rounds of box pushing are executed, the numberof steps per round is recorded, and its probability density is estimated. Similarly, everytime a round of box-pushing is started, the Q-table of each robot is reloaded with the Q-table learned in the training stage. Figure 4.3 shows that the robots spent around 1,500steps to push the box from the starting position to the goal location before the trainingstage, while they only spent approximately 300 steps to complete the same task aftertraining, which demonstrates the effectiveness of learning. In addition, the variation of thenumber of steps per round after training is much smaller than that before training.0 o122Therefore, it is clear that the robots have learned good cooperative policies and improvedtheir performance significantly in the training stage.Besides the number of steps per round, the average global reward per round alsoincreases after training. Figure 4.4 shows the effect of training on the global reward.0.2^0.4^0.6^0.8^1^1.2^1_4^1_6^18Average globe) reward per roundFigure 4.4: Probability density estimate of the average global reward per roundbefore and after training.The average global reward per round is obtained by dividing the sum of global rewardsreceived by the number of steps in a round of box-pushing. Figure 4.4 shows theprobability density estimates of the average global reward per round before and aftertraining. Both probability density curves are obtained by analyzing 500 samples of data.From Figure 4.4, it is clear that the average global reward is approximately 0.05 beforetraining while it becomes 1.15 after training. Because a bigger average global rewardindicates that the robots employed a better cooperation strategy in this round of box-123pushing, training/learningperformance and help thepushing task.Figure 4.5 shows thecontinuous box-pushing.is effectively used in the new SQKF algorithm to improve itsrobots learn good cooperation policies in a multi-robot box-history of number of steps per round in 1,000 rounds of1100^200^300^400^500^600^700^800^900^1000Round #300025002000a20_1500Eo_55a)1000500take only approximately 400 steps to complete the same task. It means that the robots havelearned correct strategies to cooperatively push the box in a faster manner and moresuccessfully avoid obstacles in the environment.The above observation is also validated by Figure 4.6 which shows the history of theaverage global reward per round in 1,000 rounds of continuous box-pushing.32.5o 1.5ascr)as0.5100^200^300^400^500^600^700^800^900 1000Round #Figure 4.6: The average global reward per round in 1,000 rounds of continuous box-pushing.From Figure 4.6, it is clear that the average global reward per round is increased whenthe robots obtained more experience of box-pushing. It increases from approximately 0.3to 1.4 in 1,000 rounds of continuous box-pushing. Because a bigger average global rewardindicates that the robots have employed good cooperation strategies to move the box withlow probability of colliding with obstacles, Figure 4.6 confirms that the new SQKF12511algorithm is effective in helping the robots to improve their performance in carrying outthe cooperative task.The global rewards within one round are observed in order to understand the effect oftraining on the performance of the robots. Figure 4.7 shows the global rewards received bythe robots in a round of box-pushing before training, while Figure 4.8 shows theprobability density estimate of the global rewards. 1Toam500 1000 1500Step #Figure 4.7: The global rewards received by the robots in a round of box-pushing beforetraining.1260 350 3 -0 250.10_05-4^-2^0^2^4^6Global rewardFigure 4.8: The probability density estimate of the global rewards in Figure 4.7.Figure 4.7 shows that the robots take 1,500 steps to cooperatively push the box to thegoal location while avoiding the obstacles in the environment. The global rewards are veryrandom in the course of transportation. Figure 4.8 describes the statistical nature of theglobal rewards received when the box is moved from the start position to the goal location.The average value of the global rewards is approximately 0 and the rewards fluctuate from-4 to +4. Figure 4.8 and Figure 4.7 indicate that the robots receive both positive andnegative global rewards in a round of box-pushing before training. It means that at thisstage the robots are still exploring the world states and collecting experience to improvetheir Q-tables.Figure 4.9 show the global rewards received in a round of box-pushing after thetraining stage, and the corresponding statistical result is shown in Figure 4.10.127 Ji 50^100^150^200^250^300^350Step #Figure 4.9: The global rewards received by the robots in a round of box-pushing aftertraining.0.50.450.40.35Z." 0.30_25F5ys2—  0o_^-^0^1^2^3^4^5^6Global rewardFigure 4.10:.The probability density estimate of the global rewards in Figure 4.9.128o o 500 1000 1500^2000^2500^3000Step number per round after training3500 4000 45002.5 x 10-3SQKF algorithm— — - Conventional Q-learning20.5From Figure 4.9 and Figure 4.10 it is clear that the global rewards are increasedsignificantly after training, thereby decreasing the number of steps to 300, which meansthe robots have leaned good cooperation strategies in the training stage and improved theirperformance significantly. In particular, now the average value of the global rewards isapproximately 1.5. By comparing Figure 4.7 and Figure 4.8 with Figure 4.9 and Figure4.10, it is convincing that the new SQKF algorithm can improve the performance of therobots to cooperatively transport the box through training.Next the performance of the new SQKF algorithm is compared with that of aconventional single-agent Q-learning algorithm for the same multi-robot task. Thecomparison results are presented in Figure 4.11 and Figure 4.12.Figure 4.11: Probability density estimate of number of steps per round in 500 rounds ofbox-pushing.1291.8 21.64.5— — - Conventional Q-learningSQKF algorithm43.50.2^0.4^0.6^0.8^1^1.2^1.4Average global reward per round after training0.500Figure 4.12: Probability density estimate of average global reward per round in 500rounds of box-pushing.For both learning algorithms, the robots are trained for 10,000 rounds of box-pushingfirst to improve their Q-tables. Then another 500 rounds of box-pushing are executed, andthe number of steps and the average global reward per round are recorded for statisticalanalysis. From Figure 4.11 and Figure 4.12, it is clear that the new SQKF algorithm hasbetter performance than the conventional Q-learning algorithm in the multi-robot box-pushing project. In particular, the new SQKF algorithm takes an average of 350 steps tocomplete the task while the conventional Q-learning algorithm takes almost 1,200 steps tocomplete the same task. Similarly, the new algorithm receives a higher average of globalrewards than the conventional Q-learning algorithm.1304.4 SummaryMost multi-robot operating environments, such as planet surfaces, are dynamic andunstructured where the terrain features and the obstacle distribution change with time.Furthermore, even if the external physical environment is stationary, the overall systemstructure is still dynamic from the viewpoint of a single robot because other robots maytake actions thereby changing the environment of that particular robot continuously. Theenvironmental dynamics makes multi-robot decision-making quite complex, and thetraditional task planning becomes ineffective because a planned optimal policy canbecome inappropriate a few minutes later due to changes in the environment.By observing human capabilities of dealing with a dynamic environment, it is easy toconclude that a human employs not only planning or reactive (behavior-based) techniquesbut also learning techniques to successfully complete a task in such an environment.Through learning, a human learns new world states, finds optimal actions under thesestates, and improves his planning and reactive techniques continuously. Learning enableshim to deal with unexpected world states, decrease uncertainties in the environment, andmake his decision-making capability more robust in a dynamic environment.Therefore, machine learning has become important in multi-robot applications. Amongthe existing machine learning approaches, Reinforcement Learning, especially Q-learning,is used most commonly in multi-robot systems due to its simplicity and good real-timeperformance. Chapter 3 addressed how to employ two Q-learning algorithms, the single-agent Q-learning and the team Q-learning, to find optimal cooperative strategies for amulti-robot box-pushing task. While they help improve the performance of the robots in acooperative task, there are some serious disadvantages in both Q-learning algorithms. Inparticular, the single-agent Q-learning has a slow convergence speed when applied to amulti-robot task due to the presence of dynamic environment, and the learning space ofthe team Q-learning algorithm can become too extensive to be practical in a real multi-robot application. Consequently, it has become important to develop a new reinforcementlearning algorithm that is suitable for multi-robot decision-making.In this chapter, a modified Q-learning algorithm suitable for multi-robot decisionmaking was developed. By taking inspiration from the execution of heavy-duty objectmoving tasks by a team of humans, through arranging the robots to learn in a sequential131manner and employing Kalman filtering to estimate real rewards and the environmentalnoise, the developed learning algorithm showed better performance than the conventionalsingle-agent Q-learning algorithm in a simulated multi-robot box-pushing project.132Chapter 5Computer Vision in Multi-robot Cooperative Control5.1 OverviewMulti-robot systems have become a promising area in robotics research (Cao et al.,1997). In such a system, several autonomous robots cooperatively work to complete acommon task. Due to the advantages of robustness, flexible configuration, potential highefficiency and low cost, and close similarity to human-society intelligence, multi-robotsystems have attracted researchers in the robotics community.In a multi-robot system, it is important for a robot to know the latest poses(positions/orientations) of other robots and potential obstacles in the environment so as tomake rational decisions. There are many means to measure the pose of a robot or anobstacle; for example, using sonar or laser distance finders. However, most multi-robotsystems employ digital cameras for this purpose. The main reasons are: (1) A digitalimage provides a rich source of information on multiple moving objects (robots andobstacles) simultaneously in the operating environment, (2) The vast progress in computervision research in recent years has made it possible to build fast and accurate visionsubsystems at low cost, (3) Use of cameras by robots to observe and understand theiroperating environment is as "natural" as the use of eyes by the humans to observe theworld.Some work has been done to monitor multiple moving objects in a dynamicenvironment by using computer vision. Stone and Veloso (1998) studied a multi-robotsoccer system. In their work, they used a global camera to monitor the positions of therobots and objects in the game. Veeraraghavan et al. (2003) developed a computer visionalgorithm to track the vehicle motion at a traffic intersection. A multilevel approach usinga Kalman filter was presented by them for tracking the vehicles and pedestrians at anintersection. The approach combined low-level image-based blob tracking with high-levelKalman filtering for position and shape estimation. Maurin et al. (2005) presented a133vision-based system for monitoring crowded urban scenes. Their approach combined aneffective detection scheme based on optical flow and background removal that couldmonitor many moving objects simultaneously. Kalman filtering integrated with statisticalmethods was used in their approach. Chen et al. (2005) presented a framework forspatiotemporal vehicle tracking using unsupervised learning-based segmentation andobject tracking. In their work, an adaptive background learning and subtraction methodwas applied to two real-life traffic video sequences in order to obtain accuratespatiotemporal information on vehicle objectsNevertheless, significant difficulties exist in employing the conventional computervision algorithms in a multi-robot system. The first difficulty results from the real-timeconstraints in the specific system. In a multi-robot system, multiple mobile robots movequickly in an unpredicted manner, and the vision system needs to capture their positionsand orientations within a very short time. Consequently, it is not practical to employ aneffective yet time-consuming vision algorithm. The conventional algorithms, such as thoseused by others in the work outlined above, are too complex for meeting real-timeconstraints in a multi-robot system. The second difficulty arises when multiple mobilerobots move in a large area that has different levels of illumination. Multi-robot systemsusually work in large and unstructured environments with uneven lighting conditions. Therobots usually move into and move out of sub-areas having different brightness levels. Inorder to track the robots effectively in such an environment, the vision system must berobust with respect to different illumination conditions. However, most of existingalgorithms do not consider this problem.A fast color-blob tracking algorithm has been developed (Wang and de Silva, 2006d)for the multi-robot projects which are carried out in the Industrial Automation Laboratory(de Silva, 2005; Karray and de Silva, 2004) of the University of British Columbia (UBC).The algorithm meets the real-time constraints in our multi-robot projects, and caneffectively track the positions and orientations of the robots when they move quickly in anunstructured environment with uneven illumination. In the next section, the multi-robottransportation system in the laboratory is introduced. In Section 5.3, the developedcomputer vision algorithm is presented. In Section 5.4, a representative experimentalresult obtained through the developed approach is presented. Based on the proposed color-134Robotic Ann Object- Force SensorsFixedObstaclesMovableObstaclesSonarEthernet NetworkMobile Robotblob tracking algorithm, a multi-robot route planning approach is designed and presentedin Section 5.5. The conclusions are given in Section Multi-robot Transportation SystemThe objective of our multi-robot transportation project in the Industrial AutomationLaboratory (IAL), is to develop a physical system where a group of intelligent robots workcooperatively to transport an object to a goal location and orientation in an unknown andunstructured dynamic environment. Obstacles may be present and even appear randomlyduring the transportation process. Robot control, sensing, multi-agent technologies, andmachine learning are integrated into the developed physical platform to cope with themain challenges of the problem. A schematic representation of the initial version of thedeveloped system is shown in Figure 1.1 and is repeated in Figure 5.1. The latest systemin the laboratory uses state-of-the art mobile robots together with fixed-base articulatedrobots (one state-of-the art commercial robot and two multi-modular robot prototypesdeveloped in our laboratory).Digital CameraFigure 5.1: The developed physical multi-robot system.In this first prototype there is one fixed robot and one mobile robot, whichcooperatively transport an object to a goal location. During the course of the transportation,135the robots have to negotiate the cooperation strategy and decide on the optimal locationand magnitude of the individual forces that would be applied by them so that the objectwould be transported quickly and effectively while avoiding any obstacles that might bepresent in the path. In some special cases, they also need to consider whether to move anobstacle out of the way rather than negotiating around it. Other considerations such as thelevel of energy utilization and damage mitigation for the transported object may have tobe taken into account as well. A global camera is used to monitor and measure the currentlocation and orientation of the object. The environment is dynamic and unpredictable withsome movable obstacles which may appear randomly, and some fixed obstacles. Therobots, the camera, and the sensors are separately linked to their host computers, which areconnected through a local communication network, to implement complex controls andmachine intelligence.A robotic arm and a mobile robot are used in the system shown in Figure 5.1. Thesystem provides an opportunity to observe the cooperation between two robots withdifferent dynamics and obtain useful information on the system behavior. Moreover, withthe ability of exact localization of the robotic arm and the capability of broad groundcoverage of the mobile robot, it is possible to integrate different advantages of the tworobots in a complementary manner, so as to improve the effectiveness of the overallsystem.The vision subsystem is rather crucial in the current project. It is expected toeffectively and accurately track the poses (positions and orientations) of the robots and theobject in real-time. Several color-blobs with different colors are used to identify the posesof the robots and the object, thereby simplifying the programming requirements. Twodifficulties arise in this vision subsystem. The first one is its speed of response, which issubjected to real-time constraints. In order to simultaneously find the current poses ofmultiple robots and the object, all the image processing work, such as image acquisition,pre-processing, searching the positions of the color-blobs, and coordinate transformation,must be completed within 500 ms so that the decision-making subsystem can effectivelytrack and update the new world model and make correct decisions. The second difficultyis caused by the uneven lighting in the environment, which is unavoidable in a naturalenvironment. In our projects, it is found that uneven lighting greatly degrades the136Color-Blob Template MatchingCoordinateTPoses of the robots and the objectperformance of the vision subsystem, and it represents a primary disturbance in detectingthe positions of the color blobs. Because uneven illumination is inevitable in a largephysical environment, it is imperative to develop a computer vision algorithm that isrobust to uneven lighting conditions.5.3 The Fast Color-blob Tracking AlgorithmIn order to cope with the two main problems in our multi-robot project, as mentioned inthe previous section, a fast color-blob tracking algorithm is developed. This algorithm ispresented in Figure 5.2 and its steps are described in detail next.Original ImageFigure 5.2: The color-blob tracking algorithm.5.3.1 Removal of the Lighting DisturbancesAs discussed before, uneven illumination in the operating environment greatlydegrades the performance of the color-blob tracking algorithm. In the first step of thepresent algorithm, the disturbances caused by uneven illuminations are removed bytransforming the original image from the RGB (Red, Blue, and Green) color space to the137HSI (Hue, Saturation, and Intensity) color space and removing its Saturation and Intensitycomponents.When the image is captured by the camera, it is usually represented in the RGB colorspace. Each color corresponds to a point in the RGB space; i.e., any color is composed ofRed, Green, and Blue with different intensity levels. Although the RGB color space is agood tool to generate color, it is not particularly suitable for describing color. The HSIcolor space is another well-known color model, which is effectively employed in thedigital image processing field (Gonzalez and Woods, 2002). Its most important advantageis that it corresponds closely with the way humans describe and interpret color. Anotheradvantage is that it successfully decouples the color and the gray-level information in animage.In this section, in order to remove the disturbances caused by uneven illumination inthe environment, the original color image is transformed from the RGB color space to theHSI color space. Then the Saturation and the Intensity components in the image aredeleted and only the Hue component is retained. Because the Hue component in the imagerepresents the color attribute, the Saturation component represents the amount of whitelight that is added into a monochromatic wave of light, and the Intensity componentrepresents brightness, by separating the Hue component from the Saturation and theIntensity components, the disturbances arising from uneven illumination in theenvironment are effectively removed.Equations (5.1)-(5.3) indicate how to convert a color image from the RGB color spaceto the HSI color space (Gonzalez and Woods, 2002).{ 19^if B GH = (5.1)360 — t9 if B > Gwith0.5 [(R — G) + (R — B)]0 = cos-1 {[(R — G)2 + (R — B)(G — B)1 1/23 (5.2)S^ G, B)]=1 [min(R,(R+G+B)I=-1 (R+G+B) (5.3)31385.3.2 Statistical Feature FilteringIn this step, a type of statistical feature filtering is employed to remove the colorswhich are not related to the color blobs of interest in the image so that the succeedingalgorithm will be able to easily determine the positions of these color blobs.Figure 5.3: A sample image of a color blob.A sample image of a color blob is shown in Figure 5.3. It is assumed that there aren x n pixels in this image and h(i,i) represents the hue value of the pixel (i, j) . Theaverage hue value h and the standard deviation a of this sample color blob can becalculated as follows:n nEzh(i,j)=  i=1 j=1n 2E nELh(i j) --h12i=1 j=1n 2 — 1(5.4)(5.5)Given k color blobs in the image, and their corresponding average hue valuesh1 ,     hk and standard deviations a1 ,     •, 6k , the statistical filtering algorithm that ispresented next is executed.For each pixel in the original imageIf its hue value is within the set ofth^— 1.2a1 h ^ h, + 1.2a, or-1.262 h ^ h2 +1.172or • • -hi —1.26i Sh ^ hior • • • orhk -1.2ak ^h ^ hk +1.2ak )139it will not be changedElseIts hue values will be replaced with the value of O.5.3.3 Color-blob Template MatchingAfter statistical feature filtering, apart from the regions of color blobs there still existvery small regions of pixels which are caused by unknown disturbances that arise when animage is captured by a camera. A color-blob template matching algorithm is developed inorder to remove these small regions of pixels.A color-blob template is a matrix of hue values of a given color blob. Thecorresponding m x m matrix is generated off line from the image of the color blob, and isused to search its position in the entire image of the environment which is captured by theglobal camera. Given a n x n hue matrix H of the environment image, and a m x mtemplate matrix T , the following color-blob template matching algorithm is executed:• Initialize min-distance=100000, blobpos=(1,1)• For each element H(i, j) in the matrix of H,distance=+ k -1, j + 1 -1)- T (k,1)) 2k----I 1=1m 2if (distance < min-distance) then• min-distance=distance• update: blob_pos=(i,j)• Output blobpos5.3.4 Coordinate Transformation and Pose ReconstructionOnce the positions of all color blobs in the image are obtained, it is necessary totransform their coordinates in the mage to those in the real world. Because the positionsare defined in a 2-D plane, this transformation is straightforward. Similarly, given all colorblob positions in the real world, it is easy to reconstruct the poses of all robots and the140object in the environment. Usually, two blobs with different colors are needed todetermine the position and orientation of a robot or an object. By checking thecorresponding color blobs of each robot or object, the poses of all robots and the objectcan be reconstructed rather quickly.5.4. Experimental Result5.4.1 The Test-bedA test bed is developed for multi-robot projects in the Industrial AutomationLaboratory (IAL), as shown in Figure 5.4. It includes a 4-joint 2D robotic arm, anAmigoTM mobile robot, a JVC TM color CCD (charge-coupled device) camera, twoTransducer Techniques TM force sensors, a rectangular box (object), an indexing table, aServoToGoTM motion control board, a Matrox TM frame grabber board, a CISCO TMwireless LAN access point and a Pentium NTM computer with 512M RAM. The roboticarm is particularly designed to move an object on a 2-D surface. Its four joints (two armjoints and two wrist joints) are equipped with optical encoders to accurately detect thejoint motions. One force sensor is mounted in its end effector. The encoder and force dataare acquired by the host computer via the motion control board.The mobile robot is a two-wheel-driven intelligent machine with two 500-tick encoders,8 sonar units, wireless LAN communication ability and a force sensor in its end effecter.The mobile robot is connected to the host computer with the wireless LAN via a CISCO TMwireless Ethernet access point equipment. It can wander within a radius of 50 meterswithout any communication interruption.141Figure 5.4: The experimental system in IAL at UBC.A color CCD camera with a resolution of 768x492 pixels is used to monitor thepositions and orientations of the box, the obstacles, and the mobile robot. The visionimages are captured into the computer by a frame grabber board at a sampling rate of 30MHz.5.4.2 Experimental ResultIn order to speed up the image processing task and reduce the computational load ofthe vision subsystem, the mobile robot and the box are marked with color blobs (a purpleand a green blob on the mobile robot; an orange and a blue blob on the box), whichrepresent their current poses, as shown in Figure 5.5. The poses of the mobile robot andthe box are determined by quickly localizing the four color blobs from the image,followed by the necessary coordinate transformation operations.142Figure 5.5: An original image captured by the CCD camera (Note the color blobs on themobile robot and the box.)The first step of the developed approach is to convert the original image from the RGBcolor space to the HSI color space so as to remove the disturbances caused by unevenlighting conditions. The result of this processing step is shown in Figure 5.6. Bycomparing Figure 5.5 with Figure 5.6, it is noted that the influence of uneven lighting iseffectively removed and only the hue of the image is retained.143Figure 5.6 The hue component of the original color image in the HSI color space.The second step is statistical feature filtering, which separates the color blobs ofinterest from the background. In Figure 5.5, there are four types of sample hue (purple,green, orange, and blue), which have to be detected from an image. First, the sample huevalues are extracted from the region of interest. For example, the 4*4=16 pixels, whichare around the center of the purple blob on the mobile robot in Figure 5.5, are extracted.Then the average hp and the standard deviation crp are computed from the 16 sample datavalues, using the equations (5.4)-(5.5) in Section 5.3. Here, the subscript p denotes thecolor "purple." Next, hp±1.2*cfp is considered as the range of "purple hue," which is ofinterest here. Through similar operations, the statistical results of hg±1.2*ug, h0+1.2*a0,andhb±1.2*crb are obtained as well. In this experiment, the hue parameters of the four colorblobs are as follows:hp = 0.93948, o-p = 0.00655hb = 0.54805 , ab = 0.03896hg = 0.35959, ag = 0.00527144ho = 0.02253, cy.„ 0.001The hue component image in Figure 5.6 is filtered pixel by pixel according to thealgorithm described in Section 5.3.2. Only the pixels with hue values falling in the rangeh,±1.2*cri (i=r, g, o, b) are retained. The filtering result obtained in this manner is shown inFigure 5.7.Figure 5.7: The hue component image after the background is filtered.From Figure 5.7 it is clear that the background of the image is virtually removed andonly the four types of sample hue (purple, green, orange and blue) are retained, which aredistinguished with the distinct values 0.3, 0.5, 0.7 and 0.9. However, because the hue ofthe mobile robot is quite similar to the orange blob and there still exist some small regionsof disturbance, the positions of the color blobs of interest are not very clear in Figure 5.7.The third step is to use the template matching method described in Section 5.3.3 andsome known space information (for example, the distance between the blue and the purpleblobs is always greater than the distance between the blue and the orange blobs) to removethe small unwanted regions and the mobile robot hue in Figure 5.7. The template matricesof four color blobs are generated off line from their original images, which are presented1450.937230.935400.929460.932830.938410.936630.934570.934960.940770.939730.935350.939240.54085 0.50000 0.55731. -0.54336 0.52108 0.538600.60163 0.62418 0.613590.53026 0.50000 0.536120.35129 0.36415 0.350650.36254 0.36306 0.354780.35636 0.36244 0.356360.35819 0.35960 0.353640.02252 0.02253 0.022530.02253 0.02253 0.022530.02253 0.02253 0.022530.02253 0.02253 0.02253below.0.951440.95047purple  = 0.945580.949600.532610.53026Tblue^0.589460.509490.363590.36787Tgreen^0.363010.365950.022530.02253orange^0.022530.02253The processing result is shown in Figure 5.8.Figure 5.8: The result after filtering Figure 5.7 using template matching.146From Figure 5.8 it is observed that the positions of the four color blobs are accuratelyestablished. Through coordinate transfoimation operations, the positions and orientationsof the mobile robot and the box are reconstructed, as shown in Figure 5.9.Figure 5.9: The reconstructed result of the poses of the robots and the box.Because the image processing scheme presented here does not require too manycomplex operations and since some operations may be carried out off line, the total timeof acquiring and processing an image is found to be less than 500 ms, which meets therequirement of the transportation task in the present application.5.5 Multi-robot Route PlanningThe positions and orientations of multiple robots and the object can be trackedsimultaneously using the color-blob tracking algorithm presented in this chapter.Consequently, a potential-field-based multi-robot route planning algorithm can bedesigned.A main challenge in the multi-robot route planning is the avoidance of collisions withother robots and objects. Finding an optimal and collision-free route in the presence ofmultiple robots is rather difficult and requires complex and time-consuming computation.147Instead, a simplified multi-robot route-planning approach is employed in this chapter,which is summarized below.1) Before carrying out route planning, each robot broadcasts its current position and thedestination position to all other robots in the team.2) Each robot broadcasts the positions of obstacles detected by its sensors to its teammembers.3) Each robot fuses the obstacle information received from its peers and its own sensors,and computes a global obstacle distribution.4) Each robot computes the moving distance between its current position and itsdestination position. Based on the distance information of all robots, a robot queue isgenerated by sorting the distances in the ascending order. In this queue, the robot with asmaller moving distance will be placed farther ahead.5) Based on the global obstacle distribution information, each robot computes its routefrom its current position to its destination position, using the potential-field-based path-planning approach. When computing the route of a robot, the other robots are regardedas special obstacles whose positions are determined as follows: If a teammate is aheadof the current robot in the queue mentioned in step 4, this teammate is regarded as anobstacle whose position is the same as its destination position. If a teammate is behindthe current robot in the queue, this teammate is regarded as an obstacle whose positionis the same as its current position. In this manner all the robots compute their routes atthe same time.6) After the routs of all the robots are computed in step 5, they will move to thedestinations one by one according to these routes. When one robot is in motion, theother robots will remain stationary. In particular, the robot in the first place in the queuewill move first, until it reaches its destination positions, and during this period the otherrobots will stay in their current positions. After the first robot reaches its destination,the second robot in the queue will begin to move towards its destination. Thisprocedure will be repeated until all the robots reach their destinations.Through the strategy of "Plan the routes simultaneously and move at different times,"the multi-robot route planning approach presented here avoids complex optimal path-planning algorithms. As a result it possesses superior real-time performance in practical148multi-robot tasks. The main drawback is that the robots are not allowed to move at thesame time. In some multi-robot tasks, as in multi-robot transportation, where the speedmay not be a critical specification, the present multi-robot route-planning algorithm can bequite advantageous.5.6 SummaryIn this chapter, a fast and robust color-blob tracking algorithm was presented for use ina developed prototype multi-robot transportation system. In order to remove thedisturbances caused by uneven lighting conditions, the original RGB image was convertedinto the HSI color space and only the Hue component information was retained. Methodsof statistical feature filtering and template matching were developed to remove thebackground and reconstruct the current world state. Because the algorithm did not involvea high computing load, real-time performance could be achieved. A simplified multi-robotroute-planning algorithm was developed, which uses the information acquired by thecolor-blob algorithm. The performance of the method was studied using an experimentalrobotic object transportation task.149Chapter 6Experimentation in Distributed Multi-robot CooperativeTransportation6.1 OverviewMulti-robot cooperative transportation has been an important subject of activity in therobotics community. A typical task of a multi-robot system is where two or moreautonomous robots attempt to cooperatively transport an object of interest to a goallocation in a static or dynamic environment. There is both theoretical and practicalsignificance to address issues of multi-robot transportation. First, a multi-robot systemprovides a good benchmark to validate and evaluate various artificial intelligence (AI)techniques such as machine learning, and planning or reactive approaches in a multi-agentenvironment. The traditional single-agent AI technologies face special challenges in amulti-robot or multi-agent task. These challenges include task assignment, dynamicenvironment, and distributed planning. New AI approaches such as multi-agent learning,and planning techniques are in development, which will deal with important issues thatarise in a multi-agent environment. Multi-robot transportation provides an ideal platformto test and validate these new techniques. Second, multi-robot transportation itself hasmany practical applications in industrial projects and scientific explorations. It is desirableto have multiple autonomous robots transport an object of interest in a dangerous orhazardous environment, rather than employing human operators for it.In this chapter, the techniques developed in the previous chapters, such as thehierarchical multi-robot architecture, the SQKF algorithm, and the multi-robot color-blobtracking algorithm, are integrated to implement a physical multi-robot transportationsystem running in a dynamic and unknown environment. These techniques are tested andvalidated with the physical robots in our laboratory (IAL). In addition, a hybrid150Obstacle Obstacle,u) ObstacleObstacleMobile robot Mobile robotServerI.TCP/IPTCP/IPMobile robotWired/wirelessRouterforce/position control strategy suitable for multi-robot transportation is developed andtested using the physical robots.6.2 Test bedAn experimental system has been developed in the Industrial Automation Laboratory(IAL) at The University of British Columbia to implement a physical multi-robot objecttransportation system. An overview of this system is presented in Figure 6.1.GoalFigure 6.1: The multi-robot object transportation system developed in IAL.151In the developed system, three autonomous mobile robots are employed to transport along box to a goal location on the lab floor. In order to focus on the techniques developedin the previous chapters, a simple global coordinate system is employed. Each robot isinformed its initial position and orientation in this global coordinate system when themulti-robot system begins to operate, and it estimates its latest position and orientation byrecording and analyzing the data of the encoders mounted in its wheels and the data of itscompass sensor while it moves in the environment. It is also able to inquire the currentpositions and orientations of its peer robots via the wireless network. In addition, therobots are assumed to know the global coordinates of the goal location in advance.However, the box and the obstacles are placed on the floor in a random manner.Therefore the robots have to search and estimate the poses of the box and the obstacles inthe environment by fusing the sensory data from its sonar, laser distance finder and CCDcamera. Furthemiore, the sensors are assumed to only detect objects within a radius of 1.5meters, and a robot is able to exchange its sensory information with its peers via wirelesscommunication to establish a larger local world state. In essence, this is a typical localsensing system and the robots are able to know only a local part of the whole environment.There are different color blobs on the four lateral sides of the box so that a robot canestimate the orientation and position of the box by identifying the color blobs with its ownCCD cameras and fusing this information with the data from its sonar and laser distancefinder. If an object without any color blobs is detected, it will be regarded as an obstacle inthe environment. The box with the color blobs is shown in Figure 6.2.Color blob (a) A big blue blob and a small blue blob on the same surface of the box.152(b) A blue blob on one side and two green blobs on another side.Figure 6.2: The color blobs used to identify the orientation of the box.Each robot makes decisions independently by itself. However, before it makes adecision, it needs to exchange information with its peers so as to form a cooperationstrategy, as presented in Chapter 4.There is a computer server in the developed system, which is used to synchronize theactions of the robots. However, it does not serve as a centralized decision maker becausethe system is fully distributed where each robot makes decisions independently.The mobile robots are manufactured by MobileRobots Inc. (formerly ActivMediaRobotics Company) which is a main player in the mobile robot market. In this project, onefour-wheel driven Pioneerm4 3-AT robot and two two-wheel driven PioneerTm 3-DXrobots are used. They represent an agile, versatile, and intelligent mobile robotic platformwith high-performance current management to provide power when it is needed. Built ona core client-server model, the P3-DX/AT robots contain an embedded Pentium IIIcomputer, opening the way for onboard vision processing, Ethernet-basedcommunications, laser sensing, and other autonomous functions. The P3 robots store upto 252 watt-hours of hot-swappable batteries. They are equipped with a ring of 8 forwardsonar and a ring of 8 rear sonar. Their powerful motors and 19cm wheels can reach speedsof 1.6 meters per second and carry a payload of up to 23 kg. In order to maintain accuratedead reckoning data at these speeds, the Pioneer robots use 500-tick encoders. Its sensingmoves far beyond the ordinary with laser-based navigation, bumpers, gripper, vision,153compass and a rapidly growing suite of other options. The appearance of the P3-DX robotand P3-AT robot is shown in Figure 6.3 and Figure 6.4.Figure 6.3: The PioneerTM 3-DX mobile robot.Figure 6.4: The PioneerTM 3-AT mobile robot.The P3-DX/AT robots with the ARIA software have the ability to: Wander randomly Drive under control of software, keys or joystick Plan paths with gradient navigation Display a map of its sonar and/or laser readings Localize using sonar and laser distance finder154 Communicate sensor and control information relating sonar, motor encoder, motorcontrols, user I/O, and battery charge data Provide C/C++/JAVA development platform. Simulate behaviors offline with the simulator that accompanies each developmentenvironmentWith its Laser Mapping & Navigation System and MobileEyes, the robot can mapbuildings and constantly update its position within a few cm while traveling withinmapped areas. With appropriate accessories, a user can view the view of the robotremotely, speak, play and hear audio and send the robot on patrol.In summary, the Pioneer-3 DX or AT robot is an all-purpose base, used for researchand applications involving mapping, teleoperation, localization, monitoring,reconnaissance, vision, manipulation, cooperation and other behaviors.6.3 JAVA RMIThe multi-robot system developed in this chapter is fully distributed. Each robotidentifies the current world state and selects the corresponding action independently. Inorder to find good cooperation strategies with other robots and complete a common taskquickly and effectively, each robot has to observe the environmental states and actions ofother robots frequently through fusing its sensory data and communicating with itsteammates. Therefore, frequent communication takes place among the robots.In the developed multi-robot transportation system as presented in Figure 6.1, theIEEE 802.11g wireless TCP/IP protocol provides the communication infrastructure for therobots. While the TCP/IP protocol is a reliable and convenient communication tool, it onlyworks in a low level of the communication hierarchy, which only transfers binary datapackages. However, the information which the robots in a multi-robot system may needto exchange, can include data with high-level structures like an action list and world states.As a result, the designer of a multi-robot system has to design his own high-levelcommunication protocol that can run on the TCP/IP protocol so that the robots canexchange information effectively.A considerable amount of energy and time is required to design a customized high-level communication protocol. It is not sensible for the designer of a multi-robot system to155develop such a customized communication protocol. In this thesis, this problem is solvedby using the JAVA RMI technology (Horstmann and Cornell, 2005).JAVA RMI (Remote Method Invocation) is the JAVA implementation of "DistributedComputing." Its basic idea is the use of a family of objects which can be located anywherein the network and collaborate with each other through the standard TCP/IPcommunication protocol. The concept of "distributed computing" or "distributed objects"is proposed to solve critical problems in the traditional client/server computing model. Inthe client/server model, if a client wants to request data from server, it has to first convertthe request into a specific data format which is suitable for transmission via the network.When the server receives this data package, it parses the requested data format, finds theresponse data in its database, converts it into the desired format, and send it to the client.When the client receives the response data, it also needs to parse the data format and showthe result to the users. The traditional client/server mode requires users to design their owndata transmission format and develop the corresponding conversion program, whichincreases the programming burden of the user and degrades the performance ofcompatibility.The idea of JAVA RMI is to let a client call a method of a remote object that runs inanother computer in the network. The JAVA RMI will automatically convert the requestand response data into an appropriate transmission format and administer thecommunication process for the users. Its principle is presented in Figure 6.5.156Call stubmethod locally Send marshalledCall servermethod locallyParameters Send marshaled Return resultIReturn value Return valueFigure 6.5: JAVA RMI distributed computingIn Figure 6.5, when a client wants to call a method of a remote object that runs on theserver, a proxy object termed stub will be called automatically. The stub object resides onthe client machine and it packages the parameters used in the remote method into a datapackage which is encoded with a device-independent protocol. This encoding process iscalled parameter marshalling whose purpose is to convert the data into a specific formatsuitable for communication over the network.When the server receives the data package sent by the stub object on the client machine,it will parse the package, find the object of interest on the server machine and pass theparameters to its corresponding method. The return value of the method will be packagedby the server and sent to the client. When the stub object on the client machine receivesthe return data package from the server, it will parse it, extract the return data and pass thedata to the client object locally.157The whole communication process is based on the TCP/IP protocol, and the dataconversion and transmission operations are administered by JAVA RMI automatically.With the help of JAVA RMI, a client can call a remote method on the server machine as ifit were a method of a local object.6.3.1 Remote InterfaceIn order to implement the JAVA RMI, some programming rules have to be established.First, a remote interface has to be set up, which is used to declare which remote methodscan be invoked by the client. A representative remote interface of the developed multi-robot project is presented below.public interface robotServerinterface extends Remote{void receiveRobotActionList(Vector actionList, boolean islnitialAction) throwsRemoteException;objPose getPose0 throws RemoteException;boolean awakeMeOthrows RemoteException;boolean isSleeping0 throws RemoteException;}This JAVA interface declares that there are 4 remote methods of the robot server whichcan be called by clients (other robots in the same environment). A robot can call thesemethods to collect the information of another robot, such as its current action list, currentpose, or its sleeping status. In addition, it can wake another robot by calling the remotemethod of awakeme .The remote method information is shared by both the client and the server, so theremote interface resides simultaneously on both machines.6.3.2 Server SideOn the server machine, an object is created which implements the remote methodsrevoked by the client. A program script of the server object is shown below.158public class robotServer extends UnicastRemoteObjectimplements robotServerinterface{public robotServer(robot r) throws RemoteException (theRobot=r;}public objPose getPose() throws RemoteException(Lock rlock rwlock.readLock();rlock.lock();try{return theRobot.getPose();}finally{rlock.unlock();}In addition, the server has to register the remote objects with the bootstrap registryservice so that the client can locate them and download their stub objects correctly. Theregistration codes are as follows:String robotName="Robot #2";robotServer rServer = new robotServer(myRobot);System.out.println("Binding robot server implementations to registry..');Context namingContext = new InitialContextO;namingContext.bind("rmi:"+robotName, rServer);System.out.println("The robot server is ready for calling');1596.3.3 Client SideIn order to revoke the remote methods on the server machine, a client program firstlooks for the remote objects on the server with a known IP address. This process isdescribed below.try{String url="rmi://";Context namingContext = new InitialContext();robotServerInterface rServer= (robotServerInterface)namingContext.lookup(ur1+"Robot #2');rServer.receiveRobotActionList(actionList,true);}catch (Exception e) {e.printStackTrace();}Furthermore, a security manager should be loaded by the JAVA RMI to prevent amalicious program from attacking the system. The program script is presented below.try{System.setProperty(' ava.security.policy", "client.policy");System.setSecurityManager(new RMISecurityManager());}catch (Exception e){e.printStackTraceO;160This script will load the JAVA security policy file: client.policy, which is stored in thesame directory as the client program. The security policy is used to restrict the port rangeand related operations. In the developed multi-robot transportation project, the content ofthe security policy file is as follows:grant{permission java.net.SocketPermission"*:1024-65535" "connect, accept";permission java.net.SocketPermission"localhost:80", "connect";permission java.io.FilePermission".II-", "read,write,delete";};6.4 A Physical Multi-robot Transportation SystemA physical multi-robot transportation system has been developed in the IndustrialAutomation Laboratory at The University of British Columbia, which integrates variousadvanced approaches developed in chapters 2 through 5. This is a distributed multi-robotsystem with local sensing capability. An experiment is carried out to validate theseapproaches in a real environment in the presence of sensor noise.In this experiment, to test the cooperative transportation capability of the developedsystem, three mobile robots equipped with the SQKF algorithm are employed to transporta big box from one end of the environment to the other one.Before the experiment is carried out, the simulation system described in the previouschapters is employed to train the robots so that their Q-tables or knowledge bases areimproved. After 10,000 rounds of simulated box-pushing, the Q-tables of three simulatedrobots are exported to the three physical robots to complete a physical cooperative box-pushing task.A different policy from the simulation system is employed in the experimental systemdeveloped here. In the simulation system described in Chapter 4, after each step of box-pushing, the robots will identify the new world state and select the corresponding actions161with the SQKF algorithm. However, it is time-consuming and not practical for the robotsin a real experimental system to change their actions (pushing positions) frequently.Instead, a new policy is employed in the experimental system, where the robots willcontinue with the actions selected in the previous step unless the reward earned in theprevious step is lower than a predefined threshold value. This policy avoids changing theactions of the robots very frequently, which will be expensive and time-consuming for aphysical multi-robot box-pushing system.The experimental results of cooperatively transporting a box in a real environment arepresented in Figure 6.6.(a)(b)162(0)(p)£91(f)(g)Figure 6.6: Multi-robot cooperative transportation.In Figure 6.6 (a), a big box is placed on the floor, which is within the detection radiusof the sensors of the mobile robots. The three robots are informed about their initialpositions and orientations in the global coordinate system before they begin to work.When the system starts to operate, each robot uses its CCD camera to search and identifythe color blobs on the box surface and to estimate the orientation of the box relative to thecurrent pose of the robot. By fusing this relative orientation of the box and the sensorydata from its laser distance finder, the robot can estimate the position and orientation ofthe box in the global coordinate system. If one robot cannot detect the box with its laser orvision sensors, it will communicate with other robots to request the position and164orientation information of the box. If no robot finds the box, they will wander in theenvironment until one of them detects the box.At the same time, each robot scans its local environment with the laser distance finder.If an object is detected which is not the box or one of the peer robots, the object will beregarded as an obstacle.By fusing the information of the box pose with the information of local obstacledistribution, a local world state is established by the robot and the optimal action underthis state is selected using its Q-table.Figure 6.6 (b) shows how the robots push the box with the selected actions, and Figure6.6 (c) shows that the robots have changed to another formation so that the box is pushedwith the biggest net force.In Figure 6.6 (c), the robots find out there exists an obstacle (the blue garbage bin) inthe path, which was not detected before by their sensors due to the limited detection radius.In order to determine the position and area of the obstacle, Figure 6.6 (d) shows that oneof the robots moves close to the obstacle and measures the distance between the obstacleand itself. The obstacle position estimated by this robot is sent to its two peers so that theycan re-compute their local world states and select the corresponding actions to adapt to thenew local world.Figure 6.6 (e) and (f) show that the robots have changed their formation to adapt to thenew world state. Here they attempt to change the orientation of the box in order to avoidthe obstacle.Figure 6.6 (g) shows that the robots have avoided the obstacle successfully and restoredto the formation which generates the biggest net pushing force.From Figure 6.6 (a)-(g), it is observed that the learning based multi-robot system isquite effective in carrying out a cooperative transportation task in an environment withunknown obstacles. The learned Q-tables in the training stage help the robots select goodcooperation strategies in a robust and effective manner.6.5 Force/position Hybrid ControlForce and position control of the robots is quite valuable in a physical multi-robottransportation system. Without proper control of the forces and positions of the robots, the1651 PositionControllerParameterTuning ModelIdentification----1 ----11.4-DirectionConstraintsForceControllerModelIdentification+4.--RobotAssistantAgent(Supervisor)ParameterTuningRobotHcooperation strategy described in the previous chapters may not be effective. The hybridforce/position control scheme shown in Figure 6.7 is implemented in the low-level controllayer in the multi-robot system of the present work.o.Figure 6.7: The robot hybrid force/position control scheme.In Figure 6.7, two sets of control modes, position control and force control, areintegrated into the robot control system. Here L and L' are the direction constraintmatrices (Craig, 2005), which, respectively, determine in which directions the forcecontrol mode is used and in which directions the position control mode is employed. Theyare diagonal matrices with ones and zeros on the diagonal. For a diagonal element of "1"in L, there is a "0" as the corresponding diagonal element of L', indicating that positioncontrol is in effect. On the other hand, if a "0" is present on the diagonal of L, then "1"will be present as the corresponding element of L' , indicating that force control is in effect.Note that L and L' are set by the robot assistant agent in the upper level. Moreover, therobot assistant agent will provide the desired forces and positions for the low-level control,based on the current cooperation strategy.In addition, an adaptive control mechanism (Astrom and Wittenmark, 1994) is166integrated into the control scheme. In each control loop, a dynamic model of the robot isidentified on line and this information is used to tune the controller parameters.Finally, a two-step control strategy is used in the practical control process. When therobot assistant agent receives a new cooperation strategy from the high-level decision-making subsystem, it computes the desired force and position of the corresponding robot.As the upper level supervisor for the low-level control subsystem, the agent then sendscommands to the low-level position controller to initiate the first control step, which is tomove the robots to the desired position and orientation. Next, the agent starts the secondcontrol step, which requires the position controller to continue the position control strategy,while instructing the force controller to execute the desired force control strategy in aspecific direction in order for the robot to exert a constant pushing force on the object.An experiment is developed to validate this force/position hybrid control scheme,where the two robots have to first move to a specified location with a specified orientation,and then cooperatively transport the box for 8 seconds while maintaining a constantcontact force. This process is shown in Figure 6.8 (a)-(c).167'144.(b)^t^desired positionMEM(a) The initial state(c)Pushing of the boxFigure 6.8: The robots move to the desired position and push the box under hybridforce/position control.According to the control scheme given in Figure 6.7, the process is divided into twostages: position control and hybrid force/position control. The former is used to move therobots to a desired position with a desired orientation. The latter is employed to exert aconstant force in a specific direction while the position is kept unchanged in the otherdirections. Figure 6.9 (a)-(d) shows the control process of the robotic arm in the first stage.1680.30.20.10E-0:1-0.2-0.3-0.401 0.2 0.6 0.7 080.3 0.4^0.5x(m)Trajectory of the end effector(a) The trajectory of the end effector.X coordinate of the end effectorE0. -0.20.1 ^0 10^20^30^40^50^60time(s)(b) The x coordinate of the end effector.1690 30.20.1E Y coordinate of the end effector\10^20^30^40^50^60time(s)(c) They coordinate of the end effector.Speed of the end effectorLL10^20^30^40^50^60time(s)(d) The speed of the end effector.Figure 6.9: Position control of the robotic arm.Figure 6.10 (a)-(b) shows the force curve and the control input of the robotic arm underthe hybrid force/position scheme in the second stage.0.350.30 250.2Ea)6 0. 150.10.050170Force of the end effector— measured force— - desired value200I^ , 56^57^58^59^60^61^62^63^64time(s)(a) The force on the end effector.Control input of the force control(b) The control input of the joint.Figure 6.10: The force control of the robotic arm.From Figures 6.8-6.10 it is observed that the hybrid force/position control scheme andthe two-step control strategy, which are presented in this chapter, perform properly in theexperiments.0552000 ^18001600 -14001200 -+Ta' 1000 -g-800 -600 -4001716.6 SummaryIn this chapter, a physical multi-robot transportation system is implemented in theIndustrial Automation Laboratory (IAL) at the University of British Columbia, whichintegrated the approaches developed in the previous chapters. In this system, three mobilerobots were employed to push a big box to a goal location in the laboratory environment.There were three challenges to reach the goal in the developed system. First, each robotonly possesses local sensing capability which means they have to make decisions whilethey do not possess knowledge of the global environment. Second, there exist sensor noiseand uncertainty of actions in a physical system, which do not exist in a simulation system.These constraints degrade the performance of the robots. Finally, the environment isdynamic due to the presence of random obstacles. All these represent important challengeswhich a multi-robot system has to meet in a real natural environment.The distributed computing technology, JAVA RMI, was introduced into the developedmulti-robot system, which enabled the robots to cooperate in an easy and efficient manner.It was demonstrated through experiments in this chapter that JAVA RMI was a powerfultool to implement distributed multi-robot systems.An experiment was carried out to evaluate the performance of the developed multi-robot system. In this experiment, cooperative robotic transportation of a box in anenvironment with unknown obstacles was carried out. The experimental results showedthat the developed multi-robot system was able to work well in a realistic environment. Itcan be concluded that the approaches developed in chapters 2 through 5 are helpful andeffective in enable multi-robot systems to operate in a dynamic and unknown environment.A force/position hybrid control scheme was proposed in this chapter, which isimportant in multi-robot transportation systems. The developed scheme was evaluatedusing a physical experiment involving a robotic arm pushing a box on a 2D surface. Goodperformance was achieved by using the hybrid force/position control strategy.172Chapter 7Conclusions7.1 Primary ContributionsMulti-robot systems have to undergo many improvements before they can be used inreal-life applications. For example, there are enormous challenges to develop a multi-robot system which can cooperatively transport an object of interest in a dynamic andunknown environment like the Mars surface. In this thesis, several techniques have beendeveloped and integrated to support the operation of multi-robot transportation systems incomplex and unstructured dynamic environments with unknown terrains. In particular thethesis has made contributions with respect to self-deterministic learning, robustness,action optimization, and real-time operation of a cooperating team of robots in suchenvironments. Basically, the contributions of the thesis can be classified into four mainareas as follows.First, a general distributed multi-robot architecture was developed in Chapter 2. Thishierarchical architecture integrated several techniques from artificial intelligence (AI) sothat multi-robot systems could operate in a robust and effective manner in a dynamic andunknown environment. In its top level, a machine learning unit and a local path planningunit were combined to establish good cooperation strategies for the robots while theyattempt to transport an object cooperatively. A machine learning unit with effectivetraining was used to select proper actions for the robots so that they could complete acommon task quickly and effectively. If a local minimum of decision-making wasdetected, a local path planning unit would be temporarily activated by an arbitrator toselect a local transportation path for the object in order for the robots to escape the localminimum. In the middle level of the architecture, the behavior-based approach, which hasbeen proved to be very effective for single robot systems, was employed to decompose theabstract behavior sent by its upper level into more detailed primitive behaviors so that thedetailed control requirements could be generated and accepted by the lower levelcontroller. For the bottom level of the architecture, a low-level controller was designed to173receive control tasks from the behavior-based layer and control the motion or force of therobot. This architecture also included a communication module so that any robot couldeasily exchange information with other robots using standard network protocols. Bycombining the learning, planning and behavior-based approaches, and carefully designingthe coordination mechanism among them, the developed multi-robot architecture wasfound to be more flexible and powerful than a traditional two-layer architecture based on asingle AI technique.Second, machine learning was employed to find optimal cooperation strategies formulti-robot systems so that they could operate in a dynamic and unknown environment.For the first prototype system, a centralized learning agent was proposed, where a Q-learning unit and a Genetic Algorithm unit were integrated to search the optimalcooperation strategies for all robots in the environment. While the approach was found tobe successful for a robot team with relatively few members, it faced challenges withrespect to scalability, computational complexity and communication bottleneck when thenumber of robots increased. In order to meet these challenges, a distributed Q-learningmulti-robot transportation system was developed. By directly extending single-agent Q-learning to the multi-robot domain and carefully designing the reward function, thedeveloped system was able to demonstrate good adaptivity and effectiveness in a complexenvironment with multiple obstacles. However, due to lack of knowledge of the actions ofthe peer robots, the single-agent Q-learning algorithm was found to converge very slowlyin a multi-robot environment. In order to improve the convergence speed, a team Q-learning algorithm was developed for use in the same multi-robot task. An interestingconclusion was drawn by comparing the simulation results of the team Q-learningalgorithm with those of the single-agent Q-learning algorithm. Specifically, single-agentQ-learning showed a better performance (as measured by the average number of stepsrequired in a round of box-pushing) than team Q-learning even though the latter is usuallythought to be more efficient. Furthermore, the drawback of a large learning space in theteam Q-learning algorithm was demonstrated through computer simulation.In order to overcome various disadvantages of the traditional single-agent Q-learningalgorithm and the team Q-learning algorithm, a modified Q-learning algorithm termedSequential Q-learning with Kalman Filtering (SQKF) was developed in this thesis. By174enabling the robots to learn in a predefined sequence and employing a Kalman filter toextract the real rewards of a robot dynamically, the new SQKF algorithm was able toovercome several major shortcomings in the single-agent Q-learning algorithm or the teamQ-learning algorithm, and consequently more suitable for multi-robot tasks. Thesimulation results showed that the SQKF algorithm needed less time to complete a multi-robot transportation task and received better reward than the traditional Q-learningalgorithms.Third, a fast computer vision algorithm was developed to track multiple color-blobssimultaneously so that the orientations and positions of the robots could be determined ina multi-robot transportation task. This algorithm was found to meet two challenges incarrying out the specific task: the changing illumination levels in the work environmentand the real-time operating requirement of the algorithm. By extracting the hueinformation in the image and combining it with some statistical techniques and theavailable domain knowledge, the algorithm was able to effectively and quickly trackmultiple moving color blobs simultaneously in an environment with uneven illumination.Finally, a physical multi-robot transportation project was developed in the laboratory toimplement and validate the approaches developed in the thesis. The physical system facedmore challenges than the computer simulation system; in particular, sensor noise, wheelslip, real-time operation, and motion constraints. In addition, in order to implementeffective communication among multiple robots and facilitate proper execution of the newSQKF algorithm, the JAVA RMI technology was incorporated into the system. Theexperimental results showed that the physical system was able to operate effectively androbustly in a dynamic physical environment with obstacles.7.2 Limitations and Suggested Future ResearchAlthough the developed multi-robot transportation system has demonstrated quite goodperformance both in computer simulation and physical experimentation, there are someareas that need improvement. Some directions for further research are indicated next.1757.2.1 Improvement of the Model and Algorithm of SQKFThe SQKF algorithm developed in this thesis assumes a linear model for use in theKalman filtering algorithm, which may not be true in a real multi-robot system. In practice,there is a high possibility that the model is nonlinear. If the model nonlinearity issignificant, the standard linear Kalman filtering algorithm may not estimate the rewardvalues correctly, thereby making the SQKF algorithm ineffective. Therefore, more workmay be needed to identify a suitable nonlinear model for the SQKF algorithm.Furthermore, with a nonlinear model one may have to incorporate advanced filteringalgorithms, such as the Extended Kalman Filter or the Unscented Kalman Filter, inextending the SQKF algorithm.7.2.2 GA-based Learning Space CompressionIn this thesis, when applying the Q-learning algorithm to the multi-robot domain, theresulting large learning space was not dealt with rigorously. When the number of robots inthe environment increases, the resulting extensive learning space will make the Q-learningalgorithm ineffective.One possible solution to this problem would be to integrate a Genetic Algorithm (GA)with the Q-learning algorithm. For example, when a step of Q-learning is completed, bysearching an optimal sub-space in the Q-table, the GA algorithm may be able to set up anew learning sub-space for the Q-learning algorithm which is much smaller than theoriginal Q-table. The next step of Q-learning would be carried out using this smaller newsub-space. The challenge is how to determine which sub-space in the Q-table is "optimal"and how to deal with any "unexpected" world states that might be encountered by therobots, which were not included into the "optimal" sub-space.7.2.3 Local Modeling and Low-level ControlMost multi-robot systems have focused on high-level decision-making and ignoredlow-level control of multiple robots. Although, some work has been done in this thesis,there is room to improve the control methodology in the context of multi-robotcooperation.176The control needs in a multi-robot system are different from those in a single robotsystem. Local modeling may be a possible solution to the low-level control problem inmulti-robot systems.. The so-called local modeling will be such that one robot not onlycollects information from its own sensors but also exchanges information with otherrobots so that a local kinematic or dynamic model can be established to facilitate thecontrol task. The model may include the information of the formation and capabilities ofthe robots, their current states and actions, the shared environment, and so on. Based onthe local model, an advanced force/position controller may be developed to meet thecontrol requirements communicated by the high-level decision-making sub-system.7.2.4 Integrating Planning and LearningPlanning and learning are both important for multi-robot systems when operate in anunknown and dynamic environment. In this thesis, a type of switching strategy betweenlearning and planning was developed. In particular, a local path planning unit wastemporarily activated when a local minimum of the machine learning approach wasencountered. However, a more advanced integration scheme would be useful so as tomake use of planning in a multi-robot task.7.3 SummaryNew developments in multi-robot systems will endow the next generation of roboticsystems with more powerful capabilities. It has been an important objective to allowmultiple robots to autonomously carry out such tasks as cooperative object transportationor robotic soccer, in dynamic, unknown, and unstructured environments. In this thesis,several key approaches were developed toward achieving this goal. It is believed that nosingle AI technique can meet this objective. Instead, learning, reactive (behavior-based)paradigm, planning, optimization, and low-level control approaches should be integrated,and new approaches should be developed on that basis so that an autonomous multi-robotsystem will be able to operate in a dynamic and unknown environment in an effective androbust manner. Some significant work has been carried out in this thesis towards thisgeneral objective, specifically with regard to self-deterministic learning, improvedrobustness, optimization of robotic action, and real-time operation.177BibliographyAlenya, G., Escoda, J., Martinez, A.B. and Torras, C., "Using laser and vision to locate arobot in an industrial environment: a practical experience," Proceedings of IEEEInternational Conference on Robotics and Automation, Barcelona, Spain, pp. 3528-3533,April, 2005.Alpaydin, E., Introduction to Machine Learning, The MIT Press, Cambridge, MA, 2004.Arai, T., Pagello, E. and Parker, L.E., "Advances in multirobot systems," IEEE Trans. onRobotics and Automation, Vol.18, No. 5, pp. 655-661, 2002.Arkin, R.C., Behavior-Based Robotics, The MIT Press, Cambridge, MA, 1998.Asada, M., Uchibe, E. and Hosoda, K., "Cooperative behavior acquisition for mobilerobots in dynamically changing real world via reinforcement learning and development,"Artificial Intelligence, V ol. 110, No. 2, pp. 275-292, 1999.Asensio, J.R., Montiel, J.M.M. and Montano, L., "Goal directed reactive robot navigationwith relocation using laser and vision," Proceedings of IEEE International Conference onRobotics and Automation, Vol. 4, pp. 2905-2910, May, Detroit, MI, 1999.Astrom, K.J. and Wittenmark, B., Adaptive Control, Second Edition, Addison-WesleyPublishing Company, Menlo Park, CA„ 1994.Balch, T. and Arkin, R.C., "Behavior-based formation control for multirobot teams,"IEEE Trans. on Robotics and Automation, Vol. 14, No.6, pp. 926-939, 1998.Bellman, R.E., Dynamic Programming, Princeton University Press, Princeton, New Jersey,1957.Bouloubasis, A.K., McKee, G.T. and Schenker, P.S., "A behaviour-based manipulator formulti-robot transport tasks," Proceedings of IEEE International Conference on Roboticsand Automation, Taipei, Taiwan, pp. 2287-2292, 2003.Camacho, D., Fernandez, F. and Rodelgo, M.A., "Roboskeleton: an architecture forcoordinating robot soccer agents," Engineering Application of Artificial Intelligence, Vol.19, No. 2, pp. 179-188, 2006.Cao, Y.U., Fukunaga, A.S. and Kahng, A.B., "Cooperative mobile robotics: antecedentsand directions," Autonomous Robots, Vol. 4, No. 1, pp. 7-27, 1997.Cao, Z., Tan, M., Li, L. Gu, N. and Wang, S., "Cooperative hunting by distributed mobilerobots based on local interaction," IEEE Transactions on Robotics, Vol. 22, No. 2, pp.403-407, 2006.178Casella, G. and Berger, R.L., Statistical Inference, Second Edition, Thomson Learning,Belmont, CA, 2002.Chaimowicz, L., Kumar, V. and Campos, M.F.M., "A paradigm for dynamic coordinationof multiple robots," Autonomous Robots, Vol. 17, No. 1, pp. 7-21, 2004.Chang, Y.-H., Ho, T., and Kaelbling, L.P., "All learning is local: multi-agent learning inglobal reward games," Proceedings of Neural Information Processing Systems (NIPS),Vancouver, BC, Canada, 2003.Chen, S., Shyu, M., Peeta, S. and Zhang, C., "Spatiotemporal vehicle tracking," IEEERobotics and Automation Magazine, Vol. 12, No. 1, pp.50-58, 2005.Choset, H., Lynch, K.M., Hutchinson, S., Kantor, G., Burgard, W., Kavraki, L.E. andThrun, S., Principles of Robot Motion, The MIT Press, Cambridge, MA, 2005.Chui, C.K. and Chen G., Kalman Filtering with Real-Time Applications, Third Edition,Springer, Berkeley, CA, 1999.Craig, J.J., Introduction to Robotics: Mechanics and Control, Third Edition, PearsonPrentice Hall, Upper Saddle River, NJ, 2005.Dahl, T.S., Mataric, M.J. and Sukhatme, G.S., "Multi-robot task-allocation throughvacancy chains," Proceedings of the 2003 IEEE International Conference on Robotics andAutomation, Taipei, Taiwan, pp. 2293-2298, 2003.Dahl, T.S., Mataric, M.J. and Sukhatme, G.S., "Adaptive spatio-temporal organization ingroups of robots," Proc. of the 2002 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), Lausanne, Switzerland, pp. 1044-1049, 2004.De Silva, C.W, MECHATRONICS—An Integrated Approach, Taylor & Francis/CRCPress, Boca Raton, FL, 2005.Deng, L.Y., Tang, N.C., Lee, D., Wang, C.T. and Lu, M.C., "Vision based adaptivetraffic signal control system development," Proceedings of 19th International Conferenceon Advanced Information Networking and Applications, Taipei, Taiwan, pp. 385-388,2005.Dudek, G., Jenkin, M. and Milios, E., "A taxonomy of multirobot systems," (Chapter 1),Editor: Balch, T. and Parker, L.E., Robot Teams: From Diversity to Polymorphism, AKPeters, Ltd., Natick, Massachusetts, 2002.Elahibakhsh, A.H., Ahmadabadi, M.N., Janabi-Sharifi, F. and Araabi, B.N., "Distributedform closure for convex planar objects through reinforcement learning with local179information," Proceedings of 2004 IEEE/RSJ International Conference on IntelligentRobots and Systems, Sendai, Japan, pp. 3170-3175, 2004.Emery, R. and Balch, T., "Behavior-based control of a non-holonomic robot in pushingtasks," Proceedings of the 2001 IEEE International Conference on Robotics andAutomation, Seoul, Korea, pp. 2381-2388, 2001.Farinelli, A., Iocchi, L. and Nardi, D., "Multirobot systems: a classification focused oncoordination," IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics,Vol. 34, No. 5, pp. 2015-2028, 2004.Ferch, M. and Zhang, J., "Learning cooperative grasping with the graph representation ofa state-action space," Robotics and Autonomous Systems, Vol. 38, No. 3-4. pp.183-195,2002.Fernandez, F. and Borrajo, D., "A reinforcement learning algorithm in cooperative multi-robot domains," Journal of Intelligent and Robotic Systems, Vol. 43, No. 2-4, pp. 161-174,2005.Frontoni, E. and Zingaretti, P., "A vision based algorithm for active robot localization,"Proceedings of IEEE International Symposium on Computational Intelligence in Roboticsand Automation, pp. 347-352, June, Espoo, Finland, 2005.Gerkey, B.P. and Mataric, M.J., "Pusher-watcher: an approach to fault-tolerant tightly-coupled robot coordination," Proceedings of the 2002 IEEE International Conference onRobotics and Automation, Washington, DC, pp. 464-469, 2002a.Gerkey, B.P., and Mataric, M.J., "Sold!: auction methods for multirobot coordination,"IEEE Trans. on Robotics and Automation, Vol. 18, No. 5, pp. 758-768, 2002b.Gerkey, B.P., and Mataric, M.J., "A formal analysis and taxonomy of task allocation inmulti-robot systems," International Journal of Robotics Research, Vol. 23, No. 9, pp. 939-954, 2004.Goldberg, D. and Mataric, M.J., "Design and evaluation of robust behavior-basedcontrollers," (Chapter 11), Editor: Balch, T. and Parker, L.E., Robot Teams: FromDiversity to Polymorphism, AK Peters, Ltd., Natick, Massachusetts, 2002.Gonzalez, R.C. and Woods, R.E., Digital Image Processing, Prentice Hall, Upper SaddleRiver, NJ, 2002.Gustafson, S. and Gustafson, D.A., "Issues in the scaling of multi-robot systems forgeneral problem solving," Autonomous Robots, Vol. 20, No. 2, pp. 125-136, 2006.180Gopalakrishnan, A., Greene, S. and Sekmen A., "Vision-based mobile robot learning andnavigation," Proceedings of IEEE International Workshop on Robot and HumanInteractive Communication, Nashville, TN, pp. 48-53, 2005.Hajjdiab, H. and Laganiere, R., "Vision-based multi-robot simultaneous localization andmapping," Proceedings of 1st Canadian Conference on Computer and Robot Vision, pp.155-162, May, London, ON, Canada, 2004.Hastie, T., Tibshirani, R. and Friedman, J., The Elements of Statistical Learning: DataMining, Inference, and Prediction, Springer-Verlag, New York, 2001.Horstmann, C.S. and Cornell, G., Core Java 2, Volume II — Advanced Features, SeventhEdition, Pearson Education, Inc., Upper Saddle River, NJ, 2005.Hu, J. and Wellman, M.P., "Multiagent reinforcement learning: theoretical framework andan algorithm," Proc. Of 15th International Conf on Machine Learning, San Francisco,CA, pp. 242-250, 1998.Huntsberger, T., Pirjanian, P., Trebi-011ennu, A., Nayar, H.D., Aghazarian, H, Ganino,A.J., Garrett, M., Joshi, S.S. and Schenker, P.S., "Campout: a control architecture fortightly coupled coordination of multi-robot systems for planetary surface exploration,"IEEE Trans. on Systems, Man, and Cybernetics, Vol. 33, No. 5, pp. 550-559, 2003.Ishiwaka, Y., Sato, T. and Kakazu, Y., "An approach to the pursuit problem on aheterogeneous multiagent system using reinforcement learning," Robotics andAutonomous Systems, Vol. 43, No. 4, pp. 245-256, 2003.Inoue, Y., Tohge, T. and Iba, H., "Object transportation by two humanoid robots usingcooperative learning," Proc. of the 2004 Congress on Evolutionary Computation, Portland,OR, pp. 1201-1207, 2004.Ito, K. and Gofuku, A., "Hybrid autonomous control for multi mobile robots," AdvancedRobotics, Vol. 18, No. 1, pp. 83-99, 2004.Jones, C. and Mataric, M.J., "Automatic synthesis of communication-based coordinatedmulti-robot systems," Proceedings of the 2004 IEEE/RSJ International Conference onIntelligent Robots and Systems, Sendai, Japan, pp. 381-387, 2004a.Jones, C. and Mataric, M.J., "Utilizing internal state in multi-robot coordination tasks,"Proceedings of Nineteenth National Conference on Artificial Intelligence (AAAI-2004),San Jose, CA, pp. 958-959, 2004b.Kalman, R.E., "A new approach to linear filtering and prediction problems," Transactionsof ASME - Journal of Basic Engineering, Vol. 82 (Series D), pp. 35-45, 1960.181Kapetanakis, S. and Kudenko, D., "Reinforcement learning of coordination in cooperativemulti-agent systems," Proceedings of Eighteenth national conference on Artificialintelligence, Edmonton, Alberta, Canada, pp. 326-331, 2002.Karray, F.O. and de Silva, C.W., Soft Computing and Intelligent Systems Design, AddisonWesley/Pearson, New York, NY, 2004.Kato, K., Ishiguro, H. and Barth, M., "Identification and localization of multiple robotsusing omnidirectional vision sensors," Electronics and Communications in Japan, Part II:Electronic, Vol. 86, No. 6, pp. 1270-1278, 2003.Kawakami, K., Ohkura, K. and Ueda, K., "Reinforcement learning approach tocooperation problem in a homogeneous robot group," IEEE International Symposium onIndustrial Electronics, Pusan, Korea, pp. 423-428, 2001.Konidaris, G.D. and Hayes, G.M., "An architecture for behavior-based reinforcementlearning," Adaptive Behavior, Vol. 13, No. 1, pp. 5-32, 2005.Kovac, K., Zivkovic, I. and Basic, B.D., "Simulation of multi-robot reinforcementlearning for box-pushing problem," Proceedings of the Mediterranean ElectrotechnicalConference — MELECON, Dubrovnik, Croatia, pp. 603-606, 2004.Kube, C.R. and Bonabeau, E., "Cooperative transport by ants and robots," Robotics andAutonomous Systems, Vol. 30, No. 1, pp. 85-101, 2000.Kumar, M. and Garg, D.P., "Sensor-based estimation and control of forces and momentsin multiple cooperative robots," Journal of Dynamic Systems, Measurement, and Control,Transactions of the ASME, Vol. 126, No. 2, pp. 276-283, 2004.Kume, Y., Hirata, Y., Wang, Z. and Kosuge, K., "Decentralized control of multiplemobile manipulators handling a single object in coordination," Proceedings of the 2002IEEE/RSJ International Conference on Intelligent Robots and Systems, EPFL, Lausanne,Switzerland, pp. 2758-2763, 2002.Lerman, K., Jones, K., Galstyan, A. and Mataric, M.J., "Analysis of dynamic taskallocation in multi-robot systems," International Journal of Robotics Research, Vol. 25,No. 3, p 225-241, 2006.Liberty, J., Teach Yourself C++ in 21 Days, Third Edition, Sams Publishing, IndianapolisIN, 1999.Littman, M.L., "Markov games as a framework for multi-agent reinforcement learning,"Proc. of the Eleventh International Conference on Machine Learning (ML-94), NewBrunswick, NJ, pp.157-163, 1994.182Littman, M.L., "Value-function reinforcement learning in Markov games," Journal ofCognitive Systems Research, Vol. 2001, No. 1, pp.1-12, 2001a.Littman, M.L., "Friend-or-Foe Q-learning in general-sum games," Proc. 18thInternational Conf on Machine Learning, San Francisco, CA, pp. 322-328, 200 lb.Liu, J. and Wu, J., Multi-agent Robotic Systems, CRC Press, Boca Raton, FL, 2001.Marshall, J.A., Fung, T., Broucke, M.E., D'eleuterio, G.M.T., Francis, B.A., "Experimentsin multirobot coordination," Robotics and Autonomous Systems, Vol. 54, No. 3, pp. 265-275, 2006.Martinson, E., Stoytchev, A. and Arkin, R.C., "Robot behavioral selection using Q-learning," Proceedings of the 2002 IEEE/RSJ International Conference on IntelligentRobots and Systems, EPFL, Lausanne, Switzerland, pp. 970-977, 2002.Martison, E. and Arkin, R.C., "Learning to role-switch in multi-robot systems," Proc. ofIEEE International Conference on Robotics & Automation, Taibei, Taiwan, pp. 2727-2734, 2003.Mataric, M.J., Nillsson, M., and Simsarian, K.T., "Cooperative multi-robot object-pushing," Proc. of IEEE/RSJ Int. Conf on Human Robot Interaction and CooperativeRobots, Pittsburgh, PA, pp. 556-561, 1995.Mataric, M.J., "Reinforcement learning in the multi-robot domain," Autonomous Robots,Vol. 1997, No. 4, pp.73-83, 1997.Maurin, B., Masoud, 0. and Papanikolopoulos, N. P., "Tracking all traffic: computervision algorithms for monitoring vehicles, individuals, and crowds," IEEE Robotics andAutomation Magazine, Vol. 12, No. 1, pp. 29-36, 2005.Mitchell, T.M., Machine Learning, McGraw-Hill Companies, Inc., New York, NY, 1997.Miyata, N., Ota, J., Arai T. and Asama, H. "Cooperative transport by multiple mobilerobots in unknown static environments associated with real-time task assignment," IEEETrans. on Robotics and Automation, Vol. 18, No. 5, pp. 769-780, 2002.Murarka, A., Modayil, J. and Kuipers, B., "Building local safety maps for a wheelchairrobot using vision and lasers," Proceedings of the 3rd Canadian Conference on Computerand Robot Vision, Quebec City, QC, Canada, pp. 25-32, 2006.Nolfi, S. and Floreano, D., "Learning and evolution," Autonomous Robots, Vol. 7, No. 1,pp. 89-113, 1999.Ogata, K., Modern Control Engineering, Third Edition, Prentice Hall, Inc., Upper SaddleRiver, NJ, 1998.183Ortin, D., Neira, J. and Montiel, J.M.M., "Relocation using laser and vision," Proceedingsof IEEE International Conference on Robotics and Automation, New Orleans, LA, pp.1505-1510, 2004.Panait, L. and Luke, S., "Cooperative multi-agent learning: the state of the art,"Autonomous Agents and Multi-Agent Systems, Vol. 11, No. 3, pp. 387-434, 2005.Park, K., Kim, Y. and Kim, J., "Modular Q-learning based multi-agent cooperation forrobot soccer," Robotics and Autonomous Systems, Vol. 35, No. 2, pp. 109-122, 2001.Parker, L.E., "ALLIANCE: an architecture for fault tolerant multirobot cooperation,"IEEE Transactions on Robotics and Automation, Vol. 14, No. 2, pp. 220-240, 1998.Parker, L.E., "Lifelong adaptation in heterogeneous multi-robot teams: response tocontinual variation in individual robot performance," Autonomous Robots, Vol. 2000, No.8, pp. 239-267, 2000a.Parker, L.E., "Current state of the art in distributed autonomous mobile robotics,"Distributed Autonomous Robotic Systems 4, Springer-Verlag, Tokyo, pp. 3-12, 2000b.Parker, L.E., Touzet, C. and Fernandez, F., "Techniques for learning in multirobot teams,"(Chapter 7), Editor: Balch, T. and Parker, L.E., Robot Teams: From Diversity toPolymorphism, AK Peters, Ltd., Natick, Massachusetts, 2002.Pereira, G., Campos, M. and Kumar, V., "Decentralized algorithms for multi-robotmanipulation via caging," The International Journal of Robotics Research, Vol. 23, No. 7-8, pp. 783-795, 2004.Pham, D.T., and Awadalla, M.H., "Neuro-fuzzy based adaptive co-operative mobilerobots," Proceedings of the 2002 28th Annual Conference of the IEEE IndustrialElectronics Society, Sevilla, Spain, pp. 2962-2967, 2002.Pimentel, B.S., Pereira, G.A.S. and Campos, M.M.F.M., "On the development ofcooperative behavior-based mobile manipulators," Proceedings of the InternationalConference on Autonomous Agents, Bologna, Italy, pp. 234-239, 2002.Rabie, T., Abdulhai, B., Shalaby, A. and El-Rabbany, A., "Mobile active-vision trafficsurveillance system for urban networks," Computer-Aided Civil and InfrastructureEngineering, Vol. 20, No. 4, pp. 231-241, 2005.Rimon, E. and Koditschek, D.E., "Exact robot navigation using artificial potentialfunctions," IEEE Trans. on Robotics and Automation, Vol. 8, No. 5, pp. 501-518, 1992.184Rocha, R., Dias, J. and Carvalho, A., "Cooperative multi-robot systems: A study of vision-based 3-D mapping using information theory," Robotics and Autonomous Systems, Vol.53, No. 3-4, p 282-311, 2005.Rosen, K.H., Discrete Mathematics and Its Applications, Fifth Edition, McGraw-HillCompanies, Inc., New York, NY, 2003.Ross, S.M., Introduction to Probability Models, 8th Edition, Elsevier, Singapore, 2006.Rus, D., Donald, B., and Jennings, J., "Moving furniture with teams of autonomousrobots," Proc. of IEEE/RSJ International Conference on Human Robot Interaction andCooperative Robots, Pittsburgh, PA, pp. 235-242, 1995.Russell S. and Norvig P., Artificial Intelligence: A Modern Approach, Second Edition,Pearson Education, Inc., Upper Saddle River, NJ, 2003.Sanchez-Pena, R. and Sznaier, M., Robust Systems Theory and Applications, John Wiley& Sons, Inc., New York, NY, 1998.Sardag, A. and Akin, H.L., "ARKAQ-Learning: Autonomous state space segmentationand policy generation," Lecture Notes in Computer Science, Vol. 3733, LNCS, pp. 512-523, Springer Verlag, New York, NY, 2005.Schenker, P.S., Huntsberger, T.L., Pirjanian, P., Trebi-011ennu, A., Das, H., Joshi, S.,Aghazarian, H., Ganino, A.J., Kennedy, B.A. and Garrett, M.S, "Robot work crews forplanetary outposts: Close cooperation and coordination of multiple mobile robots,"Proceedings of SPIE - The International Society for Optical Engineering, Boston, MA, pp.210-220, 2000.Schenker, P.L., Huntsberger, T., Pirjanian, P., Baumgartner, E.T. and Tunstel, E.,"Planetary rover developments supporting MARs exploration, sample return and futurehuman-robotic colonization," Autonomous Robots, Vol. 14, No. 2-3, pp. 103-126, 2003.Sen, S., "Evolution and learning in multiagent systems," International Journal of Human-Computer Studies, V ol. 48, No. 1, pp. 1-7, 1998.Shoham, Y., Powers, R. and Grenager, T., "Multi-agent reinforcement learning: a criticalsurvey," Technical report, Stanford University, Stanford, CA, 2003.Siegwart, R. and Nourbakhsh, I.R., Introduction to Autonomous Mobile Robots, The MITPress, Cambridge, MA, 2004.Siyal, M.Y., "A novel image processing based approach for real-time road trafficapplications," Proceedings of 2004 IEEE Region 10 Conference: Analog and DigitalTechniques in Electrical Engineering, Chiang Mai, Thailand, pp. 447-450, 2004.185Spong, M.W., Hutchinson, S., and Vidyasagar, M., Robot Modeling and Control, JohnWiley & Sons, Inc., Hoboken, NJ, 2006.Stone, P. and Veloso, M., "Layered approach to learning client behaviors in theROBOCUP soccer server," Applied Artificial Intelligence, Vol. 12, No. 203, pp. 165-188,1998.Stone, P. and Veloso, M., "Multiagent systems: a survey from a machine learningperspective," Autonomous Robots, Vol. 8, No. 3, pp. 345-383, 2000.Stone, P., Sridharan, M., Stronger, D., et al., "From pixels to multi-robot decision-making:A study in uncertainty," Robotics and Autonomous Systems, Vol. 54, No. 11, pp. 933-943,2006.Stroupe, A., Huntsberger, T., Okon, A, Aghazarian, H. and Robinson, M, "Behavior-basedmulti-robot collaboration for autonomous construction tasks," Proceedings of the 2005IEEE/RSJ International Conference on Intelligent Robots and Systems, Edmont, Alberta,Canada, pp. 1989-1994, 2005.Stroupe, A., Okon, A., Robinson, M., Huntsberger, T., Aghazarian, H. and Baumgartner,E., "Sustainable cooperative robotic technologies for human and robotic outpostinfrastructure construction and maintenance," Autonomous Robots, Vol. 20, No. 2, pp.113-123, 2006.Sugar, T.G., Desai, J.P. Kumar, V. and Ostrowski, J.P., "Coordination of multiple mobilemanipulators," Proceedings of the 2001 IEEE International Conference on Robotics &Automation, Seoul, Korea, pp. 3022-3027, 2001.Sugar, T.G. and Kumar, V., "Control of cooperating mobile manipulators," IEEE Trans.on Robotics and Automation, Vol.18, No. 1, pp. 94-103, 2002.Sutton, R.S. and Barto, A.G., Reinforcement Learning: An Introduction, The MIT Press,Cambridge, MA, 1998.Svinin, M.M., Kojiama, F., Katada, Y. and Ueda, K., "Initial experiments onreinforcement learning control of cooperative manipulations," Proceedings of the 2000IEEE/RSJ International Conference on Intelligent Robots and Systems, Takamatsu, Japan,2000.Takahashi, S. and Ghosh, B.K., "Parameter identification of a moving object based onsensor fusion," Proceedings of IEEE International Conference on Mechatronics andAutomation, Niagara Falls, ON, Canada, pp. 171-176, 2005.Tangamchit, P., Dolan, J.M. and Khosla, P.K., "Crucial factors affecting decentralizedmultirobot learning in an object manipulation task," Proceedings of IEEE InternationalConference on Intelligent Robots and System, Las Vegas, NV, pp. 2023-2028, 2003.186Taylor, M.E. and Stone, P., "Behavior transfer for value-function-based reinforcementlearning," Proceedings of the 4th International Conference on Autonomous Agents andMulti agent Systems, Utrecht, Netherlands, pp. 201-207, 2005.Thrun, S., Burgard, W. and Fox, D., Probabilistic Robotics, The MIT Press, Cambridge,MA, 2005.Tomono, M., "Environment modeling by a mobile robot with a laser range finder and amonocular," Proceedings of 2005 IEEE Workshop on Advanced Robotics and its SocialImpacts, Nagoya, Japan, pp. 133-138, 2005.Trebi-011ennu, A., Nayar, H.D., Aghazarian, H., Ganino, A., Pirjanian, P., Kennedy, B.,Huntsberger, T. and Schenker, P., "Mars rover pair cooperatively transporting a longpayload," Proceedings of IEEE International Conference on Robotics and Automation,Washington, DC, pp. 3136-3141, 2002.Veeraraghavan, H., Masoud, 0. and Papanikolopoulos, N.P., "Computer vision algorithmsfor intersection monitoring," IEEE Transactions on Intelligent Transportation Systems,Vol. 4, No. 2, pp. 78-89, 2003.Wang, Y. and de Silva, C.W., "An object transportation system with multiple robots andmachine learning," Proc. of American Control Conference (ACC 2005), Portland, OR,pp.1371-13'76, 2005a.Wang, Y. and de Silva, C.W., "A fast computer vision algorithm for multi-robotcooperative control," Proceedings of International Symposium on Collaborative Researchin Applied Science, pp. 189-194, Vancouver, BC, Canada, October, 2005b.Wang, Y. and de Silva, C.W., "Extend single-agent reinforcement learning approach to amulti-robot cooperative task in an unknown dynamic environment," Proceedings of IEEE2006 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC,Canada, pp. 10098-10104, 2006a.Wang, Y. and de Silva, C.W., "Multi-robot box-pushing: single-agent Q-Learning vs.team Q-Learning," Proceedings of 2006 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), Beijing, China, pp. 3694-3699, 2006b.Wang, Y. and de Silva, C.W., "Cooperative transportation by multiple robots withmachine Learning, " Proceedings of 2006 IEEE Congress on Evolutionary Computation(CEC), Vancouver, BC, Canada, pp. 10407-10413, 2006c.Wang, Y. and de Silva, C.W, "A fast and robust algorithm for color-blob tracking inmulti-robot coordinated tasks," International Journal of Information Acquisition, Vol. 3,No. 3, pp. 191-200, 2006d.187Wang, Y. and de Silva, C.W, "A machine learning approach to multi-robot coordination,"Engineering Applications of Artificial Intelligence (Elsevier), 2007a (In Press).Wang, Y. and de Silva, C.W, "A modified Q-learning algorithm for multi-robot decision-making," Proceedings of ASME International Mechanical Engineering Congress andExposition (IMECE 2007), Seattle, WA, 2007b (In Press).Wang, Y. and de Silva, C.W, "Assess team Q-learning algorithm in a purely cooperativemulti-robot task," Proceedings of ASME International Mechanical Engineering Congressand Exposition (IMECE 2007), Seattle, WA, 2007c (In Press).Wang, Z., Kumar, V., Hirata, Y. and Kosuge, K., "A strategy and a fast testing algorithmfor object caging by multiple cooperative robots," Proceedings of the 2003 IEEEInternational Conference on Robotics and Automation, Taipei, Taiwan, pp. 2275-2280,2003a.Wang Z., Nakano, E. and Takahashi, T., "Solving function distribution and behaviordesign problem for cooperative object handling by multiple mobile robots," IEEE Trans.On Systems, Man, and Cybernetics — Part A: Systems and Humans, Vol. 33, No. 5, pp.537-549, 2003b.Wang Z., Hirata, Y. and Kosuge, K., "Control multiple mobile robots for object cagingand manipulation," Proceedings of the 2003 IEEE/RSJ International Conference OnIntelligent Robots and Systems, Las Vegas, Nevada, pp. 1751-1756, 2003c.Wang, Z., Takano, Y., Hirata, Y., and Kosuge, K., "A pushing leader based decentralizedcontrol method for cooperative object transportation," Proceedings of 2004 IEEE/RSJInternational Conference on Intelligent Robots and Systems, Sendai, Japan, pp. 1035-1040,2004a.Wang, Z., Hirata, Y. and Kosuge, K., "Control a rigid caging formation for cooperativeobject transportation by multiple mobile robots," Proceedings of the 2004 IEEEInternational Conference on Robotics and Automation, New Orleans, LA, pp. 1580-1585,2004b.Wang, Z., Hirata, Y., and Kosuge, K., "An algorithm for testing object caging conditionby multiple mobile robots," Proc. of IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), Edmonton, Alberta, Canada, pp. 2664-2669, 2005.Weiss, G., Multiagent Systems, The MIT Press, Cambridge, MA, 1999.Wooldridge, M., An Introduction to Multiagent Systems, John Wiley & Sons, LTD,Hoboken, NJ, 2002.188Yamada, S. and Saito, J., "Adaptive action selection without explicit communication formulti-robot box-pushing," IEEE Trans. on Systems, Man, and Cybernetics, Vol. 31, No. 3,pp. 398-404, 2001.Yamashita, A., Arai, T., Ota, J. and Asama, H., "Motion planning of multiple mobilerobots for cooperative manipulation and transportation," IEEE Trans. On Robotics andAutomation, Vol. 19, No. 2, pp. 223-237, 2003.Yang, E., and Gu, D., "Multiagent reinforcement learning for multi-robot systems: asurvey," Technical Report,http://robotics.usc.edu/—maja/teaching/cs584/papers/yang04multiagent.pdf, 2004.Zaerpoora, A., Ahmadabadia, M.N., Barunia, M.R. and Wang, Z., "Distributed objecttransportation on a desired path based on constrain and move strategy," Robotics andAutonomous Systems, Vol. 50, No. 2-3, pp. 115-128, 2005.Zlot, R. and Stenz, A., "Market-based multirobot coordination for complex tasks," TheInternational Journal of Robotics Research, Vol. 25, No. 1, pp. 73-101, 2006.189AppendixJAVA Documents of the Developed System1. SummaryPackage SQKFInterface SummaryenvServerInterface FThe interface for remote product objects.robotServerInterface fThe interface for remote product objects.Class SummaryrobotrobotServerboxenvenvServer This is the implementation class for the remote product objects.goal main-ob*Pose'obstaclepainterpointrobotData 2. Class boxSQKFClass boxjava.lang.ObjectL SQKF . boxpublic class boxextends java.lang.Object190Method Summary-void;7boxResponse()objPoselgetBoxPose()double getDetectionRadius()double lgetGlobalReward()int getHeight()int getWidth()voidfSetBoxPose(objPose bPose)setEnvServer(envServer s)voidConstructor Summary box(env e, java.util.Vector oGroup, objPose p, goal g)Creates a new instance of boxConstructor Details1) boxpublic box(env e,java.util.Vector oGroup,objPose p,goal g)Creates a new instance of boxMethod Details2) setEnvServerpublic void setEnvServer(envServer s)3) getBoxPosepublic objPose getBoxPose()4) setBoxPosepublic void setBoxPose(objPose bPose)5) get Widthpublic int getWidth ( )1916) getHeightpublic int getHeight ()7) getDetectionRadiuspublic double getDetectionRadius()8) boxResponsepublic void boxResponse()9) getGlobalRewardpublic double getGlobalReward()3. Class envSQKFClass envjava.lang.ObjectLsQKF.envpublic class envextends java.lang.ObjectField Summarydouble height doubleiwidthConstructor Summaryenv(double wid, double hei)Method SummaryMethods inherited from class java.lang.Objectclone, equals, finalize, getClass, hashCode, notify, notifyAll,toString, wait, wait, wait10) widthpublic double width19211) heightpublic double heightConstructor Details12) envpublic env(double wid,double hei)4. Class envServerSQKFClass envServerjava.lang.ObjectLjava.rmi.server.RemoteObjectLjava.rmi.server.RemoteServerLjava.rmi.server.UnicastRemoteObjectL-SQKF.envServerAll Implemented Interfaces:java.io.Serializable, java.mii.Remote, envServerInterfacepublic class envServerextends java.rmi.server.UnicastRemoteObjectimplements envServerInterface This is the implementation class for the remote product objects.See Also:Serialized FormField SummaryFields inherited from class java.rmi.server.RemoteObjectrefConstructor SummaryenvServer(goal g, box b, java.util.Vector oGroup)Creates a new instance of envServer193Method Summaryint addRobot(robotData r)booleanvoidvoidvoidallRobotReady()awakeAllRobots()awakeRobot(int id)clearAllRobotReady()point getBoxDimension ()objPose getBoxPose()doubledoublegetDetectionRadius()getGlobalReward()point getGoal()Jjava.util.Vector getObstacleGroup()java.util.Vectorjava.util.Vectorjava.util.VectorgetRobotDataList() igetRobotPoseList ()getRobotStatusList()java.util.Vector getRobotUr1List()jvoid setBoxPose(objPose bPose)void setRobotAction(int id,^int action)void setRobotPose(int id, objPose p)void setRobotReady(int id)void setRobotStatus(int id, boolean s)Constructor Details13) envServerpublic envServer(goal g,box b,java.util.Vector oGroup)194throws java.rmi.RemoteExceptionCreates a new instance of envServerThrows:java.rmi.RemoteException14) addRobotpublic int addRobot(robotData r)throws java.rmi.RemoteExceptionSpecified by:addRobot in interface envServerInterface Throws:java.rmi.RemoteException15) getGoalpublic point getGoal()throws java rmi.RemoteExceptionSpecified by:getGoal in interface envServerInterface Throws:java.rmi.RemoteException16) getObstacleGrouppublic java.util.Vector getObstacleGroup()throws java.rmi.RemoteExceptionSpecified by:getObstacleGroup in interface envServerInterfaceThrows:java.rmi.RemoteException17) getBoxPosepublic objPose getBoxPose()throws java.rmi.RemoteExceptionSpecified by:getBoxPose in interface envServerInterfaceThrows:java.rmi.RemoteException18) getBoxDimensionpublic point getBoxDimension()throws java.rmi.RemoteExceptionSpecified by:getBoxDimension in interface envServerInterfaceThrows:java.rmi.RemoteException19) getRobotUr1Listpublic java.util.Vector getRobotUr1List()throws java.rmi.RemoteExceptionSpecified by:getRobotUr1List in interface envServerInterface195Throws:java.rmi.RemoteException20) getRobotStatusListpublic java.util.Vector getRobotStatusList()throws java.rmi.RemoteExceptionSpecified by:getRobotStatusList in interface envServerInterfaceThrows:java.rmi.RemoteException21) getGlobalRewardpublic double getGlobalReward()throws java.rmi.RemoteExceptionSpecified by:getGlobalReward in interface envServerInterfaceThrows:java.rmi.RemoteException22) getDetectionRadiuspublic double getDetectionRadius()throws java.rmi.RemoteExceptionSpecified by:getDetectionRadius in interface envServerInterfaceThrows:java.rmi.RemoteException23) getRobotPoseListpublic java.util.Vector getRobotPoseList()throws java.rmi.RemoteExceptionSpecified by:getRobotPoseList in interface envServerInterfaceThrows:java.rmi.RemoteException24) getRobotDataListpublic java.util.Vector getRobotDataList()25) awakeAllRobotspublic void awakeAllRobots()26) awakeRobotpublic void awakeRobot(int id)27) allRobotReadypublic boolean allRobotReady()28) setRobotActionpublic void setRobotAction(int id,int action)throws java.rmi.RemoteException196Specified by:setRobotActionini/ACtikeenvServerInterfaceThrows:java.rmi.RemoteException29) setRobotReadypublic void setRobotReady(int id)throws java.rmi.RemoteExceptionSpecified by:setRobotReady in interface envServerInterfaceThrows:java.rmi.RemoteException30) setBoxPosepublic void setBoxPose(objPose bPose)throws java.rmi.RemoteExceptionSpecified by:setBoxPose in interface envServerInterface Throws:java.rmi.RemoteException31) clearAllRobotReadypublic void clearAllRobotReady()throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException32) setRobotPosepublic void setRobotPose(int id,objPose p)throws java.rmi.RemoteExceptionSpecified by:setRobotPose in interface envServerInterfaceThrows:java.rmi.RemoteException33) setRobotStatuspublic void setRobotStatus(int id,boolean s)throws java.rmi.RemoteExceptionSpecified by:setRobotStatus in interface envServerInterfaceThrows:java.rmi.RemoteException5. Interface envServerSQKFInterface envServerInterfaceAll Superinterfaces:197java.rmi.RemoteAll Known Implementing Classes:env S ery erpublic interface envServerInterfaceextends java.rmi.RemoteThe interface for remote product objects.Method Summaryint addRobot(robotData r)point'getGoal()java.util.Vector.getObstacleGroup()java.util.VectorgetRobotPoseList()java.util.VectorlgetRobotStatusList()pointIgetBoxDimension()objPoseigetBoxPose()doubleigetDetectionRadius()doubleigetGlobalReward()java.util.VectorigetRobotUr1List()-------^-void:setBoxPose(objPose boxPose)void setRobotAction(int id, int action)voidisetRobotPose(int id, objPose p)void setRobotReady(int id)t- —^voidisetRobotStatus(int id, boolean s)198Method Details34) addRobotint addRobot(robotData r)throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException35) getGoalpoint getGoal()throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException36) getObstacleGroupjava.util.Vector getObstacleGroup()throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException37) getBoxPoseobjPose getBoxPose()throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException38) getBoxDimensionpoint getBoxDimension()throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException39) getRobotUr1Listjava.util.Vector getRobotUr1List()throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException40) getGlobalRewarddouble getGlobalReward( )throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException41) getDetectionRadiusdouble getDetectionRadius()throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException42) getRobotPoseListjava.util.Vector getRobotPoseList()throws java.rmi.RemoteException199Throws:java.rmi.RemoteException43) getRobotStatusListjava.util.Vector getRobotStatusList()throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException44) setRobotActionvoid setRobotAction(int id,int action)throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException45) setRobotReadyvoid setRobotReady(int id)throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException46) setRobotPosevoid setRobotPose(int id,objPose p)throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException47) setRobotStatusvoid setRobotStatus(int id,boolean s)throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException48) setBoxPosevoid setBoxPose(objPose boxPose)throws java.rmi.RemoteExceptionThrows:java.rmi.RemoteException6. Class goalSQKFClass goaljava .lang.ObjectsQKF.goalpublic class goalextends java.lang.Object200Field Summarypoint positionConstructor Summarygoal(double x, double y)Creates a new instance of goal Method SummaryMethods inherited from class java.lang.Objectclone, equals, finalize, getClass, hashCode, notify, notifyAll,toString, wait, wait, waitField Details49) positionpublic point position. . .^. _^.Constructor Details50)goalpublic goal(double x,double y)Creates a new instance of goal7. Class mainSQKFClass mainjava.lang.ObjectL ava .awt.ComponentL j ava .awt.ContainerL j avax.swing.JComponentL j avax swing.JPanelL SQKF .mainAll Implemented Interfaces:java.awt.image.ImageObserver, jaVa.awt.MenuContainer, java.io.Serializable,javax.accessibility.Accessiblepublic class mainextends javax.swing.JPanelSee Also:201Serialized FormField SummaryConstructor Summarymain()Method Summaryvoid clearCanvas()static voidlmain(java.lang.String[] args)void;paintComponent(java.awt.Graphics g)static double randomNum(double al, double a2) Constructor Details51) mainpublic main()Method Details52) paintComponentpublic void paintComponent(java.awt.Graphics g)Overrides:paintComponent in class j avax.swing.JComponent53) clearCanvaspublic void clearCanvas()54) mainpublic static void main(java.lang.String[] args)throws java.lang.ExceptionThrows:java.lang.Exception55) randomNumpublic static double randomNum(double al,double a2)202Field Summarydouble anglepoint centerCreates a new instance of objPose8. Class objPoseSQKFClass objPosejava.lang.ObjectLsOKF.objPoseAll Implemented Interfaces:java.io.Serializablepublic class objPoseextends java.lang.Objectimplements java.io.SerializableSee Also:Serialized FormConstructor SummaryobjPose ()Creates a new instance of objPoseField Details56) centerpublic point centerCreates a new instance of objPose57) anglepublic double angleConstructor Details58) objPosepublic objPose()Creates a new instance of objPose9. Class obstacleSQKFClass obstaclejava.lang.ObjectL sQKF.obstacleAll Implemented Interfaces:203java.io.Serializablepublic class obstacleextends java. lang . Objectimplements java . io . SerializableSee Also:Serialized FormField Summarypoint positiondouble T;Constructor Summaryobstacle (point p, double r1)Creates a new instance of obstacleField Details59) positionpublic point position60) rpublic double rConstructor Details61) obstaclepublic obstacle (point p,double r1)Creates a new instance of obstacle10. Class robotServerSQKFClass robotServerjava . lang . Obj ec tL Java rmi . server .RemoteObjectL Java .rmi . server .RemoteServerL ava rmi . server .UnicastRemoteObjectLsQKF.robotServerAll Implemented Interfaces:javaio.Serializable, java.rmi.Remote, robotServerInterfacepublic class robotServer204extends java.rmi.server.UnicastRemoteObjectimplements robotServerInterface Field SummaryFields inherited from class java.rmi.server.RemoteObjectrefConstructor SummaryrobotServer(robot r)Creates a new instance of robotServerMethod Summaryboolean awakeMe()objPose getPose()boolean isSleeping()void receiveRobotActionList(java.util.Vector actionList,boolean isInitialAction)Constructor Details62) robotServerpublic robotServer(robot r)throws java.rmi.RemoteExceptionCreates a new instance of robotServerThrows:java.rmi.RemoteExceptionMethod Details63) receiveRobotActionListpublic void receiveRobotActionList(java.util.Vector actionList,boolean isInitialAction)throws java.rmi.RemoteExceptionSpecified by:receiveRobotActionList in interface robotServerInterfaceThrows:java.rmi.RemoteException64) getPosepublic objPose getPose()throws java.rmi.RemoteExceptionSpecified by:205getPoseininterfWerobotServerInterfaceThrows:java.rmi.RemoteException65) awakeMepublic boolean awakeMe()throws java.rmi.RemoteExceptionSpecified by:awakeMe in interface robotServerInterfaceThrows:java.rmi.RemoteException66) isSleepingpublic boolean isSleeping()throws java.rmi.RemoteExceptionSpecified by:isSleeping in interface robotServerInterface Throws:java.rmi.RemoteException11. Class robotSQKFClass robotjava . lang . Obj ectL j ava.lang.ThreadL sQKF . robotAll Implemented Interfaces:j ava.lang.Runnablepublic class robotextends java.lang.ThreadNested Class SummaryNested classes/interfaces inherited from class java.lang.Threadjava.lang.Thread.State, java.lang.Thread.UncaughtExceptionHandlerField SummaryFields inherited from class java.lang.ThreadMAX_PRIORITY, MIN_PRIORITY, NORM_PRIORITY206fvoid modiRobotActionList(java.util.Vector actionList,boolean isInitialAction)void run()Constructor Details67) robotpublic robot(double fValue,java.lang.String sUrl,java.lang.String mUrl,objPose mPose)Creates a new instance of robotMethod DetailsConstructor Summaryrobot(double fValue, java.lang.String sUrl, java.lang.String mUrl,objPose mPose)Creates a new instance of robotMethod Summary boolean amSleeping()void awakeMe()void becomeWaiting(int timeout)void , frame2Tol(point p1, point p , point origin, double theta)objPoseigetPose()static void main(java.lang.String[] args)68)mainpublic static void main(java.lang.String[] args)69) runpublic void run()Specified by:run in interface j ava lang . RunnableOverrides:run in class java . lang . Thread20770) become Waitingpublic void becomeWaiting(int timeout)71) awakeMepublic void awakeMe()72) amSleepingpublic boolean amSleeping()73) modiRobotActionListpublic void modiRobotActionList(java.util.Vector actionList,boolean isInitialAction)74) fivme27-61public void frame2Tol(point pl,point p2,point origin,double theta)75) getPosepublic objPose getPose()208


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items