Robot Learning for Autonomous Navigation in a DynamicEnvironmentbyYunfei ZhangB.Sc., Qingdao University of Science and Technology, 2006M.Sc., Shanghai Jiao Tong University, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Mechanical Engineering)The University of British Columbia(Vancouver)September 2015c© Yunfei Zhang, 2015AbstractThis dissertation addresses autonomous navigation of robots in a dynamic environ-ment where the existence of moving and/or unknown objects leads to more seriouschallenges for the robots than those when operating in a traditional stationary en-vironment. Therefore, the use of learning capabilities to facilitate proper roboticoperation in a dynamic environment has become an important research area in thepast decade. This dissertation proposes several novel learning-based methods toovercome the shortcomings in the existing approaches of autonomous navigation.Three aspects are addressed in the present work.First, a real-time path planning method is designed for autonomous naviga-tion that can generate a path that avoids stationary and moving obstacles. To thisend, learning ability is imparted to the robot. The present framework incorporatesthe statistical planning approach called probabilistic roadmap (PRM), Q-learningtogether with regime-switching Markov decision process (RSMDP) due to its ben-eficial characteristics, to form a robust Q-learning. Consequently, the initial pathcan be improved through robust Q-learning during interaction with a dynamic en-vironment.Second, motion planning under constraints is investigated. Specifically, a closed-form piecewise affine control law, called piecewise affine-extended linear quadraticregulator (PWA-ELQR), for nonlinear-nonquadratic control problems with con-straints is proposed. Through linearization and quadratization in the vicinity ofthe nominal trajectories, nonlinear-nonquadratic control problems can be approx-imated to linear-quadratic problems where the closed-form results can be derivedrelatively easily.Third, people detection is integrated into the autonomous navigation task. Aiiclassifier trained by a multiple kernel learning-support vector machine (MKL-SVM) is proposed to detect people in sequential images of a video stream. Theclassifier uses multiple features to describe a person, and learn its parameter valuesrapidly with the assistance of multiple kernels.In addition to the methodology development, the present research involvescomputer simulation and physical experimentation. Computer simulation is usedto study the feasibility and effectiveness of the developed methodologies of pathplanning, motion planning and people detection. The experimentation involvesautonomous navigation of a homecare robot system. The performance of the de-veloped system is rigorously evaluated through physical experimentation and isimproved by refining the developed methodologies.iiiPrefaceThe entire work presented in this dissertation was conducted at the Industrial Au-tomation Laboratory of the University of British Columbia, Vancouver campusunder the supervision of Dr. Clarence W. de Silva. A collection of manuscripts,resulting from the collaboration of several researchers, contribute the content ofthis work. I was responsible for majority of the research in this work, including lit-erature survey, algorithm development and implementation, open-source softwareimplement, numerical simulation and physical experimentation, with the guidanceand advice of my supervisor, Dr. Clarence W. de Silva. I wrote the manuscripts,which were edited and refined by Dr. Clarence W. de Silva.Parts of Chapter 2 and Chapter 3 are based on the work published on:• Yunfei Zhang, Weilin Li and Clarence W. de Silva, ”RSMDP-based RobustQ-learning for Optimal Path Planning in a Dynamic Environment,” Interna-tional Journal of Robotics and Automation, 2015 (accepted).A version of combined Chapter 4 and Chapter 6 has been submitted to a journaland is under second round review:• Yunfei Zhang, Xun Chen and Clarence W. de Silva, ”PWA-Extended LQRfor Optimal Motion Planning of Nonholonomic Mobile Robot with Con-straints”.Chapter 5 is based on the work published on:• Yunfei Zhang, Rajen Bhatt, and Clarence W. de Silva, ”MKL-SVM-basedhuman detection for autonomous navigation of a robot,” IEEE InternationalConference In Computer Science & Education (ICCSE), Vancouver, August22-24, 2014.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixNomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Scope and Problem Specification . . . . . . . . . . . . . 41.2.1 Research Scope . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Problem Specification . . . . . . . . . . . . . . . . . . . 51.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.1 Path Planning . . . . . . . . . . . . . . . . . . . . . . . . 81.3.2 Motion Planning . . . . . . . . . . . . . . . . . . . . . . 91.3.3 People Detection . . . . . . . . . . . . . . . . . . . . . . 111.4 Challenges, Contributions and Organization of the Dissertation . . 12v1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Robot Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Markov Decision Process . . . . . . . . . . . . . . . . . 182.2.2 Dynamic Programming: Model-based Method for SolvingMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.2.3 Reinforcement Learning: Model-free Method for SolvingMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3 Support Vector Machine: A Popular Method of Supervised Learning 342.3.1 Geometric Margins . . . . . . . . . . . . . . . . . . . . . 352.3.2 Optimal Margin Classifier . . . . . . . . . . . . . . . . . 392.3.3 Dual Problem and Support Vectors . . . . . . . . . . . . . 403 Robust Q-learning with Regime-Switching Markov Decision Processfor Optimal Path Planning . . . . . . . . . . . . . . . . . . . . . . . 423.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 RSMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3 Probabilistic Roadmap for RSMDP . . . . . . . . . . . . . . . . . 453.4 Path Planner with Online Q-learning . . . . . . . . . . . . . . . . 473.5 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 514 Extended Linear Quadratic Regulator Enhanced with PWA for Mo-tion Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2 Optimal Motion Planning with Constraints . . . . . . . . . . . . . 564.3 Piecewise Affine-ELQR . . . . . . . . . . . . . . . . . . . . . . . 584.3.1 Traditional Linear-Quadratic-Regulator . . . . . . . . . . 584.3.2 Piecewise Affine Feedback Control for Constrained Con-trol Problems . . . . . . . . . . . . . . . . . . . . . . . . 594.3.3 PWA-LQR Smoothing . . . . . . . . . . . . . . . . . . . 624.3.4 PWA-ELQR: Local Approximation for Nonliear-NonquadraticControl Problems . . . . . . . . . . . . . . . . . . . . . . 65vi4.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 675 People Detection-using MKL-SVM . . . . . . . . . . . . . . . . . . . 745.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.2 Combined Features: HOG-HOF . . . . . . . . . . . . . . . . . . 755.2.1 Histogram of Oriented Gradients . . . . . . . . . . . . . . 755.2.2 Histograms of Optical Flow . . . . . . . . . . . . . . . . 785.3 MKL-SVM Classifier . . . . . . . . . . . . . . . . . . . . . . . . 795.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 816 Implementation and Experimentation . . . . . . . . . . . . . . . . . 856.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.2 The Hardware System . . . . . . . . . . . . . . . . . . . . . . . . 876.2.1 Segway Mobile Base . . . . . . . . . . . . . . . . . . . . 876.2.2 3D Camera: Microsoft Kinect . . . . . . . . . . . . . . . 876.2.3 2D Laser Ranger Finder: Hokuyo UTM-30LX . . . . . . 896.3 Software System . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3.1 People Detection . . . . . . . . . . . . . . . . . . . . . . 896.3.2 Autonomous Navigation . . . . . . . . . . . . . . . . . . 926.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 957 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 1047.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108viiList of TablesTable 5.1 Comparisons of different kernels and corresponding results . . 82Table 6.1 Segway RMP100 Specifications . . . . . . . . . . . . . . . . . 87Table 6.2 Microsoft Kinect Specifications . . . . . . . . . . . . . . . . . 88Table 6.3 Hokuyo UTM-30LX Specifications . . . . . . . . . . . . . . . 89viiiList of FiguresFigure 1.1 Percentage of elderly (>65) with respect to working-age (16-65) population. . . . . . . . . . . . . . . . . . . . . . . . . . 2Figure 1.2 A scenario of autonomous navigation in home environment. . 3Figure 1.3 A typical function structure of autonomous navigation. . . . . 5Figure 1.4 The designed scenario of autonomous navigation in a real-world home environment. . . . . . . . . . . . . . . . . . . . 6Figure 1.5 (a) The Segway two-wheel differential-drive mobile robot; (b)Kinematic scheme of the two-wheel differential-drive mobilerobot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Figure 2.1 Basic MDP scheme for modeling the environment when robotlearning is applied. . . . . . . . . . . . . . . . . . . . . . . . 18Figure 2.2 Basic MDP scheme for modeling the environment where robotlearning are applied into. . . . . . . . . . . . . . . . . . . . . 23Figure 2.3 Static collision-free roadmap for establishing Q table. . . . . . 33Figure 2.4 Reward matrix R (-1 means there is no edge between nodes, 0means an arrived node is not a goal node, 10 means an arrivednode is a goal node). . . . . . . . . . . . . . . . . . . . . . . 34Figure 2.5 Q table in the form of matrix Q at the initial iteration. . . . . . 35Figure 2.6 Q table in the form of matrix Q at the iteration 5. . . . . . . . 36Figure 2.7 Q table in the form of matrix Q at the iteration 50. . . . . . . 37Figure 2.8 The basic scheme of linear classifier. . . . . . . . . . . . . . . 37Figure 2.9 The basic scheme of geometric margins. . . . . . . . . . . . . 39ixFigure 2.10 An example of the concept of support vectors for a linear clas-sifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Figure 3.1 The probabilistic roadmap (PRM) process in the configurationspace (C-space) . . . . . . . . . . . . . . . . . . . . . . . . . 46Figure 3.2 Local roadmap generation. . . . . . . . . . . . . . . . . . . . 48Figure 3.3 Simulated environment for the mobile robot in a 640×480 im-age plane, in a learning process. . . . . . . . . . . . . . . . . 51Figure 3.4 History of Q-value with γmax = 0.8,0.5,0.2 and 100 sampledpoints generated in PRM. . . . . . . . . . . . . . . . . . . . . 52Figure 3.5 Performance comparison between traditional Q-learning androbust Q-learning with γmax = 0.98. . . . . . . . . . . . . . . 53Figure 3.6 Obstacle avoidance in dynamic environment. . . . . . . . . . 55Figure 4.1 (a) The real-world environment where SegPanda executes au-tonomous navigation; (b) The simplified simulation environ-ment corresponding to (a). . . . . . . . . . . . . . . . . . . . 68Figure 4.2 Comparison of the optimal trajectories of the input control be-tween ELQR (does not consider constraints on control duringmotion planing) and PWA-ELQR. . . . . . . . . . . . . . . . 69Figure 4.3 The corresponding comparison of the optimal trajectories ofthe state between ELQR, which does not consider constraintson control during motion planing, and PWA-ELQR(Hence thestate trajectory of ELQR can theoretically reach to the goallocation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70Figure 4.4 Comparison of the optimal trajectories of the state betweenELQR and PWA-ELQR in a simplified simulation environ-ment. Although ELQR results in a somewhat shorter path, itcosts 142.97 and can not be executed because of control con-straints. PWA-ELQR costs 140.77 and is executed success-fully since it considers constraints during motion planning. . . 72Figure 5.1 The definition of cell and block for a sample image . . . . . . 75Figure 5.2 Calculation of the gradients of the sample image. . . . . . . . 76xFigure 5.3 8-bin histogram orientation for one cell . . . . . . . . . . . . 77Figure 5.4 Overview of extracting HOG feature . . . . . . . . . . . . . . 78Figure 5.5 Overview of extracting HOF feature . . . . . . . . . . . . . . 78Figure 5.6 (a) Detection results (light color bounding boxes) based onHOG-HOF feature using MKL-SVM; (b) Detection results (darkcolor bounding boxes) based on HOG feature using MKL-SVM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Figure 6.1 The prototype of the autonomous navigation system-SegPanda. 86Figure 6.2 An example of a simulated environment built by ROS. . . . . 86Figure 6.3 The structure of Kinect components . . . . . . . . . . . . . . 88Figure 6.4 An example of the scanning result of Hokuyo in the IndustrialAutomation Laboratory (IAL) laboratory environment. Thegreen lines pointed using red arrows show the scanning resultof Hokuyo. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90Figure 6.5 The result of people detection in a one-person case. . . . . . . 91Figure 6.6 The result of people detection in a two-person case. . . . . . . 92Figure 6.7 The overview structure of the ROS navigation stack. . . . . . 93Figure 6.8 The overall structure of the autonomous navigation system . . 95Figure 6.9 An example of the local costmap used for autonomous naviga-tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Figure 6.10 An example of the computed global path used for autonomousnavigation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Figure 6.11 The flowchart of the overall autonomous navigation system. . 100Figure 6.12 The views of the experimental demonstration. . . . . . . . . . 101Figure 6.13 The views of the experimental demonstration: avoiding a staticobstacle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102Figure 6.14 The views of the experimental demonstration: avoiding a mov-ing obstacle. . . . . . . . . . . . . . . . . . . . . . . . . . . 103xiNomenclatureA = {a1,a2, ...,aN} a set of actions of an MDPAt state matrix of a discrete-time formulationA¯t state matrix of a reverse discrete-time formu-lationα learning rateb optimal parameterBt control matrix of a discrete-time formulationB¯t control matrix of a reverse discrete-time for-mulationct immediate cost function at time-step tcl immediate cost function at final time-step l[Dt CTtCt Et]cost weight matrix in cost-to-go function[D¯t C¯TtC¯t E¯t]cost weight matrix in cost-to-come function[dtet]cost weight vector in cost-to-go function[d¯te¯t]cost weight vector in cost-to-come functionδ iteration termination thresholddx vertical stride sizedy horizontal stride sizeη geometric marginE mean value (expectation)γ discount factorI imageIm gradient magnitude imageIo gradient orientation imagexiiIx vertical difference imageIy horizontal difference imagek kernel functionL distance between the two wheelsP transition function of an MDPPψk transition function of an RSMDPpi policy of an MDPpiψk policy of an RSMDPpi∗ optimal policy of an MDPpi∗ψk optimal policy of an RSMDPψ regimeΨ collection of regimeΦi feature spaceQ state-action functionQ∗ optimal state-action functionQψk state-action function in each regimeQ∗ψk optimal state-action function in regime ψkQ Q value matrixr cost functionrψk cost function in regime ψkR reward function of an MDPRψk reward function of an RSMDPR reward matrixσ iteration thresholdS = {s1,s2, ...,sN} a set of states of an MDPTC terminal cost function of an MDPTCψk terminal cost function of an RSMDPt timeu = [vl,vr]T robot control input, vl left wheel velocity, vrright wheel velocityε greedy thresholdvt cost-to-go functionv¯t cost-to-come functionV value functionV ∗ optimal value functionVψk value function in each regimeV ∗ψk optimal value function in regime ψkwm normal vector to the hyperplane for the fea-ture space Φmw optimal parameterxiiiList of AcronymsANN artificial neural networksC-space configuration spaceCNN convolutional neural networksDP dynamic programmingEL evolutionary learningELQR extended linear-quadratic-regulatorHOG histograms of oriented gradientsHOG-HOF histograms of oriented gradients-histograms of optical flowHOF histograms of optical flowKKT Karush-Kuhn-TuckerLQR linear-quadratic-regulatorLSVM linear support vector machineMC Monte CarloMDP Markov decision processMKL multiple kernel learningMKL-SVM multiple kernel learning-support vector machineMPC model predict controlIAL Industrial Automation LaboratoryxivIL imitation learningPCL point cloud libraryPOMDP partially observable Markov decision processPRM probabilistic roadmapPWA-LQR piecewise affine-LQRPWA-ELQR piecewise affine-ELQRQP quadratic programmingRBF radial basis functionRGB red-green-blueRGB-D red-green-blue-depthRL reinforcement learningRHC receding horizon controlROS robot operating systemRSMDP regime-switching Markov decision processSimpleMKL simple multiple kernel learningSVM support vector machineTDL temporal difference learningxvAcknowledgmentsFive years of pursing PhD degree is an extremely fantastic journey in life, bringingme much joy and cheers, at the price of many challenges, hardship, pressure andtears. Only after experiencing such feelings, can I thoroughly understand whatgratitude is.First, I wish to express my sincere gratitude to my supervisor, Dr. ClarenceW. de Silva. It is he who offered me this precious opportunity for pursing a PhDdegree, enabling me to freely swim in the ocean of knowledge and solidly build afoundation for my future, which without doubt has completely changed my pathof life. I greatly appreciate all of his advice, mentorship, understanding and unwa-vering support during the past five years. In my eyes, Dr. de Silva is not just myacademic supervisor, but also my life mentor who always encouraged me with hisgenerosity and patience.I would like to thank a few of senior labmates, Dr. Ying Wang for his recom-mendation, Dr. Tahir Khan for his humour, Dr. Roland Haoxiang Lang for thehappy hours I spent at his home during festivals, and Dr. Yanjun Wang for en-joyable conversations in our lab. Without their help and guidance, I could not gothrough the difficulties in the beginning period of my PhD program.Also, I wish to thank all my other colleagues in the Industrial AutomationLaboratory (IAL), Ms. Yu Du, Mr. Muhammad Tufail Khan , Mr. Shan Xiao, Ms.Lili Meng, Mr. Hani Balkhair, Ms. Pegah Maghsoud, Mr. Min Xia, Mr. ShujunGao, Mr. Zhuo Chen, Mr. Teng Li., and Mr. Tongxin Shu for their friendship andhelp.A big thank you goes to my close friendship groups, THREEPLUSONE inChina and WOCHANG in Vancouver. Their support and comfort have made mexvibelieve that a bright future will come my way eventually. Special thanks go to mydear friend, Prof. Xun Chen, for his advice on research that inspires me to purse ahigher academic goal.This work has been supported by research grants from the Natural Sciences andEngineering Research Council (NSERC) of Canada, the Canada Foundation forInnovation (CFI), the British Columbia Knowledge Development Fund (BCKDF),and the Tier 1 Canada Research Chair in Mechatronics and Industrial Automationheld by C.W. de Silva. Thank all sponsors.In conclusion, I would like to express my deepest gratitude to my family. Thankyou my parents and parents-in-law, for your unconditional love and confidence inme that encouraged me to always go forward. Thank you my lovely twins, Lunlunand Yaya. You are god-sent gifts that can recharge me whenever I am out of power.Pingping Liu, my wife, a simple thank you cannot express my gratitude for yoursharing and contributing to this journey. Only with her that I am able to know howmy world rotates.xviiChapter 1Introduction1.1 MotivationDuring the past decade, researchers in robotics have increasingly redirected theirattention from traditional industrial robots operating in structured or stationary en-vironments to the more challenging area of mobile robotics in unstructured anddynamic environments with applications in military, space exploration, underwateroperations, and service robots. In the area of service robotics, homecare roboticsfor the elderly and the disabled has a special significance for its contribution to im-proving the quality of life and reducing the caregiver burdens and costs. Accordingto UN statistics shown in Fig. 1.1, the percentage of adults who are over 65 (withrespect to those over 16) will be more than double in many countries (includingCanada) in the next 50 years. As a result there will be a severe shortage of nursingassistants for the elderly and the disabled. This will generate a major opportunityfor the development of robotic technologies to assist or even substitute the nursingassistants in a home environment. Also, this has important quality-of-life implica-tions since it is known that people regardless of their age or the ability prefer to liveindependently in their own homes. The overall goal of the homecare project in ourlaboratory (Industrial Automation Lab) is to develop technologies for autonomousmobile robots that can assist the elderly and the disabled in their daily activities ina home environment. Many such activities involve autonomous navigation in thepresence of static and dynamic obstacles under uncertain and unknown conditions.1Figure 1.1: Percentage of elderly (>65) with respect to working-age (16-65)population.Autonomous navigation means that the robot can navigate by itself from thestart location to reach the goal location, in the presence of obstacles, which is anessential ability for a variety of application of intelligent robots such as militarytasks, space exploration, and people rescue. In static and structured environments,maps and models of the world (environment in which the robot moves) can begenerated to support autonomous navigation by a prior sensory/perception processtogether with available information. However, in homecare robotics, as shown inFig. 1.2, autonomous navigation becomes more complex since the home environ-ment is dynamic (and somewhat arbitrary), and changes take place due to movingobjects such as humans and pets and their actions. For instance, giving first aidto a person who suddenly suffers heart attack or guiding an elderly person to awashroom involves robotic movement in a dynamic environment[1]. Therefore, itis important that such robots learn how to deal with dynamic environments so thatthey can repeat those or similar tasks more effectively, by taking advantage of theirprior experience. The main focus of the present dissertation is the imparting of thelearning capability to robots so that they can adapt to dynamic environments during2autonomous navigation.Figure 1.2: A scenario of autonomous navigation in home environment.The problem of population aging is a global one due to the rising life ex-pectancy and declining birth rate. Initially, this has been a problem primarily incountries that are economically more developed. But recently, developing coun-tries such as China and India are also affected by this problem. According to theprediction of Statistics Canada, the percentage of the senior population will reach25% of the total population by 2050. The social and economic effects of popula-tion aging are enormous. The elderly need some care and physical assistance whennecessary, in their daily life. The total cost of providing care to people of this groupis tremendous, considering the large size of the aging population.Another group of people that will rapidly increase its size are those with phys-ical and cognitive impairments. The Canadian government spends about $9 billionon disability related matters. The cost of basic care for a disabled person at homeis about $10,000/month, and this is a huge burden on the Government.The burden described above may be reduced through the use of a homecarerobotic system, which consists of a group of robotic devices to work together in acoordinated and efficient manner and carry out a common task of providing assis-tance to elderly and/or disabled people in the home setting. The needed technolo-gies of robotics, network communication, and control are sufficiently mature and3are available at reasonable cost. Due to the lack of necessary assistive technologiesand specialized end-effector devices, and also due to insufficient efforts in bring-ing the pertinent technology teams together, the development of affordable andreliable systems has eluded the application area in the past. The research in home-care robotics is a new but rapidly developing field of service robotics, especially inJapan.Through our established facilities in the Industrial Automation Laboratory weare in the process of developing an affordable and effective system of service robotsfor the elderly or the disabled people with physical and cognitive impairments ina home setting. The system comprises several sets of robot manipulators havingmobile bases, and can provide assistance in daily activities, medical assistance,surveillance, and so on. They can work independently or collaboratively in a team,based on the complexity of the task. The designed homecare robotic system allowsthe care-receiver to stay in their familiar home environment where the assistance isprovided, which is a merit of the associated technologies.1.2 Research Scope and Problem Specification1.2.1 Research ScopeIn carrying out complex robotic tasks in a dynamic environment, autonomous nav-igation involves multi-domain technologies such as sensor fusion, data processing,computer vision, artificial intelligence, optimal control and so on. Typically it canbe divided into three functional parts: perception part, computation part and execu-tion part, as shown in Fig. 1.3. The perception part provides many kinds of sensordata to recognize the environment around the robot; the computation part analy-ses the data from the perception part and gives corresponding commands to therobot according to the task goals; and the execution part accomplishes all the me-chanical action by following the commands from the computation part. In view ofthe needed capabilities of autonomous navigation for a homecare application, thescope of the present research focuses on three main areas: 1. global path planningthat provides an optimal path to direct the robot to the global goal; 2. local motionplanning that gives the robot specific controls to follow the global path; 3. people4detection in a home environment. Path planning and local motion planning whichbelong to the computation part are critical and indispensable for autonomous navi-gation. People detection also falls into the computation part. But more importantly,humans play an important role in the environment of many robotic applications,especially in a homecare environment. Therefore, taking the existence of humansinto account during autonomous navigation has a vitally practical significance.Figure 1.3: A typical function structure of autonomous navigation.1.2.2 Problem SpecificationThis dissertation combines the aforementioned three research scopes into one au-tonomous navigation scenario, as shown in Fig. 1.4. The autonomous navigationsystem needs to complete the following tasks: detect people around the robot andthe people’s location, set the people location as the goal location, calculate an op-timal path starting from the current location to the goal location, compute a localmotion plan to follow the global path, and avoid obstacles (both moving and static)5during all of the procedures. The mobile robot that is used in this research workis a nonholonomic two-wheel differential-drive mobile robot. A zoom-out view ofthe mobile robot and its kinematic scheme are shown in Fig. 1.5.Figure 1.4: The designed scenario of autonomous navigation in a real-worldhome environment.The robot states are represented by its position and orientation (angle of rota-tion) x= [x,y,θ ]T . Considering the left wheel velocity and the right wheel velocityas the robot control inputs, u = [vl,vr]T , the robot kinematic model can be ex-pressed asx˙ =12(vr + vl)cosθy˙ =12(vr + vl)sinθθ˙ =1L(vr− vl)(1.1)where L represents the distance between the two wheels.Generally for formula derivation, it is common to express Equation (1.1) ina vector form of a deterministic continuous-time formulation which is the mosttypical way to describe the kinematics of a nonlinear robotic system, as given byEquation (1.2).x˙(t) = f(t,x(t),u(t)) (1.2)6For computer representation of a continuous-time system, the discrete-time formu-lation of Equation(1.2) for any given time-step is expressed as:xt+1 = gt(xt ,ut), (1.3)with gt ∈X×U→X, xt ∈X and ut ∈U, where X⊂Rn and U⊂Rm. Accordingly,for the purpose of cost-to-come (as discussed in Section 4.3.3) the inverse discrete-time kinematic is denoted as:xt = g¯t(xt+1,ut), (1.4)where g¯t ∈ X×U→ X such that gt(g¯t(xt +1,ut),ut) = xt+1.(a) (b)Figure 1.5: (a) The Segway two-wheel differential-drive mobile robot; (b)Kinematic scheme of the two-wheel differential-drive mobile robot.71.3 Related Works1.3.1 Path PlanningPath planning is critical for a homecare robot, as it moves in an environment ofunknown and dynamic factors. In path planning, the path of the navigating robotfrom the start location to the goal location is planned according to some criteria(e.g., shortest path, quickest path and/or path of minimum energy) while avoidingcollisions with static and moving obstacles, and subject to some constraints (e.g.,robot capabilities with respect to its possible movements). However, path planningbecomes more complex in the present application since the home environment isdynamic and unstructured due to moving objects such as humans and pets and theiractions (e.g., rearrangement of furniture)[2–4].Sampling-based methods such as probabilistic roadmap (PRM) [5–7] and Rapidly-exploring Random Trees (RRT) [5, 8] became very popular since the nineties.These algorithms have proved to be very effective for path planning in high-dimensionsince they rely on a collision-checking module instead of explicit representation ofthe environment; however, convergence to optimal solutions cannot be guaranteedwith probability one . Recently in [8], a series of variants to the sampling-basedmethods (PRM*, RRT*) have been proposed to provide probabilistic complete-ness. This means, if a solution exists for a particular plan, the probability of dis-covering the optimal solution converges to one as the sampled number of statesincreases to infinity. All these methods do not explicitly consider moving obstacleswhen asymptotically converging to the optimal path.In practical applications, however, most challenges in path planning come fromfactors of uncertainty in dynamic environments such as varying environment andunknown and moving obstacles. Inspired by the consideration of static obstacleand moving obstacle separately and the configuration-time state space, Van denBerg proposed a hybrid approach, which first constructs a path using PRM in theconfiguration-time state space based on the stationary obstacles in the environment,and subsequently plans a collision-free path from the original path by taking intoaccount the moving obstacles, which are modeled as time discs [9, 10]. A particularadvantage of this approach is that it does not need to exactly know the specific8movements of the obstacles, such as speed and direction, and the planner is ableto generate a path on-line while taking into account changes in the environmentduring the period of deliberation. However, the approach has some disadvantagesas well. One is the assumption that the maximum speed of obstacle motion mustbe less than that of the robot. Another is that the safety buffer of the time discmodule sacrifices part of the collision-free space. Furthermore, the approach doesnot make use of the previous planning experience, which can help to reduce thecomputational burden.1.3.2 Motion PlanningThe objective of motion planning under constraints is to plan optimal motions fora robot such that a user-defined cost function is minimized, without disruptingany required constraints on states and inputs. It is basically treated as an optimalcontrol problem. The linear-quadratic-regulator (LQR)[11] is a transitional andpopular optimal feedback controller,which can determine a proper optimal roboticcontrol action, given the current state. It has been studied for several decades andbeen successfully applied in many practical optimal control problems [12][13][14].However, it only provides a closed-form optimal solution and assumes that therobot system is linear and the cost function is quadratic. Iterative LQR (iLQR)[15]and its variants [16][17] have been proposed in order to deal with nonlinear systemswith nonquadratic cost functions. But these approaches do not consider constraintsexplicitly during the iteration process. They just add an extra effort (e.g., linearsearch) to guarantee iteration convergence. Consequently, the executed trajectorieseasily deviate the robot from the planned trajectories beyond a tolerant range, andthe robot may even fail to complete the task (e.g., avoiding obstacles).Much of the work related to nonlinear-nonquadratic control problems has fo-cused on deterministic robot systems. Basically most of the numerical methods fallinto one of the following three categories.One category consists of sample-based methods [18][19].These methods utilizesearch algorithms to go through the configuration space that is defined by the user.Even though these methods can give global results and implicitly handle constraintsby sampling, the need of gridding or other ad doc discretization can cause difficul-9ties. Learning-based method is another category. Based on the structure of MarkovDecision Processing (MDP), approximate reinforcement learning (ARL)[20]usespolity iteration or value iteration to get an approximate cost function by learningfrom samples and then obtain the optimal control. Optimistic planning (OP)[21]approximates the cost function by using a new search method called simultane-ous optimistic optimization for planning. However, these learning-based methodsalso have to first discretize the continuous space, which makes them unsuitable forlarge-scale problems.The third category contains optimization-based methods. Differential dynamicprogramming (DDP)[22][23] belongs to this category. DDP first maintains a rep-resentation of a single trajectory, and then improves it locally relying on dynamicprogramming. It shows second-order convergence and is numerically more effi-cient than Newton’s method[24]. iLQR is closely related to DDP but proves to bemore efficient for complex control problems [15]. It iteratively linearizes the non-linear system and quadratizes the cost function at the current nominal trajectorygiven by the previous iteration and gets a new trajectory via modified Riccati equa-tions. In order to guarantee convergence, iterative LQR (iLQR) [15]additionallyneeds convergence measure (e.g. line search) to drive the new trajectory not farfrom where linearization and quadratization are valid. Extended LQR (ELQR)[17]is proposed to avoid the specific convergence measure through a concept calledLQR-smoothing. LQR-smoothing utilizes cost-to-go and cost-to-come cost func-tions to establish a total-cost function for each iteration. It implements linearizationand quadratization only about a sequence of states and controls that are dynami-cally feasible. A particular advantage of the methods DDP, iLQR and ELQR is thatthey yield feedback control laws that are suitable for real-time control. However,their common shortcoming is that they are unable to handle constraints explicitly.Model Predictive control (MPC) [25][26][27], also called receding horizoncontrol (RHC)[28], has become an acceptably systematic way to handle constraintsexplicitly for motion planning[29]. MPC iteratively solves the convex-optimizationproblem starting from the current state over a finite horizon, and then obtains a se-quence of open-loop optimal controls. The main drawbacks of the MPC are theintrinsic open-loop characteristic and relatively formidable on-line computationaleffort[30]. These drawbacks limit its applicability to relatively slow or small prob-10lems, though the partially close-loop RHC (PCLRHC)[28] method can improvethe speed of performance. Sequential quadratic programing (SQP) is also suitablefor solving constraints. It can accommodate convex constraints on the state andthe control input. However, SQP typically does not generate a feedback controlpolicy, but gives a sequence of optimal open-loop control inputs. The piecewiseaffine-ELQR (PWA-ELQR) method proposed in the present work, inherits advan-tages from ELQR, and incorporates piecewise affine close-loop optimal controlinto the process of minimizing the total cost function, so that the input controlconstraints and differential constraints can be explicitly solved in each iteration.Additionally, the feedback control is locally optimal, which further facilities fastconvergence under constraints to generate executable controls. The present workassumes that obstacle occupation is a unique form of state constraint, since the en-tire unreachable state space is defined as the obstacle state and is considered in thecost function.1.3.3 People DetectionPeople detection for autonomous navigation of homecare robots has attracted con-siderable interest in the past decade, since humans are integral part of a homecarerobots environment. Though remarkable progress has been made in human-freescenarios such as the DARPA Urban Challenge1, autonomous navigation in dy-namic environment where there are moving people is still a largely unsolved prob-lem.Digital cameras can deliver much richer appearance information than rangesensors such as LIDAR, although cameras rarely reach the geometric accuracy asthose of range sensors. This advantage of richer appearance makes cameras attrac-tive for detecting humans in applications of autonomous navigation. One popularframework to solve appearance-based human detection usually insists on a spaceof image features and a learned classifier. In such a framework, there are threemain steps: feature extraction, classifier training, and detection strategy. One typeof features extracted from raw image data is called histograms of oriented gradi-ents (HOG) [31] and is quite popular. Its variants include histograms of optical1http://www.darpa.mil/GRANDCHALLENGE/11flow (HOF) [32], and combined versions with other features like integral channelfeatures [33][34] or color features [35]. Classifiers usually use two categories ofapproaches: support vector machine (SVM) or some variants of boosting, whichlearn to map the features to the scores indicating the probability that the detectedarea represents a human. After training the classifier, a sliding window scheme[36] is commonly used to search the entire image and detect which part is believedto represent a human according to the detection score.1.4 Challenges, Contributions and Organization of theDissertationThe challenges in each considered component of autonomous navigation are sum-marized below:1. The considered dynamic environment is unstructured and has uncertaintydue to lack of knowledge of the behavior of moving obstacles. Hence it israther difficult to model the surrounding environment and the unpredictablemovements of the obstacles. Therefore, a good autonomous navigation sys-tem requires that the global path planning is able to deal with such dynamicenvironments.2. Motion planning is more difficult as well in dynamic environments due tothree main challenges:(1) continuous state and control space bring aboutthe inevitable curse of dimensionality; (2) nonlinear robot system and non-quadratic cost function make the solving optimization problem extremelyhard; and (3) inherent system constraints decrease the effectiveness of theoptimal solution. Apparently, it is non-trivial to address the foregoing threepoints all-in-once for constrained motion planning.3. Distinguishing moving people from other objects so as to locate a personsposition is another big challenge. This challenge is caused due to severalreasons [37]: (1) Human appearance possesses very high variability causedby changing pose, wearing different clothes with diverse colors, and havinga considerable range of sizes. In addition, a cluttered background (home orurban environment) under a wide range of illumination and weather condi-tions varies the quality of the sensed information. The case of being partially12occluded by other objects definitely increases the analysis complexity; (2)Both humans and sensors that are in motion can complicate the movementanalysis. Furthermore, humans appearing at different view angles and a sen-sor system working over a large scope of distances result in more complexsituations. (3) The real-time requirement for detection demands fast reactiontime and robust performance.The goal of this research is to develop novel learning methods to address thesechallenges that arise when a homecare robot operates in a dynamic environment.In achieving this objective, the main three technical contributions are:1. This present work proposes a novel framework called the regime-switchingMarkov decision process (RSMDP) scheme to represent a dynamic and un-structured global environment. Develop a novel optimal path planner by inte-grating an online reinforcement learning approach into the RSMDP schemeto avoid unknown or unpredictable moving obstacles2. To address the constraint problem explicitly in the iteration process as wellas the indicated four challenges together, a new iLQR-based feedback con-troller called PWA-ELQR is proposed in the present work. The idea ofPWA-ELQR comes from the extended linear-quadratic-regulator (ELQR)[17] which is inspired by the extended Kalman filter. It incorporates a piece-wise affine structure into ELQR to solve constraints explicitly, with the helpof quadratic programming (QP). The entire iteration process begins with anarbitrary initial trajectory. Then it iteratively linearizes the robot system andquadratizes the cost function at the states of the previous nominal trajectoryforwardly and backwardly. Next, it adds the cost functions of the forwardpass and the backward pass to get approximate total-cost functions. Lastly,it minimizes the approximate total-cost functions to progressively obtain abetter knowledge of the robot’s future trajectory, and uses the new trajectoryto repeat the entire process.3. Design a classifier trained by an multiple kernel learning-support vector ma-chine (MKL-SVM) method, that comes from an off-the-shelf open source,to detect moving people in sequential images from a video stream, where ap-plying multiple features consisting of HOG + HOF features, which had been13proposed in literature, for detecting moving people. Then combine multiplefeatures and the designed MKL-SVM classifier as the people detector formoving people.1.5 Thesis OutlineThe rest of this dissertation is organized as follows:Chapter 3 presents a robust Q-learning method for global path planning in adynamic environment. The method consists of three steps: first, a regime-switchingMarkov decision process,RSMDP, is formed to present the dynamic environment;second a probabilistic roadmap PRM is constructed, integrated with the RSMDPand stored as a graph whose nodes correspond to a collision-free world state forthe robot; and third, an online Q-learning method with dynamic step size, whichfacilitates robust convergence of the Q-value iteration, is integrated with the PRMto determine an optimal path for reaching the goal. In this manner, the robot is ableto use past experience for improving its performance in avoiding not only staticobstacles but also moving obstacles, without knowing the nature of the obstaclemotion. The use of regime switching in the avoidance of obstacles with unknownmotion is particularly innovative. The developed approach is applied to a homecarerobot in computer simulation. The results show that the online path planner withQ-learning is able to rapidly and successfully converge to the correct path.Chapter 4 presents a closed-form piecewise affine control law for nonlinear-nonquadratic control problems with constraints. The existing ELQR is a validiterative method for handling nonlinear-nonquadratic control problems. Yet it can-not explicitly deal with constraints during iteration. The present work proposes aneffective method, PWA-ELQR, to take constraints into account explicitly duringthe iteration process, by combining QP and ELQR smoother. In this method, QPprovides a linear piecewise affine feedback control law for linear-quadratic con-trol problems, and the ELQR smoother provides tools to approximate nonlinear-nonquadractic control problems through linearization and quadratization in thevicinity of nominal trajectories. The developed approach is tested in a complexcontrol problem: optimal motion planning for a nonholonomic mobile robot withconstraints on the control input. The simulation and experimental results demon-strate good performance, and the proposed approach is a promising feedback con-14troller for optimal motion planning with constraints.Chapter 5 presents a classifier trained by a multiple kernel-learning supportvector machine,MKL-SVM to detect a human in sequential images from a videostream. The developed method consists of two aspects: multiple features consist-ing of HOG features and HOF features suitable for moving objects, and combinednonlinear kernels for SVM. For the purpose of real time application in autonomousnavigation, the popular open-source algorithm called simple multiple kernel learn-ing (SimpleMKL) is implemented into the proposed MKL-SVM classifier. It isable to converge rapidly with sufficient efficiency through a weighted 2-norm reg-ularization formulation with an additional constraint on the weights. The classi-fier is compared with the state-of-the-art linear SVM using a dataset called TUD-Brussels, which is available on line. The results show that the proposed classifieroutperforms the Linear SVM with respect to accuracy.Chapter 6 presents a physical experimental platform, called SegPanda, to testthe proposed methods in this dissertation, as well as the hardware and software usedfor SegPanda. The designed experiment combines into one autonomous navigationscenario all of the three key aspects: path planning, motion planning and peopledetection, which shows good performance based on the proposed methods.The conclusions of the dissertation and suggestions for future research are sum-marized in Chapter 7.15Chapter 2Robot Learning2.1 IntroductionRobot Learning, an application of machine learning approaches to the physicalworld of robotics, is a research area that brings together the methods to endowrobots with learning capabilities. Learning implies that every time the observa-tion about the environment is obtained, the robot will be able to perform theiractivities better in view of the knowledge/experience gained through the learningprocess. Historically, industrial robots did not possess learning capabilities. Theywere used in controlled contexts such as factories to perform predefined specifictasks repeatedly with little uncertainty in interaction with humans or in a varyingenvironment. Recently, the goal of developing autonomous robots that have ca-pabilities of adapting to unstructured or unknown and dynamic environments hasprovided a good rationale for incorporating learning capabilities into robots.There are many ways by which the robots may learn and exploit that capability.They may learn by adjusting parameters based on an outcome, build environmentalmodels such as maps, exploit patterns, evolve rule sets for condition-action pairs,generate entire behaviors, devise new strategies for actions, predict environmentalchanges, recognize the strategies of the opponents or exchange knowledge withother robots. Such capability in robots would relieve humans from much of thedrudgery of intervention to correct robotic actions and would potentially allow op-eration in environments that are unstructured, changeable, unpredictable or only16partially known. In view of (but not limited to) the mentioned advantages of learn-ing, autonomous robots with the capability has been an important goal of robotics,artificial intelligence, and the cognitive sciences.Although boundaries between different learning methods are blurred, basicallyrobot learning is divided into the following three classifications in terms of whattype of feedback can be received [38]:Supervised Learning: In supervised learning, the robot observes some input-output pairs in the form of a given training set, and discovers a function that mapsfrom input to output. Methods such as support vector machine(SVM) [39, 40],artificial neural networks (ANN) [41] and imitation learning (IL) [42] belong tothis category.Unsupervised Learning: In unsupervised learning, the robot needs to deducevarious patterns by observing the input without explicit feedback from the output.evolutionary learning (EL) [43, 44] falls into this category.Reinforcement Learning: In reinforcement learning, unlike the forgoing twotypes of learning, there is no available input in the beginning for the robot. Itrequires the robot to interact with its environment to acquire the input knowledge.Such a trial-and-error mechanism prompts the robot to find the optimal policy,which establishes the mapping between states and actions. Policy iteration [45]and value iteration [46] are common methods that belong to this category.In addition, increasingly, researchers have adopted hybrid learning strategiesto adapt to a dynamic environment. For example, the work reported in [1, 47, 48]employs inverse reinforcement learning methods to recover the intent of the teacherby modeling his cost function, and subsequently, derives the policy that is optimalwith respect to the cost-to-go. The work in [49] combines ANN with imitationlearning to teach goal finding and obstacle avoidance to a Nomad 200 mobile robot.In the context of autonomous navigation in a dynamic home environment, theresearch presented in this dissertation mainly concentrates on suitable methods thatbelong to supervised learning and reinforcement learning (RL). The associatedbackground and the algorithms (Algorithm 1-6) which had been proposed in thepast by other researchers are introduced in the following sections.172.2 Reinforcement Learning2.2.1 Markov Decision ProcessMuch attention has been given recently [50–52] concerning the formulation of theuse of robot learning as either Markov decision process (MDP) or a partially ob-servable Markov decision process (POMDP) framework. Both MDP and POMDPperform very well in a dynamic environment, but their disadvantages are also ob-vious. MDP requires complete observation of states, and POMDP has a heavycomputational burden of exact planning since it needs to estimate various possi-ble uncertainties, thereby making the method unfeasible in just about any practicalproblem of robotics, without further extension [53].In the present research, the approximation process in path/motion planning in-evitably leads to the introduction of noise and uncertainty into the autonomousnavigation task. Hence, explicit consideration of stochastic elements and uncer-tainty cannot result in adequate benefits for an autonomous navigation task. There-fore, unless explicitly stated, the entire system is treated as a deterministic one andMDP is chosen as the framework for the reinforcement learning methods. MDPFigure 2.1: Basic MDP scheme for modeling the environment when robotlearning is applied.is a tuple (S,A,P,R) where S = {s1,s2, ...,sN} is a set of states, being infinite or18finite, discrete or continuous; A = {a1,a2, ...,aN} is a set of actions, being infiniteor finite, discrete or continuous; P(· | ·, ·) : S×S×A→R>0 is a transition functionthat satisfies ∑s′∈S P(s′ | s,a) = 1 for all s ∈ S and a ∈ A; R(·, ·, ·) : S×A× S→ Ris an immediate reward (cost) function for all s ∈ S and a ∈ A. Some researchersalso use the form of (S,A,P,R,TC) where TC : S→ R is a terminal cost functiondenoting an end of an MDP, to especially express the impact of the terminal state.The transition function P and the reward function R together define the model ofthe MDP. The basic MDP scheme is shown in Fig. 2.1. Based on the status of Pand R, the MDP problems can be further divided into subtypes. It is termed a a de-terministic MDP if P or R is deterministic, otherwise it is called a stochastic MDP(e.g. P or R is a probabilistic function). In addition, it is termed a model-free MDPif P or R is unknown in advance. Conversely, it is called model-based MDP if bothP and R are known in advance. The fundamental property of the MDP assumesthat the current state s possesses all the information of the historical states, whichimplies that all of the future states are determined only by the current state, and notusing the historical states. This assumption plays a significant role when applyingthe MDP. The following introduces the core components of an MDP.Policies A policy pi in an MDP acts as an mapping between a state s ∈ S and anaction a ∈ A. The mathematical form of deterministic policy pi can be defined aspi : S→ A. It also has the stochastic form defined as pi : S×A→ [0,1] such thatpi(s,a)≥ 0 and ∑a∈Api(s,a) = 1 for each s∈ S. In the present research, policies areassumed to be deterministic. Given a policy pi , the MDP works in the followingway: 1. Get an initial state s0 using either a random mechanism or a predefinedfunction (e.g., initial state distribution); 2. Obtain the responded action a0 = pi(s0)according to pi; 3. With the obtained s0 and a0, predict the next state s1 and theassociated reward r0 relying on the transition function T (s0,a0,s1) and the rewardfunction R(s0,a0,s1); 4. Previous steps continue to get a sequence of state-action-reward process in the form s0,a0,r0,s1,a1,r1.... If the state-action-reward processends in a terminal state sgoal and is restarted in a new random state s0, the processfrom s0 to sgoal is called one episode associated to one pi . The objective of learningin an MDP is to find an optimal policy pi∗ which is able to gather the largest rewardin an episode.19Optimality The optimality in an MDP is expressed by optimizing the gatheredrewards. There are three models of gathering rewards, as given by:E[h∑t=0rt ] (2.1)E[h∑t=0γ trt ] (2.2)limh→ ∞E[1hh∑t=0rt ] (2.3)Equation 2.1 is the finite horizon model which takes rewards of a finite horizon oflength h. Here pi∗ should optimize its expected rewards over this horizon and yieldh− steps of optimal action. The robot can either take all of the h actions or justchoose the first certain amount of actions. The latter case is known as recedinghorizon control (RHC) or model predict control (MPC) in the context of optimalcontrol problems. The problem that exits in this model is that the choice of thehorizon length depends on the specific problem. Equation 2.2 represents the in-finite horizon model, which takes long-run rewards into account. The parameterγ ∈ [0,1) denotes the discount factor, which determines how many future rewardswill be received. With the exponential form of γ , the further away a future rewardis received, the lower the weight this future reward occupies in the entire set ofrewards. Such design accords with real-world experience, since farther future re-wards come with more uncertainty and hence their weights should be lower thanthe near future rewards. When γ = 0, only the immediate reward would be takeninto account at the current state. In addition, even with infinite horizon, the sumof the gathered rewards is still finite since γ ∈ [0,1) mathematically guaranteesthis property mathematically. Hence, many algorithms prefer to using this model.Equation 2.3 represents average-reward model, which is used when only the en-tire reward is the concern. Obviously this model cannot distinguish between twopolicies that have different rewards in different steps, because all the differencesare averaged into the result. Therefore, which model should be chosen depends onthe specific learning problem. If the length is known, the finite-horizon model isthe best. Otherwise if the task continues infinitely and the associated episode has20no clear terminal goal, the infinite-horizon model would be more suitable. Moredetails about choosing an optimal model are found in [54].Value Functions Value functions in an MDP play the role of connecting the opti-mality models to the policies. A value function estimates how good the policy is interms of the number of the expected rewards, which is termed as the return, the pol-icy gathers at state s. The value function of a state s under policy pi , represented asV pi(s), denotes the expected return when starting in state s and following pi there-after. Using the infinite-horizon model of optimality in the present dissertation, thefull form of the value function is expressed as:V pi(st) = Epi{∞∑i=0γ irt+i | st = s} (2.4)Without of loss generality, the expectation symbol E is used in the above expressionand in later sections to handle the stochastic MDP. However, E can be disregardedif only deterministic MDP is considered.Sometimes an analogous state-action value function termed as Q-function maybe defined as:Qpi(st ,at) = Epi{∞∑i=0γ irt+i | st = s,at = a} (2.5)where Q : S×A→ R. With Q-function, the state s ∈ S and action a ∈ A can bechosen without knowing the transition function T , so that it is suitable for model-free approaches.Value function 2.4 possesses a fundamental recursive property, and hence theso-called Bellman equation can be deduced as follows:V pi(st) = Epi{∞∑i=0γ irt+i | st = s}= Epi{rt + γrt+1+ γ2rt+2+ ... | st = s}= Epi{rt + γV pi(st+1) | st = s}= ∑st+1P(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γV pi(st+1)).(2.6)It denotes that an value function consists of the immediate reward, and the value of21the possible next state weighted by a discount factor. In addition, the reward-next-state set may be multiple due to transition probabilities. Each of the reward-next-state sets is weighted by the associated transition probabilities. The optimal valuefunction V ∗(st) can be obtained by solving the following optimization problem:V ∗(st) = maxa∈A ∑st+1∈SP(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γV ∗(st+1)). (2.7)The associated optimal policy can be obtained by:pi∗(st) = argmaxa∈A ∑st+1∈SP(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γV ∗(st+1)) . (2.8)As in Equation 2.5, an analogous Q-function version of the Bellman equation andthe related optimal policy are defined as:Q∗(st ,at) = maxa∈A ∑st+1∈SP(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γQ∗(st+1),at+1),(2.9)pi∗(st) = argmaxa∈AQ∗(st ,at) (2.10)Policy Q∗ and V ∗ is called a greedy policy because it selects the best action ac-cording to the maximum mechanism. The relation between Q∗ and V ∗ is givenby:V ∗(st) = maxa∈AQ∗(st ,at) (2.11)2.2.2 Dynamic Programming: Model-based Method for SolvingMDPOnce all of the components of the MDP have been defined, explicitly solving theoptimal policy from the value functions becomes the next step. One method of so-lution uses dynamic programming (DP). It represents a class of algorithms that iscapable of computing optimal policies given a perfect model (P and R) of the MDPproblem. Classical DP algorithms have limited application in reinforcement learn-ing due to the requirement of a perfect model and its heavy computation. However,DP serves as an essential foundation since such assumption leads to good proper-ties in terms of mathematical deduction. Hence, most other methods can be viewed22as attempts to achieve similar results as DP by transferring the original conditionsto the perfect-model assumption, with some added noise. Also, because of suchmathematical benefits, DP has been widely used in the optimal control domain,where the state vector is denoted by the set of x = {x1,x2, ...,xN} and the inputvector by the action set u = {u1,u2, ...,uN}, in order to differentiate from the plan-ning problems. This situation will be encountered in Section 4.3 where an optimalmotion controller for the motion planning function of autonomous navigation isproposed.Two typical DP algorithms are policy iteration [55] and value iteration[56],which provide two fundamental structures for various variants and combinations.They are discussed now.Policy IterationPolicy iteration includes two phases: policy evaluation and policy improvement.These two phases iterate to a convergent result (optimal policy) that satisfies certainthreshold conditions, which are used for terminating the iteration process. Fig. 2.2shows the basic scheme of the entire iteration process.Figure 2.2: Basic MDP scheme for modeling the environment where robotlearning are applied into.23Policy Evaluation This phase computers the value function of the k-th iterationV pik (st) for each state st ∈ S in the current policy pi , based on Equation (2.6). In eachiteration as k goes to infinity, the old value for the state st is replaced until conver-gence, based on the expected value of possible successor states. This sequence ofupdating V pik (st) is termed backup operation. By defining a backup operator Bpiover the value functions, the policy evaluation phase can be expressed as:(BpiV )(st) = ∑st+1P(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γV pi(st+1)). (2.12)Policy Improvement This phase improves the current policy with respect to thenewly available value functions obtained from the policy evaluation phase. Oncethe value functions for the states are known, the next step will determine if thereis an action at ∈ A that can improve the value function of the associated state st byusing:Qpi(st ,at) = Epi{rt + γV pi(st+1) | st ,at}= ∑st+1∈SP(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γV pi(st+1),at+1)(2.13)If now V pi(st) is less than Qpi(st) for at ∈ A, then it means there is a better policyfor this particular state. Similarly, the entire original policy can be improved bygreedily selecting the best action in each state as follows:pi ′(st) = argmaxa∈AQpi(st ,at)= argmaxa∈AEpi{rt + γV pi(st+1) | st ,at}= argmaxa∈A ∑st+1∈SP(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γV ∗(st+1)) .(2.14)The iterative procedure of these two phases generates a sequence of alternatingpolicies and value functions. The complete algorithm of policy iteration is given inAlgorithm 1.24Algorithm 1 The Policy Iteration AlgorithmInput: initial policy pi(st) and value function V pi(st) for all st ∈ S; iterationthreshold σOutput: optimal policy pi∗(st)1: Policy Evaluation2: repeat3: δ := 04: for each step of episode t do5: for each st ∈ S do6: v :=V pi(st)7: V (st) := ∑st+1∈S P(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γV (st+1))8: δ = max(δ , | v−V (s) |)9: end for10: end for11: until δ < σ12: Policy Improvement13: policy-state:=true14: for each step of episode t do15: for each st ∈ S do16: b := pi(st)17: pi(st) := argmaxa∈AV (st)18: if b 6= pi(st) then19: policy-state:=false20: end if21: end for22: end for23: if policy-stable then24: pi∗ = pi25: Stop Iteration26: else27: go to Policy Evaluation28: end ifValue IterationOne disadvantage of the policy iteration algorithm 1 is that the policy evaluationphase can provide a useful value function only when it satisfies the convergence25Algorithm 2 The Value Iteration AlgorithmInput: initial policy pi(st) and value function V pi(st) for all st ∈ S; iterationthreshold σOutput: optimal policy pi∗(st)1: repeat2: δ := 03: for each step of episode t do4: for each st ∈ S do5: v :=V pi(st)6: V (st) := maxa∈A∑st+1∈S P(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γV (st+1))7: δ = max(δ , | v−V (s) |)8: end for9: end for10: until δ < σ11: pi∗(st) = argmaxa∈A∑st+1∈S P(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γV∗(st+1))condition δ < σ in each iteration, which may cost significant computation. How-ever, in some cases, the policy evaluation phase does not need to exactly convergein each iteration. Then, truncating it to save computational time will not hurt theconvergence guarantee of the policy iteration[57].One traditional approach is to letthe policy evaluation stop after only one step backup operation, and then executethe policy improvement. This algorithm is called value iteration, which may beexpressed as follows:Vk+1(st) = maxat∈AE{rt+1+ γV pik (st+1) | st ,at}= maxat∈A ∑st+1∈SP(st ,pi(st),st+1)(R(st ,pi(st),st+1)+ γVk(st+1),at+1).(2.15)The complete value iteration algorithm is given in Algorithm 2.2.2.3 Reinforcement Learning: Model-free Method for Solving MDPRL is a class of algorithms that obtain an optimal policy for problems where amodel of MDP is not available in advance. The underlining objective in RL con-26centrates on finding the estimated model for MDP. Due to the lack of a model, RLneeds to sample and explore the MDP to acquire experience (statistical knowledge)about the unknown model, in order to establish the estimated value functions, asin DP methods. Basically the RL algorithms for MDP can be divided into twocategories in terms of how to deal with the employed policy during learning: off-policy methods and on-policy methods. In off-policy methods, the policy used forgenerating behaviors (actions) is independent of the policy that is evaluated andimproved. On the contrary, on-policy methods attempt to evaluate and improve thesame policy that is used for generating behaviors. As a result, exploration must bebuilt into the policy to find a possible improvement, and the speed of the policy im-provements should be fast. These two different aspects usually oppose each other;specifically, strengthening one aspect will weaken the other aspect. This is calledexploration-exploitation trade-off problem. The most basic exploration strategy isthe ε-greedy method in which the best action is taken at probability (1−ε) and theother action is randomly selected at probability ε . Many other exploration meth-ods can be found in[58, 59]. In the following sections, two types of RL methods,Monte Carlo (MC) methods and temporal difference learning (TDL) methods, areintroduced.Monte Carlo MethodsMC comes from the law of large numbers in random theory; namely, in the limit,the expected value of a set of samples is equal to the average value (mean). That’sthe reason why MC is a class of algorithms solving RL that rely on averaging sam-ple returns. It samples sequences of states, actions, and rewards from interactionwith either actual or simulated environment to acquire an estimated model insteadof a real model. Then the estimated model provides approximate transition andreward functions that are required by model-based methods. In addition, to guar-antee satisfying principal conditions, the experience of MC is learned episode byepisode rather than step by step. This means the value functions and the policiesget updated only after one episode is finished, with the assumption that all episodeseventually reach the terminal state in the limit no matter what actions are selected.Another property of MC is that it prefers to estimate the state-action value27functions Qpi(s,a) under one policy pi other than state value functions V (s), sinceV (s) alone is not sufficient to provide sample information for estimating the tran-sition and reward functions when a model is not available. Therefore, all of thefollowing MC methods use action value functions Qpi(s,a). Suppose that a givenset of episodes is sampled by following pi and passing through state s. Each occur-rence of state s in an episode is called a visit to s. If the estimated Qpi(s,a) comesfrom the average of the returns following all the visits to s in a set of episodes, it istermed the every-visit MC method. Similarly, first-visit method uses the average ofreturns following just the first visit to s in a set of episodes. The first-visit methodhas been widely researched and hence the following section uses it to express MCmethods.On-Policy MC The overall scheme of on-policy MC is similar to DP methods.It also includes policy evaluation and policy improvement. As introduced before,on-policy MC improve the policy that is currently used for generating actions. Anε−greedy policy exploration is built into the process of learning the optimal policy.An example of on-policy MC algorithm is given in Algorithm 3.Off-Policy MC Off-policy MC distinguishes from on-policy MC in that the pol-icy used to generate actions is not the one that is used for evaluation, as given inAlgorithm 4.Temporal Difference LearningTDL is a class of algorithms combining the advantages of both DP and MC. Itcan learn the experience directly from interaction with the environment without amodel of the environment like MC does, as well as update estimates by using partof other learned estimates without waiting for an entire episode to end like DP does.With all of these advantages, TDL has become an important breakthrough in RLbecause it can be applied to a wide situations such as the case where the episodesare very long or where there is no episodes at all. TDL update the estimates inan incremental manner based on other estimates, so that each step of update willgenerate a useful experience. The simplest TDL method TD(0) is given as an28Algorithm 3 The On-Policy Monte Carlo AlgorithmInput: for all st ∈ S,at ∈ A(s): initial policy pi(s) with ε − greedy := arbitrary;initial state-action value function Q(s,a):=arbitrary; Retures(s,a):=empty list;iteration threshold σOutput: optimal policy pi∗(s)1: repeat2: Generate an episode using pi3: for each step t = 0,1,2, ...,T of the generated episode do4: for each state-action pair st ,at appearing in each episode do5: R := the return following the first-visit MC6: Append R to Retures(s,a) (not (st ,at), since the (st ,at) may be samein different t )7: q := Q(s,a)8: Q(s,a) := average(Retures(s,a))9: δ = max(δ , | q−Q(s,a) |)10: end for11: for each state s appearing in each episode do12: a∗ = argmaxa Q(s,a)13: for all a ∈ A(s):ε−greedy :={1− ε+ ε/|A(s) | i f a = a∗ε/|A(s) | i f a 6= a∗14: end for15: end for16: until δ < σ17: pi∗(s) = pi(s)example:Vst =Vst +α[rt+1+ γV (st+1)−Vst ], (2.16)where α is a constant step-size parameter denoting the learning speed and γ denotesthe discount factor of the future estimate. From Equation (2.16), it is noted thatTD(0) updates the estimated value function V(st) only until the next time step t+1and by using the observed reward rt+1 and the estimated value function of nextstate V(st+1). In MC, the return Rt , which can be obtained only when one episodeends, is used for the value function, whereas the observed reward rt+1 is used inTD(0). Hence the target for the TD update is rt+1+ γVt(st+1) instead of Rt in MC.29Algorithm 4 The Off-Policy Monte Carlo AlgorithmInput: for all st ∈ S,at ∈ A(s):initial policy pi(s):= arbitrary;initial state-action value function Q(s,a):=arbitrary;Nu(s,a):=0; Numerator of Q(s,a)De(s,a):=0; Denominator of Q(s,a)iteration threshold σOutput: optimal policy pi∗(s)1: repeat2: for each step t = 0,1, ...,T of the generated episode do3: Generate an episode {s0,a0,r1,s1,a1,r2, ...,sT−1,aT−1,rT ,st} by select-ing a policy pi ′4: τ=last time at which aτ 6= pi(sτ)5: for each state-action pair st ,at appearing in each episode do6: t=the time of first visit (after τ) of (s,a)7: ω :=∏T−1k=t+11pi ′(sk,ak)8: Nu(s,a) := Nu(s,a)+ωRt9: De(s,a) := De(s,a)+ω10: Q(s,a) := Nu(s,a)De(s,a)11: q := Q(s,a)12: δ = max(δ , | q−Q(s,a) |)13: end for14: for each state s appearing in each episode do15: pi(s) = argmaxa Q(s,a)16: end for17: end for18: until δ < σ19: pi∗(s) = pi(s)As discussed under MC, TDL also has the requirement to build approximatetransition and reward functions, and have to trade off exploration and exploitationby some policies like the ε − greedy exploration policy. Accordingly, the state-action value function Qpi of TDL also has similar challenges. All of these methodscan also be divided into two categories: on-policy methods and off-policy methods.In the following, an on-policy TDL, SARSA, and an off-policy TDL, Q-learning,are introduced.30Algorithm 5 The SARSA AlgorithmInput: for all st ∈ S,at ∈ A(s):initial state-action value function Q(s,a):=arbitrary;iteration threshold σOutput: optimal policy pi∗(s)1: repeat2: for each episode do3: Randomly choose a state s as starting state4: Choose a for s using policy derided from Q(s,a) based on ε − greedyexploration5: repeat6: for each step t = 0,1, ...,T of the generated episode do7: Choose at for st , observe reward rt and the next state st+18: Choose at+1 for st+1 using policy derived from Q(s,a) based on ε−greedy exploration9: Qst ,at = Qst ,at +α[rt+1+ γQ(st+1,at+1)−Qst ,at ]10: end for11: until st reaches the terminal state12: end for13: q := Q(s,a)14: δ = max(δ , | q−Q(s,a) |)15: until δ < σ16: pi∗(s) = argmaxa Q(s,a))SARSA: On-Policy TDL In SARSA methods, the policy for generating actions isthe same as the one to update evaluation. Each evaluation update occurs after everytransition from a state-action pair (st ,at) to next state-action pair (st+1,at+1), andreceives a feedback reward rt , as expressed in the following equation:Qst ,at = Qst ,at +α[rt+1+ γQ(st+1,at+1)−Qst ,at ], (2.17)Here st is not a terminal state. If state st+1 is terminal, then define Q(st+1,at+1) aszero. All of the elements of Equation 2.17 forms a quintuple of (st ,at ,rt ,st+1,at+1)that makes up a transition. This is why this algorithm is called SARSA. The com-plete algorithm is given in Algorithm 5.31Q-learning: An Off-Policy TDL method Q-learning is an effective method and asignificant development of off-policy TDL. It is different from SARSA in that ituses a build-in max operator over the function values of the next state-action pairwhen updating every transition. The core algorithm is given by:Q(st ,at)← Q(st ,at)+ α[rt(st ,at)+ γmaxaQ(st+1,at+1)−Q(st ,at)] (2.18)where the value of Q(st ,at) is termed Q value. Q-learning builds its experiencebase through establishing a Q table. An example is presented here to demonstratehow to establish a Q table.Suppose that a static collision-free roadmap has been obtained as shown inFig. 2.3. According to the definitions given in Section 3.3, each node representsone state s, and action a is represented by the arrows around each edge. The rewardr after executing an action will be 10 if it arrives at the goal state s11; otherwise,it will be 0 after executing one action. Suppose that the mobile robot is in states1, and it has four action options: go to state s1,s2,s4 and s5. If it chooses states5, the value of r is set equal to 0. It cannot go to state s7 because there is noedge between them, which means that the space between them is blocked either bymoving obstacles or by static obstacles. Let us express such a state diagram andthe instant reward values by the reward table, matrix R, given in Fig. 2.4.After building the reward matrix R, the next step is to build the Q table accord-ing to matrix R and set the Q-learning algorithm given by Equation (2.18). In thisdissertation, Q table is also expressed in the form of a matrix Q. We will start bysetting all the Q values to zero as shown in Fig. 2.5, and the initial state as state s1.According to the second row of matrix R there are four possibilities for the currentaction. By the strategy of uniform random selection among these four actions, weselect to go to state s2 without loss of generality. Suppose now that the robot isin state 2. Then according to the third row of matrix R, there are four possibleactions: go to state s1, s3,s5 or s6. The corresponded transition is:Q(1,2) = Q(1,2)+α {R(1,2)+ γmax[Q(2,1),Q(2,3),Q(2,5),Q(2,6)−Q(1,2)]Now the next state s2 becomes the current state. We repeat the above steps until the32Figure 2.3: Static collision-free roadmap for establishing Q table.robot arrives at the goal state s11. In this manner we have completed one episode.We begin the next episode with a randomly chosen initial state, say state s7, andupdate the matrix Q through similar calculation using Equation (2.18). Episodesare repeated until each Q value converges to a certain optimal value. Then wecan choose the optimal policy for starting state s0 to goal state s11 according tothe matrix Q. Fig. 2.6 and Fig. 2.7 show the results of the Q value after 5 and 50iterations, respectively. The value in the red square shows the updating processof the Q value of the state-action pair (s1,a2). Note that the converged values ofmatrix Q will not remain unchanged forever since it is initially obtained using astatic environment. The robot will update its Q value when interacting with thereal-time dynamic environment. The complete Q-learning algorithm is given inAlgorithm 6[60].33Figure 2.4: Reward matrix R (-1 means there is no edge between nodes, 0means an arrived node is not a goal node, 10 means an arrived node is agoal node).2.3 Support Vector Machine: A Popular Method ofSupervised LearningMany applications of robotic technologies require the solution of classificationproblems. Detecting people, for instance, is a binary classification problem, orsay yes-or-no classification to answer if the tested area contains people. SVM[61, 62] is among the best ”out-of-box” supervised learning algorithms to solveclassification problems. Given a training set of sample-label pairs S = {(xi,yi), i =1,2, ...,m} where xi ∈ Rn denotes the samples and y ∈ {1,−1}l denotes the classlabels, the SVM works out the optimal parameters w,b with respect to the trainingset S for the linear classifier defined as:hw,b(x) = g(wT x+b), (2.19)34Figure 2.5: Q table in the form of matrix Q at the initial iteration.where g(wT x+b) = 1 if wT x+b≥ 0 and g(wT x+b) =−1 if wT x+b< 0. There-fore, wT x+ b = 0 is called the decision boundary, or the separating hyperplane ifthe dimension of the feature x is higher than 2. Fig. 2.8 shows the basic scheme oflinear classifier.2.3.1 Geometric MarginsWith the definition of (w,b), the geometric margin of (w,b) with respect to eachtraining sample (xi,yi) is defined as the shortest distance from the sample (xi,yi)to the decision boundary wT x+b = 0. It is denoted by the symbol ηi as shown inFig. 2.9. In Fig. 2.9, point A represents a sample xi and point B is the cross pointgenerated by the orthogonal vector of the decision boundary which goes throughpoint A. Segment AB is hence the distance from point A to the decision boundary,35Figure 2.6: Q table in the form of matrix Q at the iteration 5.denoted as the geometric margin ηA.Since A is xi, point B can be given by xi−ηi× w‖w‖ . Note that point B lies onthe decision boundary. Hence wT(xi−ηi× w‖w‖)+b= 0, and the ηi can be solvedas:ηi =(w‖ w ‖)Txi+b‖ w ‖ . (2.20)In Equation (2.20), the solution of ηi is obtained for the case of a positive sampleA. Accordingly, the general solution of ηi is:ηi = yi((w‖ w ‖)Txi+b‖ w ‖). (2.21)The forgoing geometric margin is associated with one training sample. More-over, the geometric margin of (w,b) associated with the entire training set S is36Figure 2.7: Q table in the form of matrix Q at the iteration 50.Figure 2.8: The basic scheme of linear classifier.37Algorithm 6 The Q-learning AlgorithmInput: discount factor γ , learning rate αinitial state-action value function Q(s,a):=arbitrary, for all st ∈ S,at ∈ A(s);iteration threshold σOutput: optimal policy pi∗(s)1: repeat2: for each episode do3: Randomly choose a state s as the starting state4: repeat5: for each step t = 0,1, ...,T of the generated episode do6: Choose at for st using policy derived from Q(s,a) based on ε −greedy exploration7: Observe reward rt and the next state st+1, choose at+1 for st+18: Q(st ,at)← Q(st ,at)+ α[rt(st ,at)+ γmaxaQ(st+1,at+1)−Q(st ,at)]9: end for10: until st reaches the terminal state11: end for12: q := Q(s,a)13: δ = max(δ , | q−Q(s,a) |)14: until δ < σ15: pi∗(s) = argmaxa Q(s,a)defined as:η = mini=1,...,mηi. (2.22)Obviously, the bigger the η , the greater the gap between the positive training sam-ples and the negative samples, which reflects a very confident decision boundary.Therefore, the objective of the classifier is to maximize the geometry margin η .The entire process of working out a linear classifier now becomes how to solve thefollowing optimization problem:maxη ,w,bηs.t. yi(wT xi+b)≥ η , i = 1, ...,m‖ w ‖= 1.(2.23)38Figure 2.9: The basic scheme of geometric margins.Apparently the optimization problem (2.23)cannot be solved by ”off-the-shelf”methods, since the constraint ‖w ‖= 1 is non-convex, which is an intractable opti-mization problem. Therefore, functional margins are imported into (2.23) in orderto transfer (2.23) into a tractable optimization problem.2.3.2 Optimal Margin ClassifierTo aim at solving (2.23), the functional margin of (w,b) with respect to each train-ing sample, (xi,yi) is defined as:ηˆi = yi(wT xi+b) (2.24)The corresponded functional margin with respect to the training set S is defines as:ηˆ = mini=1,...,mηˆi. (2.25)By comparing with the form of the geometry margin (2.21) with the functionalmargin (2.24), η = ηˆ‖w‖ can be obtained. By integrating η =ηˆ‖w‖ into (2.23), aswell as noting that maximizing ηˆ‖w‖ =ηˆ‖w‖ is the same as minimizing ‖ w ‖2, a39more tractable optimization problem can be obtained:maxη ,w,b12‖ w ‖2s.t. yi(wT )xi+b)≥ 1, i = 1, ...,m(2.26)The optimization problem (2.26) has a transitional quadratic objective with onlylinear constraints, which can be solved easily by many commercial methods suchas the QP method. The solution gives the optimal classifier.2.3.3 Dual Problem and Support VectorsThe previous optimization problem (2.26) is a so-called primal problem. An as-sociated dual form of the problem will bring more benefits for solving the opti-mization problem (2.26). First the Lagrangian factor for (2.26) is constructed asfollows:L(w,b,α) =12‖ w ‖ −m∑i=1αi[yi(wT xi+b)−1], (2.27)where α = {αi, i = 1, ...,m} denotes the Lagrange multiplier. Then fix α , andminimize L(w,b,α) with respect to w and b separately:∇wL(w,b,α) = w−m∑i=1αiyixi = 0⇒ w =m∑i=1αiyixi (2.28)∂∂bL(w,b,α) =m∑i=1αiyi = 0 (2.29)Next combine Equations (2.27, 2.27, and 2.29), and consider the constraints αi ≥ 0together. The dual optimization problem can be given by:maxαW (α) =m∑i=1αi− 12m∑i, j=1yiy jαiα j〈xi,x j〉s.t. αi ≥ 0, i = 1, ...,mm∑i=1αiyi = 0,(2.30)40Obviously the form of the dual optimization problem (2.30) is convex so that it canbe solved using optimization methods. It also shows two significant advantages.First, it transfers solving the optimal (w,b) into solving only one parameter α .Once the optimal α∗ is solved, the optimal w∗ and b∗ can be obtained by substitut-ing α∗ back into Equations (2.28) and (2.29). More deeply, the experience on thesolution of α discloses that most of αi will be zero, with a few non-zero α whichrightly associate with the training samples that have the smallest geometric mar-gins (closest to the decision boundary). These samples are shown in Fig. 2.10 asthe points that lie on the dash lines parallel to the decision boundary. These pointsare term support vectors . By using support vectors, the size of the training sam-ples related to the solution of the optimal parameter will reduce significantly. Thesecond advantage arises in the form of < xi,x j > which is defined as kernel.Thedefinition of kernel helps to map the feature space of the training samples to thefeature space of higher dimension, where the nonlinear decision boundary in thelow dimension space may become linear. These two advantages are the main rea-sons why SVM has been quite popular in the research community of classification.Figure 2.10: An example of the concept of support vectors for a linear clas-sifier.41Chapter 3Robust Q-learning withRegime-Switching MarkovDecision Process for OptimalPath Planning3.1 IntroductionThis chapter presents a new path planning approach, which incorporates onlinereinforcement learning integrated with regime-switching Markov decision process(RSMDP) for a mobile robot moving in a dynamic environment. Modeling thesurrounding dynamic environment is the first challenge. To address this challenge,a novel framework based on the RSMDP scheme is introduced to represent a dy-namic environment. In addition, an online reinforcement learning approach is inte-grated into the RSMDP scheme to resolve the uncertainty in a model-free environ-ment, and probabilistic roadmap(PRM), a sample-based method, is used to resolvethe curse of dimensionality that arises with reinforcement learning when facing acontinuous space of state and action.The main contributions of the present chapter, in the context of the related pre-vious work [63, 64], are as follows. First, unlike [63, 64], which consider only42static obstacles, the present path planner is able to return a globally optimal path inthe presence of unknown moving obstacles, with regard to balancing the shortestpath and obstacle avoidance. Second, through PRM, both state space and controlspace can be constrained to a low-dimensional finite space. Third, and most impor-tantly, reinforcement learning is used in an online formation, using the concept ofregime-switching [65, 66], to represent the changing environment caused by mov-ing obstacles, where value iteration is robust to parameter changes. This appearsto be the first application of regime switching to solve path planning problems in adynamic and unstructured environment.3.2 RSMDPThe scenario of path planning problem described in Section 1.2.2 can be formu-lated as an Markov decision process(MDP) because it has the Markov property thatthe future state depends only on the current state and has no dependence on the paststates. An MDP is defined by a tuple with five elements: M = (S,A,P,R,TC)whereS is a set of states, A is a set of actions depending on S, P(· | ·, ·) : S×S×A→R>0is a transition probability function that satisfies ∑s′∈S P(s′ | s,a) = 1 for all s ∈ Sand a ∈ A, R(·, ·, ·) : S×A×S→ R is an immediate cost function for all s ∈ Sanda ∈ A, and TC : S→ R is a terminal cost function denoting the sign of an end ofan MDP. An absorbing state with zero cost is usually used in a path planning prob-lem . Whether or not a control policy pi : S→ A of one MDP is a good processis determined by its corresponding expected value function, which usually can beobtained by solving the Bellman equation given by:V pi(st) = Epi(∞∑i=0γ irt+i |st ) = Epi(rt +γ∞∑i=0γ irt+i+1 |st ) = Epi(rt +γV pi(st+1)) (3.1)where γ ∈ [0,1) is the step size that corresponds to the iteration rate of the Bellmanequation, and st+1 = δ (st ,at) is the transition function according to P(· | ·, ·) : S×S×A→ R>0. The goal is to find an optimal policy pi∗(s) = argmaxaV pi(s) thatminimizes (or maximizes, depending on specific definition of the cost function R)the expected value (3.1) for every initial state S0. In this section a sample-basedmethod is applied to overcome the curse of dimensionality. Specifically, the finite43horizon discount version of MDP where i< ∞, is used for this purpose.Next regime-switching is integrated into MDP to represent a dynamic environ-ment. From experience it is known that a home environment can change betweenstatic and dynamic states. A regime ψ is defined as the time/step period betweenthe last changes and the current changes. Therefore, the state, action and transitionprobability of MDP stay the same in one regime and vary from one regime to thenext. Consider a countable collection Ψ of changing regimes of cost-minimizingMDP problems. Each regime ψk ∈Ψ is associated with one period of static MDPgiven by one RSMDP Mψk = (Sψk ,Aψk ,Pψk ,Rψk ,TCψk), where k ∈ N denotes theindex of each discrete static time/step period of the changing environment. Con-sequently, the goal becomes finding the optimal policy piψk∗(s) = argmaxaV piψk(s)where V piψk(s)is given by:V piψk(sψk,t) = Eψk,pi(∞∑i=0γψkirψk,t+i∣∣sψk,t )= Eψk,pi(rψk,t + γψk∞∑i=0γ iψk rψk,t+i+1∣∣sψk,t )= Eψk,pi(rψk,t + γψkVpiψk(sψk,t+1))(3.2)It is seen that the optimal policy piψk ∗ (s)varies from one regime to another.Hence in RSMDP, the problem of tracking the optimal policypiψk ∗ (s) correspond-ing to each regimeψk is considered. The only assumptions that is needed forpiψk ∗ (s) is as follows:Assumption 1: For each regime we have ψk ∈Ψ,E[Tψ ] supE[ti], where Tψrepresents the duration of regime piψk ∗ (s), and ti represents the duration of eachiteration in the Bellman equation.Assumption 1 implies that the requirement of successfully converging to piψk ∗(s) means that the regime does not change too often when compared with the timeused for each iteration step in (3.2). This is satisfied in the practical scenario ofpath planning that is considered in the present work. Next, the way to express thepath planning problem by incorporating PRM into RSMDP is described.443.3 Probabilistic Roadmap for RSMDPA home environment is arguably unstructured. For example, furniture may be clut-tered and unorganized, and it is difficult to determine the structure of such furnitureusing sensors. Furthermore, this will impose a huge computational burden whenbuilding an accurate model to represent the environment. Sampling-based meth-ods have adequately resolved the problem of computational burden, because thesemethods rely on a collision-checking module instead of using an explicit represen-tation of the environment. PRM and its variants [6, 7], provide effective methodsof path planning that are sampling-based.PRM is a network of simple curve segments or arcs that meet at nodes. Eachnode corresponds to a configuration in the configuration space (C-space). Each arcbetween two nodes corresponds to a collision free path between two configurations.It comprises a preprocessing phase and a query phase. In the following, let Cdenote the robot’s C-space, C f the free C-space, N the node set, and E the edge set.First, initiate a graph R= (N,E) that is empty. The preprocessing phase constructsthe free C-space, giving the global picture, as shown in Fig. 3.1a. The query phase,shown in Fig. 3.1b, generates an optimal global collision-free path (bold line inFig. 3.1b) by connecting the start and goal nodes to the roadmap, where heuristicmethods are usually used (Q-learning is used in the present section). Details arefound in [6, 7].As defined, state set Sψk in RSMDP corresponds to the node set N in PRM, andaction set Aψk corresponds to the edge set E. If the system is continuous, index tdenotes the time interval; otherwise, index t denotes the step interval. In this sectiont is considered as the step interval without losing generality since the dynamicenvironment is formulated as a discrete RSMDP and each step is very short relativeto the entire C-space. Therefore, the action subset A¯ψk ∈ Aψk associated with a statesψk,i ∈ Sψk at step i under regime ψk, is countable and corresponds to those edgesconnecting to the associated node in PRM. Hence which action should be chosenat a certain state in the learning process is governed by a stochastic behavior dueto the unpredictable motion of the obstacles, although the available actions arecountable. The final goal of the present work is to find the optimal policy piψk∗(s)iteratively after the Q-learning process, as described in Section 3.4. Regime ψk45will not be changed unless the moving obstacles affect the current piψk∗(s). Fig. 3.2shows two common scenarios of regime change where one moving obstacle isdetected using some distance threshold. As it blocks the current optimal path, thetransition probability (although not known in advance) of the corresponding statesand actions will be changed so that the current regime ψk will be changed intothe next regime ψk+1. Then the Q-learning process will choose another availablepath according to the new regime. Sometimes, once the chosen path is blockedin the current regime, there may not exist an available path to choose from due tothe lack of sampled nodes in PRM. For example in Fig. 3.2c, an obstacle movesto block all possible paths to the goal node and stays there for a long time. Hereagain, PRM is imported to build a local roadmap around the robot and the movingobstacle, in order to determine a feasible path. In particular, as shown in Fig. 3.2d,(a) (b)Figure 3.1: The PRM process in the C-space: (a)Preprocessing stage; (b)Query state. Note:Polygons represent static and moving obstacles, andthe blank space represents the collision-free C-space. In order to repre-sent a robot as a point as it moves on the ground, the standard practice isto expand the obstacles corresponding to the size reduction of the robot,as shown by the bold sideline of polygons. A uniformly random sam-ple method is used to construct deterministic nodes in the free C-space.Then, such nodes are collision-free nodes. The roadmap is constructedusing the collision-free nodes.46first a semicircular local region is built centered on the location of the previousstate of the blocked edge. Its radius is calculated from the distance between thelocations of two states connected by the same blocked edge. Then, a roadmapis generated within this semicircle by the same PRM method as before. Clearlythe extra sampled nodes generated by the local roadmap will change the structureof the current PRM and consequently change the current regime ψk into the nextregime ψk+1.3.4 Path Planner with Online Q-learningPRM works well in a static environment, but it cannot adapt to a dynamic en-vironment where there are moving obstacles. Re-planning might be an intuitivealternative, in the presence of moving obstacles, but it would be impractical ingeneral. For example, a moving obstacle might rapidly change its position afterthe path planner re-calculates a path based on the previous sensor information ofthe moving obstacle, and that new position of the moving obstacle might still blockthe new path. Considering such problems, the present section incorporates the re-inforcement learning method, Q-learning, into the query phase of the PRM in theRSMDPformulation. In this manner, when the optimal path is blocked by a mov-ing obstacle, the path planner is able to quickly choose another optimal path, usingthe previous experience about the map, as determined by the Q function value.In the 1990s reinforcement learning (RL) was proposed to solve MDP prob-lems [5]. In RL, an agent learns its behavior through trial-and-error interactionswith a dynamic environment, while receiving rewards for good actions and penal-ties for bad actions. Specifically, the agent performs an action at in state st andreceives a real-valued reward rt = r(st ,at)∈ R from the environment. Through thisprocess, the agent learns an optimal policy pi∗(s) = argmaxaV pi(s) where V pi(s) isequal to 1, which maps the state set S into the action set A, and arrives at its nextstate st+1 = δ (st ,at). The policy should be able to maximize the cumulative rewardaccording to V pi(s).Q-learning is a popular version of off-policy reinforcement learning which,regardless of the policy being followed, always estimates the optimal Q-functionthat is defined as Q(st ,at) : S×A→ R. Q-learning has two main advantages when47(a) (b)(c) (d)Figure 3.2: Local roadmap generation: (a) an original path generated byPRM; (b) an alternative path selected if the original path is blocked;(c)(d) if no alternative path is available, a semicircle is used to define thescope where some new points will be generated to attach to the PRM inorder to find an alternative collision-free path .compared with other approaches of reinforcement learning. First, it does not re-quire a model of the environment, which is advantageous when dealing with anunknown environment. For example, in an unknown environment rt = r(st ,at) andst+1 = δ (st ,at) are nondeterministic functions. Then, r and s are initiated arbi-48trarily and the algorithm will eventually converge to the optimal Q*(s,a) value inview of its mathematical basis. Second, it is able to update the estimates using par-tially learned estimates without waiting for completing the entire episode, whichmeans it bootstraps. These aspects are discussed under MDP formulation. Thecore algorithm of Q-learning in RSMDP is given by:Qψk(st ,at)←Qψk(st ,at)+ γψk,t [rψk,t(st ,at)+αψk maxa Qψk(st+1,at+1)−Qψk(st ,at)](3.3)The optimal action policy pi∗ψk is given by:pi∗ψk(s) = argmaxa Q∗ψk(s,a) (3.4)There are two conditions that should be satisfied to guarantee the convergenceof Q-learning to optimal Q∗ψk(s,a) with probability one [11]. These are given now.Condition 1: All the state-action pairs Q(st ,at) are visited infinitely often asthe number of transitions approaches infinity.Condition 2: The step size γt should satisfy γψk,t > 0,∀k,∞∑t=0γψk,t = ∞,∞∑t=0γ2ψk,t < ∞ .Condition 1 is called exploration, which requires that Q-learning has nonzeroprobability of choosing any action when it also needs to exploit its current knowl-edge in order to perform well by selecting greedy actions in the current Q-function.A popular method to balance exploration with exploitation is the ε − greedy ap-proach:at ←an uni f orm random action in Aψk , with probability εka ∈ argmaxaQψk(st ,at) , with probability 1− εk(3.5)Condition 2 implies that the step size should meet the requirement of limt→0γψk,t =0. There is also a tradeoff problem when choosing γψk,t in regime ψk. In order forQ-learning to converge to optimal Q∗ψk(s,a) quickly, the step size γψk,t has to belarge; however, the step size γψk,t should be small in order to minimize the mag-nitude of the fluctuations of the Q-function within a given regime. In traditionalQ-learning, this tradeoff does not considerably affect the system since the speed is49usually adequate to solve related problems in a static situation. However, in obsta-cle avoidance in the RSMDP framework, the way how the Q-function convergesto the optimal value greatly affects the robot system. A large γψk,t is expected toproduce a fast convergence speed, but the high-magnitude fluctuations caused bysmall γψk,t will lead to an incorrect optimal Q-function, possibly causing the robotto collide with obstacles. At high speeds, safety should be given more attention.The strategy to achieve these performance requirements is to make the Q-functioniteration process robust to γψk,t . Then, accurate optimal value can be achievedat a satisfactory speed. Therefore, in the current work, by setting and resettingγψk,t = γ tmax, the path planner always selects the largest possible step size for thecurrent regime and makes it converge to zero within the same regime. But the stepsize is reset to the largest possible value again when the regime changes. In thisway, γψk,t is able to eventually converge to zero. But it is set to a large value in thebeginning of the iteration so that the Q-function iteration is robust to the changesin γψk,t , while having sufficient speed of converging to the new optimal Q-functionQ∗ψk(s,a) to adapt to the new regime ψk+1. When to reset γψk,t is critical in thepresent approach. In view of Assumption 1 in Section 3.4, the changing frequencyof the dynamic environment should not be very high although the moving obstaclemay always make the environment to change. To this end, it is assumed that theregime is changed only when the current path has been blocked by obstacles thatenter the robot’s dangerous area as defined by some threshold, rather than whenmoving obstacles change the PRM.The online path planner with Q-learning, which is used in the present section,chooses the optimal path according to the maximum Q value with respect to eachstate-action pair. The Q value of each state-action pair is obtained by taking into ac-count both the shortest path and obstacle avoidance in the cost function rψk,t(st ,at)defined by:rψk,t(st ,at) = ω f (st ,st+1)+(1−ω)h(st) (3.6)Here f (st ,st+1) = ‖st ,st+1 ‖δ denotes a distance function defined by the δ norm, ωis the weighting parameter used to balance f (st ,st+1), and h(st) denotes the rewardfunction according to moving obstacles.50Once an optimal path is obtained in the current regime, the robot begins tomove. When a moving obstacle blocks the recurrent optimal path, the step sizeresetting will be made and the path planner will chooses another available optimalpath by quickly converging to the new optimal policy. It is seen that although thestep size γψk,t changes every time when the regime changes, the learning rate withinthe new regime will be faster than in the previous regimes since Q-value for eachstate-action pair is saved as the knowledge for the new regime. That is the reasonwhy the present path planner is able to adapt to a dynamic environment.3.5 Simulation StudiesThis section presents simulation studies to validate the online path planner de-veloped for mobile robot. The system is simulated using open-source softwareOpenCV1.Figure 3.3: Simulated environment for the mobile robot in a 640×480 imageplane, in a learning process.Fig. 3.3 shows an image of a simulated environment obtained from a presumedglobal camera before the mobile robot is started. The white rectangles representstatic obstacles and the white ellipse represents a moving obstacle. The worldstate is made up of the pixel coordinates of the nodes of the roadmap within a640× 480 image plane, which is represented as s(x,y) where x = 0,1, . . . ,640;y = 0,1,2. . . ,480. After obtaining the collision-free states, the starting point and1http://opencv.org/51the goal point of the robot are set at fixed positions indicated by arrows. These twopoints are added to the collision-free roadmap in the same way that the roadmapis generated, except that loop connection is used instead of incremental connec-tion in order to provide more possibilities of path connection between the startingpoint and the goal point. The thin lines represent available paths and the thickerlines represent optimal paths obtained during the learning process. The rotationcoordination θo can be ignored here to illustrate path planning for the purpose ofsimplicity of simulation, because the mobile robot can first rotate to the correctdirection before it starts to move. The number close to each edge is the Q value foreach state-action pair corresponding to the edge. Euclidian distance is used by set-ting δ = 2,ω = 10/(640+480) in Equation (3.6) and setting the reward functionas follows:h(st) =R= 0; when the robot reaches the gaolR= -10; when the robot touches the obstacleR= -5; in any other situation.(3.7)The first simulation (shown in Fig. 3.4) tests how the Q value converges to itsFigure 3.4: History of Q-value with γmax = 0.8,0.5,0.2 and 100 sampledpoints generated in PRM.optimal value in terms of the possible maximum beginning value of the step size52Figure 3.5: Performance comparison between traditional Q-learning and ro-bust Q-learning with γmax = 0.98.when using the strategy γψk,t = γ tmax. The Q value of the edge close to the goalpoint is chosen under three beginning step-size values, γmax = 0.8,0.5,0.2(shownby the dotted line, star line and triangle line, respectively, in the figure). It is seenthat the smallest value γmax = 0.2 can bring up less intense fluctuations, which is inaccordance with Condition 2 in Section 3.4. Nevertheless, it is seen that the optimalQ-value and the learning rate do not change appreciably for different maximumbeginning values of the step size γmax. In this sense, the Q-value iteration processis robust for choosing γmax. Specifically, γmaxcan be chosen as high as possible inorder to obtain the fastest possible learning rate or the fastest convergence speed.Fig. 3.5 shows the performance of the iteration process when γmax is set at 0.98for robust Q-learning and traditional Q-learning. In Fig. 3.5, it is seen that bothrobust Q-learning (dotted line) and traditional Q-learning (solid line) are able tosuccessfully converge to a new optimal Q-value when the regime changes fromregime 1 to regime 2, caused by moving obstacles. Although traditional Q-learningis faster than robust Q-learning, robust Q-learning converges to the optimal Q-valuemore smoothly when compared with the traditional Q-learning. In addition, higherrewords (in Q-value) are obtained by robust Q-learning. These properties will helpthe robot to choose a more accurate optimal action if the optimal Q-values of theactions with respect to one state intersect with each other when the regime changes53quickly and there is not enough time for the Q-value iteration to converge to theoptimal value. Therefore, the safety can be guaranteed as much as possible at thecost of low speed of convergence. This low speed is fast enough in such conditionsand can be achieved using a modern computer.The second simulation (shown in Fig. 3.6) verifies that the present algorithm isable to successfully avoid both static and moving obstacles under the RSMDP androbust Q-learning framework. Keyboard controller is used to control the movingobstacle and make it move to block the obtained optimal path. With the cost func-tion as defined in Equation (3.6) and Equation (3.7), the simulated robot reaches thegoal point by choosing the shortest available path and avoiding obstacles, which isconsidered as regime 1. When the robot detects that the moving obstacle is block-ing its current optimal path, it quickly finds another optimal path by using thelearning experience, which is considered as regime 2. It is noted that although in-creasing the number of PRM nodes will generate more available paths, the timespent to learn a new optimal path will also increase. Hence, there is a tradeoffbetween the number of nodes and the time taken to avoid obstacles. In the presentcase, having 400 sampled nodes can provide the fastest speed to adapt to a dynamicenvironment.The online robust Q-learning method that was investigated in the present sec-tion is a behavior-based decision-making process. In this method, the robot con-tinuously observes the world states and selects the action having the optimal Qvalue among the possible actions in the current state, as given by the Q function.This is different from a traditional behavior-based system where the rule base ofbehavior is designed entirely by a human expert in advance. The rule base of Q-learning is learned autonomously when the robot interacts with its environmentduring the training process. The curse of dimensionality is a serious challenge inthis process because, in theory, an infinite number of iterations would be needed toguarantee convergence to the optimal value. The method proposed in the presentwork overcomes this problem by incorporating a PRM roadmap as the world statefor the robot. The safety is another challenge, when dealing with rapidly movingobstacles in a dynamic environment. The robust Q-learning in the present workguarantees smooth convergence for Q-value iteration so that a relatively accurateaction is chosen when the regime changes.54(a) (b)(c) (d)(e) (f)Figure 3.6: Obstacle avoidance in dynamic environment. (a), (c), (e) origi-nal optimal path in regime 1; (b), (d), (f) scenarios for regime 2 wheremoving obstacle is blocking the current path, hence choose another op-timal path. (a) and (b) correspond to 100 sampled points in PRM, (c)and (d) correspond to 400 sampled points, and (e) and (f) correspond to600 sampled points.55Chapter 4Extended Linear QuadraticRegulator Enhanced with PWAfor Motion Planning4.1 IntroductionThis chapter begins by reviewing linear-quadratic-regulator (LQR) with regard tocost-to-go function, and how to obtain the piecewise affine closed-form controllaw, piecewise affine-LQR (PWA-LQR), for constrained LQR. It is done by in-corporating quadratic programming (QP) in the traditional LQR. PWA-LQR alsobrings in the concept of cost-to-come function which uses a similar procedure asthe cost-to-go function, except that cost-to-come runs forward in time. PWA-LQRsmoothing combines these two functions to form a smoother that provides a se-quence of optimal states for the linear-quadratic control problem. Then PWA-LQRis extended (E) to PWA-ELQR which is used to deal with nonlinear-nonquadraticcases where an iteratively local approximation method is used.4.2 Optimal Motion Planning with ConstraintsIn the present section, the objective of the optimal motion planning is to find anoptimal control policy pit ∈ X→ U such that for the entire horizon 0 6 t 6 l the56selected input control ut = pit(xt) can (approximately) minimize a cost functiondefined by:cl(xl)+l−1∑t=0ct(xt ,ut), (4.1)where l denotes the horizon, cl ∈ X→ R denotes the final-step cost function andct(xt ,ut) ∈ X×U→ R represents the immediate cost function at time-step t.Based on the concept of cost function given by Equation(4.1), the cost-to-gofunction vt ∈ X→ R and the related optimal control policy pit at the current statext can be calculated by the backward recursion procedure:vl(xl) = cl(xl)vt(xt) = minut (ct(xt ,ut)+ vt+1(gt(xt ,ut))),pit(xt) = argminut (ct(xt ,ut)+ vt+1(gt(xt ,ut))),s.t. xmin 6 xt 6 xmax,umin 6 ut 6 umax,xt+1 = gt(xt ,ut),(4.2)According to the foregoing definition, the kinematic system (1.1) is nonlinearand the cost function (4.1) generally is nonquadratic. Hence linear and quadraticapproximation methods are needed for handling the problem. Basically, by deriv-ing Riccati equations when solving a quadratic optimization problem at time stept, a formula for minimizing the cost-to-go function can be obtained at time stept−1. However, optimizing a quadratic problem is solvable only when the Hessianis positive-(semi)definite. This rule must be followed to assure that all of the im-mediate cost functions have positive-(semi)definite Hessians when designing costfunctions. Specifically,∂ 2cl∂xl∂xl> 0,∂ 2ct∂ut∂ut> 0,[∂ 2ct∂xt∂xt∂ 2ct∂xt∂ut∂ 2ct∂ut∂xt∂ 2ct∂ut∂ut]> 0.(4.3)574.3 Piecewise Affine-ELQR4.3.1 Traditional Linear-Quadratic-RegulatorLQR is a prevalent controller that provides a closed-form optimal solution for thelinear-quadratic control problem[11]. The linear kinematics for all horizon 06 t 6l are defined as:xt+1 = gt(xt ,ut) = Atxt +Btut + ct , (4.4)where At ∈ Rn×n, Bt ∈ Rn×m, ct ∈ Rn, and both ql and pt are constant. Thequadratic immediate cost functions for all horizons 06 t 6 l are defined as:cl(xl) =12xTl Qlxl +xTl ql +ql,ct(xt ,ut) =12[xtut]T [Qt PTtPt Rt][xtut]+[xtut]T [qtrt]+ pt ,(4.5)whereQt ∈ Rn×n,Rt ∈ Rm×m,Pt ∈ Rm×n,qt ∈ Rn, and rt ∈ Rm, such that Ql > 0and Rt > 0 are positive-definite, and[Qt PTtPt Rt]> 0 is positive-semidefinite, inaccordance with Equation (4.3).There are two attractive features in the linear-quadratic control problem. Thefirst one is that the corresponding cost-to-go functions have explicit quadratic for-mulation as follows:vl(xl) =12xTl Slxl +xTl sl + sl,vt(xt) =12xTt Stxt +xTt st + st ,(4.6)where sl ,st are constant, St ∈ Rn×n > 0, and st ∈ Rn. For final-step l, Sl = Ql ,sl = ql and sl = ql . By combining Equations (4.2), (4.4), and (4.5), we can obtain:vt(xt) = minut12[xtut]T [Dt CTtCt Et][xtut]+[xtut]T [dtet]+ pt , (4.7)58where:Ct = Pt +BTt St+1At ,Dt = Qt +ATt St+1At ,Et = Rt +BTt St+1Bt ,dt = qt +ATt st+1+ATt St+1ct ,et = rt +BTt st+1+BTt St+1ct .(4.8)and accordingly,St = Dt −CTt E−1t Ct , (4.9)st = dt −CTt E−1t et . (4.10)The second feature is that linear-quadratic control problems have a closed-formoptimal controller which is an explicit linear formulation in the feedback form:ut = pit(xt) = Ltxt + lt , (4.11)Lt =−E−1t Ct , lt =−E−1t et . (4.12)4.3.2 Piecewise Affine Feedback Control for Constrained ControlProblemsThe closed-form feedback control law (4.11) comes from the minimization of thequadratic cost-to-go function (4.6) with the assumption that none of the variableshave constraints. This can be considered as solving an unconstrained optimizationproblem. Even when there are natural constraints on variables, it may be reasonableto disregard them as their effects on the solution are negligible. However, in manypractical problems, the constraints play a critical role (like in the example of thepresent work). Therefore, the unconstrained problem has to be transformed into a59constrained one as follows, in accordance with Equation (4.6):vl(xl) =12xTl Slxl +xTl sl + sl,vt(xt) =12xTt Stxt +xTt st + st ,s.t.xmin 6 xt 6 xmax,umin 6 ut 6 umax,xt+1 = gt(xt ,ut) = Atxt +Btut + ct ,(4.13)Equation (4.13) is a typical constrained optimization problem which can be solvedby QP [67].An optimization problem with a quadratic objective function and linear con-straints is called a quadratic program. Clearly, Equation (4.13) belongs to inequality-constrained problems and satisfies first-order optimality conditions. Therefore, itcan be solved by applying Karush-Kuhn-Tucker (KKT) conditions, also known asfirst-order necessary conditions as follows:Step 1, Lagrangian function for Equation (4.13):£(xt ,λt) = vt(xt)+λ(1)t(Atxt +Btut + ct)+λ(2)t(xt −xmin)+λ(3)t(xmax−xt)+λ(4)t(ut −umin)+λ(5)t(umax−ut),(4.14)60Step 2, specifying KKT conditions for the optimal solution x∗:Ox£(x∗t ,λ ∗t ) = 0,x∗t −xmin ≥ 0,xmax−x∗t ≥ 0,ut −umin ≥ 0,umax−ut ≥ 0,λ(1)t ≥ 0,λ(2)t ≥ 0,λ(3)t ≥ 0,λ(4)t ≥ 0,λ(5)t ≥ 0,λ(1)t(Atx∗t +Btut + ct) = 0,λ(2)t(x∗t −xmin) = 0,λ(3)t(xmax−x∗t ) = 0,λ(4)t(ut −umin) = 0,λ(5)t(umax−ut) = 0.(4.15)Based on (4.14) and (4.15), the piecewise affine feedback control law for (4.13)can be obtained as follows:ut = pit(xt) = Litxt + lit ,xt ∈CRi (4.16)where CRi is the critical region to which xt belongs, according to the constraints onthe state and (or) the control input, and,Sit = Dt −CTt Lit , (4.17)sit = dt −CTt lit , (4.18)where Ct , Dt , Et , dt and et are similar as (4.8).There are a handful of effective methods to calculate CRi.They include active-61set methods which are suitable for small- and medium-sized problems[68], andinterior-point methods for large problems[69][70][71]. It is indicated in [69] that,in convex quadratic programming, interior-point methods are generally much fasteron large problems than active-set methods.QPs can always be solved (or shown to be infeasible) in a finite amount ofcomputation. But the effort required to find a solution depends strongly on thecharacteristics of the objective function and the number of inequality constraints.If the Hessian matrix G is positive semi-definite, Equation (4.13) is said to bea convex QP, and in this case the problem is often similar in difficulty to a linearprogram (strictly convex QPs are those in which G is positive definite). NonconvexQPs, in which G is an indefinite matrix, can be more challenging because they canhave several stationary points and local minima.”4.3.3 PWA-LQR SmoothingThe cost-to-go functions (4.2) is called backward LQR or PWA-LQR if there areconstraints. It means only the total future cost that will occur between stage t andstage l (inclusive of stage t and stage l) is considered. In this subsection the sim-ilar concepts, cost-to-come functions and forward LQR, are introduced. Then theforward LQR is extended to forward PWA-LQR. Finally the PWA-LQR smootheris obtained by combining backward PWA-LQR and forward PWA-LQR.Cost-to-come functions and Forward PWA-LQRCost-to-come functions, denoted by v¯t(xt), give the total past cost that occurred be-tween stage 0 and stage t (excluded). Given the inverse kinematics (1.4), v¯t(xt) andinverse control policy p¯it are defined by the following forward value iteration[17]:v¯0(x0) = 0v¯t+1(xt+1) = minut (ct(g¯t(xt+1,ut)),ut)+ v¯t(g¯t(xt+1,ut))),pit(xt+1) = argminut (ct(g¯t(xt ,ut))+ v¯t(g¯t(xt+1,ut))),s.t. xmin 6 xt 6 xmax,umin 6 ut 6 umax,xt = g¯t(xt+1,ut),(4.19)62The control input ut = p¯it(xt+1) drives state xt to state xt+1 with minimal cost-to-come cost. Assume that the inverse kinematics are linear, defined by:xt = g¯t(xt+1,ut) = A¯txt+1+ B¯tut + c¯t , (4.20)where A¯t = A−1t , B¯t = −A−1t Bt and c¯t = −A−1t ct . Combining (4.19), (4.20) thesame local cost functions as (4.5), an explicit quadratic formulation of global cost-to-come functions can be obtained:v¯0(x0) = 0v¯t(xt) =12xTt S¯txt +xTt s¯t + s¯t ,(4.21)Similar to (4.11) and (4.12), the feedback control law for forward LQR is:ut = p¯it(xt+1) = L¯txt+1+ l¯t (4.22)L¯t =−E¯−1t C¯t , l¯t = E¯−1t e¯t . (4.23)whereS¯t+1 = D¯t − C¯Tt L¯t , (4.24)s¯t+1 = d¯t − C¯Tt l¯t , (4.25)C¯t = B¯Tt (S¯t +Qt)A¯t +PtA¯t ,D¯t = A¯Tt (S¯t +Qt)A¯t ,E¯t = B¯Tt (S¯t +Qt)B¯t +Rt +PtB¯t + B¯Tt PTt ,d¯t = A¯Tt (s¯t +qt)+ A¯Tt (S¯t +Qt)c¯t ,e¯t = rt +Pt c¯t + B¯Tt (s¯t +qt)+ B¯Tt (S¯t +Qt)c¯t(4.26)Suppose that there are constraints on the state and control input in the cost-to-63go functions, given by:v¯0(x0) = 0v¯t(xt) =12xTt S¯itxt +xTt s¯it + s¯t , xt ∈CRis.t. xmin 6 xt 6 xmax,umin 6 ut 6 umax,xt = g¯t(xt+1,ut),(4.27)The similar KKT conditions (4.14)(4.15) should be added, and the correspondedpiecewise affine feedback control law can be obtained as:ut = p¯it(xt+1) = L¯itxt+1+ l¯it , xt+1 ∈CRi (4.28)whereS¯it+1 = D¯t − C¯Tt L¯it , (4.29)s¯it+1 = d¯t − C¯Tt l¯it , (4.30)C¯t = B¯Tt (S¯it +Qt)A¯t +PtA¯t ,D¯t = A¯Tt (S¯it +Qt)A¯t ,E¯t = B¯Tt (S¯it +Qt)B¯t +Rt +PtB¯t + B¯Tt PTt ,d¯t = A¯Tt (s¯it +qt)+ A¯Tt (S¯it +Qt)c¯t ,e¯t = rt +Pt c¯t + B¯Tt (s¯it +qt)+ B¯Tt (S¯it +Qt)c¯t(4.31)The recursive update Equations (4.25)-(4.27) run forward from stage 0 to staget. The overall procedure is referred as forward PWA-LQR.PWA-LQR SmootherExecuting both backward and forward PWA-LQR for a given constrained linear-quadratic control problem gives the cost-to-go functions vt and cost-to-come func-tions v¯t . The sum of vt and v¯t results in the total cost, denoted by vˆt , which accu-mulates the entire cost between stage 0 and stage l by a minimal-cost sequence of64states and controls that visits xt at stage t:vˆt(xt) = vt(xt)+ v¯t(xt) =12xTt (Sit + S¯it)xt +xTt (sit + s¯it)+ st . (4.32)Using xˆt to denote the state at stage t that minimizes the total-cost function v¯t , weobtain:xˆt =−(Sit + S¯it)−1(sit + s¯it). (4.33)The xˆt is the smoothed state, which results in the sequence of minimum-cost states{xˆ0, ..., xˆl} for the given linear-quadratic control problems.4.3.4 PWA-ELQR: Local Approximation for Nonliear-NonquadraticControl ProblemsThe foregoing discussion concerns the linear quadratic control problem. How-ever, most practical systems come under nonlinear, nonquadratic control prob-lems, such as that considered in the present work. Therefore, it is necessary toextend PWA-LQR (both forward and backward) to the case of nonlinear kinematicand nonquadratic local cost function. This is similar to extending the Kalman fil-ter to nonlinear systems. This subsection proposes the PWA-ELQR to deal withthe linearization and quadralization. These two approximation procedures are per-formed in the vicinity of the state and control input candidates for both backwardPWA-ELQR and forward PWA-ELQR. A critical challenge in this regard is howto choose candidates about which to linearize the kinematics and quadratize thelocal cost functions. In the present work, the smoother state xˆt and the correspond-ing control uˆt are selected as the candidates since they bring excellent convergencecharacteristic for the entire iterative structure of PWA-ELQR. The details of lin-earization and quadralization are introduced next.For the backward PWA-ELQR with 06 t 6 l, the recursion sequence of updat-ing cost-to-go functions proceeds from t = l− 1 until t− 1, with the current-bestapproximation of total-cost at stage t+1:xˆt+1 =−(Sit+1+ S¯it+1)−1(sit+1+ s¯it+1). (4.34)Using the transitional kinematic and control policy from the backward procedure,65as introduced in section 4.3.1, the state and control input candidates for stage t canbe obtained by:uˆt = p¯it(xˆt+1), xˆt = g¯t(xˆt+1, uˆt) (4.35)Then linearize the kinematic gt(xt ,ut) about uˆt , xˆt for (4.4), with:At =∂gt∂xt(xˆt , uˆt),Bt =∂gt∂ut(xˆt , uˆt),ct = xˆt+1−At xˆt −Bt uˆt ,(4.36)and quadratize the local cost function ct(xt),ut about uˆt , xˆt for (4.5), with:[Qt PTtPt Rt]=∂ 2ct∂[xtut]∂[xtut](xˆt , uˆt),[qtrt]=∂ct∂[xtut](xˆt , uˆt)−[Qt PTtPt Rt][xˆtuˆt] (4.37)Combining (4.36) and (4.37) with (4.17) and (4.18), the corresponding Sit , sit andcontrol policy pit can be calculated repeatedly until t = 0.For the forward PWA-ELQR with 06 t 6 l, the recursion sequence of updatingcost-to-come functions proceeds from t = 0 until t = t + 1, with the current-bestapproximation of total-cost at stage t:xˆt =−(Sit + S¯it)−1(sit + s¯it). (4.38)Using the inverse kinematics and the inverse control policy from the forward pro-cedure (introduced in section 4.3.3), the state and the control input candidates forstage t can be obtained by:uˆt = p¯it(xˆt), xˆt+1 = g¯t(xˆt , uˆt) (4.39)Take uˆt and xˆt+1 as the candidates about which to linearize the inverse kinematic66g¯t(xt ,ut) to get (4.20):A¯t =∂ g¯t∂xt+1(xˆt+1, uˆt), B¯t =∂ g¯t∂ut(xˆt+1, uˆt),c¯t = xˆt − A¯t xˆt+1− B¯t uˆt ,(4.40)Given the above matrices and vectors, the parameters for quadratizing the localcost function ct(xt),ut can be calculated in the same way as (4.37). Then S¯it , s¯itand the control policy p¯it can be calculated repeatedly up to t = t + 1 based on(4.22)-(4.26).The entire iterative algorithm of PWA-ELQR is summarized in Algorithm 7.From Algorithm 7, it is seen that PWA-ELQR is a locally optimal method dueto its requirement for linearization and quadralization around state and input can-didates. Therefore, explicitly considering constraints during the iterative processwill facilitate restriction of the searching region to the vicinity of the state and inputcandidates. This advantage of local approximation for nonlinear-nonquadratic con-trol problems helps to avoid divergence, which is caused by blindly following theoptimal principle, and hence increases the speed of convergence. The simulationgiven later will show this characteristic.4.4 Simulation StudiesIn this section, the proposed PWA-ELQR method is applied to the constrained mo-tion planning function of autonomous navigation for a two-wheeled differential-drive mobile robot, named SegPanda, as described in detail in chapter 6. SegPandais equipped with a Segway nonholonomic mobile robot, a Kinect camera to detectthe goal position and a Hokoyu laser ranger finder to detect obstacles (both staticand moving) in the environment. All the physical devices (mobile robot, drivers,sensors etc.) are operated by Robot Operating System (ROS, http://www.ros.org/),which is a popular open-source software platform for controlling many types ofrobots. A main advantage of ROS is that, it provides powerful simulation tools formotion planning and result visualization, and the simulated results can be trans-ferred to a physical experiment seamlessly.The PWA-ELQR is implemented in both real world environment as shown67(a) (b)Figure 4.1: (a) The real-world environment where SegPanda executes au-tonomous navigation; (b) The simplified simulation environment corre-sponding to (a).in Fig. 4.1a, and simplified simulation environment created by ROS shown inFig. 4.1b. With the help of ROS, the result from the simulated environment can betransferred conveniently to the corresponding real-world environment. In Fig. 4.1a,the circles represent obstacles (corresponding to boxes in Fig. 4.1b ) with radius of0.3 m.The simulated SegPanda, which is built to the same scale as the physical Seg-way mobile robot, has a radius of 0.3 m. The initial state is set to x∗0 = (0,0,0)and the goal state (the people position in Fig. 4.1a and the black cube position inFig. 4.1b) to x∗l = (2,2,2), where the number of steps is fixed at l = 60. PWA-ELQR need not be seeded with an initial trajectory. The algorithm runs until therelative improvement satisfies:oldcost−newcostnewcost< 10−4Three types of constraints are considered in the present control problem:Differential/kinematic constraints as (1.1);Control input constraints vl,vr ∈ [−0.75,0.75]m/s;State constraints [x,y,θ ]∈ [(−10m,10m),(−10m,10m),(−3.14rad,3.14rad)].Actually here the only state constraints is the maximum scope of the entire68Figure 4.2: Comparison of the optimal trajectories of the input control be-tween ELQR (does not consider constraints on control during motionplaning) and PWA-ELQR.69Figure 4.3: The corresponding comparison of the optimal trajectories of thestate between ELQR, which does not consider constraints on controlduring motion planing, and PWA-ELQR(Hence the state trajectory ofELQR can theoretically reach to the goal location).70environment. The obstacles can also bring constraints on states. However, in orderto improve convergence performance, a more efficient way is to consider the effectof the obstacles in cost functions defined as:cl =12(x−x∗l )T Q(x−x∗l ),c0 =12(x−x∗0)T Q(x−x∗0)+12(u−u∗)T R(u−u∗),ct =12(u−u∗)T R(u−u∗)+q∑iexp(−di(x)),where q ∈ R+ gives the weight the obstacle cost occupies in the entire local costfunction; u∗ = [0.25,0.25]m/s represents the nominal control input which is themoving speed we expect the SegPanda has during navigation; and the function−di(x) represents the signed distance between the SegPanda and the i′th obstaclein the environment. It is seen that the Hessian of the term q∑i exp(−di(x)) cannotbe guaranteed to be positive-semidefinite, hence its Hessian is regularized by com-puting the eigen decomposition and setting the negative eigenvalues to zero[72].With this setting, both ELQR and PWA-ELQR are tested in the environmentof Fig. 4.1. Fig. 4.2 shows the comparison of the optimal trajectories the in-put control between ELQR and PWA-ELQR. It is seen that the result of PWA-ELQR follows the constrains of vl,vr ∈ [−0.75,0.75]m/s rather accurately. Onthe other hand, ELQR just blindly follows the optimal principle without takingvl,vr ∈ [−0.75,0.75]m/s into account. Common sense tells us that SegPanda can-not realize the control input during execution. The corresponding comparison ofthe optimal trajectories of the state between ELQR and PWA-ELQR is shown inFig. 4.3 and in Fig. 4.4. In Fig. 4.4, it is noted that although ELQR results in asomewhat shorter path, it costs 142.97 and can not be executed because of controlconstraints. PWA-ELQR gives a competitive path that costs 140.77 and can be ex-ecuted successfully, since PWA-ELQR follows the control input constraints whenplanning motions.Another advantage of PWA-ELQR is that, in the present implementation, ittakes 8 iterations to achieve convergence, which is 3 fewer than that for ELQR.It is noted that constraints introduce extra calculation for each iteration of PWA-71Figure 4.4: Comparison of the optimal trajectories of the state betweenELQR and PWA-ELQR in a simplified simulation environment. Al-though ELQR results in a somewhat shorter path, it costs 142.97 andcan not be executed because of control constraints. PWA-ELQR costs140.77 and is executed successfully since it considers constraints duringmotion planning.ELQR. However, with modern computers (e.g., in our case Intel i5 3320 2.60GHz,4GB RAM), such differences can be ignored with regard to real-time control.72Algorithm 7 The PWA-ELQR Algorithm (proposed in the present work)Input: local cost functions:ct ,0 6 t 6 l; standard kinematic equation: (1.3) andinverse kinematic equation: (1.4); l: number of time stepsVariables: xˆt :smoothed states; pit : control policy; p¯it :inverse control policyvt :cost-to-go function;v¯t :cost-to-come functionOutput: pit for all t1: pit = 0,St = 0,st = 0,st = 02: repeat3: S¯t := 0, s¯t := 0,st := 04: for t = 0; t < l; t = t+1 do5: xˆt =−(Sit + S¯it)−1(sit + s¯it) (smoothed states following constraints)6: uˆt = pit(xˆt), xˆt+1 = g(xˆt , uˆt)7: Linearize Eq. (4.40) and Quadratize Eq. (4.37) about (xˆt+1, uˆt)8: Compute S¯t , s¯t , s¯t and p¯it of forward PWA-LQR based on Eqs. (4.22)-(4.25)9: if ut = p¯it(xˆt+1) ∈ [umin,umax] then10: S¯it = S¯t , s¯it = s¯t , s¯it = s¯t , p¯it11: else12: Re-compute S¯it , s¯it , s¯it and p¯it of constrained forward PWA-LQR basedon Eqs. (4.28)-(4.30)13: end if14: end for15: Quadratize cl about xˆl in the form of Eq. (4.37)16: Sl = Ql , sl = ql and sl = ql17: for t = l−1; t ≥ 0; t = t−1 do18: xˆt+1 = −(Sit+1 + S¯it+1)−1(sit+1 + s¯it+) (smoothed states following con-straints)19: uˆt = p¯it(xˆt+1), xˆt = g¯(xˆt+1, uˆt)20: Linearize Eq. (4.4) and Quadratize Eq. (4.37) about (xˆt , uˆt)21: Compute St ,st ,st and pit of backward PWA-LQR based on Eqs. (4.16)-(4.18)22: if ut = pit(xˆt) ∈ [umin,umax] then23: Sit = S¯t ,sit = st ,sit = st24: else25: Re-compute Sit ,sit ,sit and pit of constrained backward PWA-LQR basedon Eqs. (4.16)-(4.18)26: end if27: end for28: until Converged73Chapter 5People Detection-usingMKL-SVM5.1 IntroductionThe present chapter mainly focuses on finding a suitable people detector for ap-plication in autonomous navigation. Specifically, a learning scheme that integratesmultiple kernel learning (MKL) with support vector machine (SVM), leading toMKL-SVM, is designed in order to increase the precision of detection. Comparingwith a state-of-the-art classifier linear support vector machine (LSVM) [73], thedesigned MKL-SVM can map the features into a higher dimension space so thatthe linear assumption for data set can be released, while controlling the trainingtime within an acceptable range. An experiment based on open-source dataset,TUD-Brussels [74] is used to show that the proposed classifier achieves good per-formance.The performance of a sliding window-based detector is significantly affectedby the choice of both feature and classifier. For autonomous navigation, the mov-ing system and the dynamic environment require that the people detector is able toconsider features caused by movements. Therefore, the present work chooses to de-scribe the people feature in an image by combining static feature HOG(histogramsof orientated gradients) and motion feature HOF(histograms of optical flow) as in[35], but does not include color feature due to the specific application for real-time74detection that is treated in the dissertation. Then MKL-SVM classifier based onthe combined features is proposed and how to apply the learned classifier to detecta person in an image is explained.5.2 Combined Features: HOG-HOF5.2.1 Histogram of Oriented GradientsHOG, first proposed in [31], characterizes local object appearance with the distri-bution of local intensity gradients. HOG feature is a local description. The gradientinformation is quite suitable for depicting the edge features of the body of a person.Apart from preprocessing steps like illumination normalization and image graying,the process of extracting HOG feature has three main steps:Figure 5.1: The definition of cell and block for a sample image751. Define some area parameters by dividing the entire sample image I = height×width into cells, each of which consists of c× c pixels, and combining theneighbored e× e cells into one block, as shown in Fig. 5.1. Then calculatethe gradients at each pixel in the sample image I, as shown in Fig. 5.2. Ob-tain one gradient for a pixel by first calculating the vertical difference by thevertical filter [−1,0,1]T and calculating the horizontal different by the hor-izontal filter [−1,0,1]; then calculate the gradient magnitude image Im andthe gradient orientation image Io based on the vertical difference image Ixand the horizontal difference image Iy using the following relationships:Im =√I2x + I2yIo = arctanIyIx(5.1)Figure 5.2: Calculation of the gradients of the sample image.2. Compute the histogram of oriented gradients for each cell by voting orienta-tions Io in each cell through distributing the gradient orientation into b binsof the histogram. The bins are defined as the gradient orientation interval76with equal space. The present research uses an 8-bins histogram orientationbased on a 360-degree setting, as shown in Fig. 5.3. The magnitude of eachis obtained by the following equation:h(ci) = h(ci)+m× (1− o− cib )h(c j) = h(c j)+m× (o− cib )(5.2)where ci, c j are the two closest bin centers to the orientation of one pixelfor the orientation histogram of the cell which covers the pixel; b is the binsize of the orientation histogram; m is the magnitude of each orientation ofa pixel; h(ci) is the magnitude of each bin with center ci in the orientationhistogram, and the same h(c j) for c j. Note that ci and ci can be obtainedthrough interpolation (e.g., trilinear interpolation). After this, concatenatethe histograms of each cell within this block and normalize the value of eachhistogram to form the HOG feature for each block.Figure 5.3: 8-bin histogram orientation for one cell3. Slide the block along the sample image vertically and horizontally with stridesizes dxand dy. The sliding block is referred to as the HOG descriptor. Thensend each HOG descriptor into a trained SVM classifier which generates thepeople detection result for this block. If this block is positive, it is consideredas people area and is marked as people position.77Interpolation and normalization make the HOG representation robust to changesin lighting conditions and small variations in pose. The entire process is summa-rized in Fig. 5.4.Figure 5.4: Overview of extracting HOG feature5.2.2 Histograms of Optical FlowHOF has been proposed to analyze video content or camera images on movingequipment where the background moves as much as the people in the scene [32].The basic idea of HOF came from HOG, and their principles are almost sameexcept that HOF uses optical flow gradients instead of pixel intensity gradients.Further improvements can be done for making the HOF scheme more accurate.In the present work, an IMHdiff (Internal Motion Histograms Difference) schemeas in [32] is used for real-time application. The process of calculating HOF issummarized in Fig. 5.5.Figure 5.5: Overview of extracting HOF feature785.3 MKL-SVM ClassifierThe classification algorithm is a key element of the detection procedure in theMKL-SVM classifier. Its reliability, computational complexity and recognition ef-ficiency have a significant effect on the overall performance of the system. Twofamilies of classifiers, SVMs and AdaBoosts, provide the best performance, ac-cording to a recent evaluation [75]. Since AdaBoosts cannot achieve more compet-itive results [76], it is not considered here. Linear SVMs often perform well, butthe assumption that the dataset is linearly separable inevitably brings errors intothe detection. Therefore, in the present work, a multiple kernel learning SVM, thatis termed MKL-SVM, is proposed to increase the precision of the people detector.SVM uses an optimal separating hyperplane between classes by focusing onthe support vectors, which are training samples lying at the edge of the class distri-butions. The remaining training samples can be effectively discarded by this way.The principle of the SVM classifier is that only the training samples lying on theclass boundaries are considered for discrimination. A more detailed discussion ofSVM is found in [77]. In SVM, a feature map k(xi,xj) =〈Φ(xi),Φ(xj)〉, namedkernel function k, implicitly maps samples x into a high-dimension feature space.For kernel algorithms, the solution of the learning problem is of the form:f (x) =`∑i=1wmk(xi,x)+b (5.3)Since there are different kernel types such as radial basis function (RBF) ker-nel, polynomial kernel, and Hyperbolic tangent kernel, it is often unknown whatthe most suitable kernel is for the presented task. Hence the designer may com-bine several possible kernels. One problem with simply adding kernels is that theuse of uniform weights may not result in an optimal solution. For example, onekernel unrelated with the labels will just add noise and degrade the performance.Hence, using an optimized kernel function that could fit the data structure well is adesirable way. An intuitive thought is to use a group of optimally weighted kernelfunctions to fit the data structure. Therefore, combination of multiple features aswell as multiple kernels in SVM has been proposed in literature. The present workproposes the use of the following multiple kernel learning approach for people79detection.A useful way of optimizing kernel weights is MKL. In the MKL approach, aconvex formation of M kernels is used as follows:k(xi,xj) =M∑m=1dmk(xi,xj) (5.4)where dm ≥ 0,M∑m=1dm = 1, and M is the total number of kernels used. In themultiple kernel learning problem, for binary classification, assume that we havea training set of N samples {xi,yi}, i = 1,. . . ,N, yi ∈ {1,−1}, where xi is trans-lated via M mappings Φm(xi) 7→ RDm , m = 1, ...,M, from the input into M featurespaces (Φ1(xi), ...,ΦM(xi)), where Dm denotes the dimensionality of the m-th fea-ture space. We need to solve the following MKL primal problem [78], which isequivalent to the linear SVM when M =1:min12(M∑m=1dm‖wm‖2)2+CN∑i=1ξiw.r.t wm ∈ RDm ,ξ ∈ RN+,b ∈ Rsubject to yi(M∑m=1dm 〈wm,Φm(xi)〉+b)> 1−ξi∀i = 1, ...N(5.5)where wm is the normal vector to the hyperplane for the feature space Φm. In [78]80the MKL dual is derived for the problem (5.5) as given below:min γw.r.t γ ∈ R,α ∈ RNsubject to 06 α 6 1C,N∑i=1αiyi = 012N∑i, j=1αiα jyiy jkm(xi,xj)−N∑i=1αi︸ ︷︷ ︸=:Sm(α)6 γ∀m = 1, ...M(5.6)where α = (α1,α2, ...,αN), αi are Lagrange multipliers, and wm =N∑i=1αiyiΦm(xi).In order to solve (5.6), a weighted 2-norm regularization formulation has beenproposed in [79], in order to deal with the MKL problem. The work in [79] uses anadditional constraint on the weights so as to encourage sparse kernel combinations.This algorithm is called SimpleMKL, and it can converge rapidly at comparableefficiency. Therefore, SimpleMKL will be adopted as the MKL approach in thepresent work.5.4 Performance EvaluationIn this section, TUD-Brussels database is chosen as the evaluation data due to itsarguably realistic and challenging dataset, which provides motion pair images forthe HOF feature. The motion-pair training set contains 1776 annotated pedestriansrecorded from multiple viewpoints taken from a handheld camera in a pedestrianzone. These 1776 annotated pedestrians are positive samples cropped from the1092 positive image pairs. In the present example 300 positive samples are chosenamong the 1776 samples. TUD-Brussels also contains 192 negative image pairs,from which 300 negative samples with fixed [height width] size of [128 64] arechopped. Therefore, a total of 600 motion-pair samples are obtained from TUD-Brussels.The sizes of the sliding windows for both extracting histograms of oriented81gradients-histograms of optical flow (HOG-HOF) features and detecting on thetargeted image are set as [128 64]. To fit such sliding windows, positive samplesneed to be resampled at the same size of [128 64]. Within each sliding window,every block contains 2× 2 cells each of which having 8× 8 pixels. The stridesize for both sliding window and sliding block is 8. The bin size for histogramorientation is set at 9 within the scope of [0,2pi]. With these settings, each vectorof HOG-HOF feature contains 4608×2 components and the entire training datasethas a matrix size of 9261×600.For training the MKL-SVM and LSVM classifier, 60% of the 600 motion-pairsamples are selected randomly as the training subset. Specifically, all the left 40%samples are considered as the testing subset. With the duality gap of 0.01 for theconvergence threshold of both MKL-SVM and LSVM, Table 5.1 shows the testresults when using different features combined with different classifier.The first row gives the LSVM result, and the remaining rows concern MKL-SVM.It is seen that MKL-SVM has higher precision than LSVM. For MKL-SVM, twocommon nonlinear kernels, RBF kernel and polynomial kernel (Poly), are used.The tuned parameter for the RBF kernels is sigma, and for the Poly kernels it isdegree. Clearly RBF is able to describe the HOG-HOF better than other options.Three combined RBF kernels are enough to reach the highest precision in this eval-uation. Further increasing the kernel number cannot produce any improvement.Table 5.1: Comparisons of different kernels and corresponding resultsKernel Type Kernel Number Kernel Parameters PrecisionLinear SVM 94.24%RBF 10 [0.5 1 2 5 8 10 12 15 17 20] 96.68%RBF 5 [0.5 2 5 10 15 20] 96.68%RBF 3 [0.5 2 5] 96.68%RBF 2 [0.5 2 ] 74.27%RBF, Poly 8 [0.5 2 5 10 15 20],[1 2] 96.68%RBF, Poly 6 [0.5 2 5],[1 2 3] 96.27%RBF, Poly 4 [0.5 2],[1 2] 92.95%Choosing the best kernel combination according to Table 5.1, Fig. 5.6 shows82an example of the detection results given by the trained classifiers based on HOGfeatures and HOG-HOF features. It is seen that HOG-HOF features result in betterdetecting performance than with HOG features. The scores on the pictures indicatethe possibility that the bounding boxes contain humans. A score larger than 0 givesa positive result while a score less than 0 means a negative answer. The closer it isto 1, the higher the likelihood of people presence. Conversely, closer to 0 meansa lower likelihood. The chosen images in Fig. 5.6 show for a classifier a typicalcomplex scenario where there exist some advertising boards with people models orsome parts of the construction are similar to a people body profile. This can causesome false positive bounding boxes. Even the motion feature failed to distinguishthem from the presence of a real person.83(a)(b)Figure 5.6: (a) Detection results (light color bounding boxes) based onHOG-HOF feature using MKL-SVM; (b) Detection results (dark colorbounding boxes) based on HOG feature using MKL-SVM.84Chapter 6Implementation andExperimentation6.1 OverviewThis chapter presents the physical experimentation carried out in the Industrial Au-tomation Laboratory (IAL) of the University of British Columbia. The laboratorymimics a home environment, to implement and evaluate the performance of themethodologies developed in the present research. A prototype autonomous nav-igation system, named ”SegPanda”, has been assembled for the experimentationas shown in Fig. 6.1. It consists of three hardware components: a mobile base, aon-board 2D laser ranger finder and a 3D camera. The laser ranger finder is usedto build maps and detect obstacles around the robot. The 3D camera is used fordetecting people. All of the hardware operates under an open-source system calledrobot operating system (ROS)1 which provides rich libraries for a wide variety ofhardware. Fig. 6.2 shows an example of a simulated environment built by ROS,which also builds the modeled SegPanda. One big advantage of ROS is that theresults of simulations can be transferred to the physical SegPanda very easily.SegPanda’s task in this experiment is to realize the function of human-awareautonomous navigation in a home environment, as shown in Fig. 1.4. Specifically,1http://www.ros.org/85Figure 6.1: The prototype of the autonomous navigation system-SegPanda.Figure 6.2: An example of a simulated environment built by ROS.SegPanda needs to first recognize if there are people in front of it and to determinethe location of a detected person. Then taking the location of the detected people asthe goal location, a planned path is followed to reach that location, while avoidingobstacles simultaneously. The designed experiment does not limit itself to com-plete a common-sense human following function. In addition, it evaluates whetherthe proposed methodologies of path planning, motion planning and human detec-tion, each of which can be applied to many other robotic tasks, can perform well.That is more meaningful than just completing an autonomous navigation function.866.2 The Hardware System6.2.1 Segway Mobile BaseThe mobile base used in this experiment is a Segway RMP100. It is a typical non-holonomic two-wheel differential-drive mobile robot, which can be represented bya unicycle model as in Fig. 1.5b. The Segway mobile base has a build-in odometerwhich can provide wheel speed information, so that the speed and the position ofthe robot can be obtained accordingly by using ROS libraries. The related specifi-cations are found in Table 6.12.Table 6.1: Segway RMP100 SpecificationsSegway RMP100 SpecificationsTop Speed 10km/hTrack Width 46cmTire Diameter 41cmTurning Radius zero degreesTurning Envelope 64cmData Updated Rate 100HzMaximum Payload 68Kg6.2.2 3D Camera: Microsoft KinectKinect was originally developed as a motion sensing device for Microsoft Xbox360. Because of its relatively low price, it has become a popular experimental sen-sor in the robotics community. Kinect consists of four key components: infreredoptics, red-green-blue (RGB) camera, motorized tilt and multi-array microphone,as shown in Fig. 6.33. The two components related to this experiment are the RGBcamera and the infrared optics. The RGB camera provides a 2D image with richcolor information, and the infrared optics provides depth information. The com-2http://rmp.segway.com/3http://www.winbeta.org/news/company-behind-microsofts-kinect-sensor-sold-apple-345-million87bination of the RGB camera and the infrared optics can generate red-green-blue-depth (RGB-D) 3D data, which have been used in many robotics applications[80–82].Figure 6.3: The structure of Kinect components.The specifications of Kinect are listed in Table 6.24. It is seen that the depthspace range of 0.8m-4m is its main disadvantage. However, this range is goodenough to realize the experiment in the present work where the people are locatedin this range.Table 6.2: Microsoft Kinect SpecificationsKinect SpecificationsViewing Angle 43◦ vertical by 57◦ horizontalVertical Tilt Range 27◦Frame Rate (depth and color stream) 30 frames per second (FPS)Depth Space Range 0.8m-4m4http://msdn.microsoft.com/en-us/library/jj131033.aspx886.2.3 2D Laser Ranger Finder: Hokuyo UTM-30LXSegPanda uses a single UTM-30LX Hokuyo sensor to map the environment anddetect obstacles in a 2D map. The UTM30LX has a 30m and 270 degree scanningrange, which is ideal for mapping lab environments (usually fairly large rooms).Another advantage of this sensor is its low weights of only 370g. As a result itis a suitable sensor for small robots. The sensor is located about 20cm above theground, and hence only works for obstacles that can be seen at that height, suchas boxes and walls. But this setting usually can detect most obstacles in a homeenvironment. The green lines pointed by red arrows in Fig. 6.4 show an exampleof the scanning result of Hokuyo in the IAL lab environment. More details of itsspecifications are found in Table 6.35.Table 6.3: Hokuyo UTM-30LX SpecificationsSegway RMP100 SpecificationsDetection Range 0.1m to 30mAccuracy 0.1 to 10m:30mm; 10 to 30m:50mmAngular Resolution 0.25◦Scan Time 25msec/scanWeight Approx. 370g(with cable attachment)6.3 Software System6.3.1 People DetectionSegPanda detects people in front of it using a Microsoft Kinect 3D camera. AROS-wrapped driver, called OpenNI6, publishes information from the Kinect sen-sor into ROS, which is then analyzed using point cloud library (PCL) 1.7 7 func-tions to provide image and depth information. The people namespace in PCL isused to process the published RGB-D data. This algorithm is focused on speed5https://www.hokuyo-aut.jp/02sensor/07scanner/utm 30lx.html6http://wiki.ros.org/openni launch/7http://pointclouds.org89Figure 6.4: An example of the scanning result of Hokuyo in the IAL labora-tory environment. The green lines pointed using red arrows show thescanning result of Hokuyo.and real time processing without the use of a dedicated GPU. The people detec-tion is based on Euclidean clustering using multiple kernel learning-support vectormachine(MKL-SVM) parameters trained by the method in Section 5.3 for detec-tion of human bodies. If there are several people in an image, the clustering maygroup two people together into a single person. To resolve this problem, the Head-Based subclustering in ROS is borrowed to perform after Euclidean Clusteringto detect the number of heads and use that information to split groups of people90that were clustered together into separate people. The confidence of each person iscalculated using histograms of oriented gradients-histograms of optical flow(HOG-HOF). This algorithm has satisfactory performance in slow moving and stationarypeople who stand upright, which is executed as a ROS node. When the peopledetection ROS node is first run, it waits for point cloud data indefinitely until thefirst RGB-D data is acquired. When the first RGB-D data are acquired, they aredisplayed in the visualizer and the user is requested to select three floor pointsmanually to calculate where the floor in the 2D RGB image is. After processing,the people with HOG-HOF confidence above a threshold are published onto a vi-sualizer and also published as a custom ROS message. The ROS message that iseventually published is used to send as a navigation goal to make the robot followaround the person. Fig. 6.5 and Fig. 6.6 show the results of the people detection.Figure 6.5: The result of people detection in a one-person case.The Microsoft Kinect has a relatively narrow viewing angle and can only gen-erate relatively high-accuracy RGB-D data from a distance of 0.8 to 4 meters. Thislimited viewing range plus the fact that the algorithm can only process uprightpeople limits the possible places the person can stand to be seen by the peopledetection algorithm. The algorithm does not reliably work in all environments.In environments where random object orientations can appear like a person whentransformed into a point cloud, the algorithm will have trouble separating people91Figure 6.6: The result of people detection in a two-person case.from inanimate objects. Due to the user entered ground plane coordinate imple-mentation, the ground plane must always stay the same in the Kinect view frame.This means the Kinect cannot be translated along the z axis or rotated along the xand y axis without the people detection algorithm failing.6.3.2 Autonomous NavigationSegPanda relies on the ROS Navigation stack to combine all of the elements of theautonomous navigation system and make it to move. The overview structure of theROS Navigation stack is shown in Fig. 6.78.A goal is given to the navigation stack by either user input or another ROSnode configured to interact with the navigation stack. The navigation stack usesthe odometry of the mobile base and at least one sensor to localize itself on theglobal map created by the Gmapping package. It also uses the sensor to createa local map that updates quickly for real time object avoidance. The navigationalgorithm operates by using a global costmap and local costmap. Each costmap ismodeled as an occupancy grid with the local costmap having a higher resolution.Each grid square is assigned a probability for it to be occupied. The global costmap8http://wiki.ros.org/navigation/Tutorials/RobotSetup92Figure 6.7: The overview structure of the ROS navigation stack.is taken directly from the map created using Gmapping when the navigation stackfirst boots.Amcl package is used to localize the robot inside the global costmap, accordingto data from the Hokuyo sensor. The localization only works when the odometryis somewhat closely matched with the global map frame. The user is responsiblefor providing the navigation stack with a pose estimate of where the robot is whenthe navigation stack first starts. Once the navigation starts, Amcl localizes therobot automatically. This removes the reliance on perfect odometry and preventsSegPanda from drifting out of position on the global costmap as it navigates aroundthe environment. Odometry must still be relatively accurate for Amcl to work. Thetests specified in the mapping section for odometry must be satisfied for Amclnavigation as well.The navigation stack takes a goal input and calculates a global path based on themost current global costmap, by using the path planner presented in Chapter 3. Thegoal can be input into the navigation stack using RVIZ or a goal sending programthat conforms to the navigation API. Once a goal is received, the navigation stackcalculates a global path to the goal using the global costmap. The path is usuallythe shortest way to the goal that the robot considers to be safe. Then, based on theglobal path, the navigation stack creates a constantly updating local path that goesto the edge of the local costmap (5m x 5m). The local path is created through themotion planner proposed in Chapter 4 based on the local costmap, which attempts93to adhere generally to the global path but has a more accurate and smooth path. Thelocal path is what the navigation stack uses to send the robot velocity commands.Due to the fast update rate of the local path and local costmap, the robot is ableto navigate in dynamic environments, and able to avoid randomly slow movingobjects.The navigation package outputs to the mobile base driver by sending velocitycommands under the unicycle model. It sends a velocity command that consistsof a linear velocity with positive forward and an angular velocity with positivecounter-clockwise. The navigation stack is plugged into SegPanda’s tf transforms.Using, the odometry message and the base link message to various robot body linktransforms, the navigation stack is able to create an accurate 2D model of the robotinside the global map. The navigation stack takes into account the radius of therobot when navigating, instead of navigating with only a center point. This allowsit to solve navigation problems such as how far should the calculated path be awayfrom any obstacles and whether or not the robot can fit in between two obstacles.The local motion planner (local planner) and global planner can be configuredas plugins to ROS. SegPanda uses the previously mentioned global planner and thelocal planner for its navigation. The Eband local planner is the planner includedin the segbot ROS package that we took and modified for our SegPanda. As longas the navigation algorithm conforms to the ROS Navcore API, it can be used as aplugin for either global planners or local planners.The driver package for the mobile base of SegPanda is taken from the Seg-bot package available on the ROS Wiki9, and modified according to SegPanda’shardware. The odometry information is published by the driver and read by thenavigation stack for use in navigation and localization. The robot driver createsodometry information by keeping track of its motor rotations and how it wouldaffect the robot’s real world position. Since SegPanda is much heavier than theoriginal Segway50rmp, which the driver was written for, the odometry scale andmin-max speeds have been modified to suit SegPanda’s specifications.9http://wiki.ros.org/segbot946.4 Experimental ResultsThe overall scheme of the present experiment is shown in Fig. 6.8.Figure 6.8: The overall structure of the autonomous navigation systemSegPanda uses the dual navigation goal node to implement the goal sendingcapability. The navigation goal program connects the people detection programwith the ROS navigation stack. The people detection node publishes the people lo-cation information to a ROS topic. The navigation goal node sender subscribes tothat ROS topic and uses the data to determine the goal that is to be sent to the nav-igation stack. The dual navigation goal node in fact has two nodes each of whichsends navigation goals to the navigation stack. Each node works by receiving theposition of a person from the people detection node, transforming the data into aposition on the map and then sending it to the navigation stack as a goal. The pro-95gram waits until the robot has reached the goal before a new goal can be receivedand sent. Thus, an individual navigation node is not capable of changing the goalduring navigation. A property of the navigation stack is that it can be interruptedin the middle of navigating if a new goal is received from an input source. The im-plementation on SegPanda takes advantage of this property and has two navigationgoal nodes alternatively sending the location of the person as a goal at a frequencyof about 1 Hz. Since these two nodes interrupt each other continuously, the navi-gation goal will always be reset to the current position of the person, allowing therobot to follow a person in real time.In the event that more than one person is in view or no people are in view, therobot will idle and wait until it sees a person before moving. While idling, it canstill be controlled by manual navigation goals or teleoperation key controls. Sincethe people detection algorithm may interpret noise data as people while the robotis moving, a noise filtering algorithm is implemented to ensure that the robot onlystarts moving once it is confident that what it sees is a person. The Kinect musthave five consecutive frames where a person is in the RGB-D data before it sendsas a navigation command. This solves the problem when random noise makes oneor two frames see a fake person. However, this does not deal with the problem thatoccurs when the people detection node legitamately detects a non-person object asa person, which happens fairly often in uncontrolled environments.In addition, the navigation system that is used on SegPanda requires a map fileof the robot’s environment constructed beforehand. This map file is used as theglobal costmap for navigation, allowing the robot to self-localize using Amcl. Theused map is created using the ROS wrapper for OpenSlam’s Gmapping10, whichis a package available on ROS that is supported up to ROS Indigo. Odometryinformation published by SegPanda and at least one horizontally mounted sensor(Hokuyo Laser Ranger Finder in our case) are required for Gmapping to make amap of the environment.Once the global costmap is obtained, using the UTM-30LX hokuyo sensor andmost currently global costmap, SegPanda creates a fast updating local costmap ina 5× 5m square with itself located in the center (assuming rolling window is set10http://wiki.ros.org/gmapping96to true), as shown in Fig. 6.9. In Fig. 6.9, the blue area denotes the expansionof the obstacles according to the radius of SegPanda. In this way, the robot canbe considered as a single point to facilitate the calculation during planning. Theobstacles seen in the local costmap are saved to update to global costmap constantlyas the robot navigates around. The primary use of the local costmap is to recognizenon static objects (such as boxes or chairs) and navigate around them. The updatesto the global costmap are not saved in the map file, so each time the navigationstack boots up, a new instance of the global costmap is created.Figure 6.9: An example of the local costmap used for autonomous naviga-tion.97Now all the preliminaries for the core navigation algorithm have been devel-oped and presented. First the global planner provides, based on the current globalcostmap, a global path which is able to avoid static obstacles. But the global pathdoes not possess specific control commands to move SegPanda. It provides a roughidea on how and where the SegPanda should move. Therefore, the crossing pointbetween the global path and the frame of the local costmap, which is in front of therobot, is sent to the local planner as its temperate goal. The local planner createsa local motion plan and control actions based on the local costmap which con-tains all the information on the real-time environment. The local costmap alwayshas the robot in the middle and is constantly updated so that the created motionplan can have capability to avoid obstacles (either static or moving) which are un-known when determining the global path. Fig. 6.10 shows the calculated globalpath (green line) and the local motion path (green circle). It is noted that the localmotion path does not completely follow the global path, and it has a new trajectorywithin the scope of the associated local costmap.The last step is to control the SegPanda for it to move to the goal location .The driver plugs into ROS, sends the robot odometry and receives unicycle modelcontrol commands from the local planner. The unicycle model is a convenientabstraction of linear and angular velocity that the driver converts into left wheelspeed and right wheel speed. The navigation stack is responsible for sending thesevelocity commands that the driver then executes. The complete flowchart of theautonomous navigation system is shown in Fig. 6.11. The people detection modulelaunches the Microsoft Kinect and analyzes the RGB-D data to find any peoplein the Kinect viewing range. If a person is found, the position of the person rela-tive to the Kinect is published for another module to read. The people navigationgoal module waits for the people detection module to send it the position of theperson to follow, then sends the position of the person as a goal to the navigationstack. The navigation goal module constantly updates the position of the personand sends new goals to interrupt previous ones so as to follow the person in realtime. The navigation stack receives the goal sent by the people navigation goalmodule and uses a combination of global and local planners to try to navigate tothe point. It must receive the transform functions and odometry information fromthe robot driver and description to work effectively. The navigation stack outputs98Figure 6.10: An example of the computed global path used for autonomousnavigation.only a linear velocity command and an angular velocity command in each cycle,and will output a stop command once the goal is reached. The robot driver movesthe motors based on the commands received from the navigation stack and con-stantly publishes the odometry information at a rapid rate. Based on the forgoingalgorithm, the physical experiment has been executed in IAL to evaluate its validityfor navigation. The screen of the experiment demonstration is divided into threedifferent views as shown in Fig. 6.12: the top-left view represents the Kinect view99Figure 6.11: The flowchart of the overall autonomous navigation system.which shows the result of people detection; the down-left view represents the realworld as seen by the audience; the entire right view represents the navigation func-tion, or say, the view of SegPanda, operated by ROS. The overall process of the100Figure 6.12: The views of the experimental demonstration.present experiment is shown in the sequence of still images in the following figures.Fig. 6.13 shows the case of detecting people and making the SegPanda successfullymove to the goal location, where there are only static obstacles (the boxes in thereal world view) in the environment. The generated paths illustrate that SegPandais able to reach the global goal even when the global goal is changing during thenavigation process. Fig. 6.14 is the sequential image set that follows Fig. 6.13. Itshows SegPanda’s ability of avoiding a moving obstacle during navigation. Thered-circled box denotes the moving obstacle, which is unknown in advance, andis pushed out in the middle of the original path, allowing SegPanda to move for-ward (sequence 1-3 in Fig. 6.14). The sensor of SegPanda detects this change, andSegPanda updates the costmap accordingly, generates alternative paths, and goesforward to the detected person successfully (sequence 4-6 in Fig. 6.14).101Figure 6.13: The views of the experimental demonstration: avoiding a staticobstacle.102Figure 6.14: The views of the experimental demonstration: avoiding a mov-ing obstacle.103Chapter 7Conclusions and Future Work7.1 ConclusionsIn this dissertation, several main challenges related to three aspects of autonomousnavigation in a dynamic environment were investigated, primarily with a focus onrobot learning methods. These three aspects are: path planning, motion planningand people detection.Path planning:An online, robust Q-learning path planning method for an autonomous robotin a dynamic environment with unpredictable moving obstacles was proposed anddeveloped. The strategy of dynamic step size in online robust Q-learning is cen-tral when using regime-switching Markov decision process (RSMDP) to representa dynamic environment. This strategy makes the Q-value iteration robust to thechoice of maximum initial step size. Probabilistic roadmap (PRM) contributes toovercoming the curse of dimensionality, which is a common problem in Markovdecision process (MDP) and in reinforcement learning. Simulation studies pre-sented in this dissertation have shown that the developed path planner can rapidlyand safely find an optimal path in a dynamic environment in the presence of bothstatic and moving obstacles.Motion planning:A novel locally optimal feedback control, piecewise affine-extended linear-quadratic regulator (PWA-ELQR), was proposed and developed to solve nonlinear-104nonquadratic constrained motion planning for a nonholonomic mobile robot. In op-timization, PWA-ELQR gives an optimal control policy with the objective of min-imizing a user-defined cost function. First, piecewise affine-linear quadratic regu-lator (PWA-LQR) was introduced to handle constrained, linear-quadratic, controlproblems, by combining quadratic programming (QP) and PWA-LQR smoother.QP specifically handles constraints by providing a piecewise affine control law.PWA-LQR smoother facilitates the convergence performance by combining cost-to-go functions and cost-to-come functions to form total-cost functions. NextPWA-ELQR was proposed and developed to extend PWA-LQR to nonlinear non-quadratic control problems by iteratively performing linearization and quadratiza-tion in the vicinity of nominal trajectories. The characteristic of local optimizationof PWA-ELQR is inherently suitable to handle constraints since it seeks to keep thechanges of trajectories small in scope. Conversely, solving constraints using QPbenefits the convergence performance of PWA-ELQR by producing results withinthe constraint scope, which results in fewer iteration steps. The proposed approachwas implemented in both simulation and real-world experimentation. The resultsdemonstrated that PWA-ELQR could provide competitive results compared withELQR, but with a better executable control policy that takes into account con-straints.People detection:Under this topic, an multiple kernel learning-support vector machine (MKL-SVM) classifier was designed, combined with multiple features, static feature HOG(histograms of oriented gradients) and motion feature HOF (histograms of opticalflows) which have been proposed by other researchers, to detect people in a videosequence for the present task of autonomous navigation. MKL-SVM usually onlyfocuses on one type of feature to increase the detection precision. This is becausecalculating a multiple kernel matrix for large quantities of data introduces a sig-nificant computational burden. In order to increase the detection accuracy whiledecreasing the computational load, the SimpleMKL algorithm, which is availableas an open-source library, was incorporated into the conventional support vec-tor machine (SVM) scheme. SimpleMKL effectively relieves the computationalload through a weighted 2-norm regularization formulation with an additional con-straint on the weights. Tests carried out on the open-source TUD-Brussels database105showed that the detection precision was increased as a result, when compared toLIBLINER SVM.Finally, an autonomous navigation experiment was designed to combine allthe proposed methods into one framework and test if these methods together canwork well in a real environment. To this end, an experimental platform, calledSegPanda, was assembled. All of the hardware of SegPanda are controlled by robotoperating system (ROS) system that also provides rich simulation tools and open-source functions. With the help of ROS, the simulated results can be transferred tophysical SegPanda without much difficulties. The experimental results showed thatSegPanda could successfully detect and localize people who appear in its Kinectview, set the people location as the goal location, and reach the goal location bysuccessfully avoiding static and moving obstacles during navigation.7.2 Future WorkThe performance comparison between the proposed piecewise affine-extended lin-ear quadratic regulator (PWA-ELQR) and the existing ELQR was presented in thisdissertation. The PWA-ELQR originated from ELQR, and the latter approach doesnot incorporate constraints in the motion planning problem. Hence a more reason-able comparison would be with other methods that can directly resolve the con-straint problem in optimal motion planning. For instance, model prediction control(MPC) [25][26][27] has also been proposed to solve constraint problems for mo-tion planning. To this end, comparing the performance of PWA-ELQR with MPCmay lead to stronger evidence on the performance of PWA-ELQR in situationswith constraints associated with optimal motion planning. A suitable approach forthis would be: compare the paths generated by PWA-ELQR and MPC individuallyin both simulation and experimentation, and then analyse the results by using nu-merical data. This plan will be undertaken as important future work. In addition,autonomous navigation that was investigated and developed in this dissertation as-sumes that the robot knows the state of the environment with certainty, so that MDPmodel can be used successfully for learning how to act in a dynamic environment.Obviously, this is a strong assumption which is typically not satisfied in a real situa-tion. In a real dynamic environment, there always exist uncertainties caused by the106robot itself or the environment. For example, robot’s on-board sensors suffer fromnoise due to mechanical reasons. Another source of uncertainty may come fromsome hidden states. For instance, if a part of the body of a person is occluded by atable, then the camera fails to detect the person from the surrounding environment.To better solve the problem of autonomous navigation in a dynamic environmentfull of uncertainty, future work may utilize the partially observable Markov deci-sion process (POMDP), which provides a solid mathematical framework to solvethe problem of optimal decision making in environments that are only partiallyobservable to a robot. In addition to applying POMDP in robot navigation in asmall 2D grid like that in the work of [83], new problems with hundreds of thou-sands of state or even with a continuous state space can be solved in just a few ofminutes [84–86]. Therefore, exploring POMDP for autonomous navigation withuncertainty and continuous state will be a promising future direction of research.Another possible future direction is to apply deep learning, specifically con-volutional neural networks (CNN), to people detection, since the developments incloud computation can greatly enhance the practical utility of CNN [87]. In thework of the present dissertation, only one location was needed for autonomousnavigation. Hence, a method of detecting a single person was integrated into theautonomous navigation framework. The big advantage of CNN is that there is noneed, or little need, to specify design specific features for specific objects. Thealgorithm itself is able to learn proper features to distinguish from different objects[88]. Therefore, with one learning framework, multiple types of objects can be de-tected and classified by using CNN, as some recent work has introduced [89, 90].The Segway robot has only been tested in the IAL and may not work well inother environments, especially those with many complex objects. It also requiresto be connected to a power source, as the Kinect and Hokuyo need outlet powerto function. A desirable direction of future work can be the expansion of the au-tonomous navigation application to a larger and more complicated environmentlike an urban area. If the method proposed in the present dissertation can workwell in such dynamic environments, a variety of meaningful applications, such asself-driving vehicles, can arise from it.107Bibliography[1] Pieter Abbeel and Andrew Y Ng. Apprenticeship learning via inverse rein-forcement learning. In Proceedings of the 21th international conference onMachine learning, page 1. ACM, 2004.[2] Howie M Choset. Principles of robot motion: theory, algorithms, and imple-mentation. MIT press, 2005.[3] Steven M LaValle. Planning algorithms. Cambridge university press, 2006.[4] Joelle Pineau, Michael Montemerlo, M. Pollack, Nicholas Roy, and Sebas-tian Thrun. Probabilistic control of human robot interaction: Experimentswith a robotic assistant for nursing homes. In The second IARP/IEEE/RASJoint Workshop on Technical Challenges for Robots in Human Environments(DRHE), October 2002.[5] Ryan Luna, Ioan A Sucan, Mark Moll, and Lydia E Kavraki. Anytime solu-tion optimization for sampling-based motion planning. In IEEE InternationalConference on Robotics and Automation (ICRA), pages 5068–5074, 2013.[6] EE Kavraki, Mihail N Kolountzakis, and J-C Latombe. Analysis of prob-abilistic roadmaps for path planning. IEEE Transactions on Robotics andAutomation, 14(1):166–171, 1998.[7] Lydia E Kavraki, Petr Svestka, J-C Latombe, and Mark H Overmars. Proba-bilistic roadmaps for path planning in high-dimensional configuration spaces.IEEE Transactions on Robotics and Automation, 12(4):566–580, 1996.[8] Sertac Karaman, Matthew R Walter, Alejandro Perez, Emilio Frazzoli, andSeth Teller. Anytime motion planning using the rrt*. In IEEE InternationalConference on Robotics and Automation (ICRA), pages 1478–1483, 2011.[9] Jur Van Den Berg, Dave Ferguson, and James Kuffner. Anytime path plan-ning and replanning in dynamic environments. In IEEE International Con-ference on Robotics and Automation (ICRA), pages 2366–2371, 2006.108[10] Jur Van Den Berg and Mark Overmars. Planning time-minimal safepaths amidst unpredictably moving obstacles. The International Journal ofRobotics Research, 27(11-12):1274–1294, 2008.[11] Dimitri P Bertsekas. Dynamic programming and optimal control, Third Edi-tion, volume 1. Athena Scientific Belmont, MA, 2005.[12] Huzefa Shakir and Won-Jong Kim. Nanoscale path planning and motioncontrol with maglev positioners. IEEE/ASME Transactions on Mechatronics,11(5):625–633, 2006.[13] Yantao Shen, Eric Winder, Ning Xi, Craig A Pomeroy, and Uchechukwu CWejinya. Closed-loop optimal control-enabled piezoelectric microforce sen-sors. IEEE/ASME Transactions on Mechatronics, 11(4):420–427, 2006.[14] Thomas Baumgartner and Johann Walter Kolar. Multivariable state feed-back control of a 500 000-r/min self-bearing permanent-magnet motor.IEEE/ASME Transactions on Mechatronics, 2015 (To be published).[15] Weiwei Li and Emanuel Todorov. Iterative linear quadratic regulator designfor nonlinear biological movement systems. In Proceedings of the First In-ternational Conference on Informatics in Control, Automation and Robotics,August 25-28 2004.[16] Jan van den Berg. Iterated lqr smoothing for locally-optimal feedback controlof systems with non-linear dynamics and non-quadratic cost. In AmericanControl Conference (ACC), 2014, pages 1912–1918, 2014.[17] Jur van den Berg. Extended lqr: Locally-optimal feedback control for sys-tems with non-linear dynamics and non-quadratic cost. In International Sym-posium of Robotics Research (ISRR), December 2013.[18] Amit Bhatia, Lydia E Kavraki, and Moshe Y Vardi. Sampling-based motionplanning with temporal goals. In IEEE International Conference on Roboticsand Automation (ICRA), pages 2689–2696, 2010.[19] Sertac Karaman and Emilio Frazzoli. Sampling-based algorithms for optimalmotion planning. The International Journal of Robotics Research, 30(7):846–894, 2011.[20] Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst. Rein-forcement learning and dynamic programming using function approximators.CRC press, 2010.109[21] Lucian Busoniu, Alexander Daniels, Re´mi Munos, and Robert Babuska. Op-timistic planning for continuous-action deterministic systems. In IEEE Sym-posium on Adaptive Dynamic Programming And Reinforcement Learning(ADPRL), pages 69–76, 2013.[22] David H Jacobson and David Q Mayne. Differential dynamic programming.1970.[23] Evangelos Theodorou, Yuval Tassa, and Emo Todorov. Stochastic differentialdynamic programming. In American Control Conference (ACC), pages 1125–1132, 2010.[24] Li-zhi Liao and Christine A Shoemaker. Advantages of differential dynamicprogramming over newton’s method for discrete-time optimal control prob-lems. Technical report, Cornell University, 1992.[25] Morteza Farrokhsiar and Homayoun Najjaran. An unscented model pre-dictive control approach to the formation control of nonholonomic mobilerobots. In IEEE International Conference on Robotics and Automation(ICRA), pages 1576–1582, 2012.[26] M Farrokhsiar and H Najjaran. A robust probing motion planning scheme:A tube-based mpc approach. In American Control Conference (ACC), pages6492–6498, 2013.[27] Morteza Farrokhsiar and Homayoun Najjaran. Unscented model predic-tive control of chance constrained nonlinear systems. Advanced Robotics,28(4):257–267, 2014.[28] Noel E Du Toit and Joel W Burdick. Robot motion planning in dynamic,uncertain environments. IEEE Transactions On Robotics, 28(1):101–115,2012.[29] C Goerzen, Zhaodan Kong, and Bernard Mettler. A survey of motion plan-ning algorithms from the perspective of autonomous uav guidance. Journalof Intelligent and Robotic Systems, 57(1-4):65–100, 2010.[30] Alberto Bemporad, Manfred Morari, Vivek Dua, and Efstratios N Pistikopou-los. The explicit linear quadratic regulator for constrained systems. Automat-ica, 38(1):3–20, 2002.[31] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for humandetection. In IEEE Computer Society Conference on Computer Vision andPattern Recognition(CVPR), pages 886–893, 2005.110[32] Navneet Dalal, Bill Triggs, and Cordelia Schmid. Human detection usingoriented histograms of flow and appearance. In European Conference onComputer Vision(ECCV), pages 428–441. Springer, 2006.[33] Piotr Dolla´r, Zhuowen Tu, Pietro Perona, and Serge Belongie. Integral chan-nel features. In The British Machine Vision Conference (BMVC), 2009.[34] Rodolphe Sepulchre, Derek A Paley, and Naomi Ehrich Leonard. Stabiliza-tion of planar collective motion with limited communication. IEEE Transac-tions on Automatic Control, 53(3):706–719, 2008.[35] Stefan Walk, Nikodem Majer, Konrad Schindler, and Bernt Schiele. New fea-tures and insights for pedestrian detection. In IEEE conference on Computervision and pattern recognition (CVPR), pages 1030–1037, 2010.[36] Paul Viola and Michael J Jones. Robust real-time face detection. Interna-tional Journal of Computer Vision, 57(2):137–154, 2004.[37] David Geronimo, Antonio M Lopez, Angel Domingo Sappa, and ThorstenGraf. Survey of pedestrian detection for advanced driver assistance sys-tems. Pattern Analysis and Machine Intelligence, IEEE Transactions on,32(7):1239–1258, 2010.[38] Stuart Russell and Peter and Norvig. Artificial Intelligence: A Modern Ap-proach. 3 edition, 2011.[39] Dylan Campbell and Mark Whitty. Metric-based detection of robot kidnap-ping with an svm classifier. Robotics and Autonomous Systems, 2014.[40] Kun Li and Max Meng. Robotic object manipulation with multilevel part-based model in rgb-d data. In IEEE International Conference on Roboticsand Automation (ICRA), pages 3151–3156, 2014.[41] S Hamid Dezfoulian, Dan Wu, and Imran Shafiq Ahmad. A generalizedneural network approach to mobile robot navigation and obstacle avoidance.In Intelligent Autonomous Systems 12, pages 25–42. Springer, 2013.[42] David Silver, James Bagnell, and Anthony Stentz. High performance outdoornavigation from overhead data using imitation learning. Robotics: Scienceand Systems IV, pages 262–269, 2009.[43] Chia-Feng Juang and Yu-Cheng Chang. Evolutionary-group-based particle-swarm-optimized fuzzy controller with application to mobile-robot navi-gation in unknown environments. IEEE Transactions on Fuzzy Systems,19(2):379–392, 2011.111[44] Matt Knudson and Kagan Tumer. Adaptive navigation for autonomousrobots. Robotics and Autonomous Systems, 59(6):410–420, 2011.[45] Thomas Kollar and Nicholas Roy. Trajectory optimization using reinforce-ment learning for map exploration. The International Journal of RoboticsResearch, 27(2):175–196, 2008.[46] Daoyi Dong, Chunlin Chen, Jian Chu, and T Tarn. Robust quantum-inspiredreinforcement learning for robot navigation. IEEE/ASME Transactions onMechatronics, 17(1):86–97, 2012.[47] Adam Coates, Pieter Abbeel, and Andrew Y Ng. Learning for control frommultiple demonstrations. In Proceedings of the 25th international conferenceon Machine learning, pages 144–151. ACM, 2008.[48] Nathan D Ratliff, David Silver, and J Andrew Bagnell. Learning to search:Functional gradient techniques for imitation learning. Autonomous Robots,27(1):25–53, 2009.[49] Noel E Sharkey. Learning from innate behaviors: A quantitative evaluationof neural network controllers. Machine Learning, 31(1-3):115–139, 1998.[50] Peter Trautman and Andreas Krause. Unfreezing the robot: Navigation indense, interacting crowds. In IEEE/RSJ International Conference on Intelli-gent Robots and Systems (IROS), pages 797–803, 2010.[51] Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-basedpomdp solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51,2013.[52] Tiago Veiga, Matthijs Spaan, and Pedro Lima. Point-based pomdp solvingwith factored value function approximation. In Twenty-Eighth AAAI Confer-ence on Artificial Intelligence, 2014.[53] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics.MIT press, 2005.[54] Sven Koenig and Yaxin Liu. The interaction of representations and planningobjectives for decision-theoretic planning tasks. Journal of Experimental &Theoretical Artificial Intelligence, 14(4):303–326, 2002.[55] Ronald A Howard. Dynamic programming and markov processes. 1960.112[56] Richard Bellman. Dynamic programming and lagrange multipliers. Proceed-ings of the National Academy of Sciences of the United States of America,42(10):767, 1956.[57] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learn-ing. MIT Press, 1998.[58] Stuart Ian Reynolds. Reinforcement learning with exploration. PhD thesis,The University of Birmingham, 2002.[59] Bohdana Ratitch. On characteristics of Markov decision processes and rein-forcement learning in large domains. PhD thesis, McGill University, 2005.[60] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning,8(3-4):279–292, 1992.[61] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A trainingalgorithm for optimal margin classifiers. In Proceedings of the 5th annualworkshop on Computational learning theory, pages 144–152. ACM, 1992.[62] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machinelearning, 20(3):273–297, 1995.[63] Ron Alterovitz, Thierry Sime´on, and Kenneth Y Goldberg. The stochasticmotion roadmap: A sampling framework for planning with markov motionuncertainty. In Robotics: Science and Systems (RSS), volume 3, pages 233–241, 2007.[64] Vu Anh Huynh, Sertac Karaman, and Emilio Frazzoli. An incrementalsampling-based algorithm for stochastic optimal control. In IEEE Interna-tional Conference on Robotics and Automation (ICRA), pages 2865–2872,2012.[65] G Yin, Vikram Krishnamurthy, and Cristina Ion. Regime switching stochas-tic approximation algorithms with application to adaptive discrete stochasticoptimization. Society for Industrial and Applied Mathiematics Journal onOptimization, 14(4):1187–1215, 2004.[66] Andre Costa and Felisa J Va´zquez-Abad. Adaptive stepsize selection fortracking in a regime-switching environment. Automatica, 43(11):1896–1908,2007.[67] Nocedal Jorge and J Wright Stephen. Numerical optimization,second edition.Springerverlang, USA, 2006.113[68] Elizabeth Lai Sum Wong. Active-set methods for quadratic programming.2011.[69] Nicholas IM Gould and Philippe L Toint. Numerical methods for large-scalenon-convex quadratic programming. Springer, 2002.[70] Nick Gould, Dominique Orban, and Philippe Toint. Numerical methods forlarge-scale nonlinear optimization. Acta Numerica, 14:299–361, 2005.[71] E Michael Gertz and Stephen J Wright. Object-oriented software forquadratic programming. ACM Transactions on Mathematical Software,29(1):58–81, 2003.[72] Nicholas J Higham. Computing a nearest symmetric positive semidefinitematrix. Linear algebra and its applications, 103:103–118, 1988.[73] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. The Journal ofMachine Learning Research, 9:1871–1874, 2008.[74] Christian Wojek, Stefan Walk, and Bernt Schiele. Multi-cue onboard pedes-trian detection. In IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), pages 794–801, 2009.[75] Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestriandetection: An evaluation of the state of the art. IEEE Transactions on PatternAnalysis and Machine Intelligence, 34(4):743–761, 2012.[76] Dennis Park, C Lawrence Zitnick, Deva Ramanan, and Piotr Dolla´r. Explor-ing weak stabilization for motion feature extraction. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages 2882–2889, 2013.[77] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sella-manickam Sundararajan. A dual coordinate descent method for large-scalelinear svm. In Proceedings of the 25th international conference on Machinelearning, pages 408–415. ACM, 2008.[78] Francis R Bach, Gert RG Lanckriet, and Michael I Jordan. Multiple kernellearning, conic duality, and the smo algorithm. In Proceedings of the 21thinternational conference on Machine learning, page 6. ACM, 2004.[79] Alain Rakotomamonjy, Francis Bach, Ste´phane Canu, Yves Grandvalet, et al.Simplemkl. Journal of Machine Learning Research, 9:2491–2521, 2008.114[80] Felix Endres, Ju¨rgen Hess, Nikolas Engelhard, Ju¨rgen Sturm, Daniel Cre-mers, and Wolfram Burgard. An evaluation of the rgb-d slam system. InIEEE International Conference on Robotics and Automation (ICRA), pages1691–1696, 2012.[81] Christian Kerl, Jurgen Sturm, and Daniel Cremers. Robust odometry estima-tion for rgb-d cameras. In IEEE International Conference on Robotics andAutomation (ICRA), pages 3748–3754, 2013.[82] Peter Henry, Michael Krainin, Evan Herbst, Xiaofeng Ren, and Dieter Fox.Rgb-d mapping: Using depth cameras for dense 3d modeling of indoor envi-ronments. In Experimental Robotics, pages 477–491. Springer, 2014.[83] Amalia Foka and Panos Trahanias. Real-time hierarchical pomdps for au-tonomous robot navigation. Robotics and Autonomous Systems, 55(7):561–571, 2007.[84] Haoyu Bai, David Hsu, Wee Sun Lee, and Vien A Ngo. Monte carlo valueiteration for continuous-state pomdps. In Algorithmic foundations of roboticsIX, pages 175–191. Springer, 2011.[85] Hanna Kurniawati and Vinay Yadav. An online pomdp solver for uncertaintyplanning in dynamic environment. ISRR, 2013.[86] Konstantin M Seiler, Hanna Kurniawati, and Surya PN Singh. An online andapproximate solver for pomdps with continuous action space. In Proc. IEEEInt. Conference on Robotics and Automation (ICRA), 2015.[87] Rodrigo Benenson, Mohamed Omran, Jan Hosang, and Bernt Schiele. Tenyears of pedestrian detection, what have we learned? In European Conferenceon Computer Vision (ECCV), pages 613–627. Springer, 2014.[88] Pulkit Agrawal, Ross Girshick, and Jitendra Malik. Analyzing the perfor-mance of multilayer neural networks for object recognition. In EuropeanConference on Computer Vision(ECCV), pages 329–344. Springer, 2014.[89] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, EricTzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature forgeneric visual recognition. Computing Research Ropository, abs/1310.1531,2013.[90] Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. Part-basedr-cnns for fine-grained category detection. In European Conference on Com-puter Vision(ECCV), pages 834–849. Springer, 2014.115
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Robot learning for autonomous navigation in a dynamic...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Robot learning for autonomous navigation in a dynamic environment Zhang, Yunfei 2015
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Robot learning for autonomous navigation in a dynamic environment |
Creator |
Zhang, Yunfei |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | This dissertation addresses autonomous navigation of robots in a dynamic environment where the existence of moving and/or unknown objects leads to more serious challenges for the robots than those when operating in a traditional stationary environment. Therefore, the use of learning capabilities to facilitate proper robotic operation in a dynamic environment has become an important research area in the past decade. This dissertation proposes several novel learning-based methods to overcome the shortcomings in the existing approaches of autonomous navigation. Three aspects are addressed in the present work. First, a real-time path planning method is designed for autonomous navigation that can generate a path that avoids stationary and moving obstacles. To this end, learning ability is imparted to the robot. The present framework incorporates the statistical planning approach called probabilistic roadmap (PRM), Q-learning together with regime-switching Markov decision process (RSMDP) due to its beneficial characteristics, to form a robust Q-learning. Consequently, the initial path can be improved through robust Q-learning during interaction with a dynamic environment. Second, motion planning under constraints is investigated. Specifically, a closed-form piecewise affine control law, called piecewise affine-extended linear quadratic regulator (PWA-ELQR), for nonlinear-nonquadratic control problems with constraints is proposed. Through linearization and quadratization in the vicinity of the nominal trajectories, nonlinear-nonquadratic control problems can be approximated to linear-quadratic problems where the closed-form results can be derived relatively easily. Third, people detection is integrated into the autonomous navigation task. A classifier trained by a multiple kernel learning-support vector machine (MKL-SVM) is proposed to detect people in sequential images of a video stream. The classifier uses multiple features to describe a person, and learn its parameter values rapidly with the assistance of multiple kernels. In addition to the methodology development, the present research involves computer simulation and physical experimentation. Computer simulation is used to study the feasibility and effectiveness of the developed methodologies of path planning, motion planning and people detection. The experimentation involves autonomous navigation of a homecare robot system. The performance of the developed system is rigorously evaluated through physical experimentation and is improved by refining the developed methodologies. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2015-10-24 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0166745 |
URI | http://hdl.handle.net/2429/54912 |
Degree |
Doctor of Philosophy - PhD |
Program |
Mechanical Engineering |
Affiliation |
Applied Science, Faculty of Mechanical Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2015-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2015_november_zhang_yunfei.pdf [ 35.04MB ]
- Metadata
- JSON: 24-1.0166745.json
- JSON-LD: 24-1.0166745-ld.json
- RDF/XML (Pretty): 24-1.0166745-rdf.xml
- RDF/JSON: 24-1.0166745-rdf.json
- Turtle: 24-1.0166745-turtle.txt
- N-Triples: 24-1.0166745-rdf-ntriples.txt
- Original Record: 24-1.0166745-source.json
- Full Text
- 24-1.0166745-fulltext.txt
- Citation
- 24-1.0166745.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0166745/manifest